Chapter 1 – Description
Description: Seek to employ theories and practices using the Python language and its libraries for Data Analysis and Machine Learning, among other technologies such as SAS Enterprise Miner and Streamlit to demonstrate the importance of the steps and use of Data Analysis and Machine Learning techniques step by step.
Note: In this post I am simply commenting on part of a larger project I did and what techniques and steps I used for this project. I ca not post the entire project and documentation in this post because that would be huge and boring for the reader. I am diluting this in other smaller posts over the days. However, you can contact me if you have any interest or comments about it.
If you are interested, you can follow more on my blog.
The dataset is presented, analysed, cleaned, organized and using techniques of balancing the data, saving, creating the models, training, forecasting and comparing these models with each other, using various metrics to qualify which of them for the given problem was the best choice and explaining why this choice and presenting all the results on a web page through Streamlit where the user can navigate between several explanatory options for a perfect understanding of any person without such knowledge.
Technologies: Excel, Python, Streamlit, seaborn, pyplot, Pandas, NumPy, Matplotlib, scikit-learn, SMOTE, Jupyter Notebook, Markdown, SAS Enterprise Miner, GitHub, VS Code, HTML5, CSS3, Bootstrap.
Chosen dataset: The dataset used for this project is a dataset from a marketing campaign. Where it seeks to predict possible customers for the campaign and reduce unnecessary spending. The initial dataset has 1500 rows and 19 columns with two possible targets. The dataset and this project are within the laws of the General Data Protection Regulation (GDPR).
Workflow of the entire process and stages:
- Data Understanding
- Implementation
- Summary of Metrics and Comparison
- Visualization
Where in each of these steps was addressed more deeply in subtopics.
Models used in both Python and SAS Enterprise Miner:
- Decision Tree(DT)
- DecisionTreeClassifier(DTC)
- RandomForest(RF)
- Logistic Regression(LR)
- Neural Network(NN)
Data Visualization: Data visualization was done in Streamlit, to demonstrate this Python library that was created in order to allow the Data Analyst not to have to venture to learn the complexities of other languages or frameworks such as JS or Django, to present the views and results. That way, using the library, the Analyst can easily build an elegant and clear view on a web page.
The image below shows the topics I covered in this project.
A good documentation is also part of a good project. So, in addition to documenting the entire process, you also have material for future consultations, whether for yourself or for others, even to add to your portfolio. So it is important that you document well so that people can understand that in addition to the practical part of your project, you were able to have a good and clear documentation, explaining it so that others, besides you, can in the future understand what you did, your goals and what you were able to accomplish and how you did and were successful. This can even serve as a reference and help.
The objective here is to demonstrate the entire process of a Data Analysis and Data Science and the techniques used, such as those mentioned above.
It is worth noting that:
I looked for unbalanced data like this one to demonstrate how we can balance the data so we can buy our models correctly. I will use a technique called SMOTE to balance this data that we will see throughout this documentation.
For the visualization I will use the Streamlit library which is a Python library to build web pages and visualize in an elegant and simple way the data in these web pages, where the user can interact.
Let’s start
It is important for good analysis and data science that the analyst or scientist follow some of these important steps. Perhaps the part of understanding the data is one of the most important in the whole process, as well as visualizing the data in a clear way so that lay users can understand some of the results just by glancing at it. So it’s worth dedicating yourself to the data understanding part and believe me, this will make your life a lot easier and avoid rework later on if this is done well.
Data Understanding
- Read data
- Transforming into data frame
- Analysing and obtaining relevant information
- Searching for incorrect, blank, or missing data
- Checking the wrong data
- Cleaning Data
- Eliminating Columns
- Transforming Variables
- Checking the data after the transformation process
- Saving the data frame to a *.csv file
- Balancing data
- Saving balanced data in a data frame
Implementation
- Separating into training and testing
- Saving training and testing samples
- DecisionTreesClassifier(DTC)
- Random Forest (RF)
- Neural Network (NN)
- Summary of metrics and Confuse Matrix
Visualization
- Data visualization on web page with Streamlit.
The figure below shows the entire workflow mentioned above that will be used in this work from this chapter onwards. The detailed and in-depth walkthrough to make all the steps clear.
The figure below shows in a simple bar graph the two target variables before using data balancing. We can see that the data is not balanced and needs to be balanced so that we can proceed with the steps and the correct comparison of the models. In later chapters I’ll explain in more detail why this is important, but in short, if you use this data to train your models, without being balanced, your models can generate erroneous results.
In future posts as mentioned above, I will be putting the continuation of this project. So, dividing the project into small posts I believe it will not be boring for the reader. In this post I just summarized the entire process used in this project.
Remember if you are interested or want to send a comment, you can here on the blog other links that point to my social networks, page and GitHub.
Thanks in advance for reading and interest.
See you soon.