This is the standard Data Science workflow

Joachim Kuleafenu
12 min readFeb 7, 2022

--

Understanding the complete Data Science pipeline while building a real-world project.

Photo by Isaac Smith on Unsplash

To reproduce this project, download the code on my GitHub repository.

Just a reminder Don’t forget to give a clap when you find this piece helpful.

Project objective: Using Linear Regression, a machine-learning algorithm to create a model that predicts a student’s grades based on a given attribute.

Things to learn

i. Introducing the algorithm, dataset and various libraries used for the project.
ii. Understanding the data through Exploratory Data Analysis.
iii. Data preprocessing and feature engineering.
iv. Machine learning section; predictive modelling.
v. Saving and loading model as a file.

1. Introduction

Linear Regression is unique in that it belongs to the field of Mathematics but it also adapts well to machine learning.

With Linear Regression, we build a model that combines a specific set of input numeric values, and its solution being the predicted numeric output.

Because we deal with Numeric values, this helps in problems solving scale, number, statistical probability, grades of which student performance evaluation is an essential need.

Libraries

we will begin by importing various python libraries that will help us in our data explorations.

1.1 Data wrangling libraries

i. Pandas is a fast, powerful, flexible and easy to use open-source data analysis and manipulation tool, built on top of the Python programming language. Read more about pandas.

ii. Numpy is one of the most powerful open source scientific computational tools used by scientists, statisticians and other people in the quantitative field. You can read about NumPy ease-of-use and its magic here

1.2 Data Visualization libraries

iii. Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Read about Matplotlib here.

iv. Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. Read more about seaborn.

  1. 3 Model creation and evaluation libraries
    These libraries with its operations would be explained in detail as we move along the project.

2. Exploratory data analysis

We will, first of all, have to understand the data and explore how various attributes of each student may directly or indirectly affect his/her score.

The data set is in the data folder. We will use the pandas read method to read both data sets and pass ; for the sep parameter because each column is separated by a semi-colon.

Let's preview the first five of both datasets.

  • we have two different datasets maths course and portuguese language course each having the same attributes.
  • For simplicity sake, we will combine and analyze both datasets together with the help of the Pandas concatenate method.
stu_mat.head()
stu_por.head()
# concatenate both datasets 
comb_df = pd.concat([stu_mat,stu_por],axis=0)

2.1 Dataset

The students’ performance dataset used for this study was extracted from the UCI data repository.
The data is split into two, one as ‘math course’ and the other as ‘Portuguese course’.

We combined and analyzed the two separate datasets together because they all have the same attributes.

2.2. Description of the dataset

There are 1044 of data entries with only two distinct schools.

The percentage of females to males is 53% and 47% respectively.

The average age of a student is 17 years with 15 years minimum and 22 years maximum.

A student may averagely, absent himself/herself twice from school.
The total number of features is 33 while 16 is an integer type and 17 string type.

comb_df.describe(include='all').T

Let’s combine the two datasets and remove duplicate rows by checking them with some list of attributes.

  • There are 382 duplicate rows.
  • Now, let's use the Pandas' duplicated method to remove duplicates from the data.
  • The ~ the sign simply means the opposite of whatever outcome.
  • In this context, the sign ~is used to negate the outcome of boolean values.

NB: Since we have combined and got ridden of the duplicate rows, we can now role-in with the exploratory analysis by doing visualization.

2.3. Visualization

I. Let check how sex and age affects the student’s grade.

On average, male students with an age less than or equal to 19 years perform better than those more than 20 years.

Female students with the age more than 19 performs far better than those less than or equal to 19 which is directly opposite to their male mates.

Therefore we can say that the older the age of a female student, the better her performance and the otherwise to male students.

II. Can student relationship status affects his/her grade?

  • It observed that both genders with no romantic lives perform much better than those in a relationship.
  • Averagely, Male students perform slightly better than female students in both the romantic status

III. Is there any impact parents educational level have on students performance.

  • Upon visualizing both the Father and Mother educational levels, it turned out consistently that, parents with no educational background and those with higher educational backgrounds have the best-performing students.
  • The rest of the students having parents in the other categories performs lower relatively.

A study on the Correlation Between Parents’ Educational Level and Children’s Success published on the lamar university website shows how significant parents educational level affects their wards performance.

The findings suggest that people whose parents did not hold a degree who entered the workforce straight out of high school were more likely to believe that a college degree was not worth the cost or that they did not need further education to pursue their desired career.

Parents who didn’t get the privilege to attend high school find it necessary to encourage their children to go to school and motives them to go the extra mile on what they couldn’t do.

IV. How does the financial needs of a student affect his/her grade?

  • The simple heatmap below shows that students with both family supports and school support tend to perform lower than all the other categories.

Kirabo Jackson, a professor at Northwestern University, has published several studies showing a connection between higher spending and improved education outcomes.

His research has shown additional funding to low-wealth school districts can make a difference in student performance. Wealthier school districts don’t benefit as much from an extra influx of cash.

  • It is also surprising that students with no financial support performs extremely better securing the second-highest position on the map.

This may be a result of other non-physical factors such as commitment and determination to make a difference.

We can conclude that lack of finance in this case is not a major contributor to students performance.

map
i.Define a Matplotlib figure
ii. Use the pivot_table method to create an excel style pivot table.
iii. Plot the heatmap and apply the necessary arguments.
iv. Set the various labels

3. Data Preprocessing and Feature Engineering.

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.

3.1. Why feature engineering

Better features mean flexibility.

You can choose “the wrong models” (less than optimal) and still get good results. Most models can pick up on good structure in data.

The flexibility of good features will allow you to use less complex models that are faster to run, easier to understand and easier to maintain. This is very desirable.

Better features mean simpler models.

With well-engineered features, you can choose “the wrong parameters” (less than optimal) and still get good results, for much the same reasons. You do not need to work as hard to pick the right models and the most optimized parameters.

With good features, you are closer to the underlying problem and a representation of all the data you have available and could use to best characterize that underlying problem.

3.2. Better features means better results.

The algorithms we used are very standard for Kagglers.We spent most of our efforts in feature engineering.
Xavier Conort, on “Q&A with Xavier Conort” on winning the Flight Quest challenge on Kaggle.

Let's check the shape of the data frame.

df.shape
(662, 33)

3.3. Let's begin with binarization,discretization and normalization of some features in the data set.

I. age feature

  • We will then create age_bin feature by grouping the age into teenager,youg_adult and adult.
  • we will create vote_age a binary feature showing whether a student is eligible to vote or not.
  • We will then create is_teenager feature. This binary feature shows whether student is a teenager or not.
    The cut method categorizes the continuous values into a given range.

II. Medu(Mother education) and Fedu( father education) feature

numeric:
0 — No education
1 — primary education and above.

We will derive a feature called higher_edu to contain the boolean of whether the parent attended secondary education or not.

  • Per our observation, parents that have had at least primary education and above have their wards performing better as compare to parent that haven’t had any formal education.

III. Fjob and Mjob feature

  • Create the Fjob_cat combining the at_home,teacher,health features into one category.
  • We will leave the Mjob (Mother job) as-is.
# use lambda expression to create category of 
# three;'other','services' and 'employee'
df.loc[:,'Fjob_cat'] = df['Fjob'].apply(lambda x: x if (x == 'other' or x == 'services') else 'employee')

IV.guardian feature

  • We will create has_parent, a boolean feature of a student having or not having a parent.
  • We could see that those having their parents as their guardians performs much better than those without.

v. absence feature

We will use discretization to form a new feature named absent_cat
We will create absented feature, which will be a binary feature representing either a student once absented or not.

VI. Label encoding Ordinal-Categorical variables

Label Encoding is a popular encoding technique for handling categorical variables. In this technique, each label is assigned a unique integer based on alphabetical ordering.

VII. One-hot-encoding of non-ordinal category

Encode categorical features as a one-hot numeric array.

The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features.

The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme.

This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse parameter).

VIII. Continuous Variables

Handling skewness and outliers in continuous variables

In statistics, an outlier is an observation point that is distant from other observations. We wouldn’t want to have outliers in our dataset since that have significant effects on the model performance.

So we will, first of all, visualize the continous variables then we will decide either to cap it or remove it with statistical method.

  • Some data beyond the age of 21 appears to be an outlier.
  • The distribution of the number of absences is positively skewed.
  • On the bottom, a plot between G1 and G2 has outlier grades.

It is either we use the Z-score function defined in the Scipy library to detect the outliers or scale our data with Standardization.

The Z-score is the signed number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured

The intuition behind Z-score is to describe any data point by finding their relationship with the Standard Deviation and Mean of the group of data points.

Z-score is finding the distribution of data where mean is 0 and standard deviation is 1 i.e. normal distribution.

In most of the cases, a threshold of 3 or -3 is used i.e if the Z-score value is greater than or less than 3 or -3 respectively, that data point will be identified as outliers.

Standardization

Standardization (or z-score normalization) scales the values while taking into account standard deviation.

If the standard deviation of features is different, their range also would differ from each other. This reduces the effect of the outliers in the features.

NB: We will use the standard scaling at the model creation time.

IX. Checking for Multicollinearity

Multicollinearity is the correlation between independent variables. It is considered as a disturbance in the data if present will weaken the statistical power of the regression model.

The Variance Inflation Factor (VIF) is a measure of collinearity among predictor variables within a multiple regression.

It is calculated by taking the ratio of the variance of all a given model’s betas divide by the variance of a single beta if it were fit alone.

V.I.F. = 1 / (1 – R^2)

  • VIF value <= 4 suggests no multicollinearity
  • VIF value of >= 10 implies serious multicollinearity and therefore those features will be dropped.

4. MODEL CREATION

Finally, we have come to the model creation section.

  • We will split the data into train and test using the train_test_split method from the Scikit-learn library.

Sklearn (or Scikit-learn): Is a Python library that offers various features for data processing that can be used for classification, clustering, and model selection.
Train_test_split: Is a function in Sklearn model selection for splitting data arrays into two subsets: for training data and for testing data.

Train_test_split parameters explained

X, y: The first parameter is the dataset you’re selecting to use.
train_size: This parameter sets the size of the training dataset.
test_size: This parameter specifies the size of the testing dataset.
random_state: The default mode performs a random split using np.random but alternatively, you can add a constant integer value.

I. Scaling the data

This takes into account the standard deviation hence reducing the effect of outliers.

II. Linear regression

Linear regression uses traditional slope-intercept form, where m and b are the variables our algorithm will try to learn to produce the most accurate predictions.

y = wx + b

x represents our input data
y represents our prediction. w is the slope of the line
b is the intercept, that is the value of y when x = 0

Multivariable regression

A multi-variable linear regression might look like this:

f(x,y,z)=w1x + w2y + w3z

where w represents the coefficients, or weights, our model will try to learn.
The variables x,y,z represent the attributes or distinct pieces of information, we have about each observation.

Derivation

y=wx+b
y = (y1y2…yn)
y = (w1xi1+w2xi2+wnxin+b)

Model training

X_train shape = (496, 58) and y_train shape = (496,)
X_test shape = (166, 58) and y_test shape = (166,)


Dummy train score 0.0
Dummy test score -0.01040076083093755


train score 0.8461778266234085
test score 0.9006398918375393

Feature Importance

We determined the feature importance by plotting the various coefficient of each feature.

The plot below , shows that G1 (first grade), G2 (second grade) and absented_1 (number of absences) have significant influence on the model predictions.

5. Model Evaluation

mean_squared_error

MSE measures the average squared difference between an observation’s actual and predicted values. The output is a single number representing the cost, or score, associated with our current set of weights.

print('test mse ',mean_squared_error(y_test,prediction))print('\n\n dummy test mse ',mean_squared_error(y_test,y_predict_dummy_mean))test mse  1.7110916870356683


dummy test mse 17.400226050534084

R-Squared

The R-Squared metric provides us with a way to measure the goodness of fit or how well our data fits the model. The higher the R-Squared metric, the better the data fit our model.

print('dummy test r2_score',r2_score(y_test,y_predict_dummy_mean))print('test r2_score',r2_score(y_test,prediction))

dummy test r2_score -0.01040076083093755
test r2_score 0.9006398918375393

Making predictions

Visualizing the predicted values vs the target values.

Plotting ground_truth vs prediction

Saving model for future use

import pickle# save the model
with open('model/multi_linear_reg.sav','wb') as f:
pickle.dump(lin_reg, f)

Loading model for prediction

The custom predictor helper functions will help us to us only four variables first_grade,second_grade,absences,index for our predictions.

from predictor import Predictorpredict = Predictor('model/multi_linear_reg.sav','data/prep_df.csv',scaler)first_grade = 5
second_grade = 4
absences = 0
index = 6
predict.get_prediction(first_grade,second_grade,absences,index)
output: 2

--

--

Joachim Kuleafenu

Software Engineer. I build smart features with Machine Learning techniques.