Data Preprocessing

Gohil Rushabh Navinchandra
5 min readSep 16, 2021

This blog contain the information about data preprocessing . In this blog we will learn about different techniques of data preprocessing . So let’s start the discussion .

The longest and the most important step in this workflow is Data preparation/preprocessing, which approximately takes 70% of the time. This step is important because in most situations data provided by the customer has a bad quality or just cannot be directly fed to some kind of ML model.

Data preprocessing is a data mining technique which is used to transform the raw data in a useful and efficient format.

Steps for data preprocessing

  • Step 1 : Import the libraries
  • Step 2 : Import the data-set
  • Step 3 : Check out the missing values
  • Step 4 : See the Categorical Values
  • Step 5: Feature Scaling

Import the libraries

Import the data-set

For this blog , we are using Kaggle “Top Spotify Tracks of 2018 — Audio features of top Spotify songs”.

Now we are import our data-set .

Now we print first five row of our data-set .

We print the shape of our data-set . Shape of data-set means total number of rows and columns contains our data-set .

Next step is we have to print basic information about data-set like data type of each column , null value count , memory usage and many information about our data-set .

After that we have to describe our data-set for this we will get the information like mean , min , 25% , 50% , 75% , max values of each column .

Check out the missing values

Our data-set we don’t have any null values but in this tutorial we are talk about how to handle null values or different techniques to handle null values .

There are major two different ways to handle missing value .

First method is , we either delete a particular row if it has a null value for a particular feature and a particular column if it has more than 75% of missing values. This method is advised only when there are enough samples in the data set. Removing the data will lead to loss of information which will not give the expected results while predicting the output.

Second method is , we have to fill the null values with mean , median or mode for numerical column . For categorical column we have to fill the null values with most frequent value .

See the Categorical Values

Machine learning models are based on Mathematical equations and you can intuitively understand that it would cause some problem if we can keep the Categorical data in the equations because we would only want numbers in the equations.

So , we have to convert our categorical values into numerical one . label_encoder is technique which is I use and help us in transferring Categorical data into Numerical data .

There is one more technique called OneHotEncoder . In this we creating a dummy variable . Dummy Variables is one that takes the value 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome.

Firstly , apply dummy variable on name column .

Then , we have to apply dummy variable on artists column .

After applying OneHotEncoder total number of columns are 183 . Our final data-set is shown below .

Feature Scaling

Feature scaling is the method to limit the range of variables so that they can be compared on common grounds.

In this tutorial we will learn standard scaler and Min-Max scaler .

Standard Scaler :

The standard scalar assumes your data is normally distributed within each feature and will scale them such that the distribution is now centered around 0, with a standard deviation of 1.

The mean and standard deviation are calculated for the feature and then the feature is scaled based on:

Min-Max Scaler :

The Min-Max scaler is the probably the most famous scaling algorithm, and follows the following formula for each feature:

So that’s our end with Data Preprocessing tutorial and I hope you like this Article .

Here I am attach my GitHub link .

--

--