Data Cleaning with Python

Tserundede

Tserundede Ejueyitchie

Data Cleaning With Python — Data Science

6 min read
·
Feb 20, 2023
--
In data science, one of the major process in arriving in a well processed data for analysis is ‘Data Cleaning. And this is where python shines when it comes to data analysis.
Data cleaning is the process of identifying and correcting or removing errors, inconsistencies, and inaccuracies in data in order to improve its quality and ensure that it is accurate and reliable for analysis. In Python, there are various libraries and tools that can be used for data cleaning. In this discourse, we will cover some of the most common steps involved in data cleaning using Python. Which are:
Importing the necessary libraries: The first step is to import the necessary libraries, including Pandas, Numpy etc. These libraries are essential for data manipulation, analysis, and visualization.
Loading the data: The next step is to load the data into a Pandas DataFrame. Pandas is a powerful library for data manipulation and analysis, and it provides several methods for loading data, including CSV, Excel, SQL, and more.
Handling missing data: Missing data is a common problem in real-world datasets, and it can be handled in various ways, including imputation, deletion, or interpolation. Pandas provides several methods for handling missing data, including fillna, dropna, and interpolate.
Removing duplicates: Duplicates can distort data analysis and lead to incorrect conclusions. Pandas provides a method called drop_duplicates that can be used to remove duplicate rows from a DataFrame.
In this project, we will be using Pandas to clean our data so that it is ready for analysis. Three different types of data issues that we will deal with are: Missing Data, Duplicate Data, Incorrect Data.

Importing and Correcting data type

To take care of these data issues, we are able to use built in pandas functions to clean it efficiently. Before starting to clean the above, we can use the info() function to give us a snapshot of the type of data that is in our DataFrame and systematically change data into proper data types.
import pandas as pdimport numpy as npdataframe_3 = pd.read_csv('Data Source.csv')dataframe_3.info()#RUN and get the below output:<class 'pandas.core.frame.DataFrame'>RangeIndex: 1887 entries, 0 to 1886Data columns (total 8 columns): #   Column      Non-Null Count  Dtype  ---  ------      --------------  -----   0   Store       1887 non-null   int64   1   Date        1887 non-null   object  2   CategoryID  1887 non-null   int64   3   ProductID   1887 non-null   object  4   Q1          1886 non-null   float64 5   Q2          1722 non-null   object  6   Q3          1738 non-null   float64 7   Q4          1719 non-null   float64dtypes: float64(3), int64(2), object(3)memory usage: 118.1+ KB
We can first start by seeing that the Date column is an object type, whereas the correct type for a date is datetime. We can correct this by inputting the follow code.
pd.to_datetime(dataframe_3['Date'])#RUN using shift & enter key:0      2019-03-101      2019-06-042      2019-01-163      2019-10-184      2019-04-11          ...    1882   2019-03-241883   2019-03-241884   2019-03-241885   2019-09-301886   2019-06-14Name: Date, Length: 1887, dtype: datetime64[ns]#Date type has been changed successfully. Next would be reinstate it to our data
# In order to make a change to the DataFrame, the changes must be reinstatiated to overwrite with the changes.dataframe_3['Date'] = pd.to_datetime(dataframe_3['Date'])dataframe_3.info()#RUN using shift & enter key:<class 'pandas.core.frame.DataFrame'>RangeIndex: 1887 entries, 0 to 1886Data columns (total 8 columns): #   Column      Non-Null Count  Dtype         ---  ------      --------------  -----          0   Store       1887 non-null   int64          1   Date        1887 non-null   datetime64[ns] 2   CategoryID  1887 non-null   int64          3   ProductID   1887 non-null   object         4   Q1          1886 non-null   float64        5   Q2          1722 non-null   object         6   Q3          1738 non-null   float64        7   Q4          1719 non-null   float64       dtypes: datetime64[ns](1), float64(3), int64(2), object(2)memory usage: 118.1+ KB
We can now see that the date has been corrected to right data type.

Missing Data

Using the info() function above, we can see there are null values in our data from looking at the Non-Null Count column. Some columns have a different non-null count, which means that these columns have more null values. To see in greater detail which columns and rows have null values, we can use the isna() function.
dataframe_3.isna()#RUN using shift & enter key:
This output can be difficult to interpret and act on. A way to simplify this information is to pair `isna()` with the `sum()` function.
dataframe_3.isna().sum()#RUN using shift & enter key:Store           0Date            0CategoryID      0ProductID       0Q1              1Q2            165Q3            149Q4            168dtype: int64
Now, we can see the number of null values per each column.
Missing data can be sorted out using the following approach:
Removing null values from DataFrame rows and/or columns
Filling null values with a constant or other values from the DataFrame
Removing Null Values
To remove columns or rows that contain null values, we can use the dropna() function. We can start by dropping the rows where at least one element is null in the row. This reduces our DataFrame from 1887 rows to 1684 rowss as seen below.
dataframe_3.dropna()#RUN using shift & enter key:
From the above the number of rows in the data frame has been reduced to 1684. We can also drop columns that have null values rather than rows with the same function, but specifying the axis.
dataframe_3.dropna(axis=1)
Looking above, we now see that the columns that have any null values have been dropped also.

Filling null values

Completely dropping columns & rows that have null values is usually not an efficient method of cleaning data, as lots of data can be lost.
A more common approach to cleaning these null values is to fill them in using the fillna() function. Two popular uses of the function is:
Fill the null values with 0 or any value
Fill the null values based on the next/previous number (‘ffill’ and ‘bfill’).
Forward filling and backward filling are two approaches to fill missing values. Forward filling means fill missing values with previous data. Backward filling means fill missing values with next data point.
With these methods, we can still retain valuable data and not drop the entire column/row.
# filling NaNs with 0dataframe_3.fillna(0)
# filling NaNs that is equal to the previous non-NaN number "Forward Fill"dataframe_3.fillna(method='ffill')
# filling NaNs that is equal to the next non-NaN number "Back Fill"dataframe_3.fillna(method='bfill')
I believe with this you have been able to learn how to clean your data for analysis. In the next episode I will be writing on how to handle ‘error data’, ‘duplicate data’ and ‘exporting of data’.
Data source: My sincere appreciation goes to the CFI team. The dataset has been uploaded to my google drive here.
follow me for updates on data science and analysis processes.
Like this project

Posted Jun 16, 2025

Data cleaning using Python libraries like Pandas and Numpy.

Likes

0

Views

1

Timeline

Feb 11, 2023 - Feb 20, 2023