Data cleansing, is the process of detecting and correcting or removing inaccurate, incomplete, irrelevant, or inconsistent data within a dataset. It is an essential step in data preparation and involves several activities such as:
Handling Missing Data: Identifying and dealing with missing values in the dataset, either by imputing values based on statistical methods or removing the rows or columns with missing data.
Removing Duplicates: Identifying and eliminating duplicate records or observations from the dataset to ensure each entry is unique.
Correcting Errors: Identifying and correcting errors in data entries, such as typos, formatting issues, or inconsistencies in naming conventions.
Standardizing Data: Ensuring data is in a consistent format and adheres to predefined rules or formats, such as date formats, units of measurement, or categorical values.
Handling Outliers: Identifying and dealing with outliers or anomalies that may skew the analysis or results.