The Great Cleanup: Handling Missing Values, Duplicates, and Errors
Welcome to the unglamorous reality of Machine Learning. While Data Scientists love to talk about algorithms, they spend the vast majority of their time here, in the “Janitor Work” phase.
Real-world data is never clean. It is full of holes, duplicates, and spelling mistakes. If you feed this mess into a model, the model will fail. Here is how we scrub the data to ensure it is ready for training.
The Problem of Missing Data
It is rare to find a dataset where every single row is complete. Perhaps a user forgot to enter their age, or a sensor went offline for an hour.
You generally have two options:
-
Drop the data: If a row is 90% empty, get rid of it.
-
Impute the data: This means filling in the blanks.
Simple imputation involves filling empty spots with the average (mean) or median of the column. More advanced methods use algorithms like K-Nearest Neighbours (KNN) to guess what the value should be based on similar data points.
Standardising Inconsistent Formats
Human input is notoriously inconsistent. In a “Location” column, you might find:
-
“London”
-
“london”
-
“Greater London”
-
“Lndn”
To a computer, these are four completely different cities. You must standardise these entries so the model recognises them as the same category. This often involves rigorous string manipulation and mapping.
Sanity Checks
Finally, run logic checks.
-
Are there negative numbers in a “Price” column?
-
Is someone’s age listed as zero?
-
Are there exact duplicate rows?
Automating these sanity checks ensures that obvious errors do not silently sabotage your model’s performance.
Next up: Now that the data is clean, can we make it better? Part 5 explores Feature Engineering, the art of turning raw data into powerful signals.
Get in touch for help cleaning your data