Garbage In, Garbage Out: Why Data Prep is the Real Work of Machine Learning
There is a romanticised version of Machine Learning (ML) that exists in movies and marketing pitch decks. In this version, the hard work is the “AI” itself, complex neural networks, cutting-edge algorithms, and futuristic code.
The reality, as any (most) Data Scientist will tell you, is quite different. The reality is 80% data preparation and 20% modeling.
If you are embarking on an ML project, your instinct might be to rush toward import sklearn or import tensorflow. This series is here to tell you to stop. Before you tune a single hyperparameter, you need to fix your foundation.
The Principle of GIGO (Garbage In, Garbage Out)
Machine Learning models are not magic; they are math. They do not “understand” the world; they find patterns in the numbers you feed them.
If you feed a model noisy, biased, or broken data (“Garbage In”), it will confidently give you wrong predictions (“Garbage Out”). It doesn’t matter if you use the most expensive GPU cluster or the latest Transformer architecture, a model trained on bad data is simply a powerful engine in a broken car.
Step 1: Define the Problem (Not the Code)
Before you collect a single byte of data, you must define the lineage of your problem.
-
What are we predicting? (e.g., Customer Churn)
-
Does the data actually capture this? (e.g., Do we have historical data on customers who actually churned, or just those who complained?)
Know Your Data’s Lineage
Data rarely arrives on a silver platter. It comes from messy SQL databases, user-generated logs, or third-party APIs. Understanding Data Lineage, where your data was born and how it traveled to you, is critical.
Ask yourself:
-
Is this data a proxy? (Are you using “zip code” as a proxy for “income”? That introduces massive bias.)
-
How was it collected? (Was it a mandatory form field? If users were forced to select an option, did they just pick the first one?)
The Road Ahead
In this series, we are going to walk through the complete “Raw to Ready” pipeline. We won’t just talk theory; we will cover the practical steps of turning messy real-world data into a clean signal.
Next up: We dive into the most underrated part of the ML stack, Data Labeling. How do you teach a machine what “true” looks like?
Get in touch if you are interested in organising your data ready for Data Engineering or ML