Why Data Prep is the Real Work of Machine Learning

Garbage In, Garbage Out: Why Data Prep is the Real Work of Machine Learning

There is a romanticised version of Machine Learning (ML) that exists in movies and marketing pitch decks. In this version, the hard work is the “AI” itself, complex neural networks, cutting-edge algorithms, and futuristic code.

The reality, as any (most) Data Scientist will tell you, is quite different. The reality is 80% data preparation and 20% modeling.

If you are embarking on an ML project, your instinct might be to rush toward import sklearn or import tensorflow. This series is here to tell you to stop. Before you tune a single hyperparameter, you need to fix your foundation.

The Principle of GIGO (Garbage In, Garbage Out)

Machine Learning models are not magic; they are math. They do not “understand” the world; they find patterns in the numbers you feed them.

If you feed a model noisy, biased, or broken data (“Garbage In”), it will confidently give you wrong predictions (“Garbage Out”). It doesn’t matter if you use the most expensive GPU cluster or the latest Transformer architecture, a model trained on bad data is simply a powerful engine in a broken car.

Step 1: Define the Problem (Not the Code)

Before you collect a single byte of data, you must define the lineage of your problem.

  • What are we predicting? (e.g., Customer Churn)

  • Does the data actually capture this? (e.g., Do we have historical data on customers who actually churned, or just those who complained?)

Know Your Data’s Lineage

Data rarely arrives on a silver platter. It comes from messy SQL databases, user-generated logs, or third-party APIs. Understanding Data Lineage, where your data was born and how it traveled to you, is critical.

Ask yourself:

  1. Is this data a proxy? (Are you using “zip code” as a proxy for “income”? That introduces massive bias.)

  2. How was it collected? (Was it a mandatory form field? If users were forced to select an option, did they just pick the first one?)

The Road Ahead

In this series, we are going to walk through the complete “Raw to Ready” pipeline. We won’t just talk theory; we will cover the practical steps of turning messy real-world data into a clean signal.

Next up: We dive into the most underrated part of the ML stack, Data Labeling. How do you teach a machine what “true” looks like?

Get in touch if you are interested in organising your data ready for Data Engineering or ML

Recent

10 Tech Predictions for 2026: The Year of the “Agentic” Enterprise Author

2025 was a year of friction for the entire tech industry. We saw the "AI Boom" collide with the reality of legacy infrastruct...

Recovering the Past with AI: Our Work on a 17th-Century Secretary Script Document

Some projects are technical. Some are operational. And every now and then, one is quietly profound.

What London’s 11 Million Dots Says About How We See the World

How We See the World: London has a new landmark, and for once it’s not another glazed tower, str...

Contact us

Complete the form and we’ll get in touch

Please enable JavaScript in your browser to complete this form.
Checkboxes

How Can We Help?

  • Building a new data product?

    Let's bring your vision to life.

  • Getting AI-ready?

    We'll prepare your data for intelligent insights.

  • Need custom application development?

    Scalable, secure, and built for growth.

  • Database challenges?

    Optimization, migration, or architecture - we've got you covered.

  • Exploring AI solutions?

    Our experts can guid your next big move.

  • Need better reporting & analytics?

    We create dashboards and visualisations that turn your data into clear, actionable insights.

Send a message or schedule a call for a free consultation

Awards & accreditations

High Digital: top bi data company
High Digital: top bi data company
Cyber Essentials Plus
High Digital: Innovate UK
High Digital : ISO 27001
High Digital : ISO 27001

'Our customers love to work with us'

Clutch logo

5 icon star icon star icon star icon star icon star

Read our reviews