Garbage In, Garbage Out: Why Data Prep is the Real Work of Machine Learning

January 20th, 2026

There is a romanticised version of Machine Learning (ML) that exists in movies and marketing pitch decks. In this version, the hard work is the “AI” itself, complex neural networks, cutting-edge algorithms, and futuristic code.

The reality, as any (most) Data Scientist will tell you, is quite different. The reality is 80% data preparation and 20% modeling.

If you are embarking on an ML project, your instinct might be to rush toward import sklearn or import tensorflow. This series is here to tell you to stop. Before you tune a single hyperparameter, you need to fix your foundation.

The Principle of GIGO (Garbage In, Garbage Out)

Machine Learning models are not magic; they are math. They do not “understand” the world; they find patterns in the numbers you feed them.

If you feed a model noisy, biased, or broken data (“Garbage In”), it will confidently give you wrong predictions (“Garbage Out”). It doesn’t matter if you use the most expensive GPU cluster or the latest Transformer architecture, a model trained on bad data is simply a powerful engine in a broken car.

Step 1: Define the Problem (Not the Code)

Before you collect a single byte of data, you must define the lineage of your problem.

What are we predicting? (e.g., Customer Churn)
Does the data actually capture this? (e.g., Do we have historical data on customers who actually churned, or just those who complained?)

Know Your Data’s Lineage

Data rarely arrives on a silver platter. It comes from messy SQL databases, user-generated logs, or third-party APIs. Understanding Data Lineage, where your data was born and how it traveled to you, is critical.

Ask yourself:

Is this data a proxy? (Are you using “zip code” as a proxy for “income”? That introduces massive bias.)
How was it collected? (Was it a mandatory form field? If users were forced to select an option, did they just pick the first one?)

The Road Ahead

In this series, we are going to walk through the complete “Raw to Ready” pipeline. We won’t just talk theory; we will cover the practical steps of turning messy real-world data into a clean signal.

Next up: We dive into the most underrated part of the ML stack, Data Labeling. How do you teach a machine what “true” looks like?

Get in touch if you are interested in organising your data ready for Data Engineering or ML

Categories

Recent

High Digital Named Best Computer Software Business of the Year 2026 February 27th, 2026 We spend a lot of time on our blog and LinkedIn poking fun at the tech industry. We joke about Burger Fish product launches, the dangers of treating A...

Building the Factory: Automating Your Data Pipeline February 27th, 2026

We have reached the end of our journey. Over the last six posts, we have taken a raw, messy dataset and transformed it into a...

Don’t Cheat: Proper Splitting and Avoiding Data Leakage February 25th, 2026

We have now arrived at the most dangerous phase of the data preparation pipeline.

You have col...

How Can We Help?

Building a new data product?
Let's bring your vision to life.
Getting AI-ready?
We'll prepare your data for intelligent insights.
Need custom application development?
Scalable, secure, and built for growth.
Database challenges?
Optimization, migration, or architecture - we've got you covered.
Exploring AI solutions?
Our experts can guid your next big move.
Need better reporting & analytics?
We create dashboards and visualisations that turn your data into clear, actionable insights.

Send a message or schedule a call for a free consultation

Garbage In, Garbage Out: Why Data Prep is the Real Work of Machine Learning

Contact us

How Can We Help?

Company

Our services

Product discovery

Design

Software development

Data engineering

Artificial intelligence (AI)

Support

Techonologies we use

Backend

Frontend

Database

Cloud & devops

BI & analytics

Industries