The Danger Zone (Data Leakage & Splitting)

Don’t Cheat: Proper Splitting and Avoiding Data Leakage

February 25th, 2026

We have now arrived at the most dangerous phase of the data preparation pipeline.

You have collected, labelled, cleaned, and engineered your data. It looks perfect. You train your model, and it achieves 99% accuracy. You pop the champagne.

Then, you deploy it to the real world, and it fails miserably.

Why? Because you unknowingly cheated. You fell victim to Data Leakage.

The first rule of machine learning is that you never judge a model on the data it studied. That is like giving a student the exam questions properly before the test.

To prevent this, we split our data into three distinct sets:

Training Set (The Textbook): The model sees this data and learns from it. This is usually 70–80% of your data.
Validation Set (The Mock Exam): Used during training to tune settings (hyperparameters). The model doesn’t learn directly from this, but we use it to guide our decisions.
Test Set (The Final Exam): This data is locked away in a vault. The model never sees it until the very end. It is the only true measure of how your model will perform in the real world.

Understanding Data Leakage

Data Leakage occurs when information from outside the training dataset is used to create the model. It essentially means your model has access to the future.

Example 1: The “Future” Feature

Imagine you are predicting whether a customer will cancel their subscription (churn). Your dataset includes a column called “Cancellation Date.”

If you leave this column in your training data, the model will instantly learn: “If there is a date here, the customer churned.” It gets 100% accuracy. But in the real world, active customers do not have a cancellation date yet. The model is useless because it relied on information it won’t have at the moment of prediction.

Example 2: Improper Splitting in Time Series

If you are predicting stock prices, you cannot simply shuffle the data randomly.

If you shuffle, your Training Set might contain data from tomorrow, while your Test Set contains data from yesterday. The model will learn to use tomorrow’s news to predict yesterday’s price. The Fix: For time-series data, you must split chronologically. Train on January–March, test on April.

Stratified Sampling

Finally, be careful with rare events. If you are detecting a rare disease that only occurs in 1% of patients, a random split might result in zero cases of the disease ending up in your Test Set.

To fix this, we use Stratified Sampling. This forces the split to maintain the same percentage of target classes (e.g., 1% positive, 99% negative) across both the Training and Test sets.

Next up: In the final part of our series, we look at how to stop doing this manually. Part 7 covers Automation and Pipelines.

Get in touch to talk to a data engineering expert

Categories

Recent

When AI Lets You Build Exactly What You Want March 31st, 2026 Have ever watched South Park? Do you remember an episode called "The Entity"? Mr Garrison gets mad with the airline industry and invents a revoluti...

The AI Hype Made Us Forget About Quantum March 22nd, 2026

If you spend five minutes on LinkedIn or read any tech publication today, you would think Artificial Intelligence is the only technology that matte...

The SaaS Trap: Why Building Bespoke Software is Cheaper March 16th, 2026 For the last decade, the technology industry has sold us a very convincing lie: Never build software if you can just rent it. The pitch was simple....

How Can We Help?

Building a new data product?
Let's bring your vision to life.
Getting AI-ready?
We'll prepare your data for intelligent insights.
Need custom application development?
Scalable, secure, and built for growth.
Database challenges?
Optimization, migration, or architecture - we've got you covered.
Exploring AI solutions?
Our experts can guid your next big move.
Need better reporting & analytics?
We create dashboards and visualisations that turn your data into clear, actionable insights.

Send a message or schedule a call for a free consultation

Don’t Cheat: Proper Splitting and Avoiding Data Leakage

Contact us

How Can We Help?

Company

Our services

Product discovery

Design

Software development

Data engineering

Artificial intelligence (AI)

Support

Techonologies we use

Backend

Frontend

Database

Cloud & devops

BI & analytics

Industries