Data Cleaning

The Great Cleanup: Handling Missing Values, Duplicates, and Errors

Welcome to the unglamorous reality of Machine Learning. While Data Scientists love to talk about algorithms, they spend the vast majority of their time here, in the “Janitor Work” phase.

Real-world data is never clean. It is full of holes, duplicates, and spelling mistakes. If you feed this mess into a model, the model will fail. Here is how we scrub the data to ensure it is ready for training.

The Problem of Missing Data

It is rare to find a dataset where every single row is complete. Perhaps a user forgot to enter their age, or a sensor went offline for an hour.

You generally have two options:

  1. Drop the data: If a row is 90% empty, get rid of it.

  2. Impute the data: This means filling in the blanks.

Simple imputation involves filling empty spots with the average (mean) or median of the column. More advanced methods use algorithms like K-Nearest Neighbours (KNN) to guess what the value should be based on similar data points.

Standardising Inconsistent Formats

Human input is notoriously inconsistent. In a “Location” column, you might find:

  • “London”

  • “london”

  • “Greater London”

  • “Lndn”

To a computer, these are four completely different cities. You must standardise these entries so the model recognises them as the same category. This often involves rigorous string manipulation and mapping.

Sanity Checks

Finally, run logic checks.

  • Are there negative numbers in a “Price” column?

  • Is someone’s age listed as zero?

  • Are there exact duplicate rows?

Automating these sanity checks ensures that obvious errors do not silently sabotage your model’s performance.

Next up: Now that the data is clean, can we make it better? Part 5 explores Feature Engineering, the art of turning raw data into powerful signals.

Get in touch for help cleaning your data

Recent

Sherlock Holmes Mode: Getting to Know Your Data

In Part 2, we discussed the importance of labelling your data to create a ground truth. Now that we have our dataset, the tem...

The Ground Truth: Strategies for High-Quality Data Labeling

In Part 1, we discussed why data preparation is the bedrock of Machine Learning. Now, we enter the most critical phase of th...

Garbage In, Garbage Out: Why Data Prep is the Real Work of Machine Learning

There is a romanticised version of Machine Learning (ML) that exists in movies and marketing pitch decks. In this version, t...

Contact us

Complete the form and we’ll get in touch

Please enable JavaScript in your browser to complete this form.
Checkboxes

How Can We Help?

  • Building a new data product?

    Let's bring your vision to life.

  • Getting AI-ready?

    We'll prepare your data for intelligent insights.

  • Need custom application development?

    Scalable, secure, and built for growth.

  • Database challenges?

    Optimization, migration, or architecture - we've got you covered.

  • Exploring AI solutions?

    Our experts can guid your next big move.

  • Need better reporting & analytics?

    We create dashboards and visualisations that turn your data into clear, actionable insights.

Send a message or schedule a call for a free consultation

Awards & accreditations

High Digital: top bi data company
High Digital: top bi data company
Cyber Essentials Plus
High Digital: Innovate UK
High Digital : ISO 27001
High Digital : ISO 27001

'Our customers love to work with us'

Clutch logo

5 icon star icon star icon star icon star icon star

Read our reviews