Sherlock Holmes Mode: Getting to Know Your Data

February 3rd, 2026

In Part 2, we discussed the importance of labelling your data to create a ground truth. Now that we have our dataset, the temptation is to immediately start “fixing” it.

However, you cannot fix what you do not understand.

This stage is called Exploratory Data Analysis (EDA). Think of this as the interview phase. You are interviewing your data to learn its secrets, its quirks, and its potential problems.

Visualising the Distribution

The first step is always visualisation. You need to see the shape of your data. If you simply calculate the average (mean) of a column, you might miss the full story.

For example, if you are analysing house prices, a few massive mansions could skew your average significantly. By using Histograms and Box Plots, you can visualise the spread. Is the data bell-shaped (normal distribution)? Or is it heavily skewed to one side? Understanding this shape helps you choose the right statistical tools later on.

The Outlier Dilemma: Error or Insight?

During EDA, you will almost certainly find data points that do not fit. These are outliers.

If you are analysing customer ages and find a value of “200”, that is clearly an error to be removed. However, if you are analysing credit card transactions and see a massive purchase, that might not be an error. That might be the fraud you are trying to predict.

Key takeaway: Never delete outliers blindly. Investigate them. They often hold the most valuable information in the entire dataset.

Finding Relationships

Finally, we look for correlations. How do different variables interact?

A Correlation Matrix (often visualised as a heatmap) allows you to spot redundant features. If “variables A” and “variable B” move in perfect sync, you likely do not need both. Feeding the model redundant information can slow down training and lead to overfitting.

Once you understand the shape and structure of your data, you are finally ready to pick up the mop and bucket.

Next up: Part 4 covers Data Cleaning. We look at how to handle missing values and scrub the dataset until it shines.

Get in touch to talk to a data engineering expert

Categories

Recent

The Ground Truth: Strategies for High-Quality Data Labeling January 27th, 2026

In Part 1, we discussed why data preparation is the bedrock of Machine Learning. Now, we enter the most critical phase of th...

Garbage In, Garbage Out: Why Data Prep is the Real Work of Machine Learning January 20th, 2026

There is a romanticised version of Machine Learning (ML) that exists in movies and marketing pitch decks. In this version, t...

10 Tech Predictions for 2026: The Year of the “Agentic” Enterprise Author January 13th, 2026

2025 was a year of friction for the entire tech industry. We saw the "AI Boom" collide with the reality of legacy infrastruct...

How Can We Help?

Building a new data product?
Let's bring your vision to life.
Getting AI-ready?
We'll prepare your data for intelligent insights.
Need custom application development?
Scalable, secure, and built for growth.
Database challenges?
Optimization, migration, or architecture - we've got you covered.
Exploring AI solutions?
Our experts can guid your next big move.
Need better reporting & analytics?
We create dashboards and visualisations that turn your data into clear, actionable insights.