Sherlock Holmes Mode: Getting to Know Your Data
In Part 2, we discussed the importance of labelling your data to create a ground truth. Now that we have our dataset, the temptation is to immediately start “fixing” it.
However, you cannot fix what you do not understand.
This stage is called Exploratory Data Analysis (EDA). Think of this as the interview phase. You are interviewing your data to learn its secrets, its quirks, and its potential problems.
Visualising the Distribution
The first step is always visualisation. You need to see the shape of your data. If you simply calculate the average (mean) of a column, you might miss the full story.
For example, if you are analysing house prices, a few massive mansions could skew your average significantly. By using Histograms and Box Plots, you can visualise the spread. Is the data bell-shaped (normal distribution)? Or is it heavily skewed to one side? Understanding this shape helps you choose the right statistical tools later on.
The Outlier Dilemma: Error or Insight?
During EDA, you will almost certainly find data points that do not fit. These are outliers.
If you are analysing customer ages and find a value of “200”, that is clearly an error to be removed. However, if you are analysing credit card transactions and see a massive purchase, that might not be an error. That might be the fraud you are trying to predict.
Key takeaway: Never delete outliers blindly. Investigate them. They often hold the most valuable information in the entire dataset.
Finding Relationships
Finally, we look for correlations. How do different variables interact?
A Correlation Matrix (often visualised as a heatmap) allows you to spot redundant features. If “variables A” and “variable B” move in perfect sync, you likely do not need both. Feeding the model redundant information can slow down training and lead to overfitting.
Once you understand the shape and structure of your data, you are finally ready to pick up the mop and bucket.
Next up: Part 4 covers Data Cleaning. We look at how to handle missing values and scrub the dataset until it shines.
Get in touch to talk to a data engineering expert