Alchemy for ML: Turning Raw Data into Signal
If data cleaning is the science of ML preparation, Feature Engineering is the art.
Feature Engineering is the process of using domain knowledge to create new variables (features) that make machine learning algorithms work better. A brilliant algorithm with poor features will always be beaten by a simple algorithm with brilliant features.
Encoding: Speaking the Machine’s Language
Machine learning models are mathematical. They require numbers, not words. You cannot simply feed the word “Red” or “Blue” into a neural network.
We use Encoding to translate:
-
Label Encoding: Converting categories to numbers (e.g., Small=1, Medium=2, Large=3). This is great for ordinal data where the order matters.
-
One-Hot Encoding: Creating a new binary column for every category (e.g., “Is_Red”, “Is_Blue”). This is safer for categories without a natural order, preventing the model from assuming that Blue (2) is “greater than” Red (1).
Scaling: Levelling the Playing Field
Imagine a dataset with two columns: “Age” (mostly 20-80) and “Salary” (mostly 30,000-100,000).
Because the numbers in “Salary” are so much bigger, the model might assume that Salary is thousands of times more important than Age. To prevent this, we use Scaling to squeeze all data into the same range, typically between 0 and 1 (Normalisation) or centred around zero (Standardisation).
Creating Interaction Features
Sometimes the strongest signal comes from combining two variables. If you are predicting house prices, the “Total Price” and “Total Square Footage” are useful. But creating a new feature called “Price Per Square Foot” (Price divided by Size) might reveal a much stronger pattern about the value of the property.
This is where domain expertise shines. You are not just processing data; you are translating real-world logic into inputs the machine can understand.
Next up: We are almost ready to train. But before we do, we must ensure we aren’t cheating. Part 6 covers Splitting and Data Leakage.
Get in touch to talk to a data engineering expert