Enhancing Machine Learning Models: A Deep Dive into Feature Engineering
Salman Khan
Posted on December 31, 2023
The efficacy of machine learning models heavily depends on the quality of input data and features [1]. In traditional machine learning, transforming raw data into features is crucial for model accuracy. Feature engineering aims to transform existing data into informative, relevant, and discriminative features. Although deep learning and end-to-end learning have revolutionized and automated processing for images, text, and signals, feature engineering for relational and human behavioural data remains an iterative, slow and laborious task [2].
This article explores techniques for feature engineering to enhance the accuracy and reliability of a predictive model. Additionally, it presents solutions that can help streamline the feature engineering process.
Data Science Workflow: An Iterative Three-Step Process [2]
In the initial phase, analysts define the predictive objectives. Data engineers then extract, load, transform variables, engineer features, and define target labels. Finally, machine learning engineers construct models tailored to the specified predictive goals, exploring various techniques iteratively for the most suitable solution.
What is Feature Engineering?
Feature Engineering is an essential step in traditional machine learning models, where the experts manually design and extract relevant features from the processed data. The goal is to encode expert knowledge, intuitive judgement, and human preconceptions into the machine learning model. This enables easier learning, especially with smaller data sets and increased model accuracy and interpretability.
However, feature engineering is not a one-size-fits-all solution. The choice of features and techniques depends on the nature of the data, the complexity of the problem, and the goals of the model. Moreover, feature engineering is an iterative process where the model performance is evaluated, and the features are refined and updated accordingly.
The figure below illustrates how simply projecting existing covariates into higher dimensional space, i.e. creating polynomial variants of existing features, can make the data linearly separable and easier to learn by a simple machine learning model.
Example of how feature engineering improves model accuracy - Adding a polynomial variant of existing features can make classes linearly separable and easier for a simple ML model to learn.
Basic Feature Engineering Techniques on relational and temporal data
The table below summarizes some basic feature engineering techniques relevant to the traditional machine learning models.
Technique | Description and Use Cases | Procedure |
---|---|---|
Imputation | Filling or estimating missing values to complete the dataset. Critical for handling missing data before model training. | Mean, median, or mode imputation for numerical variables. |
Scaling | Normalizing numerical features to a similar scale to prevent bias. Essential for preventing the dominance of features with larger magnitudes. | Min-Max scaling (values in [0, 1]). Z-score normalization (mean of 0, standard deviation of 1). |
Outliers | Setting predefined upper and lower bounds for numerical values to limit extreme values (outliers). This helps prevent extreme values from disproportionately influencing the model. | Define upper and lower bounds based on percentiles or specific thresholds. Cap values above the upper bound and collar values below the lower bound. |
One-Hot Encoding | Representing categorical variables as binary vectors. Enables machine learning algorithms to work with categorical data. | Create a binary column for each category. Assign 1 to the corresponding category, 0 otherwise. |
Binning | Transforming continuous numerical features into categorical ones. Useful for handling non-linear relationships in numerical data. | Group values into discrete intervals or bins. |
Log Transform | Applying logarithmic transformation to skewed numerical features. Effective for variables like income with skewed distributions. | Logarithm of values to handle right-skewed distributions. |
Polynomial Features | Creating new features via polynomial transformation. Captures non-linear relationships in data. | If x is a feature, x^2 and x^3 become new features. |
Feature Interactions | Creating new features by combining existing features. Captures joint effects on the target variable. | If x1 and x2 are features, create a new feature x1 * x2. |
Feature Aggregation | Combining multiple related features into a single, more informative feature. Reduces dimensionality and captures consolidated information. | Calculate averages or sums of related features. |
Time-Based Features | Extracting temporal information from timestamps or time-related data. Useful for understanding temporal patterns in the data. | Examples: Day of the week, hour of the day, time lags for time series data. |
Regular Expressions Features | Extracting patterns from text data using regular expressions. Useful for identifying specific structures or formats in text. | Examples: Matching email addresses, extracting dates, identifying hashtags in social media text. |
Frequency Encoding | Assigning numerical values based on the frequency of categorical variables. Useful for encoding categorical variables with varying frequencies. | Preserves information about the distribution of categories. Suitable for high-cardinality categorical variables. |
Domain Specific Feature Engineering
Natural Language Processing (NLP)
NLP tasks require extractions of features from text data, which include:
- Tokenization: Splitting text into individual words or tokens.
- Stemming and lemmatization: Removing prefixes and suffixes from words and mapping them to their base or dictionary form.
- Part-of-speech (POS) tagging - labelling each word with its corresponding POS, i.e. noun, verb, adjective, etc.
- Named-entity recognition: Locating and tagging named entities in text, such as persons, organizations, and locations.
- Bag of Words: Representing text as an integer vector of its word counts from a predefined vocabulary.
- Term Frequency-Inverse Document Frequency: Representing words by their numeric weights based on their frequency in a document relative to their frequency across all documents.
- Word Embeddings: Representing words as dense vectors in a continuous vector space based on semantic similarity.
- Sentiment analysis: Identifying the sentiment or tone of the text, whether it is positive, negative, or neutral.
- Topic modelling: Identifying the underlying topics in a document or set of documents.
Computer Vision (CV)
Feature engineering in computer vision involves techniques for extracting features from images, including:
- Image augmentation: Transforming the training set through geometric alterations like image rotation and filters to increase the training set for better model generalization.
- Edge detection filter: Utilizing Sobel, Prewitt, Laplacian, or Canny edge filters to highlight changes in intensity or edges in the image.
- Scale-invariant feature transform (SIFT): Identifying and describing local features in images that are invariant to scaling and rotation.
- Colour Histogram: Representing an image by the distribution of its colours.
- Histogram of Oriented Gradients (HOG): Extracting features from an image based on the distribution of gradients in the image.
Time-Series Analysis
Feature engineering in time-series analysis involves techniques for extracting features from time-series data, including:
- Autocorrelation: Measuring the correlation between time series and its lagged values.
- Moving averages: Calculating the average of a subset of time-series data over a defined window.
- Trend analysis: Identifying trends and patterns in the time series data.
- Fourier transforms: Decomposing a time-series signal into its frequency components.
- Mel-frequency cepstral coefficients (MFCCs): Representing the audio signal by its power spectrum.
- Phonemes Representing words by phonemes, leveraging human preconceptions about how words are pronounced.
Automated Feature Engineering
There are a myriad of tools and open-source packages that can help automate and streamline feature engineering. These packages utilize algorithms to generate and select features based on data characteristics. This reduces manual efforts and broadens the exploration of potential features.
Featuretools: designed for automated feature engineering on temporal and relational data.
tsfresh: designed for feature engineering from time-series and other sequential data
AutoFeat: streamlines the generation of nonlinear features from data.
TPOT (Tree-based Pipeline Optimization Tool): Designed to automate all aspects of machine learning pipeline, i.e. feature engineering, feature selection and model optimization using genetic programming.
featurewiz: automates feature engineering and selection.
Featuretool, with its deep feature synthesis (DFS) [2] algorithm, stands out for its versatility, particularly when working with relational datasets and incorporating temporal aggregation.
Tutorial
In this tutorial, we will implement Featuretools on a dataset consisting of four tables:
- clients - information about clients at a credit union
- loan - previous loans taken out by the clients
- payments due - payments due date and amount
- outcomes - loan payment and date
Data Source: Kaggle [3]
Data Sample: The objective of the machine learning model here is to classify if the customer will make or miss the next payment
Featuretools offers three distinct advantages that make it a powerful tool for automated feature engineering:
1. EntitySet Approach
Featuretools operates on EntitySet, i.e. data frames and the relationships between them. This simplifies the feature engineering process for relational datasets, enables users to define relationships between tables, and automatically generates features based on these relationships.
Code:
es = ft.EntitySet(id = 'clients')
## Entities Dataframe
es = es.add_dataframe(
dataframe_name="clients",
dataframe=clients,
index="client_id",
time_index="joined")
es = es.add_dataframe(
dataframe_name="loans",
dataframe=loans,
index="loan_id",
time_index="loan_start")
es = es.add_dataframe(
dataframe_name="payments_due",
dataframe=payments_due,
index="payment_id",
time_index="due_date")
es = es.add_dataframe(
dataframe_name="outcome",
dataframe=outcome,
time_index="outcome_time")
## Adding Relationships in data frames
# Relationship between clients and previous loans
ed = es.add_relationship('clients', 'client_id', 'loans', 'client_id')
# Relationship between previous loans and payments
es = es.add_relationship('loans', 'loan_id', 'payments_due', 'loan_id')
# Relationship between payments and outcome
es = es.add_relationship('payments_due', 'payment_id', 'outcome', 'payment_id')
es.plot()
Featuretools EntitySet - Tables and their Relationship.
2. Feature Primitives, Deep Feature Synthesis (DFS)
"Feature primitives are the building blocks of Featuretools. They define computations that can be applied to raw datasets to create new features."[4].
Feature primitives fall into two categories:
- Aggregation: Functions that group together child datapoints for each parent and compute statistics such as mean, variance etc. [3]
- Transformation: Operations applied to a subset of columns in a table, e.g. extracting the day from dates, difference between two columns. etc. [3]
Code:
# Index and the time to use as a cutoff time
cutoff_times = es['payments_due'][['payment_id', 'due_date']].sort_values(by='due_date')
# Rename columns to avoid confusion
cutoff_times.rename(columns = {'due_date': 'time'},
inplace = True)
cutoff_times.head()
# subtract 1 day from the time
cutoff_times['time'] = cutoff_times['time'] - pd.Timedelta(0, 'days')
payments_due
# Feature Primitives
agg_primitives = ["sum","min"]
trans_primitives = ["time_since_previous"]
# Deep feature synthesis
agg_primitives = ["sum","count","max"]
trans_primitives = ["time_since_previous"]
# Deep feature synthesis
f, feature_names = ft.dfs(entityset=es, target_dataframe_name='payments_due',
agg_primitives = agg_primitives,
trans_primitives = trans_primitives,
n_jobs = -1, verbose = 1,
cutoff_time = cutoff_times,
cutoff_time_in_index = True,
max_depth = 2)
Moreover, Featuretools provides a visual representation to inspect, interpret, and validate generated features, enhancing understanding and interpretability of the subsequently built machine learning model.
Visual representation from featuretool of how a particular feature is generated for the given dataset.
3. Data Leakage and Handling Time
Data leakage in its many forms remains a challenge for machine learning models and systems [5]. Featuretools provide an intrinsic solution to prevent 'temporal leakage' in feature generation by specifying cutoff times for each record and filtering out all data preceding that time stamp before calculating features, effectively preventing the introduction of temporal leakage [6].
A detailed tutorial on how to use featuretool with consideration for handling time can be found here: Github
In conclusion, feature engineering is a critical step in machine learning models that can significantly enhance the model's performance and accuracy. Automated tools and packages, such as Featuretools, can streamline the feature engineering process and contribute to the success of a machine learning system.
References
[1] Domingos, P., 2012. A few useful things to know about machine learning. Communications of the ACM, 55(10), pp.78-87.
[2] Kanter, J.M. and Veeramachaneni, K., 2015, October. Deep feature synthesis: Towards automating data science endeavors. In 2015 IEEE international conference on data science and advanced analytics (DSAA) (pp. 1-10). IEEE.
[3] Automated Feature Engineering Tutorial - Kaggle
[4] Feature primitives - Featuretools Documentation
[5] Kapoor, S. and Narayanan, A., 2023. Leakage and the reproducibility crisis in machine-learning-based science. Patterns, 4(9).
[6] Handling Time - Featuretools Documentation
Posted on December 31, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.