Your Guide to Feature Engineering
How to increase your Machine Learning model performance
Although there are multiple methods to improve your Machine Learning (ML) models' performance, the one that stands out is known as “feature engineering”. Feature engineering refers to the process of transforming data into useful representations (features) to boost model inference. As an activity, feature engineering can have the greatest impact on a model’s predictive power and can be thought of as the “art and science” of data science.
In this article, you’ll learn about the complexities surrounding feature engineering and some key strategies and techniques for mitigating these complexities while achieving increased model performance.
ML models utilize context, concepts, and events to make inferences.
How you represent these concepts and events (from customer profiles to biological molecules) will define the effectiveness of your ML models. Proper characterization is the key to a successful model.
Features are the fuel of any ML system. More data does not necessarily mean better model performance, and we must use features with a strong predictive signal for our task. This process is continuous: collect more data, transform them into features, and select the best features for our model.
Feature engineering must eventually take into account the type of model being used. Different ML models have different strengths and weaknesses. For example, neural networks will automatically cross features for you, but they can only handle numeric inputs and suffer from sparse and categorical features.
The relevance of a feature is always measured by its ability to distinguish instances of the dataset with respect to the target to which the instance is measured. Features can, therefore, be categorized into those strongly relevant to the dataset and those weakly relevant to it.
Weakly relevant features and seemingly irrelevant features can still be extremely useful to a model as part of a larger feature set. For example, a user’s average transaction price may only be weakly relevant in a fraud detection use case. However, it may become very predictive when it’s included in a feature set with the price of that specific transaction.
The feature engineering process has two distinct parts: feature transformation and feature selection.
Feature transformation
The original features in a dataset may not be optimal, and better performance may be achieved by deriving features from that input. Feature transformation is the process of taking attributes from your dataset and creating new features from them to isolate as much signal as possible.
Features in their raw form may not be effective for a model. Transformations such as centering, scaling, or transforming a distribution to symmetry can help the model better identify predictive signals.
Some feature engineering strategies are:
Feature aggregation
A feature aggregation takes multiple rows of data and combines them into a single feature. Examples of aggregations include average, median, max, etc.
Feature crossing
A feature cross is a synthetic feature formed by multiplying (crossing) two or more features. Crossing combinations of features can provide predictive abilities beyond what those features can provide individually, and allow us to learn non-linearity within linear models. Neural networks inherently learn feature crosses, but you can also create many different kinds of feature crosses like:
- [A X B]: a feature cross formed by multiplying the values of two features.
- [A x B x C x D x E]: a feature cross formed by multiplying the values of five features.
- [A x A]: a feature cross formed by squaring a single feature.
For example, if we had longitude and latitude features available, we could cross them to represent a more defined area, rather than having latitudes or longitudes as separate descriptors. This way, a model could learn that certain areas within a range of latitudes and longitudes represent a stronger signal than the two features considered individually.
Dimensionality reduction
The concept behind this is that high-dimensional data are dominated “superficially” by a small number of simple features. This way, we can find a subset of the features to represent the same level of information in the data or transform the features to a new set without losing much information. Some of the most common approaches to reduce data dimensionality are:
- Principal Component Analysis (PCA): PCA is an unsupervised algorithm that creates linear combinations of the original features. Principal components are extracted so that the first principal component explains the maximum variance in the dataset, the second principal component tries to explain the remaining variance in the dataset and is uncorrelated to the first principal component, and so on. From a statistical perspective, variance is synonymous with information. An important benefit of this technique is that the resulting PCA scores are uncorrelated, which is useful for several ML modeling techniques (e.g., neural networks or Support Vector Machines).
- Linear Discriminant Analysis (LDA): LDA is a supervised method that can only be used with labeled data. Unlike PCA, LDA doesn’t maximize explained variance; it maximizes the separability between classes. It seeks to preserve as much discriminatory power as possible for the dependent variable while projecting the original data matrix onto a lower-dimensional space: it projects data in a way that maximizes class separability.
Clustering
This technique finds groups of observations (called clusters) that share similar characteristics. It’s an unsupervised learning technique used for data classification. The idea is to replace a group of “similar” variables with a cluster centroid, which becomes a feature. Some of the most popular algorithms include K-means and hierarchical clustering.
Alternatively, instead of labeling each feature with the nearest cluster centroid, we can measure the distance from a point to all the centroids and return those distances as features.
Knowledge Graphs
In 2012, Google pioneered the concept of Knowledge Graphs (KGs) to refer to its general-purpose knowledge base. KGs organize data from multiple sources, capture information about entities of interest in a given domain or task (like people, places, or events), and forge connections between them. Unlike traditional databases, the content of KGs is organized as a graph, where nodes (entities of interest and their types), relationships between, and attributes of the nodes are equally important.
KGs are developed from linked data, and it’s precisely these linked data sources that can be used for feature generation tasks. Current ML methods typically rely on input data built from tables. This often means abstracting, simplifying, and—sometimes—entirely leaving out a lot of predictive relationships and contextual data. With connected data and relationships stored as graphs, extracting connected features and more easily incorporating all of this critical information is straightforward.
KGs can be used as data models, leading to a suitable learning framework by eliminating the need for manual feature engineering.
Features can be transformed through linear or non-linear transformations, then combined to create more complex features, which can be transformed again. So, when should you stop?
Start simple: you can get an excellent baseline performance by creating an initial model without deep features. After this baseline is achieved, you can try more esoteric approaches. Feature engineering can be done in collaboration with domain experts who can guide us on what features to engineer and use.
Feature selection
You should not feed every possible feature to your model. Not only will this increase computation cost and training time, but having lots of bad features or covariant features can actually harm model performance.
The main purpose of a feature selection process is to improve our model’s prediction performance, provide faster and more cost-effective predictors, and increase the interpretability of the prediction process.
Unused features create technical debt. If you find that you are not using a feature and that combining it with other features is not working, then remove it from your infrastructure.
Some of the main feature selection strategies are filter, wrapper, and embedded methods.
Filter methods
These techniques pick up the intrinsic properties of the features measured via univariate statistics, and they are faster and less computationally expensive than other methods when dealing with high-dimensional data.
Filter methods include techniques like variable ranking (e.g., using criteria like Information Gain or Fischer’s Score), correlation analysis (used with quantitative features), or Chi-square Tests (used with categorical features).
The idea behind these methods is to compute the scores between each feature and your target, and then select the features based on their scores. This feature selection method is independent of any ML algorithm.
However, features usually don’t work when isolated to explain an event. Since this method focuses on the intrinsic properties of the features, we miss the fact that a variable that is completely useless by itself can provide a significant performance improvement when combined with others.
Wrapper methods
This method uses the prediction performance of a given ML algorithm to assess the usefulness of a subset of features.
Wrapper methods use different performance measures depending on the type of problem to decide which features to add or remove from that subset. For example, for a regression problem, the performance measure can be p-values, R-squared, or Adjusted R-squared. In contrast, for a classification problem, the criterion can be accuracy, precision, recall, or F1-score. Finally, this method selects the features that give the optimal results for the specified ML algorithm.
Wrapper methods require some approach to search the space of all possible subsets of features. Exhaustive searches could be performed if the number of features was not too large, but this approach can quickly become intractable. For this reason, some of the most commonly used techniques are:
- Forward Feature Selection is an iterative method in which we start with the best-performing variable against the target. Next, we select another variable that gives the best performance in combination with the first selected variable and continue the process until the preset criterion is achieved.
- Backward Feature Elimination: works precisely opposite to the Forward Feature Selection method. We start with all the features and remove the least significant feature at each iteration until the preset criterion is achieved.
Embedded methods
This method involves algorithms that have built-in feature selection methods. The main goal of embedded methods is learning which features best contribute to the performance of a given ML model. They also have built-in penalization functions to reduce overfitting.
Embedded methods encompass the benefits of both wrapper and filter methods by evaluating feature interactions while maintaining reasonable computational costs. The typical steps for embedded methods involve training an ML algorithm using all the features and then deriving the importance of those features according to the algorithm used. Afterward, it can remove unimportant features based on criteria specific to the algorithm.
Regularization algorithms like LASSO, Elastic Net, and Ridge Regression are the most commonly used embedded methods. These algorithms penalize a feature given a coefficient threshold.
So, how can we develop good features?
- Avoid rarely used discrete feature values: good features should appear more than five times in a data set. Having several examples with the same discrete value gives the model a chance to see the feature in different settings and, in turn, determine when it’s a good predictor for the label. Doing so enables a model to learn how this feature value relates to the label.
- Prefer clear meanings: each feature should have an obvious meaning to anyone on the project. A good feature is clearly named, and the value makes sense concerning the name.
- Don’t mix “magic” values with actual data: good floating-point features don’t contain peculiar out-of-range discontinuities or “magic” values. For variables that take a finite set of values (discrete variables), add a new value to the set and use it to signify that the feature value is missing. For continuous variables, ensure missing values do not affect the model by using the mean value of the feature’s data.
- Account for upstream instability: the definition of a feature shouldn’t change over time.
Conclusion
Feature engineering is a critical skill in Machine Learning, transforming raw data into predictive insights that fuel model performance. Whether by transforming features to capture new patterns or selecting only the most impactful ones, each feature engineering stage shapes the model’s effectiveness and potential to solve real-world challenges. This process goes beyond technical steps for practitioners—it requires a nuanced understanding of the problem domain, data characteristics, and model-specific needs.
In practice, mastering feature engineering involves experimentation and collaboration with domain experts. You can continually enhance your model's predictive power by methodically iterating through transformations, selections, and validations. In an era of ever-growing data complexity, effective feature engineering is a cornerstone of Machine Learning success and an art form that bridges data insights with powerful, actionable predictions.