Feature Engineering: Processes, Techniques & Benefits in 2024
Data scientists spend around 40% of their time on data preparation and cleaning. It was 80% in 2016, according to a report by Forbes. There seems to be an improvement thanks to automation tools but data preparation still constitutes a large part of data science work. This is because getting the best possible results from a machine learning model depends on data quality and creating better features can help provide better quality data.
In this article, we’ll explore what feature engineering is, what are its techniques, and how you can improve feature engineering efficiency.
What is feature engineering?
Feature engineering is the process of transforming raw data into useful features.
Real-world data is almost always messy. Before deploying a machine learning algorithm to work on it, the raw data must be transformed into a suitable form. This is called data preprocessing and feature engineering is a component of this process.
A feature refers to an attribute of data that is relevant to the problem that you would like to solve with the data and the machine learning model. So, the process of creating features depends on the problem, available data, and deployed machine learning algorithm. Therefore, it would not be useful to create the same features from a dataset for two different problems. In addition, different algorithms require different types of features for optimal performance.
What are feature engineering processes?
Feature engineering can involve:
- Feature construction: Constructing new features from the raw data. Feature construction requires a good knowledge of the data and the underlying problem to be solved with the data.
- Feature selection: Selecting a subset of available features that are most relevant to the problem for model training.
- Feature extraction: Creating new and more useful features by combining and reducing the number of existing features. Principal component analysis (PCA) and embedding are some methods for feature extraction.
What are some feature engineering techniques?
Some common techniques of feature engineering include:
One-hot encoding
Most ML algorithms cannot work with categorical data and require numerical values. For instance, if you have a ‘Color’ column in your tabular dataset and the observations are “Red”, “Blue” and “Green”, you may need to convert these into numerical values for the model to better process it. However, labeling “Red” = 1, “Blue” = 2, and “Green” = 3 is not enough because there is not an ordered relation between colors (i.e. blue is not two times red).
Instead, one-hot encoding involves creating two columns for being “Red” and “Blue”. if an observation is red, it takes 1 in the “Red” column and 0 in “Blue”. If it is green, it takes 0 in both columns and the model deduces that it is green.
Log-transformation
Log-transformation is replacing each value in a column with its logarithm. It is a useful method to handle skewed data as shown in the image below. Log-transformation can transform the distribution to approximately normal and decrease the effects of the outliers. Fitting a linear predictive model, for instance, would give more accurate results after transformation because the relationship between the two variables is closer to linear after transformation.
Outlier handling
Outliers are observations that are distant from other observations. They can be due to errors or be genuine observations. Whatever the reason, it is important to identify them because machine learning models are sensitive to the range and distribution of values. The image below demonstrates how outliers drastically change a linear model’s fit.
The outlier handling method depends on the dataset. Suppose you work with a dataset with house prices in a region. If you know that a house’s price cannot exceed a certain amount in that region and if there are observations above that value, you can
- remove those observations because they are probably erroneous
- replace outlier values with mean or median of the attribute
Binning
Binning, or discretization, is grouping observations under ‘bins’. Converting ages of individuals to age groups or grouping countries according to their continent are examples of binning. The decision for binning depends on what you are trying to obtain from the data.
Binning can prevent overfitting, which happens when a model performs well with training data but poorly with other data. On the other hand, it sacrifices granular information about data.
Handling missing values
Missing values are among the most common problems of the data preparation process. There may be due to error, unavailability of the data, or privacy reasons. A significant portion of machine learning algorithms are designed to work with complete data so you should handle missing values in a dataset. If not, the model can automatically drop those observations which can be undesirable.
For handling missing values, also called imputation, you can:
- fill missing observations with mean/median of the attribute if it is numerical.
- fill with the most frequent category if the attribute is categorical.
- use ML algorithms to capture the structure of data and fill the missing values accordingly.
- predict the missing values if you have domain knowledge about the data.
- drop the missing observations.
Feature scaling
Feature scaling is standardizing the range of numerical features of the data. Consider these two examples:
- Suppose that you have a weight column with some values in kilograms and others in tons. Without scaling, an algorithm can consider 2000 kilograms to be greater than 10 tons.
- Suppose you have two columns for individuals in your dataset: age and height, with values ranging between 18-80 and 152-194, respectively. Without scaling, an algorithm doesn’t have a criteria to compare these values and is likely to weight larger values higher and weigh smaller values as lower, regardless of the unit of the values.
There are two common methods for scaling numerical data:
- Normalization (or Min-Max Normalization): Vales are rescaled between 0 and 1.
- Standardization (or Z-score Normalization): Values are rescaled so that it has a distribution with a 0 mean and variance equal to 1.
Why is it important now?
Feature engineering is an integral part of every machine learning application because created and selected features have a great impact on model performance. Features that are relevant to the problem and appropriate for the model would increase model accuracy. Irrelevant features, on the other hand, would result in “Garbage in-Garbage out” situation in data analysis and machine learning.
How to increase feature engineering efficiency?
Feature engineering is a process that is time-consuming, error-prone, and demands domain knowledge. It depends on the problem, the dataset, and the model so there is not a single method that solves all feature engineering problems. However, there are some methods to automate the feature creation process:
- Open-source Python libraries for automated feature engineering such as featuretools. Featuretools uses an algorithm called deep feature synthesis to generate feature sets for structured datasets.
- There are also AutoML solutions that offer automated feature engineering. For more information on AutoMl, check our comprehensive guide.
- There are MLOps platforms that provide automated feature engineering tools. Feel free to check our article on MLOPs tools and our data-driven list of MLOps platforms.
However, it should be noted that automated feature engineering tools use algorithms and may not be able to incorporate valuable domain knowledge that a data scientist may have.
If you have other questions about feature engineering for machine learning and automated ML solutions, don’t hesitate to contact us:
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 60% of Fortune 500 every month.
Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE, NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and media that referenced AIMultiple.
Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised businesses on their enterprise software, automation, cloud, AI / ML and other technology related decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.
He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.
Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
To stay up-to-date on B2B tech & accelerate your enterprise:
Follow on
Comments
Your email address will not be published. All fields are required.