Creating Custom Transformers in Python and scikit-learn

Gershinen  Shanding

Gershinen Shanding

Creating Custom Transformers in Python and scikit-learn

3 min read
·
Nov 20, 2023
Transformers are a crucial component in the world of machine learning and data preprocessing. They are responsible for transforming raw data into a format that is suitable for training models. While scikit-learn provides a rich set of transformers, there may be cases where you need to create custom transformers tailored to your specific data and requirements. In this blog post, we’ll explore how to create custom transformers in Python using scikit-learn.

Understanding Transformers in scikit-learn

In scikit-learn, transformers are classes that implement two main methods: fit and transform. The fit method is responsible for learning parameters from the training data, while the transform method applies these learned parameters to new data.
To create a custom transformer, you need to create a class that inherits from BaseEstimator and TransformerMixin classes from scikit-learn. The BaseEstimator provides basic functionality, and the TransformerMixin adds the fit_transform method, which combines the fit and transform steps.
Let’s walk through the process of creating a simple custom transformer.

Creating a Simple Custom Transformer

Step 1: Import Libraries

from sklearn.base import BaseEstimator, TransformerMixinimport pandas as pd

Step 2: Define the Custom Transformer Class

class CustomTransformer(BaseEstimator, TransformerMixin):    def __init__(self, column_name):        self.column_name = column_name    def fit(self, X, y=None):        return self  # The fit method typically does nothing for transformers    def transform(self, X):        # Your transformation logic goes here        X_transformed = X.copy()  # Copy the input DataFrame to avoid modifying the original        X_transformed[self.column_name] = X_transformed[self.column_name].apply(lambda x: x * 2)  # Example transformation        return X_transformed

Step 3: Using the Custom Transformer

# Example usagedata = {'feature_1': [1, 2, 3, 4],        'feature_2': [5, 6, 7, 8]}df = pd.DataFrame(data)custom_transformer = CustomTransformer(column_name='feature_1')df_transformed = custom_transformer.fit_transform(df)print(df_transformed)
In this example, the CustomTransformer takes a DataFrame and a column name as input during initialization. The transform method then applies a simple transformation to the specified column.

Handling Numerical and Categorical Data

When working with datasets that contain both numerical and categorical features, it’s essential to handle them appropriately. Let’s extend our custom transformer to handle both types.
class CustomTransformer(BaseEstimator, TransformerMixin):    def __init__(self, column_name, multiplier=2):        self.column_name = column_name        self.multiplier = multiplier    def fit(self, X, y=None):        return self  # The fit method typically does nothing for transformers    def transform(self, X):        X_transformed = X.copy()  # Copy the input DataFrame to avoid modifying the original        # Check if the specified column is numerical        if pd.api.types.is_numeric_dtype(X_transformed[self.column_name]):            X_transformed[self.column_name] *= self.multiplier        else:            # If categorical, apply a different transformation (e.g., capitalize strings)            X_transformed[self.column_name] = X_transformed[self.column_name].apply(lambda x: str(x).capitalize())        return X_transformed
This updated version of the CustomTransformer checks the data type of the specified column. If it's numerical, it multiplies the values by a specified multiplier. If it's categorical, it capitalizes the strings.

Integrating with scikit-learn Pipelines

Custom transformers are often used in conjunction with scikit-learn pipelines, which streamline the process of transforming data and fitting models. Let’s see how our custom transformer can be integrated into a pipeline.
from sklearn.pipeline import Pipelinefrom sklearn.compose import ColumnTransformerfrom sklearn.impute import SimpleImputerfrom sklearn.preprocessing import StandardScaler, OneHotEncoder# Assume 'numerical_features' and 'categorical_features' are lists of feature namesnumerical_transformer = Pipeline(steps=[    ('imputer', SimpleImputer(strategy='mean')),    ('scaler', StandardScaler())])categorical_transformer = Pipeline(steps=[    ('imputer', SimpleImputer(strategy='most_frequent')),    ('onehot', OneHotEncoder(handle_unknown='ignore'))])custom_transformer = CustomTransformer(column_name='custom_feature', multiplier=3)preprocessor = ColumnTransformer(    transformers=[        ('num', numerical_transformer, numerical_features),        ('cat', categorical_transformer, categorical_features),        ('custom', custom_transformer, ['custom_feature'])    ])# Assuming 'model' is your machine learning model (e.g., RandomForestClassifier)pipeline = Pipeline(steps=[('preprocessor', preprocessor),                            ('model', model)])# Now you can use the pipeline for training and prediction
In this example, we’ve incorporated our custom transformer into a scikit-learn pipeline alongside transformers for numerical and categorical data. This pipeline can then be used for training and making predictions on new data.

Conclusion

Creating custom transformers in scikit-learn allows you to tailor your data preprocessing steps to the specific requirements of your dataset. Whether you need to handle numerical and categorical features differently or apply unique transformations, custom transformers provide a flexible solution. By integrating them into scikit-learn pipelines, you can streamline the entire process of data preparation and model training.

More Reading

Like this project

Posted Jun 3, 2025

Published a blog post on creating custom transformers in scikit-learn.