GID

Useful data transformations

· Georgios Is. Detorakis · 7 minutes read

This post briefly introduces fundamental data transformations such as mean subtraction (centering data), normalization, standardization, difference transform, and power transform. Furthermore, we provide simple examples of Python code for applying those transforms to real data. Moreover, we heavily rely on the sklearn Python package [1].

Mean subtraction

Let’s assume we have some data in a vector $ {\bf x} $, and know that the mean value of $ {\bf x} $ is not zero. We could force the mean to be zero by subtracting the mean from each element in the vector $ {\bf x} $. Thus, we center the data when we apply the following transform:

$$ {\bf z} = {\bf x} - \bar{x}, \quad (1) $$

where $ \bar{x} $ is the mean of $ {\bf x} $. Another important use of this transform is this: When we would like to compare data sets of different scales, such as temperatures in Celcius and Fahrenheit, we can center each data set separately and then compare them.

The following code snippet shows how we can center data in Python:

import numpy as np

X = np.random.normal(5, 1, (100,))
X_bar = X.mean()
print(X_bar)		# 4.9911754

Z = X - X_bar
print(Z.mean()) 	# -2.930988e-16 

Normalize

Originally, data normalization meant rescaling and shifting a data set’s values so that they range in $ [0, 1] $. The mathematical formula to do that is:

$$ {\bf z} = \frac{ {\bf x} - x_{\text{min}} }{ x_{\text{max}} - x_{\text{min}} }, \quad (2) $$

where $ {\bf x} $ is the input data, $ x_{\text{min}} $ is the minimum element in the vector $ {\bf x} $, and $ x_{\text{max}} $ is the maximum element.

However, if we would like to normalize our data into a different interval $ [a, b] $ we can use the following formula:

$$ {\bf z} = \frac{ {\bf x} - x_{\text{min}} }{ x_{\text{max}} - x_{\text{min}} } (b - a) + a. \quad (3) $$

Normalization is used when we know our data do not follow a Gaussian distribution.

In Python, we can normalize any data set using the MinMaxScaler function of sklearn [2].

import numpy as np
from sklearn.preprocessing import MinMaxScaler

X = np.random.random((100, 1))

# Normalize into [0, 1]
scaler = MinMaxScaler()
Z = scaler.fit_transform(X)

# Normalize into [2, 4]
scaler = MinMaxScaler(feature_range=(2, 4))
Z = scaler.fit_transform(X)

Standarize

Let’s assume that we have some data on people’s height and weight, and we would like to use machine learning models on them. Naturally, weight and height measure different physical quantities and thus are in various scales and units (height is usually between $ 10 $ and $ 200 $ kilograms, and height is between $ 0 $ to $ 2 $ m). So, how do we use that data? One solution is to standardize the data using the z-score

$$ {\bf z} = \frac{{\bf x} - \bar{x}}{\sigma}, \quad (4) $$

where $ {\bf x} $ is the vector that holds the data, $ \bar{x} $ is the mean value of $ {\bf x} $, and $ \sigma $ is the standard deviation. Standardizing our data means they will have a zero mean and a unit standard deviation. We usually apply a standardization transformation when we know that our data follow a Gaussian-like distribution.

In Python, we can standardize our data using the preprocessing functions provided by the sklearn package [3]. Here is an example:

import numpy as np
from sklearn.preprocessing import StandardScaler

X = np.empty((100, 2))
# Both weight and height follow a Gaussian distribution
X[:, 0] = np.random.normal(0, 2, (100, ))		# height in meters
X[:, 1] = np.random.normal(70, 50, (100,))		# weight in kilograms

scaler = StandardScaler()
Z = scaler.fit_transform(X)

When we should use a standardization of our data:

  • Before PCA. Sometimes data points with high variance get weighted more and dominate the principal components.
  • Before clustering algorithms such as k-means. Clustering algorithms are based on the notion of distance as a similarity measure; thus, data with a wide range of features will affect the distances of the clustering algorithm more.
  • Before SVM. Classic SVM tries to maximize the distance between two separable hyperplanes and their support vectors. Thus, data with a wide range of features will affect the distances in a non-desirable way.
  • Before LASSO or Ridge regression. These algorithms penalize the magnitude of their coefficients related to each variable, and thus the scale of each variable will determine the penalty values. Coefficients with high variance take small values, and thus they will be penalized less.

In the following cases, we can skip standardization since these models are immune to a wide range of features:

  • Logistic regression
  • Random forests
  • Decision trees
  • Gradient boosting

Difference transform

A difference transform is most useful when we are dealing with time series. If there is a trend in a time series, and we would like to eliminate it, we can apply a difference transform by subtracting the value at time $ t-1 $ from the current time $ t $ value. More precisely,

$$ x[t] = x[t] - x[t-1], \quad (5) $$

and if we would like to get rid of a seasonal structure, then we only need to take into account the period (or frequency) of that seasonality,

$$ x[t] = x[t] - x[t - d], \quad (6) $$

where $ d $ is the delay or the period of seasonality (e.g., how many data points in the past we should subtract).

The difference transform is easy to implement in Python. The following code snippet provides two functions to apply and reverse the difference transform.

import numpy as np


def difference(X, delay=1):
	n = len(X)
    diff = [X[i] - X[i - delay] for i in range(delay, n)]


def invDifference(X, dX, delay=1):
	n = len(X)
    inv = [dX[i - delay] + X[i - delay] for i in range(delay, n)]


if __name__ == '__main__':
    x = np.array([i for i in range(1, 10)])
    print(x)					# 1, 2, 3, 4, 5, 6, 7, 8, 9

    x_diff = difference(x, 1)
    print(x_diff)				# [1, 1, 1, 1, 1, 1, 1, 1]

	# We can obtain similar results when delay=1 using Numpy's diff function
    x_diff_prime = np.diff(x)
    print(x_diff_prime)			# [1 1 1 1 1 1 1 1]

	x_diff_inv = invDifference(x, x_diff, 1)
    print(x_diff_inv)			# [2, 3, 4, 5, 6, 7, 8, 9]

	# Another way to inverse the difference when delay=1 is the following
    x_diff_prime_inv = np.r_[x[0], x_diff_prime].cumsum()
    print(x_diff_prime_inv)		# [1 2 3 4 5 6 7 8 9]

Power transform

We can apply a power transform to make our data look more ``normal'' (Gaussian-like) and stabilize its variance. There are two major power transforms: the Cox-Box [4] and the Yeo-Johnson [5]. The sklearn Python package supports both Cox-Box and Yeo-Johnson transforms. The Box-Cox transform requires strictly positive data, while the Yeo-Johnson transform supports both positive and negative data [6]. The following code snippet demonstrates how we can apply them to our data.

import numpy as np
from sklearn.preprocessing import PowerTransformer

X = np.random.random(10)
print(X)	

# array([0.92222554, 0.92306673, 0.82856923, 0.8713333 , 0.08001814,
#        0.12258023, 0.5008433 , 0.60396389, 0.24539718, 0.55259061])

pt = PowerTransformer(method="box-cox")		# or method='yeo-johnson' (default)

# fit_transform receives ndarray of shape (n_samples, n_features)
Z = pt.fit_transform(X.reshape(-1, 1)	
Z = Z[:, 0]								# we have only one feature
print(Z)

# [-1.71002627  1.17246013 -1.55389497  0.91312675 -0.02584321  0.19532181
#  -0.08403872  1.48983064  0.03445026 -0.43138641]

Summary

We examined five essential data transforms: centering data, normalization, standardization, difference, and power transform. We briefly described the math behind those transformations and provided some Python functions based on sklearn that implement those transformations.

Cited as

@article{detorakis2022acfpacf,
  title   = "Useful data transformations",
  author  = "Georgios Is. Detorakis",
  journal = "gdetor.github.io",
  year    = "2022",
  url     = "https://gdetor.github.io/posts/normalization"
}

References

  1. Pedregosa et al., Scikit-learn: Machine Learning in Python,, JMLR 12, pp. 2825-2830, 2011.
  2. Sklearn MinMaxScaler
  3. Sklearn StandardScaler
  4. G. E. Box and D. R. Cox, An analysis of transformations, Journal of the Royal Statistical Society, Series B. 26 (2): 211–252, 1964.
  5. In-Kwon Yeo and R.A. Johnson, A New Family of Power Transformations to Improve Normality or Symmetry, Biometrica, 87(4), 2000.
  6. Sklearn PowerTransformer