/ blog

# Data Preprocessing

Data quality metrics:

# Cleaning

# Incomplete Data

# Noisy Data

# Inconsistent Data

# Integration

Combine data from multiple sources and develop strategies for entity identification while removing redundant data.

# Correlation Analysis

Correlation does not imply causality.

Numerical attributes: correlation coefficient

Nominal attributes: chi-square test

# Data Transformation

# Normalization

v' = [(v-min)/(max-min)](max'-min')+min'

Ex: map [50k, 200k] to [0, 1]

v' = [(v-50k)/(200k-50k)](1-0)+0
v' = (v-mean)/(max-min)
v' = (v-mean)/stdev

# Discretization

Split or merge points into intervals. Use supervised or unsupervised class labels.

# Unsupervised

Equal width or equal frequency

# Supervised

# Data Reduction

Use forward selection (add the most informative attributes) and backward elimination (remove the least informative attribute) for automated dimensionality reduction.

# Principle Component Analysis (PCA)

Wavelet Transformation: linear signal processing to store a small fraction of the strongest wavelet coefficients. Remove resolutions with least information.

# Numerical reduction

Parametric: assume the data fits a certain model and estimate the model parameters.

Non-parametric: do not assume model.

# Sampling

EX: