Data Preprocessing
Data quality metrics:
- Relevance
- Accessibility
- Interpretability
- Reliability
- Timeliness
- Accuracy
- Consistency
- Precision
- Granularity
- Completeness
Cleaning
Incomplete Data
- remove missing attributes
- manually fill in missing values
- automate a best guess (global constant, attribute mean, class mean)
- estimated value (regression, kNN, probabilistic)
Noisy Data
- fit data with regression functions (linear, polynomial, logistic, …)
- clustering (group data into clusters and detect outliers)
Inconsistent Data
- Semantic-based checking (metadata, attribute relationships)
- Data understanding (statistical analysis and visualization)
Integration
Combine data from multiple sources and develop strategies for entity identification while removing redundant data.
Correlation Analysis
Correlation does not imply causality.
Numerical attributes: correlation coefficient
Nominal attributes: chi-square test
Data Transformation
- smoothing: noise removal and reduction
- aggregation: cities => state (n to 1)
- generalization: city => state (1 to 1)
- normalization: feature scaling
- discretization: continuous => intervals
- attribute construction from existing ones
Normalization
- Rescaling (min-max normalization)
v' = [(v-min)/(max-min)](max'-min')+min'
Ex: map [50k, 200k] to [0, 1]
v' = [(v-50k)/(200k-50k)](1-0)+0
- Mean normalization
v' = (v-mean)/(max-min)
- Standardization (z-score normalization)
v' = (v-mean)/stdev
Discretization
Split or merge points into intervals. Use supervised or unsupervised class labels.
Unsupervised
- Binning
- Histogram analysis
Equal width or equal frequency
- clustering analysis
- intuitive partitioning
Supervised
- pre determined class labels
- entropy-based interval splitting (lower entropy means purer class distribution)
X^2
analysis-based interval merging (lowerX^2
value means class is independent of interval)
Data Reduction
- Dimensionality reduction (attributes)
- Numerical reduction (objects)
- Feature engineering (domain knowledge, decision tree induction)
Use forward selection (add the most informative attributes) and backward elimination (remove the least informative attribute) for automated dimensionality reduction.
Principle Component Analysis (PCA)
- n dimensional data
- use first few orthogonal vectors (principle components)
Wavelet Transformation: linear signal processing to store a small fraction of the strongest wavelet coefficients. Remove resolutions with least information.
Numerical reduction
Parametric: assume the data fits a certain model and estimate the model parameters.
Non-parametric: do not assume model.
- Aggregation
- Histogram
- Clustering
- Sampling
Sampling
- Select a representative subset of data points
- Different sampling methods for different scenarios
EX:
- Random sampling (w/wo replacement to pool)
- Cluster sampling (focus on single class)
- Stratified sampling (gather from every class)