Data Understanding
A data set is a collection of data objects, each described by a number of attributes.
Attribute types can be categorical (nominal, binary, ordinal), numeric (discrete or continuous), and more.
Figure out how big your data set is, the distribution of each attribute's values (% of each category, numeric: central tendency, dispersion), compare across attributes & data sets (similarity).
Central tendency: typical numeric value or the norm (mean, median, mode,
midrange, (max - min)/2
)
Dispersion: how the values are spread out in your distribution (stretch and
squeeze). Range max-min
, quartiles, IQR Q3-Q1
, variance, standard deviation.
Quick visualization
- Boxplot (show outliers)
- Histogram
- Quantile Plot (percentile comparison)
- Quantile-Quantile Plot (comparison of two quantiles)
- Scatter Plot (comparison of two attributes)
Similarity
Dissimilarity matrix i,j
is the distance between two objects. Can be binary or
a gradient.
Binary
Symmetric - equal chance of Y or N Asymmetric - more likely to be Y or N
Hamming distance - bits that are different
Two binary records, A and B.
q = count of attributes where Ai == Bi == true
r = count of attributes where Ai == true and Bi == false
s = count of attributes where Ai == false and Bi == true
t = count of attributes where Ai == Bi == false
Symmetric variables:
d(i,j) = (r+s)/(q+r+s+t)
Asymmetric variables (assuming t is the norm in this example):
d(i,j) = (r+s)/(q+r+s)
Jaccard coefficient:
sim(i,j) = q/(q+r+s) = 1 - d(i,j)
Jaccard is useful for sparse data
Ordinal
Map values to ranks:
rif in {1,...,Mf}
zif = (rif - 1)/(Mf - 1)
Dissimilarity is the distance between the mapped values.
Numeric
Measured by distance, like Minkowski distance (l_p norm)
p = 1: Manhattan distance
p = 2: Euclidean distance
d(i,j) = (|xi1-xj1|^p + ... + |xin - xjn|^p)^(1/p)
Useful for dense continuous data
Properties
d(i,i) = 0
d(i,j) >= 0
d(i,j) = d(j,i) # not always true with weights
d(i,j) <= d(i,k) + d(k,j)
triangular inequality (z < x + y when looking at triangle)
Cosine Similarity
Ex: text documents
Angular similarity of high dimensional and sparse data.
cos(x) = cos(A,B) = (A dot B) / (||A||||B||) = sumi Ai Bi / (sqrt(sumiA^2i))(sqrt(sumiB^2i))
Useful for sparse data
Sequential Data, Time Series
- Euclidean distance
- Dynamic time warping
- Minimum jump cost
Mixed Attribute Types
Weighted sum across attributes