Data Mining
Understand the data
- Volume (how many data points)
- Variety (what is the mixture of different data types)
- Velocity (what changes are taking place in the data, how many copies do you have, is it temporal or does it show change)
- Veracity (do you trust each data point)
- Value (is this a useful way to spend time, is this data set worth it)
Types of data
- Relational, transactional
Student records, bank accounts, purchases
- Sequential, temporal, streaming
Gene sequences, stock prices, sensor readings
- Spatial, spatial-temporal
Land use, bird migration, traffic condition
- Text, multimedia, web
News articles, audio/video/image, hypertext
- Graph, network data
Social network, power grid, co-authorship
Define an Application Goal
- Market analysis, target advertisement
- Healthcare, medical research
- Science and engineering
- Security
- Government
Choose if you are trying to be descriptive, predictive, or prescriptive.
Knowledge View
- Frequent pattern, association, correlation
- Categorization (similarity and differences between groups)
- Anomaly, outliers (fraud and extreme events)
- Changes over time
Techniques
Frequent pattern analysis
Identify frequent events, items, sequences, correlation, or structure.
Classification
Build model to distinguish items into pr-defined classes. Need training data.
Prediction
Make a numerical prediction (continuous value) like weather or stock price.
Clustering
Work with no predefined classes, find intra-cluster similarity and inter-cluster dissimilarity.
Anomaly Detection
Search for events differing from the norm.
Trend and evolution analysis
Changes over time with periodical patterns.
Pipeline
A data mining pipeline exists to take data and turn it into knowledge to be used in an application.
- Data understanding
- Data preprocessing
- Data warehousing
- Data modeling
- Pattern evaluation (also performance evaluation of the model)
- Diverse data = diverse knowledge
- Quality data = quality knowledge
Technique
- Frequent Pattern Analysis
- Classification
- Prediction
- Clustering
- Anomaly Detection
- Trend Analysis
outlier analysis complex data types
FPA
Apriori
algorithm used to find frequent sets and patterns using a data set and
mean support threshold.
Mean support multiplied by number of rows is lower bound for threshold number of occurrences. List individuals, filter with bound, list pairs, filter, etc….
To handle big data, try partitioning, sampling, transaction reduction (skip if current does not contain known frequent pattern).
Support counting using hash-tree. Use subset function for item specific branching.
Also consider vertical data format, key is item set and value is transactions that contain item set.
FP-growth
algorithm - if abc
is frequent and d is frequent in set of abc
, then
abcd
is frequent.
Association Rule
- Given a list of transactions, item sets X, Y
- Association rule:
X => Y
- Support:
P(X U Y)
- Confidence:
P(Y | X)
To measure correlation of numerical attributes, use correlation coefficient. To measure the correlation of nominal attributes, use chi-squared test.
lift(A, B) = P(A U B) / P(A)P(B)