Data Mining
Understand the data
- Volume (how many data points)
- Variety (what is the mixture of different data types)
- Velocity (what changes are taking place in the data, how many copies do you have, is it temporal or does it show change)
- Veracity (do you trust each data point)
- Value (is this a useful way to spend time, is this data set worth it)
Types of data
- Relational, transactional
Student records, bank accounts, purchases
- Sequential, temporal, streaming
Gene sequences, stock prices, sensor readings
- Spatial, spatial-temporal
Land use, bird migration, traffic condition
- Text, multimedia, web
News articles, audio/video/image, hypertext
- Graph, network data
Social network, power grid, co-authorship
Define an Application Goal
- Market analysis, target advertisement
- Healthcare, medical research
- Science and engineering
- Security
- Government
Choose if you are trying to be descriptive, predictive, or prescriptive.
Knowledge View
- Frequent pattern, association, correlation
- Categorization (similarity and differences between groups)
- Anomaly, outliers (fraud and extreme events)
- Changes over time
Techniques
Frequent pattern analysis
Identify frequent events, items, sequences, correlation, or structure.
Classification
Build model to distinguish items into pr-defined classes. Need training data.
Prediction
Make a numerical prediction (continuous value) like weather or stock price.
Clustering
Work with no predefined classes, find intra-cluster similarity and inter-cluster dissimilarity.
Anomaly Detection
Search for events differing from the norm.
Trend and evolution analysis
Changes over time with periodical patterns.
Pipeline
A data mining pipeline exists to take data and turn it into knowledge to be used in an application.
- Data understanding
- Data preprocessing
- Data warehousing
- Data modeling
- Pattern evaluation (also performance evaluation of the model)
- Diverse data = diverse knowledge
- Quality data = quality knowledge