The ultimate goal of data mining is not the acquisition of data, but the exploration and analysis of massive amounts of data resulting in patterns, rules, and relationships. One of the key outcomes is the identification of reliable and meaningful patterns.
Meaningful patterns can do the following:
* Model typical behaviors
* Identify atypical behaviors
* Express possible cause and effect relationships
* Explain past behaviors
* Describe current conditions
* Develop a predictive model for the future
While developing patterns in data drawn from different types of data can show meaningful relationships, time-series data mining can be used to postulate
* life-cycle behaviors
* impacts of proximal or distal relationships
* cluster formation and disaggregation
Characteristics of Data Mining Data
The data may represent changes of behavior and activities over time, or, alternatively, it could represent the relationship of different types of data which have been collected at one point in time.
* Same type of data, collected at different points in time
* Different types of data, collected at the same point in time
* Data streams, which involve ordered sequences of items that arrive over time
* offline: regular chunked arrivals
* online: continuous flow
Planning the Data Mining Process
The development of data mining can follow a fairly clear process:
1. Determine the problem and definition
2. Determine the characteristics of the data
3. Develop a plan for data mining
4. Review of similar data mining projects and algorithms
5. Become familiar with data, data issues, potentially meaningful data subsets
6. Data preparation and conditioning
7. Model and algorithm development
8. Evaluation of model / comparison with other models
9. Implementation, which involves generating reports, or continuing to develop ongoing activities
Data Mining: Key Tasks
Key tasks in data mining include the following:
* Eliminate unnecessary or distracting frequent item sets
* Definition and differentiation
* Optimize storage and recovery of streams
* Cluster recognition
* Discovery of motifs (sub-sequences)
* Detection of similar clusters or sets
* Detection of outliers and anomalies
* Create predictive models
Data mining that involves continuous streams of data presents unique challenges because of the nature of the data and types of patterns that are meaningful, given the array of patterns that are possible to develop. It is also challenging to integrate incoming data with existing databases in order to qualitatively evaluate patterns in a timely way. It is also challenging to avoid the “concept drifting” problem, which means that the usefulness and validity of the results will degrade over time.
* Sensors and surveillance: networks, physical locations, manufacturing, transportation
* Performance monitoring: manufacturing, networks, controls
* Transaction / activity monitoring: retail, web performance, manufacturing
A literature review of algorithms suggests that data mining for data streams is generally performed using three different major classifications of algorithms, and that they do not yield the same results, which could be quite significant, depending on the application.
Landmark Window Based Data Mining
What is measured is the difference between a specific time-stamp (the landmark) and the present.
Pros: Complete comparision with an a priori property
Cons: The order in which information is considered and placed into sets can lead to errors
This approach privileges new data over old or historical data, which means that the older data drops out of consideration for developing sets
Pros: Efficient use of resources, eliminates old and obsolete information
Cons: Large errors may be made because the information being eliminated may be important for the rule to be effective
Sliding favors new data (as in the Damped Window approaches), but does not completely eliminate old data. Instead, it incorporates summarized versions of old data and data relations.
Pros: Can incorporate past data and do so relatively quickly
Cons: The assumptions made to create summaries of old data sets can be flawed
General Observations and Conclusions
At this point in time, the ability to collect data continues to expand and sometimes dramatically, thanks to technological advances in both hardware and software. However, a review of the processes and the literature make it clear that the algorithms use to process and make meaning of the data batchs and streams differ widely. Consequently, the results and conclusions that are created using data mining techniques (both collecting and in analyzing), can be highly variable. Thus, decisions made through data mining need to be made carefully, and more than one analytical technique and set of algorithms should be used.
AAPG Session on Big Data at Geomechanics and Reservoir Characterization of Shales and Carbonates
The AAPG Geomechanics GTW in Baltimore, July 16-17, will feature a session on the use of data mining in various aspects shale plays and the “new” carbonates. For a full preliminary program, click the link.
Esling, P., & Agon, C. (2012). Time-Series Data Mining. ACM Computing Surveys, 45(1), 12:1-12:34.
Mala, A. A., & Dhanaseelan, F. (2011). Data Stream Mining Algorithms: A Review of Issues and Existing Approaches. International Journal On Computer Science & Engineering, 3(7), 2726-2732.
Ramageri, B. M., & Desai, B. L. (2013). Role of data mining in retail sector. International Journal On Computer Science & Engineering, 5(1), 47-50.
This article by Susan Smith Nash previously appeared elsewhere.
Last 5 posts by Susan Nash
- AAPG Pre-Conference Short Courses (URTeC) - July 25th, 2014
- Granite Wash and Pennsylvanian Sand Forum - July 7th, 2014
- Latitudinal Controls on Stratigraphic Models and Sedimentary Concepts: An AAPG/SEPM Hedberg Research Conference - July 7th, 2014
- Folding, Thrusting and Syntectonic Sedimentation: Perspectives from Classic Localities of the Central Pyrenees - June 24th, 2014
- Complex Carbonate Reservoirs: Sedimentation and Tectonic Processes - The Impact of Facies and Fractures on Reservoir Performance - June 23rd, 2014