Data Mining: What You Might Not Know

The ultimate goal of data mining is not the acquisition of data, but the exploration and analysis of massive amounts of data resulting in patterns, rules, and relationships. One of the key outcomes is the identification of reliable and meaningful patterns.

Meaningful patterns can do the following:

* Model typical behaviors

* Identify atypical behaviors

* Express possible cause and effect relationships

* Explain past behaviors

* Describe current conditions

* Develop a predictive model for the future

While developing patterns in data drawn from different types of data can show meaningful relationships, time-series data mining can be used to postulate

* causality

* life-cycle behaviors

* impacts of proximal or distal relationships

* cluster formation and disaggregation

Characteristics of Data Mining Data

The data may represent changes of behavior and activities over time, or, alternatively, it could represent the relationship of different types of data which have been collected at one point in time.

* Same type of data, collected at different points in time

* Different types of data, collected at the same point in time

* Data streams, which involve ordered sequences of items that arrive over time

* offline: regular chunked arrivals

* online: continuous flow

Planning the Data Mining Process

The development of data mining can follow a fairly clear process:

1. Determine the problem and definition

2. Determine the characteristics of the data

3. Develop a plan for data mining

4. Review of similar data mining projects and algorithms

5. Become familiar with data, data issues, potentially meaningful data subsets

6. Data preparation and conditioning

7. Model and algorithm development

8. Evaluation of model / comparison with other models

9. Implementation, which involves generating reports, or continuing to develop ongoing activities

Data Mining: Key Tasks

Key tasks in data mining include the following:

* Identification

* Eliminate unnecessary or distracting frequent item sets

* Definition and differentiation

* Optimize storage and recovery of streams

* Classification

* Cluster recognition

* Segmentation

* Discovery of motifs (sub-sequences)

* Detection of similar clusters or sets

* Detection of outliers and anomalies

* Create predictive models

Data mining that involves continuous streams of data presents unique challenges because of the nature of the data and types of patterns that are meaningful, given the array of patterns that are possible to develop. It is also challenging to integrate incoming data with existing databases in order to qualitatively evaluate patterns in a timely way. It is also challenging to avoid the “concept drifting” problem, which means that the usefulness and validity of the results will degrade over time.

In general,

* Sensors and surveillance: networks, physical locations, manufacturing, transportation

* Performance monitoring: manufacturing, networks, controls

* Transaction / activity monitoring: retail, web performance, manufacturing

Algorithms

A literature review of algorithms suggests that data mining for data streams is generally performed using three different major classifications of algorithms, and that they do not yield the same results, which could be quite significant, depending on the application.

Landmark Window Based Data Mining

What is measured is the difference between a specific time-stamp (the landmark) and the present.

Pros: Complete comparision with an a priori property

Cons: The order in which information is considered and placed into sets can lead to errors

Damped Window

This approach privileges new data over old or historical data, which means that the older data drops out of consideration for developing sets

Pros: Efficient use of resources, eliminates old and obsolete information

Cons: Large errors may be made because the information being eliminated may be important for the rule to be effective

Sliding Window

Sliding favors new data (as in the Damped Window approaches), but does not completely eliminate old data. Instead, it incorporates summarized versions of old data and data relations.

Pros: Can incorporate past data and do so relatively quickly

Cons: The assumptions made to create summaries of old data sets can be flawed

General Observations and Conclusions

At this point in time, the ability to collect data continues to expand and sometimes dramatically, thanks to technological advances in both hardware and software. However, a review of the processes and the literature make it clear that the algorithms use to process and make meaning of the data batchs and streams differ widely. Consequently, the results and conclusions that are created using data mining techniques (both collecting and in analyzing), can be highly variable. Thus, decisions made through data mining need to be made carefully, and more than one analytical technique and set of algorithms should be used.

AAPG Session on Big Data at Geomechanics and Reservoir Characterization of Shales and Carbonates

The AAPG Geomechanics GTW in Baltimore, July 16-17, will feature a session on the use of data mining in various aspects shale plays and the “new” carbonates. For a full preliminary program, click the link.

References

Esling, P., & Agon, C. (2012). Time-Series Data Mining. ACM Computing Surveys, 45(1), 12:1-12:34.

Mala, A. A., & Dhanaseelan, F. (2011). Data Stream Mining Algorithms: A Review of Issues and Existing Approaches. International Journal On Computer Science & Engineering, 3(7), 2726-2732.

Ramageri, B. M., & Desai, B. L. (2013). Role of data mining in retail sector. International Journal On Computer Science & Engineering, 5(1), 47-50.

This article by Susan Smith Nash previously appeared elsewhere.

Last 5 posts by Susan Nash

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • BlogMemes
  • LinkedIn
  • Reddit
  • StumbleUpon
  • Technorati
  • Tumblr
  • TwitThis