What’s the difference between data mining and text mining?

Even though data mining and text mining are often seen as complementary analytic processes that solve business problems through data analysis, they differ on the type of data they handle.

While data mining handles structured data – highly formatted data such as in databases or ERP systems – text mining deals with unstructured textual data – text that is not pre-defined or organized in any way such as in social media feeds.

Another difference is how data mining and text mining approach analytics. Neither of them are a single technology but instead use a broad range of functions to transform available data to valuable insights and knowledge.

On one hand, data mining combines disciplines including statistics, artificial intelligence and machine learning to apply directly to structured data. Some of the used data modelling functions are listed below:

  • Association – Determines how probable one occurrence is to happen in relation to another occurrence over time. For example, in sales transactions the association function can uncover purchase patterns of customers buying milk when buying cereal.
  • Classification – Reveals patterns used to predict the class to which data will fall into. For example, weather predictions on whether it will be sunny or cloudy depending on weather conditions.
  • Clustering – Organizes data by identifying similarities and grouping it into clusters to identify new facts about that data. For example, market segmentation is one of its applications.
  • Regression – Predicts a numeric value depending on the variables on a given data set. For example, the prices of a used car given its mileage and other variable conditions.

Analytics and business intelligence platforms can quickly identify and retrieve information from large data sets of structured data and apply these data mining functions to create models that enable descriptive, predictive and prescriptive analytics.

On the other hand, text mining requires an extra step while maintaining the same analytic goal as data mining. Text mining deals with unstructured data so, before any data modeling or pattern recognition function can be applied, the unstructured data has to be organized and structured in a way that allows for data modeling and analytics to occur.

This requires sophisticated statistical and linguistic techniques to be able to analyze a wide range of unstructured textual data formats and enriching each document with metadata, such author, date, content summary, etc. This process is typically linked to an AI technique called Natural Language Processing that allows the system to understand the meaning in human language.

The metadata can be considered the key element in structuring this type of data. Once the data has been meta-tagged and defined, it can be translated into a machine-readable format that can be used for analysis.

The benefits of data and text mining

As data mining works on the structured data within the organization, it is particularly suited to deliver a wide range of operational and business benefits. For example, it can organize and analyze data from IoT systems to enable the predictive maintenance of factory equipment or it can combine historical sales data with customer behaviors to predict future sales and patterns of demand.

Text mining can take this a stage further by synthesizing vast amounts of content into easily understood information and allowing you to understand what people are actually saying about them. Sentiment analysis has become a major business use case of text mining as it uncovers the opinions and concerns of customers and partners by tracking and analyzing social content.

Comparing data mining and text mining

The following table outlines differences between data mining and text mining.

Data mining Text mining
Overview A range of functions to search for patterns and relationships in structured data A range of functions to turn unstructured textual data into structured information to enable data analysis
Data type Structured data from large datasets found in systems such databases, spreadsheets, ERP, CRM and accounting applications Unstructured textual data found in emails, documents, presentations, videos, file shares, social media and the Internet.
Data retrieval Structured data is homogenous and organized making it easy to retrieve Unstructured textual data comes in many different formats and content types located in a more diverse range of applications and systems.
Data preparation Structured data is formal and formatted facilitating the process of ingesting data into analytical models Linguistic and statistical techniques – including NLP keywording and metatagging – must be applied to turn unstructured into usable structured data.
Need for taxonomy There is no need to create a over-riding taxonomy for text mining As the unstructured text comes in many different forms and formats, there needs to be an over-riding taxonomy for the data so that it can be organized into a common framework.

Until recently, data mining was the dominant approach within most companies as they had greater control over their structured data. However, things are changing rapidly. Data volumes are exploding and most of this is unstructured. Organizations know that they must be able to use text mining if they are to release the value locked in content and unstructured communications.

The new world of big data means that most enterprises are looking to combine both structured and unstructured data to deliver greater visibility and richer insights into their business and operations. Today, you need to incorporate both data and text mining if you’re to move towards true data-driven decision-making.

To find out more about data mining, text mining or other AI and analytics solutions from OpenText, visit our website

Editor’s note: This is an installment in our “AI Glossary” series of blog posts, offering guidance on key areas of artificial intelligence and analytics. Look for future posts in this series over the months to come.

Zachary Jarvinen

Zachary is the Product Marketing Lead for Analytics and Artificial Intelligence at OpenText. He previously worked at Global Fortune 500 Epson and the U.S. State Department, and was part of the 2008 Obama Campaign Digital Team. Zachary speaks fluent Spanish and Portuguese, and holds an MBA/MSc from UCLA and the London School of Economics.

Related Articles

Close