Auto-Classifying data with Magellan and beyond: What approaches are there?

In the first installment of this series, we considered the reasons why you might want to turn to auto-classification to help manage your data. This…

Michael Gagnon

March 7, 2019•5 minutes read

Descriptive text explaining the contents of the image.

In the first installment of this series, we considered the reasons why you might want to turn to auto-classification to help manage your data. This time around, we want to have a closer look at the Machine Learning (ML) approaches available and the OpenText™ solutions that integrate them.

Today, ML algorithms are embedded in more and more applications we use everyday, notably search engines, social media applications and, of course, Enterprise Information Management (EIM) software. But what is ML? Machine learning draws from many fields of research (artificial intelligence, probability and statistics, philosophy, psychology, cognitive science and many others) to devise programs that learn from experience in order to improve their performance at specific tasks. In the words of Tom Mitchell: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

At the outset, let’s distinguish two different types of ML methods: unsupervised and supervised.

Unsupervised methods generally refer to ML approaches or algorithms where we do not know ahead of time the kinds of results we expect. We know what data goes in, but we do not know what results should come out. Clustering and association algorithms fall into this category; clustering offers natural groupings in the data (e.g. what documents are similar according to some aspect of them, such as textual content), whereas association basically offers correlations found in the data, and can be useful in answering questions pertaining to customer churn for instance.

If you’re looking to gather insight about a set of legal documents, or emails, for an ongoing litigation, and you need to do so quickly, then a tool that uses unsupervised methods such as OpenText™ Axcelerate™ might be appropriate.

By contrast, supervised methods of classification are approaches where the machine learning models or algorithms are trained on known data prior to being used on your target content. We can do this when we have some idea of the types of results we’re looking for. Supervised methods allow us to more carefully supervise the learning process by selecting the examples, or exemplars, which are used in the classification model training, proceed with the training until we are satisfied with the results, and then apply the models to real world data. If you wish to set custom classification models specific to your organization’s needs, supervised methods are likely the best choice for you.

Since it requires human intervention both in selecting the training data and in optimizing the model to meet our desired outcome, the supervised learning approach tends to require substantial involvement and effort in the early phases of the project. However, it’s a small price to pay for the long-term benefits of an auto-classification pipeline that automatically classifies your content and reduces storage and manual classification costs.

For any classification task, we seek to classify documents or content according to some specific features. These features are the first thing that will inform our choice of machine learning tool.

What you know (and don’t know) about your content ahead of time will have an effect on the tools you choose. For instance, if you are trying to classify emails based on a retention schedule, and have content of varying lengths, then you could use a tool like OpenText™ Auto-Classification (OTAC), which combines supervised classification methods with unsupervised methods.

Imagine that you are trying to classify textual content where your individual files are not short like Tweets or emails, but much longer, such as full-length news articles, user reviews, long contracts, or even research papers. Beyond individual document length, the overall volume of content to treat is also very large. In this case, a tool like OpenText™ Magellan™ Text Mining ideal. This specialized solution uses a combination of Natural Language Processing algorithms, a Part of Speech (POS) tagger and supervised methods. This allows us to zoom in even more on the features of the text that truly interest us, the concepts, and ignore some that are perhaps predictable, such as certain functional terms (determiners, demonstratives, etc.) or verbs.

Perhaps you want to classify plans and CAD files based on symbols found in the files: The symbols are arbitrary ones that were selected years ago, long before there was any plan or thinking about automating the classification of these assets. In some cases, we are talking about thousands upon thousands of legacy image files. The symbols are fairly consistent in terms of shape, though. In this case, you might want to use a tool like OpenText™ Captiva™ where you can use supervised methods to train classification models based on the symbols, the basis for your classification. In this case, we would be training and optimizing OpenText Captiva to recognize specific types of symbols related to our classifications.

Once you have chosen the tools you need based on your classification features, and have a good classification scheme, then it is time to get ready to set up your auto-classification pipeline. In our next installment, we turn to the preparation required ahead of your new auto-classification project.

Learn more by contacting OpenText’s AI & Analytics Services Practice.

This blog is part of a series about auto-classifying data with OpenText Magellan and beyond. Other blogs in the series include:

For more information on how AI technology is transforming enterprise content, read the IDC Industry Brief.

Share this post

Michael Gagnon

Michael completed his PhD in Linguistics in 2013 at the University of Maryland, with specializations in Syntax, Semantics and Pragmatics. Since then he has been pursuing a variety of challenges as a university professor, a Natural Language Understanding engineer for voice applications, and, nowadays, as a business analyst and computational linguist working with Magellan Text-Mining at OpenText.

See all posts

Auto-Classifying data with Magellan and beyond: What approaches are there?

Michael Gagnon

More from the author

Auto-Classifying data with Magellan and beyond: Preparing for a project

Auto-Classifying data with Magellan and beyond