Auto-Classifying data with Magellan and beyond: Preparing for a project

In previous posts in the series, we’ve discussed advantages brought by Auto-Classification projects, and different types of solutions and Machine Learning (ML) algorithms available at OpenText™ to undertake such projects. In this installment, we discuss the preparation phase.

Choosing taxonomies

Our first step is to think about our taxonomy, i.e. our classification scheme. We have already answered some important questions that drove our choice of classification tool: What is the nature of the content we wish to classify? Texts? Images? Architectural plans? Tweets? User comments?

Now we get to the nitty-gritty. Are we using a pre-established taxonomy, or are we creating a new one, specifically to train with ML? Not all taxonomies are ideal for ML training; some can’t even be trained. The distinctions that exist between your classifications must be of a type that is accessible to your ML algorithms or tools. A taxonomy needs to both provide a good arrangement and description of the content it seeks to classify, and it must be usable for its intended purpose.

Let’s say you want to classify documents according to their textual content. Then, a taxonomy that relies on the topics of documents, rather than on their type, loosely defined, is what you need. This kind of taxonomy might be very different from one you use for your file plan, for instance, which could rely on document types.

In your file plan, you might have different classifications such as contract or budget for each department; but in your ML classification, you might only have one contract classification that covers all departments. This is because the contracts from department 1 might be similar to those of department 2, and so on, which means that to your ML algorithms, they might look the same. It makes sense: Sometimes the contracts are between one department and the other, so they’re the same in both groups; sometimes they’re about the same projects but in different phases. Sometimes, if you didn’t know that Karen was in finance, you wouldn’t know to classify a contract accordingly. For auto-classification with ML, the different Contract classifications should be combined into a single classification.

The lesson here is that we need some flexibility regarding the structure of our taxonomies. If we were previously dealing with a manual classification process, or employees making a judgment that could vary from case to case, our tool needs to track the features these judgments were based on. Does the algorithm know Karen from Finance? Is there a way it could be made aware of her? If not, then this distinction is of no use to our tool, and we must rethink the taxonomy. Because this can be a complex task, OpenText offers taxonomy training to help guide you.

Gathering your corpora

But let’s say we have developed a taxonomy that’s well-suited for ML training. Now what?

If you’re using unsupervised methods, gathering the content to analyze in a folder of your system accessible to your tool, and wrangling it, so that it is in a format possible to process by your tool, should suffice. Chances are your tool comes equipped with wrangling or conversion workflows to help speed things up.

If you’re using supervised methods, you need to prepare different sets of documents to train and improve your ML models. At minimum, we need a training set, a test set and a blind set for your final tests. Again, we need to ask why each document belongs to a classification. Once we understand the relevant distinctions, we must create our corpora with good examples of documents from each of your classifications. We recommend a minimum of 100 exemplars (good representative documents/images/emails) per classification for training and testing. Of course, this is a baseline, and more or fewer exemplars might be required depending on the complexity of your classification task and your tool.

In other words, if you have 25 classifications we are looking for 2,500 carefully selected exemplars. This means users who clearly understand your content and taxonomy will need to help with collection efforts ahead of training.

Auto-classification tools often reveal unsuspected aspects of the taxonomy such as features shared among classifications. For instance, the category “training” includes “training about security”. Overlap between classifications tends to happen when our taxonomy is very granular. In such cases, we can merge the classifications. Alternatively, we can try to address them by other means, such as conditional rules. But generally, we want to avoid overlap in our definitions as much as possible.

It’s also important to allow flexibility in terms of time: Sometimes, certain exemplars turn out inappropriate for the task, and we need to change or process them in some way. Let’s say that a set of documents collected for training and optimization actually needed optical character recognition: then we need to process these with, say, OpenText™ Captiva™, prior to running them through OpenText™ Magellan™ Text Mining, or another classification tool. This will add time and effort to the project.

Sometimes, a person collecting exemplars didn’t fully think through their taxonomy scheme. It’s possible that some of their selected exemplars will have to be replaced.

This happens especially when users are participating in such initiatives for the first time. For this reason, it’s important to familiarize users with the classification tools as early as possible. In many cases, the tools can provide ways to identify good exemplars. For instance, OpenText™ Auto-Classification allows you to create a set of candidate exemplars that can be compared to others in the model, and then added as you see fit.

Once satisfied with the training and optimization of our model, we turn to our blind test set, a set representative of your ultimate classification task (a few hundred or thousand test documents at least). This doesn’t need to be pre-classified. It’s there to verify our model’s performance on real content and, as a result, is much easier to collect than the other sets. However, this set is important and generally leads to re-calibration and optimization of the model: Adding to the model coverage, removing noise, creating new rules, perhaps reviewing the classification scheme, merging and splitting classifications. It’s important not to underestimate this part of the process. OpenText offers different workshops to guide you through this sometimes complex phase.

Next time, we’ll discuss the aftermath of a project, the maintenance of your new auto-classification tool or pipeline, and how it affects your daily business.

Learn more by contacting OpenText’s AI & Analytics Services Practice.

This blog is part of a series about auto-classifying data with OpenText Magellan. Other blogs in the series include:

For more information on how AI technology is transforming enterprise content, read the IDC Industry Brief.

Michael Gagnon

Michael completed his PhD in Linguistics in 2013 at the University of Maryland, with specializations in Syntax, Semantics and Pragmatics. Since then he has been pursuing a variety of challenges as a university professor, a Natural Language Understanding engineer for voice applications, and, nowadays, as a business analyst and computational linguist working with Magellan Text-Mining at OpenText.

See all posts

Analytics

Auto-Classifying data with Magellan and beyond: What approaches are there?

In the first installment of this series, we considered the reasons why you might want to turn to auto-classification to help manage your data. This…

March 07, 2019•

5 min read

Analytics

Auto-Classifying data with Magellan and beyond

In today’s digital world, the ability to establish defensible and transparent processes to auto-classify digital content is of growing importance to the intelligent and connected…

February 26, 2019•

5 min read