Auto-Classifying data with Magellan and beyond

In today’s digital world, the ability to establish defensible and transparent processes to auto-classify digital content is of growing importance to the intelligent and connected enterprise. As data stores get bigger and bigger, the burden of classifying them grows as well. In fact, the challenge is not just to classify digital content quickly, accurately, and cost-effectively, but also to make these processes legally defensible and transparent if litigation ever rears its ugly head. This challenge generates many questions about auto-classification: how does it benefit your bottom-line? Can all your content be classified automatically? What auto-classification tools are there, and how can we use them efficiently? What work is required upfront to setup an auto-classification pipeline? What kind of work is required in the long run?

In this blog series, I explore these questions about auto-classification and show you the many ways that OpenText™ can help. In this first installment, I share some insights about the benefits of auto-classification, as well as how to figure out what you want to classify.

Why Auto-Classification?

There are two main reasons why today’s businesses turn to auto-classification:

1. Simplifying record keeping and content classification, and saving on time and related costs.

While this is a very compelling reason to move towards auto-classification, it’s important to know that such solutions come at a cost, both in terms of money and effort. Although auto-classification has many long-term benefits, including reducing your employees’ workload when it comes to classifying your content, setting it up can require a substantial upfront investment in time and effort from external or in-house experts. It’s important to weigh your expected return on investment against the initial cost.

2. Mitigating legal risks when it comes to litigation, while ensuring compliance with respect to a retention policy (if applicable).

Enforcing and maintaining a transparent and defensible retention policy can impose constraints that must be clear throughout the development of classification models. It is crucial to know the exact type of content that is to be classified (emails, images, text files, etc.), and to make sure we choose the right tools to realize our desired outcome of retaining only the documents important to your business while disposing of legacy, unnecessary or potentially harmful content.

This disposal leads to two immediate benefits: savings on storage and the ability to combine the resulting classification structures with other semantic metadata extracted from the content to build more robust search and AI solutions. (I’ll go into this more in a future blog.)

What do you want to classify?

There are a variety of auto-classification tools on the market, so it’s important to understand what you want to classify and how. Thinking carefully about the nature of all the content you wish to classify, and ensuring you consider the right tools for the task(s), is always the first step.

Ask yourself: when you look at a document, how do you know which category it belongs to? Is it a document containing text (contract, email, financial report)? Is it a plan or CAD file identified by a specific symbol? Is it a photo or a logo? If the textual content is what matters, then you should choose a tool that can train classifications based on that. If you want to classify documents found in your content repositories, such as OpenText™ Content Server, or your employees’ mailboxes, and want to have a good user interface and dashboards to clearly understand the performances of your classification models, then a tool like OpenText™ Auto-Classification (OTAC) might be what you’re looking for.

If you want to auto-classify your content and combine the results with other metadata (system or semantic) to build more robust search and AI solutions beyond an auto-classification pipeline, then OpenText™ Magellan™ Text Mining can help. For a quick demo of Magellan Text Mining, simply paste part of your favorite news article here.

If you want to manage usage rights for images or clips, manage your branding and branding history, or resolve discovery challenges relating to your digital content, then a digital asset management platform like OpenText™ Media Management (OTMM) is what you’re looking for.

After you’ve identified the basis for your classification, it’s important to verify that this feature of your content is readily accessible. Do you have image PDFs in your collection, reflecting a legacy of documents scanned over the years? It’s likely these documents should be treated by optical character recognition (OCR) software before classifying them with a tool made for textual content. If you’re trying to classify invoices and receipts captured by employees on the go, then you might be looking for solutions like OpenText™ Captiva™.

Now that we’ve identified our content to be classified and chosen our tools based on our classification criteria, how do we classify our content? Join us next time when we dive into this question and more!

Learn more by contacting OpenText’s AI & Analytics Services Practice.

This blog is part of a series about auto-classifying data with OpenText Magellan and beyond. Other blogs in the series include:

For more information on how AI technology is transforming enterprise content, read the IDC Industry Brief.

Michael Gagnon

Michael completed his PhD in Linguistics in 2013 at the University of Maryland, with specializations in Syntax, Semantics and Pragmatics. Since then he has been pursuing a variety of challenges as a university professor, a Natural Language Understanding engineer for voice applications, and, nowadays, as a business analyst and computational linguist working with Magellan Text-Mining at OpenText.

See all posts