We’re still over a year away from General Data Protection Regulation’s (GDPR) “go live” date, but the sense of dread at recent conferences is tangible. And understandably so: The GDPR imposes sweeping requirements on organizations to understand and protect the personal data they process and use. While records management and infosec have so far dominated the GDPR discussion, your lawyers and compliance teams are also gearing up with discovery analytics, including machine learning, to help them manage GDPR risk.
The New Cost of Personal Data
The GDPR introduces a slew of IG regulations that attach to Personally Identifiable Information (PII), which is defined as any information relating to an individual. If that sounds broad, it’s because it is. Your name, your pictures, your email, your IP address—really anything that could be used to identify you is included. The GDPR creates personal rights in this data, like the right to be forgotten, the right to audit your data, the right to correct it, or transfer it. It also includes enhanced data breach notification and response obligations.
Basically, if your organization touches consumer data in some fashion you’re likely covered by the GDPR. And if your organization’s products or services regularly involve personal data, security takes on even more prominence. Failure to comply with the GDPR could incur fines of up to 20 million Euro or an enormous 4% of global turnover.
The dramatic penalties have spurred organizations to conduct Privacy Impact Assessments (PIA) and proactively audit their own data to measure risk & exposure. Understanding how and where you handle personal data is the first challenge, and a significant one since PII can be embedded in nearly all your business documents and some are more important than others.
Finding a Needle in a Stack of Needles
If a basic component of GDPR is understanding your data, then naturally you need tools to search, identify, categorize, and flag documents. Traditional search methods of manually reviewing contracts one by one for language about PII treatment, processing, or warehousing is unreliable and inefficient. During a breach response or a PII-assessment, triage is key; you need to rapidly identify the most sensitive documents and tag them for special handling (more on that later). To do so, you need discovery analytics and machine learning.
Pattern identification is a crucial technology to rapidly identify simple documents containing standardized PII like credit cards, licenses, medical records, and more. But this technology on its own won’t identify all the documents necessary for a PIA because not all PII is pattern-based and is often highly contextual.
That’s where concept analysis, an unsupervised machine learning algorithm, comes in. This technology analyzes the co-occurrence of words and clusters them together according to contextual themes—even if they lack specific keywords—and without any human feedback. These tools can, with astounding accuracy, distinguish between different contexts that influence how we interpret words. For instance, if the word “private” appears in a number of documents related to military ranks the engine could group those documents aside from ones that feature the word “private” in relation to personal data.
These automated tools can get you started on a privacy evaluation, but the ultimate analysis is too nuanced to rely exclusively on machine categorization. Human review is an indispensable element, so having document review workflows and administration tools is necessary. This means the ability to batch out documents in related groups to keep legal reviewers engaged with relevant content. And with a continuous machine learning algorithm running in the background, each decision that our legal team makes while eyeballing documents will train a recommendation engine. This algorithm can then evaluate the remaining documents and predict which ones are likely to contain sensitive data (much more on that interesting topic here).
In this way, you can start with a known dataset (like your vendor contracts database) and then leverage analytics to identify unknown, risk-prone documents. As you review more documents and find more PII-laden content, the algorithm is constantly learning in the background. It conducts broad sweeps of your remaining data to prioritize batches of content that are likely to contain PII. What’s more, these algorithms can run on an issue-specific basis—a crucial ability since the GDPR distinguishes between “personal data” and “sensitive personal data.”
Knowing is Half the Battle
The broader impact of GDPR will shake out over years, it’s still unclear how individuals will exercise their rights or how DPAs will enforce the rules. But organizations can take steps today towards understanding their risk exposure and doing what they can to mitigate potential consequences.
OpenText™ Discovery combines tools like machine learning, pattern identification, and entity extraction with data visualizations, keywords, and metadata filters to help legal and compliance teams identify any PII-carrying data. All of this is guided by a document review workflow that has been honed over years of litigation projects and layered security.
Learn more about GDPR readiness by watching our analyst webinar here.