Before we start looking at classification techniques, a few concepts need to be defined.
The information we rely on to classify content automatically comes from three sources within a single document.
Format – Format often includes the coding that allows a specific application to work with it. It can also include any structure that impacts its appearance, such as forms or templates. A word document and a PDF may be the exact same document and contain all the same words, but they are different formats and possibly different record categories. Two template-based documents may have zero words in common but are the same record category because of the template.
Content – The human readable part of a document file. This includes the words, punctuation, text patterns. It can also include photos which have a location, color, format, face (PII) or subject matter. The tricky bits to think about are hidden text, hyper-links, or languages that can provide valuable information but are not necessarily what you see.
Context – Slightly broader than just metadata, although metadata is the majority of this category. Metadata tells you the who, when, and where about the document. You will find that a document in motion (being written, edited, being transported in an email or automated process, or some other form of “Work in Progress”) will have different metadata than the completed document. When completed, the document gets filed, stored, archived, published or somehow finalized in a way that adds additional metadata. Context data includes file name, location, ownership, date, status, attributes, or properties.
There are several uses for our auto-classification efforts and your purpose in classifying documents will determine the appropriate method. These are the three most important for me.
Type – Auto-classification is used to determine what something is. Usually, we want to use autoclassification for determining if a document is an invoice, contract, or presentation. This classification purpose can also be linked to a retention category or some other bucket that helps us manage it better. It is a simple task for a single person to do for a single object, but when you have millions of documents, you will need help.
Metadata – Sometimes we auto-classify to be able to put a tag on a document to making reporting, grouping, searching, protecting or integrating easier. Added metadata could be an invoice number or invoice amount, or a contract effective date, for example.
Security or Risk – Sometimes we need to measure risk, responsiveness, or security. Auto-classification can help identify which documents might have risky or valuable information in them, without need to know their type or what specific value they contain.