Document Autoclassification: Source and Purpose

Brian Tuemmler


Before we start looking at classification techniques, a few concepts need to be defined.

Source data

The information we rely on to classify content automatically comes from three sources within a single document.

  • Format – Format often includes the coding that allows a specific application to work with it. It can also include any structure that impacts its appearance, such as forms or templates. A word document and a PDF may be the exact same document and contain all the same words, but they are different formats and possibly different record categories. Two template-based documents may have zero words in common but are the same record category because of the template.

  • Content – The human readable part of a document file. This includes the words, punctuation, text patterns. It can also include photos which have a location, color, format, face (PII) or subject matter. The tricky bits to think about are hidden text, hyper-links, or languages that can provide valuable information but are not necessarily what you see.

  • Context – Slightly broader than just metadata, although metadata is the majority of this category. Metadata tells you the who, when, and where about the document. You will find that a document in motion (being written, edited, being transported in an email or automated process, or some other form of “Work in Progress”) will have different metadata than the completed document. When completed, the document gets filed, stored, archived, published or somehow finalized in a way that adds additional metadata. Context data includes file name, location, ownership, date, status, attributes, or properties.


There are several uses for our auto-classification efforts and your purpose in classifying documents will determine the appropriate method. These are the three most important for me.

  • Type – Auto-classification is used to determine what something is. Usually, we want to use autoclassification for determining if a document is an invoice, contract, or presentation. This classification purpose can also be linked to a retention category or some other bucket that helps us manage it better. It is a simple task for a single person to do for a single object, but when you have millions of documents, you will need help.

  • Metadata – Sometimes we auto-classify to be able to put a tag on a document to making reporting, grouping, searching, protecting or integrating easier. Added metadata could be an invoice number or invoice amount, or a contract effective date, for example.

  • Security or Risk – Sometimes we need to measure risk, responsiveness, or security. Auto-classification can help identify which documents might have risky or valuable information in them, without need to know their type or what specific value they contain.

First in the series

Next in the series

 © 2024 Infotechtion. All rights reserved 


By submitting this form you agree that Infotechtion will store your details and send future resources. You may opt-out any time.

Recent posts

Job application.

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorestandard dummy text ever since.

Please fill the form

Job application.

Join Infotechtion for an impactful career filled with passion, innovation, and growth. Embrace diversity, collaboration, and continuous learning. Discover your potential with us. Exciting opportunities await!

Please fill the form

By submitting the form, you confirm that you do not require a visa sponsorship to work in the country of application.