Garbage In, Goldmine Out: Curating your Training Sets for CoPilot AI

Brian Tuemmler


Curating your Shared Network Drives and SharePoint Training Sets for MS CoPilot AI

We have come a long way since file analysis tools and defensible disposition helped to mitigate the risk and cost of shared drives. Organizations that tried simple clean-up projects alone, often failed to see long term strategic benefits. If clean-up was part of a larger RIM program, privacy risk remediation, or migration, there was definite value, but only as a first step. What we generally did not do was to strike gold in the process.

Identifying value, or better yet, extracting and adding value, is the gold mine in addressing unstructured information landfills. It is the purpose of Digital Value Promotion (DVP) for AI

Semantic Index for Copilot

For those early adopters of Microsoft CoPilot leveraging generative AI capabilities, we are seeing a resurgence of attention in network drives and other unstructured content locations – and it is for some very interesting use cases. You can now use Microsoft Graph connectors or intelligent migration to integrate the data sources you want with Semantic Index for Copilot. That shared drive can now be a gold mine for your generative CoPilot AI project.

However, as with any large-scale analytics or machine learning process, the adage that you “get out what you put in” is true with AI. It is easy to point your AI tools at “everything” in your environment to harvest the value and try to make AI better. The problem is that you are also ingesting failed projects, outdated research, lost litigations, abandoned drafts, and of course, garbage. The quality and accuracy of your AI engagements will drastically improve if you can focus on promoting the valuable content. By valuable content, I mean content that encapsulates your company’s ideas, concepts, and decisions in words.  Infotechtion guidance and products can help.

Yes, what we are describing is traditional Records Management, but with some new and important twists that have great payoff. Our i-ARM product capabilities extend Purview records management, classification, and disposition capabilities to data in Azure, non-Microsoft clouds, SaaS services and even on-premises in your network file shares. making DVP for AI possible:

Records Management

Maintain your regular records management activities diligently. Records management is good at bringing the cream of your content value to the top. This is not new. But, here are three ideas that are new:

  1. If you have the time, go through record categories, or even content types, and identify content that has real content value, not just transactional or administrative value – research, strategies, specifications, standards, legal documents, etc.
  2. If that time is not available, start from the other end of the value chain and identify records categories that represent transactional data that has no re-usable value, or even potentially harmful data such as privacy information. Unless you are unique, populating your AI source content with transactional data in invoices will not improve results. This extra filtering will add value to your content.
  3. It may be legally necessary to keep records around failed projects or activities, but it is not necessary to add them to your AI content. If you have metadata, such as product names or numbers, that you can use to separate good quality or successful content from bad quality, low value content within a specific category, this is an opportunity to use it.

Non Records

If your content is not labeled or classified as a record, it can therefore be classified as a non-record. Some organizations take the approach that non-records should be kept for a short duration (3 years) and then deleted. This is also not new, and it can reduce outdated information influencing your AI. However, it can also have a negative impact on your AI quality. Take a look at your non-record content to find opportunities to identify quality. Quality is often a proxy for value. There are three useful options:

  1. Depending on how you use sensitivity labels, you can leverage those classifications to improve your AI. A sensitivity label identifying valuable intellectual property should definitely be included, a label identifying privacy information should be excluded from accidentally making its way into your AI content. Here is a good short video on this.
  2. If you have content analytics capabilities, such as in MS cognitive services, look for content elements that indicate that content was valued when it was created. Time edited, lack of spelling or grammar mistakes, use of templates or stylesheets are all behavior-based metadata held in Office content that can be used to identify quality.
  3. Look for groups or individuals that you know are the “brains” of the organization and intentionally create content that is high quality (regardless of its record status.) Legal, and Research and Development, or in your key line of business are good examples. Identity based classification, such as “Management” can also be a useful identifier.

Trivial, Redundant, Abandoned, Superseded, Harmful (TRASH) aka ROT

TRASH or ROT removal again is not new. If you are intending to start a project to remove TRASH content as part of a bigger initiative, consider the following analysis – UNDERSTAND it first:

  1. Don’t just delete duplicates. Look at the reasons why duplicates are being made. Often it is because different functional areas need access to the same knowledge – a perfect justification for AI.
  2. DRAFT“ is the single most identified piece of trivial content, (based on a file name) but only if it has not been touched for a while. Nobody ever intentionally created garbage content, it just became that over time, or changing priorities, or changing resources. Age is not a valid indicator of quality by itself, but it is a very useful tool in your AI arsenal.
  3. Although it is not trivial to the IT part of your organization, shared drives often contain system files, applications and structured data, sample data, and help files, that simple file analysis capabilities can identify and isolate from your AI content

Bringing content from shared drives adds a potential treasure chest of data. Microsoft connectors along with Infotechtion i-ARM capabilities for file shares, azure files, and SharePoint help identify, classify, and act on content on shared drives and in M365 as well as other locations.

Curating your training data to enhance your AI accuracy takes the right tools and a solid strategy.

Finally, remember that all new content created today will be the gold within your AI systems of the future. The rules and queries you use to clean and organize your legacy content today should be the same ones to build into your day-forward environment for tomorrow. Spend the time to build an ongoing M365 and Purview strategy that is consistent between legacy content and anything built from this point forward. That way you are laying a foundation that has longevity and will pay off in the future rather than simply creating a one-off temporary structure you won’t be able to rely on later.

 © 2024 Infotechtion. All rights reserved 


By submitting this form you agree that Infotechtion will store your details and send future resources. You may opt-out any time.

Recent posts

Job application.

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorestandard dummy text ever since.

Please fill the form

Job application.

Join Infotechtion for an impactful career filled with passion, innovation, and growth. Embrace diversity, collaboration, and continuous learning. Discover your potential with us. Exciting opportunities await!

Please fill the form

By submitting the form, you confirm that you do not require a visa sponsorship to work in the country of application.