Schedule a Meeting

Return to Enterprise Automation Blog

Pre-Trained Vs Self-Taught Machine Learning: What’s the Difference?

June 14 2023

6 min read

Banner image for

When undertaking automation projects, the first step is moving from the physical page to digital data—the process of digitization. It sounds easy, but the reality is that paper documents are inherently messy, lacking in consistency, and prone to change.

Fortunately, machine learning is uniquely positioned to bring order to the chaos of paper documents, especially when coupled with natural language processing. This combination of technologies allows you to begin with the end in mind—starting with what you need to extract from documents—and train the machine learning models to understand your company’s documents and processes.

It sounds simple on the surface, but when every business has unique processes and documents, and those processes and documents change with regularity, it begs the question, “How will the AI be able to handle my specific business needs?”

Machine Learning Models Must Be Taught

The simple answer is that a machine learning model is only as good as its teacher. The quality of the machine’s learning and its ability to perform accurately heavily relies on the expertise of the teacher. This teaching can happen in a few different ways.

  1. You train the model yourself using your own data and familiarity with your business’s processesOr
  2. Someone else with technical expertise but less knowledge of day-to-day operations trains a model based on the data they have available and their “best guess” at the company’s use-case.

In both options, the human is giving the machine examples of what success looks like. Once trained, the machine can emulate those results on new documents. The question of whether to utilize a pre-trained model comes down to what you’re trying to accomplish through the use of machine learning.

Pre-Trained Machine Learning Models Pros and Cons

Pre-trained models present a number of advantages. A pre-trained model provides instantaneous access to processing results that can be immediately incorporated into automation queues immediately. When you have limited time or organizational resources, the process of creating custom models can be intimidating, and often prevents organizations from leveraging something more bespoke.

A pre-trained model is best used for proofs of concept or situations where lower accuracy levels are acceptable. In these situations, they are an inexpensive way to quickly test a use case or use in areas less sensitive to errors being sent downstream.

Pre-trained models aren’t without their drawbacks, however. The earlier analogy still applies—in order for a model to learn it must be taught, and pre-trained models will only be able to provide results on the outcomes it was taught to achieve.

Another fundamental problem with pre-trained models is that in order for them to be universally applicable, they must be pre-trained on all permutations of extraction goals—something that isn’t currently feasible. Instead, pre-trained models are a better fit for the 80/20 rule—they can used for general applications and goals, but lack specificity for unique or bespoke processes. Because they are already trained, it’s difficult to fine tune or dramatically change the outcomes yourself because the underlying ground truth isn’t exposed—it’s a case of the dreaded black box.

Self-Taught Machine Learning Models Pros and Cons

Self-taught models offer an alternative to many of the tradeoffs above. For example, if you have a custom “control number” field that you need extracted from invoices for hazardous materials, a pre-trained model might struggle to find it, as it’s not a universally used field.

A self taught model, on the other hand, will be based on the exact data points your organization needs and documents your team encounters. The underlying training (or “ground truth” in machine learning (ML) terms) being sourced from your day to day operations will result in a much more accurate and specific model.

Self-taught models take more time to deploy than pre-trained models, but provide a highly specific ML model for the challenges you’re facing. Teaching a model from your own data has been revolutionized in recent years, reducing the amount of work dramatically. Only a few years ago it would have required a data scientist and prohibitively large data sets in order to create anything on your own.

Platforms like Hyperscience focus on empowering line of business employees to teach the machine what good data looks like, minimizing the need for hard to find or expensive data analysis expertise. Instead of writing code and working through datasets, a business user will point and click on the document for your specific data model.

In practice, a new model can require as few as 100 training documents before being deployed, making it easier than training a new employee on how to understand the documents. This focus on ease of use dramatically increases the accessibility of AI and does not require large IT investments to deploy new models.

Why not Both? How Hyperscience Merges Self-taught & Pre-trained Models

While there are significant trade-offs when focusing on one approach, finding a balance between the two can lead to fantastic results. That’s why we’ve identified the critical points when a self taught model will be needed versus pre-trained modeling being superior.

The variation in visual structure, as well as the desired outcomes for data extraction typically demand a self taught model to be successful. We believe that using your own documents to train your own identification models creates the best outcomes for customers in situations where accuracy is paramount.

On the other hand, the underlying Natural Language Processing (NLP) models that perform the transcription of identified fields, normalize outputs, clean up images, classify page structures, and more are more broadly applicable.

Hyperscience itself is delivered to customers with multiple pre-trained models, making the entire document processing flow automated, without sacrificing accuracy. We blend pre-trained with self taught, using your documents to provide the highest level of performance so that you can always trust the accuracy of your models, control the underlying data, and deploy superior results faster.

To see a demonstration of how these two models work together in the Hyperscience Platform, be sure to watch the demo here. For more information on how Hyperscience delivers the power of AI to specific business use cases, please contact us here.