HOW RETRAINING PIPELINES KEEP ML MODELS UP TO DATE

- Dr. Theresa Bick

When artificial intelligence (AI) is used to automate real-world business processes, the work on a machine learning (ML) model does not end after the first successful training. Over time, the data on which the model operates evolves, which can lead to a decline in output quality. This makes regular updates — retraining — indispensable. In this article, we explore the challenges associated with retraining and demonstrate how this process can be made efficient.

ML: A data logistics challenge

Developing ML technologies for practical use presents challenges of various kinds. A common pattern often emerges: exciting ideas lead to rapid results in the proof-of-concept (PoC) phase, especially in areas like natural language processing (NLP) and large language models (LLMs), thanks to adaptable APIs from well-known providers like OpenAI. However, when these insights are transferred into functional solutions, progress can falter due to data privacy requirements or disappointing performance on real-world data, which often doesn’t match the idealized, synthetic datasets used during development. As a result, developments frequently end up as isolated, disconnected solutions that fail to realize their full economic potential.

The solution to these multifaceted challenges is logistical: linking data, models, and use cases in a scalable, enterprise-ready manner. Beyond developing high-performance ML algorithms, scaling to large datasets requires infrastructure expertise to efficiently and reliably operate solutions. Only by combining ML know-how with infrastructure expertise can an ML application tackle these challenges and deliver lasting value in a business context.

The following sections illustrate how retraining pipelines enable seamless interaction between domain experts, data, and ML models, using an ML-based document classification1 system as an example.

Data quality: Expectation vs. reality

For an ML-based document classifier to learn the relationship between document content and its respective class, an ideal scenario would involve a dataset that fully represents the range of data variance and includes sufficient examples for all possible document classes. Additionally, it is assumed that the content will remain consistent over time.

However, a closer look at real-world business data quickly reveals significant deviations from this ideal. Real data is often unstructured or incomplete. This must be considered when designing a solution, because the quality of an ML model heavily depends on the quality and selection of the training data. This includes tasks such as cleaning and preparing the data (e.g., removing duplicates or identifying erroneous data) and selecting the features2 used to train the ML model.

Topics and language evolve over time

When working with text and language, it’s essential to recognize that both language and relevant topics change over time. ML models, if trained only once and not updated, are subject to what is known as model drift, leading to a decline in model performance.

Model drift can occur as sudden or incremental changes. For example, the term “cloud” primarily referred to the weather phenomenon, but after the rise of cloud services in the early 2000s, it gained a new meaning. Similarly, the introduction of new laws, such as the GDPR in 2018, caused terms like “data protection” and “compliance” to gain immediate prominence. Incremental changes are seen in topics like climate change and sustainability, which have gradually gained more focus over the years.

Freezing models in their initial state without mechanisms for keeping the underlying data up to date increases the risk of declining performance over time. This creates significant maintenance needs, requiring frequent updates as new representative data becomes available — often only after dissatisfaction with the model’s performance arises.

The concept: Using expert feedback to improve accuracy

ML solutions often automate existing processes rather than introducing entirely new ones. This is particularly effective when domain experts are integrated into these processes. Interactive ML applications allow employees with years or even decades of expertise to feed their knowledge back into the system. This not only improves the model but also fosters trust in the solution by making its outcomes transparent and empowering users to influence its performance.

In an era where skilled labor shortages are widespread, this approach preserves valuable expertise within algorithms while allowing experts to focus on their core competencies instead of repetitive tasks.

We have implemented the idea of combining the previously mentioned challenges and requirements for document classification.. The classifier is initially trained on an incomplete dataset, knowing that certain document classes may be underrepresented. The context-specific model is managed within a model pool.

When classifying a new, unseen document, the "best" model from the pool (based on predefined criteria) is selected to determine the document class. Additionally, a confidence level determines whether feedback from employees is requested. This feedback is incorporated into the existing training dataset, enabling the dataset to grow over time. Through regular retraining, the model improves incrementally.

Factors influencing whether feedback is requested include uncertainty caused by rarely encountered document classes or the appearance of content that doesn’t fit any previously seen document classes.


01
02
03
04
05
06
07

A focus on footprint and data security

In developing ML technologies, particularly in text and document processing, generative AI and large language models often take center stage. However, hosting such systems on-premises is unfeasible for most companies, leading many to use services like OpenAI in a U.S.-based cloud. Few users interacting with tools like ChatGPT realize the technological footprint involved, including energy consumption and costs.

In times of energy crises and sustainability concerns, it’s crucial to evaluate the footprint of ML solutions. By focusing on essential functionality during the design phase, it often becomes evident that large language models are not always necessary. Even small-scale intelligent solutions can provide significant automation benefits. For example, our ML classifier, which relies on established NLP techniques such as tokenization3 , word or text embeddings4 , and classification algorithms,does not require a GPU5 . This minimal resource requirement supports regular, automated retraining.

At the same time, data security becomes a pivotal concern. A resource-efficient design that allows the model to be hosted on-premises ensures that companies retain full control over their data. This approach avoids potential risks associated with the U.S. Cloud Act, which grants U.S. authorities access to corporate and customer data stored in U.S.-based clouds.

Looking Beyond: Broader applications of retraining pipelines

In this article, we have demonstrated how retraining pipelines ensure that ML models remain up to date, using our ML-based document classifier as an example. However, retraining and active learning mechanisms are applicable across a wide range of use cases, including areas like computer vision or text content extraction, such as named entity recognition. Once the data logistics and infrastructure challenges are addressed, many more enterprise-ready ML technologies become accessible.

We also discussed the implications of using LLMs and explained why opting for smaller ML models can often be the better choice. In a follow-up article, we will delve deeper into where we see the greatest potential for value creation when using LLMs. Spoiler alert, it’s not in chatbots.

1 Document Classification: Categorizing texts/documents into predefined categories, such as "Invoice," "Delivery Note," "Termination," etc.
2 Feature Engineering: Transforming raw data (e.g., text) into meaningful features (e.g., frequency of specific keywords).
3 Tokenization: The process of breaking text into smaller units (tokens), such as words, subwords, or characters.
4 Embedding: Numerical representation of text (e.g., words or sentences) in a multidimensional vector space.
5 GPU: Graphics Processing Unit, a processor optimized for parallel computations, originally developed for graphics rendering. GPUs are highly efficient in processing large datasets, making them frequently used in ML applications.

Share