It may not be the first time you hear about MLOps. This agile approach, first introduced by Google in 2015 in the famous article “Hidden Technical Debt in Machine Learning”, has since then been at the center of interest of new Machine Learning approaches. At ML Architects Basel, we conceptualized and developed a simplified and industrialized MLOps approach that enables its adoption even in non-high-tech companies and environments. We call this “Effective MLOps”.

In this article, we aim to analyze the background and fundamental elements required to understand, adopt and master the MLOps approach by introducing our definition of an Effective MLOps framework. This framework addresses the key principles, workflows, activities and artefacts related to MLOps and adapted to the digital age in order to identify MLOps' best practices and techniques.

In fact, Machine Learning (ML) projects are complex and require both deep and cross-functional knowledge. They are usually hard to maintain and update. In addition to the complexity of traditional software, ML projects deal with extra layers, i.e. data and model. The fast emergence of AI and ML in mainstream businesses exposes how hard it is to build and maintain an ML driven application. In the real world, many companies are competing to quickly deliver the most reliable and efficient product. In such an environment, we should “act as fast as possible, but as slowly as necessary”.

Machine Learning Challenges

Cross-functional teams composed of data engineers, data scientists and software engineers amongst others (read more about roles and collaboration within the ML lifecycle here), work on different aspects of such a ML project to design, build, deploy and maintain a ML application. Besides, ML software production processes require different tools and workflows while being complex, hard to predict, to test, to explain and to improve. Moreover, data is often spread out over multiple systems and architectures and is far from being prepared for AI/ML processing. Also, on-premise infrastructures are usually not meeting the scalability and flexibility needed for such a complex project. Therefore, developing ML is hard, but operationalizing ML is even harder.

In recent years, through the emergance of paradigms such as microservice architectures or CI/CD, the DevOps approach became vital in applied software engineering, shifting the focus from development to a broader scope. It aims to extend the agile principles and improve the collaboration among development and operations teams. MLOps, which is getting increasingly popular as shown in figure 1, is a similar approach that brings the DevOps principles to AI and ML projects.

Like DevOps, MLOps is a Machine Learning (ML) engineering culture and practice that aims at unifying ML system development (Dev) and ML system operation (Ops).

Google, 2020

While MLOps inherits the challenges of DevOps for traditional software, the complexity of Machine Learning algorithms and the high-dimensionality of data, bring on top of that a different layer of complexity: being difficult to handle and requiring a unified approach to set up practices and raise culture challenges. The trade-off (cross-functional team vs. building, improving and delivering ML based software) is daily-business of many organizations seeking for a digital transformation and machine learning driven software.

All the above reasons lead us to believe that organizations need an effective MLOps framework which is accessible and applicable to "normal" (non-leading high-tech companies like Google) environments.

Figure 1 - Evolution of interest in MLOps according to google trends

In other terms, we are lacking an agile ML ecosystem that provides an end-to-end unified process to design, implement, deploy and monitor ML applications. Thus, the need to «industrialize» and «democratize» the MLOps approach call for MLOps to be made accessible to organizations as an engineering discipline. That's what Effective MLOps is meant to do.

Effective MLOps by MLAB: Democratizing MLOps

At ML Architects Basel (MLAB), we strongly believe that a Digital Transformation and/or an Agile transformation require a holistic transformation approach covering not only Technology but also the Operating Model, as presented in figure 2, by leveraging new generation IT capabilities. We consider the MLOps approach as a key pillar for a Digital Highway for continuous ML delivery. We have therefore defined our own approach and developed a best practices model to «industrialize» the MLOps concept, named “The Effective MLOps Framework”.

Figure 2 - 3 Pillars of Effective MLOps

For us, democratizing MLOps is about making it accessible to all organizations by providing:

  • A systematical approach to reduce complexity and increase efficiency by leveraging DataOps & AI/ML, culture & skills and Continuous Delivery & Site Reliability Engineering (SRE) capabilities.
  • A clear and structured set of activities supported by customizable learning modules and an approach leveraging market best practices.

Effective MLOps is an engineering approach that aims to include and unify operating model, technologies and culture in order to facilitate adoption and provide a smooth interaction between different (new and existing) roles involved in ML projects, and automate safe increments to continuously deliver reliable and high-quality ML systems.

Under Effective MLOps Engineering we understand the application of a systematic, disciplined and holistic approach to the cost-effective development and operations of ML Systems in the context of changing business and data landscapes.

Key questions from an engineering perspective

From an engineering perspective the focus lies on the technical aspects of the application-lifecycle:

?

Pipelining & Automation

How can I build hardened pipelines for data, model and application and reliably recreate my experiments end-to-end?

?

Testing & Quality Assurance

How can I improve testing for model, data and code to increase automation capabilities?

?

Scalability

How do I make sure my application is scalable so it can handle increased user load?

?

Monitoring & Observability

How can I make sure I understand when my application degrades without spending too much time looking at monitoring data?

Key questions from an operating model perspective

The operating model perspective pertains to aspects beyond solely the technical capabilities and focuses more on aspects such as collaboration and generating values:

?

Roles & Processes

How can I create the right roles & processes for a reliable system?

?

Value Stream Management

How do I make sure that what we build actually brings value to the end-consumer?

?

Continuous Learning

How do I make sure to learn from past mistakes and continuously improve?

?

Collaboration

How can I facilitate collaboration between teams and stakeholders and break down silos?

In alignment with the aforementiond questions, the Effective MLOps approach we propose covers the three key pillars of Machine Learning, Data and Code from both an engineering and an operating model perspective. More specifically, this translates to:

  • Designing and building the pipelines to manage continuous code, model and data changes by applying DevOps and SRE best practises to the Machine Learning domain.
  • Objective and up-to-date benchmarks of both established market solution and new, innovative (next generation) tools for MLOps.
  • Designing and establishing an operating model which serves as the foundation for effectively building and operating a ML-based application.
  • Maintaining and operating the continuous delivery pipeline and the SRE cockpit to enable continuous monitoring and release-management of the ML Systems by providing maturity assessments and roadmaps to manage governance, processes and tools.

The Effective MLOps framework helps organizations to design, build and enable their ML systems and operating models to continuously deliver reliable Machine Learning systems. We think that it is important that some best practices should be taken into consideration regarding the architecture, the implementation and the operations.

In other words, we believe that the key principles for Effective MLOps are:

  • Data, model and code pipelines driven by reliability
  • Continuous learning and if needed online, real-time predictions
  • Error budgeting and service level objective (SLO) engineering
  • Cross-functional collaboration between teams
  • Adoption and extension of the DevOps culture and values to the ML domain

The Effective MLOps we propose at MLAB contributes in enabling what is called the 4 Cs of MLOps: Continuous Integration (CI), Continuous Delivery (CD), Continuous Monitoring (CM) and Continuous Training (CT).

Now that you are aware of the Effective MLOps scope and framework, the principles behind it and how it can be implemented in your organization, we think it is also important to share with you some of the best practices, we strongly believe, are primordial to adopt MLOps:

  • Establish unified model development and data exploration
  • Adopt continuous delivery for ML code, model and data pipelines
  • Leverage unified monitoring, observability and AIOps for ML model and system
  • Define data engineering roles and workflows
  • Define model development and workflows
  • Define DevOps/SRE and workflows
  • Work on SRE, ML and data science skills development
  • Adopt technical and operating model retrospective
  • Establish continuous culture sessions

In the next blogpost, we will dive deeper and more technical into our Effective MLOps approach. Hope you enjoy this piece, and stay tuned for the next one!

Machine Learning Architects Basel

Machine Learning Architects Basel (MLAB) is part of the Swiss Digital Network (SDN) which has extensive knowledge and experience around DevOps, SRE, and AIOps regarding "classical" non-ML systems. In addition to our tried and tested effective SRE approach, we have also developed an effective MLOps approach and in this blog post we presented our blueprint for a digital highway for end-to-end machine learning and effective MLOps.

It includes data, model and application pipelines, observability, incident management, but also operating models, training and cultural transformation as fundamental components. No matter where you stand now, we at MLAB and SDN will be able to support you in your digital journey. Whether you are completely new to ML and want to "start right" or have extensive experience but are looking to improve one aspect such as observability, we will be able to advise you.

If you are interested in learning more about how MLAB can help your business generate sustainable value by building and running reliable machine learning solutions, and supporting the implementation of a digital highway for your company, please do not hesitate to get in touch.

Footnotes and References