According to Reuters, ChatGPT, the popular chatbot from OpenAI, is estimated to have reached 100 million monthly active users in January, just two months after launch, making it the fastest-growing consumer application in history. While this is phenomenal news, ChatGPT also crashed repeatedly, struggling with the usage load. OpenAI then announced a $20 monthly subscription. The company said it would provide a more stable and faster service and the opportunity to try new features first. However, more outages were reportedly affecting also the subscribers who paid for the new version, GPT-4.

When they hear “machine learning system,” many people think of just the ML algorithms being used, such as the now very popular Large Language Models (LLMs) behind Chat GPT. However, the algorithms and models are only a small part of an ML system. The system also includes the business requirements that gave birth to the ML project in the first place, the interface where users and developers interact with your system, the data stack, and the logic for developing, monitoring, and updating your models, as well as the infrastructure that enables the delivery of that logic.

Figure 1: Different components of an ML system. “ML algorithms” is usually what people think of when they say machine learning, but it’s only a small part of the entire system. [Image adapted from: Overview of Machine Learning Systems by O'Reilly]

Machine learning systems are both complex and unique, and in this blog post, which is part of our Next Generation Data & AI Journey powered by MLOps series, we will explain why. We will consider both System Design & Continuous Delivery for ML as essential aspects for achieving reliable and scalable machine learning deployments. As organizations increasingly rely on data-driven decision-making, ensuring the reliability and performance of ML models becomes more and more critical. By adopting Site Reliability Engineering (SRE) practices and continuously improving the delivery of ML models, we can create robust systems that drive real value.

On our journey so far, we covered data reliability engineering as well as model pipelines. If you have already read our blog post about the complete digital highway, you will know that another essential part of end-to-end MLOps approaches relates to the continuous delivery of the outputs from these pipelines. This blog post explores various topics related to system design, deployment and release strategies, and continuous delivery for ML.

We hope to improve your understanding of the best practices and strategies for implementing continuous delivery and ensuring the reliability of your ML applications. Let's explore how we can unlock the full potential of data, machine learning, and artificial intelligence for your organization!

Unleashing the Power of ML System Design: Strategies for Scalability, Flexibility, and Accuracy

In this section, we will delve into system design principles for machine learning and explore the importance of building scalable, flexible, and reliable ML solutions. With a strong foundation in system design, you can ensure that your ML projects deliver accurate results and remain efficient and robust as they evolve. Moreover, a well-designed system can improve the accuracy of ML models and significantly reduce training time, allowing you to train a model in a fraction of the time it would take with a less optimized design.

System design is a crucial aspect of any software development process, and it plays an equally important role in machine learning projects. The primary goal of system design is to create a well-organized, reliable, scalable, and maintainable system that can adapt to changing requirements and handle increasing workloads. Some fundamental principles of system design include modularity, abstraction, and separation of concerns, which help create flexible and easy-to-understand systems. Implementing these principles in your ML projects can lead to faster training times and more accurate models, making it more feasible to iterate and refine your solutions continuously without excessive resource consumption.

It is important to emphasize that system design should be addressed early in the development life cycle, ideally during the requirements-gathering phase. By considering system design from the beginning, you can ensure that your ML solution meets desired requirements such performance and reliability. System design is an iterative process, and you may need to revise the design with time. This allows you to adapt to changing needs and requirements, ensuring that your solution remains relevant and efficient.

However, there still is of course, a significant difference between traditional software and machine learning. The following figure illustrates this:

Figure 2: Machine learning is an approach to (1) learn (2) complex patterns from (3) existing data and use these patterns to make (4) predictions on (5) unseen data.

Machine learning projects often deal with large volumes of data and complex algorithms, making scalability and flexibility essential. As the amount of data or the complexity of the models increases, a well-designed system should be able to handle the additional workload without compromising performance or reliability. Similarly, as the number of users grows - or potentially sees spikes - we want our systems to handle these changing workloads as well as possible. Additionally, a flexible system design allows for the seamless integration of new features, tools, or techniques, ensuring that your ML solution stays up-to-date with the latest advancements in the field. However, it is essential to recognize that while scalability and flexibility are crucial, they often come at a cost. It is vital to strike the right balance between these aspects and system complexity.

Figure 3: Five key areas for robust system design.

To achieve a robust system design for your ML projects, consider the following best practices:

  • Modularization and Microservices: By breaking down the ML system into smaller, independent components or microservices, you can develop, test, and deploy each component separately. This modular approach simplifies the overall architecture, enables parallel development, and makes scaling and maintaining the system easier.
  • API-Driven Development: Adopt an API-driven approach to facilitate communication between system components. This allows for easier integration and collaboration between different teams working on the project and promotes the reusability and interoperability of services.
  • Data Storage and Management: As data is the lifeblood of any ML project, ensure that your system design includes efficient and scalable data storage and management solutions. You can achieve this through distributed databases, data lakes, or other data storage technologies that support real-time processing and analytics. More and more companies recognize that a change in operating model and culture is required and adopting approaches such as the Data Mesh. See also our previous blog post about robust data pipelines.
  • Reliability and Fault Tolerance: Design your system to be reliable, with the ability to recover quickly from errors or unexpected events. This can be accomplished through replication, load balancing, and auto-scaling, which help ensure your ML system remains available and performs well under various conditions.
  • Monitoring and Observability: Incorporate monitoring and observability tools into your system design to track the performance of your ML models and the underlying infrastructure. This allows for proactive identification and resolution of issues and helps ensure your system operates efficiently and reliably.

Deployment and release strategies, and model serving

Basics of application deployment and release

Deployment usually refers to moving software from one controlled environment to another. Often this means we move it from a staging or testing environment into production. The release refers to making your software available to users. Usually, this goes hand-in-hand with the deployment but can sometimes include additional non-technical business considerations. For example, although a feature might be deployed, you might not yet want to make it available for your users.

When deploying and releasing a machine learning model, many of the challenges you face will be similar to those encountered during traditional software deployment. Some of these challenges include:

  • Scalability: Like with classical software, ML models must scale efficiently to handle increasing workloads and user traffic while maintaining reliability. This can involve load balancing, auto-scaling, and distributing workloads across multiple servers or clusters.
  • Interoperability: Achieving seamless interaction between ML models and existing systems or applications can be challenging, as it may require compatibility with various data formats, APIs, and infrastructure components. This is similar to ensuring interoperability between new features or components and existing systems in classical software development.
  • Security: Ensuring data privacy and system security is crucial in both classical software and ML model deployments. This includes implementing secure data storage, access control, and communication protocols.
  • Configuration Management: Managing configurations for both classical software and ML models can be complex, as it involves keeping track of various settings, parameters, and dependencies. This can be addressed using configuration management tools and best practices.

Strategies for effective deployment and release

There are some common strategies for the effective and reliable deployment and release of software systems and ML models (see also this link for more details or contact us if you want to know more):

  • Canary Releases: Canary releasing is a technique that involves deploying a new version of your software or ML model to a small percentage of your user base, monitoring its performance, and gradually increasing the rollout to more users as confidence in the new version grows. This helps to mitigate the risk of deploying faulty or poorly performing models, allowing you to identify and resolve issues before they affect a more significant portion of your users.
  • Blue/Green Deployments: In a blue/green deployment, two identical production environments are maintained, with one hosting the current version of the application (blue) and the other hosting the new version (green). Traffic is gradually (or all at once) shifted from the "blue" environment to the "green" environment, allowing you to test and monitor the new version under real-world conditions before entirely switching over. If issues arise, you can quickly revert to the previous version by redirecting traffic back to the blue environment.
  • Feature Flags: They allow you to selectively enable or disable features or ML models for specific users, user groups, or under certain conditions. This approach allows you to test new features or models in a controlled environment, gradually roll out changes to your user base, and quickly roll back features or models if issues arise. By implementing feature flags in your deployment and release strategy, you can minimize the risk of launching new functionality and ensure a smoother user experience.
  • Shadow models and A/B testing: Utilize A/B testing or shadow models to optimize model performance and mitigate risks associated with deploying new model versions. A/B testing involves exposing distinct model versions to different user groups and comparing their performance. In contrast, shadow models run alongside the current model without affecting the user experience, processing the same input data and making predictions for evaluation purposes. These approaches enable you to test and assess new models in real-world environments, gather performance metrics, and identify potential issues before committing to full deployment. Based on the results, you can confidently promote the new model using a controlled deployment strategy, such as canary releases or blue/green deployments.
Figure 4: A simple illustration of a blue/green deployment by Harness.

ML-specific challenges and best practices

In addition to the challenges faced during the deployment of classical software, deploying ML models presents several unique challenges that need to be addressed:

  • Architecture: ML models often require specialized hardware, such as GPUs or TPUs, for efficient inference. Switching between GPU or TPU types may be more complex than changing to instances with different CPU types. These resources are typically more expensive than traditional CPU resources. To optimize costs, you should spin up these resources on-demand when needed rather than running them continuously.
  • Post-processing and Filtering: The raw output from an ML model may not always be suitable for direct delivery to users. Sometimes it may be necessary to filter the result to remove sensitive, offensive, or otherwise unwanted content. For example, in medical diagnosis applications, you may not want to provide the diagnosis directly to the user but instead refer them to a specialist for further evaluation and consultation.
  • Model Shelf Life: Unlike traditional software, the effectiveness of ML models can diminish over time due to factors like data or model drift. This means that models may need to be continuously re-trained and updated to maintain accuracy and performance. It's essential to monitor and detect these changes quickly and have a robust re-training and deployment pipeline. For more information on detecting and addressing model drift, refer to this previous blog post.
  • Model Explainability: ML models, especially deep learning models, can sometimes act as black boxes, making it difficult to understand the reasoning behind their predictions. Ensuring that your models are explainable and transparent is crucial for building trust and confidence in the system. This may involve using techniques such as LIME or SHAP to provide insights into the model's decision-making process.
  • Latency and Resource Efficiency: Inference for ML models can sometimes be resource-intensive and time-consuming, especially for complex models or large datasets. Optimizing the model's architecture, using model compression techniques, or leveraging edge computing can help reduce latency and improve resource efficiency during deployment.
  • Monitoring Model Performance: Tracking the performance of deployed ML models is essential to ensure that they continue to meet the desired level of accuracy and quality. This may involve setting up custom performance metrics, alerts, and dashboards to monitor model performance in real time and take corrective actions when necessary.

By addressing these ML-specific challenges during the deployment process, you can ensure that your machine learning models integrate seamlessly into existing systems and deliver reliable and high-quality results to users. To maximize the chances of success, consider following some general best practices:

  • Implement a robust deployment pipeline: A streamlined deployment pipeline is crucial for managing the complexities of ML applications and ensuring smooth updates and rollouts.
  • Adopt containerization: Containerizing your ML models can simplify deployment, improve scalability, and ensure consistency across environments, making it easier to manage dependencies and share resources.
  • Monitoring and Observability: Continuously monitor the performance of deployed models to detect drift and other issues early. Use custom performance metrics, alerts, and dashboards to stay informed about your models' performance in real time. For more information on this topic, refer to our previous blog post.
  • Continuous Training: Keep your models up-to-date by continuously updating and re-deploying them based on a schedule, the availability of new data, or in response to detected issues. This helps ensure that your models stay relevant and accurate over time.

Model Serving

It is crucial to decide how your ML models are available to users when they are deployed. There are at least two common ways in which model serving is implemented:

  • Integrated Model Serving: The ML model is integrated into an existing application in this approach. The model is loaded into memory, which allows for fast inference. This approach is usually more straightforward than alternatives. However, this means that this is the only application that has direct access to the model. This tight coupling can be an appropriate strategy for some systems, depending on the use case and the requirements.
  • Microservice-based Model Serving: In this approach, the ML model is deployed as a standalone service that other services can query through an exposed API. This introduces less coupling and allows the model service and other applications to be changed independently. However, this approach introduces additional network latency and complexity, which need to be carefully managed.

We can also make a distinction between batch- and stream- based deployments. In batch deployments data processing and storage occur on aggregated data. Features are retrieved from batch storage, the model carries out inference on the entire dataset, and the results are subsequently saved back to batch storage. This process is typically scheduled but can also be adapted for real-time applications by transferring inference results to real-time storage. On the other hand, stream-based deployments conduct inference on incoming data as it flows through a streaming pipeline, with results being written back to a streaming system. Applications can subscribe to the results system, enabling them to access and use the results as soon as they become available.

To ensure reliable and efficient model serving, consider the following topics:

  • Containerization: Package your ML models in containers (e.g., using Docker) to simplify deployment, improve scalability, and ensure consistency across environments. Containers make managing dependencies and sharing resources easier, which is particularly important for ML models that often have specific library requirements.
  • Load Balancing: Distribute incoming requests for your ML models across multiple instances to ensure that no single instance becomes overloaded. This can help maintain low latency and high throughput even as the number of requests increases. Load balancing can be achieved using tools like Kubernetes or other cloud-based services that provide auto-scaling and load balancing features.
  • Caching: Cache the results of common queries to reduce the load on your ML models and improve response times. Caching can be particularly effective for models that receive many similar or identical requests. Implementing a caching layer in front of your model serving infrastructure can help avoid unnecessary computations and improve overall system performance.
  • Monitoring, Observability, and Site Reliability Engineering (SRE): Continuously monitor the performance of your deployed models and their serving infrastructure while adopting SRE practices to ensure reliability, scalability, and maintainability. Set up custom performance metrics, alerts, and dashboards to track key performance indicators (KPIs) such as latency, throughput, and error rates. This information can help you identify bottlenecks, detect issues early, and optimize your model serving setup.

SRE focuses on building and operating large-scale, highly available systems by applying principles from software engineering. Incorporate SRE principles, such as defining Service Level Objectives (SLOs) and Service Level Indicators (SLIs), to set clear expectations for your system's performance and reliability. Additionally, implement practices like error budgeting and automated testing to identify and address potential issues proactively, ultimately improving the overall stability and resilience of your ML model serving infrastructure.

By carefully choosing the appropriate model serving approach and implementing these best practices, including monitoring, observability, and SRE, you can ensure that your machine learning models are deployed efficiently and deliver high-quality results to users on time.

Rollbacks and Error Handling in ML Deployments

If something goes wrong during the deployment of your ML models, it is crucial to have a well-defined strategy for handling errors and rolling back to previous, stable versions of your system. Consider the following best practices to manage rollbacks and minimize the impact of errors in your ML applications:

  • Model Testing: Before deploying a new version of an ML model, ensure that it has been rigorously tested using various validation techniques, such as cross-validation, holdout sets, and performance metrics. This helps to minimize the risk of issues arising in production and ensures that your model meets the desired level of accuracy and quality.
  • Version Control and Model Management: Maintain a robust version control system for your ML models and associated artifacts, such as data preprocessing scripts and feature engineering code. This enables you to easily track changes, identify the cause of issues, and revert to previous versions when needed.
  • Automated Rollback Processes: Implement automated rollback processes within your deployment pipeline, allowing you to quickly revert to a previous version when needed. By automating the rollback process, you can minimize the downtime and disruption caused by issues that may arise during deployment.
  • Clear Documentation of Rollback Procedures: Clearly document your rollback procedures, ensuring that your team is familiar with the process and can execute it efficiently when required. This includes outlining the steps needed to revert to a previous version, and any necessary checks and validations to ensure that the rollback has been successful.
  • Plan for Failure: Embrace the possibility of failure and develop contingency plans for different scenarios. This includes identifying potential risks, creating mitigation strategies, and regularly reviewing and updating your plans to ensure they remain relevant, effective, and practical.

By implementing these best practices for managing rollbacks and error handling, you can minimize the impact of issues that may arise during the deployment of your ML models and maintain a high level of reliability and performance in your ML applications. It is important to note that the effectiveness of these best practices relies on the ability to detect failures in the first place. This is where observability and SRE practices come into play, in conjunction with an alerting strategy, to ensure that issues can be detected promptly and relevant team members can be notified.

By proactively monitoring your system, setting up comprehensive observability, and implementing SRE practices, you create a robust infrastructure capable of rapidly identifying and addressing problems. When combined with an efficient alerting strategy, your team will be well-prepared to initiate rollbacks if necessary, minimizing downtime and ensuring the continued performance of your ML applications.

Continuous Delivery for Machine Learning

Ultimately, we want to automate "everything" described in this blog post, from the data pipeline until the final release. We call this Continuous Delivery (CD). It is a modern software engineering approach that aims to make the process of releasing new features, bug fixes, and updates as automated and streamlined as possible. It plays a crucial role in achieving reliable machine learning deployments. The goal is to automate as many steps as possible, including the building, testing, and deployment of new versions.

Implementing continuous delivery for machine learning provides significant benefits:

  • Improved quality and reliability: Automated testing and deployment processes help to minimize the risk of errors and ensure that your ML models are of high quality, performance, and reliability.
  • Faster time-to-market: By automating the delivery process, you can release new features, updates, and improvements to your ML models more quickly, allowing your organization to respond more effectively to changing market conditions and requirements.
  • Enhanced collaboration: Continuous delivery encourages collaboration between data engineers, ML scientists, software engineers, and other stakeholders, fostering a culture of shared responsibility for the quality and success of ML models.
  • Greater adaptability: With a continuous delivery pipeline, it becomes easier to integrate new features, tools, and techniques into your ML models, ensuring that your system stays up-to-date with the latest developments in the field.

Throughout this blog post, we described best practices for the implementation of CD, such as using pipelines, version control, testing, monitoring and observability, SRE, and collaboration within and across teams.

Machine Learning Architects Basel

Machine Learning Architects Basel (MLAB) is a member of the Swiss Digital Network (SDN). We have created our effective MLOps framework that combines our expertise in DataOps, Machine Learning, MLOps, and our extensive knowledge and experience in DevOps, SRE, and agile transformations.

If you want to learn more about how MLAB can aid your organization in creating long-lasting benefits by developing and maintaining reliable data and machine learning solutions, don't hesitate to contact us.

References and Acknowledgements

  1. The Digital Highway for End-to-End Machine Learning & Effective MLOps
  2. Intro to Deployment Strategies: Blue-Green, Canary, and More