Machine Learning for Healthcare: Development Lifecycle & MLOps

In our last blog post, we elaborated on some of the challenges that slow down the adoption of machine learning (ML) in the healthcare sector. We also pointed out that to generate sustainable value for both patients and physicians, it is necessary that all stakeholders are willing to go beyond the proof-of-concept stage and "walk the talk". From a technological perspective, building an ML model becomes secondary while making it reliable and operationalizing it (a.k.a. MLOps) takes the center stage.

ML for healthcare requires involving healthcare professionals and MLOps

Machine learning and data scientists are excellent at creating models that make accurate predictions based on real-world data. However, their work often stops once a model is built. Unfortunately, this means that most models (some claim 90% [1, 2]) never get deployed into production. This happens because building a model is only a small component in the development of a production-ready ML system. MLOps is the discipline of developing, deploying, and running ML systems reliably by combining practices from machine learning, data engineering, and DevOps (see what current MLOps approaches are missing in our blog post on effective MLOps). Figure 1 illustrates the lifecycle of a machine learning project and highlights to what degree healthcare professionals (HCPs) should be involved in the various stages. MLOps, as we will see in the following paragraphs, is a vital ingredient to successfully plan and execute an ML initiative in healthcare.

Figure 1 | In the lifecycle³ of a clinical ML project, healthcare professionals (HCPs) should be involved in many stages. The number of physicians represents the importance of involving an HCP.

Define Use Cases

In the beginning, two worlds collide. The technology-oriented world of machine learning and the care-centered world of medicine. One of the biggest challenges in this phase is to find a common language. Typically, the ML engineer will have limited knowledge about the medical domain, and the HCP is not aware of existing ML methods and the ML jargon. It can’t be stressed enough that both sides must continuously strive to be understood by the other. For instance, an ML engineer uses the terms training/validation/test data set, whereas a clinical researcher might be more familiar with the terms derivation and validation cohort. Without clarifying questions, it would be impossible to understand that “validation cohort” is equivalent to the “test data set” (not the validation set). All sides must be willing to learn from each other.

Once a common language is established, it is crucial for the ML engineer to understand the clinical problem and the way success will be measured. Ideally, the measure of success can be translated into metrics such as sensitivity, specificity, AUROC, or other performance metrics. This is often the case for patient-focused ML applications such as risk-stratification or diagnosis prediction. In other cases, success may be of operational nature and measured in the reduction of wait times or hours spent on documentation.

Additionally, the workflow/process supported by the model’s predictions must be understood. Like in many software engineering projects, user acceptance (read: the project's success) depends on a solid understanding of the processes and culture in which the solution is employed. In healthcare, this is particularly important. Solving the problem algorithmically (i.e., achieving high predictive performance) is the first step. The real challenge is understanding how deploying the ML solution affects workflows, physicians, and patients.

Explore Data

Models can only be as good as the data on which they are trained. Therefore, it is imperative to carefully think about what kind of data and how much is needed to start building good models. These factors will heavily depend on the use case, and it is vital to refer to the expertise of the involved parties, previous similar use cases, or external resources such as published literature.

Once the type and amount of required data are estimated, one can evaluate what data could feasibly be accessed. This might include several distinct data sources such as data collected in-house (e.g., EHR, PACS, PDMS), or publicly available data sets. Even if data exists, it might not be readily accessible due to various data protection rules. Patients might not have provided consent for their data to be used, or data might have to be anonymized before its use. One also needs to consider general data quality and that data from distinct sources might vary in their quality (e.g., the number of missing data points), and sampling frequency.

Automating as much of this work as possible is essential to building a robust MLOps data pipeline. This pipeline should handle the aggregation of data from multiple sources, version it, anonymize it, or check its quality. It should also allow for easy integration of new data collected during the development or the use in production.

Evaluate

There are many ways to evaluate models. The most relevant aspects will have been recognized during the use case definition, but others will likely also be of interest. For example, in addition to evaluating how accurate the predictions of a model are, one might want to assess its latency. Can predictions be made in near real-time (while a patient is still with the clinician), or does it take substantially longer (for example, because the input data requires much pre-processing during the data pipeline)?

Notably, the evaluation should be part of a robust model pipeline, which takes the data from the data pipeline, trains a model, computes performance metrics, and compares these values to values calculated earlier and/or the defined minimum performance criteria. An effective MLOps approach will automatically track and store the models and their performance metrics.

Models must be robust and reliable to avoid failures or malfunctions, for example in the case of "outlier" input data. Ideally, a model's prediction should also be interpretable as both clinicians and patients might be interested in how exactly a model arrived at its prediction and possibly why this prediction might differ from the clinician's expert opinion.

Iterate

If the model does not exceed the defined minimum performance criteria, e.g., does not detect cancer in at least 90% of cases where it is present or cannot handle outliers, we will need to iterate. This means we need to go back to one of the earlier steps and evaluate the data, the selected features, the selection of the algorithm, or even revisit the feasibility of the use case.

Present Results

Once the proposed solution reaches the performance metrics agreed upon in the use case definition, the results should be presented to a panel of medical experts and potential users. It is essential to go beyond a superficial performance analysis, and to prevent a biased model from harming specific populations (e.g., stratified by sex, ethnicity, age), a detailed subcohort analysis is required. A model might perform well on the full patient population but sub-optimal on female patients. The outcome of this analysis could be that we should refrain from using the model, use different decision thresholds, or use a population-specific model for certain patients. Ideally, these considerations are part of the use case definition and the evaluation step.

Operationalize and Integrate Model: A model can't generate value in a vacuum

At this point, we developed an ML model that is of clinical relevance as it meets the success factors defined during the (re)definition of our use cases. However, such a model can’t generate value in a vacuum. There are at least two ways to make use of our model. We can either integrate predictions in the software that is used in the existing clinical workflow (e.g., the EHR, radiology software, …) or build an application-specific to our use case. The challenge of the first approach is that not all systems allow for easy integration of third-party services. Unfortunately, many systems prefer to keep their ecosystem closed which, on the one hand, protects sensitive data from bad actors but, on the other hand, impedes innovation. Building a custom app might therefore sound like a “cleaner” path to production; however, it comes with its own challenges. A new app will most certainly interrupt workflows that end-users are familiar with, affecting user satisfaction and acceptance. If the pain of switching to a different application exceeds the value that our ML model generates, users will stop using the application and simply return to the old way of doing things.

We don’t want to find ourselves in a situation where our model works great, but nobody wants to use it. Therefore, it is crucial to think about operationalization and integration from the very beginning of the project. For this, potential end-users and IT staff (to detect interoperability issues) should be involved, and the ML engineer must understand the relevant clinical processes (see step 1). Lastly, educated users are happy users, so providing training on all aspects of the model and application (e.g., reliability, robustness, interpretation of scores) is necessary to ensure proper and long-term usage of our ML solution.

Monitor and Manage Model

Monitoring is always an important part of effective MLOps, but when applying machine learning within the healthcare sector, it becomes even more important because the models might be involved in high-stakes decisions ("What kind of care should this patient receive?"). Model development in the real world, i.e., outside of research labs, is rarely a finished activity. Instead, models need to be continuously updated when bias is detected, or new data is collected.

We, therefore, need to take an active approach to monitoring and make sure any anomalies, faults, or errors are detected before end-users are impacted. Many models' performances degrade over time due to various kinds of "drift"; we need to detect it and react appropriately. One important aspect of continuous monitoring for healthcare is to find bias. A model’s performance should not depend on features such as race, sex, or socio-economic status. A mature MLOps pipeline enables this observability, in addition to the classical monitoring of the health of the overall system. (If you are interested in observability for MLOps, keep an eye out for our upcoming blog post on that topic!)

Conclusion

It sounds simple but can’t be stressed enough: a project at the intersection of machine learning and healthcare won’t be a success if healthcare professionals are not involved from the very beginning. While it is easy to define ML use cases based on a superficial understanding of the field, their clinical relevance can be judged best by physicians and domain experts. Also, finding the best way to integrate a model’s predictions into daily clinical practice is a non-trivial undertaking that requires a co-creation approach involving all stakeholders. If we’re serious about improving patient care through ML, we must build robust, trustworthy, and usable systems, not only well-performing models.

Finally, a mature MLOps pipeline requires the cooperation of many different parties. This usually means that cultural challenges arise. At Machine Learning Architects Basel, we strongly believe that it is important to develop an MLOps culture from the beginning of a project and continuously foster and improve it. Ideally, this also means that — as end-users — clinicians and patients are involved in the continuous improvement of the pipeline by providing feedback and offering criticism that can be included in retrospectives.

References and Acknowledgements

Stackoverflow: How to put machine learning models into production [link]
VentureBeat: Why do 87% of data science projects never make it into production? [link]
Adapted from the content of the MLOps (Machine Learning Operations) Fundamentals course [link]
Doctor icons created by DinosoftLabs - Flaticon