Full-Spectrum ML Model Monitoring at Lyft (2024)

Full-Spectrum ML Model Monitoring at Lyft (3)

By Mihir Mathur and Jonas Timmermann

Machine Learning models at Lyft make millions of high stakes decisions per day from physical safety classification to fraud detection to real-time price optimization. Since these ML model based actions impact the real world experiences of our riders and drivers as well as Lyft’s top and bottom line, it is critical to prevent models from degrading in performance and alert on malfunctions.

However, identifying and preventing model problems is hard. Unlike problems in deterministic systems whose errors are easier to spot, models’ performance tends to gradually decrease, which is more difficult to detect. Model problems stem from diverse root-causes including:

  • Bugs in the caller service which passes wrong features or incorrect units to the model (garbage in results in garbage out)
  • Unexpected change in an upstream feature definition
  • Distribution changes of input features (Covariate Shift)
  • Distribution changes of output labels (Label Shift)
  • Conditional distribution changes for output given an input (Concept Drift)

One example of a model problem happened when Estimated Times of Arrival (ETAs) fell in response to declined demand due to COVID. The ETA models were retrained since they were over-predicting ride times due to being trained on historical demand. This, however, caused other models which took ETAs as input to dramatically under-predict on pricing, revealing one of the many challenges we hadn’t originally anticipated.

In early 2020, motivated by examples like the one above and by the rapid influx of production ML models at Lyft, we embarked on a journey to build a robust system for identifying and preventing model degradation. In this post, we will discuss the variety of model monitoring approaches we developed and the cultural change needed to get ML practitioners to effectively monitor the models powering Lyft. We hope our learnings help other ML platform teams that consider building (or buying) a model monitoring system in navigating the space.

The Challenges We Faced

  1. Prioritization among different monitoring techniques: There are several factors that affect the design of a model monitoring system: type of model (online or offline), what to monitor (features or predictions), when to monitor (during training, during real-time inference or on a schedule post inference), and whether to consider samples by themselves or in a time-series context. Given a limited engineering bandwidth, what exactly should we prioritize building to deliver the most impact?
  2. Technical challenges: The technical challenges included extremely tight latency requirements, long delays in getting the ground truth for some models, and building in a way that lets ML modelers configure and fine-tune alarms.
  3. Lack of standardization: Before introducing a platform solution, a few ML modelers at Lyft had built some form of custom monitoring for their models. Building one-off monitoring systems per model in a company with hundreds of models is akin to building dams on minor tributaries — there is duplication of work and no centralized visibility and control. By building a centralized model monitoring system we wanted to build a dam at the source — performant, flexible, and cost-effective.
  4. Adoption: A challenge we faced was getting all ML modelers at Lyft to effectively use the suite of model monitoring approaches in our centralized system as well as instrumenting existing models.

Strategy

Our general strategy for creating a comprehensive model monitoring system was to build in two phases. The focus of the first phase was on monitoring techniques that are quick to onboard models and agile enough to catch the most obvious model problems, i.e. Feature Validation and Model Score Monitoring. The second phase focused on building more powerful offline monitoring techniques, such as Performance Drift Detection and Anomaly Detection, which are capable of diagnosing complex problems.

Full-Spectrum ML Model Monitoring at Lyft (4)

(Note: We have a separate system for Data Quality monitoring, Verity, that is complementary to our model monitoring system, and is used for offline semantic correctness checks across column values of data tables.)

In this section, we’ll discuss the implementation of our different monitoring modes.

Model Score Monitoring

For every model scoring request made to our online model serving solution, LyftLearn Serving, the system emits the model output to our metrics system. We can then define time-series based alerts on this data.

Full-Spectrum ML Model Monitoring at Lyft (5)

Out of the box, the system checks for that models are not stuck emitting the same score over a period of time, which indicates potential upstream issues.

In order to reduce false-positives, this query requires a minimum number of requests made to the model (by default, ten requests over a period of an hour). Each team can parametrize the alert to their requirements and add their own alerts on the data, e.g. checking that the average model score is within a certain value range.

The benefits of this approach are that it allows for checks to be online and on time-series data, while the approaches discussed next only support one or the other. Another strength is that a base level of protection is provided without requiring any effort from the model owner. The downsides are that only data in the metrics system can be referenced, e.g. no ground-truth data, and that the data processing capabilities are limited.

Feature Validation

This technique validates the features of every prediction request online against a set of expectations on that data.

Full-Spectrum ML Model Monitoring at Lyft (6)

Typical data tests include:

  • Type checks — passing a string of “1” instead of the integer 1
  • Value ranges — -10 for a distance or 1000 for an age
  • Missing values for a required feature — “None” for city
  • Set membership — an unknown label for a categorical feature
  • Table expectations — checking that all the expected feature names are present

For defining those expectations we use the Great Expectations open-source library. A basic set of expectations can be automatically generated by running a profiler on the feature data set of the model. Further, expectations can be conveniently created with the library, typically in a notebook environment.

Here’s an example of a simple expectation set:

suite.expect_column_values_to_not_be_null(‘city’, meta={‘severity’: ‘ALERT’})suite.expect_column_values_to_be_in_set(‘ride_type’, {‘regular’, ‘shared’, ‘xl’})suite.expect_column_values_to_be_between(‘distance’, 0, 100, meta={‘condition’: [distance, ‘!=’, None]})

The expectation suite is registered with our backend, synced to the model serving system, and then applied within our model monitoring library to validate every incoming request.

Model Monitoring Library

On top of Great Expectations we have built our own model monitoring library to tie the validation into our infrastructure and add features geared towards online feature validation including:

  • Tagging expectations with the severity of a violation which is used to determine if model owners should be paged right away or only after exceeding a threshold
  • Conditional expectation support — a common scenario is to only perform range checks on a feature value when the feature has a numerical value and otherwise ignore null values
  • Integration with Lyft’s stats and logs systems
  • A data profiler geared towards the type of expectations relevant for our application
  • Abstractions for using other data validation libraries
  • A low-latency data validator — Great Expectations typically validates large data-sets against expectation suites. For our online use-case in which we only validate a single feature set (row) at a time, an implementation with less overhead was required. Our light-weight validator reduced the latency by over 500x to 0.1ms for a typical feature set. The validation is further performed async to make the latency impact on model scoring negligible.

Anomaly Detection

Anomaly detection is a technique that identifies potential model problems by analyzing statistical deviations of logged features and predictions over long periods of time. The calculation of aggregate metrics and the evaluation of deviations run on a schedule, typically daily. In our experience, the most indicative signals for potential problems are z-scores of over 2 for:

  • Call volume
  • Model score mean
  • Feature value mean
  • Feature null percentage
Full-Spectrum ML Model Monitoring at Lyft (7)

The biggest benefit of this approach is that statistical checks for all numerical features and model scores are performed automatically and require no onboarding from users. On the flip side, there tends to be many false positives because features and predictions can deviate statistically without implying a problem with the models (e.g. at New Year’s Eve the rider_intents feature might be several standard deviations higher than the week leading up to it). For this reason, we mostly consume the results in the form of reports rather than alerting the model owners.

Performance Drift Detection

Our most powerful model monitoring technique for catching performance degradation over time is Performance Drift Detection. In this offline monitoring approach we retrieve arbitrary data, perform transformations on it, and then validate the output against expectations. This system is built using Lyft-Distributed powered by Kubernetes Spark and Fugue, which were discussed in our last blog post. The most popular use-case is to join model scores with their ground-truth data and calculate a performance metric on it. We use the same model monitoring library as in Feature Validation to evaluate the output against expectation suites and emit relevant metrics and logs.

Full-Spectrum ML Model Monitoring at Lyft (8)

The user-provided parts for Performance Drift Detection are:

  1. SQL query for retrieving the ground truth and predictions from the data store
  2. Post-processing steps on the query output for computing the performance metrics
  3. A collection of expectations for those metrics

The following JSON snippet shows an example configuration with the mean error metrics generated in the SQL query and the squared error ones generated through post-processing:

{
"drift_detection": {
"associated_model_uuids": ["model_uuid_1"],
"metrics": [
"frac_dist_mean_error",
"frac_time_mean_error",
],
"data_gen": {
"sql_path": "data.sql",
"sql_type": "presto",
"parameters": {
"days_back": "7",
"regions": "('LAX', 'SFO', 'SEA', 'CHI')"
},
"postprocess": {
"frac_time_mean_square_error": {
"processor": "mean_squared_error",
"arguments": {
"y_true": "time",
"y_pred": "predicted_time"
}
},
"frac_dist_mean_square_error": {
"processor": "mean_squared_error",
"arguments": {
"y_true": "dist",
"y_pred": "predicted_dist"
}
}
}
},
"validation": {
"rule_path": "drift_detection_rule.json"
}
}
}

Similar to the anomaly detection workflow, the performance drift detection workflow is run on a schedule, typically daily. Model owners can be alerted about violations or check the performance of their production models on a dashboard.

Full-Spectrum ML Model Monitoring at Lyft (9)

This monitoring technique is the most effective at detecting intricate model performance issues. However, it requires high involvement from the model owner to provide the data query, transformation logic, and expectations on the output. An additional challenge to this approach is having reliable, timely ground-truth data in the first place.

When we started to develop a platform solution for monitoring all ML models in early 2020, there weren’t third party solutions available that met all of our needs. The main drivers for building our own solution and leveraging open-source libraries were deep integration with our existing ML platform, e.g. ensuring a model is instrumented before we allow deployment, and avoiding lock-in with a commercial offering in this nascent field. As monitoring models was of high urgency and impact for Lyft, given how critical the models’ sustained performance are to our business, the engineering effort of building a robust solution was easy to justify.

In this section, we discuss some of the success stories and learnings from building and rolling out our Full-Spectrum Model Monitoring system.

Wins

  • Catching bugs: With our model monitoring system, many issues have been prevented before they hit production. Feature validation examples include a required feature being omitted from the model scoring request due to a client code update, a high null-rate for a feature because of an upstream data-pipeline issue, and a mismatch in a feature’s name during training and serving due to a typo. A lot of these problems can be caught in staging before deploying the model to production unless the issue occurs while the model is already live.
  • Deprecating or retraining old models: The push to have all models monitored also shed light on older models that were not being actively used or which had performance deterioration. These models were either deprecated or retrained.
  • Adoption: Today, over 90% of our production models have Feature Validation and Model Score Monitoring and 75% have Performance Drift Detection or Anomaly Detection. We have fired hundreds of alarms and caught over 15 high impact issues in the nine months following the internal general availability of this system
  • Testimonials: One of the most important aspects for a platform solution is user satisfaction and buy-in. Here are a few comments we received:

​​“MLP’s model monitoring made it a snap to set up input validation for our recommendation models. The documentation and example notebook were very helpful in onboarding, and we were able to register validation checks within four hours.” (Data Scientist)

“… I forgot to add new features to the feature list when registering the model and because I set null checks to those features it triggered the alert” (Data Scientist)

“We were able to deploy feature validation for three models in about two days. The proportional observability and reliability benefits are immense, though! Kudos to the MLP for making this such a smooth experience.” (Software Engineer)

“This feature provides exactly what our team needed for a long time. Now, all of our models are enabled in this framework which adds great observability for our systems.“ (Data Scientist)

Learnings

  • Building great tools is only half the story. In order to drive adoption, we had to make significant cultural changes. For most data scientists, operational concerns of running their model in production are typically not top of mind. Data scientists, however, do know the expectations of their model’s features and predictions best and need to be closely involved in setting up monitoring. Therefore, we invested significant effort in making onboarding as smooth as possible as well as selling the monitoring offerings internally through brown bags and partnering with product teams directly. The last step after seeing healthy organic adoption was to make monitoring for all models mandatory going forward and programmatically enforcing it to ensure best practices are followed.
  • Automate as much as possible. Even better than providing a seamless onboarding experience is to remove it entirely. We auto-generated Feature Validation expectation suites by profiling logged data for old models since it was particularly difficult to get owners to go back and instrument them. Model Score Monitoring and Anomaly Detection are also automatically provided to all models.
  • Monitoring is a defensive investment. Trying to estimate the impact of a model monitoring system is akin to quantifying the value of a smoke detector. A fire is needed to realize the impact of a monitoring system: if the system is working well, it alerts on the first flare before it becomes an expensive problem. This can make it difficult to prioritize over initiatives that more immediately and demonstrably affect the bottom line.
  • Monitoring will reduce shipping velocity slightly. By enforcing monitoring for all models there is an additional prerequisite in the productionzation of models. The effort of setting the model’s data expectations up is marginal compared to the whole development cycle but still affects the shipping velocity. We are convinced that this is a worthwhile trade-off between rigor and agility, especially at the scale of Lyft’s ML usage and business overall.

Acknowledgements

We would like to thank the following team members of the Machine Learning Platform team for their contributions to building Model Monitoring at Lyft:

Jonas Timmermann, Yang Zhang, Han Wang, Shiraz Zaman, Craig Martell, Willie Williams, Mihir Mathur, Anindya Saha, Konstantin Gizdarski, Sallie Walecka, Vinay Kakade, Drew Hinderhofer, Hakan Baba

If you’re interested in working with us on building state of the art ML systems or solving other complex challenges to create the world’s best transportation service, we’d love to hear from you! Visit www.lyft.com/careers to see our openings.

Full-Spectrum ML Model Monitoring at Lyft (2024)

FAQs

How to track ML model performance? ›

You can assess the model performance using standard ML evaluation metrics, such as accuracy, mean error, etc. The choice of metrics depends on the type of model you're working with. These metrics usually match those used to evaluate the model performance during training.

How does Lyft use AI? ›

By combining data from multiple user sessions, the system can accurately predict the likelihood of a transportation request and initiate dispatches accordingly. This approach streamlines the matching process, reduces wait times for users, and enhances overall service quality.

What are three metrics used to measure performance of a machine learning model? ›

A machine learning model's performance can be judged using performance measures. Metrics provide an objective evaluation of the model's capacity for precise classification or prediction by quantifying factors including accuracy, precision, recall, and F1 score.

What are three types of machine learning model monitoring? ›

3 types of machine learning models

They are: Descriptive - to help understand what happened in the past. Prescriptive - to automate business decisions and processes based on data. Predictive - to predict future business scenarios.

What algorithm does Lyft use? ›

Lyft's current algorithm is the result of building off of Dijkstra and other algorithms and data structures, such as A* and geohash models.

What business model does Lyft use? ›

The Lyft business model has two customer segments- riders and drivers. Riders: Riders can sign in to the app and request rides from the Lyft rider app. They can be people who require taxi services. Their needs may range from daily commutes to social outings or trips.

Does Lyft use machine learning? ›

As Lyft's machine-learning models adapt over time, this process will only get smoother. But sometimes, using machine learning intelligently means knowing when it's not necessary. Right now, for instance, Lyft doesn't rely on machine learning alone to inform the drop-off location.

How do you evaluate the performance of a machine learning model? ›

Evaluation Metrics for ML
  1. Accuracy: This metric assesses the overall correctness of the model by calculating the ratio of correct predictions to total predictions. ...
  2. Precision: Precision measures the proportion of true positives among all positive predictions.
Dec 19, 2023

How to know if your machine learning model has good performance? ›

Good accuracy in machine learning is subjective. But in our opinion, anything greater than 70% is a great model performance. In fact, an accuracy measure of anything between 70%-90% is not only ideal, it's realistic. This is also consistent with industry standards.

How should you measure the performance of models? ›

Most model-performance measures are based on the comparison of the model's predictions with the (known) values of the dependent variable in a dataset. For an ideal model, the predictions and the dependent-variable values should be equal.

How do you compare performance of different ML models? ›

How can we compare the performance of different machine learning models?
  1. Using performance metrics is one of the most frequent techniques. ...
  2. Using benchmark datasets is another method for comparing the performance of distinct models. ...
  3. Cross-validation compares the performance of many models.

Top Articles
Latest Posts
Article information

Author: The Hon. Margery Christiansen

Last Updated:

Views: 6776

Rating: 5 / 5 (70 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: The Hon. Margery Christiansen

Birthday: 2000-07-07

Address: 5050 Breitenberg Knoll, New Robert, MI 45409

Phone: +2556892639372

Job: Investor Mining Engineer

Hobby: Sketching, Cosplaying, Glassblowing, Genealogy, Crocheting, Archery, Skateboarding

Introduction: My name is The Hon. Margery Christiansen, I am a bright, adorable, precious, inexpensive, gorgeous, comfortable, happy person who loves writing and wants to share my knowledge and understanding with you.