How to detect drift with Evidently and MLFlow

Learn how to detect and monitor data drift in machine learning models using Evidently and MLflow. This blog provides a step-by-step tutorial using a mobile price prediction dataset, ensuring consistent model performance by tracking and visualizing drift insights over time.

GraphQL has a role beyond API Query Language- being the backbone of application Integration
background Coditation

How to detect drift with Evidently and MLFlow

Data Drift

Data drift, also known as concept drift, refers to the change in patterns of data over time. In the context of machine learning, data drift happens when the statistical properties of the target variable, which the model is trying to predict, change in the unseen data over time.
This change in data patterns can lead to a degradation of model performance because the assumptions that the model learned during training no longer hold. For instance, a model trained to predict customer churn based on historical data may start to perform poorly if the behavior of customers changes significantly due to new market conditions or changes in the company's policies.
There are several types of data drift:

  1. Sudden Drift: This is when the data distribution changes abruptly. This could be due to a change in data collection, a change in policy, or a sudden shift in user behavior.
  2. Incremental Drift: This is a slow and gradual change in data distribution over time. It can be challenging to detect because it happens slowly.
  3. Seasonal Drift: This type of drift is predictable and cyclical. It's often found in data related to fields like retail, finance, and weather where there are regular and predictable changes.

Detecting data drift can be challenging because it requires constant monitoring of the model's input and output data. Some indicators of data drift include a decrease in model performance, an increase in the number of errors, or a change in the distribution of predictions.
MLflow is an open-source platform that helps manage the end-to-end machine learning lifecycle. It includes tools for experiment tracking, model packaging, reproducibility, deployment, and a central model registry. MLflow is designed to work with any machine learning library and algorithm, simplifying the management of ML projects. You can find more about MLflow on their official website.

Evidently

To keep watch on data drift, monitoring model performance and taking precautionary measures becomes a need of time. Evidently is an open source python library that helps to do most of this.
Evidently works with tabular and text data and helps throughout models lifecycle with its reports, tests and monitoring.
For data-drift detection, Evidently has a set of statistical tests and default thresholds depending on type of feature (numeric or categorical). It also allows users to define custom drift detection methods and thresholds. It produces reports that give feature level as well as dataset level data-drift insights. Reports can be visualized as html or further used as json. It also has the capability to integrate with MLops tools like Airflow, MLflow, Metaflow etc.
In this Blog, the attempt is to perform data-drift analysis on a sample dataset and to integrate the evidently output with MLflow in a custom way.

Installation and Setup

For the purpose, we need to install and import the libraries like numpy, pandas, evidently, mlflow and datetime


import numpy as np
import pandas as pd
from evidently.pipeline.column mapping import ColumnMapping
from evidently.dashboard import Dashboard
from evidently.dashboard.tabs import DataDriftTab
import mlflow
from mlflow. tracking import MlflowClient
from datetime import datetime

Dataset

For this experiment, let’s pick up the mobile price prediction dataset with limited features. Features like battery_power, clock_speed, int_memory, mobile_wt, n_cores, ram are the continuous and numerical whereas dual_sim, four_g are the categorical ones. The dataset is divided into two equal halves as reference data (df_ref) and current data (df_curr).


df = pd. read csv ('mobile price.csv')
df ref = df.loc[:500,:]
df curr = df. loc [500:,:J.reset index (drop=True)

A drift is introduced in numeric features battery_power, ram and categorical feature dual_sim of the current dataset.


df curr['battery power'] = df curr['battery power']*1.3+100
df curr ' ram'! = df curr[' ram']*0.8-50
df curr.iloc[:150,:J.dual sim = df curr.iloc[:150,:J.dual sim.replace (0,1)

Code

Dataset and date variables are defined for the naming conventions of the drift reports and MLflow experiment / runs.


dataset= 'mobile price'
2 date = datetime.now().strftime ('Sy-sm-sd SH:%M:85')

Drift analysis can be done only on the features which are common to reference and current dataset. Also the column mapping is necessary for performing suitable statistical tests to calculate drift. Columns are mapped as numerical_features and categorical_features. 
We are using Dashboard with DataDriftTab to calculate covariate drift (i.e. changes in distribution of independent features). It requires reference data, current data and column mapping.


common features = [feature for feature in list(df ref.columns) if feature in list(df curr. columns)]
column mapping = ColumnMapping()
column mapping categorical features = ['dual sim', 'four g'1
column mapping.numerical features = ['battery power', 'clock speed' 'int memory' 'mobile wt' 'n cores' 'ram'!
covariate drift report = Dashboard (tabs= (DataDriftTab()])
covariate drift report. calculate(df ref, df curr, column mapping=column mapping)
covariate output = list(covariate drift report.analyzers results. values ())[0]

The output of the statistical test (p_value) is compared with the significance level (0.05 in this case), If it’s less than significance level then the feature is considered as drifted.


drifted features = []
[]
drift p value = {}
for key in list(covariate output.metrics.features.keys ()):

p val = covariate output.metrics.features [key].p value
if p val<0.05:
drifted_features.append (key)
drift_p_value.update({key: round (p_val,4)})

Output : 


drifted features ['battery_power', 'ram', 'dual_sim']
drift p value
{'battery power': 0.0, 'ram': 0.0, 'dual sim': 0.0}

Integration with MLflow

MLflow experiment is set with the dataset name. We can log different parameters of the experiment like date/time, dataset information, number of features and their counts, the results of the drift analysis and metrics like percent of the drift features, which are easy to extract from evidently reports.


client = MlflowClient ()
mlflow.set experiment (F' {dataset} Drift')
with mlflow.start run(run name = dataset + date) as run:
	mlflow.log param('date',date)
	mlflow.log param('reference data', 'df ref') mlflow.log param('current data', 'df curr')
	mlflow.log_param('n_ features', covariate_output.metrics.n_features) 
	miflow.log param('features', list(covariate output.metrics. features. keys () )) 
	mlflow.log param('n drifted features', covariate output.metrics.n drifted features) 
	mlflow.log param('drifted features'‚drifted features) 
	mlflow.log param('drifted features p vals',drift p value) 
	milflow.log param('dataset drift', covariate_output.metrics.dataset_drift) 
	mlflow.log metric('drifted features percent', covariate output.metrics.share drifted features*100)

MLflow Dashboard 

A new experiment gets created in MLflow and for each run parameters and metrics are logged.We can have different runs with different sets of data and also for the successive data cycles.

MLflow provides functionality to compare runs in tabular form as well as graphically using scatter, contour and parallel coordinate plots to keep track of data quality and drift.

Want to receive update about our upcoming podcast?

Thanks for joining our newsletter.
Oops! Something went wrong.

Latest Articles

Optimizing Databricks Spark jobs using dynamic partition pruning and AQE

Learn how to supercharge your Databricks Spark jobs using Dynamic Partition Pruning (DPP) and Adaptive Query Execution (AQE). This comprehensive guide walks through practical implementations, real-world scenarios, and best practices for optimizing large-scale data processing. Discover how to significantly reduce query execution time and resource usage through intelligent partition handling and runtime optimizations. Perfect for data engineers and architects looking to enhance their Spark job performance in Databricks environments.

time
8
 min read

Implementing custom serialization and deserialization in Apache Kafka for optimized event processing performance

Dive deep into implementing custom serialization and deserialization in Apache Kafka to optimize event processing performance. This comprehensive guide covers building efficient binary serializers, implementing buffer pooling for reduced garbage collection, managing schema versions, and integrating compression techniques. With practical code examples and performance metrics, learn how to achieve up to 65% higher producer throughput, 45% better consumer throughput, and 60% reduction in network bandwidth usage. Perfect for developers looking to enhance their Kafka implementations with advanced serialization strategies.

time
11
 min read

Designing multi-agent systems using LangGraph for collaborative problem-solving

Learn how to build sophisticated multi-agent systems using LangGraph for collaborative problem-solving. This comprehensive guide covers the implementation of a software development team of AI agents, including task breakdown, code implementation, and review processes. Discover practical patterns for state management, agent communication, error handling, and system monitoring. With real-world examples and code implementations, you'll understand how to orchestrate multiple AI agents to tackle complex problems effectively. Perfect for developers looking to create robust, production-grade multi-agent systems that can handle iterative development workflows and maintain reliable state management.

time
7
 min read