In Data Engineer’s Lunch #11: MLFlow and Spark, we discussed using MLFlow, a machine learning management tool, with Apache Spark. This acts both as a continuation of a previous series on Apache Spark Companion Technologies and one of our data engineer’s lunch events. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!
MLFlow is an open-source machine learning management framework, created by Databricks. It helps data scientists manage the machine learning lifecycle from training to deployment. MLFlow includes tools for logging and metadata management. It also can save artifacts from various instances of training and associate them with the metadata for that instance. In MLFlow’s internal terminology, a run is a single execution of machine learning training code. An experiment is a collection of runs. These runs need not be identical, they can have different parameters passed to them. MLFlow’s architecture is split between a small number of services, each with its API for interacting with that functionality.
MLFlow tracking is the section of MLFlow that manages logging and metadata management. It logs model parameters, the hyperparameters for training the model. It also logs several run parameters including code version for model lineage. MLFlow also keeps track of start and end times for running your code, as well as source information, metrics, and artifacts. Source information covers filenames within a project, as well as project name and entry point (the command used to run the project). Metrics can include hyperparameters, as mentioned above. They also include results like accuracy and potentially even model-internal scores like loss function tracked over time. Artifacts are things like models as output from training, data files, and images generated by the code within a run.
MLFlow projects is a format for packaging machine learning code together in a reusable way. The format consists of a directory for each project with code and an internal project file creating a structure that allows MLFlow to run the code and ensure that all pre-requisites are installed. The project file contains a project name, a list of entry points defining commands that can be run with the project’s code, and an environment definition. Projects can be retrieved from git as well as local directories and MLFlow allows projects to be submitted to Databricks and Kubernetes clusters to run.
MLFlow Models is a format for packaging exported machine learning models. It’s useful for integration with downstream tools. The MLFlow Models API contains tools that allow for a sort of deployment, where MLFlow will take inputs and generate predictions using a model. The MLModel format is just like a yaml file and contains info on when the model was created. It also has a run id assigned by MLFlow Tracking and a signature that defines the inputs and outputs in JSON format. It may also contain an example input, also represented in JSON. MLFlow Models has built-in support for Python and R function, h20, Keras, MLeap, PyTorch, Sci-kit Learn, Spark MLLib, TensorFlow, ONNX, MXNet Gluon, XGBoots, LightGBM, Spacy, Fastai, and Statsmodels.
MLFlow Model Registry
The MLFlow model registry is a central storage location for MLFlow models. It integrates with a UI and an API that help manage models and track their lineage. Models come from runs and can be registered via the UI or API. The model registry tracks models via their unique name and version number and also keeps track of other metadata. Each time a model is registered under the same name, the version number associated with that model is increased by one. The model stage defines what production stage the model is in. It can have the values Staging, Production, or Archived. Users can add annotations and descriptions to registered models by hand, in Markdown format.
MLFlow and Spark
In the demo portion of the video below, we go through the process of running an MLFlow project using Pyspark. The files for that project can be found here, under myRun/spark. The MLProject file defines our entry points and parameters, and points to conda.yaml as our environment definitions. The conda.yaml file defines all dependencies needed to run the code. We use code similar to that used in the Machine Learning with Spark and Cassandra series.
The machine learning code is packaged within train.py. Inside the main method, we use findspark to connect to our Spark Cluster and download the necessary data from sklearn and transform it into a DataFrame. We do some minimal pre-processing, split into training and testing datasets, and load our parameters. Then we use mlflow.start_run and do all of our actual training and evaluation inside of that section of the code. To properly log parameters and results, the spark cluster needs to have access to the MLFlow jar file, and even then some features like the auto-log functionality the works for sklearn are currently unavailable when using spark. With proper management, MLFlow can be a tool for organizing your machine learning efforts where or not you use Spark.
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!