Selecting an Effective & Productive Machine Learning Platform

← All writing

As you embark on building or enhancing your organization's ML capabilities, the choice you make for your ML Platform is crucial. Setting up a machine learning platform is a very complicated exercise, as each company or team you are in could be of different sizes and varying functions.

In my years of experience leading ML teams at different companies, I had to go through this exercise many times. I believe there are valuable learnings that someone who is in the process of setting up or evaluating how to improve their current situation can benefit from.

I am fortunate to have seen and shape ML practices across a variety of industries, team compositions, and technologies. I have worked across industries ranging from retail to digital media. I've worked with data science (DS) and machine learning engineering (MLE) teams of various shapes and sizes; in some cases taking a small team and rapidly scaling them up to over 80 data scientists. I've worked with ML & AI technologies across the Azure, Google Cloud, and Amazon Web Services landscapes.

Throughout all these experiences, I have gone over a number of considerations when it comes to selecting an ML platform. The major consideration has always been finding a solution that can accompany the company's ever-changing Data Science and ML initiatives, and allow team members to execute at a rapid pace. I have seen challenges arise from multiple fronts and there are a number of learnings that I would like to share.

I see 3 key factors that shape a solid foundation for an ML platform:

Cloud & Platform Portability
Simplicity of Data Science tooling
Notebook & Collaboration experience

1st Consideration: Cloud Platform Environment vs Platform Portability

Data Scientists and Machine Learning Engineers require convenient and secure access to data. The organization will usually have a primary cloud that holds its data assets, often in a data lake or data warehouse. In order to allow the DS & MLE teams to start being productive immediately, the decision is often to utilize that cloud platform's native ML & AI tooling. However, these usually come with restrictions (e.g. lack of portability across environments, custom DSLs, etc.) that are not future-proof and will easily mis-align with where the data sits in the company.

Multi-cloud usage could disrupt lots of efficiencies across teams

One company I've worked at has the perfect example for this scenario. A bulk of the data was stored in AWS, hence most workloads were set up to execute on AWS EC2 instances. We spent considerable time setting up lots of infrastructure within the AWS environment. One year down the track, the technology team decided to move all the data assets to Google Cloud. Not only did it disrupt our rhythm, but it also incurred huge time & resource costs in retooling workflows across environments.

However, this could have been avoided if we chose a cloud-agnostic offering, such as Databricks. A cloud-agnostic offering allows us to develop machine learning activities & workflows independently, with the assurance that they will operate uniformly across clouds. This future-proofs our work, and allows our organizations to adopt multi-cloud architectures and/or move data architectures across clouds.

2nd Consideration: Simplicity of Data Science Tooling

A normal workflow for data scientists usually includes exploratory data analysis (EDA), manipulating datasets into different forms, feature engineering, creating training datasets, performing multiple sampling strategies, building models then exploring multiple models to compare against their performance, and finally picking the best model to go into production, and provide performance reporting for business stakeholders. That's a lot of tasks with different tools and languages involved. While working with a team of data scientists, you will also face people with different preferences on tool sets, some prefer Python or R, while others may be more comfortable with Scala or Java.

You usually end up with a few decisions that you have to make to simplify this technology landscape:

Build and maintain a common tooling function, such that it can be reusable so that not everyone in the team is duplicating tasks
Build in an environment that is language agnostic, such that common tooling functions can be shared and reused easily

At one of my previous companies, we decided to build our own Automated Machine Learning (AutoML) platform. This was a platform made available to all our data scientists, and they could also improve the underlying algorithms.

We built this on Argo. It was widely adopted, with almost 100s of models being trained concurrently, and minimal bottlenecks thanks to Kubernetes auto-scaling. However, this setup is quite time- and resource-intensive, especially with the ever-changing metrics and model type of each new development. It required valuable MLE time to maintain many working parts. Hence, in later stages, we started looking for a platform that increased productivity; and decided to evaluate Databricks and Vertex AI.

We were already using open-source MLflow to capture model experimentation metrics. Databricks has very strong integration with MLflow, while Vertex AI requires lots of extra custom logging in order to provide the visibility that MLflow and Databricks can provide. That itself will save you lots of effort. I recently helped a startup set up their ML Platform from scratch with MLflow on Databricks; something that took hours rather than months.

Model deployment and diagnosis

Our ML Platform should allow us to deploy models to production and warn us when the model's performance starts to deteriorate. This means that the ease of deployment and diagnosis is very important.

A few things I have experienced in this area:

Model deployment in containers requires extra setup to host and serve models and would need to set up extra logging facilities to capture results
In more immature teams, models are deployed as pickle files and executed separately. Results are then captured retrospectively or only model results are saved, hence creating confusion when diagnosing results.

Based on these points, it is very important to simplify the deployment workflow and minimize the variation between different team members' work. After comparing against lots of existing platforms in the field, Databricks stands out with the nice functionality around the ease of deployment, resulting in a more standardized deployment workflow.

3rd Consideration: Notebook Development Environment

Notebooks enable DS and MLEs to easily get started and share knowledge. Notebooks are a key productivity enabler for my teams, and something I always evaluate seriously. In companies I have worked at, I see two common patterns:

Self-hosted Jupyter Notebooks attached to Kubernetes clusters
Managed Notebooks (e.g. Databricks notebooks / Vertex AI Workbench notebooks) where the underlying infrastructure and administrative overheads are taken care of by the platform provider

Method 1 definitely costs the most, in time and effort, to host and customize. I would consider this a good learning exercise, but not the most responsible way to empower teams to work fast. Someone has to administer and solve problems such as resource allocation issues, drive mounts, managing container versions, package management, security enforcement, and the list goes on.

Method 2 is what I always recommend. There are several choices in the market, and I have a preference for Databricks notebooks. The environment setup — with Python, SQL, and Spark — is the best with Databricks. Databricks takes care of pre-loading all the latest libraries and pre-configures them to interoperate well with each other.

Other evaluation criteria

Looking across managed notebooks, I can summarise my key evaluation criteria:

Transparency: Databricks provides the most transparent AutoML offering. Their glass box approach allows you to download the notebooks associated with your AutoML runs, giving data scientists full confidence in getting to a minimal viable model rapidly.
Multi-Language Support: A DS or MLE can work with their language of choice — Python, SQL, R, Scala, Java — in a notebook, and even mix and match languages in the same notebook.
Feature Store Integration: One of the most important features. At one of the companies I worked at, we spent a lot of effort setting up our own feature store that supported streaming features. It is advantageous to have a readily-available production-ready Feature Store.

Conclusion

There are a lot of other smaller challenges I have been through with my teams in the past, however, I think these are the major ones that you should consider when choosing an ML platform for your team.

The three pillars: 1. Cloud & Platform Portability. 2. Simplicity of Data Science tooling. 3. Notebook & Collaboration experience. Hopefully next time when you have to build or evaluate the setup of your Machine Learning Platforms, you can use the above to decide what should be used.