Data scientist hiring tip: Build your machine learning pipeline first

Don’t hire a data scientist to build your data pipeline.

In fact, you might want to build your machine learning pipeline first, before bringing a data scientist on.

The evolution of data scientist job descriptions

This is what job descriptions vs job reality for data scientists was like only a couple years back:

Love that tweet, so true!

Lately I’m seeing a different challenge in job descriptions for data scientists. Companies are asking for a combination of skills that never has existed in one person! They want their head data scientist to be a data engineer, DevOps specialist, machine learning genius, full-stack software developer, research methodologist, classical statistician, data storyteller, reporting analyst, and more.

Also this person will be a player-coach who generates insightful analytics themselves while managing a growing team of junior data scientists, data programmers, and infoviz experts.

Here’s a mock example (based on real job postings I’ve seen for lead data scientists):

  • Create a product roadmap for artificial intelligence and advanced analytics, then evangelize and socialize throughout the company to achieve buy-in
  • Find and surface insights in data using Tableau, then wrap these into compelling stories that you will present to our most important clients
  • Create production data reporting using SQL, Spark, Airflow, and Kafka
  • Leverage your broad and deep knowledge of statistical and machine learning modeling approaches including deep learning, time series forecasting, natural language processing, and geospatial analysis in developing descriptive and predictive analytical capabilities
  • Design and run experimental studies then analyze the data using statistical techniques including ANOVA, MANOVA, ANCOVA, MANCOVA, and Bayesian multilevel modeling
  • Develop production machine learning capabilities using Java, Scala, Python, and JavaScript
  • Deploy, manage, and monitor your machine learning models in production data pipelines at scale

Even if you could find a person with all that experience and expertise, they won’t have the focus or support they need to succeed with a real-world machine learning project if they are tasked with handling so much. Most likely, they will focus on building pilot predictive models using data easily accessible to them. They’ll share the model results in a slide deck, and that will be as far as their work goes.

Building a pipeline is the hard part

I do find it promising that more and more companies realize that they need to get their machine learning capabilities into production for them to make an impact on their customers and their business. They see that to do this, they need to build automated data pipelines that include model management and monitoring.

You may be better off postponing hiring a data scientist until you’ve established a production machine learning infrastructure. That may not sound like the right order of operations, but many companies have attempted the opposite (data scientist first, then production ML pipeline), and failed.

Consider tasking your existing data-savvy staff with building your initial cut at a machine learning pipeline, with some very basic and simple predictive or classification model at its core. Be sure to include monitoring and feedback so that you can gather additional labeled data and improve models over time.

creating a full-loop machine learning pipeline

Ideally you’d put an ML-aware product manager in charge, to figure out what sort of ML-powered guidance your end-users and customers (or internal businesspeople) need. You’d get a data engineer who wants to add ML skills to his or her resume to build some basic models using data at hand to support the guidance your product manager envisions. And have a savvy DevOps engineer ready to help the data engineer put the pieces together in your cloud-based services infrastructure.

Once you have a full machine learning pipeline in place you’re ready to hire a data scientist! By building out a production ML-powered capability with a full data pipeline first, even if relatively bare bones, you’ll be positioned to hone your data scientist job description to just the modeling skills you need. You’ll be better able to evaluate candidates’ understanding of full lifecycle ML development and how they can contribute to it. And you just might find you don’t need an expensive, hard-to-find data scientist quite yet.

If you need advice about roadmapping your first machine learning project, setting up a data science function, or building a machine learning pipeline we can help. Get in touch.