Fair to say that almost ten years after Harvard Business Review declared data scientist the sexiest job of the 21st century, many companies still don’t know how to build an effective data science team.
One reason I joined The Mom Project to lead their data science team was that they had already established themselves as “full-stack.” With the help of TMP’s platform team, two data scientists had already built and deployed a state-of-the-art collaborative filtering capability for recommending jobs to moms on the platform before I arrived. After my decade of struggles trying to get models into production at the companies where I practice data science, this talented and hardworking team showed me how it’s done.
(Their example suggests that it is possible to have your data scientists build your data pipeline. I still don’t recommend it though!)
This is how the TMP data scientists describe this full-stack approach:
Our data science team is full-stack, meaning we own the DS process end-to-end from training models to API deployment. We code in Python, and we use technologies like Docker, Redshift and PyTorch. We work in two-week sprints with daily scrums, bi-weekly planning, demos, and retros.From our latest Senior Data Scientist job post
I struggled to get AI/ML capabilities into production at other companies where I worked, because the data science team didn’t have the engineering capabilities to deploy models in APIs, to build model monitoring and evaluation capabilities, and to develop and maintain the data pipelines we needed to get the data we needed for both training models and using them in production.
Not just data scientists
Building a full-stack data science team doesn’t mean only hiring data scientists who can build a Python REST API. A well-balanced team has specialists that complement the data scientists. To that end, I’m adding three new positions to the TMP data science team, to complement the two full-stack senior data scientists already there:
- Senior Data Scientist – this particular data science role is aimed at Python software engineers with an interest in getting into AI/ML development. You need not have work experience building ML systems but you should have familiarity with them, e.g., through online coursework. Initially this new data scientist will come up to speed by taking over maintenance and enhancement of TMP’s state-of-the-art collaborative filtering job recommender, but will eventually take on new ML/AI development as they come up to speed on how we work.
- Staff Data Engineer – in this role, you’ll lead the development of our MLOps infrastructure supporting our end-to-end full-loop AI/ML system. We’re looking for someone with expert Python skills, experience with a variety of data management systems including SQL DBMSs and Elasticsearch, and an interest in working in the AI/ML space.
- AI/ML Solutions Architect – in this architecture role, you will map business requirements onto a technical architecture that supports rapid development and deployment of AI/ML systems. You will become the data expert for the team, understanding what data is available, what it means, and how it might be enriched. Along with the data science product manager, you will assist in prioritizing the buildout of specific technical features in the end-to-end system.
Pretty obvious why you’d want the first two, but what about that last one? This is a gap I’ve seen on data science teams.
Why a solutions architect?
Gartner says an AI Architect envisions, builds, deploys, and operationalizes an end-to-end machine learning pipeline. The AI/ML Solutions Architect at The Mom Project will be the technical owner and architect of the machine learning system we are building to promote better matching of talent to jobs. They will become experts on the data we have available and the additional data we might need to serve as the lifeblood of our AI/ML system. They will consult with stakeholders throughout the company to determine the technical requirements we need to meet with our end-to-end machine learning capability and its constituent parts, including a data warehouse, data pipelines, a data annotation system, and a job-and-skills graph. This system will form the basis of a career management system that helps a diverse workforce create and manage careers that are rewarding, satisfying, and allow high quality of life as a working parent.
In addition, TMP has just hired a Senior Product Manager focused solely on data science. I’ve never worked at a company that hired someone for that role before. I am so excited to work with someone who will be focused exclusively on how data science can meet the needs of our business and our platform users.
Other ways of organizing data science
I don’t think that a centralized full-stack data science team of the sort we’re building at TMP is the only or necessarily even the best way of organizing data science. Aside from what I’ve just outlined, there are two other main ways of organizing data science. In a distributed or federated model, data scientists are assigned to various teams, for example full-stack feature teams, where they contribute to full-stack development as the data science experts, and are not led by any central management. You might build a hub-and-spoke approach, with a central “center of excellence” style team creating foundational capabilities and practices complemented with team members distributed to different departments or teams.
For smaller companies just getting started with data science like TMP, the centralized model seems a reasonable first step. I hope we demonstrate so much value that we can explore hub-and-spoke in the future, and get our data scientists embedded on user-facing development teams.