Team Data Science Process
If you combine Scrum and CRISP-DM, you will get something that looks like Microsoft’s Team Data Science Process. Launched in 2016, TDSP is “an agile, iterative data science methodology to deliver predictive analytics solutions and intelligent applications efficiently.” (Microsoft, 2020 ).
This is a modern data science process that combines elements of the core data science life cycle, software engineering, and Agile processes.
Team Data Science Lifecycle (Based on Microsoft, 2020)
TDSP Components
TDSP has four main components:
- A data science lifecycle definition
- A standardized project structure
- Recommended infrastructure and resources
- Recommended tools and utilities
It comes as no surprise that these two often leverage Microsoft Azure; however, a team could use other tech stacks and still adhere to TDSP.
TDSP Life Cycle
Although the lifecycle graphic looks quite different, TDSP’s project lifecycle is like CRISP-DM and includes five iterative stages:
- Business Understanding: define objectives and identify data sources
- Data Acquisition and Understanding: ingest data and determine if it can answer the presenting question (effectively combines Data Understanding and Data Cleaning from CRISP-DM)
- Modeling: feature engineering and model training (combines Modeling and Evaluation)
- Deployment: deploy into a production environment
- Customer Acceptance: customer validation if the system meets business needs (a phase not explicitly covered by CRISP-DM)
Team Definition
TDSP addresses the weakness of CRISP-DM’s lack of team definition by defining six roles:
- Solution architect
- Project manager
- Data engineer
- Data scientist
- Application developer
- Project lead
Microsoft defines the relevant tasks and artifacts for many of the team roles during each phase of the project life cycle.
Standardized Resources
Microsoft also provides standardized project documents such as project charters and data reports, infrastructure and resources for data science projects, and tools and utilities for project execution. The use of some of these artifacts is also mapped to the five phases.
Evaluation
Pros
- Agile: Emphasizes the need for incremental deliverables.
- Familiar: The product backlog, features, user stories, bugs, Git versioning, and sprint planning are familiar to those used to common software practices.
- Data Science Native: TDSP acknowledges that data science and software engineering are different, and is built for data science teams working on production-bound projects.
- Flexible: TDSP can be implemented as it is defined or in conjunction with other approaches such as CRISP-DM.
- Thorough: Because of its rich team focus and detailed documentation, TDSP is arguably the most mature CRISP-derived project management approach. It is conceptually similar to Domino Data Lab’s Lifecycle but is more detailed.
- Free Templates: Go to Microsoft Azure’s GitHub repository to get started.
Cons
- Fixed Sprints: TDSP leverages fixed-length planning sprints which many data scientists struggle with.
- Some Inconsistencies: Not all of Microsoft’s documentation is consistent.
Bottom Line
TDSP is a good option for data science teams who aspire to deliver production-level data science products. It may not be appropriate for one-team data scientists or for projects without a production goal.
Learn More
- 72 Min Video from Open Data Science walking through a project using TDSP
- (external): Microsoft’s TDSP Documentation
- Post: 10 Ways to Manage Data Science Projects – Emerging Approaches
- Category: Learn more about leading teams
- Comprehensive Guide: Navigate through a larger set of data science methodologies and process frameworks in this guide.