Description of a workflow of a data scientists published on CACM blog.
Generally, the workflow consists of 4 interconnected phases with some sub-steps:
- Acquire data
- Reformat and clean data
- Edit analysis scripts
- Execute scripts
- Inspect outputs
Each of these phases is related to its own challenges. The author developed prototype tools that are supposed to address them.
One interesting insight given in the blog entry is that the manual data cleaning is reported by data scientists as the most tedious and time-consuming part of their workflows. However, the author stresses that this step is also a very important one since "the chore of data reformatting and cleaning can lend insights into what assumptions are safe to make about the data, what idiosyncrasies exist in the collection process, and what models and analyses are appropriate to apply."