Building your own software, data management or workflow management systems requires you to do a significant amount of interpersonnel management, as well as tracking of development.
Software engineers having long suffered under the burden of disorganization and communication with clients, have come up with a framework for developing software and sharing it with their users.
While data science applications have a different audience and intended result, the organizational practices of software developers are a valuable and useful tool to consider integrating into your open science lab group.
Frequently Used Terms
- Continuous Integration: (CI) is testing automation to check that the application is not broken whenever new commits are integrated into the main branch
- Continuous Delivery: (CD) is an extension of ‘continuous integration’ to make sure that you can release new changes in a sustainable way
- Continuous Deployment: a step further than ‘continuous delivery’, every change that passes all stages of your production pipeline is released
- Continuous Development: a process for iterative software development and is an umbrella over several other processes including ‘continuous integration’, ‘continuous testing’, ‘continuous delivery’ and ‘continuous deployment’
- Continuous Testing: a process of testing and automating software development.
- Development: the environment on your computer where you write code
- DevOps: Development and information techology Operations is the set of practices surrounding CI/CD
- Production: environment where users access the final code after all of the updates and testing
- Stage: environment that is as similar to the production environment as can be for final testing
The software developer concept of ‘continuous delivery’ can be applied to your data science projects and lab.
As we’ve discussed, version control is an important component of modern software development. Critically, version control can also be used in data science applications and for research project management. There are two dominant forms of project management for continuous delivery in open source software: Waterfall and Agile Scrum.
Agile development practices involve organizing a team around short term (1-2 week long) ‘sprints’. Sprints are organized by scrum master. Team members are assigned tasks and evaluate their results during sprint reviews and planning sessions.
Similar to the common Gantt chart a waterfall model is a breakdown of project activities into linear sequential phases, where each phase depends on the deliverables of the previous one and corresponds to a specialisation of tasks.
Doing reproducible science requires you to host your code and versioned software used to complete the analysis, in addition to the actual data. GitHub or Gitlab could become the central point supporting your data science lab.
Powerful uses of GitHub include integration with other web services, like container registries (DockerHub), websites (ReadTheDocs, web sites https://pages.github.com/), continuous integration (CircleCI, Jenkins, Travis), and workflow managers GitHub Actions.
Continous Integration (CI) is a practice of checking code repositories (typically a few times a day) to test for changes which may cause failures.
CI can be integrated into either scientific programming workflows or into code development
The most popular CI tools are:
When to use CI?¶
- building or hosting services to a community
- developing versioned copies of containers for public consumption
- DevOps + Data Science
Status badges can be embedded in a README.md. Badges let you show the state of code or documentation.
You can view a diverse list of different badges on Shields.io
Now you can pass the
style GET argument,
to get custom styled badges same as you would for shields.io.
If no argument is passed,
flat is used as default.