Hello everyone! I'm Canas and this is my first post on DEV, hopefully it can be useful for people working in Data Science and Machine Learning :)
One of the known challenges of working with notebooks is that version control is not ideal, with few tools available that actually deal with this. This motivated me to try and lessen the burden for some engineers that work with these files using Github Actions, leveraging on an existing open source tool called ndime.
My Workflow
The general idea is to make the Github Action post a comment on the PR that contains the changes to any notebook with respect to the target branch.
We will use the following existing actions to accomplish this:
-
checkout@v2
, for fetching the code
-
actions/setup-python@v1
, for installing python
-
peter-evans/create-or-update-comment@v1
, to create a comment on the PR with nbdiff
's output.
Submission Category:
I guess this would fall in Maintainer Must-Haves, since it will provide much better context when notebooks are submitted in a shared repository (e.g., for researching or persisting experiments).
Yaml File or Link to Code
You can see a working implementation in this repository.
name: Generate notebook diff
on: ["pull_request"]
jobs:
check-diff:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 0
- name: Fetch target branch
run: git fetch origin ${{ github.event.pull_request.base.ref }}:${{ github.event.pull_request.base.ref }}
- name: Setup Python
uses: actions/setup-python@v1
with:
python-version: "3.6"
- name: Install requirements
run: pip3 install nbdime
- name: Run and store diff
run: |
nbdiff ${{ github.event.pull_request.base.ref }} --no-color > diff.log
sed -i '1s/^/\`\`\`diff\n&/' diff.log
sed -i '$s/$/\n&\`\`\`/' diff.log
- name: Get comment body
id: get-comment-body
run: |
body=$(cat diff.log)
body="${body//'%'/'%25'}"
body="${body//$'\n'/'%0A'}"
body="${body//$'\r'/'%0D'}"
echo ::set-output name=body::$body
- name: Create comment
uses: peter-evans/create-or-update-comment@v1
with:
issue-number: ${{ github.event.pull_request.number }}
body: ${{ steps.get-comment-body.outputs.body }}
In simple terms, we use nbdiff
to generate a file called diff.log
. After that, we use sed
to append and prepend the markdown enclosing characters. In the next step, we take diff.log
and do additional replacements that ensure that the PR comment will not truncate newlines, which are then stored in the body
variable. Finally, we pass the body
variable to the create-or-update-comment
action which will take care of posting our formatted output in the PR.
This repo is to try out a Github action that comments PRs with Jupyter Notebook diffs (vía nbdime) if available. Sample is available in the only open PR.
nbdiff.yaml
name: Generate notebook diff
on: ["pull_request"]
jobs
check-diff
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 0
- name: Fetch target branch
run: git fetch origin ${{ github.event.pull_request.base.ref }}:${{ github.event.pull_request.base.ref }}
- name: Setup Python
uses: actions/setup-python@v1
with:
python-version: "3.6"
- name: Install requirements
run: pip3 install nbdime
- name: Run and store diff
run: |
nbdiff ${{ github.event.pull_request.base.ref }} --no-color > diff.log
sed -i '1s/^/```diff\n&/' diff.log
sed -i '$s/$/\n&```/' diff.log
- name: Get comment body
id: get-comment-body
run: |
body=$(cat diff.log)
body="${body//'%'/'%25'}"
body="${body//$'\n'/'%0A'}"
body="${body//$'\r'/'%0D'}"
echo ::set-output name=body::$body
- name: Create comment
uses: peter-evans/create-or-update-comment@v1
…
*Be sure to checkout to the change
branch if you want to see the actual file!
Additional Resources / Info
Actions used:
Libraries used:
Possible future work:
- Test and benchmark on large size notebooks
- Look for a way to deploy the web version of nbdime,
nbdiff-web