-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create workflow_overview.md #83
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Nadia, this is a good start. And thanks for submitting it as a PR, that really helps!
You can read below that I think so far a lot of this duplicates content we already have. I think you are hoping to just have a single high-level process but in technical writing it is tricky to balance detail against accuracy (and too much detail slides into inaccuracy because it makes peoples eyes glaze over and miss steps. 😅 )
I am thinking this doc needs to be something like:
-
Make a [git repo](LINK TO GIT TUTORIAL) on GitHub (or on an internal private git server??) for your analysis code
-
Find your data on the [data server](LINK TO DATA SERVER DOC) and include it data by (one of)
- Running
git clone -b r20230623 [email protected]:datasets/<DATASET> /tmp/<DATASET> && (cd /tmp/<DATASET> && git annex get)
(or some other folder??) manually and making sure to prefix all analysis commands with that path, e.g.sct_analyze_lesion -i /tmp/<DATASET>/path/to/file.nii.gz
- Writing
git clone -b r20230623 [email protected]:datasets/<DATASET> && (cd <DATASET> && git annex get)
in a top level analyse.py or analyse.sh script and adding<DATASET>/
to your.gitignore
- Using
git submodule add -b r20230623 [email protected]:datasets/<DATASET> && (cd <DATASET> && git annex get)
- other alternatives??
Troubleshoot problems by seeing the [git-annex](LINK TO GIT ANNEX PAGE) page.
- Running
-
Run
git push
regularly to share your work and ask for feedback early -
Test your reproducibility by having someone else
git clone
((orgit clone --recurse-submodules
)) your project and running its analysis script on a different machine at a different time. If it doesn't run with just one or two command lines, fix it so it does by adding the necessary steps to your top level driver script.
On a style point, I feel this file should probably be in a subfolder. I'm not sure where though, this crosses both "Onboarding" and "Geek Tips" and "Computing Resources/Data". I notice you added an emoji to the title, which means it would already fit in with the other top-level pages table of contents. Was that your intention?
Wherever you put it, you'll need to find the nearest toctree
and add it there in order for this page to show up, e.g.:
intranet.neuro.polymtl.ca/mri-scanning/README.md
Lines 7 to 12 in 4f32201
```{toctree} | |
:hidden: | |
unf-3t-prisma | |
mni-mcgill-7t-terra | |
mhi-7t-agilent | |
``` |
If you want it in the top-level folder, this is the toctree
to edit:
intranet.neuro.polymtl.ca/README.md
Lines 3 to 33 in 4f32201
```{toctree} | |
:caption: Ressources | |
:hidden: | |
onboarding/README | |
agenda-and-calendar | |
computing-resources/README | |
data/README | |
mri-scanning/README | |
rf-lab/README | |
``` | |
```{toctree} | |
:caption: Academic Life | |
:hidden: | |
courses | |
scholarships | |
conferences | |
bibliography/README | |
ideas-for-cool-projects | |
writing-articles | |
``` | |
```{toctree} | |
:caption: Miscellaneous | |
:hidden: | |
practical-information/README | |
geek-tips/README | |
edi | |
contact | |
NeuroPoly Website <https://neuro.polymtl.ca> | |
``` |
@@ -0,0 +1,57 @@ | |||
|
|||
# 📄 General Overview of Project Workflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate that you noticed and kept the emojis here! There's a small glitch about sphinx, all the top-level pages need to have their emojis wrapped like this to get them to align properly
# 📄 General Overview of Project Workflow | |
# <span>📄</span> General Overview of Project Workflow |
|
||
# 📄 General Overview of Project Workflow | ||
|
||
Once your [onboarding](https://intranet.neuro.polymtl.ca/onboarding/README.html) is complete, you will be ready to tackle your project! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I try to use relative links everywhere, because the domain this is on is not promised to last forever. So
Once your [onboarding](https://intranet.neuro.polymtl.ca/onboarding/README.html) is complete, you will be ready to tackle your project! | |
Once your [onboarding](onboarding) is complete, you will be ready to tackle your project! |
or maybe that doesn't work, tbh I forget how sphinx handles folder links. Maybe better to be explicit:
Once your [onboarding](https://intranet.neuro.polymtl.ca/onboarding/README.html) is complete, you will be ready to tackle your project! | |
Once your [onboarding](onboarding/README.md) is complete, you will be ready to tackle your project! |
Note that the build process -- sphinx
-- is smart enough to map .md
to .html
in the final build, but by keeping the link as a .md then if you've done the relative links right, GitHub's markdown renderer will render the whole site as a fully functional backup. Take a gander at https://github.com/neuropoly/intranet.neuro.polymtl.ca/blob/4f322012f35de2f170111e325110431bda94a7d6/workflow_overview.md, you will see in this version this link is broken. Even once this is published to the live site, I'd still count that link as broken in spirit 😲 because it'd go to an external site and so the docs aren't self-contained.
You can double-check how your work will render in the end with our theme and all the links fixed up properly. It's a bit more work, but you can just check out this branch on your computer and follow these instructions:
intranet.neuro.polymtl.ca/setup.py
Lines 2 to 7 in 4f32201
To build the docs: | |
pip install .[sphinx] | |
make html | |
They will end up in _build/html/ |
(these are the same commands that are used by GitHub to publish the live copy of the site)
**Step 3.** | ||
* Create your project working directory: | ||
``` | ||
cd data_nvme_<POLYGRAMES_USERNAME> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most systems don't have /mnt/nvme
or ~/data_nvme_$USER
. That's really just a hack we added for joplin
at some point, and it's already documented here
intranet.neuro.polymtl.ca/computing-resources/neuropoly/cpus.md
Lines 15 to 17 in 58f03fe
| **Hostname** | `joplin.neuro.polymtl.ca` | | |
For fast I/O, use the NVMe hard drive, which is automatically available: `~/data_nvme_$USER` |
I would say there's no need to specify this. Anywhere someone has access that has enough space is fine, and if you just don't mention it then most people will by default end up working in their home directories, which should work perfectly well on most systems, at least to start out. The big gotcha is that combining duke
and git
is a bad idea, but I think documenting that is probably out of scope of this.
mkdir <PROJECT_NAME> | ||
cd <PROJECT_NAME> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can help people out by being explicit here:
mkdir <PROJECT_NAME> | |
cd <PROJECT_NAME> | |
mkdir <PROJECT_NAME> | |
cd <PROJECT_NAME> | |
git init |
* After adding your NeuroPoly workstation [SSH key to your Github account](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account?platform=linux), you are ready to make a local fork of that remote repository: | ||
``` | ||
cd data_nvme_<POLYGRAMES_USERNAME>/<PROJECT_NAME> | ||
git clone -b "<YOUR_WORKING_BRANCH>" [email protected]:<REPOSITORY>.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want everything we do to be tracked by git by default. Things that should not be under git, like private data, or test files, can either go in /tmp
or can be added to .gitignore
explicitly. That is to say, just have <PROJECT_NAME>/
, not <PROJECT_NAME>/<REPOSITORY>/
.
Then, we can either git init
locally and git push
, or do the reverse and click the New Repo button on GitHub (which runs git init
on their side) and then git clone
. Either way works and GitHub gives helpful pointers to guide people through either workflow. Once you've done either two or three times they become obvious and most developers don't bother to.
We already have a page that's supposed to cover this:
# git & Github |
and while it needs a lot of love at the moment, it'd be better to work these tips into it than repeat them here. Can we arrange it so this section is just a link to that page? Maybe that page needs to get some edits in this PR as well to bring it up to speed with our current practices and/or to slim out the advice we don't use.
``` | ||
git annex drop . | ||
``` | ||
* Any data derivatives that you output should be added to `data.neuro:datasets/<PROJECT_DATASET>` according to the [BIDS](https://bids-specification.readthedocs.io/en/stable/) data standard! More documentation on how to version control your data on `data.neuro` can be found [here](https://intranet.neuro.polymtl.ca/data/git-datasets.html#update). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto on the absolute link: better to make it relative.
* Thanks to `git annex`, the following command will copy the directory structure and some small files of your dataset on `data.neuro`: | ||
``` | ||
cd data_nvme_<POLYGRAMES_USERNAME>/<PROJECT_NAME> | ||
git clone [email protected]:datasets/<PROJECT_DATASET> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the most interesting part to me. This is the part where we might address neuropoly/data-management#136.
One option here would be
git clone [email protected]:datasets/<PROJECT_DATASET> | |
git submodule add [email protected]:datasets/<PROJECT_DATASET> |
This would be the Datalad YODA recommendation. They don't use the word "submodule" on that page but it is what they have in mind.
We can also do
git clone [email protected]:datasets/<PROJECT_DATASET> | |
git submodule add -b v1.0.3 [email protected]:datasets/<PROJECT_DATASET> |
to pick out a specific older version (in this case v1.0.3
).
Then when you git push
the code to GitHub it looks like this:
The dataset folder is unclickable because it's tagged as a submodule reference, and moreover it's a submodule that is on the private, intentionally-inaccessible, storage node:
but if I'm on one of our internal processing nodes with permission to the datasets, I can reproduce my entire project with git clone --recurse-submodules
git clone --recurse-submodules
p115628@bireli:~/src$ git clone --recurse-submodules https://github.com/kousu/proj1
Clonage dans 'proj1'...
remote: Enumerating objects: 6, done.
remote: Counting objects: 100% (6/6), done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 6 (delta 0), reused 6 (delta 0), pack-reused 0
Réception d'objets: 100% (6/6), fait.
Sous-module 'canproco' ([email protected]:datasets/canproco) enregistré pour le chemin 'canproco'
Clonage dans '/home/GRAMES.POLYMTL.CA/p115628/src/proj1/canproco'...
remote: Énumération des objets: 51098, fait.
remote: Décompte des objets: 100% (51098/51098), fait.
remote: Compression des objets: 100% (27474/27474), fait.
remote: Total 51098 (delta 22242), réutilisés 39149 (delta 14297), réutilisés du pack 0
Réception d'objets: 100% (51098/51098), 4.44 Mio | 11.12 Mio/s, fait.
Résolution des deltas: 100% (22242/22242), fait.
Chemin de sous-module 'canproco' : '42d424b0b56b9269fe0c5058130f5e3bc7c9e941' extrait
p115628@bireli:~/src$ cd proj1/
p115628@bireli:~/src/proj1$ ls canproco/
dataset_description.json sub-cal135 sub-cal192 sub-edm086 sub-edm183 sub-mon109 sub-mon174 sub-tor031 sub-tor103 sub-van111 sub-van176
derivatives sub-cal136 sub-cal194 sub-edm087 sub-edm184 sub-mon111 sub-mon175 sub-tor032 sub-tor106 sub-van112 sub-van177
participants.json sub-cal137 sub-cal195 sub-edm088 sub-mon001 sub-mon113 sub-mon176 sub-tor033 sub-tor107 sub-van116 sub-van178
participants.tsv sub-cal138 sub-cal197 sub-edm089 sub-mon002 sub-mon118 sub-mon180 sub-tor035 sub-tor109 sub-van123 sub-van180
sub-cal056 sub-cal140 sub-cal198 sub-edm094 sub-mon003 sub-mon119 sub-mon181 sub-tor036 sub-tor110 sub-van124 sub-van181
sub-cal072 sub-cal142 sub-cal199 sub-edm095 sub-mon004 sub-mon121 sub-mon183 sub-tor037 sub-tor112 sub-van125 sub-van182
sub-cal073 sub-cal143 sub-cal200 sub-edm098 sub-mon005 sub-mon124 sub-mon185 sub-tor038 sub-tor114 sub-van129 sub-van183
sub-cal078 sub-cal144 sub-cal201 sub-edm105 sub-mon006 sub-mon125 sub-mon186 sub-tor039 sub-tor115 sub-van131 sub-van184
sub-cal080 sub-cal145 sub-cal202 sub-edm107 sub-mon007 sub-mon126 sub-mon187 sub-tor040 sub-tor118 sub-van133 sub-van185
sub-cal083 sub-cal146 sub-cal206 sub-edm113 sub-mon009 sub-mon129 sub-mon189 sub-tor041 sub-tor121 sub-van134 sub-van186
sub-cal084 sub-cal149 sub-cal207 sub-edm114 sub-mon010 sub-mon131 sub-mon190 sub-tor043 sub-tor123 sub-van135 sub-van189
sub-cal085 sub-cal150 sub-cal209 sub-edm118 sub-mon011 sub-mon132 sub-mon191 sub-tor044 sub-tor124 sub-van136 sub-van191
sub-cal088 sub-cal151 sub-cal210 sub-edm123 sub-mon013 sub-mon133 sub-mon192 sub-tor049 sub-tor125 sub-van137 sub-van192
which is pretty cool! And a lot more reliable than most other reproduction methods, which say "go dig up this DOI and try to find the matching dataset on Zenodo, and then the paper, and some code that went with the paper on some obscure university FTP site" (and don't say, but should, "and make sure you're running such and such a version of such and such an OS and using such and such a version of python and using such and such a version of nvidia's GPU hardware"...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sandrinebedard gave me a tour of how she handled https://github.com/sct-pipeline/ukbiobank-spinalcord-csa last year.
In short, she wrote a detailed process guide in
While explaining, she realized some parts were left undocumented: she set up a conda environment to avoid surprise breakage, which may cause different results since requirements.txt doesn't specify versions, and there is a git tag
named "r20210928" attached to the precise dataset she analysed
p115628@joplin:~/datasets/uk-biobank-processed$ git remote -v
origin [email protected]:datasets/uk-biobank-processed (fetch)
origin [email protected]:datasets/uk-biobank-processed (push)
p115628@joplin:~/datasets/uk-biobank-processed$ git log HEAD~..
commit 7ed28cd0aacaab0ce1570bccdba3bff495b5f496 (HEAD -> master, tag: show, tag: r20210928, origin/master, origin/HEAD)
Author: Sandrine Bedard <[email protected]>
Date: Sun Sep 5 17:49:04 2021 -0400
change suffix in derivatives from labels-manual to labels-disc-manual
but she hasn't documented that that's the tag to go with the project. Otherwise her procedure fits option 3: write down all the steps and expect people will read all your docs and follow all your instructions.
Maybe you can already tell but I'm leaning against option 3. I think we need something more automated because if there's one thing I know about human-computer interaction it's that people don't read instructions. I don't know if git submodule
(option 1) is the right answer but I think we need something automated. And we will know if we have it when we can run a weekly cronjob on each of our papers that reproduces the final paper including all figures and statistical analyses from just source code and access to the data server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh perf, I just poked around a bit and noticed I guess that you were trying to reproduce this project: https://github.com/NadiaBlostein/Chemical_Structure_Reconstruction
The very first thing it says pretty much is:
- Download the Mordred compound sets (1 through 3) and place them in a
data/
folder
I have no idea where to get the Mordred compound sets (1 through 3) and what version (if any) of them I'm supposed to get and I don't know if they assume the data/
folder is supposed to be in the same folder as the code, or if they mean in the current working directory (read it in the unix convention, I would read that as the current working directory!) which isn't necessarily the same.
And
- Run the main function in
src/thresholding.py
.
There is no main function in that file. As a programmer I can read the code and interpret that they must mean "run python src/thesholding.py
", but if working with a large series of projects it would be impossible to track down and correct every small gap like that.
Their repo also depends on pandas and some other third-party python packages, but doesn't declare a setup.py or requirements.txt or a conda environment or anything. It doesn't even say what version of python it was written against. All of that is key to write down for reproducibility.
Those are some prime examples of the mistakes I want us to be able to avoid.
|
||
**Step 5. The data** | ||
* It is critical to make sure that you know what data you are working with. | ||
* Ideally, it should be in [BIDS](https://bids-specification.readthedocs.io/en/stable/) format on the [`data.neuro`](https://intranet.neuro.polymtl.ca/data/git-datasets.html) storage node: `data.neuro:datasets/<PROJECT_DATASET>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto with the relative links. Also I'd shorten the BIDS link to be safer against link rot -- their domain will change less frequently than their subfolders, hopefully.
* Ideally, it should be in [BIDS](https://bids-specification.readthedocs.io/en/stable/) format on the [`data.neuro`](https://intranet.neuro.polymtl.ca/data/git-datasets.html) storage node: `data.neuro:datasets/<PROJECT_DATASET>`. | |
Ideally, it should be in [BIDS](https://bids-specification.readthedocs.io) format. We have many of these on the private [`data`](data/git-datasets.md) server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also think about the workflow for combining datasets. In principle it's easy to include multiple datasets -- each of them is just a different folder, and we can do that with git submodules
or almost any other tool. We should think about how to leave the door open to this and avoid tricking people into thinking that there's a 1-to-1 mapping between analyses and datasets, but while remembering that the 1-to-1 situation is the common case.
**Step 1.** | ||
* Make sure that your VPN connection is established or that you are connected to the Polytechnique wifi. | ||
|
||
**Step 2.** | ||
* Log in to one of the available [Neuropoly compute nodes](https://intranet.neuro.polymtl.ca/computing-resources/neuropoly/README.html): | ||
``` | ||
ssh <POLYGRAMES_USERNAME>@<STATION>.neuro.polymtl.ca | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This duplicates the instructions at
intranet.neuro.polymtl.ca/computing-resources/neuropoly/README.md
Lines 152 to 158 in 58f03fe
### SSH (command line) | |
Once the VPN connection established, connect via ssh using the `STATION` you want: | |
```bash | |
ssh <POLYGRAMES_USERNAME>@<STATION>.neuro.polymtl.ca | |
``` |
I think those instructions could be clearer -- it would be nice if that page was broken up -- but in any case it's best not to duplicate the work.
* However, in order to save space, make sure to "undownload" those big files once you are done working with them with: | ||
``` | ||
git annex drop . | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good tip, but I think it would fit better as a new subsection over in https://intranet.neuro.polymtl.ca/data/git-datasets.html#drop.
By the way, today Jan and now me spent several hours putting together a BIDS curation guide: intranet.neuro.polymtl.ca/data/dataset-curation.md Lines 1 to 167 in b89d175
We could add a link to this to the workflow. Some projects start by curating their own data, some just use pre-curated data, and others involve generating derivative datasets (which also need to be curated). There are two other documents I'm involved with that duplicate this work: none of these are finished, or even necessarily correct. I would like to get all steps of all our workflows -- collection, curation, analysis, reproduction -- down in our wiki to offer as a gold standard. |
Also I've left a blank spot in intranet.neuro.polymtl.ca/data/dataset-curation.md Lines 174 to 176 in 6222bd9
to link up this doc when it's ready! |
No description provided.