open-discussion
open-discussion > RE: AAAS: Your Paper MUST include Data and Code
Mar 10, 2011 08:03 PM | Cinly Ooi
RE: AAAS: Your Paper MUST include Data and Code
Originally posted by Pierre Bellec:
On Anonymization:
It is good practice to make sure all fMRI data is anonymized in the first place. The issue with T1 scans (and indeed any MR datasets) can be handled by simply requiring the receiving party to get approval from the relevent ethics committee. Cannot see how Science will say no to this requirement.
On algorithm:
As Matthew points out, releasing algorithm actually have the extremely big beneift of getting the algorithm tested. The feed back from the process actually improve the algorithm. If you find it hard to reproduce results for new algorithms on new production platform, experience says you have bugs. Surely you have a unit test suite that can tell you where your bug is? ;-)
I do coding for a living. Received and examined so many program code from neuroscience community that I can tell you it is not really a big problem. I see bad code and good code alike. Even for the worst code, I manage to decipher it. Documentation? Never trusted it. Even on sourceforge.net I bet you will find program code written worse then the worst you find in the community, especially considering the programmer might actualy comes from commercial software field.
On servers hosting datasets:
There is no need for a centralized server. A decentralized approach is probably better. In fact, data need not be online at all. fMRI Data Center did quite a good job in posting data through the postal service.
Institutional libraries are also making available digital space for archiving.
When there is a will, there is a way. Let's not get too bogged down by servers.
On workflow:
It is ony asked that you make your workflow available for inspection. If you use supercomputing, then make the supercomputing script available. If you do partially automated, give us your scripts and documentation for the other part of the workflow in a way that allow us to reproduce it.
For the purpose of reproducing the results, we have to assume the receiving party has the capability to either reproduce your environment or can adapt your scripts to their environment and at all time keep fidelity to your workflow. IMHO, any workflow that cannot survive a change in environment had to be treated suspiciously.
On infrastructure:
It will be sometime before we can have facilities like genome research where you can send your search request to servers in Tokyo, Washington DC or other places which process your request for you and return the data. Even then, to tell the truth, I don't think I want something like this.
Well I'd say that a lot of people will agree on
the principles of reproducible research and will clearly see the
benefits, but I am surprised no one has mentioned yet how
challenging this is in practice for neuroimaging. First there are
huge issues with anonymization, especially for clinical data.
Processing a dataset is one thing, releasing it publicly is another
(faces in T1 scans have to be blurred for example). Then you need
to host securely tons of datasets online. In the same vein, coding
an in-house algorithm is one thing, releasing it publicly is again
completely different (you need to document !). For new algorithms,
the production environment can be very hard to reproduce. Moreover,
the analysis itself can be computationally challenging (I use
supercomputers all the time, this is not used by the vast majority
of the neuroimaging community). Not to mention that a lot of
research group do not fully automatize their data processing flow.
My point is that it is not enough to say "let's go
reproducible/public/open source/....". We also need an
infrastructure to do so. I bet that in the next couple of years we
will have websites that allow to share datasets publicly, and
request with only a couple clicks for processing in
supercomputers with well tested and maintained analysis pipeline.
There are many current efforts in that direction
(e.g. http://www.cbrain.mcgill.ca/). But at this stage this
is science fiction as far as I know.
On Anonymization:
It is good practice to make sure all fMRI data is anonymized in the first place. The issue with T1 scans (and indeed any MR datasets) can be handled by simply requiring the receiving party to get approval from the relevent ethics committee. Cannot see how Science will say no to this requirement.
On algorithm:
As Matthew points out, releasing algorithm actually have the extremely big beneift of getting the algorithm tested. The feed back from the process actually improve the algorithm. If you find it hard to reproduce results for new algorithms on new production platform, experience says you have bugs. Surely you have a unit test suite that can tell you where your bug is? ;-)
I do coding for a living. Received and examined so many program code from neuroscience community that I can tell you it is not really a big problem. I see bad code and good code alike. Even for the worst code, I manage to decipher it. Documentation? Never trusted it. Even on sourceforge.net I bet you will find program code written worse then the worst you find in the community, especially considering the programmer might actualy comes from commercial software field.
On servers hosting datasets:
There is no need for a centralized server. A decentralized approach is probably better. In fact, data need not be online at all. fMRI Data Center did quite a good job in posting data through the postal service.
Institutional libraries are also making available digital space for archiving.
When there is a will, there is a way. Let's not get too bogged down by servers.
On workflow:
It is ony asked that you make your workflow available for inspection. If you use supercomputing, then make the supercomputing script available. If you do partially automated, give us your scripts and documentation for the other part of the workflow in a way that allow us to reproduce it.
For the purpose of reproducing the results, we have to assume the receiving party has the capability to either reproduce your environment or can adapt your scripts to their environment and at all time keep fidelity to your workflow. IMHO, any workflow that cannot survive a change in environment had to be treated suspiciously.
On infrastructure:
It will be sometime before we can have facilities like genome research where you can send your search request to servers in Tokyo, Washington DC or other places which process your request for you and return the data. Even then, to tell the truth, I don't think I want something like this.
Threaded View
Title | Author | Date |
---|---|---|
Luis Ibanez | Mar 10, 2011 | |
hongtu zhu | Mar 13, 2011 | |
Luis Ibanez | Mar 13, 2011 | |
Matthew Brett | Mar 13, 2011 | |
Isaiah Norton | Mar 13, 2011 | |
Torsten Rohlfing | Mar 10, 2011 | |
Luis Ibanez | Mar 11, 2011 | |
Daniel Kimberg | Mar 10, 2011 | |
Cinly Ooi | Mar 10, 2011 | |
Torsten Rohlfing | Mar 10, 2011 | |
Cinly Ooi | Mar 10, 2011 | |
Torsten Rohlfing | Mar 10, 2011 | |
Cinly Ooi | Mar 10, 2011 | |
Torsten Rohlfing | Mar 10, 2011 | |
Cinly Ooi | Mar 10, 2011 | |
Matthew Brett | Mar 10, 2011 | |
Pierre Bellec | Mar 10, 2011 | |
Luis Ibanez | Mar 11, 2011 | |
Matthew Brett | Mar 10, 2011 | |
Cinly Ooi | Mar 10, 2011 | |
Cinly Ooi | Mar 10, 2011 | |
Torsten Rohlfing | Mar 10, 2011 | |
Daniel Kimberg | Mar 10, 2011 | |
Cinly Ooi | Mar 10, 2011 | |