Should Researchers Share Their Code-in-Progress Online?
I am a huge fan of github. Not only because I think it is a great service, but I love the idea of having my work freely accessible for people to view, use, use and critique. I have transitioned all of the code from the ZIA Code Repository there, used it to collaborate with Aric Hagberg on our NetworkX workshop, and I even gave a presentation to a group of fellow-graduate students in my agent-based modeling class a few weeks ago on the joys of version control using git.
I am also pushing the code associated with what I hope to be a large part of my dissertation work to my github account. There are, of course, inherent risks in "airing my dirty laundry" for all of the world to see. Last night I had a conversation with a friend about these risks. He uses github like I do, but when he mentioned it to his advisor he was strongly advised to take down the code. Unfortunately, this advise came without an explanation, but clearly this seasoned academic viewed the risk of posting premature code irreconcilable with any advantages.
Without a doubt, there are numerous bugs in my code on github, but I had a very hard time understanding why that was a problem. During the conversation last night we went back and forth trying to account for all of the risks of putting our code-in-progress online before it was fully developed. As new reasons came up, we seemed to easily find more compelling—at least to me—counter-arguments. Here are some of the reasons we came up with:
- People will steal your ideas - This seems to be the most common reason for keeping code private, but also the easiest to counter. How can someone steal something that you have already publicly claimed as your own? I understand that for graduate students there may be a fear that senior academics with a higher profile could "borrow" your work and use their position to get it to publication faster, and while this may have at one time been a legitimate fear, code repositories like github date/time stamp everything. If someone steals your stuff having a repository is your only recourse, and actually acts as a much more effective guard against intellectual property theft than keeping things a secret.
- People will see all of your mistakes - This is absolutely true, but so what? In both the hard and social sciences there are strong traditions of posting "working" version of papers online. Part of the reason for doing this is to get some response from the community about the work. This includes solicited or unsolicited criticism. It is quite common for ambitious graduate students to dig deeply into the appendix of a paper to check a proof or data coding, and forward any errata to the author. This is precisely the same dynamic that occurs when bugs are flagged in code, and this is a good thing.
- Incomplete projects make you seem fickle - Part of what I love about github is how easy it is to create a new repository. Every time I have a new coding idea I can just fire through a few commands in the terminal and be ready to push code. This, however, can lead to many incomplete projects—the dreaded "abandonware." I think this is a fair criticism, but only if this is all one ever posts. A better idea is to have one repository that you use as a sandbox, and be explicit about its purpose. In the software development world this is a standard operating procedure, and it should be for scientific research. One researcher's sandbox may be another's career. Allowing others to see your ideas in an area can spark brilliance!
After all this self-assurance, however, I am eager for someone to convince me otherwise. Has anyone had a particualrly negative experience with posting code? Are their disadvantages that we could not come up with? Posting code seems like an obviously good thing to me, which makes me very suspicious that I am wrong. Please help!