This week is Open Access week, and we thought it would be interesting to talk with Todd Vision, the senior author on our recent publication “Data reuse and the open data citation advantage”. This article, was originally submitted as our first ever PeerJ PrePrint and the PeerJ version was published three weeks ago and has subsequently attracted quite a bit of interest in the scientific community.
is in the Department of Biology at the University of North Carolina at Chapel Hill and has been the Associate Director for Informatics at NESCent (the National Evolutionary Synthesis Center) since 2006. His research spans evolutionary genetics, computational biology, data science and scholarly communication. He is a co-founder of Dryad
, a widely used repository for data underlying biological and medical publications.
PJ: Tell us a bit about the research you published with us, and what is the ‘Take Home Message’ of your article?
This is part of a larger effort to gather evidence on the costs and benefits of making research data accessible and reusable, particular for the data producers themselves. We focused one of the potential motivators for data producers, namely how much additional credit they receive in the form of increased article citations for depositing the data reported the article to a public repository.
One take home message is there is a nontrivial benefit to data producers for making data openly available, and it is increasing over time. Another is that the reuse by others of openly available gene expression data accounts for well more than half of its total usage in the literature, and that this proportion is also on the increase. The implications for policy makers are, we think, pretty self-evident.
PJ: What challenges did you face while doing this research?
The biggest challenge was simply getting machine access to the literature, both to query the citation data and to mine the full text of articles for data accession numbers. At the time we conducted this study, the only option for querying citation data with a list of PubMed IDs was Scopus. Unfortunately, this wasn’t available to us through our institutions, and Elsevier declined to provide us individual access despite our willingness to pay. Heather [Piwowar, first author of the article] tried to use the British Library’s walk-in access during a trip overseas, but the restrictions imposed by the library were not designed for the digital world, and made the exercise impractical. It would have required her to manually type in ten thousand PubMed identifiers one by one. She eventually obtained access to Scopus through an arrangement with Canada’s National Research Library, but even that had its Kafka-esque elements, since she needed to be fingerprinted to obtain a police clearance certificate first. Once she had access, getting the data out of Scopus was very laborious, because she compiled these citation data before Elsevier made the current API available. A study of the scientific literature on this scale is difficult to pull off without the ability to automate the search, either with an API or access to the source data.
Another challenge is that we didn’t (and still don’t) have access to mine the full text of all the scientific articles that might be mentioning these datasets. For that reason, the second part of the study was restricted to the subset of articles available from PubMedCentral. This means we had to extrapolate our estimates, or rely on minimum estimates, for a number of the core results. Even though our university library pays for a subscription that enables humans to read these articles, there is a layer of legal fog that prevents academic researchers from writing software that reads the articles. At one point, multiple Elsevier executives were on the phone with Heather to discuss granting her full-text access to the articles they publish. But the legal negotiations were too slow to be of much help for this study, and at any rate it was only one publisher. There are precious few academics with the time, determination and expertise to negotiate bilateral agreements with all the relevant publishers, and to do it afresh every time there is some new study that requires full text access.
So, we are very happy that publishing our own article as Open Access in PeerJ isn’t contributing another brick in this access wall for future researchers.
PJ: So far, did you get any comments from colleagues about the results you have published with us?
We have been getting an encouraging stream of attention and feedback on this work since it was posted as a PeerJ Preprint in April. In fact, I suspect that the availability of the preprint is the reason the article has managed to start receiving citations already, even though it has been published for less than three weeks.
PJ: PeerJ encourages Authors to make their review comments visible. Why did you choose to reproduce the complete peer-review history of your article?
A lot of expert time goes into reviews, and often the dialog between the authors, the reviewers and the editor adds valuable context that does not get fully surfaced in the paper. I find that writing reviews knowing that they will be made public motivates me to be as constructive as I can. If reviewers wish to stay anonymous, that is still an option - and note that one of the two did in this case. Furthermore, if reviewers wish to make sensitive comments, they always have the option of sharing those privately with the editor. Actually, I’m not sure anymore what purpose is served by having the content of the reviews kept secret by default!
PJ: You received some great media coverage for your article. How was that process? Did the fact that we are an Open Access publisher help with exposure at all?
It’s a fairly involved paper, with lots of different quantitative analyses, including 11 figures and tables. Distilling that down to a few key points for a wider audience has been an interesting and fun challenge, and it has shifted my own thinking about which results are most important and why.
Some of the pieces I have seen were clearly based on the PeerJ press release, but in others you can tell that the reporter went to get additional material from the article itself. It stands to reason that reporters are more likely to do that for an Open Access article like this one.
PJ: With our new Q&A feature, you’ve already had the chance to answer a few questions about your paper. Could you comment on that?
We had one questioner, who asked both ‘did you think of this?’ and ‘where can I learn more about this?’ kinds of questions. Since we, as authors, don’t have an infallible crystal ball that lets us know where readers will want to delve deeper, I think it’s great to allow readers and authors to continue the dialog after publication. It feels like being in a very dispersed, asynchronous journal club, except everyone involved is really interested and has actually read the paper. And the authors can respond if someone thinks they’ve found a fatal flaw! The way that has been implemented in the website, marked in the margin for the relevant section of the manuscript, is very nice.
PJ: Anything else you would like to talk about?
One interesting piece of backstory is the source of the introductory paragraph. Heather felt that she opened the introduction well in a paper she published on the same topic in 2007, and was disinclined to reword the same ideas just for the sake of it. The rationale is that since the introduction section is supposed to restate ideas from the literature anyway, there’s not much point in putting them in new, and potentially inferior, words. So we convinced ourselves to just include the passage verbatim. But then it wasn’t clear whether or how to attribute the passage in order to avoid concerns of self-plagiarism. In the end, we simply stated the source in the acknowledgements. That should be noncontroversial, since these were, after all, one of the author’s own words and the original paper was Open Access. But it did cause some mild unease during review, so we appreciate that PeerJ allowed us, as authors, to make the final call.
Another interesting aspect of the paper is that it was written under version control in a way that all the analyses in the paper, including tables and figures, can be updated just by modifying and replacing the data and recompiling the thing. We used Knitr with embedded R code and data, and had all the files versioned on GitHub. After it was published we put the final snapshot into Dryad, so anyone interested in reusing the data or analysis is free to do so. We’d be delighted if others end up use our source files as a template for figuring out how to write their own reproducible papers.
PJ: Thank you for your time. We are pleased to have published this paper, which clearly contributes important new information in the open access debate.
If you would like to experience the future of publishing for yourself, then submit now to PeerJ.