Wednesday, July 4, 2012

Early Data on the Success of Citizen Science Projects

The last few days I've been a powerless blogger thanks to losing all electricity last Friday in the DC area storms. It's an interesting time as a citizen science weather observer, but not as a homeowner in 100 degree heat. Fortunately I've found a temporary home base and used the extra free time to our benefit.

This week I started pulling together quantitative data on the success of citizen science projects. As some of have suggested I first began looking at public popularity and scientific value as ways to measure success. Both also come with relatively simple data sets that can be collected online when one has free time on their hands.

A public spreadsheet with the results is available here. For this first round the I collected data on 1) number of scientific papers referencing the project (through Google Scholar), 2) popularity of web site as determined by number of Google hits based on the project name, and 3) popularity of project as determined by Google ranking based on the project name. It also includes my ranking of each project by the hypothesized success factors, scientific areas of interest, and type of project (as defined in my previous post classifying citizen science projects).

Combining all the data sets should let us see not just the projects that are successful. It will hopefully yield additional insights into success of projects by classifications (such as which transcription-type projects are most successfully). It will also help us account for differences in the data by evaluating projects on an even basis. The best example of this is separating Distributed Computing and non-Distributed Computing projects (tentatively called "Interactive Citizen Science Projects"). This is a lesson learned from earlier in the process and continued in my separation of Distributed Computing projects in their own unique tab. That may be needed for other projects as well, but for now I am only separating the project list in to two while maintaining data distinctions in case further separation is needed.

In most cases my data was based on searches for projects official name in quotations. For some, such as "Great Backyard Bird Count" or "MoonZoo", this is a distinctive name I can rely on for results. But some names are more ambiguous and bring up results not associated with the project. So I've used a series of methods to clean up the data. The first is to manually review results for applicability on an ad hoc basis. This eliminates some erroneous results and tests whether modifying the search criteria is necessary. When a modification is made I've recorded it with a series of asterisks as defined below:

Google Site Listings
No asterisk - "[project name]"
* - "[project name]" citizen science
** - "[project name]" citizen science project
*** - "[project name]" app
**** - "[project name]" distributed computing
***** - "[project name]" naval history

Google Scholar Rankings
No asterisk - "[project name]"
* - "[project name]" citizen science
** - "[project name]" distributed computing
*** - "[project name without punctuation]"
**** - "[project name]" citizen
***** - "[project name]" with individual review
****** - "[project web site]"

Now that it's all up feel free to review and analyze the data as you see fit. I'm working on my own models now but am curious what you find. Also, feel free to comment on any potentially incorrect results so make can keep the data clean. Although I reviewed a large number of the "hits" for applicability, no single person can review them all. So if you have ideas about getting a more tailored, accurate number please let me know.


No comments:

Post a Comment