Tuesday, July 10, 2012

Proving My Theories of Success (aka Keeping my Fingers Crossed!)

Over the last few weeks I've laid out my theory for predicting success in citizen science projects.  I've quantified that theory and evaluated the likely success of all the projects highlighted so far on my blog.  I've also collected results data including public popularity (Google hits) and peer-reviewed paper citations (google scholar).  I've even scrubbed the data and run some initial tests.  So let's see if my theory holds up.

First, just a quick note on the data.  I'm still scrubbing it a tiny amount as I have time to dig deeper into each result.  So you'll see a new category added in my Google Scholar search footnotes for "individually reviewed".  In these cases a project, such as the Sungrazer project, was providing odd results so I did some much broader searches and manually picked out the relevant hits.  It's time consuming and not necessary for all the projects, but in this case it was useful (and called for by the ambiguous initial results I'd had).

Now back to the results.  First I did some preliminary checks of the hypothesized success ranking against google popularity.  This was not successful.  There was too much going on and not strong correlations.  I'd go through the results but it's pretty dull and uninformative.  So I next turned to the rankings compared to Google Scholar results.  This was much a bit more promising, especially when using the Google Scholar rank (compared to all results) versus hypothesized success score.  The results are below:

There are two main things of interest.  First, note the regression line and it's slightly positive slope.  This would tend to disprove my thesis, but look more closely at the groupings.  We have a sizable group of highly-successful projects all with high hypothesized success scores in the upper right hand corner.  Just where they should be under the theory.  But the mass of projects below are what pull away the regression.  So looking a bit closer and controlling for the type of project, we get a cleaner result that also fits the theory much more nicely.

The most logical place to start is controlling by primary area of science.  All these projects focus on approximately the same things so they should provide good comparisons.  And indeed they do:

The only problem above is with the Astronomy set which shows only a very small correlation (a zero-slope regression line). I have some answers for that as well, but that will be a whole other posting! I also want to dive even deeper into each success factor to further control our variables.

Overall this is a much better result.  All the regressions provide a negative slope, meaning the higher the hypothesized success rate the greater the number of peer-reviewed papers generated by that project.  So clearly there is something going on here.  This also fits in with our earlier discussion of separating projects types when analyzing success, such as distributed computing vs. non-distributed computing (aka "interactive") projects.   But what exactly is going on here?  There are still more clues hiding in the data.

So far we've used an aggregate hypothesized success score using every single trait I identified and weighting them equally.  But are some more important than others?  Do some have very little impact on project in the comments below and we can check them out in the next set of data runs. 

UPDATE: Hyperlinked to previous posts explaining the concept of "Hypothesized Success Ranking" and added R-squared values to charts to better indicate how well the regression line fits.  My next post will be much more statistically-based so I was holding off the math a bit in this post.  But in hindsight some additional data is necessary. 


  1. What is the source and methodology behing the "hypothesized success ranking"? I may have missed it in an earlier post. If that is the case, I apologize.

  2. Good question...I was referring to the projects scoring highest uner my ranking system (http://www.openscientist.org/2012/06/ranking-citizen-science-projects-by.html). Under my theory those should be the most successful, and that's what I'm trying to show. Hopefully. :)

  3. ah yes, this makes good sense. Truly enjoying the exercise.