OpenScientist: July 2012

Tuesday, July 24, 2012

Success Factors for Distributed Computing Projects

We have talked generally about success factors for distributed computing projects for weeks now. We've also talked about success factors for all interactive projects, as well as sub-dividing it all into success factors for astronomy and ecology projects. We even dove into the correlation statistics on them. So now it's time to do the same thing for Distributed Computing.

First let's look at the data broadly. The most significant result is no result...nearly fifty percent of the success factors have zero correlation to either the popularity or scholarly success of these projects. This is not unexpected and part of the reason why we separated results for distributed computing projects in the first place. But it still merits discussion. Looking closely at the data all of these items show absolutely no difference between projects in their success ranking factor. In other words, each project addresses these factors in the same way or they just aren't expressed at all. So there are a more limited number of factors that differentiate the projects, limiting the potential factors for success. A positive interpretation is that project designers only need to focus on a few key elements to make a highly successful project since most will not impact the result. However it could also be interpreted as proof that further research is needed on success factors for distributed computing projects beyond what we've done here. While I do promise to address that in future posts, let's see what the data we do have says.

Now let's look at what effects the scholarly scholarly impact of Distributed Computing projects. By fat the largest correlation, almost 1:1, is Providing Data Access. This confused me for a while and made me double-check my data to make sure it wasn't just one outlier or a series of unreliable data points causing the trend. And that's what it turned out to be. The only project that does anything different in this aspect than other projects is SETI@Home, and the overall success of that project is likely showing that correlation. So it may be a real factor but there is not enough data here to prove it. I also need to review the other projects more closely on this factor to make sure the lack of differentiation seen in my experience truly holds up. that may be a future post as well, and I would encourage anyone who sees differences from their experience to let me know in the comments below.

Be audacious has a very similar problem. While there are differences with two projects (and not just one), it's still a minor differentiation and nothing I can build a statistical argument on. So we will also need to disregard that item for lack of evidence. So what does that leave us with? The correlation chart of success factors for scholarly success below shows the answer:

Entertain	n/a
Reward	0.006933365
Challenge	n/a
Educate	n/a
Motivate the User	n/a
Create a Community	n/a
Interact in Real Time	0.006933365
Provide Feedback	0.699078158
Offer Excitement	0.515624856
Encourage Dialogue	0.444842746
Provide Data Access	0.967159265
Allow for Errors	n/a
Be Audacious	0.705797165
Stay Focused	n/a
Make it Convenient	n/a
Make Learning Easy	0.567665542
Make Participating Easy	0.071321992

Tthe importance of Providing Feedback sticks out the most to me. This data point has significant variation as project designers work to varying degrees to provide timely updates on the project and it's results. Some use e-mail newsletters, some update information on their web site, some provide usage statistics in the program interface, and others use combinations of these techniques. This helps keep projects popular by keeping them in the participant's (and the public's ) eye, and keeps people motivated by showing the benefits they provide to science, but how is this a scholarly success factor?

The important thing to remember is that distributed computing projects typically have a single goal (or set of goals) that don't change over time. Researchers design the project for a very particular problem and use the brute force of everyone's computers to solve it. So there is no benefit to a project only half-completed...the problem remains unsolved and the work already done gets discarded. So if there are not enough participants to see a project to it's end there will be no results and all the time will be wasted. Meaning no final result and no academic papers. So even though this is a popularity measure it is vital to scholarly success.

Speaking of popular success, below is the correlation chart for those success factors:

Entertain	n/a
Reward	-0.075083869
Challenge	n/a
Educate	n/a
Motivate the User	n/a
Create a Community	n/a
Interact in Real Time	-0.075083869
Provide Feedback	0.021533374
Offer Excitement	0.231458187
Encourage Dialogue	0.052357277
Provide Data Access	0.274326387
Allow for Errors	n/a
Be Audacious	0.144201148
Stay Focused	n/a
Make it Convenient	n/a
Make Learning Easy	0.053787176
Make Participating Easy	0.077344773

The important idea on this chart is the need to "Offer Excitement". This should not be a surprise when you think of how projects gain popularity in the first place. Most bring people in through existing citizen science portals (such as OpenScientist) advertising them, or more likely, from people reading popular press articles about the project. The press loves an exciting story and will focus most on projects with the best narrative. So in some ways it is not directly related to what we consider scientific success. Except for one important thing.

Remember that from a participant's point of view, none of these take significant time, energy, or resources. There is minimal operational time and many even share the same infrastructure (such as the BOINC platform). So projects can't differentiate themselves on ease of use or on a network infrastructure effect. Instead they compete by how exciting a project is. The more exciting, the more participants will join and the more computing cycles will be performed. So it is a necessary component to project success, bringing in a critical mass of participants and ensuring enough computer time will be devoted to the project.

Monday, July 23, 2012

Diving into the Statistics...Ecology

One thing I've learned over the last few weeks is the wide array of success factors important to different types of citizen science projects. There is a lot of diversity out there and no single formula defines success for everyone. Hours staring at Excel spreadsheets and wild regressions taught me that. But once you organize things into neat piles and keep the apples with the apples things start coming into focus.

Up next are the ecology projects, looking first at measures of public popularity. I expect they will be significantly different than the astronomy projects based on the nature of the science. Fortunately the data seem to bear this out.

Entertain	0.114475249
Reward	-0.067028298
Challenge	-0.25679603
Educate	-0.035525285
Motivate the User	-0.12342609
Create a Community	-0.112233026
Interact in Real Time	-0.27786182
Provide Feedback	-0.230619358
Offer Excitement	-0.179413057
Encourage Dialogue	-0.33835345
Provide Data Access	-0.236025844
Allow for Errors	0.100673012
Be Audacious	-0.237861864
Stay Focused	0.379573448
Make it Convenient	0.20049689
Make Learning Easy	0.149773189
Make Participating Easy	0.27213335

Very quickly we see the most important factors for success appears to be all from the Keep it Simple family: Stay Focused, Make it Convenient, Make Learning Easy, and Make Participating Easy. This is the first time we are seeing a whole group of items together and it's likely there is a strong correlation amongst them. As discovered earlier the success factors aren't completely distinct; while they have their own definitions there is certainly potential for overlap. They are also based on general concepts and not on rigidly-defined criteria. So a correlation between these is not unexpected.

Looking more closely at Keep it Simple it makes some sense for these to be strong success factors for ecology projects. Unlike other astronomy or meteorology projects, many ecology projects can be performed with minimal time, investment, or training. Of the ones we've reviewed many just require taking notes in your backyard or a nearby area, and any identification is either 1) relatively simple for the lay-person, or 2) aided by tools provided by the project. So it doesn't require extensive biology training or expertise, and reduces the need for extensive education.

As long as designers work to keep the projects simple and easy to participate in, they can expect to be successful. Part of this is staying focused on a particular species or problem...this focus helps keep things simple and ensures that minimal training is required. Only a few simple things need to be taught. But even then, making the learning easy will ensure a well-trained volunteer able to participate and they won't get frustrated trying to learn by themselves. Making it convenient, by allowing measurements in one's backyard and on relatively infrequent timetables, also makes participation easy. All these factors work together to help people get involved, stay involved, and tell their friends to become involved.

Surprisingly (or not so surprisingly), the data on important factors leading to academic success (as measured through Google Scholar citations) are much different than those for ensuring popular success.

Entertain	0.372179967
Reward	-0.090097108
Challenge	0.595873844
Educate	0.527128459
Motivate the User	0.524441676
Create a Community	0.671781181
Interact in Real Time	0.50987457
Provide Feedback	0.409862612
Offer Excitement	0.194700875
Encourage Dialogue	0.461991758
Provide Data Access	0.504180285
Allow for Errors	0.221117582
Be Audacious	0.28583062
Stay Focused	-0.141373398
Make it Convenient	-0.406432517
Make Learning Easy	-0.214070343
Make Participating Easy	-0.062347621

The strongest, and to me one of the most interesting success factors, is creating a community. This fits very strongly with the highest-ranking projects (Great Backyard Bird Count and Christmas Bird Count) which have both created very strong communities around the project. For one thing, developing a community of volunteers that supports one another and encourages themselves to stay involved is able to keep strongly motivated users who will work hard for the project. But while important for remaining popular and bringing in new participants, it is also very important to scientific success as well. This also correlates with "Interact in Real Time" where project participants work with one another and keep up a regular dialogue about the project. My thinking is that the training component involved with having experienced users teach new birdwatchers how to participate significantly improves the quality and quantity of results. It keeps people in the field longer and ensures the data is accurate. It also teachers new participants tips on spotting more birds and increases the number of available data points. All important parts of developing strong data sets.

Finally, I find it interesting that the success factors for a publicly popular project are actually negatively correlated with scientific success. In other words, keeping a project simple can actually hurt it's chances of creating scientifically important results. Why is this? Well, consider that the simplicity involved in these projects actually over-simplifies the situation does not allow enough meaningful data to be collected. Some of this may introduce uncontrolled variables when people can work on their own and on their own time...this lack of rigor can increase errors and make the data unreliable. Conversely there is not enough flexibility in the data...much of science is not just looking for data on the problem you understand, but also looking for new problems and unexpected connections. Making a project to simple can stymie that type of discovery.

So those are my initial thoughts on Ecology projects. But what about Distributed Computing projects? Find out about the final scientific area being explored in my next post.

Tuesday, July 17, 2012

Diving into the Statistics...Astronomy

Yesterday we looked at what the results told us on a broad scale. We learned that offering tangible rewards (e.g., cash) was the most sure-fire way to achieve a successful citizen science project. But we also know that the different scientific fields create projects under different constraints and cater to people with different interest.

Let's look at astronomy first. It's one of the most popular types of projects available and also one of the most diverse. A lot falls under this category with participants performing a wide variety of tasks. So it should be no surprise if there is also wide variety in the importance of success factors for each individual project. But let's look at what the numbers tell us.

From a scholarly success point of view, we can see that the strongest correlation is with Educate the User. Given the nature of the field not much can be discovered without either significant training or concerted design work to succeed. Unlike ecology projects you can't just send a person into the backyard with a magnifying glass; instead you either need to provide access to advanced telescope data along with tools for interpreting it, or you need to train individuals how to observe and measure stars from their backyard. The second requires much education and training, and the first means much explaining is required for users to participate.

Additionally, it is these more complex astronomy projects that achieve the most scholarly success. This means that easy to understand projects (requiring little additional education) don't have much potential for new knowledge. But more complex projects with significant teaching requirements also have educational opportunities as a side benefit. So this success factor is not just a predictive factor, but also a side-effect of the project's very nature.

Entertain	0.095559984
Reward	-0.080876458
Challenge	0.183599383
Educate	0.368265236
Motivate the User	0.024092987
Create a Community	0.186135395
Interact in Real Time	0.183557797
Provide Feedback	0.034732258
Offer Excitement	-0.310943178
Encourage Dialogue	0.091740298
Provide Data Access	0.27963916
Allow for Errors	0.076900829
Be Audacious	0.272500062
Stay Focused	-0.373308726
Make it Convenient	-0.02391735
Make Learning Easy	0.231746537
Make Participating Easy	0.113893409

But now let's look at a different success indicator...general popularity. Here we get a surprisingly different result. After the importance of offering a reward (previously discussed) the two biggest factors are to Offer Excitement and Be Audacious.

Entertain	-0.351447347
Reward	0.576650297
Challenge	0.30922189
Educate	-0.341196228
Motivate the User	0.06370121
Create a Community	-0.165903482
Interact in Real Time	-0.374093447
Provide Feedback	0.140734236
Offer Excitement	0.442401972
Encourage Dialogue	0.088954375
Provide Data Access	-0.420434771
Allow for Errors	-0.554451346
Be Audacious	0.429358299
Stay Focused	-0.366548773
Make it Convenient	-0.324426406
Make Learning Easy	-0.377610427
Make Participating Easy	-0.477168855

What makes these two factors so important? Since we are looking at general public popularity these are two factors that would create the most "buzz" about the project. It's most exciting to the average person, the media publicize these projects more since they can easily excite their readership about it, and by working on big concepts they are easier to explain. So they attract more press, more readers, and more participants. This doesn't necessarily make them academically successful, but it does provide a steady pool of people potentially interested in joining. So all designers need to do is keep that energy moving to actual participation, and the project becomes successful.

It should also be noted that "Be Audacious" and "Offer Excitement" can be very similar concepts that overlap significantly. But there are differences (as described in my previous posts here and here), and while in many cases the audaciousness can lead to excitement, I did separate them on purpose. It just shows that the overlap must be accounted for, and that project designers can often incorporate multiple success factors even if they focus on only a few points.

Of course, all this deals with only Astronomy projects. Does it also apply to Ecology projects or Distributed Computing projects? Check in tomorrow and find out!

Sunday, July 15, 2012

Making a Rewarding Dive into Statistics

Enough introduction the the project. Let's dive right in to the final results.

First let's look at correlations existing between the hypothesized success factors and the actual success results (in the forms of overall popularity and scholarly references). But not for correlations between factors. I do not argue that many of my hypothesized factors may be correalted and co-indicative. As non-quantitative factors developed based solely on personal experience, thers is not a rigorous separation between the factors. I have also not systematically defined each instead relying on examples and a basic defintion. So the concepts may indeed overlap, as would the rankings. While there may be some use looking for unknown correlations thateffort probably would not be fruitful.

Our first set will be the combination of all interactive science citizen projects, which does not include any distributed computing projects which behave somewhat differently. But I do expect all the other interactive projects to have similar correlations; the differences between astronomy, ecology, and meteorology projects are not so different as to have different correlation factors. Or so I intiailly propose.

	Success (Google Scholar)	Success (Google Popularity)
Entertain	-0.220795537	-0.060921511
Reward	0.325528097	0.184150624
Challenge	0.207362325	0.026094657
Educate	-0.192750571	-0.068373492
Motivate the User	0.207464013	-0.03338748
Create a Community	0.146555214	-0.109951901
Interact in Real Time	-0.058822728	-0.202973033
Provide Feedback	0.200423776	-0.020146269
Offer Excitement	0.179612493	0.0760348
Encourage Dialogue	0.069481952	-0.01456579
Provide Data Access	-0.240809998	-0.212537934
Allow for Errors	-0.311429459	-0.117534344
Be Audacious	0.137497156	-0.011224756
Stay Focused	-0.345734709	0.004637614
Make it Convenient	-0.090581135	0.000771944
Make Learning Easy	-0.178545686	-0.015490161
Make Participating Easy	-0.16152641	-0.032260453

As you can see, no matter how success is defined, the strongest positive correlation for all interactive citizen science projects is the availability of a reward. As you'll remember from my previous postings describing the success factors, the rewards are considered monetary rewards and not the many intangible rewards participants receive. These are highly important, but are covered separately in other success factors such as "Motivate the User".

In some ways this makes sense; this is the most beneficial to individuals by providing tangible rewards to people. Helping the community, saving the earth, or learning about sciecne are all noble and motivating benefits, but they don't provide the same incentive as cash rewards. I should also note that non-cash rewards such as prizes as bounties also count as rewards though pure cash is the most common. So there are a variety of ways project designers can choose to reward users and create a successful project without breaking the bank. They just need to be creative. Just check out my previous post on Citizen Science bounties to leran more and start the brainstorming.

Another item to note is that while Reward is a very powerful factor in success, it is not the only one and there are a large number of projects that offer no rewards. Many of these are also successful. In fact only 7 of 52 interactive projects received a non-zero ranking. So this reaffirms the earlier prediction that the "Keys to Successful Citizen Science Projects" would not be all-inclusive, but would instead be a useful guide to creating successful projects.

What more can the statistics tell us? Come back tomorrow and find out!

Tuesday, July 10, 2012

Proving My Theories of Success (aka Keeping my Fingers Crossed!)

Over the last few weeks I've laid out my theory for predicting success in citizen science projects. I've quantified that theory and evaluated the likely success of all the projects highlighted so far on my blog. I've also collected results data including public popularity (Google hits) and peer-reviewed paper citations (google scholar). I've even scrubbed the data and run some initial tests. So let's see if my theory holds up.

First, just a quick note on the data. I'm still scrubbing it a tiny amount as I have time to dig deeper into each result. So you'll see a new category added in my Google Scholar search footnotes for "individually reviewed". In these cases a project, such as the Sungrazer project, was providing odd results so I did some much broader searches and manually picked out the relevant hits. It's time consuming and not necessary for all the projects, but in this case it was useful (and called for by the ambiguous initial results I'd had).

Now back to the results. First I did some preliminary checks of the hypothesized success ranking against google popularity. This was not successful. There was too much going on and not strong correlations. I'd go through the results but it's pretty dull and uninformative. So I next turned to the rankings compared to Google Scholar results. This was much a bit more promising, especially when using the Google Scholar rank (compared to all results) versus hypothesized success score. The results are below:

There are two main things of interest. First, note the regression line and it's slightly positive slope. This would tend to disprove my thesis, but look more closely at the groupings. We have a sizable group of highly-successful projects all with high hypothesized success scores in the upper right hand corner. Just where they should be under the theory. But the mass of projects below are what pull away the regression. So looking a bit closer and controlling for the type of project, we get a cleaner result that also fits the theory much more nicely.

The most logical place to start is controlling by primary area of science. All these projects focus on approximately the same things so they should provide good comparisons. And indeed they do:

The only problem above is with the Astronomy set which shows only a very small correlation (a zero-slope regression line). I have some answers for that as well, but that will be a whole other posting! I also want to dive even deeper into each success factor to further control our variables.

Overall this is a much better result. All the regressions provide a negative slope, meaning the higher the hypothesized success rate the greater the number of peer-reviewed papers generated by that project. So clearly there is something going on here. This also fits in with our earlier discussion of separating projects types when analyzing success, such as distributed computing vs. non-distributed computing (aka "interactive") projects. But what exactly is going on here? There are still more clues hiding in the data.

So far we've used an aggregate hypothesized success score using every single trait I identified and weighting them equally. But are some more important than others? Do some have very little impact on project in the comments below and we can check them out in the next set of data runs.

UPDATE: Hyperlinked to previous posts explaining the concept of "Hypothesized Success Ranking" and added R-squared values to charts to better indicate how well the regression line fits. My next post will be much more statistically-based so I was holding off the math a bit in this post. But in hindsight some additional data is necessary.

Wednesday, July 4, 2012

Early Data on the Success of Citizen Science Projects

The last few days I've been a powerless blogger thanks to losing all electricity last Friday in the DC area storms. It's an interesting time as a citizen science weather observer, but not as a homeowner in 100 degree heat. Fortunately I've found a temporary home base and used the extra free time to our benefit.

This week I started pulling together quantitative data on the success of citizen science projects. As some of have suggested I first began looking at public popularity and scientific value as ways to measure success. Both also come with relatively simple data sets that can be collected online when one has free time on their hands.

A public spreadsheet with the results is available here. For this first round the I collected data on 1) number of scientific papers referencing the project (through Google Scholar), 2) popularity of web site as determined by number of Google hits based on the project name, and 3) popularity of project as determined by Google ranking based on the project name. It also includes my ranking of each project by the hypothesized success factors, scientific areas of interest, and type of project (as defined in my previous post classifying citizen science projects).

Combining all the data sets should let us see not just the projects that are successful. It will hopefully yield additional insights into success of projects by classifications (such as which transcription-type projects are most successfully). It will also help us account for differences in the data by evaluating projects on an even basis. The best example of this is separating Distributed Computing and non-Distributed Computing projects (tentatively called "Interactive Citizen Science Projects"). This is a lesson learned from earlier in the process and continued in my separation of Distributed Computing projects in their own unique tab. That may be needed for other projects as well, but for now I am only separating the project list in to two while maintaining data distinctions in case further separation is needed.

In most cases my data was based on searches for projects official name in quotations. For some, such as "Great Backyard Bird Count" or "MoonZoo", this is a distinctive name I can rely on for results. But some names are more ambiguous and bring up results not associated with the project. So I've used a series of methods to clean up the data. The first is to manually review results for applicability on an ad hoc basis. This eliminates some erroneous results and tests whether modifying the search criteria is necessary. When a modification is made I've recorded it with a series of asterisks as defined below:

Google Site Listings
No asterisk - "[project name]"
* - "[project name]" citizen science
** - "[project name]" citizen science project
*** - "[project name]" app
**** - "[project name]" distributed computing
***** - "[project name]" naval history

Google Scholar Rankings
No asterisk - "[project name]"
* - "[project name]" citizen science
** - "[project name]" distributed computing
*** - "[project name without punctuation]"
**** - "[project name]" citizen
***** - "[project name]" with individual review
****** - "[project web site]"

Now that it's all up feel free to review and analyze the data as you see fit. I'm working on my own models now but am curious what you find. Also, feel free to comment on any potentially incorrect results so make can keep the data clean. Although I reviewed a large number of the "hits" for applicability, no single person can review them all. So if you have ideas about getting a more tailored, accurate number please let me know.

Enjoy!

Who is OpenScientist?

I'm a citizen scientist just like you. My background is in chemistry (BS) and business administration (MBA) with ten years of experience overseeing government grants and contracts. By day I combine those interests overseeing research administration for a number of large scientific programs. By night I'm just a regular member of the public who enjoys learning about, teaching, and discovering science. I have also been an avid participant in both local and large-area citizen projects projects for over ten years.

All material is my own. Unless specifically noted, thoughts and opinions do not reflect those of any company, organization,or government agency.