Sunday, April 21, 2013

Wiki Surveys and the Suggestion Box Problem

I'm still looking at the Suggestion Box Problem and finding ways citizen science can capture the breadth and depth of ideas available from the public.  There are so many ideas of varying quality and in vastly different areas it seems impossible to sort through them all.  But I'm determined to find some models that can.

Beginning my research I stumbled upon a working paper published by Matthew J. Salganik and Karen E.C. Levy titled "Wiki Surveys: Open and Quantifiable Social Data Collection"  (arXiv:1202.0500v1).  In it they compare different methods social scientists use to collect data from the public, and then go one step further to describe a new web site ( to implement their ideas and prove the value of "Wiki Surveys".  This seems like a great starting point.

I've highlighted a few key points from the article and offer some of my own thoughts on it's applicability to citizen science.  But I also recommend reading the full working paper yourself, available for free here.

The authors begin by looking at the problems with different ways of collecting data.
While surveys allow researchers to quantify large amounts of information quickly and at a reasonable cost, they are routinely criticized for being "top-down" and rigid; that is, the survey questions and possible responses are formulated before data collection begins, meaning that surveys generally are not open to novel or unexpected information from respondents.  In contrast, interviews allow new information to "bubble up" directly from respondents, but are slow, expensive, and difficult to quantify...Advances in computing technology now enable a hybrid approach that combines the quantifiability of a survey and the openness of an interview; we call this new class of data collection tools wiki surveys. 
The authors also describe the trade-offs involved in designing a collection instrument:
The primary advantage of closed questions is that responses can be handled with relative ease: answers can be assigned values, fed into statistical software, and employed in quantitative analysis.  There processes are relatively straightforward, fast, and inexpensive, making closed questions an efficient choice for large-scale social science surveys. In contrast, responses to open-ended questions are more complicated for researchers to reliably code and quantify...In some cases, however, open methods may provide insights that closed methods cannot because they are receptive to new information that was unanticipated by the researcher...Because respondents have a strong tendency to confine their responses to the answer choices offered (Krosnick, 1999; Schuman, 2008), researchers who construct all the possible answer choices necessarily constrain what can be learned.  This is unfortunate because unanticipated information is often the most valuable for research.
Looking at our own field, citizen science seems to rely much more heavily on the first type of collection type, closed questions, and must less on open-ended questions.  In fact many of the most popular citizen science programs involve public participation by collecting defined sets of data or answering specific sets of questions.  We see this in tools such as eBird, Nature's Notebook, and the many other survey instruments that ask people to collect and report wildlife data that can be .  There are also specific advantages to closed questions in citizen science, since often the public does not know what to look for or how to report it, since they are not highly trained in that field.  In these cases the structure of closed questioning helps them organize their thoughts and narrow their focus.   It also allows data to be transmitted electronically and stored as structured data items, helping analysis and accommodating mobile applications. So for many projects this can be a highly successful approach.  But it still means many things may be missed.
Traditional surveys attempt to collect a fixed amount of information from each respondent; respondents who want to contribute less than one questionnaire's worth of information are considered problematic and respondents who want to contribute more are prohibited from doing so.  This contrasts sharply with successful information aggregation projects on the Internet, which collect as much or as little information as each respondent is willing to provide.  Such a structure typically results in highly unequal levels of contribution: when contributors are plotted in rank order, the distributions tend to show a small amount of heavy contributors -- the "fat head" -- and a large number of light contributors -- the "long tail" (Anderson, 2006; Wilkinson, 2008).
The Zooniverse projects have had success combining types of questions.  Most of their projects break down large scientific tasks into smaller pieces and walking participants through a series of questions to describe a piece of data (picture, light curve, whale song, etc.) in a structured manner.  But they also encourage people to identify "unknown" or "interesting" aspects of each picture they are reviewing, and they also have discussion forums that allow users to discuss interesting items not captured by the project tools.  This is how they stumbled on Hanny's Voorwerp though there are also other examples.  Interestingly, the paper authors discuss this type of approach in their background:
[Yochai] Benkler (The Wealth of Networks: How Social Production Transforms Markets and Freedom; 2006) notes that successful information aggregation systems are typically composed of granular, modular tasks.  That is, in successful systems, large problems can be broken down into smaller "chunks" which require low individual investment of time and effort (granularity) and these "chunks" can be independently completed by many individuals before being flexibly integrated into a larger whole (modularity).
In developing their model the authors propose three traits that any successful survey instrument needs to have to capture information from a wide number of people and perspectives.  Although these are applied to the Wiki Surveys they may also have wide use for our citizen science "Suggestion Box" problem and will be a useful tool in evaluating other potential approaches.  The proposed key traits are:
  • Adaptive: They should be continually optimized to elicit the most useful information for estimating the parameters of interest, given what is already known. 
  • Greedy: They should capture as much or as little information as a respondent is willing to provide.
  • Collaborative:  They should be open to new information contributed directly by respondents that may not have been anticipated by the researcher, as often happens during an interview.  Crucially, unlike a traditional "other" box in a survey, this new information would then be presented to future respondents for evaluation.  In this way, a wiki survey bears some resemblance to a focus group in which participants can respond to the contributions of others.
Based on these concepts the authors developed the web site to implement the idea of Wiki Surveys.  Project designers set a broad question they want public input on, and then seed the system with a variety of responses.  These answers are then randomly mixed and presented to users as a set of binary questions with users picking whichever they prefer.  The system continues for as long as the user continues answering questions, each time presenting a newly-generated pair of answers.  In addition, the user can also add their own answers.  Those are added to the list of seeded answers and presented to other users as part of the survey process.  In this way researchers get answers based on their own assumed answers, as well as additional answers provided by the public.

For example, let's say you want to find out what the most popular color is.  As a researcher you seed the survey with Red, Orange, Yellow, Green, Blue, Indigo, and Violet.  When a user logs on they are asked "Which of these two is your favorite color?" and they are forced to choose between Red and Indigo.  Then Orange and Blue.  Then Red and Orange.  The questions continue and slowly the system can rank responses based on how people respond.  Additionally, someone could add their own favorite, Maroon.  The system will add now include Maroon as part of the randomly-generated pairs, so some users will be asked to choose between Indigo and Maroon, or Yellow and Maroon.

The AllOurIdeas site has actually hosted many surveys since the paper was first published, with over 2.5 million votes cast over 1,500 surveys.  There were also over 60,000 new ideas submitted that the individual survey designers had not initially included and presumably never thought of.  This includes two specific surveys, one for the New York City Mayor's Office on creating a greener, greater city and one for the international Organization for Economic Cooperation and Development.  They noted a couple of interesting findings. 

One is that " surveys are best suited for situations in which there is a single predetermined question."  In some ways this fits the needs of citizen science projects since many currently exist with well-defined questions.  But it doesn't hit our biggest problem...capturing chance discoveries and unexpected results.  Some of the most powerful discoveries have been made accidentally while a scientist was looking at a different problem, or made a "mistake" in an experiment.  Just think of the discovery of Penicillin or invention of the microwave...none of these was planned for but they became very important.  It also doesn't capture theoretical discoveries or insights, just data.  So the model does require continued development.

Additionally, it was discovered that: both wiki surveys, many of the highest scoring ideas were uploaded by users.  For PlaNYC, 8 of the top 10 ideas were uploaded by users, as were 7 of the top 10 for the OECD. ..There seem to be two general classes of uploaded ideas that score well: novel information -- that is, new ideas that were not anticipated by the wiki survey creators -- and alternative framings -- that is, new and "stickier" ways of expressing existing ideas.
The alternative framings concept is especially intriguing to me.  While it may mean people are just re-wording what is already captured in the seed responses, often time re-framing is exactly what science needs.  All the data is in front of us, we just need a new perspective to understand it.  Unfortunately the authors don't go into any detail on this point but it's definitely worth a closer look.  Especially since their wiki surveys demonstrate an ability to both collect data and incorporate new ideas. 

All of this has given me some great ideas on how to use wiki surveys for my own research on citizen science topics. But I'm getting ahead of myself; there is still a lot more research to do first.  Hopefully this has given you some insights too, and let me know if there are other papers you find that are also interesting.  I'm still working through some more but would welcome any additional ideas. 

Just let me know in the comments below.


  1. With surveys it becomes easy to understand all the necessary steps that need to be taken in order for the improvements. Surveys are also done to get thoughts, opinions from the participants. Today online surveys are also done by using online survey tools. It is easy to get the data at one go in an organized manner with the use of online software tools. Apart from these surveys can be local, among a community or global. There are also survey communities where people can give their responses. I think it is an easy way to know about a particular product or service.

  2. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I'll be subscribing to your feed and I hope you post again soon.A fantastic presentation. Very open and informative.You have beautifully presented your thought in this blog post. Good Online Surveys Software

  3. There was no wide passage there, although had Franklin's men made it so far, they might have stumbled upon the Bellot Strait.

  4. In fact many of the most popular citizen science programs involve public participation by collecting defined sets of data or answering specific sets of questions.