So, if my del.icio.us inbox is any indication, the blogosphere has been abuzz lately with opinions and commentary on “folksonomy.” It’s interesting stuff, no doubt, especially for those of us who come to social computing from a library and information science background.
Unfortunately, too many of the paeans to tagging that I’ve read have completely ignored some of the key social and cultural issues associated with public and collaborative labeling of content, opting instead for a level of technology-driven optimism that I see as overly naive. I think folksonomy has incredible value—the two web sites that I use most heavily right now are Flickr and del.icio.us. And I understand that this is something that can’t be stuffed back into the bottle. Nonetheless, I don’t think that means we have to accept it with an uncritical eye, or adopt every new implementation of tagging without consideration.
I’ve been happy, however, to see some exceptions to this rule—recent posts by Lou Rosenfeld, Rebecca Blood, Anil Dash, and Foe Romeo have all addressed the darker side of bottom-up classification.
After Technorati unveiled their new tagging implementation, Rebecca Blood wrote this:
It’s certain that some people will try to game the system, deliberately tagging their photos to misdirect people, make a political statement, or otherwise promote their own interests. It seemed to me that Technorati would want to start thinking about that now: to Design for Evil, as Bruce Sterling has said.
The issue of inevitable systematic disruptive behavior has been missing from a lot of these discussions, and I hope Rebecca’s post will spur more discussion on this aspect of tagging.
Foe Romeo followed up with a wonderful example of exactly how decontextualized content, repurposed by Technorati, shapes the perception of content. She linked to the “teens” tag on Technorati, and pointed out the juxtaposition of blog posts referencing pornography with Flickr photos of kids hanging out with each other. (Her screenshot is quite different from what I get now when I try that search, however. I don’t know if that’s a regional issue, a new censorship implementation in the Technorati database, or some other factor.)
One of the topics that’s started coming in these discussions is the extent to which any given individual’s tagging behavior is (or can be, or should be, or shouldn’t be) influenced by the tags others have assigned. In a recent post on the delicious-discuss mailing list, Saul Albert wrote:
So I’m proposing a kind of tag-brokerage system. A system by which people can form epistomology gangs who decide to share tags, and declare a concensually [sic] decided-upon meaning and remit for them. That’s when tags can start to become categories, grouped, separated, weeded, updated, expanded etc..
I’ve been mulling that over for a bit. On the one hand, as a librarian, I understand completely the value of controlled vocabularies and taxonomies. I don’t want to have to look in six different places for information on a given topic—I want some level of confidence that the things I want are grouped together. On the other hand, I don’t share the optimism that so many of my colleagues in this field seem to have that the collective “wisdom of crowds” will always yield accurate and useful descriptors. Describing things well is hard, and often context-specific.
Last night, I discovered the perfect example to illustrate my concerns. The ESP Game is a site developed by researchers at CMU, intended to create a set of descriptors for images indexed by Google.
Here’s the abstract from a paper they presented at last year’s CHI conference:
We introduce a new interactive system: a game that is fun and can be used to create valuable output. When people play the game they help determine the contents of images by providing meaningful labels for them. If the game is played as much as popular online games, we estimate that most images on the Web can be labeled in a few months. Having proper labels associated with each image on the Web would allow for more accurate image search, improve the accessibility of sites (by providing descriptions of images to visually impaired individuals), and help users block inappropriate images. Our system makes a significant contribution because of its valuable output and because of the way it addresses the image-labeling problem. Rather than using computer vision techniques, which don’t work well enough, we encourage people to do the work by taking advantage of their desire to be entertained.
Noble aims, and a brilliant way to attract and reward input. But the unintended consequences of this approach are non-trivial, as I found when I spent a few hours playing with it yesterday and today.
The way the site works is that you register as a game player, then launch the java applet. You’re paired at random with another player, and presented with the first in a a series of images.
The clock in the top left corner shows how much time is left. The thermometer along the bottom shows how many matches you’ve made, and how many images are left to describe. The “taboo words” are words you can’t use—if a color is included there, all colors are barred, as are parts of existing words (thus if china is banned, so is chin). Not all pictures include taboo words—I suspect these appear once certain words have already been regularly associated with an image, a problem I’ll come back to.
You start typing words associated with the image, one at a time. When you and your partner have both typed the same word, you’re told it’s a match and you move on to the next item. The goal is to come to agreement on a word for every picture before time runs out.
When I started playing the game, my scores were very low. I kept trying to assess the context and content of the image, and choose descriptors based on that. So if I saw a woman in a bathing suit walking down a runway wearing a sash and a crown, I’d type pageant, or contestant. But it turns out that’s a lousy strategy for winning the game, because it’s unlikely that you’ll be matched with someone doing the same thing. If the picture is of a female, regardless of clothing or context, woman is always the most likely match. And if woman is listed as a taboo word, girl will almost always work. Unsurprisingly, with pictures of men, it is not the case that “boy” is the next best choice. Instead, the best match if “man” is taboo is typically race (“black”), hair (“bald”, “gray”), or clothing (“tie” “suit” or even “glasss”).
Maximizing your scores in this game means sacrificing a lot of valuable semantics. Colors are great for matching, but often are not the most critical or valuable aspects of the image. Shapes are good. Easiest to match are any images that have text, with typical “stop words” being the best matches—“the”, “of”, etc.
The game developers attempted to push people to richer semantic labels by the use of taboo words. According to their CHI paper:
Taboo words are obtained from the game itself. The first time an image is used in the game, it will have no taboo words. If the image is ever used again, it will have one taboo word: the word that resulted from the previous agreement. The next time the image is used, it will have two taboo words, and so on. (The current implementation of the game displays up to six different taboo words.)
In my experience with the game, however, taboo words also serve to influence player word choice. Looking at the list of words encourages you to find synonyms, rather than analyzing the image itself. If a taboo word shown is “round,” then “circle” turns out to be a very likely match—even if there may be aspects of the image that have far more semantic meaning. If one word in the image is shown, any other words, or word fragments, or letters, are likely to be typed in. In many cases, the list of five or six “taboo” words being shown completely miss key aspects of the image—one image I saw was of a greek coin. No inclusion of greece or greek anywhere in the taboo words, though, nor could I get a match with my partner by typing those. Coin was there, but the other words had to do with obvious physical characteristics rather than inferred or non-explicit information.
There’s another problem that I encountered with the list of “taboo” words, one that’s even more troubling for me. One of the pictures I was shown last night was of a young black woman. The first word in the list of taboo terms was “nigga.” According to the game description, that means that two people, randomly selected, agreed upon that word as the best descriptor for the image.
The paper goes on to say:
We use only words that players agree on to ensure the quality of the labels: agreement by a pair of independent players implies that the label is probably meaningful. In fact, since these labels come from different people, they have the potential of being more robust and descriptive than labels that an individual indexer would have assigned.
Beyond the obviously disturbing example I just provided, there are other problems with this conclusion. The labels chosen by people trying to maximize their matches with an anonymous partner are not necessary the most “robust and descriptive” labels. They’re the easiest labels, the most superficial labels, the labels that maximize the speed of a match rather than the quality of the descriptor. In addition, they’re words that are devoid of context or depth of knowledge. (Yes, increasing the number of people assigning tags, as in Flickr or del.icio.us, helps with this particular problem.)
I think, however, that the same factors that influence players of the ESP Game to try to maximize agreement rather than depth are also at work in the new folksonomic playgrounds. Increasingly, people are changing the way they label their links or photos because of how they see other people labeling them. Knowing that your descriptors will change how people can access your content can’t help but change the way you use the tags—just as knowing that people will read your blog influence the way you write. Tagging for your own retrieval is different than tagging for retrieval by people you know (say, searching for posts on your blog) and even more different than tagging for retrieval in an completely uncontextualized environment—like Technorati. (Anil does a good job of thinking about this impact.)
Another weakness of this approach is that the people who are likely to have the most time to play these games and provide the content are not necessarily those with the broadest range of knowledge and expertise. (Yes, yes, I know…that’s a horribly elitist thing to say.) In fact, when I played the game with my 8-year-old sitting next to me, I did much better—his very simple suggestions were typically better than my more nuanced descriptors. Which is fine, if you’re trying to maximize search results for other eight-year-olds. But what if you want to maximize results for people who need finer granularity?
Clay argues that detractors from wikipedia and folksonomy are ignoring the compelling economic argument in favor of their widespread use and adoption. Perhaps. But I’m arguing that it’s just as problematic to ignore the compelling social, cultural, and academic arguments against lowest-common-denominator classification. I don’t want to toss out folksonomies. But I also don’t want to toss out controlled vocabularies, or expert assignment of categories. I just don’t believe that all expertise can be replicated through repeated and amplified non-expert input.