I’ve long said that the rating system in Protagonize is inherently flawed because it relies on the honesty of it’s participants, both in that it relies on users not artificially inflating their own ratings and on user not artificially deflating the ratings of their peers. As a moderator on the site and as a coder who has a passing interest in reputation systems, it’s clear that relying on user’s honesty isn’t a workable solution.
The other problems with rating systems is that there tends to be a large bias towards positive ratings. In Protagonize, where the people behind the ratings are anonymous to other users, there is a clear trend of ratings towards the 3.5-5.0 end of the scale out of a 0.5-5.0 range. Looking across the site, it’s clear that many things rated highly do not deserve these ratings and in an ideal world, the range of values would see a concentration of works rated around the 2.5 mark, with outliers at the more extreme ends of the scale. However, this doesn’t seem to be the case, though without access to the actual statistics I can only make this assertion based on my own observations, which will be prone to their own biases.
From what I have observed, the majority of users on protagonize behave in the following way. Upon reading a work, if they like it, the will rate it highly, at a minimum 3.5. If they are friend with the author, they tend to naturally increase their rating slightly, making a 4.0 become 4.5, for example. 5.0 tend to be given out slightly less often, being reserved for either friends or for exceptional pieces of work. If on the other hand user dislike a work, they tend to abstain completely from rating a work, most probably because they don’t want to hurt the author’s feelings, feel like they are giving the author a chance to improve before they rate them accordingly or just can’t bear to have to think about the work any longer. Obviously this creates a strong bias towards high ratings, as low ratings tend to be a minority.
Nick, the founder of protagonize, recently posted on the protagonize blog about the misuse of the term ‘hater-rater’. The terminology was originally aimed at people that deliberately try to warp the system by actively negatively rating people, usually those with high-ratings, in order to both lower those authors rankings whilst increasing their own relative to said authors, since they are essentially dragging the site-wide average down, putting their lower scores closer to the upper bounds above the average than the lower bounds. Over the course of time, the term has come to encompass anyone who makes any kind of negative rating, or people who rate lowly due to ‘not getting’ a piece of work, rather than any objective measure. Being branded a ‘hater-rater’ is obviously not something people want, which further leads people to instead abstain entirely from rating where they otherwise would have given a low rating. As such this leads to the above bias towards ratings that fall within the 3.5-5.0 range and I can see this perpetually refining itself as the new lower bound gets considered hateful, slowly narrowing the range tighter and tighter until rating become 5.0 or nothing at all.
These problems are only exacerbated by people who write to achieve high ratings, rather than writing well and being rewarded with high ratings. Since your overall rating is calculated as an average of all your ratings, it makes sense, when trying to competitively climb the ladder, to write a much higher quantity of posts targeted at a group of friends you know will rate highly, rather than at actually trying to write well. This way, you can both raise your average rank whilst also building in increasing immunity to low ratings, by virtue of a overwhelming majority of high ratings.
Seeing this problem some time ago, I wrote The Protagonizer’s Manifesto, a call to action for fellow authors to rate everything they read, all the time, and to rate them objectively, leaving their emotions and personal opinions for comments, which are far more suitable for relaying that kind of content than a single number. While I had many people agree with much of what I said, nothing much has changed. I’ve proposed a few other systems for ratings, but the core problems remain as part of any rating system. Namely:
- People abstaining from rating
- Only rating things positively
Now, for a non-binary ratings system, both of the above pose a problem and lead to the biasing problems seen above. Facebook solves these issues by actually making use of abstinence as a form of negative rating. By allowing users to ‘Like’ something but not providing a way to dislike anything, nor a way to provide a more scalar value of how much one ‘likes’ something, they’ve eliminated both problems. Things that have more likes are better than things with less likes – that’s a clear and obvious metric.
However, for protagonize, does a purely cumulative numeric metric actually provide a useful function? Let us first examine what functions ratings currently serve on the site.
- Determining approval of an individual page of work
- Determining the overall approval of an entire work
- The visibility of that work on the site in terms of recent approval
- The visibility of that work in terms of all-time approval
- The visibility of the author in terms of recent approval
- The visibility of the author in terms of all-time approval
- The average opinion of the author in terms of ratings given by said author
For the most part, one would expect the visibility on the site to be a kind of endorsement or recommendation – an increased visibility implying one should read the work whilst a decreased visibility implying it is one to avoid. The problem with that assumption is that it has no context and provides no information about a work other than a certain number of other people ‘liked’ it. As an author, you don’t know why people liked something, which is not very helpful, while the existing system at least attaches some context to each of the scalar values it provides in terms of ‘Perfect in every way’, ‘Lack originality or suffers from serious mistakes’, etc. As a reader, all you know is that other readers liked this, but not why and as such you can only judge something on it’s perceived popularity rather than whether or not it meet’s your own criteria for what you consider good.
Thus, in terms of recommendation, just listing things or increasing their visibility based on the total number of ‘likes’, for protagonize, seems to be little more than an impetus for those people who want to be at the top of a list, rather than actually any good, which promotes rating abuse such as fake accounts, liking friends work, reciprocal ‘liking’, etc. However, there are ways of dealing with those kinds of issues and there have been many papers on the subject, such as ‘Immunizing Online Reputation Reporting Systems Against Unfair Ratings and Discriminatory Behavior’. Most recommendation systems are based on correlations between you as some who rates things and others who also rate things, building a set of of items commonly rated by others that share many similar ratings to you, but for which items you have not yet rated. Something like this is perfectly applicable to protagonize and far more useful in terms of recommending things you may want to read than lists based ‘likes’ or the current ratings, but such system fall outside the scope of this.
Ratings in the current context of protagonize are markers solely of personal reputation. It can be said that those with a higher reputation are better writers, or at least the implication is there, but due to rater bias and the increasing marginalisation of ratings to within such a narrow boundary, such reputations lose their meaning. Switching to a binary system doesn’t change this fact as the problems of gaming the system still arise, changing the issue from collecting high ratings to one of collecting ratings at all. So in terms of personal reputation, perhaps a non-numeric value is better placed for presenting this. Let us get to the heart of the matter.
What qualities make a writer a good writer? What qualities make a work good or bad? Many of these things aren’t quantifiable or otherwise require intense computation to work out automatically. As such, trying to map this to any numeric system isn’t going to work. What’s needed is a more flexible system based on describing qualities of a work, rather than rating it good or bad.
I’ve proposed a system like this before, based on tagging inspired by that which occurs in LittleBigPlanet on the PlayStation 3. In LBP you could choose from a series of adjectives to describe a work. I propose extending this format significantly to add the following features:
- Positive, neutral and negative connotations
- A system for indicating the reliability of any rater
- Increased context for enabling feature recommendation systems
In this system, ratings would be given via three boxes. A positive, neutral and negative box. Into each box, you could choose several tags from a pool of attributes and drag them into whatever box you liked. For example, you might indicate you felt positively about the storyline by dragging the storyline tag into the positive box, while you might indicate you felt indifferent to the setting by dragging the setting tag into the neutral box and indicate your displeasure at the writer’s grammar by dragging grammar into the negative box. This first step of rating gives both the rater and the ratee more context into what exactly is being rated and what those opinions mean. Things that are not worthy of note need not be dragged into any box, essentially abstaining for those tags, but with a rich corpus of descriptors, abstaining becomes less likely. In terms of rating the reliability of raters, raters can be ranked based on how works they have rated have been rated by others. When one rater chooses to place grammar in the negative box while several others put the tag in the positive box, in can be said the distance between that rater is less reliable, having an outlying value rather than approaching the norm. From the distances these outliers have from the normal value, a reliability can be calculated and be used to scale the effect any outlying rating they have given accordingly. In terms of displaying these ratings so that they are useful, it would be encouraged to use several synonyms of various basic descriptors to avoid copying existing ratings or introducing bias when raters are choosing what ratings to give. For authors wishing to review their feedback, a tag cloud for each of the positive and negative values could be used, or perhaps just a list of the first few most popular tags (taking into account synonyms) for each of the positive, neutral and negative aspects. Lastly, the additional context these tags lend to recommendation system is obvious. Raters inclined to rate things positively in terms of grammar, imply a preference for things with good grammar, users using the sci-fi in a negative context a lot of the time can be said to dislike sci-fi. The benefit of using nouns for descriptors is that each has no emotional context except that which is given by selecting what box to put it in. As such, a author seeing sci-fi high up in their negative box need not feel badly, for it’s clear that many people who don’t like sci-fi have read the story, rather than there being anything wrong with it per se.
Other ways of presenting a simple binary or trinary system with additional helpful context is to follow a method similar to getsatisfaction, where you state your mood (good, bad, neutral) and are then offered a way to contextualize that by adding tags or emotions to the rating. Such a system could work in protagonize where you choose good and then the option expands to let you choose what things from a list of items specifically made it ‘good’ or an option to add your own message while still maintaining your anonymity.
The core issue is that scalar rating systems suck for subjective assessments. Basing a system around any kind of scalar rating is inherently flawed for a site like protagonize where everything is highly opinion based, or is quantifiable, but only by human beings and then only in a fuzzy way. Spelling, punctuation and grammar for example are all quantitative in terms of it being good or bad, correct words versus incorrect words, etc but a machine isn’t capable (yet) of accurately coping with ratings these and putting such ratings in the hands of humans is prone to error and bias. No, the only option is to get rid of a ranking system entirely, and instead encourage others to judge each other qualitatively and to punish those that try to game the system by unilaterally effecting the worth of all of their judgements, benefiting everyone.
From that basis, it’s possibly to construct a quantitative ranking system based upon the number well-supported, agreed upon qualitative statements, weighted accordingly for the subject domain, which restores the previous listings, but in a much fairer manner, with additional context that makes such things much more useful for everyone involved.