Machine Learning: Truth, Lies and “Gold Standards”

There’s an interesting article in the recent issue of AI Magazine called “Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation.” AI Magazine is considered the “journal of record for the AI community” and is a product of the Association for the Advancement of Artificial Intelligence. It’s a “delayed open access” journal which is nice because that means the articles are openly available 15 months after they’ve been published.*

One reason this article caught my attention is because I’ve been thinking about Kevin Lee’s comment on my post a couple of weeks back where he said he thought the field of artificial intelligence and law seemed “under theorized from a jurisprudential perspective, in the sense of drawing from relatively simplistic and outdated jurisprudence.” Lee also offered some great suggested readings including Brian Leiter’s reading of Hart and his project of “naturalizing” jurisprudence and Hillary Putnam’s The Collapse of the Fact/Value Dichotomy and Other Essays.

When I saw this paper I wondered if there might be something in it that might relate to this under theorization. The “truth” the authors** are exploring in their research relates to the quality of the “gold standards” used to train, test and measure the success of machine learning algorithms. These gold standards are based on “human annotations” where a group of people each provide their interpretation of a collection of test examples. Because “there is more than one truth for every example” this process raises issues when trying to determine an “ideal truth.”

“In our experiments we have found that we don’t need extreme cases to see clearly multiple human perspectives reflected in annotation; this has revealed the fallacy of truth and helped us to identify a number of myths in the process of collecting human annotated data.”

These are the seven myths the authors identify and debunk:

Myth One: One Truth
Most data collection efforts assume that there is one correct interpretation for every input example.
Myth Two: Disagreement Is Bad
To increase the quality of annotation data, disagreement among the annotators should be avoided or reduced.
Myth Three: Detailed Guidelines Help
When specific cases continuously cause disagreement, more instructions are added to limit interpretations.
Myth Four: One Is Enough
Most annotated examples are evaluated by one person.
Myth Five: Experts Are Better
Human annotators with domain knowledge provide better annotated data.
Myth Six: All Examples Are Created Equal
The mathematics of using ground truth treats every example the same; either you match the correct result or not.
Myth Seven: Once Done, Forever Valid
Once human annotated data is collected for a task, it is used over and over with no update. New annotated data is not aligned with previous data.

They conclude that “semantic interpretation is subjective, and gathering a wide range of human annotations is desirable.”

Their research focused on “medical relation extraction” and it might be interesting to apply their methodology to a legal context and compare their findings to the natural language processing and development of gold standards in AI and law. Readers more familiar with the work of the AI and law community may be able to point to research in this area and perhaps be able to add references in the comment section.

* As luck would have it Lora Aroyo, one of the authors, announced that this article has also been made available here:
** Lora Aroyo, associate professor, Department of Computer Science, VU University Amsterdam, The Netherlands; Chris Welty, research scientist at Google Research in New York.


  1. I would tend to agree that AI and law is undertheorized jurisprudentially, but in the same breath I will also claim that most of existing theory in general jurisprudence in general is a distraction at best, and directly counterproductive at worst. The myths you list (and their judicial correlates) are certainly a stumbling block, but even more so, I think the biggest challenge for general jurisprudence of the kind that would actually be useful for AI+law is explaining how legal reasoning works when it actually works reasonably well without much disagreement (except when provoked by the interests of a party). Such cases tend to be of no interest when seeing things in terms of the human platform, but with computers you need to be a lot more explicit about these things.

    The example of an extremely difficult legal problem from a computational perspective that I keep repeating ad nauseam are all the legal transactions involved (and usually implicitly negotiated) in buying a cupful of coffee and whatever other property transactions it may entail (e.g. ownership of the container, short-term lease to use some part of the premises etc.). You can of course engineer away all of these complications, in which case you end up with a vending machine and a transaction with fairly sharp and well-defined edges. (And the vending-machine world is where much of AI+law seems to be stuck.) Obviously there is no research funding or commercial potential in coffeecup jurisprudence, but even when working on other problems one shouldn’t be afraid to ask seemingly stupid questions like these.