There’s an interesting article in the recent issue of AI Magazine called “Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation.” AI Magazine is considered the “journal of record for the AI community” and is a product of the Association for the Advancement of Artificial Intelligence. It’s a “delayed open access” journal which is nice because that means the articles are openly available 15 months after they’ve been published.*
One reason this article caught my attention is because I’ve been thinking about Kevin Lee’s comment on my post a couple of weeks back where he said he thought the field of artificial intelligence and law seemed “under theorized from a jurisprudential perspective, in the sense of drawing from relatively simplistic and outdated jurisprudence.” Lee also offered some great suggested readings including Brian Leiter’s reading of Hart and his project of “naturalizing” jurisprudence and Hillary Putnam’s The Collapse of the Fact/Value Dichotomy and Other Essays.
When I saw this paper I wondered if there might be something in it that might relate to this under theorization. The “truth” the authors** are exploring in their research relates to the quality of the “gold standards” used to train, test and measure the success of machine learning algorithms. These gold standards are based on “human annotations” where a group of people each provide their interpretation of a collection of test examples. Because “there is more than one truth for every example” this process raises issues when trying to determine an “ideal truth.”
“In our experiments we have found that we don’t need extreme cases to see clearly multiple human perspectives reflected in annotation; this has revealed the fallacy of truth and helped us to identify a number of myths in the process of collecting human annotated data.”
These are the seven myths the authors identify and debunk:
- Myth One: One Truth
- Most data collection efforts assume that there is one correct interpretation for every input example.
- Myth Two: Disagreement Is Bad
- To increase the quality of annotation data, disagreement among the annotators should be avoided or reduced.
- Myth Three: Detailed Guidelines Help
- When specific cases continuously cause disagreement, more instructions are added to limit interpretations.
- Myth Four: One Is Enough
- Most annotated examples are evaluated by one person.
- Myth Five: Experts Are Better
- Human annotators with domain knowledge provide better annotated data.
- Myth Six: All Examples Are Created Equal
- The mathematics of using ground truth treats every example the same; either you match the correct result or not.
- Myth Seven: Once Done, Forever Valid
- Once human annotated data is collected for a task, it is used over and over with no update. New annotated data is not aligned with previous data.
They conclude that “semantic interpretation is subjective, and gathering a wide range of human annotations is desirable.”
Their research focused on “medical relation extraction” and it might be interesting to apply their methodology to a legal context and compare their findings to the natural language processing and development of gold standards in AI and law. Readers more familiar with the work of the AI and law community may be able to point to research in this area and perhaps be able to add references in the comment section.
* As luck would have it Lora Aroyo, one of the authors, announced that this article has also been made available here: http://data.crowdtruth.org/Truth_is_a_lie_CrowdTruth.pdf
** Lora Aroyo, associate professor, Department of Computer Science, VU University Amsterdam, The Netherlands; Chris Welty, research scientist at Google Research in New York.