Law as Algae
One of the many brilliant things that Google indexing has created is something known as the Web 1T 5-gram corpus made available for scholars via the Linguistic Data Consortium at the University of Pennsylvania.
Very roughly stated, as I understand it, n-grams have to do with the frequency with which one unit in a language is followed by another unit — e.g. how many times in a given body of text is the word “love” followed by the word “fifteen,” and what, then, is the predictability of this 2-gram occuring when “love” occurs. You can see how Google would be interested in this, because it relates directly to suggested search terms, among other things.
Google’s “5-gram corpus” — the body of all one-, two-, three-, four- and five-word units — contains 1,176,470,663 five-word sequences that appear at least 40 times. A recent Language Log entry talked about common 5-word sequences on the Web, as discovered from this corpus. Turns out that of the top 21 commonest sequences, 18 are patches of legal boilerplate. What follows is the list of items, with the number of occurrences after each item:
Use of this Web site 19678811
of this Web site constitutes 19703371
this Web site constitutes acceptance 19723554
Web site constitutes acceptance of 19724386
eBay User Agreement and Privacy 19807811
the eBay User Agreement and 19808132
acceptance of the eBay User 19850253
of the eBay User Agreement 19850627
constitutes acceptance of the eBay 19850700
Designated trademarks and brands are 20820815
User Agreement and Privacy Policy 20917050
trademarks and brands are the 20975334
and brands are the property 21113548
brands are the property of 21139112
site constitutes acceptance of the 21556427
this result in new window 24059811
Open this result in new 24059963
the property of their respective 24891265
are the property of their 24938581
property of their respective owners 25640531
Clearly, when it comes to law, you can run but you can’t hide. Tens of millions of faintly useful cant phrases muttered in tiny print like superstitious charms constitute the white noise of the web. Legal language fit only to be “honoured in the breach” spreads like algae. The law’s contribution to the new medium seems to be pollution.
These boilerplate phrases just beg to be made into a “5-gram corpus” haiku. I tried, but found the words to contain too many syllables, alas.
I know what you mean, Meg. There’s a found poetry feel about them, probably because they’re so devoid of context.
If you fiddled with the phrases you could get:
use of this website
acceptance of the eBay
open this result
Hardly Basho, I’m afraid.