Google Cache & Copyright – I Don’t Get It

I’m wondering everyone’s opinion on this:

I don’t get it. Google cache is an almost complete reproduction of a webpage, and goes way beyond legitimate copying in my mind. This decision seems to open the door for every scraping program on the web today. They add a couple highlighted terms, and that’s ‘transformative’? What’s next, ads next to the cached page?

And why is it incumbent on webmasters to add a ‘no-cache’ tag to their robots.txt file? It’s not like the old days where you submit your site to a search engine, Google now indexes without asking. Truth be told, if the option was available to add a ‘yes-cache’ tag, I would do so (and I would definitely do so, and submit, for the Internet Archive); BUT letting for-profit Google build a database of the web without publisher permission smacks of ‘negative option billing’.

We’ve got a couple of IP gurus here at Slaw, so help me, what am I missing?


  1. Steven – how does this differ from the wayback machine at

  2. When you search Google, each result has a link under it that says ‘Cached’. When clicked, this allows the user to look at the most recently indexed copy maintained in Google’s database. Google’s brand is also framed at the top of the resulting URL.

    Google also caches copies of many website graphics and photos in their images database (which do have advertising next to them), but I’m pretty sure this was not in the scope of the decision.

    The problem with Google (or the Internet Archive) doing this, is that it opens the door for other websites to scrape or cache copies of your content, and be integrated into website functions that a publisher may not agree to. Caching content can be done for altruistic reasons, but there is a definite dark side too.

    If one has time, check out some of the conversations over at webmaster world.

    Of note, this is me with a webmaster’s hat on, and I am not an IP lawyer.

  3. keep in mind that i’m just a lay internet user with interests in IP and not an IP lawyer.

    first, i’ll admit that i have very little sympathy for the argument that a website publisher is being damaged when google indexes their site. if you don’t want to tap a global audience, then pull up the drawbridge, force users to use a secret login and password and turn away the large portion of ‘drive-by’ browsers who might stumble across your brilliance. the print equivalent is to bury your work and leave it ‘unpublished’.

    yes, the technical ‘standard’ of a robots.txt is not ideal, but neither was splattering a © on every original work. the advantage of the robots.txt is that it allows the indexing to happen automatically without human intervention and its corresponding judgement / fallability. the plaintiff in this case made things difficult by being web-aware, with prior knowledge of how spiders and the google cache worked, and ignoring the possibility of excluding spiders with a robot.txt file.

    i think that the bad faith efforts of this plaintiff definitely gave impetus to the judge to look for ways to rule the way he did, but the question this leads us to is, would a clueless plaintiff have had more success ? the judge made a very detailed analysis that relied very heavily upon google’s expert opinion. i’d be interested to see what would happen if the plaintiff had found an expert of the same caliber to provide support for the IP rights side.

    if i had been the plaintiff, i would have proclaimed ignorance of how the web worked, pulled the material from the website and put it behind a pay / password combo and then sued google when the cached version was still available. if you search google for you will an example of a cached page that has disappeared from the web for more than the 14-20 days cited in the safe harbour portion of the decision (CIRA ‘reclaimed’ the domain when i was unable to supply personal info to match the faulty data that i had submitted to their leaky whois database).

    the litigation then might have revolved around whether google’s opt-out provisions were sufficient to ameliorate the economic damages incurred between when the material was pulled from the web and when the plaintiff realized that the material was still available in the cache. all of this would presuppose ignorance by the plaintiff of how the web is indexed and would still be a stretch for anyone professing to be trying to use the internet to distribute material for profit.

    what would have been nice to implement as a technical standard would be a way to ‘whitelist’ approved spiders, while excluding all others, including the rogue spambots. i can understand the hostility in the webmasterworld forum – the search engine spiders do have an impact on site stability and there is indeed a valid argument about how much copying is too much, but again, if no one can find your site or find info on it (webmasterworld has just disallowed all search engine robots, while attempting to roll their own sitesearch), your audience mindshare may be diminished to the point where your advertising support will look elsewhere.

  4. senatorhung, I agree with the sentiment of your post (especially the point that this is an unsympathetic plaintiff), but disagree with your first thought – indexing and inclusion in the search engines is a separate process from caching (IMO). If you display a title, link and a page abstract of my site within the search results, you’re right, that’s part of the game. Offering a link to a ‘full text’ copy of the page (even an old copy), without publisher permission, is another.

    A couple more points that are becoming clearer to me as I read more:

    1) I don’t agree that Google should be treated as an ISP, or given that status. ISPs offer as close to a current copy of a webpage as possible, to benefit the user. Google is keeping an older page copy, which they do not own, and offering that content to the user as a website feature that is connected to the Google brand. Publisher (selling advertising) or an ISP, but not both.

    2) The vast majority of website owners have no clue about the robots.txt standard, and won’t likely in the near future. The bar to prevent Google caching has been set pretty high.

    And, I’ve still got more questions:

    1) I still don’t understand why caching content in its entirety is not analogous to copying of MP3s.

    2) Will the Google Print project now go full steam ahead on Copyrighted works? Highlight the search terms from user queries to ‘transform’ the page, and print publishers will be as powerless as electronic publishers?

  5. we’re just going to have to disagree about publisher permission for materials on the internet. the key point for me is that someone who makes something available on the internet usually wants it to be read. if they want to get paid for it, they can find some additional way to make it harder. yes, this is the opposite of the print world, but again, printing multiple copies costs money; churning out digital copies cost nada. a few authors (cory doctorow, scott adams, tim o’reilly) have understood this and even actively encourage users to copy the electronic versions of their books. if the book is valuable enough, word-of-mouth kicks in and print copies will be sold.

    as for the distinction between indexing and caching – to my understanding, the data collection and indexing processes are much more efficiently done as 2 completely different processes. first, you hoover up the data from the websites. once this is in-house, you index your massive ‘cache’ of the websites and then plug the results into your search engine. you need to have the cache before the index, so why not make it available to the users as well ?

    advantages: users may be able to see original page that matches their search terms, quality control for the search engine (splogs, reverse-video text, webrings, etc.); disadvantages: possible user confusion about source – link to original is still pre-eminent – user has to make an active choice to use cache version

    as i advise at my workplace, once you make info publicly available, you can not control what endusers will do with it. if that is a concern, then you either don’t make it available, restrict its availability or try to lobby legislators to introduce intellectual property laws that will be abused way beyond their intended purpose (garage door openers, print toner cartridges, lego block designs). however, i do agree that treating google as an ISP was a stretch by the judge; however, so was jailing sklyarov for exposing security issues with adobe.

    the vast majority of website owners have no clue about how DNS addresses are assigned, how files get transferred over the internet, even how to determine what kind of file they are looking at (windows stupidly hides filetype by default) – just because it’s obscure doesn’t mean that it’s broken or shouldn’t be followed. if that was the case, most of the fine print of legal contracts would be unenforceable.

    caching = copying mp3’s – sure, it’s all just data. those of us who buy blank cd’s are already subsidizing some of the ‘losses’ that some artists feel that they are due. i have yet to run across an independent musician who has any problem with me making mix cd’s and sharing their music with my circle of friends, whether they are here in town or on the internet via artofthemix.

    will google print go full steam ? – i hope so :)

    o’reilly publishing is leading the way with new models of how to survive as a publisher in the digital age. it comes down to what the purpose of a publisher is – it’s no longer distribution. instead it’s building collaborative teams of authors, recognizing info niches that haven’t yet been filled, innovative packaging of info, and / or providing relevant and timely info on demand. if a publisher consistently does these things, they will retain mind and market share and people will continue to pay for the right to access their products. competing publishers who continue to prop themselves up with copyright will just get pushed out of the market.

    print publishers should be even more worried about a much worse google scenario for print publishers as outlined by PBS tech commentator cringeley.

  6. Hi Steven:

    I have not had time to look into this issue. A quick Google search (go figure!) resulted in a mention of Google winning a recent copyright dispute in US District Court on this issue (at least according to the following news story; I have not read the decision yet) – see:

    A quick search on the Index to Legal Periodicals (not exhaustive) uncovers the following (dated) material on the issue (once again, I have not had time to read these articles):

    Hugenholtz, P.-B. “Caching and Copyright: The Right of Temporary Copying” (2000) 22 European Intellectual Property Review 482.

    Christian, Tamber. “Internet Caching: Something to Think About” (1999) 67 UMKC Law Review 477.

    “Caching on the Internet and the Proxy Caching Notice Project: Avoiding an Internet Copyright Dilemma” (1997) 52 Record of the Association of the Bar of the City of New York 968.

    Hardy,-I.-Trotter, “Computer RAM “Copies”: A Hit or a Myth? Historical Perspectives on Caching as a Microcosm of Current Copyright Concerns” (1997) 22 University of Dayton Law Review 425.

  7. Ted, the concepts may evolve slightly, but shouldn’t date. :-) Please do drop some commentary on this post when you have a chance. I’m interested in your take on things.

    senatorhung, I’m obviously not as ‘open’ in my copyright perspective :-), but I really do appreciate your thoughts… anything that makes us question right?

  8. Google is violating author-publisher rights and committing copyright infringement to republish, ie ‘cache’ an entire or almost entire web page without author’s consent if the page is clearly marked ALL RIGHTS RESERVED, and COPYRIGHTED, etc..
    If they do not specifically ask you to sign an agreement to waive the copyright in order to list your page on their search engine they are in violation, but they also are so big, and backed up with such legal teams, who has the money to stop the bullies. There are other search engines available and they do not do this. They do it to me and I am disappointed that the ‘hits’ to my site are drastically reduced by such unlawful practices. Why should Google get the actual web page visit hit, instead of the creator. joe martin