Privacy and Internet Log Files
In the past two weeks, the New York Times reported that Microsoft has made a minor concession with European privacy authorities about how long it retains its log files. A committee of European privacy regulators had asked that these logs be kept for only six months. Microsoft’s response? Eighteen months.Yahoo used to keep them for thirteen months and just announced it will cut retention to 90 days. Google keeps them for nine.
The privacy implictions of these innocuous log files have been underestimated, particularly when you think about the fulsome picture of your private life that companies like Google may be assembling about you. The information in an ordinary web-server log usually contains the just a tid-bit of information. One “hit” on a website may look like this (but all on one line):
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"
The first bundle of numbers is the IP address of the computer that requested a particular web-page. “Frank” refers to a userid, which is usually not eabled. The next field is the date” Following that, and usually preceded by “GET” is the command your web-browser sent to the server. The next bits are the status code returned by the server and then the size of the entity requested. Next is something called a “referer” (mis-spelled) , followed by details about your browser.
Since many people often share the same IP address (it could be one IP for an entire company or just a group of people in a house using the same internet connection), some have argued it is not personal information and a log-file doesn’t contain personal information. The problem is that even if an IP address is not directly connected to one individual, one can do some easy analysis to make the connections. After AOL released supposedly de-identified search logs to researchers, an intrepid reporter was able to track down at least one of the users who had some very personal health-related searches in the logs (see: Users identifiable by AOL search data).
What’s additionally troubling from a privacy point of view is that the large inernet companies, like Google, Yahoo and Microsoft, don’t just have your search queries. Increasingly, they have a huge trove of data sources in their logs.
Take Google, for example. Google has their famous Google search. They also have GMail, Google Analytics, Google AdSense, Google Documents, Google Toolbar and more. Each time you “hit” one of their sites, you’re in their logs. Most internet users hit Google’s logs dozens of times a day and on many of those occasions aren’t even aware that they’re using a Google service. Google has what is probably the most popular and widely used network of online advertising: AdSense. Each time you go to a website that features Google’s ads, your computer sends a request to Google’s servers and that “hit” goes into their logs, along with the information about what site you were visiting, when you visited and what ad was served. If you click on the ad, even more information is collected and logged. But even if you don’t visit a site with Google’s ads, there’s a very good chance that the webmaster is using Google Analytics to find out about useage of his or her site. (Full disclosure: I use Google Analytics for my site at www.privacylawyer.ca.) I should also note that Yahoo! and MSN also have advertising networks, which collect the same sort of information.What this means is that Google, Yahoo and Microsoft register in their logs a significant portion of your usage of the internet.
And if you have a Google, Yahoo! or MSN account, that hit can be connected to your account details, includig your name.
I don’t think it’s too far fetched to think of a day when it will become standard for all investigations involving the internet to inlcude a warrant served on Google or Yahoo! or Microsoft for all logs related to a particular user or IP address or both.
Next week, I’ll discuss efforts being made by governments and law enforcement to make log rentention mandatory.
I’ve read a few blog posts and articles on this, and none have addressed the question of mirrored sites or backups. I can’t be certain that log data sets are mirrored (simultaneously stored on another computer, perhaps in another country) or if the logs are backed up — but these practices need to be spelled out.
The other area of interest is user control over identifying data. As users get more sophisticated and aware of their rights, companies are going to have to start offering services to meet requests for this information. Some form of self-reporting capability seems inevitable in the future. Taken a bit further, one should also be able to request a purge of all identifying data held by companies. Are these not, in some way, our data?