The Evil Side of Google? Exploring Google's User Data Collection
The author's views are entirely their own (excluding the unlikely event of hypnosis) and may not always reflect the views of Moz.
Update: You can now download the complete list of Google User Data by clicking here.
Google Inc. is first and foremost a data company. In the past, it competed on a level playing field by manipulating publicly available data better than its competition. By doing this, it had unprecedented success.
Enter Web 2.0. Hard drives, processors, bandwidth, and even workers are now all relatively inexpensive. This has caused the barriers to entry in the search field to drastically lower. As Google’s competition has started to catch up (MSN Image Search) and new competitors are arising, (Cuill) the search engine is looking for some kind of advantage. Since everyone has reasonably equal access to the internet’s content, leaders have been striving to gain access to private data. The most cost effective way of doing this for the engines is by collecting data from the users that already use their services. Google has been increasingly serving its users by using their personal data to manipulate public data in individualized ways. These methods are impossible to copy without the necessary personal data.
The Methods Google Uses to Get Data
Click Tracking - Google logs all the navigational clicks (ads, actions, feature clicks, etc) of all of its users on all of its services.
Forms - Along with the data the user enters directly into the forms (username, password, etc), Google logs the time and date and location of submission.
Code From Google Account Sign Up
1. Input type is hidden so user doesn't see or enter data into given field
2. Location to send user after submitting (hidden)
3. Input type is hidden so user doesn't see or enter data into given field
4. User's referrer data is used and sent via the form so Google knows where user clicked "Sign Up" (hidden)
Cookies - Google uses cookies on all of its web properties. Additionally, it leaves advertising (Doubleclick) cookies to track users' movement around the web. By doing this, Google can track individual users on any page that has either Doubleclick or Adsense ads. This means millions of pages that are not on Google’s web properties.
Unique cookies stored on user's computer from multiple Google web properties
Server Requests Stored in Log Files - Every request made to any of Google's server (ex. GET http://www.google.com) is stored in log files. The content stored is dependent on the type of request. (See ‘normal search’ below for more details.)
Example of a log file
URL - "http://www.google.com/search?hl=en&q=seomoz&ie=UTF-8"
1. IP Address from user making request. This can be used to geo-locate the user
2. Date, time, and time zone offset of user
3. Language of requested result (in this case, English)
4. Search query
5. Operating system of user
6. Browser of user
The additional information is less important but details the server type of request, the server response, and rendering engine.
Javascript - Google has small amounts of javascript embedded in websites all over the internet. When a user’s browser executes the script in the background, Google is able to tell a lot of important information on a person’s browsing habits (location, operating system, browser type and version, etc).
Web Beacons - Google embeds small (1 pixel by 1 pixel) transparent .gifs into many of its checkout screens. Just like the javascript, a user downloads the invisible image and sends information about their computer to Google.
Example of a Web Beacon (What you can't see it? That is the point.)
Understanding What Google Does with the Data
Store - Google uses an internal database called BigTable spread over approximately one million servers.
Google Data In 2006
Data | Size (TB) |
Crawl Index | 800 |
Google Analytics | 200 |
Google Base | 2 |
Google Earth | 70 |
Orkut | 9 |
Personalized Search | 4 |
(Source: Bigtable: A Distributed Storage System for Structured Data)
This is the size of the compressed data in terabytes (1,024 GB). That puts Google's disclosed data size at over 1 petabyte (1,048,576 GB). GREAT GOOGLEY MOOGLEY! This doesn't even consider AdSense, Gmail, Google Maps, Street View, Google Images, or other private databases. This is considered to be a lot of data now and these are stats from over two years ago before the Web 2.0 Data Rush.
Massive Data Analysis - This is a little like Charlie and the Chocolate Factory. We know that a lot of data goes into Google, and we know a lot of useful manipulated data comes out. We just don't know what happens in between.
Ompa Loompas working hard at Google writing pretty primary colored code.
We know that Google has many algorithms to sort and organize its data. Page Rank is the most well known. It also known that Google has many complicated spam filters, duplicate content filters, pattern detection algorithms, natural language interpreters, image recognition software, and loads of other complicated software.
Permanent Backup - The final resting place for data at Google is likely in permanent storage. Google's privacy policies hint that some user data can never be completely deleted because of permanent backups.
Understanding What Specific User Data Google Collects
Below is a list of every self-declared piece of datum that Google collects when a user interacts with its many web services. This means there is even more user data that is gathered by Google that is unknown to the public. Be forewarned, ignorance is bliss. After you read this you may feel inclined to wear a tinfoil hat.
The Comprehensive List of All the Data Google Admits to Collecting from Users
Download as:
PDF Doc Pages
Cookies and logs (described above) are used in addition to the methods used below to track users. Note: a few of the items below require a user to opt in.
Google (Normal Search)
- Search Engine Result Pages
- Country code domain
- Query
- IP address
- Language
- Number of results
- Safe search
- Additional preferences can include:
- Street Address
- City
- State
- Zip/postal code
- Server log
- Query
- URL
- IP address
- Cookie
- Browser
- Date
- Time
- Clicks
Google Personalized Search
- Logs every website visited as a result of a Google search.
Google's data on me while I researched this article
- Content analysis of visited websites
Google Account
- Used as resource to compile information on individual users
- Sign up
- Sign up date
- Username
- Password
- Alternate e-mail
- Location (country)
- Personal picture
- Usage
- Friends
- Google Services usage
- Amount of logins
Toolbar
- All websites visited
- Unique application number
- Sends all visited 404s to Google
- Toolbar synchronization function
- Stores autofill info with Google account
- Sends structure of web forms to Google
- Safe browsing
- Stores response to security warnings
- Stores autofill forms data
- Spellcheck sends data to Google servers
Web History
- Every website visited from Google SERP
- Date
- Time
- Search query
- Ads clicked
- Which service
Translate
- All text sent to Google servers
Google Finance
- Stock portfolio
- User’s stocks
- Amount of shares
- Date/time bought
- Bought at price
Google Checkout
- Buyers
- Full legal name
- Credit card number
- Debit card number
- Card expiration date
- Card Verification Number (CVN)
- Billing address
- Phone number
- E-mail address
- Sellers
- Bank account number
- Personal address
- Business category
- Government-issued identification number
- Social Security Number
- Taxpayer Identification Number
- Sales Volume
- Government-issued identification number
- Transaction volume
- Business information from Dun & Bradstreet
- Transactions
- Amount
- Description of product
- Name of seller
- Name of buyer
- Type of payment used
- User trend data
- Web Beacons
- Referrer data
YouTube
- YouTube SERP data
- Registered user data
- Videos uploaded
- Comments posted
- Videos flagged
- Subscriptions
- Channels
- Groups
- Favorites
- Contacts
- All videos watched
- Frequency of data transfers
- Size of data transfers
- Click location data
- Information display data
- E-mail
- Web Beacons for tracking
- E-mail opened or discarded
- Web Beacons for tracking
- Account basics
- Password
- Username
- Location (country)
- Postal code
- Birthdate
- Gender
Gmail
- Stores, processes, and maintains all messages
- Account activity
- Storage usage
- Number of log-ins
- Data displayed
- Links clicked
- Stores all e-mails
- Contact lists
- Spam trends
- Gchat
- All conversations and who they involve.
- When service is used
- Size of contact list
- Contacts communicated with
- Gchat
- Frequency of data transfers
- Size of data transfers
- Clicks
Calendar
- Name
- Default language
- Time zone
- Usage statistics
- How long the service is used for
- Frequency of data transfers
- Size of data transfers
- Number of events
- Number of calendars
- Clicks
- Deletes every 90 days
- All events
- Who is going
- Who was invited
- Comments
- Descriptions
- Date
- Time
Desktop
- Indexes and stores
- Versions of your files
- Computer activity
- E-mails
- Chats
- Web history
- Mixed with web search results
- Content analysis of data on computer for integration into SERPs (opt-in)
- Unique application number
- Application interacts with Google’s servers
- Number of searches and response times
Goog 411
- Phone number
- Time of call
- Duration of call
- Options selected
- Phone number used as identifier
- Records all voice commands
iGoogle
- Settings stored in Cookies
- Settings linked to Google Account
Blogger
- User photo
- Birth date
- Location
- Frequency of data transfers
- Size of data transfers
- Clicks
- Blogger Mobile
- Phone number
- Associates with Google Account
- Device identifiers
- Hardware Identifiers
Google Docs
- E-mail address
- Number of logins
- Actions taken
- Storage usage
- Clicks
- All collaborators
- All text
- All images
- All changes (previous versions)
Groups
- E-mail password
- Contents of posts
- Contents of custom pages
- Contents of external files
- Account activity
- Groups joined
- Groups managed
- List of members
- List of invitees
- Ratings made
- Preferred settings
Orkut
- Name
- Gender
- Age
- Location
- Occupation
- Religion
- Friend graph
- Hobbies
- Interests
- Photos
- Invites
- Messages
- Orkut Mobile
- Phone number
- Wireless carrier
- Content of message
- Date
- Time
- Everything a user writes
- Every blog post a user reads
Picasa
- Friend graph
- Favorite lists
- Clicks (almost all Google services track all clicks)
- All photos
- Geotags (Exif data)
- People who subscribe to albums
Mobile
- Phone number
- Device type
- Request type
- Carrier
- Carrier user ID
- Content of request
- Maps for mobile
- Location information (GPS)
- Address
- Websites visited if user asks Google to transcode
- Voice commands
Web Accelerator
- Web requests
- Cache of websites before you go to them
Double Click/AdWords
- Ads clicked
- Age
- Sex
- Location
- Trends of past visited websites
- IP address
Health
- Medial records
- Doctors
- Conditions
- Prescriptions
- Age
- Sex
- Race
- Blood type
- Weight
- Height
- Allergies
- Procedures
- Test results
- Immunizations
Postini
- E-mail address
- Traffic patterns
- Clicks
GrandCentral
- Credit card
- Credit card expiration date
- Credit card verification number
- Billing address
- Stores, process and maintains
- Voicemail messages
- Recorded conversations
- Contact lists
- Storage usage
- Number of log ins
- Data displayed
- Clicks
- Telephony log information
- Calling-party phone number
- Forwarding numbers
- Time of calls
- Date of calls
- Duration of calls
- Types of calls
Google Merchant Search
- Name
- Contact information
- E-mail address
- Phone number
Notebook
- Stores, processes and maintains
- All content in notebook
- Nickname
- Storage usage
- Number of log-ins
Google Web Services That Conveniently Don't Have Individual Privacy Policies Disclosing What User Data is Collected
- Webmaster Tools
- Google Analytics
- AdWords
- AdSense
- Alerts
- Reader
- Earth
- FeedBurner (technically has one, but it is useless)
Search Verticals
- Image search
- Map search
- Blog search
- Book search
- News search
- Patent search
- Product search
- Scholar search
- Special search
- Video search
- Code search
By the way Google...
I found some broken links and errors on your website. On your main privacy policy page the link anchored with "Video Player" is broken. Additionally, you capitalized your own product incorrectly. "GMail" should be "Gmail." Lastly, the Google Store has text encoding issues on the homepage and the link to download sketchup is broken.
Please send my check in the mail (I am sure you already have my address).
Sources:
Additional Information:
Can you trust Google to obey the rules? - Excellent analysis of the darker side of Google Inc. as a web giant.
If you have any other advice that you think is worth sharing, feel free to post it in the comments. This post is very much a work in progress. As always, feel free to e-mail me or send me a private message if you have any suggestions on how I can make my posts more useful. All of my contact information is available on my profile: Danny Thanks!
Comments
Please keep your comments TAGFEE by following the community etiquette
Comments are closed. Got a burning question? Head to our Q&A section to start a new conversation.