Due to inconsistencies in format, data collection from social networking sites and Webmail requires careful attention and adaptability to ensure evidence integrity.
When it comes to social media and Webmail in forensic collection, these two are a lawless bunch.
Everyone—from kids to great-grandparents—seems to be using social media and Webmail on a daily basis.
Take a look at these statistics:
• 1 in 5 minutes online is spent on social networks1
• 6.6 hours a month are spent per user on Facebook2
• 400 million tweets are sent everyday3
• 72 hours of video are uploaded to YouTube every minute4
• 3 million blogs are started every month5
• 4.5 million photos are uploaded to Flickr every minute6
• There are:
- 3.1 billion e-mail accounts7
- 2.4 billion social networking accounts worldwide7
- 901 million active users on Facebook8
- 140 million active users on Twitter9
- 161 million members on LinkedIn10
- 64 million blogs on Tumblr11
- 54 million WordPress sites12
With that much data being created online, it only makes sense that some of it could be essential to a lawsuit and/or an investigation. Yet, collecting the information while maintaining data integrity and review ability is really an untamed land.
Questions arise such as:
• How do the various Webmail and e-mail formats become standardized and able to be deduplicated?
• What authorization do you need from service providers to collect information without violating the user agreements?
• What can legally be collected from social media accounts about a user’s friends and connections?
Life on the Range
The industry of digital forensics and electronic discovery is still a rather young one. Yet it has been around long enough to develop standards and best practices for handling multiple types of digital files on various mediums.
The data collection process has traditionally been about documents, e-mails, and graphics found on computers, hard drives, phones, and other mediums. Now, it also includes data from social networking sites (SNS), which requires careful attention and adaptability to ensure the digital information maintains its initial context and meaning.
The challenge of taming the land of social media and Webmail—where each platform has its own rules, or no rules at all—is just like taming the Wild West. Data collection must be done in a way to fully preserve the information, even if dealing with multiple outside parties and systems for just one social media platform.
All e-mail is not alike, and that is especially true when it comes to Webmail. Different programs and systems output e-mail in various formats, meaning the strings of metadata don’t look the same. It is nearly impossible to effectively cull down a mountain of duplicate e-mails when the data was generated using disparate Webmail and/or internal mail programs.
My company was recently involved with a project for which we collected e-mails from 80 different accounts—approximately 500,000 e-mail messages from both internal mail programs and multiple Webmail applications. The e-mails were collected during the electronic discovery process of a case, meaning that they would need to be culled and searched by attorneys to determine what would be pertinent to the lawsuit.
Many of the e-mails were EML, a standard format used by multiple e-mail programs. Generally, EML is a file extension for e-mails saved as MIME RFC 822, or Multi-Purpose Internet Mail Extension with an ASCII messaging header. We also collected from other standard platforms such as Lotus Notes and Exchange, which are stored in their own unique formats. Additional e-mails were collected from multiple IMAP (Internet Message Access Protocol) and POP3 (Post Office Protocol) accounts, including Gmail, Hotmail (or Live Mail), Yahoo, Apple’s Me.com, and various others.
The collection process showed that current electronic discovery and forensics software aren’t as comprehensive as we would like to think. There is no one tool that can accurately collect all of the various e-mail formats. Multiple methods and applications need to be employed to accommodate the myriad platforms and file types and still maintain data integrity.
There is also the issue of handling deduplication, a common way to reduce the amount of data on the front end of an e-discovery project. The current method for the deduplication of e-mails is to create an MD5 or SHA1 hash of a string of text generated using portions of the e-mails’ metadata. Because specific fields are static across different copies of the file, this is a sensible way to remove duplicate files. As an example, when you send one e-mail to 10 different people, all the fields for “to,” “from,” “subject,” etc., are the same, meaning they are accurate to deduplicate against.
However, the storage of e-mail metadata varies greatly depending on the e-mail system. Each e-mail and Webmail program can structure its data differently. That means different platforms may or may not contain the same fields, or, if they have the same fields, they may be named differently, or the storage of the metadata may differ, or other issues may arise.
Challenges to deduplicating different e-mail formats:
1. Identify all the structural and metadata variations across multiple platforms.
2. Determine what needs to be modified in each program to make them all conform to each other for the purposes of hash generation.
Even though e-mail has been around for decades, and in fact pre-dates the Internet, there is no one application that can properly deduplicate across multiple e-mail platforms.
These issues regarding e-mail collection and deduplication combine with existing dilemmas in the industry about file conversion methods and hashcode creation.
There are accepted processes to convert file types (ie. NSF to PST) and standardize them. They have been implemented for some time; however, the truth is that those conversion methods are flawed. Streamlining that type of process comes at a cost: Metadata can be lost—either because the metadata may be changed during conversion or because items from the original source do not get converted. That is unacceptable when those files—and the process of working with those files—can be called into a court of law.
The dilemma with e-mail hashcodes is that, while commonly used in our industry, there is no standard method of creation. Hashcodes uniquely represent an arbitrary amount of information for the purposes of validation and verification. This data can be a file’s binary information, a Web site password, e-mail metadata consolidation strings, or any other type of information that can be represented digitally. However, each platform has a different method of mapping data and determining the hash value. While the logic behind the algorithms is the same, the results are contrasting because the information and processes vary greatly. Differences between platforms can include the order of the fields, the delimiters between fields, or the way the data is input, such as showing time in a 12-hour versus a 24-hour clock. In the litigation technology industry, there is a need for standardizing e-mail hashing across all processing platforms. It is not that any of their individual processes are incorrect; on the contrary, they are perfectly sound and logical. Yet, there is no way to work with that data across different platforms because there isn’t a standard for how data is stored and, thus, no definitive method for hashcode creation.
So, how do you work with various e-mail platforms?
My company ended up designing a solution for Webmail and e-mail that performed the actions we needed to compare and deduplicate. We began by reverse engineering the various programs and then thoroughly analyzing each field from every platform to determine the differences. For example, one e-mail platform may list attachments as “attachment1.doc; attachment2.doc”, while another might list them as “attachment1.doc,attachment2.doc”. Those slight distinctions of using a semi-colon versus a comma or not having a space will make the process of e-mail deduplication completely ineffective.
Next, we wrote custom code to parse the data, create hashcodes, and store the information so that it could be processed through a standard electronic discovery platform. The end process loads the native files, verifies metadata, changes the data temporarily to generate a hashcode, and reverts the data back to the original. The hashcode stays, which can be used to deduplicate, but the data is not changed. Additionally, we kept a forensic copy of the source files, as is customary and best practice, to compare and validate as needed.
One thing we had to consider was creating a process that would not exclude data potentially responsive to the case. As applicable, we erred on the side of inclusion versus exclusion to ensure the results were sound. We also conducted numerous quality checks, and made alterations as required, to confirm our process was accurate, effective, and defensible.
Ropin’ In Facebook, Twitter, LinkedIn
Each social media platform is different, with unique code and variations. Each one runs on its own hardware and software platform, and some, such as Facebook, have even developed custom technology to run their sites. Because of that, each requires its own method of forensically collecting data. Additionally, collection processes have to keep up with the constantly changing code base for these social media giants.
Facebook was the first platform to create a simple way to download a user’s information. The archive is comprehensive and quicker than one created with an outside solution. It includes all posts, messages, and chat conversations as well as photos and videos that the user has shared. There is also the option of an “expanded archive” that includes additional historic information such as IP addresses used during logins. Facebook data is provided in an HTML format that can be viewed on a computer.
The downside of this collection module is that the user may need to download the data himself. Even if a forensics company has the user name and password to log into the account and download the information, Facebook has implemented other security protocols that can require the account holder’s participation. For example, once the e-mail is received from Facebook noting the archive is ready for download, the link may direct you to a page with a randomly generated question that only the account holder can answer, such as naming someone in a photo. While it is possible to research the account holder and determine the answer, sometimes the most time-efficient method is to have the individual download his own account information.
If the user is involved in downloading his own Facebook information, it should be in conjunction with the company handling the forensic collections to ensure everything is handled expertly. It may also need to involve a specific protocol—i.e. that it is compressed, encrypted, and uploaded to a secure FTP site.
Twitter accounts do not have an internal method for a user to easily download all of his tweets. However, there are other methods—either writing custom code or using an emerging platform—to grab all available tweets, including contacts, lists, accounts that are following the user and/or being followed by the user, retweets, geographic places, and links. Older tweets no longer stored on Twitter can be accessed by leveraging systems that store databases of archived tweets.
After Twitter information has been downloaded, it needs to be displayed in a format for review by an outside party with the ability to view tweets from multiple users at one time. This can be done using the foundation of an existing application, like Tweet Nest, and modifying the code for viewing requirements. This kind of interactive Web-style database allows attorneys to view and filter tweets by year, month, and day, as well as search for tweets by keywords.
For LinkedIn, our experience suggests that the most effective way to gather data is by writing custom code. Due to the way that information is stored and structured on the site, LinkedIn is the most disjointed system of all the major social media networks and thus the most difficult one from which to collect data. Through custom coding, we have had success in pulling all profile information, including groups to which the user belongs. At the time of writing this article, however, LinkedIn is modifying its platform, and the upgrade may allow for easier collection.
Cloud-based documents and calendars can also be collected through an existing application or by writing code to fit specific requirements. Once that information is collected, it can be converted to formats that can be opened in common programs, such as Microsoft Office.
Google’s applications, such as Google Docs, Gmail, chats, and other correspondence, can now be collected through their recently launched e-discovery tool, Google Apps Vault. For a small monthly fee, Vault adds capabilities for information governance, e-mail and chat archiving, placing legal holds, e-discovery searching, exporting, and auditing. This comprehensive suite was a needed addition for business customers and greatly simplifies future collections of Google information.
What do we have the right to collect?
While these are social media sites, there is still some expectation of privacy. The amount of privacy varies depending on the platform and how the content is distributed through it. For example, most tweets on Twitter are public and easily accessible, but direct messages are private. Additionally, a company can’t “spider out” and get information from someone just because that person is linked with the user being collected. Similarly, courts don’t appreciate “friending” someone as a pretense to being able to collect that person’s information.
However, even if content is private, that doesn’t mean that it is privileged. Any content posted online or e-mailed can still be collected for a legal matter.
Though there are many techniques for gathering data from social networks and Webmail online, digital forensics and e-discovery companies need to proceed with caution. Not every collection method is acceptable. It is important for companies to have proper authorization from the service provider. A common obstacle faced in the collection across various platforms involves the user agreement between the service provider and end user. While a forensics company can write code to collect information, doing so can violate the user agreement and earn the negative connotation of “scraping.” Each platform’s terms of service should be reviewed carefully to determine if the agreement will be violated—either by the manner in which collection happens or because of the information that is gathered.
This Wild West can and will be tamed. In the future, Webmail applications and social platforms will follow the leads of Facebook and Google and establish methods within the applications to collect, search, and view archived records. Similarly, e-discovery and digital forensics firms will place an emphasis upon learning and understanding the best practices involved in Webmail and social media collection.
However, we are not yet to that point. Before collecting any Webmail or social media, it is important to conduct an in-depth vetting process with the companies involved to learn about their procedures, protocols, and quality control standards. Once these processes become standardized, we’ll all ride off into the sunset.
- "It's a Social World." 21 Dec 2011, comScore. 13 Jul 2012.
- Lipsman, Andrew. “comScore Voices.” 23 Dec 2011, comScore. 13 July 2012. .
- Gaskell, Adi. “Twitter Passes 400 Million Tweets Per Day – Most from Mobile.” 07 June 2012, Technorati. 12 July 2012. < http://technorati.com/social-media/article/twitter-passes-400-million-tw....
- “Statistics.” YouTube. 19 July 2012. .
- “State of the Blogosphere 2011.” 04 Nov 2011, Technorati. .
- “comScore Media Metrix.” Aug 2011, Flickr. 13 July 2012.
- “Email Statistics Report 2011-2015.” May 2011, The Radicati Group, Inc. 13 July 2012. .
- “Key Facts.” 2012, Facebook. 13 July 2012.
- “Twitter turns 6.” 21 March 2012, Twitter. 12 July 2012.
- “About.” 2012, LinkedIn. 13 July 2012.
- “About.” 13 July 2012, Tumblr. 13 July 2012. .
- “Stats.” 13 July 2012, Wordpress.com. 13 July 2012. .
Gary Torgersen is Vice President of Technology at DSi, Document Solutions, Inc. A Certified Computer Examiner (CCE) and member of the International Society of Computer Forensics Examiners (ISCFE), he has worked on hundreds of digital forensics and e-discovery cases. Document Solutions, Inc. (DSi), 414 Union Street, Suite 1210, Nashville, Tenn. 37219; (615) 255-5343; firstname.lastname@example.org, www.dsi.co