The Internet Archive and the Search for Integrity

Genie Tyburski, Web Manager, The Virtual Chase

Originally published in The CyberSkeptic's Guide to Internet Research (April 2004) under the title, "Lies, Damned Lies, and the Internet Archive."

Updated 15 June 2007. A prospective client had something to hide when she claimed no previous involvement in an industry rife with fraud. This claim stated in conjunction with the submission of an informed business plan rang false. Other clues about her integrity worried the lawyer. He soon suspected that she was a dishonest person. After the meeting, he consulted another partner, who in turn delivered the puzzle to my e-mail inbox. My mission was to fit the mismatched pieces of information together, either substantiating or disproving the lawyer's skepticism.

Internet Archive to the Rescue

Wanting to emphasize the importance of retaining knowledge of history, George Santayana wrote the words made famous by the film, Rise and Fall of the Third Reich--"Those who cannot remember the past are condemned to repeat it." Of course, at the time the Internet Archive didn't exist; nor did the Information Age. If it had, perhaps he would have edited his philosophy to state, "Those who cannot discover the past are condemned to repeat it."

Certainly in times when new information amounts to five exabytes, or the equivalent of "information contained in half a million new libraries the size of the Library of Congress print collections" (How Much Information 2003?), it is perhaps fortunate that librarians possess a knack for discovering information. It is also in our favor that Brewster Kahle and Alexa Internet foresaw a need for an archive of Web sites.

Internet Archive and the Wayback Machine

Founded in 1996, the Internet Archive contains about 30 billion archived Web pages. While always open to researchers, the collection did not become readily accessible until the introduction of the Wayback Machine in 2001. The Wayback Machine enables finding archived pages by their Web address. Enter a URL to retrieve a dated listing of archived versions. You can then display the archived document as well as any archived pages linked from it.

The Internet Archive helped me successfully respond to the concerns the lawyers had about the prospective client. It contained evidence of a business relationship with a company clearly in the suspect industry. Broadening the investigation to include the newly discovered company led to information about an active criminal investigation. Suddenly, the pieces of the puzzle came together and spelled L-I-A-R.

Using the Internet Archive should be a consideration for any research project that involves due diligence, or the careful investigation of someone or something to satisfy an obligation. In addition to people and company investigations, it can assist in patent research for evidence of prior art, or copyright or trademark research for evidence of infringement. It can also come in handy when researching events in history, looking for copies of older documents like superseded statutes or regulations, or when seeking the ideals of a former political administration. (Note: 25 October 2004. A special keyword search engine, called Recall Search, facilitates some of these queries. Unfortunately, it was removed from the site during mid-September. Messages posted in the Internet Archive forum indicate they plan to bring it back. Note: 15 June 2007. I think it's safe to assume that Recall Search is not coming back. However, check out the site for developments in searching archived audio (music), video (movies) and text (books).)

Recall Search at the Internet Archive

But while the Internet Archive contains information useful in investigative research, finding what you want within the massive collection presents a challenge. If you know the exact URL of the document, or if you want to examine the contents of a specific Web site--as was the case in the scenario involving the prospective client--then the Wayback Machine will suffice. But searching the Internet Archive by keyword was not an option until recently. (Note: See the note in the previous paragraph.)

During September 2003, the project introduced Recall Search, a beta version of a keyword search feature. Recall makes about one-third, or 11 billion, Web pages in the archived collection accessible by keyword. While it further facilitates finding information in the Internet Archive, it does not replace the Wayback Machine. Because of the limited size of the keyword indexed collection and the problems inherent in keyword searching, due diligence researchers should use both finding tools.

Recall does not support Boolean operators. Instead, enter one or more keywords (fewer is probably better) and, if desired, limit the results by date.

Results appear with a graph that illustrates the frequency of the search terms over time. It also provides clues about their context. For example, a search for my name limited to Web pages collected between January 2002 and May 2003 finds ties to the concepts, "school of law," "government resources," "research site," "research librarian," "legal professionals" and "legal research." The resulting graph further shows peaks at the beginning of 2002 and in the spring of 2003.

Applying content-based relevancy ranking, Recall also generates topics and categories. Little information exists about how this feature works, and I have experienced mixed results. But the idea is to limit results by selecting a topic or category relevant to the issue.

Suppose you enter the keyword, Microsoft. The right side of the search results page suggests concepts for narrowing the query. For example, it asks if instead you mean Microsoft Windows, Microsoft Internet Explorer, Microsoft Word, and so on. Likewise, a search for turkey suggests wild turkey, the country of Turkey, turkey hunting, roast turkey and other interpretations.

While content-based relevancy ranking can be a useful algorithm, it is far from perfect. Some topics and categories generated might not seem to make sense. If the queries you run do not produce satisfactory results, consider another approach.

Pinpoint the specific sites you want to investigate by first conducting the research on the Web. In the prospective client example, an old issue of the newsletter of the company under criminal investigation (Company A) mentioned the prospective client's company (Company B). This clue led us to Company A's Web site where we found no further mention of Company B. However, with the Web site address in hand, we reviewed almost every archived page at the Internet Archive and found solid evidence of a past relationship. Additional research, during which we tracked down court records and spoke to one of the investigators, provided the verification we needed to confront the prospective client.

Advanced Search Techniques

You can display all versions of a specific page or Web site during a certain time period by modifying the URL. Greg Notess first illustrated this strategy in his On The Net column (See "The Wayback Machine: The Web's Archive," Online, March/April 2002).

A request for all archived versions of a page looks like this:

http://web.archive.org/web/*/http://www.domain.com

The asterisk is a wildcard that you can modify. For example, to find all versions from the year 2002, you would enter:

http://web.archive.org/web/2002*/http://www.domain.com

Or to find all versions from September 2002, you would enter:

http://web.archive.org/web/200209*/http://www.domain.com

Sometimes you encounter problems when you browse pages in the archive. For example, I often receive a "failed connection" error message. This may be the result of busy Web servers or a problem with the page. It may also occur if the live Web site prohibits crawlers.

To find out if the latter issue is the problem, check the site's robot exclusion file. A standard honored by most search engines, the robot exclusion file resides in the root-level directory. To find it, enter the main URL in your browser address line followed by robots.txt. Like this: http://www.domain.com/robots.txt .

If the site blocks the Internet Archive's crawler, it will contain two lines of text similar to the following:

User-agent: ia_archiver
Disallow: /

If it forbids all crawlers, the commands should look like this:

User-agent: *
Disallow: /

It's common for Web sites to block crawlers, including the Internet Archive, from indexing their copyrighted images and other non-text files. If the Internet Archive blots out images with gray boxes, then the Web site probably prevents it from making the graphics available.

If the site does not appear to block the Internet Archive, don't give up when you encounter a "failed connection" message. Return to the Wayback Machine and enter the Web page address. This strategy generates a list of archived versions of the page whereas Recall presents specific matches to a query. One of the other dated copies of the page may load without problems.

Conclusion

While the Internet Archive does not contain a complete archive of the Web, it offers a significant collection that due diligence researchers should not overlook. Tools like the Wayback Machine and Recall Search provide points of access. However, these utilities only handle simple queries. You can search by Web page address or keyword. You cannot conduct Boolean searching or limit a query by key information. Moreover, Recall Search limits keyword access to one-third of the collection. Consequently, conduct what research you can elsewhere first using public Web search engines and commercial sources. Then use the information you discover to scour relevant sites in the Internet Archive.