A new scarcity

The World Wide Web (WWW) has grown exponentially since 1989, when its inventor, Tim Berners-Lee, created a system of hipertextual documents linked and accessible through the Internet. They were called web pages (or just webs) [1]. Between 1993 and 1995, the number of web servers (the computers that house websites) jumped from 130 to 22,000 [2]. Gulli and Signorini estimated that the Web had more than 11.5 billion pages in 2005 [3]. According to Internet Archive’s website (www.archive.org/about/faqs.php), its historical record of the web contains approximately one petabyte (1,024 terabytes) of data and is growing ate the rate of 20 terabytes per month [4].

Thanks to the great development the Web has experimented since its origin, some aspects of daily life have changed heavily, for example, personal communication, business or research. This revolution is transforming the world by leading it to Information Society. And it is still changing towards Knowledge Society and Knowledge Economy, where knowledge is considered the main asset of economy dinamics. For this reason, business and research will likely succeed if they manage projects based on knowledge/information.

Historians are taking advantage of current technology. Museums and archives are digitizing their material to both preserve and extend user access to cultural heritage content. Some of these projects are Project Gutenberg, Million Book Project, Internet Archive, Bibliotheca Alexandrina, Amazon, Google Books or Open Content Alliance [4]. Not so long ago, historians worried about the small numbers of people they could reach, pages of scholarship they could publish, primary sources they could introduce to their students, and documents that had survived from the past. Digital technology has removed many of these limits [5]. Now they are living a transition from scarcity to abundance.

Still, the astonishingly rapid accumulation of digital data (obvious to anyone who uses the Google search engine and gets 300,000 hits) should make us consider that future historians may face information overload [5]. Information overload, also called infoxication (information + intoxication), is not an issue that just concerns to archivists, librarians and journalists. Internet is able to intoxicate every single user with its huge amount of knowledge. Too much information is not always the best. In fact, it tends to generate confusion. I have myself lots of resources available when I need to search information: Google,Wikipedia, UWO Library Catalogue (alpha.lib.uwo.ca/), databases (SpringerLink, IEEE, etc) and so on. Sometimes, you do not know where to start.

Information overload is not a new concept. It has been a preoccupation since the Middle Ages [6]. Immanuel Kant warned that “pure information without selection criteria is blind. Francis Bacon and Karl Popper added that “Nature will be mute while we do not learn to make it talk with both relevant and purposeful questions” [7]. By the way, this latter cite reminds me certain similitude between Nature and how we have to behave in relation to a search engine in order to extract useful information from the Web.

The struggle to incorporate the possibilities of new technology into the ancient practice of history has led, most importantly, to questioning the basic goals and methods of historians’ craft. And they should continue taking steps individually and within their professional organizations to embrace the culture of abundance made possible by digital media [5]. However, such abundance of information can cause users to have difficulty finding relevant and interesting content [8]. We, computer scientists, are familiar with this problem. As far as I am concerned, meaning of scarcity is currently changing from “a lack of quantity to “a lack of quality. That is, “lack of useful content”. Historians need tools that help them deal with such abundance of information. It is necessary to find the float in the sea of knowledge, to find the needle in the haystack.

The main obstacle to provide better results to users is the Web itself. Its content is not undertandable by machines (only by humans). In other words, the Web does not incorporate mechanisms that allow automated processing of information. In order to overcome this problem, the solution more broadly supported all over the world is to represent the Web content in a formal way (processable by machines) and to use techniques based on Artificial Intelligence to take advantage of this sound representation. This plan to revolutionize the Web is called Semantic Web (semanticweb.org/wiki/Main_Page).

The Semantic Web is an attempt to enrich web pages so that machines can cooperate to perform inferences in the way that people do. The underlying idea is to give information a well-defined meaning specifically in order to enable interaction among machines [4]. When Tim Berners-Lee created the Web, his original idea was not the Web as we know today. His idea was the Semantic Web. He explains his own concept [1]:

“The first step is putting data on the Web in a form that machines can naturally understand, or converting it to that form. This creates what I call a Semantic Web, a web of data that can be processed directly or indirectly by machines.”

That is, information is organized in a way machines can interpret its meaning, like in a database. For example,

<?xml version=”1.0″ encoding=”ISO-8859-1″?>
<book>
<title> The Neverending Story </title>
<author> Michael Ende </author>
<year> 1979 </year>
</book>

A standard for describing books and other resources is the Dublin Core Metadata Standard (dublincore.org/). Information structured this way will enable questions such as “Who wrote The Neverending Story”. Notice that simple questions like this or like “Qué es La Historia Interminable” (in Spanish) can be currently answered by Google. In the future, every kind of question will be answered by semantic tools, no matter how complex it is.

Computer scientists are working hard to develop this technology and the tools that let researchers organize knowledge and extract it from structured information. There are lots of interesting applications. For instance, a tool for Medieval document XML markup. Its authors present a novel tool-suite supporting the working historian in the transcription of original medieval charters into a machine-readable form (XML) [9]. The Catalogus Professorum Lipsiensis is an application of an adaptive, semantics-based knowledge engineering approach for the development of a prosopographical knowledge base on the Web, which enable historians to collect, structure and publish prosopographical knowledge . The resulting knowledge base contains information about more than 14.000 entities and is tightly interlinked with the emerging Web of Data [10]. The Timeline tool (www.simile-widgets.org/timeline/) is basically an API for visualizing historic events. All you need is to mark up your data in XML [11]. Exhibit (www.simile-widgets.org/exhibit/) is a lightweight structured data publishing framework that lets you create web pages with support for sorting, filtering, and rich visualizations. The only web technology you need is HTML and, optionally, some CSS and Javascript code [11].

Historians should embrace these new tools to isolate the relevant data from the abundance, to find the needle in the haystack.

The needle in the haystack

The needle in the haystack

References
[1] T. Berners-Lee, Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web, Harper, San Francisco, USA, 1999
[2] D. J. Cohen and R. Rosenzweig, Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web, University of Pennsylvania Press, Philadelphia, USA, 2006
[3] A. Gulli and A. Signorini, The Indexable Web is More than 11.5 Billion Pages, ACM, Chiba, Japan, 2005
[4] I. H. Witten, M. Gori and T. Numerico, Web Dragons: Inside the Myths of Search Engine Technology, Morgan Kaufmann, San Francisco, USA, 2007
[5] R. Rosenzweig, Scarcity or Abundance? Preserving the Past in a Digital Era, American Historical Review vol. 108 n. 3, pp. 735-762, 2003
[6] La Infoxicación en el Siglo XVI, 2007 [http://www.documentalistaenredado.net/495/la-infoxicacion-en-el-siglo-xvi/]
[7] X. Rubert de Ventós, La Red del Pescador, 2008 [http://www.elpais.com/articulo/opinion/red/pescador/elpepiopi/20080706elpepiopi_5/Tes]
[8] H. Cramer et al., The Effects of Transparency on Trust in and Acceptance of a Content-Based Art Recommender, User Model User-Adap Inter vol. 18 n. 5, pp. 455-496 , Springer, 2008.
[9] B. Burkard , G. Vogeler and S. Gruner, Informatics for Historians: Tools for Medieval Document XML Markup, and their Impact on the History-Sciences, Journal of Universal Computer Science vol. 14 n. 2, pp. 193-210 , 2007
[10] T. Riechert et al., Knowledge Engineering for Historians on the Example of the Catalogus Professorum Lipsiensis, ISWC’10 Proceedings of the 9th international semantic web conference on The semantic web vol. 2, Springer, 2010
[11] S. Fischer, History Museums and the Semantic Web, 2007 [http://publichistorian.wordpress.com/2007/01/16/history-museums-and-the-semantic-web/]

About mavillard

Graduate student of PhD. in Hispanic Studies at The University of Western Ontario.
This entry was posted in Uncategorized. Bookmark the permalink.

1 Response to A new scarcity

  1. Pingback: How would a historian tag my blog? | 2011: A Digital Odyssey

Leave a comment