DigitalLibrary

an overview

Digital Library Research

The official documentation for the C·I·T·E digital library architecture is here. (Note as of June 18, 2012: This documentation site is under development and should expand and change rapidly.)

In this space, the Folio server at Furman, we are implementing open-ended research projects with digital libraries at their heart: two very significant papyrus fragments (the “Bankes Homer” and the Aristotelian Constitution of the Athenians, both in the British Library), two Biblical manuscripts from Lichfield Cathedral, and the the Botanical Collections of John Bartram, William Bartram, John Lawson, and Mark Catesby (in the Royal Museum of Natural History, London). The botanical volumes are each absolutely unique, of course, and the Aristotle papyrus is the only existing witness to that philosopher's history of Athenian democracy. Each of these editions will be evolving integrated environments for interdisciplinary research. The editorial work on all of these projects is being done, or will be done, primarily by undergraduate researchers.

This is a project entirely in keeping with the best traditions of the discipline called “Classics”, the study of Greek and Roman antiquity.

Background

In the early 200s BCE, the Greek poet Callimachus compiled the Pinakes (“Tables”) that were the first catalogue of the holdings of the Library at Alexandria. Since then Classicists have always played an active role in the science of managing information, particularly the information that, when formed into an interlocking wall, constitutes scholarly knowledge. The mortar that holds this wall together has always been citation, the ability to identify uniquely and concisely a particular portion of a particular text that exists in an ontological hierarchy: Author, Work, Edition - Section, Word, Character. In that scheme, as presented, everything before the hyphen is the realm of the librarian; everything after the hyphen is in the charge of the editor; everything pertains to the reader. My ongoing research aims at simplifying digital library architecture from its current fragile complexity by using canonical citation as the organizing principle. The goal is simplicity, longevity, flexibility, scaleability.

Problems & Challenges

The Homer Multitext (“HMT”) is both the inspiration and foundation of this work at Furman. The Editors of the HMT have digital images of five large Byzantine codices. Each folio-side of each codex is represented by from one to fifteen images. Each folio-side contains many texts: a poetic text, bibliographic texts, several different commentaries, running enumerations of features, graphical elements. Valuable insight into all of these features is to be found in texts external to these codices: Plato, Aristotle, the Greek Historians, the Attic Orators. We have morphological and lexical data to include and link. How do we organize all of this for our own study of the ancient world?

Historical Botany and Phenology · We have volumes of botanical specimens collected in the Carolinas in the 18th century. Each folio has attached to it one or more pressed plant specimens, one or more printed or hand-written labels, and other hand-written notes. The collection of these specimens is described in published travelogues by the scientists, men like William Bartram and Mark Catesby. Each plant represented has its own taxonomic ontology. These plants exist in the wild, too, and as specimens in modern herbaria, and as accessions in botanical gardens. Their geospatial locations are significant. Each of these, in turn, may be represented by innumerable photographs, taken over time. How do we organize this for study of change over time?

A Generic Digital Library Architecture Driven by Citation

The solution has to be a generic digital library infrastructure, organized from bottom to top, with the bottom being the most important and long-lasting, and the top being the most ephemeral. The bottom is the data. This must exist in its own right, in a public format that is both human-readable and machine-actionable. It must be downloadable as a body, and freely repurposed. It must be in the simplest possible format, unencumbered by proprietary technology. The top layer is the end-user application, that addresses specific identifiable needs: reading, browsing, searching, juxtaposition, analysis. The top layer will have a short life; it will be determined and constrained by the technology of the day, and perish as technology moves on. It will be determined and constrained by the imagination of its creators, and will be useless to subsequent scholars who have different needs and want to ask different questions.

In the middle are services. These should be defined independently of technology as protocols – defined requests with defined responses – but implemented in the best technology of the time. They will last longer than end-user applications, because they do less, but not as long as the data itself, which should be timeless if it is of any use at all.

My collaborators and I, particularly Neel Smith of the College of the Holy Cross, Casey Dué of the University of Houston, Mary Ebbott of the College of the Holy Cross, and Brent Seales of the University of Kentucky, have spent the past decade reducing the complexity of our experimental digital library architecture to what we consider a highly functional, elegant minimum.

Our base layer consists of only three kinds of object: texts, collections, and indices.

Texts are defined in a hierarchy of Group, Work, Edition, Translation; each text contains a hiearchy of citeable units. We expect to be able to work with texts at any level of precision, from the group-level (e.g. “Homer”), to the most precise (“In the Homeric Iliad, the edition of the Venetus A manuscript, Book 1, line 1, the second occurence of the letter “η”).

Collections are groups of similar data-objects that do not exist in a hierarchy, but may exist in a sequence. A dictionary is an unordered collection of lexical entries. A collection of geographic points, a collection of vases, any group of like objects that can be represented as digital data. A manuscript is an ordered collection of folio-objects.

Some collections are special, such as collections of images or geospatial data. For these you might want to perform particular actions or make particular requests that are common to that type of collection, but would make no sense in the context of collections generally. Any portion of a text can be identified by a canonical citation, which we express as a Universal Resource Name (URN). Any object in a collection, likewise, can be identified by a URN.

Indices are simple lists of paired objects: URN - URN, or URN - arbitrary data.

With these three kinds of data, any scholarly argument or act of publication becomes possible.

Christopher W. Blackwell
The Louis G. Forgione University Professor of Classics
October 6, 2011

Infrastructure - Storage

Texts · Texts begin life as TEI-XML documents, and we provide them in that form for download. For delivery through the Canonical Text Service, however, we translate the hierarchical XML files into tabluated files, which are uploaded to Google BigTable for “cloud” storage. Local needs: none.

Collections and Indices · Generic Collections are stored in Google FusionTables. Generic indices are stored in a postgreSQL database. We provide the output of the pg_dump utility as a download for mirroring or re-purposing, and we generally make XML or plain-text snapshots of the data available as well. Local needs: a working implementation of postgreSQL, with administrative rights (create, modify, destroy); a few gigabytes of disk-space.

Image-collections · Image-collections are an extension generic Collections. Their storage requirements include metadata in FusionTables, but also storage-space for the binary image data. We provide direct downloading of images in TIFF format or JPEG format. We also create pyramidal-TIFF versions for delivery to a tiled-image viewer. Image-storage is the most onerous hardware requirement. Local needs: for our current collection 1.5Tb; looking forward to collaborative projects already in the works, 3Tb.

Infrastructure - Services

General · We prefer to provide directory-listing access to an archive of raw data, from which users can download individual pieces of data, or which they can programmatically mirror to capture all our data, using a utility like wget, vel sim. Local needs: Apache2, with administrative privileges to modify the httpd.conf file, and restart the service.

Texts · Texts are delivered via a CTS implementation in Google AppEngine, drawing its data from Google BigTable. Local needs: none.

Collections and Indices · Our RefIndex service and CollectionService are each implemented as Java/Groovy webapps using the Grails framework. These webapps mediate between end-user applications and the relational database back-end. Local needs: Apache Tomcat, vel sim., with administrative privileges to add .war files, check status of the running service, and restart the servlet container.

Image-Collections · These use the Collections service for their metadata, plus extensions specific to images, allowing end-user applications to request images at different scales, cropped to regions of interest, or presented in an interface for zooming and panning. These extensions are implemented using the Internet Image Protocol (IIP) tools, implemented using the FastCGI extension. The directory holding binary image files must reside on the same filesystem as the service. Local needs: The same as for Collections, write access to the cgi-bin directory in Apache, image-storage mounted on the local filesystem.