IABS 2014 » 18_Information

18: Information Technologies in Buddhist Studies

Sat., Aug. 23rd, 11:00-12:30

E-texts: Digitizing, Tagging and Beyond: Notes on Content-Based Text Markup and its Potential with Regard to Tibetan Historical Research

Fermer, Mathias (OEAW, Vienna, AUT); Grössing, Benjamin (Vienna University of Technology, Vienna, AUT)

The ‘Sakya Research Centre’ is an open online platform set up by a network of Tibetan and academic contributors working in the field of historical research and digital text processing. Since its start in 2011, the project is designed as an open reference system for Tibetan historical research based on a digital text corpus embedded in an interlinked, relational database and web application.

By now we have gathered a large body of historical texts in digital form, encoded to common standards and fully searchable. Our present text corpus holds most of the standard works of Tibetan historiography as well as genealogies, biographies and religious histories particularly related to the Sakya school of Tibetan Buddhism. The digital texts that will be openly accessible on the web were initially inputted (or converted) in Tibetan Unicode and brought into a compatible format following TEI (Text Encoding Initiative) standards, supplemented with our own specifications required for historical text analysis and digital editing.

The present paper will address our system of text markup as an approach to data extraction by which new forms of historical evidence are derived through further processing and organization of the extracted data. In this paper, we will provide examples of how text-based data can be processed in order to answer complex questions historians might want to ask from a wider perspective, going beyond individual case studies.

The possibilities of using relationally structured, semantic data sets derived from digital texts, which have been manually supplemented by editorial markup, are widely and largely unexplored in the field of information technology within Tibetan Studies. New ways of presenting or visualizing textual content can convey wider and indirect connections between historical entities and procedures. Gathering orthographic variants of historic toponyms or agents, contextualizing events or making religious networks visible are only some areas in which this newly derived information could be used.

Advantages of this text-based approach are certainly the transparency and traceability of the system: By adding markup to a text through a set of pre-defined tags (highlighting, for instance, indications of time, agency or geographic space) the text itself preserves its original wording and structure while meta information of the text can be fed into the relational database back-end for broader, intertextual analysis. For the user, both the database and the digital text input remain accessible for search and reference via an easy-to-use web application.

Entity Relationship Model for Gandhāran Research System

McCrabb, Ian (Sydney University, Sydney, AUS)

The presentation will précis the underlying entity relationship model of the Gandhāran Research System (GRS), present published outputs, and outline some of the research methods enabled by the system.

The GRS is the next incarnation of the software platform that currently supports the Dictionary of Gāndhārī, Bibliography of Gāndhārī Studies and Catalog of Gāndhārī Texts by Stefan Baums and Andrew Glass, as well as the source‐text corpus assembled by them on gandhari.org. With development support from a consortium of four universities, the GRS project commenced in 2013 to redevelop the current system into a comprehensive multi‐user research workbench and publishing platform for ancient Sanskrit and Prakrit texts: manuscripts, inscriptions, coins and other documents:

a linked repository of images, transcriptions, translations, metadata, commentary and bibliographic records,
a content management system encompassing import, editing, maintenance, analysis and publishing,
a collaboration platform with comprehensive access and visibility control to support draft development, workgroup collaboration and public presentation,
a research platform for the production of catalogs, glossaries, concordances and grammatical analyses, and
a flexible system for publishing individual transcription renditions or full scholarly editions, both print-ready and online.

The GRS is a database platform based on open source software and built to open standards. It provides an extensible entity model, TEI support and a published API for integration with related systems.

An entity relationship model is an abstract way of describing a relational database; most often represented as a flow chart accompanied by precise descriptions of each entity. This approach allows one to model the entities and their relationships and determine the most effective and flexible way of structuring the data to support authoring, storage, maintenance, analysis, reporting and publishing. The underlying design principle of the GRS is the atomisation of data to its smallest indivisible components and the linking and sequencing of these entities. The design approach has been to build a comprehensive set of entities which model real world objects. Manuscripts or inscribed items have parts, fragments and surfaces. Images of these surfaces can be segmented to provide a fixed reference system, a baseline much like the grids laid out at an archaeological excavation. Syllables can then be mapped to these image segments and sequenced into spans across a surface of a fragment of a manuscript. These fragments can then be aligned and the spans sequenced into complete lines.

The philological process model adopted is one of defining each entity by applying classifying metadata and progressively sequencing these entities from the smallest upwards. This approach allows for attribution and annotation of different interpretations of syllable, words, etymology and translation in order to record scholarly contribution at the finest level of granularity. Multiple versions of all entities may exist in parallel to support the publishing of alternative editions of a text.

The application of metadata to each entity in the system enables a range of automated analysis outputs which support palaeographic, phonological, grammatical, orthographical and morphological research in addition to the opportunities opened up for formulae and syntactical analysis.

Ian McCrabb is the founder and managing director of a Sydney based IT consulting group established in 1994. His PhD dissertation is focused on methodologies for the analysis of reliquary inscriptions and characterization of the ritual practices and religious significance of relic establishment in Gandhāra. Ian is system designer on the Gandhāran Research System project.

Exploring Possibilities of Digital Environments for Buddhist Studies

Nagasaki, Kiyonori (International Institute for Digital Humanities and University of Tokyo, Bunkyo-ku, JPN); Muller, Charles (University of Tokyo, Tokyo, JPN); Tomabechi, Toru (International Institute for Digital Humanities, Tokyo, JPN); Shimoda, Masahiro (JPN)

In 2012, the SAT project published a result of attempts to form an alliance between international projects of digital Buddhist studies and to develop digitized research tools under the concept of the “Methodological Commons” in the international digital humanities community on the Web.

The 2012 version includes numerous new functions. It is remarkable that the texts of the Taishō Shinshū Daizōkyō (henceforth, Taishō) are linked with English Tripiṭaka at the sentence level adopting the concept of the “Stand-off markup” in the TEI P5 guidelines. We call it the “BDK-SAT parallel corpus.” The work of linking both texts was done entirely on a Web collaboration system which was developed to link between Taishō and other objects such as texts, images, and so on. It would be useful for translation, education, textual analysis, and so on along with the function of easy-search for the Digital Dictionary of Buddhism. At the same time, the text database of the BDK Web site is linked with the SAT text database by the medium of the BDK-SAT parallel corpus.

The 2012 version also includes several character databases so that readers can easily find the information of a character from several databases by use of pop-up windows, while it has become difficult to find appropriate information of it due to increasing of CJK ideographic characters in the Unicode.

Moreover, the 2012 version makes it possible to browse the page images of Taishō, which were scanned in 600 dpi. The images are roughly linked with the lines of the texts, that is, when a reader clicks a button located next to the text line, the page image is displayed on a left narrow window while the lines are centered. And of course it can be zoomed.

The SAT project aims to make a wider and deeper alliance with many international projects of digital Buddhist studies so that Buddhist studies can be carried out efficiently and significantly. We would like to discuss various aspects of digitization in Buddhist studies at the conference.