# Workshop on Electronic Texts: Proceedings, 9-10 June 1992

## Part 13

Book page: https://www.cyberlibrary.org/en/books/workshop-on-electronic-texts-proceedings-9-10-june-1992-53/index.md

SPERBERG-McQUEEN dismissed the lowest-common-denominator approach as unable to support the kind of applications that draw people who have never been in the public library regularly before, and make them come back. He advocated more interesting text and more intelligent text. Asserting that it is not beyond economic feasibility to have good texts, SPERBERG-McQUEEN noted that the TEI Guidelines listing 200-odd tags contains tags that one is expected to enter every time the relevant textual feature occurs. It contains all the tags that people need now, and it is not expected that everyone will tag things in the same way.

The question of how people will tag the text is in large part a function of their reaction to what SPERBERG-McQUEEN termed the issue of reproducibility. What one needs to be able to reproduce are the things one wants to work with. Perhaps a more useful concept than that of reproducibility or recoverability is that of processability, that is, what can one get from an electronic text without reading it again in the original. He illustrated this contention with a page from Jan Comenius's bilingual Introduction to Latin.

SPERBERG-McQUEEN returned at length to the issue of images as simulacra for the text, in order to reiterate his belief that in the long run more than images of pages of particular editions of the text are needed, because just as second-generation photocopies and second-generation microfilm degenerate, so second-generation representations tend to degenerate, and one tends to overstress some relatively trivial aspects of the text such as its layout on the page, which is not always significant, despite what the text critics might say, and slight other pieces of information such as the very important lexical ties between the English and Latin versions of Comenius's bilingual text, for example. Moreover, in many crucial respects it is easy to fool oneself concerning what a scanned image of the text will accomplish. For example, in order to study the transmission of texts, information concerning the text carrier is necessary, which scanned images simply do not always handle. Further, even the high-quality materials being produced at Cornell use much of the information that one would need if studying those books as physical objects. It is a choice that has been made. It is an arguably justifiable choice, but one does not know what color those pen strokes in the margin are or whether there was a stain on the page, because it has been filtered out. One does not know whether there were rips in the page because they do not show up, and on a couple of the marginal marks one loses half of the mark because the pen is very light and the scanner failed to pick it up, and so what is clearly a checkmark in the margin of the original becomes a little scoop in the margin of the facsimile. Standard problems for facsimile editions, not new to electronics, but also true of light-lens photography, and are remarked here because it is important that we not fool ourselves that even if we produce a very nice image of this page with good contrast, we are not replacing the manuscript any more than microfilm has replaced the manuscript.

The TEI comes from the research community, where its first allegiance lies, but it is not just an academic exercise. It has relevance far beyond those who spend all of their time studying text, because one's model of text determines what one's software can do with a text. Good models lead to good software. Bad models lead to bad software. That has economic consequences, and it is these economic consequences that have led the European Community to help support the TEI, and that will lead, SPERBERG-McQUEEN hoped, some software vendors to realize that if they provide software with a better model of the text they can make a killing.

******

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ DISCUSSION * Implications of different DTDs and tag sets * ODA versus SGML * +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

During the discussion that followed, several additional points were made. Neither AAP (i.e., Association of American Publishers) nor CALS (i.e., Computer-aided Acquisition and Logistics Support) has a document-type definition for ancient Greek drama, although the TEI will be able to handle that. Given this state of affairs and assuming that the technical-journal producers and the commercial vendors decide to use the other two types, then an institution like the Library of Congress, which might receive all of their publications, would have to be able to handle three different types of document definitions and tag sets and be able to distinguish among them.

Office Document Architecture (ODA) has some advantages that flow from its tight focus on office documents and clear directions for implementation. Much of the ODA standard is easier to read and clearer at first reading than the SGML standard, which is extremely general. What that means is that if one wants to use graphics in TIFF and ODA, one is stuck, because ODA defines graphics formats while TIFF does not, whereas SGML says the world is not waiting for this work group to create another graphics format. What is needed is an ability to use whatever graphics format one wants.

The TEI provides a socket that allows one to connect the SGML document to the graphics. The notation that the graphics are in is clearly a choice that one needs to make based on her or his environment, and that is one advantage. SGML is less megalomaniacal in attempting to define formats for all kinds of information, though more megalomaniacal in attempting to cover all sorts of documents. The other advantage is that the model of text represented by SGML is simply an order of magnitude richer and more flexible than the model of text offered by ODA. Both offer hierarchical structures, but SGML recognizes that the hierarchical model of the text that one is looking at may not have been in the minds of the designers, whereas ODA does not.

ODA is not really aiming for the kind of document that the TEI wants to encompass. The TEI can handle the kind of material ODA has, as well as a significantly broader range of material. ODA seems to be very much focused on office documents, which is what it started out being called-- office document architecture.

******

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ CALALUCA * Text-encoding from a publisher's perspective * Responsibilities of a publisher * Reproduction of Migne's Latin series whole and complete with SGML tags based on perceived need and expected use * Particular decisions arising from the general decision to produce and publish PLD * +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

The final speaker in this session, Eric CALALUCA, vice president, Chadwyck-Healey, Inc., spoke from the perspective of a publisher re text-encoding, rather than as one qualified to discuss methods of encoding data, and observed that the presenters sitting in the room, whether they had chosen to or not, were acting as publishers: making choices, gathering data, gathering information, and making assessments. CALALUCA offered the hard-won conviction that in publishing very large text files (such as PLD), one cannot avoid making personal judgments of appropriateness and structure.

In CALALUCA's view, encoding decisions stem from prior judgments. Two notions have become axioms for him in the consideration of future sources for electronic publication: 1) electronic text publishing is as personal as any other kind of publishing, and questions of if and how to encode the data are simply a consequence of that prior decision; 2) all personal decisions are open to criticism, which is unavoidable.

CALALUCA rehearsed his role as a publisher or, better, as an intermediary between what is viewed as a sound idea and the people who would make use of it. Finding the specialist to advise in this process is the core of that function. The publisher must monitor and hug the fine line between giving users what they want and suggesting what they might need. One responsibility of a publisher is to represent the desires of scholars and research librarians as opposed to bullheadedly forcing them into areas they would not choose to enter.

CALALUCA likened the questions being raised today about data structure and standards to the decisions faced by the Abbe Migne himself during production of the Patrologia series in the mid-nineteenth century. Chadwyck-Healey's decision to reproduce Migne's Latin series whole and complete with SGML tags was also based upon a perceived need and an expected use. In the same way that Migne's work came to be far more than a simple handbook for clerics, PLD is already far more than a database for theologians. It is a bedrock source for the study of Western civilization, CALALUCA asserted.

In regard to the decision to produce and publish PLD, the editorial board offered direct judgments on the question of appropriateness of these texts for conversion, their encoding and their distribution, and concluded that the best possible project was one that avoided overt intrusions or exclusions in so important a resource. Thus, the general decision to transmit the original collection as clearly as possible with the widest possible avenues for use led to other decisions: 1) To encode the data or not, SGML or not, TEI or not. Again, the expected user community asserted the need for normative tagging structures of important humanities texts, and the TEI seemed the most appropriate structure for that purpose. Research librarians, who are trained to view the larger impact of electronic text sources on 80 or 90 or 100 doctoral disciplines, loudly approved the decision to include tagging. They see what is coming better than the specialist who is completely focused on one edition of Ambrose's De Anima, and they also understand that the potential uses exceed present expectations. 2) What will be tagged and what will not. Once again, the board realized that one must tag the obvious. But in no way should one attempt to identify through encoding schemes every single discrete area of a text that might someday be searched. That was another decision. Searching by a column number, an author, a word, a volume, permitting combination searches, and tagging notations seemed logical choices as core elements. 3) How does one make the data available? Tieing it to a CD-ROM edition creates limitations, but a magnetic tape file that is very large, is accompanied by the encoding specifications, and that allows one to make local modifications also allows one to incorporate any changes one may desire within the bounds of private research, though exporting tag files from a CD-ROM could serve just as well. Since no one on the board could possibly anticipate each and every way in which a scholar might choose to mine this data bank, it was decided to satisfy the basics and make some provisions for what might come. 4) Not to encode the database would rob it of the interchangeability and portability these important texts should accommodate. For CALALUCA, the extensive options presented by full-text searching require care in text selection and strongly support encoding of data to facilitate the widest possible search strategies. Better software can always be created, but summoning the resources, the people, and the energy to reconvert the text is another matter.

PLD is being encoded, captured, and distributed, because to Chadwyck-Healey and the board it offers the widest possible array of future research applications that can be seen today. CALALUCA concluded by urging the encoding of all important text sources in whatever way seems most appropriate and durable at the time, without blanching at the thought that one's work may require emendation in the future. (Thus, Chadwyck-Healey produced a very large humanities text database before the final release of the TEI Guidelines.)

******

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ DISCUSSION * Creating texts with markup advocated * Trends in encoding * The TEI and the issue of interchangeability of standards * A misconception concerning the TEI * Implications for an institution like LC in the event that a multiplicity of DTDs develops * Producing images as a first step towards possible conversion to full text through character recognition * The AAP tag sets as a common starting point and the need for caution * +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

HOCKEY prefaced the discussion that followed with several comments in favor of creating texts with markup and on trends in encoding. In the future, when many more texts are available for on-line searching, real problems in finding what is wanted will develop, if one is faced with millions of words of data. It therefore becomes important to consider putting markup in texts to help searchers home in on the actual things they wish to retrieve. Various approaches to refining retrieval methods toward this end include building on a computer version of a dictionary and letting the computer look up words in it to obtain more information about the semantic structure or semantic field of a word, its grammatical structure, and syntactic structure.

HOCKEY commented on the present keen interest in the encoding world in creating: 1) machine-readable versions of dictionaries that can be initially tagged in SGML, which gives a structure to the dictionary entry; these entries can then be converted into a more rigid or otherwise different database structure inside the computer, which can be treated as a dynamic tool for searching mechanisms; 2) large bodies of text to study the language. In order to incorporate more sophisticated mechanisms, more about how words behave needs to be known, which can be learned in part from information in dictionaries. However, the last ten years have seen much interest in studying the structure of printed dictionaries converted into computer-readable form. The information one derives about many words from those is only partial, one or two definitions of the common or the usual meaning of a word, and then numerous definitions of unusual usages. If the computer is using a dictionary to help retrieve words in a text, it needs much more information about the common usages, because those are the ones that occur over and over again. Hence the current interest in developing large bodies of text in computer-readable form in order to study the language. Several projects are engaged in compiling, for example, 100 million words. HOCKEY described one with which she was associated briefly at Oxford University involving compilation of 100 million words of British English: about 10 percent of that will contain detailed linguistic tagging encoded in SGML; it will have word class taggings, with words identified as nouns, verbs, adjectives, or other parts of speech. This tagging can then be used by programs which will begin to learn a bit more about the structure of the language, and then, can go to tag more text.

HOCKEY said that the more that is tagged accurately, the more one can refine the tagging process and thus the bigger body of text one can build up with linguistic tagging incorporated into it. Hence, the more tagging or annotation there is in the text, the more one may begin to learn about language and the more it will help accomplish more intelligent OCR. She recommended the development of software tools that will help one begin to understand more about a text, which can then be applied to scanning images of that text in that format and to using more intelligence to help one interpret or understand the text.

HOCKEY posited the need to think about common methods of text-encoding for a long time to come, because building these large bodies of text is extremely expensive and will only be done once.

In the more general discussion on approaches to encoding that followed, these points were made:

BESSER identified the underlying problem with standards that all have to struggle with in adopting a standard, namely, the tension between a very highly defined standard that is very interchangeable but does not work for everyone because something is lacking, and a standard that is less defined, more open, more adaptable, but less interchangeable. Contending that the way in which people use SGML is not sufficiently defined, BESSER wondered 1) if people resist the TEI because they think it is too defined in certain things they do not fit into, and 2) how progress with interchangeability can be made without frightening people away.

SPERBERG-McQUEEN replied that the published drafts of the TEI had met with surprisingly little objection on the grounds that they do not allow one to handle X or Y or Z. Particular concerns of the affiliated projects have led, in practice, to discussions of how extensions are to be made; the primary concern of any project has to be how it can be represented locally, thus making interchange secondary. The TEI has received much criticism based on the notion that everything in it is required or even recommended, which, as it happens, is a misconception from the beginning, because none of it is required and very little is actually actively recommended for all cases, except that one document one's source.

SPERBERG-McQUEEN agreed with BESSER about this trade-off: all the projects in a set of twenty TEI-conformant projects will not necessarily tag the material in the same way. One result of the TEI will be that the easiest problems will be solved--those dealing with the external form of the information; but the problem that is hardest in interchange is that one is not encoding what another wants, and vice versa. Thus, after the adoption of a common notation, the differences in the underlying conceptions of what is interesting about texts become more visible. The success of a standard like the TEI will lie in the ability of the recipient of interchanged texts to use some of what it contains and to add the information that was not encoded that one wants, in a layered way, so that texts can be gradually enriched and one does not have to put in everything all at once. Hence, having a well-behaved markup scheme is important.

STEVENS followed up on the paradoxical analogy that BESSER alluded to in the example of the MARC records, namely, the formats that are the same except that they are different. STEVENS drew a parallel between document-type definitions and MARC records for books and serials and maps, where one has a tagging structure and there is a text-interchange. STEVENS opined that the producers of the information will set the terms for the standard (i.e., develop document-type definitions for the users of their products), creating a situation that will be problematical for an institution like the Library of Congress, which will have to deal with the DTDs in the event that a multiplicity of them develops. Thus, numerous people are seeking a standard but cannot find the tag set that will be acceptable to them and their clients. SPERBERG-McQUEEN agreed with this view, and said that the situation was in a way worse: attempting to unify arbitrary DTDs resembled attempting to unify a MARC record with a bibliographic record done according to the Prussian instructions. According to STEVENS, this situation occurred very early in the process.

WATERS recalled from early discussions on Project Open Book the concern of many people that merely by producing images, POB was not really enhancing intellectual access to the material. Nevertheless, not wishing to overemphasize the opposition between imaging and full text, WATERS stated that POB views getting the images as a first step toward possibly converting to full text through character recognition, if the technology is appropriate. WATERS also emphasized that encoding is involved even with a set of images.

SPERBERG-McQUEEN agreed with WATERS that one can create an SGML document consisting wholly of images. At first sight, organizing graphic images with an SGML document may not seem to offer great advantages, but the advantages of the scheme WATERS described would be precisely that ability to move into something that is more of a multimedia document: a combination of transcribed text and page images. WEIBEL concurred in this judgment, offering evidence from Project ADAPT, where a page is divided into text elements and graphic elements, and in fact the text elements are organized by columns and lines. These lines may be used as the basis for distributing documents in a network environment. As one develops software intelligent enough to recognize what those elements are, it makes sense to apply SGML to an image initially, that may, in fact, ultimately become more and more text, either through OCR or edited OCR or even just through keying. For WATERS, the labor of composing the document and saying this set of documents or this set of images belongs to this document constitutes a significant investment.

WEIBEL also made the point that the AAP tag sets, while not excessively prescriptive, offer a common starting point; they do not define the structure of the documents, though. They have some recommendations about DTDs one could use as examples, but they do just suggest tag sets. For example, the CORE project attempts to use the AAP markup as much as possible, but there are clearly areas where structure must be added. That in no way contradicts the use of AAP tag sets.

SPERBERG-McQUEEN noted that the TEI prepared a long working paper early on about the AAP tag set and what it lacked that the TEI thought it needed, and a fairly long critique of the naming conventions, which has led to a very different style of naming in the TEI. He stressed the importance of the opposition between prescriptive markup, the kind that a publisher or anybody can do when producing documents de novo, and descriptive markup, in which one has to take what the text carrier provides. In these particular tag sets it is easy to overemphasize this opposition, because the AAP tag set is extremely flexible. Even if one just used the DTDs, they allow almost anything to appear almost anywhere.

******

SESSION VI. COPYRIGHT ISSUES

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ PETERS * Several cautions concerning copyright in an electronic environment * Review of copyright law in the United States * The notion of the public good and the desirability of incentives to promote it * What copyright protects * Works not protected by copyright * The rights of copyright holders * Publishers' concerns in today's electronic environment * Compulsory licenses * The price of copyright in a digital medium and the need for cooperation * Additional clarifications * Rough justice oftentimes the outcome in numerous copyright matters * Copyright in an electronic society * Copyright law always only sets up the boundaries; anything can be changed by contract * +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Marybeth PETERS, policy planning adviser to the Register of Copyrights, Library of Congress, made several general comments and then opened the floor to discussion of subjects of interest to the audience.

Having attended several sessions in an effort to gain a sense of what people did and where copyright would affect their lives, PETERS expressed the following cautions:

* If one takes and converts materials and puts them in new forms, then, from a copyright point of view, one is creating something and will receive some rights.

* However, if what one is converting already exists, a question immediately arises about the status of the materials in question.