Encoding Guidelines

By Richard Hadden

The Woodman Diary leverages the Text Encoding Initiative (TEI) schema for encoding. This document describes the guidelines and policies that inform our encoding.

Structural Markup

The Woodman Diary project uses a minimal, customised TEI P5-based schema, using on the Core module with additional elements added as required. This was generated using the TEI Roma tool. The schema used is Schema file (in RelaxNG syntax).

As an edition with a focus on the text of the diary, rather than particular features of the manuscript itself, the marked-up text is contained within the <text> TEI element (rather than using <msDesc>).

The structural markup elements present the text as a series of daily entries, which are then grouped by month.

The body of the text is enclosed in <text> and <body> tags.

Each month is contained in a <div> element: <div type=”month” xml:id=”x_YYYY-MM”>

Each day is contained in a <div> element: <div type=”day” xml:id=”x_YYYY-MM-DD”>

In the above examples, the x is replaced with w for the Wilson Diary and b for the Butterfly Diary.

Dates that have no entry are not encoded.

Diary inserts are contained within a <div> element, within the appropriate day (based on an editorial decision of where it seemed most appropriate for them to appear)” <div type=”insert” xml:id=”w_YYYY-MM-DDi”>

Inserts graphics are contained in the <figure> element, along with any transcribed caption.

    <graphic url="imageURL"/>

Alternative structural elements, notably page breaks and line breaks, are encoded as empty elements (<pb/> and <lb/> respectively). The page break tag uses the @facs attribute to encode the URL of the image file. This allows the text contained on each physical diary page to be extracted using XSLT.

Page breaks fall in the text wherever the end of a page occurs in diary. For entries that begin at the top of a new page, the <pb> element is placed before the <div> for that day (i.e. between the next and previous <div>s). This is to assist extraction of the text for each page: each day will always begin on the page referenced by the previous page break, wherever that falls. (This is also consistent with the use of line breaks, which begin a new line, and thus fall before the line’s text.)

Text Markup

Below are the guidelines for the markup of transcription text within the diary:

<date> is used to mark up dates in the text, and to make explicit things like ‘yesterday’ and ‘last Thursday’. The @type="head" attribute is added to the date considered to mark the start of an entry; @type="newPage" is used to indicate a Woodman’s repetition of the entry’s date following a page break.

<p> is used to indicate paragraphs. As part of the semantic structure of the text — rather than something graphically certain, such as a line break — the application of the paragraph tag

<lg>, <l> — line group, line — for verse. <lg> wraps the entire poem or stanza as applicable. <l> wraps the entire line, as interpreted as a poetic line. (<lb/> can be used mid-line to indicate how it is written on the page.)

    <l>Humphry Davy</l>
    <l>Was not fond <lb/> of gravy</l>
    <l>He lived in the odium</l>
    <l>Of having discovered sodium</l>

(n.b. the <lb/> in the stanza above is there as it is an actual physical line break)

<hi rend="eg underlined"> — some highlighted or distinct writing

<choice> is used to contain <abbr> and <expan> elements — abbreviations and expansion.

<add rend="above"> — addition by author. Allowed the following @rend attributes: above, overwrite

<del rend="strikethrough">— deletion by author. Allowed the following @rend attributes: strikethrough, overwrite

<supplied> — for an obvious editorial addition to the text. This element is little-used, as the diary is in good condition and little outright conjecture is required. It is, however, used where a word is obviously missing and understanding would be impaired otherwise (e.g., Woodman is describing a plane crash, and, at the point of beginning a new page, omits the word “pilot”).

<unclear> can be used to wrap sections of the transcription that are unclear as to their exact transcription due to poor penmanship, damage to the page, or any other situation in which the original text is unclear.

Named Entities

The following are the elements used to encode named entities, as well as terms and other points deemed to require annotation. The annotations and other references were produced in a spreadsheet and transformed to non-TEI XML syntax using a custom Python script (including the Ninja2 templating library).

The resulting XML file will be finally re-incorporated with the main TEI file (in the <profileDesc> element) should the data be made available. References are established using the name of the person, place, organisation, or term in camel case (i.e. camelCase), and prefixed with per_, pla_, org_, and ter_ respectively. Named entities in the text is encoded using the TEI P5 naming tags (persName, placeName, orgName), rather than the longer — though equivalent — <name type="person">, etc.

<foreign xml:lang=”fra”> — foreign word (eg. in French)

Notes on the text

At various points, it was deemed necessary to add additional editorial notes to the text that were not, strictly, descriptions of places or terminology.

Each day may contain multiple <note> elements at the end of the enclosing <div> (i.e. after the text for the day). The note should have an @xml:id="nYYYYMMDDi", where n designates a note (so does not change), and the i the number of the note associated with that day. Portions of the day’s text applicable to the note should be enclosed in a <rs> (reference string) element, with a @ref attribute corresponding to the @xml:id of the note in question.

Image Naming Conventions

The image files have been renamed from the standard camera filenames, to allow easy identification and correspondence with the diary entries.

  • Restart with 001 and include date; W- for Wilson diary (the first of Woodman’s physical diaries), B- for Butterfly diary (the second of Woodman’s physical diaries).
  • Clippings and Articles had their naming decided after this text page encoding was complete.