Orthographic variation, which is endemic to non-standard spelling systems, is seen by many researchers as a fatal stumbling block for building morpho-syntactically parsed historical corpora. Graphemic alternations, however, have long been a treasure-trove for historical phonologists, who attempt to piece together bygone sound-systems by close examination of spelling practices. This tug-of-war between morpho-syntactic and grapho-phonological approaches to corpus-building has resulted in independent traditions of spelling standardisation, on the one hand, and of diplomatic transcription with minimal tagging, on the other. A third route, however, is increasingly feasible, producing lemmatised and part-of-speech tagged texts, while preserving fine-grained spelling variation. In this workshop I will give an overview of this approach, based on the construction of the Corpus of Historical Mapudungun (CHM), a project at Edinburgh’s Angus McIntosh Centre for Historical Linguistics.
The main focus will be on the methods and tools used to go from the printed or manuscript texts to a lemmatised, morphologically tagged and grapho-phonologically parsed corpus. I will survey the process of optical character recognition, and the principles and conventions of XML tagging used for lemma and morpheme parsing. Since the CHM-version is still under development, the final stage of the process – grapho-phonological parsing – will be illustrated with data from the From Inglis To Scots corpus (FITS – also developed at the AMC), which maps spellings to sounds in the early history of Scots (1380–1500). Here, I will showcase our bespoke tool – Medusa – which creates dynamic visualisations of the graphophonological relations in the corpus. I conclude with some examples of the usefulness of reconciling the core objectives of corpus methods with a level of linguistic analysis often dismissed as cumbersome and uninformative.