The creation of a new corpus for the use with corsica requires several steps. First, all corpus texts have to be tokenized, such that the tokens in a file are seperated by newline. The program tokenizer performs this task (see seperate manual). After the preprocessing, a corpus definition file is required. This file holds information about the corpus and all corpus texts.
Options:
Description:
Create a new corpus. The corpus definition file defines the texts of a corpus, the corpus header and an optional taxonomy. The format of the corpus and text headers is compliant to the TEI guidelines.
The definition file has the following outline:
<teicorpus.2> <!-- corpus header starts here --> <teiheader type="corpus"> <filedesc> ... </filedesc> <profiledesc> <!-- taxonomy defininition --> </profiledesc> </teiheader> <!-- text definitions start here --> <teiheader> <filedesc> ... </filedesc> <profiledesc> ... </profiledesc> </teiheader> <!-- .... more text definitions --> <teiheader> ... </teiheader> </teicorpus.2>
A minimal corpus header consists of the element filedesc, which contains the elements titlestmt, publicationstmt and sourcedesc.
Example of a minimal corpus header:
<teiheader type="corpus"> <filedesc> <titlestmt> <title> The Testcorpus </titlestmt> <publicationstmt> <idno> corpus_id </idno> <p> Published by Marco Zierl </p> <p> CLUE Erlangen </p> </publicationstmt> <sourcedesc> <p> The corpus as a document of its own has no source </p> </sourcedesc> </filedesc> </teiheader>
The text header is almost identical to the corpus header. The main difference is the obligatory element profiledesc, which defines the names of the source files of the corpus text. These files must have been tokenized (see tokenizer) the required format is one token per line. The filenames are defined inside the element creation. Each filename is defined inside a code element.
Example:
<profiledesc> <creation> <code> /projects/thesis/maz/work/corsi/text_part1 <code> /projects/thesis/maz/work/corsi/text_part2 <code> /projects/thesis/maz/work/corsi/text_part3 </creation> </profiledesc>
A complete text header consists of the two elements filedesc and profiledesc. The obligatory ID (inside the publicationstmt) must be unique inside the corpus.
Example of a complete text header:
<teiheader> <filedesc> <titlestmt> <title> A test text </title> </titlestmt> <publicationstmt> <idno> testtext </idno> <distributor> CLUE </distributor> </publicationstmt> <sourcedesc> <p> This text has no source </sourcedesc> </filedesc> <profiledesc> <creation> <code> /projects/thesis/maz/work/corsi/test2 </creation> </profiledesc> </teiheader>
The file example.def contains a complete minmal corpus definition file.
If the parsing of the definition file was successful, and if all indicated files were found, the corpus data structures and the corpus index are build. The following files are created:
All files, except the corpus registration file, must not be edited or changed.
The definition of a taxonomy associated with a corpus offers the possibility to define a number of categories according to which the texts of a corpus can be categorized. This makes it possible to select texts of a corpus on the basis of the categorization of the corpus texts. The taxonomy definitions are contained inside the element encodingdesc, which follows the filedesc element. Each taxonomy definition must have a unique ID.
The outline is as follows:
<encodingdesc>
<classdecl>
<taxonomy id=taxonomy1>
... categories
</taxonomy>
<taxonomy id=taxonomy2>
... categories
</taxonomy>
... more taxonomy elements
</classdecl>
</encodingdesc>
Each taxonomy contains a specific number of categories, which must all have a unique ID (inside the corpus). The categories of a taxonomy element are all contained inside one category element, which has no ID and defines the name of the taxonomy. The contained category elements define the name of the category inside an catdesc element.
example:
<taxonomy id=taxonomy1>
<category>
<catdesc> Medium </catdesc>
<category id=T01>
<catdesc> Book </catdesc>
</category>
<category id=T02>
<catdesc> Periodical </catdesc>
</category>
</category>
</taxonomy>
An example for a complete encoding description:
<encodingdesc>
<classdecl>
<taxonomy id=taxonomy1>
<category>
<catdesc> Medium </catdesc>
<category id=C01>
<catdesc> Recht </catdesc>
</category>
<category id=C02>
<catdesc> Wirtschaft</catdesc>
</category>
</category>
</taxonomy>
<taxonomy id=taxonomy2>
<category>
<catdesc> Text type </catdesc>
<category id=C03>
<catdesc> written, published </catdesc>
</category>
<category id=C04>
<catdesc> written, unpublished </catdesc>
</category>
<category id=C05>
<catdesc> spoken </catdesc>
</category>
</category>
</taxonomy>
</classdecl>
</encodingdesc>
If a taxonomy was defined in the corpus header, references to the categories can be made inside the text headers. These references are defined inside the profiledesc of a text, following the creation-tag. Each taxonomy reference must state the ID of the taxonomy (scheme) and the ID of the category (target). These references are made within the element textclass. Example:
<textclass> <catref scheme=taxonomy1 target=C01> <catref scheme=taxonomy2 target=C04> </textclass>
The registration file contains information about a corpus. The file is readable and may be edited. It contains the following fields:
id: title: total_words: directory: attributes: malaga_analysis:
The first three fields should be left untouched. The directory may be changed, if the corpus files are moved to another place. The line attributes can contain a list of corpus attributes. These attributes are usually created by executing add_attribute. The filenames of the corpus attributes have the following structure: corpus_id + '-' + attribute_name. The files and file extensions are identical to the corpus data files.