
(für
deutsche Version hier klicken / click here for German version)
Deutsche Malaga-Morphologie (DMM)
What is DMM?
DMM is a system for the automatic wordform recognition of German. DMM's
abilities are:
- categorization, i.e. assigning grammatical categories like part
of speech, case, gender, number, person, tense etc., to a wordform
- lemmatisation, i.e. assigning a base form to a wordform
- segmentation, i.e. identification of the morphemes that a
wordform is composed of
How does DMM work?
DMM uses the grammar formalism of Left-Associative Grammar
(LAG). In contrast to Phrase Structure Grammars which are based on the
principle of possible substitutions, LAGs are based on the principle
of possible continuations. Input is analyzed
left-associatively (left to right in the case of Western
scripts, more generally: in writing direction). Analysis is
time-linear and surface-compositional, which means that
the input segments are concatenated in the order of their occurrence
(left to right) and that every rule application is mandatorily linked
with reading exactly one segment of the input. As a consequence, LAGs
perform very favourably concerning complexity.
How is DMM implemented?
DMM has been developed with the LAG implementation software MALAGA
which has been developed by Björn Beutel at the Computational
Linguistics Department of the University Erlangen-Nürnberg
(CLUE). Malaga consists of:
- a programming language called MALAGA
- rule and lexicon compilers which translate grammar components
written by developers into a binary format
- a run time component that can analyze wordforms or whole texts
- a number of development and embedding tools, like a Perl module
written by Michael Piotrowski.
Here
you can find more information about MALAGA.
What lexica does DMM use?
DMM works with a base form lexicon with about 50,000 entries,
consisting of:
- 20,400 nouns
- 11,200 adjectives
- 10,900 proper nouns
- 6,200 verbs
- rest: function words (determiners, prepositions, etc.),
inflectional endings, prefixes, linking morphemes, etc.
With special rules 67,000 allomorphs are generated from these 50,000
entries. The allomorphs are then concatenated to wordforms by the run
time component.
What do DMM analyses look like?
DMM analyses are not single tags but complex feature-value
structures. The example analysis for the wordform
Bundesverfassungsgericht ("Federal Constitutional Court") is:

The result is a list (indicated by the corner brackets) of
analyses. Each analysis is a record (indicated by the square brackets)
containing feature-value pairs. In the above example the result list
contains exactly one analysis with the following information:
- the analysis type, in this case Parsed, i.e. the wordform
was recognized by the LAG rule mechanism (other possibilities would be
unknown and Hypothesis)
- the segmented surface of the wordform
- the part-of-speech tag
- the base form
- a weight which can be used for disambiguation of wordforms that
have more than one reading; the weight is based on heuristics that
evaluate the concatenation processes
- gender of the noun
- case and number of the noun
(This result format is a simplified form of a richer internal format.)
A (rather large) image of an analysis tree can be found here.
How does DMM perform?
The coverage of morphological analysis systems depends on a number of
factors; besides those concerning the system (like lexicon size),
these are mainly textual factors like, for example, the domain of the
text.
Every domain has a domain-specific vocabulary; that is the part of the
vocabulary which occurs exclusively or accumulatedly in this
domain. The size of the domain-specific lexicon differs for each
domain. Domains like sports have a comparatively small domain-specific
vocabulary; others like medicine have a rather large domain-specific
vocabulary.
As DMM's base form lexicon contains mainly general vocabulary, the
coverage in domains with a large domain-specific vocabulary is
lower. One reason for this is that texts from these domains have a
less favourable token-to-type ratio due to their larger
vocabulary. The token-to-type ratio is the ratio of wordform
instances (tokens) to wordform prototypes (types). For example, a
text which is 1000 wordforms long (1000 tokens) might only consist of
100 different wordforms (100 types); this text has a token-to-type
ratio of 10, or in other words: every type averages at 10
occurrences. A text is lexically richer if the token-to-type ratio is
low. A low token-to-type ratio usually results in a lower analysis
coverage.
DMM has been used for annotating the CLUE corpora which have been
collected at the Computational Linguistics Department. The coverages
for the different subcorpora are:
| corpus |
tokens |
unknown |
% |
types |
unknown |
% |
tokens per type |
| Bible |
1,131,536 |
24,907 |
2.20 |
37,031 |
7,099 |
19.17 |
30.56 |
| Limas |
1,236,774 |
32,549 |
2.63 |
121,650 |
17,544 |
14.42 |
10.16 |
| Sports |
1,140,121 |
57,967 |
5.08 |
64,799 |
14,506 |
22.38 |
17.59 |
| IT |
1,000,001 |
100,825 |
10.08 |
100,208 |
33,233 |
33.16 |
9.98 |
| Medicine |
1,017,646 |
139,682 |
13.72 |
104,425 |
38,004 |
36.29 |
9.74 |
| Total |
5,526,079 |
355,930 |
6.44 |
324,570 |
103,432 |
31.86 |
17.02 |
As can be seen, the percentage of unknown wordforms is low if the
token-to-type ratio ist high (as in Bible and Sports); it is higher if
the token-to-type ratio is low (as in IT and Medicine). An exception
is the Limas corpus where the percentace of unknown wordforms is low
despite a low token-to-type ratio. This is because the Limas corpus
has been used as the empirical basis for extending the base form
lexicon.
To test the correctness of DMM, the analyses of 1000 wordforms were
randomly selected from the Limas corpus and evaluated manually. The
results were:
| (error) class |
number |
% |
| correct analysis |
913 |
91.3 |
| correct hypothesis |
3 |
0.3 |
| syntactically correct, incorrect segmentation |
1 |
0.1 |
| ambiguous with correct und incorrect reading |
44 |
4.4 |
| syntactically incorrect |
5 |
0.5 |
| missing ambiguity |
4 |
0.4 |
| not recognized |
25 |
2.5 |
| input (spelling) error |
5 |
0.5 |
| total |
1,000 |
100.0 |
Is there an interactive demo of DMM?
You can find an interactive demo of DMM on the Malaga page.
Is there a documentation of DMM?
Unfortunately there is no documentation for DMM yet. DMM development
continues and, eventually, a documentation will be produced.
There is, however, a masters thesis descibing an older version of
DMM. Note that this masters thesis is only available in German.
- Lorenz, Oliver (1996): Automatische Wortformenerkennung für das
Deutsche im Rahmen von Malaga. Magisterarbeit.
Friedrich-Alexander-Universität Erlangen-Nürnberg, Abteilung für Computerlinguistik.
- available as [HTML]
[PostScript]
Further questions about DMM?
If you have further questions about DMM, please contact Oliver Lorenz.