(für deutsche Version hier klicken / click here for German version)

Deutsche Malaga-Morphologie (DMM)

What is DMM?

DMM is a system for the automatic wordform recognition of German. DMM's abilities are:

How does DMM work?

DMM uses the grammar formalism of Left-Associative Grammar (LAG). In contrast to Phrase Structure Grammars which are based on the principle of possible substitutions, LAGs are based on the principle of possible continuations. Input is analyzed left-associatively (left to right in the case of Western scripts, more generally: in writing direction). Analysis is time-linear and surface-compositional, which means that the input segments are concatenated in the order of their occurrence (left to right) and that every rule application is mandatorily linked with reading exactly one segment of the input. As a consequence, LAGs perform very favourably concerning complexity.

How is DMM implemented?

DMM has been developed with the LAG implementation software MALAGA which has been developed by Björn Beutel at the Computational Linguistics Department of the University Erlangen-Nürnberg (CLUE). Malaga consists of: Here you can find more information about MALAGA.

What lexica does DMM use?

DMM works with a base form lexicon with about 50,000 entries, consisting of: With special rules 67,000 allomorphs are generated from these 50,000 entries. The allomorphs are then concatenated to wordforms by the run time component.

What do DMM analyses look like?

DMM analyses are not single tags but complex feature-value structures. The example analysis for the wordform Bundesverfassungsgericht ("Federal Constitutional Court") is:

The result is a list (indicated by the corner brackets) of analyses. Each analysis is a record (indicated by the square brackets) containing feature-value pairs. In the above example the result list contains exactly one analysis with the following information:

(This result format is a simplified form of a richer internal format.)

A (rather large) image of an analysis tree can be found here.

How does DMM perform?

The coverage of morphological analysis systems depends on a number of factors; besides those concerning the system (like lexicon size), these are mainly textual factors like, for example, the domain of the text.

Every domain has a domain-specific vocabulary; that is the part of the vocabulary which occurs exclusively or accumulatedly in this domain. The size of the domain-specific lexicon differs for each domain. Domains like sports have a comparatively small domain-specific vocabulary; others like medicine have a rather large domain-specific vocabulary.

As DMM's base form lexicon contains mainly general vocabulary, the coverage in domains with a large domain-specific vocabulary is lower. One reason for this is that texts from these domains have a less favourable token-to-type ratio due to their larger vocabulary. The token-to-type ratio is the ratio of wordform instances (tokens) to wordform prototypes (types). For example, a text which is 1000 wordforms long (1000 tokens) might only consist of 100 different wordforms (100 types); this text has a token-to-type ratio of 10, or in other words: every type averages at 10 occurrences. A text is lexically richer if the token-to-type ratio is low. A low token-to-type ratio usually results in a lower analysis coverage.

DMM has been used for annotating the CLUE corpora which have been collected at the Computational Linguistics Department. The coverages for the different subcorpora are:

corpus tokens unknown % types unknown % tokens per type
Bible 1,131,536 24,907 2.20 37,031 7,099 19.17 30.56
Limas 1,236,774 32,549 2.63 121,650 17,544 14.42 10.16
Sports 1,140,121 57,967 5.08 64,799 14,506 22.38 17.59
IT 1,000,001 100,825 10.08 100,208 33,233 33.16 9.98
Medicine 1,017,646 139,682 13.72 104,425 38,004 36.29 9.74
Total 5,526,079 355,930 6.44 324,570 103,432 31.86 17.02

As can be seen, the percentage of unknown wordforms is low if the token-to-type ratio ist high (as in Bible and Sports); it is higher if the token-to-type ratio is low (as in IT and Medicine). An exception is the Limas corpus where the percentace of unknown wordforms is low despite a low token-to-type ratio. This is because the Limas corpus has been used as the empirical basis for extending the base form lexicon.

To test the correctness of DMM, the analyses of 1000 wordforms were randomly selected from the Limas corpus and evaluated manually. The results were:

(error) class number %
correct analysis 913 91.3
correct hypothesis 3 0.3
syntactically correct, incorrect segmentation 1 0.1
ambiguous with correct und incorrect reading 44 4.4
syntactically incorrect 5 0.5
missing ambiguity 4 0.4
not recognized 25 2.5
input (spelling) error 5 0.5
total 1,000 100.0

Is there an interactive demo of DMM?

You can find an interactive demo of DMM on the Malaga page.

Is there a documentation of DMM?

Unfortunately there is no documentation for DMM yet. DMM development continues and, eventually, a documentation will be produced.

There is, however, a masters thesis descibing an older version of DMM. Note that this masters thesis is only available in German.

Further questions about DMM?

If you have further questions about DMM, please contact Oliver Lorenz.