In order to enable the global availability of several corpora, a client / server access to the corpus data was implemented. All copora used by the server must be registered in the registration file corsica.reg . This file is searched by getting the value of the environment variable CORSICA_PATH, or searching in the current working directory. The file corsica.reg must contain the complete filenames (including absolute paths) of all corpus registration files (.reg). These corora will then be made accessible by the server.
The program register_corpus registers a corpus into the registration file and checks for unique corpus ids:
The server process is a daemon that waits until it receives a connection request. In this case, a new process is fork'ed, and the daemon continues listening on the specified port. The new process handles all request from a client and exits if the client is no longer connected.
The server reads all registration files of all registered corpora and opens the associated files. All data is send using TCP/IP internet stream sockets.
The client is written in Java and can be used as an Applet (started from a Web Page) or as a standalone-program (in the local 'intranet'). The used protocol is line based; the client and server processes read and send strings terminated by newline.
The Corsica client is a basic concordance browser, supporting the selection of all available corpora. Queries may be entered conforming to the query language:
"find" "this" "sequence"
Regular expression, conforming to the POSIX standard may be used to define arbitrary search patterns:
Queries over attribute values have a differnt syntax:
The default attribute in every corpus is 'word'.
Logical or and and are supported with the '|' and '&' characters.
[word="thick" | word="thin"] [word="can" & POS="NN"]
As usual, or has a higher precedence than and. Attribute queries can be arbitrarily complex:
[word="one" | word="two" | word="three" | word="four"]
Queries are evaluated from left to right. Therefore, the leftmost (first) query value defines how fast or slow the search will be - if the first word is a very general regular expression, it is very likely that the complete query will have to be evaluated several times on several sub-results of the first expression (due to 'limited' buffer size) before the results are found.