Polona Gantar

Dictionary of Modern Slovene: From the Lexical Database to the Digital Dictionary Database for Slovene

The ability to process language data has become fundamental for the development of technologies in various areas of human life in the digital world. The development of computer-readable linguistic resources, methods and tools is therefore also one of the key challenges for contemporary Slovene language. This challenge has been recognized in the Slovene language community both at the professional as well as the state level, and has been the subject of quite a number of activities over the past ten years, which I would like to present in this paper.

The idea of a comprehensive dictionary database covering all levels of linguistic description of modern Slovene, from the morphological and lexical to the syntactic, had already been formulated in the framework of the European Social Fund project Communication in Slovene (2008-2013), within which the Slovene Lexical Database was created. When designing the Slovene Lexical Database (SLD), we pursued two goals: to create linguistic description of Slovene intended for human users which would also be useful for machine processing of Slovene. Ever since the construction of the first corpus of Slovene, it has become evident that there is a need for a description of modern Slovene based on real language data, and that it is necessary to understand the needs of language users in order to create useful language reference works. In addition, it became obvious that only the digital medium enables the comprehensiveness of language description and that the design of the database must be adapted to it from the start. Also, in terms of formats and international standards the description needs to follow good practices as closely as possible, as this enables the inclusion of Slovene into a wider network ofresources, such as Open Linked data, BabelNet, ELEXIS, etc. Given/Due to time pressure and trends in lexicography, we had to consider the procedures for automating the extraction of linguistic data from corpora and the inclusion of crowdsourcing into the lexicographic process.

In accordance with the essential idea of creating an all-inclusive Digital Dictionary Database for Slovene, several independent databases have been created over the past two years. Specifically, the Collocations Dictionary of Modern Slovene, and the automatically generated Thesaurus of Modern Slovene – both also exist as independent online dictionary portals. One of the novelties that we put forward together with both dictionaries is the concept of the so-called 'responsive dictionary', which includes crowdsourcing methods. Ultimately, the Digital Dictionary Database provides all (other) levels of linguistic description: morphological, with the Sloleks database upgrade, phraseological, with the construction of a multi-word expressions lexicon, and syntactic, with the formalization of valency patterns of Slovene verbs. Each of these databases contains its own specific language data that will ultimately be included in the comprehensive Slovene Digital Dictionary Database, which will represent basic linguistic description of Slovene both for the human and machine user.

More about Polona Gantar