O Mrežniku

Ustanova nositelj: Institut za hrvatski jezik i jezikoslovlje

Voditeljica projekta: dr. sc. Lana Hudeček

Trajanje projekta: 1. ožujka 2017. – 28. veljače 2021.

Broj projekta: IP-2016-06-2141

U vrijeme u kojemu se pojam rječnika gotovo izjednačuje s pojmom e-rječnika (u pravilu mrežnoga), a leksikografije s pojmom e-leksikografije, Hrvatska još pripada sve manjemu broju zemalja koje nemaju ni slobodno dostupan mrežni rječnik nacionalnoga jezika utemeljen na spoznajama suvremene e-leksikografije ni sustavno provedena znanstvena istraživanja u tome području. Temeljni je cilj ovoga projekta promijeniti to stanje u oba spomenuta aspekta.

U okviru projekta Hrvatski mrežni rječnik – Mrežnik izrađuje se slobodno dostupan, jednojezični, hipertekstni, jednostavno pretraživ mrežni rječnik hrvatskoga standardnog jezika od 10 000 natuknica. Natuknice te pojedine riječi i izrazi u rječničkim člancima povezuju se međusobno, a također i s natuknicama u bazama podataka koje nastaju u okviru ovoga projekta i izrađuju se usporedno s izradom rječnika (baza jezičnih savjeta, baza veznika s opisom vezničkih skupina i njihove modifikacije, baza objašnjenja podrijetla frazema, baza etnika i ktetika) te bazama koje suradnici na projektu već stvaraju u okviru svojih institutskih zadataka ili koje su vlasništvo Instituta za hrvatski jezik i jezikoslovlje. Uz osnovne, rječnik uključuje i školske definicije (3000) i definicije za strance (1000). Rječnik je utemeljen na korpusu Instituta za hrvatski jezik i jezikoslovlje Hrvatska mrežna riznica (http://riznica.ihjj.hr/index.hr.html) i na Hrvatskome mrežnom korpusu hrWaC (http://nlp.ffzg.hr/resources/corpora/hrwac/). Uz te se izvore pri uspostavi definicija, donošenju primjera i značenja uzimaju u obzir i svi drugi dostupni tiskani i mrežni izvori.

Rječnik se piše u programu TLex, prilagođenu za potrebe projekta, a za crpenje podataka iz korpusa upotrebljava se mrežni alat SketchEngine, koji omogućuje prikaz konteksta leksema putem tzv. skica riječi (WordSketches), najčvršćih kolokacija raspoređenih u sintaktičke kategorije te pronalazak dobrih primjera uporabe riječi ili kolokacija. Nakon završetka leksikografske obradbe podatci će biti izvezeni iz TLexa za mrežnu aplikaciju te u repozitorij europske znanstvene infrastrukture CLARIN (repozitorij clarin.si i javni sustav za upravljanje podatcima github.com). Time će Mrežnik biti učinjen dostupnim i za uporabu putem mrežne aplikacije i za strojnu primjenu preuzimanjem podataka iz repozitorija CLARIN.

U okviru projekta predviđeno je:

1. izrada temeljnoga rječnika od 10 000 natuknica (s naglašenim natuknicama i naglašenim oblicima u gramatičkome bloku sustavno uspostavljenim za svaku vrstu riječi, razvedenim definicijama (osnovnim, školskim i za strance), primjerima, podatcima o antonimima i sinonimima, svezama i frazemima) te bogatim kolokacijskim opisom.

2. izrada i povezivanje s temeljnim rječnikom baza koje se izrađuju usporedno s njegovom obradom: bazom jezičnih savjeta (300 jezičnih savjeta), bazom veznika s opisom vezničkih skupina i modifikacije (za sve veznike u rječniku), bazom objašnjenja podrijetla frazema (50 frazema) te bazom etnika i ktetika (300 etnika i ktetika)

3. povezivanje temeljnoga rječnika s drugim mrežnim izvorima koji se trenutačno izrađuju u ustanovi provedbe projekta, Institutu za hrvatski jezik i jezikoslovlje – s bazom glagolskih valencija e-Glavom, Kolokacijskom bazom hrvatskoga jezika, bazom hrvatskoga strukovnog nazivlja Strunom, mrežnim stranicama Jezični savjetnik i Bolje je hrvatski

4. izrada odostražnoga rječnika na temelju abecedarija Mrežnika

5. intenzivan rad na znanstvenome proučavanju mogućnosti i primjene e-leksikografije, usavršavanje suradnika na projektu i diseminacija stečenoga znanja (među ostalim predviđeno je održavanje skupa o e-leksikografiji u 3. godini projekta, objava e-zbornika radova sa skupa, izrada e-monografije o Hrvatskome mrežnom rječniku) te davanje znatnoga doprinosa ovomu području koje u hrvatskoj znanosti nema mjesto kakvo zaslužuje.

Važno je naglasiti da je Hrvatski mrežni rječnik ­– Mrežnik zamišljen kao dinamični rječnik koji i nakon završetka projektnoga razdoblja otvara mogućnost daljnje dorade.

Institution: Institute of Croatian Language and Linguistics

Project leader: Dr. Lana Hudeček

Project term: 1 March 2017 - 28 February 2021

Project no.: IP-2016-06-2141

In a time in which the concepts of dictionary and (online) e-dictionary have become nearly equivalent, as have the concepts of lexicography and e-lexicography, Croatia still belongs to the ever-smaller number of countries with no free online national language dictionary founded on modern e-lexicography, nor has systematic scientific research been carried out in this area. The basic goal of this project is to change this in both of the aforementioned aspects.

The Croatian Web Dictionary – Mrežnik project aims at creating a free, monolingual, easily searchable hypertext online dictionary of the Croatian standard language with 10,000 entries. Entries, entry-words, and sub-entires are interconnected, as well as linked with entries in databases created within the framework of the project in parallel with the creation of the dictionary (language advice database, conjunction database with description of groups of conjunctions and their modifications, database of explanations of the origins of idioms, database of ethnics and ktetics), as well as databases being created by project collaborators or other Institute members within the framework of other projects. In addition to basic definitions, the dictionary also includes definitions for children (3,000) and definitions for non-native speakers (1,000). The dictionary is based on the Croatian Web Repository online corpus (http://riznica.ihjj.hr/index.hr.html) and the hrWaC Croatian web corpus (http://nlp.ffzg.hr/resources/corpora/hrwac/). In addition to these sources, all other available print and web sources are taken into account in writing definitions and providing examples and meanings.

The dictionary is written in the TLex programme, which has been adapted to the needs of the project. Data extraction from the corpora is performed with the SketchEngine web tool, which allows the display of lexeme context through WordSketches, the most common collocations sorted into syntactic categories and the discovery of good examples of word usage or collocations. After lexicographic processing is completed, the data will be exported from TLex to both the web application and the CLARIN European science infrastructure repository (clarin.si repository and the github.com public data management system). This will make Mrežnik available for use both via a web application and for machine implementation by downloading data from the CLARIN repository.

The project goals and objectives are:

1. the creation of a basic dictionary of 10,000 entries (with accented entry-words and forms in the grammatical block systematically established for each word class, sorted definitions [for adult native speakers, for children, and for non-native speakers], examples, antonyms and synonyms, sub-entries, and phrases), and broad collocation descriptions.

2. connecting of the dictionary with the basic dictionary of the databases created in parallel with dictionary processing: linguistic advice database (300 pieces of advice), conjunction database with description of conjunction groups and modifications (for all conjunctions in the dictionary), a database of idioms (50 idioms), a database of ethnics and ktetics (300 ethnics and ktetics), etc.

3. connecting the basic dictionary with other web sources currently being created at the Institute of Croatian Language and Linguistics – e-Glava verb valency database, Kolokacijska baza hrvatskoga jezika database of collocations, the Struna Croatian professional terminology database, and the Jezični savjetnik and Bolje je hrvatski websites

4. the creation of a reversed dictionary based on the Mrežnik word list

5. scientific study of the abilities and applications of e-lexicography, training project collaborators and disseminating acquired knowledge (among other things, an e-lexicography conference is envisioned in year 3 of the project, as is the publishing of conference proceedings and the creation of an e-monograph on the Croatian Web Dictionary), as well as significant contribution to this field, which does not have the place it deserves in the Croatian sciences.

It is important to note that the Croatian Web Dictionary – Mrežnik is conceived as a dynamic dictionary that can be further compiled and edited even after the end of the project.

Word list

The frequency lists of hrWaC (first 12,000 words) and the Hrvatska jezična riznica (first 10,000 words) were overlapped, all words present only in Hrvatska jezična riznica and not present in hrWaC were extracted, their frequency was multiplied by four, and they were added to the shared list. (The alphabetical list also has a word type filter [POS], allowing e.g. only verbs to be filtered). This word list (first 8,000 entries) was juxtaposed with two separate word lists: the word list for the module for children (which was excerpted from textbooks for the first four grades of elementary school with some additions by the collaborators of Mrežnik) and the word list for the module for non-native speakers (which includes 1,000 words taken from a list in textbooks for non-native speakers, with the addition of rare entries found in both of these lists, in order to ensure that words found in both these lists [which partially overlap]) appear in the list for adult native speakers. This word list was supplemented with male/female and aspectual pairs, possessive and descriptive adjectives, adverbs derived from adjectives from the list, nouns ending in -ost derived from adjectives from the list, numerous grammatical and semantic groups, etc. This resulted in a word list of 10,000 words with two separate word lists of 3,000 words (for children) and 1,000 words (for non-native speakers).

The word list of the module for children was considered as the basic word list (the majority of the 1,000 entries for non-native speakers were also included in the module for children) and we first began compiling the entries for words from this list in order to make processing for the module for children as compatible as possible with that for adult native speakers.

It is important to note that Mrežnik is not conceived as a finished dictionary and will not reach its full extent if it does not continue to grow; thus, we do not consider the choice of words for the first Mrežnik word list especially important. However, we provide clear guidelines we were led by in the compilation of the word list.

Normative dictionary

The Croatian Web Dictionary – Mrežnik is a normative dictionary. Its normative nature is apparent in the following:

1. the selection of entry-words and referencing (the word list is not established automatically on the basis of a list created from the overlap of the frequency lists of the Hrvatska jezična riznica and hrWaC; instead, in composing the word list, the fact that Mrežnik is a corpus-based dictionary and not a corpus-driven dictionary was taken into account; due to frequency, entry-words were included that are not a part of the Croatian standard language or are not recommended in the standard language as they belong to its colloquial register; the dictionary entry of these entry-words (definitions, meanings, examples, collocations, etc.) was compiled, however a normative note has been included explaining why the word (phrase) is not part of the standard language; this normative note also appears in the entry of synonymous words that do belong to the standard language (e.g. for both stomatologica and stomatologinja, EN. 'female dentist').

2. the accentuation of entry words (a norm based on neo-Štokavian accentuation rules, i.e. forms don't have accents on the last syllable or descending accents in the middle of the word)

3. the selection of forms in the grammatical block; this provides only forms acceptable by the norm

4. the selection of examples (we try to select examples with no language errors while examples with language errors are edited)

5. giving linguistic advice in all three modules

6. the systematic nature of accentuation of entry-words, the selection and accentuation of forms provided in the grammatical block, the definition of words that belong to closed grammatical and semantic groups are composed in a similar way, etc. For this reason, lists of grammatical and semantic groups with lists of entry-words were composed.

A corpus-based dictionary

Mrežnik is a corpus-based dictionary, not a corpus-driven dictionary. This means that the corpus and all data extracted from it serve only as guidelines. The glossary on our website ihjj.hr/mreznik defines a corpus-based dictionary as follows: a dictionary for which the lexicographer uses a corpus, but can freely decide what should be included in the dictionary, allowing the dictionary to be supplemented with words from other sources if necessary, as well as collocations and meanings not attested in the corpus.

It follows that, in composing a dictionary entry, meanings can be added to a particular entry, or what we know to be common and representative can be added to the collocation field regardless of word sketches and corpus attestations; we make informed choices from word sketches in order to provide collocations that are representative of the Croatian standard language, and not of the corpus.

In choosing collocations, we take into account all features of the corpus and evaluate their suitability for inclusion in the collocations in Mrežnik regardless of their degree of attestation in the corpus (e.g. the collocations brkata konobarica EN. 'moustachioed waitress', sisata konobarica 'EN. large-breasted waitress', silovana konobarica EN. 'raped waitress' would not be suitable collocations for Mrežnik).

We also take examples from the corpus, however, we can, if needed, edit them by correcting obvious mistakes in accordance with the Hrvatski pravopis orthography manual. Of course, we endeavor to find examples that require no intervention.

We make an effort to be systematic in compiling entries of Mrežnik (this includes e.g. similar definitions for male/female pairs). However, this is not done at all costs, e.g. if this approach is not supported by corpus data (e.g. brodski kuhar or trkaći inženjer will not have the female pairs brodska kuharica or trkaća inženjerka on the basis of the fact that there is no attestation of these forms on the Internet). However, glavni kuhar will be supplemented with glavna kuharica as there are many attestations in hrWaC and we know this to be common.

In addition to word sketches, dozens of 'pages' in hrWaC and Riznica must also be reviewed, as this often results in more typical collocations than those found in the sketches, e.g. sketches for kuharica EN. 'female chef' do not include glavna kuharica EN. 'head female chef', however collocations are found if one directly searches the corpus for glavna kuharica.

We also sometimes take examples from the Internet.