O Mrežniku

Ustanova nositelj: Institut za hrvatski jezik

Voditeljica projekta: dr. sc. Marijana Horvat

1. faza projekta (istraživački projekt HRZZ-a IP-2016-06-2141; ustanova nositelj projekta: Institut za hrvatski jezik i jezikoslovlje); 1. ožujka 2017. – 31. srpnja 2021.

2. faza projekta (interni projekt Instituta za hrvatski jezik i jezikoslovlje): od 1. siječnja 2022.

Dva su osnovna cilja projekta Hrvatski mrežni rječnik – Mrežnik:

sastaviti slobodno dostupan, jednojezični, hipertekstni, trodmodulni jednostavno pretraživ rastući mrežni rječnik hrvatskoga standardnog jezika
potaknuti sustavno bavljenje e-leksikografijom u Hrvatskoj.

U prvoj su fazi projekta ispunjeni svi ciljevi predviđeni projektnom prijavom i radnim planovima te su većini slučajeva i obilno premašeni:

izrađene su skice riječi za hrvatski jezik
izrađen je rječnik (koji obuhvaća tri rječnička modula: osnovni – 10 000 natukničkih jedinica, za učenike nižih razreda osnovne škole – 3000 rječničkih jedinica, za osobe koje uče hrvatski kao ini jezik – 1000 natukničkih jedinica) s naglašenim natuknicama i naglašenim oblicima u gramatičkome bloku sustavno uspostavljenim za svaku vrstu riječi, snimljenim izgovorom za 3000 natuknica, razvedenim definicijama, primjerima, svezama i frazemima, s podatcima o antonimima, sinonimima, hiponimima, meronimima, poveznicama na vanjske izvore te ilustracijama
izrađene su i s rječničkim člancima povezane ove baze: baza jezičnih savjeta za učenike nižih razreda osnovne škole, baza veznika, baza objašnjenja podrijetla frazema te baza etnika i ktetika; drugoj je projektnoj fazi počela izrada baze etimoloških podataka
rječnik je igrificiran
rječnik je povezan s drugim mrežnim izvorima koji se izrađuju u Institutu za hrvatski jezik i jezikoslovlje – e-Glavom, Kolokacijskom bazom hrvatskoga jezika, bazom hrvatskoga strukovnog nazivlja Strunom i Hrvatskim terminološkim portalom, mrežnim stranicama Jezični savjetnik i Bolje je hrvatski, Hrvatskim pravopisom i Hrvatskom školskom gramatikom, Pojmovnikom koronavirusa te radovima iz časopisa Hrvatski jezik; 4. izrađen je odostražni rječnik na temelju Mrežnikova abecedarija
izrađena je radna inačica e-monografije o Hrvatskome mrežnom rječniku
uspostavljena je trajna suradnja s Odsjekom za informacijske i komunikacijske znanosti Filozofskofa fakulteta Sveučilišta u Zagrebu (predavanja, studenti na praksi) te nizom drugih projekata i ustanova
intenzivno se radilo na znanstvenome proučavanju mogućnosti i primjene e-leksikografije, usavršavanju suradnika na projektu i diseminaciji projektnih rezultata (među ostalim organiziran je skup E-rječnici i e-leksikografija – eReL u 3. godini projekta, objavljen je e-zbornik radova sa skupa, održan je okrugli stol Rječnik i sterotipi i objavljeni radovi s toga okruglog stola).

U sklopu projekta velika se pozornost posvećuje mladim istraživačima te je temeljem Zakladina natječaja Projekt razvoja karijera mladih istraživača – izobrazba novih doktora znanosti na projektu zaposleno dvoje mladih istraživača, Josip Mihaljević i Maja Matijević, čiji je rad HRZZ ocijenio najvišim ocjenama. Doktorand Josip Mihaljević u sklopu projekta Mrežnik izradio je i obranio doktorat Konceptualni okvir igrifikacije hrvatskoga mrežnog rječnika.

Na kraju svakoga jednogodišnjega projektnoga razdoblja vrednovatelji Hrvatske zaklade za znanost projekt su ocijenili najvišom ocjenom. Ukupan rad u prvoj projektnoj fazi također je ocijenjen najvišim ocjenama i vrednovatelja i Upravnoga odbora Hrvatske zaklade za znanost.

Iz završne ocjene 1. projektne faze vrednovatelja i Upravnoga odbora HRZZ-a:

Projekt je u cjelini zaslužio ne samo izvrsnu sveukupnu ocjenu A, jer je u svim razdobljima istu ocjenu dobio i od vrednovatelja i od nezavisnog stručnjaka. Svakako su za to zaslužni voditeljica, cijeli istraživački tim i odlično planiranje i provođenje projektnih aktivnosti u trajanju cijeloga projekta, kao i pažljivo planiranje i trošenje projektnih sredstava.

U pet godina trajanja projekta ostvareni su zadani ciljevi u svakoj fazi projekta (uz male odmake i pomake zbog pandemijskih ograničenja) te se osobito pomno pazilo na kontinuiranu provedbu projektne diseminacije, izobrazbu i usavršavanju doktoranada na projektu i racionalnom trošenju dodijeljenih financijskih sredstava. Valja istaknuti i zamjetne napore voditeljice projekta da se ciljevi projekta ostvare i premaše usprkos ograničenjima i da svi rezultati projekta budu uredno i pedantno prikazani.

Posebno valja istaknuti obranu doktorske disertacije doktoranda na projektu s temom izravno proizašlom iz projekta.

Očekuje se značajan utjecaj na kroatistiku uopće, a posebno na nove standardološke radove.

Premda je projekt kroatistički, ostvarene su dobre međunarodne suradnje.

Mrežnik je jedan od rijetkih (ne samo hrvatskih) leksikografskih projekata koji je u pet godina uspio ostvariti zadane projektne ciljeve i u razumnim okvirima dovršiti planiranu leksikografsku obradbu.

Institution: Institute for the Croatian Language

Project leader: Dr. Marijana Horvat

Project term: 1 March 2017 - 28 February 2021

Project no.: IP-2016-06-2141

In a time in which the concepts of dictionary and (online) e-dictionary have become nearly equivalent, as have the concepts of lexicography and e-lexicography, Croatia still belongs to the ever-smaller number of countries with no free online national language dictionary founded on modern e-lexicography, nor has systematic scientific research been carried out in this area. The basic goal of this project is to change this in both of the aforementioned aspects.

The Croatian Web Dictionary – Mrežnik project aims at creating a free, monolingual, easily searchable hypertext online dictionary of the Croatian standard language with 10,000 entries. Entries, entry-words, and sub-entires are interconnected, as well as linked with entries in databases created within the framework of the project in parallel with the creation of the dictionary (language advice database, conjunction database with description of groups of conjunctions and their modifications, database of explanations of the origins of idioms, database of ethnics and ktetics), as well as databases being created by project collaborators or other Institute members within the framework of other projects. In addition to basic definitions, the dictionary also includes definitions for children (3,000) and definitions for non-native speakers (1,000). The dictionary is based on the Croatian Web Repository online corpus (http://riznica.ihjj.hr/index.hr.html) and the hrWaC Croatian web corpus (http://nlp.ffzg.hr/resources/corpora/hrwac/). In addition to these sources, all other available print and web sources are taken into account in writing definitions and providing examples and meanings.

The dictionary is written in the TLex programme, which has been adapted to the needs of the project. Data extraction from the corpora is performed with the SketchEngine web tool, which allows the display of lexeme context through WordSketches, the most common collocations sorted into syntactic categories and the discovery of good examples of word usage or collocations. After lexicographic processing is completed, the data will be exported from TLex to both the web application and the CLARIN European science infrastructure repository (clarin.si repository and the github.com public data management system). This will make Mrežnik available for use both via a web application and for machine implementation by downloading data from the CLARIN repository.

The project goals and objectives are:

1. the creation of a basic dictionary of 10,000 entries (with accented entry-words and forms in the grammatical block systematically established for each word class, sorted definitions [for adult native speakers, for children, and for non-native speakers], examples, antonyms and synonyms, sub-entries, and phrases), and broad collocation descriptions.

2. connecting of the dictionary with the basic dictionary of the databases created in parallel with dictionary processing: linguistic advice database (300 pieces of advice), conjunction database with description of conjunction groups and modifications (for all conjunctions in the dictionary), a database of idioms (50 idioms), a database of ethnics and ktetics (300 ethnics and ktetics), etc.

3. connecting the basic dictionary with other web sources currently being created at the Institute of Croatian Language and Linguistics – e-Glava verb valency database, Kolokacijska baza hrvatskoga jezika database of collocations, the Struna Croatian professional terminology database, and the Jezični savjetnik and Bolje je hrvatski websites

4. the creation of a reversed dictionary based on the Mrežnik word list

5. scientific study of the abilities and applications of e-lexicography, training project collaborators and disseminating acquired knowledge (among other things, an e-lexicography conference is envisioned in year 3 of the project, as is the publishing of conference proceedings and the creation of an e-monograph on the Croatian Web Dictionary), as well as significant contribution to this field, which does not have the place it deserves in the Croatian sciences.

It is important to note that the Croatian Web Dictionary – Mrežnik is conceived as a dynamic dictionary that can be further compiled and edited even after the end of the project.

Word list

The frequency lists of hrWaC (first 12,000 words) and the Hrvatska jezična riznica (first 10,000 words) were overlapped, all words present only in Hrvatska jezična riznica and not present in hrWaC were extracted, their frequency was multiplied by four, and they were added to the shared list. (The alphabetical list also has a word type filter [POS], allowing e.g. only verbs to be filtered). This word list (first 8,000 entries) was juxtaposed with two separate word lists: the word list for the module for children (which was excerpted from textbooks for the first four grades of elementary school with some additions by the collaborators of Mrežnik) and the word list for the module for non-native speakers (which includes 1,000 words taken from a list in textbooks for non-native speakers, with the addition of rare entries found in both of these lists, in order to ensure that words found in both these lists [which partially overlap]) appear in the list for adult native speakers. This word list was supplemented with male/female and aspectual pairs, possessive and descriptive adjectives, adverbs derived from adjectives from the list, nouns ending in -ost derived from adjectives from the list, numerous grammatical and semantic groups, etc. This resulted in a word list of 10,000 words with two separate word lists of 3,000 words (for children) and 1,000 words (for non-native speakers).

The word list of the module for children was considered as the basic word list (the majority of the 1,000 entries for non-native speakers were also included in the module for children) and we first began compiling the entries for words from this list in order to make processing for the module for children as compatible as possible with that for adult native speakers.

It is important to note that Mrežnik is not conceived as a finished dictionary and will not reach its full extent if it does not continue to grow; thus, we do not consider the choice of words for the first Mrežnik word list especially important. However, we provide clear guidelines we were led by in the compilation of the word list.

Normative dictionary

The Croatian Web Dictionary – Mrežnik is a normative dictionary. Its normative nature is apparent in the following:

1. the selection of entry-words and referencing (the word list is not established automatically on the basis of a list created from the overlap of the frequency lists of the Hrvatska jezična riznica and hrWaC; instead, in composing the word list, the fact that Mrežnik is a corpus-based dictionary and not a corpus-driven dictionary was taken into account; due to frequency, entry-words were included that are not a part of the Croatian standard language or are not recommended in the standard language as they belong to its colloquial register; the dictionary entry of these entry-words (definitions, meanings, examples, collocations, etc.) was compiled, however a normative note has been included explaining why the word (phrase) is not part of the standard language; this normative note also appears in the entry of synonymous words that do belong to the standard language (e.g. for both stomatologica and stomatologinja, EN. 'female dentist').

2. the accentuation of entry words (a norm based on neo-Štokavian accentuation rules, i.e. forms don't have accents on the last syllable or descending accents in the middle of the word)

3. the selection of forms in the grammatical block; this provides only forms acceptable by the norm

4. the selection of examples (we try to select examples with no language errors while examples with language errors are edited)

5. giving linguistic advice in all three modules

6. the systematic nature of accentuation of entry-words, the selection and accentuation of forms provided in the grammatical block, the definition of words that belong to closed grammatical and semantic groups are composed in a similar way, etc. For this reason, lists of grammatical and semantic groups with lists of entry-words were composed.

A corpus-based dictionary

Mrežnik is a corpus-based dictionary, not a corpus-driven dictionary. This means that the corpus and all data extracted from it serve only as guidelines. The glossary on our website ihjj.hr/mreznik defines a corpus-based dictionary as follows: a dictionary for which the lexicographer uses a corpus, but can freely decide what should be included in the dictionary, allowing the dictionary to be supplemented with words from other sources if necessary, as well as collocations and meanings not attested in the corpus.

It follows that, in composing a dictionary entry, meanings can be added to a particular entry, or what we know to be common and representative can be added to the collocation field regardless of word sketches and corpus attestations; we make informed choices from word sketches in order to provide collocations that are representative of the Croatian standard language, and not of the corpus.

In choosing collocations, we take into account all features of the corpus and evaluate their suitability for inclusion in the collocations in Mrežnik regardless of their degree of attestation in the corpus (e.g. the collocations brkata konobarica EN. 'moustachioed waitress', sisata konobarica 'EN. large-breasted waitress', silovana konobarica EN. 'raped waitress' would not be suitable collocations for Mrežnik).

We also take examples from the corpus, however, we can, if needed, edit them by correcting obvious mistakes in accordance with the Hrvatski pravopis orthography manual. Of course, we endeavor to find examples that require no intervention.

We make an effort to be systematic in compiling entries of Mrežnik (this includes e.g. similar definitions for male/female pairs). However, this is not done at all costs, e.g. if this approach is not supported by corpus data (e.g. brodski kuhar or trkaći inženjer will not have the female pairs brodska kuharica or trkaća inženjerka on the basis of the fact that there is no attestation of these forms on the Internet). However, glavni kuhar will be supplemented with glavna kuharica as there are many attestations in hrWaC and we know this to be common.

In addition to word sketches, dozens of 'pages' in hrWaC and Riznica must also be reviewed, as this often results in more typical collocations than those found in the sketches, e.g. sketches for kuharica EN. 'female chef' do not include glavna kuharica EN. 'head female chef', however collocations are found if one directly searches the corpus for glavna kuharica.

We also sometimes take examples from the Internet.