Croatian Collocation Database

Associates:

dr. sc. Goranka Blagus Bartolec, project leader (gblagus@ihjj.hr),

dr. sc. Barbara Kovačević, dr. sc. Ivana Kurtović Budja, dr. sc. Ivana Matas Ivanković, Dr. phil. Stefan Rittgasser (external associate)
Internet and computer support: Vedran Cindrić

Advisor for terms in computer and information sciences: dr. sc. Milica Mihaljević

 

The Croatian Collocation Database project (started in June 2014) has been conceived as a dynamic (upgradeable) dictionary of Croatian word combinations that will be populated and processed in a relational database.

In the beginning, the project is based on an extensive corpus that was collected and processed by Dr. Stefan Rittgasser, who worked at the University of Heidelberg and the University of Mannheim. A sample data base has been published on lingua-hr.de.

The CCD  is based on extensive data sources collected and processed for automatic detection. The current structure of the CCD was constructed on the following sources: (1) Croatian daily, weekly and monthly newspapers from 1998 to today; (2) various online sources; (3) contemporary Croatian lexicographical manuals (dictionaries, lexicons, encyclopaedias); (4) Narodne novine online (official gazette of the Republic of Croatia); (5) recent linguistic journals with articles on the topics of word combinations in Croatian, (6) hrWaC – Croatian web corpus (http://nlp.ffzg.hr/resources/corpora/hrwac/) and HNK - Croatian National Corpus (http://www.hnk.ffzg.hr/), (7) STRUNA – database of Croatian special field terminology (http://struna.ihjj.hr/).

The CCD is primarily based on traditional lexicographic and lexicological settings of multiword lexical units (Benson et al. 1997, Blagus Bartolec 2014, Mel’čuk 1998), so that the main plan is to put together in one database the most common Croatian multiword lexical units by defining their semantic types and context of use. The database will be a useful source to be included in other more advanced MWE sources (Croatian and international) for the development of tools that permits the extraction MWEs on the basis of their semantic and lexical features (Sag et al. 2001, Ramisch 2015).

The main challenge of the project is to distinguish several types of MWEs in Croatian according to their lexical and semantic features: idioms (eg. ustati na lijevu nogu ‘lit. wake up on left leg; meaning: be in a bad mood’; multiword terms (eg. zool. prednje noge ‘front legs’, stražnje noge ‘hind legs’), proverbs (eg. u laži su kratke noge ‘lies have short legs / lies don’t travel far’), collocations (word combinations with more restricted or specific meaning, eg. vitke noge ‘slender legs’), free combinations (combinations with freedom of selection and freddom of combination, eg. prekrižiti noge ‘to cross (one’s) legs’).

A list of word combinations was entered in a separate file for each letter of the Croatian alphabet. Combinations of words were entered into the database under all parts of speech (noun, verb, adjective, and adverb) which form a combination, e.g. both the headwords labudov (‘swan’) and vrat (‘neck’) contain the word combination labudov vrat (‘swan neck’). The collocation database is based on the extended valence model – the basic form of word combination has been extended with words (verbs, pronouns, nouns) with which it forms the most common syntactic environment; e.g. both canonical idiom od vrata do vrata (‘door to door’) or canonical collocation lakša ozljeda (‘minor injury’) and extended form ići od vrata do vrata (‘go door to door’) or pretrpjeti lakše ozljede (‘to suffer minor injuries’ etc.) are placed in the database.

The CCD contains eight columns: (1) Headword/Entry (the canonical form of the word: infinitive for verbs, nominative for nouns, masculine gender for adjectives and pronouns); (2) Part of speech / Word class (in the beginning only for homographic headwords/entries, and gradually all the entries will be labeled with the part of speech); (3) Text (a key part of the database that contains the list of word combinations); (4) Synonyms and variants (single- or multi-word synonym or variant of the word combination in the Text column); (5) Label (symbol for types of word combinations – phraseme / idiom (F), collocation / fixed phrase or term (S), proverb (P) and free combination (no symbol); (6) Subject-field labels for each word combination (e.g. Medicine, Physics, Architecture) or Usage labels for each word combination (non-standard use, official or formal use, informal use, jargon/slang use, literary use, advertising use); (7) Meaning (if necessary, the specific semantic or using context of a particular combination is explained, (8) Examples (the sentences taken from the corpora that confirm the context in which a word combination is used).

 

References:

Baldwin, Timothy, Kim, Su Nam. 2009. Multiword expressions. Handbook of Natural Language Processing (2nd edition). Ur. Indurkhya, Nitin; Damerau, Fred J. CRC Press. Boca Raton: 1–39.

Benson, M., E. Benson, R. Ilson. 1997. The BBI dictionary of English word combinations. John Benjamins Publishing Co. Amsterdam – Philadelphia.

Blagus Bartolec, Goranka. 2012. Riječi i njihovi susjedi: Kolokacijske sveze u hrvatskom jeziku. Institut za hrvatski jezik i jezikoslovlje. Zagreb.

Firth, John R. 1957. A Synopsis of Linguistic Theory, 1930–1955. Studies in Linguistic Analysis. Special Volume. 1–32.

Ivir, Vladimir. 1992./1993. Kolokacije i leksičko značenje. Filologija 20–21. 181–189.

 

Mel’čuk, Igor. 1998. Collocations and Lexical Functions. Phraseology: Theory, Analysis and Applications. Ur. Cowie, Anthony Paul. Oxford University Press. Oxford – New York:  23–53. 

Menac, Antica, Fink-Arsovski, Željka i Radomir Venturin 2003. Hrvatski frazeološki rječnik. Naklada Ljevak. Zagreb.

Mihaljević, Milica. 1991. Višerječne natuknice i podnatuknice u jednojezičnom općem rječniku hrvatskoga jezika. Rasprave Instituta za hrvatski jezik i jezikoslovlje 17. 133–143.

Pintarić, Neda. 2002. Pragmemi u komunikaciji. Zavod za lingvistiku Filozofskoga fakulteta. Zagreb.

Pritchard, Boris. 1998. O kolokacijskom potencijalu rječničkog korpusa. Filologija 30–31. 285–304.

Ramisch, Carlos. 2015. Multiword Expressions Acquisition: A Generic and Open Framework. Theory and Applications of Natural Language Processing series XIV, Springer

Sag, Ivan A., Baldwin, Timothy, Bond, Francis Copestake, Ann i Dan Flickinger. 2002. Multiword expressions: A pain in the neck for NLP. Lecture notes in computer science 2276. 1–15.

Siepmann, Dirk. 2005. Collocation, Colligation and Encoding Dictionaries. Part II: Lexicographical Aspects. International Journal of Lexicography 19/1: 1–39.

Stojić, Aneta, Murica, Sanela. 2010. Kolokacije – teorijska razmatranja i primjena u praksi. Fluminensia 22/2: 111125.

Škara, Danica. 1997. Glas tradicije. Ziral. Mostar – Zagreb.

Tafra, Branka. 2005. Od riječi do rječnika. Školska knjiga. Zagreb.

Turk, Marija. 2000. Višečlani izrazi s desemantiziranom sastavnicom kao nominacijske jedinice. Riječki filološki dani 3: Zbornik radova s Međunarodnoga znanstvenog skupa Riječki filološki dani. Ur. Stolac, Diana. Filozofski fakultet. Rijeka. 477–486.

Dictionaries:

HJP – Hrvatski jezični portal, http://hjp.znanje.hr/

Školski rječnik hrvatskoga jezika (Institut za hrvatski jezik i jezikoslovlje – Školska knjiga, Zagreb 2012.)

VRH – Veliki rječnik hrvatskoga standardnog jezika (Školska knjiga, Zagreb, 2015.)

Proleksis enciklopedija, http://proleksis.lzmk.hr/

Corpora:

hrWac - Croatian web corpus (http://nlp.ffzg.hr/resources/corpora/hrwac/)

HNK - Croatian National Corpus (http://www.hnk.ffzg.hr/)