Croatian Collocation Database

Associates: dr. sc. Goranka Blagus Bartolec, dr. sc. Barbara Kovačević, dr. sc. Ivana Kurtović Budja, dr. sc. Ivana Matas Ivanković, Dr. phil. Stefan Rittgasser.
Internet and computer support: Vedran Cindrić

The Croatian Collocation Database project (started in June 2014) has been conceived as a dynamic (upgradeable) dictionary of Croatian word combinations that will be populated and processed in a relational database.

The project is based on an extensive corpus that was collected and processed by Dr. Stefan Rittgasser, who worked at the University of Heidelberg and the University of Mannheim. A sample data base has been published on lingua-hr.de.

The current structure of the Croatian Collocation Database project was constructed on the basis of the following sources: Croatian daily, weekly, and monthly news papers from 1998 to today; various online sources; contemporary Croatian lexicographical manuals (dictionaires, lexicons, encyclopedias); Narodne novine online (official gazette of the Republic of Croatia); 10 recent linguistic journals with articles on the topics of word combinations in Croatian.

The collocation database will enable exploring the possibilities of combining words to the syntagmatic level. Each word combination in the database shall be marked with a special tag according to its lexical and semantic features (as a lexical or grammatical collocation, as a phraseme/idiom, as a free combination, as a term, as a pragmeme, as a proverb). Each collocation will be presented through examples of its use in everyday speech as a part of journalistic/marketing style, conversational style, literary style, or scientific style.

In addition to having a combination of words, the database will allow different types of data searches, and will serve as the starting point for finding various data useful for scientific research - synonymy, antonymy, polysemy, stylistic use of words, word formation, morphology, phraseology, terminology.

The main goal of the project is to make a large list of word combinations in Croatian with descriptions of their main grammatical and semantic features and their ability to be used in a particular communicational context.

Data for the collocation database was entered and processed in Microsoft Access. Data from Access will be transfered into a publicly available database on the website of the Institute of Croatian Language and Linguistics.

The main working Microsoft Access contains nine columns: 1. Headword/Entry, 2. Part of speech / Word class (only for homographic headwords/entries), 3. Order of meaning (if the word is polysemous), 4. Text (a key part of the database that contains the list of word combinations), 5. Synonym (single- or multi-word synonym of the word combination in the Text column), 6. Label (symbol for types of word combinations – phraseme/idiom, fixed phrase – collocation or term, proverb and free combination (no symbol)), 7. Subject-field labels for each word combination (eg. Medicine, Physics, Architecture) or Usage lebels for each word combination (non-standard use, official or formal use, informal use, jargon/slang use, literary use, advertising use), 8. Exclamation mark (only in the first phase of the project if the combination is unclear or rare), 9. Source (if the word combination taken from the another lexicographic manuals).

From July 2015 prepared data from working Access files are publicly available on the test Internet Database as HyperText (the Croatian letters L, LJ, M, Š and V). The publicly available database initially contains only four columns: 1. Headword/Entry, 2. Part of speech / Word class (only for homographic headwords/entries), 4. (Text), 6. (Label), and will be later upgraded with new columns as previously described.

This project is primarily based on traditional lexicographic and lexicological settings, so that the main plan put together in one database the most common Croatian multiword lexical units by defining their semantic types and context of use.

Symbols in the Internet Database:

1. Part of speech

Symbol Value
a adjective
b number
c conjunction
d adverb
p preposition
re reflexive
s noun
t particle
u interjection
v verb
z pronoun

2. Designation

Symbol Value
F idiom
E idiom in context
P proverb
S fixed phrase (term or collocation)
No symbol free combination