Indexing of textual databases based on lexical resources: A case study for Serbian

Authors: Stanković Ranka, Krstev Cvetana, Obradović Ivan, Kitanović Olivera
Year: 2015
Venue: Semantic Keyword-based Search on Structured Data Sources : First COST Action IC1302 International KEYSTONE Conference, IKC 2015, Coimbra, Portugal, September 8-9, 2015. Revised Selected Papers
Product of the Action: Yes

Keystone Members Authors:

In this paper, we describe an approach to improvement of information retrieval results for large textual databases by pre-indexing documents using bag-of-words and Named Entity Recognition. The approach was applied on a database of geological projects financed by the Republic of Serbia in the last half century. Each document within this database is described by metadata, consisting of several fields such as title, domain, keywords, abstract, geographical location and the like. A bag of words was produced from these metadata using morphological dictionaries and transducers, and named entities within the metadata were recognized using a rule-based system. Both were then used for indexing documents and ranking was based on tf idf measure. Evaluation of ranked retrieval results based on data obtained by pre-indexing are compared to results obtained by informational retrieval without pre-indexing with Precision-Recall Curve, showing a significant improvement in terms of Mean Average Precision measure (MAP).