Improving Document Retrieval in Large Domain Specific Textual Databases Using Lexical Resources

Authors: Ranka Stanković, Cvetana Krstev, Ivan Obradović, Olivera Kitanović
Year: 2017
Venue: Trans. Computational Collective Intelligence 26: 162-185 (2017), Editors: Ngoc Thanh Nguyen, Ryszard Kowalczyk, Alexandre Miguel Pinto and Jorge S. Cardoso, LNCS 10190, ISBN 978-3-319-59267-1 (Print), 10.1007/978-3-319-59268-8_8, pp. 112-123, 2017
Link: http://www.springer.com/us/book/9783319592671
Product of the Action: Yes

Keystone Members Authors:
,

Abstract:
Large collections of textual documents represent an example of big data that requires the solution of three basic problems: the representation of documents, the representation of information needs and the matchingbasic problems: the representation of documents, the representation of information needs and the matching of the two representations. This paper outlines the introduction of document indexing as a possible solution to document representation. Documents within a large textual database developed for geological projects in the Republic of Serbia for many years were indexed using methods developed within digital humanities: bag-of-words and named entity recognition. Documents in this geological database are described by a summary report, and other data, such as title, domain, keywords, abstract, and geographical location. These metadata were used for generating a bag of words for each document with the aid of morphological dictionaries and transducers. Named entities within metadata were also recognized with the help of a rulebased system. Both the bag of words and the metadata were then used for pre-indexing each document. A combination of several based measures was applied for selecting and ranking of retrieval results of indexed documents for a specific query and the results were compared with the initial retrieval system that was already in place. In general, a significant improvement has been achieved according to the standard information retrieval performance measures, where the InQuery method performed the best.