Exploring Multidimensional Continuous Feature Space to Extract Relevant Words

Authors: Márius Šajgalík, Michal Barla, Mária Bieliková
Year: 2014
Venue: Statistical Language and Speech Processing, Lecture Notes in Computer Science Vol. 8791. Springer International Publishing
Link: link.springer.com/content/pdf/10.1007%2F978-3-319-11397-5_12.pdf
Product of the Action: No

With growing amounts of text data the descriptive metadata become more crucial in efficient processing of it. One kind of such metadata are keywords, which we can encounter e.g. in everyday browsing of webpages. Such metadata can be of benefit in various scenarios, such as web search or content-based recommendation. We research keyword extraction problem from the perspective of vector space and present a novel method to extract relevant words from an article, where we represent each word and phrase of the article as a vector of its latent features. We evaluate our method within text categorisation problem using a well-known 20-newsgroups dataset and achieve state-of-the-art results.