WG2 – Keyword search




The amount of structured data published on the Web is constantly growing. These data come from various  sources and domains and can foster creation of new services and businesses for political, social and commercial activities. In this context, it becomes very important to enable end users to easily retrieve relevant data from these sources.

One of the most flexible techniques enabling novice users to access structured data is keyword search. Currently, semantic keyword search over the LOD cloud, relational and other kinds of structured sources faces several problems, such as lack of assessment of data quality, increased ambiguity of keyword queries, scalability problems, as well as lack of query routing techniques that take into account both, query semantics and data quality.

In this WG, we aim to support development of novel methods and algorithms that address these problems and enable effective and efficient keyword search over structured data sources. In particular, we study techniques  for matching user keywords with data structures and the domains of selected sources and formulation of the corresponding queries.

Basic concepts



Whereas structured queries (e.g., queries formulated with SPARQL and SQL) are a powerful tool to precisely describe a user’s informational need and retrieve the intended information from a dataset, manual creation of a structured query is a  labor-intensive and error-prone task already in a single-source search  scenario. This task requires exact knowledge of the dataset schema as  well as proficiency in a query language, which are typically beyond the  expertise of end users. In addition, in the multi-source context of structured data sources,  users face the problem of dataset selection to obtain relevant, correct  and up-to-date information. Finally, scalability problems are caused  by the large scale of the data.

Structured data sources include relational databases and Linked Open Data (LOD). LOD  is emerging as the de-facto standard for publishing data on the Web enabling to connect and interlink pieces of structured data, information and knowledge spread across different web sources. The LOD cloud (i.e. the open datasets that have been published in the Linked Data format) already includes a variety of data from government, media, geographic, publications and life sciences domains as  well as cross-domain data and user-generated content spread across hundreds of datasets containing billions entities and facts. The data in  the LOD-compliant datasets is represented using RDF (Resource Description Framework (http://www.w3.org/TR/rdf-concepts).

Expected Outcomes

The tangible outcomes of this WG include:

  • Advanced search techniques exploiting statistics, semantics, and metadata.
  • Techniques for graph-based search and query interpretation in multi-source search scenarios.
  • Scalable keyword search techniques for large scale structured data.

Link to WG2 POSTS