The amount of structured data published on the Web is constantly growing. These data come from various sources and domains and can foster creation of new services and businesses for political, social and commercial activities. In this context, it becomes very important to enable end users to easily retrieve relevant data from these sources.
One of the most flexible techniques enabling novice users to access structured data is keyword search. Currently, semantic keyword search over the LOD cloud, relational and other kinds of structured sources faces several problems, such as lack of assessment of data quality, increased ambiguity of keyword queries, scalability problems, as well as lack of query routing techniques that take into account both, query semantics and data quality.
In this WG, we aim to support development of novel methods and algorithms that address these problems and enable effective and efficient keyword search over structured data sources. In particular, we study techniques for matching user keywords with data structures and the domains of selected sources and formulation of the corresponding queries.
Whereas structured queries (e.g., queries formulated with SPARQL and SQL) are a powerful tool to precisely describe a user’s informational need and retrieve the intended information from a dataset, manual creation of a structured query is a labor-intensive and error-prone task already in a single-source search scenario. This task requires exact knowledge of the dataset schema as well as proficiency in a query language, which are typically beyond the expertise of end users. In addition, in the multi-source context of structured data sources, users face the problem of dataset selection to obtain relevant, correct and up-to-date information. Finally, scalability problems are caused by the large scale of the data.
Structured data sources include relational databases and Linked Open Data (LOD). LOD is emerging as the de-facto standard for publishing data on the Web enabling to connect and interlink pieces of structured data, information and knowledge spread across different web sources. The LOD cloud (i.e. the open datasets that have been published in the Linked Data format) already includes a variety of data from government, media, geographic, publications and life sciences domains as well as cross-domain data and user-generated content spread across hundreds of datasets containing billions entities and facts. The data in the LOD-compliant datasets is represented using RDF (Resource Description Framework (http://www.w3.org/TR/rdf-concepts).
The tangible outcomes of this WG include:
- Advanced search techniques exploiting statistics, semantics, and metadata.
- Techniques for graph-based search and query interpretation in multi-source search scenarios.
- Scalable keyword search techniques for large scale structured data.
Link to WG2 POSTS