Sprungmarken

Servicenavigation

Hauptnavigation

Sie sind hier:

Hauptinhalt

Seminar "Advances in Modern Query Languages"

News

Sept. 28, 2022 Web page set up.

Oct. 05, 2022 Description added.

Oct. 07, 2022 Topics added, organization details added.

Description

The seminar will be held in English.

SQL is the de facto standard for data access languages. It's declarative nature removes the need to define how a query is processed and allows users to precisely state which data is desirable, even without deeper knowledge of computer programming.

An even more important aspect of the declarative nature of SQL is the extreme performance achieved by modern relational database systems. Indeed, due to the continuous research effort over the last decades, these systems can offer very high performance and push the underlying hardware to it's limits. To achieve this efficiency, a restriction to a specific data model is necessary. Working with a specific model allows efficient optimizations and the mapping of query constructs to highly specialized operator implementations.

However, specific application contexts are often at odds with a purely relational approach. One example are workloads on graph- or nested ("JSON-like") data or the incorporation of relation processing in machine learning applications. While it is possible to model many non-relational use-cases, such as graphs, in a relational way, non-relational data models can be more intuitive and allow other forms of implementations and features. As a result, many non-relational ("NoSQL") systems offer their own, often specialized, approach to a query interface. While these interfaces are usually convenient to use, the query performance is usually not on par with that of mature, relational database systems. To this end, much effort is invested in bringing "optimizations and features from the relational world" (back) into these systems.

In this seminar, we will consider various extensions and alternatives to SQL, which is already an extension of the classical, relational algebra/model. As a disclaimer; While some of the works below do have formal aspects, many of them will be deeply involved with the inner workings of modern database technology and hardware. Having visited the Master Lectures "Architecture and Implementations of DBMS" and/or "Data Processing on Modern Hardware" is helpful, but not required.

Please find a list of the topics below. Note, however, that we also welcome other topic suggestions you might find interesting. If required, we will add additional topics.

Interested students are invited to contact Maximilian Berens () with their three favorite topics until 28.10.2022.

Topics 

There are currently 8 relatively broad (sub-)topics to pick from, some are grouped in the same category. Each (sub-)topic is usually accompanied by more than just one resource/paper. This is intentional! While it is possible to focus on only one particular paper, we highly recommend to approach the topics in a more holistic manner, instead of just compressing what is written in one specific paper. Of course, if you find other sources, feel free to use them as well.

Depending on the demand and your interests, we can also add new topics. If you are interested in a specific topic that is not listed below, you are very welcome to contact us and make your own suggestion!

Note: Most papers can likely only be retrieved from the university network.

1. Integrating User-Defined Functionality

Sometimes it is beneficial to extend the SQL interface of an existing system by self-implemented functions and applying them within the database.
This functionality promises the user an implementation of their specialized workload without forgoing the underlying system's features.

Subtopic A:

Extending the SQL interface or even the whole language can allows the transparent inclusion of user-defined code.

Subtopic B:

So called polyglot systems allow to write user-defined functions in host languages, such as Python or Java Script

2. Common Language for Semi-Structured Data

Data in large-scale compute clusters is often semi-structured, i.e., non-relational. Specific application contexts each have their own requirements, but the variety of interfaces poses the challenge of finding the most suitable one. While each system basically brings it's own interface, a unified query interface brings many advantages. One contender for this task is SQL++, a superset of SQL.

3. Language-Integrated Queries

Many systems, such as web servers, automatically generate SQL-queries, usually by concatenating strings of SQL, to access databases and use the results to provide data for a website. However, submitting SQL from other programming languages introduces an impedance mismatch: Both the calling program code, as well as the underlying database are forced to treat each other as black boxes. In turn, this comes with a performance penalty due to additional data copies and conversions, among other concerns, such as security. Language-integrated query approaches are a solution to this problem. They make the components of a (relational) query available in the host language and allow the generation of efficient SQL instructions.

4. From Flat to Nested and Back

"Arrays as values", while not relational, is ubiquitous in basically any object oriented programming language and allows a very flexible and natural way to represent information. While SQL does allow arrays (collections) as values per se, many systems still define their own language and operations. Unnesting is an established way of transforming non-relational data into a relational form and therefore re-enables SQL on nested data.

Subtopic A - (Flat) SQL over Nested Data:

Google's Dremel, or it's open source implementation Apache Drill, allows querying nested data via SQL.
- Dremel columnar Data https://dl.acm.org/doi/pdf/10.14778/2732977.2732987
- Drill Docu https://drill.apache.org/docs/drill-in-10-minutes/

Subtopic B - Query Shredding:

Relational DBMS do not support nested arrays in query results. Query shredding is a technique, that transforms nested queries over nested data to flat queries over flat representations of nested data.

5. Mixing Machine Learning and Linear Algebra

Workloads, such as model training for machine learning, are often performed with light-weight query tools, such as Python's pandas. While these tools are easy to set up and perform reasonably fast at very small scales, more complex data transformations and large data volumes call for the usage of more sophisticated systems. Platforms, such as Apache Spark, offer scale-out to a whole compute clusters. However, they are usually unable to consider optimizations that are possible, if both the machine learning task and the construction of the dataset (from one or more tables) are considered holistically.

Subtopic A:

Research produced many systems that allow to specify both tasks in one way or another and process them within the same system.

Subtopic B:

SDQL tries to unify linear- and relational Algebra for both flat- and nested data.
- SDQL Paper https://dl.acm.org/doi/pdf/10.1145/3527333
 

Organization

We will use a peer-reviewing tool for the submission of your report. The tool will be set up in the course of the semester. Each student reviews two reports from other students. More details on this will follow via Mail.

First Version

We require that you submit a preliminary version of your report, which will be peer-reviewed. This early version should clearly outline the planned contents of your paper, as well as a preliminary abstract. It is highly recommended to already write some text for every section to receive meaningful feedback from the peer-reviewing process.

Talk

The presentations will be held (in attendance) in a block over approx. x many days (TBA).

The talks are scheduled for 20 minutes, each talk is followed by a discussion on the topic and the presentation.

Report

The report should follow the ACM Proceedings Templates (LaTeX) with a maximum of six pages (including references).

Important dates

Due to winter holidays (until 06. Jan.) and the end of lecture period (03. Feb.), the schedule is a little tight for the submission deadlines. Feel free to finish your first report earlier, so you can get feedback earlier as well!

  • Submission (via mail) of your topic preferences: 28. Oct. 2022
  • Submission of the first version: 13. Jan. 2022
  • Submission of reviews: 20. Jan. 2022
  • Talks: TBA (planned for last week of the lecture period)
  • Submission of the final version: 28. Feb. 2023