Sie sind hier:

Bereichsnavigation

Lehre+
- Sommer 2023
- Winter 2022/23
  - Vertiefungsvorlesung „Data Processing on Modern Hardware“
  - Vertiefungsvorlesung "Industrial Data Science 1"
  - Seminar "Advances in Modern Query Languages"
  - Information for Data Science students
- Sommer 2022
- Sommer 2021
- Winter 2020/21
- Sommer 2020
- Winter 2019/20
- Sommer 2019
- Winter 2018/19
- Sommer 2018
- Winter 2017/18
- Sommer 2017
- Winter 2016/17
- Sommer 2016
- Winter 2015/16
- Sommer 2015
- Winter 2014/15
- Sommer 2014
- Winter 2013/14
- Sommer 2013

Hauptinhalt

Seminar "Advances in Modern Query Languages"

News

Sept. 28, 2022 Web page set up.

Oct. 05, 2022 Description added.

Oct. 07, 2022 Topics added, organization details added.

Description

The seminar will be held in English.

SQL is the de facto standard for data access languages. It's declarative nature removes the need to define how a query is processed and allows users to precisely state which data is desirable, even without deeper knowledge of computer programming.

An even more important aspect of the declarative nature of SQL is the extreme performance achieved by modern relational database systems. Indeed, due to the continuous research effort over the last decades, these systems can offer very high performance and push the underlying hardware to it's limits. To achieve this efficiency, a restriction to a specific data model is necessary. Working with a specific model allows efficient optimizations and the mapping of query constructs to highly specialized operator implementations.

However, specific application contexts are often at odds with a purely relational approach. One example are workloads on graph- or nested ("JSON-like") data or the incorporation of relation processing in machine learning applications. While it is possible to model many non-relational use-cases, such as graphs, in a relational way, non-relational data models can be more intuitive and allow other forms of implementations and features. As a result, many non-relational ("NoSQL") systems offer their own, often specialized, approach to a query interface. While these interfaces are usually convenient to use, the query performance is usually not on par with that of mature, relational database systems. To this end, much effort is invested in bringing "optimizations and features from the relational world" (back) into these systems.

In this seminar, we will consider various extensions and alternatives to SQL, which is already an extension of the classical, relational algebra/model. As a disclaimer; While some of the works below do have formal aspects, many of them will be deeply involved with the inner workings of modern database technology and hardware. Having visited the Master Lectures "Architecture and Implementations of DBMS" and/or "Data Processing on Modern Hardware" is helpful, but not required.

Please find a list of the topics below. Note, however, that we also welcome other topic suggestions you might find interesting. If required, we will add additional topics.

Interested students are invited to contact Maximilian Berens (maximilian.berens@tu-dortmund.de) with their three favorite topics until 28.10.2022.

Topics

There are currently 8 relatively broad (sub-)topics to pick from, some are grouped in the same category. Each (sub-)topic is usually accompanied by more than just one resource/paper. This is intentional! While it is possible to focus on only one particular paper, we highly recommend to approach the topics in a more holistic manner, instead of just compressing what is written in one specific paper. Of course, if you find other sources, feel free to use them as well.

Depending on the demand and your interests, we can also add new topics. If you are interested in a specific topic that is not listed below, you are very welcome to contact us and make your own suggestion!

Note: Most papers can likely only be retrieved from the university network.

1. Integrating User-Defined Functionality

Sometimes it is beneficial to extend the SQL interface of an existing system by self-implemented functions and applying them within the database.
This functionality promises the user an implementation of their specialized workload without forgoing the underlying system's features.

Subtopic A:

Extending the SQL interface or even the whole language can allows the transparent inclusion of user-defined code.

User-defined Operators https://www.vldb.org/pvldb/vol15/p1119-sichert.pdf
UDFs by extending SQL http://reports-archive.adm.cs.cmu.edu/anon/2021/CMU-CS-21-101.pdf (Master Thesis, potentially useful for background knowledge)

Subtopic B:

So called polyglot systems allow to write user-defined functions in host languages, such as Python or Java Script

Babelfish http://www.vldb.org/pvldb/vol15/p196-grulich.pdf
Systems, such as Apache Impala, also count as polyglot systems (UDF documentation)

2. Common Language for Semi-Structured Data

Data in large-scale compute clusters is often semi-structured, i.e., non-relational. Specific application contexts each have their own requirements, but the variety of interfaces poses the challenge of finding the most suitable one. While each system basically brings it's own interface, a unified query interface brings many advantages. One contender for this task is SQL++, a superset of SQL.

SQL++ Paper https://arxiv.org/pdf/1405.3631.pdf
SQL++ handbook
Overview of multiple languages from an application perspective: http://www.vldb.org/pvldb/vol15/p154-muller.pdf

3. Language-Integrated Queries

Many systems, such as web servers, automatically generate SQL-queries, usually by concatenating strings of SQL, to access databases and use the results to provide data for a website. However, submitting SQL from other programming languages introduces an impedance mismatch: Both the calling program code, as well as the underlying database are forced to treat each other as black boxes. In turn, this comes with a performance penalty due to additional data copies and conversions, among other concerns, such as security. Language-integrated query approaches are a solution to this problem. They make the components of a (relational) query available in the host language and allow the generation of efficient SQL instructions.

T-LINQ: https://dl.acm.org/doi/pdf/10.1145/2544174.2500586
Extension with Aggregation/Group By https://link.springer.com/content/pdf/10.1007/978-3-030-59025-3_9.pdf
Additional Source: https://dl.acm.org/doi/pdf/10.1145/2847538.2847542
ScalaQL: SQL in Scala https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.443.9185&rep=rep1&type=pdf

4. From Flat to Nested and Back

"Arrays as values", while not relational, is ubiquitous in basically any object oriented programming language and allows a very flexible and natural way to represent information. While SQL does allow arrays (collections) as values per se, many systems still define their own language and operations. Unnesting is an established way of transforming non-relational data into a relational form and therefore re-enables SQL on nested data.

Subtopic A - (Flat) SQL over Nested Data:

Google's Dremel, or it's open source implementation Apache Drill, allows querying nested data via SQL.
- Dremel columnar Data https://dl.acm.org/doi/pdf/10.14778/2732977.2732987
- Drill Docu https://drill.apache.org/docs/drill-in-10-minutes/

Subtopic B - Query Shredding:

Relational DBMS do not support nested arrays in query results. Query shredding is a technique, that transforms nested queries over nested data to flat queries over flat representations of nested data.

Shredding for multisets https://dl.acm.org/doi/pdf/10.1145/2588555.2612186
Further reading https://dl.acm.org/doi/pdf/10.14778/1920841.1920866
Optional Ressource http://vldb.org/pvldb/vol14/p445-smith.pdf

5. Mixing Machine Learning and Linear Algebra

Workloads, such as model training for machine learning, are often performed with light-weight query tools, such as Python's pandas. While these tools are easy to set up and perform reasonably fast at very small scales, more complex data transformations and large data volumes call for the usage of more sophisticated systems. Platforms, such as Apache Spark, offer scale-out to a whole compute clusters. However, they are usually unable to consider optimizations that are possible, if both the machine learning task and the construction of the dataset (from one or more tables) are considered holistically.

Subtopic A:

Research produced many systems that allow to specify both tasks in one way or another and process them within the same system.

LaraDB https://dl.acm.org/doi/pdf/10.1145/3070607.3070608
MADLib https://arxiv.org/pdf/1208.4165

Subtopic B:

SDQL tries to unify linear- and relational Algebra for both flat- and nested data.
- SDQL Paper https://dl.acm.org/doi/pdf/10.1145/3527333

Organization

We will use a peer-reviewing tool for the submission of your report. The tool will be set up in the course of the semester. Each student reviews two reports from other students. More details on this will follow via Mail.

First Version

We require that you submit a preliminary version of your report, which will be peer-reviewed. This early version should clearly outline the planned contents of your paper, as well as a preliminary abstract. It is highly recommended to already write some text for every section to receive meaningful feedback from the peer-reviewing process.

Talk

The presentations will be held (in attendance) in a block over approx. x many days (TBA).

The talks are scheduled for 20 minutes, each talk is followed by a discussion on the topic and the presentation.

Report

The report should follow the ACM Proceedings Templates (LaTeX) with a maximum of six pages (including references).

Important dates

Due to winter holidays (until 06. Jan.) and the end of lecture period (03. Feb.), the schedule is a little tight for the submission deadlines. Feel free to finish your first report earlier, so you can get feedback earlier as well!

Submission (via mail) of your topic preferences: 28. Oct. 2022
Submission of the first version: 13. Jan. 2022
Submission of reviews: 20. Jan. 2022
Talks: TBA (planned for last week of the lecture period)
Submission of the final version: 28. Feb. 2023

Sprungmarken

Servicenavigation

Hauptnavigation

Bereichsnavigation

Hauptinhalt

Seminar "Advances in Modern Query Languages"

News

Description

Topics

1. Integrating User-Defined Functionality

Subtopic A:

Subtopic B:

2. Common Language for Semi-Structured Data

3. Language-Integrated Queries

4. From Flat to Nested and Back

Subtopic A - (Flat) SQL over Nested Data:

Subtopic B - Query Shredding:

5. Mixing Machine Learning and Linear Algebra

Subtopic A:

Subtopic B:

Organization