Zum Inhalt
Fakultät für Informatik
Paper “Low-Latency Query Compilation”

Reviews (The VLDB Journal, initial submission)

Editor

Dear authors,

Thank you very much for accepting our invitation to submit an extended version of you DaMoN paper to the VLDBJ special edition issue. All reviewers liked the paper and have provided the comments and suggestions included below. We look forward to the revised paper that incorporated them that will certainly make the paper even better.

Best regards,
The editorial team

Reviewer #1

Overall this is an interesting paper with solid results and valuable ideas. Below a few aspects that could help to make the contribution more complete:

The paper argues that one of the reasons why they are faster than generic compilers is that the repetitive nature of query operators and their narrow scope allow for optimizations and focus on a few constructs rather than having to compile arbitrary code. To what extent is the mapping of, e.g., a join, a true compilation step as opposed to just instantiating a machine code template for the join? What applies to interpreted query execution, where the operators are generic and just invoked on different inputs, would seem to apply as well to operators in machine code. In 3.2 function calls are mentioned but the paper never explains how often they are used and the mechanism behind them (is there a catalog somewhere of available functions? If yes, how much actual compilation is happening as opposed to scripting a series of function calls?)

It would be interesting to see some comment/analysis on the end-to-end steps of query compilation. The paper claims to be end-to-end but it only deals with the mapping plan -> IR -> machine code. The input is a query tree that, presumably, has already been optimized. Any chance to provide a rough idea of the relative cost of the complete sequence? SQL text string -> query plan -> optimized plan -> IR -> machine code? That would be a good way to frame the results. I see the point that query compilation is too long for short queries but, for complex queries (unlike what the paper claims), it is likely that the optimization step and running time make the compilation time irrelevant. The results of Figure 15 go along these lines but should be commented in more detail. 

There is quite a bit of work on hardware optimized operators (see, for instance, the papers by Balkesen et al. on radix hash join and sort merge joins in ICDE 13 and VLDB 13). These operators make heavy use of low level machine instructions (AVX, prefetching, etc.). It would seem that the approach in this paper would be a great way to implement such sophisticated algorithms. It would be good to comment on that, extending the discussion in 5.2. Along these lines, it would seem that the use of a prefetcher is something that could be indicated by the query optimizer rather than just generated in the compilation step. Are there opportunities to pass such hints (AVX, prefetching, data sizes, selectivities, etc.) down the pipeline to be used in the compilation step? 

It would be useful to see a discussion of any potential shortcomings of the proposed approach. The paper repeatedly claims that it is simpler and faster, which it proves in the experiments. But the paper does not prove or even discuss the generality of the approach and whether there are possibilities using LLVM that are not available using Flounder. For instance, LLVM is used to deal with heterogeneous hardware while Flounder is very much tied to x86 processors. Similarly, the paper only discusses a handful of operators and query types while the systems it compares to are capable of supporting a wider range of SQL constructs and queries. Some of it is just because Flounder has not been developed further but it would be good to show arguments that it can actually be extended and where its limits are. 

Minor comments:

I was puzzled by the "?" in Figure 1. It is not clear what it means or what it represents and the text does not explain it. I suggest to either clarify, give it a name, or remove it. 

I would add in the introduction a line or two quantifying the results (5.5 speedup). It is in the abstract but the intro only has a generic description of the results. It would also be useful to list there already what systems are used as baseline for comparison.

In section 1.2, I would remove the explanation of the added contents over the DAMON paper. It is useful for reviewing but nor needed for the final version. 

I am not a great fan of having an outline for the next sections every few paragraphs. It is not needed and it looks like the text is there to make the paper longer. There is already an outline in the introduction, I would suggest to remove the repetitive outlines added in several sections (e.g., at the end of 2.2).  

For the final version, it would be nice to correct paragraphs that end with one or two words in the last line. There are quite a few.

Some of the code is labeled as a figure, other is inlined in the text and without label. I suggest to be consistent and label all of them as Figures.  Also, in 2.3, I would suggest to make the example a figure rather than having it floating on the side of the text. It does not look nice and it is not needed as there is space for it. 
 

Reviewer #2

Dear Authors,

A few minor comments:

1. Regarding Fig. 3: I am not familiar with the produce/consume model of [29]. I guess that because of that, I missed the outer-loop of the Build-Side where all tuples of the build-side are inserted into the hash table. I also could not follow the meaning of a1 vs. Attributes in this figure. I propose to expand a bit more on explaining the Figure.
2. Section 2.3 contains well-known compilation techniques. In my opinion it does not add to the depth of the paper, and therefore can be omitted. By contrast, Section 3.1 does contain a new technique of scoping the use of virtual register in SQL processing loops. 
3. Typo in Section 2.3: The constant 0.23 is replaced later by the constant 0.34.
4. Figure 10: You need to solve the editorial/visual problem of how to annoatet the X-axis. There are 2 different scales: 1 for projection and 1 for join. But the figure names none.

And a single major comment:

From my perspective, the key contribution of the paper is the demonstartion of a domain-specific JIT compiler (for the specific domain of SQL), as opposed to the general-purpose LLVM. For example, the special and simple  register allocation scheme benefits from the domain specificity. I would recommend emphaizing this point. Section 1.1 indeed introduces this contribution, but it deserves further elaboration (maybe ast future work, for other domains such as AI/ML...).

Reviewer #3

The paper introduces an intermediate language (Flounder IR) to reduce the high latency of compilation with LLVM in database systems, and a database system (ReSQL) that uses JIT compilation for the Flounder IR language. This is a solid paper that tackles an important problem that has the potential to impact practice. This version expands on an earlier workshop publication with sufficient new content to merit publication, with some minor edits.

As a general comment, given the focus on latency it was a little surprising to see the evaluation with OLAP workloads in a tiny database in Section 6.6. These workloads have long-running queries that can tolerate a few seconds of delay due to compilation. It may have been more interesting to evaluate Flounder IR and ReSQL with OLTP workloads, where compilation would add an unacceptable latency to every transaction to the point where compilation performance alone determines the peak throughput of the system.

Some specific questions on the new content:
1. How are NULLs handled in ReSQL and Flounder IR? The instruction path length grows considerably when NULLs are involved due to the use of binary logic in the ISA. Flounder IR has an opportunity to encode ternary logic directly in the IR. Some discussion on this would strengthen the paper.

2. Please clarify the relationship of the decimal type used in the example in Section 2.3 to the SQL exact numeric types (numeric, decimal) and the approximate numeric types (float, real). In particular, clarify if ReSQL supports both exact and approximate numeric types, or if approximate numeric types are converted to exact types? Also, clarify how are exact decimals wider than the register size handled.

3. The prefetching optimization in 5.2 aims to match the granularity of prefetching with the granularity of access by tracking cacheline accesses and using loop unrolling. This realization comes later in the section and is a little underwhelming---I was reading eagerly for an interesting way to optimize the prefetching distance which is a very brittle optimization in practice. Please state the nature of the optimization upfront, instead of at the end of the section. Some discussion on if/how a database-specific IR can help with tuning the prefetch distance would also be welcome.

4. The 100MB database used in Section 6.6 appears to be outside the specs of TPC-H, where a scale factor of 1 produces a 1GB database. How was the 100MB database generated exactly? 

5. Typo: (page 9) "its low compile-times goals" -> compile-time goal