To content
Department of Computer Science

VLDB 2020 Reviews

Reviews for paper Data-Parallel Query Processing on Non-Uniform Data, submitted to VLDB 2020.

Overall Rating: revision

Reviewer #1

1. Overall Rating

Weak Accept

2. Relevant for PVLDB

Yes

3. Are there specific revisions that could raise your overall rating?

Yes

5. Summary of the paper (what is being proposed and in what context) and a brief justification of your overall recommendation. One solid paragraph.

This paper proposes two generic algorithm on GPU to improve parallelism over non-uniform data for which existing approaches would suffer from inactive lanes. Push-down parallelism, in a nutshell, evenly distributes and buffers a dynamic number of tasks generated from joins or variable-sized inputs to GPU lanes before processing them in each GPU lane. The concept of Lane Refill is same as existing algorithm in CPU, refilling work buffers of inactive lanes by feeding tasks from overloaded lanes. Nonetheless, the paper has a new contribution to apply the concept to GPU; avoids reliance on efficient shared memory and only uses bitmap-based intrinsics available among GPU lanes.
Overall, the performance benefit is impressive, and the empirical experiment is done well enough.
This reviewer is positive to accept this paper, but a couple issues should be addressed before publication.

6. Three (or more) strong points about the paper (Please be precise and explicit; clearly explain the value and nature of the contribution).

S1: Push-down parallelism and its implementation on GPU is well thought, simple and generic enough to have an impact to real products.

S2: Implementation of lane refilling in GPU is also well done.

S3: Promising and informative experimental results.

7. Three (or more) weak points about the paper (Please clearly indicate whether the paper has any mistakes, missing related work, or results that cannot be considered a contribution; write it so that the authors can understand what is seen as negative).

W1: Presentation should be improved, especially explanation of push-down parallelism.
W2: Concept of Lane Refill itself is not new. Some might say it is just applying an existing idea on to a new metal although the paper has a new trick.
W3: After all, these algorithm require several shuffling of data/work among lanes. There might be some case where this causes more overheads than benefits.

8. Novelty (Please give a high novelty ranking to papers on new topics, opening new fields, or proposing truly new ideas; assign medium ratings to delta papers and papers on well-known topics but still with some valuable contribution).

With some new ideas

9. Significance

Improvement over existing work

10. Technical Depth and Quality of Content

Solid work

11. Experiments

Very nicely support the claims made in the paper

12. Presentation

Reasonable: improvements needed

13. Detailed Evaluation (Contribution, Pros/Cons, Errors); please number each point.

D1: Maybe this reviewer was particularly slow, but it took a while to fully understand the push-down parallelism. The presentation of this paper is not particularly bad, but there are a few things that would help readers understand the algorithm:
- Section 3.2/3.3 and Figure 3/4 are split into two pages, back-to-back. The reviewer almost hurt its neck by flipping the pages tens of times. Better placed in a way that texts and related figures are co-located.
- The meaning of w/s/etc can be better explained with detailed running examples and real numbers. Only Figure 3 has such numbers, which are not well connected with other texts/figures. Furthermore, it is confusing that t_buf/w_buf/s_buf are mentioned in Section 3.2 and Figure 3 before explaining their meanings.
- In the mini-figure starting with “gather w_buf, t_buf, ...”, it would be helpful to graphically say that o_orderdata/o_orderkey/c_acctbal are “t” and bucket_offs is “s” in this case.
- In Figure 4, “w <- number expansion items op_i” does not sound clear. How about “w <- number of work items after expansion in op_i”
- It would be less confusing to use 8 as the number of lanes everywhere rather than switching between 8 and 32 within the section. The paper can just say in the experimental section that it used 32 lanes instead of 8.

D2: The benefit of push-down parallelism and lane refilling might be sensitive to the number/type of op_1~op_i-1, op_i+1~. For instance, the following two cases might observe significantly different benefits of the algorithm between 1) there are 100 operators and each operator involves heavy computation and 2) there are few or no operators aside from the diverting operator. In the latter case, and especially when there are many lanes, how does the cost of shuffling and bitmap-computation affect the overall cost? The paper should have a sensitivity analysis.

14. If revision is required, list specific revisions you seek from the authors

D1/D2

Reviewer #2

1. Overall Rating:

Weak Reject

2. Relevant for PVLDB

Yes

3. Are there specific revisions that could raise your overall rating?

Yes

5. Summary of the paper (what is being proposed and in what context) and a brief justification of your overall recommendation. One solid paragraph.

The authors raise the topic of control flow divergence in the context of parallel architectures, namely GPUs. To address the sources of divergence, they present two techniques, the latter being a port / refinement of existing work. The presented techniques aim to ensure that the warp lanes of GPUs remain as utilized as possible during query processing. Each technique is presented in detail; the authors have explained which GPU primitives are relevant for each one. Finally, experimental shows the benefits of the presented approaches, which, however, appear to be marginal for macro-experiments.

6. Three (or more) strong points about the paper (Please be precise and explicit; clearly explain the value and nature of the contribution).

S1. Solid attempt to address a shortcoming of modern hardware.
S2. Paper's text is of high quality, and its contents can be used as manual for database developers.
S3. Proposed techniques have applicability in multiple database operations.

7. Three (or more) weak points about the paper (Please clearly indicate whether the paper has any mistakes, missing related work, or results that cannot be considered a contribution; write it so that the authors can understand what is seen as negative).

W1. Benefits from techniques appear to be marginal for macro-benchmarks.
W2. Absence of discussion on how to expose / incorporate the techniques to a query optimizer.
W3. The 'lane refill' technique is quite similar to pre-existing techniques.

8. Novelty (Please give a high novelty ranking to papers on new topics, opening new fields, or proposing truly new ideas; assign medium ratings to delta papers and papers on well-known topics but still with some valuable contribution).

With some new ideas

9. Significance

Improvement over existing work

10. Technical Depth and Quality of Content

Solid work

11. Experiments

OK, but certain claims are not covered by the experiments

12. Presentation

Reasonable: improvements needed

13. Detailed Evaluation (Contribution, Pros/Cons, Errors); please number each point.

D1. Please try to keep figures in the same page as their description.
Example: Figure 4 and Section 3.2, as well as Figure 7 and Section 4.2.
D2. Please quantify the space penalty incurred by the use of the proposed techniques.
D3. While Sections 3.4 and 4.4 list numerous potential usage scenarios for the proposed techniques, few of them are actually evaluated.
D4. It would be nice to see a discussion of how the proposed techniques would be incorporated in a query optimizer, and when each of them should be used. The evaluation section has a mention on this topic, yet a more principled discussion would have been interesting. In the same spirit of query optimization, it should be noted that lane refill alters the order of data records, and can affect the presence of an interesting order in the dataset.
D5. The text of the first paragraph of Section 4.3 is a bit convoluted; please try to rewrite slightly.
D6. Section 5: My understanding is that all figures besides Fig.15 correspond to GPU-resident datasets, but please be explicit reg. where data lies for each of the experiments.
D7. Figure 8: Please include bar for a case between pk-fk and pk-32-fk (e.g., 8 or 16). In addition, in the case of pk-fk, why is pushdown faster despite having more complex work to do? is it just an artifact of small execution times?
D8. As a general comment, bigger data sizes would have been appreciated, to avoid reasoning in terms of small millisecond-level comparisons.
D9. Having most "real-world" experiments use TPC-H, yet shifting to SSB for Section 5.2, feels a bit odd.
D10. The numbers reported in Figure 14 showcase limited contribution of the proposed techniques compared to the baseline, a concern further exacerbated by the fact that data seems to be GPU-resident before queries are launched. The authors do include a case of CPU-resident data in Figure 15, but only compare against MonetDB in it -- a comparison which offers little to the discussion. Including the baseline DogQC implementation in Figure 15 would likely have little variation to the optimized DogQC variation.
D11. Parts of the appendix feel very disconnected from the main text. For example, the 'sample operator' part would be more useful to the reader if it had been incorporated in the main body, and revolved around how the proposed techniques can be incorporated in a query compiler that (by default) does not reason in terms of parallelization strategy changes within a pipeline.

14. If revision is required, list specific revisions you seek from the authors

Address weak and detailed points

Reviewer #3

1. Overall Rating

Weak Accept

2. Relevant for PVLDB

Yes

3. Are there specific revisions that could raise your overall rating?

No

5. Summary of the paper (what is being proposed and in what context) and a brief justification of your overall recommendation. One solid paragraph.

This paper introduces DogQC, a query compiler for GPUs that preserves processing efficiency even in presence of heavily skewed operations.
To achieve this, the authors propose two optimisations, namely push-down parallelism and lane refill, that address respectively expansion divergence and filter divergence, the two types of control flow divergence the authors identify stemming from operation skew.
Push-down parallelism addresses expansion divergences by adapting the parallelisation strategy within the pipeline to a more suitable one to the evaluated operations, instead of simply inheriting the parallelisation strategy of the producing operator (i.e., the consuming "pushing down" its preferred parallelisation strategy to the producer).
To mitigate the effects of filter divergence, Lane Refill introduces a buffer operator that, based on a threshold, either buffers (sparse) active lanes or activates buffered tuples if there is enough of them.
By postponing the handling of some tuples, the operator is able to produce a dense output for the next operator, thereby increasing their efficiency.
An experimental evaluation spanning micro- and end-to-end benchmarks, as well as GPU-based and CPU-based baselines demonstrates the benefits of these two optimisations.

6. Three (or more) strong points about the paper (Please be precise and explicit; clearly explain the value and nature of the contribution).

S1. Excellent presentation: the paper is very well structured. Judiciously-placed examples and applicability discussions are particularly appreciated.

S2. The proposed techniques are simple, practical, and have a wide applicability range.

S3. The paper has substantial (and sound) technical depth.

7. Three (or more) weak points about the paper (Please clearly indicate whether the paper has any mistakes, missing related work, or results that cannot be considered a contribution; write it so that the authors can understand what is seen as negative).

W1. The paper does not offer a direct empirical comparison with other state-of-the-art approaches (see D2).
W2. The paper assumes sometimes too much of the reader. There are several concepts (e.g., shared memory bank conflict) that could be briefly explained (given the available space) instead of relying on references. This would add further technical depth to the paper.
W3. The paper lacks a bigger picture with a vision or direction for DogQC (see D4).

8. Novelty (Please give a high novelty ranking to papers on new topics, opening new fields, or proposing truly new ideas; assign medium ratings to delta papers and papers on well-known topics but still with some valuable contribution).

With some new ideas

9. Significance

Improvement over existing work

10. Technical Depth and Quality of Content

Excellent work

11. Experiments

Very nicely support the claims made in the paper

12. Presentation

Excellent: careful, logical, elegant, easy to understand

13. Detailed Evaluation (Contribution, Pros/Cons, Errors); please number each point.

D1. The paper does a very nice job in explaining the two types of control flow divergence and the two proposed optimisations to address them. One thing I particularly appreciated is the technical depth of the paper, enhanced by code snippets for the most important implementation parts of the two optimisations, as well as meaningful experimental analyses (in contrast to mere observations). The technical depth would be enhanced even further if the authors invest the space they have to explain parts of the paper where too much prior knowledge is expected of the reviewer (think every place a reference is used to avoid explaining something).

D2. Even if DogQC supports a bigger functionality range that state-of-the-art GPU query compilers, the paper should set an empirical comparison point to see whether there is a tradeoff between functionality and performance. The authors cannot claim to outperform other approaches in performance without reporting experimental results (it is qualitatively established for feature range).

D3. The discussion of Figure 11 should include an explanation to why lane-refill is outperformed by the naive setup for high selectivities.

D4. I would have liked to read about the vision for DogQC. What is its goal? Is it just to demonstrate the proposed optimisations? GPUs are relatively widely used in database systems, but mostly for specialised workloads such as geo-spatial applications and embedded machine learning, not for traditional relational workloads. Do you think that DogQC (or its concepts) has the potential to change this?

Related Information