Jump label

Service navigation

Main navigation

You are here:

Main content

Main-Memory Hash Joins on Multi-Core CPUs: Tuning to the Underlying Hardware — PVLDB 2013 (E&A) Reviews

Reviews for paper Efficient Main-Memory Hash Joins on Multi-Core CPUs: Does Hardware Still Matter?, submitted to PVLDB 2013 (Experimental Track).

Overall rating: reject

This paper was later published as “Main-Memory Hash Joins on Multi-Core CPUs: Tuning to the Underlying Hardware” at ICDE 2013.

Reviewer 1

Overall Recommendation

2nd chance after minor revisions

Summary of the paper (what is being proposed and in what context; brief justification of your overall recommendation). One paragraph

This paper revisits a no-partitioning hash join [3], denoted NPB, and parallel radix join [10], denoted PR, optimizes both resulting NPO and PRO, and compares all of these four algorithms on three hardware platforms. The algorithms are described in sufficient detail; the optimizations are effective; the evaluation results are convincing; the conclusion is clear - in most cases, the parallel radix join performs better than no-partitioning hash join.

Three (or more) strong points about the paper (Please be precise and explicit; clearly explain the value and nature of the contribution).

S1. The paper studies two state-of-the-art main memory hash join algorithms on multi-core machines.

S2. The presentation of the paper is clear on the algorithms, implementation details, and evaluation.

S3. The evaluation is done on three hardware platforms with various factors considered. The results are convincing.

Three (or more) weak points about the paper (Please indicate clearly whether the paper has any mistakes, missing related work, or results that cannot be considered a contribution; write it so that he authors can understand what is seen as negative aspects

W1. While hash joins are important, writing a full paper to re-evaluate two state-of-the-art alorithm sees a bit thin.

W2. There are some typos and a few presentation glitches.

Relevant for VLDB2013

YES

Novelty (Please give a high novelty ranking to papers on new topics, opening new fields, or proposing truly new ideas; assign medium ratings for delta papers and papers on well known topics but still with some valuable contribution; low novelty ratings are

Novelty unclear

Rationale for novelty rating

This is an experimental study paper on existing algorithms, so there is little novelty in the algorithms. The optimizations are all in implementations and are quite simple.

Significance

Improvement over existing work

Technical Depth and Quality of Content

Solid work

Experiments, Repeatability

Very well presented, supports the claims made in the paper

Presentation

Reasonable: improvements needed

Would you "champion" for acceptance of the paper in a discussion with the peer reviewers?

Yes, absolutely

Detailed Evaluation (Contribution, Pros/Cons, Errors); please number each point

D1. The paper categorizes the two existing algorithms NPB and PRB into cache/hardware oblivious versus cache/hardware conscious. This categorization may be inaccurate. Specifically, the NPB paper [3] emphasized that computation and synchronization costs should be considered in addition to the traditional factors of cache misses and TLB misses in main memory join algorithms. I would say both approaches are cache/hardware conscious. Furthermore, the term cache-oblivious refers to a divide-and-conquer approach that partitions data to the finest level in a cache-friendly way without knowledge of specific cache parameters, e.g., the cache- oblivious B-tree.

D2. The shorthands of algorithm names are necessary but are confusing at times. It would help to summarize them at the end of Section 3 or beginning of Section 4: NPB, NPO, RJ, PR, PRB, and PRO.

D3. The names of Non-Equal and Equal Data Sets are somewhat confusing. Hash joins deal with equality data sets. What are Non-Equal data sets for? It would be better to be more specific, e.g., Non-Equal Size data set?

D4. Section 6.1. The first paragraph says "for PR we report in Figure 7", but PR results are not shown in Figure 7. The second paragraph talks about results of RJ and PR, neither of which are shown in Figure 7. I wonder why the results are not shown.

D5. Section 6.1. The third paragraph puzzles me - it says (1) PRO is faster, (2) the (PRO?) results are very "similar" to those in [10], and (3) the performance improvement of PRO over PRB is "comparable" to that of NPO over NPB. How much is similar and how much comparable?

D6. Section 6.3. The second paragraph says "in the rest of the paper, we use the bucket chaining version of PRO". Does it mean there were multiple versions of PRO? If so, what version of PRO was used in figures 7 and 8, and would it affect the results if a non-bucket chaining version of PRO was used there?

D7. Typos and awkward sentences.
Section 6.5 "must is".

Section 7.2. The first paragraph.

"On the AMD machine, performance is not as good as even ..16 cores. This is due to... and we do not report it here".

If 2nd chance with revision required, list specific revisions you seek from the Authors

D5 and D6

Reviewer 2

Overall Recommendation

Reject as major revisions (that go beyond the revision timeframe) needed

Summary of the paper (what is being proposed and in what context; brief justification of your overall recommendation). One paragraph

The author show that hardware-focussed optimization is still relevant, by showing performance gain on existing Join algorithms.

Three (or more) strong points about the paper (Please be precise and explicit; clearly explain the value and nature of the contribution).

1. Experimental evaluation of some Join algorithms

Three (or more) weak points about the paper (Please indicate clearly whether the paper has any mistakes, missing related work, or results that cannot be considered a contribution; write it so that he authors can understand what is seen as negative aspects

1. Comparison only one family of architectures (x86) (Serious drawback)
2. Neglects the NUMA effects of the memory subsystem of the multi-core chips
3. Weak experimental section, no mention of the OS, compiler infrastructure, options used??
4. How is SIMD used?
3. Focus on one type of multi-cores (e.g., GPUs should also be considered)

Relevant for VLDB2013

YES

Novelty (Please give a high novelty ranking to papers on new topics, opening new fields, or proposing truly new ideas; assign medium ratings for delta papers and papers on well known topics but still with some valuable contribution; low novelty ratings are

Novelty unclear

Rationale for novelty rating

No new algorithms are proposed. The paper does provide some new insights into the workings of the algorithms.

Significance

Improvement over existing work

Technical Depth and Quality of Content

Syntacticaly complete but with little contribution

Experiments, Repeatability

Ok, but certain claims are not covered by the experiments

Presentation

Reasonable: improvements needed

Would you "champion" for acceptance of the paper in a discussion with the peer reviewers?

Well, not really

Detailed Evaluation (Contribution, Pros/Cons, Errors); please number each point

1. AVX is a also SIMD. I think you meant SSE/VSX is 128-bit. BTW, SIMD instructions are not new (they are available since 1997 on the Pentium class of machines).
2. It is also important to evaluate hardware-agnostic versions (portable) of these algorithms. In most real systems, this goal is far more important than hardware-obliviousness.
3. Figure number missing on page 3.
4. No mention of VM page size while discussing TLB performance (a common solution is to increase the VM page size to reduce TLB misses)
5. I assume the OS was Linux. No mention of the OS, compiler used, which compiler options are used?
6. Join is a memory bound workload, but there is no evaluation of the memory subsystem (e.g., QPI on Intel)
7. Why didn't you use Vtune? (It is free for academic institutions)
8. Can you compare Figure 6 (scalable results) with Figure 9?

Overall, I was not at all impressed by the conclusions. They are very banal.

If 2nd chance with revision required, list specific revisions you seek from the Authors

Evaluation of the results on another architectural family (ARM or Power). Also, detailed performance characterization of the memory subsystem is necessary. These two results are absolutely necessary for this work to get accepted (it will also increase the wider use of the work).

Reviewer 1

Overall Recommendation

Reject as major revisions (that go beyond the revision timeframe) needed

Summary of the paper (what is being proposed and in what context; brief justification of your overall recommendation). One paragraph

The paper compares two types of main-memory hash join algorithms experimentally: hardware oblivious vs. hardware conscious. There are recent papers highlighting either type of algorithms. Through experimental study, the paper points out that under many situations, hardware conscious algorithms can achieve much higher performance than hardware oblivious algorithms. However, many aspects of the study are missing, incomplete, or questionable.

Three (or more) strong points about the paper (Please be precise and explicit; clearly explain the value and nature of the contribution).

S1. this is an interesting topic.

S2. experiments on three hardware platforms.

Three (or more) weak points about the paper (Please indicate clearly whether the paper has any mistakes, missing related work, or results that cannot be considered a contribution; write it so that he authors can understand what is seen as negative aspects

W1. The paper focuses only on well-tuned hardware conscious algorithms vs. hardware oblivious algorithms. One important aspect of the comparison is missing: not-tuned hardware comscious algorithms vs. hardware oblivious algorithms (for modeling the case of running algorithms on different hardware)

W2. The actual comparison study occupies only a small portion of the paper.
W3. many important aspects of the study are incomplete.

Relevant for VLDB2013

YES

Novelty (Please give a high novelty ranking to papers on new topics, opening new fields, or proposing truly new ideas; assign medium ratings for delta papers and papers on well known topics but still with some valuable contribution; low novelty ratings are

Ideas are not new (say why)

Rationale for novelty rating

The paper compares existing hash join algorithms.

Significance

Improvement over existing work

Technical Depth and Quality of Content

Insignificant contribution

Experiments, Repeatability

Very well presented, supports the claims made in the paper

Presentation

Reasonable: improvements needed

Would you "champion" for acceptance of the paper in a discussion with the peer reviewers?

No

Detailed Evaluation (Contribution, Pros/Cons, Errors); please number each point

The paper compares two types of main-memory hash join algorithms experimentally. The first type are variants of no partitioning join, where the algorithms do not perform partitioning and relies on modern multicore CPUs to deal with cache misses. The second type are variants of radix join, where the algorithms partition the two input relations into CPU cache sized partitions in order to minimize the number of CPU cache misses for building and probing hash tables. The first type of algorithms are known as hardware oblivious, while the second type are known as hardware conscious. There are recent papers highlighting either type of algorithms. Through experimental study, the paper points out that under many situations, hardware conscious algorithms can achieve much higher performance than hardware oblivious algorithms.

Many aspects of the study are missing, incomplete, or questionable. The paper can be siginificantly improved by addressing the following issues:

1. It would be nice to evaluate some other platform. Examples are low-power designs (e.g., ARM or x86 based ATOM), and Sun machines (with simpler cores).

2. In general, hardware conscious algorithms tuned for a specific hardware platform will typically be faster than hardware oblivious algorithms on the hardware platform. This is confirmed by the paper.

However, another equally important use scenario is missing from the paper. This is a software porting scenario where developers do not fine tune algorithms. Hardware oblivious algorithms are easy to port to various platforms and can achieve good performance, while hardware conscious algorithms without parameter tuning may have worse performance on a new hardware platform.

3. The paper compares hash table structure designs. However, several aspects are missing from the paper. First, the paper assumes that every bucket structure should contain 2 tuples, what would be effects of varying the number of tuples in a bucket structure on runtime and space overhead? Second, the tuple size is quite small in both experiments (8B and 16B), It would be nice to vary the tuple size, e.g. from 8B to 100B and see if the hash table design is still good. Third, how frequent are the overflow buckets used when there are skews? What are the performance impact because of the overflow buckets?

4. Sec 6.1 says that PR and RJ are compared in the text, but Figure 7 compares only PRB and PRO. The description of PRB is just one sentence at the end of sec 3.6. This certainly does not explain why PRB performs a global synchronization between build and probe, which is given as the main reason for the cache miss problem in Sec 6.5. In fact, my understanding is that after partitioning in radix style joins, each pair of partitions is independent of each other. Therefore, it is easy to assign different threads to handle different partitions, performing building and probing. There does not seem to be any need for global synchronization.

5. TLB misses are important considerations that drive the design of radix joins. However, the paper does not report TLB size in Table 2 for the hardware platforms under study.

6. The paper performs experiments on a single socket of the CPUs. It would be interesting to see the performance impact of using two sockets as in reality, a DB system will use all the available hardware threads.

7. In Figure 14, the paper varies build relation size. Have the experiments used different partition parameters or the same parameters are used? In the extreme case, when the build relation is very small, there may not be need to do partitioning any more or a single pass partition will be sufficient.

8. The paper spends only 2.5 pages (sec 7) on hardware oblivious vs. hardware conscious comparisons, while 3.5 pages (sec 5 and 6) on the results on the justifications for NPO and PRO algorithms. It would be better to spend more space on the actual comparison, which is the goal of the paper.

If 2nd chance with revision required, list specific revisions you seek from the Authors

1-8

ONLY FOR REVISED VERSIONS: Justify your answer of the above question with respect to the individual comments and concerns raised with respect to the original version

REPLACE THIS WITH YOUR ANSWER

ONLY FOR REVISED VERSIONS: Any other comments for the authors

REPLACE THIS WITH YOUR ANSWER

Related Information