Reviews for paper Efficient Main-Memory Hash Joins on Multi-Core CPUs: Does Hardware Still Matter?, submitted to PVLDB 2013 (Experimental Track).
Overall rating: reject
This paper was later published as “Main-Memory Hash Joins on Multi-Core CPUs: Tuning to the Underlying Hardware” at ICDE 2013.
2nd chance after minor revisions
This paper revisits a no-partitioning hash join [3], denoted NPB, and parallel radix join [10], denoted PR, optimizes both resulting NPO and PRO, and compares all of these four algorithms on three hardware platforms. The algorithms are described in sufficient detail; the optimizations are effective; the evaluation results are convincing; the conclusion is clear - in most cases, the parallel radix join performs better than no-partitioning hash join.
S1. The paper studies two state-of-the-art main memory hash join algorithms on multi-core machines.
S2. The presentation of the paper is clear on the algorithms, implementation details, and evaluation.
S3. The evaluation is done on three hardware platforms with various factors considered. The results are convincing.
W1. While hash joins are important, writing a full paper to re-evaluate two state-of-the-art alorithm sees a bit thin.
W2. There are some typos and a few presentation glitches.
YES
Novelty unclear
This is an experimental study paper on existing algorithms, so there is little novelty in the algorithms. The optimizations are all in implementations and are quite simple.
Improvement over existing work
Solid work
Very well presented, supports the claims made in the paper
Reasonable: improvements needed
Yes, absolutely
D1. The paper categorizes the two existing algorithms NPB and PRB into cache/hardware oblivious versus cache/hardware conscious. This categorization may be inaccurate. Specifically, the NPB paper [3] emphasized that computation and synchronization costs should be considered in addition to the traditional factors of cache misses and TLB misses in main memory join algorithms. I would say both approaches are cache/hardware conscious. Furthermore, the term cache-oblivious refers to a divide-and-conquer approach that partitions data to the finest level in a cache-friendly way without knowledge of specific cache parameters, e.g., the cache- oblivious B-tree.
D2. The shorthands of algorithm names are necessary but are confusing at times. It would help to summarize them at the end of Section 3 or beginning of Section 4: NPB, NPO, RJ, PR, PRB, and PRO.
D3. The names of Non-Equal and Equal Data Sets are somewhat confusing. Hash joins deal with equality data sets. What are Non-Equal data sets for? It would be better to be more specific, e.g., Non-Equal Size data set?
D4. Section 6.1. The first paragraph says "for PR we report in Figure 7", but PR results are not shown in Figure 7. The second paragraph talks about results of RJ and PR, neither of which are shown in Figure 7. I wonder why the results are not shown.
D5. Section 6.1. The third paragraph puzzles me - it says (1) PRO is faster, (2) the (PRO?) results are very "similar" to those in [10], and (3) the performance improvement of PRO over PRB is "comparable" to that of NPO over NPB. How much is similar and how much comparable?
D6. Section 6.3. The second paragraph says "in the rest of the paper, we use the bucket chaining version of PRO". Does it mean there were multiple versions of PRO? If so, what version of PRO was used in figures 7 and 8, and would it affect the results if a non-bucket chaining version of PRO was used there?
D7. Typos and awkward sentences.
Section 6.5 "must is".
Section 7.2. The first paragraph.
"On the AMD machine, performance is not as good as even ..16 cores. This is due to... and we do not report it here".
D5 and D6
Reject as major revisions (that go beyond the revision timeframe) needed
The author show that hardware-focussed optimization is still relevant, by showing performance gain on existing Join algorithms.
1. Experimental evaluation of some Join algorithms
1. Comparison only one family of architectures (x86) (Serious drawback)
2. Neglects the NUMA effects of the memory subsystem of the multi-core chips
3. Weak experimental section, no mention of the OS, compiler infrastructure, options used??
4. How is SIMD used?
3. Focus on one type of multi-cores (e.g., GPUs should also be considered)
YES
Novelty unclear
No new algorithms are proposed. The paper does provide some new insights into the workings of the algorithms.
Improvement over existing work
Syntacticaly complete but with little contribution
Ok, but certain claims are not covered by the experiments
Reasonable: improvements needed
Well, not really
1. AVX is a also SIMD. I think you meant SSE/VSX is 128-bit. BTW, SIMD instructions are not new (they are available since 1997 on the Pentium class of machines).
2. It is also important to evaluate hardware-agnostic versions (portable) of these algorithms. In most real systems, this goal is far more important than hardware-obliviousness.
3. Figure number missing on page 3.
4. No mention of VM page size while discussing TLB performance (a common solution is to increase the VM page size to reduce TLB misses)
5. I assume the OS was Linux. No mention of the OS, compiler used, which compiler options are used?
6. Join is a memory bound workload, but there is no evaluation of the memory subsystem (e.g., QPI on Intel)
7. Why didn't you use Vtune? (It is free for academic institutions)
8. Can you compare Figure 6 (scalable results) with Figure 9?
Overall, I was not at all impressed by the conclusions. They are very banal.
Evaluation of the results on another architectural family (ARM or Power). Also, detailed performance characterization of the memory subsystem is necessary. These two results are absolutely necessary for this work to get accepted (it will also increase the wider use of the work).
Reject as major revisions (that go beyond the revision timeframe) needed
The paper compares two types of main-memory hash join algorithms experimentally: hardware oblivious vs. hardware conscious. There are recent papers highlighting either type of algorithms. Through experimental study, the paper points out that under many situations, hardware conscious algorithms can achieve much higher performance than hardware oblivious algorithms. However, many aspects of the study are missing, incomplete, or questionable.
S1. this is an interesting topic.
S2. experiments on three hardware platforms.
W1. The paper focuses only on well-tuned hardware conscious algorithms vs. hardware oblivious algorithms. One important aspect of the comparison is missing: not-tuned hardware comscious algorithms vs. hardware oblivious algorithms (for modeling the case of running algorithms on different hardware)
W2. The actual comparison study occupies only a small portion of the paper.
W3. many important aspects of the study are incomplete.
YES
Ideas are not new (say why)
The paper compares existing hash join algorithms.
Improvement over existing work
Insignificant contribution
Very well presented, supports the claims made in the paper
Reasonable: improvements needed
No
The paper compares two types of main-memory hash join algorithms experimentally. The first type are variants of no partitioning join, where the algorithms do not perform partitioning and relies on modern multicore CPUs to deal with cache misses. The second type are variants of radix join, where the algorithms partition the two input relations into CPU cache sized partitions in order to minimize the number of CPU cache misses for building and probing hash tables. The first type of algorithms are known as hardware oblivious, while the second type are known as hardware conscious. There are recent papers highlighting either type of algorithms. Through experimental study, the paper points out that under many situations, hardware conscious algorithms can achieve much higher performance than hardware oblivious algorithms.
Many aspects of the study are missing, incomplete, or questionable. The paper can be siginificantly improved by addressing the following issues:
1. It would be nice to evaluate some other platform. Examples are low-power designs (e.g., ARM or x86 based ATOM), and Sun machines (with simpler cores).
2. In general, hardware conscious algorithms tuned for a specific hardware platform will typically be faster than hardware oblivious algorithms on the hardware platform. This is confirmed by the paper.
However, another equally important use scenario is missing from the paper. This is a software porting scenario where developers do not fine tune algorithms. Hardware oblivious algorithms are easy to port to various platforms and can achieve good performance, while hardware conscious algorithms without parameter tuning may have worse performance on a new hardware platform.
3. The paper compares hash table structure designs. However, several aspects are missing from the paper. First, the paper assumes that every bucket structure should contain 2 tuples, what would be effects of varying the number of tuples in a bucket structure on runtime and space overhead? Second, the tuple size is quite small in both experiments (8B and 16B), It would be nice to vary the tuple size, e.g. from 8B to 100B and see if the hash table design is still good. Third, how frequent are the overflow buckets used when there are skews? What are the performance impact because of the overflow buckets?
4. Sec 6.1 says that PR and RJ are compared in the text, but Figure 7 compares only PRB and PRO. The description of PRB is just one sentence at the end of sec 3.6. This certainly does not explain why PRB performs a global synchronization between build and probe, which is given as the main reason for the cache miss problem in Sec 6.5. In fact, my understanding is that after partitioning in radix style joins, each pair of partitions is independent of each other. Therefore, it is easy to assign different threads to handle different partitions, performing building and probing. There does not seem to be any need for global synchronization.
5. TLB misses are important considerations that drive the design of radix joins. However, the paper does not report TLB size in Table 2 for the hardware platforms under study.
6. The paper performs experiments on a single socket of the CPUs. It would be interesting to see the performance impact of using two sockets as in reality, a DB system will use all the available hardware threads.
7. In Figure 14, the paper varies build relation size. Have the experiments used different partition parameters or the same parameters are used? In the extreme case, when the build relation is very small, there may not be need to do partitioning any more or a single pass partition will be sufficient.
8. The paper spends only 2.5 pages (sec 7) on hardware oblivious vs. hardware conscious comparisons, while 3.5 pages (sec 5 and 6) on the results on the justifications for NPO and PRO algorithms. It would be better to spend more space on the actual comparison, which is the goal of the paper.
1-8
REPLACE THIS WITH YOUR ANSWER
REPLACE THIS WITH YOUR ANSWER