Reviews for paper Main-Memory Hash Joins on Multi-Core CPUs: Tuning to the Underlying Hardware, submitted to ICDE 2013.
Overall rating: accept
This paper solves an important controversy raised by [2] that came on the heels of [1], which came with a fully contradictory message. By exploring the parameter space in depth, it shows that the results of [2] only holds when the join is between a small and a large table and the join keys are skewed -- in all other cases the conclusions of [1] take precedence over [2].
- Incomplete results like in [2] put both practitioners and researchers in a state of confusion that can easily make them lose a lot of time to reach the conclusions of this paper (and if these unlucky researchers do not have the engineering depth of experience as displayed here, they might remain confused). Hence such clarifying work is quite appreciated.
- excellent presentation
- convincing evaluation
- some reviewers will say that this paper does not introduce a new algorithm and lacks novelty. However, for me clearing the confusion raised by [1] vs. [2] is a strong contribution to the community.
Medium
High
High
This is very well done and sound work; there are little details to complain about/remark on.
III B. "AMD machine" -> The AMD machine
IV C. phsysical
V C. "Resulting overall" -> The resulting overall
the key observations are
- function-call intensiveness (relative inefficiency versus the other implementations) of radix join in [2]
- poor radix join implementation choice in [2] to build all hash tables first before probing them
- sorted insert order and effective non-hashing in [2] that cause a sequential access pattern in hash build
- problem configuration [2] happened to hit its sweet spot combining small build size and skewed key distributions
in all you showed that [1] was right, and [2] either made some very unfortunate mistakes -- or, more likely, pulled all the tricks to sell a story that does not hold up to scrutiny.
High
Accept (I will argue for acceptance)
The authors report on a thorough experimental analysis of hash-join algorithms on multicore systems. The conclusion is that even in this contraint setting, there can be observable performance differences in number of cpu cycles spent per tuple.
- very detailed experimental study
- readable story
- good reference case for micro experimentation.
- lack of algorithmic novelty
- lack of translation into real world performance
- lack of stability under skew analysis
Low
High
Acceptable
The paper presents a thorough analysis of hash-join algorithms on a multicore system. The main weakness is the lack of an analysis of the macro-visible effects. How much of a factor two on cycles/output element translate into observable better query response times. Integration in an open-source system would have made this clear.
Furthermore, the stability under data skew has not been studied. Therefore, the overall contribution remains isolated.
High
Accept (I will argue for acceptance)
The paper performs a very careful evaluation of different hash join strategies and highlights the different sweet spots. Primarily, it is a falsification paper for [2]. What is a bit missing is the "tuning to the underlying hardware" part from the title. The authors give some general hints, but not really a roadmap.
- a very careful evaluation of different join strategies
- highlights systematically weaknesses in the setup of [2]
- the novelty is a bit limited. Most of the results were known (though not properly published, so that might be ok)
- the paper does not really over a guideline for tuning to the underlying hardware
Medium
Medium
High
The paper performs a very careful analysis of different hash join strategies, and I really appreciate the effort. Some algorithms work well only for certain setups, and this paper makes a good point of emphasizing this issue. It primarily falsifies [2], and shows that the results of [2] are artifacts of the the experimental setup in [2]. This is useful work, and reminds us to be honest in our setup and to avoid measuring special cases like sorted input.
What is a bit missing is the "tuning" part of the title. How should we tune the algorithms to the underlying hardware? The paper gives some hints, for example in Figure 9, but the question remains largely unanswered. For example while the no-partition strategy is not competitive in general, it is clearly a good idea if the build relation is very small. Can we quantify "very small"? Can we detect this at execution time? At compile time? Or what about the hardware itself? Section VI-B mentions that the AMD processor might require a different data flow/access pattern than the Intel processor. Can we develop building blocks such that the appropriate algorithm can be "assembled" according to the available hardware?
Of course these are hard questions, and one cannot reasonably expect to get an answer to all of them in one paper. But the current paper neglects this issue a bit. It primarily compares a particular Manegold-style radix join to the non-partitioning hash join and gives numbers, without spending too much thoughts on the tuning itself.
Medium
Accept (I will argue for acceptance)