Reviews for paper Efficient Frequent Item Counting in Multi-Core Hardware, submitted to (P)VLDB 2012.

**Overall rating:** reject

(This work was later published at KDD 2012.)

Reject

The paper proposes a new algorithm for frequent item counting that can make good use of a multi-core system. The main approach is to add a filter stage before a state-of-the-art single-threaded algorithm (called Space-Saving) so as to create a two-stage execution pipeline. The filter stage can be implemented efficiently using vectorized instructions. A multi-stage pipeline is proposed to take advantage of pipelined parallelism to utilize multiple cores. Experimental evaluation shows that the approach is beneficial for highly- skewed datasets.

S1: The insight that the Space-Saving algorithm can be extended with a filter stage that can exploit data skew; thereby also bringing a pipelined parallelism approach for efficiency.

S2: The filtering stage is implemented efficiently with vectorized instructions and no branching.

S3: The paper is easy to read.

S4: The motivation for the work is strong: the number of cores is increasing as are data sizes, and item counting is a core component of many complex analysis tasks.

W1: The overall problem setting is never made clear. (See detailed description below.)

W2: The efficiency of the approach only shows when the degree of skew is high.

W3: While the paper focuses on pipelined parallelism, the applicability of partitioned parallelism is brushed aside without a deeper study.

W4: The algorithm does not seem to scale well beyond two cores.

YES

With some new ideas

The core intuition is to add a "cache" layer to an existing algorithm that exploits high data skew. The use of vectorized instructions is also motivated by past work.

Improvement over existing work

Syntacticaly complete but with little contribution

Ok, but certain claims are not covered by the experiments

Reasonable: improvements needed

No

D1. (On W1) While the paper defines the frequent item counting problem, the overall problem setting is never stated clearly. I can imagine at least four different problem settings:

1. The input comes as a single online stream of data.

2. The input is resident on local storage that only allows a sequential read.

3. The input comes in a parallel fashion as multiple individual data streams.

4. The input is stored on local disk in a way that different partitions of it can be processed in parallel.

These problem settings can admit different solutions. Since the problem setting is not clear, I had trouble understanding why "online queries" and the consequent merge overheads for partitioned parallelism are an issue. What applications need 2000 queries/second (Figure 13)? What are these queries?

D2: (On W2) Skew in real-life datasets is an established fact. However, the paper makes a leap of faith when it uses such a general motivating statement and then claims that the proposed approach can give an order of magnitude improvement. First of all, the analytical and empirical evaluation is done only using synthetic data with Zipfian skew. While I do not see this issue as a major drawback, it will improve the paper if the authors point out that Zipfian skew is not the only form of skew or possibly the predominant form (e.g., see The “DGX” Distribution for Mining Massive, Skewed Data, http://www.informedia.cs.cmu.edu/documents/zhiqiang-sigkdd01-dgx.pdf)

I am more concerned about what degrees of skew are common in practice, and the consequent question of how the proposed approach works in that setting. Slide 7 in http://dimacs.rutgers.edu/~graham/pubs/slides/skewed- slides.pdf gives some typical z factors where the highest z factor is 1.6.

When we dig deeper into the experimental results in this paper, the performance improvements for z factors below 1.6 is small (definitely nowhere near an order of magnitude). It will improve the presentation of the paper if the authors do a survey of typical skew factors, and made an unbiased presentation of the empirical results.

D3: (On W3) I felt that the applicability of partitioned parallelism was brushed aside rather quickly in Section 2.4.1 and never revisited again in the paper:

-- Why not try more adaptive approaches to partitioned parallelism such as: (i) run-time load balancing across the cores if a poor partitioning function was picked originally; (ii) a sampling phase to pick a better partitioning function in the first place (this approach would not apply to pure streaming problem settings; see comment D1). For the levels of "degree of skew" considered in the paper, (ii) may work well.

-- The comparison with Space-Saving in Figure 12 would be more fair if some (simple?) partitioned-parallel-version of Space-Saving is used.

-- It will be useful to include a graph that shows how the overhead of a partitioned-parallel-version of Space-Saving increases as the number of online queries increase.

A related aspect. Space-Saving is the state-of-the-art algorithm developed for a single core. Please give more intuition behind your choice to create a parallel version of this algorithm, rather than developing a parallel algorithm from scratch.

D4: Figure 14 shows negligible improvement as the number of cores is increased beyond 2. Is the choice of pipelined parallelism coming back to bite you?

D5: The presentation can be improved in a number of places:

-- Unclear sentence in Section 1: To date, there is no parallel implementation of frequent item counting that reaches the performance of the best sequential implementation of the algorithm. (reaches the performance?)

-- Unclear sentence in Section 3.1: Data structures for X that support this will increase the cost of count-based lookups in X. (what is "this"?)

-- Please provide more details on the multi-stage algorithm. The end of Section 4.1 is sparse.

-- What is Turbo-boost?

D6: If the items are strings instead of integers, would using the vectorized code need a preprocessing step to replace the strings with integers?

Reject

The paper proposes an implementation of a frequent item counting algorithm to compute heavy hitters on a modern multicore machine. The algorithm extends the previously proposed "Space-Saving" algorithm with a preprocessing step to handle (more efficiently) the true heavy hitters. Speed-ups are observed in workloads where heavy hitters dominate the data stream.

S1. Paying attention to the features/pitfalls of modern hardware is a good idea. Making the best use of multicore hardware is important now that clock frequencies are not increasing.

W1. There is a fundamental performance disconnect between this work and recent work on multicore aggregation (e.g., ref [20]). See the detailed comments.

YES

Novelty unclear

No impact

Syntacticaly complete but with little contribution

Ok, but certain claims are not covered by the experiments

Reasonable: improvements needed

No

D1. There is no comparison of the method with relevant performance numbers from previously published work on computing exact aggregates. Reference [20] is a good candidate; dismissing it because it needs "clever memory arrangements" is unsatisfactory. The present work requires additional complexity at least as high as (if not higher than) [20].

D2. When one does compare the results of [20] with the present work, one finds:

a) The machines used are comparable.

b) When [20] explores a Zipf distribution with skew parameter 0.5, the bandwidth is between 100 million records/sec (large domain cardinality) and 1100 million records/sec (small domain cardinality). The numbers presented in the present paper are about 15 million records/sec.

c) While [20] does not provide results for other values of the skew parameter, it is likely than performance would also improve for the same reasons as in the present paper, i.e., the small local cache-resident table would be hit more often.

D3. Reflecting on D2(c), it seems that [20] (and some prior work before that) also uses the technique of special efficient handling of the common case values. The claims of novelty need to be put in context.

D4. The dismissal of techniques such as locking, partitioning and sharing is naive. Yes, each alone is probably a poor solution, but prior work (such as [20] and the references cited therein) show how it is possible to use combinations of these approaches to achieve efficiency. Also, the authors need to mention atomic primitives based on compare-and-swap as another implementation option. Given the performance results of [20], the dismissal of hash-based techniques is also questionable.

D5. Similarly, the authors claim that the combination step is a bottleneck. This may be true for the Space-Saving algorithm, but is not true for aggregation in general, where the combination step is efficient and can be done in parallel.

D6. [20] uses more space than the proposed solution, which could be seen as an advantage for the present paper. Traditional heavy-hitter methods have tried to minimize space, using motivations such as the implementation in hardware on a router. However, for the present paper that aims to use modern multicore CPUs with many gigabytes of RAM available, the motivation for economizing on space is weak. In particular, at very high performance levels one must be storing the data in RAM because neither the network nor the I/O devices could keep up. If so, one already has a high RAM budget.

D7. To summarize, [20] gives exact answers where the present paper gives an approximation; [20] performs an order of magnitude (or two) faster; while [20] uses more space, there is no strong motivation in the present paper to minimize space.

Reject

The authors propose a new algorithm exploiting modern computer architectures (multi-core machines) to speed up frequent item counting. The novel algorithm contains a filtering stage which filters the input into "heavy hitters" and the remaining items. The look-up of new items for the "heavy hitter" set is speed up by using SIMD instructions. The filtering stage is then combined with a conventional frequent item algorithm.

S1: paper is reasonably well written. It would have been good to explain some terminology instead of just using it.

S2: Frequent item counting is a important problem that is part of many data mining algorithms.

S3: It is a complete paper, however with a very narrow focus.

W1: The main weak point of the paper is the lack of depth. The presented ideas are rather straightforward and unsurprising. The proposed work is embedded in much of related work, but the new ideas are few and far in- between and almost the 'first best thing to do" with the problem. It is a well-written paper, it just suffers from a lack of depth to be competitive within VLDB.

YES

Ideas are too simple (say how)

The paper lacks depth and the ideas are straightforward (how to filter the data, and how to implement the SIMD look-up).

No impact

Syntacticaly complete but with little contribution

Ok, but certain claims are not covered by the experiments

Reasonable: improvements needed

No

It is a complete paper, but it would be better to submit it to a less competitive conference.

- submission (PDF)
- final paper (PDF) — published at KDD 2012