Sprungmarken

Servicenavigation

Hauptnavigation

Sie sind hier:

Hauptinhalt

Parallel Computation of Skyline Queries – FCCM 2013 Reviews

Reviews for paper Massively Parallel Computation of Skyline Queries with FPGAs, submitted to FCCM 2013.

Overall rating: accept

Reviewer 1

OVERALL EVALUATION

6 (Full submission: Very good full paper)

Technical quality:This is a measure of the overall technical quality of the paper

5 (Good)

Originality

5 (Highly Original)

Presentation:This is a measure of the clarity of the presentation, including organization and grammar

5 (Good)

Background: This is a measure of the thoroughness of the background material and references.

5 (Good)

Appropriateness: The appropriateness of this paper for FCCM. Regardless of the technical content, if you feel this paper should be submitted elsewhere give a low mark here

5 (Good)

Applications Papers: If this is an application paper, does it discuss novel use of the reconfigurable device used, describe lessons learned that others can take advantage of, or do a comparative study of different architectures?

3 (Yes)

Review

This paper describes an implementation of skyline queries computation on FPGA. It gives a clear definition of the skyline query problem and the algorithms to solve it. To achieve high parallelism, the BNL algorithm is implemented as a two-phase pipeline. In the evaluation phase, the data of each input tuple is compared with the working tuple. In the shift phase, the interactions between neighboring tuples are performed based on the comparison in the evaluation phase.
Mechanisms are designed to copy multiple BRAM words between neighboring processing elements in one clock cycle without hurting the throughput.
Their experiments show that the proposed technique can achieve a higher throughput than the single thread software implementation and fastest parallel implementation on a state of the art multicore server as well.

The paper gives a very clear background knowledge on skyline query and its computation. The authors conducted thorough experiments. The reference point is convincing, it clearly demonstrates that the FPGA implementation can reach a higher throughput than both single thread and multi-thread parallel implementations using software.

The paper will be improved if the authors can give more insight on:
1. Why the FPGA implementation scales while the software implementation doesn't?
2. Why memory hierarchy leads to the performance difference when processing data with different level of correlations?

Reviewer 2

OVERALL EVALUATION

6 (Full submission: Very good full paper)

Technical quality:This is a measure of the overall technical quality of the paper

4 (Fair)

Originality

3 (Minor incremental)

Presentation:This is a measure of the clarity of the presentation, including organization and grammar

5 (Good)

Background: This is a measure of the thoroughness of the background material and references.

5 (Good)

Appropriateness: The appropriateness of this paper for FCCM. Regardless of the technical content, if you feel this paper should be submitted elsewhere give a low mark here

5 (Good)

Applications Papers: If this is an application paper, does it discuss novel use of the reconfigurable device used, describe lessons learned that others can take advantage of, or do a comparative study of different architectures?

3 (Yes)

Review

This paper is about accelerating the computation of Pareto sets from a set of n-dimensional vectors.

The paper is very well written and the Lemmings examples are awesome to read. The transition to the parallel BNL with FPGAs, is however, a bit abrupt.

For an FCCM paper, a reader would be more interested in the PE design and most people at this conference know already very well how to compute a Pareto set. For example, the BRAMs seam to be underutilized because they store only one tuple and because they use only one port. Why not comparing two components in parallel by using the second port? Why not storing multiple tuples in one PE and doing time multiplexing? This might reduce I/O demands. For a 32-bit compare and a state machine, 320 LUTs seam to be quite big. Please explain why. Is it to keep the PE generic? Is there something complex inside the PE that needs more logic?
To summarize this, the FPGA implementation should have been discussed in more depth.

One thing I haven’t got is the following: you stream a tuple component wise from PE to PE. However, only after comparing the last component we know if we have to store the input tuple in a free PE. So do you keep a local copy inside the PEs? Or is data moved tuple-wise?

What I like in this work, (but what is not really discussed) is that the approach seams to be highly scalable with respect to dimensions and with respect to the resources.
One thing I wondered is that for higher dimensions, BNL is inferior to D&C variants. This is not an issue for the six dimensional test case in the experiments, but might be for the 15 dimensions test case. Have you considered D&C?

One comment on the title: I wouldn’t call 192 PEs massively parallel in these days.
You could do so, if you cascade multiple boards and demonstrate a way to scale this up for distributed databases…

Finally, there is rising interest also in the FPGA research community in database acceleration and it would be great if you would release your implementation as open IP.

Reviewer 3

OVERALL EVALUATION

6 (Full submission: Very good full paper)

Technical quality:This is a measure of the overall technical quality of the paper

6 (Outstanding)

Originality

4 (Major incremental)

Presentation:This is a measure of the clarity of the presentation, including organization and grammar

6 (Perfectly clear)

Background: This is a measure of the thoroughness of the background material and references.

5 (Good)

Appropriateness: The appropriateness of this paper for FCCM. Regardless of the technical content, if you feel this paper should be submitted elsewhere give a low mark here

6 (Completely appropriate)

Applications Papers: If this is an application paper, does it discuss novel use of the reconfigurable device used, describe lessons learned that others can take advantage of, or do a comparative study of different architectures?

3 (Yes)

Review

Particularly strong example of massive parallelism by moving data through the processing, in a database application that is both compute intensive and throughput intensive. Impressive performance vs much larger conventional solutions. Inherently scalable.

Very well written indeed, making great use of a fun yet directly applicable game example to introduce the algorithms. Clear and well-understood text. Simple, readable graphics.

It goes without saying there's a big energy efficiency advantage here. Reporting specific power figures of the FPGA, Xeon and PowerEdge would be a good addition.

Related Information



Nebeninhalt

Kontakt

Prof. Dr. Jens Teubner
Tel.: 0231 755-6481