Sprungmarken

Servicenavigation

Hauptnavigation

Sie sind hier:

Hauptinhalt

Shifter Lists - Interleaving Data Storage and Computation – FPGA 2013 Reviews

Reviews for paper Shifter Lists - Interleaving Data Storage and Computation, submitted to FPGA 2013.

Overall rating: reject

Reviewer #1

Reviewer's Scores
Technical Contribution and Quality (1-6): 4
Originality: 2
Readability: 5
Relevance: 6
Confidence: 5
Paper Topic: Application/Implementation
Overall: 5
Comments

The paper is well written and provides both concrete and intuitive understanding of the skyline acceleration as implemented. However as the primary focus and claimed contribution of the paper is the novel 'shifter-list' abstraction, this is the arena where I have significant issues. The abstraction seems ill described and quantitatively justified as being novel from well studied and understood abstractions/frameworks that employ pipeline-parallelism and dataflow/streaming style computation networks. Without a more rigorous definition of the novelty and quantitatively justification for generality/performance-optimality/productivity-improvement it becomes difficult for me to consider this as a key contribution of the paper.

Novelty issues:

- minimal extension over extensively studied ideas for streaming/data-flow style processing on FPGAs (Optimus, Maxeler Technologies, Kahn process networks, etc) as well as more classic work such as systolic arrays and network processors such as Xelerated Data and the Intel IXP 2800.

- key contribution claim is novelty shifter-list 'abstraction' but this simply seems to be a chain of FIFO connected blocks with a FSM node controller executing a typical 2-phase compute/communicate protocol

=> poor technical differentiation from prior approaches

=> claims from 2.2 regarding this difference are not quantitatively justified in the paper

i) generic implementation strategy

ii) awareness of communication cost built-in

- secondary contribution is the concrete acceleration of a skyline database query which is reasonably presented and offers some improvement over a typical CPU parallelization strategy

Technical issues:

=> comparison against CPU implementations uses tuples/sec as a throughput metric and reports total query time as a label

-> queries per second seems to be the more relevant metric (especially for a database style accelerator)

-> relevance of choices for dataset sizes / dimensionality are not well defined (why 1M 7-tuples?)

=> generality of the 'abstraction' is only demonstrated for a single algorithm with qualitatively described 'sketches' provided for 2 more algorithms

=> productivity benefits of 'abstraction' are unclear as the decomposition of an algorithm and mapping into a set shifter-list data-structure appears to be a manual process and doesn't even appear to benefit from a common set of templated shifter-nodes (at least no such library is claimed in the paper)

Readability issues:

- well written grammatically and technically

- needs a description of a high-level application of the skyline operator to motivate the basic use-cases for this application study (ex. from previous literature these seem to include finding pareto-optimal subsets of N hotels given a M-dimensional set of properties, or similarly for baskets of equities in a financial environment) => such a description is necessary to understand the expected latency-per-query/queries-per-second performance requirements and the expected dimensionality/dataset-size in typical use cases which is underdescribed in this paper

- reasonable intuition into algorithmic transformation through use of analogies

Summary:

- the concrete implementation of the shifter-list implementation style for the skyline database query is well presented and offers reasonable speedup in both worst and best cases compared to a tuned parallel cpu implementation

- however, the primary focus on the 'abstraction' as a generic strategy is insufficiently justified and how it precisely offers a difference between previously described frameworks

- the use of the abstraction on a single application does not offer sufficient evidence of generality, optimal mapping, or productivity benefits

Reviewer #2

Reviewer's Scores
Technical Contribution and Quality (1-6): 4
Originality: 4
Readability: 4
Relevance: 5
Confidence: 4
Paper Topic: Other
Overall: 5
Comments

This paper presents a FPGA-based database architecture, shifter lists. Processing of parallel data is pipelined to overlap communication operation and computation operation, and Block-Nested Loops (BNL) algorithm is broken into evaluation and shift phases to execute algorithm locally. Experimental results show good scalability as window size increases, and improvements compared with CPU designs in most cases. The benefits of the shifter lists are covered, and the comparison with CPU designs seems reasonable.

The nearest neighbour communication pattern (section 2.2) is used in shifter lists to reduce additional energy and latency introduced by wiring. However, the scatter-gather pattern may be preferred in certain scenarios, for example, data with low density (section 6.3.2). It would be interesting to have some comparisons between the shifter lists and other architectures.

Moreover, several other FPGA-based architectures have been proposed. The FIFO-based merge sorter proposed in [16] combines pipelined processing and scatter-gather communication patterns. The improvements for shifter lists would be more concrete if it can be compared with other published architectures and results, such as [16], to show the limitations of previous work and how they are addressed in shifter lists.

One of the aims for this paper is to provide abstraction that aids in building parallel database implementations. However, only one application is implemented. It is not clear how much effort is needed to map other database applications into the proposed architecture. Will the node phases, node states, communication mechanisms proposed in section 5 be modified?

How much resource is consumed for a single node? What is the limiting resources for current skyline application? Will these be affected if different database algorithms are implemented? It would be clearer about how the abstraction helps building database applications if these issues can be discussed.

[16] http://dl.acm.org/citation.cfm?id=1950427

Reviewer #3

Reviewer's Scores
Technical Contribution and Quality (1-6): 4
Originality: 3
Readability: 5
Relevance: 5
Confidence: 6
Paper Topic: System-level tools
Overall: 5
Comments

The whole paper is well-written. The topic is interesting and particularly relevant to FPGA community, i.e., how to extract an abstract computing pattern from existing applications that are traditionally in FPGA domain? more importantly, how to take advantage of FPGA's special capability and exploit such computing pattern?

The author identified an important database application for the target. There are three claimed contributions: 1. a new data structure abstraction (shifter lists) that helps designing massively parallel algorithms for a number of related database tasks. 2. Shifter lists combine data organization, computational power, and synchronization into a novel parallel processing model that naturally supports the characteristics of FPGAs. 3. illustrate shifter lists based on a common database operator, skyline.

My overall assessment about this paper is that although the whole presentation is well-organized and the problem formulation is clear, the major technical contribution is not significant. Specifically:

For Contribution 1, I failed to fully appreciate the novelty of the proposed shifter list data structure. The streaming computing model has been researcher extensively in Bill Dally's work, although in general-purpose computing. For the type of computing task the author described (a long stream of input data come in the target computing device sequentially and will be processed sequentially. Moreover, there is strong temporal locality among input data items.), it seems to me that chaining the processing elements as a pipeline and using local memory to process data as input data flows by is a rather obvious choice. In fact, these kind of organization is quite well-known among FPGA designers. For example, the previously proposed computing pattern parallel scan and the way an FPGA design implements a h.264 decoder.

For Contribution 2, the reason of simple data synchronization is due to the nature of the target applications. It would be much more interesting if the target application originally has complicated data dependency, but applying the concept of shifter-list mitigates this problem. Furthermore, distributing data with computation is quite natural to any FPGA application.

For Contribution 3, I think the author did a great job to describe the skyline application. However, the author only compared the performance of the proposed shifter-list with a multi-core processor solution. I would think that comparing the proposed shift-list with a more conventionally implemented FPGA solution of skyline would be more fair. In other words, if an FPGA designer doesn't know the concept of shifter-list, what would he typically do given the characteristic of skyline?

Reviewer #4

Reviewer's Scores
Technical Contribution and Quality (1-6): 4
Originality: 2
Readability: 1
Relevance: 5
Confidence: 4
Paper Topic: Application/Implementation
Overall: 5
Comments

the point on how building specialized implementations leaves hardware potential unused is unclear --- by the description it is incorrect, but perhaps you mean instead trying to map many algorithms to pre-defined hardware (which is not building a specialized implementation)

how are shifter lists different from the prior (numerous) streaming computation models (including SCORE, StreamIt, BlueSpec, ...)

it is curious to compare this technique to general purpose high level synthesis --- that implies automatic translation and generality to be able to map from a high level input language (or at least a significant subset)

the lemmings analogy may be a valuable descriptive technique, but too much time is taken to applying the analogy, especially in adding stories about the results of different algorithms instead of using plain english to explain strengths and weaknesses of each algorithm. In this case, the shifter list technique described sounds much like standard software pipelining and techniques commonly employed in stream computation.

you state that the shifter list is implemented in a state automaton, but that the nodes in state X must be shiftered to the end of the pipeline --- this is unclear; if you are actually shifting the order of how hardware communicates this is a poor idea, if you are simply allowing the state to propagate through multiple blocks this would work fine

your text doesn't match the figures (e.g. you state that dotted transitions between W & O states are omitted for readability, but there are transitions there) it is unclear from your diagram that this set of transitions is general to handle all query operations and not just the BNL/SL algorithm you demonstrate

the graphs in terms of non normalized throughput makes it difficult to see the magnitude of the difference although the trend is clear. Because you want to claim speedup (and do), you should show graphs of speedup instead

although you make a legitimate claim about the comparitive prices of the parallel server and the FPGA implementation, you don't describe how the parallel skyline algorithm works (and, importantly how if at all it differs from your algorithm) if the parallel skyline algorihtm chose a similar strategy, your shifter list has even less novelty, and if it did not chose a similar strategy, you should argue why that strategy is or is not appropriate for an FPGA-based design.

Furthermore, you specify that your implementation achieves only ~1% of the performance limit by the memory sub-system. You don't specify resource use for 192 nodes, but this implies that you cannot fit more nodes into the FPGA and those nodes have poor bandwidth use. You don't show how you derive the upper bandwidth limitation -- your design choices may limit the upper bound on achieveable bandwidth due to clock speed and datapath width in a way that has nothing to do with the number of shifter list nodes.

Related Information



Nebeninhalt

Kontakt

Prof. Dr. Jens Teubner
Tel.: 0231 755-6481