Sprungmarken

Servicenavigation

Hauptnavigation

Sie sind hier:

Hauptinhalt

Skeleton Automata for FPGAs: Reconfiguring without Reconstructing — SIGMOD 2012 Reviews

Reviews for paper Skeleton Automata for FPGAs: Reconfiguring without Reconstructing, submitted to SIGMOD 2012.

Overall rating: accept (conditionally, with shepherding)

Reviewer 1

Summary of the paper's main contributions and potential impact

This paper extends the complexity of predicates that can be expressed in an FPGA to include a large and common subset of XPath, without incurring a prohibitive compilation overhead, by modeling the Xpath expression as a non- deterministic finite-state automata (NDFSA), and implementing that NDFSA as a sequence of generic "skeleton" segments, each parameterized by the XPath tag and axis for that step. The NDFSA state and axis are stored as FPGA flip- flops, and the (much larger) tags are stored in Block RAM (BRAM). The FPGA generated by this technique was implemented and evaluated to show that documents could be projected at 98.7% of the network's nominal capacity, and close to the FPGA's nominal clock rate.

Strong points of the paper

S1: Significant extension -- to most of XPath -- of the use of hardware accelerators for predicate evaluation beyond the simple "SARGs" (predicates of the form < column > < relop > < expression > ) that IBM Netezza can support.
S2: The ideas of the paper were actually implemented in an FPGA and evaluated to show that documents could be filtered with an XPath projection expression in the FPGA at the effective clock rates of the FPGA.
S3: Paper is self-contained and readable by non-hardware types (like myself) by having a nice introduction to the basics of the hardware and programming FPGAs, plus a nice running example and good motivation and English.

Weak points of the paper

W1: Some of the points about multiple paths (Section 4.5) and pipelining and XPath semantics for "descendant-or-self" ("//") axes discussed in Section 6.1 weren't that clear to me (see question in detailed comments). In particular, Section 4.5 could benefit from at least a simple example of what is meant. Are the multiple paths on behalf of the same query? Are they linked by conjunction or disjunction?
W2: I would have liked a bit more thorough evaluation in Section 7, which was presumably limited by space considerations due to all the introductory explanation that made this paper readable. In particular, the workload of XPath expressions and the document corpus used to achieve Figs. 14 and 15 were not given. I would also like to have seen more performance evaluation as a function of different types of Xpath expressions (e.g., those involving only "/" paths vs. those having one or more "//" paths). In other words, what really affects the throughput of the FPGAs you generate? There wasn't even any discussion of the few XMark queries shown in Fig. 16!

Innovation

High

Quality of Presentation

High

Quality and repeatability of experiments, if applicable

Unsatisfactory

Detailed comments

C1: It seems that the authors are overloading the term "pipelining" in Section 6.1 to mean two things: (1) an FPGA circuit-design optimization using registers to minimize longest signal paths ("The corresponding circuits [what corresponding circuits?] are thus amenable to pipelining, a very effective circuit optimization technique. Figure 12(b) illustrates the idea."); and (2) starting to process another document immediately (one machine cycle?) after the previous document, i.e., before it completes (In the "Throughput vs. Latency" subsection, "in a fully pipelined circuit, a new input item can enter the circuit every clock cycle".). The reader is further confused by the previous section first discussing the output of the tag decoder, the lengths of signal paths, and its effect on the speed, which seems to me to be a fine point to be discussed after first giving the general idea of pipelining documents through the FPGA. Rewrite this section to be clearer. It seems to me that the first meaning should be given another term, as it enables the second meaning, but maybe this is correct usage in the FPGA community.
C2: In addition (related to C1), it might be instructive to discuss what happens to successive documents having various-length paths when applying an XPath predicates containing a "//" (descendant-or-self) path. It seems to me that processing of document 2 could be stalled by document 1, if the // tag on document was satisfied only after several descendant state transitions, whereas it was satisfied by the immediate descendant state transition in document 2. Isn't that why the "current state" flip-flops are needed to keep track of the state of the FSA for each document in the pipeline (Fig. 7)?
C3: "Effective, projection paths...in the cooked stream." I had trouble following the point you were making, and Fig. 11 didn't help.
C4: In Fig. 1, Won't the < open_auctions > portion of the document need to be touched ("evaluated") to process Query Q1, to find the end tag for < /site >, even though the path //regions//item/name# has already been satisfied? It isn't underlined. I guess it isn't clear what the meaning of "everything else can be pruned" is, because of the passive voice. WHO can do the pruning? The FPGA? Isn't that part of "evaluating Query Q1"? Please clarify.
C5: Detailed English comments:
-- "without effecting the query outcome" --> "without affecting the query outcome"
-- "we eat the cake and have it too" is more normally said "we can have our cake and eat it, too", at least in American English.
-- "the circuits output" --> "the circuit's output" -- "determine whether a state is accepting or help us correctly handle..." Huh? Cannot parse.
-- "When arbitrary automata shapes must be supported, this problem is inevitable..." Indefinite reference: to what does "this" refer? Do not use relative pronouns across paragraphs.
-- "...as we discussed it in Section 2.2. The key insight are the..." --> "...as we discussed in Section 2.2. The key insight is the..."
-- "This allows to build..." --> "This allows us to build..." or "This allows building..." (but use of a gerund in the second alternative makes it unclear who does the building). "Allows" cannot take an infinitive as its object.
-- "for a tag in real world and we read out only one character at a time even though..." --> for a tag in the real world, and we read out only one character at a time, even though..."

Overall recommendation

Strong accept (award-quality paper)

Reviewer 2

Summary of the paper's main contributions and potential impact

The paper aims to implement XML projection using FPGA hardware. The novelty of the paper is the separation of the configurations information from algorithmic structures needed for matching. Generic structures are able to be used for a variety of queries, and reprogramming only changes the configuration, not the whole FPGA code (which would involve time-consuming recompilation).

Strong points of the paper

The paper presentation is good. It makes a clear and compelling case for the proposed design.

The real implementation confirms that practical speeds (faster than most disk/network throughput) are achievable.

Weak points of the paper

XML projection in the network is not always efficient. For example, suppose that there are enough parallel queries so that the output from the projection (unioned over all queries) is much larger than the input. That would make the network output the bottleneck. In a software implementation where the CPU does the projection as well as the subsequent processing, that bottleneck would not occur. This is not to say that the proposed techniques are not useful, but it does suggest a dimension of the problem space that should be addressed as a trade-off.

I would like to see performance results about reprogrammability.

Innovation

High

Quality of Presentation

High

Quality and repeatability of experiments, if applicable

Satisfactory

Detailed comments

In section 2.1, "effecting" should be "affecting".

In Algorithm 1, why don't you assign the output of pop to the appropriate variable?

It's not clear to me how many segments are needed for various kinds of queries. What is the query capacity of a 600-segment device?

Overall recommendation

Accept

Reviewer 3

Summary of the paper's main contributions and potential impact

This paper proposes a hardware design technique ("skeleton automata") which allows flexibly and quickly mapping XML projection workloads to FPGA or other hardware. The implementation allows a large number of queries to filter the data. The observation that hardware structures are largely determined by the nature of the problem has the potential to reduce the cost of developing and using custom hardware, which is increasingly important due to current chip/architectural trends.

Strong points of the paper

S1. The technique represents an elegant and scalable hardware design. The authors would do well to make it very clear that the technique should scale to significantly higher clock frequencies than the FPGA allows, and that it would consume vastly less power even for the same performance (an increasingly important concern).

S2. The hardware-related explanations are clear and easy to follow for a non-specialist audience

Weak points of the paper

W1. It is unclear whether this specialized hardware would outperform software in practice (see detailed comments), author feedback notwithstanding.

W2. The performance evaluation is unsatisfactory because it doesn't establish the circumstances under which the proposed approach beats the state of the art in software, or even by how much as different factors vary (number of projection paths being a significant one). The experiments do not utilize the best known software parser (SaxonEE could easily be factors faster than SaxonHE), nor do they give the software-only approach the benefit of the same two-phase setup the hardware enjoys.

W3. The proposed technique is argued to be especially useful as the number of queries in the system increases, but this is precisely when projection will be least selective, forcing the downstream engine to parse most or all of the input data even after the projection step completes. While this weakness could potentially be overcome by pushing more of the workflow into the hardware engine, it is not immediately clear how this could be done and the paper has no discussion on this point.

Innovation

Medium

Quality of Presentation

High

Quality and repeatability of experiments, if applicable

Unsatisfactory

Detailed comments

D0. The experiments use a hardware implementation to accelerate a single- threaded Java program by 2-6x (estimated from figure 16, but never stated in the paper). This is not very impressive, particularly since the paid-for version of Saxon (EE) might achieve about that much speedup just by compiling its queries to Java bytecode first.

D1. Both the paper and author feedback seem to assume that a SW-only setup must use only one parser for everything. However, it would be perfectly reasonable to use two SaxonEE in a pipeline (exactly analogous to the HW+SW setup in the paper), which would allow the first engine to efficiently project and improve significantly the performance of the second. Author feedback argues that parsing costs outweigh any benefit such a chain might provide, but the xml must be parsed twice either way -- the hardware does not pass "cooked" xml to the second phase of processing, and if it were done there is no reason a software implementation could not do the same.

D2. The memory consumption numbers are misleading since the free (HE) version of saxon builds a full parse tree of the XML in memory before processing the query; the author feedback acknowledges that the true memory costs are similar for SW-based streaming, so this result should be retracted. Again, this suggests that a (non-free) SaxonEE-based projector feeding results to a second SaxonEE-based processor might reasonably compete with the hardware-based implementation.

D3. Figure 16 is less meaningful when the relative contributions of parse and execute time are unknown (e.g. improving already-short parse times wouldn't be helpful to overall performance, for example)

D4. The effectiveness of projection as a separate step decreases rapidly as the number of queries increases, because together they will tend to reject a decreasing fraction of the data. Author feedback points out that most such queries will be subqueries of the same overall expression (and that all data passed to the second stage is therefore meaningful), but this doesn't change the fact that the second stage will tend to be the bottleneck under such circumstances.

D5. It is unclear from the text whether finite automata present a particularly challenging case for FPGA place and route, or whether the latter is simply a slow and painful process in general (I believe it is the latter).

D6. It would be helpful to discuss what happens if a node's history stack were to overflow, and how to maintain correctness when this occurs.

D7. There should be at least a high-level discussion of how fragmentation of the skeleton chain can be handled: over time, as queries enter and leave, the chain will tend to contain many short gaps which are too small to be useful. In theory, the monotonic nature of matches should allow moving a query's chain by replicating it elsewhere first, no?

D8. The power consumption of FPGA vs software is not discussed, but may be relevant given the low performance edge (< 10x) currently reported. Clock speed is another worthwhile issue: I see no reason the design is really limited to ~180MHz, and suspect that frequencies 10x higher would be straightforward to achieve if the FPGA's internal limitations were not a factor (e.g. in an ASIC implementation). This would directly lead to 50x speedup or more, assuming the input data can be supplied that fast.

Overall recommendation

Weak accept

Related Information



Nebeninhalt

Kontakt

Prof. Dr. Jens Teubner
Tel.: 0231 755-6481