Jump label

Service navigation

Main navigation

You are here:

Main content

Dependable Cardinality Forecasts for XQuery — VLDB 2008 Reviews

Reviews for paper Dependable Cardinality Forecasts for XQuery , submitted to VLDB 2008.

Overall rating: accept

Reviewer 1

Novelty

Medium

Technical Depth

Medium

Presentation

Good

Overall Rating

Weak Accept

Summary of the paper's contribution and Impact (up to one paragraph)

This paper provides a detailed description of cardinality estimation process for a large subset of XQuery.

Three strong points of the paper (please number them S1, S2, S3 in order of importance; if none, wite "none")

S1. Provides a detailed description of cardinality estimation process for a useful subset of XQuery.
S2. Experiments with XMark benchmark show great performance improvements in Pathfinder query performance due to the new cardinality estimates.

Three weak points of the paper (please number them W1, W2, W3 in order of importance; if none, write "none")

W1. Motivation for this work is somewhat incomplete.
W2. Experimental evaluation is not sufficient.
W3. The technique is done outside the RDBMS engine, and thus often reinvents the wheel.

Detailed comments (please number them)

First, I’d like to question the need for complete query cardinality estimation in this system. Cardinality information is needed to make choices between execution plans. However, the authors don’t explain at all what choices can be made given their query algebra. Instead, they rely on DB2 relational optimizer, armed with better selectivity estimates, to pick the optimal plan. Thus, only selectivity of certain key predicates is needed to guide the DB2 optimizer. It is not clear, that computing cardinality information for every algebra operator is the only, or even the best way to achieve this.

The proposed algebra seems to be fairly comprehensive, but I have concerns over the size of resulting plans. If a fairly simple query Q1 results in plan of Figure 3 with dozens of operators, one has to wonder how the system is going to handle real-life page-long XQueries. I would like to see some data on running time and memory requirements of the optimizer. If this cardinality estimation was used during enumeration of a large number of plans (as is customary for cardinality estimation) the cost may get out of control. For instance, computation of histograms for every operator seems a bit heavyweight.

The Pathfinder works outside the RDBMS engine, thus it has to re-implement cardinality estimation for textbook operators from scratch. Thus it is difficult for Pathfinder to match the features of commercial optimizers. Instead, the paper seems to occasionally use System R style uniformity assumptions. For example, it seems that joins are always estimated assuming the uniformity (rules Card-5,8,9). I realize that almost any given technique for dealing with non-uniform data can be implemented in the context of this system, but why reinvent the wheel and not utilize already available cardinality estimation (for example in DB2 if that’s what the authors use).

The proposed scheme relies on XML data distribution statistics that may be collected for any sequence of XPath steps. Thus there is potentially infinite space of statistics sets that could be collected. Are there any plans to decide which statistics should be collected? Is there a way to use superset statistics if nothing else is available? E.g. use histogram of /a//b for expression /a/b.

This work does have a lot in common with DB2 pureXML cost-based optimization as described in [2]. The remark in the related work section that DB2 optimizer is limited to twigs in XPath surface language is not true. Both System RX and DB2 pureXML employ a number of rewrite rules before the cost-based optimization in order to unnest the XQuery expressions and merge twigs in the surface language. For example, the entire Q1 should be rewritten into a single operator “box†by DB2 pureXML, and handled by DB2 cost based optimizer holistically.

Many ideas in the paper are not novel individually. E.g. the notions of fanout and selectivity of XPath steps and expressions where used in [2]. Inference of domain sizes has a lot in common with tracking of column cardinalities by commercial optimizers. However, they are pulled together into a nice and comprehensively, albeit densely, explained scheme.

It is notoriously hard to evaluate quality of the query optimization schemes. Unfortunately, this paper includes only a brief description of experiments using only a single synthetic benchmark (XMark). This benchmark focuses on a particular type of XML applications with very homogeneous dataset, and fairly simple and somewhat text oriented queries. It would be nice to include experiments with other kinds of datasets and queries. Also it may be interesting to see comparisons of the DB2 pureXML query plans with the Pathfinder, since their cardinality estimation techniques have much in common.

Reviewer 2

Novelty

Medium

Technical Depth

Medium

Presentation

Good

Overall Rating

Accept

Summary of the paper's contribution and Impact (up to one paragraph)

The paper describes a framework to compute reliable cardinality estimates for queries expressed using the XQuery language. The framework provided, has the property of being embeddable easily into a traditional relational query processor.

Three strong points of the paper (please number them S1, S2, S3 in order of importance; if none, wite "none")

S1. The paper aims to solve the cardinality estimation problem related to XQuery, which has tremendous applicability in fast query processing.
S2. The content is well presented with well described formulas that make it easy for the reader to understand and reimplement the ideas.

Three weak points of the paper (please number them W1, W2, W3 in order of importance; if none, write "none")

none

Detailed comments (please number them)

1. Typo on page 4, section 3.1, line 6. first word. know -> known
2. It might be helpful to explain the operators in slightly more detail
3. I think colocation of each rule and its commentary would help the reader more.

Reviewer 3

Novelty

Medium

Technical Depth

Medium

Presentation

Good

Overall Rating

Accept

Summary of the paper's contribution and Impact (up to one paragraph)

The identification of proper query plans heavily relies on the presence of good cardinality estimates of the (sub)expressions involved in the query. While general estimation techniques have been developed and widely adapted in cost-based query evaluation mechanisms in the relational context, it is less understood in the context of the XQuery query language. The current paper provides a solid framework (based on inference rules) for cardinality estimation for XQuery queries. It does this by mapping XQuery into a relational algebra dialect that apart from the standard relational operators also incorporates XML specific operators to deal with navigational axes and such. The formal framework associates with each operator in this algebra a set of annotations (constant column, estimated size, abstract domain identifiers that enable to relate sets of values) and then propagate these annotations using inference rules. It is noteworthy that the inference rules only take local information into account, meaning that only information is used that depends on the current operator's vicinity. In order to obtain sensible cardinality estimates for the XPath related operators, the presence of generic statistics-gathering functions (fan-out & selectivity) is assumed. It is shown that they can be implemented without much difficulty using e.g., Data Guides. The benefit of having cardinality estimates, as provided by the framework, is reported in the experimental section. It is shown that some of the XMark queries (when compiled into SQL), when enriched with selectivity information, experience a dramatic increase in efficiency due to the selection of a better plan by the optimizer.

Three strong points of the paper (please number them S1, S2, S3 in order of importance; if none, wite "none")

S1. A nice, clean and flexible approach to cardinality estimation for XQuery is proposed. As far as I know, this is one of the first formal treatments of this subject.
S2. Experiments show its usefulness. Although not surprising, it shows the merits of their rule-based system.
S3. A well-written paper, explaining most parts in sufficient detail.

Three weak points of the paper (please number them W1, W2, W3 in order of importance; if none, write "none")

W1. No formal properties of the inference system are derived. E.g., would it be possible to have a set of rules that (i) always underestimates the size; or (ii) always overestimates the size.
W2. It is not reported what the overhead is of computing the estimates.
W3. In the experimental section, the focus is on showing the usefulness of the aproach for all XMark queries rather than on showing the behavior on various data sets. Clearly, the validity of the approach also depends on how different plans will be generated for the same query depending on the data set.

Detailed comments (please number them)

P1. The rule-based system is clearly very general and easily extendible. In the database theory community XQuery has recently been extended to work on so-called annotated XML. Here, the annotations are values coming from some semiring and the XQuery operators are translated into operations in the semiring. I wonder whether the annotations could be used for cardinality estimations, i.e., depending on the choice of semiring one could then obtain different kind of estimates depending on the application in mind. (See Annotated XML: Queries and Provenance, PODS'08, Tannen et. al).

P2. Could you shed some light on how a pure XQuery approach (without the deviation to the relational setting) could be achieved. You mention that the your proposal is also useful for XQuery implementations that are not based on a relational backend, but this is not so clear.

Related Information