Algebra-based XQuery cardinality estimation

The Authors

Sherif Sakr, NICTA, University of New South Wales, Sydney, Australia

Abstract

Purpose – Estimating the sizes of query results and intermediate results is crucial to many aspects of query processing. All database systems rely on the use of cardinality estimates to choose the cheapest execution plan. In principle, the problem of cardinality estimation is more complicated in the Extensible Markup Language (XML) domain than the relational domain. The purpose of this paper is to present a novel framework for estimating the cardinality of XQuery expressions as well as their sub-expressions. Additionally, this paper proposes a novel XQuery cardinality estimation benchmark. The main aim of this benchmark is to establish the basis of comparison between the different estimation approaches in the XQuery domain.

Design/methodology/approach – As a major innovation, the paper exploits the relational algebraic infrastructure to provide accurate estimation in the context of XML and XQuery domains. In the proposed framework, XQuery expressions are translated into an equivalent relational algebraic plans and then using a well defined set of inference rules and a set of special properties of the algebraic plan, this framework is able to provide high-accurate estimation for XQuery expressions.

Findings – This paper is believed to be the first which provides a uniform framework to estimate the cardinality of more powerful XML querying capabilities using XQuery expressions as well as their sub-expressions. It exploits the relational algebraic infrastructure to provide accurate estimation in the context of XML and XQuery domains. Moreover, the proposed framework can act as a meta-model through its ability to incorporate different summarized XML structures and different histogram techniques which allows the model designers to achieve their targets by focusing their effort on designing or selecting the adequate techniques for them. In addition, this paper proposes benchmark for XQuery cardinality estimation systems. The proposed benchmark distinguishes itself from the other existing XML benchmarks in its focus on establishing the basis for comparing the different estimation approaches in the XML domain in terms of their accuracy of the estimations and their completeness in handling different XML querying features.

Research limitations/implications – The current status of this proposed XQuery cardinality estimations framework does not support the estimation of the queries over the order information of the source XML documents and does not support non-numeric predicates.

Practical implications – The experiments of this XQuery cardinality estimation system demonstrate its effectiveness and show high-accurate estimation results. Utilizing the cardinality estimation properties during the SQL translation of XQuery expression results in an average improvement of 20 percent on the performance of their execution times.

Originality/value – This paper presents a novel framework for estimating the cardinality of XQuery expressions as well as its sub-expressions. A novel XQuery cardinality estimation benchmark is introduced to establish the basis of comparison between the different estimation approaches in the XQuery domain.

Article Type:

Research paper

Keyword(s):

Benchmarking; Algebra; Database management; Extensible Markup Language.

Journal:

International Journal of Web Information Systems

Volume:

4

Number:

1

Year:

2008

pp:

6-47

Copyright ©

Emerald Group Publishing Limited

ISSN:

1744-0084

1 Introduction

The Extensible Markup Language (XML) (Bray et al., 2006) has been introduced by the end of the 1990s in order to create a standard data-format for the world wide web which can be easily handled by computers as well as by humans. In recent years, XML has found practical application in numerous domains including data interchange, streaming data and data storage. The semi structured nature of XML allows data to be represented in a considerably more flexible nature than in the traditional relational paradigm. As XML continues to grow in popularity, large repositories of XML documents are going to emerge, and users are likely to pose increasingly more complex queries on these data sets. In 2001, XQuery is decided by the world wide web consortium as the standard XML query language (Boag et al., 2006). XQuery is based on a hierarchical and ordered document model which supports a wide variety of constructs and use cases. The language addresses a wide range of requirements, thus incorporating a rich set of features.

Estimating the sizes of query results and intermediate results is crucial to many aspects of query processing. Accurate forecasting for the selectivity of a given query has many advantages such as:

In principle, the problem of cardinality estimation is more complicated in the XML domain than the relational domain. The main reason behind this is that the XML queries involve structural conditions in addition to the value-based conditions. Therefore, any complete and accurate size estimation system for the XML queries requires maintaining statistical summary information about the structure of the source XML documents as well as statistical summary information about the distribution of the values associated with the XML nodes. Currently, XML database management systems still lack a convincing means to estimate cardinalities for arbitrary XQuery (sub-) expressions. All of the existing work in this domain remains limited to a subset of XPath expressions and twig queries but none of them addresses the critical issues of XQuery language such as iterators, nested queries, arithmetic operations, comparison operations, etc. In this paper, we present a novel framework for estimating the cardinality of arbitrary XQuery expressions as well as its sub-expressions.

The work of this paper was developed within the Pathfinder project (Pathfinder, available at: http://pathfinder-xquery.org). The aim of the Pathfinder project is to implement the XQuery language to query XML data stored on relational database systems. The architecture of Pathfinder is designed in a front-end/back-end fashion. Pathfinder receives an XQuery expression, which is parsed, normalized, simplified, type checked, optimized, and translated into an equivalent relational algebraic plan. This algebraic plan thus serves as an intermediate representation for the XQuery expression which is then in our context translated into SQL code over the conventional RDBMS (Grust et al., 2007) and in other contexts translated into MIL code over MonetDB DBMS (Boncz et al., 2006; Rittinger, 2005). At the time of writing, this work, the Pathfinder XQuery compiler supported all features of XQuery langauge except user-defined recursive functions.

The intermediate algebraic plans of Pathfinder play a main role in our framework. As a major innovation, we exploit these algebraic plans as well as the relational estimation techniques to provide accurate cardinality estimations in the context of XML and XQuery domains. In particular, the main contributions of the work of this paper is a novel framework for XQuery cardinality estimation with the following main advantages:

Additionally, this paper introduces a novel XQuery cardinality estimation benchmark. Unlike the usual XML benchmarks, the main focus of this benchmark is to establish the basis of comparison between XQuery cardinality estimation approaches in terms of the accuracy of estimation and the completeness in handling the different features of XQuery language.

The rest of this paper is organized as follows. In Section 2, we give an overview of our proposed XQuery cardinality estimation framework. In Section 3, we give an overview of the loop-lifting compilation technique which translates XQuery expressions into their equivalent intermediate relational algebraic plans. In Section 4, we describe the main building blocks of our proposed XQuery cardinality estimation framework. In Section 5, we describe the process for estimating the cardinality of XQuery expression as well as its sub-expressions. Our proposed benchmark for XQuery cardinality estimation system is presented in Section 6 to establish the basis of the experimental analysis introduced in Section 7. Section 8 reviews the related work before we conclude in Section 9. The appendix of this paper fully describes the queries of our proposed benchmark for XQuery cardinality estimation system.

2 XQuery cardinality estimation framework

Figure 1 shows framework of our proposed XQuery cardinality estimation. In this framework, the cardinality-estimation process for XQuery expressions runs through the following steps:

  1. The source XML documents have to be mapped (shredded) into an encoding relational scheme. During the shredding process, we build our summarized tree structure to capture summary information about the structure of the source XML document. We name this summary structure as the statistical guide.
  2. To support the estimation of the value-based predicates, we need to build statistical histograms for capturing the distribution of the data values. As it is always a trade-off between the required estimation accuracy and the used storage space, the decision to build the histograms for capturing the distribution of the atomic values of the most frequently used guide nodes of the statistical guide is left to the system administrator.
  3. Using the loop-lifting compilation techniques, the Pathfinder algebraic translation module translates its input XQuery expressions into their equivalent relational algebraic plans (Grust et al., 2004; Teubner, 2006).
  4. The XQuery selectivity estimation module receives four inputs, the outputs of the previous three steps, plus inference rules for the cardinality estimation of the Pathfinder algebraic operators (Section 5.1) and returns back the estimated cardinality for each operator in the algebraic plan.
  5. The estimated cardinality for the root operator in the Pathfinder algebraic plan represents the estimated cardinality for the whole XQuery expression, while the estimated cardinalities for the intermediated operators represent the estimated sizes of their correspondent sub-expressions.

3 XQuery algebraic compilation

Relational algebra has been a main component in relational database systems, and has played an important role in their success for gaining widespread usage. A corresponding XML algebra would have the same importance and substantial role for XML query processing. In our context, having an adequate algebraic compilation for XQuery expressions provides us with the solid infrastructure for predicting the cardinality of the main XQuery expressions and its sub-expressions in a very convenient and accurate way. Many proposals for an algebra for XML query processing have been introduced (Brantner et al., 2005; Chen et al., 2003; Sartiani and Albano, 2002; Jagadish et al., 2001; Paparizos et al., 2002; Zhang and Tompa, 2003). According to Re et al. (2006), existing algebras for XQuery fall into two classes:

  1. Tuple-based algebra. This class of algebra tries to facilitate the use of relational optimization techniques as well as the whole relational query processing framework (theory, compilation, optimization, execution).
  2. Tree-based algebra. This class of algebra provides more natural support for novel XML-specific optimizations and manipulates XML data modelled as forests of labelled ordered trees.

The Pathfinder project has a special module for compiling XQuery expressions into its own dialect of tuple-based algebra producing equivalent relational query plans. The Pathfinder algebra is quite primitive such that it can efficiently fit within the capabilities of SQL-based systems. Pathfinder compiles the XQuery core dialect listed in Table I into relational query plans using the set of algebraic operators listed in Table II. The Pathfinder algebraic operators are designed to receive one or more inputs and produce one or more outputs. These inputs and outputs are in the form of sets of tuples. Most of these operators are rather standard or even restricted variants of the operators found in a classical relational algebra which allow the query optimizer to employ the usual relational algebraic optimization techniques such as the pushdown of the selection operator. For example, the attachment operator (@ a:v ) appends one a new attribute a to the tuples of the input relation R where the value for the new attribute is assigned via the value v. The selection operator (σ a ) filters the tuples of the input relation R to only the tuples where the Boolean selection attribute a equals to true. The Boolean selection attribute (a) are usually produced from applying the comparison operator (Fixed graphic 1). The unsorted row numbering operator (# a ) and the sorted row numbering operator (ϱ a:(o 1, … ,o n )/p ) are two of the most important Pathfinder's algebraic operators. Together, they are responsible of preserving the order concept defined by the XQuery/XPath specifications (Fernández et al., 2006). The unsorted row numbering operator (# a ) receives an input relation R and returns the same relation extended with a new consecutive numbering attribute (a) from 1 to n where n is the cardinality of the input relation R. The sorted row numbering operator (ϱ a:(o 1, … ,o n )/p ) receives an input relation R, an ordering attribute list (o 1, … o n ) and an optional partitioning attribute p. It returns the same relation extended with a new consecutive numbering attribute (a). The numbering attribute (a) respects the tuple sorting of relation R defined by the order specification (o 1, … o n ) and restarts the numbering from one for each partition defined by the optional partitioning attribute (p). The XPath location step evaluator operator (Fixed graphic 2 item:(α, n)) is responsible for evaluating XPath expressions. It returns the result nodes for each iteration of the input context relation in the document order and duplicate free. Figure 2 shows examples for the behavior of some Pathfinder algebraic operators.

In this section, we give an overview of the loop-lifting technique, which is considered to be the heart of this compilation process. For a detailed and complete description of this technique and its translation rules we refer to Teubner (2006), Grust et al. (2004) and Grust (2005). In a nutshell, loop-lifting is the key technique used to compile XQuery iterations into efficient bulk style application of algebraic operators. The principal idea behind the compilation scheme is that every XQuery expression occurs in the scope of an iteration. The iterations of each scope are encoded by a column iter in the associated relational representation. In order to clarify the loop-lifting idea, we will show some examples of Pathfinder loop-lifted translations for some XQuery expressions into their associated Pathfinder algebraic relational representation.

3.1 Sequences

The XQuery language is designed to operate over ordered, finite sequences of items as its principal data type. The evaluation of any XQuery expression yields an ordered sequence of n≥0 items. These items can be either atomic values (integers, strings, … , etc.) or XML tree nodes. An XQuery item sequence (x1, … x n ) is encoded with a relational table with a schema (pos, item) where the column pos is used to preserve the order information between the items inside the target sequence. The item column is a polymorphic column. In the case of atomic items, it stores the values of the encoded atomic items (1, “A” …) and in the other case of XML tree nodes, it stores the pre-order ranks of the encoded nodes. The RDBMS supports the representation of such polymorphic columns using the variant data type. The empty sequence ( ) is encoded with an empty table with the same schema (pos, item).

3.2 FLWOR expressions

The FLWOR expression is one of the main features provided by the XQuery language. It is used for representing iterations and for the binding of variables to intermediate results. It is also used for computing joins between two or more sequences and for restructuring data. A loop of n iterations is represented by a relation loop with a single column iter of n values (1, 2, … n). The compilation of variables bound in the iterations of FLWOR expression is translated into a relational table with a schema (iter, pos, item) where each tuple of the encoding relation (i, p, x) indicates that for the i-th iteration, the item at position p stores the value x.

3.3 Path steps

The compilation of path steps is represented in the algebraic plans using the XPath evaluator operator (Fixed graphic 3). It takes a context relation (iter, item) as input, where the item column stores the node identifiers of the input context nodes. It uses the XPath encoding relation of the current live nodes (Γ) to evaluate the path step (α, n) and returns a new relation with the same schema (iter, item). In the output relation, the item column stores the node identifiers of the resulting nodes. Figure 3 shows an example for the loop-lifted compilation of the path step where Figure 3(a) shows the input context nodes, Figure 3(b) shows the sample source XML document and Figure 3(c) shows the output context nodes from applying the path step child::c over input context nodes and the sample source XML document.

3.4 Arithmetic and comparison expressions

Given the relational representation R 1(iter1, item1) and R 2(iter2, item2) of two XQuery expression e 1 and e 2, the design of the loop-lifting compilation of the arithmetic and comparison expressions requires that the schema of the relational representation (R i ) of each argument expression (e i ) must have a single node identifier column (item i ) for each iteration (iter i ) with no order preserving column (pos). The arithmetic expression e 1 e 2 is evaluated by joining R 1 and R 2 over their iterations attribute iter, for each tuple of the result we apply the arithmetic operator (∘) over the two arguments columns item1 and item2, and store the result in a new column res. The algebraic representation of the arithmetic expression (e 1 e 2) is defined as follows: Equation 1 where R 1 and R 2 are consequentially representing the relational representation of the two expressions e 1 and e 2.

Similarly, the comparison expression e 1 Fixed graphic 4 e 2 is evaluated by joining R 1 and R 2 over their iterations attribute iter, for each tuple of the result we apply the comparison operator (∘) over the two argument attributes item1 and item2, and store the result in a new attribute res. The comparison operators are normally followed by a selection operator to filter the tuples satisfying the comparison condition. The algebraic representation of the comparison expression (e 1 Fixed graphic 5 e 2) is defined as follows: Equation 2

4 Building blocks of XQuery cardinality estimation

The problem of cardinality estimation is more complicated in the XML domain than the relational domain. The main reason behind this is that the XML queries involve structural conditions in addition to the value-based conditions. Therefore, any complete and accurate size estimation system for the XML queries requires maintaining statistical summary information about the structure of the source XML documents as well as statistical summary information about the distribution of the values associated with the XML nodes. In this section, we describe the building blocks of our XQuery cardinality estimation system which capture the summary information about the structure and the values of the source XML documents. In addition, we describe the process of inferring some special properties for the Pathfinder algebraic plans. These special properties play a vital role in our estimation process using the relational estimation techniques as we will show next in Section 5.

4.1 Statistical guide

The first step in estimating the result size of an XQuery expression is to summarize the source XML document into a compact graph structure which we name the statistical guide. The statistical guide represents an implementation for the data guide summary tree structure presented in Goldman and Widom (1997) and is very similar to the path tree summary structure presented by Aboulnaga et al. (2001). Every node in the statistical guide represents a rooted path starting from the root node of the source XML document. The root node of the statistical guide represents the root element of the document. The statistical guide has a node for every distinct rooted path in the source XML document. Each node of the statistical guide is labeled with the tag name of the elements reachable by the path it represents and annotated with the number of occurrences for such node (rooted path). Figure 4 shows an XML document and its associated statistical guide. We construct the statistical guide of an XML document during the normal shredding process of the XML document using the XPath accelerator storage scheme presented by Grust (2002). It is thus constructed during the normal scan of the document using a SAX parser. To estimate the cardinality of an XPath expression using the statistical guide, we need to evaluate the XPath expression over the statistical guide. This will result in a set of nodes which correspond to the query path expression. Hence, the estimated cardinality of the XPath expression is defined by computing the sum of the number of occurrences of these resulting nodes. Owing to its compact size even for larger XML documents, the statistical guide can easily fit into main memory and be made available for the query processor during the compilation process.

4.2 Histograms

Histograms are the most popular summary data structures used for estimating query result sizes in relational systems. In our XQuery cardinality estimation system, it is used to estimate the number of nodes satisfying a specified value predicate. Although, we can build a histogram for each value node (text node-attribute) in the statistical guide, it is always a trade-off between the required estimation accuracy and the used storage space. In practice, we leave it to the system administrator to decide building the required histograms for the frequently used nodes in the statistical guide. A histogram H on node X is constructed by partitioning the data distribution of the node values into N mutually disjoint subsets called buckets and then counting the number of values for each bucket. The histogram's buckets can be defined according to different partitioning rules that searches for effective distribution of the values which leads to accurate estimations. The values in the range of each bucket is assumed to be uniformly distributed between the lowest and highest values in the bucket. Poosala et al. (1996) has provided a survey on several well known histograms such as equi-width, equi-depth and serial histograms (Ioannidis, 1993) where each of them is using different partitioning rules and constraints. In particular, each of these histograms has its own advantages. For example, equi-depth histogram works well for range queries only when the data distribution has low skew while serial histogram has shown to be optimal for equality joins and selection when a list of all the attribute values in each bucket is maintained.

In fact our proposed estimation framework is not dependent on or affected by the type of the used histograms and it is flexible to integrate any of them. Currently, we are using on standard the simple equi-width histograms for collecting the statistical information of the relevant nodes however it is also open for the system administrator to use any other type of histograms without affecting the estimation process.

4.3 Histograms operation

4.3.1 Histograms constant manipulation

The XQuery language provides the capability of performing arithmetic operations. One variation of the instances of the arithmetic operators ⊚ x:(x, y) inside the algebraic plan is the instance where one argument (x) is represented by an item sequence of atomic values and the other argument (y) is a constant value v. Assuming the data distribution for the atomic values of the item sequence representing the first argument (x) is captured by the histogram (H1). Applying the arithmetic operation yields a resulting item sequence where the data distribution is captured by the histogram (H2). The characteristics of the data distribution of (H2) is defined by manipulating the boundaries of the histogram (H1), its minimum value, its maximum value and the boundaries of its associated buckets with the defined arithmetic operation and constant value (V) while the distribution of the values inside each buckets remains as the same. Figure 5 shows a visual example of manipulating the distribution of the data values of histogram H1 with different arithmetic operations and the constant value five.

4.3.2 Histograms arithmetic

Another variation of the instances of the arithmetic operators ⊚ z:(x, y) is the situation where both of the arguments (x, y) are represented by an item sequence of atomic values. Assume that the data distribution for the atomic values of the item sequence representing the two arguments (x, y) are captured by the two histograms (HX, HY), respectively. In this situation we rely on the histogram discretization algorithm presented by Berleant (1993) for performing arithmetic operations over histograms and computing the resulting histogram H3 which captures the distribution of the values of the resulting attribute z. The steps of performing the arithmetic operation ∘ over the two histograms (HX, HY) using the histogram discretization algorithm are described as follows:

  1. Compute the Cartesian product of the buckets of two histograms (HX, HY).
  2. For each member (HX i , HY j ) in the Cartesian product, an intermediate result bucket is computed by:
    • executing the arithmetic operation ∘ on HX i and HY j to get the intermediate resulting bucket HZ ij =HX i HY j ; and
    • associating with the intermediate bucket HZ ij the size information equal to the size of the bucket HX i multiplied by the size of the bucket HY j .
  3. The resulting intermediate buckets are combined to get a final result by:
    • Deciding on a set of intervals partitioning the domain of HZ. This partition determines the placement of the boundaries of the buckets of the resulting histogram.
    • Calculating the size of each bucket of the resulting histogram HZ as follows:
      1. any intermediate result bucket HZ ij that falls completely within some member of the partition has its entire results added to that member; and
      2. any intermediate result bucket that overlaps more than one member of the partition has its size divided among them proportionally to the fraction of the intermediate result bucket it overlaps.

To illustrate let us consider the following example:

Let us assume that the histogram HX where its buckets information are described as follows (Table III). The buckets of the histogram HY are described in Table IV.

Applying the multiplication operation between the two histograms HX and HY will produce the intermediate buckets as shown in Table V.

The decision of the resulting number of buckets and partitioning values can be defined in many ways. We use the heuristic rule for deciding the number of the resulting buckets to be equal to the maximum number of the buckets of the two argument histograms. In our example, the resulting histogram HZ will consist of three buckets covering the range of values between 0 and 120 and the width of each bucket is equal to 40. Hence, the buckets of the histogram HZ are described as shown in Table VI, where the resulting bucket number 1 combines the full data of the intermediate buckets 1, 2, 3 and fractions from the buckets 4, 5. The resulting bucket number 2 combines fractions from the data of the buckets 4, 5, 6 and the resulting bucket number 3 contains a fraction of the data in the intermediate bucket number 6.

4.3.3 Histograms comparison

One variation of the instances of the comparison operators Fixed graphic 6 z:(x, y) is the instance where both of the arguments (x, y) are represented by an item sequence of atomic values. Assuming the data distribution for the atomic values of the item sequence representing the two argument (x, y) are captured by the two histograms (HX, HY), respectively. In this situation and using the same idea of the histogram discretization algorithm, the selectivity of the resulting boolean attribute z is computed by applying bucket by bucket comparison of the Cartesian product between the two histograms (HX, HY).To illustrate let us represent the following example:

Let us consider the histogram HX where its buckets information are described as shown in Table VII. The buckets of the histogram HY are described as shown in Table VIII.

Estimating the selectivity of the comparison HX<HY will be computed as shown in Table IX.

For the intermediate buckets number 1, 2 and 4 it is guaranteed that any value which belongs to the HX i bucket is less than any value belongs to the HY j bucket. Hence, the estimated size of the number of pairs satisfying the conditions (HX i <HY j ) is equal to the total size. For the intermediate bucket number 5, it is guaranteed that there is no value which belongs to the HX i bucket can be less than any value which belongs to the HY j bucket. Hence, the estimated size of the number of pairs satisfying the conditions (HX i <HY j ) is equal to 0. For the intermediate buckets number 3 and 6, we use the mean value of the HX i bucket to estimate the number of pairs satisfying the comparison condition. In the case of the buckets comparison 3, the mean value of HX i is equal to 7.5 and the size of the pairs satisfying the condition (HX i <HY j ) is estimated as follows: Equation 3 In the case of the buckets comparison 5, the mean value of HX i is equal to 12.5 and the size of the pairs satisfying the condition (HX i <HY j ) is estimated as follows: Equation 4 The selectivity of the resulting Boolean attribute z is then computed by dividing the estimated sum of all pairs satisfying the condition (HX i <HY j ) by the total number of the pair values as follows: Equation 5

4.4 Guide node annotation

Having an accurate estimation of XPath expression is a very crucial aspect for having accurate XQuery cardinality estimation system. In our presented algebraic compilation of XQuery expression, the XPath evaluator operator (Fixed graphic 7 item:(α, n)) is used for representing the XPath steps. The XPath evaluator operator (Fixed graphic 8 item:(α, n)) receives an input context relation (ctx) with the schema (iter, item) where the item column stores the node identifiers of the input context nodes as well as an encoded XPath accelerator relation for the associated live nodes fragment (Γ) and returns back the set of output context nodes from the evaluation the path step (α, n) over the input context nodes. In order to estimate the cardinality of the XPath evaluator operator Fixed graphic 9 item:(α, n)), we need to keep track of the node identifiers of the input context nodes. Hence, we represent a new annotation for the item columns storing the node identifiers of the context nodes, we name it as guide node annotation. An item column of the context relation (ctx) annotated with the guide node property X indicates that the list of the node identifiers stored in the item are corresponding to the node with the pre-order rank X of the statistical guide. Having an input context relation where the item column is annotated by the guide node property X. Applying the XPath evaluator operator (Fixed graphic 10 item:(α, n)) over the context relation ctx annotates the item column of the resulting context relation with the guide node property equal to Y, where Y is the set of the pre-order rank(s) of the resulting node(s) from applying the path step (α, n) over the guide node X. A detailed explanation for the steps of estimating the cardinality of the XPath evaluator operator Fixed graphic 11 will be presented in Section 5.1.

4.5 Special properties for the algebraic plan

Grust (2005) has shown that further analysis on the Pathfinder algebraic plans reveals the following two important observations:

  1. Pathfinder algebraic plans always include several occurrences of common sub-plans. Based on this observation, it is more suitable and effective to use direct acyclic graphs (DAGs) instead of operators tree for representing the resulting algebraic plans.
  2. Owing to the design of the loop-lifting compilation rules, it is noticed that some of the algebraic operators have specific properties such as: all join operators (⋈) are equi-join, the projection operators (π) do not need to remove any duplicates and all union operators (Fixed graphic 12) are receiving disjoint argument relations.

Based on these observations, it becomes possible infer a variety of properties for the algebraic operators of the Pathfinder algebraic plans. Additionally, Sakr (2007) has added to these properties three cardinality estimation related special properties (GuideNode, Histogram, Selectivity) to keep track of the information about the context nodes paths and values to support the selectivity-estimation process.

Table X lists the properties we are able to infer and propagate during the plan analysis of the algebraic plan. These properties play a vital role in our proposed size estimation system for XQuery. Making use of these inferred properties for having accurate size estimations for the results of XQuery expressions will be described in Section 5.1. Figures 6-8 show the complete and sound set of inference rules for inferring and propagating the special properties during the loop-lifting compilation process of XQuery expressions. These inference rules are represented in the form: Equation 6 where each inference rule establishes a relationship between a set of premises and conclusions, whereby the conclusion is said to be inferable from the premises. In other words, whenever in the course of some logical derivation the given premises have been obtained, the specified conclusion can be taken for granted as well. Examples of the used notations for representing the special properties of the algebraic operators listed in Table X are given as follows:

For a detailed description of the inference rules of the special properties of the Pathfinder algebraic plan (Figures 6-8) we refer to Sakr (2007). Remarks about the inference rules are given as follows:

  1. In the inference rules of the domain property, we use the annotation a (D+) to represent the assignment of a new domain identifier (D+) for the attribute (a). For example, some of the algebraic operators commonly propagate all domain information of the attributes from their input relations to their output relation without any changes and additionally the operators create a new domain identifiers (D+) for their resulting (extending) attributes (r) (Rule Dom-1). These operators are: the attachment operator (@), the unsorted row numbering operator (#), the sorted row numbering operator (ϱ), the arithmetic operator (⊚), the comparison operator (Fixed graphic 13), the negation operator (¬) and the document access operator (Δ).
  2. If the attribute item in the input context relation R is annotated with the guide node property (Ign) (the set of input guide nodes), then applying the XPath location step evaluator operator (Fixed graphic 14 item:(α, n)) over the input context relation R changes the annotation of the attribute item in the resulting relation by the set of the guide nodes (Ogn) resulting from applying the step (α, n) over the set of guide nodes (Ign) in the statistical guide (Rule Guide-6).
  3. The document access operator (Δvalue:item) uses the guide node(s) information annotated to the item column of the context relation to retrieve their associated histogram(s) from the statistical repository. The retrieved histogram(s) are annotated to the histogram property of the resulting value attribute (Rule Hist-3). If the item column of the context relation is annotated with more than one guide node information, then each of the retrieved histograms is assigned a weight of a participation (H i .weight) which is equal to the computed percentage by dividing the size of its associated guide node (StatisticalGuide.getSize(gns i )) by the total size of all annotated guide nodes StatisticalGuide.getSize(gns). Another alternative in this situation is to merge the retrieved histograms for the set of the guide nodes into one histogram. However, we prefer to use and maintain the weight of each histogram for gaining more accurate estimation of the selectivity values by applying the comparison operators. If the item column is annotated with only one guide node, then the weight percentage of the retrieved histogram is naturally equal to 100 percent. In fact, the document access operator is responsible for retrieving the atomic values of the context nodes. Consequently, the document access operator is the algebraic operator which is responsible for introducing the histogram property into the algebraic plans.
  4. If the attribute a in relation R is annotated with the histogram property X and the attribute a in relation S is annotated with the histogram property Y (each of X and Y represents a set of histograms which may contain one or more histogram), then the resulting attribute a in (R Fixed graphic 15 S) combines all the histogram information of both input attributes where the weight of each element histogram in the resulting set of histograms Z (Z.Histogram(X i ).weight and Z.Histogram(Y j ).weight) is computed by dividing the computed values of associated tuples in the source relation (X.Histogram(X i ).weight*C1 and Y.Histogram(Y j ).weight*C2) by the total number of the tuples of the resulting relation (C1 + C2) where C1 is the cardinality of the relation R and C2 is the cardinality of the relations S (Rule Hist-4).
  5. The instances of the arithmetic operators (⊚ z:(x, y)) in the Pathfinder algebraic plans come in one of the following three situations:
    • The argument attribute x is a constant attribute with the value v1 and the argument attribute y is a constant attribute with the value v2. In this case, the resulting attribute z is a constant attribute with the value (v1∘v2) (Rule Hist-7).
    • The argument attribute x is a constant attribute with the value v and the argument attribute y is annotated with the histogram property equal to H1 where H1 represents a set of histograms. In this case, the resulting attribute z is annotated with the histogram property equal to H2. The set of histograms H2 is a result of manipulating each element of the set of histograms H1 with the arithmetic operation ∘ and the constant value v (Rule Hist-9). In Rule Hist-9, the function ArithHistConst represents the process of histogram manipulation using a constant value discussed in Section 4.3.
    • The argument attribute x is annotated with the histogram property equal to the set of histograms H1 and the argument attribute y is annotated with the histogram property equal to the set of histograms H2. In this case, the resulting attribute z is annotated with the histogram property equal to the set of histograms H3. The set of histograms H3 is a result of manipulating the histogram(s) of H1 with the arithmetic operation ∘ and the histogram(s) of H2 (Rule Hist-8). In Rule Hist-8, the function ArithHistHist represents the process of applying arithmetic operations on histograms discussed in Section 4.3.
  6. If the attribute a in relation R is annotated with the selectivity property s1 and the attribute a in relation S is annotated with the selectivity property s2, then the selectivity property of the resulting attribute a in (R Fixed graphic 16 S) is computed by dividing the sum of the tuples with the True value in R and S by the total sum of all tuples in R and S (Rule Sel-3).
  7. The instances of the comparison operators (Fixed graphic 17 z:(x, y)) in the Pathfinder algebraic plans come in one of the following three situations:
    • The argument attribute x is a constant attribute with the value v1 and the argument attribute y is a constant attribute with the value v2. In this case, the resulting Boolean attribute z is either annotated with the selectivity property equal to 1, if (v1∘v2) is equal to true, or with the selectivity property equal to 0, if (v1∘v2) is equal to false (Rule Sel-7). In Rule Sel-7, the function CompareConstConst evaluates the comparison operator (v1∘v2) and returns the equivalent selectivity value.
    • The argument attribute x is annotated with the histogram property equal to the set of histograms H and the argument attribute y is a constant attribute with the value v. In this case, the resulting Boolean attribute z is annotated with the selectivity property equal to (s) where the values of (s) is computed using the following two steps (Rule Sel-8):
      1. Computing the intermediate selectivity values s i by comparing each element histogram H i with the constant value v.
      2. The final resulting selectivity values s is equal to the weighted average of the intermediate selectivity values s i . It is computed by summing the values resulting from multiplying each intermediate selectivity values s i with the weight of each associated histogram H i .weight.
    • The argument attribute x is annotated with the histogram property that is equal to the set of histograms H1 and the argument attribute y is annotated with the histogram property that is equal to the set of histograms H2. In this case, the resulting Boolean attribute z is annotated with the selectivity property equal to (s) where (s) is computed using the following three steps (Rule Sel-9):
      1. Computing the intermediate selectivity values s i results from applying the comparison operator ∘ over the values of each pair of histograms (H1 i and H2 j ).
      2. Each intermediate selectivity value s i is assigned a weight value equals to the resulting value from multiplying the weights of its associated pair of histograms (H1 i and H2 j ).
      3. The final resulting selectivity values s is equal to the weighted average of the intermediate selectivity values s ij . It is computed by summing the values resulting from multiplying each intermediate selectivity values s ij with its computed weight s ij. weight.In Rule Sel-9, the function CompareHistHist represents the process of histograms comparison discussed in Section 4.3.
  8. If the relational R is annotated with the cardinality property equal to C, then applying the unsorted row numbering operator (# a ) or the sorted row numbering operator with no partitioning argument (ϱ a:(o 1, … ,o n )) over the input relation R annotates the resulting numbering attribute a with DomainSize property equal to C (Rules DomSize-6, DomSize-7).

5 XQuery cardinality estimation

5.1 Estimation rules

This section presents the main part of the estimation process which integrates the different building blocks of the XQuery cardinality estimation framework. In this section, we present the inference rules for estimating the cardinality of XQuery expression and its sub-expressions using the previously mentioned building blocks. As previously mentioned, Pathfinder compiles the input XQuery expression into its equivalent algebraic plans. The XQuery expressions are compositional in nature as sub-expressions are combined with each other to form the final query. Our approach is able to estimate the cardinality of the final XQuery expression as well as its sub-expressions. It uses the Pathfinder intermediate algebraic plans and the relational well-known estimation techniques to estimating the cardinality of each algebraic operators in the query plan. Our current set of rules is supporting the most typical and frequent constellations of Pathfinder algebraic plans. Currently, we are working on defining further sophisticated rules to increase the forecast quality and to complete the set of possible arbitrary Pathfinder algebraic constellations. Figure 9 shows the inference rules for estimating the cardinalities of Pathfinder algebraic operators. In Section 5.2, we will provide an example to describe the use of these inference rules in the estimation process.

Remarks about the estimation inference rules are given as follows:

  1. Since the two input relations R, S of the union operator (Fixed graphic 18) are guaranteed to be always disjoint (Grust, 2005), the estimated cardinality of (R Fixed graphic 19 S) is equal to the sum of the cardinality of R (C1) plus the cardinality of S (C2) (Rule Card-2).
  2. An accurate estimation of XPath location steps is crucial for an accurate estimation of the whole XQuery expression. Assuming that the cardinality of the input relation R is equal to C1 and the item column storing the node identifier of the input context nodes is annotated with the set of guide node(s) (Igns), the estimated cardinality of the operator Fixed graphic 20 (α, n)(R) is achieved through the following steps represented in Rule Card-15:
    • Using our summarized tree structure, the statistical guide, we can get the sum of cardinalities of the input guide node(s) (Igns) represented as (C2).
    • We apply the location step (α, n) over the statistical guide with the input context guide nodes (Igns) and the resulting guide nodes (Ogns) are annotated to the resulting item column.
    • Once again, the statistical guide is used to get the sum of cardinalities of the output guide node(s) (Ogns) represented as (C3).
    • Using the cardinality of the input relation (C1), the cardinality of the input guide nodes (C2) and the cardinality of the output guide nodes (C3), the estimated cardinality of the operator Fixed graphic 21 (α, n)(R) is calculated as follows: Equation 7
  3. The estimated cardinality of the resulting relation from applying the join operator (⋈ a=b ) is very dependent on the key and domain properties of the join attributes of the input relations. Many different situations can occur in this context hence, four different inference rules have been defined to deal with the different possible instances of the join operators (Rules Card-16, Card-17, Card-18, Card-19 and Card-20). Since some instances of the join operators may fit with the preconditions of more than one size estimation inference rule, the instances of the join operations are checked against the preconditions of the specified inference rules in a deterministic order as shown in Figure 9. Rule Card-20 represents the else case of these set of the inference rules which is used when we do not have sufficient information about the domain and key properties of the join attributes. In this case, we use the System R rule which estimates the cardinality of the resulting relation to be equal to 1/10 of the Cartesian product of the input relations (Selinger et al., 1979).
  4. Generally, the cardinality estimation of the duplicate elimination operator (δ) is very difficult and not as straightforward in the general case because it is very dependant on the values of the attributes of the tuples of the input relation. However, we have provided different estimation rules for handling the most common cases for instances of this operator. These cases are listed as follows:
    • The estimated cardinality of the resulting relation from applying the duplicate elimination operator (δ(R)) is equal to the cardinality of the input relation R if any of the attributes of the input relation is a key attribute (Rule Card-21).
    • If the schema of the input relation R consists of only one attribute a and this attribute is a constant attribute then the cardinality of the resulting relation is equal to 1 (Rule Card-22).
    • If the schema of the input relation R consists of only one attribute a and this attribute is annotated with the histogram property that is equal to H, the cardinality of the resulting relation is equal to the cardinality of the input relation R multiplied by the percentage of the unique values of the data represented by the histogram H1 (Rule Card-23).
    • If the schema of the input relation R consists of only one attribute a and this attribute is annotated with a DomainSize property that is equal to n, assuming that the cardinality of the input relation R is equal to C1, the cardinality of the resulting relation would then equal to the minimum value between C and n (Rule Card-24).
    • For cases where the schema of the input relation R consists of more than one attribute where all attributes are constant attributes with the exception of an attribute a, if the attribute a is annotated with the histogram property that is equal to H then the estimation is done in the same way as in the third case (Rule Card-25), if the attribute a is annotated with the DomainSize property equal to n then the estimation is done in the same way as described in the fourth case (Rule Card-26).

In the case where the status of the duplicate elimination operator does not fit with the preconditions of any of the mentioned cases, then we estimate the cardinality of the resulting relation to be equal to the cardinality of the input relation.

5.2 Estimation process

Having the algebraic compilation of the XQuery expressions, the special properties of the algebraic plans, the statistical guide, the histogram repository and the estimation inference rules of the algebraic operators, the XQuery cardinality estimation in our framework becomes a very straightforward process. The estimation process is mainly achieved through two main steps. Firstly, the input XQuery expression is compiled into its equivalent algebraic plan. Secondly, the cardinality estimation of the input XQuery expression and its sub-expression is done by traversing the DAG of the algebraic operators in a post-order approach and then estimating the cardinality of each algebraic operator using its defined estimation rules (Section 5.1). Each sub-expression is associated with an algebraic operator in the DAG plan and the estimated cardinality of each intermediate algebraic operator is used as an input for the immediate following operator(s) in the algebraic DAG plan. To illustrate, we present an example of the estimation process of an XQuery expressions using its associated compiled algebraic plan and the estimation rules. In this example, we use the XMark sample document “auction.xml” as a source XML document where Figure 10 shows its associated statistical guide. In the statistical guide figure, we used the following notations:

Example 1

Find all price nodes with value less than 50

    Q1

 for $x in doc(“auction.xml”)//closed_auctions//price/text( )

  where $x<50

  return $x

Figure 11(a) shows the Pathfinder algebraic plan for the XQuery query Q1. Figure 11(b) shows the distribution of the price values (H1 – GuideNode51). Figure 12 shows the operators of the algebraic plans and the annotations of the special properties related to the estimation process. The columns with the title GNode represents the annotation of the guide node property while the column with the title Hist represents the annotation of the histogram property. Remarks about the inferred properties and cardinalities of the algebraic operators are given as follows:

  1. The (Doc) operator 17 retrieves the document node from the XPath accelerator encoding relation of the “auction.xml” document. Hence, the resulting item column is annotated with the guide node property equal to 0 referring to the root node of the statistical guide of the “auction.xml” document.
  2. The operators 16, 15 are instances of the XPath evaluator operator (Fixed graphic 22). Operator 16 represents the evaluation of the path step (descendant::closed_auctions) while the operator 15 represents the evaluation of the path step (descendent::price). The notation desc is used as an abbreviation for the descendant axis.
  3. The estimated cardinalities of the XPath evaluator operators (Fixed graphic 23) 16, 15 are done on the basis of the inference Rule Card-15. For operator number 16, applying the path step (descendant::closed_auctions) over the statistical guide with the input context document guide node (GuideNode 0) yields to the guide node with the pre-order rank equal to 47 (GuideNode 47) and the number of occurrences is equal to 1. Hence, the estimated cardinality for this operator is equal to 1 and the resulting item attribute is annotated with the guide node property equal to 47. Similarly, for operator 15 applying the path step (descendent::price) over the statistical guide with the input context guide node (GuideNode 47) yields to the guide node with the pre-order rank equal to 51 (GuideNode 51) where the number of occurrences is equal to 100. The estimated cardinality for this operator is equal to 100 and the resulting item attribute is annotated with the guide node property equal to 51.
  4. The document access operator (Δitem1:item) 10 uses the guide node property of the input item attribute to annotate the resulting attribute item1 with its associated histogram H1 capturing the distribution of the values of the price values (GuideNode 51). Simultaneously the estimated cardinality of the resulting relation is equal to the cardinality of the input relation.
  5. The comparison operator 7 (<item: item1, item2) applies a comparison operation (<) between the item1 attribute annotated with the histogram property equal to H1 and the item2 attribute annotated with the constant value 50 resulting from the attachment operator 8 (@item2:50). Using the following information of the histogram H1:
    • the histogram size is equal to 100; and
    • the constant value 50 lies within the range of the third bucket [40, 60] where the size of the bucket is equal to 20 and the accumulative size of the values that are less than 40 is equal to 80.the selectivity of the resulting Boolean attribute (item) is computed as follows: Equation 8 The estimated cardinality of the resulting relation from applying the document access operators is equal to the estimated cardinality of the input relation.
  6. The estimated cardinality of the resulting relation from applying the selection operator 5 (σ item) is computed by multiplying the value of the annotated selectivity property of the selection attribute item (0.9) by the estimated cardinality of the input relation (100).
  7. The operator 3 (⋈iter = iter) applies a join operation between the resulting relations from applying the operators 4 and 12. The estimated cardinality of the resulting relation from applying the join operator 3 is derived using the Rule Card-17 where the join attribute of each input relation is a key attribute and the domain of one join attribute is a subset of the domain other join attribute.
  8. Additionally, the cardinality estimation of the sub-expressions could be inferred by detecting the corresponding relationship between each sub-expression and its associated algebraic operator. For example, the operator 15 with estimated cardinality equal to 100 corresponds to the evaluation of the sub-expression doc(“auction.xml”)//closed_auctions//price while the operator 4 with estimated cardinality equal to 90 corresponds to the evaluation of the sub-expression $x<50.
  9. Using the loop-lifted compilation of the FLWOR expression and the loop scope relations described in Section 3, we are able to estimate the cardinality of each iteration in the scope of the FLWOR expression. A description of the estimation of the cardinality of each iteration in the FLWOR expression will be presented in the experiments of Section 7.2.

6 Proposed benchmark for XQuery cardinality estimation systems

The XML research community has proposed several benchmarks such as Böhme and Rahm (2001), Schmidt et al. (2002), Yao et al. (2003) and Nicola et al. (2007) which are very useful for their targets and perspectives. However, none of these benchmarks fits in our context. As such, in our context we need a benchmark which is designed with a main focus to measure the accuracy of our XQuery cardinality estimation systems with their different aspects. In this section, we present our proposed benchmark for XML query size estimation. Unlike the usual XML database benchmarks, the aim of our proposed benchmark is to establish the basis of different estimation approaches in the XML domain in terms of the accuracy of the estimations and its completeness in handling different XML querying features. A good XQuery selectivity estimation system must be able to support as much as possible from the different estimation aspects mentioned in this benchmark, and must scale reasonably well for combinations of these aspects. To the best of our knowledge, the work of this paper is the first which provides a uniform framework to estimate the cardinality of XQuery expressions. Hence, we have the situation that we have no other systems to compare with them in terms of the accuracy or any other aspects. The benchmark consists of six groups of queries where each group is intended to address the challenges posed by the different aspects of XML query result size estimation. The individual queries in each covers the major possibilities of XML querying capabilities.

The queries of our proposed benchmark are based on the well-known sample XML document of the XMark benchmark “auction.xml”. The structure of this XML document is described in details in Schmidt et al. (2002). We also reuse some of the XMark queries but from other perspectives serving our cardinality estimation targets. Table XI lists the descriptions of the queries of our proposed benchmark and the specific XQuery cardinality estimation aspect handled by each query. The appendix of this paper will give a complete description for the queries of our proposed benchmark.

7 Experiments

This sections presents as an assessment for the quality of the estimates produced by our proposed XQuery estimation framework. The experiments of this section have the following goals:

7.1 Estimation accuracy

This sub-section presents an assessment for the accuracy for our XQuery estimation system described in this paper. In these experiments, we use the queries of our proposed XQuery estimation benchmark and an XMark document instance with a size of 10 MB. As previously mentioned, unlike usual XML database benchmarks, our proposed queries are designed with close eyes for the different aspects of testing the accuracy of XQuery cardinality estimation system.

To assess the accuracy of our system, we compare our estimated cardinalities for the proposed queries with their actual resulting cardinalities. We use the following error metric for evaluating the accuracy of our estimation system: Equation 9 where RE(Q) represents the relative error in the estimated cardinality of the query Q, EST(Q) represents the estimated cardinality of the query Q and ACT(Q) represents the actual resulting cardinality of applying the query Q.

Figure 13 shows the computed relative errors for our experiment. Remarks about the results of this figure are given as follows:

  1. Using our introduced notion of guide node annotations and our summary structure, statistical guide, our XQuery estimation system is able to produce exact estimates with no errors for different types of order-insensitive path expressions as well as twig expressions (Q1, Q2, Q3, Q5, Q6, Q7 and Q8). Similarly, the cardinality of nested XQuery expression (Q21) is very dependent on the cardinality of the referenced XPath expression in the scope of the outer for loop. Since we are able to produce exact estimates for the cardinality of XPath expression, we are also able to produce exact estimates for similar nested expressions. Q4 does not appear on the results of Figure 13 because our estimation system does not support the estimation of XPath expressions with order-based axes due to a limitation in our summary structure as it does not capture the order information of the source XML documents.
  2. The resulting relative errors of predicates-based XQuery expressions and join-based XQuery expressions (Q9-Q20) are very dependent on the used statistical histogram techniques for capturing the distribution of the data values as well as the used algorithms for performing comparison and arithmetic operations over histograms. The more sophisticated histogram techniques and algorithms used, the higher the accuracy that could be achieved. Q14 does not appear on the results of Figure 13 because our estimation system does not currently support string-based histograms.
  3. The queries (Q22-Q25) do not appear in the results of Figure 13 because our estimation system does not support them for the following reasons:
    • Q22, Q24 are using a form of predicates which is dependent on individual values of the source data values. We cannot track or estimate these values in our algebraic plans.
    • Q23 is similar to Q14 uses string-based predicates. Our estimation system does not currently support string-based histograms.
    • Q25 is similar to Q4 uses order-based operation. Our currently used summary structure does not capture the order information of the source XML documents.

7.2 Estimating the cardinality of XQuery sub-expression and context of FLWOR iterations

In Section 3, we described the loop-lifting compilation of FLWOR expression. In this compilation, a loop of n iterations is represented by a base loop relation with a single column iter of n values (1, 2, … n). This base loop relation preserves the scope of the context nodes in the FLWOR expression. In this section, we present an example of estimating the cardinality of each sub-expression of the main XQuery expression as well as the cardinality of the context nodes for each iteration in the scope of the FLWOR expressions. In this example, we use the XMark query Q8 and an XMark document instance with a size of 10 MB.

Q8

 let $auction≔doc(“auction.xml”) return

 for $p in $auction/site/people/person

 let $a≔

  for $t in auction/site/closed_auctions/closed_auction

  where $t/buyer/@person=$p/@id

  return $t

 return<item person=“{$p/name/text( )}”>{count($a)}</item>

To illustrate the correspondence relationship between the sub-expressions and their associated algebraic operator we need to represent the algebraic plan of Q8. The algebraic plan of Q8 consists of a total of 245 operators. Owing to space limitation, we are not able to present this algebraic plan. Figure 14 shows the estimated cardinalities of the of XQuery sub-expressions and iterations contexts for Q8. Remarks about this figure are given as follows:

  1. In the loop-lifting compilation of Q8, we have four different scopes described as follows:
    • S0: represents the scope of the outermost let expression;
    • S1: represents the scope of the outer for expression;
    • S2: represents the scope of the inner for expression; and
    • S3: represents the scope of the then part of the conditional where expression.Figure 14(a) shows the four base loop relations, their actual cardinalities and their estimated cardinalities.
  2. Figure 14(b) shows the scope of each sub-expression of the XMark query Q8, its actual cardinality, its estimated cardinality and the estimated cardinality for each iteration in the associated scope. The estimated cardinality of each sub-expression is derived from the estimated cardinality of its correspondent algebraic operator. The estimated cardinality for each iteration in the associated scope is computed by dividing the estimated cardinality of each sub-expression by the cardinality of the base loop relation of its associated scope.

8 Related work

Several research efforts have been presented in the literature to propose different estimation models for XML queries. However, all of the existing work in this domain remains limited to a subset of XPath expressions and twig queries but none of them addresses the critical issues of XQuery language such as iterators, nested queries, arithmetic operations, comparison operations, etc.

Aboulnaga et al. (2001) have presented two different techniques for capturing the structure of the XML documents and for providing accurate cardinality estimations for the path expressions. The first technique is a summarizing tree structure called a path tree. A path tree is a tree containing each distinct rooted path in the database where the nodes are labelled by the tag name of the nodes. The second technique is a statistical structure called Markov table. This table, implemented as an ordinary hash table, contains any distinct path of length up to m and its selectivity. The presented techniques only support the cardinality estimations of simple path expressions that are without predicates, inline conditions, recursive axes and order-based axes. Moreover, the models cannot be applied to twigs. Lim et al. (2002) presented XPathLearner as a cardinality estimation system for XPath expressions which employs the same summarization and estimation techniques presented in Aboulnaga et al. (2001) in a different way. Instead of scanning the XML data, XPathLearner gathers and refines the required statistical information in an online manner from query feedbacks.

Zhang et al. (2006) have addressed the problem of deriving cardinality estimation of XPath expressions. In this work, the authors are mainly focusing on the handling of XPath expressions which involve only structural conditions. The main idea behind the paper is to provide an efficient treatment of recursive XML documents and the accurate estimation of recursive queries.

Li et al. (2006) have described a framework for estimating the selectivity of XPath expressions with a main focus on the order-based axes (following, preceding, following-sibling, preceding-sibling). They used a path encoding scheme to aggregate the path and order information of XML data. They used a path encoding scheme to aggregate the path and order information of XML data.

Freire et al. (2002) have presented an XML schema-based statistics collection technique called StatiX. StatiX leverages the available information in the XML schema to capture both structural and value statistics about the source XML documents. These structural and value statistics are collected in the form of histograms.

The authors of Wang et al. (2004) proposed a framework for XML path selectivity estimation in a dynamic context using a special histogram structure named as bloom histogram. The bloom histogram keeps a count of the statistics for paths in XML data.

The work of Sartiani (2003) is the one closest related to our framework, because it deals with selectivity estimation for XQuery. However, he considers only very restricted kinds of (non-nested) FLWOR expressions. Comparing to our work, we have the following main advantages:

9 Conclusion and outlook

In this paper, we described the design and implementation of our proposed framework for estimating the cardinality of XQuery expressions. Several research literature have presented different estimation models for XML queries. However, some of these models are limited only to cardinality estimation of path expression, while others also deal with twig queries, but all of them do not address the critical issues of XML queries such as iterators, nested queries, element construction, etc. To the best of our knowledge, our work is the first which provides a uniform framework to estimate the cardinality of more powerful XML querying capabilities using XQuery expressions as well as its sub-expressions. It exploits the relational algebraic infrastructure to provide accurate estimation in the context of XML and XQuery domains. Moreover, the proposed framework can act as a meta-model through its ability to incorporate different summarized XML structures and different histogram techniques which allows the model designers to achieve their targets by focusing their effort on designing or selecting the adequate techniques for them. Currently, the current status of the our proposed XQuery cardinality estimations framework suffers from two main limitations:

  1. It does not support the estimation of the queries over the order information of the source XML documents using either ordered-based XPath axes or node ordering operators. That is because the used summary structure, statistical guide, does not keep this order information of the source XML document. However, currently we are working on integrating the SLT ordered-based summary structure presented by Fisher and Maneth (2007). Integrating such XML summary structure will solve this problem and enhance our framework with the ability to support ordered-based XQuery expressions.
  2. It does not support the estimation of the data dependent queries described in our proposed benchmark.

In addition, our contributions in this paper include our proposed benchmark for XQuery cardinality estimation systems. Our proposed benchmark distinguishes itself from the other existing XML benchmarks in its focus on establishing the basis for comparing the different estimation approaches in the XML domain in terms of their accuracy of the estimations and their completeness in handling different XML querying features.

ImageEquation 1
Equation 1

ImageEquation 2
Equation 2

ImageEquation 3
Equation 3

ImageEquation 4
Equation 4

ImageEquation 5
Equation 5

ImageEquation 6
Equation 6

ImageEquation 7
Equation 7

ImageEquation 8
Equation 8

ImageEquation 9
Equation 9

ImageFixed graphic 1
Fixed graphic 1

ImageFixed graphic 2
Fixed graphic 2

ImageFixed graphic 3
Fixed graphic 3

ImageFixed graphic 4
Fixed graphic 4

ImageFixed graphic 5
Fixed graphic 5

ImageFixed graphic 6
Fixed graphic 6

ImageFixed graphic 7
Fixed graphic 7

ImageFixed graphic 8
Fixed graphic 8

ImageFixed graphic 9
Fixed graphic 9

ImageFixed graphic 10
Fixed graphic 10

ImageFixed graphic 11
Fixed graphic 11

ImageFixed graphic 12
Fixed graphic 12

ImageFixed graphic 13
Fixed graphic 13

ImageFixed graphic 14
Fixed graphic 14

ImageFixed graphic 15
Fixed graphic 15

ImageFixed graphic 16
Fixed graphic 16

ImageFixed graphic 17
Fixed graphic 17

ImageFixed graphic 18
Fixed graphic 18

ImageFixed graphic 19
Fixed graphic 19

ImageFixed graphic 20
Fixed graphic 20

ImageFixed graphic 21
Fixed graphic 21

ImageFixed graphic 22
Fixed graphic 22

ImageFixed graphic 23
Fixed graphic 23

ImageFigure 1XQuery cardinality estimation framework
Figure 1XQuery cardinality estimation framework

ImageFigure 2Illustrating examples for the behavior of some Pathfinder algebraic operators
Figure 2Illustrating examples for the behavior of some Pathfinder algebraic operators

ImageFigure 3An illustrating example for the loop-lifted compilation of the path steps
Figure 3An illustrating example for the loop-lifted compilation of the path steps

ImageFigure 4An example of an XML document and its associated statistical guide
Figure 4An example of an XML document and its associated statistical guide

ImageFigure 5Histogram constant manipulation
Figure 5Histogram constant manipulation

ImageFigure 6Inference rules of the key, domain and const properties of the Pathfinder algebraic plans
Figure 6Inference rules of the key, domain and const properties of the Pathfinder algebraic plans

ImageFigure 7Inference rules of the GuideNode, Histogram and Selectivity properties of the Pathfinder algebraic plans
Figure 7Inference rules of the GuideNode, Histogram and Selectivity properties of the Pathfinder algebraic plans

ImageFigure 8Inference rules of DomainSize property of the Pathfinder algebraic plans
Figure 8Inference rules of DomainSize property of the Pathfinder algebraic plans

ImageFigure 9The inference rules for estimating the cardinalities of Pathfinder algebraic operator
Figure 9The inference rules for estimating the cardinalities of Pathfinder algebraic operator

ImageFigure 10The statistical guide for the XMark document
Figure 10The statistical guide for the XMark document

ImageFigure 11Estimation components of the XQuery Q1
Figure 11Estimation components of the XQuery Q1

ImageFigure 12Annotated properties of operators of algebraic plan of Q1
Figure 12Annotated properties of operators of algebraic plan of Q1

ImageFigure 13Estimation accuracy
Figure 13Estimation accuracy

ImageFigure 14The estimated cardinalities of the XQuery sub-expressions and iterations contexts
Figure 14The estimated cardinalities of the XQuery sub-expressions and iterations contexts

ImageTable IPathfinder supported XQuery dialect
Table IPathfinder supported XQuery dialect

ImageTable IIPathfinder algebraic operators
Table IIPathfinder algebraic operators

ImageTable III
Table III

ImageTable IV
Table IV

ImageTable V
Table V

ImageTable VI
Table VI

ImageTable VII
Table VII

ImageTable VIII
Table VIII

ImageTable IX
Table IX

ImageTable XSpecial properties of Pathfinder algebraic operators (relations)
Table XSpecial properties of Pathfinder algebraic operators (relations)

ImageTable XIBenchmark queries list
Table XIBenchmark queries list

References

Aboulnaga, A., Alameldeen, A.R., Naughton, J.F. (2001), "Estimating the selectivity of XML path expressions for internet scale applications", Proceedings of the 27th International Conference on Very Large Data Bases (VLDB), Roma, Italy, pp.591-600.

[Manual request] [Infotrieve]

Berleant, D. (1993), "Automatically verified reasoning with both intervals and probability density functions", Interval Computations, Vol. 2 pp.48-70.

[Manual request] [Infotrieve]

Boag, S., Chamberlin, D., Fernández, M.F., Florescu, D., Robie, J., Siméon, J. (2006), "XQuery 1.0: an XML query language – world wide web consortium proposed recommendation", available at: www.w3.org/TR/xquery, .

[Manual request] [Infotrieve]

Böhme, T., Rahm, E. (2001), "XMach-1: a benchmark for XML data management", Datenbanksysteme in Büro, Technik und Wissenschaft (BTW) 9, GI-Fachtagung, London, UK, pp.264-73.

[Manual request] [Infotrieve]

Boncz, P.A. (2002), "Monet: a next-generation DBMS kernel for query-intensive applications", Universiteit van Amsterdam, Amsterdam, PhD thesis, .

[Manual request] [Infotrieve]

Boncz, P.A., Grust, T., Van Keulen, M., Manegold, S., Rittinger, J., Teubner, J. (2006), "MonetDB/XQuery: a fast XQuery processor powered by a relational engine", Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, IL, USA, pp.479-90.

[Manual request] [Infotrieve]

Brantner, M., Helmer, S., Kanne, C-C., Moerkotte, G. (2005), "Full-fledged algebraic XPath processing in Natix", Proceedings of the 21st International Conference on Data Engineering (ICDE), Tokyo, Japan, pp.705-16.

[Manual request] [Infotrieve]

Bray, T., Paoli, J., Sperberg-McQueen, C.M., Maler, E., Yergeau, F. (2006), "World Wide Web Consortium, Extensible Markup Language (XML) 1.0", 4th ed., available at: www.w3.org/TR/xml, .

[Manual request] [Infotrieve]

Chen, Z., Jagadish, H.V., Lakshmanan, L.V.S., Paparizos, S. (2003), "From tree patterns to generalized tree patterns: on efficient evaluation of XQuery", Proceedings of the 29th International Conference on Very Large Data Bases (VLDB), Berlin, Germany, pp.237-48.

[Manual request] [Infotrieve]

Fernández, M.F., Malhotra, A., Marsh, J., Nagy, M., Walsh, N. (2006), "XQuery 1.0 and XPath 2.0 data model (XDM) – world wide web consortium proposed recommendation", available at: www.w3.org/TR/xpath-datamodel, .

[Manual request] [Infotrieve]

Fisher, D., Maneth, S. (2007), "Structural selectivity estimation for XML documents", Proceedings of the 23rd International Conference on Data Engineering (ICDE), Istanbul, Turkey, .

[Manual request] [Infotrieve]

Freire, J., Haritsa, J.R., Ramanath, M., Roy, P., Siméon, J. (2002), "StatiX: making XML count", Proceedings of the ACM SIGMOD International Conference on Management of Data, Madison, WI, USA, pp.181-91.

[Manual request] [Infotrieve]

Goldman, R., Widom, J. (1997), "DataGuides: enabling query formulation and optimization in semistructured databases", Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB), Athens, Greece, pp.436-45.

[Manual request] [Infotrieve]

Grust, T. (2002), "Accelerating XPath location steps", Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, WI, USA, pp.109-20.

[Manual request] [Infotrieve]

Grust, T. (2005), "Purely relational FLWORs", Proceedings of the 2nd International Workshop on XQuery Implementation, Experience and Perspectives (XIME-P), in Cooperation with ACM SIGMOD, Baltimore, MD, USA, .

[Manual request] [Infotrieve]

Grust, T., Sakr, S., Teubner, J. (2004), "XQuery on SQL hosts", Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), Toronto, Canada, pp.252-63.

[Manual request] [Infotrieve]

Grust, T., Mayr, M., Rittinger, J., Sakr, S., Teubner, J. (2007), "A SQL: 1999 code generator for the Pathfinder XQuery compiler", Proceedings of the ACM SIGMOD International Conference on Management of Data, Beijing, China, .

[Manual request] [Infotrieve]

Ioannidis, Y.E. (1993), "Universality of serial histograms", Proceedings of the 19th International Conference on Very Large Data Bases (VLDB), San Francisco, CA, USA, pp.256-67.

[Manual request] [Infotrieve]

Jagadish, H.V., Lakshmanan, L.V.S., Srivastava, D., Thompson, K. (2001), "TAX: a tree algebra for XML", Proceedings of the 8th International Workshop of Database Programming Languages (DBPL), Frascati, Italy, pp.149-64.

[Manual request] [Infotrieve]

Li, H., Lee, M-L., Hsu, W., Cong, G. (2006), "An estimation system for XPath expressions", Proceedings of the 22nd International Conference on Data Engineering (ICDE), Atlanta, GA, USA, pp.54.

[Manual request] [Infotrieve]

Lim, L., Wang, M., Padmanabhan, S., Vitter, J.S., Parr, R. (2002), "XPathlearner: an on-line self-tuning Markov histogram for XML path selectivity estimation", Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), Hong Kong, China, pp.442-53.

[Manual request] [Infotrieve]

Nicola, M., Kogan, I., Schiefer, B. (2007), "An XML transaction processing benchmark", Proceedings of the ACM SIGMOD International Conference on Management of Data, New York, NY, USA, pp.937-48.

[Manual request] [Infotrieve]

Paparizos, S., Al-Khalifa, S., Jagadish, H., Niermann, A., Wu, Y. (2002), A Physical Algebra for XML, University of Michigan, Dearborn, MI, Technical Report, .

[Manual request] [Infotrieve]

Poosala, V., Ioannidis, Y.E., Haas, P.J., Shekita, E.J. (1996), "Improved histograms for selectivity estimation of range predicates", Proceedings of the ACM SIGMOD International Conference on Management of Data, Quebec, pp.294-305.

[Manual request] [Infotrieve]

Re, C., Siméon, J., Fernández, M.F. (2006), "A complete and efficient algebraic compiler for XQuery", Proceedings of the 22nd International Conference on Data Engineering (ICDE), Atlanta, GA, USA, pp.14.

[Manual request] [Infotrieve]

Rittinger, J. (2005), "Pathfinder/MonetDB: a relational runtime for XQuery", Konstanz University, Konstanz, Master's thesis, .

[Manual request] [Infotrieve]

Sakr, S. (2007), "Cardinality-aware and purely relational implementation of an XQuery processor", University of Konstanz, Konstanz, PhD thesis, .

[Manual request] [Infotrieve]

Sartiani, C. (2003), "A general framework for estimating XML query cardinality", Proceedings of the 9th International Database Programming Languages Workshop (DBPL), Potsdam, Germany, pp.257-77.

[Manual request] [Infotrieve]

Sartiani, C., Albano, A. (2002), "Yet another query algebra for XML data", Proceedings of the International Database Engineering & Applications Symposium (IDEAS), Edmonton, Canada, pp.106-15.

[Manual request] [Infotrieve]

Schmidt, A., Waas, F., Kersten, M.L., Carey, M.J., Manolescu, I., Busse, R. (2002), "XMark: a benchmark for XML data management", Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), Hong Kong, China, pp.974-85.

[Manual request] [Infotrieve]

Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G. (1979), "Access path selection in a relational database management system", Proceedings of the ACM SIGMOD International Conference on Management of Data, Boston, MA, USA, pp.23-34.

[Manual request] [Infotrieve]

Teubner, J. (2006), "Pathfinder: XQuery compilation techniques for relational database targets", Technical University of Munich, Munich, PhD thesis, .

[Manual request] [Infotrieve]

Wang, W., Jiang, H., Lu, H., Yu, J.X. (2004), "Bloom histogram: path selectivity estimation for XML data with updates", Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), Toronto, Canada, pp.240-51.

[Manual request] [Infotrieve]

Yao, B.B., Özsu, M.T., Keenleyside, J. (2003), "XBench – a family of benchmarks for XML DBMSs", Proceedings of the VLDB 2002 Workshop EEXTT and CAiSE 2002 Workshop DTWeb on Efficiency and Effectiveness of XML Tools and Techniques and Data Integration over the Web-Revised Papers, London, UK, pp.162-4.

[Manual request] [Infotrieve]

Zhang, H., Tompa, F.W. (2003), "XQuery rewriting at the algebraic level", Trends in XML Technology for the Global Information Infrastructure, a special issue of Journal of Computer Systems, Science, and Engineering, Vol. 18 No.5, pp.241-62.

[Manual request] [Infotrieve]

Zhang, N., Özsu, M.T., Aboulnaga, A., Ilyas, I.F. (2006), "XSeed: accurate and fast cardinality estimation for XPath queries", Proceedings of the 22nd International Conference on Data Engineering (ICDE), Atlanta, GA, USA, pp.61.

[Manual request] [Infotrieve]

Appendix

A.1 Group 1: path expressions

Q1: path expression with non-recursive axes.

Find the names of all persons in the auction document.

for $b in doc(“auction.xml”)/site/people/person/name/text( )

return $b

where non-recursive axes are child, parent, attribute, following-sibling and preceding-sibling.

Q2: path expression with recursive axes.

Find all description nodes descendant of all item nodes in the auction document.

for $b in doc(“auction.xml”)/site//item//description

return $b

where recursive axes are descendant, descendant-or-self, ancestor and ancestor-or-self.

Q3: path expression with wild cards.

Return the subtree of the Africa region.

for $b in doc(“auction.xml”)/site/regions/africa//*

return $b

Q4: path expression with ordered-based axes.

Return the description nodes following the tags with name closed_auction.

for $b in doc(“auction.xml”)/site//closed_auction/following::description

return $b

where ordered-based axes are following, following-sibling, preceding and preceding-sibling.

Q5: branching XPath expressions.

Return the names of all persons with id less than “person10”.

for $b in doc(“auction.xml”)/site//person[@id< ‘person10’]/name

return $b

A.2 Group 2: twig expressions

Q6: simple twig expression.

Return the names and description of all items.

for $b in doc(“auction.xml”)//item

return ($b/name, $b/description)

Q7: twig expression with element construction.

Return the restructured result of the names and description of all items.

for $b in doc(“auction.xml”)//item

return

  <Result>

   <name>{$b/name}</name>

   <description>{$b/description}</description>

 </Result>

A.3 Group 3: predicates

Q8: positional predicates.

Return the third bidder of each open auction.

for $b in doc(“auction.xml”)/site/open_auctions/open_auction

return $b/bidder[3]

Q9: equality predicates.

Return the closed auctions with price equal to 40.

for $b in doc(“auction.xml”)/site//closed_auction

where data($b/price)=40

return $b

Q10: range predicates.

Return the closed auctions with price less than 40.

for $b in doc(“auction.xml”)/site//closed_auction

where data($b/price)<40

return $b

where the range predicates uses any of the operators (< ≤,= != > ≥)

Q11: conjunctive/disjunctive predicates.

Return the closed auctions with price greater than 40 and less than 100.

for $b in doc(“auction.xml”)/site//closed_auction

where data($b/price)>40 and data($b/price)<100

return $b

where conjunctive predicates uses the conjunctive/disjunctive operators (AND, OR)

Q12: predicates with merged nodes from different paths.

Return the African and Asian items with id value greater than 100.

for $b in (doc(“auction.xml”)/site/africa/item,

 doc(“auction.xml”)/site/asia/item)

where data($b/@id)>100

return $b

An accurate estimation of such query should consider the different data distribution for the nodes resulting from each different path expression as well as the percentage of each path in constructing the nodes of the operated sequence.

Q13: predicates with merged nodes from different paths and hybrid nature.

Return the price nodes and quantity nodes with value >100.

for $b in (doc(“auction.xml”)/site//price,

 doc(“auction.xml”)/site//quantity)

where data($b)>1 and data($b)>100

 return $b

This query is more challenging than the previous one because the resulting nodes of the sequence are representing different items (price, quantity).

Q14: string predicates.

Return all persons with id value greater than “person200”.

for $b in doc(“auction.xml”)/site/people/person

where $b/@id>“person200”

return $b

A.4 Group 4: values comparisons ( joins)

Q15: Value comparison where the values of each operand are constructed by path expression.

Return all pairs of increase value and price value where the increase value is greater than the price value.

for $x in doc(“auction.xml”)/site//increase,

 $y in doc(“auction.xml”)/site//price

where data($x) >data($y)

 return<pair>{$x, $y}<(pair>

Q16: value comparison where the values of one operand are constructed by path expression and the values of the other operand are constructed by path expression manipulated with arithmetic expression.

Return all pairs of increase value and price value where the increase value is greater than the price value multiplied by 2.

for $x in doc(“auction.xml”)/site//increase,

 $y in doc(“auction.xml”)/site//price

where data($x) >data($y)*2

return<pair>{$x, $y}<(pair>

Q17: values join.

Return all pairs of increase value and price value where the increase value is equal to the price value.

for $x in doc(“auction.xml”)/site//increase,

 $y in doc(“auction.xml”)/site//price

where data($x) = data($y)

return<pair>{$x, $y}<(pair>

Q18: histograms arithmetic 1.

Return all pairs of increase value and price value where the sum of the increase value and the price value is greater than 100.

for $x in doc(“auction.xml”)/site//increase

 $y in doc(“auction.xml”)/site//price

where data($x) + data($y)>100

return<pair>{$x, $y}<(pair>

Q19: histograms arithmetic 2.

Return all pairs of increase value and price value where the sum of the increase value and the price value is equal to 100.

for $x in doc(“auction.xml”)/site//increase

 $y in doc(“auction.xml”)/site//price

where data($x) + data($y)=100

return<pair>{$x, $y}<(pair>

Q20: histograms arithmetic 3.

Return all triples of increase value, price value and income where the sum of the increase value and the income value is greater than the sum of the price value and the income value.

for $x in doc(“auction.xml”)/site//increase

 $y in doc(“auction.xml”)/site//price

 $z in doc(“auction.xml”)/site//@income

where data($x) + data($z) >data($y) + data($z)

return<pair>{$x, $y, $z}<(pair>

A.5 Group 5: nested expressions

Q21: let – aggregates.

Return the names of persons and the number of items that they bought.

let $auction≔doc(“auction.xml”) return

for $p in $auction/site/people/person

let $a≔

 for $t in $auction/site/closed_auctions/closed_auction

  where $t/buyer/@person=$p/@id

 return $t

return<item>

   <person>{$p/name/text( )}</person>

   <count>{count($a)}</count>

 </item>

A.6 Group 6: data dependent estimations

Q22: predicates with values constructed by aggregate function.

Return the open auctions with sum of bidder increases that are greater than 1,000.

for $b in doc(“auction.xml”)/site/open_auctions/open_auction

where sum(data(b/bidder/increase))>1000

return<increase>$b</increase>

Q23: sub-string matching.

Return the names of all items whose description contains the word “gold”.

let $auction≔doc(“auction.xml”) return

for $i in $auction/site//item

where contains(string(exactly-one($i/description)), “gold”)

return $i

Q24: distinct.

Return the distinct price values.

for $b in distinct-values(doc(“auction.xml”)//price/text( ))

return $b

Q25: document order.

Return the open auctions where a certain person issued a bid before another person.

let $auction≔doc(“auction.xml”) return for $b in

$auction/site/open_auctions/open_auction where

some $pr1 in $b/bidder/personref[@person=“person20”],

 $pr2 in $b/bidder/personref[@person=“person51”]

satisfies $pr1≪$pr2

return<history>$b</history>

Corresponding author

Sherif Sakr can be contacted at: sherif.sakr@nicta.com.au