3. Research Method
3.4. XML Query Model
In this research, we attempt to propose a generic XML query model applicable to any scenario. We do not describe the queries based on a specific application domain such as an auction site or a library. The queries are specified in generic terms. It is easy for user to apply them in different scenarios. Besides, we further identify key factors that influence the complexity of each query. This would help users to evaluate performance of the system with increasing complex queries.
After analysis previous three XML benchmarks, we identify a comprehensive set of queries. The query model we defined can be classified into 10 categories, including 14 different queries. Each of them challenges different aspects of XML processing. Besides, userscan specify queriesaccording to theirrequirements,called “user-driven query”.The following will describe each category briefly, and express each query in generic terms. In each query, the generic term is written in italics. Then we illustrate them in XQuery. We use E1, E2 etc. to denote a certain element, and A1, A2 etc. to denote a certain attribute.
The number of them does not indicate their order in a XML document, just for representing convenience. Finally, the complexity factors will be discussed.
3.4.1 Exact Match
This type of queries specifies a full path expression. One main concept of XQuery is the use of path expressions for selecting nodes. The length of the path expression depends on the levels of predicates being queried in XML documents. This is the simplest query type. We can use this type of queries to establish a simple “metric” comparing performance of the following queries. It tests the database ability to handle simple string lookups with a fully specified path.
Generic terms:
Given a full path expression, find elements E1 that have an attribute A1 in a value X.
XQuery expression:
FOR $aIN input()/SUBPATH/E1[@A1 = “X”] RETURN $a
The complexity of the query is influenced by the length of the path expression.
Queries with different level of predicate would have different performance.
3.4.2 Joins
References are an integral part of XML identifying the relationship between related data. With using of reference, richer relationships can be represented than just hierarchical element structures. The system must be able to combine separate information together using joins. Horizontal traversals are defined in this type of queries. Joins can be on the basis of references and values. References are specified in the DTD and may be optimized with logical OIDs for example. The system should make use of the cardinalitiesofthesetsto bejoined.Joinsbased on valuestestthedatabase’sability to handle large (intermediate) results.
Join on Reference Generic terms:
Find element E1 by the reference attribute A1 of E2. The reference attribute A1 of E2 refer to E1.
XQuery expression:
FOR $a IN input()//E1
$b IN input()//E2 WHERE $a/@A2 = $b/@A1 RETURN $a
Join on Value Generic terms:
This time reference is based on join of the data values. Find element E1 whose attribute A1 is equal to the attribute A2 of E2.
XQuery expression:
FOR $a IN input()//E1
$b IN input()//E2 WHERE $a/@A1 = $b/@A2 RETURN $a
The queries specified above are 2-way join. It is the simplest form. 3-way join, 4-way join, and N-way join would be generated with increasing complexity. In addition, the result size would affect the query efficiency too.
3.4.3 Regular Path Expressions
Regular path expressions are a basic building block of almost every XML language including XPath, XQuery, and XSLT. The system should be capable of optimizing path expressions and reducing traversals of irrelevant parts of the tree. We often use wildcards in regular path expressions and the system should realize that it is not necessary to traverse the complete document tree to execute such expressions. This type of queries tries to quantify the costs of long path traversals that do not include wildcards, and the costs of path traversals that include wildcards.
Full Sub-path Generic terms:
Find element E1 with a long path expression.
XQuery expression:
FOR $a IN input()/SUBPATH/E1 RETURN $a
Unknown Sub-path Generic terms:
Find element E1 with a regular path expression include wildcards.
XQuery expression:
FOR $a IN input()//E1 RETURN $a
The length of path expression would influence the complexity. In a path expression, each step can apply one or more predicates to eliminate nodes that fail to satisfy a given condition. Therefore, numbers of element unknown in the sub-path would also affect the query complication.
3.4.4 Document Construction
Structure is very important to XML documents. But XML documents storing in relational DBMSs often need to be broken down. Reconstructing the original document is a big challenge to systems. We might retrieve fragments of original documents with original structures. But sometimes we may want to construct document fragments with
new structures. These queries tests for the ability of the system to reconstruct portions of the original XML document.
Structure Preserving Generic terms:
Return a XML document constructed by element E1 and its sub-element E2. Retrieve E2 of E1 that has an attribute A1 equal to a certain value X.
XQuery expression:
FOR $a IN input()//E1[@A1 = X]
RETURN <$a> $a/E2 </$a>
Structure Transforming Generic terms:
Construct a new XML document. Find element E1 with an attribute A1 equal to a certain value X, and select several sub-element of E1 to construct a new XML document.
XQuery expression:
FOR $a IN input()//E1[@A1 = X]
RETURN <output>
{$a/E2/E3}
{$a/E2/E4}
{$a/E2/E3/E5}
{$a/E6}
</output>
The complication of the XML document structure would increase the difficulty of reconstruction. On the other hand, the structure of output document would also influence the query complexity.
3.4.5 Ordered Access
Order of elements is important in XML documents. Because documents will sometimes be fragmented when they are stored on disk, it is important that the order of these fragments in the original document is preserved. The system should be able to preserve these intrinsic orders. This type of queries attempts to test how efficient the system handle queries with order constraints.
Generic terms:
Find element E1 with attribute A1 in certain value X, and return the first sub-element E2 of E1.
XQuery expression:
FOR $a IN input()//E1[@A1 = X]
RETURN $a/E2[1]
The complexity depends on order constraints specified in the query. If there is an index build on the attribute, the query can take advantage of set-valued aggregates on the index attribute to accelerate the execution.
3.4.6 Sorting
The order by clause is the only facility provided by XQuery for specifying an order other than document order. In XML documents, the generic data type of element content is string, but users may cast the string type to other types. Therefore, the system should be able to sort values both in string and in non-string data types. This type of queries tests whether the system can do sorting efficiently.
By String Generic terms:
List sub-element E3 of element E1 sorted by sub-element E2.
XQuery expression:
FOR $a IN input()//E1 ORDER BY $a//E2
RETURN $a/E3
By Non-string Generic terms:
List sub-element E3 of element E1 sorted by sub-element E2. This time E2 is a non-string value.
XQuery expression:
FOR $a IN input()//E1 ORDER BY $a//E2
RETURN $a/E3
The number of tuples that are generated by the FOR and LET-clauses and satisfies the condition in the WHERE-clause would influence the complexity of the query. Also, if the ORDER BY-clause uses several options, the complexity would increase.
3.4.7 Missing Elements
In XML, schemas are more flexible and may have a number of irregularities.
Queries in this type are to test how well the system knows to deal with the semi-structured aspect of XML data, especially elements that are declared optional in the schemas.
Generic terms:
Find element E1 whose sub-element E2 has NULL value.
XQuery expression:
FOR $a IN input()//E1 WHERE EMPTY($a/E2/text()) RETURN $a
The complexity depends on the FOR and LET-clauses that generate the test tuples.
3.4.8 Text Search
Text search plays a very important part in XML document systems. This type of queries conducts a full-text search in the form of keyword search. They will challenge the textual nature of XML documents.
Generic terms:
Find element E1 whose sub-element E2 contains a specific text Y.
XQuery expression:
FOR $a IN input()//E1
WHERE CONTAINS ($a/E2,“Y”) RETURN $a
This query has to scan large part of the document. Therefore, the number of tuples that are generated by the FOR and LET-clauses would influence the query complexity.
Also, if the query contains multiple texts, the difficulty would increase.
3.4.9 Data-type Cast
Strings are the generic data type in XML documents. But we often need to cast strings to another data type that carries more semantics. These queries challenge the system’sability to transform between datatypes.
Generic terms:
Find element E1 with a constraint that contain operations need to transform data value
of sub-element E2 to other data-type. Retrieve element E1 whose sub-element E2 is bigger than a certain number X.
XQuery expression:
FOR $a IN input()//E1 WHERE $a/E2 > X RETURN $a
The number of tuples needs to be transformed affects the execution efficiency. If there are several casting conditions in a query, the complexity would increase.
3.4.10 Function Application
The following query challenges the system with aggregate functions such as count, avg, max, min and sum.
Generic terms:
Group element E1 by sub-element E2, and calculate the total number of elements for each group.
XQuery expression:
FOR $a IN DISTINT-VALUES (input()//E1/E2) LET $b := input()//E1[E2 = $a]
RETURN count($b)
The complexity of this query is influenced by the number of tuples that are generated by the FOR and LET-clauses.