Tài liệu Advances in Database Technology- P7 ppt

282 O Shmueli, G Mihaila, and S Padmanabhan (Step-Predicate Consistency) If maps a Predicate to node in and maps the first step of the RLP of the PredExp of the Predicate to node in then is an edge in (Comparison Mapping) maps the ‘Op Value’ part of a predicate ‘self::node() Op Value’ to the node it maps the self:: step and the condition ‘Op Value’ is satisfied by this node Similarly to an expression mapping, a matching map is also a function from expression components into tree nodes, this time nodes of T rather than A matching map satisfies the same six conditions listed above3 As we shall see, the existence of an expression mapping function from to implies the existence of a matching map from to T Let T be a tagged tree, an image data tree of T and a normalized expression Suppose there exists an element occurrence elem, in computed by on as evidenced by an expression mapping from to We associate with a matching map such that if maps a component of to a node in then maps to the node in T such that is an image of Because of the way images of tagged trees are defined, if maps to and to and is an edge in then is an edge in T Furthermore, the node tests that hold true in must be “true” in T, that is Node Names are the same and if text() holds on the node then the corresponding T node “manufactures” text We also call a matching of and T Observe that there may be a number of distinct matching maps from the same normalized expression to T Perhaps the simplest example is a tagged tree T having a root and two children tagged with ‘A’ Let be Root/child::A, we can match the step child::A with either of the two A children of the root of T Observe that these two matching maps correspond to different ways of constructing an expression mapping from to a data tree of T Tagged Tree Rewriting Rewriting is done as follows: Based on a matching map for an expression the tagged tree is modified (subsection 4.1) A collection of rewritten tagged trees is superimposed into a single tagged tree (subsection 4.2) We shall present the obvious option of combining all rewritten trees for all expressions and their matching maps We note that, in general, the collection of superimposed trees may be any subset of these trees, giving rise to various strategies as to how many superimposed trees to generate Our techniques apply in this more general setting as well Except the ‘Op Value’ satisfaction requirement Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Query-Customized Rewriting and Deployment of DB-to-XML Mappings 4.1 283 Node Predicate Modification Our goal is to modify T so as to restrict the corresponding image data to data that is relevant to the normalized expressions in Consider Let be the set of matching maps from to T Consider We next describe how to modify the formulas attached to the nodes of T based on Algorithm examines the parsed normalized expression and the tree T in the context of a matching map Inductively, once the algorithm is called with a node in T and a parsed sub-expression, it returns a formula F which represents conditions that must hold at that T node for the recursive call to succeed in a data tree for T (at a corresponding image node) Some of these returned formulas, the ones corresponding to Predicate* sequences on the major steps of are attached to tree nodes (in lines 20 and 25), the others are simply returned to the caller with no side effect The information about whether the algorithm is currently processing a major step is encoded in the Boolean parameter isMajor Let be the original formula labeling in T; in case of a data annotation, the formula is (and True if no formula originally labels in T) Initially, the algorithm is invoked as follows: At this point let us review the effect of ASSOCIATEF Consider a matching map for normalized expression Let be the major steps in the major path of i.e., those steps not embedded within any [ ] Let be the respective nodes of T to which maps Algorithm ASSOCIATEF attaches formulas that will specify the generation of image data for node only if all predicates applied at are guaranteed to hold In other words, image data generation for nodes along the major path of is filtered Note, that this filtering holds only for this matching We shall therefore need to carefully superimpose rewritings so as not to undo the filtering effect Example Let T be the tagged tree in Figure and expression In this case, there is only one possible matching of to T The call will recursively call ASSOCIATEF on the node polist, then po, and then on the predicate at node po The call on the predicate expression will return the formula which will be conjoined with the original binding annotation formula Orders at node po, thus effectively limiting the data tree generation to purchase orders of company “ABC” 4.2 Superimposing Rewritings Algorithm ASSOCIATEF handles one matching of an expression Let be another matching of and T Intuitively, each such matching should be here is a dummy variable as no variable is bound at this node Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 284 O Shmueli, G Mihaila, and S Padmanabhan supported independently of the other We now consider superimposing rewritings due to such independent matching maps A straightforward approach is to rewrite a formula at a node that is in the range of some matching maps, Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Query-Customized Rewriting and Deployment of DB-to-XML Mappings say tagging and 285 to where is the original formula is the formula assigned to by ASSOCIATEF based on matching The approach above may lead to unnecessary image data generation Consider a tree T, with Root, a child of root at which variable is bound, and a child of in which variable is bound via formula depending on Suppose map results in a single binding on the current relational data Suppose map results in no bindings on the current relational data Now consider generating image data at The formula at is of the form where corresponds to and to Suppose that and are true on some other tuple on which is false Then, an image data is generated for based on the resultant binding However, had we applied the rewritings separately, no such image data would be generated This phenomenon in which a formula for one matching map generates image data based on a binding due to another matching map is called tuple crosstalk The solution is to add an additional final step to algorithm AssociateF In this step the rewritten tree is transformed into a qual-tagged tree (definition follows) This transformation is simple and results in the elimination of tuple crosstalk Essentially, it ensures that each disjunct in will be true only for bindings that are relevant to For example, in subsection 1.1 this solution was used in obtaining the final superimposed tree We now define the concept of a qual-tagged trees Consider a variable, say bound by a formula Define to be from which all atoms of the form are eliminated For example, suppose Then Consider a node in tagged tree T with ancestor variables If we replace with the meaning of the tagged tree remains unchanged This is because, intuitively, the restrictions added are guaranteed to hold at that point A tree in which this replacement is applied to all nodes is called a qual-tagged tree 5.1 Tagged Tree Optimizations Node Marking and Mapping Tree Pruning Consider a tagged tree T and a normalized expression Each node in T which is in the range of some matching map is marked as visited Each such node that is the last one to be visited along the major path of the corresponding path expression and all its descendants are marked as end nodes Nodes that are not marked at all are useless for No image data generated for such nodes will ever be explored by in an image data tree If for all client queries ec, for all normalized expressions for ec, node is useless for then node may be deleted from T One may wonder whether we can further prune the tagged tree; consider the following example Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 286 O Shmueli, G Mihaila, and S Padmanabhan Example Let T be a chain-like tagged tree, with a root, a root child A tagged with a formula binding a variable and a child B (of A) containing data extracted from, say, B Suppose our set of queries consists of the following normalized expression Observe that there is but a single matching map from to T After performing ASSOCIATEF based on the formula at node A is modified to reflect the requirement for the existence of the B node whose data can be interpreted as an integer greater than 10 So, when image data is produced for A, only such data tuples having are extracted Since the B elements are not part of the answer (just a predicate on the A elements), one is tempted to prune node B from T This however will result in an error when applying to the predicate [child::B > 10 ] will not be satisfied as there will simply be no B elements! 5.2 Formula Pushing Consider a tagged tree produced by an application of algorithm ASSOCIATEF using matching mapping A node in is relevant if it is an image under A relevant node in is essential if either it (i) is an end node, or (ii) has an end node descendant, or (iii) is the image of a last step in a predicate expression (Observe that this definition implies that all relevant leaves are essential.) A relevant node that is not essential is said to be dependent Intuitively, a dependent node is an internal node of the subtree defined by and that has no end node descendant Let us formalize the notion of a formula’s push up step Let be a dependent node in tagged with formula F Let be relevant children, associated respectively with formulas Let be the variable bound at if any and some new “dummy” variable otherwise Modify F to The intuition is that the data generated as the image of node is filtered by its formula as well as the requirement that this data will embed generated data for some child There is no point in generating a data element for that does not “pass” this filtering Such a data element cannot possibly be useful for data tree matching by the expression mapping implied by Push up steps may be performed in various parts of the tree In fact, if we proceed bottom-up, handling a dependent node only once all its dependent children nodes (if any) were handled, the result will be a tagged tree in which each generated image data, for a dependent node, can potentially be used by the expression mapping implied by (on the resulting data tree) In other words, image data that certainly cannot lead to generation of image data that may be used by the expression mapping are not produced The benefit of pushing lies in the further filtering of image data node generation The cost of extensive filtering is that (a) the resulting formulas are complex (they need eventually be expressed in SQL), and (b) some computations are repeated (that is, a condition is ensured, perhaps as part of a disjunction, and then rechecked for a filtering effect further down the tree) Therefore, there is a tradeoff; less “useless” image data is generated, but the computation is more Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Query-Customized Rewriting and Deployment of DB-to-XML Mappings 287 extensive Further work is needed to quantify this tradeoff so as to decide when is it profitable to apply push up steps We have explained formula push up as applied to a tagged tree that reflects a single matching Recall that such trees, for various queries and matching mappings, may be superimposed The superimposition result reflects the push up steps of the various matching mappings as well An alternative approach is to first superimpose and then push up steps For example, view a node as essential if it is essential for any of the relevant mappings Clearly, the resulting filtering will be less strict (analogous to the tuple “cross-talk” phenomenon”) 5.3 Minimum Data Generation For a tree T let denote the number of nodes in T For tree denotes that may be obtained from T by deleting some sub-trees of T Given a data tree and a query Q, we would like to compute a minimum data tree for which This problem is related to the following decision problem: BOUNDED TREE INSTANCE: A data tree a query Q, an integer QUESTION: Is there a data tree with nodes such that Theorem The BOUNDED TREE problem is NP-complete Proof By reduction from SET COVER (ommited) Corollary Finding a minimum data tree is NP-hard Experimental Evaluation In order to validate our approach, we applied our selective mapping materialization techniques to relational databases conforming to the TPC-W benchmark [TPC] The TPC-W benchmark specifies a relational schema for a typical e-commerce application, containing information about orders, customers, addresses, items, authors, etc We have designed a mapping from this database schema to an XML schema consisting of a polist element that contains a list of po elements, each of which contains information about the customer and one or more orderline-s, which in turn contain the item bought and its author (an extension of the simple tagged tree example in Section 2) In the first experiment, we examine the effect of tagged tree pruning for XPath queries without predicates Thus, we simulated a client workloads (we use the abbreviated XPath syntax for readability) For each XPath expression in this workload, the ASSOCIATEF algorithm found a single matching in the original tagged tree and, after eliminating all the unmarked nodes, it resulted in a tagged Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 288 O Shmueli, G Mihaila, and S Padmanabhan Fig Data Tree Reductions (no predicates) Fig Data Tree Reduction (with predicates) Fig Data Tree Reduction (no orderlines) Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Query-Customized Rewriting and Deployment of DB-to-XML Mappings 289 Fig Data Tree Reduction (predicates and structure) tree that enabled a data reduction of about 58% The sizes of the generated documents are shown in Figure (for three database sizes) In the second experiment (see Figure 6), we simulated a client that is interested only in the orders with shipping address in one of the following five countries: USA, UK, Canada, Germany and France This time, all the order information is needed for the qualifying orders, so no nodes are eliminated from the tagged tree Instead, the disjunction of all the country selection predicates is attached to the po node, which reduced the data tree size by about 94% (because only about 5.6% of the orders were shipped to one of these countries) In general, the magnitude of data reduction for this type of query is determined by the selectivity of the predicate In the third experiment (see Figure 7), we simulated a client that wants to mine order data, but does not need the actual details of the ordered items In this case, the ASSOCIATEF algorithm eliminated the entire subtree rooted at orderline resulting in a data reduction of about 60% Finally, in the fourth experiment (see Figure 8), we simulated the combined effect of predicate-based selection and conditional prunning of different portions of the tagged tree Thus, we considered the workload described in the Introduction (Section 1.1): This time, the data reduction amounted to about 62% when a qual-tree was generated, compared to about 50% for the non qual-tree option (the predicates applied only at the “po” node), which demonstrates the usefulness of qual-trees Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark 290 O Shmueli, G Mihaila, and S Padmanabhan Our experiments show that, in practice, our mapping rewriting algorithms manage to reduce the generated data tree size by significant amounts for typical query workloads Of course, there are cases when the query workload “touches” almost all the tagged tree nodes, in which case the data reduction would be minimal Our algorithms can identify such cases and advise the data administrator that full mapping materialization is preferable Conclusions and Future Work We considered the problem of rewriting an XML mapping, defined by the tagged tree mechanism, into one or more modified mappings We have laid a firm foundation for rewriting based on a set of client queries, and suggested various techniques for optimization - at the tagged tree level as well as at the data tree level We confirmed the usefulness of our approach with realistic experimentation The main application we consider is XML data deployment to clients requiring various portions of the mapping-defined data Based on the queries in which a client is interested, we rewrite the mapping to generate image data that is relevant for the client This image data can then be shipped to the client that may apply to the data an ordinary Xpath processor The main benefit is in reducing the amount of shipped data and potentially of query processing time at the client In all cases, we ship sufficient amount of data to correctly support the queries at the client We also considered various techniques to further limit the amount of shipped data We have conducted experiments to validate the usefulness of our shipped data reduction ideas The experiments confirm that in reasonable applications, data reduction is indeed significant (60-90%) The following topics are worthy of further investigation: developing a formulalabel aware Xpath processor; quantifying a cost model for tagged trees, and rules for choosing transformations, for general query processing; and extending rewriting techniques to a full-fledged query language (for example, a useful fragment of Xquery) References M J Carey, J Kiernan, J Shanmugasundaram, E J Shekita, and S N Subramanian XPERANTO: Middleware for publishing object-relational data as XML documents In VLDB 2000, pages 646–648 Morgan Kaufmann, 2000 [FMS01] Mary Fernandez, Atsuyuki Morishima, and Dan Suciu Efficient evaluation of XML middleware queries In SIGMOD 2001, pages 103–114, Santa Barbara, California, USA, May 2001 [GMW00] Roy Goldman, Jason McHugh, and Jennifer Widom Lore: A database management system for XML Dr Dobb’s Journal of Software Tools, 25(4):76–80, April 2000 [KM00] Carl-Christian Kanne and Guido Moerkotte Efficient storage of XML data In ICDE 2000, pages 198–200, San Diego, California, USA, March 2000 IEEE Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Query-Customized Rewriting and Deployment of DB-to-XML Mappings 291 [LCPC01] Ming-Ling Lo, Shyh-Kwei Chen, Sriram Padmanabhan, and Jen-Yao Chung XAS: A system for accessing componentized, virtual XML documents In ICSE 2001, pages 493–502, Toronto, Ontario, Canada, May 2001 [Mic] Microsoft SQLXML and XML mapping technologies msdn.microsoft.com/sqlxml [MS02] G Miklau and D Suciu Containment and equivalence for an xpath fragment In PODS 2002, pages 65–76, Madison, Wisconsin, May 2002 [Sch01] H Schöning Tamino - A DBMS designed for XML In ICDE 2001, pages 149–154, Heidelberg, Germany, April 2001 IEEE J Shanmugasundaram, J Kiernan, E J Shekita, C Fan, and J Funderburk Querying xml views of relational data In VLDB 2001, pages 261–270, Roma, Italy, September 2001 Jayavel Shanmugasundaram, Eugene J Shekita, Rimon Barr, Michael J Carey, Bruce G Lindsay, Hamid Pirahesh, and Berthold Reinwald Efficiently publishing relational data as XML documents In VLDB 2000, pages 65–76 Morgan Kaufmann, 2000 [TPC] TPC-W: a transactional web c-commerce performace benchmark www.tpc.org/tpcw [XP] XPath www.w3.org/TR/xpath [XQ] XQuery www.w3c.org/XML/Query Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark ... Matching Most database systems allow matching text strings using pseudophonetic Soundex algorithm originally defined in [11], primarily for Latin-based scripts In summary, while current databases... improvements could be obtained by internalizing our “outside-the-server” implementation into the database engine In summary, we expect the phonetic matching technique outlined in this paper to effectively... briefly outline the support provided for multiscript matching in the database standards and in the currently available database engines While Unicode, the multilingual character encoding standard,

Tài liệu Advances in Database Technology- P7 ppt

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan