On view processing for a native XML DBMS

ON VIEW PROCESSING FOR A NATIVE XML DBMS CHEN TING NATIONAL UNIVERSITY OF SINGAPORE 2004 Contents Introduction Background 2.1 XML data model 2.2 ORA-SS 10 Review of the State of the Art 15 3.1 XML Schema Formats and Graphical view definitions 15 3.2 XML document storage schemes and Native XML DBMS 17 3.3 XML View Processing techniques 21 ORA-SS as XML View Definition Format 26 4.1 Why ORA-SS ? 26 4.2 Semantics of ORA-SS views 32 i CONTENTS 4.3 ii Comparison and Summary 35 XML Document Storage in Native XML DBMSs 37 5.1 Object Based Clustering 38 5.2 Object Labelling Scheme 40 5.3 Object Based Clustering vs Element Based Clustering 41 ORA-SS View Processing on a native XML DBMS 6.1 6.2 Associative Join: A Primitive XML Join Technique 46 6.1.1 Structural Query and Associative Query 46 6.1.2 Processing of Associative Query 48 Processing XML views defined in ORA-SS formats 54 6.2.1 Value Join vs Associative Join 55 6.2.2 The importance of relationship set in ORASS view schema 58 6.2.3 ORA-SS View Transformation Algorithm 59 Experiments 7.1 45 64 XBase description 64 7.1.1 ORA-SS Schema Parser 66 CONTENTS 7.2 7.3 iii 7.1.2 Storage Manager 66 7.1.3 ORA-SS View Transformer 69 Datasets 69 7.2.1 DBLP Bibliography Record (DBLP) 69 7.2.2 Project-Researcher-Paper (JRP) 69 Performances and Analysis 71 7.3.1 The advantages of OBC storage 71 7.3.2 View Processing in XBase 74 Conclusion 82 A Appendix 89 A.1 XSLT Script for view schema in Figure 7.9c: 89 A.2 XSLT Script for view schema in Figure 7.9d: 90 Chapter Introduction Traditionally, view is an important aspect of data processing View support is desirable because it provides automatic security for hidden data and allows the same data to be seen by different users in different ways at the same time Compared with views in relational database, views for hierarchical data like XML not only allow basic operations like selection, projection and join, but also structural swapping of nodes in document trees For example, a bibliography XML file (e.g DBLP[19]) contains a list of publications; “under” each publication there are the authors together with various other properties of the publication A frequent view operation on XML data like DBLP is to find all authors together with their publications, which is indeed a swapping operation on nodes “Publication” and “Author” The starting point of XML view transform is view definition There are two Chapter Introduction general approaches to define views on source XML data: One way is to define views or queries in script languages like XQuery[32] or XSLT[33] The alternative approach is to define views by view schemas Systems like Clio[24] , eXeclon[11] and the work in [7] fall into this category Users only need define a view schema over source data to obtain desired the view result This approach is declarative and alleviates user from writing complex scripts to perform view transformation There are problems with the above two approaches which hinder them to become ideal XML view definition formats The query languages (e.g XSLT and XQuery) cited above in the first approach usually use regular expressions to express possible variations in the structure of the data But the use of regular expression queries means the user is responsible to phrase their queries in a way that will cover the variations in the structure of the source data As an example, suppose again we want to find the information of authors of each publication; however it is possible that the information we want may be presented in the source data in two ways: in some places author is nested under publication (e.g in a bibliography record) whereas in some other places publication is nested under author (e.g in a publication list of a researcher) Using regular expression means that we have to specify two patterns: author//publication and publication//author to obtain all relevant Chapter Introduction information It would be clear that we can extend the example such that in the worst case an exponential number of regular expressions need to be written to cover all possible variation in source data To overcome the above problem, a solution is to utilize the ontology of source data, which consists of the list of tag names of elements and attributes in the data Apparently, it is much easier to start from the ontology to define views than to require a user to comprehend the structural details of source data As an example, we can extract two keywords author and publication from source schema Next we let author be the parent node of publication in a view schema meaning that we want to find all matching pairs of author and publication elements which lie on the same path in source documents and construct the results by placing publication elements under author elements Note that we not restrict the hierarchical order of elements in a matching pair in source document The approach discussed in this thesis greatly extends the above idea: it allows a user to extract element names from the ontology of source data and define the structure of view via a view schema All the tedious work of finding structural variations of view schemas in the source document will be left to the view processing back-end system Thus view definitions can be phrased succinctly based only on the ontology Meanwhile, simple tree/graph-structure schema languages like DTD and XML Schema used in the second approach for XML view (target) schema can not express many useful semantics and consequently causes ambiguity To see this, Chapter Introduction let us take a look at a sample XML document in Figure 1.1 It contains information about researchers working under different projects and the publication list for each researcher Example 1.1 Consider the source XML document and view schema in Figure 1.1 It has at least two possible meanings: For each project, list all the papers published by project members; for each paper of the project, list all the authors of the paper For each project, list all the papers published by project members; for each paper of the project, list all the authors of the paper working for the project The different interpretations result in quite different views Current popular XML schema formats like DTD, XML Schema are unable to express these semantic differences It is one of the main focuses of our work to use a XML schema representation: Object-Relationship-Attribute model for Semi-Structured data (ORA-SS) [9], which overcomes the problems of the two current XML view definition approaches ORA-SS can extract matches with structural variations from XML source and meanwhile clearly define the semantics of source data and views There are three main proposed ways to process XML view definitions: general document-based XML query processing engines (e.g XQuery and XSLT query Chapter Introduction < root > < Researcher R N ame = ”r1” > < /Researcher > < Researcher R N ame = ”r2” > < /Researcher > < Researcher R N ame = ”r2” > < /Researcher > < Researcher R N ame = ”r3” > < /Researcher > < /root > (a) Source XML document Root Project J N ame Researcher R N ame Paper P N ame (b) Source Schema Root Project J N ame Paper P N ame Researcher R N ame (c) View Schema Figure 1.1: An sample XML document with DTD-like source and view schemas engines such as Xalan[30],XT[8],SAXON[26] and Quip[25]) traverse in-memory source data trees to output the result tree Another possible solution is to load the XML data file into a relational or object-relational database and perform view transformation using available RDBMS facilities This method requires conversion from hierarchical data and schema to relational data and schema The third approach and also the one used in this paper is to use a native XML DBMS to support view transformation A native XML DBMS is one which is designed and implemented from the ground up for storage and query processing of XML data Recently, great efforts have been put into the study of XML query optimization Techniques[1][3][34] are developed mainly for processing of queries de- Chapter Introduction fined in the XPath[31] standard, which can express both path and branch patterns However, as we demonstrated earlier, XML views defined based on the ontology of source data can not be mapped to a single XPath expression To meet the new challenges, we investigate new XML query processing techniques for views defined via schema mapping The new techniques are integrated with our native XML DBMS XBase to process XML views defined in ORA-SS format Experiment results demonstrate the advantages of our method over current state-of-the-art approaches The main contributions of our work are: We introduce a new view schema definition format based on ORA-SS which can (a) Extract matches with structural variants in tree-structured data like XML without issuing an excessive number of queries as XSLT and XQuery (b) Express a large variety of semantics which results in different view which is not possible under view schema format like DTD and XML Schema A native XML document storage and view transformation prototype XBase which implements novel XML document storage scheme and query processing techniques to obtain views defined in our view schema format 7.3 PERFORMANCES AND ANALYSIS 76 Running Time(sec) 60 50 40 Xbase SAXON 30 Quip 20 10 20 40 60 80 Source XML file size (MB) Figure 7.10: Running time comparison of processing ORASS view schema in Fig 7.9a for JRP dataset Running Time(sec) 70 60 50 Xbase SAXON 40 30 Quip 20 10 20 40 60 80 Source XML file size (MB) Figure 7.11: Running time comparison of processing ORASS view schema in Fig 7.9b for JRP dataset Running Time(sec) 7.3 PERFORMANCES AND ANALYSIS 500 450 400 350 300 250 200 150 100 50 77 Xbase SAXON Quip 20 40 60 80 Source XML file size (MB) Figure 7.12: Running time comparison of processing ORASS view schema in Fig 7.9c for JRP dataset Running Time(sec) 160 140 120 100 Xbase 80 SAXON 60 Quip 40 20 20 40 60 80 Source XML file size (MB) Figure 7.13: Running time comparison of processing ORASS view schema in Fig 7.9d for JRP dataset 7.3 PERFORMANCES AND ANALYSIS Author 78 Author AP;2 Name Name Publication year Title (a) View Author (b) View Author-Publication Figure 7.14: Two views defined over DBLP datasets Running Time(sec) 160 140 120 100 Xbase 80 SAXON 60 Quip 40 20 20 40 60 80 Source XML file size (MB) Figure 7.15: Running time comparison of processing ORASS view schema in Fig.7.14a for DBLP dataset (Figure 7.10 and Figure 7.11)in view schema Figure7.9c because it has one more relationship type than Figure 7.9d and consequently needs one more round of associative join and value join However, although SAXON 7.5 performs reasonably well in Figure7.9d, it simply takes too long to finish views for Figure7.9c This happens to Quip too We list the XSLT scripts used for both view schemas in Appendix I and II The scripts explain why XSLT engines like SAXON is slow for the view schema with two binary relationship types but much quicker for the view schema with one ternary relationship type We extract the most important sections Running Time(sec) 7.3 PERFORMANCES AND ANALYSIS 200 180 160 140 120 100 80 60 40 20 79 Xbase SAXON Quip 20 40 60 80 Source XML file size (MB) Figure 7.16: Running time comparison of processing ORASS view schema in Fig.7.14b for DBLP dataset from the two scripts: (a) Script for P roj − P aper − Researcher1 (b) Script for P roj − P aper − Researcher2 The two scripts are identical in finding papers written by researchers in a project (i.e the first two xsl : f or − each − group in the two scripts) Their main difference lies in the third xsl : f or − each − group directives for Researcher The script for P roj − P aper − Researcher1 needs to search the whole document for each paper to find the complete paper author list because the authors may not work for the project On the other hand, script for P roj − P aper − Researcher2 avoids the global search because it only intends to find authors of the paper working for the project This example shows that today’s general purpose view 7.3 PERFORMANCES AND ANALYSIS 80 transformation engines still need more work on optimization Our algorithm is very efficient in views which project and swap over portion but not all of the source data(e.g views P roject − P aper, P aper − P roject of JRP datasets and Author,Author − P ublication of DBLP datasets) This is especially true with big files (e.g 80MByte DBLP and JRP XML files) approaching the memory threshold because to load the whole document into the memory (which is what SAXON and QuiP do) degrades the performance significantly Our algorithm shows little differences in performances for views differing from each other only in their node structural order To illustrate this, let us look at the running times of views P roject − P aper and P aper − P roject in Figure 7.10 and Figure 7.11 These two view schemas only differ on their node order Our algorithm uses roughly the same amount of time (about -2 5% of difference) on both views for all file sizes; on the other hand, SAXON need about 20% more time to process view P aper − P roject The reason for the big difference in running times of SAXON and QuiP engines is due to their “tree-walk” transformation mechanism: different orders in views cause different tree traversal sequences However, the running time of associative and value join and merge operation in the transformation algorithms of XBase are 7.3 PERFORMANCES AND ANALYSIS determined solely by the total list sizes 81 Chapter Conclusion The starting point of our work is the observation of problems with two kinds of XML view transformation systems High level systems like eXcelon[11] and Clio[24], which perform view transformations by defining target view schema, have problems with not being able to define views with complex semantics On the other hand, general systems like XSLT and XQuery processors require the user to write transformation scripts themselves The problem becomes worse with tree-structured XML data because many possible structural variants need to be considered in transformation scripts to cover all possible query results As we have demonstrated in experiments, it is hard for the systems to optimize many simple and practical XSLT/XQuery scripts Our approach, like the other high-level systems, allows users to define view schema to get desired views However, our method differs from other highlevel XML view transformation system in the following aspects: The use of ORA-SS as underlying schema representation allows us to 82 Chapter Conclusion 83 express view schemas with a great variety of semantic meanings Our ontology-based view definition approach saves users the trouble of looking into often complicated source schema Users just need to know the set of element names (ontology) in the source schema to define views Mapping from source schema to view schema is done by our system It performs view transformation directly on a native XML DBMS: XBase Other systems usually convert schema mapping into XSLT/XQuery scripts and still rely on XSLT/XQuery engines to perform transformation At the same time, based on ORA-SS schemas, our system requires much less time than general XML processors in processing view transformations Our view transformation engine XBase has many novel features Our storage method Object Based Clustering allows us to leverage on the new join processing techniques naturally In most view transformations, only the relevant portions of source document are read ORA-SS, as the conceptual model used in our transformation, differs from other traditional structure-based XML data models by taking into full consideration the semantic information such as keys, relationship and relationship attribute associated with XML data In our transformation method, a relationship type in ORA-SS schema is the basic unit of transformation We devise novel and efficient join technique called associative join to construct a single relationship type From a performance point of view, our transformation engine XBase is much Chapter Conclusion 84 more efficient compared with XQuery/XSLT engines In particular, our method is very fast in projection and swapping over portion of source data because it only loads data which is required In comparison, today’s XSLT and XQuery processors are slow in processing view operations like projection and swapping Our method is also much faster in processing ORA-SS view schema with multiple relationships on a single path Our work can be extended in several directions View operations like value selection, negation and so on are not considered in our prototype Meanwhile in our view transformation algorithms, we not utilize source ORA-SS schema information, which can help to further optimize our view transformation engine Bibliography [1] N Bruno, N Koudas, and D Srivastava Holistic twig joins: optimal XML pattern matching In Proceedings of SIGMOD 2002, pages 310–321, 2002 [2] S Ceri, S Comai, E Damiani, P Fraternali, S Paraboschi, and L.Tanca XML-GL: a graphical language of querying and restructuring XML documents In Proceedings of WWW8, pages 1171–1187, 1999 [3] J Chen, D J DeWitt, F Tian, and Y Wang NiagaraCQ: A scalable continuous query system for internet databases In Proceedings of SIGMOD 2000, pages 379–390, 2000 [4] Q Chen, A Lim, and K W Ong D(k)-index: An adaptive structural summary for graph-structured data In Proceedings of SIGMOD 2003, pages 134–144, 2003 [5] T Chen, T W Ling, and C Y Chan Prefix path streaming: a new clustering method for optimal XML twig pattern matching In Proceedings of DEXA 2004, pages 801–811, 2004 85 BIBLIOGRAPHY 86 [6] Y B Chen, T W Ling, and M L Lee Designing valid XML views In Proceeding of ER 2002, pages 463–478, 2002 [7] Y B Chen, T W Ling, and M L Lee Automatic generation of XQuery view definitions from ORA-SS views In Proceeding of ER 2003, pages 158–171, 2003 [8] J Clark XT XSLT processor http://blnz.com/xt/ [9] G Dobbie, X Wu, T W Ling, and M L Lee ORA-SS: An ObjectRelationship-Attribute model for Semistructured data Technical Report TR21/00, School of Computing, National University of Singapore, 2000 [10] DTD Document type definitions http://www.w3.org/TR/REC-xml [11] eXcelon An general XML data manager http://www.exceloncorp.com/ [12] R Goldman and J Widom Dataguides: Enabling query formulation and optimization in semistructured databases In Proceedings of VLDB 97, pages 436–445, 1997 [13] H He and J Yang Multiresolution indexing of XML for frequent queries In Proceedings of ICDE 2004, pages 683–694, 2004 [14] H V Jagadish and S AL-Khalifa TIMBER: A native XML database Technical report, University of Michigan, 2002 [15] H Jiang, W Wang, H Lu, and J X Yu Holistic twig joins on indexed XML documents In In Proceeding of VLDB 2003, pages 273–284, 2003 BIBLIOGRAPHY 87 [16] C C Kanne and G Moerkotte Efficient storage of XML data In Proceedings of ICDE 2000, pages 198–209, 2000 [17] R Kaushik, P Sheony, P Bohannon, and E Gudes Exploiting local similarity for efficient indexing of paths in graph structured data In Proceedings of ICDE 2002, pages 129–140, 2002 [18] M Ley Apache Xindice http://XML.apache.org/xindice/ [19] M Ley DBLP computer science bibliography record http://www.informatik.uni-trier.de/ ley/db/ [20] D F Luo, T Chen, T W Ling, and X F Meng On view transformation support for a native XML DBMS In Proceeding of DASFAA 2004, pages 226–231, 2004 [21] J McHugh, S Abiteboul, R Goldman, D Quass, and J Widom Lore: A database management system for semistructured data SIGMOD Record, pages 54–66, 1997 [22] T Milo and D Suciu Index structures for path expressions In Proceedings of ICDT 99, pages 277–295, 1999 [23] W Ni and T W Ling GLASS: A graphical query language for semistructured data In Proceedings o DASFAA 2003, pages 363–372, 2003 BIBLIOGRAPHY 88 [24] L Popa, M A Hernandez, Y Velegrakis, R J Miller, F Naumann, and H Ho Mapping XML and relational schemas with Clio In Proceedings of ICDE 2002, pages 498–499, 2002 [25] QuiP http://developer.softwareag.com/tamino/quip/ [26] SAXON a XSLT processor http://saxon.sourceforge.net/ [27] X Schema XML Schema http://www.w3.org/XML/Schema [28] H Schoning Tamino - a DBMS designed for XML In Proceedings of ICDE 2001, pages 149–154, 2001 [29] Y Wu, J M Patel, and H V Jagadish Structural join order selection for XML query optimization In Proceedings of ICDE 2003, pages 443–454, 2003 [30] Xalan XSLT processor http://xml.apache.org/xalan-j/ [31] XPath http://www.w3.org/TR/xpath [32] XQuery http://www.w3.org/XML/Query [33] XSLT http://www.w3.org/Style/XSL/ [34] C Zhang, J F Naughton, D J DeWitt, Q Luo, and G M Lohman On supporting containment queries in relational database management systems In Proceedings of SIGMOD 2001, pages 425–436, 2001 Appendix A Appendix A.1 XSLT Script for view schema in Figure 7.9c: Remark: The differences of XSLT scripts in Appendix I and II are highlighted lines

On view processing for a native XML DBMS

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan