Abstract
Query languages for XML such as XPath or XQuery support Boolean retrieval: a query result is a (possibly restructured) subset of XML elements or entire documents that satisfy the search conditions of the query. This search paradigm works for highly schematic XML data collections such as electronic catalogs. However, for searching information in open environments such as the Web or intranets of large corporations, ranked retrieval is more appropriate: a query result is a ranked list of XML elements in descending order of (estimated) relevance. Web search engines, which are based on the ranked retrieval paradigm, do, however, not consider the additional information and rich annotations provided by the structure of XML documents and their element names.
This article presents the XXL search engine that supports relevance ranking on XML data. XXL is particularly geared for path queries with wildcards that can span multiple XML collections and contain both exact-match as well as semantic-similarity search conditions. In addition, ontological information and suitable index structures are used to improve the search efficiency and effectiveness. XXL is fully implemented as a suite of Java classes and servlets. Experiments in the context of the INEX benchmark demonstrate the efficiency of the XXL search engine and underline its effectiveness for ranked retrieval.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Aboulnaga A, et al. (2001) Estimating the selectivity of XML path expressions for internet scale applications. In: VLDB 2001, pp. 591–600.
Agirre E and Rigau G (1996) Word Sense disambiguation using conceptual density. In: 16th Int. Conf. on Computational Linguistics 1996, pp. 16–22.
Amer-Yahia S, Botev C and Shanmugasundaram J (2004) TeXQuery: A Full-Text Search Extension to XQuery. In: WWW 2004. Online Proceedings, available from http://www2004.org/.
Amer-Yahia S, et al. (2002) Tree pattern relaxation. In: Jensen CS et al., Eds., EDBT 2002, pp. 496–513.
Baeza-Yates RA and Riberto-Neto B, Eds. (1999) Modern Information Retrieval. Addison Wesley.
Blanken H, Grabs T, Schek H-J, Schenkel R and Weikum G., Eds. (2003) Intelligent Search on XML Data, vol. 2818 of LNCS.
Boag S, et al. (2002) XQuery 1.0: An XML query language. W3c recommendation, World Wide Web Consortium. http://www.w3.org/TR/xquery.
Brin S and Page L (1998) The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1–7):107–117.
Bruno N, Koudas N and Srivastava D (2002) Holistic twig joins: Optimal XML pattern matching. In: SIGMOD 2002, pp. 310–321.
Budanitsky A and Hirst G (2001) Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. In: Workshop on WordNet and Other Lexical Resources. Online Proceedings, available from http://engr.smu.edu/rada/mwnw/.
Chen Z, et al. (2001) Counting twig matches in a tree. In: ICDE 2001, pp. 395–404.
Chinenyanga TT and Kushmerick N (2001) Expressive and efficient ranked querying of xml data. In: WebDB 2001, pp. 1–6.
Chinenyanga TT and Kushmerick N (2002) An expressive and efficient language for XML information retrieval. Journal of the Americal Society for Information Science & Technology, 53(6):438–453.
Choi B, et al. (2003) On the optimality of holistic algorithms for Twig Queries. In: DEXA 2003, pp. 28–37.
Chung C-W, et al. (2002) APEX: An adaptive path index for xml data. In: SIGMOD 2002, pp. 121–132.
Cohen E, et al. (2002) Reachability and distance queries via 2-hop labels. In: 13th ACM-SIAM Symposium on Discrete algorithms (SODA 2002), pp. 937–946.
Cohen S, et al. (2003) XSEarch: A semantic search engine for xml. In: VLDB 2003, pp. 45–56.
Cohen WW (1999) Recognizing structure in web pages using Similarity Queries. In: 16th National Conference of Artificial Intelligence (AAAI)/11th Conference on Innovative Applications of Artificial Intelligence (IAAI), pp. 59–66.
Cormen TH, Leiserson CE and Rivest RL (2001) Introduction to Algorithms. 2nd edition, MIT Press.
Deutsch A, Fernandez MF, Florescu D, Levy AY and Suciu D (1998) XML-QL. In: QL ‘98, The Query Languages Workshop, W3C Workshop. available from http://www.w3.org/TR/1998/NOTE-xml-ql-19980819/.
Fellbaum C, ed. (1998) WordNet: An Electronic Lexical Database. MIT Press.
Fuhr N and Großjohann K (2001) ‘XIRQL: A query language for information retrievalin XML documents. In: SIGIR 2001, pp. 172–180.
Graupmann J, Biwer M, Zimmer C, Zimmer P, Bender M, Theobald M and Weikum G (2004) COMPASS: A concept-based Web search engine for html, xml, and deep Web data (demo). In: VLDB 2004.
Grust T (2002)Accelerating XPath location steps. In: SIGMOD 2002, pp. 109–120.
Grust T and van Keulen M (2003) Tree awareness for relational dbms kernels staircase join. In: Blanken H, Grabs T, Schek H-J, Schenkel R and Weikum G., Eds. Intelligent Search on XML Data, vol. 2818 of LNCS, pp. 231–245.
Guha S, et al. (2002) Approximate XML joins. In: SIGMOD 2002, pp. 278–298.
Guo L, et al. (2003) XRANK: ranked keyword search over XML documents. In: SIGMOD 2003, pp. 16–27.
Hayashi Y, et al. (2000) Searching text-rich xml documentswith Relevance Ranking. In: ACM SIGIR 2000 Workshop on XML and Information Retrieval. Online Proceedings, available from http://www.haifa.il.ibm.com/sigir00-xml/.
Hirst G and St-Onge D (1998) Lexical chains as representations of context for the detection and correction of malapropisms. In: Fellbaum C, Ed. WordNet: An Electronic Lexical Database. MIT Press, pp. 305–332.
Horrocks I (2002) DAML+OIL: A reason-able Web ontology language. in: edbt 2002, pp. 2–13.
Jamasz M and Szpankowicz S (2001) Roget’s thesaurus: A lexical resource to treasure. in Proceedings of the NAACL Workshop “WordNet and Other Lexical Resources,” Pittsburg, pp. 186–188.
Jamasz M and Szpankowicz S (2003) Roget’s thesaurus and semantic similarity. technicalReport TR-2003-01, University of Ottawa, Canada.
Jiang H, et al. (2003) Holistic twig joins on indexed xml documents. In: VLDB 2003, pp. 273–284.
Jiang JJ and Conrath DW (1997) Semantic similarity based on corpus statisticsand Lexical Taxonomy. In: 10th Int. Conf. on Research on Computational Linguistics (ROCLING 1997), Taipeh, Taiwan, pp. 19–33.
K+02 Kaushik R, et al. (2002) Covering indexes for branching path queries. In: SIGMOD 2002, pp. 133–144.
Kaushik R, et al. (2004) On the integration of structure indexes and Inverted Lists. In: SIGMOD 2004, pp. 779–790.
Kazai G, et al. (2003) The INEX evaluation initiative. in: blanken h Grabs T, Schek H-J, Schenkel R and Weikum G., Eds. Intelligent Search on XML Data, vol. 2818 of LNCS, pp. 279–293.
Kleinberg J (1999) Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632.
Leacock C and Chodrow M (1998) Combining local context and wordnet Similarity for Word Sense Disambiguation. In: Fellbaum C, Ed. WordNet: An Electronic Lexical Database. MIT Press, pp. 265–283.
Lenat DB and Guha RV (1990) Building Large Knowledge Based Systems. Addison Wesley.
Lesk M (1969) Word-word association in document retrieval systems’. American Documentation, 20(1):27–38.
Lewis WD (2002) Measuring conceptual distance using wordnet: the Design of a Metric for Measuring Semantic Similarity. The University of Arizona Working Papers in Linguistics, 12.
Lin D (1998) An information-theoretic definition of similarity’. In: 15th Int. Conf. on Machine Learning (ICML 1998), pp. 296–304.
Manning CD and Schuetze H (1999) Foundations of Statistical Natural Language Processing. The MIT Press.
McHale ML (1998) A comparison of WordNet and Roget’s taxonomy for measuring semantic similarity. in: workshopon Usage of WordNet in Natural Language Processing Systems (COLING-ACL 1998). Online Proceedings, available from http://xxx.lanl.gov/abs/cmp-lg/9809003.
Oracle 9i text. http://otn.oracle.com/products/text/.
Polyzotis N and Garofalakis MN (2002a) Statistical synopses for graph-structured xml databases. in SIGMOD 2002, pp. 358–369.
Polyzotis N and Garofalakis MN (2002b) Structure and value synopses for xml data graphs. In: VLDB 2002, pp. 466–477.
Qiu Y and Frei H-P (1993) Concept-based query expansion. In: SIGIR 1993, pp. 160–169.
Qiu Y and Frei H-P (1995) Improving the retrieval effectiveness by a similarityThesaurus. Technical Report 225, Swiss Federate Institute of Technology, Zürich, Switzerland.
Qun C, et al. (2003) D(k)-index: An adaptive structural summary for graphStructured Data. In: SIGMOD 2003, pp. 134–144.
Rada R, et al. (1989) Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man, and Cybernetics, 19(1):17–30.
Resnik P (1995) Using information content to evaluate semanticSimilarity in a Taxonomy. In: 14th Int. Joint Conf. on Artificial Intelligence (IJCAI 95), Vol. 1. pp. 448–453.
Resnik P (1999) Semantic similarity in a taxonomy: anInformation-Based Measure and its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research, 11:95–130.
Richardson R, et al. (1994) Using WordNet as a knowledge base for measuring semantic similarity between words. In: Proceedings of the AICS Conference.
Rubenstein H and Goodenough JB (1965) Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633.
Russel S and Norvig P (1995) Artificial Intelligence–-A Modern Approach. Prentice Hall.
Schenkel R (2004) FliX: A flexible framework for indexing complex XML document collections. In: 1st Int. Workshop on Database Technologies for Handling XML Information on the Web.
Schenkel R, Theobald A and Weikum G (2003) Ontology-enabled XML search. In: Blanken H, Grabs T, Schek H-J, Schenkel R and Weikum G., Eds. Intelligent Search on XML Data, vol. 2818 of LNCS, pp. 119–131.
Schenkel R, Theobald A and Weikum G (2004) HOPI: An efficient connection index for complex xml document collections. In: EDBT 2004, pp. 237–255.
Schlieder T (2002) Schema-driven evaluation of approximate treepattern queries. In: EDBT 2002, pp. 514–532.
Schlieder T and Meuss H (2000) Result ranking for structured queries against xml documents. In: DELOS Workshop: Information Seeking, Searching and Querying in Digital Libraries, available from http://www.ercim.org/publication/ws-proceedings/DelNoe01.
SAD+00 Staab S, et al. (2000) Semantic community web portals. Computer Networks, 33(1–6):473–491.
Sussna M (1993) Word sense disambiguation for free-textindexing using a Massive Semantic Network. In: 2nd Int. Conf. on Information and Knowledge Management (CIKM 1993), pp. 67–74.
Theobald A and Weikum G (2000) Adding relevance to XML. In: 3rd Int. Workshop WebDB 2000, pp. 105–124.
Theobald A and Weikum G (2002a) The index-based xxl searchengine for Querying XML data with relevance ranking. In: EDBT 2002, pp. 477–495.
Theobald A and Weikum G (2002b) The XXL search engine: ranked retrieval ofxml data using indexes and ontologies. In: SIGMOD 2002, p. 615.
Wu Y, Patel JM and Jagadish H (2002) Estimating answer sizes for xml queries in: edbt 2002, pp. 590–608.
Wu Z and Palmer M (1994) Verb semantics and lexical selection. In: 32nd. Annual Meeting of the Association for Computational Linguistics 1994, pp. 133 –138.
Zezula P, et al. (2003) Tree signatures for XML querying and navigation. in: 1stint. xml Database Symposium, pp. 149–163.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Schenkel, R., Theobald, A. & Weikum, G. Semantic Similarity Search on Semistructured Data with the XXL Search Engine. Inf Retrieval 8, 521–545 (2005). https://doi.org/10.1007/s10791-005-0746-3
Issue Date:
DOI: https://doi.org/10.1007/s10791-005-0746-3