An Experiment in Building Vertical Search Engine

27 110 0
An Experiment in Building Vertical Search Engine

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

An Experiment in Building Vertical Search Engine

Vietnam National University of Hanoi University of Engineering and Technology STUDENT RESEARCH SEMINAR, 2012 Project: An Experiment in Building Vertical Search Engine Students: Phạm Ngọc Quân K53CA Phạm Lê Lợi K53CA Bùi Hữu Điệp K53CA Lê Đăng Đạt K53CA Faculty: Information Technology Supervisor: Dr. Lê Quang Hiếu Hanoi, 2012 PROJECT SUMMARY Project: An Experiment in Building Vertical Search Engine Project members: Phạm Ngọc Quân, Phạm Lê Lợi, Bùi Hữu Điệp, Lê Đăng Đạt Project supervisor: Dr. Lê Quang Hiếu, Department of Information Technology, UET Management: University of Engineering and Technology, VNU Hanoi Research time: 9/2011 – 3/2012 1. Motivation When using a general search engine, user will get result involving many aspects, due to unclassified websites. However, for people who are searching for information about a specific topic, the websites in this category should be prioritized. In this case, with using the popular search engine, users will have to read through the search results and choose suitable ones, causing inconvenience. Our research’s purpose is a experiment to build a search engine that allows users to choose the domain they like, the returning results would closely relate to the chosen domain. 2. Main content In this project, we focus on building a search engine that works as an upgrade of the popular search engine. The vertical search engine collects the search results from one or several popular search engine (if there are decent differences in the searching methods of these search engines). After that, a classification module will decide which sites are domain-related. Finally, the result will be filtered with removing the out-of-topic webpages and then returned to users. We also add some suggestions for keyword such as keyword correction, keyword expansion so that user can get better result. 3. Research result In our experiment, we demonstrate the idea as an experiment for building a vertical search engine with the topic chosen as Football. The search engine contains the collecting data module that gets search results from Yahoo and Bing search engine, and the 2 classification module uses Support Vector Machine classifier. The whole project was done in Java and installed in a website using JSP. Another separately experiment is the keyword suggestion module that was written in Python. Since the running time of this module when integrated with JSP is inconvenient, this module is removed from the whole search engine, but can be tested severally. 3 TABLE OF CONTENTS I. INTRODUCTION 5 II. LITERATURE REVIEW 6 1.In the world 6 2.In Vietnam 8 3.Our research goal 8 III. VERTICAL SEARCH ENGINE 9 1.System Architecture 9 2.System’s Features 10 IV. SYSTEM MODULES 11 a.Meta Search Engine 11 1.1. Introduction 11 1.2.Operation 11 1.3.Structure 11 2.Webpage Filter module 12 2.1.Introduction 12 2.2.Support Vector Machine Introduction 12 About 300 more web pages of each category (Football and Non-Football type) are collected to test the model. The result of the testing is 97% web pages are correctly classified 15 3. Keyword Suggestion Model 16 3.1. Introduction 16 3.2.Operation 16 3.3.Algorithms 17 4. Search Interface 19 V. EXPERIMENTAL RESULT 22 VI. FUTURE WORK 25 VII. CONCLUSION 25 VII. REFERENCES 26 4 I. INTRODUCTION Internet is becoming more and more popular in every country; its capacity is also getting larger every second. Internet’s complex structure and its huge amount of data had been irresolvable obstacles for internet users. That was the reason for the introduction of a large number of search engines, and Google was a big success. However, a general search engine like Google, which treats data in all domains equally, would become inconvenient when users prefer one specific domain to the others. In this situation, a vertical search engine, with the contribution of domain-specific expertise would perform greater. A vertical search engine, as distinct from a general web search engine, focuses on a specific segment of online content. The vertical content area may be based on topicality, media type, or genre of content. Common verticals include shopping, the automotive industry, legal information, medical information, and travel. In contrast to general Web search engines, which attempt to index large portions of the World Wide Web using a web crawler, vertical search engines typically use a focused crawler that attempts to index only Web pages that are relevant to a pre-defined topic or set of topics. Some vertical search sites focus on individual verticals, while other sites include multiple vertical searches within one search engine. Vertical search offers several potential benefits over general search engines:  Greater precision due to limited scope  Leverage domain knowledge including taxonomies and ontology.  Support specific unique user tasks A part of vertical search engine which focus on specific topic is domain-specific search. Domain-specific search solutions focus on one area of knowledge, creating customized search experiences, that because of the domain's limited corpus and clear relationships between concepts, provide extremely relevant results for searchers.[2] Normally, the process of building a search engine will consist of the steps below: • Creating a crawler that collects the websites from the internet. This step covers the internet, as well as making a database of websites for the search purpose. • Indexing the websites 5 • Query Processor: process with the query (from the user) with natural language processing and match the query in the websites for the list the appropriate results. • Determining the website ranks, and returns the ranked list to the user. However, these steps require massive of storage as well as a remarkable algorithm for ranking the websites. Instead of making a whole new search engine, metasearch engine is another method to make a new one. A metasearch engine is a search tool that sends user requests to several other search engines and/or databases and aggregates the results into a single list or displays them according to their source. Metasearch engines enable users to enter search criteria once and access several search engines simultaneously. Metasearch engines operate on the premise that the Web is too large for any one search engine to index it all and that more comprehensive search results can be obtained by combining the results from several search engines. This also may save the user from having to use multiple search engines separately. [3] On our research, we combine the technology of domain specific search engine and the idea of metasearch engine, resulting a two levels structure that coordinate each search engine’s own advantages. II. LITERATURE REVIEW 1. In the world Search engine construction has evolved from the early days of building from scratch to today's plethora of data APIs that make tomorrow's vertical search engines more powerful and easier to build. Past • Huge expenses to build the index, find the data, maintain the process. • Majority of time spent on building relevancy and less on design and creating a unique experience. Present • Search APIs reduce the complexity of building an index. • Vertical search engines still spend significant resources on creating unique data. 6 • More resources are spent on designing the best relevancy and a unique experience. Future • New search engines tap into huge amounts of distributed data. • More time for developing unique approaches to presenting relevant information and creating a unique experience. Vertical search engines have a distinct advantage over the general search engines. They already know what their users are interested in. A search for Jaguar in Yahoo! may return the automobile, the Mac OS, or the animal. However, vertical search engines that specialize in sports, autos, or animals would not have that problem. This assumption of user interest gives vertical search engines more flexibility in creating new models of relevancy ranking.[4] General search engines, like Google, Yahoo or Bing are certainly famous to every internet user, they are considered indispensable tools. People now even go to search pages to find websites that they have already known, instead of directly enter websites’ name on address bar. On the other hand, vertical search engine and metasearch engine obtain very few successes. Some vertical search engines were released such as MedNar, PubMed, BizNar, some metasearch websites like iBoogie, InfoGrid were built but none of them are become famous in the entire world. Currently, the vertical search mechanism and metasearch method separately seem not powerful enough to overwhelm classic searching machine, they should be researched further or be combined together. A good representative for vertical search is Truevert, which is an environmental vertical search engine that is going beyond the basic assumption of a niche user's intentions. They build a unique natural language dictionary to enhance relevancy. A search for "CFL" on a regular search engine could return "Canadian Football League" but Truevert recognizes this as the acronym for "Compact Flourescent Lighting", a much more relevant term for environmental concerns.[5] Back in history, Yahoo was once the dictator on searching aspect, it then becomes the second after the risen of Google. The success of vertical search engine may be on the future, when convenience is more appreciated. 7 2. In Vietnam In Vietnam, general search engine are very popular: Almost every website has its own search engine. However, the idea of vertical search and metasearch has not been industrially explored. 3. Our research goal We want to demonstrate the vertical search engine idea that re-filter the results of the general search engine based on a specific topic such as medicine, health, football, weather, economy, etc A simple vertical search engine should be done with similar user interface as general search engines, the speed should be acceptable, the results should be prioritized based on a topic and the keywords can be suggested for the users. Other than that, the experiment can also be an approach for providing a mechanism to quickly build one vertical search engine with least effort. 8 III. VERTICAL SEARCH ENGINE 1. System Architecture The architecture for the system is layer-based, each layer represent one levels of filter, the more layers we have, the more irrelevant websites are filter out, and thus, the better the results are. The layers are independent so that we can easily add, remove one layer or replace it by a new one. We have developed 4 modules: Meta Search Engine,Webpage Filter and Keyword Suggestion, Search Interface: Figure 1: System architecture 9 Search Interface Other Search Engine Other Search Engine Meta-search Engine Keyword Suggestion Filter Knowledge Base Metasearch Engine use metasearch technique to ask other search engine about the finding keyword. It then get all returning results together with their scores, transform them from html form to normal text form and then send the result to the upper module: Text classification Webpage Filter takes results from Web Crawler, refine the results based on the knowledge base, with the technique using Support Vector Machine Classification. All the passed results are sent to Interface to display. Keyword Suggestion, independent with WebCrawler, gets suggestion from other search engine, then use Information Gain (IG) to rearrange results. Top high-score suggests are sent to Interface to display. Search Interface allows user to choose number of pages to get and enter the keyword. It then calls Text Classification and Keyword Suggestion to get returning pages and suggests respectively. Finally, the results are displayed to users. 2. System’s Features By combining the popular and efficient search engine of Bing or Yahoo and the functionality to refine the results as categories, the system can offer a vertical search service that helps the users to find the efficient information in the topic, without self- filtering the information of the search results. The keyword suggestion function helps the user to get the keywords inside the topics, which is the upgrade on the keyword suggestion function of the popular search engine. The system also offer a method to setup a personal or topic search engine that should serve an organization or company in the limited time where the specific information is needed to search, which is not reliable when using the popular search engine. 10 [...]... displayed in a number of pages Figure 7: Search Results with page change 21 V EXPERIMENTAL RESULT 1 Experiments in search engine application At first, the search engine was intended to be the combination of the results from Bing Search Engine and Yahoo Search Engine, in order to have variant of web pages for a search term, as well as multi-perspective due to the differences in the web-crawling algorithms... of Yahoo and Bing Search Engine However, for most of the keywords, the differences in the results of the two search engines are not noticeable So the metasearch engine only works with Bing Search Engine After the web page classification module has been integrated into the search engine, some variant meaning keywords such as City Manchester, Arsenal, Seagames, Olympics… are used to test the search results... engine for that domain VII CONCLUSION Search Engines are important tools for Internet users due to structure and capacity of the Internet General search engines are very popular but there are also vertical search engines which have some advantages over general ones Our research’s purpose is to build this kind of search engine, a search engine that finds pages on one specific domain Currently, we have... stuck in integrating step In addition the performance of the search engine is not very good: low speed, poor interface…However applying to related domain, the search results are clearly improved Although the result of our first experiment was not as good as expected, it confirms that vertical search engines have advantages that general search engines could not obtain Vertical search will obtain considerable... be expanded to multi-language, as the only language English is not sufficient • The keyword suggestion module should be integrated into the search engine, after optimizing the running time • The project can be expanded into developing a new framework, to quickly setup a vertical search engine The data for a specific domain with appropriate format can plug into the system to make the search engine for... football Some advantages and disadvantages of the current search engine • Advantages : o New websites updated with the original search engines o Unrelated websites are removed • Disadvantages : o The classification works only on English languages ( due to our lack of time and labour ) o The vertical search engine may takes longer to search than the original one o The websites with flash cannot be analyzed... suggestions in other search engines and then put them in filters to find most meaningful ones Specifically, following works are done: • Similar to Crawler module, it sends requests to search engines and extract suggestion from returning html pages • After getting a list of suggested words, it rearranges these words using Information Gain algorithm • Words with highest scores are sent to Interface Believing in. .. original search engine result: (First 10 results) The results after refining: Figure 8 : Vertical Search Experiment 22 Compared to the results sent back by the original search engine, the vertical search engines removed the websites that are not relative to football, while keeping the order of the football websites With this upgrade, people who intend to find football pages feel easier to get the information... Meta Search Engine 1.1 Introduction Meta Search Engine is the module stands between Search Engine and the Internet It is an Java program which receives keyword from upper layer, then use the Internet as the resource to find related page It returns a list of unordered pages in all aspects to the upper layer 1.2 Operation Receive keyword from user, Web Crawler do following tasks:  Ask multiple Search Engines... the search engine is used The first method really takes time, much longer than the second one, however the classification results may stay the same 2 Experiments in keyword suggestion Keywords in 3 different categories: football, nonFootball, ambiguous are tested Each keyword is sent to 3 different search engine: Yahoo, Bing, our vertical search engine Then the suggestions from 3 above search engine . as an experiment for building a vertical search engine with the topic chosen as Football. The search engine contains the collecting data module that gets search results from Yahoo and Bing search. modules: Meta Search Engine, Webpage Filter and Keyword Suggestion, Search Interface: Figure 1: System architecture 9 Search Interface Other Search Engine Other Search Engine Meta -search Engine Keyword. domain specific search engine and the idea of metasearch engine, resulting a two levels structure that coordinate each search engine s own advantages. II. LITERATURE REVIEW 1. In the world Search

Ngày đăng: 12/04/2014, 15:46

Mục lục

  • I. INTRODUCTION

  • II. LITERATURE REVIEW

    • 1. In the world

    • 2. In Vietnam

    • 3. Our research goal

    • III. VERTICAL SEARCH ENGINE

      • 1. System Architecture

      • 2. System’s Features

      • IV. SYSTEM MODULES

        • a. Meta Search Engine

          • 1.1. Introduction

          • 1.2. Operation

          • 1.3. Structure

          • 2. Webpage Filter module

            • 2.1. Introduction

            • 2.2. Support Vector Machine Introduction

            • About 300 more web pages of each category (Football and Non-Football type) are collected to test the model. The result of the testing is 97% web pages are correctly classified.

            • 3. Keyword Suggestion Model

              • 3.1. Introduction

              • 3.2. Operation

              • 3.3. Algorithms

              • 4. Search Interface

              • V. EXPERIMENTAL RESULT

              • VI. FUTURE WORK

              • VII. CONCLUSION

              • VII. REFERENCES

Tài liệu cùng người dùng

Tài liệu liên quan