Information Retrieval IR

33 682 0
Information Retrieval  IR

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

GATE Annie Lib Lucene Course Project Presentator: Bui Dac Thinh For: IR Students TH2010 1 Objective Information Retrieval: Search Engine on crawler text datasets and open-source system 2 “What is the largest city of VietNam?” “……….” HIT/ NOT HIT question Answer document set Objective 3 4 Search Engine Architecture User Interface Caching Indexing and Ranking Index Builder Web Page Parser Crawler Web Graph Builder Link Analysis Inverted index Cached Pages Page & Site Statistics Page Ranks Web Graph Pages Links Anchors Link Map Online Part Offline Part 7/2/14 What to do 5 Crawler crawler4j JAVA Sphider PHP Scrapy PYTHON HTMLAgilityPack .NET GeckoFx .NET Link set Textual data Preprocessing GATE JAVA UIMA JAVA Data ANNIE OPENNLP NLP Survey of Tools & Resources  General frameworks  UIMA  GATE  NLP components, pipelines, and tools  Stanford Named Entity Recognizer (NER)  Stanford CoreNLP (CoreNLP)  NegEx (NegEx)  ENJU (ENJU)  OpenNLP 6 Java framework Apache OpenNLP  OpenNLP tools  Sentence detector Pos-tagger  Tokenizer Shallow and full syntactic parser  Named-entity detector  Emdros  Text database engine for analyzed and annotated text  Mallet  Machine learning for language toolkit in Java  NLTK  Weka  Wordnet::Similarity  Measures of semantic relatedness using WordNet 7 Annie a Nearly - New Information Extraction System 8 Annie a Nearly - New Information Extraction System  Document Reset  Tokeniser  Gazetter  Sentence Splitter  RegEx Sentence Splitter  POS Tagger  Semantic Tagger 9 GATE  Open source software  Community of Text engineering  Defined and repeatable process  The Eclipse of NLP  The Lucene of Infromation Extraction 10 [...]... given directory  calculates a score for each of the documents that match a given query Search with Lucene Search Result  Primary class  ScoreDOC: document that hits  Position  Score  TopDOCs: total documents that hit [number] Codeline  Index //state the file location of the index string indexFileLocation = @"C:\Index"; Lucene.Net.Store.Directory dir = Lucene.Net.Store.FSDirectory.GetDirectory(indexFileLocation... //state the file location of the index string indexFileLocation = @"C:\Index"; Lucene.Net.Store.Directory dir = Lucene.Net.Store.FSDirectory.GetDirectory(indexFileLocation , true); //create an index searcher that will perform the search Lucene.Net.Search.IndexSearcher searcher = new Lucene.Net.Search.IndexSearcher(dir); Codeline  Search //build a query object Lucene.Net.Index.Term searchTerm = new Lucene.Net.Index.Term("content",... to process the text Lucene.Net.Analysis.Analyzer analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer(); //create the index writer with the directory and analyzer defined Lucene.Net.Index.IndexWriter indexWriter = new Lucene.Net.Index.IndexWriter(dir, analyzer, true); /*true to create a new index*/ Codeline //create a document, add in a single field Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();... Product in Jakartar Apache Popular: Xerox, Apple, Wikipedia, IBM, CNN, Nutch… Open source in JAVA The most efficient framework for IR • Index • Search Lucene4c / CLucene Nlucene / Lucene.NET PyLucene Ferret / RubyLucene ZEND Framework What uses Lucene WIKIPEDIA NUTCH CNET RED-PIRANHA ……… Lucene Sketch In 5 Mins http://www.ibm.com/developerworks /library/os-apache-lucenesearch/ Analysis with Lucene Analysis... searcher.Search(query); //iterate over the results for (int i = 0; i < hits.Length(); i++) { Document doc = hits.Doc(i); string contentValue = doc.Get("content"); Console.WriteLine(contentValue); } Codeline  Display /* First parameter is the query to be executed and second parameter indicates the no of search results to fetch */ TopDocs topDocs = indexSearcher.search(query,20); System.out.println("Total hits "+topDocs.totalHits); . efficient framework for IR • Index • Search 13 Lucene4c / CLucene Nlucene / Lucene.NET PyLucene Ferret / RubyLucene ZEND Framework What uses Lucene 14 NUTCH WIKIPEDIA RED-PIRANHA CNET ………. Lucene. GATE Annie Lib Lucene Course Project Presentator: Bui Dac Thinh For: IR Students TH2010 1 Objective Information Retrieval: Search Engine on crawler text datasets and

Ngày đăng: 02/07/2014, 14:12

Mục lục

  • NLP Survey of Tools & Resources

  • Annie a Nearly - New Information Extraction System

  • Annie a Nearly - New Information Extraction System

  • Search with Lucene Query

  • Search with Lucene Term

  • Search with Lucene IndexSearcher

  • Search with Lucene Search Result

  • Thank you Bui Dac-Thinh Hotline: 01689928267

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan