Apache Solr 3 Enterprise Search Server pptx

418 2K 1
Apache Solr 3 Enterprise Search Server pptx

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

www.it-ebooks.info Apache Solr 3 Enterprise Search Server Enhance your search with faceted navigation, result highlighting, relevancy ranked sorting, and more David Smiley Eric Pugh BIRMINGHAM - MUMBAI www.it-ebooks.info Apache Solr 3 Enterprise Search Server Copyright © 2011 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: August 2009 Second published: November 2011 Production Reference: 2041111 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-84951-606-8 www.packtpub.com Cover Image by Duraid Fatouhi (duraidfatouhi@yahoo.com) www.it-ebooks.info Credits Authors David Smiley Eric Pugh Reviewers Jerome Eteve Mauricio Scheffer Acquisition Editor Sarah Cullington Development Editors Shreerang Deshpande Gaurav Mehta Technical Editor Kavita Iyer Project Coordinator Joel Goveya Proofreader Steve Maguire Indexers Hemangini Bari Rekha Nair Production Coordinator Alwin Roy Cover Work Alwin Roy www.it-ebooks.info About the Authors Born to code, David Smiley is a senior software engineer with a passion for programming and open source. He has written a book, taught a class, and presented at conferences on the subject of Solr. He has 12 years of experience in the defense industry at MITRE, using Java and various web technologies. Recently, David has been focusing his attention on the intersection of geospatial technologies with Lucene and Solr. David rst used Lucene in 2000 and was immediately struck by its speed and novelty. Years later he had the opportunity to work with Compass, a Lucene based library. In 2008, David built an enterprise people and project search service with Solr, with a focus on search relevancy tuning. David began to learn everything there is to know about Solr, culminating with the publishing of Solr 1.4 Enterprise Search Server in 2009—the rst book on Solr. He has since developed and taught a two-day Solr course for MITRE and he regularly offers technical advice to MITRE and its customers on the use of Solr. David also has experience using Endeca's competing product, which has broadened his experience in the search eld. On a technical level, David has solved challenging problems with Lucene and Solr including geospatial search, wildcard ngram query parsing, searching multiple multi-valued elds at coordinated positions, and part-of-speech search using Lucene payloads. In the area of geospatial search, David open sourced his geohash prex/ grid based work to the Solr community tracked as SOLR-2155. This work has led to presentations at two conferences. Presently, David is collaborating with other Lucene and Solr committers on geospatial search. www.it-ebooks.info Acknowledgement Most, if not all authors seem to dedicate their book to someone. As simply a reader of books I have thought of this seeming prerequisite as customary tradition. That was my feeling before I embarked on writing about Solr, a project that has sapped my previously "free" time on nights and weekends for a year. I chose this sacrice and want no pity for what was my decision, but my wife, family and friends did not choose it. I am married to my lovely wife Sylvie who has easily sacriced as much as I have to work on this project. She has suffered through the rst edition with an absentee husband while bearing our rst child—Camille. The second edition was a similar circumstance with the birth of my second daughter—Adeline. I ofcially dedicate this book to my wife Sylvie and my daughters Camille and Adeline, who I both lovingly adore. I also pledge to read book dedications with new-found rst- hand experience at what the dedication represents. I would also like to thank others who helped bring this book to fruition. Namely, if it were not for Doug Cutting creating Lucene with an open source license, there would be no Solr. Furthermore, CNET's decision to open source what was an in-house project, Solr itself, in 2006, deserves praise. Many corporations do not understand that open source isn't just "free code" you get for free that others write: it is an opportunity to let your code ourish in the outside instead of it withering inside. Last, but not the least, this book would not have been completed in a reasonable time were it not for the assistance of my contributing author, Eric Pugh. His own perspectives and experiences have complemented mine so well that I am absolutely certain the quality of this book is much better than what I could have done alone. Thank you all. David Smiley www.it-ebooks.info Eric Pugh has been fascinated by the "craft" of software development, and has been heavily involved in the open source world as a developer, committer, and user for the past ve years. He is an emeritus member of the Apache Software Foundation and lately has been mulling over how we solve the problem of nding answers in datasets when we don't know the questions ahead of time to ask. In biotech, nancial services, and defense IT, he has helped European and American companies develop coherent strategies for embracing open source search software. As a speaker, he has advocated the advantages of Agile practices with a focus on testing in search engine implementation. Eric became involved with Solr when he submitted the patch SOLR-284 for Parsing Rich Document types such as PDF and MS Ofce formats that became the single most popular patch as measured by votes! The patch was subsequently cleaned up and enhanced by three other individuals, demonstrating the power of the open source model to build great code collaboratively. SOLR-284 was eventually refactored into Solr Cell as part of Solr version 1.4. He blogs at http://www.opensourceconnections.com/ www.it-ebooks.info Acknowledgement When the topic of producing an update of this book for Solr 3 rst came up, I thought it would be a matter of weeks to complete it. However, when David Smiley and I sat down to scope out what to change about the book, it was immediately apparent that we didn't want to just write an update for the latest Solr, we wanted to write a complete second edition of the book. We added a chapter, moved around content, rewrote whole sections of the book. David put in many more long nights than I over the past 9 months writing what I feel justiable in calling the Second Edition of our book. So I must thank his wife Sylvie for being so supportive of him! I also want to thank again Erik Hatcher for his continuing support and mentorship. Without his encouragement I wouldn't have spoken at Euro Lucene, or become involved in the Blacklight community. I also want to thank all of my colleagues at OpenSource Connections. We've come a long way as a company in the last 18 months, and I look forward to the next 18 months. Our Friday afternoon hack sessions re-invigorate me every week! My darling wife Kate, I know 2011 turned into a very busy year, but I couldn't be happier sharing my life with you, Morgan, and baby Asher. I love you. Lastly I want to thank all the adopters of Solr and Lucene! Without you, I wouldn't have this wonderful open source project to be so incredibly proud to be a part of! I look forward to meeting more of you at the next LuceneRevolution or Euro Lucene conference. www.it-ebooks.info About the Reviewers Jerome Eteve holds a MSc in IT and Sciences from the University of Lille (France). After starting his career in the eld of bioinformatics where he worked as a Biological Data Management and Analysis Consultant, he's now a Senior Application Developer with interests ranging from architecture to delivering a great user experience online. He's passionate about open source technologies, search engines, and web application architecture. He now works for WCN Plc, a leading provider of recruitment software solutions. He has worked on Packt's Enterprise Solr published in 2009. Mauricio Scheffer is a software developer currently living in Buenos Aires, Argentina. He's worked in dot-coms on almost everything related to web application development, from architecture to user experience. He's very active in the open source community, having contributed to several projects and started many projects of his own. In 2007 he wrote SolrNet, a popular open source Solr interface for the .NET platform. Currently he's also researching the application of functional programming to web development as part of his Master's thesis. He blogs at http://bugsquash.blogspot.com. www.it-ebooks.info www.PacktPub.com This book is published by Packt Publishing. You might want to visit Packt's website at www.PacktPub.com and take advantage of the following features and offers: Discounts Have you bought the print copy or Kindle version of this book? If so, you can get a massive 85% off the price of the eBook version, available in PDF, ePub, and MOBI. Simply go to http://www.packtpub.com/apache-solr-3-enterprise-search- server/book , add it to your cart, and enter the following discount code: as3esebk Free eBooks If you sign up to an account on www.PacktPub.com, you will have access to nine free eBooks. Newsletters Sign up for Packt's newsletters, which will keep you up to date with offers, discounts, books, and downloads. You can set up your subscription at www.PacktPub.com/newsletters. Code Downloads, Errata and Support Packt supports all of its books with errata. While we work hard to eradicate errors from our books, some do creep in. Meanwhile, many Packt books have accompanying snippets of code to download. You can nd errata and code downloads at www.PacktPub.com/support. www.it-ebooks.info [...]... ManifoldCF 32 4 Connectors 32 5 Putting ManifoldCF to use 32 5 Summary 32 8 Chapter 10: Scaling Solr Tuning complex systems Testing Solr performance with SolrMeter Optimizing a single Solr server (Scale up) Configuring JVM settings to improve memory usage MMapDirectoryFactory to leverage additional virtual memory 32 9 33 0 33 2 33 4 33 4 33 5 Enabling downstream HTTP caching Solr caching 33 5 33 8 Indexing performance 34 0... data to Solr in bulk Don't overlap commits Disabling unique key checking Index optimization factors 33 9 34 0 34 1 34 2 34 3 34 3 Enhancing faceting performance 34 5 Using term vectors 34 5 Improving phrase search performance 34 6 Moving to multiple Solr servers (Scale horizontally) 34 8 Replication 34 9 Starting multiple Solr servers 34 9 Configuring replication 35 1 Load balancing searches across slaves 35 2 Configuring... with Solr Wait, what about security? Building a Solr powered artists autocomplete widget with jQuery and JSONP AJAX Solr Using XSLT to expose Solr via OpenSearch OpenSearch based Browse plugin Installing the Search MBArtists plugin 294 294 295 295 296 297 298 30 3 30 5 30 6 30 6 Accessing Solr from PHP applications 30 9 solr- php-client 31 0 Drupal options 31 1 Apache Solr Search integration module Hosted Solr. .. balancing Sharding indexes 35 4 35 6 Indexing into the master server Configuring slaves Assigning documents to shards Searching across shards (distributed search) 35 2 35 3 35 7 35 8 Combining replication and sharding (Scale deep) 36 0 Where next for scaling Solr? Summary 36 3 36 4 Near real time search [ viii ] www.it-ebooks.info 36 2 Table of Contents Appendix: Search Quick Reference 36 5 Index 36 9 Quick reference... Solr by Acquia 31 2 31 2 Ruby on Rails integrations The Ruby query response writer 31 3 31 3 [ vii ] www.it-ebooks.info Table of Contents sunspot_rails gem Setting up MyFaves project Populating MyFaves relational database from Solr Build Solr indexes from a relational database Complete MyFaves website 31 4 31 5 31 6 31 8 32 0 Which Rails/Ruby library should I use? 32 2 Nutch for crawling web pages 32 3 Maintaining... queries 128 129 129 Range queries 131 Fuzzy queries Date math 131 132 Score boosting 133 Existence (and non-existence) queries 134 Escaping special characters 134 The Dismax query parser (part 1) 135 Searching multiple fields 137 Limited query syntax 137 Min-should-match 138 Basic rules Multiple rules What to choose 138 139 140 A default search Filtering Sorting Geospatial search Indexing locations Filtering... 12 14 15 16 18 20 23 24 25 27 28 29 MusicBrainz.org 30 One combined index or separate indices 31 One combined index 32 Problems with using a single combined index Separate indices Schema design Step 1: Determine which searches are going to be powered by Solr Step 2: Determine the entities returned from each search Step 3: Denormalize related data www.it-ebooks.info 33 34 35 36 36 37 Table of Contents... Chapter 8: Deployment Deployment methodology for Solr Questions to ask Installing Solr into a Servlet container Differences between Servlet containers Defining solr. home property 231 231 232 245 245 246 247 248 248 Logging HTTP server request access logs Solr application logging 249 250 251 A SearchHandler per search interface? Leveraging Solr cores Configuring solr. xml 254 256 256 Configuring logging output... 252 2 53 2 53 254 258 259 259 261 [ vi ] www.it-ebooks.info Table of Contents Monitoring Solr performance 262 Stats.jsp 2 63 JMX 264 Starting Solr with JMX 265 Securing Solr from prying eyes Limiting server access 270 270 Securing public searches Controlling JMX access 272 2 73 Securing index data 2 73 Controlling document access Other things to look at 2 73 274 Summary 275 Chapter 9: Integrating Solr Working... Working with included examples Inventory of examples Solritas, the integrated search UI Pros and Cons of Solritas SolrJ: Simple Java interface Using Heritrix to download artist pages SolrJ-based client for Indexing HTML SolrJ client API 277 278 278 279 281 2 83 2 83 285 287 Embedding Solr 288 Searching with SolrJ 289 Indexing 290 When should I use embedded Solr? In-process indexing Standalone desktop applications . systems 33 0 Testing Solr performance with SolrMeter 33 2 Optimizing a single Solr server (Scale up) 33 4 Conguring JVM settings to improve memory usage 33 4 MMapDirectoryFactory. memory 33 5 Enabling downstream HTTP caching 33 5 Solr caching 33 8 Tuning caches 33 9 Indexing performance 34 0 Designing the schema 34 0 Sending data to Solr

Ngày đăng: 07/03/2014, 06:20

Từ khóa liên quan

Mục lục

  • Cover

  • Copyright

  • Credits

  • About the Authors

  • About the Reviewers

  • www.PacktPub.com

  • PacktLib.PacktPub.com

  • Table of Contents

  • Preface

  • Chapter 1: Quick Starting Solr

    • An introduction to Solr

      • Lucene, the underlying engine

      • Solr, a Lucene-based search server

      • Comparison to database technology

    • Getting started

      • Solr's installation directory structure

      • Solr's home directory, and Solr cores

      • Running Solr

    • A quick tour of Solr

      • Loading sample data

      • A simple query

      • Some statistics

      • The sample browse interface

    • Configuration files

    • Resources outside this book

    • Summary

  • Chapter 2: Schema and Text Analysis

    • MusicBrainz.org

    • One combined index or separate indices

      • One combined index

        • Problems with using a single combined index

      • Separate indices

    • Schema design

      • Step 1: Determine which searches are going to be powered by Solr

      • Step 2: Determine the entities returned from each search

      • Step 3: Denormalize related data

        • Denormalizing—"one-to-one" associated data

        • Denormalizing—"one-to-many" associated data

      • Step 4: (Optional) Omit the inclusion of fields only used in search results

    • The schema.xml file

      • Defining field types

      • Built-in field type classes

        • Numbers and dates

        • Geospatial

      • Field options

      • Field definitions

        • Dynamic field definitions

      • Our MusicBrainz field definitions

      • Copying fields

      • The unique key

      • The default search field and query operator

    • Text analysis

      • Configuration

      • Experimenting with text analysis

      • Character filters

      • Tokenization

      • WordDelimiterFilter

      • Stemming

        • Correcting and augmenting stemming

      • Synonyms

        • Index-time versus query-time, and to expand or not

      • Stop words

      • Phonetic sounds-like analysis

      • Substring indexing and wildcards

        • ReversedWildcardFilter

        • N-grams

        • N-gram costs

      • Sorting Text

      • Miscellaneous token filters

    • Summary

  • Chapter 3: Indexing Data

    • Communicating with Solr

      • Direct HTTP or a convenient client API

      • Push data to Solr or have Solr pull it

      • Data formats

      • HTTP POSTing options to Solr

      • Remote streaming

    • Solr's Update-XML format

      • Deleting documents

    • Commit, optimize, and rollback

    • Sending CSV formatted data to Solr

      • Configuration options

    • The Data Import Handler Framework

      • Setup

      • The development console

      • Writing a DIH configuration file

        • Data Sources

        • Entity processors

        • Fields and transformers

      • Example DIH configurations

        • Importing from databases

        • Importing XML from a file with XSLT

        • Importing multiple rich document files (crawling)

      • Importing commands

        • Delta imports

    • Indexing documents with Solr Cell

      • Extracting text and metadata from files

      • Configuring Solr

      • Solr Cell parameters

      • Extracting karaoke lyrics

      • Indexing richer documents

    • Update request processors

    • Summary

  • Chapter 4: Searching

    • Your first search, a walk-through

    • Solr's generic XML structured data representation

    • Solr's XML response format

      • Parsing the URL

    • Request handlers

    • Query parameters

      • Search criteria related parameters

      • Result pagination related parameters

      • Output related parameters

      • Diagnostic related parameters

    • Query parsers and local-params

    • Query syntax (the lucene query parser)

      • Matching all the documents

      • Mandatory, prohibited, and optional clauses

        • Boolean operators

      • Sub-queries

        • Limitations of prohibited clauses in sub-queries

      • Field qualifier

      • Phrase queries and term proximity

      • Wildcard queries

        • Fuzzy queries

      • Range queries

        • Date math

      • Score boosting

      • Existence (and non-existence) queries

      • Escaping special characters

    • The Dismax query parser (part 1)

      • Searching multiple fields

      • Limited query syntax

      • Min-should-match

        • Basic rules

        • Multiple rules

        • What to choose

      • A default search

    • Filtering

    • Sorting

    • Geospatial search

      • Indexing locations

      • Filtering by distance

      • Sorting by distance

    • Summary

  • Chapter 5: Search Relevancy

    • Scoring

      • Query-time and index-time boosting

      • Troubleshooting queries and scoring

    • Dismax query parser (part 2)

      • Lucene's DisjunctionMaxQuery

      • Boosting: Automatic phrase boosting

        • Configuring automatic phrase boosting

        • Phrase slop configuration

        • Partial phrase boosting

      • Boosting: Boost queries

      • Boosting: Boost functions

        • Add or multiply boosts?

    • Function queries

      • Field references

      • Function reference

        • Mathematical primitives

        • Other math

        • ord and rord

        • Miscellaneous functions

      • Function query boosting

        • Formula: Logarithm

        • Formula: Inverse reciprocal

        • Formula: Reciprocal

        • Formula: Linear

      • How to boost based on an increasing numeric field

        • Step by step…

        • External field values

      • How to boost based on recent dates

        • Step by step…

    • Summary

  • Chapter 6: Faceting

    • A quick example: Faceting release types

      • MusicBrainz schema changes

    • Field requirements

    • Types of faceting

    • Faceting field values

      • Alphabetic range bucketing

    • Faceting numeric and date ranges

      • Range facet parameters

    • Facet queries

    • Building a filter query from a facet

      • Field value filter queries

      • Facet range filter queries

    • Excluding filters (multi-select faceting)

    • Hierarchical faceting

    • Summary

  • Chapter 7: Search Components

    • About components

    • The Highlight component

      • A highlighting example

      • Highlighting configuration

        • The regex fragmenter

        • The fast vector highlighter with multi-colored highlighting

    • The SpellCheck component

      • Schema configuration

      • Configuration in solrconfig.xml

        • Configuring spellcheckers (dictionaries)

        • Processing of the q parameter

        • Processing of the spellcheck.q parameter

      • Building the dictionary from its source

      • Issuing spellcheck requests

      • Example usage for a misspelled query

    • Query complete / suggest

      • Query term completion via facet.prefix

      • Query term completion via the Suggester

      • Query term completion via the Terms component

    • The QueryElevation component

      • Configuration

    • The MoreLikeThis component

      • Configuration parameters

        • Parameters specific to the MLT search component

        • Parameters specific to the MLT request handler

        • Common MLT parameters

      • MLT results example

    • The Stats component

      • Configuring the stats component

      • Statistics on track durations

    • The Clustering component

    • Result grouping / Field collapsing

      • Configuring result grouping

    • The TermVector component

    • Summary

  • Chapter 8: Deployment

    • Deployment methodology for Solr

      • Questions to ask

    • Installing Solr into a Servlet container

      • Differences between Servlet containers

        • Defining solr.home property

    • Logging

      • HTTP server request access logs

      • Solr application logging

        • Configuring logging output

        • Logging using Log4j

        • Jetty startup integration

        • Managing log levels at runtime

    • A SearchHandler per search interface?

    • Leveraging Solr cores

      • Configuring solr.xml

        • Property substitution

        • Include fragments of XML with XInclude

      • Managing cores

      • Why use multicore?

    • Monitoring Solr performance

      • Stats.jsp

      • JMX

        • Starting Solr with JMX

    • Securing Solr from prying eyes

      • Limiting server access

        • Securing public searches

        • Controlling JMX access

      • Securing index data

        • Controlling document access

        • Other things to look at

    • Summary

  • Chapter 9: Integrating Solr

    • Working with included examples

      • Inventory of examples

    • Solritas, the integrated search UI

      • Pros and Cons of Solritas

    • SolrJ: Simple Java interface

      • Using Heritrix to download artist pages

      • SolrJ based client for Indexing HTML

      • SolrJ client API

        • Embedding Solr

        • Searching with SolrJ

        • Indexing

      • When should I use embedded Solr?

        • In-process indexing

        • Standalone desktop applications

        • Upgrading from legacy Lucene

    • Using JavaScript with Solr

      • Wait, what about security?

      • Building a Solr powered artists autocomplete widget with jQuery and JSONP

      • AJAX Solr

    • Using XSLT to expose Solr via OpenSearch

      • OpenSearch based Browse plugin

        • Installing the Search MBArtists plugin

    • Accessing Solr from PHP applications

      • solr-php-client

      • Drupal options

        • Apache Solr Search integration module

        • Hosted Solr by Acquia

    • Ruby on Rails integrations

      • The Ruby query response writer

      • sunspot_rails gem

        • Setting up MyFaves project

        • Populating MyFaves relational database from Solr

        • Build Solr indexes from a relational database

        • Complete MyFaves website

      • Which Rails/Ruby library should I use?

    • Nutch for crawling web pages

    • Maintaining document security with ManifoldCF

      • Connectors

      • Putting ManifoldCF to use

    • Summary

  • Chapter 10: Scaling Solr

    • Tuning complex systems

    • Testing Solr performance with SolrMeter

    • Optimizing a single Solr server (Scale up)

      • Configuring JVM settings to improve memory usage

        • MMapDirectoryFactory to leverage additional virtual memory

      • Enabling downstream HTTP caching

      • Solr caching

        • Tuning caches

      • Indexing performance

        • Designing the schema

        • Sending data to Solr in bulk

        • Don't overlap commits

        • Disabling unique key checking

        • Index optimization factors

      • Enhancing faceting performance

      • Using term vectors

      • Improving phrase search performance

    • Moving to multiple Solr servers (Scale horizontally)

      • Replication

      • Starting multiple Solr servers

        • Configuring replication

      • Load balancing searches across slaves

        • Indexing into the master server

        • Configuring slaves

      • Configuring load balancing

      • Sharding indexes

        • Assigning documents to shards

        • Searching across shards (distributed search)

    • Combining replication and sharding (Scale deep)

      • Near real time search

    • Where next for scaling Solr?

    • Summary

  • Appendix: Search Quick Reference

    • Quick reference

  • Index

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan