Tài liệu Solr 1.4 Enterprise Search Server- P1 pptx

50 516 2
Tài liệu Solr 1.4 Enterprise Search Server- P1 pptx

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Solr 1.4 Enterprise Search Server Enhance your search with faceted navigation, result highlighting, fuzzy queries, ranked scoring, and more David Smiley Eric Pugh BIRMINGHAM - MUMBAI This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009 4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. Solr 1.4 Enterprise Search Server Copyright © 2009 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: August 2009 Production Reference: 1120809 Published by Packt Publishing Ltd. 32 Lincoln Road Olton Birmingham, B27 6PA, UK. ISBN 978-1-847195-88-3 www.packtpub.com Cover Image by Harmeet Singh ( singharmeet@yahoo.com ) This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009 4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. Credits Authors David Smiley Eric Pugh Reviewers James Brady Jerome Eteve Acquisition Editor Rashmi Phadnis Development Editor Darshana Shinde Technical Editor Pallavi Kachare Copy Editor Leonard D'Silva Indexer Monica Ajmera Production Editorial Manager Abhijeet Deobhakta Editorial Team Leader Akshara Aware Project Team Leader Priya Mukherji Project Coordinator Leena Purkait Proofreader Lynda Sliwoski Production Coordinator Shantanu Zagade Cover Work Shantanu Zagade This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009 4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. About the Authors Born to code, David Smiley is a senior software developer and loves programming. He has 10 years of experience in the defense industry at MITRE, using Java and various web technologies. David is a strong believer in the opensource development model and has made small contributions to various projects over the years. David began using Lucene way back in 2000 during its infancy and was immediately excited by it and its future potential. He later went on to use the Lucene based "Compass" library to construct a very basic search server, similar in spirit to Solr. Since then, David has used Solr in a major search project and was able to contribute modications back to the Solr community. Although preferring open source solutions, David has also been trained on the commercial Endeca search platform and is currently using that product as well as Solr for different projects. This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009 4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. Most, if not all, authors seem to dedicate their book to someone. As simply a reader of books, I have thought of this seeming prerequisite as customary tradition. That was my feeling before I embarked on writing about Solr, a project that has sapped my previously "free" time on nights and weekends for a year. I chose this sacrice and would not change it, but my wife, family, and friends did not choose it. I am married to my lovely wife Sylvie who has sacriced easily as much as I have to complete this book. She has suffered through this time with an absentee husband while bearing our rst child— Camille. She was born about a week before the completion of my rst draft and has been the apple of my eye ever since. I ofcially dedicate this book to my wife Sylvie and my daughter Camille, whom I both lovingly adore. I also pledge to read book dedications with newfound rsthand experience at what the dedication represents. I would also like to thank others who helped bring this book to fruition. Namely, if it were not for Doug Cutting creating Lucene with an open source license, there would be no Solr. Furthermore, CNet's decision to open source what was an in-house project, Solr itself in 2006, deserves praise. Many corporations do not understand that open source isn't just "free code" you get for free that others wrote; it is an opportunity to let your code ourish on the outside instead of it withering inside. Finally, I thank the team at Packt who were particularly patient with me as a rst-time author writing at a pace that left a lot to be desired. Last but not least, this book would not have been completed in a reasonable time were it not for the assistance of my contributing author, Eric Pugh. His perspectives and experiences have complemented mine so well that I am absolutely certain the quality of this book is much better than what I could have done alone. Thank you all. This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009 4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. Fascinated by the 'craft' of software development, Eric Pugh has been heavily involved in the open source world as a developer, committer, and user for the past ve years. He is an emeritus member of the Apache Software Foundation and lately has been mulling over how we move from the read/write Web to the read/write/share Web. In biotech, nancial services, and defense IT, he has helped European and American companies develop coherent strategies for embracing open source software. As a speaker, he has advocated the advantages of Agile practices in software development. Eric became involved with Solr when he submitted the patch SOLR-284 for Parsing Rich Document types such as PDF and MS Ofce formats that became the single most popular patch as measured by votes! The patch was subsequently cleaned up and enhanced by three other individuals, demonstrating the power of the open source model to build great code collaboratively. SOLR-284 was eventually refactored into Solr Cell as part of Solr version 1.4. He blogs at http://www.opensourceconnections.com/blog/ . This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009 4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. Throughout my life I have been helped by so many people, but all too rarely do I get to explicitly thank them. This book is arguable one of the high points of my career, and as I wrote it, I thought about all the people who have provided encouragement, mentoring, and the occasional push to succeed. First off, I would like to thank Erik Hatcher, author, entrepreneur, and great family man for introducing me to the world of open source software. My rst hesitant patch to Ant was made under his tutelage, and later my interest in Solr was fanned by his advocacy. Thanks to Harry Sleeper for taking a chance on a rst time conference speaker; he moved me from thinking of myself as a developer improving myself to thinking of myself as a consultant improving the world (of software!). His team at MITRE are some of the most passionate developers I have met, and it was through them I met my co-author David. I owe a huge debt of gratitude to David Smiley. He has encouraged me, coached me, and put up with my lack of respect for book deadlines, making this book project a very positive experience! I look forward to the next one. With my new son Morgan at home, I could only have done this project with a generous support of time from my company, OpenSource Connections. I am incredibly proud of what o19s is accomplishing! Lastly, to the all the folks in the Solr/Lucene community who took the time to review early drafts and provide feedback: Solr is at the tipping point of becoming the "it" search engine because of your passion and commitment I am who I am because of my wife, Kate. Schweetie, real life for me began when we met. Thank you. This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009 4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. About the Reviewers James Brady is an entrepreneur and software developer living in San Francisco, CA. Originally from England, James discovered his passion for computer science and programming while at Cambridge University. Upon graduation, James worked as a software engineer at IBM's Hursley Park laboratory—a role which taught him many things, most importantly, his desire to work in a small company. In January 2008, James founded WebMynd Corp., which received angel funding from the Y Combinator fund, and he relocated to San Francisco. WebMynd is one of the largest installations of Solr, indexing up to two million HTML documents per day, and making heavy use of Solr's multicore features to enable a partially active index. Jerome Eteve holds a BSC in physics, maths and computing and an MSC in IT and bioinformatics from the University of Lille (France). After starting his career in the eld of bioinformatics, where he worked as a biological data management and analysis consultant, he's now a senior web developer with interests ranging from database level issues to user experience online. He's passionate about open source technologies, search engines, and web application architecture. At present, he is working since 2006 for Careerjet Ltd, a worldwide job search engine. This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009 4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. Table of Contents Preface 1 Chapter 1: Quick Starting Solr 7 An introduction to Solr 7 Lucene, the underlying engine 8 Solr, the Server-ization of Lucene 8 Comparison to database technology 9 Getting started 10 The last official release or fresh code from source control 11 Testing and building Solr 12 Solr's installation directory structure 13 Solr's home directory 15 How Solr finds its home 15 Deploying and running Solr 17 A quick tour of Solr! 18 Loading sample data 20 A simple query 22 Some statistics 24 The schema and configuration files 25 Solr resources outside this book 26 Summary 27 Chapter 2: Schema and Text Analysis 29 MusicBrainz.org 30 One combined index or multiple indices 31 Problems with using a single combined index 33 Schema design 34 Step 1: Determine which searches are going to be powered by Solr 35 Step 2: Determine the entities returned from each search 35 This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009 4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. Table of Contents [ ii ] Step 3: Denormalize related data 36 Denormalizing—"one-to-one" associated data 36 Denormalizing—"one-to-many" associated data 36 Step 4: (Optional) Omit the inclusion of fields only used in search results 38 The schema.xml file 39 Field types 40 Field options 40 Field definitions 42 Sorting 44 Dynamic fields 45 Using copyField 46 Remaining schema.xml settings 47 Text analysis 47 Configuration 48 Experimenting with text analysis 50 Tokenization 52 WorkDelimiterFilterFactory 53 Stemming 54 Synonyms 55 Index-time versus Query-time, and to expand or not 57 Stop words 57 Phonetic sounds-like analysis 58 Partial/Substring indexing 60 N-gramming costs 61 Miscellaneous analyzers 62 Summary 63 Chapter 3: Indexing Data 65 Communicating with Solr 65 Direct HTTP or a convenient client API 65 Data streamed remotely or from Solr's filesystem 66 Data formats 66 Using curl to interact with Solr 66 Remote streaming 68 Sending XML to Solr 69 Deleting documents 70 Commit, optimize, and rollback 70 Sending CSV to Solr 72 Configuration options 73 Direct database and XML import 74 Getting started with DIH 75 The DIH development console 76 This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009 4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. [...]... that you're going to get a great search experience, because it's not going to have features that users come to expect in a great search With Solr, the leading open source search server, you'll tap into a host of features from highlighting search results to spell-checking to faceting As you read Solr Enterprise Search Server you'll be guided through all of the aspects of Solr, from the initial download... from MusicBrainz.org Furthermore, you will also find instructions on accessing a Solr image readily deployed from within Amazon's Elastic Compute Cloud Solr Enterprise Search Server targets the Solr 1.4 version However, as this book went to print prior to Solr 1.4' s release, two features were not incorporated into the book: search result clustering and trie-range numeric fields This material is copyright... example, if you have an apache -solr- 1.4. war file, then you would access it at http://localhost:8983/apache -solr- 1.4/ , assuming it's on the local machine and running at that default port We're going to deploy this WAR file into the Jetty servlet engine included with Solr If you are using a pre-built downloaded Solr distribution, then Solr is already deployed into Jetty as solr. war Solr has an ant target that... on www.verypdf.com 30327 Quick Starting Solr Solr first checks for a Java system property named solr. solr.home There are a few ways to set a Java system property, but a universal one, no matter which servlet engine you use, is through the command line where Java is invoked You could explicitly set Solr' s home like so when you start Jetty: java -Dsolr .solr home =solr/ -jar start.jar, or you could use... www.verypdf.com 30327 Quick Starting Solr Welcome to Solr! You've made an excellent choice in picking a technology to power your searching needs In this chapter, we're going to cover the following topics: • An overview of what Solr and Lucene are all about • What makes Solr different from other database technologies • How to get Solr, what's included, and what is where • Running Solr and importing sample data... plugins from the Solr distribution (the WAR file) for convenience If you extend Solr without modifying Solr itself, then those modifications can be deployed in a JAR file here doing anything with it except perhaps deleting it occasionally It's really important to know how Solr finds its home directory This is covered next How Solr finds its home In the next section, you'll start Solr When Solr starts up,... 222 Chapter 8: Integrating Solr Structure of included examples Inventory of examples SolrJ: Simple Java interface Using Heritrix to download artist pages Indexing HTML in Solr SolrJ client API Indexing POJOs When should I use Embedded Solr In-Process streaming Rich clients Upgrading from legacy Lucene Using JavaScript to integrate Solr Wait, what about security? Building a Solr powered artists autocomplete... Aug 7, 2008 4:59:35 PM org.apache .solr. core.Config getInstanceDir INFO: Solr home defaulted to 'null' (could not find system property or JNDI) Aug 7, 2008 4:59:35 PM org.apache .solr. core.Config setInstanceDir INFO: Solr home set to 'solr/ ' This shows that Solr was left to default to solr/ You'll see this output when you start Solr, as described in the next section [ 16 ] This material is copyright... Table of Contents Hosted Solr by Acquia 252 Ruby on Rails integrations acts_as _solr Setting up MyFaves project Populating MyFaves relational database from Solr Build Solr indexes from relational database Complete MyFaves web site Blacklight OPAC Indexing MusicBrainz data 253 254 255 256 258 260 263 263 Customizing display solr- ruby versus rsolr Summary 267 269 270 Chapter 9: Scaling Solr Tuning complex... data • A quick tour of the interface and key configuration files An introduction to Solr Solr is an open source enterprise search server It is a mature product powering search for public sites like CNet, Zappos, and Netflix, as well as intranet sites It is written in Java, and that language is used to further extend/modify Solr However, being a server that communicates using standards such as HTTP and . characters 10 8 Filtering 10 8 Sorting 10 9 Request handlers 11 0 Scoring 11 2 Query-time and index-time boosting 11 3 Troubleshooting scoring 11 3 Summary 11 5 This. types 14 2 MusicBrainz schema changes 14 4 Field requirements 14 6 Types of faceting 14 6 Faceting text 14 7 Alphabetic range bucketing (A-C, D-F, and so on) 14 8

Ngày đăng: 14/12/2013, 20:15

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan