Presented by Eric Gries, CEO, Lucid Imagination
The data and content whirlwind continues unabated, with total data volume doubling every 18 months and more forms and formats every day, as end users and organizations alike continue to demand better search technology. Now the Lucene Revolution has arrived, freeing end users and organizations to harness the content whirlwind with better, faster, more scalable and effective search apps and search-enabled services. More and more enterprises are breaking the bonds of commercial proprietary software lock-in, scaling up on open source Lucene/Solr search technology. Thousands are betting their business on the promise and the results delivered by the community innovation engine for disruptive competitive advantage. We'll look at how cutting edge developers, thought leaders, and market makers who have found the search they've been looking for, are driving the Lucene Revolution around the world. In addition we'll discuss what's needed to take open source search to the next level and continue the dramatic rate of adoption to more enterprises and users.
Presented by Satish Gannu, Director of Engineering, Cisco Systems
Video has had a significant impact in the consumer world. A similar trend is underway in the enterprise. In enterprises video is increasingly used for business meetings, user to user messaging, training, customer care, corporate events, corporate communications, safety and security. In many cases video must also be recorded. This explosion of video creates two challenges: finding and consuming them.
This talk examines why video is becoming more prevalent in the enterprise and promising new technology to find and consume them. It also addresses the role of speech and search technology, as well as the role of network infrastructure.
Presented by Erik Arnold, Adhere Solutions
Lucene has made great strides in the federal government, powering search on some of the highest profile web sites from the Obama Administration, including the flagship Search.USA.gov. Search.USA.gov was contracted out to enterprise software providers for the last decade, but now it has been brought in house with a focus on using open source software and commercial APIs. By embracing the open source movement, Search.USA.gov is now a commercial grade search engine while opening up its services to other government entities and though Application Program Interfaces which may be used by either the government or commercial sector. This model of creating web services and code that may be leveraged by other entities has two benefits: first, it saves the taxpayers money, and second, it exposes government content to the public outside of HTML and PDF format, which is much more useful for application developers.
Presented by Sumer Jabri, Principal Architect, Rivet Logic
Faceted search and dynamic navigation are key usability features of modern websites, but enterprises struggle to implement such capabilities cost-effectively for large-scale sites, especially where Web content managers and marketers need total editorial control. Using open source platforms like Alfresco WCM and Apache Solr, Rivet Logic has built large-scale faceted search and navigation which is completely under editorial control. Alfresco is used for Web content management, allowing Web editors to manage website content, and tag the content and structure with rich metadata models and taxonomies. The presentation discusses the technical approach we use for Solr/Alfresco integration, content publishing, real-time Solr indexing, and dynamic generation of website visitor experiences for several public sites that are currently in production.
Presented by Brian Pinkerton, Lucid Imagination
What's needed for broader enterprise adoption of Solr? In addition to requirements like flexibility, scalability, cost savings, and transparency met by Solr, there are new demands for ease-of-use, accelerated developer productivity, management, security, and enterprise-grade support. This talk will cover trends in these emerging enterprise search requirements - and how they can be met.
Presented by Steve Kearns, Product Manager, Rosette Platform, Basis Technology
Listening Platforms are a new class of search based applications which help organizations understand their marketplace by monitoring and extracting information from traditional and social media sources. With these applications, global organizations are seeking to understand, interact with, and react to their customers around the world. As these customers increasingly create and consume media in their native languages, leading organizations are going to rely on their listening platforms to understand and deliver insights using information from many languages. This talk will discuss some of the challenges that you will face when building a global listening platform – including the integration of Nutch and Solr, designing a meaningful schema, processing text in many languages, and selecting higher level text analysis tools. The talk will conclude with a demonstration of a prototype global listening platform based on Solr. The platform will be made available for further inspection and discussion during the conference in the exhibit hall.
Presented by Marjorie M.K. Hlava, President, Chairman, and Founder, Access Innovations, Inc.
Lucene is tightly linked both the MySQL and XML repositories, so that records are created, saved, and immediately searchable on the web. This presentation is a case study on NICEM. NICEM is a directory of educational media, created originally by NICEM at the University of Southern California and maintained by Access Innovations, Inc. since 1984. Metadata from the source, extracted programmatically includes the title, description, controlled subject terms, publication date, target audience type, language, media, publisher, and distributor(s). Access Innovations' Data Harmony software, which uses XML, TCP/IP Java and UNICODE in its creation making it platform independent. The platform allows staff to work in the cloud or locally seamlessly contributing to the database from geographically distributed locations. The software parses the title and description to automatically suggest appropriate subject terms from the NICEM thesaurus. The XML Intranet System (XIS) repository is able to upload a record to the search site at the same time it is saved as an approved record by the editorial review staff. This makes the semantically enriched record instantly searchable and the directory instantly current. The thesaurus is later used in the search interface to suggest search terms to users as they type an entry in the search box. Search returns display thesaurus related and narrower terms to offer broadened or narrowed search paths to the user. The main website search interface provides a subject search facility with filtering by date, audience type and language. A separate search facility provides distributor details. Each title included in the directory is a record for the search side in the open source database manager MySQL. Access Innovations decided to move to the open source Lucene search engine to take advantage of its features. Lucene's ranking search and its mathematic formula that responds to a specific query significantly improve search returns for the user. The combination of taxonomy terms from a controlled vocabulary and Lucene creates an excellent search experience.
Presented by Rod Cope, CTO, Founder, OpenLogic
Hadoop and HBase make it easy to store terabytes of data, but how do you scale your search mechanism to sift through these mountains of bits and retrieve large result sets in a matter of milliseconds? Solr sharding and careful index creation made these requirements come to life in our production environment. Come see how we handle millions of rapid fire queries from dozens of parallel search clients against many terabytes of data while addressing high availability through load balancing and replication.
Presented by Ching-hsien Wang, Smithsonian Institution
The Smithsonian Institution has created a one-stop search center for its diverse object collections from 40 databases of libraries, archives and museums. Using Solr, this search center currently indexes five million documents & records, and supports faceted searching and navigation. This presentation will demonstrate the functions of the system, describe the overall system architecture and highlight operational and work flow issues in maintaining and supporting the system.
Presented by Peter Wolanin, Acquia
Drupal is a robust, free and open-source content management system that is becoming an ever more popular tool for building enterprise websites. The Apache Solr Search Integration module for Drupal allows you to integrate Solr functionality into a Drupal site with no additional coding. The module manages and display many facets including taxonomy, content types, authors, and dates, provides options for boosting search results, and sends queries to the "more like this" handler to provide content recommendation. Using this module as a basis, you can make additional customizations or use additional sub-modules for even more functionality, such as indexing file attachments.
Presented by Grant Ingersoll, Lucid Imagination
The NoSQL movement has been gaining steam for some time now, due to it's focus on addressing certain database misgivings. Ironically, search has been "no sql" since before SQL even existed. In this talk, Lucene and Solr committer Grant Ingersoll will take a look at using Apache Lucene/Solr not only as a NoSQL solution, but a NoSQL solution with benefits.
Presented by Pascal Dimassimo, Open Text Corporation and Igor Radic, Open Text Corporation
Text analytics automatically extracts named entities, categories and sentiment from unstructured content; using this metadata as facets in Solr is the foundation of Semantic Navigation. Leveraging this set of facets into a flexible and scalable presentation framework allows for delivering highly relevant content to large audiences.
This discussion covers the approach taken by Open Text to deploy Nutch, Solr and other technologies into a self-provisioned cloud-based offering including:
Presented by Bill McQuaide, Black Duck Software
The abundance of open source is well known: there are over 220,000 OSS projects representing more than 2 million-person-years of development available to developers and their companies. The open source community continually demonstrates the power of the community development model. And Black Duck Software continually seeks out the most compelling trends in open source to share with this vibrant community, as they help to enable the use of open source components in a multi-source development world. In this session Bill McQuaide, EVP of Product and Strategy at Black Duck Software, will discuss open source software trends Black Duck has identified in their research and explore the results in type and growth of projects.
Presented by Hadley Reynolds, IDC
This session will profile discuss how search innovation is playing an important part in creating new business opportunities and transforming business models. Open source is now widely accepted and embraced. With trends pointing toward more open source software adoption, enterprises are recognizing the strength of products that are utilizing this approach. We see commercial enterprises as well as those who sell search-based applications incorporating open source in their implementation strategies. Among the benefits of Lucene/Solr or other open source projects are rapid time-to-market, high-level functionality, flexibility and customization, low entry-level cost, and a positive future outlook for technology evolving with the state of the art. However, open source has some limitations; for example, no single point of responsibility, dependence on "the community," its timetable and its priorities for certain enhancements, as well as potentially expensive customization. Open source can also require advanced skills in-house or near-at-hand. In addition, open source search solutions may trail the cutting edge search software specialist firms in feature enrichment. Lifetime system costs for both open source and commercial search, however, can be difficult to predict. At this time, much of the former skepticism about critical mass and future support in open source search can be laid to rest. Lucene/Solr search will play a significant role in the search market from here on.
Presented by Mats Bjore, Infosphere AB
Open source search (OSS) in the various intelligence communities with business and government have revealed both opportunities and obstacles. The generally secretive and suspicious community have accepted OSS and it is now an integral and important component of both the government and the business intelligence market in many countries. A catalyst for the increased interest has been the downturn in the economy, as business and organizational unities look for lower-priced options when their IT budgets are trimmed down. The growing interest in OSS challenges existing legacy vendors but also change and creates new demands on procurement behaviours. The future looks bright but within the intelligence communities there are still obstacles for adopting OSS that range from people not knowing that open-source equivalents to proprietary software exists, not understanding the licenses, or procurement not being set up in proper ways. What appeals to the intelligence community is the lack of contract discussions about limitations about query rates, number of servers,the index size, etc. which always is THE obstacle to talk about within this set of clients. What also appeals to the intelligence communities is that OSS enables the creation of many specialized fast paced projects around existing or new analytical tools. This speech will give examples of successful projects both from a client and vendor perspective, and it will point to the need of initiatives such as the European Commission’s Open Source Observatory and Repository for European public administrations and the 2009 US Defense Department guidance on open-source software to overcome obstacles both from a vender and customer perspective.
Presented by Otis Gospodnetic, Founder, Sematext
Over the past few years, as datasets grew in size, Hadoop Ecosystem mushroomed: the code base grew, the mailing list traffic exploded, JIRA activity and commits shot through the roof, new subprojects were created, then promoted to their own Apache TLPs Through this evolutionary process lots of information about Hadoop, various tips, tricks, solutions to common problems, answers to frequently asked questions, etc. accumulated in various Hadoop projects' data silos: numerous mailing lists, JIRA projects, Wiki, etc. To help users of various Hadoop projects find what they are looking for in this enormous Hadoop Haystack, we employed tools we work with every day: Solr, Tika, Droids and several others, and we built Search-Hadoop.com. In this talk we will describe which tools we used and how we used them. We'll describe the architecture behind Search-Hadoop.com and share information about architectural and deployment decisions we've made to make this service cheap to run, how it is deployed and monitored on Amazon's EC2.
Presented by Johannes Goll, J. Craig Venter Institute
JCVI Metagenomics Reports (METAREP) is an open source web application designed to help scientists analyze and compare annotated metagenomics data sets. It utilizes Solr/Lucene to quickly query large data collections. The web logic is implemented using the CAKEPHP framework. Index files are served by two load balanced Dell Power Edge R710 servers each having 8 cores (2.66 GHz), 72G RAM. Currently, we have indexed 68 million documents distributed over 330 index files.
Presented by Stefan Olafsson, TwigKit
So you've done the Solr heavy lifting, but can you build a polished user interface in seven minutes? We'll show you how we tackle the challenge of building a great search user experience with just a handful tags. About the presenter: Stefan Olafsson is the founder of TwigKit, co-organiser of the Enterprise Search London meetup, and has been working with search for the last 7 years.
Presented by Aaron Binns, Senior Software Engineer, Internet Archive
This talk outlines the full-text search system for the Internet Archive's Archive-It hosted archiving service which serves 120+ customers, 1500+ collections, and 900+ million documents, adding nearly 1 million new documents each day. We are currently using a homegrown Lucene solution and want to migrate to Solr. I will touch upon the challenges we have run into so far.
Presented by Tomás Fernández Löbbe, Lucid Imagination
SolrMeter is a stress test tool specifically built for Apache Solr installations. Using SolrMeter you can simulate realtime random, highly concurrent transactional and query loads on your server. It also enables you to understand your server behaviour visually using realtime statistics charts.
Presented by Eric Gaumer, Chief Scientist, ESR Technologies
Data integration is one of the more challenging aspects of designing enterprise search architectures. Data from disparate sources must be aggregated, normalized, and augmented prior to indexing. This talk will touch on some common design mistakes and show how asynchronous messaging architectures can be used to integrate content more effectively.
Presented by Grant Ingersoll, Lucid Imagination
Presented by Brent O. Palmer, Widernet Project
The eGranary Digital Library - also known as "The Internet in a Box" - provides millions of digital educational resources to institutions lacking adequate Internet access. Through a process of garnering permissions, copying web sites and delivering them to our partner institutions in developing countries, we provide access to millions of multimedia documents that can be instantly accessed by patrons over their local area networks. We use Solr to provide full-text search capabilities for these resources.
Presented by Arkadi Kosmynin, CSIRO Astronomy and Space Science, Australia
The talk presents a short overview of an intranet search engine Arch which is based on Apache Nutch and is designed for efficient and effective indexing and search of organisational web sites (intranets). Corporate web environments have a few distinct characteristics as opposed to the global Web, for which Nutch was originally designed. Arch achieves very high search precision in intranet environments, supports document-level security, offers high availability and inexpensive index updates, and is easy to use to index multiple web sites. We describe key technical solutions that allowed to achieve our aims. Arch performance evaluation results are also presented.
Presented by Andrzej Bialecki, Lucid Imagination
This talk will describe issues involved in scalable web crawling and web search, and explain how to integrate Apache Solr as a search platform with web crawling functionality, using existing web crawling platforms: Nutch, Aperture and Lucene Connectors Framework.
Presented by Bill Press, Software Development Manager, Salesforce.com
How do you deploy Lucene to support millions of searches per day, by hundreds of thousands of users (each with distinct privacy settings), over tens of thousands of document sets containing both structured and unstructured data, all the while indexing hundreds of millions of document updates per day? In this talk, we will discuss the scalability challenges of search at salesforce.com, our current architecture, and the new challenges posed by new product lines, including Chatter, our new collaboration and social networking application for the enterprise.
Presented by Yonik Seeley, Lucid Imagination
The Solr/Lucene community is hard at work designing and developing a range of new features and fixes for Apache Solr, advancing the frontiers of search. Solr creator Yonik Seeley will provide a preview survey of these developments, and talk about how one can leverage new functionality. Topics will include new faceting functionality, new function queries, increased scalability, field collapsing, and spatial search. The talk will span features already included in trunk, features slated for the next release, as well as incomplete features under consideration for future releases.
Presented by Sonali Sambhus, Senior Search Architect & Engineering Manager, Cisco Systems, Inc.
Cisco's new Pulse(TM) is a powerful platform that uses embedded Lucene/Solr search technology to tag and indexes key terms and topics from a broad range of media -- from email to video -- in real time. Tapping into internal communications traffic, it helps find expertise from withi n the enterprise's internal social network. Cutting edge enterprise search techniques were developed at Cisco with the help of Lucid Imagin ation. This in-depth technical workshop covers how the Cisco team designed and optimized Pulse with Lucene and Solr, on topics including:
Presented by Michael McCandless, Senior Software Engineer, IBM
Flexible indexing is one of the new features in Lucene's next major release, 4.0. It includes big changes to a number of places in Lucene: a new, higher performance postings iteration API; terms as arbitrary opaque bytes (not chars); direct visibility and control of deleted documents; a low-level, pluggable codec API giving applications full control over the postings data. Several interesting codecs have already been created, including the default "standard" codec, which enables sizable RAM reduction for searchers, and a "pulsing" codec that inlines postings data directly into the terms dictionary, which provides a solid performance boost for primary key fields. In this talk Michael presents an overview of all of these exciting changes, as well as several concrete, real-world examples of how applications can tap into these new features.
Presented by Jon Gifford, Co-Founder, Loggly
This talk will describe a streaming Log file search system based on Solr, that indexes data in real-time, and provides search access to it in at most 10 seconds later. We describe the use of 0MQ to move data around the system, and the distributed shard management system based on SolrCloud/Zookeeper that gives the system its elasticity. We take advantage of a number of non-traditional features of both the data and the expected search behavior to minimize the overall system size, while still allowing for very large indices, and input rates (all going well) in excess of 100,000 events/second.
Presented by David Oliver, Manage My Life/Sears
Manage My Life is a going-on-four-years-long experiment in a community centered experience by Sears Holdings Company. The site offers expert advice, articles and projects, and owner manuals among other attractive content. But, month after month users would find their way to our site-wide search page, find nothing useful and promptly leave. Our site is implemented in Ruby on Rails, but our search implementation was a home-grown search engine and API (using Lucene libraries under the hood) written in Java, coupled with a crawling and indexing scheme also hand-rolled and also in Java. The search experience on the site, which should have been a first-class citizen, was in reality the ugly step-child. We observed that we had a steady “hard bounce rate” at or near 100% once users landed on our search page. Also at or near 100% was the “exit rate”. We took user feedback--raw user behavior from Omniture as well as survey results--and combined a revamped user experience with the power of Solr to give users what they are actually looking for on the site. Find out the challenges we faced (both technical and otherwise), how we overcame them and what our stakeholders and users are saying now.
Presented by Paul Oakes, LuLu
Lulu is creating a new model in publishing — open publishing — that empowers more creators to sell more content to more readers more profitably than ever before. We have more than a million creators registered from more than 200 countries and territories, and each month they add approximately 20,000 new works to our catalog.
Therein lies one of Lulu's big challenges: How to sort through all that content to quickly, effectively, and efficiently meet the needs of buyers. We needed a best-of-breed search and discovery platform; we chose Lucene and Solr.
Lulu's experience with Lucene and Solr has evolved over the past few years, and our index has grown many times over its earlier size. We've had our share of growing pains, and we've learned a lot from the challenges of integrating internal services into indexing.
With our modern implementation, Lulu has achieved remarkably faster and more meaningful search results, indexing times have been reduced by orders of magnitude, and because this project is open source, we have executed these improvements with minimal costs. Lulu has great plans for its future search and discovery experience, and we look forward to the benefits Lucene and Solr will continue to bring.
Presented by Michael Busch, Search Engineer, Twitter
Lucene has for a while already a nice feature that we call "Near-realtime search" (NRT). The approach works well for a lot of applications, but we're currently working on an even better real-time solution in Lucene: directly searching IndexWriter's RAM buffer while do cuments are being added! This will dramatically improve indexing performance compared to NRT, and the search latency (the time it takes for a d ocument to become searchable) will shrink to a minimum - hence we will scratch the N in NRT! This talk will discuss this new approach and give an overview of the current status of Lucene's realtime-search branch.
Presented by: Jason Eiseman, Yale Law School Daniel Lovins, Yale University; Jeffrey Barnett; Tom Burton-West, Retrieval Programmer, University of Michigan
Our panel of experts have practical experience with implementation and scaling of Solr within the context of major university library systems. Using the power of Solr within the Drupal integration at Yale – Jason Eiseman will highlight how Drupal and Solr were used to improve the search functionality and usability of Yale University Law School’s library website. His colleagues across campus, Jeffrey Barnett and Daniel Lovins will discuss the use of the Lucene/Solr platform, integrated with ICU and language detection and why it is the best way for the Yale Library to provide the same high standard of relevancy ranking and faceting with non-Roman scripts Rounding out this discussion is Tom Burton-West of the University of Michigan Library who will explain his experience with scaling Solr to provide full-text access to millions of books at a reasonable cost for the HathiTrust Large Scale Search project.
Download Presentations: Yale - Eiseman | Yale - Lovins | Burton-West - UofM
Presented by Mitch Stewart, Boomi
As more applications are moving into the Cloud, the need to organize and locate relevant data becomes a critical part of any Cloud Application. Cloud Users have become more tech-savvy and expect the application to respond quickly to search requests as well as allow for easy customizations. This talk covers how Boomi utilizes Solr to monitor its Cloud Integration solution, how search can be used to audit the data flowing between applications and the challenges and benefits of implementing Solr in a multi-tenant fashion.
Presented by Michael McIntosh, VP, Enterprise Search Technologies, TNR Global
There are many reasons that an IT department with a large scale search installation would want to move from a proprietary platform to Lucene Solr. In the case of FAST Search, the company’s purchase by Microsoft and discontinuation of the Linux platform has created an urgency for FAST users. This presentation will compare Lucene/Solr to FAST ESP on a feature basis, and as applied to an enterprise search installation. We will further explore how various advanced features of commercial enterprise search platforms can be implemented as added functions for Lucene/Solr. Actual cases will be presented describing how to map the various functions between systems.
Presented by David Smiley, Senior Software Developer, MITRE
Spatial search is a growing area in the Lucene/Solr community that is steadily progressing. One method used to accomplish spatial search involves geohashes. A geohash is a latitude/longitude geocode system in the public domain, described in detail on Wikipedia. Geohashes are strings that further narrow a latitude-longitude box on the earth with each added character. Given this property, Lucene’s inverted index is well suited as the basis for a geohash-based search filter. There are two challenges to such an implementation: One is dealing with the fact that not every point near another necessarily has the same prefix due to inevitable box boundary conditions. Secondly, an ideal implementation should be optimized to handle searches spanning a great number of points. This presentation will discuss such an implementation with closing thoughts on addressing performance for the latter case.
Presented by John Wang, Seach Architect, LinkedIn
LinkedIn is a high traffic consumer internet site and LinkedIn search, built on Lucene, is serving millions of queries on a daily basis. The search problem at LinkedIn is unique in the following ways:
This presentation will cover the challenges we have faced and the solutions we have come up with. Furthermore, the future plans and next steps we are taking in enhancing our search system using Lucene.
Presented by Uwe Schindler, Schindlers Software/PANAGAEA
PANGAEA (Publishing Network for Geoscientific and Environmental Data, www.pangaea.de) is a data library for georeferenced data from earth system research operated in Open Access. Scientific primary data are long-term archived with related meta-information using a relational data base. On top of this data base, which is used for maintaining and curating the data in the backend, all data citations and corresponding documentation are searchable using Apache Lucene Java. Users are able to use conventional scoring Lucene queries as well as geographical filters to retrieve archived data sets. In this talk, Uwe Schindler presents the use of NumericRangeQueries in combination with custom scoring to create a map-based search and dynamic results display (possibly with live demo). Lucene is also used to quickly lookup relations based on Digital Object Identifiers (DOIs) between these data citations and conventional research papers hosted by scientific publishers. Uwe will also present the XML-based workflow used for indexing content from the underlying relational database.
Apache Lucene, Lucene, Apache Solr, Solr, Apache Hadoop and Hadoop are trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by LucidWorks.