Day 1 Presentation Abstracts
GO TO DAY 2
Click on the slide thumbnail to download the presentation
The Once and Future History of Enterprise Search and Open Source
Presented by Marc Krellenstein, Lucid Imagination
While it remains challenging to build best practice search applications, core search technology has become commoditized. Open source Lucene/Solr represents the best form of that commodity. It is as good as or better than than any commercial search technology while also providing the cost, control and flexibility advantages of open source. In this talk, we'll look at how past challenges in search were met and new ones evolved, and the place of Lucene/Solr in that evolution.
|
|
Watch session video. |
From Publisher To Platform: How The Guardian Embraced the Internet using Content, Search, and Open Source
Presented by Stephen Dunn, Guardian News and Media UK
In 2009 The Guardian launched The Open Platform, a suite of services and tools that enable content partners and developers to build applications with The Guardian's rich content. The content API, hosted on Solr instances on EC2, contains JSON representations of all Guardian articles back to 1999 - over 1 million articles, and is an increasingly complete representation of the output of the organisation. The DataStore contains curated data sets for use in applications and virtualizations.
This talk will cover how The Guardian opened up their business, enriched it, and reached new markets with its Open Platform strategy. Stephen will cover the technical architecture, implementation of Solr (the key technology powering the platform), and how The Guardian has used it to embrace disruption in the media space, while finding new sources of revenue and innovation. With two years since its launch, Stephen will cover some of the lessons learned, and explain how the Guardian complements use of Solr with other open-source non-relational technology, as it platform evolves.
|
|
Watch session video. |
Finite State Automata in Lucene: Internals and Applications
Presented by Dawid Weiss, Poznan University of Technology, Poland
Finite state automata and transducers made it into Lucene fairly recently, but already show a very promising impact on search performance. This data structure is rarely exploited because it is commonly (and unfairly) associated with high complexity. During the talk, I will try to show that automata and transducers are in fact very simple, their construction can be very efficient (memory and time-wise) and their field of applications very broad. This will be backed by an introduction to how FSTs are implemented in Lucene (construction and traversals) and practical use cases of where FSTs have been useful so far. If you'd like to see how to squeeze a 150MB of text data into 1.8MB of compact data structure, this talk is for you.
Boosting Documents in Solr by Recency, Popularity and Personal Preferences
Attendees with come away from this presentation with a good understanding and access to source code for boosting and/or filtering documents by recency, popularity, and personal preferences. My solution improves upon the common "recip" based solution for boosting by document age. The framework also supports boosting documents by a popularity score, which is calculated and managed outside the index. I will present a few different ways to calculate popularity in a scalable manner. Lastly, my solution supports the concept of a personal document collection, where each user is only interested in a subset of the total number of documents in the index. My presentation will provide a good example of how to filter and/or boost results based on user preferences, which is a very common requirement of many Web applications.
|
|
Watch session video. |
Real-time Search at Yammer
Presented by Boris Aleksandrovsky, Yammer, Inc
This talk will be focused on the architecture, scalability concerns, performance bottlenecks, operational characteristics and lessons learned while designing and implementing Yammer distributed real-time search system. Yammer is an enterprise social network SaaS offering with over 100,000 networks (including 85% of the Fortune 100) and nearly 2 million users. The search system we developed scales well up to 1B messages and serves a foundation of knowledge base analysis services Yammer is developing.
|
|
Watch session video. |
Case Study - Panasonic Europe Powered by Apache Solr
Presented by Daniel Potzinger, AOE media GmbH
In 2010 Panasonic made the decision to replace their legacy enterprise search tool and switched the search for all their European websites to a Apache Solr based solution.
Now their customers benefit from an incredibly fast and feature rich solution that is much more than just a search and has become a valuable sales-driving tool for Panasonic. Features like relevancy manipulation, autosuggest, contextual filtering for properties like color or product category were implemented under not the most ideal circumstances mainly that there was no access to structured data. The search was rolled out in close to 30 countries so far also putting Solr multi-lingual handling to a test.
Searching The United States Code with Solr/Lucene
Presented by Ronald Matamoros, Search Technologies
What are the challenges in searching an 85 year old document? The United States Code was published by the United States Congress in 1926 as a single bound volume containing all of the general and permanent laws of the United States Government. It has been updated every year since and has grown into a 30 volume set of some 40,000 pages divided into 50 titles.
The talk will cover the challenges searching this collection and the specific Solr and Lucene solutions and plug-ins implemented at each point, including hierarchical browsing of the TOC, searching and highlighting sub-sections of documents, custom query features, and search user interface components. The implemented required custom token filters, query parsers, document parsing and processsing, and Span operators.
Jazzed about Solr: People as a Search Problem
Presented by Joshua Tuberville, eHarmony
Search oriented architectures are obvious approaches for web pages, emails, documents, and other text based entities. Often with traditional structured data, text searching is "added on" to the traditional Boolean queries in relational stores. When Jazzed was initiated we wanted search to be front and center. When we evaluated Solr we realized we could take the opposite approach "add on" Boolean components to textual searches. This hybrid query approach makes transitioning to flexible ranking easy and straightforward. In this talk we will cover
|
|
Watch session video. |
Heavy Committing: DocValues aka. Column Stride Fields in Lucene 4.0
Presented by Simon Willnauer, Apache Lucene PMC
Lucene 4.0 is on its way to deliver a tremendous amount of new features and improvements. Beside Real-Time Search & Flexible Indexing DocValues aka. Column Stride Fields is one of the "next generation" features. DocValues enable Lucene to efficiently store and retrieve type-safe Document & Value pairs in a column stride fashion either entirely memory resident random access or disk resident iterator based without the need to un-invert fields. It's final goal is to provide a independently update-able per document storage for scoring, sorting or even filtering. This talk will introduce the current state of development, implementation details, its features and how DocValues have been integrated into Lucene's Codec API for full extendability.
|
|
Watch session video. |
Integrating Advanced Text Analytics into Solr
Presented by Steve Kearns, Basis Technology
Text analytics provides a number of interesting analytic capabilities that can enhance enterprise search applications, though in practice it is not always obvious how these can be integrated effectively into Solr. This presentation will describe some of the practical ways that leading organizations are using text analytics by integrating them directly into Solr and their user interface to improve relevance, navigate results, and discover new information. The combination of Solr and quality text analytics can improve existing keyword search solutions, and enable new ways of discovering knowledge hidden in existing data.
Search, APIs, capability management and the Sensis journey
Presented by Craig Rees, Sensis
Earlier this year, Sensis launched its Business Search API, which allows publishers to develop local search propositions powered by the two million business listings contained in the Australian Yellow Pages® and White Pages® directories.
This case study will explore Sensis' strategic direction for search and explain how the framework and metrics by which search is managed at Sensis were used to define our search roadmap. Key architectural decisions including our use of Solr and MongoDB will be discussed as well as our approach to real-time search tuning and quality management.
|
|
Watch session video. |
Four Pillars of Designing the Search Experience
Presented by Tyler Tate, Twigkit
Lucene and Solr provide many excellent tools for presenting information to users, but what makes some search user interfaces better than others? Should you aim for a rich, advanced UI or should you "just make it look like Google"?
Through his work at TwigKit with blue-chip corporations, scientific institutes, and governments, Tyler has identified four guiding pillars of the search experience:
This discussion covers the approach taken by Open Text to deploy Nutch, Solr and other technologies into a self-provisioned cloud-based offering including:
We'll delve deep into each dimension and discuss how to achieve useful, useable, and beautiful search interfaces using design patterns including: autocomplete, faceted navigation, breadcrumbs, best bets, related searches, spelling suggestions, clickable metadata, result clustering, saved searches, data visualisation, and more.
|
|
Watch session video. |
A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene
Documentum xPlore provides an integrated Search facility for the Documentum Content Server. The standalone search engine is based on EMC's xDB (Native XML database) and Lucene. In this talk we will introduce xPlore and some of its key components and capabilities. These include aspects of a tight integration of Lucene with the XML database: xQuery translation and optimization into Lucene query/API's as well as transactional update Lucene). In addition, xPlore is being deployed aggressively into virtualized environments (both disk I/O and VM). We cover some performance results and tuning tips in these areas.
Using Solr in Online Travel Shopping to Improve User Experience
Presented by Esteban Donato, Sudhakara Karegowdra, Travelocity
In this talk we would like to present three different use cases of Solr in the travel industry. First of all we would describe how we implemented faceted navigation for hotel shopping. Then, we will introduce how we implemented destination searching functionality like auto-complete and misspelling. Lastly, we will show you how we integrated Solr to provide better experiences to mobile users.
Solr @ eBay Kleinanzeigen
Presented by Olaf Zschiedrich, ebay.de
Attendees will learn how eBay Germany has implemented Solr, why Solr was selected, which Solr features are utilized. and how Solr is configured and used in production. Recommended best practices will be profiled alomng with eBay Kleinanzeigen plans for future deployment of Solr.
|
|
Watch session video. |
Rapid Prototyping with Solr
Presented by Erik Hatcher, Lucid Imagination
Got data? Let's make it searchable! This interactive presentation will demonstrate getting documents into Solr quickly, will provide some tips in adjusting Solr's schema to match your needs better, and finally will discuss how showcase your data in a flexible search user interface. We'll see how to rapidly leverage faceting, highlighting, spell checking, and debugging. Even after all that, there will be enough time left to outline the next steps in developing your search application and taking it to production.
|
|
Watch session video. |
Search Analytics: What? Why? How?
Presented by Otis Gospodnetic, Sematext
You've indexed your data and people are searching it. But how do you know if they are happy with the results? How do you know if they are finding what they need? With search incresingly becoming the primary information access mechanism, knowing how your search is doing is not just a matter of mere curiosity, but often has direct business impact.
In this talk we'll talk about Search Analytics and how it can be used to answer questions like:
We'll explore what specific Search Analytics reports tell us and what specific actions you should take based on those reports.
"Stump The Chump": Get On The Spot Solutions To Your Real Life Solr/Lucene Challenges
Presented by Grant Ingersoll, Lucid Imagination
Got a tough problem with your Solr or Lucene application? Facing challenges that you'd like some advice on? Looking for new approaches to overcome a Lucene/Solr issue? Not sure how to get the results you expected? Don't know where to get started? Then this session is for you.
Now, you can get your questions answered live, in front of an audience of hundreds of Lucene Revolution attendees! Back again by popular demand, "Stump the Chump" at Lucene Revolution 2011 is hosted by PMC chairman and Lucid Imagination co-founder Grant Ingersoll.Grant's going on the hot seat in front of your peers, to tackle questions live.
Our MC will read the questions, and Grant have to formulate a solution on the spot. A panel of judges will decide if he has provided an effective answer. Prizes will be awarded by the panel for the best question - and for those deemed to have "stumped the chump".
|
|
Watch session video. |
Day 2 Presentation Abstracts
All Data Big and Small
Presented by Stephen O'Grady, Redmonk
The last twenty four months have seen a veritable explosion in discussion around what is commonly refererred to as Big Data and the infrastructure technology employed to manage it. The wealth of available open source software means that businesses from any industry have easily accessible tools with which to tackle projects that would have been out of their reach just a few years prior. Less heralded, however, has been the fact that making data actually useful - whatever its size - remains a challenge. In this session we'll explore the role of search in putting data - big and small - to work answering the important questions for businesses and society by reducing the friction between question and answer.
|
|
Watch session video. |
Highly Relevant Search Result Ranking for Large Law Enforcement Information Sharing Systems
Presented by Ronald Mayer, Forensic Logic
Law enforcement data has many interesting complexities for search. Cross-agency searches are even more challenging because each agency has it's own shorthand. Many different types of similarity between search clauses and documents should influence the ranking of results. For example, a search clause mentioning a "tall suspect" might want to include results with "6 foot 4 suspect". Spatial clusters are important, as are temporal patterns. Different fields may be more or less important depending on the type of crime - for example, a victim's race may matter more than a vehicle's make in a sex crime; but less in an auto theft. Also, documents may be related to each other in various ways that may also affect their ideal search ranking.
Solr's great flexibility in its analyzers, filters, synonyms, and boosting make it excellent tool for such diverse requirements.
We've contributed a patch to Solr (#SOLR-2058) that helped us further improve search result ranking for cases where a search for a suspect with a 'red baseball cap, black leather jacket' is compared against many documents mentioning red caps, black caps, etc.
This presentation will describe how we addressed some domain-specific challenges of our data.
Intuit's Live Community
Presented by Floyd Morgan, Intuit
TurboTax Live Community is a large scale web application that uses user contribution and open source technology to assist millions of TurboTax users complete their tax returns. Other benefits from Live Community include reducing support calls, highly effective advertising campaigns, usability engineering and new for this year conversion prediction analytics. I will present how Solr/Lucene powers the many facets of TurboTax Live Community now in the future.
|
|
Watch session video. |
Using Solr/Lucene/LWE for eCommerce
Presented by Grant Ingersoll, Lucid Imagination
If you're user can't find it, they can't buy it right? In this talk, Apache Lucene and Solr committer Grant Ingersoll will discuss architecture, techniques and tips for successfully deploying search tools like Lucene, Solr and LucidWorks Enterprise in eCommerce environments.
|
|
Watch session video. |
Flexible Indexing in Lucene 4.0
Presented by Uwe Schindler, SD DataSolutions GmbH
Apache Lucene's next major release, 4.0, will introduce lots of flexibility into indexing, but also fundamental changes to the well-known APIs: It features a new and consistent, 4-dimensional iteration API on top of a low-level, pluggable codec API giving applications full control over the postings data. Terms are now arbitrary opaque bytes enabling users to store terms in any encoding, not necessarily UTF-8, natively in the index (e.g. numeric fields). Currently under development is a higher performance postings iteration API, enabling interesting codecs based on recent encoding algorithms to work effectively. Several codecs have already been created, including the default "standard" codec, which enables sizable RAM reduction for searchers, and a "pulsing" codec that inlines postings data directly into the terms dictionary, which provides a solid performance boost for primary key fields. A lot of new codecs are under development like "PFOR", "FOR", "AFOR", or "Simple64". In this talk, Uwe presents an overview of all of these exciting changes, as well as several concrete, real-world examples of how applications can tap into these new features.
Handy Installation Tool "Anuenue" for Solr Cluster, and Implementation of "Did you mean" Facility for Queries in Japanese.
Presented by Takahiko Ito, mixi
mixi is one of the largest social networking services in Japan, providing various communication services for over 14M monthly active users. The latest internal mixi project is to replace the in-house search engine with Apache Solr. This session covers two topics; a simple packaging system for Solr that eases the installation process and daily operations, and implementation of a "Did you mean" facility for Japanese queries using a log mining tool. These tools have been released as OSS projects.
Extending Solr: Behind CareerBuilder's Cloud-like Knowledge Discovery Platform
Presented by Trey Grainger, CareerBuilder
For CareerBuilder, a 1% deviance in search relevancy can mean millions of missed job opportunities for our users. When CareerBuilder moved to Solr from an expensive, proprietary search vendor, our top priorities were maintaining the quality of our search results and drastically improving our agility. This talk will describe how we addressed both needs. For search quality, we'll cover some of our internal studies and resulting methods for dealing with multi-lingual content across dozens of languages, as well as customizing and experimenting with relevancy calculations. For platform agility, we'll discuss CareerBuilder's cloud-like search API framework which seamlessly handles millions of searches an hour, processes hundreds of millions of documents, and is powered by hundreds of globally-distributed servers. Come hear the results of our studies and some best practices for quality and performance. Learn how our framework has lead to staggering improvements in both maintainability and technology innovation, allowing us to learn from our content, not just find it.
Transforming the House Hunting Experience: How Solr is Helping Trulia Reshape the Real Estate Industry
Presented by Alexander Kanarsky, Trulia
Trulia is a real estate search company that helps customers find homes for sale or to rent and provides them with information to help them make better decisions in the process. It is also a hub for real estate professionals to market their listings, view real estate data and promote their services.
The presentation describes how Solr helped Trulia to transform the traditional real estate experience and make real estate data accessible and understandable to millions of users. It discusses approaches we took to achieve this by using custom-built distributed index management, indexing integration with Hadoop and geospatial search enhancements to Solr.
|
|
Watch session video. |
Implementing Click-through Relevance Ranking in Solr and Lucid Works Enterprise
Presented by Andrzej Bialecki, Lucid Imagination
This talk will present what are click-through events and how to process them with LucidWorks Enterprise. This innovative technique puts powerful search and relevancy at your fingertips -- at a fraction of the time and effort required to program them yourself with native Apache Solr. Andrzej will discuss and present how you can use LucidWorks Enterprise for:
Building specialized industry applications using Solr, and migration from FAST ESP
Presented by Rahul Agarwalla, Uchida Spectrum Inc
Uchida Spectrum, Inc. is a leader in the Japan search market. USI provides SMART InSight, a search application used by many Fortune 500 companies for specialized industry applications like R&D and quality assurance for manufacturing, claims and customer management etc.
Originally SMART/InSight was based on Microsoft FAST. This talk will review how SMART/InSight has migrated from FAST ESP to LucidWorks Enterprise, and how SMART/InSight incorporates virtual data integration, enterprise search, and the ability for users to have a unified way to navigate diverse data sources, analyze data more easily, and personalize results. Several use cases will be profiled with demonstrations of real-world use cases.
Using SOLR For Enabling Highly Customized Sitewide Navigation
Presented by Shantanu Deo, AT&T
The organization needed to enable a very customizable form of Global Navigation for the various types of users (based on their profile and other factors). This would normally have involved complex logic to figure out the appropriate set of links to show for a customer, and would have been a maintenance nightmare. Instead we approached the problem as a search problem. Coupled with a novel encoding scheme we were able to solution the problem simply by searching on the customers profile groups and return a coherent global navigation using SOLR to index the data.
This has resulted in a very simple to understand and maintain solution that will stand in good stead in the future.
The presentation is meant to be a description of using SOLR to implement a real-world application.
Using Solr to find the Right Person for the Right Job
Presented by Laura Kang, TheLadders
In this talk, we'll describe how TheLadders.com uses Lucene/Solr to instantly recommend candidates to a recruiter when he/she posts a job on the recruiter site. Our matching algorithm scores candidates from our job seeker site based on the criteria and description of jobs and job seekers' resume and profile data. This helps recruiters quickly identify candidates that are right for the job and increases the chance of our job seekers getting hired.
The talk covers an overview of our Solr architecture and a description of our matching algorithm. We'll also a discuss criteria for evaluating the algorithm, including an overview of our testing sessions and their format. Finally, we'll also demo the feature so you can see how it works in practice.
|
|
Watch session video. |
The Seven Deadly Sins of Solr
Presented by Jay Hill, Lucid Imagination
In this talk, we'll describe how TheLadders.com uses Lucene/Solr to instantly recommend candidates to a recruiter when he/she posts a job on the recruiter site. Our matching algorithm scores candidates from our job seeker site based on the criteria and description of jobs and job seekers' resume and profile data. This helps recruiters quickly identify candidates that are right for the job and increases the chance of our job seekers getting hired.
The talk covers an overview of our Solr architecture and a description of our matching algorithm. We'll also a discuss criteria for evaluating the algorithm, including an overview of our testing sessions and their format. Finally, we'll also demo the feature so you can see how it works in practice.
Advanced Search and Analytics in 20 Minutes
Presented by Mark Davis, Kitenga
Kitenga's ZettaVox and ZettaSearch products support SOLR and Lucene ecosystems at both the ingestion point and for the search user. In this talk, I will show how ZettaVox, our professional content mining platform on Hadoop, can be used to index content and rich metadata into a LucidWorks Enterprise installation. Being built on Hadoop, ZettaVox scales up by scaling out. I will then create an end-user search and analytics experience using our ZettaSearch solution that leverages the faceted metadata to enhance information discovery and analysis. All in about 20 minutes.
Solr and Lucene at Etsy
Presented by Gregg Donovan, Etsy
Etsy is using Solr and Lucene to serve queries at a rate of more than 8 billion per year (and growing). In this case study, we will describe how Etsy has integrated Solr/Lucene into our continuous deployment infrastructure (see: http://codeascraft.etsy.com/2010/05/20/quantum-of-deployment/) , allowing for Solr configuration, Java-based indexers, and query parsing logic to go from passing tests to production code in minutes. We'll also discuss how we're leveraging Solr's new Geo-search to power both local item search and GeoIP-personalized location autosuggest.
We'll also share how we've extended Solr, adding personalized faceting and filtering as well as multi-currency sorting and filtering that accounts for realtime currency fluctuation (contributed in SOLR-2202) Note that code will be open-sourced/contributed for both of these features]. We will share our real-time monitoring techniques, including how we track Solr replication, query, and GC times in Ganglia. Finally, we'll discuss how we've used Hadoop-based user analytics to improve relevance and power data-driven spelling corrections, autocomplete suggestions, and related searches.
|
|
Watch session video. |
Solr Performance: Key Innovations
Presented by Yonik Seeley, Lucid Imagination
Recent developments in Solr/Lucene have made significant contributions to distributed search processing, scalability, and throughput. In this talk, Yonik Seeley, creator of Solr, will survey key performance strategies for building search applications with Solr, and review innovations included in Solr 3.1, as well as forthcoming development work in Solr 4.0 and beyond.
|
|
Watch session video. |
Building SaaS Solutions for Online Media Using Apache Solr
Presented by Alberto Mijares, Canoo Engineering AG
In the last years, the idea of building applications that can be used remotely by mean of the Web, has coined a new concept called "Software as a Service". Such applications, have the advantage of a remote web deployment that can be instantaneously be used by potentially any consumer in internet or of the cost reduction that a Web-based deployment provides.
The speaker explains in this talk the architecture of an innovative "Software as a Service" solution built for Axel Springer media group in Switzerland.
This application is capable of extracting remotely the content of multiple online newspaper articles, analyze them and classify them determining which articles are the most similar to a given one. This information is then integrated back into the article to provide the user with a "related articles" feature. The key points that made this "SaaS" application successful are: the low integration effort, the minimal TCO, the superior results quality and the capability to integrate information across different websites with a pragmatical approach.
The core components of the analysis process are: language-specific tools (used to filter the superfluous language terms) and semantic knowledge bases (like Wikipedia, used to enrich the indexed information with new context specific terms, or to disambiguate the extracted terms).
In a more technical layer, the speaker will explain the criteria to select the emerging enterprise search framework Apache Solr as platform and how it reduced drastically the development effort required.
As a summary, a list of the key achievements and conclusions will be presented to the public, pointing out the maturity and robustness of Apache Solr as a flexible and open-source based enterprise search platform.
Using Solr Cloud to Tame an Index Explosion
Presented by Jon Gifford, Loggly
We have hundreds of customers, each of whom may have dozens of shards. To manage this explosion of indexes, I'll describe how we're using Solr Cloud to manage every index - from creation, through migration from box to box, and finally destruction. I'll describe some of the performance issues we had to deal with, especially with ZooKeeper.
Lucene @ Yelp
Presented by Sudarshan Gaikaiwari, Yelp
This talk describes how the Yelp uses Lucene to provide search services. It includes
We will cover:
|
|
Watch session video. |
CPython Embedded in Solr - Search Solution for Python Lovers With the Speed of Native Java
Presented by Roman Chyla, CERN
SPIRES is the biggest bibliographic database for High Energy Physics, ArXiv is the biggest fulltext repository for the fulltext papers in High Energy Physics, and INSPIRE is the biggest digital library that merges the two. We must work with result sets bigger than 1 million for citation related queries and our partners from Astrophysics with 6 million sets, however INSPIRE is written in Python. So how do we move several million result sets between the two systems fast? How do we take advantage of our special NLP processing pipeline written in Python? How do we join them? We do not use Jython. We do not use pipes. We do not embed Solr inside INSPIRE. We embed INSPIRE into Solr! The talk shows benefits and challenges of this surprisingly elegant solution.
|
|
Watch session video. |
Lots of Facets, Fast
Presented by Anne Veling, BeyondTrees
We created a web application for a well-known US newspaper, to create a maps-like zooming application on top of the 60,000 newspapers since 1850 and using Solr over the 28,000,000 articles to create an interactive heatmap over it. The out-of-the-box faceting solution was optimized using domain knowledge by order-of-magnitude which allowed us to create a great visual way of exploring trends in historical newspapers.
Apache Lucene, Lucene, Apache Solr, Solr, Apache Hadoop and Hadoop are trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by LucidWorks.