Session Abstracts | Day 2
Keynote |
Go to Day 1 |
Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop
Presented by Grant Ingersoll | Lucid Imagination Inc.
Understanding and accessing large volumes of content often requires a multi-faceted approach that goes well beyond the needs of batch processing approaches. In many cases, one needs both ad hoc, real-time access to the content as well as the ability to discover interesting
information based on a variety of features such as recommendations, summaries and other interesting insights. Furthermore, analyzing how users interact with the content can both further enhance the quality of the system as well as deliver much needed insight into both the users and the content for the business. In this talk, we'll discuss a platform that enables large scale search, discovery and analytics over a wide variety of content utilizing tools like Solr, Hadoop, Mahout and others. The talk will discuss the architecture and capabilities of the system along with
how the capabilities of Solr 4 help drive real time access for content discovery and analytics.
Apache Hadoop: Now, Next and Beyond
Presented by Ari Zilka, Chief Product Officer | Hortonworks
Apache Hadoop has evolved rapidly to become the leading platform for managing, processing and analyzing big data. If your organization is examining how you can use Hadoop to store, transform and refine large volumes of multi-stuctured data, please join us for this session where we will discuss:
- The emergence of “big data” and opportunities for deriving business value
- The evolution of Apache Hadoop and future directions
- Essential components required in a Hadoop-powered platform
- Solution architectures that integrate Hadoop with existing enterprise Data Warehouses
Track Sessions
Lucene in the Cloud: Learn how GCE leveraged the power of search and Big Data to shed light on government spending
Presented by Seshu Simhadri | Global Computer Enterprises
A leader in bringing innovative technologies to the Federal Government, GCE looks to open source tools to drive down cost and provide the foundation for building value-added services for its customers. This talk will discus GCE’s innovative use of Lucene/Solr combined with the GCE Big Data Cloud to open up access to Federal spending data. This data is in wide use across the Federal government, Federal contracting community, media and press, as well as Capitol Hill. GCE has utilized this toolset to deliver the type of capability that users typically only find in web consumer applications. This session will highlight the technical side of the challenge in implementing these tools across a large user community and data set in a Cloud environment.
How to Access Your Library Book Collections Using Solr
Presented by Engy Ali | The Library of Alexandria
Do you have a large collection of text content that you want to search? Facing challenges on how to facet after performing a full text search across metadata and content? Do you want to use Solr with personalization? Bibliotheca Alexandrina provides public access to digitized book collections that exceed 220,000 books, through a web-based search and browsing facility. The facility is completely built on Solr in five different languages. The website provides full text morphological search within the books’ metadata and content with result highlighting. Different personalization features like annotation tools and tagging are also implemented using Solr. This presentation will cover how Bibliotheca Alexandrina uses Solr to implement full text indexing and searching across the entire collection, faceting, search within the content of a book and result highlighting and techniques used for personalization.
Building a Real-Time Solr-Powered Recommendation Engine
Presented by Trey Grainger | CareerBuilder
Searching text is what Solr is known for, but did you know that many companies receive an equal or greater business impact through implementing a recommendation engine in addition to their text search capabilities? With a few tweaks, Solr (or Lucene) can also serve as a full featured recommendation engine. Machine learning libraries like Apache Mahout provide excellent behavior-based, off-line recommendation algorithms, but what if you want more control? This talk will demonstrate how to effectively utilize Solr to perform collaborative filtering (users who liked this also liked…), categorical classification and subsequent hierarchical-based recommendations, as well as related-concept extraction and concept based recommendations. Sound difficult? It’s not. Come learn step-by-step how to create a powerful real-time recommendation engine using Apache Solr and see some real-world examples of some of these strategies in action.
Solr, Lucene and Hadoop @ Etsy
Presented by David Giffin | Etsy
Search at Etsy poses significant challenges. Our marketplace is filled with millions of unique, short-lived items and people trying to find them over 13 million times a day. In this session we'll discuss many of the solutions we've engineered to meet these challenges including, the evolution of indexing at Etsy, how HBase and Hadoop have taken indexing from hours to minutes, how and why we use bittorrent for Solr replication, how we track search performance, our approach to shave crucial milliseconds off every search, and an overview of our continuous deployment strategy, web / search config integration and A/B testing and analytics.
Television News Search and Analysis with Lucene/Solr
Presented by Kai Chan | UCLA
UCLA Communication Studies Archive hosts a collection of over 100,000 hours of digital television news, updated daily. Its search engine provides closed captioning search and online streaming of videos. The search engine allows researchers and students in various fields to study television news, images and language usage, in ways that were not possible before. In this presentation, we will show the setup of our Lucene/Solr-powered search engine, as well as how it is being used. We will discuss our work on custom result formats, such as linking search result text to the video at particular timestamps, counting occurrences of words, phrases or patterns, grouping the result by fields such as month or show, and creating interactive charts. We will also discuss our work on extending Lucene’s proximity searches, and creating custom query types, such as segment-enclosed (two or more words, phrases or patterns occurring within a story-based text segment), time-enclosed (two or more words, phrases or patterns occurring within a certain time), and multi-word regular expression queries. Future goals will also be discussed, such as supporting multiple languages, multiple sources (speech-to-text along side closed-captioning text), searching user-contributed and generated metadata (programs that identify story segments, objects in video, etc.), and syntactic tags (such as parts of speech).
Big Data Meets Metadata – Analyzing Large Data Sets
Presented by Jeremy Bently| Smartlogic
As Big Data becomes more pervasive, the need for increased metadata management becomes critical to the understanding and mining of that content. Metadata is what unlocks the value of information assets. When metadata is well managed, the information assets are more useful and valuable. Badly managed metadata can make information assets less useful and less valuable — creating increased costs and risks related to those assets. During this presentation, we'll discuss the different types of metadata, the role of search and analytics in Big Data and the integration of Apache Solr with Content Intelligence to enable better metadata management of Big Data.
Challenges in Maintaining a High Performance Search Engine Written in Java
Presented by Simon Willnauer| Apache Lucene
During the last decade Apache Lucene became the de-facto standard in open source search technology. Thousands of applications from Twitter Scale Webservices to Computers playing Jeopardy rely on Lucene, a rock-solid, scaleable and fast information-retrieval library entirely written in Java. Maintaining and improving such a popular software library reveals tough challenges in testing, API design, data-structures, concurrency and optimizations. This talk presents the most demanding technical challenges the Lucene Development Team has solved in the past. It covers a number of areas of software development including concurrency & parallelism, testing infrastructure, data-structures, algorithms, API designs with respect to Garbage Collection, and Memory efficiency and efficient resource utilization. This talk doesn’t require any Apache Lucene or information-retrieval background in general. Knowledge about the Java programming language will certainly be helpful while the problems and techniques presented in this talk aren’t Java specific.
Search with Polygons: Another Approach to Solr Geospatial Search
Presented by Andrew Urquhart | Raytheon
After investigating the Lucene Spatial Playground approach to Solr geospatial search, Raytheon determined that the Lucene Spatial Playground was not evolving in a direction that would meet their needs. In particular, they required the ability to search for documents within a geospatial polygon and also desired a solution that would not require special handling at any point on the Earth specifically including the poles and the 180-degree East/West longitude meridian. Taking these requirements, they implemented a Solr/Lucene geospatial search capability that maps latitude/longitude points onto a spherical Earth and then operates in three-dimensional Cartesian space. Using the geohash algorithm modified to produce Long indices, Raytheon indexes the approximate locations of points as numeric values. This approach enables index lookup using Trie structures with numeric range queries. Come hear about their approach to Solr Geospacial Search.
Is your index reader really atomic or maybe slow?
Presented by Uwe Schindler | SD DataSolutions GmbH
Since the first day, Apache Lucene exposed the two fundamental concepts of reading and writing an index directly through IndexReader & IndexWriter. However, the API did not reflect reality; from the IndexWriter perspective this was desirable but when reading the index this caused several problems in the past. In reality a Lucene index is not a single index while logically treated as a such. This talk will introduce the new API classes AtomicReader and CompositeReader added in Lucene 4.0 as very general interfaces, and DirectoryReader, which most people know as the segment-based “Lucene index on disk”. The talk will also cover more changes and improvements to the search API like reader contexts that allow to convert local document ids to global ones from IndexSearcher. Lucene changed all IndexReaders to be read-only, so it’s no longer possible to modify indexes using those classes. Finally, Uwe Schindler will show migration paths from custom norm values to the various new ranking models that were added to Lucene; this includes using Similarity with Lucene 4.0’s DocValues as replacement for norms.
Things Made Easy: One Click CMS Integration with Solr and Drupal
Presented by Peter Wolanin | Acquia, Inc
If you have a new web project or and existing Drupal site, the combination of Drupal and Apache Solr is both powerful and easy to set up thanks to the existing integration code. The module allows for substantial customization with the administrative UI. Drupal facilitates further customizations of the UI, indexing, and bosting because of the open architecture that provides multiple opportunities for custom code to alter the behavior. A couple code snippets will be followed by a review of other contributed Drupal modules that further enhance the search capability.
Finally, this session will showcase some example of Drupal sites using Solr including Acquia's own sites and Drupal sites including many well-known Enterprise and government sites.
Solr 4: The SolrCloud Architecture
Presented by Mark Miller | Lucid Imagination Inc.
In this talk, Lucene/Solr committer Mark Miller will discuss the low level architecture and design decisions around SolrCloud and distributed indexing. Come learn about the latest work on Solr's new scaling and fault tolerance solution - how it works and how we built it.
Indexing Big Data in the Cloud
Presented by Scott Stults | OpenSource Connections
Amazon Web Services offers a quick and easy way to build a scalable search platform, a flexibility is especially useful when an initial data load is required but the hardware is no longer needed for day-to-day searching and adding new documents. This presentation will cover one such approach capable of enlisting hundreds of worker nodes to ingest data, track their progress, and relinquish them back to the cloud when the job is done. The data set that will be discussed is the collection of published patent grants available through Google Patents. A single Solr instance can easily handle searching the roughly 1 million patents issued between 2010 and 2005, but up to 50 worker nodes were necessary to load that data in a reasonable amount of time. Also, the same basic approach was used to make three sizes of PNG thumbnails of the patent grant TIFF images. In that case 150 worker nodes were used to generate 1.6 Tb of data over the course of three days. In this session, attendees will learn how to leverage EC2 as a scalable indexer and tricks for using XSLT on very large XML documents.
How to Gain Greater Business Intelligence from Solr/Lucene
Presented by Patrick Beaucamp | Bpm-Conseil
Vanilla, an Open Source business intelligence application by bpm-conseil.com, offers unique features such as report indexing through an embedded Lucene integration. Using Vanilla and Lucene, developers can manage both report indexing and external document indexing, which ultimately saves end users time when they search for specific keywords such as "product code," or "customer code." Vanilla can build upon an existing Solr/Lucene installation that takes care of all the indexing processes while Vanilla takes care of the Reporting/Dashboard creation. During this presentation, attendees will learn how we moved from embed Lucene Api to a Solr/Lucene platform and all the technical and business benefits from this architecture in terms of clustering, caching and access mode.
Updateable Fields in Lucene and other Codec Applications
Presented by Andrzej Bialecki | Lucid Imagination Inc.
Lack of field-level updates in Lucene has been a long-standing issue. Join Andrzej Bialecki as he presents several interesting applications that use the Lucene 4.0 Codec API. Specifically design and implementation of updateable fields, and examples of online index filtering and mirroring of index data using codecs, and an implementation of a single-pass index splitter.
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics
Presented by M.C. Srivas | MapR
This session addresses the biggest issue facing Big Data – Search, Discovery and Analytics need to be integrated. While creating and maintaining separate SOLR and Hadoop clusters is time consuming, error prone and difficult to keep in synch, most Hadoop installations do not integrate with SOLR within the same cluster. Find out how to easily integrate these capabilities into a single cluster. The session will also touch on some of the technical aspects of Big Data Search including how to; protect against silent index corruption that permeates large distributed clusters, overcome the shard distribution problem by leveraging Hadoop to ensure accurate distributed search results, and provide real-time indexing for distributed search including support for streaming data capture. Srivas will also share relevant experiences from his days at Google where he ran one of the major search infrastructure teams where GFS, BigTable and MapReduce were used extensively.
Apache Lucene, Lucene, Apache Solr, Solr, Apache Hadoop and Hadoop are trademarks of The Apache Software Foundation.










