Reading Metadata between the Lines: Searching for Stories, People, Places and More in Television News

Download slides: Powerpoint or PDF or SlideShare

Summary: UCLA’s NewsScape has over 200,000 hours of television news from the United States and Europe. In the last two years, the project has generated a large set of “metadata”: story segment boundaries, story types and topics, name entities, on-screen text, image labels, etc. Including them in searches opens new opportunities for research, understanding, and visualization, and helps answer questions such as “Who were interviewed on which shows about the Ukraine crisis in May 2014” and “What text or image is shown on the screen as a story is being reported”. However, metadata search poses significant challenges, because the search engine needs to consider not only the content, but also its position and time relative to other metadata instances, whether search terms are found in the same or different metadata instances, etc. This session will describe how UCLA has implemented metadata search with Lucene/Solr’s block join and custom query types, as well as the collection’s position-time data. This talk will also describe UCLA’s work on using time as the distance unit for proximity search and filtering search results by metadata boundaries as well as their metadata-aware, multi-field implementation of auto-suggest.

The project's official website: http://newsscape.library.ucla.edu/