Text Mining for the Real World

The growth of social media and digitized libraries has made computational text analysis—whereby users can search and detect trends in vast amounts of text—a vital tool for modern scholarship. Too often methods that work for expert users on standardized collections, with clean and complete data, don’t translate to real-world data analysis needs. In order to be useful, text mining methodologies need to balance theoretical power with practical application. They need to be able to mine real data sets, which are noisy and complicated. They also need to address a key research challenge—that huge amounts of data cannot be shared directly due to copyright, potentially including all books published after 1923.

David Mimno, Information Science, is using this CAREER award to develop tools that can model limited, privatized data, including images. Using new algorithm functionalities, Mimno is developing methods to rectify data that’s incomplete or noisy while also reducing model uncertainty. He is also creating new methods for learning from private and sensitive documents by creating public views of nonpublic data.

Mimno’s group is developing new tools for modeling images and text, tools which are optimized for the way images actually accompany text in real corpora. Algorithms will focus on reliability and efficiency, so that powerful techniques can be used by nonexpert users on easily accessible hardware, thereby increasing the societal impact of the work.

Cornell Researchers

Funding Received

$550 Thousand spanning 5 years

Sponsored by