Cleaning Up Big, Messy Data

Messy data—heterogeneous values, missing entries, and large errors—is a major obstacle to automated modeling. Data cleaning is the first step in any data processing pipeline, and the way it’s carried out has serious consequences for the results of any subsequent analysis. Yet this step is generally performed using ad-hoc methods. The simplest approach to coping with missing data is also the most common. Many researchers just ignore any datum with a missing element. In settings where most data is present, this practice results in decreased statistical power; in settings where most data is missing, this practice is disastrous and renders the data useless.

Along with her team, Madeleine Udell, Operations Research and Information Engineering, is developing basic, composable modeling tools for robust data inference by exploiting structure in the data set. Their modeling framework will accept data with noisy, uncertain, or missing values, and will produce clean, complete data sets. This framework will automatically extend any other modeling tool to work on highly incomplete, noisy data. The research involves developing novel algorithms for denoising and imputing missing data by exploiting spatiotemporal, network, and low dimensional structure to infer missing values in complex, heterogeneous databases.

These methods of preprocessing data will allow researchers to draw power from the data that they do have and to perform any kind of analysis they normally would perform on complete data. Applications range from social science surveys to medical informatics, from manufacturing analytics to marketing, from finance to hyperspectral imaging, and beyond.

Cornell Researchers

Funding Received

$1.4 Million spanning 4 years

Sponsored by