The National Science Foundation (NSF) recently awarded a collaborative team led by Andrew McGregor, computer science, a three-year, $1.5 million grant to further develop the foundations of data science in a project that will create NSF’s national TRIPODS Institute for Theoretical Foundations of Data Science.
Also part of the executive committee with McGregor are Markos Katsoulakis and Patrick Flaherty of mathematics and statistics, plus Barna Saha and Arya Mazumdar, both on leave this year from the College of Information and Computer Sciences (CICS).
In addition to the executive committee, a team of senior researchers includes Justin Domke, Marco Duarte, Akshay Krishnamurthy, Anna Liu, Andrew McCallum, Cameron Musco, Luc Rey-Bellet, Dan Sheldon and Ileana Streinu.
McGregor says, “This team will help strengthen ties between the institute and the mathematics, computer science and electrical engineering departments here on campus. The fact that Ileana and Dan have positions at Smith and Mt. Holyoke also will give us the opportunity to also engage undergraduates at those colleges.”
Another aspect of the TRIPODS project will be to organize summer schools, speaker series, talks by experts in related technical areas and workshops for faculty researchers in other disciplines who want to learn how big data can help them. “In the next three years we might notice more of our colleagues attending data science workshops on campus,” McGregor notes.
Unlike the “practical outcomes” focus of some big data initiatives, the focus of TRIPODS is more on theoretical, mathematical and foundational aspects of data science, McGregor says. For example, practical data scientists “know that their methods typically work in practice,” he says, “but we don’t necessarily know why, or whether they’re consistent and reliable. In order to know the why and how, you need to mathematically analyze them. You need to show that the algorithm’s estimates will always in fact give an answer that is within a certain percentage of the true answer.”
Thus the TRIPODS team will mathematically prove attributes of a given algorithm such as running time, accuracy and scalability.
Data sets in the sciences, such as genetics and physics, are growing larger every year, McGregor points out. For example, a personalized medical device may generate data continuously, 24 hours a day and seven days a week, over a year or longer. “By the time you reach 10 years it will be overwhelming,” he says.
He adds, “In statistics the more data you get, the more accurate you can be, but in computer science the more data you get, the longer it will take you to process. That’s one reason you need computer scientists and mathematicians working together. When you double the size of the input, we need to know if the algorithm takes twice as long, four times as long or 100 times as long? That’s important.”
The award is part of the foundation’s $17.7 million support for 12 Transdisciplinary Research in Principles of Data Science (TRIPODS) projects, which will bring together the statistics, mathematics and theoretical computer science communities at 14 institutions in 11 states to promote long-term research and training activities in data science that transcend traditional disciplinary boundaries.