Home Contact Lab members Publications Teaching Research awards

Research interests

Digital Twins and Precision Medicine Data Sciences Foundations/Theory (AI-ML-Statistics) Conditional inference (Bayesian, resampling, EL, ...) Limited Description Data Other Trans-disciplinary Applications

Contact:

Snigdhansu (Ansu) Chatterjee,
Department of Mathematics and Statistics,
University of Maryland Baltimore County,
Email: snigchat at umbc dot edu (preferred),
Phone: 410 dot 455 dot 2235.

If you are interested in the Statistics Ph.D. or MS programs at UMBC, and do not find adequate or relevant information in the department or UMBC webpages (sometimes these websites are not up to date), feel free to contact me by email.

Chatterjee Lab members:

Additional details will come soon.

Siddhartha Nandy Post-doc.
Vishal Subedi Grad student.
Fred Azizi Grad student.
Pratyusha Sarkar Grad student.
Jhilam Sur Grad student.
Yuting Liu Grad student.
Olivia Yuengling Undergrad student.

Publications:

For a listing of my papers, see my Google Scholar profile.

Teaching and course materials:

Most of my course materials are available to registered students only, via university-controlled portals. Some lecture materials (mostly on data science foundations and related topics) will eventually make it here.

NSF Funding:

Research interests:

My research interests are numerous; however; time (primarily), resources and collaborators are limited. I work on both the theory/foundations and the applications of data science, broadly defined as the collection of data-driven techniques and algorithms that are often labeled as artificial intelligence (AI), machine learning (ML) or Statistics (Stat).
Below are a few topics that I am currently interested in. There are deep connections and interplay between these topics, and most of my research activities address several of these topics simultaneously.

Digital Twins: Neurodegenerative Diseases, Precision Medicine and other applications

This is currently my main focus area relating to the applications of artificial intelligence(AI), machine learning (ML) and Statistics to real-world scientific problems. Parts of this project is supported by US NSF grant DMS-2436549.
Our primary application focus is on neurodegenerative diseases, like Alzheimer's or Parkinson's. However, the data science core for digital twins (DT) that we are developing is transportable to many other domains. In particular, we are also studying cancer and personalized treatment regimes for cancer patients, and some other applications not related to medicine but useful for DT studies.

Verification, validation, uncertainty quantification (VVUQ): We are working on several aspects of digital twins , one major focus topic being VVUQ. One aspect that we are focussing on is monitoring an entire cohort of individuals (for changes in the brain, for example) and personalized detection and signaling of any changes (that may be a trigger or symptom of Alzheimer's, for example). The techniques we use involve many of the research topics listed below.

Discovering dynamical systems from data: Our work on DT involves close interplay between domain sciences and data sciences, and one of the very exciting research questions we would like to address is can we discover scientific equations from data? We are studying multiple aspects of this problem, and some of the results we obtained so far are extremely promising!

(Generative) AI on neurodegenerative data: This branch of research couples with my studies on the theoretical aspects of AI and related techniques! Some details are listed below.

Data Sciences- Foundations: Theory, Algorithms, Software (DS-F: TAS)

This is currently my primary research program on the foundational or theoretical aspects of artificial intelligence(AI), machine learning (ML) and Statistics.

Theoretical development of Data Sciences: We primarily study the theoretical foundations and principles, of various topics that come under the broad umbrella of Data Science, including AI, ML, Statistics, to understand their consistency, optimality and other properties. We also work on the algorithmic and methodological aspects, often in relation to a specific inter-disciplinary application. Thus, our work encompasses questions about why prefer a particular (data science) technique over another, how to make the best use of the data and available computational resources, how to quantify our uncertainty about the results we obtain, and so on.

Uncertainty quantification: A considerable part of our research may be termed uncertainty quantification (UQ). We take a very broad view of UQ, and study probability and probabilistic inference including considerations for complex forms of dependencies, multi-dimensional extremes and heavy tails, generative models of many kinds, risk assessment and decision making under uncertainties, forecasting and predictions, multi-scale and multi-resolution aspects. In some studies, UQ also relates to privacy, confidentiality, representativeness in both the data and methods, which we broadly study under limited description data methods.

Scientific dynamics discovery: Another topic of great interest to me currently is how to enable scientific discoveries from data? This related to the fourth and fifth paradigm of scientific discovery process, and my students, collaborators and are actively engaged with this topic!

Statistical connections: Our research techniques often take us into the worlds of Bayesian statistics and resampling techniques, which help build inter-disciplinary bridges between Statistics, Computer Science and parts of Mathematics. We often look at high-dimensional data geometry to understand properties of the data and algorithms.

Data Geometry: The geometrical and topological properties of data, especially in high-dimensions, are of great interest to me. There are numerous geometric aspects that one may consider: our focus is tied to the interplay between the probabilistic (measure-theoretic) properties as exhibited by the data in the context of the geometry of the space such data are in.

Competition between different opimalities: The theoretical properties and algorithmic aspects may often be in competition with each other. For example, statistical optimality, which captures how to extract the greatest possible information out of data, may clash with computational optimality, which argues for efficient and fast computations. We study how to balance these different needs of data science research.

Conditional Inference

Conditional statistical methodology is structured to exploit the available and existing data. This is our core topic of research in mainstream Statistics. The three main lines of research that we pursue here are (admittedly, these three streams have completely different philosophical foundations):

Bayesian Statistics: Our research is mainly on using Bayesian principles and theory for understanding and enhancing machine learning, artificial intelligence including deep learning algorithms, primarily with a view to quantifying uncertainty in the answers obtained by such algorithms, assessing the risk associated with using results from such algorithms, and conducting statistical inference and causality studies in ML/AI frameworks.
We also work on Bayesian methodological developments and applications to inter-disciplinary applications, and other statistical applications.

Resampling Techniques: This is philosophically very different from Bayesian statistics, and encompasses a broad area of research where we either repeatedly use the observed data, or assign random weights to data, or try to imitate the data generating process or optimization framework in a variety of ways. Despite being fundamentally different from Bayesian techniques from a philosophical standpoint, in practical terms, resampling often achieves very similar goals of uncertainty quantification, risk assessment, statistical inference and causality in all kinds of data science frameworks.
I have been working on resampling for a long time now, and this remains a core (and fun) topic of research for me, owing to its never-ending potential.

Empirical Likelihood Techniques: A third approach for statistical inference relying primarily on observed data is via empirical likelihood. We are generalizing some aspects fo empirical likelihood to make make it more appealing for geometric inference and a few other things.

Limited Description Data

Much of statistics, and almost all of artificial intelligence and deep learning and machine learning methodology is built around assumptions of ``nice'' data: (i) there is representative, unbiased and adequate data from all sub-groups and sub-domains of interest, (ii) sampling artefacts are not present in data, (iii) observations are statistically independent and identically distributed (iid), and so on.
What if all of these are not true, as if often the case for real data? That is, what if data is biased or unrepresentative and dependent on the sampling scheme, there are small amount of data or no data from some sub-populations, there are missing observations and systematic biases and errors in the observations, and the observations are not independent or identically distributed? What if the directly observed data in some domains or areas are too few in number for meaningful conclusions to be drawn? What if the data fails to meet fairness, diversity, equity, representativeness standards? Can we ensure that the analysis of such datasets meet fairness, as well as privacy and confidentiality standards?

We research on such limited description data from a number of different perspectives:

Small Area Models: This is now a well-established field, where we consider the problem that the data from some or all the sub-populations are not adequate in size, but we can borrow strengths across sub-populations. Our research on this takes us into Bayesian statistics, resampling techniques, and many other interesting topics.

Unrepresentative data, non-probability sampling, citizen-generated data: Like it or not, much of the available data in today's world do not adhere to the strict statistical and mathematical assumptions we would like to impose. Consequently, we have to devise Data Science (AI-ML-Stat) methodologies for data that are biased or unrepresentative.

Synthetic data: Artificial datasets are extremely useful for public dissemination, research purposes, and many other uses. Such datasets are critical for development of digital twins. We work on the delicate balance between data representativeness, confidentiality, security and the quality of the results obtained analyzing such artificial datasets. This arm of research also involves deep dives into Bayesian statistics, resampling techniques as well as machine learning and artificial intelligence techniques.

Record Linkage, Entity Resolution: We try to construct larger datasets from smaller ones, by discovering relationships between observations and among features. This topic engages us in Bayesian statistics as well as several interesting machine learning techniques.

Federated Learning, Transfer Learning: This is a counterpart of record linkage and entity resolution, and instead of trying to discover relations between observations and features of different datasets, we transport the analyses and inferences of one dataset to another. This helps in many ways: we can preserve privacy, confidentiality, respondent rights and data and intellectual property security, we can reduce computational costs, we can obtain better statistical inference with high accuracy and precision.
This arm of research also involves deep dives into Bayesian statistics as well as machine learning and artificial intelligence techniques.

Repeated Surveys: This research domain is on using Bayesian statistics, resampling techniques and machine learning and artificial intelligence techniques to learn from surveys that are repeated over time.

Privacy, Confidentiality: There is considerable overlap of this topic with small area techniques, record linkage and entity resolution, federated and transfer learning.

Other trans-disciplinary data science applications

It's wonderful to be able to use the multiple arms of Data Sciences, like AI, ML, Statistics, to solve real-world scientific problems! Here are some of the topics I work on:

Health Informatics: We study multimodal and multi-resolution data on various aspects of health and well-being, including medical images, genomics, cancer biology, and omics-data. This is an exciting field, one where I am learning a lot every day!

Social science and informatics: Can we use data science methods to predict violence, and more importantly, drivers of political violence? Similarly, can we analyze alliances between countries or other groups? We study such tremendously important questions, along with their causes and effects tied to climate change, environment and ecological systems, migrations, supply chain dynamics.

Climate Informatics: We try to understand properties of this planet's climate, environment and ecological systems, using various data science techniques and domain sciences like physics and biology. This involves using lots of statistical techniques for inference and uncertainty quantification, along with physics-informed data science, extreme value statistics and so on.