Posts by Collection

portfolio

Portfolio item number 1

Short description of portfolio item number 1

Portfolio item number 2

Short description of portfolio item number 2

publications

Prediction and Estimation of Random Variables with Infinite Mean or Variance

Published in Communications in Statistical - Theory and Methods, 2024

An estimator for distributions with infinite mean or variance using transformation, and the construction of confidence intervals.

Recommended citation: de la Peña, V., Gzyl, H., Mayoral, S., Zou, H., and Alemayehu, D. (2009). "Prediction and Estimation of Random Variables with Infinite Mean or Variance." Commun.Stat-Theory and Methods. 1(1).
Download Paper

Approximate Leave-one-out Cross Validation for Regression with L1 Regularizers

Published in Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, 2024

Selected for oral presentation.

Recommended citation: Auddy, A., Zou, H., Rahnama Rad, K. and Maleki, A. (2024). "Approximate Leave-one-out Cross Validation for Regression with L1 Regularizers." Proceedings of The 27th International Conference on AISTATS. 238:2377-2385.
Download Paper

Approximate Leave-one-out CV for Regression with L1 Regularizers

Published in IEEE Transactions on Information Theory, 2024

Recommended citation: Auddy, A., Zou, H., Rahnama Rad, K. and Maleki, A. (2024). "Approximate Leave-one-out CV for Regression with L1 Regularizers." IEEE Trans. Inf. Theory. 70(11): 8040 - 8071.
Download Paper

Unbiased Estimation of the Gini Coefficient.

Published in Statistics and Probability Letters, 2025

This paper establishes the unbisedness of the classical Gini coefficient for Gamma distribution, with applications to data grouping.

Recommended citation: Baydil, A., de la Peña, V., Zou, H. and Yao, H. (2025). "Unbiased Estimation of the Gini Coefficient." Prob. Stats. Letters. 222:110376.
Download Paper

Leave-one-out Cross Validation in High Dimensional Settings.

Published in Proceedings of The 28th International Conference on AISTATS, 2025

Recommended citation: Zou, H., Auddy, A., Rahnama Rad, K. and Maleki, A. (2025). "Leave-one-out Cross Validation in High Dimensional Settings." Proceedings of The 28th International Conference on AISTATS.
Download Paper

Certified Machine Unlearning under High Dimensional Regime

Published in Journal of Machine Learning Research (JMLR), 2025

Machine unlearning focuses on the computationally efficient removal of specific training data from trained models, ensuring that the influence of forgotten data is effectively eliminated without the need for full retraining. Despite advances in low-dimensional settings, where the number of parameters p is much smaller than the sample size n, extending similar theoretical guarantees to high-dimensional regimes remains challenging. We study an unlearning algorithm that starts from the original model parameters and performs a theory-guided sequence of Newton steps $ T \in { 1,2}$. After this update, carefully scaled isotropic Laplacian noise is added to the estimate to ensure that any (potential) residual influence of forget data is completely removed. We show that when both $ n, p \to \infty $ with a fixed ratio $ n/p $, significant theoretical and computational obstacles arise due to the interplay between the complexity of the model and the finite signal-to-noise ratio. Finally, we show that, unlike in low-dimensional settings, a single Newton step is insufficient for effective unlearning in high-dimensional problems—however, two steps are enough to achieve the desired certifiebility. We provide numerical experiments to support the theoretical claims of the paper.

A Scalable Formula for the Moments of a Family of Self-Normalized Statistics

Published in Submitted to the Journal of Applied Probability, 1900

Following the student t-statistic, normalization has been a widely used method in statistic and other disciplines including economics, ecology and machine learning. We focus on statistics taking the form of a ratio over (some power of) the sample mean, the probabilistic features of which remain unknown. We develop a unified formula for the moments of these self-normalized statistics with non-negative observations, yielding closed-form expressions for several important cases. Moreover, the complexity of our formula doesn’t scale with the sample size $n$. Our theoretical findings, supported by extensive numerical experiments, reveal novel insights into their bias and variance, and we propose a debiasing method illustrated with applications such as the odds ratio, Gini coefficient and squared coefficient of variation.

A Complete Error Analysis of the K-fold Cross Validation for Regularized Empirical Risk Minimization in High Dimensions.

Published in Working in progress., 2025

This paper studies the error of k-fold cross validation in estimating the out-of-sample error of regularized empirical risk minimization (R-ERM) under proportional high dimensional settings, where the number of observations $n$ and the number of parameters $p$ both go to infinity proportionally. We provide a stochastic bound for the MSE of k-CV under mild assumptions. In contrast with common belief that the MSE decreases when the number of folds $k$ increases, we found that it actually stops decreasing anymore when $k$ exceeds a certain boundary, when $n,p$ are fixed. The manuscript will be finished and submitted soon.

Newfluence: Boosting Model Interpretability and Understanding in High Dimensions

Published in ICML2025, Workshop: Assessing World Models, Methods and Metrics for Evaluating Understanding, 2025

The increasing complexity of machine learning (ML) and artificial intelligence (AI) models has created a pressing need for tools that help scientists, engineers, and policymakers interpret and refine model decisions and predictions. Influence functions, originating from robust statistics, have emerged as a popular approach for this purpose. However, the heuristic foundations of influence functions rely on low-dimensional assumptions where the number of parameters p is much smaller than the number of observations n. In contrast, modern AI models often operate in high-dimensional regimes with large p, challenging these assumptions. In this paper, we examine the accuracy of influence functions in high-dimensional settings. Our theoretical and empirical analyses reveal that influence functions cannot reliably fulfill their intended purpose. We then introduce an alternative approximation, called Newfluence, that maintains similar computational efficiency while offering significantly improved accuracy. Newfluence is expected to provide more accurate insights than many existing methods for interpreting complex AI models and diagnosing their issues. Moreover, the high-dimensional framework we develop in this paper can also be applied to analyze other popular techniques, such as Shapley values.

Download Paper

talks

The bias of Gini coefficient

Published: October 16, 2022

The Gini coefficient is a crucial statistical measure used widely across various fields. The interest in the study of the properties of the Gini coefficient is highlighted by the fact that every year the World Bank ranks the level of income inequality between countries using it. In order to calculate the coefficient, it is common practice to assume a Gamma distribution when modeling the distribution of individual incomes in a given population. The asymptotic behavior of the sample Gini coefficient for populations with Gamma distributions has been well-documented in the literature. However, research on the finite sample bias has been absent due to the challenge posed by the denominator. This study aims to fill this gap by demonstrating that the sample Gini coefficient is an unbiased estimator of the population Gini coefficient for a population with Gamma distribution. Furthermore, our findings provide an expectation of the downward bias due to grouping when group sizes are equal.

Approximate Leave one out CV in High Dimensions

Published: April 01, 2024

Approximate Leave-one-out CV for Regression with L1 regularizers

Published: May 01, 2024

The out-of-sample error (OO) is the main quantity of interest in risk estimation and model selection. Leave-one-out cross validation (LO) offers a (nearly) distribution-free yet computationally demanding method to estimate OO. Recent theoretical work showed that approximate leave-one-out cross validation (ALO) is a computationally efficient and statistically reliable estimate of LO (and OO) for generalized linear models with twice differentiable regularizers. For problems involving non-differentiable regularizers, despite significant empirical evidence, the theoretical understanding of ALO’s error remains unknown. In this paper, we present a novel theory for a wide class of problems in the generalized linear model family with the non-differentiable L1 regularizer. We bound the error $|ALO−LO|$ in terms of intuitive metrics such as the size of leave-i-out perturbations in active sets, sample size n, number of features p and signal-to-noise ratio (SNR). As a consequence, for the L1 regularized problems, we show that $|ALO−LO|\to0$ when $n,p\to\infty$ while $n/p$ and SNR remain bounded.

High-dimensional Asymptotic Analysis: Opportunities and Challenges

Published: November 01, 2024

More Information

Approximate Certified Data Removal in High Dimensional R-ERM

Published: April 18, 2025

Slides can be found here.

Haolin (Lyn) Zou