[Lowerbounds, Upperbounds]

Algorithms are everywhere.

Date: 2008-01-30 12:00
Place: NSH 1507
Speaker: Shuheng Zhou
Title: High Dimensional Sparse Regression and Clustering

Abstract:

I will talk about two “high-dimensional” problems: sparse high dimensional regression and learning mixtures of discrete product distributions using both graph-based and spectral techniques.

In high-dimensional regression, the sparse object is a vector $\beta$ in a linear system: $Y = X \beta + \epsilon$, where $X$ is $n \times p$ matrix such that $n < < p$, $\beta \in \R^{p}$ and $\epsilon \in \R^n$ consists of i.i.d. random noise. In the classical setting, this problem is ill-imposed for $p > n$ even for the case when $\epsilon =0$; however, when the vector $\beta$ is sparse, one can recover an empirical $\hat \beta$ that is consistent in terms of its support with true $\beta$, as shown in Zhao and Yu 07, and Wainwright 06.

John Lafferty, Larry Wasserman and I studied the regression problem under the setting that the original $n$ input variables encoded in $(X, Y)$ are compressed by a random Gaussian ensemble $\phi \in \R^{m \times n}$ to $m$ examples in $p$ dimensions, where $m << n$ or $p$. A primary motivation for this compression procedure is to anonymize the data and preserve privacy by revealing little information about the original data. We established sufficient mutual incoherence conditions on $X$, under which a sparse linear model can be successfully recovered from the compressed data. We characterized the number of random projections that are required for $\ell_1$-regularized compressed regression to identify the nonzero coefficients in the true model with probability approaching one, a property called “sparsistence”. In addition, we showed that $\ell_1$-regularized compressed regression asymptotically predicts as well as an oracle linear model, a property called “persistence” by Greenshtein and Ritov 2004. Finally, we established upper bounds on the mutual information between the compressed and uncompressed data that decay to zero.

In the second problem, I will describe the problem of partitioning a small data sample drawn from a mixture of $k$ product distributions; each dimension corresponds to a Bernoulli random variable with a certain frequency. We are interested in the case that individual features are of low average quality, and we want to use as few of them as possible to correctly partition the sample. I will describe results on this problem obtained using both graph-based and spectral techniques. This is based on a result in my thesis, and a joint work with Avrim Blum, Amin Coja-Oghlan and Alan Frieze.

I will not be able to go into great details of proofs, but I will give some high level idea why things work out they way they are, and describe some open problems.

No Comments :(