[Lowerbounds, Upperbounds]

Algorithms are everywhere.

On Geomblog, Piotr mentioned a problem that I think many of us can share: if we really have to pick just one to teach (as in his case), do we go with 2-4 trees or red-black trees?

For the most part, the pros and cons of the two approaches are quite, urr, balanced. :P But there has been one critical issue that tilt my preference towards RB. On the surface, the issue is “augmentation”, ala CLRS chapter 14 style. But it really is about the number of rotations used during insertions and deletions…

There is a fascinating history behind this, which is summarized by Ottmann and Wood up to 1990. It all started with a design from Olivie and a swift response from Tarjan. The rest of the juicy bits, well, I offer no spoiler. This post merely needs the following fact to make sense: in the worst case, some implementations of RB need only $O(1)$ rotations during restructuring and the remaining work are color changes; some implementations need logarithmic number of rotations.

But if we can afford a logarithmic number of color changes in the worst case, isn’t it just “a matter of constants” even if we have to do a logarithmic number of rotations as well?

In many cases, the answer is indeed yes, but there is no lack of interesting cases due to augmentations that make rotations “expensive”. For a starter: see section 13.4 of Bob’s textbook, pp 570-571 in the C++ 3rd edition, where this very issue is raised and the model answer is presented. Here is a favorite example of mine: Seidel and Aargon used this to highlight why treaps are preferable to splay trees in applications like segment trees. Let’s end with one more related to this post: McCreight’s priority search trees, used as the motivating example in Ottmann and Wood, also critically rely on search trees that use only $O(1)$ number of rotations to restructure. More precisely, doing $p$ rotations yields a bound of $O(p \log n)$. See, it’s not just in the constants!

I will finish this by saying that this doesn’t imply a landslide victory for RB. Conceptually, 2-4 trees are really hard to beat! In fact, even though I would be speaking RB in my mouth, I would be thinking about 2-4 at the back of my mind! (Kind of like some foreign language speakers, no? Ten years ago I was exactly like this, haha.) And really, if we ever get to more advanced stuff like supporting finger searches, I will pick 2-4 on most but not all days. Complications… complications… :P

Exercise: To understand how a 2-4 speaker would speak in RB, what is “black-height” in the 2-4 language?

Reference: Thomas Ottmann, Derick Wood: How to Update a Balanced Binary Tree with a Constant Number of Rotations. SWAT 1990: 122-131

From the project homepage:

PGF is a TeX macro package for generating graphics. It is platform- and format-independent and works together with the most important TeX backend drivers, including pdftex and dvips. It comes with a user-friedly syntax layer called TikZ.

I have been a happy user of PGF ever since 2006. For certain types of graphics like data structures and geometry, it really makes more sense to “code it” then to “draw it” (interactively). The package is very rich in features, as you can witness just by flipping through the 560 page manual.

560 pages for a package? I know it’s long, but of course you don’t need to learn every single option of every single feature before you can use PGF productively. (Actually, you will use a high-level abstraction layer called TikZ as explained in the manual.) The manual provides several tutorials to help you start.

Lastly, I note that the package is written by fellow Theory researcher Till Tantau. Thank you Till!

P.S. Version 2.0 is in the MiKTeX distribution already. Just do an automatic update.

Date: February 27 2008, noon
Location: NSH 1507
Speaker: Daniel Golovin
Title: Uniquely Represented Data Structures with Applications to Privacy

Abstract:

A uniquely represented data structure has a unique physical state to encode each logical state of its abstract data type. In this talk I will discuss new efficient uniquely represented versions of popular data structures (including hash tables, binary search trees, and queues), how these data structures work, and how they can be used to improve the security and privacy of real-world applications.

As an example application, consider a typical system with some internal data structures. If the memory representations of those internal data structures are inspected, they may leave significant clues to the past use of the system. For example, a data structure with lazy deletions might retain an object that the user believes was deleted long ago; this is problematic in environments requiring high security or strict privacy guarantees. Uniquely represented data structures eliminate such problems entirely by storing exactly the information specified by an abstract data type, and nothing more.

I/O-Efficient Algorithms for Computing Contour Lines on a Terrain
Bardia Sadri, Duke University
February 22, 2008, 3:30PM, Wean 7220

Abstract:
A terrain \Sigma is the graph of a bivariate function. We assume that \Sigma is represented as a triangulated surface with n vertices. A contour (or contour line) of \Sigma is a connected component of a level set of \Sigma. Generically, each contour is a closed polygonal curve; at “critical” levels these curves may touch each other. We present I/O-efficient algorithms for the following two problems related to computing contours of \Sigma:

Given two real parameters h and d > 0, we present an I/O-optimal algorithm that report all contours of \Sigma at heights h + kd, for every positive integer k, using O(Sort(N)+T/B) I/Os, where T is the total number edges in the output contours, B is the “block size,” and Sort(N) is the number of I/Os needed to sort N elements. The algorithm uses O(N/B) disk blocks. Each contour is generated individually with its composing segments sorted in clockwise order. Moreover, our algorithm generates information on how the contours are nested.

We can preprocess \Sigma, using O(Sort(N)) I/Os, into a linear-size data structure so that all contours at a given height can be reported using O(logB N + T/B) I/Os, where T is the output size. Each contour is generated individually with its composing segments sorted in clockwise order.

This is a joint work with Lars Arge, Pankaj K. Agarwal, and Thomas Moelhave.

Title: An Erdos-Stone type theorem for spanning subgraphs
Speaker: Anusch Taraz, Technical University Munich
When: February 21, 12:30-13:30
Where: Porter Hall 125B

Abstract:

In this talk we discuss the proof of the following conjecture by Bollobas and Komlos: Suppose that G and H are both graphs on n vertices, H has bounded maximum degree, chromatic number r, and bandwidth o(n), G has minimum degree at least ((r-1)/r + eps)n, and n is sufficiently large. Then G must contain a copy of H. This is joint work with Julia Boettcher and Mathias Schacht.

Time: 12:00pm 2008-02-19
Place: NSH 1507
Speaker: Yi Wu
Title: An Optimal SDP Approximation Algorithm for Max-Cut

Abstract:

Let $G$ be an undirected graph for which the standard Max-Cut SDP relaxation achieves at least a $c$ fraction of the total edge weight, $1/2 \leq c \leq 1$. If the actual optimal cut for $G$ is at most an $s$ fraction of the total edge weight, we say that $(c, s)$ is an SDP gap. We define the SDP gap curve $GapSDP: [1/2, 1] \rightarrow [1/2, 1]$ by $GapSDP(c) = \text{inf} {s : (c, s) \text{ is an SDP gap}}$.

We complete a long line of work by determining the entire SDP gap curve; we show $GapSDP(c) = S(c)$ for a certain explicit (but complicated to state) function $S$. In particular, our lower bound $GapSDP(c) \geq S(c)$ is proved via a polynomial-time ‘$\text{RPR}^2$’ algorithm. Thus we have given an efficient, optimal SDP-rounding algorithm for Max-Cut. The fact that it is $\text{RPR}^2$ confirms a conjecture of Feige and Langberg.

We also describe and analyze the tight connection between SDP gaps and Long Code tests. Using this connection, we give optimal Long Code tests for Max-Cut. We reach the following conclusions:

- The Max-Cut SDP gap curve subject to triangle inequalities is also given by $S(c)$.

- No $\text{RPR}^2$ algorithm can be guaranteed to find cuts of value larger than $S(c)$ in graphs where the optimal cut is $c$. (Contrast this with the fact that in the graphs exhibiting the $c$ vs. $S(c)$ SDP gap, our $\text{RPR}^2$ algorithm actually finds the optimal cut.)

- Further, no polynomial-time algorithm of any kind can have such a guarantee, assuming P != NP and the Unique Games Conjecture.

Sound 3-query PCPPs are long
Prahladh Harsha, TTI Chicago
February 29, 2008, 3:30PM, Wean 7220

Abstract:
The celebrated PCP Theorem [AS+ALMSS] states that any NP proof can be rewritten in such a manner that a probabilistic verifier can check the veracity of the proof by querying the proof in at most a constant number of locations. In this paper, we ask the question if there exist 3-query PCPs that are both short (of size n.\poly\log n) as well as reject proofs of false assertions with probability close to 1/2? In other worlds, can we have the best of the results of both Hastad (3-query PCPs with soundness 1/2-eps but long proofs) and Ben-Sasson-Sudan+Dinur (3-query PCPs with short proofs, but poor soundness)?

We show that any PCP construction that yields PCPs of proximity with similar parameters CANNOT achieve the above property. We obtain this result by studying the trade-off between the length of a PCP of proximity (PCPP) and the maximal soundness that can be guaranteed by a 3-query verifier with oracle access to the proof. Our main result is that a verifier limited to querying a short (i.e, polynomial) proof cannot obtain the same soundness as that obtained by a verifier querying a long (i.e, exponential) proof.

Title: Avoiding small subgraphs in Achlioptas processes

Speaker: Michael Krivelevich, Tel Aviv University

When: February 14, 12:30-13:30

Where: Porter Hall 125B

Abstract:

Consider the following model of random graphs. We are given a monotone graph property P (Hamiltonicity, non-3-colorability, containment of a copy of a fixed graph H etc.) and an integer parameter r>=1. At each round, we are presented with r random edges from the set of edges of the complete graph K_n on n vertices and are asked to choose one of them. The task is to design an online edge choice algorithm that will be able to postpone (avoid) or to facilitate (embrace) almost surely the appearance of P. This model is sometimes called the Achlioptas process, after Dimitris Achlioptas who suggested it for the case r=2 and the property P being the existence of a linear sized connected component (the so called giant component) in the graph. The latter setting is essentially the only one that has been studied extensively so far. In the work to be presented, we were able to achieve a satisfactory solution for the case where the task is to avoid the appearance of a fixed graph H, and H is either a cycle, or a complete graph, or a complete bipartite graph. The answer is quite surprising and depends heavily on the parameter r. For example, the threshold for avoiding K_4 in the Achlioptas process with parameter r=2 is n^{28/19}.

FRIDAY FEBRUARY 15TH
3:30PM
WEH 7220

TITLE: LP-duality Based Algorithms for Restless Bandit Problems

Kamesh Munagala, Duke University

Abstract:

In numerous areas of operations research and artificial intelligence, we face a problem common to gamblers choosing slot machines (or bandits) in a casino: whether to try new machines or keep playing the one we know and hope for the best! More generally, we face the trade-off between exploration and exploitation: between learning from and adapting to a stochastic system and exploiting our current best-knowledge. A model that captures this trade-off is the celebrated Multi-arm Bandit Problem, which was formulated and solved many decades ago, and has since then become a fundamental tool in decision theory. However, in many stochastic applications, we cannot use the basic multi-arm bandit model, but must consider a generalization called the Restless Bandit Problem. Although this problem has been extensively studied, little positive progress has been made. In fact, it has been proven that in its ultimate generality, the restless bandit problem is PSPACE-Hard to approximate to any non-trivial factor.

In this talk, we derive a 2-approximation algorithm for a general subclass of the Restless Bandit Problem, in which the state-transition probability for each bandit is “separable” into a constant probability matrix and a vector of arbitrary monotone functions of time. The monotonicity models increasing levels of uncertainty as time-since-last-observation increases, which occurs in many applications such as in wireless scheduling. The resulting algorithm is simple and intuitive.

To derive the result, we modify a well-known LP relaxation, take its dual, and add a “balance” condition which provides more structure to the LP. From the dual variables, we construct an index policy, and prove that it’s a 2-approximation by using complementary slackness and a potential function argument. Our techniques are fairly clean, yield simple index policies with small approximation factors, and are applicable to related stochastic scheduling problems.

Joint work with Sudipto Guha, UPenn, and Peng Shi, Duke.

Date: 2008-02-13 12:00
Room: NSH 1507
Speaker: Katrina Ligett
Title: A Learning Theory Approach to Non-Interactive Database Privacy

Abstract:

We demonstrate that, ignoring computational constraints, it is possible to release privacy-preserving databases that are useful for all queries over a discretized domain from any given concept class with polynomial VC-dimension. We show a new lower bound for releasing databases that are useful for halfspace queries over a continuous domain. Despite this, we give a privacy-preserving polynomial-time algorithm that releases information useful for all halfspace queries, for a slightly relaxed definition of usefulness. Inspired by learning theory, we introduce a new notion of data privacy, which we call distributional privacy, and show that it is strictly stronger than the prevailing privacy notion, differential privacy.

February 12, 2008

Todd Phillips

10:00 AM, 4623 Wean Hall

Thesis Proposal

Title: Optimal Meshing Algorithms

Abstract:

The meshing problem is to decompose a geometric domain into simple elements that preserve the domain boundary. Meshing algorithms have been used for physical simulation and graphics applications for fifty years, but have only received serious theoretical attention in the last two decades. I propose to address two fundamental algorithmic questions in this area.

Many meshing algorithms require the use of a bounding box to mesh a larger region that contains the domain of interest. The extra elements are then discarded, similar to scaffolding around a construction project. I will provide a new output analysis of such meshing algorithms, where I will show that the number of extraneous scaffold elements is bounded by the number of interesting domain elements. This favorable result ensures that the output-sensitive runtime of such algorithms will depend only on the size of the domain mesh. This same framework shows that a 3D domain volume mesh is asymptotically the same size as a 2D boundary surface mesh, which has deep ramifications for the development of surface reconstruction and surface meshing algorithms.

An important characteristic of a domain to be meshed is the spread ($\Delta$), the ratio of the largest to smallest input feature. I propose the first runtime-optimal meshing algorithm that does not depend on $\Delta$. Previous optimal meshing algorithms work only for inputs with a constant $\log \Delta$ or a constant $\log\log \Delta$ (equivalent to fixed-length integer or floating-point models). Essential to the algorithm is a new eager point-location strategy. Novel low-entropy sorting techniques are developed for bucketing uninserted points. The algorithm also relies on centervertices, a new geometric construction related to centerpoints.

My research will also address related open questions on size-optimality for meshing algorithms.

Thesis Committee:
Gary L. Miller, Chair
Anupam Gupta
Daniel Sleator
Noel Walkington
Tamal Dey, Computer Science and Engineering, Ohio State

Title: Lehman Matrices
Speaker: Gerard Cornuejols
When: February 7, 12:30-13:30
Where: Porter Hall 125B

Abstract:
A pair of square $0,1$ matrices $A,B$ such that $AB^T=E+kI$ (where $E$ is the $n \times n$ matrix of all 1s and $k$ is a positive integer) are called {\em Lehman matrices}. These matrices figure prominently in Lehman’s seminal theorem on minimally nonideal matrices. There are two choices of $k$ for which this matrix equation is known to have infinite families of solutions. When $n=k^2+k+1$ and $A=B$, we get point-line incidence matrices of finite projective planes, which have been widely studied in the literature. The other case occurs when $k=1$ and $n$ is arbitrary, but very little is known in this case. This talk discusses this class of Lehman matrices and classifies them according to their similarity to circulant matrices. The work is joint with Betrand Guenin and Levent Tuncel.

Who: Mohit Singh
Where: NSH 1507
When: 2008-02-06 Noon

Title: Iterative Methods in Combinatorial Optimization

Abstract:

Linear programming has been a successful tool in combinatorial optimization to achieve polynomial time algorithms for problems in P and also to achieve good approximation algorithms for problems which are NP-hard. We demonstrate that iterative methods give a general framework to analyze linear programming formulations of combinatorial optimization problems. We show that iterative methods are well-suited for problems in P and lead to new proofs of integrality of linear programming formulations for various problems in P. This understanding provides us the basic groundwork to address various problems that are NP-hard and to achieve good approximation algorithms.

In this talk, we focus on degree bounded network design problems. The most well-studied problem in this class is the Minimum Bounded Degree Spanning Tree problem. We present a polynomial time algorithm that returns a spanning tree of optimal cost and the degree of any vertex is one more than its degree bound. This generalizes a result of Furer and Raghavachari to weighted graphs, and thus settles a 15-year-old conjecture of Goemans affirmatively. This is essentially the best possible result for this problem. For degree constrained versions of more general network design problems, we obtain first additive approximation algorithms using the iterative method. These results add to a rather small list of combinatorial optimization problems which have an additive approximation algorithm and show that iterative methods provide a unifying framework for solving large class of constrained network design problems.

Friday February 1st, 2008
11:00AM
1305 Newell-Simon Hall

Speaker: Ivet Bahar (University of Pittsburgh)

TITLE: Supramolecular Machinery: Insights from Elastic Network Models

ABSTRACT:

Many proteins function as molecular machines. Understanding the principles that control the machinery of biomolecular systems can be a challenge due to the involvement of multiple subunits and cooperative interactions manifested by allosteric changes in conformations beyond the range of atomic simulations. We have developed and utilized low resolution models to explore the collective dynamics of such complex systems, and to bridge structure and function, through the paradigm structure-encodes-dynamics-encodes-function. The elastic network models and methods we introduced to this aim have found utility in many applications and have helped us gain insights into the intrinsic, structure-encoded ability of native structures to energetically favor the reconfigurations between functional substates. An overview of these recent progresses will be presented, along with the application to a few systems.

Friday February 1st, 2008
Wean Hall 7220
3:30pm

Speaker: Jiri Sgall (Academy of Sciences of the Czech Republic)

Title: Preemptive Online Scheduling: Optimal Algorithms for All Speeds

Abstract:

In this talk, we describe an optimal online algorithm for a large class of scheduling problems.

We are interested in an online version of the classical problem of preemptive scheduling on uniformly related machines. We are given machines with different speeds and a sequence of jobs, each described by its processing time (length). The objective is to find a schedule of all jobs in which the maximal completion time (makespan) is minimized. In the preemptive version, each job may be divided into several pieces, which can be assigned to different machines in disjoint time slots. In the online problem, jobs arrive one-by-one and we need to assign each incoming job to some time slots on some machines, without any knowledge of the jobs that arrive later.

Our new algorithm is optimal for any fixed combination of speeds of the machine, and thus our results subsume all the previous work on various special cases. The algorithm is deterministic, yet it is optimal even among all randomized algorithms. Together with previous results and a new lower bound it follows that the overall competitive ratio of this optimal algorithm is between $2.054$ and $e \approx 2.718$. Finally, our algorithm can be extended into various semi-online settings, where some partial information about the input instance is known, e.g., the maximal processing time or the sum of all the processing times.

Title: On conditional sorting
Speaker: Victor Grinberg
When: January 31, 12:30-13:30
Where: Porter Hall 125B

Abstract:

Let P be a finite poset and s(P) the minimal number of pairwise comparisons, which will suffice to sort P in the worst case. A sorting procedure can be viewed as a game, where a PLAYER chooses a pair of non-comparable elements, and an ORACLE responds which one is smaller. The game ends when a linear order is achieved. The PLAYER tries to minimize the number of questions, the ORACLE — to maximize it. s(P) is the price of this game. Conditional sorting is a generalization of this (conventional) sorting. In it, the ORACLE is restricted in his answers: they must comply with a given poset Q that is an extension of P. Let s(P/Q) be the price of this game. Clearly, s(P/P)=s(P) and s(Q){\leq} s(P/Q){\leq} s(P). Conditional sorting can be used to obtain nontrivial lower bounds for conventional sorting (an old idea of D. Knuth). For the case when P is an extension of 2 chains (merging), we demonstrate an intimate connection between conditional and conventional sorting. In this case, for any extension Q an “intermediate” poset R can be effectively found such that s(P/Q)=s(R).

Date: 2008-01-30 12:00
Place: NSH 1507
Speaker: Shuheng Zhou
Title: High Dimensional Sparse Regression and Clustering

Abstract:

I will talk about two “high-dimensional” problems: sparse high dimensional regression and learning mixtures of discrete product distributions using both graph-based and spectral techniques.

In high-dimensional regression, the sparse object is a vector $\beta$ in a linear system: $Y = X \beta + \epsilon$, where $X$ is $n \times p$ matrix such that $n < < p$, $\beta \in \R^{p}$ and $\epsilon \in \R^n$ consists of i.i.d. random noise. In the classical setting, this problem is ill-imposed for $p > n$ even for the case when $\epsilon =0$; however, when the vector $\beta$ is sparse, one can recover an empirical $\hat \beta$ that is consistent in terms of its support with true $\beta$, as shown in Zhao and Yu 07, and Wainwright 06.

John Lafferty, Larry Wasserman and I studied the regression problem under the setting that the original $n$ input variables encoded in $(X, Y)$ are compressed by a random Gaussian ensemble $\phi \in \R^{m \times n}$ to $m$ examples in $p$ dimensions, where $m << n$ or $p$. A primary motivation for this compression procedure is to anonymize the data and preserve privacy by revealing little information about the original data. We established sufficient mutual incoherence conditions on $X$, under which a sparse linear model can be successfully recovered from the compressed data. We characterized the number of random projections that are required for $\ell_1$-regularized compressed regression to identify the nonzero coefficients in the true model with probability approaching one, a property called “sparsistence”. In addition, we showed that $\ell_1$-regularized compressed regression asymptotically predicts as well as an oracle linear model, a property called “persistence” by Greenshtein and Ritov 2004. Finally, we established upper bounds on the mutual information between the compressed and uncompressed data that decay to zero.

In the second problem, I will describe the problem of partitioning a small data sample drawn from a mixture of $k$ product distributions; each dimension corresponds to a Bernoulli random variable with a certain frequency. We are interested in the case that individual features are of low average quality, and we want to use as few of them as possible to correctly partition the sample. I will describe results on this problem obtained using both graph-based and spectral techniques. This is based on a result in my thesis, and a joint work with Avrim Blum, Amin Coja-Oghlan and Alan Frieze.

I will not be able to go into great details of proofs, but I will give some high level idea why things work out they way they are, and describe some open problems.

Friday January 25th
3:30pm
Wean Hall 7220

Sofya Raskhodnikova, Penn State

Title: Approximating String Compressibility and the Distribution Support Size
Abstract:

Imagine having to choose between a few compression schemes to compress a very long file. Before deciding on the scheme, you might want to obtain a rough estimate of how well each scheme performs on your file. Another motivating scenario comes from areas, such as data-mining, natural language processing and genomics, where compression schemes are used as tools for measuring properties of strings such as similarity and entropy. In these applications, one typically needs only the length of the compressed version of a file, not the output itself.

In this talk, we consider the question of approximating compressibility of a string with respect to a fixed compression scheme, in sublinear time. We will concentrate on the run-length encoding and a version of Lempel-Ziv as our example compression schemes. We present algorithms and lower bounds for approximating compressibility with respect to both schemes.

The second problem we consider is approximating the support size of a distribution from a small number of samples, when each element in the distribution’s support appears with probability at least 1/n. Our motivation is tight relationship between this problem and estimating compressibility with respect to Lempel-Ziv. However, estimating distribution support size is also considered in other areas of computer science along with an important variant, estimating the number of distinct elements in a sequence of length n.

For all three problems (distribution support, distinct elements and Lempel-Ziv compressibility), we prove a nearly linear (in n) lower bound on the query complexity, applicable even for approximation with additive error. At the heart of the lower bound is a construction of two positive integer random variables, X and Y, with very different expectations and the following condition on the first k moments: E[X]/E[Y] = E[X2]/E{Y2] = … = E[X^k]/E[Y^k]. Our lower bound method is also applicable to other problems. In particular, it gives a new lower bound for the sample complexity of approximating the entropy of a distribution.

Based on joint work with Dana Ron, Ronitt Rubinfeld and Adam Smith, published in RANDOM 2007 and joint work with Dana Ron, Amir Shpilka and Adam Smith, published in FOCS 2007

Title: Packing cubes in a torus
Speaker: Tom Bohman
When: January 24, 12:30-13:30
Where: Doherty Hall 4303
Abstract:

We consider the following packing problem. How many d-dimensional cubes of side length 2 can we pack into a d-dimensional torus of odd side length? In this talk we present some of the best known constructions, detail the connection between this packing problem and the problem of determining the Shannon capacities of graphs, and discuss some recent techniques for establishing upper bounds. Some related open questions on graph products will also be discussed.

Who: Nina Balcan
Where: NSH 1507
When: 2008-01-22 Noon

Title: A Theory of Similarity Functions for Learning and Clustering

Abstract:

Kernel methods have proven to be extremely powerful tools in machine learning. They perform well in many applications, and there is also a well-developed theory of sufficient conditions for a kernel to be useful for a given learning problem. However, while a kernel can be thought of as just a pairwise similarity function that satisfies additional mathematical properties, this theory requires viewing kernels as implicit (and often difficult to characterize) maps into high-dimensional spaces. In this work we develop an alternative theory of learning with more general similarity functions, which requires neither reference to implicit spaces, nor the function to be positive semi-definite. Our results strictly generalize the standard theory, and any good kernel function under the usual definition can be shown to also be a good similarity function under our definition.

We then show how our framework can also be applied to clustering: multi-way classification from purely unlabeled data. In particular, using this perspective we develop a new model that directly addresses the fundamental question of what kind of information a clustering algorithm needs in order to produce a highly accurate partition of the data. Our work can be viewed as an approach to defining a PAC model for clustering with non-interactive feedback.

This talk is based on work joint with Avrim Blum and Santosh Vempala.