Archive for January, 2008

Theory Seminar 2008-02-01

Friday February 1st, 2008
11:00AM
1305 Newell-Simon Hall

Speaker: Ivet Bahar (University of Pittsburgh)

TITLE: Supramolecular Machinery: Insights from Elastic Network Models

ABSTRACT:

Many proteins function as molecular machines. Understanding the principles that control the machinery of biomolecular systems can be a challenge due to the involvement of multiple subunits and cooperative interactions manifested by allosteric changes in conformations beyond the range of atomic simulations. We have developed and utilized low resolution models to explore the collective dynamics of such complex systems, and to bridge structure and function, through the paradigm structure-encodes-dynamics-encodes-function. The elastic network models and methods we introduced to this aim have found utility in many applications and have helped us gain insights into the intrinsic, structure-encoded ability of native structures to energetically favor the reconfigurations between functional substates. An overview of these recent progresses will be presented, along with the application to a few systems.

Comments

Theory Seminar 2008-02-01

Friday February 1st, 2008
Wean Hall 7220
3:30pm

Speaker: Jiri Sgall (Academy of Sciences of the Czech Republic)

Title: Preemptive Online Scheduling: Optimal Algorithms for All Speeds

Abstract:

In this talk, we describe an optimal online algorithm for a large class of scheduling problems.

We are interested in an online version of the classical problem of preemptive scheduling on uniformly related machines. We are given machines with different speeds and a sequence of jobs, each described by its processing time (length). The objective is to find a schedule of all jobs in which the maximal completion time (makespan) is minimized. In the preemptive version, each job may be divided into several pieces, which can be assigned to different machines in disjoint time slots. In the online problem, jobs arrive one-by-one and we need to assign each incoming job to some time slots on some machines, without any knowledge of the jobs that arrive later.

Our new algorithm is optimal for any fixed combination of speeds of the machine, and thus our results subsume all the previous work on various special cases. The algorithm is deterministic, yet it is optimal even among all randomized algorithms. Together with previous results and a new lower bound it follows that the overall competitive ratio of this optimal algorithm is between $2.054$ and $e \approx 2.718$. Finally, our algorithm can be extended into various semi-online settings, where some partial information about the input instance is known, e.g., the maximal processing time or the sum of all the processing times.

Comments

ACO Seminar 2008-01-31

Title: On conditional sorting
Speaker: Victor Grinberg
When: January 31, 12:30-13:30
Where: Porter Hall 125B

Abstract:

Let P be a finite poset and s(P) the minimal number of pairwise comparisons, which will suffice to sort P in the worst case. A sorting procedure can be viewed as a game, where a PLAYER chooses a pair of non-comparable elements, and an ORACLE responds which one is smaller. The game ends when a linear order is achieved. The PLAYER tries to minimize the number of questions, the ORACLE — to maximize it. s(P) is the price of this game. Conditional sorting is a generalization of this (conventional) sorting. In it, the ORACLE is restricted in his answers: they must comply with a given poset Q that is an extension of P. Let s(P/Q) be the price of this game. Clearly, s(P/P)=s(P) and s(Q){\leq} s(P/Q){\leq} s(P). Conditional sorting can be used to obtain nontrivial lower bounds for conventional sorting (an old idea of D. Knuth). For the case when P is an extension of 2 chains (merging), we demonstrate an intimate connection between conditional and conventional sorting. In this case, for any extension Q an “intermediate” poset R can be effectively found such that s(P/Q)=s(R).

Comments

Theory Lunch 2008-01-30

Date: 2008-01-30 12:00
Place: NSH 1507
Speaker: Shuheng Zhou
Title: High Dimensional Sparse Regression and Clustering

Abstract:

I will talk about two “high-dimensional” problems: sparse high dimensional regression and learning mixtures of discrete product distributions using both graph-based and spectral techniques.

In high-dimensional regression, the sparse object is a vector $\beta$ in a linear system: $Y = X \beta + \epsilon$, where $X$ is $n \times p$ matrix such that $n < < p$, $\beta \in \R^{p}$ and $\epsilon \in \R^n$ consists of i.i.d. random noise. In the classical setting, this problem is ill-imposed for $p > n$ even for the case when $\epsilon =0$; however, when the vector $\beta$ is sparse, one can recover an empirical $\hat \beta$ that is consistent in terms of its support with true $\beta$, as shown in Zhao and Yu 07, and Wainwright 06.

John Lafferty, Larry Wasserman and I studied the regression problem under the setting that the original $n$ input variables encoded in $(X, Y)$ are compressed by a random Gaussian ensemble $\phi \in \R^{m \times n}$ to $m$ examples in $p$ dimensions, where $m << n$ or $p$. A primary motivation for this compression procedure is to anonymize the data and preserve privacy by revealing little information about the original data. We established sufficient mutual incoherence conditions on $X$, under which a sparse linear model can be successfully recovered from the compressed data. We characterized the number of random projections that are required for $\ell_1$-regularized compressed regression to identify the nonzero coefficients in the true model with probability approaching one, a property called “sparsistence'’. In addition, we showed that $\ell_1$-regularized compressed regression asymptotically predicts as well as an oracle linear model, a property called “persistence'’ by Greenshtein and Ritov 2004. Finally, we established upper bounds on the mutual information between the compressed and uncompressed data that decay to zero.

In the second problem, I will describe the problem of partitioning a small data sample drawn from a mixture of $k$ product distributions; each dimension corresponds to a Bernoulli random variable with a certain frequency. We are interested in the case that individual features are of low average quality, and we want to use as few of them as possible to correctly partition the sample. I will describe results on this problem obtained using both graph-based and spectral techniques. This is based on a result in my thesis, and a joint work with Avrim Blum, Amin Coja-Oghlan and Alan Frieze.

I will not be able to go into great details of proofs, but I will give some high level idea why things work out they way they are, and describe some open problems.

Comments

Theory Seminar 2008-01-25

Friday January 25th
3:30pm
Wean Hall 7220

Sofya Raskhodnikova, Penn State

Title: Approximating String Compressibility and the Distribution Support Size
Abstract:

Imagine having to choose between a few compression schemes to compress a very long file. Before deciding on the scheme, you might want to obtain a rough estimate of how well each scheme performs on your file. Another motivating scenario comes from areas, such as data-mining, natural language processing and genomics, where compression schemes are used as tools for measuring properties of strings such as similarity and entropy. In these applications, one typically needs only the length of the compressed version of a file, not the output itself.

In this talk, we consider the question of approximating compressibility of a string with respect to a fixed compression scheme, in sublinear time. We will concentrate on the run-length encoding and a version of Lempel-Ziv as our example compression schemes. We present algorithms and lower bounds for approximating compressibility with respect to both schemes.

The second problem we consider is approximating the support size of a distribution from a small number of samples, when each element in the distribution’s support appears with probability at least 1/n. Our motivation is tight relationship between this problem and estimating compressibility with respect to Lempel-Ziv. However, estimating distribution support size is also considered in other areas of computer science along with an important variant, estimating the number of distinct elements in a sequence of length n.

For all three problems (distribution support, distinct elements and Lempel-Ziv compressibility), we prove a nearly linear (in n) lower bound on the query complexity, applicable even for approximation with additive error. At the heart of the lower bound is a construction of two positive integer random variables, X and Y, with very different expectations and the following condition on the first k moments: E[X]/E[Y] = E[X2]/E{Y2] = … = E[X^k]/E[Y^k]. Our lower bound method is also applicable to other problems. In particular, it gives a new lower bound for the sample complexity of approximating the entropy of a distribution.

Based on joint work with Dana Ron, Ronitt Rubinfeld and Adam Smith, published in RANDOM 2007 and joint work with Dana Ron, Amir Shpilka and Adam Smith, published in FOCS 2007

Comments

ACO Seminar 2008-01-24

Title: Packing cubes in a torus
Speaker: Tom Bohman
When: January 24, 12:30-13:30
Where: Doherty Hall 4303
Abstract:

We consider the following packing problem. How many d-dimensional cubes of side length 2 can we pack into a d-dimensional torus of odd side length? In this talk we present some of the best known constructions, detail the connection between this packing problem and the problem of determining the Shannon capacities of graphs, and discuss some recent techniques for establishing upper bounds. Some related open questions on graph products will also be discussed.

Comments

Theory Lunch 2008-01-23

Who: Nina Balcan
Where: NSH 1507
When: 2008-01-22 Noon

Title: A Theory of Similarity Functions for Learning and Clustering

Abstract:

Kernel methods have proven to be extremely powerful tools in machine learning. They perform well in many applications, and there is also a well-developed theory of sufficient conditions for a kernel to be useful for a given learning problem. However, while a kernel can be thought of as just a pairwise similarity function that satisfies additional mathematical properties, this theory requires viewing kernels as implicit (and often difficult to characterize) maps into high-dimensional spaces. In this work we develop an alternative theory of learning with more general similarity functions, which requires neither reference to implicit spaces, nor the function to be positive semi-definite. Our results strictly generalize the standard theory, and any good kernel function under the usual definition can be shown to also be a good similarity function under our definition.

We then show how our framework can also be applied to clustering: multi-way classification from purely unlabeled data. In particular, using this perspective we develop a new model that directly addresses the fundamental question of what kind of information a clustering algorithm needs in order to produce a highly accurate partition of the data. Our work can be viewed as an approach to defining a PAC model for clustering with non-interactive feedback.

This talk is based on work joint with Avrim Blum and Santosh Vempala.

Comments

SumatraPDF 0.8 Released

SumatraPDF has just become a very strong contender as an everyday PDF viewer on Windows. The reason: it finally supports searching and bookmarks in version 0.8.

Its interface is still a bit over simplistic, but it’s already quite good for everyday uses. At the moment, the keyboard shortcuts are not very well-documented. Looking at the source code, I noticed some missing entries that are noteworthy:

  • The search interface: / starts a search, so as Ctrl-F. Ctrl-G and F3 is next, and Shift-F3 is previous. Note that the toolbar doesn’t have to be visible for the search keys to work.
  • Ctrl-L toggles full screen mode.
  • F12 toggles the bookmark panel.

(Update: Also note that Ctrl-Drag will select a region and copy its text to clipboard.)

I don’t have a list of the numerous PDF viewers on Windows here: a google search will get you many hits. Personally, I have “switched” to SumatraPDF on my browsers a few months ago, while spending most of my “PDF time” in PDF-XChange Viewer (see this post about its annotation capability).

I still keep Adobe Reader 8 as the default PDF handler: in terms of printing, anti-aliasing and javascript support, I have yet to see any competition to Adobe. However, now that PDF is in the process of becoming an ISO standard and FSF is declaring the GNU PDF project as one of the few high priority projects, hopefully we will see some change in a few years (see this linux.com article).

I should note that SumatraPDF 0.8 is still missing hyperlink support. Also, none of Windows PDF viewers seems to support pdfsync (round-tripping between PDF and TeX source). I know the former is being worked on. The latter, hmm…

P.S. Common PDF viewers on Windows also include Ghostview and Foxit Reader.

Comments (2)