Machine Learning Seminars

5/8/2012 at 12:30
Linguistic Structure Prediction with AD³

Speaker: Noah Smith, Carnegie Mellon University
Location: Gates Commons

In this talk, I will present AD³ (Alternating Directions Dual Decomposition), an algorithm for approximate MAP inference in loopy graphical models with discrete random variables, including structured prediction problems. AD³ is simple to implement and well-suited to problems with hard constraints expressed in first-order logic. It often finds the exact MAP solution, giving a certificate when it does; when it doesn't, it can be embedded within an exact branch and bound technique. I'll show experimental results on two natural language processing tasks, dependency parsing and frame-semantic parsing. This work was done in collaboration with André Martins, Dipanjan Das, Pedro Aguiar, Mário Figueiredo, and Eric Xing.

4/24/2012 at 12:30
Decision making in the primate brain: A model based on POMDPs and reinforcement learning

Speaker: Raj Rao, University of Washington
Location: CSE 305

How does the brain learn to make decisions based on noisy sensory information and incomplete knowledge of the world? In this talk, I will sketch a neural model of action selection and decision making based on the general framework of partially observable Markov decision processes (POMDPs). The model postulates that actions are selected so as to maximize expected cumulative reward, where rewards can be external (e.g., food) or internal (e.g., penalty for delay). Action selection is based on the posterior distribution over states (the "belief" state) computed using Bayesian inference. A reinforcement learning algorithm known as temporal difference (TD) learning maximizes the expected reward. I will describe how such a model provides a unified framework for explaining recent experimental results on decision making in the primate brain. The resulting neural architecture posits an active role for the neocortex in Bayesian inference while ascribing a role to the basal ganglia in value computation and action selection. The model suggests an important role for interactions between the neocortex and the basal ganglia in learning the mapping between probabilistic sensory representations and actions that maximize rewards. (Joint work with Yanping Huang and Abe Friesen)

4/10/2012 at 12:30
Spectral Clustering and Higher-Order Cheeger Inequalities

Speaker: James Lee, Dept. Computer Science & Eng., Univ. Washington
Location: Gates Commons

I will try to tell a story of spectral clustering, from heuristics, to experiments, to a new algorithm and its formal analysis. In an interesting twist, the same tools used to analyze spectral partitioning yield new ways to mine the structure of noisy systems of linear equations modulo a prime. This, in turn, allows us to mount a spectral attack on one of the most prominent open problems in complexity theory. [This is based partly on joint work with Shayan Oveis-Gharan and Luca Trevisan.]

3/6/2012 at 12:30
Gaussian Processes for Approximating Complex Models

Speaker: Murali Haran, Pennsylvania State University
Location: EEB 037

This is a tutorial on the use of Gaussian process to approximate complex models often used in modeling phenomena like climate, disease dynamics, etc. The tutorial will begin with an introduction to Gaussian processes followed by a discussion of how Gaussian processes can be used for both computer model emulation (approximation) and calibration.

2/28/2012 at 12:30
Causality

Speaker: Thomas Richardon, University of Washington
Location: EEB 037

As these talks are intended as overviews, time permitting, I plan to give two mini-tutorials: The first will be on the idea of counterfactuals/potential outcomes. Basically answering the question: how does data obtained from a simple randomized experiment differ from that obtained from an observational study, and how does that weaken inferences that can be obtained. The second will require assume a little bit of background on Bayesian networks and will answer the question: if we obtain data from a subset of the variables in a causal Bayesian network, which causal effects are identified and how can they be computed efficiently.

2/21/2012 at 12:30
Counting and Sampling Solutions of Combinatorial Problems

Speaker: Ashish Sabharwal, IBM T,J, Watson
Location: EEB 037

This tutorial presents a survey of algorithms that are used for counting and sampling of SAT problems.

2/14/2012 at 12:30
Knowledge Extraction for Biomedical Text

Speaker: Hoifung Poon, Microsoft Research
Location: EEB 037

This tutorial summarizes the literature on knowledge extraction in scientific domains, such as biomedical texts.

2/7/2012 at 12:30
Machine Learning for Information Retrieval

Speaker: Niranjan Balasubramanian, University of Washington
Location: EEB 037

This tutorial discusses the machine learning techniques popular in the information retrieval subcommunity such as ranking techniques, pagerank, etc.

1/31/2012 at 12:30
Trajectory Optimization with Differential Dynamic Programming

Speaker: Tom Erez, University of Washington
Location: EEB 037

This tutorial gives an introduction to the control theory, in particular, discussing the trajectory optimization techniques.

1/24/2012 at 12:30
Latent Variable Models of Lexical Semantics

Speaker: Alan Ritter, University of Washington
Location: EEB 037

This tutorial discusses probabilistic models popular in the NLP lexical semantics community.

1/17/2012 at 12:30
Latent Factor Models for Relational and Network Data

Speaker: Peter Hoff, University of Washington
Location: EEB 037

This tutorial discusses probabilistic models for social and other network data.

1/10/2012 at 12:30
Submodular Functions and Active Learning

Speaker: Andrew Guillory
Location: EEB 037

This tutorial presents a brief survey of active learning, submodular functions, and the interesting algorithms and analyses at their intersection. Minimal background knowledge is assumed, and emphasis is placed on open problems and gaps between theory and practice. Slides at: http://ml.cs.washington.edu/www/media/presentations/submodularity_tutorial.pdf

11/29/2011 at 12:30
Machine Learning and Big Data Analysis

Speaker: Alice Zheng
Location: CSE 305

Our world is becoming more data driven. With the spread of ubiquitous sensors, network connectivity, and massive storage capabilities, we are able to collect more and more data. But our computation and analysis capabilities have not increased at a comparable rate. Computer scientists are facing looming questions such as "How do we deal with the massive amounts of data we are collecting? How can we extract value out of data?" A sub-question relevant to machine learning researchers is "what role will machine learning and data mining play?" Through a survey of current sources of Big Data and analysis workflow patterns, this talk aims to shed light on the latter question.

11/22/2011 at 12:30
Entire Relaxation Path for Maximum Entropy Models

Speaker: Yoram Singer
Location: CSE 305

We describe a relaxed and generalized notion of maximum entropy problems for multinomial distributions. By introducing a simple re-parametrization we are able to derive an efficient homotopy tracking scheme for the entire relaxation path using linear space and quadratic time. We also show that the Legendre dual of the relaxed maximum entropy problem is the task of finding the maximum likelihood estimator for an exponential distribution with L1 regularization. Hence, our solution can be used for problems such as language modeling with sparse parameter representation.

11/15/2011 at 12:30
Learning Semantic Parsers for More Languages and with Less Supervision

Speaker: Luke Zettlemoyer
Location: CSE 305 (community lunch starts at NOON, Talk starts at 12:30pm)

Recent work has demonstrated effective learning algorithms for a variety of semantic parsing problems, where the goal is to automatically recover the underlying meaning of input sentences. Although these algorithms can work well, there is still a large cost in annotating data and gathering other language-specific resources for each new application. In this talk, I will describe efforts to address these challenges by developing scalable, probabilistic CCG grammar induction algorithms. I will present recent work on methods that incorporate new notions of lexical generalization, thereby enabling effective learning for a variety of different natural languages and formal meaning representations. I will also describe a new approach for learning semantic parsers from conversational data, which does not require any manual annotation of sentence meaning. Finally, I will sketch future directions, including our recurring focus on building scalable learning techniques while attempting to minimize the application-specific engineering effort. Joint work with Yoav Artzi, Tom Kwiatkowski, Sharon Goldwater, and Mark Steedman

11/1/2011 at 12:30
Speech Recognition with Segmental Conditional Random Fields

Speaker: Geoff Zweig, Microsoft Research
Location: CSE 305 (community lunch starts at NOON, Talk starts at 12:30pm)

Segmental Conditional Random Fields (SCRFs) are a mathematical model in which an observation sequence is segmented into labeled chunks. For example, an audio sequence may be segmented into words, or a video segmented into scenes. These models have two key characteristics. First, in contrast to frame-based systems, long-span features can be used - features which relate an entire span of observations to a label, and which may make reference to precisely defined segment boundaries. Second, the models are log-linear in form, and allow for the convenient integration of features derived from heterogeneous information sources. This talk will provide an overview of SCRFs, and in particular their use in speech recognition. First, we describe the basic model. Then, we present results in which SCRFs are used to add new information sources to improve the performance of a state-of-the-art system. Finally, we present initial results in which the framework is used to do recognition from the ground up.

10/18/2011 at 12:30
Penalized Methods for Estimation of High Dimensional Directed (cyclic) Graphs

Speaker: Ali Shojaie, Department of Biostatistics, University of Washington
Location: CSE 305 (community lunch starts at NOON, Talk starts at 12:30pm)

In graphical models theory, directed edges often represent causal relationships among random variables and discovering such relationships is of main interest in many areas of application, including biological and social problems. In addition to challenges in estimating directed graphs due to high dimensionality of the space and inadequacy of observational data for estimating the direction of causation, presence of cycles in graphs further complicates application of statistical methodologies to estimation of directed graphs. I will start by describing an efficient (polynomial time) algorithm, based on penalized likelihood methods, for estimation of directed graphs with known topological ordering, and discuss extensions of this methodology for estimation of general directed graphs, using (i) temporal observations, and (ii) joint perturbation screens and steady-state observations of the systems. Asymptotic properties of the proposed estimators and their applications in the context of gene regulatory networks are further discussed.

10/4/2011 at 12:30
Direct Loss Minimization for Structural Labeling with Applications to Speech Recognition

Speaker: Joseph Keshet, TTI-Chicago
Location: CSE 305 (community lunch starts at NOON, Talk starts at 12:30pm)

In discriminative machine learning one is interested in training a system to optimize a certain desired measure of performance, or loss. In binary classification one typically tries to minimizes the error rate. But in structured prediction each task often has its own measure of performance such as the BLEU score in machine translation or word error rate in speech recognition. The most common approaches to structured prediction, structural SVMs and CRFs, do not minimize the task loss: the former minimizes a surrogate loss with no guarantees for task loss and the latter minimizes log loss independent of task loss. In the first part of the talk we present a theorem stating that a certain perceptron-like learning rule, involving features vectors derived from loss-adjusted inference, directly corresponds to the gradient of task loss. Empirical results on phonetic alignment are presented surpassing all previously reported results on this problem on the TIMIT corpus. In the second part of the talk, we describe a new algorithm which aims to minimize the regularized task loss. We state a PAC-Bayesian generalization bound, which gives an upper-bound on the expected task loss in terms of the empirical task loss. Our algorithm is derived by finding the gradient of the PAC-Bayesian bound and minimizing it by stochastic gradient descent. The resulting algorithm is iterative and easy to implement. Experiments on phoneme recognition on the TIMIT corpus, when the task loss is chosen to be the phoneme error rate, show that our method achieves the lowest phoneme error rate compared to other discriminative and generative models with the same expressive power. Joint work with David McAllester and Tamir Hazan.

6/28/2011 at 12:30
Learning with Exploration

Speaker: John Langford, Yahoo! Research
Location: CSE 305 (community lunch starts at NOON, Talk starts at 12:30pm)

In the real world, we are constantly presented with problems where we first see some information, then make a choice, and then get feedback about that choice. Direct examples of this occur in medical testing and online ad auctions. The standard online supervised learning setting is almost the same, except that it provides feedback about all choices. With this weaker form of feedback provided by the real world, how can we best learn?

5/31/2011 at 12:30
Convex Relaxation and High-Dimensional Matrices

Speaker: Martin Wainwright (UC Berkeley)
Location: EE 403 (community lunch starts at NOON, Talk starts at 12:30pm)

Problems that require estimating high-dimensional matrices from noisy observations arise frequently in statistics and machine learning. Examples include dimensionality reduction methods (e.g., principal components and canonical correlation), collaborative filtering and matrix completion (e.g., Netflix and other recommender systems), system identification of state-space models, and graphical model learning. When the sample size is less than the matrix dimensions, all of these problems are ill-posed, so that some type of structure is required in order to obtain interesting results. In recent years, relaxations based on the nuclear norm and other types of convex matrix regularizers have become popular. By framing a broad class of problems as special cases of matrix regression, we present a single theoretical result that provides guarantees on the accuracy of such convex relaxations. Two properties of the convex relaxation turn out to be essential: decomposability of the regularizer, and restricted strong convexity of the loss function. Our general result can be specialized to obtain various explicit and non-asymptotic bounds, among them sharp rates for noisy forms of matrix sketching, completion, and decomposition. Based on joint work with Sahand Negahban.

5/17/2011 at 12:30
Learning Transformations for Search and Adaptation

Speaker: Brian Kulis (UC Berkeley)
Location: EE 403 (community lunch (Indian food) starts at noon)

There are several scenarios where one needs to construct or learn a transformation of data from one space to another. One example is auto-tagging of images, where an image is mapped from image space to text space to find the most relevant words to describe an image. Another example is metric learning, where one maps all of the data from one space to another such that standard distance functions in the mapped space better capture the desired distance structure of the data. In this talk, I will describe some of my recent work in analyzing and developing techniques for the transformation learning problem. I will describe a general formulation for learning linear transformations, and then show that this formulation may be efficiently adapted to learn non-linear transformations through kernelization. Crucially, this kernelization yields transformations that can be applied inductively to new data, and the analysis generalizes a significant amount of previous research in this area. I will then describe some specific instantiations of this formulation for the problems of metric learning and domain adaptation. I will show some results on a variety of vision tasks, and will discuss connections to related work on online learning and hashing techniques.

5/10/2011 at 12:30
Language: How Far Can We Get With Unlabeled Data?

Speaker: Chris Burges (Microsoft Research)
Location: EE 403 (community lunch starts at noon (now with a lactose-free option))

Unlabeled text data is plentiful and computing power continues to grow exponentially. The accuracy difference between competing supervised learning methods tends to decline as more data becomes available, indicating that more data can be more important than cleverer algorithms. "AskMSR" showed that leveraging unlabeled textual Web data can do an impressive job at question-answering. Recent work has shown that vector space models combined with neural nets can solve standard natural language problems well, and very recent work using recurrent neural nets to predict sequences of characters gives very intriguing results. Inspired by all of this, we are investigating the following question: what is the simplest vector space model we can build that (1) learns from unlabeled data, (2) represents both words and sentences as vectors, (3) is generative (it can efficiently map sentence vectors to text), (4) is as simple as possible but can still do something interesting? This work is still in early development.

4/19/2011 at 12:30
Sum-Product Networks: A Principled Approach to Deep Learning

Speaker: Pedro Domingos (UW CS)
Location: EE 403 (community lunch starts at noon)

The key limiting factor in graphical model inference and learning is the complexity of the partition function. We thus ask the question: what are general conditions under which the partition function is tractable? The answer leads to a new kind of deep architecture, which we call sum-product networks (SPNs). SPNs are directed acyclic graphs with variables as leaves, sums and products as internal nodes, and weighted edges. We show that if an SPN is complete and consistent it represents the partition function and all marginals of some graphical model, and give semantics to its nodes. Essentially all tractable graphical models can be cast as SPNs, but SPNs are also strictly more general. We then propose learning algorithms for SPNs, based on backpropagation and EM. Experiments show that inference and learning with SPNs can be both faster and more accurate than with previous deep architectures. For example, SPNs perform face completion better than state-of-the-art deep networks for this task. SPNs also have intriguing potential connections to the architecture of the cortex. (Joint work with Hoifung Poon.)

4/5/2011 at 12:30
Deep Convex Network: Architectures and Parallelizable Learning

Speaker: Li Deng (Microsoft Researcher)
Location: EE 403 (community lunch starts at noon)

We recently developed a DBN-HMM (Deep Belief Net/Hidden Markov Model) architecture for large-scale speech recognition with three steps in learning and decoding: 1) Stacked RBM pre-training; 2) Stochastic gradient descent back-propagation in fine tuning of DBN; and 3) Tying DBN weights in the phone-bound temporal dimension and using dynamic programming to arrive at the decision rule for speech recognition (Dahl, Yu, and Deng, ICASSP 2011 and IEEE Trans. Audio, Speech and Language Proc., 2011). While achieving remarkable success with this approach compared with the state-of-the-art in discriminatively trained HMM, we are faced with the seemingly insurmountable problem of scalability in dealing with virtually unlimited amount of speech training data in all sorts of acoustic environments. The main difficulty comes from the stochastic gradient learning algorithm in the fine tuning step of DBN. This step of the DBN-HMM learning algorithm is extremely difficult to parallelize over many machines with practical advantages, after some detailed exploration. To overcome the scalability challenge, we have recently developed a novel deep learning architecture, which we call deep convex network (DCN). The learning algorithm in the DCN is a convex one in each layer of the DCN. And the learning algorithm is batch-based instead of stochastic, naturally lending it amenable for parallelization. The construction of the DCN is related to our earlier work on building the deep hidden CRF (Yu, Wang, Deng, IEEE J. Selected Topics in Signal Proc., Dec 2010), but the learning algorithm is much simpler in each layer, enabling the use of hundreds of layers in the overall DCN architecture. We have empirically found that without doing global error fitting over all layers, the DCN can already achieve excellent discrimination results as long as a very deep architecture is built. One version of the algorithm contains a module that makes use of nonlinear random projection inspired by the work of extreme learning machine (Huang, 2006). While this module gives reasonably good results after it is embedded in the deep structure, we found that replacing it with the restricted Boltzmann machine (RBM) contributes to over 30% of error reduction in the image recognition (MNIST) experiment, achieving the state of the art performance in this task. More recent experiments on speech data will be discussed also.

3/9/2011 at 12:30
Some Recent Advances in Local Learning

Speaker: Maya Gupta, UW EE
Location: CSE 305

We discuss some recent advances in local learning. We describe and compare state-of-the-art local learning methods, including the local support vector machine, optimal weights for k-NN, local Bayesian discriminant analysis, and completely lazy learning. We highlight the importance of local learning when given similarity data. Experimental results include applications in computer vision, sonar, bioinformatics, and non-destructive evaluation.

2/23/2011 at 12:30
The Blessing and the Curse of the Multiplicative Updates

Speaker: Manfred K Warmuth, UC Santa Cruz
Location: CSE 305

Multiplicative updates multiply the parameters by nonnegative factors. These updates are motivated by a Maximum Entropy Principle and they are prevalent in evolutionary processes where the parameters are for example concentrations of species and the factors are survival rates. The simplest such update is Bayes rule and we give an in vitro selection algorithm for RNA strands that implements this rule in the test tube where each RNA strand represents a different model. In one liter of the RNA soup there are approximately 1020 different strands and therefore this is a rather high-dimensional implementation of Bayes rule. We investigate multiplicative updates for the purpose of learning online while processing a stream of examples. The ``blessing'' of these updates is that they learn very fast because the good parameters grow exponentially. However their ``curse'' is that they learn too fast and wipe out parameters too quickly. We describe a number of methods developed in the realm of online learning that ameliorate the curse of these updates. The methods make the algorithm robust against data that changes over time and prevent the currently good parameters from taking over. We also discuss how the curse is circumvented by nature. Some of nature's methods parallel the ones developed in Machine Learning, but nature also has some additional tricks. This will be a high level talk. No background in online learning or evolutionary Biology will be required.

1/26/2011 at 12:30
A Framework for Feature Selection in Clustering

Speaker: Daniela Witten, UW Biostatistics Department
Location: CSE 305

We consider the problem of clustering observations using a potentially large set of features. One might expect that the true underlying clusters present in the data differ only with respect to a small fraction of the features, and will be missed if one clusters the observations using the full set of features. We propose a novel framework for sparse clustering, in which one clusters the observations using an adaptively chosen subset of the features. The method uses a lasso-type penalty to select the features. We use this framework to develop simple methods for sparse K-means and sparse hierarchical clustering. A single criterion governs both the selection of the features and the resulting clusters.

1/12/2011 at 12:30
The Convex Algebraic Geometry of Linear Inverse Problems

Speaker: Venkat Chandrasekaran, MIT
Location: CSE 305

Deducing the state or structure of a system from partial, noisy measurements is a fundamental task throughout the sciences and engineering. A commonly encountered difficulty that arises in such inverse problems is the very limited availability of data relative to the ambient dimension of the signal to be estimated. We analyze a class of ill-posed linear inverse problems in which the underlying model of interest has simple algebraic structure. Specifically the focus of this talk is on the setting in which we are given a small number of linear measurements of a simple model, which is formed by adding a small number of elements from some atomic set. Examples of such models include previously studied cases such as sparse signals and low-rank matrices, as well as others such as low-rank tensors, binary vectors, orthogonal matrices, and matrices formed as the sum of a few permutations. Inspired by the success of the L1-norm and nuclear-norm heuristics, we propose a general convex regularization framework to recover such simple models. We discuss general recovery guarantees for our approach, and describe several example applications. Thus our work extends the catalog of objects and structures (beyond sparse vectors and low-rank matrices) that can be recovered from limited information via tractable convex programming. Joint work with Benjamin Recht, Pablo Parrilo, and Alan Willsky.

11/30/2010 at 12:30
Learning with Permutations: Consensus Finding, Exponential Models, and Infinite Rankings

Speaker: Marina Meila, Department of Statistics, University of Washington
Location: Gates Commons (CSE 691)

This talk is concerned with summarizing -- by means of statistical models -- of data that expresses preferences. This data is typically a set of rankings of n items by a panel of experts; the simplest summary is the "consensus ranking", or the "centroid" of the set of rankings. Such problems appear in many tasks, ranging from combining voter preferences to boosting of search engines. We study the problem in its more general form of estimating a parametric model known as the Generalized Mallows (GM) model. I will present an exact estimation algorithm, non-polynomial in theory, but extremely effective in comparison with existing algorithms. From a statistical point of view, we show that the GM model is an exponential family, and introduce the conjugate prior for this model class. Then we introduce the infinite GM model, corresponding to "rankings" over an infinite set of items, and show that this model is both elegant and of practical significance. Finally, the talk will touch upon the subject of multimodal distributions and clustering. Joint work with: Harr Chen, Alnur Ali, Bhushan Mandhani, Le Bao, Kapil Phadnis, Arthur Patterson and Jeff Bilmes.

11/9/2010 at 12:30
Relax, Compensate and then Recover: A Theory of Anytime, Approximate Inference

Speaker: Adnan Darwiche, Computer Science Department, University of California, Los Angeles
Location: Gates Commons (CSE 691)

I will present in this talk a theory of anytime, approximate inference, which explains some of the most influential algorithms in probabilistic reasoning, yet is based on one of the older ideas in symbolic reasoning: Relaxations. According to this theory, the fundamental notion of approximation is that of "relaxing" logical constraints (equalities in particular) for the purpose of decomposing a problem into smaller pieces that can be solved independently. A second fundamental notion is that of "compensation", which calls for imposing weaker and, typically, probabilistic notions of equality to compensate for the relaxed equalities. The third fundamental notion of the presented theory is that of "recovery," where some of the relaxed equalities are incrementally recovered, based on an assessment of their impact on improving the quality of approximations. I will discuss how this theory subsumes one of the most influential algorithms in probabilistic inference: loopy belief propagation and some of its generalizations, and comment on its relevance to lifted probabilistic inference. I will also show how this theory has formed the basis of an award-winning solver at the third UAI inference competition in 2010. I will finally discuss the relationship between this theory and current developments in symbolic reasoning -- in particular, how advances in knowledge compilation can impact the state of the art in approximate (and exact) probabilistic inference.

10/26/2010 at 12:30
Unsupervised Inference of Genomic Structure from Multiple Functional Genomics Data Sets

Speaker: Bill Noble, Dept. Genome Sciences, University of Washington
Location: CSE 305

Sequence census methods such as ChIP-seq have begun to produce an unprecedented amount of genome-anchored data. Numerous techniques exist to analyze the results of single assays, but genomics research has lacked an integrative method to identify patterns from multiple experiments simultaneously, while taking full advantage of the high-resolution data now available. We have developed such a method, which employs a dynamic Bayesian network to discover joint patterns across different experiment types. We have applied this method to ENCODE chromatin data for the human chronic myeloid leukemia cell line K562. In an unsupervised fashion, we have identified patterns associated with transcription initiation and elongation, as well as transcriptional repression, finding the locations of genes without using DNA sequence, mRNA or multispecies conservation data. We have also successfully performed a semi-supervised identification of enhancer regions across the genome using a curated set of seed enhancers. In both the unsupervised and semi-supervised cases, the method identifies patterns that elucidate the relationships among open chromatin, various transcription factors and histone modifications.

10/12/2010 at 12:30
The Road from DBNs to Image Segmentation Is Paved with Submodularity

Speaker: Jeff Bilmes, Dept. Electrical Engineering, University of Washington
Location: Gates Commons (CSE 691)

In this talk, I'll discuss a journey taken by my research interests leading from dynamic graphical models to image segmentation. I'll start out discussing dynamic graphical models (DGMs) and their efficient inference. Entailed in performing inference in such models is the problem of identifying a vertex cut that determines the boundary of a critical periodic graphical component. This component is used when the model is expanded to an arbitrary length, and also determines a local junction tree segment used as part of this expansion. Depending on the boundary of this component, inference costs can change by orders of magnitude. There are many ways of judging the quality of the component without performing inference itself, including the standard vertex cut criterion. Most interestingly, however, is the case where the DGM may be sparse, and the quality is determined by an objective that takes into account this sparsity. It turns out that such an objective is submodular. Considering this, and when vertex cut is seen as standard graph cut, a new problem is naturally introduced that involves finding a minimal cut in a graph where edges are judged not by a sum of edge weights but instead by a submodular function, a problem we call cooperative cut. The remainder of the talk will provide a brief background on submodularity, cooperative cut, and various hardness and approximate solutions to this problem, and how it can be applied to image segmentation. In this latter application, results show significant improvements over the standard (node submodular) graph cut approaches, in particular for images that exhibit a "shrinking bias" problem and for those images with smooth illumination gradients. Joint work with Stefanie Jegelka, with earlier contributions from Chris Bartels.