Machine Learning Seminars
Speaker: Noah Smith, Carnegie Mellon University
Location: Gates Commons
In this talk, I will present AD³ (Alternating Directions Dual
Decomposition), an algorithm for approximate MAP inference in loopy
graphical models with discrete random variables, including structured
prediction problems. AD³ is simple to implement and well-suited to
problems with hard constraints expressed in first-order logic. It often
finds the exact MAP solution, giving a certificate when it does; when it
doesn't, it can be embedded within an exact branch and bound technique.
I'll show experimental results on two natural language processing tasks,
dependency parsing and frame-semantic parsing. This work was done in
collaboration with André Martins, Dipanjan Das, Pedro Aguiar, Mário
Figueiredo, and Eric Xing.
Speaker: Raj Rao, University of Washington
Location: CSE 305
How does the brain learn to make decisions based on noisy sensory
information and incomplete knowledge of the world? In this talk, I
will sketch a neural model of action selection and decision making
based on the general framework of partially observable Markov decision
processes (POMDPs). The model postulates that actions are selected so
as to maximize expected cumulative reward, where rewards can be
external (e.g., food) or internal (e.g., penalty for delay). Action
selection is based on the posterior distribution over states (the
"belief" state) computed using Bayesian inference. A reinforcement
learning algorithm known as temporal difference (TD) learning
maximizes the expected reward. I will describe how such a model
provides a unified framework for explaining recent experimental
results on decision making in the primate brain. The resulting neural
architecture posits an active role for the neocortex in Bayesian
inference while ascribing a role to the basal ganglia in value
computation and action selection. The model suggests an important role
for interactions between the neocortex and the basal ganglia in
learning the mapping between probabilistic sensory representations and
actions that maximize rewards.
(Joint work with Yanping Huang and Abe Friesen)
Speaker: James Lee, Dept. Computer Science & Eng., Univ. Washington
Location: Gates Commons
I will try to tell a story of spectral clustering, from
heuristics, to experiments, to a new algorithm and its formal analysis.
In an interesting twist, the same tools used to analyze spectral
partitioning yield new ways to mine the structure of noisy systems of
linear equations modulo a prime. This, in turn, allows us to mount a
spectral attack on one of the most prominent open problems in complexity
theory. [This is based partly on joint work with Shayan Oveis-Gharan and
Luca Trevisan.]
Speaker: Murali Haran, Pennsylvania State University
Location: EEB 037
This is a tutorial on the use of Gaussian process to approximate complex models often used in modeling phenomena like climate, disease dynamics, etc. The tutorial will begin with an introduction to Gaussian processes followed by a discussion of how Gaussian processes can be used for both computer model emulation (approximation) and calibration.
Speaker: Thomas Richardon, University of Washington
Location: EEB 037
As these talks are intended as overviews, time permitting, I plan to give two mini-tutorials:
The first will be on the idea of counterfactuals/potential outcomes.
Basically answering the question: how does data obtained from a simple randomized experiment differ from that obtained from an observational study, and how does that weaken inferences that can be obtained.
The second will require assume a little bit of background on Bayesian networks and will answer the question: if we obtain data from a subset of the variables in a causal Bayesian network, which causal effects are identified and how can they be computed efficiently.
Speaker: Ashish Sabharwal, IBM T,J, Watson
Location: EEB 037
This tutorial presents a survey of algorithms that are used for counting and sampling of SAT problems.
Speaker: Hoifung Poon, Microsoft Research
Location: EEB 037
This tutorial summarizes the literature on knowledge extraction in scientific domains, such as biomedical texts.
Speaker: Niranjan Balasubramanian, University of Washington
Location: EEB 037
This tutorial discusses the machine learning techniques popular in the information retrieval subcommunity such as ranking techniques, pagerank, etc.
Speaker: Tom Erez, University of Washington
Location: EEB 037
This tutorial gives an introduction to the control theory, in particular, discussing the trajectory optimization techniques.
Speaker: Alan Ritter, University of Washington
Location: EEB 037
This tutorial discusses probabilistic models popular in the NLP lexical semantics community.
Speaker: Peter Hoff, University of Washington
Location: EEB 037
This tutorial discusses probabilistic models for social and other network data.
Speaker: Andrew Guillory
Location: EEB 037
This tutorial presents a brief survey of active learning, submodular
functions, and the interesting algorithms and analyses at their
intersection. Minimal background knowledge is assumed, and emphasis
is placed on open problems and gaps between theory and practice.
Slides at: http://ml.cs.washington.edu/www/media/presentations/submodularity_tutorial.pdf
Speaker: Alice Zheng
Location: CSE 305
Our world is becoming more data driven. With the spread of ubiquitous
sensors, network connectivity, and massive storage capabilities, we
are able to collect more and more data. But our computation and
analysis capabilities have not increased at a comparable rate.
Computer scientists are facing looming questions such as "How do we
deal with the massive amounts of data we are collecting? How can we
extract value out of data?" A sub-question relevant to machine
learning researchers is "what role will machine learning and data
mining play?" Through a survey of current sources of Big Data and
analysis workflow patterns, this talk aims to shed light on the latter
question.
Speaker: Yoram Singer
Location: CSE 305
We describe a relaxed and generalized notion of maximum entropy
problems for multinomial distributions. By introducing a simple
re-parametrization we are able to derive an efficient homotopy
tracking scheme for the entire relaxation path using linear space and
quadratic time. We also show that the Legendre dual of the relaxed
maximum entropy problem is the task of finding the maximum likelihood
estimator for an exponential distribution with L1 regularization.
Hence, our solution can be used for problems such as language modeling
with sparse parameter representation.
Speaker: Luke Zettlemoyer
Location: CSE 305 (community lunch starts at NOON, Talk starts at 12:30pm)
Recent work has demonstrated effective learning algorithms for a
variety of semantic parsing problems, where the goal is to
automatically recover the underlying meaning of input sentences.
Although these algorithms can work well, there is still a large cost
in annotating data and gathering other language-specific resources for
each new application. In this talk, I will describe efforts to address
these challenges by developing scalable, probabilistic CCG grammar
induction algorithms. I will present recent work on methods that
incorporate new notions of lexical generalization, thereby enabling
effective learning for a variety of different natural languages and
formal meaning representations. I will also describe a new approach
for learning semantic parsers from conversational data, which does not
require any manual annotation of sentence meaning. Finally, I will
sketch future directions, including our recurring focus on building
scalable learning techniques while attempting to minimize the
application-specific engineering effort.
Joint work with Yoav Artzi, Tom Kwiatkowski, Sharon Goldwater, and
Mark Steedman
Speaker: Geoff Zweig, Microsoft Research
Location: CSE 305 (community lunch starts at NOON, Talk starts at 12:30pm)
Segmental Conditional Random Fields (SCRFs) are a mathematical model
in which an observation sequence is segmented into labeled chunks. For
example, an audio sequence may be segmented into words, or a video
segmented into scenes. These models have two key
characteristics. First, in contrast to frame-based systems, long-span
features can be used - features which relate an entire span of
observations to a label, and which may make reference to precisely
defined segment boundaries. Second, the models are log-linear in
form, and allow for the convenient integration of features derived
from heterogeneous information sources. This talk will provide an
overview of SCRFs, and in particular their use in speech
recognition. First, we describe the basic model. Then, we present
results in which SCRFs are used to add new information sources to
improve the performance of a state-of-the-art system. Finally, we
present initial results in which the framework is used to do
recognition from the ground up.
Speaker: Ali Shojaie, Department of Biostatistics, University of Washington
Location: CSE 305 (community lunch starts at NOON, Talk starts at 12:30pm)
In graphical models theory, directed edges often represent causal
relationships among random variables and discovering such
relationships is of main interest in many areas of application,
including biological and social problems. In addition to challenges in
estimating directed graphs due to high dimensionality of the space and
inadequacy of observational data for estimating the direction of
causation, presence of cycles in graphs further complicates
application of statistical methodologies to estimation of directed
graphs. I will start by describing an efficient (polynomial time)
algorithm, based on penalized likelihood methods, for estimation of
directed graphs with known topological ordering, and discuss
extensions of this methodology for estimation of general directed
graphs, using (i) temporal observations, and (ii) joint perturbation
screens and steady-state observations of the systems. Asymptotic
properties of the proposed estimators and their applications in the
context of gene regulatory networks are further discussed.
Speaker: Joseph Keshet, TTI-Chicago
Location: CSE 305 (community lunch starts at NOON, Talk starts at 12:30pm)
In discriminative machine learning one is interested in training a
system to optimize a certain desired measure of performance, or
loss. In binary classification one typically tries to minimizes the
error rate. But in structured prediction each task often has its own
measure of performance such as the BLEU score in machine translation
or word error rate in speech recognition. The most common approaches
to structured prediction, structural SVMs and CRFs, do not minimize
the task loss: the former minimizes a surrogate loss with no
guarantees for task loss and the latter minimizes log loss independent
of task loss. In the first part of the talk we present a theorem
stating that a certain perceptron-like learning rule, involving
features vectors derived from loss-adjusted inference, directly
corresponds to the gradient of task loss. Empirical results on
phonetic alignment are presented surpassing all previously reported
results on this problem on the TIMIT corpus.
In the second part of the talk, we describe a new algorithm which aims
to minimize the regularized task loss. We state a PAC-Bayesian
generalization bound, which gives an upper-bound on the expected task
loss in terms of the empirical task loss. Our algorithm is derived by
finding the gradient of the PAC-Bayesian bound and minimizing it by
stochastic gradient descent. The resulting algorithm is iterative and
easy to implement. Experiments on phoneme recognition on the TIMIT
corpus, when the task loss is chosen to be the phoneme error rate,
show that our method achieves the lowest phoneme error rate compared
to other discriminative and generative models with the same expressive
power.
Joint work with David McAllester and Tamir Hazan.
Speaker: John Langford, Yahoo! Research
Location: CSE 305 (community lunch starts at NOON, Talk starts at 12:30pm)
In the real world, we are constantly presented with problems where we
first see some information, then make a choice, and then get feedback
about that choice. Direct examples of this occur in medical testing
and online ad auctions. The standard online supervised learning
setting is almost the same, except that it provides feedback about all
choices. With this weaker form of feedback provided by the real
world, how can we best learn?
Speaker: Martin Wainwright (UC Berkeley)
Location: EE 403 (community lunch starts at NOON, Talk starts at 12:30pm)
Problems that require estimating high-dimensional matrices
from noisy observations arise frequently in statistics and machine
learning. Examples include dimensionality reduction methods (e.g.,
principal components and canonical correlation), collaborative
filtering and matrix completion (e.g., Netflix and other recommender
systems), system identification of state-space models, and graphical
model learning. When the sample size is less than the matrix
dimensions, all of these problems are ill-posed, so that some type of
structure is required in order to obtain interesting results.
In recent years, relaxations based on the nuclear norm and other types
of convex matrix regularizers have become popular. By framing a broad
class of problems as special cases of matrix regression, we present a
single theoretical result that provides guarantees on the accuracy of
such convex relaxations. Two properties of the convex relaxation turn
out to be essential: decomposability of the regularizer, and
restricted strong convexity of the loss function. Our general result
can be specialized to obtain various explicit and non-asymptotic
bounds, among them sharp rates for noisy forms of matrix sketching,
completion, and decomposition.
Based on joint work with Sahand Negahban.
Speaker: Brian Kulis (UC Berkeley)
Location: EE 403 (community lunch (Indian food) starts at noon)
There are several scenarios where one needs to construct or learn a transformation of data from one space to another. One example is auto-tagging of images, where an image is mapped from image space to text space to find the most relevant words to describe an image. Another example is metric learning, where one maps all of the data from one space to another such that standard distance functions in the mapped space better capture the desired distance structure of the data.
In this talk, I will describe some of my recent work in analyzing and developing techniques for the transformation learning problem. I will describe a general formulation for learning linear transformations, and then show that this formulation may be efficiently adapted to learn non-linear transformations through kernelization. Crucially, this kernelization yields transformations that can be applied inductively to new data, and the analysis generalizes a significant amount of previous research in this area. I will then describe some specific instantiations of this formulation for the problems of metric learning and domain adaptation. I will show some results on a variety of vision tasks, and will discuss connections to related work on online learning and hashing techniques.
Speaker: Chris Burges (Microsoft Research)
Location: EE 403 (community lunch starts at noon (now with a lactose-free option))
Unlabeled text data is plentiful and computing power continues to grow exponentially. The accuracy difference between competing supervised learning methods tends to decline as more data becomes available, indicating that more data can be more important than cleverer algorithms. "AskMSR" showed that leveraging unlabeled textual Web data can do an impressive job at question-answering. Recent work has shown that vector space models combined with neural nets can solve standard natural language problems well, and very recent work using recurrent neural nets to predict sequences of characters gives very intriguing results. Inspired by all of this, we are investigating the following question: what is the simplest vector space model we can build that (1) learns from unlabeled data, (2) represents both words and sentences as vectors, (3) is generative (it can efficiently map sentence vectors to text), (4) is as simple as possible but can still do something interesting? This work is still in early development.
Speaker: Pedro Domingos (UW CS)
Location: EE 403 (community lunch starts at noon)
The key limiting factor in graphical model inference and learning is the complexity of the partition function. We thus ask the question: what are general conditions under which the partition function is tractable? The answer leads to a new kind of deep architecture, which we call sum-product networks (SPNs). SPNs are directed acyclic graphs with variables as leaves, sums and products as internal nodes, and weighted edges. We show that if an SPN is complete and consistent it represents the partition function and all marginals of some graphical model, and give semantics to its nodes. Essentially all tractable graphical models can be cast as SPNs, but SPNs are also strictly more general. We then propose learning algorithms for SPNs, based on backpropagation and EM. Experiments show that inference and learning with SPNs can be both faster and more accurate than with previous deep architectures. For example, SPNs perform face completion better than state-of-the-art deep networks for this task. SPNs also have intriguing potential connections to the architecture of the cortex. (Joint work with Hoifung Poon.)
Speaker: Li Deng (Microsoft Researcher)
Location: EE 403 (community lunch starts at noon)
We recently developed a DBN-HMM (Deep Belief Net/Hidden Markov Model) architecture for large-scale speech recognition with three steps in learning and decoding: 1) Stacked RBM pre-training; 2) Stochastic gradient descent back-propagation in fine tuning of DBN; and 3) Tying DBN weights in the phone-bound temporal dimension and using dynamic programming to arrive at the decision rule for speech recognition (Dahl, Yu, and Deng, ICASSP 2011 and IEEE Trans. Audio, Speech and Language Proc., 2011). While achieving remarkable success with this approach compared with the state-of-the-art in discriminatively trained HMM, we are faced with the seemingly insurmountable problem of scalability in dealing with virtually unlimited amount of speech training data in all sorts of acoustic environments. The main difficulty comes from the stochastic gradient learning algorithm in the fine tuning step of DBN. This step of the DBN-HMM learning algorithm is extremely difficult to parallelize over many machines with practical advantages, after some detailed exploration.
To overcome the scalability challenge, we have recently developed a novel deep learning architecture, which we call deep convex network (DCN). The learning algorithm in the DCN is a convex one in each layer of the DCN. And the learning algorithm is batch-based instead of stochastic, naturally lending it amenable for parallelization. The construction of the DCN is related to our earlier work on building the deep hidden CRF (Yu, Wang, Deng, IEEE J. Selected Topics in Signal Proc., Dec 2010), but the learning algorithm is much simpler in each layer, enabling the use of hundreds of layers in the overall DCN architecture. We have empirically found that without doing global error fitting over all layers, the DCN can already achieve excellent discrimination results as long as a very deep architecture is built. One version of the algorithm contains a module that makes use of nonlinear random projection inspired by the work of extreme learning machine (Huang, 2006). While this module gives reasonably good results after it is embedded in the deep structure, we found that replacing it with the restricted Boltzmann machine (RBM) contributes to over 30% of error reduction in the image recognition (MNIST) experiment, achieving the state of the art performance in this task. More recent experiments on speech data will be discussed also.
Speaker: Maya Gupta, UW EE
Location: CSE 305
We discuss some recent advances in local learning. We
describe and compare state-of-the-art local learning methods,
including the local support vector machine, optimal weights for k-NN,
local Bayesian discriminant analysis, and completely lazy learning. We
highlight the importance of local learning when given similarity data.
Experimental results include applications in computer vision, sonar,
bioinformatics, and non-destructive evaluation.
Speaker: Manfred K Warmuth, UC Santa Cruz
Location: CSE 305
Multiplicative updates multiply the parameters by
nonnegative factors. These updates are motivated by
a Maximum Entropy Principle and they are prevalent in evolutionary
processes where the parameters are for example
concentrations of species and the factors are survival rates.
The simplest such update is Bayes rule and we give
an in vitro selection algorithm for RNA strands that
implements this rule in the test tube where
each RNA strand represents a different model.
In one liter of the RNA soup there are approximately 1020 different strands
and therefore this is a rather high-dimensional implementation of Bayes rule.
We investigate multiplicative updates for the purpose
of learning online while processing a stream of examples.
The ``blessing'' of these updates is that they learn very fast
because the good parameters grow exponentially.
However their ``curse'' is that they learn too fast and
wipe out parameters too quickly. We describe a number of
methods developed in the realm of online learning
that ameliorate the curse of these updates.
The methods make the algorithm robust against data
that changes over time and prevent the currently good
parameters from taking over.
We also discuss how the curse is circumvented by nature.
Some of nature's methods parallel the ones
developed in Machine Learning, but nature
also has some additional tricks.
This will be a high level talk. No background in online
learning or evolutionary Biology will be required.
Speaker: Daniela Witten, UW Biostatistics Department
Location: CSE 305
We consider the problem of clustering observations using a potentially
large set of features. One might expect that the true underlying
clusters present in the data differ only with respect to a small
fraction of the features, and will be missed if one clusters the
observations using the full set of features. We propose a novel
framework for sparse clustering, in which one clusters the
observations using an adaptively chosen subset of the features. The
method uses a lasso-type penalty to select the features. We use this
framework to develop simple methods for sparse K-means and
sparse hierarchical clustering. A single criterion governs both the
selection of the features and the resulting clusters.
Speaker: Venkat Chandrasekaran, MIT
Location: CSE 305
Deducing the state or structure of a system from partial, noisy
measurements is a fundamental task throughout the sciences and
engineering. A commonly encountered difficulty that arises in such
inverse problems is the very limited availability of data relative to
the ambient dimension of the signal to be estimated. We analyze a
class of ill-posed linear inverse problems in which the underlying
model of interest has simple algebraic structure. Specifically the
focus of this talk is on the setting in which we are given a small
number of linear measurements of a simple model, which is formed by
adding a small number of elements from some atomic set. Examples of
such models include previously studied cases such as sparse signals
and low-rank matrices, as well as others such as low-rank tensors,
binary vectors, orthogonal matrices, and matrices formed as the sum of
a few permutations. Inspired by the success of the L1-norm and
nuclear-norm heuristics, we propose a general convex regularization
framework to recover such simple models. We discuss general recovery
guarantees for our approach, and describe several example
applications. Thus our work extends the catalog of objects and
structures (beyond sparse vectors and low-rank matrices) that can be
recovered from limited information via tractable convex programming.
Joint work with Benjamin Recht, Pablo Parrilo, and Alan Willsky.
Speaker: Marina Meila, Department of Statistics, University of Washington
Location: Gates Commons (CSE 691)
This talk is concerned with summarizing -- by means of statistical
models -- of data that expresses preferences. This data is typically a
set of rankings of n items by a panel of experts; the simplest summary
is the "consensus ranking", or the "centroid" of the set of
rankings. Such problems appear in many tasks, ranging from combining
voter preferences to boosting of search engines.
We study the problem in its more general form of estimating a
parametric model known as the Generalized Mallows (GM) model. I will
present an exact estimation algorithm, non-polynomial in theory, but
extremely effective in comparison with existing algorithms. From a
statistical point of view, we show that the GM model is an exponential
family, and introduce the conjugate prior for this model class.
Then we introduce the infinite GM model, corresponding to "rankings"
over an infinite set of items, and show that this model is both
elegant and of practical significance. Finally, the talk will touch
upon the subject of multimodal distributions and clustering.
Joint work with: Harr Chen, Alnur Ali, Bhushan Mandhani, Le Bao, Kapil
Phadnis, Arthur Patterson and Jeff Bilmes.
Speaker: Adnan Darwiche, Computer Science Department, University of California, Los Angeles
Location: Gates Commons (CSE 691)
I will present in this talk a theory of anytime, approximate
inference, which explains some of the most influential algorithms in
probabilistic reasoning, yet is based on one of the older ideas in symbolic
reasoning: Relaxations. According to this theory, the fundamental notion of
approximation is that of "relaxing" logical constraints (equalities in
particular) for the purpose of decomposing a problem into smaller pieces
that can be solved independently. A second fundamental notion is that of
"compensation", which calls for imposing weaker and, typically,
probabilistic notions of equality to compensate for the relaxed
equalities. The third fundamental notion of the presented theory is that of
"recovery," where some of the relaxed equalities are incrementally
recovered, based on an assessment of their impact on improving the quality
of approximations. I will discuss how this theory subsumes one of the most
influential algorithms in probabilistic inference: loopy belief propagation
and some of its generalizations, and comment on its relevance to lifted
probabilistic inference. I will also show how this theory has formed the
basis of an award-winning solver at the third UAI inference competition in
2010. I will finally discuss the relationship between this theory and
current developments in symbolic reasoning -- in particular, how advances
in knowledge compilation can impact the state of the art in approximate
(and exact) probabilistic inference.
Speaker: Bill Noble, Dept. Genome Sciences, University of Washington
Location: CSE 305
Sequence census methods such as ChIP-seq have begun to produce an
unprecedented amount of genome-anchored data. Numerous techniques exist
to analyze the results of single assays, but genomics research has
lacked an integrative method to identify patterns from multiple
experiments simultaneously, while taking full advantage of the
high-resolution data now available. We have developed such a method,
which employs a dynamic Bayesian network to discover joint patterns
across different experiment types. We have applied this method to ENCODE
chromatin data for the human chronic myeloid leukemia cell line K562.
In an unsupervised fashion, we have identified patterns associated with
transcription initiation and elongation, as well as transcriptional
repression, finding the locations of genes without using DNA sequence,
mRNA or multispecies conservation data. We have also successfully
performed a semi-supervised identification of enhancer regions across
the genome using a curated set of seed enhancers. In both the
unsupervised and semi-supervised cases, the method identifies patterns
that elucidate the relationships among open chromatin, various
transcription factors and histone modifications.
Speaker: Jeff Bilmes, Dept. Electrical Engineering, University of Washington
Location: Gates Commons (CSE 691)
In this talk, I'll discuss a journey taken by my research interests
leading from dynamic graphical models to image segmentation. I'll start
out discussing dynamic graphical models (DGMs) and their efficient
inference. Entailed in performing inference in such models is the problem
of identifying a vertex cut that determines the boundary of a critical
periodic graphical component. This component is used when the model is
expanded to an arbitrary length, and also determines a local junction tree
segment used as part of this expansion. Depending on the boundary of this
component, inference costs can change by orders of magnitude. There are
many ways of judging the quality of the component without performing
inference itself, including the standard vertex cut criterion. Most
interestingly, however, is the case where the DGM may be sparse, and the
quality is determined by an objective that takes into account this
sparsity. It turns out that such an objective is submodular. Considering
this, and when vertex cut is seen as standard graph cut, a new problem is
naturally introduced that involves finding a minimal cut in a graph where
edges are judged not by a sum of edge weights but instead by a submodular
function, a problem we call cooperative cut. The remainder of the talk
will provide a brief background on submodularity, cooperative cut, and
various hardness and approximate solutions to this problem, and how it can
be applied to image segmentation. In this latter application, results show
significant improvements over the standard (node submodular) graph cut
approaches, in particular for images that exhibit a "shrinking bias"
problem and for those images with smooth illumination gradients.
Joint work with Stefanie Jegelka, with earlier contributions from Chris
Bartels.