Page no longer maintained (Sep 4, 2010)

Please follow this link to my new page. Also update your bookmarks as necessary.

On-line publications of Csaba Szepesvári

DBLP:Szepesvari

Last update: June 22, 2010; ICML, COLT, ALT, IROS 2010 papers, RL lecture

Contents: Reinforcement Learning, MDPs, Search, Other Machine Learning, Visual Tracking, Adaptive Control, Robust & Neuro-control, Bioinformatics and Other Applications

Note: Problems with downloading files may be caused by several reasons: a broken link, slow connection, inappropriate server/client configuration. In the case of broken links, I appreciate if you drop me an e-mail describing the problem. Also, comments, suggestions, criticism are always welcomed!

Reinforcement Learning, Markovian Decision Problems, Planning, Search

Algorithms for Reinforcement Learning

Cs. Szepesvári
Synthesis Lectures on Artificial Intelligence and Machine Learning, Morgan Claypool Publisher, 2010 (98 pages) pdf (last update: Aug 8, 2010), at the publisher's website, or buy paperback at Amazon. Visit the book webpage here.

Reinforcement learning is a learning paradigm concerned with learning to control a system so as to maximize a numerical performance measure that expresses a long-term objective. What distinguishes reinforcement learning from supervised learning is that only partial feedback is given to the learner about the learner's predictions. Further, the predictions may have long term effects through influencing the future state of the controlled system. Thus, time plays a special role. The goal in reinforcement learning is to develop efficient learning algorithms, as well as to understand the algorithms' merits and limitations. Reinforcement learning is of great interest because of the large number of practical applications that it can be used to address, ranging from problems in artificial intelligence to operations research or control engineering. In this book we focus on those algorithms of reinforcement learning which build on the powerful theory of dynamic programming. We give a fairly comprehensive catalog of learning problems, describe the core ideas, a large number of state of the art algorithms, followed by the discussion of their theoretical properties and limitations.

Book cover

Toward a Classification of Finite Partial-Monitoring Games

G. Bartók, D. Pál, Cs. Szepesvári
ALT-10, 2010 pdf

In a finite partial-monitoring game against Nature, the Learner repeatedly chooses one of finitely many actions, the Nature responds with one of finitely many outcomes, the Learner suffers a loss and receives feedback signal, both of which are fixed functions of the action and the outcome. The goal of the Learner is to minimize its total cumulative loss. We make progress towards classification of these games based on their minimax expected regret. Namely, we classify almost all games with two outcomes: We show that their minimax expected regret is either zero, Theta(T^{1/2}), Theta(T^{2/3}), or Theta(T) and we give a simple and efficiently computable classification of these four classes of games. Our hope is that the result can serve as a stepping stone toward classifying all finite partial-monitoring games.

Reinforcement Learning Algorithms for MDPs

Cs. Szepesvári
Wiley Encyclopedia of Operations Research, Wiley, 2010

Extending Rapidly-Exploring Random Trees for Asymptotically Optimal Anytime Motion Planning

Y. Abbasi, J. Modayil, Cs. Szepesvári
IROS, 2010, pdf will be available soon

The Online Loop-free Stochastic Shortest-Path Problem

G. Neu, A. György, Cs. Szepesvári
COLT-10, 2010. pdf

We consider a stochastic extension of the loop-free shortest path problem with adversarial rewards. In this episodic Markov decision problem an agent traverses through an acyclic graph with random transitions: at each step of an episode the agent chooses an action, receives some reward, and arrives at a random next state, where the reward and the distribution of the next state depend on the actual state and the chosen action. We consider the bandit situation when only the reward of the just visited state-action pair is revealed to the agent. For this problem we develop algorithms that perform asymptotically as well as the best stationary policy in hindsight. Assuming that all states are reachable with probability alpha > 0 under all policies, we give an algorithm and prove that its regret is O(L^2 sqrt(T|A|)/alpha), where T is the number of episodes, A denotes the (finite) set of actions, and L is the length of the longest path in the graph. Variants of the algorithm are given that improve the dependence on the transition probabilities under specific conditions. The results are also extended to variations of the problem, including the case when the agent competes with time varying policies.

Model-based reinforcement learning with nearly tight exploration complexity bounds

I. Szita, Cs. Szepesvári
ICML-10, 2010. pdf

One might believe that model-based algorithms of reinforcement learning can propagate the obtained experience more quickly, and are able to direct exploration better. As a consequence, fewer exploratory actions should be enough to learn a good policy. Strangely enough, current theoretical results for model-based algorithms do not support this claim: In a finite Markov decision process with N states, the best bounds on the number of exploratory steps necessary are of order O(N^2 log N), in contrast to the O(N log N) bound available for the model-free, delayed Q-learning algorithm. In this paper we show that MorMax, a modified version of the Rmax algorithm needs to make at most O(N log N) exploratory steps. This matches the lower bound up to logarithmic factors, as well as the upper bound of the state-of-the-art model-free algorithm, while our new bound improves the dependence on other problem parameters.

Toward Off-Policy Learning Control with Function Approximation

H.R. Maei, Cs. Szepesvári, S. Bhatnagar, R. Sutton
ICML-10, 2010. pdf

We present the first temporal-difference learning algorithm for off-policy control with unrestricted linear function approximation whose per-time-step complexity is linear in the number of features. Our algorithm, Greedy-GQ, is an extension of recent work on gradient temporal-difference learning, which has hitherto been restricted to a prediction (policy evaluation) setting, to a control setting in which the target policy is greedy with respect to a linear approximation to the optimal action-value function. A limitation of our control setting is that we require the behavior policy to be stationary. We call this setting latent learning because the optimal policy, though learned, is not manifest in behavior. Popular off-policy algorithms such as Q-learning are known to be unstable in this setting when used with linear function approximation.

Some recent algorithmic results about the exploration-vs-exploitation dilemma. Comment on 'Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration'

Cs. Szepesvári
Phil Trans R Soc B 2007; 362, 2010. html

Model Selection in Reinforcement Learning

A.m. Farahmand, Cs. Szepesvári
(submitted), 2010. pdf

We consider the challenge of automating parameter tuning in reinforcement learning. More specifically, we consider the batch (off-line, non-interactive) reinforcement learning setting and the problem of learning an action-value function with a small Bellman error. We propose a complexity regularization-based model selection algorithm and prove its adaptivity: the procedure is shown to perform almost as well as if the best parameter setting was known ahead of time. We also discuss other approaches to derive adaptive procedures in reinforcement learning.

A Markov-Chain Monte Carlo Approach to Simultaneous Localization and Mapping

P. Torma, A. György, Cs. Szepesvári
AI&Stat 2010, JMLR: W&CP 9, pp. 605-612, 2010. pdf

A Markov-chain Monte Carlo based algorithm is provided to solve the Simultaneous localization and mapping (SLAM) problem with general dynamics and observation model under open-loop control and provided that the map-representation is finite dimensional. To our knowledge this is the first provably consistent yet (close-to) practical solution to this problem. The superiority of our algorithm over alternative SLAM algorithms is demonstrated in a difficult loop closing situation.

REGO: Rank-based Estimation of Rényi Information using Euclidean Graph Optimization

B. Póczos, S. Kirshner and Cs. Szepesvári
AI&Stat 2010, JMLR: W&CP 9, pp. 852-859, 2010 (accepted for oral presentation). pdf

We propose a new method for a non-parametric estimation of Rényi and Shannon information for a multivariate distribution using a corresponding copula, a multivariate distribution over normalized ranks of the data. As the information of the distribution is the same as the negative entropy of its copula, our method estimates this information by solving a Euclidean graph optimization problem on the empirical estimate of the distribution's copula. Owing to the properties of the copula, we show that the resulting estimator of Rényi information is strongly consistent and robust. Further, we demonstrate its applicability in image registration in addition to simulated experiments.

Models of active learning in group-structured state spaces

G. Bartók, Cs. Szepesvári, and S. Zilles
Information and Computation, 208:364-384, 2010. pdf; the official version is here.

We investigate the problem of learning the transition dynamics of deterministic, discrete-state environments. We assume that an agent exploring such an environment is able to perform actions (from a finite set of actions) in the environment and to sense the state changes. The question investigated is whether the agent can learn the dynamics without visiting all states. Such a goal is unrealistic in general, hence we assume that the environment has structural properties an agent might exploit. In particular, we assume that the set of all action sequences forms an algebraic group.
We introduce a learning model in different variants and study under which circumstances the corresponding ``group structured environments'' can be learned efficiently by experimenting with group generators (actions). It turns out that for some classes of such environments the choice of actions given to the agent determines if efficient learning is possible. Negative results are presented, even without efficiency constraints, for rather general classes of groups, showing that even with group structure, learning an environment from partial information is far from trivial. However, positive results for special subclasses of Abelian groups turn out to be a good starting point for the design of efficient learning algorithms based on structured representations.

Active Learning in Heteroscedastic Noise

A. Antos, V. Grover, and Cs. Szepesvári
Theoretical Computer Science 411:2712-2728, 2010. pdf; the official version is here.

We consider the problem of actively learning the mean values of distributions associated with a nite number of options. The decision maker can select which option to generate the next observation from, the goal being to produce estimates with equally good precision for all the options. If sample means are used to estimate the unknown values then the optimal solution, assuming that the distributions are known up to a shift, is to sample from each distribution proportional to its variance. No information other than the distributions' variances is needed to calculate the optimal solution. In this paper we propose an incremental algorithm that asymptotically achieves the same loss as an optimal rule. We prove that the excess loss su ered by this algorithm, apart from logarithmic factors, scales as n^(-3/2), which we conjecture to be the optimal rate. The performance of the algorithm is illustrated on a simple problem.

X-Armed Bandits

S. Bubeck, R. Munos, G. Stoltz and Cs. Szepesvári
Submitted on 21/1/2010. pdf

We consider a generalization of stochastic bandits where the set of arms, X, is allowed to be a generic measurable space and the mean-payo function is "locally Lipschitz" with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payo function has a nite number of global maxima around which the behavior of the function is locally Hölder continuous with a known exponent, then the expected of HOO regret is bounded up to a logarithmic factor by n^{1/2}, i.e., the rate of growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm when the dissimilarity is a metric.

Reinforcement Learning Algorithms for MDPs -- A Survey

Cs. Szepesvári
Technical Report TR09-13, Department of Computing Science, University of Alberta, 2009. pdf

Please cite the book, which is based on this technical report, if you were planning to cite this report. Keywords: reinforcement learning; Markov Decision Processes; temporal difference learning; stochastic approximation; two-timescale stochastic approximation; Monte-Carlo methods; simulation optimization; function approximation; stochastic gradient methods; least-squares methods; overfitting; bias-variance tradeoff; online learning; active learning; planning; simulation; PAC-learning; Q-learning; actor-critic methods; policy gradient; natural gradient

Abstract: This article presents a survey of reinforcement learning algorithms for Markov Decision Processes (MDP). In the first half of the article, the problem of value estimation is considered. Here we start by describing the idea of bootstrapping and temporal difference learning. Next, we compare incremental and batch algorithmic variants and discuss the impact of the choice of the function approximation method on the success of learning. In the second half, we describe methods that target the problem of learning to control an MDP. Here online and active learning are discussed first, followed by a description of direct and actor-critic methods.

Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

H.R. Maei, Cs. Szepesvári, S. Bhatnagar, D. Silver, D. Precup, and R.S. Sutton
NIPS 22, 2009 (draft). pdf

We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD(lambda), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as non-linear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al (2009a,b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman-error, and algorithms that perform stochastic gradient-descent on this function. In this paper, we generalize their work to non-linear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms for any finite Markov decision process and any smooth value function approximator, under usual stochastic approximation conditions. The computational complexity per iteration scales linearly with the number of parameters of the approximator. The algorithms are incremental and are guaranteed to converge to locally optimal solutions.

Training Parsers by Inverse Reinforcement Learning

G. Neu and Cs. Szepesvári
Machine Learning 77:303--337, 2009. pdf

One major idea in structured prediction is to assume that the predictor computes its output by finding the maximum of a score function. The training of such a predictor can then be cast as the problem of finding weights of the score function so that the output of the predictor on the inputs matches the corresponding structured labels on the training set. A similar problem is studied in inverse reinforcement learning (IRL) where one is given an environment and a set of trajectories and the problem is to find a reward function such that an agent acting optimally with respect to the reward function would follow trajectories that match those in the training set. In this paper we show how IRL algorithms can be applied to structured prediction, in particular to parser training. We present a number of recent incremental IRL algorithms in a unified framework and map them to parser training algorithms. This allows us to recover some existing parser training algorithms, as well as to obtain a new one. The resulting algorithms are compared in terms of their sensitivity to the choice of various parameters and generalization ability on the Penn Treebank WSJ corpus.

Learning When to Stop Thinking and Do Something!

B. Póczos, Y. Abbasi-Yadkori, Cs. Szepesvári, R. Greiner, and N. Sturtevant
(ICML-09), pp. 825-832, 2009. pdf

An anytime algorithm is capable of returning a response to the given task at essentially any time; typically the quality of the response improves as the time increases. Here, we consider the challenge of learning when we should terminate such algorithms on each of a sequence of iid tasks, to optimize the expected average reward per unit time. We provide a system for addressing this challenge, which combines the global optimizer Cross- Entropy method with local gradient ascent. This paper theoretically investigates how far the estimated gradient is from the true gradient, then empirically demonstrates that this system is effective by applying it to a toy problem, as well as on a real-world face detection task.

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

R.S. Sutton, H.R. Maei, D. Precup, S. Bhatnagar, D. Silver, Cs. Szepesvári, and E. Wiewiora
(ICML-09), pp. 993-1000, 2009. pdf

Sutton, Szepesvari and Maei (2009) recently introduced the first temporal-difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity scales only linearly in the size of the function approximator. Although their gradient temporal difference (GTD) algorithm converges reliably, it can be very slow compared to conventional linear TD (on on-policy problems where TD is convergent), calling into question its practical utility. In this paper we introduce two new related algorithms with better convergence rates. The first algorithm, GTD2, is derived and proved convergent just as GTD was, but uses a different objective function and converges significantly faster (but still not as fast as conventional TD). The second new algorithm, linear TD with gradient correction, or TDC, uses the same update rule as conventional TD except for an additional term which is initially zero. In our experiments on small test problems and in a Computer Go application with a million features, the learning rate of this algorithm was comparable to that of conventional TD. This algorithm appears to extend linear TD to off-policy learning with no penalty in performance while only doubling computational requirements.

Learning Exercise Policies for American Options

Y. Li, Cs. Szepesvári and D. Schuurmans
Proc. of the Twelfth International Conference on Artificial Intelligence and Statistics, JMLR: W&CP 5:352-359, Clearwater Beach, Florida USA, April 16-18, 2009 (AISTAT-09) pdf

Options are important instruments in modern finance. In this paper, we investigate reinforcement learning (RL) methods---in particular, least-squares policy iteration (LSPI)---for the problem of learning exercise policies for American options. We develop finite-time bounds on the performance of the policy obtained with LSPI and compare LSPI and the fitted Q-iteration algorithm (FQI) with the Longstaff-Schwartz method (LSM), the standard least-squares Monte Carlo algorithm from the finance community. Our empirical results show that the exercise policies discovered by LSPI and FQI gain larger payoffs than those discovered by LSM, on both real and synthetic data. Furthermore, we find that for all methods the policies learned from real data generally gain similar payoffs to the policies learned from simulated data. Our work shows that solution methods developed in machine learning can advance the state-of-the-art in an important and challenging application area, while demonstrating that computational finance remains a promising area for future applications of machine learning methods.

Model-based and Model-free Reinforcement Learning for Visual Servoing

A.m. Farahmand, A. Shademan, M. Jagersand, and Cs. Szepesvári
ICRA, pp. 2917-2924, 2009. pdf

To address the difficulty of designing a controller for complex visual-servoing tasks, two learning-based uncalibrated approaches are introduced. The first method starts by building an estimated model for the visual-motor forward kinematic of the vision-robot system by a locally linear regression method. Afterwards, it uses a reinforcement learning method named Regularized Fitted Q-Iteration to find a controller (i.e. policy) for the system (model-based RL). The second method directly uses samples coming from the robot without building any intermediate model (model-free RL). The simulation results show that both methods perform comparably well despite not having any a priori knowledge about the robot.

Regularized Fitted Q-iteration for Planning in Continuous-Space Markovian Decision Problems

A.m. Farahmand, M. Ghavamzadeh, Cs. Szepesvári, and S. Mannor
ACC-09, pp. 725-730, 2009. pdf

Reinforcement learning with linear and non-linear function approximation has been studied extensively in the last decade. However, as opposed to other fields of machine learning such as supervised learning, the effect of finite sample has not been thoroughly addressed within the reinforcement learning framework. In this paper we propose to use L^2 regularization to control the complexity of the value function in reinforcement learning and planning problems. We consider the Regularized Fitted Q-Iteration algorithm and provide generalization bounds that account for small sample sizes. Finally, a realistic visual-servoing problem is used to illustrate the benefits of using the regularization procedure.

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation

R.S. Sutton, Cs. Szepesvári, and H.R. Maei
NIPS 21, pp. 1609-1616, 2008 (draft). pdf

We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. policy-evaluation setting in which the data need not come from on-policy experience. The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L2 norm. We prove that this algorithm is stable and convergent under the usual stochastic approximation conditions to the same least-squares solution as found by the LSTD, but without LSTD's quadratic computational complexity. GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importance-sampling methods.

Online Optimization in X-Armed Bandits

S. Bubeck, R. Munos, G. Stoltz and Cs. Szepesvári
NIPS 21, pp. 201-208, 2008 (draft). pdf longer version is here.

We consider a generalization of stochastic bandit problems where the set of arms, X, is allowed to be a generic topological space and the mean-payoff function is "locally Lipschitz" with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy whose regret improves upon previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally Hölder with a known exponent, then the expected regret is bounded up to a logarithmic factor by sqrt(n), i.e., the rate of the growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm for the class of problems considered.

Regularized Policy Iteration

A.m. Farahmand, M. Ghavamzadeh, Cs. Szepesvári, and S. Mannor
NIPS 21, pp. 441-448, 2008 (draft). pdf

In this paper we consider approximate policy-iteration-based reinforcement learning algorithms. In order to implement a flexible function approximation scheme we propose the use of non-parametric methods with regularization, providing a convenient way to control the complexity of the function approximator. We propose two novel regularized policy iteration algorithms by adding L2-regularization to two widely-used policy evaluation methods: Bellman residual minimization (BRM) and least-squares temporal difference learning (LSTD). We derive efficient implementation for our algorithms when the approximate value-functions belong to a reproducing kernel Hilbert space. We also provide finite-sample performance bounds for our algorithms and show that they are able to achieve optimal rates of convergence under the studied conditions.

Regularized Fitted Q-iteration: Application to Planning

A.m. Farahmand, M. Ghavamzadeh, S. Mannor, and Cs. Szepesvári
EWRL 2008, LNAI 5323-0055 (revised and selected papers), pp. 55-68, 2008. pdf

We consider planning in a Markovian decision problem, i.e., the problem of finding a good policy given access to a generative model of the environment. We propose to use fitted Q-iteration with penalized (or regularized) least-squares regression as the regression subroutine to address the problem of controlling model-complexity. The algorithm is presented in detail for the case when the function space is a reproducingkernel Hilbert space underlying a user-chosen kernel function. We derive bounds on the quality of the solution and argue that data-dependent penalties can lead to almost optimal performance. A simple example is used to illustrate the benefits of using a penalized procedure.

Exploration-exploitation trade-off using variance estimates in multi-armed bandits

J.-Y. Audibert, R. Munos, and Cs. Szepesvári
Theoretical Computer Science, 410:1876-1902, 2009. pdf

Algorithms based on upper confidence bounds for balancing exploration and exploitation are gaining popularity since they are easy to implement, efficient and effective. This paper considers a variant of the basic algorithm for the stochastic, multi-armed bandit problem that takes into account the empirical variance of the different arms. In earlier experimental works, such algorithms were found to outperform the competing algorithms. We provide the first analysis of the expected regret for such algorithms. As expected, our results show that the algorithm that uses the variance estimates has a major advantage over its alternatives that do not use such estimates provided that the variances of the payoffs of the suboptimal arms are low. We also prove that the regret concentrates only at a polynomial rate. This holds for all the upper confidence bound based algorithms and for all bandit problems except those special ones where with probability one the payoff obtained by pulling the optimal arm is larger than the expected payoff for the second best arm. Hence, although upper confidence bound bandit algorithms achieve logarithmic expected regret rates, they might not be suitable for a risk-averse decision maker. We illustrate some of the results by computer simulations.

Active learning in group-structured environments

G. Bartók, Cs. Szepesvári, and S. Zilles
Proc. of the 19th International Conference on Algorithmic Learning Theory (ALT-08), Springer-Verlag, Berlin, LNCS/LNAI 5254, pp. 329-343, 2008. pdf at Springer's site
For the extended version, published at I&C, click here.

The question investigated in this paper is to what extent an input representation influences the success of learning, in particular from the point of view of analyzing agents that can interact with their environment. We investigate learning environments that have a group structure. We introduce a learning model in different variants and study under which circumstances group structures can be learned efficiently from experimenting with group generators (actions). Negative results are presented, even without efficiency constraints, for rather general classes of groups showing that even with group structure, learning an environment from partial information is far from trivial. However, positive results for special subclasses of Abelian groups turn out to be a good starting point for the design of efficient learning algorithms based on structured representations.

Active learning in multi-armed bandits

A. Antos, V. Grover, and Cs. Szepesvári
Proc. of the 19th International Conference on Algorithmic Learning Theory (ALT-08), Springer-Verlag, Berlin, LNCS/LNAI 5254, pp. 287-302, 2008. pdf at Springer's site
For the extended version, published at TCS, click here.

In this paper we consider the problem of actively learning the mean values of distributions associated with a finite number of options (arms). The algorithms can select which option to generate the next sample from in order to produce estimates with equally good precision for all the distributions. When an algorithm uses sample means to estimate the unknown values then the optimal solution, assuming full knowledge of the distributions, is to sample each option proportional to its variance. In this paper we propose an incremental algorithm that asymptotically achieves the same loss as an optimal rule. We prove that the excess loss suffered by this algorithm, apart from logarithmic factors, scales as 1/n^(3/2), which we conjecture to be the optimal rate. The performance of the algorithm is illustrated in a simple problem.

Empirical Bernstein Stopping

V. Mnih, Cs. Szepesvári and J.-Y. Audibert
ICML-08, pp. 672-679, 2008. pdf

Sampling is a popular way of scaling up machine learning algorithms to large datasets. The question often is how many samples are needed. Adaptive stopping algorithms monitor the performance in an online fashion and they can stop early, saving valuable resources. We consider problems where probabilistic guarantees are desired and demonstrate how recently-introduced empirical Bernstein bounds can be used to design stopping rules that are efficient. We provide upper bounds on the sample complexity of the new rules, as well as empirical results on model selection and boosting in the filtering setting.

Speeding Up Planning in Markov Decision Processes via Automatically Constructed Abstractions

A. Isaza, Cs. Szepesvári, V. Bulitko and R. Greiner
UAI-08, pp. 306-314, 2008. pdf

In this paper, we consider planning in stochastic shortest path problems, a subclass of Markov Decision Problems (MDP). We focus on medium-size problems whose state space can be fully enumerated. This problem has numerous important applications, such as navigation and planning under uncertainty. We propose a new approach for constructing a multi-level hierarchy of progressively simpler abstractions of the original problem. Once computed, the hierarchy can be used to speed up planning by first finding a policy for the most abstract level and then recursively refining it into a solution to the original problem. This approach is fully automated and delivers a speed-up of two orders of magnitude over a state-of-the-art MDP solver on sample problems while returning near-optimal solutions.

Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping

R. Sutton, Cs. Szepesvári, A. Geramifard and M. Bowling
UAI-08, pp. 528-536, 2008. pdf

We consider the problem of efficiently learning optimal control policies and value functions over large state spaces in an online setting in which estimates must be available after each interaction with the world. This paper develops an explicitly model-based approach extending the Dyna architecture to linear function approximation. Dyna-style planning proceeds by generating imaginary experience from the world model and then applying model-free reinforcement learning algorithms to the imagined state transitions. Our main results are to prove that linear Dyna-style planning converges to a unique solution independent of the generating distribution, under natural conditions. In the policy evaluation setting, we prove that the limit point is the least-squares (LSTD) solution. An implication of our results is that prioritized-sweeping can be soundly extended to the linear approximation case, backing up to preceding features rather than to preceding states. We introduce two versions of prioritized sweeping with linear Dyna and briefly illustrate their performance empirically on the Mountain Car and Boyan Chain problems.

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

A. Antos, Cs. Szepesvári and R. Munos
Machine Learning, 71:89-129, 2008. DOI link, pdf

We consider batch reinforcement learning problems in continuous space, expected total discounted-reward Markovian Decision Problems. As opposed to previous theoretical work, we consider the case when the training data consists of a single sample path (trajectory) of some behaviour policy.In particular, we do not assume access to a generative model of the environment.The algorithm studied is fitted Q-iteration where in successive iterations the $Q$-functions of the intermediate policies are obtained by means of minimizing a novel Bellman-residual type error.PAC-style polynomial bounds are derived on the number of samples needed to guarantee near-optimal performancewhere the bound depends on the mixing rate of the trajectory, the smoothness properties of the underlying Markovian Decision Problem, the approximation power and capacity of the function set used.

Fitted Q-iteration in continuous action-space MDPs

András Antos, Remi Munos and Csaba Szepesvári
NIPS-20, pp. 9-16, 2007. pdf

We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by some policy. We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorous analysis of this algorithm, proving what we believe is the first finite-time bound for value-function based algorithms for continuous state and action problems.
Note: In retrospect, it would have been better to call this algorithm an actor-critic algorithm. The algorithm that we considers updates a policy and a value function (action-value function in this case).

Value-Iteration Based Fitted Policy Iteration: Learning with a Single Trajectory

András Antos, Remi Munos and Csaba Szepesvári
IEEE ADPRL, pp. 330-337, 2007 pdf

We consider batch reinforcement learning problems in continuous space, expected total discounted-reward Markovian Decision Problems when the training data is composed of the trajectory of some fixed behaviour policy. The algorithm studied is policy iteration where in successive iterations the action-value functions of the intermediate policies are obtained by means of approximate value iteration. PAC-style polynomial bounds are derived on the number of samples needed to guarantee nearoptimal performance. The bounds depend on the mixing rate of the trajectory, the smoothness properties of the underlying Markovian Decision Problem, the approximation power and capacity of the function set used. One of the main novelties of the paper is that new smoothness constraints are introduced thereby significantly extending the scope of previous results.

Why use policy iteration?

Tuning bandit algorithms in stochastic environments

Jean Yves Audibert, Remi Munos and Csaba Szepesvári
ALT-07, pp. 150-165, 2007. pdf presentation: ppt pdf

Algorithms based on upper-confidence bounds for balancing exploration and exploitation are gaining popularity since they are easy to implement, efficient and effective. In this paper we consider a variant of the basic algorithm for the stochastic, multi-armed bandit problem that takes into account the empirical variance of the different arms. In earlier experimental works, such algorithms were found to outperform the competing algorithms. The purpose of this paper is to provide a theoretical explanation of these findings and provide theoretical guidelines for the tuning of the parameters of these algorithms. For this we analyze the expected regret and for the first time the concentration of the regret. The analysis of the expected regret shows that variance estimates can be especially advantageous when the payoffs of suboptimal arms have low variance. The risk analysis, rather unexpectedly, reveals that except for some very special bandit problems, the regret, for upper confidence bounds based algorithms with standard bias sequences, concentrates only at a polynomial rate. Hence, although these algorithms achieve logarithmic expected regret rates, they seem less attractive when the risk of suffering much worse than logarithmic regret is also taken into account.

Apprenticeship Learning using Inverse Reinforcement Learning and Gradient Methods

Gergely Neu and Csaba Szepesvári
UAI-07, pp. 295-302, 2007 pdf

In this paper we propose a novel gradient algorithm to learn a policy from an expert's observed behavior assuming that the expert behaves optimally with respect to some unknown reward function of a Markovian Decision Problem. The algorithm's aim is to find a reward function such that the resulting optimal policy matches well the expert's observed behavior. The main difficulty is that the mapping from the parameters to policies is both nonsmooth and highly redundant. Resorting to subdifferentials solves the first difficulty, while the second one is overcome by computing natural gradients. We tested the proposed method in two artificial domains and found it to be more reliable and efficient than some previous methods.

Improved Rates for the Stochastic Continuum-Armed Bandit Problem

Peter Auer, Ronald Ortner and Csaba Szepesvári
COLT-07, pp. 454-468, 2007. pdf

Considering one-dimensional continuum-armed bandit problems, we propose an improvement of an algorithm of Kleinberg and a new set of conditions which give rise to improved rates. In particular, we introduce a novel assumption that is complementary to the previous smoothness conditions, while at the same time smoothness of the mean payoff function is required only at the maxima. Under these new assumptions new bounds on the expected regret are derived. In particular, we show that apart from logarithmic factors, the expected regret scales with the square-root of the number of trials, provided that the mean payoff function has finitely many maxima and its second derivatives are continuous and non-vanishing at the maxima. This improves a previous result of Cope by weakening the assumptions on the function. We also derive matching lower bounds. To complement the bounds on the expected regret, we provide high probability bounds which exhibit similar scaling.

Continuous Time Associative Bandit Problems

A. György, L. Kocsis, I. Szabó and Cs. Szepesvári
IJCAI-07, pp. 830-835, 2007. pdf

In this paper we consider an extension of the multi-armed bandit problem. In this generalized setting, the decision maker receives some side information, performs an action chosen from a finite set and then receives a reward. Unlike in the standard bandit settings, performing an action takes a random period of time. The environment is assumed to be stationary, stochastic and memoryless. The goal is to maximize the average reward received in one unit time, that is, to maximize the average rate of return. We consider the on-line learning problem where the decision maker initially does not know anything about the environment but must learn about it by trial and error. We propose an ``upper confidence bound''-style algorithm that exploits the structure of the problem. We show that the regret of this algorithm relative to the optimal algorithm that has perfect knowledge about the problem grows at the optimal logarithmic rate in the number of decisions and scales polynomially with the parameters of the problem.

Bandit Based Monte-Carlo Planning (paper introducing UCT)

L. Kocsis, Cs. Szepesvári
Proceedings of the 17th European Conference on Machine Learning, Springer-Verlag, Berlin, LNCS/LNAI 4212, September 18-22, pp. 282-293, 2006. pdf Springer's site

For large state-space Markovian Decision Problems Monte-Carlo planning is one of the few viable approaches to find near-optimal solutions. In this paper we introduce a new algorithm, UCT, that applies bandit ideas to guide Monte-Carlo planning. In finite-horizon or discounted MDPs the algorithm is shown to be consistent and finite sample bounds are derived on the estimation error due to sampling. Experimental results show that in several domains, UCT is significantly more efficient than its alternatives.

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

A. Antos, Cs. Szepesvári and R. Munos
Proceedings of The Nineteenth Annual Conference on Learning Theory, COLT 2006, Springer-Verlag, Berlin, Heidelberg, LNCS/LNAI 4005, pp. 574-588, 2006. pdf
Note: This is the conference paper version of the above MLJ paper.

We consider batch reinforcement learning problems in continuous space, expected total discounted-reward Markovian Decision Problems. As opposed to previous theoretical work, we consider the case when the training data consists of a single sample path (trajectory) of some behaviour policy.In particular, we do not assume access to a generative model of the environment.The algorithm studied is fitted Q-iteration where in successive iterations the $Q$-functions of the intermediate policies are obtained by means of minimizing a novel Bellman-residual type error.PAC-style polynomial bounds are derived on the number of samples needed to guarantee near-optimal performancewhere the bound depends on the mixing rate of the trajectory, the smoothness properties of the underlying Markovian Decision Problem, the approximation power and capacity of the function set used.

Universal Parameter Optimisation in Games Based on SPSA

L. Kocsis and Cs. Szepesvári
Machine Learning Journal, vol. 63, pp. 249-286, 2006. preprint: pdf DOI link

[..] The goal of this paper is twofold: (i) to introduce SPSA for the game programming community by putting it into a game-programming perspective, and (ii) to propose and discuss several methods that can be used to enhance the performance of SPSA. These methods include using common random numbers and antithetic variables, a combination of SPSA with RPROP, and the reuse of samples of previous performance evaluations. SPSA with the proposed enhancements was tested in some large-scale experiments on tuning the parameters of an opponent model, a policy and an evaluation function in our poker program, McRAISE. Whilst SPSA with no enhancements failed to make progress using the allocated resources, SPSA with the enhancements proved to be competitive with other methods, including TD-learning; increasing the average payor per game by as large as 0.19 times the size of the amount of the small bet. From the experimental study, we conclude that the use of an appropriately enhanced variant of SPSA for the optimisation of game program parameters is a viable approach, especially if no good alternative exist for the types of parameters considered.

Log-optimal currency portfolios and control Lyapunov exponents

L. Gerencsér, M. Rásonyi, Cs. Szepesvári and Zs. Vágó
44th IEEE Conference on Decision and Control, 2005 and 2005 European Control Conference. (CDC-ECC'05) , pp. 1764--1769, 2005. pdf

P. Algoet and T. Cover characterized log-optimal portfolios in a stationary market without friction. There is no analogous result for markets with friction, of which a currency market is a typical example. In this paper we restrict ourselves to simple static strategies. The problem is then reduced to the analysis of products of random matrices, the top-Lyapunov exponent giving the growth rate. New insights to products of random matrices will be given and an algorithm for optimizing top-Lyapunov exponents will be presented together with some key steps of its analysis. Simulation results will also be given. [..]

Finite Time Bounds for Sampling Based Fitted Value Iteration

Cs. Szepesvári and R. Munos
ICML'2005, pp. 881—886, 2005. pdf presentation

In this paper we consider sampling based fitted value iteration for discounted, large (possibly infinite) state space, finite action Markovian Decision Problems where only a generative model of the transition probabilities and rewards is available. At each step the image of the current estimate of the optimal value function under a Monte-Carlo approximation to the Bellman-operator is projected onto some function space. PAC-style bounds on the weighted $L^p$-norm approximation error are obtained as a function of the covering number and the approximation power of the function space, the iteration number and the sample size.

Finite Time Bounds for Fitted Value Iteration

R. Munos and Cs. Szepesvári
JMLR, 9:815--857, 2008. pdf JMLR's pdf is here.

This is the longer version of the ICML'2005 paper with all the proofs and some extra material. In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finite-time bounds on the performance of two versions of sampling-based FVI.The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability.An important feature of our proof technique is that it permits the study of weighted $L^p$-norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reflects how well the function space is ``aligned'' with the dynamics and rewards of the MDP.The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results.Numerical experiments are used to substantiate the theoretical findings.

RSPSA: Enhanced Parameter Optimisation in Games

L. Kocsis, Cs. Szepesvári and M.H.M. Winands
Proceedings of the 11th Advances in Computer Games Conference (ACG-11) pp. 39-56, 2006. pdf

Most game programs have a large number of parameters that are crucial for their performance. While tuning these parameters by hand is rather difficult, successful applications of automatic optimisation algorithms in game programs are known only for parameters that belong to certain components (e.g. evaluation-function parameters). The SPSA (Simultaneous Perturbation Stochastic Approximation) algorithm is an attractive choice for optimising any kind of parameters of a game program, both for its generality and its simplicity. It's disadvantage is that it can be very slow. In this article we propose several methods to speed up SPSA, in particular, the combination with RPROP, using common random numbers, antithetic variables and averaging. We test the resulting algorithm for tuning various types of parameters in two domains, poker and LOA. From the experimental study, we conclude that using SPSA is a viable approach for optimisation in game programs, especially if no good alternative exists for the types of parameters considered.

Reduced-Variance Payoff Estimation in Adversarial Bandit Problems

L. Kocsis and Cs. Szepesvári
Proceedings of the ECML-2005 Workshop on Reinforcement Learning in Non-Stationary Environments (accepted). pdf extended version, slide handouts, slides.

A natural way to compare learning methods in non-stationary environments is to compare their regret. In this paper we consider the regret of algorithms in adversarial multi-armed bandit problems. We propose several methods to improve the performance of the baseline exponentially weighted average forecaster by changing the payoff-estimation methods. We argue that improved performance can be achieved by constructing payoff estimation methods that produce estimates with low variance. Our arguments are backed up by both theoretical and empirical results. In fact, our empirical results show that significant performance gains are possible over the baseline algorithm.

Interpolation-based Q-learning

Cs. Szepesvári and William D. Smart
ICML 2004, pp. 791-798 pdf

Presentation that shows that the assumption that the function approximator must be an averager is the only requirement can be found here: pdf with 2 slides per page, pdf with 1 slides per page.

We consider a variant of Q-learning in continuous state spaces under the total expected discounted cost criterion combined with local function approximation methods. Provided that the function approximator satisfies certain interpolation properties, the resulting algorithm is shown to converge with probability one. The limit function is shown to satisfy a fixed point equation of the Bellman type, where the fixed point operator depends on the stationary distribution of the exploration policy and approximation properties of the function approximation method. The basic algorithm is extended in several ways. In particular, a variant of the algorithm is obtained that is shown to converge in probability to the optimal Q function. Preliminary computer simulations confirm the validity of the approach.

Shortest Path Discovery Problems: A Framework, Algorithms and Experimental Results

Cs. Szepesvári
AAAI-2004, 2004, pp. 550-555 - with thanks to Levente Kocsis for helpful discussions pdf

In this paper we introduce and study Shortest Path Discovery (SPD) problems, a generalization of shortest path problems: In SPD one is given a directed edge-weighted graph and the task is to find a the shortest path for fixed source and target nodes such that initially the edge-weights are unknown, but they can be queried. Querying the cost of an edge is expensive and hence the goal is to minimize the total number of edge cost queries executed. In this article we characterize some common properties of sound SPD algorithms, propose a particular algorithm that is shown to be sound and effective. Experimental results on real-world OCR task demonstrate the usefulness of the approach whereas the proposed algorithm is shown to yield a substantial speed-up of the recognition process.

Efficient Approximate Planning in Continuous Space Markovian Decision Problems

Cs. Szepesvári
AI Communications, 13:3, pp. 163 - 176, 2001. pdf

Monte-Carlo planning algorithms for planning in continuous state-space, discounted Markovian Decision Problems (MDPs) having a smooth transition law and a finite action space are considered. We prove various polynomial complexity results for the considered algorithms, improving upon several known bounds.

Convergent Reinforcement Learning with Value Function Interpolation

Cs. Szepesvári
Technical Report TR-2001-02. Mindmaker Ltd., Budapest 1121, Konkoly Th. M. u. 29-33, HUNGARY 2000. pdf

We consider the convergence of a class of reinforcement learning algorithms combined with value function interpolation methods using the methods developed in (Littman and Szepesvari, 1996). As a special case of the obtained general results, for the first time, we prove the (almost sure) convergence of Q-learning when combined with value function interpolation in uncountable spaces.

Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms S. Singh, T. Jaakkola, M.L. Littman and Cs. Szepesvári
Machine Learning, 38:3, pp. 287-308, 2000. pdf DOI link.

An important application of reinforcement learning (RL) is to finite-state control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform exploration. In this paper, we examine the convergence of single-step on-policy RL algorithms for control. On-policy algorithms cannot separate exploration from learning and therefore must confront the exploration problem directly. We prove convergence results for several related on-policy algorithms with both decaying exploration and persistent exploration. We also provide examples of exploration strategies that can be followed during learning that result in convergence to both optimal values and optimal policies.

Comparing Value-Function Estimation Algorithms in Undiscounted Problems F. Beleznay, T. Grõbler and Cs. Szepesvári
Technical Report TR-99-02. Mindmaker Ltd., Budapest 1121, Konkoly Th. M. u. 29-33, HUNGARY 1999. ps

We compare scaling properties of several value-function estimation algorithms. In particular, we prove that Q-learning can scale exponentially slowly with the number of states. We identify the reasons of the slow convergence and show that both TD($\lambda$) and learning with a fixed learning-rate enjoy rather fast convergence, just like the model-based method.

An Evaluation Criterion for Macro-Learning and Some Results Zs. Kalmár and Cs. Szepesvári
Technical Report TR-99-01. Mindmaker Ltd., Budapest 1121, Konkoly Th. M. u. 29-33, HUNGARY 1999. ps

It is known that a well-chosen set of macros makes it possible to considerably speed-up the solution of planning problems. Recently, macros have been considered in the planning framework, built on Markovian decision problem. However, so far no systematic approach was put forth to investigate the utility of macros within this framework. In this article we begin to systematically study this problem by introducing the concept of multi-task MDPs defined with a distribution over the tasks. We propose an evaluation criterion for macro-sets that is based on the expected planning speed-up due to the usage of a macro-set, where the expectation is taken over the set of tasks. The consistency of the empirical speed-up maximization algorithm is shown in the finite case. For acyclic systems, the expected planning speed-up is shown to be proportional to the amount of ``time-compression'' due to the macros. Based on these observations a heuristic algorithm for learning of macros is proposed. The algorithm is shown to return macros identical with those that one would like to design by hand in the case of a particular navigation like multi-task MDP. Some related questions, in particular the problem of breaking up MDPs into multiple tasks, factorizing MDPs and learning generalizations over actions to enhance the amount of transfer are also considered in brief at the end of the paper.

A unified analysis of value-function-based reinforcement-learning algorithms Cs. Szepesvári and M. L. Littman
Neurocomputing 11:2017-2060, 1999 ps

Reinforcement learning is the problem of generating optimal behavior in a sequential decision-making environment given the opportunity of interacting with it. Many algorithms for solving reinforcement-learning problems work by computing improved estimates of the optimal value function. We extend prior analyses of reinforcement-learning algorithms and present a powerful new theorem that can provide a unified analysis of value-function-based reinforcement-learning algorithms. The usefulness of the theorem lies in how it allows the asynchronous convergence of a complex reinforcement-learning algorithm to be proven by verifying that a simpler synchronous algorithm converges. We illustrate the application of the theorem by analyzing the convergence of Q-learning, model-based reinforcement learning, Q-learning with multi-state updates, Q-learning for Markov games, and risk-sensitive reinforcement learning.

Reinforcement Learning: Theory and Practice Cs. Szepesvári
in Proceedings of the 2nd Slovak Conference on Artificial Neural Networks (SCANN'98). ps

Nov. 10-12, 1998, Smolenice, Slovakia, pp. 29-39 (Ed: Marian Hrehus)

We consider reinforcement learning methods for the solution of complex sequential optimization problems. In particular, the soundness of two methods proposed for the solution of partially observable problems will be shown. The first method suggests a state-estimation scheme and requires mild {\em a priori} knowledge, while the second method assumes that a significant amount of abstract knowledge is available about the decision problem and uses this knowledge to setup a macro-hierarchy to turn the partially observable problem into another one which can already be handled using methods worked out for observable problems. This second method is also illustrated with some experiments on a real-robot.

Module-Based Reinforcement Learning: Experiments with a Real Robot Zs. Kalmár, Cs. Szepesvári and A. Lorincz
Machine Learning 31:55-85, 1998. ps

Autonomous Robots 5:273-295, 1998. (this was a joint Special Issue of the two journals..)

The behavior of reinforcement learning (RL) algorithms is best understood in completely observable, discrete-time controlled Markov chains with finite state and action spaces. In contrast, robot-learning domains are inherently continuous both in time and space, and moreover are partially observable. Here we suggest a systematic approach to solve such problems in which the available qualitative and quantitative knowledge is used to reduce the complexity of learning task. The steps of the design process are to: i) decompose the task into subtasks using the qualitative knowledge at hand; ii) design local controllers to solve the subtasks using the available quantitative knowledge and iii) learn a coordination of these controllers by means of reinforcement learning. It is argued that the approach enables fast, semi-automatic, but still high quality robot-control as no fine-tuning of the local controllers is needed. The approach was verified on a non-trivial real-life robot task. Several RL algorithms were compared by ANOVA and it was found that the model-based approach worked significantly better than the model-free approach. The learnt switching strategy performed comparably to a handcrafted version. Moreover, the learnt strategy seemed to exploit certain properties of the environment which were not foreseen in advance, thus supporting the view that adaptive algorithms are advantageous to non-adaptive ones in complex environments.

Module-Based Reinforcement Learning for a Real Robot Zs. Kalmár, Cs. Szepesvári and A. Lorincz
in Proceedings of the 6th European Workshop on Learning Robots, Lecture Notes in AI, to appear. 1998. ps

The behaviour of reinforcement learning (RL) algorithms is best understood in completely observable, finite state- and action-space, discrete-time controlled Markov-chains. Robot-learning domains, on the other hand, are inherently infinite both in time and space, and moreover they are only partially observable. In this article we suggest a systematic design method whose motivation comes from the desire to transform the task-to-be-solved into a finite-state, discrete-time, epsilon-stationary Markovian task, which is completely observable too.

An epsilon-stationary MDP is one whose transition probabilities may vary by time, but remain in an epsilon-neighorhood of some ideal transition probability matrix, p*. We prove that model-based RL algorithms converge to a neighborhood of the optimal value-function corresponding to p*, whith the neighborhood size being proportional to epsilon and some other parameters of the MDP.

The approach was tried out on a real-life robot. Several RL algorithms were compared and it was found that a model-based approach worked best. The learnt switching strategy performed equally well as a handcrafted version. Moreover, the learnt strategy seemed to exploit certain properties of the environment which could not have been seen in advance, which predicted the promising possibility that a learnt controller might overperform a handcrafted switching strategy in the future.

Non-Markovian Policies in Sequential Decision Problems Cs. Szepesvári
Acta Cybernetica 13:3, pp. 305-318, 1998. ps

In this article we prove the validity of the Bellman Optimality Equation and related results for sequential decision problems with a general recursive structure. The characteristic feature of our approach is that also non-Markovian policies are taken into account. The theory is motivated by some experiments with a learning robot.

Multi-criteria Reinforcement Learning Z. Gábor, Zs. Kalmár and Cs. Szepesvári
In Proceedings of International Conference of Machine Learning, in press, 1998. ps. Revised on 05/04/2004.

We consider multi-criteria sequential decision making problems where the vector-valued evaluations are compared by a given, fixed total ordering. Conditions for the optimality of stationary policies and the Bellman optimality equation are given for a special, but important class of problems when the evaluation of policies can be computed for the criteria independently of each other. The analysis requires special care as the topology introduced by pointwise convergence and the order-topology introduced by the preference order are in general incompatible. Reinforcement learning algorithms are proposed and analyzed. Preliminary computer experiments confirm the validity of the derived algorithms. These type of multi-criteria problems are most useful when there are several optimal solutions to a problem and one wants to choose the one among these which is optimal according to another fixed criterion. Possible application in robotics and repeated games are outlined.

Multi-criteria Reinforcement Learning Z. Gábor, Zs. Kalmár and Cs. Szepesvári
Technical Report TR-98-115, "Attila József" University, Research Group on Artificial Intelligence Szeged, HU-6700, 1998 (submitted in 1997). ps. Revised on 05/04/2004.

We consider multi-criteria sequential decision making problems where the vector-valued evaluations are compared by a given, fixed total ordering. Conditions for the optimality of stationary policies and the Bellman optimality equation are given for a special, but important class of problems when the evaluation of policies can be computed for the criteria independently of each other. The analysis requires special care as the topology introduced by pointwise convergence and the order-topology introduced by the preference order are in general incompatible. Reinforcement learning algorithms are proposed and analyzed. Preliminary computer experiments confirm the validity of the derived algorithms. These type of multi-criteria problems are most useful when there are several optimal solutions to a problem and one wants to choose the one among these which is optimal according to another fixed criterion. Possible application in robotics and repeated games are outlined.

Static and Dynamic Aspects of Optimal Sequential Decision Making Cs. Szepesvári
PhD Thesis, "Attila József" University, Bolyai Institute of Mathematics, Szeged, HU-6700, 1998. ps

Foreword: In this thesis the theory of optimal sequential decisions having a general recursive structure is investigated via an operator theoretical approach, the recursive structure (of both of the dynamics and the optimality criterion) being encoded into the so-called cost propagation operator. Decision problems like Markovian Decision Problems with expected or worst-case total discounted/undiscounted cost criterion; repeated zero-sum games such as Markov-games; or alternating Markov-games all admit such a recursive structure. Our setup has the advantage that it emphasizes their common properties as well as it points out some differences.

The thesis consists of two parts, in the first part the model is assumed to be known while in the second one the models are to be explored. The setup of Part I is rather abstract but enables a unified treatment of a large class of sequential decision problems, namely the class when the total cost of decision policies is defined recursively by a so called cost propagation operator. Under natural monotonicity and continuity conditions the greedy policies w.r.t. the optimal cost-to-go function turn out to be optimal, due to the recursive structure.

Part II considers the case when the models are unknown, and have to be explored and learnt. The price of considering unknown models is that here we have to restrict ourselves to models with an additive cost structure in order to obtain tractable learning situations. The almost sure convergence of the most frequently used algorithms proposed in the reinforcement learning community is proved. These algorithms are treated as multidimensional asynchronous stochastic approximation schemes and their convergence is deduced from the main theorem of the second part. The key of the method here is the so called rescaling property of certain homogeneous processes. A practical and verifiable sufficient condition for the convergence of on-line learning policies to an optimal policy is formulated and a convergence rate is established.

Module Based Reinforcement Learning for a Real Robot Zs. Kalmár, Cs. Szepesvári and A. Lorincz
in Proceedings of the 6th European Workshop on Learning Robots, 22-32, 1997. ps

The behaviour of reinforcement learning (RL) algorithms is best understood in completely observable, finite state- and action-space, discrete-time controlled Markov-chains. Robot-learning domains, on the other hand, are inherently infinite both in time and space, and moreover they are only partially observable. In this article we suggest a systematic method whose motivation comes from the desire to transform the task-to-be-solved into a finite-state, discrete-time, "approximately" Markovian task, which is completely observable too. The key idea is to break up the problem into subtasks and design controllers for each of the subtasks. Then operating conditions are attached to the controllers (together the controllers and their operating conditions which are called modules) and possible additional features are designed to facilitate observability. A new discrete time-counter is introduced at the "module-level" that clicks only when a change in the value of one of the features is observed. The approach was tried out on a real-life robot. Several RL algorithms were compared and it was found that a model-based approach worked best. The learnt switching strategy performed equally well as a handcrafted version. Moreover, the learnt strategy seemed to exploit certain properties of the environment which could not have been seen in advance, which predicted the promising possibility that a learnt controller might overperform a handcrafted switching strategy in the future.

The Asymptotic Convergence-Rate of Q-learning Cs. Szepesvári
In Proceedings of Neural Information Processing Systems 10, pp. 1064-1070, 1997. ps

In this paper we show that for discounted MDPs with discount factor \gamma>1/2 the asymptotic rate of convergence of Q-learning is O(1/t^{R(1-\gamma)}) if R(1-\gamma)<1/2 and O(\sqrt{\log\log t/ t}) otherwise provided that the state-action pairs are sampled from a fixed probability distribution. Here R=\pmin/\pmax is the ratio of the minimum and maximum state-action occupation frequencies. The results extend to convergent on-line learning provided that \pmin>0, where \pmin and \pmax now become the minimum and maximum state-action occupation frequencies corresponding to the stationary distribution.

Learning and Exploitation do not Conflict Under Minimax Optimality Cs. Szepesvári
In Proceedings of 9th European Conference of Machine Learning, pp. 242-249, 1997. ps

We show that adaptive real time dynamic programming extended with the action selection strategy which chooses the best action according to the latest estimate of the cost function yields asymptotically optimal policies within finite time under the minimax optimality criterion. From this it follows that learning and exploitation do not conflict under this special optimality criterion. We relate this result to learning optimal strategies in repeated two-player zero-sum deterministic games.

Some basic facts concerning minimax sequential decision processes Cs. Szepesvári
Technical Report TR-96-100, "Attila József" University, Research Group on Artificial Intelligence Szeged, HU-6700, 1996. ps

It is shown that for discounted minimax sequential decision processes the evaluation function of a stationary policy is the fixed point of the so-called policy evaluation operator which is a contraction mapping. Using this we prove that Bellman's principle of optimality holds. We also prove that the asynchronous value iteration algorithm converges to the optimal value function.

A generalized reinforcement-learning model: Convergence and applications M. L. Littman and Cs. Szepesvári
In Proceedings of International Conference of Machine Learning '96, Bari, 1996. ps

Reinforcement learning is the process by which an autonomous agent uses its experience interacting with an environment to improve its behavior. The Markov decision process (MDP) model is a popular way of formalizing the reinforcement-learning problem, but it is by no means the only way. In this paper, we show how many of the important theoretical results concerning reinforcement learning in MDPs extend to a generalized MDP model that includes MDPs, two-player games and MDPs under a worst-case optimality criterion as special cases. The basis of this extension is a stochastic-approximation theorem that reduces asynchronous convergence to synchronous convergence.

A Generalized Markov Decision Processes: Dynamic-programming and Reinforcement-learning Cs. Szepesvári and M. L. Littman
Technical Report CS-96-11, Brown University, Department of Computer Science, Providence, RI, November 1996. slightly revised ps

Reinforcement learning is the process by which an autonomous agent uses its experience interacting with an environment to improve its behavior. The Markov decision process (MDP) model is a popular way of formalizing the reinforcement-learning problem, but it is by no means the only way. In this paper, we show how many of the important theoretical results concerning reinforcement learning in MDPs extend to a generalized MDP model that includes MDPs, two-player games and MDPs under a worst-case optimality criterion as special cases. The basis of this extension is a stochastic-approximation theorem that reduces asynchronous convergence to synchronous convergence.

General Framework for Reinforcement Learning Cs. Szepesvári
In Proceedings of ICANN'95 Paris, France, Oct. 1995, Vol. II., pp. 165-170 ps

In this article we propose a general framework for sequential decision making. The framework is based on the observation that the derivation of the optimal behaviour under various decision criteria follows the same pattern: the cost of policies can be decomposed into the successive application of an operator that defines the related dynamic programming algorithm and this operator describes completely the structure of the decision problem. We take this mapping (the so called one step lookahead (OLA) cost mapping) as our starting point. This enables the unified treatment of various decision criteria (e.g. the expected value criterion or the worst-case criterion). The main result of this article says that under minimal conditions optimal stationary policies are greedy w.r.t. the optimal cost function and vice versa. Based on this result we feel that former results on reinforcement learning can be transferred to other decision criteria provided that the decision criterion is decomposable by an appropriate mapping.

Dynamic Concept Model Learns Optimal Policies Cs. Szepesvári
In Proceedings of IEEE WCCI ICNN'94 Vol. III. pp. 1738-1742. Orlando, Florida, June 1994. ps

Reinforcement learning is a flourishing field of neural methods. It has a firm theoretical basis and has been proven powerful in many applications. A brain model based alternative to RL has been introduced in the literature: It integrates artificial neural networks (ANN) and knowledge based (KB) systems into one unit or agent for goal oriented problem solving. The agent may possess inherited and learnt ANN and KB subsystems. The agent has and develops ANN cues to the environment for dimensionality reduction in order to ease the problem of combinatorial explosion. A dynamic concept model was forwarded that builds cue-models of the phenomena in the world, designs action sets (concepts) and make them compete in a neural stage to come to a decision. The competition was implemented in the form of activation spreading (AS) and a winner-take-all mechanism. The efficiency of the algorithm has been demonstrated for several examples, however, the optimality of the algorithm have not yet been proven in general. Here, a restriction to Markov decision problems (MDP) shall be treated making possible to show the equivalence of a special AS and RL. The equivalence in this special case means, that DCM has all the advantages of RL, moreover it keeps track of more distinctions allowing faster convergence and generalization.

Generalized Dynamic Concept Model as a Route to Construct Adaptive Autonomous Agents Zs. Kalmár, Cs. Szepesvári and A. Lõrincz
Neural Network World 3:353-360, 1995. ps

A model of adaptive autonomous agents, that (i) builds internal representation of events an d event relations, (ii) utilizes activation spreading for building dynamic concepts and (iii) makes use of the winner-take-all paradigm to come to a decision is extended by introducing generalization into the model. The generalization reduces memory requirements and improves performance in unseen scenes as it is indicated by computer simulations.

Generalization in an autonomous agent Zs. Kalmár, Cs. Szepesvári, and A. Lõrincz
In Proceedings of IEEE WCCI ICNN'94, Vol. III. pp. 1815-1817. Orlando, Florida, June 1994. ps

The key point in designing intelligent agents is the autonomous construction of a structured, intentional representation of the agent's environment. In this paper we address the problem of self-organized concept construction in reinforcement environments. The basis of our approach is a map-building algorithm that learnes long-term expectations. The map consists of events of the world and edges among successful events. Events are statements about action results. The truth of these statement however is not binary because of uncertainties in the world. Since the knowledge of the agent is always limited we associate to each statement a confidence value too. The map is kept as small as possible: competition among map building nodes destructs outdated, unimportant nodes.

Properties of the map and the map evaluation are exploited for concept construction. An abduction process creates new concepts using situational (circumstance-based) analogy. Created concepts are stored in an indirect form as part of events. Because of competition among map building nodes only concepts that proved to be useful (not too general) are saved for long. Concepts are frame dependent, since concepts are used only at certain points of the world-map. This is thought to be a nice property: it prevents wrongly reuse of concepts. It is hoped that the map will contain more and more concepts and some concepts will become wide-spread. Only concepts that proved to be usefull may be reused. This way the originally unstructured representation may turn into a structured one. unstructured representation may turn into a structured one.

We prove the viability of the approach by computer experiments in an artificial domain.

Behavior of an Adaptive Self-organizing Autonomous Agent Working with Cues and Competing Concepts Cs. Szepesvári and A. Lõrincz
Adaptive Behavior 2 (2):131-160, 1994. ps

The concepts are presented of a neural model based shell that integrates artificial neural networks (ANN) and artificial intelligence (AI) for problem solving. The shell may possess inherited and learnt ANN and AI subsystems. The shell has and develops (i) cues to the environment for dimensionality reduction, (ii) rules between elements of the reduced dimensional internal representation, (iii) `concepts' for achieving goals, i.e. for solving existing problems, (iv) the shell then causes the concepts to compete in order to come to a decision. The shell is designed for control problems, e.g. robotic tasks, control of plants, investment advisory systems, and may have very different ANN and AI parts. Here, we consider a simple robotic-like object in two dimensional space.

Machine Learning

Budgeted Distribution Learning of Belief Net Parameters

B. Póczos, R. Greiner, Cs. Szepesvári, L. Li
ICML-10. pdf

Most learning algorithms assume that a data set is given initially. We address the common situation where data is not available initially, but can be obtained, at a cost. We focus on learning Bayesian belief networks (BNs) over discrete variables. As such BNs are models of probabilistic distributions, we consider the "generative" challenge of learning the parameters, for a fixed structure, that best match the true distribution. We focus on the budgeted learning setting, where there is a known fixed cost c_i for acquiring the value of the i-th feature for any specified instance, and a known total cost to spend acquiring all information. After formally defining this problem from a Bayesian perspective, we first consider allocation algorithms that must decide, before seeing any results, which features of which instances to probe. We show this is NP-hard, even if all variables are independent, then prove that the greedy allocation algorithm IGA is optimal when the costs are uniform and the features are independent, but can otherwise be sub-optimal. We then show that general (non-allocation) policies perform better, and explore the challenges of learning the parameters for general belief networks in this setting, describing conditions for when the obvious round-robin algorithm will, versus will not work optimally. We also explore the effectiveness of this and various other heuristic algorithms.

A General Projection Property for Distribution Families

Y.-L. Yu, Y. Li, D. Schuurmans, Cs. Szepesvári
NIPS 22, 2009 (draft). pdf

Surjectivity of linear projections between distribution families with fixed mean and covariance (regardless of dimension) is re-derived by a new proof. We further extend this property to distribution families that respect additional constraints, such as symmetry, unimodality and log-concavity. By combining our results with classic univariate inequalities, we provide new worst-case analyses for natural risk criteria arising in classification, optimization, portfolio selection and Markov decision processes.

Learning to Segment from a Few Well-Selected Training Images

A. Farhangfar, R. Greiner, C. Szepesvári
(ICML-09), pp. 305-312, 2009. pdf

We address the task of actively learning a segmentation system: given a large number of unsegmented images, and access to an oracle that can segment a given image, decide which images to provide, to quickly produce a segmenter (here, a discriminative random field) that is accurate over this distribution of images. We extend the standard models for active learner to define a system for this task that first selects the image whose expected label will reduce the uncertainty of the other unlabeled images the most, and then after greedily selects, from the pool of unsegmented images, the most informative image. The results of our experiments, over two real-world datasets (segmenting brain tumors within magnetic resonance images; and segmenting the sky in real images) show that training on very few informative images (here, as few as 2) can produce a segmenter that is as good as training on the entire dataset.

Manifold-Adaptive Dimension Estimation

Amir massoud Farahmand, Jean-Yves Audibert and Csaba Szepesvári
ICML-2007, pp. 265-272, 2007pdf

Intuitively, learning should be easier when the data points lie on a low-dimensional submanifold of the input space. Recently there has been a growing interest in algorithms that aim to exploit such geometrical properties of the data. Oftentimes these algorithms require estimating the dimension of the manifold first. In this paper we propose an algorithm for dimension estimation and study its finite-sample behaviour. The algorithm estimates the dimension locally around the data points using nearest neighbor techniques and then combines these local estimates. We show that the rate of convergence of the resulting estimate is independent of the dimension of the input space and hence the algorithm is "manifold-adaptive". Thus, when the manifold supporting the data is low dimensional, the algorithm can be exponentially more efficient than its counterparts that are not exploiting this property. Our computer experiments confirm the obtained theoretical results.

Maximum Margin Discriminant Analysis based Face Recognition

K. Kornel, A. Kocsor, Cs. Szepesvari
HACIPPR-2005 (Joint Hungarian-Austrian Conference on Image Processing and Pattern Recognition), pp. 71-78, Dmitry Chetverikov, Laszlo Czuni, Markus Vincze (eds), Oesterreichische Computer Gesellschaft, 2005pdf

Face recognition is a highly non-trivial classification problem since the input is high-dimensional and there are many classes with just a few examples per class. In this paper we propose using a recent algorithm -- Maximum Margin Discriminant Analysis (MMDA) -- to solve face recognition problems. MMDA is a feature extraction method that is derived from a set of sound principles: (i) each feature should maximize information transmission about the classification labels, (ii) only the decision boundary should determine the features and (iii) features should reveal independent information about the class labels. Previously, MMDA was shown to yield good performance scores on a number of standard benchmark problems. Here we show that MMDA is capable of finding good features in face recognition and performs very well provided it is preceded by an appropriate preprocessing phase.

Margin Maximizing Discriminant Analysis

A. Kocsor, K. Kornel, Cs. Szepesvari
ECML/PKDD-2004, 2004, pp. 227-238pdf

We propose a new feature extraction method called Margin Maximizing Discriminant Analysis (MMDA) which seeks to extract features suitable for classification tasks. MMDA is based on the principle that an ideal feature should convey the maximum information about the class labels and it should depend only on the geometry of the optimal decision boundary and not on those parts of the distribution of the input data that do not participate in shaping this boundary. Further, distinct feature components should convey unrelated information about the data. Two feature extraction methods are proposed for calculating the parameters of such a projection that are shown to yield equivalent results. The kernel mapping idea is used to derive non-linear versions. Experiments with several real-world, publicly available data sets demonstrate that the new method yields competitive results.

Kernel Machine Based Feature Extraction Algorithm for Regression Problems

Cs. Szepesvari, K. Kornel, A. Kocsor
Proceedings of the 16th Eureopean Conference on Artificial Intelligence, ECAI' 2004, 2004, pp. 1091-1092 pdf

In this paper we consider two novel kernel machine based feature extraction algorithms in a regression settings. The first method is derived based on the principles underlying the recently introduced Maximum Margin Discimination Analysis (MMDA) algorithm. However, here it is shown that the orthogonalization principle employed by the original MMDA algorithm can be motivated using the well-known ambiguity decomposition, thus providing a firm ground for the good performance of the algorithm. The second algorithm combines kernel machines with average derivative estimation and is derived from the assumption that the true regressor function depends only on a subspace of the original input space. The proposed algorithms are evaluated in preliminary experiments conducted with artificial and real datasets.

Ockham's razor modeling of the matrisome channels of the basal ganglia thalamocortical loop

A. Lorincz, Gy. Hevizi and Cs. Szepesvari
International Journal of Neural Systems 11:125-143, 2001. Obtain pdf here.

A functional model of the basal ganglia-thalamocortical (BTC) loops is described. In our modeling effort, we try to minimize the complexity of our starting hypotheses. For that reason, we call this type of modeling Ockham's razor modeling. We have the additional constraint that the starting assumptions should not contradict experimental findings about the brain. First assumption: The brain lacks direct representation of paths but represents directions (called speed fields in control theory). Then control should be concerned with speed-field tracking (SFT). Second assumption: Control signals are delivered upon differencing in competing parallel channels of the BTC loops. This is modeled by extending SFT with differencing that gives rise to the robust Static and Dynamic State (SDS) feedback-controlling scheme. Third assumption: Control signals are expressed in terms of a gelatinous medium surrounding the limbs. This is modeled by expressing parameters of motion in parameters of the external space. We show that corollaries of the model fit properties of the BTC loops. The SDS provides proper identification of motion related neuronal groups of the putamen. Local minima arise during the controlling process that works in external space. The model explains the presence of parallel channels as the means to avoiding such local minima. Stability conditions of the SDS predict that the initial phase of learning is mostly concerned with selection of sign for the inverse dynamics. The model provides a scalable controller. State description in external space instead of configurational space reduces the dimensionality problem. Falsifying experiment is suggested. Computer experiments demonstrate the feasibility of the approach. We argue that the resulting scheme has a straightforward connectionist representation exhibiting population coding and Hebbian learning properties.

Parallel and robust skeletonization built from self-organizing elements

Zs. Marczell, Cs. Szepesv ri, Zs. Kalmar, and A. Lorincz
Neural Networks, 12:163--173, 1999. ps

A massively parallel neural architecture is suggested for the approximate computation of the skeleton of a planar shape. Numerical examples demonstrate the robustness of the method. The architecture is constructed from self-organizing elements that allow the extension of the concept of skeletonization to areas remote to image processing.

Approximate Geometry Representations and Sensory Fusion

Cs. Szepesvári and A. Lõrincz
Neurocomputing, 12:267-287, 1996. ps

This paper summarizes the recent advances in the theory of self-organizing development of approximate geometry representations based on the use of neural networks. Part of this work is based on the theoretical approach of (Szepesvari, 1993), which is different from that of (Martinetz, 1993) and also is somewhat more general. The Martinetz approach treats signals provided by artificial neuron-like entities whereas the present work uses the entities of the external world as its starting point. The relationship between the present work and the Martinetz approach will be detailed. We approach the problem of approximate geometry representations by first examining the problem of sensory fusion, i.e., the problem of fusing information from different transductors. A straightforward solution is the simultaneous discretization of the output of all transductors, which means the discretization of a space defined as the product of the individual transductor output spaces. However, the geometry relations are defined for the external world only, so it is still an open question how to define the metrics on the product of output spaces. It will be shown that simple Hebbian learning can result in the formation of a correct geometry representation. Some topological considerations will be presented to help us clarify the underlying concepts and assumptions. The mathematical framework gives rise to a corollary on the "topographical mappings" realized by Kohonen networks. In fact, the present work as well as (Martinetz, 1993) may be considered as a generalization of Kohonen's topographic maps. We develop topographic maps with self-organizing interneural connections.

Complexity of Learning: The Case of Everyday Neural Networks B. Oláh and Cs. Szepesvári
In Proceedings of IEEE WCCI ICNN'94 Vol. I. pp. 61-65. Orlando, Florida, June 1994. ps

Nowadays artificial neural networks (ANNs) receive an increasing attention. However recent computer architectures do not allow yet the implementation of large ANNs. Thus it is an important question to examine how the learning time of ANNs scales respect to their size (and/or with the size of the tasks). Judd has introduced a computational framework for the learning problem (J. Judd, "Neural Network Design and the Complexity of Learning." A Bradford Book, MIT Press, Cambridge, 1990) and proved, that learning in neural networks in general is too hard, i.e. in the worst case learning in neural networks is NP-complete. However, in his proof he restricts the domain of neural network architectures and tasks in such a way, that "everyday" neural network architectures, such as the one of the back-propagation algorithm, are excluded. Consequently Judd's proof does not tell anything for these types of networks.

First we outline a thorough framework for loading. The framework enables to differentiate between loading problems at a finer level. Two theorems are presented about the complexity of learning for "everyday" ANN architectures. The first theorem says, that for extended binary tasks and in the worst-case, the loading problem is NP-complete, while the second says, that for binary tasks and basis LUF there exists a polynomial time algorithm. From these results it seems that the loading problem for "everyday" neural network is interesting from the mathematical point of view as it lies on the boundary of efficiently solvable problems.

Topology learning solved by extended objects: a neural network model Cs. Szepesvári, L. Balázs, and A. Lõrincz
Neural Computation 6 (3):441-458, 1994. ps

It is shown that local, extended objects of a metrical topological space shape the receptive fields of competitive neurons to local filters. Self-organized topology learning is then solved with the help of Hebbian learning together with extended objects that provide unique information about neighborhood relations. A topographical map is deduced and is used to speed up further adaptation in a changing environment with the help of Kohonen type learning that teaches the neighbors of winning neurons as well.

Particle Filters, Visual Tracking
Illustrations (e.g. movies) of the results of the articles below can be found at the LS-N-IPS homepage.

Local Importance Sampling: A Novel Technique to Enhance Particle Filtering

P. Torma and Cs. Szepesvári
Journal of Multimedia, vol. 1, pp. 32-43, 2006 pdf

In the low observation noise limit particle filters become inefficient. In this paper a simple-to-implement particle filter is suggested as a solution to this well-known problem. The proposed Local Importance Sampling based particle filters draw the particles’ positions in a two-step process that makes use of both the dynamics of the system and the most recent observation. Experiments with the standard bearings-only tracking problem indicate that the proposed new particle filter method is indeed very successful when observations are reliable. Experiments with a high-dimensional variant of this problem further show that the advantage of the new filter grows with the increasing dimensionality of the system.

On using Likelihood-adjusted Proposals in Particle Filtering: Local Importance Sampling

P. Torma and Cs. Szepesvári
4th International Symposium on Image and Signal Processing and Analysis (ISPA 2005) pp. 58­63, 2005. pdf

An unsatisfactory property of particle filters is that they may become inefficient when the observation noise is low. In this paper we consider a simple-to-implement particle filter, called `LIS-based particle filter', whose aim is to overcome the above mentioned weakness. LIS-based particle filters sample the particles in a two-stage process that uses information of the most recent observation, too. Experiments with the standard bearings-only tracking problem indicate that the proposed new particle filter method is indeed a viable alternative to other methods.

Enhancing Particle Filters using Local Likelihood Sampling

P. Torma and Cs. Szepesvári
ECCV-2004, 2004, pp. 16-27. pdf

Particle filters provide a means to track the state of an object even when the dynamics and the observations are non-linear/non-Gaussian. However, they can be very inefficient when the observation noise is low as compared to the system noise, as it is often the case in visual tracking applications. In this paper we propose a new two-stage sampling procedure to boost the performance of particle filters under this condition. The new procedure is shown to reduce the variance of the weights by means of a theoretical analysis. This result is confirmed in a series of synthetic and real-world visual tracking experiments.

Sequential Importance Sampling for Visual Tracking Reconsidered

P. Torma and Cs. Szepesvári
Proceedings of 9th International Workshop on Artificial Intelligence and Statistics, editors: C. M. Bishop, B. J. Frey, pp. 271-278, 2003. pdf external link

We consider the task of filtering dynamical systems observed in noise by means of sequential importance sampling when the proposal is restricted to the innovation components of the state. It is argued that the unmodified sequential importance sampling/resampling (SIR) algorithm may yield high variance estimates of the posterior in this case, resulting in poor performance when e.g. in visual tracking one tries to build a SIR algorithm on the top of the output of a color blob detector. A new method that associates the innovations sampled from the proposal and the particles in a separate computational step is proposed. The method is shown to outperform the unmodified SIR algorithm in a series of vision based object tracking experiments, both in terms of accuracy and robustness.

Combining Local Search, Neural Networks and Particle Filters to Achieve Fast and Reliable Contour Tracking

P. Torma and Cs. Szepesvári
2003 IEEE International Symposium on Intelligent Signal Processing, pdf

LS-N-IPS is an extension of the standard N-IPS particle filter (also known as CONDENSATION in the image processing literature). The modified algorithm adds local search to the baseline algorithm: in each time step the predictions are refined in a local search procedure that utilizes the most recent observation. A critical choice in the design of LS-N-IPS is the way the local search is implemented. Here, we introduce a method based on training artificial neural networks for implementing the local search. In experiments with real-life data (visual tracking) the method is shown to improve robustness and performance significantly, surpassing the performance of previous state-of-the-art algorithms.

Towards Facial Pose Tracking

P. Torma and Cs. Szepesvári
In Proc. First Hungarian Computer Graphics and Geometry Conference Budapest, Hungary, pp. 10-16, 2002 pdf

This paper presents a novel facial-pose tracking algorithm using LS-N-IPS (Local Search N-Interacting Particle System), an algorithm that has been introduced recently by the authors. LS-N-IPS is a probabilistic tracking algorithm that keeps track of a number of alternative hypotheses at any time, the particles. LS-N-IPS has three components: a dynamical model, an observation model, and a local-search operator that has to be chosen by the algorithm designer. The main novelty of the algorithm presented here is that it relies on shading information to guide the local search procedure, the idea of the search being to apply a sort-of Hough-transformation to the mapping that renders poses to images. Here we introduce this algorithm and report results on the task of tracking of synthetic facial masks using grey-scale image sequences.

Efficient Object Tracking in Video Sequences by means of LS-N-IPS

P. Torma and Cs. Szepesvári
In Proc. Second International Symposium on Image and Signal Processing and Analysis (ISAP'01), pp. 277-282, 2001 ps

A recently introduced particle filtering method, called LS-N-IPS, is considered for tracking objects on video sequences. LS-N-IPS is a computationally efficient particle filter that performs better than the standard N-IPS particle filter, when observations are highly peaky, as it is the case of visual object tracking problems with good observation models. An implementation of LS-N-IPS that uses B-spline based contour models is proposed and is shown to perform very well as compared to similar state-of-the art tracking algorithms.

LS-N-IPS: an Improvement of Particle Filters by Means of Local Search

P. Torma and Cs. Szepesvári
5th IFAC Symposium on Nonlinear Control Systems (NOLCOS'01), pp. 715-719, 2001. ps

A modification of N-IPS, a well known particle filter method is proposed and is shown to be more efficient than the baseline algorithm in the small sample size limit and when the observations are ``reliable''. The algorithm called LS-N-IPS adds local search to the baseline algorithm: in each time step the predictions are refined in a local search procedure that utilizes the most recent observation. The uniform stability of LS-N-IPS is studied and results of experiments are reported both for a simulated and a real-world (visual) tracking problem.

Adaptive Control
This work has a homepage created by Mark French that you can find here.

Function Approximator based Control Designs for Strict Feedback Systens: LQ Performance and Scaling

M. French and Cs. Szepesvári
submitted for publication. ps

We consider a tracking problem for an uncertain strict feedback system, where the uncertainties are memoryless nonlinearities specified by weighted L2/L1 norms about a nominal system. These non-parametric uncertainties are converted to a parametric form by the use of function approximators, and adaptive backstepping designs are considered. It is shown that when multi-resolution function approximator structures are utilized, it is possible to achieve a transient LQ performance bound independant of the resolution of the approximator. In contrast, mono-resolution approximators which can cause an LQ performance to diverge with approximator resolution.

LQ Performance Bounds for Adaptive Output Feedback Controllers for Functionally Uncertain Systems

M.French, Cs. Szepesvari, E.Rogers
Automatica (to appear 2002). ps

We consider functionally uncertain systems which can be written in an output feedback form, where the nonlinearities are functions of the output only. The uncertainty is described by a weighted L 2 norm about a nominal system, and an approximate adaptive design is given which ensures output practical stability. The main result requires knowledge of the weighted L 2 uncertainty level. An upper bound on the LQ performance of the output transient and the control input is derived, where the cost penalises the output transient and the control e ort on the time interval where the output lies outside the prescribed neighbourhood of zero to which we achieve convergence.

An Asymptotic Scaling Analysis of LQ performance for an Approximate Adaptive Control Design

M. French Cs. Szepesvari, E.Rogers.
Mathematics of Control, Signals and Systems (to appear 2002). ps

We consider the adaptive tracking problem for a chain of integrators, where the uncertainty is static and functional. The uncertainty is specified by L2/L1 or weighted L2/L1 norm bounds. We analyse a standard Lyapunov based adaptive design which utilizes a function approximator to induce a parametric uncertainty, on which the adaptive design is completed. Performance is measured by a modified LQ cost functional, penalising both the tracking error transient and the control effort. With such a cost functional, it is shown that a standard control design has divergent performance when the resolution of a `mono-resolution' approximator is increased. The class of `mono-resolution' approximators includes models popular in applications. A general construction of a class of approximators and their associated controllers which have a uniformly bounded performance independent of the resolution of the approximator is given.

Uncertainty, Performance, and Model Dependency in Approximate Adaptive Nonlinear Control

M.French, Cs. Szepesvari, E. Rogers.
IEEE Transactions on Automatic Control 45:2, pp. 353-358, 2000. ps

We consider systems satisfying a matching condition which are functionally known up to weighted L2 and L1 measures of uncertainty. A modified LQ measure of control and state transient performance is given, and the performance of a class of approximate model based adaptive controllers is studied. An upper performance bound is derived in terms of the uncertainty models (stability and the state transient bounds require only the L2 uncertainty model; control effort bounds require both L2 and L1 uncertainty models), and various structural properties of the model basis. Sufficient conditions are given to ensure that the performance is bounded independently of the model basis size.

Uncertainty and Performance of Adaptive Controllers for Functionally Uncertain Output Feedback Systems

M. French, Cs. Szepesvári and E. Rogers
In Proc. of 1998 IEEE Conference on Decision and Decision,1998. ps

We consider nonlinear systems in an output feedback form which are functionally known up to a L2 measure of uncertainty. The control task is to drive the output of the system to some neighbourhood of the origin. A modified L2 measure of transient performance (penalising both state and control effort) is given, and the performance of a class of model based adaptive controllers is studied. An upper performance bound is derived.

Uncertainty, Performance and Model Dependency in Approximate Adaptive Nonlinear Control M. French, Cs. Szepesvári and E. Rogers
In Proc. of 1997 IEEE Conference on Decision and Decision, San Diego, California, 1997. ps

We consider systems satisfying a matching condition which are functionally known up to a L2 measure of uncertainty. A modified L2 performance measure is given, and the performance of a class of model based adaptive controllers is studied. An upper performance bound is derived in terms of the uncertainty measure and measures of the approximation error of the model. Asymptotic analyses of the bounds under increasing model size are undertaken, and sufficient conditions are given on the model that ensure the performance bounds are bounded independent of the model size.

Uncertainty, Performance and Model Dependency in Approximate Adaptive Nonlinear Control M. French, Cs. Szepesvári and E. Rogers
IEEE Tran. on Automatic Control (accepted for publication) 1997. ps

We consider systems satisfying a matching condition which are functionally known up to weighted L2 and L-infinity measures of uncertainty. A modified LQ measure of control and state transient performance is given, and the performance of a class of approximate model based adaptive controllers is studied. An upper performance bound is derived in terms of the uncertainty models (stability and the state transient bounds require only the L2 uncertainty model; control effort bounds require both L2 and L-infinity uncertainty models), and various structural properties of the model basis. Sufficient conditions are sgiven to ensure that the performance is bounded independent of the model basis size.

Robust & Neuro-Control

Neurocontroller using dynamic state feedback for compensatory control

Cs. Szepesvári, Sz. Cimmer and A. Lõrincz
Neural Networks, 10:1691-1708,1997 ps

A common technique in neurocontrol is that of controlling a plant by static state feedback using the plant's inverse dynamics, which is approximated through a learning process. It is well known that in this control mode even small approximation errors or, which is the same, small perturbations of the plant may lead to instability. Here, a novel approach is proposed to overcome the problem of instability by using the inverse dynamics both for the Static and for the error compensating Dynamic State feedback control. This scheme is termed SDS Feedback Control. It is shown that as long as the error of the inverse dynamics model is ``signproper'' the SDS Feedback Control is stable, i.e., the error of tracking may be kept small. The proof is based on a modification of Liapunov's second method. The problem of on-line learning of the inverse dynamics when using the controller simultaneously for both forward control and for dynamic feedback is dealt with, as are questions related to noise sensitivity and robust control of robotic manipulators. Simulations of a simplified sensorimotor loop serve to illustrate the approach.

Approximate inverse-dynamics based robust control using static and dynamic feedback Cs. Szepesvári and A. Lõrincz
in Applications of Neural Adaptive Control Thechnology, 151-197 (1997) ps

It is rigorously shown that inverse-dynamics models can be used to stabilize plants of any order provided that the inverse-dynamic model is used in a mixed mode fashion, in that of a `Static and Dynamic' State-feedback (SDS) mode. When the resulting controller is used for tracking increasing the gain of the dynamic feedback decreases the tracking error. Yet another attractive feature of the SDS scheme is that the inverse-dynamics model can be tuned on-line by {\em any} adaptation mechanism without cancelling stability if the conditions of the non-adaptive stability theorem hold at any time instant. Computer simulations of the control of a chaotic bioreactor and a `realistic' robotic manipulator demonstrate the robustness of the approach. It is shown that SDS control will yield zero asymptotic error when controlling the bioreactor using an inverse-dynamics model which when used in a traditional mode would yield intolerably large errors. In the case of the robotic arm simulations the effects of perturbation and sampling frequency are investigated and the SDS control is compared with the non-adaptive computed torque method. A fully self-organizing associative neural network architecture that can be used to approximate the inverse-dynamics in the form of a Position-and-Direction-to-Action (PDA) map is also described. Similarities between the basal ganglia - thalamocortical loops and the SDS scheme are discussed and it is argued that the SDS scheme could be viewed as a model of higher order motor functions of these areas.

Integrated architecture for motion control and path planning Cs. Szepesvári and A. Lõrincz
Journal of Robotic System, 15:1-15, 1997 ps

We consider the problem of learning to control a plant with non-linear control characteristics and solving the path planning problem at the same time. The solution is based on a path planning model that designates a speed field to be tracked, the speed field being the gradient of the stationary solution of a diffusion process. Diffusion is simulated on an artificial neural network by spreading activation. Interneurons between neighboring discretizing neurons detect the strength of the activity flow and emit control signals to control neurons via modifiable connections. The proposed method may be used for learning redundant control problems. The architecture integrates reactive path planning and continuous motion control in a natural fashion.

Inverse Dynamics Controllers for Robust Control: Consequences for Neurocontrollers Cs. Szepesvári and A. Lõrincz
In Proceedings of ICANN'96. Bochum, 1996., pp. 697-702 ps

It is proposed that controllers that approximate the inverse dynamics of the controlled plant can be used for on-line compensation of changes in the plant's dynamics. The idea is to use the very same controller in two modes at the same time: both for static and dynamic feedback. Implications for the learning of neurocontrollers are discussed. The proposed control mode relaxes the demand of precision and as a consequence, controllers that utilise direct associative learning by means of local function approximators may become more tractable in higher dimensional spaces.

Neurocontrol I: Self-organizing speed-field tracking Cs. Szepesvári and A. Lõrincz
Neural Network World 6:875-896, 1996. ps

The problems of controlling a plant while avoiding obstacles and experiencing perturbations in the plants dynamics are considered. It is assumed that the plant's dynamics is not known in advance. To solve this problem a self-organizing artificial neural network (ANN) solution is advanced here. The ANN consists of various parts. The first part discretizes the state space of the plant and also learns the geometry of the state space. The learnt geometrical relations are represented by lateral connections. These connections are utilized for planning a speed field, allowing collision free motion. The speed field is defined over the neural represention of the state space and is transformed into control signals with the help of interneurons associated with the lateral connections: connections between interneurons and control neurons encode the inverse dynamics of the plant. These connections are learnt during a direct system inverse identification process by Hebbian learning. Theoretical results and computer experiments show the robustness of approach.

Neurocontrol II: High precision control achieved using approximate inverse dynamics models Cs. Szepesvári and A. Lõrincz
Neural Network World 6:897-920, 1996. ps

It is common that artificial neural networks (ANNs) are used for approximating the inverse dynamics of a plant. In the accompanying paper a self-organising ANN model for associative identification of the inverse dynamics was introduced. Here we propose the use of approximate inverse dynamic models for both Static and Dynamic State (SDS) feedback control. This compound controller is capable of high-precision control even when the inverse dynamics is just qualitatively modeled or the plant's dynamics is perturbed. Properties of the SDS Feedback Controller in learning the inverse dynamics as well as comparisons with other methods are discussed. An example is presented when a chaotic plant, a bioreactor, is controlled using the SDS Controller. We found that the SDS Controller can compensate model mismatches that otherwise would lead to an untolerably large error if a traditional controller were used.

Self-organizing multi-resolution grid for motion planning and control T. Fomin, T. Rozgonyi, Cs. Szepesvári and A. Lõrincz
International Journal of Neural Systems 7:757-776, 1996. ps

A fully self-organizing neural network approach to low-dimensional control problems is described. We consider the problem of learning to control an object and solving the path planning problem at the same time. Control is based on the path planning model that follows the gradient of the stationary solution of a diffusion process working in the state space. Previous works are extended by introducing a self-organizing multigrid-like discretizing structure to represent the external world. Diffusion is simulated within a recurrent neural network built on this multigrid system. The novelty of the approach is that the diffusion on the multigrid is fast. Moreover, the diffusion process on the multigrid fits well the requirements of the path planning: it accelerates the diffusion in large free space regions while still keeps the resolution in small bottleneck-like labyrinths along the path. Control is achieved in the usual way: associative learning identifies the inverse dynamics of the system in a direct fashion. To this end there are introduced interneurons between neighbouring discretizing units that detect the strength of the steady-state diffusion and forward control commands to the control neurons via modifiable connections. This architecture forms the Multigrid Position-and-Direction-to-Action (MPDA) map. The architecture integrates reactive path planning and continuous motion control. It is also shown that the scheme leads to population coding for the actual command vector.

Self-Organizing Neurocontrol T. Fomin, Cs. Szepesvári and A. Lõrincz
In Proceedings of IEEE WCCI ICNN'94 Vol. V. pp. 2777-2780. Orlando, Florida, June 1994. ps

Self-organizing neural network solutions to control problems are described. Competitive networks create spatial filters and geometry connections in a self-organizing fashion. The goal position, the obstacle and the object under control all create neural activities through the filters. Spreading activation that discriminates between the controlled object, the goal position and the obstacles is utilized on the internal representation. Local self-training method and Hebiian learning develops the self-organizing control connections. The algorithm provides maneouvering capability in unseen scenes.

Bioinformatics and other exciting stuff
An experimental server using the protein functional domains prediction algorithm below is available here (Aug 2002: under version upgrade).

Prediction of Protein Functional Domains from Sequences Using Artificial Neural Networks

J. Murvai, K. Vlahovicek, Cs. Szepesvari, S. Pongor
Genome Research, 11:8, pp. 1410-1417, 2001. pdf local pdf html

An artificial neural network (ANN) solution is described for the recognition of domains in protein sequences. A query sequence is first compared to a reference database of domain sequences by use of and the output data, encoded in the form of six parameters, are forwarded to feed-forward artificial neural networks with six input and six hidden units with sigmoidal transfer function. The recognition is based on the distribution of scores precomputed for the known domain groups in a database versus database comparison. Applications to the prediction of function are discussed.

The SBASE protein domain sequence library release 6.0.

J. Murvai, E. Barta, K. Vlahovicek, Cs. Szepesv ri, C. Acatrinei, S. Pongor
Nucleic Acids Research. 27:1, pp. 257-259, 1999.

sorry, no abstract, no ps,..

Prediction of Protein Domain-Types by Backpropagation

J. Murvai, Cs. Szepesvári, Cs. Bachrati and S. Pongor
Technical Report TR-98-117, "Attila József" University, Research Group on Artificial Intelligence Szeged, HU-6700 (submitted in 1997) ps

We propose a neural net solution for the recognition of the domainˇtypes of proteins, which is a hard and important problem in biology. We have found that using a clever preprocessing technique relatively small neural networks perform surprisengly well. The performances of the neural nets were measured by cross-validation and Hoeffding's inequality was utilized for the estimation of a confidence interval of the estimates.

Automated detection and classification of microcalcifications in mammograms using artificial neural nets E. Sorantin, F. Schmidt, H. Mayer, P. Winkler, Cs. Szepesvári, E. Graif, E. Schuetz
In "4TH INTERNATIONAL WORKSHOP ON DIGITAL MAMMOGRAPHY" (homepage here), 1998.

abstract on IWDM'98 homepage; local html copy

The goal of this study was to develop a fully automated computer application for detection and classification of clustered mc using artificial neural nets (ANN's). Cases where additional investigations are necessary should be identified automatically, too.

FlexVoice: a Parametric Approach to High-Quality Speech-synthesis Gy. Balogh, E. Dobler, T. Grobler, B. Smodics, and Cs. Szepesvari.
Presented at IEE Seminar on State-of-the-Art in Speech Synthesis, London, April 2000. pdf also on this page.

The TTS system described in this paper is based on the analysis and resynthesis of a given speaker's voice. FlexVoice provides large flexibility in changing voice properties independently from the vocal tract parameters. This flexibility can be demonstrated by a number of voice conversions including female-to-male and female-to-child conversions. FlexVoice only uses a fraction of the resources of a PC and its quality is comparable to that of the leading TTS systems.

FlexVoice: A Parametric Approach to High-Quality Speech Synthesis (II) Gy. Balogh, E. Dobler, T. Grobler, B. Smodics, Cs. Szepesvari.
In Text, Speech and Dialogue, pp. 189-194, 2000. (conference page can be found here.)

FlexVoice, an integrated text-to-speech (TTS) system is presented in this paper. Its most distinctive feature is its low memory and CPU load while preserving the high quality of leading TTS systems. FlexVoice uses a hybrid approach that combines diphone concatenation with LPC-based parametric synthesis. Major improvements of speech quality are achieved by the careful design of each module at all synthesis levels (such as selection of training data for the various machine learning methods and that of the basic synthesis units for the parametric synthesiser). FlexVoice currently supports US English with two male and two female voices.

Copyright notice: Papers are presented and may be downloaded to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each copyright holder. In most cases, these works may not be reposted without explicit permission of the copyright holder.