publications
2022

Towards Robust WaveformBased Acoustic Models Dino Oglic, Zoran Cvetkovic, Peter Sollich, Steve Renals, Bin Yu IEEE/ACM Transactions on Audio, Speech, and Language Processing [abstract] [arXiv] [doi] [pdf]
We study the problem of learning robust acoustic models in adverse environments, characterized by a significant mismatch between training and test conditions. This problem is of paramount importance for the deployment of speech recognition systems that need to perform well in unseen environments. First, we characterize data augmentation theoretically as an instance of vicinal risk minimization, which aims at improving risk estimates during training by replacing the delta functions that define the empirical density over the input space with an approximation of the marginal population density in the vicinity of the training samples. More specifically, we assume that local neighborhoods centered at training samples can be approximated using a mixture of Gaussians, and demonstrate theoretically that this can incorporate robust inductive bias into the learning process. We then specify the individual mixture components implicitly via data augmentation schemes, designed to address common sources of spurious correlations in acoustic models. To avoid potential confounding effects on robustness due to information loss, which has been associated with standard feature extraction techniques (e.g., FBANK and MFCC features), we focus on the waveformbased setting. Our empirical results show that the approach can generalize to unseen noise conditions, with 150% relative improvement in outofdistribution generalization compared to training using the standard risk minimization principle. Moreover, the results demonstrate competitive performance relative to models learned using a training sample designed to match the acoustic conditions characteristic of test utterances.

Graph Neural Networks with Adaptive Readouts David Buterez, Jon Paul Janet, Steven J. Kiddle, Dino Oglic, Pietro Lio Advances in Neural Information Processing Systems [abstract] [pdf]
An effective aggregation of node features into a graphlevel representation via read out functions is an essential step in numerous learning tasks involving graph neural networks. Typically, readouts are simple and nonadaptive functions designed such that the resulting hypothesis space is permutation invariant. Prior work on deep sets indicates that such readouts might require complex node embeddings that can be difficult to learn via standard neighborhood aggregation schemes. Motivated by this, we investigate the potential of adaptive readouts given by neural networks that do not necessarily give rise to permutation invariant hypothesis spaces. We argue that in some problems such as binding affinity prediction where molecules are typically presented in a canonical form it might be possible to relax the constraints on permutation invariance of the hypothesis space and learn a more effective model of the affinity by employing an adaptive readout function. Our empirical results demonstrate the effectiveness of neural readouts on more than 40 datasets spanning different domains and graph characteristics. Moreover, we observe a consistent improvement over standard readouts (i.e., sum, max, and mean) relative to the number of neighborhood aggregation iterations and different convolutional operators.
2021

Learning WaveformBased Acoustic Models using Deep Variational Convolutional Neural Networks Dino Oglic, Zoran Cvetkovic, Peter Sollich IEEE/ACM Transactions on Audio, Speech, and Language Processing [abstract] [arXiv] [doi] [pdf] [code]
We investigate the potential of stochastic neural networks for learning effective waveformbased acoustic models. The waveformbased setting, inherent to fully endtoend speech recognition systems, is motivated by several comparative studies of automatic and human speech recognition that associate standard nonadaptive feature extraction techniques with information loss which can adversely affect robustness. Stochastic neural networks, on the other hand, are a class of models capable of incorporating rich regularization mechanisms into the learning process. We consider a deep convolutional neural network that first decomposes speech into frequency subbands via an adaptive parametric convolutional block where filters are specified by cosine modulations of compactly supported windows. The network then employs standard nonparametric 1D convolutions to extract relevant spectrotemporal patterns while gradually compressing the structured high dimensional representation generated by the parametric block. We rely on a probabilistic parametrization of the proposed neural architecture and learn the model using stochastic variational inference. This requires evaluation of an analytically intractable integral defining the KullbackLeibler divergence term responsible for regularization, for which we propose an effective approximation based on the GaussHermite quadrature. Our empirical results demonstrate a superior performance of the proposed approach over comparable waveformbased baselines and indicate that it could lead to robustness. Moreover, the approach outperforms a recently proposed deep convolutional neural network for learning of robust acoustic models with standard FBANK features.

Towards a Unified Analysis of Random Fourier Features Zhu Li, JeanFrancois Ton, Dino Oglic, Dino Sejdinovic Journal of Machine Learning Research, Volume 22 (Issue 108), pages 151 [abstract] [pdf]
Random Fourier features is a widely used, simple, and effective technique for scaling up kernel methods. The existing theoretical analysis of the approach, however, remains focused on specific learning tasks and typically gives pessimistic bounds which are at odds with the empirical results. We tackle these problems and provide the first unified risk analysis of learning with random Fourier features using the squared error and Lipschitz continuous loss functions. In our bounds, the tradeoff between the computational cost and the learning risk convergence rate is problem specific and expressed in terms of the regularization parameter and the number of effective degrees of freedom. We study both the standard random Fourier features method for which we improve the existing bounds on the number of features required to guarantee the corresponding minimax risk convergence rate of kernel ridge regression, as well as a datadependent modification which samples features proportional to ridge leverage scores and further reduces the required number of features. As ridge leverage scores are expensive to compute, we devise a simple approximation scheme which provably reduces the computational cost without loss of statistical efficiency. Our empirical results illustrate the effectiveness of the proposed scheme relative to the standard random Fourier features method.

Towards Robust WaveformBased Acoustic Models Dino Oglic, Zoran Cvetkovic, Peter Sollich, Steve Renals, Bin Yu Technical report, arXiv:2110.08634 [abstract] [arXiv]
We study the problem of learning robust acoustic models in adverse environments, characterized by a significant mismatch between training and test conditions. This problem is of paramount importance for the deployment of speech recognition systems that need to perform well in unseen environments. First, we characterize data augmentation theoretically as an instance of vicinal risk minimization, which aims at improving risk estimates during training by replacing the delta functions that define the empirical density over the input space with an approximation of the marginal population density in the vicinity of the training samples. More specifically, we assume that local neighborhoods centered at training samples can be approximated using a mixture of Gaussians, and demonstrate theoretically that this can incorporate robust inductive bias into the learning process. We then specify the individual mixture components implicitly via data augmentation schemes, designed to address common sources of spurious correlations in acoustic models. To avoid potential confounding effects on robustness due to information loss, which has been associated with standard feature extraction techniques (e.g., FBANK and MFCC features), we focus on the waveformbased setting. Our empirical results show that the approach can generalize to unseen noise conditions, with 150% relative improvement in outofdistribution generalization compared to training using the standard risk minimization principle. Moreover, the results demonstrate competitive performance relative to models learned using a training sample designed to match the acoustic conditions characteristic of test utterances.
2020

A Deep 2D Convolutional Network for Waveformbased Speech Recognition Dino Oglic, Zoran Cvetkovic, Peter Bell, Steve Renals In Proceedings of the 21st Annual Conference of International Speech Communication Association (INTERSPEECH) [abstract] [pdf] [code]
Due to limited computational resources, acoustic models of early automatic speech recognition (ASR) systems were built in lowdimensional feature spaces that incur considerable information loss at the outset of the process. Several comparative studies of automatic and human speech recognition suggest that this information loss can adversely affect the robustness of ASR systems. To mitigate that and allow for learning of robust models, we propose a deep 2D convolutional network in the waveform domain. The first layer of the network decomposes waveforms into frequency subbands, thereby representing them in a structured highdimensional space. This is achieved by means of a parametric convolutional block defined via cosine modulations of compactly supported windows. The next layer embeds the waveform in an even higherdimensional space of highresolution spectrotemporal patterns, implemented via a 2D convolutional block. This is followed by a gradual compression phase that selects most relevant spectrotemporal patterns using widepass 2D filtering. Our results show that the approach significantly outperforms alternative waveformbased models on both noisy and spontaneous conversational speech (24% and 11% relative error reduction, respectively). Moreover, this study provides empirical evidence that learning directly from the waveform domain could be more effective than learning using handcrafted features.

Deep Scattering Power Spectrum Features for Robust Speech Recognition Neethu M. Joy, Dino Oglic, Zoran Cvetkovic, Peter Bell, Steve Renals In Proceedings of the 21st Annual Conference of International Speech Communication Association (INTERSPEECH) [abstract] [pdf]
Deep scattering spectrum consists of a cascade of wavelet transforms and modulus nonlinearity. It generates features of different orders, with the first order coefficients approximately equal to the Melfrequency cepstrum, and higher order coefficients recovering information lost at lower levels. We investigate the effect of including the information recovered by higher order coefficients on the robustness of speech recognition. To that end, we also propose a modification to the original scattering transform tailored for noisy speech. In particular, instead of the modulus nonlinearity we opt to work with power coefficients and, therefore, use the squared modulus nonlinearity. We quantify the robustness of scattering features using the word error rates of acoustic models trained on clean speech and evaluated using sets of utterances corrupted with different noise types. Our empirical results show that the second order scattering power spectrum coefficients capture invariants relevant for noise robustness and that this additional information improves generalization to unseen noise conditions (almost 20% relative error reduction on AURORA4). This finding can have important consequences on speech recognition systems that typically discard the second order information and keep only the first order features (known for emulating MFCC and FBANK values) when representing speech.
2019

Scalable Learning in Reproducing Kernel Krein Spaces Dino Oglic, Thomas Gärtner In Proceedings of the 36th International Conference on Machine Learning (ICML) [abstract] [pdf] [code]
We provide the first mathematically complete derivation of the Nyström method for lowrank approximation of indefinite kernels and propose an efficient method for finding an approximate eigendecomposition of such kernel matrices. Building on this result, we devise highly scalable methods for learning in reproducing kernel Krein spaces. The devised approaches provide a principled and theoretically wellfounded means to tackle large scale learning problems with indefinite kernels. The main motivation for our work comes from problems with structured representations (e.g., graphs, strings, timeseries), where it is relatively easy to devise a pairwise (dis)similarity function based on intuition and/or knowledge of domain experts. Such functions are typically not positive definite and it is often well beyond the expertise of practitioners to verify this condition. The effectiveness of the devised approaches is evaluated empirically using indefinite kernels defined on structured and vectorial data representations.

Towards a Unified Analysis of Random Fourier Features Zhu Li, JeanFrancois Ton, Dino Oglic, Dino Sejdinovic In Proceedings of the 36th International Conference on Machine Learning (ICML) Best Paper Award – Honourable Mention
[abstract] [pdf]Random Fourier features is a widely used, simple, and effective technique for scaling up kernel methods. The existing theoretical analysis of the approach, however, remains focused on specific learning tasks and typically gives pessimistic bounds which are at odds with the empirical results. We tackle these problems and provide the first unified risk analysis of learning with random Fourier features using the squared error and Lipschitz continuous loss functions. In our bounds, the tradeoff between the computational cost and the expected risk convergence rate is problem specific and expressed in terms of the regularization parameter and the number of effective degrees of freedom. We study both the standard random Fourier features method for which we improve the existing bounds on the number of features required to guarantee the corresponding minimax risk convergence rate of kernel ridge regression, as well as a datadependent modification which samples features proportional to ridge leverage scores and further reduces the required number of features. As ridge leverage scores are expensive to compute, we devise a simple approximation scheme which provably reduces the computational cost without loss of statistical efficiency.

Bayesian Parznets for Robust Speech Recognition in the Waveform Domain Dino Oglic, Zoran Cvetkovic, Peter Sollich Technical report, arXiv:1906.09526 [abstract] [pdf] [code]
We propose a novel family of bandpass filters for efficient spectral decomposition of signals. Previous work has already established the effectiveness of representations based on static bandpass filtering of speech signals (e.g., melfrequency cepstral coefficients and deep scattering spectrum). A potential shortcoming of these approaches is the fact that the parameters specifying such a representation are fixed a priori and not learned using the available data. To address this limitation, we propose a family of filters defined via cosine modulations of Parzen windows, where the modulation frequency models the center of a spectral bandpass filter and the length of a Parzen window is inversely proportional to its bandwidth. We propose to learn these filters as part of a multilayer convolutional operator using stochastic variational inference based on Gaussian dropout posteriors and sparsity inducing priors. Such a prior leads to an intractable integral defining the Kullback–Leibler divergence term for which we propose an effective approximation based on the Gauss–Hermite quadrature. Our empirical results demonstrate that modulation filterlearning can be statistically significantly more effective than static bandpass filtering on continuous speech recognition from raw speech. This is also the first work to achieve stateoftheart results on speech recognition using variational inference.
2018

Active Search for ComputerAided Drug Design Dino Oglic, Steven A. Oatley, Simon J. F. Macdonald, Thomas Mcinally, Roman Garnett, Jonathan D. Hirst, Thomas Gärtner Molecular Informatics [abstract] [pdf] [doi] [code]
We consider lead discovery as active search in a space of labelled graphs. In particular, we extend our recent data‐driven adaptive Markov chain approach, and evaluate it on a focused drug design problem, where we search for an antagonist of an \alpha_v integrin, the target protein that belongs to a group of Arg−Gly−Asp integrin receptors. This group of integrin receptors is thought to play a key role in idiopathic pulmonary fibrosis, a chronic lung disease of significant pharmaceutical interest. As an in silico proxy of the binding affinity, we use a molecular docking score to an experimentally determined \alpha_v\beta_6 protein structure. The search is driven by a probabilistic surrogate of the activity of all molecules from that space. As the process evolves and the algorithm observes the activity scores of the previously designed molecules, the hypothesis of the activity is refined. The algorithm is guaranteed to converge in probability to the best hypothesis from an a priori specified hypothesis space. In our empirical evaluations, the approach achieves a large structural variety of designed molecular structures for which the docking score is better than the desired threshold. Some novel molecules, suggested to be active by the surrogate model, provoke a significant interest from the perspective of medicinal chemistry and warrant prioritization for synthesis. Moreover, the approach discovered 19 out of the 24 active compounds which are known to be active from previous biological assays.

Learning in Reproducing Kernel Kreı̆n Spaces Dino Oglic, Thomas Gärtner In Proceedings of the 35th International Conference on Machine Learning (ICML) [abstract] [pdf] [code]
We formulate a novel regularized risk minimization problem for learning in reproducing kernel Kreı̆n spaces and show that the strong representer theorem applies to it. As a result of the latter, the learning problem can be expressed as the minimization of a quadratic form over a hypersphere of constant radius. We present an algorithm that can find a globally optimal solution to this nonconvex optimization problem in time cubic in the number of instances. Moreover, we derive the gradient of the solution with respect to its hyperparameters and, in this way, provide means for efficient hyperparameter tuning. The approach comes with a generalization bound expressed in terms of the Rademacher complexity of the corresponding hypothesis space. The major advantage over standard kernel methods is the ability to learn with various domain specific similarity measures for which positive definiteness does not hold or is difficult to establish. The approach is evaluated empirically using indefinite kernels defined on structured as well as vectorial data. The empirical results demonstrate a superior performance of our approach over the stateoftheart baselines.

Constructive Approximation and Learning by Greedy Algorithms Dino Oglic Dissertation, Universitätsund Landesbibliothek Bonn [abstract] [pdf]
This thesis develops several kernelbased greedy algorithms for different machine learning problems and analyzes their theoretical and empirical properties. Greedy approaches have been extensively used in the past for tackling problems in combinatorial optimization where finding even a feasible solution can be a computationally hard problem (i.e., not solvable in polynomial time). A key feature of greedy algorithms is that a solution is constructed recursively from the smallest constituent parts. In each step of the constructive process a component is added to the partial solution from the previous step and, thus, the size of the optimization problem is reduced. The selected components are given by optimization problems that are simpler and easier to solve than the original problem. As such schemes are typically fast at constructing a solution they can be very effective on complex optimization problems where finding an optimal/good solution has a high computational cost. Moreover, greedy solutions are rather intuitive and the schemes themselves are simple to design and easy to implement. There is a large class of problems for which greedy schemes generate an optimal solution or a good approximation of the optimum. In the first part of the thesis, we develop two deterministic greedy algorithms for optimization problems in which a solution is given by a set of functions mapping an instance space to the space of reals. The first of the two approaches facilitates data understanding through interactive visualization by providing means for experts to incorporate their domain knowledge into otherwise static kernel principal component analysis. This is achieved by greedily constructing embedding directions that maximize the variance at data points (unexplained by the previously constructed embedding directions) while adhering to specified domain knowledge constraints. The second deterministic greedy approach is a supervised feature construction method capable of addressing the problem of kernel choice. The goal of the approach is to construct a feature representation for which a set of linear hypotheses is of sufficient capacity — large enough to contain a satisfactory solution to the considered problem and small enough to allow good generalization from a small number of training examples. The approach mimics functional gradient descent and constructs features by fitting squared error residuals. We show that the constructive process is consistent and provide conditions under which it converges to the optimal solution. In the second part of the thesis, we investigate two problems for which deterministic greedy schemes can fail to find an optimal solution or a good approximation of the optimum. This happens as a result of making a sequence of choices which take into account only the immediate reward without considering the consequences onto future decisions. To address this shortcoming of deterministic greedy schemes, we propose two efficient randomized greedy algorithms which are guaranteed to find effective solutions to the corresponding problems. In the first of the two approaches, we provide a mean to scale kernel methods to problems with millions of instances. An approach, frequently used in practice, for this type of problems is the Nyström method for lowrank approximation of kernel matrices. A crucial step in this method is the choice of landmarks which determine the quality of the approximation. We tackle this problem with a randomized greedy algorithm based on the Kmeans++ cluster seeding scheme and provide a theoretical and empirical study of its effectiveness. In the second problem for which a deterministic strategy can fail to find a good solution, the goal is to find a set of objects from a structured space that are likely to exhibit an unknown target property. This discrete optimization problem is of significant interest to cyclic discovery processes such as de novo drug design. We propose to address it with an adaptive Metropolis–Hastings approach that samples candidates from the posterior distribution of structures conditioned on them having the target property. The proposed constructive scheme defines a consistent random process and our empirical evaluation demonstrates its effectiveness across several different application domains.
2017

Active Search in Intensionally Specified Structured Spaces Dino Oglic, Roman Garnett, Thomas Gärtner In Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence [abstract] [pdf] [code]
We consider an active search problem in intensionally specified structured spaces. The ultimate goal in this setting is to discover structures from structurally different partitions of a fixed but unknown target class. An example of such a process is that of computeraided de novo drug design. In the past 20 years several Monte Carlo search heuristics have been developed for this process. Motivated by these handcrafted search heuristics, we devise a Metropolis–Hastings sampling scheme where the acceptance probability is given by a probabilistic surrogate of the target property, modeled with a max entropy conditional model. The surrogate model is updated in each iteration upon the evaluation of a selected structure. The proposed approach is consistent and the empirical evidence indicates that it achieves a large structural variety of discovered targets.

Nyström Method with Kernel Kmeans++ Samples as Landmarks Dino Oglic, Thomas Gärtner In Proceedings of the 34th International Conference on Machine Learning (ICML) [abstract] [pdf] [code]
We investigate, theoretically and empirically, the effectiveness of kernel Kmeans++ samples as landmarks in the Nyström method for lowrank approximation of kernel matrices. Previous empirical studies (Zhang et al., 2008; Kumar et al.,2012) observe that the landmarks obtained using (kernel) Kmeans clustering define a good lowrank approximation of kernel matrices. However, the existing work does not provide a theoretical guarantee on the approximation error for this approach to landmark selection. We close this gap and provide the first bound on the approximation error of the Nyström method with kernel Kmeans++ samples as landmarks. Moreover, for the frequently used Gaussian kernel we provide a theoretically sound motivation for performing Lloyd refinements of kernel Kmeans++ landmarks in the instance space. We substantiate our theoretical results empirically by comparing the approach to several stateoftheart algorithms.
2016

Greedy Feature Construction Dino Oglic, Thomas Gärtner In Advances in Neural Information Processing Systems 29 (NIPS) [abstract] [pdf] [code]
We present an effective method for supervised feature construction. The main goal of the approach is to construct a feature representation for which a set of linear hypotheses is of sufficient capacity – large enough to contain a satisfactory solution to the considered problem and small enough to allow good generalization from a small number of training examples. We achieve this goal with a greedy procedure that constructs features by empirically fitting squared error residuals. The proposed constructive procedure is consistent and can output a rich set of features. The effectiveness of the approach is evaluated empirically by fitting a linear ridge regression model in the constructed feature space and our empirical results indicate a superior performance of our approach over competing methods.
2014

Learning to Construct Novel Structures Dino Oglic, Roman Garnett, Thomas Gärtner In NIPS Workshop on Discrete and Combinatorial Problems in Machine Learning (DISCML) [abstract] [pdf]
We investigate the problem of constructing novel representatives of a fixed but unknown concept over a structured, discrete domain, such as the set of unlabeled graphs. In particular, we consider an online scenario in which the learning algorithm immediately observes the true label of each proposed structure. This problem is highly relevant in many application domains and has so far been tackled mostly by specialpurpose algorithms. We propose a general algorithm based on iteratively sampling novel structures from the domain and updating a conditional probability model of the concept based on the observed labels. The conditional probability model is used by the sampling procedure to focus on representatives of the desired concept. An empirical study of our approach demonstrates its effectiveness on a variety of problems.

Interactive KnowledgeBased Kernel PCA Dino Oglic, Daniel Paurat, Thomas Gärtner In Machine Learning and Knowledge Discovery in Databases (ECMLPKDD) [abstract] [pdf]
Data understanding is an iterative process in which domain experts combine their knowledge with the data at hand to explore and confirm hypotheses. One important set of tools for exploring hypotheses about data are visualizations. Often, however, traditional, unsupervised dimensionality reduction algorithms are used for visualization. These tools allow for interaction, i.e., exploring different visualizations, only by means of manipulating some technical parameters of the algorithm. Therefore, instead of being able to intuitively interact with the visualization, domain experts have to learn and argue about these technical parameters. In this paper we propose a knowledgebased kernel PCA approach that allows for intuitive interaction with data visualizations. Each embedding direction is given by a nonconvex quadratic optimization problem over an ellipsoid and has a globally optimal solution in the kernel feature space. A solution can be found in polynomial time using the algorithm presented in this paper. To facilitate direct feedback, i.e., updating the whole embedding with a sufficiently high framerate during interaction, we reduce the computational complexity further by incremental up and downdating. Our empirical evaluation demonstrates the flexibility and utility of this approach.
2013

Supervised PCA for Interactive Data Analysis Daniel Paurat, Dino Oglic, Thomas Gärtner In Proceedings of the 2nd NIPS Workshop on Spectral Learning [abstract] [pdf]
We investigate a novel approach for intuitive interaction with a data set for explorative data analysis. The key idea is that a user can directly interact with a two or three dimensional embedding of the data and actively place data points to desired locations. To achieve this, we propose a variant of semi supervised kernel PCA which respects the placement of control points and maximizes the variance of the unlabelled data along the ‘directions’ of the embedding.