Towards a Unified Analysis of Random Fourier Features Zhu Li, Jean-Francois Ton, Dino Oglic, Dino Sejdinovic Journal of Machine Learning Research, Volume 22 (Issue 108), pages 1-51
Random Fourier features is a widely used, simple, and effective technique for scaling up kernel methods. The existing theoretical analysis of the approach, however, remains focused on specific learning tasks and typically gives pessimistic bounds which are at odds with the empirical results. We tackle these problems and provide the first unified risk analysis of learning with random Fourier features using the squared error and Lipschitz continuous loss functions. In our bounds, the trade-off between the computational cost and the learning risk convergence rate is problem specific and expressed in terms of the regularization parameter and the number of effective degrees of freedom. We study both the standard random Fourier features method for which we improve the existing bounds on the number of features required to guarantee the corresponding minimax risk convergence rate of kernel ridge regression, as well as a data-dependent modification which samples features proportional to ridge leverage scores and further reduces the required number of features. As ridge leverage scores are expensive to compute, we devise a simple approximation scheme which provably reduces the computational cost without loss of statistical efficiency. Our empirical results illustrate the effectiveness of the proposed scheme relative to the standard random Fourier features method.
A Deep 2D Convolutional Network for Waveform-based Speech Recognition Dino Oglic, Zoran Cvetkovic, Peter Bell, Steve Renals In Proceedings of the 21st Annual Conference of International Speech Communication Association (INTERSPEECH)
Due to limited computational resources, acoustic models of early automatic speech recognition (ASR) systems were built in low-dimensional feature spaces that incur considerable information loss at the outset of the process. Several comparative studies of automatic and human speech recognition suggest that this information loss can adversely affect the robustness of ASR systems. To mitigate that and allow for learning of robust models, we propose a deep 2D convolutional network in the waveform domain. The first layer of the network decomposes waveforms into frequency sub-bands, thereby representing them in a structured high-dimensional space. This is achieved by means of a parametric convolutional block defined via cosine modulations of compactly supported windows. The next layer embeds the waveform in an even higher-dimensional space of high-resolution spectro-temporal patterns, implemented via a 2D convolutional block. This is followed by a gradual compression phase that selects most relevant spectro-temporal patterns using wide-pass 2D filtering. Our results show that the approach significantly outperforms alternative waveform-based models on both noisy and spontaneous conversational speech (24% and 11% relative error reduction, respectively). Moreover, this study provides empirical evidence that learning directly from the waveform domain could be more effective than learning using hand-crafted features.
Deep Scattering Power Spectrum Features for Robust Speech Recognition Neethu M. Joy, Dino Oglic, Zoran Cvetkovic, Peter Bell, Steve Renals In Proceedings of the 21st Annual Conference of International Speech Communication Association (INTERSPEECH)
Deep scattering spectrum consists of a cascade of wavelet transforms and modulus non-linearity. It generates features of different orders, with the first order coefficients approximately equal to the Mel-frequency cepstrum, and higher order coefficients recovering information lost at lower levels. We investigate the effect of including the information recovered by higher order coefficients on the robustness of speech recognition. To that end, we also propose a modification to the original scattering transform tailored for noisy speech. In particular, instead of the modulus non-linearity we opt to work with power coefficients and, therefore, use the squared modulus non-linearity. We quantify the robustness of scattering features using the word error rates of acoustic models trained on clean speech and evaluated using sets of utterances corrupted with different noise types. Our empirical results show that the second order scattering power spectrum coefficients capture invariants relevant for noise robustness and that this additional information improves generalization to unseen noise conditions (almost 20% relative error reduction on AURORA4). This finding can have important consequences on speech recognition systems that typically discard the second order information and keep only the first order features (known for emulating MFCC and FBANK values) when representing speech.
Scalable Learning in Reproducing Kernel Krein Spaces Dino Oglic, Thomas Gärtner In Proceedings of the 36th International Conference on Machine Learning (ICML)
We provide the first mathematically complete derivation of the Nyström method for low-rank approximation of indefinite kernels and propose an efficient method for finding an approximate eigendecomposition of such kernel matrices. Building on this result, we devise highly scalable methods for learning in reproducing kernel Krein spaces. The devised approaches provide a principled and theoretically well-founded means to tackle large scale learning problems with indefinite kernels. The main motivation for our work comes from problems with structured representations (e.g., graphs, strings, time-series), where it is relatively easy to devise a pairwise (dis)similarity function based on intuition and/or knowledge of domain experts. Such functions are typically not positive definite and it is often well beyond the expertise of practitioners to verify this condition. The effectiveness of the devised approaches is evaluated empirically using indefinite kernels defined on structured and vectorial data representations.
Towards a Unified Analysis of Random Fourier Features Zhu Li, Jean-Francois Ton, Dino Oglic, Dino Sejdinovic In Proceedings of the 36th International Conference on Machine Learning (ICML)
Random Fourier features is a widely used, simple, and effective technique for scaling up kernel methods. The existing theoretical analysis of the approach, however, remains focused on specific learning tasks and typically gives pessimistic bounds which are at odds with the empirical results. We tackle these problems and provide the first unified risk analysis of learning with random Fourier features using the squared error and Lipschitz continuous loss functions. In our bounds, the trade-off between the computational cost and the expected risk convergence rate is problem specific and expressed in terms of the regularization parameter and the number of effective degrees of freedom. We study both the standard random Fourier features method for which we improve the existing bounds on the number of features required to guarantee the corresponding minimax risk convergence rate of kernel ridge regression, as well as a data-dependent modification which samples features proportional to ridge leverage scores and further reduces the required number of features. As ridge leverage scores are expensive to compute, we devise a simple approximation scheme which provably reduces the computational cost without loss of statistical efficiency.
Bayesian Parznets for Robust Speech Recognition in the Waveform Domain Dino Oglic, Zoran Cvetkovic, Peter Sollich Technical report, arXiv:1906.09526
We propose a novel family of band-pass filters for efficient spectral decomposition of signals. Previous work has already established the effectiveness of representations based on static band-pass filtering of speech signals (e.g., mel-frequency cepstral coefficients and deep scattering spectrum). A potential shortcoming of these approaches is the fact that the parameters specifying such a representation are fixed a priori and not learned using the available data. To address this limitation, we propose a family of filters defined via cosine modulations of Parzen windows, where the modulation frequency models the center of a spectral band-pass filter and the length of a Parzen window is inversely proportional to its bandwidth. We propose to learn these filters as part of a multilayer convolutional operator using stochastic variational inference based on Gaussian dropout posteriors and sparsity inducing priors. Such a prior leads to an intractable integral defining the Kullback–Leibler divergence term for which we propose an effective approximation based on the Gauss–Hermite quadrature. Our empirical results demonstrate that modulation filter-learning can be statistically significantly more effective than static band-pass filtering on continuous speech recognition from raw speech. This is also the first work to achieve state-of-the-art results on speech recognition using variational inference.
Active Search for Computer-Aided Drug Design Dino Oglic, Steven A. Oatley, Simon J. F. Macdonald, Thomas Mcinally, Roman Garnett, Jonathan D. Hirst, Thomas Gärtner Molecular Informatics
We consider lead discovery as active search in a space of labelled graphs. In particular, we extend our recent data‐driven adaptive Markov chain approach, and evaluate it on a focused drug design problem, where we search for an antagonist of an \alpha_v integrin, the target protein that belongs to a group of Arg−Gly−Asp integrin receptors. This group of integrin receptors is thought to play a key role in idiopathic pulmonary fibrosis, a chronic lung disease of significant pharmaceutical interest. As an in silico proxy of the binding affinity, we use a molecular docking score to an experimentally determined \alpha_v\beta_6 protein structure. The search is driven by a probabilistic surrogate of the activity of all molecules from that space. As the process evolves and the algorithm observes the activity scores of the previously designed molecules, the hypothesis of the activity is refined. The algorithm is guaranteed to converge in probability to the best hypothesis from an a priori specified hypothesis space. In our empirical evaluations, the approach achieves a large structural variety of designed molecular structures for which the docking score is better than the desired threshold. Some novel molecules, suggested to be active by the surrogate model, provoke a significant interest from the perspective of medicinal chemistry and warrant prioritization for synthesis. Moreover, the approach discovered 19 out of the 24 active compounds which are known to be active from previous biological assays.
Learning in Reproducing Kernel Kreı̆n Spaces Dino Oglic, Thomas Gärtner In Proceedings of the 35th International Conference on Machine Learning (ICML)
We formulate a novel regularized risk minimization problem for learning in reproducing kernel Kreı̆n spaces and show that the strong representer theorem applies to it. As a result of the latter, the learning problem can be expressed as the minimization of a quadratic form over a hypersphere of constant radius. We present an algorithm that can find a globally optimal solution to this non-convex optimization problem in time cubic in the number of instances. Moreover, we derive the gradient of the solution with respect to its hyperparameters and, in this way, provide means for efficient hyperparameter tuning. The approach comes with a generalization bound expressed in terms of the Rademacher complexity of the corresponding hypothesis space. The major advantage over standard kernel methods is the ability to learn with various domain specific similarity measures for which positive definiteness does not hold or is difficult to establish. The approach is evaluated empirically using indefinite kernels defined on structured as well as vectorial data. The empirical results demonstrate a superior performance of our approach over the state-of-the-art baselines.
Constructive Approximation and Learning by Greedy Algorithms Dino Oglic Dissertation, Universitäts-und Landesbibliothek Bonn
This thesis develops several kernel-based greedy algorithms for different machine learning problems and analyzes their theoretical and empirical properties. Greedy approaches have been extensively used in the past for tackling problems in combinatorial optimization where finding even a feasible solution can be a computationally hard problem (i.e., not solvable in polynomial time). A key feature of greedy algorithms is that a solution is constructed recursively from the smallest constituent parts. In each step of the constructive process a component is added to the partial solution from the previous step and, thus, the size of the optimization problem is reduced. The selected components are given by optimization problems that are simpler and easier to solve than the original problem. As such schemes are typically fast at constructing a solution they can be very effective on complex optimization problems where finding an optimal/good solution has a high computational cost. Moreover, greedy solutions are rather intuitive and the schemes themselves are simple to design and easy to implement. There is a large class of problems for which greedy schemes generate an optimal solution or a good approximation of the optimum. In the first part of the thesis, we develop two deterministic greedy algorithms for optimization problems in which a solution is given by a set of functions mapping an instance space to the space of reals. The first of the two approaches facilitates data understanding through interactive visualization by providing means for experts to incorporate their domain knowledge into otherwise static kernel principal component analysis. This is achieved by greedily constructing embedding directions that maximize the variance at data points (unexplained by the previously constructed embedding directions) while adhering to specified domain knowledge constraints. The second deterministic greedy approach is a supervised feature construction method capable of addressing the problem of kernel choice. The goal of the approach is to construct a feature representation for which a set of linear hypotheses is of sufficient capacity — large enough to contain a satisfactory solution to the considered problem and small enough to allow good generalization from a small number of training examples. The approach mimics functional gradient descent and constructs features by fitting squared error residuals. We show that the constructive process is consistent and provide conditions under which it converges to the optimal solution. In the second part of the thesis, we investigate two problems for which deterministic greedy schemes can fail to find an optimal solution or a good approximation of the optimum. This happens as a result of making a sequence of choices which take into account only the immediate reward without considering the consequences onto future decisions. To address this shortcoming of deterministic greedy schemes, we propose two efficient randomized greedy algorithms which are guaranteed to find effective solutions to the corresponding problems. In the first of the two approaches, we provide a mean to scale kernel methods to problems with millions of instances. An approach, frequently used in practice, for this type of problems is the Nyström method for low-rank approximation of kernel matrices. A crucial step in this method is the choice of landmarks which determine the quality of the approximation. We tackle this problem with a randomized greedy algorithm based on the K-means++ cluster seeding scheme and provide a theoretical and empirical study of its effectiveness. In the second problem for which a deterministic strategy can fail to find a good solution, the goal is to find a set of objects from a structured space that are likely to exhibit an unknown target property. This discrete optimization problem is of significant interest to cyclic discovery processes such as de novo drug design. We propose to address it with an adaptive Metropolis–Hastings approach that samples candidates from the posterior distribution of structures conditioned on them having the target property. The proposed constructive scheme defines a consistent random process and our empirical evaluation demonstrates its effectiveness across several different application domains.
Active Search in Intensionally Specified Structured Spaces Dino Oglic, Roman Garnett, Thomas Gärtner In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence
We consider an active search problem in intensionally specified structured spaces. The ultimate goal in this setting is to discover structures from structurally different partitions of a fixed but unknown target class. An example of such a process is that of computer-aided de novo drug design. In the past 20 years several Monte Carlo search heuristics have been developed for this process. Motivated by these hand-crafted search heuristics, we devise a Metropolis–Hastings sampling scheme where the acceptance probability is given by a probabilistic surrogate of the target property, modeled with a max entropy conditional model. The surrogate model is updated in each iteration upon the evaluation of a selected structure. The proposed approach is consistent and the empirical evidence indicates that it achieves a large structural variety of discovered targets.
Nyström Method with Kernel K-means++ Samples as Landmarks Dino Oglic, Thomas Gärtner In Proceedings of the 34th International Conference on Machine Learning (ICML)
We investigate, theoretically and empirically, the effectiveness of kernel K-means++ samples as landmarks in the Nyström method for low-rank approximation of kernel matrices. Previous empirical studies (Zhang et al., 2008; Kumar et al.,2012) observe that the landmarks obtained using (kernel) K-means clustering define a good low-rank approximation of kernel matrices. However, the existing work does not provide a theoretical guarantee on the approximation error for this approach to landmark selection. We close this gap and provide the first bound on the approximation error of the Nyström method with kernel K-means++ samples as landmarks. Moreover, for the frequently used Gaussian kernel we provide a theoretically sound motivation for performing Lloyd refinements of kernel K-means++ landmarks in the instance space. We substantiate our theoretical results empirically by comparing the approach to several state-of-the-art algorithms.
Greedy Feature Construction Dino Oglic, Thomas Gärtner In Advances in Neural Information Processing Systems 29 (NIPS)
We present an effective method for supervised feature construction. The main goal of the approach is to construct a feature representation for which a set of linear hypotheses is of sufficient capacity – large enough to contain a satisfactory solution to the considered problem and small enough to allow good generalization from a small number of training examples. We achieve this goal with a greedy procedure that constructs features by empirically fitting squared error residuals. The proposed constructive procedure is consistent and can output a rich set of features. The effectiveness of the approach is evaluated empirically by fitting a linear ridge regression model in the constructed feature space and our empirical results indicate a superior performance of our approach over competing methods.
Learning to Construct Novel Structures Dino Oglic, Roman Garnett, Thomas Gärtner In NIPS Workshop on Discrete and Combinatorial Problems in Machine Learning (DISCML)
We investigate the problem of constructing novel representatives of a fixed but unknown concept over a structured, discrete domain, such as the set of unlabeled graphs. In particular, we consider an online scenario in which the learning algorithm immediately observes the true label of each proposed structure. This problem is highly relevant in many application domains and has so far been tackled mostly by special-purpose algorithms. We propose a general algorithm based on iteratively sampling novel structures from the domain and updating a conditional probability model of the concept based on the observed labels. The conditional probability model is used by the sampling procedure to focus on representatives of the desired concept. An empirical study of our approach demonstrates its effectiveness on a variety of problems.
Interactive Knowledge-Based Kernel PCA Dino Oglic, Daniel Paurat, Thomas Gärtner In Machine Learning and Knowledge Discovery in Databases (ECML-PKDD)
Data understanding is an iterative process in which domain experts combine their knowledge with the data at hand to explore and confirm hypotheses. One important set of tools for exploring hypotheses about data are visualizations. Often, however, traditional, unsupervised dimensionality reduction algorithms are used for visualization. These tools allow for interaction, i.e., exploring different visualizations, only by means of manipulating some technical parameters of the algorithm. Therefore, instead of being able to intuitively interact with the visualization, domain experts have to learn and argue about these technical parameters. In this paper we propose a knowledge-based kernel PCA approach that allows for intuitive interaction with data visualizations. Each embedding direction is given by a non-convex quadratic optimization problem over an ellipsoid and has a globally optimal solution in the kernel feature space. A solution can be found in polynomial time using the algorithm presented in this paper. To facilitate direct feedback, i.e., updating the whole embedding with a sufficiently high frame-rate during interaction, we reduce the computational complexity further by incremental up- and down-dating. Our empirical evaluation demonstrates the flexibility and utility of this approach.
Supervised PCA for Interactive Data Analysis Daniel Paurat, Dino Oglic, Thomas Gärtner In Proceedings of the 2nd NIPS Workshop on Spectral Learning
We investigate a novel approach for intuitive interaction with a data set for explorative data analysis. The key idea is that a user can directly interact with a two or three dimensional embedding of the data and actively place data points to desired locations. To achieve this, we propose a variant of semi- supervised kernel PCA which respects the placement of control points and maximizes the variance of the unlabelled data along the ‘directions’ of the embedding.