Bayesian signal processing techniques for hyperspectral image ...

NATIONAL AND KAPODISTRIAN UNIVERSITY OF ATHENS SCHOOL OF SCIENCE DEPARTMENT OF INFORMATICS AND TELECOMMUNICATIONS POSTGRADUATE PROGRAM

PhD DISSERTATION

Bayesian signal processing techniques for hyperspectral image unmixing

Konstantinos E. Themelis

ATHENS January 2012

PhD DISSERTATION BAYESIAN SIGNAL PROCESSING TECHNIQUES FOR HYPERSPECTRAL IMAGE UNMIXING Konstantinos E. Themelis Advisor: Sergios Theodoridis Professor NKUA Three-member Advising Committee Sergios Theodoridis Professor NKUA Athanasios Rontogiannis Senior Researcher NOA Konstantinos Koutroumbas Senior Researcher NOA Seven-member examination committee

Sergios Theodoridis Professor NKUA

Athanasios Rontogiannis Senior Researcher NOA

Konstantinos Koutroumbas Senior Researcher NOA

Nicholas Kalouptsidis Professor NKUA

Konstantinos Berberidis Professor UP

Elias Manolakos Associate Professor NKUA

Konstantinos Slavakis Assistant Professor UoP

Examination Date: 30/01/2012

Abstract This thesis presents a framework of novel Bayesian signal processing techniques developed for the unmixing of hyperspectral data. Hyperspectral data consist of hundreds or thousands of spatially coregistered images, each one produced by sampling at a specific wavelength. Modern remote sensing systems exploit the spectral information of hyperspectral data in order to detect targets of interest and to identify materials. Spectral unmixing is the process of decomposing mixed hyperspectral pixels to their constituent materials, or endmembers, and their corresponding proportional fractions, or abundances. Assuming that the spectral signatures of the endmembers present in the image are known a priori, we focus on the problem of abundance estimation. This problem is formulated as a linear source separation problem, subject to the physical constraints of nonnegativity and sum-to-one for the abundances. An interesting insight on spectral unmixing is also gained by assuming that this estimation problem inherently accepts a sparse solution. The sparsity assumption is based on the fact that only a small number of the available endmembers will be present in each single pixel. Besides, the unmixing problem is also complicated by the deficiencies of hyperspectral data, like for example the high dimensionality and the high correlation of the endmembers’ spectra. This necessitates the development of sophisticated techniques that take into account the special characteristics of hyperspectral data. Surely, such techniques are found in the Bayesian framework. Actually, Bayesian techniques have been widely applied in a range of scientific fields and, inevitably, in hyperspectral signal processing. They are particularly well suited for modeling stochastic processes and exploiting any prior knowledge about them. In this thesis we provide some general information on the fundamentals of Bayesian analysis. We discuss the Bayesian interpretation of widely used sparsity inducing operators such as the lasso and its variants. Furthermore, a description of popular probabilistic inference methods is provided. Sampling methods including Markov chain Monte Carlo techniques and Gibbs sampling are presented. Bayesian inference methods are also discussed, including the EM algorithm, as well as variational Bayesian inference methods. Exploiting the Bayesian framework, we develop and propose sophisticated hierarchical Bayesian models that promote sparsity and impose the problem’s constraints using appropriate prior distributions. A Bayesian maximum a posteriori (MAP) estimator is proposed, which is specifically designed to take into account the constraints of the abundances. The simplicity of the Gaussian distribution along with the symmetry of the constraints are exploited to obtain a closed form solution for the proposed estimator. This is an important achievement, if we consider that, despite the constraints, there is no need to solve an optimization problem. Due to the statistical nature of our estimator the final step of the algorithm is to project any infeasible point on the polytope of constraints.

In the sequel, two methods are presented that impose the constraints, but also highlight the sparsity of the unmixing problem. An efficient optimization scheme is first developed, which employs the regularizing function of the adaptive lasso in order to induce sparsity on the abundances. The formulated optimization problem is solved using a modification of the popular LARS algorithm, which retrieves only nonnegative solutions. The nonnegativity constraint is thus imposed by the solver of the problem, while the sum-to-one constraint is incorporated as an extra equation in the model’s system of linear equations. Experimental results show that sparsity enhances the performance of the unmixing process. An also interesting contribution of this thesis is the development of a hierarchical Bayesian model which utilizes the Laplace distribution as a prior for the abundance parameters. The Laplace prior is widely used in the Bayesian compressive sensing literature to meet the `1 norm of the celebrated lasso operator, which is known to promote sparsity. This original concept is extended in the proposed model, which utilizes an independent Laplace prior for each coefficient of the abundances’ vector. The proposed model can then be viewed as a Bayesian analogue to the adaptively weighted lasso. To perform Bayesian inference, an efficient method, termed Bayesian inference iterative conditional expectations (BI-ICE) is developed. BI-ICE is a greedy approximation to the Gibbs sampler, but it can be also viewed as a first-moments approximation to variational Bayesian inference techniques. Experimental results on simulated and real hyperspectral data show that the proposed method converges fast, favors sparsity in the abundances’ vector, and offers improved estimation accuracy compared to other related methods. Moreover, the newly developed hierarchical Bayesian model is also exploited in the context of compressive sensing. The proposed model is used to induce sparsity on the coefficients of a wavelet transform of a two-dimensional image. The objective is to reconstruct the image from a subset of the wavelet coefficients, which are determined by a sparse estimation problem. The underdeterminacy of the sparse image reconstruction problem is confronted by developing a fast suboptimal type-II maximum likelihood algorithm, which is incremental in the number of wavelet bases used to reconstruct the signal of interest. The potential of the proposed methods is also demonstrated by experimental results on the unmixing of real hyperspectral data. Images collected from airborne hyperspectral sensors, e.g. HYDICE and Aviris, have been included in our experiments. Moreover, an application from the thematic area of planetary exploration is presented. It involves the unmixing of real hyperspectral data collected from the OMEGA spectrometer of the Mars Express mission. The objective is to unmix the collected data in order to obtain thematic maps on the mineral composition of the Mars surface. The algorithms developed in this thesis are applied in this study and their unmixing results are compared to existing methods.

Acknowledgments I would very much like to thank all those who have helped and supported me in any way along the road to complete this demanding task. First and foremost, I would like to express my gratitude to my supervisor, Athanasios Rontogiannis, for giving me an opportunity to pursue a doctoral degree and become acquainted with the process of scientific research. I would like to thank him not only for his valuable scientific guidance over the years, but also for his tremendous support and constant encouragement. I am also grateful to Konstantinos Koutroumbas for many insightful discussions and suggestions, and his scientific advise and knowledge. Next, I would like to thank my professors Konstantinos Berberidis and Sergios Theodoridis, whose lectures have influenced me and nurtured my inexplicable inclination towards the signal processing field. My work has also benefited from my collaboration with Olga Sykioti, Ioannis Daglis, and Frederic Schmidt, all of whom I would also like to thank. Moreover I would like to thank my friends and members of the ISARS staff, Kostas Tziotziou, Petros Bithas, George Ropokis, David Benmayor, Giannis Papoutsis, Zefi Papadimitriou, and Marina Georgiou, with whom I have spent many pleasant hours in the lab.

Contents Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 List of figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 List of tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 What is hyperspectral imagery? . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Spectral unmixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Work contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31 31 32 33 34

2 Hyperspectral unmixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Hyperspectral imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Hyperspectral data representation . . . . . . . . . . . . . . . . . . . . 2.1.2 Practical considerations . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Hyperspectral imaging applications . . . . . . . . . . . . . . . . . . . 2.1.4 Spectral processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Spectral unmixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 The linear mixing model . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Linear unmixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2.1 Dimensionality reduction . . . . . . . . . . . . . . . . . . . 2.2.2.2 Endmember extraction . . . . . . . . . . . . . . . . . . . . . 2.2.2.3 Inversion process . . . . . . . . . . . . . . . . . . . . . . . .

37 37 39 41 43 44 46 48 49 51 51 52

3 Sparse modeling and inference methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 An introduction to linear regression . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Linear models and least squares . . . . . . . . . . . . . . . . . . . . . 3.1.1.1 Geometrical interpretation of least squares . . . . . . . . . . 3.1.1.2 Maximum likelihood and least squares . . . . . . . . . . . . 3.1.2 Regularized least squares . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 The lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3.1 The lasso as a Bayes estimate . . . . . . . . . . . . . . . . . 3.1.4 The adaptively weighted lasso . . . . . . . . . . . . . . . . . . . . . . 3.1.4.1 The weighted lasso as a Bayes estimate . . . . . . . . . . . . 3.2 Sampling methods for inference . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Traditional methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1.1 Transformation method . . . . . . . . . . . . . . . . . . . . 3.2.1.2 Rejection sampling . . . . . . . . . . . . . . . . . . . . . . . 3.2.1.3 Importance sampling . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Markov chain Monte Carlo techniques . . . . . . . . . . . . . . . . . 3.2.2.1 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Bayesian inference methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Bayesian inference basics . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 The EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 The variational EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 The evidence approximation . . . . . . . . . . . . . . . . . . . . . . .

55 55 56 57 58 59 61 62 63 64 65 65 66 67 68 69 69 71 73 73 75 78 80

4 Soft-constrained MAP abundance estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.1 A soft-constraint MAP estimator . . . . . . . . . . . . . . . . . . . . . . . . 83 4.1.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.1.2 The proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.1.3 Imposing the hard constraints . . . . . . . . . . . . . . . . . . . . . . 88 4.1.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2 On the unmixing of MEX/OMEGA data . . . . . . . . . . . . . . . . . . . . 91 4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2.2 Linear spectral unmixing techniques . . . . . . . . . . . . . . . . . . . 92 4.2.2.1 ENVI-SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.2.2.2 Quadratic programming technique . . . . . . . . . . . . . . 93 4.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.2.4

4.2.3.1 South Polar Cap image cube . . . . . . . . . . . . . . . . . 94 4.2.3.2 Syrtis Major image cube . . . . . . . . . . . . . . . . . . . . 95 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5 Semi-supervised sparsity-promoting abundance estimation methods . . . . . 103 5.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.2 Semi-supervised hyperspectral unmixing via the weighted lasso . . . . . . . . 105 5.2.1 The proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.2.1.1 Enforcing the constraints . . . . . . . . . . . . . . . . . . . 106 5.2.2 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.2.2.1 Simulated data . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.2.2.2 Unmixing of a HYDICE image . . . . . . . . . . . . . . . . 109 5.3 Semi-supervised hyperspectral unmixing via a novel Bayesian approach . . . 111 5.3.1 Hierarchical Bayesian model . . . . . . . . . . . . . . . . . . . . . . . 112 5.3.1.1 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.3.1.2 Parameter prior distributions . . . . . . . . . . . . . . . . . 113 5.3.1.3 Hyperparameters’ priors . . . . . . . . . . . . . . . . . . . . 114 5.3.2 The proposed Bayesian inference methodology . . . . . . . . . . . . . 118 5.3.2.1 Posterior conditional distributions . . . . . . . . . . . . . . 118 5.3.2.2 The BI-ICE algorithm . . . . . . . . . . . . . . . . . . . . . 120 5.3.2.3 Embedding the sum-to-one constraint . . . . . . . . . . . . 124 5.3.2.4 Fast computation of (5.42) and (5.43) . . . . . . . . . . . . 124 5.3.2.5 Relation to variational Bayesian inference and other methods 126 5.3.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.3.3.1 Simulation results on synthetic data . . . . . . . . . . . . . 128 5.3.3.2 Simulation results on real data . . . . . . . . . . . . . . . . 133 5.3.3.3 Simulation results on OMEGA data . . . . . . . . . . . . . 134 5.3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6 An empirical Bayes algorithm for sparse signal 6.1 Introduction . . . . . . . . . . . . . . . . . . . . 6.2 Bayesian modeling . . . . . . . . . . . . . . . . 6.2.1 The Bayesian lasso . . . . . . . . . . . . 6.2.2 The Bayesian adaptive lasso . . . . . . . 6.3 Bayesian inference . . . . . . . . . . . . . . . . 6.3.1 Type-II maximum likelihood approach . 6.3.2 Fast suboptimal solution . . . . . . . . . 6.4 Simulation results . . . . . . . . . . . . . . . . .

reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

139 139 141 141 142 144 145 147 150

6.5

6.4.1 1D signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 6.4.2 2D images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

7 Summary of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 7.2 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 7.2.1 Variational approximation for underdetermined systems . . . . . . . . 158 7.2.2 Nonlinear spectral unmixing . . . . . . . . . . . . . . . . . . . . . . . 158 7.2.3 Sparse unmixing using non-convex priors . . . . . . . . . . . . . . . . 159 7.2.4 Unsupervised spectral unmixing using spatial correlations . . . . . . . 159 7.2.5 Adaptive Bayesian schemes . . . . . . . . . . . . . . . . . . . . . . . 159 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

List of figures 2.1

An AVIRIS hyperspectral image of the Leadville mining district in Colorado, with the spectral dimension shown as the top and right faces of the cube. The front of the cube is a true color composite, with areas containing secondary minerals from acid mine drainage highlighted in red, orange and yellow. . . . 2.2 A hyperspectral image cube (Figure found in [85]). . . . . . . . . . . . . . . 2.3 Example of data from a simulated hyperspectral scene. Only three wavelengths have been selected for the plot. . . . . . . . . . . . . . . . . . . . . . 2.4 Taxonomy of applications for hyperspectral imaging. . . . . . . . . . . . . . 2.5 Stages of hyperspectral signal processing, from the radiance cube to end-user products. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 A hyperspectral scene comprised of pure and mixed pixels (figure found in [101]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Single versus multiple scattering (figure found in [77]). . . . . . . . . . . . . 2.8 Polytope of constraints: (a) two-dimensional case, and (b) three-dimensional case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 A simple taxonomy of unmixing algorithms. . . . . . . . . . . . . . . . . . . 2.10 Conceptual diagram for spectral unmixing. . . . . . . . . . . . . . . . . . . . 3.1 3.2 3.3

3.4

3.5

ˆ is the orthogonal projection of the data vector y The least squares solution y onto the subspace spanned by the columns φi of the mixing matrix Φ. . . . Contour plot of the `q norm for two variables for various values of q. . . . . . Contour lines of the residual sum of squares, centered at the least squares solution. The shaded areas are the constraint regions due to the regularization terms. The final solution results at the location where the contours meet the constraints regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Laplace with ς = 1 and the standard normal probability density functions. The former is the implicit prior of the lasso, whereas the latter is the prior for ridge regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Three iterations of the Gibbs sampler in the two dimensional space. . . . . .

39 40 41 43 45 46 47 49 50 50

58 60

62

63 72

4.1 4.2 4.3 4.4 4.5 4.6

4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14

5.1 5.2 5.3 5.4 5.5

ASC and ANC define a 2-simplex in the three-dimensional space. . . . . . . 85 Urban HYDICE hyperspectral dataset, band 80. . . . . . . . . . . . . . . . 89 Abundance estimation using quadratic programming techniques. . . . . . . . 90 Abundance estimation using the MAP-s estimator and solving the LMI optimization problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Abundance estimation using the closed form expression of the MAP-s estimator. 90 Reference spectra of the South Polar Cap OMEGA image. The available endmembers are: (a) OMEGA typical dust materials with atmosphere absorption. (b) synthetic CO2 ice with grain size of 100 mm, (c) synthetic H2 O ice with grain size of 100 microns. . . . . . . . . . . . . . . . . . . . . . . . . 94 Abundance map of the dust endmember, estimated using (a) ENVI-SVD, (b) QP, and (c) MAP-s for the South Polar Cap image cube. . . . . . . . . . . . 95 Abundance map of the CO2 endmember, estimated using (a) ENVI-SVD, (b) QP, and (c) MAP-s for the South Polar Cap image cube. . . . . . . . . . . . 96 Abundance map of the H2 O endmember, estimated using (a) ENVI-SVD, (b) QP, and (c) MAP-s for the South Polar Cap image cube. . . . . . . . . . . . 97 Reference spectra of the Syrtis Major OMEGA image. . . . . . . . . . . . . 98 Abundance map of Hypersthene, estimated using (a) ENVI-SVD, (b) QP, and (c) MAP-s algorithms for the Syrtis Major hyperspectral scene. . . . . . . . 98 Abundance map of Diopside, estimated using (a) ENVI-SVD, (b) QP, and (c) MAP-s algorithms for the Syrtis Major hyperspectral scene. . . . . . . . . . 99 Abundance map of Fayalite, estimated using (a) ENVI-SVD, (b) QP, and (c) MAP-s algorithms for the Syrtis Major hyperspectral scene. . . . . . . . . . 100 Abundance maps of the endmembers (a) Hypersthene, (b) Diopside, and (c) Fayalite in the Syrtis Major scene. The abundances are estimated using the band ratio method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Abundance estimation MSE curves versus SNR. . . . . . . . . . . . . . . . . Real hyperspectral data: HYDICE urban image scene, band 80. . . . . . . . Fraction maps estimated by (a) LS, (b) QP and (c) the proposed method. . . Directed acyclic graph of the proposed Bayesian model. The deterministic model parameters appear in boxes. . . . . . . . . . . . . . . . . . . . . . . . Estimation of the entries of the sparse vector w, as BI-ICE progresses. The algorithm is applied to simulated data, generated using (a) a highly correlated matrix of spectral data, (b) a matrix of i.i.d uniform data. White noise is added (SNR = 25 dB). Dashed lines: true values. Solid lines: estimated values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

109 109 110 118

129

5.6

5.7

5.8

5.9

5.10

5.11 5.12 5.13 5.14

6.1

6.2

6.3

6.4

MSE as a function of the level of sparsity obtained by different unmixing methods when applied to simulated data with white additive noise (SNR = 20 dB) and using two different spectral libraries. . . . . . . . . . . . . . . . . MSE as a function of the SNR obtained by different sparse unmixing methods when applied to simulated data with white additive noise and using different spectral libraries for sparsity level ξ = 5. . . . . . . . . . . . . . . . . . . . . MSE as a function of the level of sparsity obtained by different unmixing methods when applied to simulated data with colored additive noise (SNR = 20 dB) and using two different spectral libraries. . . . . . . . . . . . . . . . . MSE as a function of the SNR obtained by different sparse unmixing methods when applied to simulated data with colored additive noise and using different spectral libraries for sparsity level ξ = 5. . . . . . . . . . . . . . . . . . . . . MSE as a function of the level of sparsity obtained by different unmixing methods when applied to simulated data with white additive noise (SNR = 20 dB) and using two different spectral libraries. The sum-to-one constraint is incorporated to the BI-ICE algorithm, as explained in section 5.3.2.3. . . Band 150 of a subimage of the Cuprite Aviris hyperspectral data set. . . . . Estimated abundance values of four endmembers using: (a) the LS algorithm, (b) the QP algorithm, (c) the proposed BI-ICE algorithm . . . . . . . . . . . Reference endmember spectra of the Syrtis Major OMEGA image. . . . . . . Abundance map produced by BI-ICE of the endmembers (a) Hypersthene, (b) Diopside, and (c) Fayalite for the Syrtis Major hyperspectral scene. . . . Reconstruction error versus the number of measurements M in the noise-free case. (a) Uniform spikes ±1, (b) zero mean unit variance Gaussian, (c) unit variance Laplace, (d) Student’s t-distribution with 3 degrees of freedom. . . . Reconstruction error versus the number of measurements M with noisy observations. (a) Uniform spikes ±1, (b) zero mean unit variance Gaussian, (c) unit variance Laplace, (d) Student’s t-distribution with 3 degrees of freedom. Mondrian image reconstruction using the algorithms: (a) linear sparse reconstruction (error = 0.13325), (b) BCS (error = 0.14979, time = 11.7836s), (c) FL (error = 0.15094, time = 15.6063s), (d) Fast-BALa (error = 0.14402, time = 26.5266s). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Error maps of the reconstructed Mondrian images using the algorithms: (a) linear sparse reconstruction, (b) BCS, (c) FL, (d) Fast-BALa. . . . . . . . .

130

131

132

132

133 134 135 136 137

150

151

152 153

List of tables 5.1 5.2 5.3

Algorithm 1. The positivity constrained lasso. . . . . . . . . . . . . . . . . . 107 Algorithm 2. Weighted lasso for sparse unmixing. . . . . . . . . . . . . . . . 108 The BI-ICE algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.1

The Fast-BALa algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

Abbreviations

Abbreviation

Meaning

ALI ANC ASC AWGN BCS BI-ICE CCA CS EM ENVI EVD FCLS IEEE iid KLT LARS NASA NNLS OMEGA

Advanced Land Imager Abundance Nonnegativity Constraint Abundance Sum-to-one (additivity) Constraint Additive White Gaussian Noise Bayesian Compressive Sensing Bayesian Inference Iterative Conditional Expectations Convex Cone Analysis Compressive Sensing Expectation Maximization algorithm The Environment for Visualizing Images Eigenvalue Value Decomposition Fully Constrained Least Squares Institute of Electrical and Electronic Engineers Independent and identically distributed Karhunen-Loéve Transform Least Angle Regression National Aeronautics and Space Administration Nonnegative Least Squares Observatoire pour la Minéralogie, l’ Eau, les Glaces et l’ Activité hyperspectral images Optical Real-time Adaptive Spectral Identification System Primary Component Analysis Pixel Purity Index Positive Source Separation Quadratic Programming QR decomposition Recursive Least Squares Signal to Noise Ratio Spectral Mixture Analysis Singular Value Decomposition Vertex Component Analysis

ORASIS PCA PPI PSS QP QRD RLS SNR SMA SVD VCA

Notation

Symbol

Meaning

a |a| a a¬i 0 1 A AM ×N A¬i¬j

Scalar Absolute value of a scalar Vector Vector a after removing its ith element Zero vector All ones vector Matrix Matrix of dimension M × N Matrix A after removing its ith row column and jth column Transpose of a matrix Identity matrix of size K Determinant of a matrix `1 norm of a vector `2 norm of a vector Diagonal matrix, that contains the elements of vector a on its diagonal Gaussian distribution of a random variable x, with mean µ and covariance matrix Σ Field of real numbers N -dimensional Euclidean space Approximately equal

AT IK |A| kak1 kak2 diag(a) N (x|µ, Σ) R RN ≈

Preface This research has been conducted at the facilities of the Institute for Space Applications and Remote Sensing (ISARS) of the National Observatory of Athens (NOA).

Chapter 1 Introduction In this chapter we provide some fundamental background information on the spectral unmixing problem, which is the main topic of research in this thesis. The major contributions of our research work are stated, and an outline of the remaining chapters is provided.

1.1 What is hyperspectral imagery? Over the past decades hyperspectral image analysis has emerged as one of fastest growing technologies in the field of remote sensing. Hyperspectral imagery refers to the process of remotely obtaining data about an object (in our case a geographical area), with the use of hyperspectral sensors. Hyperspectral sensors have the ability to sample the electromagnetic spectrum in hundreds of continuous spectral bands. As a consequence, each pixel of a hyperspectral image is represented by a vector, where each coefficient is a measurement of reflectance at a respective wavelength. The potential of hyperspectral imagery lies on the wealth of spectral information about the remote objects. Hyperspectral data provide more accurate and detailed information than any other type of remotely sensed data. It is interesting to note that it is not only the number, but also the narrowness and contiguous nature of the measurements that characterizes hyperspectral data. Exploiting the fact that every material reflects, absorbs, and scatters sunlight radiation in a different manner, hyperspectral imagery provides, in principle, the capability to actually determine and identify materials in a remote scene, based exclusively on spectral information. According to the application type, hyperspectral imagery can be collected by satellites, airborne platforms, or unmanned aerial vehicles. There are several application areas that utilize hyperspectral imagery capabilities, including military target detection, [85], terrain classification, [77], or environmental monitoring related to forestry, [1], mineral exploration, [115, 116], and environmental changes. In addition, vegetation scientists have successfully


31


used hyperspectral imagery to identify vegetation species, and detect vegetation stress, [89]. The immense growth of hyperspectral imaging applications has also been accompanied by the development of numerous techniques to effectively process the overwhelming amount of collected data. Sophisticated algorithms have been proposed in the literature that either ameliorate the shortcomings of hyperspectral data, or focus on the exploitation of their informational content. Among all possible signal processing applications that exploit hyperspectral data, in this dissertation we are more interested in the thematic area of spectral unmixing.

1.2 Spectral unmixing An integral part of the analysis of hyperspectral data is the determination of the macroscopically distinct materials that comprise the image. Actually, this is a rather challenging task, since hyperspectral images collected in natural environments are inevitably comprised of mixed pixels, i.e., pixels whose measured spectra is a mixture of the spectral signatures of the individual (pure) materials present in the sensor’s field of view. Spectral unmixing is the procedure of decomposing the spectrum of the mixed pixels into the spectral signatures of pure materials, called endmembers, and their corresponding fractions, called abundances. The most widely used spectral mixing model is the linear mixing model (LMM), which assumes that the observed spectrum of a mixed pixel is generated by a linear combination of the spectral signatures of N constituent endmembers.. The linear mixture model is defined as y=

N X

φi wi + n,

(1.1)

i=1

where y is a N -dimensional containing the mixed pixel’s measured reflectance values, M is the number of spectral bands, φ1 , φ2 , . . . , φN are the N known endmember spectra, w1 , w2 , . . . , wN are the abundances, and n is an additive noise vector. To be physically meaningful, two constraints are imposed to the abundance fractions: (a) nonnegativity, i.e., P wi ≥ 0, i = 1, 2, . . . , N , and (b) full-additivity, i.e., N i=1 wi = 1. In signal processing terms, spectral unmixing is formulated as a linear source separation problem. Standard unmixing techniques usually discriminate between the tasks of endmember extraction and abundance estimation. Supervised methods assume that endmember spectra are known a priori and they focus on the abundance estimation problem. On the contrary, unsupervised methods use geometrical or statistical approaches to determine the spectral signatures of the pure materials in the image, prior to estimating their corresponding proportions in each pixel. Semi-supervised methods assume that a dictionary of endmember Konstantinos E. Themelis

32


spectra is available a priori, possibly obtained using an endmember extraction algorithm, or utilizing a spectral library, and the objective is then to determine which of the available materials are present in each pixel and at what percentage. In this thesis we are interested in supervised and semi-supervised methods, as it will be stated in more detail in the following section.

1.3 Work contributions The scientific contributions that appear in the present work exhibit a Bayesian treatment for the problem of supervised spectral unmixing. A major advantage of the Bayesian approach is that it provides a flexible framework to represent the probabilistic mechanisms of data generation and our prior information about it. To this end, parametric probabilistic models are considered, and appropriate prior distributions are employed to capture the uncertainties of the model parameters. In the context of hyperspectral signal processing, we were greatly influenced by the wide applicability of Bayesian methods. Our first contribution is a Bayesian maximum a posteriori (MAP) estimator, published in [111], which is specifically designed to account for the convex constraints of the abundance estimation problem. Utilizing the simplicity of the Gaussian distribution and the symmetry of constraints, closed form expressions are derived for the modes of the posterior distribution of the abundances. Following this path an efficient estimator is proposed, which has almost similar estimation performance, but is computationally more efficient than quadratic programming methods that address the same constrained estimation problem. The applicability of our proposed schemes has also been of concern. A representative example is the case study on the unmixing of real hyperspectral data collected from the OMEGA spectrometer, [15], aboard the Mars Express mission. The objective of the OMEGA spectrometer is to collect information which will help to determine the mineral composition of the Mars surface. This is an interesting application for spectral unmixing. Three methods are considered and compared in the unmixing process: (a) the ENVI-SVD method, which is commercially available through the popular remote sensing software ENVI, (b) the MAP estimator of the preceding paragraph, and (c) a quadratic programming method, [35], which is an iterative Newton method. The results of this comparison are available in [115, 116], where it is seen that the MAP estimator can provide reliable results and outperform existing methods. In the sequel, a significant amount of efforts was invested to leverage sparse signal processing techniques for spectral unmixing. The primary assumption is that only a small number of endmembers will be mixed in a single pixel, and hence, the abundance estimation problem Konstantinos E. Themelis

33


will inherently have a sparse solution. To the best of our knowledge, we were among the first to introduce the notion of sparsity to the problem of spectral unmixing. This consideration is exhibited in the publications [112, 113, 114]. In both publications a variant of the least absolute shrinkage and selection operator (lasso), [117], was selected to impose sparsity. In [112], the adaptively weighted lasso, [129], is utilized and special manipulation is provided for the constraints of the problem. To force the nonnegativity constraint, the optimization problem is solved using a modified LARS algorithm, which retrieves only nonnegative solutions. Moreover, the additivity constraint is included in the quadratic loss function of least squares, by means of an appropriate extra linear equation. Another contribution of this thesis is the development of a Bayesian hierarchical model analogous to the adaptive lasso, [129]. In the proposed Bayesian setup, independent Laplace priors are employed by the model to correspond to the weighted `1 norm penalization of the adaptive lasso. Besides, the nonnegativity constraint is incorporated to the model by a truncation operation on the prior distributions. A novel method for Bayesian inference is then developed, termed as Bayesian inference iterative conditional expectations algorithm (BI-ICE). BI-ICE appears to be a first-moments approximation to variational approximation methods, and is detailed in chapter 3. The proposed Bayesian approach is described in [114]. Finally, a lot of effort has also been invested into the research and development of a sparse reconstruction algorithm, in the framework of Bayesian compressive sensing. The previous Bayesian set-up has been adopted to promote sparse solutions to an underdetermined system of linear equations. To perform Bayesian inference, a recently proposed sub-optimal, typeII maximum likelihood algorithm was adjusted to fit the needs of the present framework. The resulting incremental-type algorithm has superior performance when compared to other Bayesian compressive sensing methods, as illustrated in the experimental results of Chapter 6.

1.4 Outline of the thesis The remaining chapters of this thesis are organized as follows. Chapter 2 contains a description of modern hyperspectral remote sensing systems. The technical characteristics of hyperspectral data are discussed and an overview of hyperspectral imaging applications is presented. The spectral unmixing problem, which is the primary research field for this thesis, is formulated as a three-step processing procedure. Special attention is devoted to linear mixing methods, which are enriched by the proposed approaches of our research. The objective of chapter 3 is twofold. First, it provides valuable information on sparse linear regression modeling and regularizing approaches for least squares. A Bayesian inKonstantinos E. Themelis

34


terpretation of regularized least squares is also presented in detail in this chapter, since it is an important part of the Bayesian analysis reported in the succeeding chapters. Second, probabilistic methods to infer the unknown model parameters are presented. These methods include sampling methods, as well as simple Bayesian inference and methods for variational approximation. In chapter 4 a soft constrained maximum a posteriori (MAP) estimator is proposed for the supervised problem of spectral unmixing. A comparative study on the umixing of real hyperspectral data collected from the OMEGA spectrometer of the Mars Express mission is also presented. The study considers three unmixing methods, the least squares estimator, a quadratic programming method and the recently proposed MAP estimator. In chapter 5 two Bayesian approaches for semi-supervised spectral unmixing are presented. Both approaches focus on the sparsity of the abundance estimation problem. First, the adaptively weighted lasso is used to impose sparsity and the resulting optimization problem is solved subject to the constraints of the abundances. Second, a novel hierarchical Bayesian model using Laplace priors is introduced, and a novel method to perform Bayesian inference for the proposed Bayesian model, termed Bayesian inference iterative conditional expectations, is presented. Chapter 6 departs from the thematic area of spectral unmixing and presents a novel Bayesian inference algorithm, which is specifically tailored for the hierarchical Bayesian model previously presented in chapter 5. In the context of the sparse image reconstruction problem, experimental results conducted on both simulated and real data illustrate the robustness of the proposed method. Finally, chapter 7 gives a summary of the research work in the context of this thesis and provides some possible future research directions.


35

Chapter 2 Hyperspectral unmixing The evolution of remote sensing systems has led to the development of hyperspectral imaging sensors which have the ability to sample the electromagnetic spectrum in hundreds of narrow contiguous bands. This chapter introduces the reader to hyperspectral image processing and defines the spectral unmixing problem, which is thoroughly considered in this thesis.

2.1 Hyperspectral imaging Remote sensing is the acquisition of information about an object or phenomenon without having physical contact with the object. There are two types of remote sensing; active and passive. In active remote sensing, the sensor emits energy in order to scan an area or an object and then measures the radiation that is reflected or backscattered from the target. Typical examples of active remote sensing include radio detection and ranging (RADAR) and light detection and ranging (LiDAR), where the time delay between emission and return is measured, establishing the location, height, speed and direction of an object. On the other hand, passive sensors measure the natural radiation emitted or reflected by the object being observed. Naturally, the most common source of radiation is the reflected sunlight. Examples of passive remote sensors include film photography, infrared, charge-coupled devices, and radiometers. A more intimate example of a passive remote sensing system is that of human vision. Our eyes intercept the sunlight scattered by the objects surrounding us and they generate an image signal that is carried via the nerve fibres to our brain for processing. As a matter of fact, a large portion of our brain activity is dedicated to the processing of this visual information. Of all five senses, vision has a major role in human perception and interpretation of the world. It is therefore not surprising that the human ability to accurately interpret visual information acts as a motive to develop, produce and evolve accurate imagery along with sophisticated algorithms that effectively analyze this kind of information. Historically, passive remote sensors have been developed during the second half of the


37


twentieth century. This is the time when remote imaging sensors are launched aboard satellites. Some remarkable examples include the Television Infrared Observation Satellite (TIROS) series, first launched in 1960, [3], the Very High Resolution Radiometer (AVHRR) and the Moderate Resolution Imaging Spectroradiometer (MODIS) instruments, [45], laun– ched aboard the Terra spacecraft in 1999. Among the principal characteristics of these pioneer sensors are the low spatial resolution and the ability to sample the visible and infrared portion of the electromagnetic spectrum. The desire for images of higher spatial resolution soon drove the evolution of remote sensing systems. High resolution images provide more accurate information and can be utilized in a broader range of applications. Increasing the spatial resolution of a sensor is, inevitably, a task of staggering cost, both in terms of deploying larger imaging apertures in space as well as in terms of the amount of collected data that must subsequently be processed. A remedy was sought in the exploitation of the spectral information of the images and the emergence of multispectral sensors was soon embraced by the remote sensing scientific community. Multispectral sensors, the precursor of the hyperspectral sensors, sample the near-infrared (NIR) and short-wave infrared (SWIR) portion of the electomagnetic spectrum in multiple narrow bands. The concept of exploiting spectral rather than spatial information to actually pinpoint materials in a remote sensing scene is based on a fundamental physical property of matter; distinct materials reflect, absorb, scatter and emit incident solar radiation in ways that are characteristic of their molecular composition. Hence, by measuring and observing reflected radiation intensity in a broad range of wavelengths, one can determine the nature of the observed material. Examples of multispectral sensors include Landsat-1, [79], which was launched in 1972, Landsat-7 (the latest in this series of highly successful satellites) launched in 1999, and the Advanced Land Imager (ALI) launched in November 2000 aboard the National Aeronautics and Space Administration (NASA) Earth Observing (EO-1) satellite. The theoretical background of spectroscopy, a science involved in the measurement, analysis and interpretation of electro-optical spectra, was utilized in the framework of remote sensing. The term imaging spectroscopy prevailed, and refers to the simultaneous acquisition of spatially co-registered images over large areas, in many different spectral bands. The small number of spectral bands that were available initially, soon proved to be an inherent limitation of multispectral sensors. Depending on the materials under study, a specific set of wavelengths had to be selected in order to be able to discriminate between possible targets and backgrounds. At the same time, as the list of observed targets grew rapidly, researchers could only compromise on the number and the location of the sampled wavelengths. The limited amount of spectral information led to the development of merely application-dependent systems, which, overall, raises the cost of remote sensing observations. A more cost-effective and robust solution was to increase the sensors’ spectral resolution,


38


which led to the development of the hyperspectral sensors. Using advances on focal plane technology, hyperspectral sensors have the ability to sample the electromagnetic spectrum in hundreds of narrow contiguous bands. The majority of hyperspectral sensors sample a vast portion of the electromagnetic spectrum; from the visible region (0.4 to 0.7 µm) through the SWIR (about 2.5 µm) bands. Hyperspectral sensors provide an immense amount of spectral information, which can be filtered and processed using application-specific algorithms. Examples of hyperspectral sensors include the Hyperion, a VNIR/SWIR hyperspectral sensor with 220 spectral bands aboard the NASA EO-1 satellite, the CHRIS sensor on the European Space Agency’s (ESA) PROBA satellite, and NASA’s Airborne Visible/Infrared Imaging Spectrometer (AVIRIS).

Figure 2.1: An AVIRIS hyperspectral image of the Leadville mining district in Colorado, with the spectral dimension shown as the top and right faces of the cube. The front of the cube is a true color composite, with areas containing secondary minerals from acid mine drainage highlighted in red, orange and yellow.

2.1.1 Hyperspectral data representation The optimal exploitation of the information provided by hyperspectral images assumes an in-depth understanding of the manner in which hyperspectral data are sampled and stored. Konstantinos E. Themelis

39


Hence, we will first discuss the special characteristics and possible representations of these high-dimensional data. Hyperspectral images are sometimes referred to as “image cubes” because they are composed of a sequence of two-dimensional images, each one produced by sampling at a specific wavelength. A pseudocolored illustration of a hyperspectral cube is shown in Figure 2.1. The front face of the cube stands for the spatial dimensions of the image, while the spectral dimension is shown as the top and right faces of the cube. Another representation of hyperspectral images can be obtained on a pixel-by-pixel basis. Specifically, the illustration of Figure 2.2 represents a hyperspectral data cube viewed either as a series of stacked grayscale images or, alternatively, as a matrix of individual pixels carrying spectral information. If we isolate a single pixel, the samples of its spectral information can be plotted as a function of the wavelength, as shown on the left of Figure 2.2 (colder colors are used for shorter wavelengths). This spectral information is called spectral signature or simply spectrum of the pixel. In the case of hyperspectral data, the resulting plot is more or less considered to be continuous, due to the oversampling of the frequency domain.

Figure 2.2: A hyperspectral image cube (Figure found in [85]). From a signal processing perspective, an alternative representation of hyperspectral data can be derived if we consider the M -dimensional space defined by the spectral bands, where M is the number of spectral bands. Each pixel of a hyperspectral image corresponds to a vector of length M , whose elements are the measurements of radiation intensity at the specific wavelengths. If we assign an axis of space to each wavelength, then each measurement can be viewed as a coordinate in the corresponding axis. In this way, each pixel represents a point in the M -dimensional space. An example plot of 30 pixels belonging to a simulated hyperspectral scene is shown in Figure 2.3, where only three of the available spectral bands are selected for the illustration. This geometrical representation has proven to be quite useful for many spectral processing algorithms, or for the development of advanced signal processing techniques that employ subspace projection operations.


40


1

0.8

0.6

0.4

0.2 0.4 0.3

0.8 0.6

0.2

0.4

0.1

0.2 0

0

Figure 2.3: Example of data from a simulated hyperspectral scene. Only three wavelengths have been selected for the plot.

2.1.2 Practical considerations Although hyperspectral data provide a deluge of information, the process of hyperspectral data acquisition is certainly not immune to impediments that distort the information content of hyperspectral images. Some of the issues that have to be addressed either in the design of the hyperspectral sensor or during the processing of hyperspectral data include atmospheric effects, such as absorption and scattering, the spectral and spatial resolution of the sensor, the spectral variability of surface materials in the scene, motion and sensor artifacts, and other environmental effects, such as viewing angle, secondary illumination and shadowing. Inevitably, the atmosphere intervenes between the sensor and the observed objects, and it is known to absorb and scatter solar radiation in a wavelength dependent fashion. First of all, the atmosphere modulates solar illumination before it reaches the ground. This modulation should be measured in order to separate the spectrum of the solar illumination from the actual reflectance spectrum of the surface. This kind of information is scene-dependent, since it may vary according to the solar illumination conditions and angle, the viewing angle of the sensor and, predominantly, the total path through the atmosphere. As far as the scattered solar radiation is concerned, three main propagation paths are worth to mention. First, the scattered light may be reflected back to the view angle of the sensor, without even reaching the surface. This light is super-imposed on the reflected light coming from the scene, causing undesirable effects. Second, another propagation path for the scattered light is the scene itself. It may act as a secondary source of diffused light illumination, especially in the blue region of the visible spectrum. Last but not least, the atmosphere further absorbs and scatters the radiation that is reflected from the scene back to the sensor, permitting the reappearance of these effects. A number of methods is currently used to estimate and compensate for those atmo-


41


spheric propagation effects. The main objective of atmospheric compensation algorithms is to remove solar illumination and atmospheric effects from the measured spectral data so that an accurate estimate of the surface reflectance can be obtained. Additional details on compensation methods can be found in [57]. The spatial resolution of the sensor has also an impact on the performance of the detection and identification algorithms that focus on the processing of hyperspectral data. A desirable property is that every potential target can be fully resolved in a single pixel. However, in practice, this can not always be true, since there is a plethora of possible targets with varying size. Considering also the fact that the spatial resolution of a sensor is a direct function of the aperture size, the cost needed to account for all possible target sizes is prohibitive. Consequently, for a given sensor type, a target may be fully resolved (i.e., it covers the total area of a pixel), or it may occupy a sub-pixel area. Highly sophisticated algorithms take into account this vulnerability of hyperspectral data and specialize on the detection of sub-pixel targets. Spectral variability has also a great influence on hyperspectral data analysis. A basic assumption in hyperspectral signal processing is that every material has a unique spectral signature. Exploiting this physical property, researchers can detect and identify materials by the exploitation of spectral analysis techniques. Spectral variability refers to the fact that different measurement conditions can affect the retrieval of the exact spectral signature of a certain material. Spectral variability may be confirmed using either laboratory data or field measurements and is an active research topic, [37]. Various mechanisms can be responsible for the observed spectral variability, including uncompensated errors in the sensor, uncompensated atmospheric and environmental effects, surface contaminants, variation in the material such as age-induced color fading due to oxidation or bleaching, and adjacency effects in which reflections from nearby objects in the scene change the apparent illumination of the material. Seasonal variations also introduce enormous changes in the spectral character of a scene. Other environmental effects can also deteriorate the quality of hyperspectral data. Such effects can be the sensor viewing angle or the surface orientation of the target, which determine the amount of illumination received by the observed scene. Clouds and ground cover may cast shadows on the target, substantially changing the illumination of the surface. Nearby objects may also reflect or scatter sunlight onto the target, superimposing various colored illuminations upon the dominant direct solar illumination. In addition, the nonlinear motion of the sensor can corrupt the spectral image by mixing together the spectral returns from different parts of the spatial image. Motion of objects in the scene can also create artifacts, since a certain amount of time is needed to capture a hyperspectral scene.


42


2.1.3 Hyperspectral imaging applications An interesting aspect of hyperspectral imaging is its immense number of possible applications. The spectral information sampled from hyperspectral sensors can be adaptively filtered and processed to account for different types of applications. In a nutshell, hyperspectral imaging applications can be separated into three categories, which are also shown in Figure 2.4: a) anomaly detection, b) target recognition and c) background characterization.

Hyperspectral imaging applications

Target recognition

Anomaly detection

Background characterization

Figure 2.4: Taxonomy of applications for hyperspectral imaging. Anomaly detection refers to the process of locating and identifying uncommon features of the image. Detection and characterization of man-made objects located in a natural background is a common application in this category. Monitoring crop diseases, stressed vegetation, or the variability of natural effluents are also important examples of applications. The term anomaly usually describes reflectance spectra that deviate from the normal. As is usually the case, anomalies occupy a limited amount of pixels in every scene. Target recognition is differentiated from anomaly detection by the availability of some a priori information about the target; a reference spectra is often used to specify the target of interest. Such reference spectra can be found in specialized spectral libraries, which are specifically developed for this task, and are comprised by laboratory spectral measurements of a variety of materials. Alternatively, the spectral information of the target of interest can be extracted from the image itself, via an appropriate extraction algorithms. Examples of target recognition applications may include status characterization, material identification, classification, status monitoring or change detection. The first two application categories emphasize detection and identification of spatially isolated features in an image, in the form of anomalies or targets. In contrast the third category performs overall background scene analysis and identification, and spans the domains of land, ocean, and atmosphere. A sheer number of applications are mainly oriented to land characterization, including trafficability, [69], target characterization, [2, 127], and terrain classification, [96, 87]. An example of background scene characterization is coastal


43


characterization, including shallow-water bathymetry. Besides, atmosphere compensation is another prominent application of background characterization algorithms, where water vapor, aerosols detection, and cloud differentiation are of major interest.

2.1.4 Spectral processing From the previous section, it becomes evident that the gamut of potential applications for hyperspectral remote sensing is quite large. However, from a signal processing perspective, all applications can be broadly organized according to the kind of algorithmic procedure used for every application-specific task. This distinction results to the following four categories: (a) target detection or anomaly detection, (b) change detection, (c) classification, and (d) unmixing. Target detection and anomaly detection encompasses the tasks of either identifying specific spectral signatures in a hyperspectral image cube, or detecting rare pixels that deviate from the norm and can be characterized as anomalies. Change detection has to do with the detection and comparison of significant alternations between two hyperspectral scenes of the same geographic location. The third category, namely classification, is a well-known task in the machine learning literature, that involves the process of assigning each hyperspectral image pixel to one of a given number of classes, [31, 47, 25, 38]. Finally, unmixing is formulated as an estimation problem and refers to the estimation of the fraction of the pixel area occupied by each material present in each single pixel. Note that the present thesis concentrates on the thematic area of spectral unmixing, as it will also become clear in the sections to follow. Let us now briefly describe the basic processing stages of a hyperspectral image cube, also shown in Figure 2.5. The raw hyperspectral data from the sensor undergo a series of calibration operations in order to remove artifacts and recover from possible sensor defects. The resulting image cube carries the information of scene radiance. It can either be directly processed, or it can be further compensated for atmospheric effects, in order to produce the reflectance cube. Reflectance values are independent of the observation conditions and are thus most appropriate to describe the intrinsic properties of the scene. When the calibration stages are completed, an optional, however important in some applications processing step may follow, that of dimensionality reduction. As already mentioned, the ambient dimension of hyperspectral data is quite large, thus complicating the processing of hyperspectral data for two reasons. First, high-dimensionality adds up to the computation complexity of all subsequent processing algorithms. Second, it is difficult to “locate” valuable information in high dimensional spaces, and hence, hyperspectral data are more difficult to interpret. However, in the analysis of high-dimensional data it is a common assumption that the relevant information in the data resides in a much lower dimensional


44


space. Therefore, one approach to simplify things is to deploy a dimensionality reduction technique, prior to performing either target or anomaly detection or classification. Several important theoretical and algorithmic developments have been made over the recent years under different low-dimensional modeling frameworks, such as compressive sensing (CS), [26, 43], matrix completion, [27, 102], and general factor-model representations [98, 32]. A widely used dimensionality reduction technique is principal component analysis (PCA), [71], which is also called discrete Karhunen-Loéve transform (KLT). PCA is used to decorrelate hyperspectral data and maximize the information content in a reduced number of features. The exact workings of PCA are described in detail in [61].

Figure 2.5: Stages of hyperspectral signal processing, from the radiance cube to end-user products. Figure 2.5 illustrates a block diagram of the possible stages of hyperspectral signal processing. Starting from a calibrated cube, atmospheric compensation algorithms permit radiance observations to be converted to reflectance values. The process of dimensionality reduction takes on, where the bands of low information content are discarded. The resulting low-dimensional image cube is further processed according to the application at hand. For example, the tasks of unmixing and detection are illustrated in Figure 2.5. In either case, thematic or detection maps are produced as end-products, respectively. In the following, we will focus our attention on the unmixing problem and discuss the characteristics of different approaches that have been proposed for this problem.


45


2.2 Spectral unmixing Hyperspectral sensors usually capture remote scenes in which a number of disparate materials contribute to the spectrum of each measured pixel. The analysis of the spectrum of such mixed pixels with the aim of identifying the constituent materials is an interesting problem that has received considerable attention since the earliest days of hyperspectral imaging. Inevitably, mixed pixels arise for at least one of two main reasons. First, the spatial resolution of the sensor is often insufficient to capture distinct materials in single pixels. In most cases, more than one materials will jointly occupy a single pixel, and hence, the resulting measured spectrum will be some composite of the individual spectral signatures of these materials. High altitude airborne systems are a good example of remote sensing systems that retrieve images of varying spatial resolution, according to the altitude of the flight. Second, some materials in nature are combined in a homogeneous mixture, irrespective of the spatial resolution of the sensor. This circumstance may occur, for example, in vegetation areas, where measured pixels will be a mixture of vegetation and soil, no matter the spatial resolution. Figure 2.6 shows a typical hyperspectral scene in which pure as well as mixed pixels are present.

Figure 2.6: A hyperspectral scene comprised of pure and mixed pixels (figure found in [101]). Spectral unmixing is the procedure by which the measured spectrum of a mixed pixel is decomposed into a collection of constituent spectra, or endmembers, and a set of corresponding fractions, or fractional abundances, that indicate the proportion each endmember


46


contributes to the pixel, [77]. The spectral breadth of hyperspectral sensors allows us to determine the pure spectral signatures present in every pixel and their corresponding abundance fractions, through the formulation of an overdetermined system of equations. Based on the type of the model used to express these equations, spectral unmixing techniques can be categorized into linear and nonlinear, [77, 101]. In linear spectral unmixing it is assumed that each measured pixel’s spectrum can be decomposed into a linear combination of the individual, macroscopically pure, endmembers’ spectra. This assumption is justified if the total surface area imaged by a pixel is considered to be divided proportionally, according to the endmembers’ abundance fractions. Hence, the incident sunlight is assumed to be reflected straight back to the sensor, so that the radiation reaching the sensor will be measured with respect to the same area proportions. Scattering effects and secondary reflections are not anticipated from the linear model, or their contribution is considered to be minimal. It is interesting to note that the linear model has been widely used in the literature, mainly due to its ease in manipulation.

Figure 2.7: Single versus multiple scattering (figure found in [77]). Although the linear model has given promising results in interpreting the variability of hyperspectral images, it also has some inherent limitations. It cannot capture the incidental reflection effects and light interactions typically observed in a microscopic scale. An interesting alternative is nonlinear spectral unmixing, which can be more suitable for areas that have a particular endmember distribution, such as those containing trees, sand, or vegetation areas. In such environments, non-uniformities of the components of interest are randomly distributed through the field of view of the sensor and are responsible for multiple scattering. The nonlinear model assumes that the mixing systematics between the different component spectra comprising a single pixel are nonlinear, due to the multiple scattering effects. The differences and similarities between the linear and non-linear model are also illustrated in Figure 2.7. In this thesis we adopt and further elaborate on the linear mixing model, which Konstantinos E. Themelis

47


has been widely used by the remote sensing research community.

2.2.1 The linear mixing model Let us now focus on the intricate mechanisms that describe the well-known linear mixing model. Assume a remote sensed hyperspectral image consisting of M spectral bands, and let y be a M × 1 vector containing the measured spectral signature (i.e., the radiance values in all spectral bands) of a single pixel. Also let Φ = [φ1 , φ2 , . . . , φN ] stand for the M × N endmember signature matrix, where the M × 1 dimensional vector φi represents the spectral signature of the ith endmember, and N is the total number of distinct endmembers present in the scene. Finally, let w = [w1 , w2 , . . . , wN ]T be the N × 1 abundance vector associated with y, where wi denotes the abundance fraction of φi in y. The linear mixing model assumes that there is a linear relationship between the spectra of the measured pixel and the endmembers of the form y = Φw + n

(2.1)

where n stands for additive noise which is assumed to be a zero-mean Gaussian distributed random vector, with independent and identically distributed (i.i.d.) elements, i.e., n|β ∼ N (n|0, β −1 IM ), where β denotes the inverse of the noise variance (precision), and IM is the M × M identity matrix. Two physical constraints are generally imposed into the model described by (2.1), namely, the abundance non-negativity constraint (ANC), and the abundance sum-to-one (additivity) constraint (ASC), i.e., wi ≥ 0, i = 1, 2, . . . , N, and

N X

wi = 1,

(2.2)

i=1

respectively. A close examination of the above constraints reveals a quite instructive geometrical interpretation. If we consider each element of the abundance vector as a coordinate in space, it is easy to see that the abundance vector w represents a point in the N -dimensional space. The constraints in (2.2) define a convex polytope, in which the abundance vector should lie. This convex polytope is more widely known as the standard N − 1 simplex in the N -dimensional space. Owing to the sum-to-one constraint, the abundance vector w should always lie in the hyperplane 1T w = 1. For illustration purposes, Figures 2.8a and 2.8b display the colored convex polytopes of constraints in the case of two and three dimensional space, respectively. In the two-dimensional case the resulting polytope is a line, whereas, in the three-dimensional case, it forms an equilateral triangle.


48


(0,0,1)

(0,1)

(1,0,0)

(0,0,0)

(0,1,0)

(0,0

(1,0)

(a)

(b)

Figure 2.8: Polytope of constraints: (a) two-dimensional case, and (b) three-dimensional case.

2.2.2 Linear unmixing Over the last decades, many spectral unmixing algorithms have exploited the linear mixing model in order to identify the synthesis of mixed pixels. Spectral unmixing is formulated as a linear regression problem, where the unknown model parameters may be the endmembers’ spectra or their corresponding abundances. From a machine learning point of view, spectral linear unmixing belongs to the broad category of unsupervised learning. If we delve into the variety of the unmixing algorithms, we are able to distinguish algorithms that either determine the spectral signatures of the endmembers, or, assuming that this knowledge is given a priori, they focus on the estimation of the abundances, subject to the constraints. The objective of all linear unmixing algorithms is three-fold: • to estimate how many endmembers are present in the scene and, specifically, in each separate pixel, • to determine the spectra of the endmembers comprising the signature matrix Φ, • to compute their corresponding abundance fractions for each pixel in the image, subject to the constraints. Depending on the specific task that they address, spectral unmixing algorithms can be categorized into: (a) supervised ; the spectral signatures of the endmembers present in the pixel are given a priori, and the estimation problem only considers the abundance fractions, (b) semi-supervised ; many different spectral signatures are given in the form of a spectral library, concerning materials that may or may not exist in the scene, and the problem is to determine which and how many of them are present in each pixel, prior to estimating the abundances, Konstantinos E. Themelis

49


(c) unsupervised ; no prior information is provided and the optimization task involves both the extraction of the endmembers from the hyperspectral scene and the estimation of the respective abundances. These three categories are also depicted schematically in Figure 2.9.

Unmixing algorithms

Unsupervised

Semi-supervised

Supervised

Figure 2.9: A simple taxonomy of unmixing algorithms. Based on the above, the process of linear unmixing can be described as a sequence of three consecutive steps, that are illustrated in Figure 2.10: dimensionality reduction, which is a preprocessing step used to reduce the computational load from the subsequent steps (recall also the related comments in section 2.1.4), endmember extraction, which involves determining the spectral signatures of the endmembers comprising the mixed pixels, and inversion, which is the process of estimating the endmembers’ corresponding abundances. In the following sections we briefly discuss each of these tasks separately.

Figure 2.10: Conceptual diagram for spectral unmixing.


50


2.2.2.1 Dimensionality reduction Although the reduction of data dimensionality is a preemptive and optional step, it plays a major role in the unmixing process. Its objective is to discard the bands of the hyperspectral cube that contain noise or have no contribution to the information content of the image. Simultaneously, it keeps a minimal representation of the signal that sufficiently retains the requisite information for successful unmixing in the lower dimension. At the completion of this step, the volume of data that will be further processed is highly reduced, so that the computation time needed by the remaining processing procedures is significantly decreased. The effect of the dimensionality reduction step may definitely affect the overall unmixing results, since dimensionality reduction suppresses signal noise, reduces the redundancy in the spectral data, and captures the necessary information needed to obtain accurate statistical estimates of the unknown data parameters. At this point, we should make a distinction between the processes of data dimensionality reduction and image compression. Although the two research directions have many features in common, they differ on the intended objective. In image compression the intention is to capture the information content of the image using as few data measurements as possible, so as to reconstruct an approximation of the image itself based on the reduced set of data. In contrast, in dimensionality reduction the aim is to make all necessary transformations to the image in order to reduce the data volume, so that to retain the valuable information regarding image endmembers, and to complete subsequent unmixing steps with lower computational complexity. 2.2.2.2 Endmember extraction Endmember extraction is, without doubt, the most significant part of the spectral unmixing process, since the evaluation of the end-products of the unmixing procedure highly depends on this part. Many algorithms have been developed in the literature to specifically address this important task. The objective of these algorithms is to determine the constituent individual spectra that comprise the image, i.e., to estimate the “active” columns of the matrix Φ in the linear model (2.1). It is worth mentioning that endmember extraction algorithms spring from a variety of fields in optimization research and applied mathematics. Among others, one can stumble on statistical, deterministic, nonparametric, parametric, geometrical, Bayesian, gradient, or convex optimization methods. Classical techniques include the pixel purity index (PPI), [19], the N-FINDR, [126], the iterative error analysis (IEA), [97], the optical real-time adaptive spectral identification system (ORASIS), [21], and the convex cone analysis, (CCA) [62]. Also, an interesting example of a geometrical algorithm is the vertex component analysis


51


(VCA), [94], whereas, from the Bayesian domain, we can cite the positive source separation (PSS) algorithm, [91]. A survey of spectral unmixing algorithms is provided in [74], and a recent review of spectral unmixing algorithms can be found in [101]. A desirable property of the endmember extraction algorithms is that their findings are interpretable from the remote sensing scientific community. To elaborate further on this, the linear mixing model suggests that the endmembers correspond to the macroscopic constituent materials of the image, e.g. water, soil, grass. However, this is not always the case. Inevitably, endmembers extracted from a scene may be substantially differentiated from the spectral measurements of raw materials in laboratory conditions. Moreover, it may be the case that these endmembers cannot be used as a prior knowledge to unmix a different scene. Although the findings of automated endmember extraction algorithms are scene-dependent, they provide useful information, as long as they describe realistic components in the scene. The exact correspondence between the actual material spectra and extracted endmembers remains a challenge for all endmember extracting algorithms. A recent trend of endmember extraction algorithms is to take into account not only the spectral but also the spatial resolution of the hyperspectral cube. Until recently, endmember extraction algorithms considered only the spectral information of the image, treating each pixel as a separate, uncorrelated training data. Hence, a random perturbation of the initial spatial location of the pixels prior to the endmember extraction process would result in exactly the same image endmembers. However, the spatial information of the image may also convey useful information, since adjacent pixels may potentially be composed of the same endmembers. This kind of information can greatly improve the unmixing results and is wellembraced by the latest endmember extraction algorithms. Typical examples are the spatial spectral endmember extraction (SSEE) algorithm [105], which performs spatial averaging of spectrally similar endmember candidates found via singular value decomposition (SVD), and a spatial preprocessing (SPP) algorithm [128], which estimates, for each pixel vector in the scene, a spatially-derived factor that is used to weight the importance of the spectral information associated to each pixel in terms of its spatial context. Undoubtedly, this is an active area of research, which has been included in the future directions of our work. 2.2.2.3 Inversion process As already mentioned, the second primary task of spectral unmixing is the inversion process. Having specified the spectral signatures of the endmembers present in a hyperspectral scene, the objective now is to estimate their corresponding abundances. Utilizing the linear model in (2.1), inversion is also a linear regression problem, where the parameters of interest are the abundance fractions wi of the materials. Additionally, the estimation task incorporates the two convex constraints of the model, namely nonnegativity and full additivity, (2.2). In Konstantinos E. Themelis

52


machine learning, this problem falls under the umbrella of supervised learning. Many algorithms have been developed over the recent years that specifically address the (semi)supervised spectral unmixing problem. Characteristic examples are the nonnegative least squares (NNLS) method, [81], or the fully constrained least squares (FCLS) method, [60], which uses a penalized iterative method with Lagrange multipliers. Bayesian methods have also been proposed in the literature, that employ hierarchical models and carefully chosen priors to account for the prior knowledge of the abundance constraints, e.g. [41]. Viewed as an optimization problem, inversion can be also addressed by any quadratic programming method, e.g. utilizing interior point methods, [35]. A key aspect of the inversion process is that it encompasses the prominent task of variable selection. Obviously, setting to zero the contribution of a specific endmember to the measured pixel spectrum, automatically means that this endmember is not present in the respective pixel. Thus, although a set of multiple distinct endmembers may be available as the product of an endmember extraction algorithm, or, a priori, in the form of a spectral library, inversion is responsible to select how many and which of these endmembers are present in each pixel. This is exactly the objective of variable selection schemes. In terms of estimation performance, inversion combined with variable selection schemes can greatly improve the quality of the unmixing results, while, in the mean time, it allows the development of computationally efficient unmixing schemes. Recent semi-supervised unmixing approaches exploit this property, based on the concept of `1 norm penalization to enhance sparsity on the fraction coefficients, e.g. [64, 16]. The research work presented in this thesis has also contributed in this direction, via the publications [112, 114], which were among the first to introduce the notion of the sparsity for the problem of spectral unmixing, and are discussed in detail in chapter 5.


53

Chapter 3 Sparse modeling and inference methods In this chapter we introduce the basic concepts of sparse Bayesian linear regression, which is the basis of the rest of this thesis. We confine ourselves to the linear model and elaborate on the potential of `1 regularization methods for variable selection. From the prism of linear regression we inspect popular methods to perform statistical inference. Sampling methods and Bayesian inference methods are presented. Also, we focus on the relation between the widely used Gibbs sampler and variational approximation methods for Bayesian inference.

3.1 An introduction to linear regression Regression methods are an intimate component of any data analysis concerned with describing the dependence between observed data y and a set of explanatory variables {φi }N i=1 . Etymologically, regression refers to the fact that although the observed data have random values, they tend to “regress” toward their mean value, while linear is the type of equations used to describe the relationship between the observed and the explanatory variables. In this chapter we will focus on multiple regression, which allows more than one variables to enter the analysis. Linear regression is considered as the cornerstone of statistical data representation. It employs linear functions to model the data, while the unknown model parameters are estimated utilizing the data. As in all types of regression, the posterior conditional distribution of y given {φi }N i=1 is of interest. Usually, linear regression refers to a model in which the N conditional mean of y given the values of {φi }N i=1 is an affine function of {φi }i=1 . According to the application at hand, linear regression problems can be separated into two main categories: • Prediction: In this case, linear regression can be used to fit a predictive model to Konstantinos E. Themelis

55


an observed data set of y and {φi }N i=1 values. After developing such a model, if an additional value of φj is then given without its accompanying value of y, the fitted model can be used to make a prediction of the value of y. • Regression: In this case, given a variable y and a number of variables {φi }N i=1 that may be related to y, linear regression analysis can be applied (i) to quantify the strength of the relationship between y and the φj ’s, (ii) to assess which φj may have no relationship with y at all, and (iii) to identify which subsets of the φj ’s contain redundant information about y. It is interesting to note that linear regression has been studied rigorously over the past forty years, and it has been extensively used in many scientific disciplines. Particularly, two well-known properties are responsible for its wide acceptability: (a) its ease of fit; models, in which the unknown parameters are related in a linear fashion, are easier to fit than models, the unknown parameters of which are combined non-linearly, and (b) its simplicity; the statistical properties of the resulting estimators are easier to assess. Linear regression models are often fitted using the least squares approach, which will be discussed in the following section.

3.1.1 Linear models and least squares Given a vector of input variables, ψ = [φ1 , φ2 , ..., φN ]T , the standard linear regression equation can be written as y = w0 +

N X

φi wi ,

(3.1)

i=1

where y is the observed variable and w0 is the intercept, also known as the bias in machine learning. The basic property of the model in (3.1) is that y is given as a linear function of the parameters w0 , w1 , . . . , wN . The scope of the linear model can be extended by considering not only the input variables φi themselves, but also fixed nonlinear functions of them. Without loss of generality, we assume that the first variable φ1 is always one, so as to facilitate having a constant term in the linear combination and being able to express (3.1) as an inner product, y = φT w.

(3.2)

Although the linear model in (3.2) is written for a single response variable, i.e., y is a scalar, it can be easily generalized if multiple response data yi are available, which may be given in the form of a M -dimensional vector y. In that case, the mixing matrix Φ is a M × N dimensional matrix with each row being an input vector denoted by ψiT , i = 1, 2, . . . , M . Konstantinos E. Themelis

56


The most popular way to fit a linear model is the least squares method. This method estimates the unknown parameter vector w of the linear model by minimizing the well-known residual sum of squares (RSS) function, expressed as RSS(w) =

M X

(yi − ψiT w)2 .

(3.3)

i=1

The function to be minimized is a quadratic function with respect to the wi ’s and its minimization is a relatively simple problem. It can also be expressed in matrix notation, RSS(w) = (y − Φw)T (y − Φw) = ky − Φwk22 ,

(3.4)

where k·k2 is the `2 norm. The minimum of (3.4) always exists but it may not be unique. By differentiating with respect to w, we get the so called normal equation ΦT (y − Φw) = 0.

(3.5)

For M > N and ΦT Φ invertible, the unique solution is given by ˆ LS = (ΦT Φ)−1 ΦT y. w

(3.6)

Φ† = (ΦT Φ)−1 ΦT

(3.7)

The quantity

is known as the Moore-Penrose pseudo-inverse of the matrix Φ, [54]. 3.1.1.1 Geometrical interpretation of least squares Apart from its ease of computation, the least squares solution has also an interesting geometrical interpretation. Let us consider the M -dimensional space, whose axes are given by the variables yi , i = 1, . . . , M , so that y is a vector in this space. Let us also denote the columns of Φ with φi , i = 1, . . . , N , so that Φ = [φ1 , φ2 , . . . , φN ]. If M > N we know from linear algebra that the columns of the matrix Φ span a subspace A of RM , also referred to as the column space of Φ. An illustration of the linear subspace A containing two vectors ˆ φ1 , φ2 is shown in figure 3.1. It is apparent from (3.6) that the least squares solution y will lie in the subspace A, since it is expressed as a linear combination of the columns of Φ. However, the sum of squared residuals in (3.3) is equal, up to a scale factor, to the Euclidean ˆ . Thus, y ˆ can distance between the observed vector y and the least squares reconstruction y be viewed as the orthogonal projection of y on the subspace A. Notice also that Konstantinos E. Themelis

57


y

A φ2 y φ1

ˆ is the orthogonal projection of the data vector y Figure 3.1: The least squares solution y onto the subspace spanned by the columns φi of the mixing matrix Φ. ˆ is constructed as y ˆ = Φw ˆ LS = PA y, where PA = ΦΦ† is the orthogonal projection • y matrix for subspace A, ˆ ) is also implied • the orthogonality between the subspace A and the residual vector (y− y in (3.5). 3.1.1.2 Maximum likelihood and least squares Linear regression can also be considered in a probabilistic framework, where y is considered as a random vector (i.e., a vector of (scalar) random variables), which is related to the input variable matrix Φ via the following equation y = Φw + n,

(3.8)

where n is a M × 1 dimensional vector of added, white, zero-mean Gaussian noise of variance σ 2 IM , i.e., n ∼ N (n|0, σ 2 IM ), where IM stands for the identity matrix of size M × M . In this case, the data y can be described via their likelihood function, which is expressed as 2

2 −M 2

2

p(y|Φ, w, σ ) = N (Φw, σ IM ) = (2πσ )

1 2 exp − 2 ky − Φwk2 . 2σ

(3.9)

This formulation is best to describe the relation between the least squares solution and the maximum likelihood (ML) estimator, of w under Gaussian measurement noise (more details on the ML estimator can be found in section 3.3.1, as well as in [106, 75]). Solving for the


58


ML estimator we get ˆ M L = arg max p(y|Φ, w, σ 2 ) = arg min w w

w

1 ky − Φwk22 . 2σ 2

(3.10)

It is obvious that the minimization function of the ML estimator is equal, up to a scale factor, to the residual sum of squares function in (3.3). Thus, the two solutions will be identical. In other words, we conclude that least squares fitting is a maximum likelihood estimation of the fitted parameters if the noise vector n contains components that are independent, normally distributed and share the same (constant) variance (i.e., they are identically distributed). Notice also that we made no assumption on the prior distribution of the input variables Φ.

3.1.2 Regularized least squares In some contexts, a regularized version of the least squares solution may be preferable or required. Such contexts may include cases where we want to avoid the least squares overfitting the data, or cases where the matrix ΦT Φ is close to singular and, as a consequence, the least squares estimate lacks accuracy. Regularization is a mathematical tool to impose some kind of structure on the model parameters. Not surprisingly, it is closely linked to the usage of prior distributions in Bayesian analysis, as it will be shown in the sections to follow. Usually, regularization is accomplished with the addition of a non-negative parameterdependent term to the optimization function of least squares, i.e., the regularized minimization function now becomes RSS(w) + λREG(w),

(3.11)

where λ is a regularization coefficient that adjusts the trade-off between the data-dependent RSS function and the parameter-dependent REG function. A common choice in data analysis is to use a norm as a function to penalize the model parameters. This is a typical choice if we consider that a norm always assigns a strictly positive length or size to all vectors in a vector space. A widely used regularization method, also known as Tychonoff regularization, deploys the squared Euclidean norm as the penalization term, and the minimization task is written as ky − Φwk22 + λ kwk22 .

(3.12)

It can be seen that this function remains quadratic with respect to the wi ’s, and an analytical solution can be established, which is ˆ ridge = (ΦT Φ + λIN )−1 ΦT y. w Konstantinos E. Themelis

59

(3.13)


w2

w2

w2

w1

w1

(a) q = 2

(b) q = 1

w1

(c) q = 0.5

Figure 3.2: Contour plot of the `q norm for two variables for various values of q. This is also referred to as the ridge regression solution. Notice that the diagonal matrix λIN is added to ΦT Φ before inversion, so as to potentially prevent ill-conditioning. Capitalizing on the above result, a more general type of norm can be used to create regularized versions of least squares bearing different characteristics. The `q norm regularization takes the form ky −

Φwk22

+λ

N X

|wi |q ,

(3.14)

i=1

where it is easily seen that setting q = 2 leads to Tychonoff regularization. Figure 3.2 shows the contours of two more widely used `q norms, for q = 1 and q = 0.5, that promote sparse solutions (that is, solutions where several components of w are zero), as it will be discussed later. Before we proceed with the analysis of more generalized regularization approaches, it is instructive to consider the Bayesian interpretation of the `2 -norm regularizer. Turning to the Bayesian framework, let us assume that w is a random vector with a prior multivariate normal distribution. For simplicity, each component is assumed to be independent with variance σw2 and zero mean. The measured data are also subject to errors, which are also considered to be independent, Gaussian distributed, with zero mean and variance σn2 . Under these assumptions, the ridge regression solution is found to be the most probable solution given the data and the a priori distribution of w, according to Bayes’ theorem. In mathematical terms, the maximum a posteriori probability (MAP) estimator (more details on the MAP estimator can be found in section 3.3.1, as well as in [75]) is computed as ˆ MAP = arg max p(w|y) w w


60


= arg max p(y|w, σn2 )p(w) w N 1 1 2 2 2 −M 2 − = arg max (2πσn ) 2 exp − 2 ky − Φwk2 (2πσw ) 2 exp − 2 kwk2 w 2σn 2σw 1 1 2 (3.15) = arg min ky − Φwk + kwk22 . 2 w σn2 σw2 Notice the similarities of (3.15) with the squared `2 norm regularized least squares function of (3.12). Using an appropriate value for the sparsity promoting coefficient λ, the two functions to be minimized are identical. The MAP estimator in (3.15) also has a closed form solution, which is expressed as ˆ MAP = (ΦT Φ + w

σn2 IN )−1 ΦT y. σw2

Notice that the diagonal regularization matrix is now equal to

(3.16) 2 σn 2 IN . σw

3.1.3 The lasso We now turn our attention to `1 norm regularization, e.g. the least absolute shrinkage and selection operator (lasso), [117], and its variants, whose theoretical properties are fundamental in the field of compressed sensing. The lasso is defined by ˆ lasso = arg min ky − w w

Φwk22

,

subject to

N X

|wi | ≤ τ,

(3.17)

i=1

where τ ≥ 0 is a nonnegative parameter of the optimization problem. It is easily seen that P the lasso finds a least-squares solution with the constraint that kwk1 ≡ N i=1 |wi |, the `1 ˆ LS k1 ≤ τ , where w ˆ LS norm of the parameter vector, is no greater than a given value τ . If kw is the least squares solution, then the regularizing term has no impact on the optimization procedure and the lasso estimate is identical to least squares. On the other hand, if τ < ˆ LS k1 , then the coefficients of the least squares solution are shrunk in magnitude and kw truncated towards zero. Obviously, as τ tends to zero, more coefficients of the lasso solution are forced to zero. Thus, τ is a parameter that adjust the level of sparsity of the solution. Expressing the lasso using Lagrange multipliers, it is equivalent to write ( ˆ lasso = arg min ky − Φwk22 + λ w w

N X

) |wi | ,

(3.18)

i=1

where also λ ≥ 0, and there is an one-to-one correspondence between the sparsity promoting parameters τ and λ. The optimization problem in (3.18) may be solved using quadratic Konstantinos E. Themelis

61


w2

w2 w

w

w*

w* w1

w1

(a)

(b)

Figure 3.3: Contour lines of the residual sum of squares, centered at the least squares solution. The shaded areas are the constraint regions due to the regularization terms. The final solution results at the location where the contours meet the constraints regions. programming, more general convex optimization methods, or specific algorithms such as the least angle regression algorithm (LARS) [44]. The tendency of the lasso estimator to prefer solutions with fewer nonzero parameter values is also depicted in figure 3.3, where the parameter vector space consists of two components. In this (trivial) case, a sparse solution vector w contains a zero component. The elliptical contours of the residual sum of squares centered at the least squares solution are presented in both subfigures. The shaded areas correspond to the constraint regions, which are formed through the inequalities |w1 |+|w2 | ≤ τ and w12 +w22 ≤ τ , which correspond to the `1 and `2 norms, respectively. Note that the sparse solutions are of the form (0, w2 ), (w1 , 0). The regularized least squares solution is located at the point where the elliptical contour first meets the constraint region. The sparse solutions are more likely (but not guaranteed) to be reached from the contours in the `1 case, rather than in the `2 case. 3.1.3.1 The lasso as a Bayes estimate In section 3.1.2 we discussed the Bayesian interpretation of the `2 norm regularization via an appropriate prior distribution. Likewise, the lasso regularization function can be transfered over to the Bayesian framework, by utilizing again an appropriate prior distribution over the parameter vector w. Specifically, each |wi | is considered to be proportional to the (minus) log density of the double exponential (or Laplace) distribution. As a result, we can derive the lasso estimate as the Bayes posterior mode under independent Laplace priors for the


62


0.5 Laplace Normal

0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 −5

−4

−3

−2

−1

0

1

2

3

4

5

Figure 3.4: The Laplace with ς = 1 and the standard normal probability density functions. The former is the implicit prior of the lasso, whereas the latter is the prior for ridge regression. wi ’s, 1 |wi | p(wi |ς) = exp − , 2ς ς

(3.19)

where ς > 0 is a scale parameter. Under the Gaussian assumption for the data, as expressed in (3.9), the MAP estimator of the parameter vector w is now expressed as ˆ MAP = arg max p(w|y) w w

= arg max p(y|w, σn2 )p(w) w 2σn2 2 kwk1 , = arg min ky − Φwk2 + w ς

(3.20)

2

where for ς = 2σλn we get the exact optimization function as in the lasso (3.18). Figure 3.4 shows the univariate Laplace distribution with ς = 1, and the zero mean, unit variance normal density function. Notice that the Laplace distribution puts more mass for values near zero and in the tails. This reflects the tendency of the lasso estimator to favor solutions that are either large or zero.

3.1.4 The adaptively weighted lasso The weighted lasso is expressed as ( ˆ lasso = arg min ky − Φwk22 + λ w w


N X i=1

63

) αi |wi | ,

(3.21)


where {αi }N i=1 are (possibly random) nonnegative weights, which multiply the parameter vector non-uniformly. In the case where the weights {αi }N i=1 stem from an initial estimation procedure for w, the adaptive lasso can be written as, [129], ( ˆ adapt = arg min ky − Φwk22 + λ w w

N X

) α ˆ i |wi | ,

(3.22)

i=1

ˆ init |γ , w ˆ init is an initial estimation of w, and γ is a positive parameter, γ > 0 where α ˆ i = 1/|w ˆ init |γ is large, the adaptive lasso employs a small weight (i.e. little shrinkage) for the . If |w ith coefficient wâdapt,i . The adaptive lasso has the obvious property wînit,i = 0 ⇒ wâdapt,i = 0.

(3.23)

It is easy to verify that the adaptive lasso is a two-stage version of the lasso. The first stage considers an initial estimate of the parameter vector w, so as to determine the penalizing coefficients α ˆ i . The second stage involves the solution of the convex optimization problem in (3.22), taking into account the nonuniform coefficients of the first stage. Compared to the standard lasso, the adaptive lasso yields consistent estimates of the parameters. Statistical consistency means that, as the sample size grows, the estimates converge to the true values. As shown in [129], the single parameter λ of the standard lasso over-shrinks the nonzero coefficients, causing them to be biased towards zero, and, in general, they are not consistent. Moreover, as also explained in [129], the lasso does not have the so called oracle property. We say that a method possess the oracle property if (a) it can correctly select the nonzero coefficients with probability converging to one, and (b) the estimators of the nonzero coefficients are asymptotically normal with the same means and covariance that they would have if the zero coefficients were known in advance. By allowing a relatively higher penalty for zero coefficients and, lower penalty for nonzero coefficients, the adaptive lasso tries to reduce the estimation bias and improve variable selection accuracy. In [129], it is proved that the adaptive lasso possesses the oracle property. Furthermore, the adaptive lasso is also essentially an `1 penalization method, and hence, it retains the attractive convexity property of the lasso. 3.1.4.1 The weighted lasso as a Bayes estimate Similarly to section 3.1.3.1, the regularization function of the weighted lasso, (3.21), can be easily derived from a Bayesian viewpoint, by utilizing N independent Laplace prior distributions with different shape parameters {ςi }N i=1 . The new prior for the parameter vector w


64


is now written as N Y |wi | 1 exp − , p(w|ς1 , ς2 , . . . , ςN ) = 2ςi ςi i=1

(3.24)

where ςi > 0, i = 1, 2, . . . , N . Under the Gaussian assumption for the data, as it is expressed in (3.9), it is easy to verify that the MAP estimator of the parameter vector w is expressed as ˆ MAP = arg max p(w|y) w w

= arg max p(y|w, σn2 )p(w) w ( = arg min ky − w

Φwk22

+

N X 2σ 2

n

i=1

ςi

) |wi | ,

(3.25)

2

n which, for ςi = 2σ is equivalent to (3.21). Thus, it is shown that, in the Bayesian framework, λαi the optimization problem of the weighted lasso is replaced by the search for a mode of the posterior distribution of w under the specific prior distribution in (3.24) for the parameter vector w and utilizing Bayes’ theorem. To the best of our knowledge, this is the first time this correspondence is shown. In chapter 5, the development of a Bayesian model is presented, which incorporates independent Laplace priors to induce sparsity similar to the adaptive lasso. As also reported in chapter 1, this is one of the contributions of this thesis.

3.2 Sampling methods for inference In this section we depart from sparse regularization methods and their Bayesian modeling, and shift our attention to inference methods that exploit the potential of these models. In the Bayesian framework, learning the parameters of probabilistic models from empirical data is considered a task of statistical inference. We first review Monte Carlo methods based on sampling using Markov chains, and finally we describe the related method of Gibbs sampling, which has been widely applied to problems of Bayesian inference.

3.2.1 Traditional methods The computation of the posterior distribution for most probabilistic models of practical interest usually requires the integration of high-dimensional functions and, hence, it is analytically intractable. A good approximation can be obtained by utilizing Monte Carlo methods, which are based on numerical sampling. Original Monte Carlo approaches were developed by physicists, in order to compute complex integrals using random number generKonstantinos E. Themelis

65


ation methods. More specifically, let us consider a random variable x, whose dimensionality can be potentially large, and let us assume that we wish to compute the integral1 b

Z

h(x)dx.

(3.26)

a

If h(x) can be decomposed as the product of a function f (x) and a probability density function (pdf) p(x) defined over the interval [a, b], then the integral can be written as b

Z

Z

b

f (x)p(x)dx = Ep(x) [f (x)],

h(x)dx =

(3.27)

a

a

where Ep(x) [x] denotes the expectation of x, with respect to the probability density function p(x). The general idea is to draw a large number of independent samples x(t) , t = 1, . . . , N from the pdf p(x), so as to approximate the integral by a finite sum, i.e., Z a

b

N 1 X h(x)dx = Ep(x) [f (x)] ≈ f (x(t) ). N t=1

(3.28)

Assuming that a conditional or marginal posterior can be expressed in the integral form of (3.26), Monte Carlo methods can be used to approximate distributions which appear in Bayesian analysis. Moreover, expectations of model parameters with regard to posterior distributions, for example in order to make predictions, can equally be approximated with Monte Carlo methods. It is interesting to note that the proposed estimator in (3.28) is unbiased, provided that the samples x(t) are drawn from p(x), and its accuracy does not depend on the dimensionality of x. In the following sections we discuss the traditional methods of transformation, rejection and importance sampling, before moving on to more sophisticated methods used for Bayesian inference. 3.2.1.1 Transformation method The transformation method is mainly used to sample from simple nonuniform distributions. Let us focus on the one dimensional case (extensions to high dimensional models are straightforward), and assume that we have a set of N independent samples stemmed from the standard uniform distribution U([0, 1]). Assume also that the distribution from which we are interested to draw samples is p(x) and that Fx (·) denotes the cumulative distribution

1 In this example we have assumed that the random variable x is continuous, but the same procedure can also be applied in the discrete case.


66


function of the random variable x that is modeled by p(x). Since p(Fx (x) ≤ u) = p(Fx−1 (Fx (x)) ≤ Fx−1 (u)) = p(x ≤ Fx−1 (u)) = Fx (Fx−1 (u)) = u,

(3.29)

where Fx−1 (·) denotes the inverse function of Fx (·), we conclude that the random variable u = Fx (x) is uniformly distributed in (0, 1). Thus, x = Fx−1 (u) results in samples from the target distribution p(x). The three steps of the inverse transform sampling method can be summarized as 1. Generate a random number u from the standard uniform distribution U([0, 1]). 2. Compute the value x as x = Fx−1 (u). 3. Accept x as a random number drawn from the target distribution p(x). This method is applicable only when the function Fx (·) can be computed and inverted analytically. Unfortunately, this task is feasible for a limited number of simple distributions and its use becomes more difficult as the number of dimensions increases. 3.2.1.2 Rejection sampling Compared to the transformation method, rejection sampling allows us to sample from more complex distributions. Again, let us consider the one-dimensional case for the description of the method; the extension to multiple dimensions is relatively straightforward once the one-dimensional case has been comprehended. Assume that we want to sample from the distribution p(x) that is odd enough to sample directly. In rejection sampling we utilize a simpler distribution q(x), also termed as proposal distribution, from which it is easier to sample. Let M be a constant, whose value is chosen such that p(x) < M q(x), for all x, and M > 1. The rejection sampling method is a two step procedure. First, we generate a sample xi from q(x). Next, we generate a number ui from the uniform distribution over [0, M q(xi )]. We accept the sample xi as a sample from the distribution p(x) if ui ≤ p(xi ); else, the sample is discarded. The steps of the rejection sampling algorithm can be summarized as follows, 1. Sample xi from q(x) and ui from U([0, M q(xi )]). 2. If p(xi ) ≥ ui then accept xi as a realization of p(x) and repeat the sampling step, else, reject the sample pair (xi , ui ).


67


It is seen that all samples are drawn from the distribution q(x) and they are then accepted with probability p(x)/M q(x). The overall acceptance probability can be computed as Z p(accept) =

1 {p(x)/M q(x)}q(x)dx = M

Z p(x)dx =

1 . M

(3.30)

It is easily discerned that the value for the constant M should be kept as small as possible, in order to increase the acceptance probability. At the same time M should also satisfy that p(x) < M q(x) for all x. Thus, the efficiency of the method depends on how well q(x) approximates p(x). Notice that in the extreme case when q(x) = p(x), we could use M = 1, and, hence, no sample has to be rejected. 3.2.1.3 Importance sampling Importance sampling is another fundamental Monte Carlo technique, which is closely related to Markov chain sampling methods. Surprisingly, it does not provide another way to efficiently sample from a complex distribution; it rather provides the means to approximate expectations of the form (3.27) directly. Suppose that we cannot easily draw the samples from p(x), but we can evaluate the value of p(x) at a given x. Similarly to rejection sampling, importance sampling is based on a proposal distribution q(x), which approximates the distribution of interest, and from which it is easier to draw samples. Suppose that we want to approximate the expectation of (3.27), which can also be expressed as Z Ep(x) [f (x)] =

Z f (x)p(x)dx =

N (t) 1 X p(x) (t) p(x ) q(x)dx ≈ f (x ) , f (x) q(x) N t=1 q(x(t) )

(3.31)

where x(1) , x(2) , . . . , x(N ) are samples drawn from q(x). As revealed in (3.31), the basic idea of importance sampling is to draw from the proposal distribution q(x), and then modify the resulting equation with the term p(x(t) )/q(x(t) ) to correct the bias introduced by sampling from the proposal distribution. The only requirement for this method to work is that q(x) must not be zero at the values of x where p(x) is nonzero. As with rejection sampling, the success of the importance sampling approach depends crucially on how well the sampling distribution q(x) matches the desired distribution p(x). In the limiting case that q(x) = p(x), it is easily seen that (3.31) reduces to the simple Monte Carlo estimation method in (3.28). Although the strategies of rejection and importance sampling may be limited to low dimensional problems and they are not generally applied in more complex problems, they can be useful components of Markov chain methods that are discussed in the sequel.


68


3.2.2 Markov chain Monte Carlo techniques Markov chain Monte Carlo (MCMC) techniques provide a powerful framework that allows to sample from a large class of distributions, regardless of the dimensionality of the sample space, [95]. They capitalize on the fact that certain Markov chains converge to a unique invariant distribution, which represents the distribution upon which the objective expectations are estimated. Again, let us assume that it is hard to sample directly from the pdf of interest p(x), but we are able to compute its value at a specific x. As in rejection and importance sampling, a proposal distribution q(x) is used to draw samples. In MCMC sampling, however, a different sampling procedure is adopted: at each step of the algorithm the current sample xt is recorded, and the proposal distribution of gthe next step is selected in dependence on the current sample xt , i.e., it can be expressed as q(x|xt ). In this way, the sequence of generated samples x1 , x2 , . . . , xN forms a Markov chain. The candidate sample x∗ drawn from q(x|xt ) at each step of the algorithm is discarded or not according to an appropriate criterion. Before introducing the Gibbs sampler, which is a widely applicable MCMC method, a brief introduction on Markov chains is provided. 3.2.2.1 Markov chains In this section we will focus on the properties that describe Markov chains that converge to a unique invariant distribution. More details on Markov chains theory can be found, e.g., in [118, 76, 48]. A Markov chain is a sequence of random variables X0 , X1 , X2 , . . . with the Markov property, namely that, given the present state, the future and past states are independent. In mathematical notation, the following conditional independence holds, [95], P (Xt+1 = x(t+1) |Xt = x(t) , {Xn = x(n) : n ∈ Q}) = P (Xt+1 = x(t+1) |Xt = x(t) ),

(3.32)

where Q is any subset of {0, . . . , t − 1}. All the random variables Xt have common range, which is called the state space of the Markov chain and is denoted by S. In the sequel, we focus on the (more easy to follow) case where S is finite and then we generalize very briefly on the case where S is continuous-valued. Moreover, the index t usually denotes the sampling step or “time”. The conditional distributions for Xt+1 given the possible values for Xt are also called transition probabilities (from state to state), and, together with the initial probabilities of the various states, they can completely specify a Markov chain. In 0 the following, the transition probabilities from state x to state x at time t are denoted by 0 Tt (x, x ), the initial probability of a state x is denoted as p0 (x) and the probability of a state x at time t is denoted as pt (x). If the transition probabilities do not depend on time the Konstantinos E. Themelis

69


Markov chain is called homogeneous or stationary and the transition probabilities will be 0 written as T (x, x ). The probability of a particular state x at any given time t + 1 can be expressed utilizing the state probabilities at time t and the transition probabilities, as follows, pt+1 (x) =

X x

0

0

Tt (x , x)pt (x ).

(3.33)

0

This equation is known as the Chapman-Kolomogrov equation, and its successive iteration describes the evolution of the chain. A distribution π(x) over the states of a Markov chain is called invariant (or stationary) if, once reached, it remains unchanged forever. For a homogeneous Markov chain, where the dependence on t can be dropped, we can write π(x) =

X x

0

0

T (x , x)π(x ).

(3.34)

0

Note that a Markov chain can have more than one invariant distribution and that a finite state Markov chain always has at least one invariant distribution. A sufficient but not necessary condition to ensure that a distribution π(x) is invariant (with respect to a Markov chain) is that the equation of detailed balance holds, i.e., 0

0

0

π(x)T (x, x ) = π(x )T (x , x).

(3.35)

This implies that π(x) is invariant because X x

0

0

T (x , x)π(x ) =

0

X x

0

T (x, x )π(x) = π(x)

0

X x

0

T (x, x ) = π(x).

(3.36)

0

A Markov chain that respects detailed balance is said to be reversible. Recall that pt (x) denotes the probability distribution of state x at time t. In [95], a Markov chain is said to be ergodic if the probability distribution over the states, pt (x) converges to an invariant distribution π(x) as t → ∞, regardless of the choice of the initial probability distribution p0 (x), over the states. Obviously, an ergodic Markov chain can have only one invariant distribution, also called equilibrium distribution. At this point, it is interesting to note that all the above properties of Markov chains are, in general, also valid in the case of a continuous state space, if we replace the summation in (3.33) with integrations. Moreover, p0 (x) is now interpreted as a density function for the 0 0 initial state at x, while T (x , x) is the probability density for a transition from state x to state x.


70


In MCMC sampling our objective is to construct an ergodic Markov chain that converges to a desired invariant distribution from which we want to sample. This procedure is described in the Gibbs sampler, which is a simple and widely applicable Markov chain Monte Carlo algorithm.

3.2.3 Gibbs sampling Gibbs sampling was introduced in the image analysis literature by Geman and Geman in 1984, [52], followed by Besag and York in 1989, [12], and subsequently by Gelfand and Smith in 1990, [50]. The term “Gibbs sampler” is due to Geman and Geman, and refers to the simulation of Gibbs distributions in statistical physics. It is a widely applicable method for probabilistic inference, especially for hierarchical Bayesian models that can be represented by directed graphs, or for cases where the conditional distributions of the individual parameters are easy to be computed and sampled. To introduce the intrinsic workings of the Gibbs sampler, let us assume that we want to sample from the distribution p(x) = p(x1 , x2 , . . . , xN ), and we have a set of initial values (0) {xi , i = 1, . . . , N }. Within the framework of each step of the sampler the variables are considered sequentially, one after the other. More specifically, the value of each one of the variables is updated by a value drawn from the conditional distribution of the specific variable given the most recent values of the remaining variables. To understand the rationale of the Gibbs sampler, suppose that we want to sample from a bivariate distribution p(x) = p(x1 , x2 ). The univariate conditional distributions p(x1 |x2 ) and p(x2 |x1 ) are utilized in each (0) step of the sampling procedure. The sampling process starts from an initial value x1 for (0) x1 , which is used to generate a new value for x2 from the conditional distribution p(x2 |x1 ). The sampler proceeds as follows, (t)

(t−1)

x1 ∼ p(x1 |x2 (t) x2

∼

(t) p(x2 |x1 ).

)

(3.37) (3.38)

Cycling through the two variables in turn, a Gibbs sequence is generated. The sampling procedure is also illustrated in Figure 3.5, where the contours of the plot represent the contours of the joint probability p(x1 , x2 ), and the labels (0), (1), . . . , denote the sampled values, each one completed after both components of x are updated. Observe that each component is revised along the direction of the coordinates axes. In the case of highly correlated random variables, this procedure results in small moves, and hence, it is difficult for the Gibbs sequence to exploit the whole sample space and converge rapidly. By construction, the Gibbs sampler satisfies the conditions of detailed balance, and it can be shown that the joint density p(x) is an invariant distribution of the Markov chain. Konstantinos E. Themelis

71


x2

(3)

(2)

(1)

(0)

x1 Figure 3.5: Three iterations of the Gibbs sampler in the two dimensional space. 0

To this end, let us consider a transition of the Gibbs sampler Markov chain from state x to x in the two-dimensional example described above. In addition, to keep things as simple as possible, let us consider the finite state space case. By definition, it can be expressed as 0

0

T (x , x) = p(x1 |x2 )p(x2 |x1 ).

(3.39)

Straightforward computations yield that X

0

0

T (x , x)p(x ) =

x0

X

0

0

0

p(x1 |x2 )p(x2 |x1 )p(x1 , x2 )

x0

= p(x2 |x1 )

X

0

0

0

p(x1 |x2 )p(x1 , x2 )

x0

= p(x2 |x1 )p(x1 ) = p(x),

(3.40)

and, hence, the joint distribution of the Gibbs sampler p(x) is invariant, with respect to the definition (3.34). This computation can be extended to any number of variables in the same way. Another desired condition for the Markov chain of the Gibbs sampler is that it is ergodic. This can be ensured if all conditional distributions utilized in the Gibbs sampler are nonzero. In this case, the Markov chain of the Gibbs sampler can move from any value of the sample space to any other value at a finite number of steps, with probability greater than zero. If some of the conditional probabilities are zero, then the Markov chain may of may not be ergodic, and a separate analysis should be conducted in this case. The extension of the Gibbs sampler in the case of multiple random variables can be expressed as (0)

1. Initialize {xi , i = 1, 2, . . . , N }. Konstantinos E. Themelis

72


2. For t = 1, 2, . . . (t) (t−1) (t−1) Sample x1 from p(x1 |x2 , . . . , xN ). (t) (t) (t−1) (t−1) Sample x2 from p(x2 |x1 , x3 , . . . , xN ). .. . (t)

(t)

(t)

(t)

(t−1)

(t−1)

Sample xj from p(xj |x1 , x2 , xj−1 , xj+1 , . . . , xN .. . (t)

(t)

(t)

).

(t)

Sample xN from p(xN |x1 , x2 , . . . , xN −1 ). At this point we should mention that there are many variations for the sampling procedure of the Gibbs sampler, like, for example, choosing a random order to sample the variables, or grouping the variables while sampling. Moreover, the initial samples of the Gibbs sequence are not representative of the joint invariant distribution and should always be discarded, since the Markov chain needs a period of time to converge, called the “burn-in” period. For more details, an extensive study of the Gibbs sampler is presented in [29].

3.3 Bayesian inference methods This section highlights some theoretical insights concerning the most commonly used methods to perform Bayesian inference. Starting with the basics of Bayesian inference, we then present an alternative view of the expectation maximization (EM) algorithm as described in [18, 123]. In the sequel, the framework of variational approximation is explored, which has been recently developed to ameliorate the shortcomings of the EM algorithm. Finally, the evidence procedure is also presented, which involves the maximization of the marginal likelihood of the data.

3.3.1 Bayesian inference basics The maximum likelihood (ML) methodology is a well-established parameter estimation process, commonly utilized in the framework of Bayesian probabilistic inference. In this framework, the data are considered to stem from a probabilistic model of known form but unknown parameters and the aim is to estimate the model parameters, based on some observed data. To this end the so called likelihood function is utilized, which measures how likely is for the given data to be represented accurately by specific values of the model parameters. Let y be an M -dimensional random vector that depends on the parameters θ (θ summarizes all the model parameters), and Y = [y1 y2 . . . yN ] be the matrix whose columns are N observations


73


of y. Then, the likelihood function2 can be written as L(θ|Y) = p(Y; θ).

(3.41)

In the notation used in (3.41), we write p(Y; θ) to show that there is no prior distribution assigned to the parameters θ, that is, θ is treated as an unknown parameter vector. Also, we will use the term estimation when we refer to parameters and the term inference to refer to random variables. As stated before, the likelihood function expresses the ability of the parameters to interpret the observed data, since it may serve as a measure on how likely is for the given data to appear, for specific values of the parameters. Capitalizing on this result, the widely used ML procedure estimates the parameter vector θ so that the likelihood function to be maximized, given the observed data. In mathematical terms, the ML estimate is expressed as θˆML = arg max p(Y; θ). θ

(3.42)

The ML approach can also be considered as a special case of the maximum a posteriori (MAP) methodology. In the framework of MAP estimation, the parameter vector θ is assumed to be a random variable, whose prior distribution is described by an appropriate pdf p(θ). In this case, we write p(Y|θ) instead of p(Y; θ), in order to express this fact. The MAP estimate is defined as θˆMAP = arg max p(θ|Y),

(3.43)

θˆMAP = arg max p(Y|θ)p(θ),

(3.44)

θ

and using Bayes’ theorem,

θ

where p(Y|θ) is the likelihood of the observations. It is interesting to note that the MAP estimator is a “crude” Bayesian estimator, since (3.44) takes into account merely the mode of the posterior distribution. On the contrary, more general Bayesian inference methods are based on the computation of the full posterior of θ. Note that the posterior of θ using Bayes’

2 The likelihood itself does not define a probability distribution, although it is usually expressed as one in the Bayesian literature.


74


law can be written as3 p(θ|Y) = R

p(Y|θ)p(θ) . p(Y|θ)p(θ)dθ

(3.45)

In many complicated problems the exact computation of the posterior distribution of the parameters using (3.45) is a difficult task, mainly due to the complex form of the integral of the denominator in (3.45). In such cases, approximation methods are employed. In the following sections we present some popular Bayesian inference techniques that help in this direction.

3.3.2 The EM algorithm In the preceding section, we have witnessed that both methodologies of ML and MAP estimation rely on the likelihood function to estimate the parameters of a probabilistic model. However, the likelihood function p(y; θ) for some models may be too complex to derive in closed form, or it may be difficult to be optimized. To alleviate this problem, hidden or unobserved variables, summarized in a vector z, can be introduced in the probabilistic model, in order to play a connecting role between the model parameters and the data observations; they serve as a means to decompose the likelihood of the data in simpler forms. Hidden variables are not observed and their choice is dependent on the adopted model and nature of the problem. They usually have a physical interpretation on the probabilistic mechanism that is assumed to generate the data, although this is not obligatory. Though, they should provide some information about the data, so that the conditional probability p(y|z) is easy to compute. With the help of the hidden variables z, the likelihood of the data can be computed as Z p(y; θ) =

Z p(y, z; θ)dz =

p(y|z; θ)p(z; θ)dz,

(3.46)

where it is reckoned that the integral in (3.46) is easily computed. The posterior probability of the hidden variables is also a function of the likelihood, i.e., p(z|y; θ) =

p(y|z; θ)p(z; θ) . p(y; θ)

(3.47)

The computation of (3.47) is also crucial, in order to infer the values of the hidden variables from the data itself. The expectation maximization (EM) algorithm, [39, 88], is an iterative technique to de3 In this section we will assume that all random variables are continuous. The same equations can also be expressed in the discrete case, by replacing integrations with summations.


75


termine the maximum likelihood estimates of the parameters of a model that has hidden variables, without explicitly computing the likelihood function. The EM is based on the following decomposition of the likelihood function, Z ln p(y; θ) =

ln (p(y; θ)) q(z)dz Z p(y, z; θ) = ln q(z)dz p(z|y; θ) Z p(y, z; θ)q(z) q(z)dz = ln p(z|y; θ)q(z) Z p(y, z; θ) p(z|y; θ) ln = − ln q(z)dz q(z) q(z)

= F (q, θ) + KL(qkp),

(3.48)

where

Z F (q, θ) =

q(z) ln

p(y, z; θ) q(z)

Z

dz,

(3.49)

and KL(qkp) = −

q(z) ln

p(z|y; θ) q(z)

dz,

(3.50)

with q(z) being any probability density function, and KL(qkp) being the Kullback-Leibler distance between the distributions p(z|y; θ) and q(z). A brief inspection of the quantities F (q, θ) and KL(qkp) in (3.49) and (3.50) respectively, reveals that they differ in sign and that the former contains the joint distribution of y and z, whereas the latter contains the conditional distribution of z given y. From the definition of the Kullback-Leibler distance we have that KL(qkp) ≥ 0, where equality holds if, and only if, p(z|y; θ) = q(z). This means that F (q, θ) ≤ ln p(y; θ), i.e., the function F (q, θ) is a lower bound of the likelihood. The EM is a two-step iterative process that maximizes the lower bound F (q, θ), with respect to the parameters θ and the density q(z). To describe the first step, also known as the E-step, assume that the current value of the parameter vector is θ old . In the E-step, the lower bound F (q, θ) is maximized with respect to q(z), while holding θ fixed to θ old . The solution to this maximization problem can be easily obtained, if we observe that the value of ln p(y; θ old ) does not depend on q(z), and so, the largest value of F (q, θ old ) will occur when the Kullback-Leibler distance is set to zero, i.e., when q(z) = p(z|y; θ old ). In this case the lower bound is equal to the likelihood. In the subsequent M-step, the lower bound F (q, θ) is maximized with respect to θ, whereas the density q(z) is held fixed. This optimization will result in a new value for the parameters θ new . It will also cause not only the lower Konstantinos E. Themelis

76


bound, but also the corresponding log-likelihood to increase. In addition, since the density is held fixed during the M-step, it will not be equal to the new posterior p(z|y; θ new ) and the Kullback-Leibler distance will be greater than zero. Hence, the increase in the log-likelihood will be greater than the increase in the lower bound. If we substitute q(z) = p(z|y; θ old ) in (3.49) and expand the lower bound we get Z

F (q, θ) =

p(z|y; θ old ) ln p(y, z; θ)dz Z − p(z|y; θ old ) ln p(z|y; θ old )dz

= Q(θ, θ old ) + C,

(3.51)

where C is a constant that does not depend on θ. As a result, the function that is maximized in the M-step is computed as Q(θ, θ

old

Z )=

p(z|y; θ old ) ln p(y, z; θ)dz

= Ep(z|y;θold ) [ln p(y, z; θ)] ,

(3.52)

which is the expectation of the joint log-likelihood of the data and the hidden variables, with respect to the conditional posterior distribution of the hidden variables given the value θ old for θ. The EM algorithm is summarized in the following two steps: 1. E-step: Compute p(z|y; θ old ) 2. M-step: θ new = arg max Q(θ, θ old ). θ

At this point we should mention that the EM can also be used to obtain MAP estimates of the model parameters. Using Bayes’ theorem we can write ln p(θ|y) = ln p(y|θ) + ln p(θ) − ln p(y),

(3.53)

and utilizing the same decomposition of the log-likelihood as in (3.48), we get ln p(θ|y) = F (q, θ) + KL(qkp) + ln p(θ) − ln p(y) ≥ F (q, θ) + ln p(θ) − ln p(y),

(3.54)

where ln p(y) is a constant. We can again maximize the lower bound of (3.54) with respect to q(z) and θ. In this case, the E-step of the algorithm remains exactly the same, whereas the prior term ln p(θ) is introduced to the M-step. Konstantinos E. Themelis

77


It is also interesting to point out that a requirement of the EM algorithm is that the conditional posterior pdf of the hidden variables given the data, p(z|y; θ), can be computed explicitly. Although p(z|y; θ) is easier to derive than p(y; θ), the former may also be intractable for some complicated probabilistic models. In such cases the EM algorithm is not applicable in a straightforward manner. To ameliorate this shortcoming of the EM, variational approximation methods have proposed in the signal processing literature, that can solve complex problems of probabilistic inference.

3.3.3 The variational EM The basic idea of the variational Bayesian framework is to approximate the posterior distribution of the hidden variables or parameters with a simpler distribution. This distribution usually admits a factorized form, which implies that the parameters of interest are independent given the data. As the name suggests, the variational EM is an extension of the EM algorithm just described. The term “variational” stems from the field of calculus of variations, [125], which is a field in mathematics that deals with maximizing or minimizing functionals, as opposed to ordinary calculus, which deals with the optimization of ordinary functions. In mathematics, a functional is usually a mapping from a set of functions to the real numbers. In our example, the lower bound of the likelihood of the data F (q, θ), expressed in (3.49), is a functional, since its minimization is conducted with respect to the parameters θ and the function q(z). Supposing that the real posterior p(z|y; θ) of the hidden data cannot be analytically computed, we consider a family of distributions q(z) to approximate the real posterior, and we try to minimize the Kullback-Leibler divergence between p(z|y; θ) and q(z). By the term family, we mean that optimization is performed over quadratic, exponential, or any other type of distribution that is likely to lead in a tractable solution. The form of factorized distributions has gained considerable attention in the variational Bayesian methodology. Making no assumption on the type of the distribution itself, we suppose that the hidden variables can be separated into groups, and their joint distribution is factorized with respect to these groups, i.e., q(z) =

N Y

qi (zi ) =

i=1

N Y

qi ,

(3.55)

i=1

where, for simplicity, we have replaced qi (zi ) by qi . It is interesting to note that the idea of utilizing a factorized form of distributions comes from an approximation framework developed in physics called mean field theory, [99]. Utilizing (3.55), the E-step of the EM algorithm is performed in an iterative fashion with


78


respect to each of the factors in (3.55). To this end, we substitute (3.55) in (3.49), and then focus on the dependence on a single factor qj , to obtain F (q, θ) = =

Z Y i Z Y

"

#

qi ln p(y, z; θ) − qi ln p(y, z; θ)

Y

dzi −

XZ Y

i

qj ln qi dzi

j

i

" qj

ln qi dz

i

i

Z =

X

# Y ln p(y, z; θ) (qi dzi ) dzj −

Z qj ln qj dzj −

XZ

i6=j

Z =

qi ln qi dzi

i6=j

Z qj ln p˜(y, zj ; θ)dzj −

qj ln qj dzj −

XZ

qi ln qi dzi

i6=j

= −KL(qj k˜ p) −

XZ

qi ln qi dzi

(3.56)

i6=j

where we have defined a new distribution p˜(y, zj ; θ), such that Z ln p˜(y, zj ; θ) = Eqi6=j [ln p(y, z; θ)] =

ln p(y, z; θ)

Y (qi dzi ),

(3.57)

i6=j

where Eq [x] denotes the expected value of the random variable x with respect to the probability density function q. To maximize the lower bound with respect to qj , it is apparent that the Kullback-Leibler distance KL(qj k˜ p) has to be minimized, independently of the form of qj . The Kullback-Leibler distance becomes zero when qj = p˜(y, zj ; θ). Thus, the optimal solution has the form ln qj∗ (zj ) = Eqi6=j [ln p(y, z; θ)] + const.

(3.58)

The constant in the above equation can be found by marginalizing, so as we get qj∗ (zj )

exp Eqi6=j [ln p(y, z; θ)] =R . exp Eqi6=j [ln p(y, z; θ)] dzj

(3.59)

It is easily seen that (3.59) is not a closed form solution for the factors qj , j = 1, 2, . . . , N . Each factor qj∗ (zj ) depends on the remaining ones, since, the expectation on the right hand side of (3.59) is with respect to qi6=j (zi6=j ). Hence, an iterative estimation procedure can be adopted to solve the system of equations (3.59). First we initialize all the distributions qj , j = 1, 2, . . . , N appropriately, and then we cycle through the factors replacing each one with an updated estimate using (3.59), where the most recent values of all the other factors have been utilized. Konstantinos E. Themelis

79


The variational EM can be summarized in the following two steps: 1. Variational E-step: Compute qj∗ (zj ), j = 1, 2, . . . , N , that maximizes F (q, θ) by iterative solving (3.59). 2. Variational M-step: Find θ new = arg max F (q ∗ , θ). θ

3.3.4 The evidence approximation The evidence approach is another method to perform probabilistic inference for Bayesian hierarchical models, in which the prior is estimated from the data. This is in contrast to standard Bayesian inference methods that assume that the prior is fixed before the data are observed. It can be viewed as an approximation to a fully Bayesian treatment of a hierarchical model 4 , wherein the parameters at the highest level of the hierarchy are set to their most likely values, instead of being integrated out. The term evidence approximation arises from the field of machine learning. However, but evidence approximation is also known in statistical inference as empirical Bayes, [10, 51], or type-II maximum likelihood, [9], or generalized maximum likelihood, [124]. Let us assume a two-stage hierarchical model, so that the observed data y are parameterized by the unobserved parameters θ, which in turn have a prior distribution that depends on the hyperparameters λ according to a probability distribution p(θ|λ). Evidence approximation is based on the maximization of the marginal likelihood in order to obtain estimates for the set of hyperparameters. As its name implies, a marginal likelihood function is a likelihood function in which some parameters have been integrated out. For the specific hierarchical model, the marginal likelihood is expressed as a function of λ, Z L(λ|y) = p(y|λ) =

p(y|θ, λ)p(θ|λ)dθ,

(3.60)

where it is seen that the parameters θ are integrated out. Besides, the posterior distribution p(λ|y) of the hyperparameters can be expressed using Bayes’ theorem as p(λ|y) =

p(y|λ)p(λ) , p(y)

(3.61)

which depends on the marginal likelihood (3.60). In evidence approximation it is assumed that this posterior is sharply peaked, so that we can obtain the most probable values for the

4 In a hierarchical model, the involved parameters are arranged hierarchically in a manner that show their interdependence (see e.g. [65]). Such a model is proposed in Chapter 5 of this thesis.


80


hyperparameters λ by maximizing their marginal likelihood, i.e., λtypeII = max p(y|λ). λ

(3.62)

Then, Bayesian inference on the posterior p(θ|y) of interest can be performed by p(θ|y) =

p(y|θ)p(θ) p(y|θ)p(θ|λtypeII ) ≈ . p(y) p(y|λtypeII )

(3.63)

In this way, we can see that evidence approximation consists in using point estimates instead of the whole distribution for the parameters λ. It is also interesting to note that the marginal likelihood may be difficult to compute and optimize directly. In such cases, the EM algorithm can be utilized to obtain the estimates of the hyperparameters λtypeII .


81

Chapter 4 Soft-constrained MAP abundance estimation In this chapter a novel method for supervised spectral unmixing is presented. The proposed method consists of a soft-constraint MAP estimator, which is specifically designed to take into consideration the convex constraints of the fractional abundances. Experimental results on simulated and real hyperspectral data are presented, that illustrate the enhanced performance of the proposed method. Finally, a case study on the unmixing of OMEGA data collected from the Mars Express mission for planetary mineral exploration is presented. More specifically, a description of the working framework of this application is given and, results from the application of the proposed unmixing method and other relevant techniques are provided.

4.1 A soft-constraint MAP estimator In this section, a recently proposed soft constrained maximum a posteriori probability (MAPs) method [8] is adopted and properly adjusted for abundance estimation in hyperspectral images. In [8], the hard constraints of the estimation problem are considered as a priori knowledge of the estimator and described via a suitably computed multivariate Gaussian distribution. The mean and covariance matrix of this distribution are obtained by solving a linear matrix inequalities (LMI) optimization problem [22]. In the present work, the symmetry of the constraints of the abundance estimation problem is exploited and closed form expressions are derived for both the mean and the covariance matrix of the multivariate Gaussian distribution. These expressions can be used directly to construct the estimator, thus avoiding the need to solve an LMI optimization problem. It should be noted that due to the statistical approach followed to describe the constraints, the MAP-s estimator is likely to violate the initial hard constraints. To be able to assess the performance of the estimator,


83


a projection of the obtained solution in the set of constraints is also applied. The proposed estimation method offers significant computational savings, compared to existing constrained optimization techniques, making it especially attractive for applications involving real time processing of hyperspectral data. As verified by simulations, this computational gain comes with no essential loss in the performance of the method.

4.1.1 Problem formulation As is detailed in chapter 2 of the thesis, a M -dimensional vector y is assigned to each pixel in a hyperspectral image, where M is the number of spectral bands. The elements of y correspond to the reflectance energy measured at the respective spectral bands. By assuming that there exist N distinct materials in the image scene (endmembers), we define the M × N matrix Φ = [φ1 φ2 . . . φN ], where φj is the spectral signature vector of the jth material. Let w = [w1 , w2 , . . . , wN ]T be the N × 1 vector of abundances associated with y, i.e. wj is the abundance fraction of the jth material in the pixel y. A commonly used model to describe the combination of materials in a pixel, is the linear mixing model (LMM) [86]. According to this model, y is expressed as follows: y = Φ w + n,

(4.1)

where n is the measurement noise assumed to be zero-mean and with covariance Σn . Assuming that the spectral signatures φ1 , φ2 , . . . , φN of the endmembers are already known, we are interested in estimating the vector w of abundances. Two constraints are naturally imposed on the abundance vector w. The abundance sum-to-one constraint (ASC), i.e., N X

wi = 1,

(4.2)

i=1

and the abundance nonnegativity constraint (ANC), i.e., 0 ≤ wi ≤ 1,

i = 1, 2, . . . , N.

(4.3)

By viewing w as a point in a N -dimensional space, it is easily shown that the above constraints define a standard (N − 1)-simplex [85], i.e. a simplex whose vertices are the points corresponding to the columns of the identity matrix. By denoting this simplex by S, each point w associated with an image pixel y must reside inside the hypervolume of S. A pictorial representation of the ANC and ASC in the three-dimensional space is shown in Figure 4.1. In this case, all points must lie on the surface area of the standard 2-simplex, which corresponds to the shaded area in the figure. Konstantinos E. Themelis

84


Figure 4.1: ASC and ANC define a 2-simplex in the three-dimensional space.

4.1.2 The proposed method The problem defined in the previous section can be formulated as a least squares (LS) minimization problem with constraints. This turns out to be a convex optimization problem [23], whose solution relies on quadratic programming methods. Applying such a method may become prohibitive, due to its high computational complexity. In this section we follow an alternative approach, by properly modifying and extending the recently proposed softconstrained maximum a posteriori (MAP-s) estimator of [8]. The MAP-s estimator belongs to the class of superefficient estimators, i.e. it always provides better performance than the −1 LS estimator, in terms of the matrix mean square error (MSE). Let ΣLS , ΦT Σ−1 Φ n ˆ is called denote the covariance matrix of the unconstrained LS estimator. An estimator w superefficient or LS-dominating if its MSE does not exceed the MSE of the LS estimator, i.e. ˆ ˆ T ≤ ΣLS . E (w − w)(w − w) (4.4) To provide MSE improvement, superefficient estimators exploit some a priori knowledge about the parameters to be estimated. In the problem under consideration, the a priori knowledge is the set of constraints which are naturally imposed on w. Under the assumption that w is a random vector with prior Gaussian distribution1 , the MAP-s estimate of w is: ˆ = w

ΦT Σ−1 n Φ+Σ

−1 −1

−1 ΦT Σ−1 y + Σ w , n

(4.5)

where w ∈ RN and Σ ∈ RN ×N are the mean and covariance matrix of w, respectively. The main idea here, is to select the parameters w and Σ, based on the knowledge of the 1 A different prior distribution for w could also be adopted, leading to a different estimation approach [91].


85


polytope of constraints S, so as to guarantee the LS-domination property (4.4), ∀ w ∈ S. After some calculations reported in [8], the LS-domination condition (4.4) can be rewritten in terms of Σ and w, as follows: Σ≥

1 (w − w)(w − w)T − ΣLS . 2

(4.6)

In order to compute the most appropriate parameters w and Σ, according to the constraint set S, Lemma 1 of [8] proved for positive definite matrices needs to be extended for positive semidefinite matrices. This is so because in the present case the polytope of constraints S lies on a hyperplane of dimension (N − 1) in the N -dimensional space. Lemma 4.1.1. Let P = PT ≥ 0 and w ∈ R(P), where R(·) denotes column space. Then P ≥ wwT ⇐⇒ wT P† w ≤ 1,

(4.7)

where (·)† denotes the Moore-Penrose pseudoinverse. Proof. The result follows directly from the properties of the generalized Schur complement [5]. According to the Albert nonnegative conditions [5], for matrices P, C and B with appropriate dimensions, the following two statements are equivalent: i) P ≥ 0 and C − BT P† B ≥ 0 and R(B) ⊂ R(P) ii) C ≥ 0 and P − BC† BT ≥ 0 and R(BT ) ⊂ R(C). Therefore, if P ≥ 0, C ≥ 0, R(B) ⊂ R(P), and R(BT ) ⊂ R(C), then P ≥ BC† BT ⇐⇒ C ≥ BT P† B. Setting B = w and C = 1 completes the proof. Let (c, P) , w : (w − c)T P† (w − c) ≤ 1 denote an ellipsoid of center c and P is a symmetric positive semidefinite matrix. Lemma 4.1.1 suggests that matrix (P − wwT ) is positive semidefinite if and only if every point w belongs to the ellipsoid (0, P). By setting P , 2Σ + ΣLS ,

(4.8)

Lemma 4.1.1 provides the following geometrical interpretation of Eq. (4.6): an MSE improvement over the LS estimator is achieved if and only if the ellipsoid (w, P) contains the polytope of constraints S. Capitalizing on this result, it can be shown, as in [8], that Σ and w must satisfy the following set of linear matrix inequalities (LMI):


86


"

2Σ + ΣLS ei − w (ei − w)T 1

# ≥ 0 i = 1, 2, . . . , N (4.9)

Σ ≥ 0, where ei , i = 1, 2, . . . , N are the vertices of the (N − 1)-simplex, which coincide with the columns of the identity matrix. The MAP-s estimator can then be stated as follows: given S and ΣLS select the parameters w and Σ so that: min |Σ| subject to (4.9),

(4.10)

w,Σ

where | · | denotes the determinant of a matrix. In geometrical terms, minimization of the determinant of Σ corresponds to finding the minimum volume ellipsoid containing S. The above minimization criterion does not guarantee the minimum achievable MSE, but gives rise to standard convex LMI problems that can be solved in polynomial time [22]. Although the minimization process is based on both the parameters w and Σ, it has been shown in [8] that for symmetric constraints (as is the case for (4.2) and (4.3)) the optimal solution is obtained when w is selected as the center of symmetry of the constraint set. In the problem of estimating the abundance vectors in a hyperspectral image, the polytope of constraints is explicitly defined as a standard (N − 1)-simplex S. It can be shown that the minimum volume ellipsoid circumscribing S is the hypersphere (w, P), with 

1   − 1  P† =  N. −1  ..  − N1−1

− N1−1 1 ...

··· .. . ...

···

− N1−1

− N1−1 .. .

− N1−1 1

    ,  

(4.11)

and w = [1/N, 1/N, . . . , 1/N ]T .

(4.12)

Note that w is selected as the center of the (N − 1)-dimensional simplex. For instance, for N = 3 the minimum volume ellipsoid (w, P) reduces to the disc shown in Figure 4.1. The matrix P, whose Moore-Penrose pseudoinverse is given by (4.11), is a singular symmetric


87


positive semidefinite matrix of rank N − 1 expressed as follows     P=  

(N −1)2 N2 − NN−1 2

.. .

− NN−1 2 (N −1)2 N2

− NN−1 2

..

.

···

··· .. . .. . − NN−1 2

− NN−1 2 .. .

− NN−1 2

(N −1)2 N2

    .  

(4.13)

It should be noted that since w satisfies the constraint (4.2) and the elements of each row ¯ ∈ R(P), as required by the assumptions of Lemma 4.1.1. of P add to zero, w − w From (4.8), (4.13) and assuming knowledge of ΣLS , Σ can be written as folows: Σ=

1 (P − ΣLS ) 2

(4.14)

By substituting w and Σ from (4.12) and (4.14) in (4.5), an estimate of the abundance vector is directly obtained. These quantities need to be computed only once and are then used for the estimation of the abundance vector of each pixel in the image. It should be noted that due to the form of matrix P, Σ given by (4.14) is near to singular. This may seriously affect the estimate in (4.5), where the inverse of Σ needs to be computed. To tackle this problem, some kind of regularization can be applied, i.e., the inverse of Σ is computed as (Σ + δ I)−1 , where δ is a small positive constant and I is the N × N identity matrix.

4.1.3 Imposing the hard constraints As described above, the MAP-s estimator assumes that the vector of abundances has a prior Gaussian distribution, i.e. w ∼ N (w, Σ), where w, Σ are given by (4.12) and (4.14) respectively. Due to this statistical assumption, for some pixels in the image, the corresponding abundance vectors may lie outside the polytope S, violating the hard constraint (4.3). However, as it will become clear in the next section, in order to assess the performance of the proposed method for hyperspectral images, the hard constraints must be somehow imposed to the estimator. Let wq be an estimated vector violating (4.3). The main idea is to replace wq with a new estimation point, by means of projecting wq on the polytope S. In this way, the projection point will be the closest to wq point of S satisfying the constraints of the problem. Unfortunately, there is no known closed form expression for the projection of a point on the standard (N − 1)-simplex in the N -dimensional space. In the following we propose an approximate solution, which is based on the Euclidean distances of wq from the vertices of S. Note that the closest to wq point of S will lie on a (N − 1)-dimensional hypersurface of S. Since the polytope S is convex, any point wp on a (N −1)-dimensional hypersurface of S can be expressed as a linear combination of the corresponding N − 1 vertices Konstantinos E. Themelis

88


Figure 4.2: Urban HYDICE hyperspectral dataset, band 80. of the polytope, i.e. wp = θ1 e1 + θ2 e2 + . . . + θN −1 eN −1 , (4.15) P −1 where θi are weight coefficients such that N i=1 θi = 1 and θi ≥ 0, i = 1, 2, . . . , N − 1. Apparently, the vertex excluded in (4.15) is the one with the largest Euclidean distance from wq , which without loss of generality is assumed to be eN . Let d1 , d2 , . . . , dN −1 denote the Euclidean distances between wq and the N − 1 remaining vertices of the polytope. Then, the weight coefficients θi in (4.15) can be computed according to the following relation: θi =

ϕ(di ) , i = 1, 2, . . . , N − 1, NP −1 ϕ(dj )

(4.16)

j=1

where ϕ(·) is a properly selected function. Two possible choices are ϕ(x) = 1/x and ϕ(x) = e1/x . As can be easily verified, in both cases the weight parameter θi is reversely proportional to the Euclidean distance di , with the second choice weighting more heavily closely located vertices.

4.1.4 Simulations In this section, we use real hyperspectral data to evaluate the performance of the proposed modified MAP-s estimator, in comparison to the constrained quadratic programming approach. The data correspond to an urban image scene, shown in Figure 4.2, and have been collected by the Hyperspectral Digital Imagery Collection Experiment (HYDICE) sensor. The sensor works in 210 spectral bands, in the region from 400 to 2500nm, with a spectral resolution of 10nm. After removing the low SNR bands, 162 spectral bands remain available, i.e. M =162. Four endmembers are present in the image, namely asphalt, roof, grass and Konstantinos E. Themelis

89


Figure 4.3: Abundance estimation using quadratic programming techniques.

Figure 4.4: Abundance estimation using the MAP-s estimator and solving the LMI optimization problem.

Figure 4.5: Abundance estimation using the closed form expression of the MAP-s estimator. tree. The spectral signatures of these endmembers have been identified using a supervised technique, [74], which utilizes areas in the image which seem to contain a pure material to extract the spectrum of this material. So far, it has been assumed that the noise covariance matrix Σn is known. Note that ΣLS used in (4.14) to compute the covariance matrix Σ, depends on Σn . As a result, estimation of the noise covariance matrix of the image is a necessary preprocessing step of the proposed algorithm, which may affect the overall performance of the estimator. In our simulation experiments, Σn has been computed using the shift-difference method, as discussed in [56]. In Figs. 4.3, 4.4 and 4.5, the results of three abundance estimation methods are depicted. Each figure comprises four images corresponding to the four endmembers under consideration. A pure black pixel in an image indicates that the abundance of the respective endmember


90


is zero, while a pure white pixel represents an abundance value equal to one. All other abundance values between zero and one are illustrated according to the different tones of gray. In Figure 4.3, a quadratic programming based technique has been applied to estimate the abundances, [35]. In Figs. 4.4 and 4.5, the MAP-s estimator has been implemented either by solving an LMI problem or by using directly the closed form expressions derived in this section, respectively. We observe that the proposed method succeeds in discriminating all four endmembers with sufficiently high accuracy. Its performance is similar to that of the LMI-based approach, as shown by comparing Figs. 4.4 and 4.5. Compared to the convex optimization based method, there is some degradation in performance, even though the asphalt endmember seems to be better resolved using the MAP-s method. Recall, however, that the computational complexity of the proposed method is much lower, compared to the one required to numerically solve a convex optimization problem for each pixel in the hyperspectral image.

4.1.5 Conclusion This section has addressed the problem of abundance estimation in hyperspectral signal unmixing, subject to full additivity and nonnegativity constraints. Instead of solving numerically this convex optimization problem, we have followed a novel approach stemming from a recently reported MAP estimation method. The proposed algorithm has almost similar performance to the much more computationally demanding numerical method, thus making it attractive especially for real time processing of hyperspectral data.

4.2 On the unmixing of MEX/OMEGA data This section presents a comparative study of three different types of estimators used for supervised linear unmixing of a MEx/OMEGA hyperspectral cube. The algorithms take into account the constraints of the abundance fractions, so as to get physically interpretable results. Abundance and spatial reconstruction error maps show that the Bayesian maximum the posteriori probability (MAP) estimator described in the previous section outperforms the other two schemes, offering a compromise between complexity and estimation performance.

4.2.1 Introduction The surface of Mars is currently being imaged with a combination of high spectral and spatial resolution. This gives the ability to detect and map chemical components on the Martian surface and atmosphere more accurately than before. The OMEGA (Observatoire pour la Minéralogie, l’ Eau, les Glaces et l’ Activité hyperspectral images) spectrometer, [15], is an Konstantinos E. Themelis

91


instrument onboard the Mars Express mission, which measures the solar radiation reflected by Mars’ surface. Its objective is not only to map the mineral composition of the planet’s surface, but also to measure aspects of the atmospheric composition of Mars. To this end, spectral unmixing is a vital part of OMEGA data analysis. Several spectral unmixing techniques for the unmixing of OMEGA data have been recently proposed in the bibliography. These techniques can be categorized into unsupervised, where a special procedure is first executed to get the endmembers’ spectral signatures from the image, and supervised, where a priori knowledge of the image endmembers is available. A recent example of an unsupervised technique is the Bayesian source separation method, developed in [109]. This technique is based on a Gibbs sampling scheme to perform Bayesian inference. Due to its high computationally complexity, a special implementation strategy is developed for its application to a complete OMEGA image data set, [109]. Band ratio is the most commonly used supervised technique to detect minerals ([15, 14, 80]). However some multiple endmember linear spectral unmixing algorithms have been proposed such as MELSUM, [36]. MELSUM uses a reference library containing various spectral signatures of minerals (used as endmembers), and is based on the classical spectral mixture analysis (SMA) algorithm. In [73], a modified Gaussian model (MGM) has been exploited to estimate the fractional abundances of a compositionally diverse suite of pyroxene spectra in the martian surface. A wavelet based method has also been applied for the unmixing of hyperspectral martian data in [53, 108]. In this section, we focus on the problem of supervised spectral unmixing. Our main objective is to estimate the abundances of the endmembers that are present in two OMEGA images, subject to the nonnegativity and sum-to-one constraints. In the following, three different supervised unmixing algorithms are considered, namely the ENVI-SVD method, [20], a quadratic programming (QP) technique, [35], and the (MAP-s) estimator, [111], described in the previous section. These algorithms are applied on two different hyperspectral OMEGA data sets, and they are evaluated through their corresponding abundance maps. The endmember reference spectra used for supervised unmixing are either extracted from the image itself, or selected from a spectral library of pure minerals. The experimental results show that the MAP-s estimator results in abundance values that satisfy the constraints of the problem and provides a compromise between the performance of the QP technique and the complexity of the ENVI-SVD method.

4.2.2 Linear spectral unmixing techniques Adopting the linear model in (4.1), three different unmixing algorithms are applied to the OMEGA data sets: i) a singular value decomposition method (ENVI-SVD), [20], available


92


in the ENVI image processing software ii) a QP technique, [35], available in the Matlab environment, and iii) the Bayesian MAP-s estimator described in the previous section, [111]. The ENVI-SVD method and the QP algorithms are also briefly described in the following sections. 4.2.2.1 ENVI-SVD ENVI-SVD is a constrained least squares approach to the unmixing problem. Using the singular value decomposition (SVD) algorithm, the pseudo-inverse of the mixing matrix Φ is computed. Then, the abundance fractions are easily estimated by multiplying the pseudoinverse matrix with each image pixel’s spectral vector. The advantage of this method is its low computational complexity, since the pseudo-inverse matrix is computed only once as a preprocessing step, and is then applied to all image pixel vectors. As far as the sum-to-one constraint is concerned, it is imposed to the problem using an extra (weighted) equation to the linear system of equations (4.1). However, ENVI-SVD does not take into account the nonnegativity of the abundances, which can result in negative abundance values that have no physical meaning. 4.2.2.2 Quadratic programming technique This quadratic programming technique is a reflective Newton method, which minimizes the quadratic function of the least squares error of the unimixing problem, subject to the sum-toone and nonnegativity constraints imposed on the abundances. The QP technique is based on an iterative optimization scheme, which has to be repeated separately for each image pixel vector. Keeping in mind that a hyperspectral image may be composed of thousands of pixels, solving a separate optimization problem for each pixel increases significantly to the computational complexity of the method. An implementation of the quadratic programming technique is available in the optimization toolbox of Matlab.

4.2.3 Discussion The previously described algorithms are applied to two different OMEGA data cubes: (a) a scene of Mars’ South Polar Cap, and (b) a Syrtis Major observation. The OMEGA instrument is a spectrometer on board ESA’s Mars Express satellite, which provides hyperspectral images of the Mars surface, with a spatial resolution from 300m to 4km, 96 wavelength channels in the visible band and 256 wavelength channels in the near infrared band, [15]. OMEGA uses three different detectors, with spectral resolutions about 7.5nm in the 0.35 − 1.05µm wavelength range (visible and near infrared channel or VNIR), 14nm between 0.94 and 2.70µm (short wave infrared channel or SWIR) and an average of 21nm Konstantinos E. Themelis

93


1 0.9

Dust CO2 ice

0.8

H2O

Reflectance

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1

1.2

1.4

1.6

1.8 2 2.2 Wavelength

2.4

2.6

2.8

3

Figure 4.6: Reference spectra of the South Polar Cap OMEGA image. The available endmembers are: (a) OMEGA typical dust materials with atmosphere absorption. (b) synthetic CO2 ice with grain size of 100 mm, (c) synthetic H2 O ice with grain size of 100 microns. from 2.65 to 5.2µm (long wave infrared channel or LWIR), respectively. The two hyperspectral data sets, the reference spectra used for each image and the SU results obtained from the application of the three methods are analytically described in the following sections. 4.2.3.1 South Polar Cap image cube This data set consists of a single hyperspectral data cube obtained by looking towards the South Polar Cap of Mars in the local summer (Jan. 2004). The data cube is made up of two channels: 128 spectral planes from 0.93 to 2.73µm with a resolution of 14nm and 128 spectral planes from 2.55 to 5.11µm with a resolution of 21nm. Noisy bands were excluded, and 156 out of the 250 initial bands were finally utilized in the region from 0.93 to 2.98µm to avoid the thermal emission spectral range. The linear model mixing matrix consists of the following three reference spectra: (a) CO2 ice (synthetic data with grain size = 100mm), (b) H2 O ice (synthetic data with grain size = 10µm), and (c) dust, which were all detected a priori using the Wavanglet method of [108]. These endmembers are discussed in the first OMEGA publication, [14], and are also verified by [109], using the Bayesian positive source separation method of [92]. The respective spectral signatures of the endmembers are shown in Figure 4.6. The abundance maps for each endmember resulting after the application of the three estimators are displayed in Figs. 4.7 - 4.9. As shown in these figures, the ENVI-SVD abundances do not satisfy the nonnegativity constraint, e.g., regarding the CO2 endmember, the minimum computed abundance value is −9.9 × 10−2 . Notice also that the abundance values calculated by QP and MAP-s are in full agreement, they share the same scale and are quite different from those obtained by ENVI-SVD. This shows that the MAP-s estimates Konstantinos E. Themelis

94


ENVI−SVD

0.4

0.6

0.8

(a)

QP

MAPs

0.5 0.6 0.70.8 0.9

0.5 0.6 0.7 0.8 0.9

(b)

(c)

Figure 4.7: Abundance map of the dust endmember, estimated using (a) ENVI-SVD, (b) QP, and (c) MAP-s for the South Polar Cap image cube. represent reliably the information about the abundances. It also adds up to the fact that the MAP-s estimator has almost similar performance with the QP algorithm in simulation scenarios with synthetic data, as shown in [111]. 4.2.3.2 Syrtis Major image cube This data set consists of a single hyperspectral data cube of the Syrtis Major region, which contains well-identified areas with very strong signatures of mafic minerals, [93]. The data Konstantinos E. Themelis

95


ENVI−SVD

0

0.2

(a)

0.4

QP

0.2

(b)

MAPs

0.4

0.2

0.4

(c)

Figure 4.8: Abundance map of the CO2 endmember, estimated using (a) ENVI-SVD, (b) QP, and (c) MAP-s for the South Polar Cap image cube. cube consists of 109 spectral bands out of the 128 original wavelengths of the SWIR detector. The spatial dimensions of the cube are 366 × 128 pixels. The OMEGA observations have been calibrated for known instrument artifacts and for atmospheric CO2 . The cube has been radiometrically corrected using the standard correction pipeline (SOFT06) and the atmospheric gas transmission has been empirically corrected using the volcano scan method, [80]. It is well known that OMEGA can identify pyroxene and olivine; it discriminates between the high-calcium pyroxenes (HCPs, e.g., clinopyroxenes) and low-calcium pyroxenes (LCPs,


96


QP

ENVI−SVD

0.1

0.2

(a)

0.3

0.1

0.2

MAPs

0.3

(b)

0.1

0.2

0.3

(c)

Figure 4.9: Abundance map of the H2 O endmember, estimated using (a) ENVI-SVD, (b) QP, and (c) MAP-s for the South Polar Cap image cube. e.g., orthopyroxenes), [13]. In the scene under investigation, we utilized three endmembers which have previously been identified to be present in the image, [93], namely, (a) Hypersthene, (b) Diopside (c) Fayalite. These are all laboratory reference spectra, which have also been used in [107] for supervised unmixing. It is interesting to note that the last two endmembers have been retrieved using NASA’s CRISM multispectral observations. Their respective spectral signatures are displayed in Figure 4.10. Three more artifact endmember spectra are utilized, specifically two neutral spectral components (flat lines at 10−4 and 1),


97


0.45 hyperst2.spc Hypersthene PYX02.h >250u Silicate (Ino); Diopside CPX CRISM Olivine Fayalite CRISM

0.4 0.35

Reflectance

0.3 0.25 0.2 0.15 0.1 0.05 0 0.8

1

1.2

1.4

1.6 1.8 Wavelength

2

2.2

2.4

2.6

Figure 4.10: Reference spectra of the Syrtis Major OMEGA image. QP

ENVI−SVD

0

0.05

(a)

0.1

0

0.05

MAPs

0.1

(b)

0.08

0.1

0.12

0.14

0.16

0.18

(c)

Figure 4.11: Abundance map of Hypersthene, estimated using (a) ENVI-SVD, (b) QP, and (c) MAP-s algorithms for the Syrtis Major hyperspectral scene. and a slope line, as in [107, 82]. The abundance maps obtained from the application of the three methods to the Syrtis Major hyperspectral scene are shown in Figs. 4.11 - 4.13. Each figure illustrates the corresponding abundance map of a single endmember, as it is estimated by all three methods. As a reference, the abundance maps of all three endmembers using the band ratio method are also shown in Figure 4.14. The band ratio and bands depth estimation methods are


98


QP

ENVI−SVD

0.02

0.04

0.06

(a)

0.08

0.1

0.02

0.04

0.06

MAPs

0.08

0.1

0.1

(b)

0.15

0.2

(c)

Figure 4.12: Abundance map of Diopside, estimated using (a) ENVI-SVD, (b) QP, and (c) MAP-s algorithms for the Syrtis Major hyperspectral scene. commonly used to detect minerals on OMEGA data [13]. This method is valid only if at least two wavelength channels can be identified as affected by only one particular mineral. Following the methodology of [107], we used four band ratios: Index(Olivine) = b2.39/b1.06

(4.17)

Index(opx) = 1 − b1.84/((1.84 − 1.25) ∗ b1.25 + (2.47 − 1.84)∗ b2.47) ∗ (2.47 − 1.25)

(4.18)

Index(cpx) = 1 − b1.85/((2.32 − 1.85) ∗ b2.32 + (2.56 − 2.32) ∗ b2.56) ∗ (2.56 − 1.85))

(4.19)

where the wavelength band “b1.84” stands for the band at 1.84 microns, Orthopyroxenes (Hyperstene) are noted as “opx” and Clinopyroxenes (Diopside) are noted as “cpx”. The differences among the three methods are again prominent. Both the MAP-s and the QP methods provide consistent results, as far as the endmembers Fayalite and Hypersthene are concerned. In addition, for both Fayalite and Hypersthene, ENVI-SVD returns negative Konstantinos E. Themelis

99


QP

ENVI−SVD

−0.1 −0.05

0

0.05

(a)

0.1

0.15

0

0.05

0.1

(b)

MAPs

0.15

0.05

0.1

0.15

0.2

0.25

(c)

Figure 4.13: Abundance map of Fayalite, estimated using (a) ENVI-SVD, (b) QP, and (c) MAP-s algorithms for the Syrtis Major hyperspectral scene. abundance values and its resulting maps substantially deviate from the other two methods. As far as Diopside is concerned, the MAP-s estimator seems to produce a slightly different abundance map in comparison to the other two methods. However, it can be easily verified by comparing Figs. 4.11 - 4.13 and 4.14 that the abundance maps of the MAP-s estimator are in better agreement with those obtained using the band ratio method, compared to the other two methods. Thus, it can be argued that MAP-s provides more reliable results than ENVI-SVD and QP.

4.2.4 Conclusions In this section, we have presented a comparison of three different supervised spectral unmixing methods (Bayesian MAP-s estimator, ENVI-SVD, QP), on the basis of two different OMEGA hyperspectral data sets. As opposed to iterative algorithms or Markov Chain Monte Carlo methods, e.g. [109], commonly used for the constrained inverse problem of abundance estimation, the computational complexity of the MAP-s estimator is much lower. Specifically, for the Syrtis Major dataset, the running time of the MAP-s algorithm was 4.7 secs in a roughly optimized Matlab implementation, while the QP needed 26.4 secs (both algorithms were run on a 2.4-Ghz Intel Core 2 CPU). In addition, as verified by experimental results, the performance of the two methods is approximately equal. Therefore, the MAP-s Konstantinos E. Themelis

100


OPX

0

0.02

0.04

0.06

(a)

CPX

0.08

0.1

0

0.05

(b)

OL

0.1

0

0.05

0.1

0.15

(c)

Figure 4.14: Abundance maps of the endmembers (a) Hypersthene, (b) Diopside, and (c) Fayalite in the Syrtis Major scene. The abundances are estimated using the band ratio method. estimator seems to offer the best compromise between estimation performance and complexity among the three algorithms, and it is thus a serious candidate for spectral unmixing of hyperspectral data in planetary exploration.


101

Chapter 5 Semi-supervised sparsity-promoting abundance estimation methods In this chapter we present two novel methods that deal with the problem of semi-supervised hyperspectral unmixing, [112, 113, 114]. By making the reasonable assumption that only a few of the available endmembers are present in each pixel of a hyperspectral scene, both methods cast the unmixing problem as a sparse linear regression problem. A common, however important feature of both methods is the utilization of the `1 norm to induce sparsity. The first method is deterministic and solves the `1 regularization problem using the LARS algorithm, [44]. The second method is of Bayesian nature. More specifically, a novel hierarchical Bayesian model is developed first that encompasses appropriate priors to account for `1 regularization. Then, Bayesian inference is performed by using a novel iterative conditional expectations algorithm, which is shown to have close affinity to variational approximation methods, [6, 65].

5.1 Problem formulation In this section, we provide definitions and formulate rigorously the sparse semi-supervised hyperspectral unmixing problem. Let y be a M × 1 hyperspectral image pixel vector, where M is the number of spectral bands. Also let Φ = [φ1 , φ2 , . . . , φN ] stand for the M × N signature matrix of the problem, with M > N , where the M × 1 dimensional vector φi represents the spectral signature (i.e., the reflectance values in all spectral bands) of the ith endmember and N is the total number of distinct endmembers. Finally, let w = [w1 , w2 , . . . , wN ]T be the N × 1 abundance vector associated with y, where wi denotes the abundance fraction of φi in y. The linear mixture model (LMM), also detailed in chapter 2, assumes that the previous


103


quantities are interrelated as follows y = Φw + n.

(5.1)

The additive noise n is assumed to be a zero-mean Gaussian distributed random vector, with independent and identically distributed (i.i.d.) elements, i.e., n|β ∼ N (n|0, β −1 IM ), where β denotes the inverse of the noise variance (precision). Due to the nature of the problem, the abundance vector is usually assumed to satisfy the following two constraints wi ≥ 0, i = 1, 2, . . . , N, and

N X

wi = 1,

(5.2)

i=1

namely, a non-negativity constraint and a sum-to-one (additivity) constraint. In geometrical terms, it is easily seen that the constraints define the N − 1 standard simplex1 in the N dimensional space, denoted by S. We can write ( w ∈ S,

S=

w ∈ RN |

N X

) wi = 1, wi ≥ 0, i = 1, . . . , N

.

(5.3)

i=1

Based on this formulation, two semi-supervised hyperspectral unmixing methodologies are introduced, where the endmember matrix Φ is assumed to be known a priori. As mentioned before, each column of Φ contains the spectral signature of a single material, and its elements are non-negative, since they represent reflectance values. The mixing matrix Φ can either stem from a spectral library or it can be determined using an endmember extraction technique, e.g., [94]. However, the actual number of endmembers that compose a single pixel’s spectrum, denoted as ξ, is unknown and may vary from pixel to pixel. Sparsity is introduced when ξ N , that is by assuming that only few of the available endmembers are present in a single pixel. This is a reasonable assumption, that is in line with intuition, since it is likely for a pixel to comprise only a few different materials from a library of several available materials. Although a related spectral unmixing approach had been recently presented in [58], the idea of performing spectral unmixing under the assumption of sparsity was independently introduced in our research work. It is reminded that in semi-supervised unmixing, we are interested in estimating the abundance vector w for each image pixel, which is non-negative and sparse, with ξ out of its N entries being non-zero. This problem can be tackled using either recently proposed compressive sensing techniques, e.g., [44, 42, 68, 7], that focus only on the sparsity issue, 1

The vertices ei of the N − 1 standard simplex in the N -dimensional space are of the form ei = i-th position

[0, . . . , 0,

z}|{ 1

, 0, . . . , 0].


104


or quadratic programming techniques, e.g., [35], that successfully enforce the constraints given in eq. (5.2), but do not exploit sparsity. In the following sections two novel methods are presented, that both (a) favor sparsity and (b) take into account the non-negativity constraint of the problem.

5.2 Semi-supervised hyperspectral unmixing via the weighted lasso In this section a novel approach for semi-supervised spectral unmixing in hyperspectral images is presented, which is based on a properly modified weighted `1 -regularized least squares algorithm (see chapter 3). The motivation of using the sparsity promoting `1 norm is that in practice, only a small number of the available endmembers are present in each pixel. Based on this observation, a weighted version [129] of the celebrated least absolute shrinkage and selection operator (lasso) citerion [117] is utilized, where weights are used for penalizing different coefficients in the `1 -regularization scheme. To efficiently solve the `1 minimization problem, the Least Angle Regression (LARS) algorithm [44] is used. However, the `1 minimization criterion contradicts the additivity constraint of the problem, resulting in degradation of estimation performance. To tackle this effect, the additivity constraint is suitably incorporated into the minimization problem, resulting in a modified cost function. As verified by simulations for both real and simulated hyperspectral data, with this modification the proposed method reaches the performance of quadratic programming methods, at a much lower computational cost. It should be mentioned that a similar approach to the spectral unmixing problem has been recently presented independently in [58]. Note, though, that not only the iterative algorithm utilized in [58], but also the way the constraints of the problem are imposed, are quite different compared to those in the present approach.

5.2.1 The proposed method Given the observation model (5.1) and the polytope (standard simplex) of constraints S, the sparse solution follows from the optimization problem min kwk0 subject to ky − Φwk22 ≤ , w∈S

(5.4)

where kwk0 is the number of nonzero wi ’s and is a small positive parameter. This problem is nonconvex and its solution requires an exhaustive combinatorial search. To alleviate this difficulty, the `0 norm in (5.4) is replaced by the `1 norm, [117], i.e., min kwk1 subject to ky − Φwk22 ≤ , w∈S


105

(5.5)


P where kwk1 = N i=1 |wi |. The rationale behind this substitution is that, under certain conditions, the `1 solution coincides with the `0 solution (see e.g. [43, 28]). The `1 penalization criterion associated with (5.5), also called the lasso [117], is Jlasso = ky −

Φwk22

+ λα

N X

|wi | ,

(5.6)

i=1

where λα is a nonnegative parameter that balances the sparsity of the solution with estimation accuracy. Minimization of Jlasso subject to w ∈ S will give the lasso solution, wlasso , which is used for both endmember selection and abundance estimation. However, in eq. (5.6) all wi ’s are equally penalized in the `1 penalty term. As shown in [129], this setup can lead to inconsistent endmember selection, as explained in section 3.1.4 of chapter 3. One way to face this problem is to assign different weights to the parameter coefficients wi ’s. This gives rise to the weighted lasso formulation, [129], Jwlasso

}|

z (

wwlasso = arg min ky − Φwk22 + λα w∈S

N X

){ βi |wi | ,

(5.7)

i=1

where β = [β1 , β2 , . . . , βN ]T is the vector containing the weights coefficients. A desirable property of the weights βi ’s is to be inversely proportional to their corresponding wi ’s, e.g. βi = 1/ |wi |, since, penalizing the nonzero elements of w using small weights and the zero entries of w using very large weights, simply forces the solution wwlasso to concentrate on the elements where w is nonzero. Since w is not known, a possible choice, among others to define βi ’s, is to use the least squares estimate of w, wls = [wls,1 , wls,2 , . . . , wls,p ]T , as proposed in [129]. For γ ≥ 0, the weights wi can be selected as βi =

1 |wls,i |γ

,

i = 1, 2, . . . , N.

(5.8)

Note that (5.7) is a quadratic programming problem and can be solved using iterative techniques, such as interior point methods. However, these methods are computationally expensive, rendering their use prohibitive for demanding real-time applications, such as the hyperspectral unmixing problem considered here. 5.2.1.1 Enforcing the constraints As mentioned in section 5.1, the sum-to-one and nonnegativity constraints are physically imposed to the hyperspectral unmixing problem. As far as the sum-to-one constraint is concerned, we propose to incorporate it directly to the optimization problem, by substituting Konstantinos E. Themelis

106


Table 5.1: Algorithm 1. The positivity constrained lasso. ˆ A = Φw ˆ A be a prediction Let A denote a subset of the indices {1, 2, . . . , N }, and let µ ˆ A. vector of the regression coefficients w ˆ A = 0, µ ˆ A = 0 and A = ∅. Initialize w ˆ A ). Compute the vector of current correlations cˆ = ΦT (y − µ S ˆ ˆ Select j with the highest correlation, C = max{ˆ cj } and set A = A {j : cˆj = C}. j

Compute: ΦA = (. . . φj . . . )j∈A GA = ΦTA ΦA − 12 AA = 1T G−1 1 A uA = ΦA wA , where wA = AA G−1 A 1 and T T α ≡ [α1 , α2 , · · · , αN ] = Φ uA . Define the N -dimensional vector d, whose Aj -th entry equals wj for j = 1, 2, . . . , |A|, while all other entries are zero (Aj denotes the j-th element while |A| the cardinality of A). Set γ˜ = minj∈A max −wÂj /dj , 0 and let ˜j be the position where the minimum occurs.n n oo ˆ cj C−ˆ ,0 AA −αj

, where AC denotes the complementary set of A. Â = w ˆ A + γˆ d, If γ˜ < γˆ stop the ongoing LARS step and remove ˜j from A, i.e. w Â = µ ˆ A + γ˜ uA and A = A − {˜j}. µ Â = w ˆ A + γˆ d and µ Â = µ ˆ A + γˆ uA . Update w Return to the second step to compute the new correlations, until a stopping criterion is met. Set γˆ = minj∈AC max

the `2 norm of eq. (5.7), with the `2 norm of an augmented problem, as shown below:   " # " # 2 N   y

X Φ

ˆ = arg min + λ β |w | , w − w

α i i w∈S  λβ 

λβ 1T i=1 2

(5.9)

where λβ is a regularization parameter [54]. Geometrically, parameter λβ determines how close the solution will lie to the hyperplane of the constraint 1T w = 1. Note that in practice, the additivity constraint is enforced for large values of λβ . In addition, the nonnegativity constraint can be imposed using a modified version of the LARS algorithm, [44], as shown in Algorithm 1 presented in table 5.1. ˆ A of y in successive steps. As a result, the `1 norm Algorithm 1 builds up estimates µ ˆ A increases in each iteration. The algorithm stops when kw ˆ A k1 of the current solution w exceeds 1, as implied by the sum-to-one constraint. The proposed method is summarized in table 5.2.


107


Table 5.2: Algorithm 2. Weighted lasso for sparse unmixing. Select γ ≥ 0 and calculate β as in (5.8). Define φ∗j = φj /βj , j = 1, 2, . . . , N . Select the parameters λα and λβ . Using Algorithm the augmented lasso problem ) ( 1, solve N ∗ 2 N

P P y φ i ˆ ∗ = arg min βi |wi | . wi w

+ λα

λβ − λ w β i=1 i=1 2 Output w ˆj = wˆj∗ /βj , j = 1, 2, . . . , N .

Tuning the parameters γ, λα and λβ of Algorithm 2 is critical. As already mentioned, λβ is given a large value (e.g. λβ = 1000 in our experiments) so as to enforce the additivity constraint. Though not obvious, λα controls the termination of Algorithm 1. This can be seen if the conventional lasso is stated in its following equivalent form ([117]) minky − w

Φwk22

subject to

N X

|wi | ≤ T.

(5.10)

i=1

ˆ A k1 > 1 specifies the parameter T . Parameter γ reflects Stopping Algorithm 1 when kw our confidence on the least squares estimator wls . For instance, in high SNRs, wls detects correctly the non-zero elements of w and γ can be set equal to one. On the other hand, when wls fails to detect the nonzero elements of w, γ can be set to zero.

5.2.2 Simulation results 5.2.2.1 Simulated data To test the performance of the proposed method, we simulated a hyperspectral image consisting of 105 pixels. Ten endmembers were selected from the USGS spectral library [34], having 474 spectral bands. The USGS library contains spectral signatures of various materials, such as minerals, plants, man-made materials, etc. All pixels were produced using exactly three out of ten endmembers. The abundances were chosen to follow a Dirichlet(1, 1, . . . , 1) distribution, while the additive noise is Gaussian, n ∼ N (0, σ 2 I), where σ 2 determines the SNR level. The proposed method is compared to the least squares algorithm (LS), [54], the equality constrained least squares algorithm (CLS), [54], the weighted lasso without enforcing the additivity constraint, (5.7), and the quadratic programming based technique (QP) of [35]. The simulation results are shown in Figure 5.1, where the mean squared error (MSE) curves versus the SNR are displayed. In terms of performance, we observe that the proposed


108


1

10

LS Constrained LS Weighted lasso QP Algorithm 2

0

10

−1

MSE

10

−2

10

−3

10

−4

10

0

5

10

15 SNR (dB)

20

25

30

Figure 5.1: Abundance estimation MSE curves versus SNR. method is equivalent to the QP method. It should be noted, however, that such an equivalence in performance is achieved with much lower computational complexity. Moreover, it is easily seen from Figure 5.1 that solving the weighted lasso problem without enforcing the sum-to-one constraint as in (5.9) results in significant performance degradation. 5.2.2.2 Unmixing of a HYDICE image In this section we apply the proposed method on a real hyperspectral image, which depicts an urban image scene, shown in Figure 5.2, and has been collected by the Hyperspectral Digital

50

100

150

200

250

300 50

100

150

200

250

300

Figure 5.2: Real hyperspectral data: HYDICE urban image scene, band 80. Imagery Collection Experiment (HYDICE) sensor. The image is composed of 210 spectral


109


grass

grass

grass

50

50

50

100

100

100

150

150

150

200

200

200

250

250

250

300

300 100

200

300

300 100

tree

200

300

100

tree

50

50

50

100

100

100

150

150

150

200

200

200

250

250

250

300

300 100

200

300

200

300

200

300

tree

300 100

200

300

100

Figure 5.3: Fraction maps estimated by (a) LS, (b) QP and (c) the proposed method. bands, in the region from 400 to 2500nm, with a spectral resolution of 10nm. After removing the low SNR bands, 162 spectral bands remain available, i.e. M = 162. Six endmembers are present in the image, namely asphalt, metal, dirt, roof, grass, and tree. The spectral signatures of these endmembers have been identified using a supervised technique, which utilizes areas in the image that seem to contain pure materials, in order to extract the spectrum of each material [77]. In Figure 5.3 we show the unmixing results for three of the available six endmembers, and for each of the three algorithms, (a) LS, (b) QP and (c) the proposed method. A pure black pixel in an image indicates that the abundance of the respective endmember is zero, while a pure white pixel represents an abundance value equal to one. All other abundance values between zero and one are illustrated using different tones of gray. Again, we observe that the unmixing results of the proposed method are in agreement with the results produced by the quadratic programming technique, while both methods are clearly superior compared to the LS algorithm. Konstantinos E. Themelis

110


5.3 Semi-supervised hyperspectral unmixing via a novel Bayesian approach In this section the problem of semi-supervised hyperspectral unmixing is considered from a Bayesian viewpoint, [113, 114]. An interesting perspective of the semi-supervised spectral unmixing problem arises when the latent sparsity of the abundance vector is taken into account. As mentioned in the previous section, a reasonable assumption is that only a small number of endmembers are mixed in a single pixel, and hence, the solution to the endmember determination and abundance estimation problem is inherently sparse. This lays the ground for the utilization of sparse signal representation techniques, e.g., [44, 121, 42, 68], in semisupervised unmixing. A number of such semi-supervised unmixing techniques has been recently proposed in [112, 63, 16], based on the concept of `1 norm penalization to enhance sparsity. These methods assume that the spectral signatures of many different materials are available, in the form of a spectral library. Since only a small number of the available materials’ spectra are expected to be present in the hyperspectral image, the abundance vector is expected to be sparse. In this section, a novel hierarchical Bayesian approach for semi-supervised hyperspectral unmixing is presented, which is based on the sparsity hypothesis and takes into account the non-negativity property of the abundances. In the proposed hierarchical model, appropriate prior distributions are assigned to the unknown parameters, which reflect prior knowledge about their natural characteristics. More specifically, to account for the non-negativity of the abundances, a truncated non-negative Gaussian distribution is used as a first level prior. The variance parameters of this distribution are then selected to be exponentially distributed (second-level priors). This two-level hierarchical prior formulates a Laplace type prior for the abundances, which is known to promote sparsity, [49, 40]. In addition, compared to other related hierarchical models, [7, 100, 68], which employ a single sparsity-controlling hyperparameter, the proposed model comprises multiple distinct sparsity-controlling hyperparameters. It is proven that this extension makes the model equivalent to a non-negativity constrained variant of the adaptive least absolute shrinkage and selection operator (Lasso) criterion of [129], whose solution provides a consistent abundance estimator. The proposed hierarchical model also retains the conjugacy of the parameter distributions, which, in the sequel, is exploited to obtain closed form expressions for the parameters’ posterior distributions. As is usually the case in Bayesian analysis, the resulting joint posterior distribution of the proposed hierarchical model does not possess a tractable analytical form. To overcome this impediment, a novel iterative algorithm is developed, which can be considered as a deterministic approximation of the Gibbs sampler ([52] and chapter 3). In this algorithmic Konstantinos E. Themelis

111


scheme, the conditional posterior distributions of the model parameters are derived and their respective expectations are selected, in contrast to the random samples used by the Gibbs sampler. More specifically, as far as the abundance vector is concerned, an efficient scheme is developed to update its posterior conditional expectation, while the conditional expectations of all remaining parameters are updated through simple, closed form expressions. The proposed Bayesian inference algorithm iterates through the derived conditional expectations, updating each one of them based on the current estimates of the remaining ones. To put the algorithm to its proper setting, its connection to other Bayesian inference methods, [72, 6, 65, 11], is discussed. In particular, emphasis is given in showing the affinity of the proposed algorithm with a variational Bayesian inference scheme, which is based on a suitable factorization of the corresponding variational posterior distribution (see also chapter 3). Interestingly, the proposed algorithm produces a point estimate of the abundance vector, which is sparse and satisfies the non-negativity constraint. As a by-product, estimates of all other parameters involved in the problem are also naturally produced; among them is the variance of the additive noise, which is assumed to corrupt the hyperspectral image. The proposed algorithm is computationally efficient and, as verified by extensive simulations, it converges very fast to the true model parameters. In addition, it offers enhanced estimation performance, as corroborated by the application of the proposed and other related methods for the unmixing of both simulated and real hyperspectral data.

5.3.1 Hierarchical Bayesian model This section introduces a novel hierarchical Bayesian model to estimate the sparse abundance vector w from (5.1), subject to the non-negativity constraint given in (5.2). In a Bayesian framework, all unknown quantities are assumed to be random variables, each one described by a prior distribution, which models our knowledge about its nature. Before we proceed, the definition of a truncated multivariate distribution is provided, which will be frequently used in the sequel to follow. Definition 1. Let RN be a subset of RN RN ⊆ RN with positive Lebesgue measure, P(·|ζ) a N -variate distribution, where ζ is a vector of parameters, and PRN (·|ζ) the truncated probability density function (pdf) resulting from the truncation of P(·|ζ) on RN . Then, x ∼ PRN (x|ζ) denotes a random vector, whose pdf is proportional to P(x|ζ) IRN (x), where IRN (·) is the indicator function defined as, ( IRN (x) =


1, x ∈ RN 0, x ∈ / RN .

112

(5.11)


5.3.1.1 Likelihood Considering the observation model defined in (5.1) and the Gaussian property of the additive noise, the likelihood function of y can be expressed as follows p(y|w, β) = N (y|Φw, β

−1

−M 2

IM ) = (2π)

β 2 β exp − ky − Φwk2 . 2 M 2

(5.12)

5.3.1.2 Parameter prior distributions The Bayesian formulation requires that both the sparsity and non-negativity properties of w should emanate from a suitably selected prior distribution. A widely used prior that favors sparsity, [49, 68, 119, 7, 100], is the zero-mean Laplace probability density function, which, for a single wi , is defined as λ L(wi |λ) = exp [−λ|wi |] , (5.13) 2 where λ is the inverse of the Laplace distribution shape parameter, λ ≥ 0. Assuming prior independence of the individual coefficients wi ’s, the N -dimensional prior over w can be written as N N Y λ L(w|λ) = L(wi |λ) = exp [−λ kwk1 ] . (5.14) 2 i=1 It can be easily shown, [49], that under the Laplace prior, the maximum a posteriori (MAP) estimate of w is given by ˆ = arg min w w

β 2 ky − Φwk2 + λ kwk1 , 2

(5.15)

which is, surprisingly enough, the solution of the lasso criterion of [117]. However, if the Laplace prior was applied to the sparse vector w directly, conjugacy2 would not be satisfied with respect to the Gaussian likelihood given in (5.12), and, hence, the posterior probability density function of w could not be derived in closed form. As noted in [4], a key property of the Laplace distribution is that it can be expressed as a scaled mixture of normals, with an exponential mixing density, i.e., λ exp [−λ |wi |] = 2

Z 0

+∞

2 1 wi2 λ2 λs √ exp − exp − ds, 2s 2 2 2πs

λ > 0.

(5.16)

In the framework of the problem at hand, eq. (5.16) suggests that the Laplace prior is equivalent to a two-level hierarchical Bayesian model, where the vector of abundances w 2 In Bayesian probability theory, if the posterior p(θ|x) belongs to the same distribution family with the prior p(θ), (for instance if they are both Gaussians), the prior and posterior are then called conjugate distributions.


113


follows a Gaussian distribution (first level), with exponentially distributed variances (second level). This hierarchical Bayesian model, which is a type of a Gaussian scale mixture (GSM), [17], has been adopted in [119, 49, 68, 100, 78, 7]. The main advantage of this formulation is that it maintains the conjugacy of the involved parameters. In this work, a slightly different Bayesian model is developed. More specifically, in order to satisfy the non-negativity constraint of the abundance vector w, the proposed hierarchical Bayesian approach uses a truncated normal distribution3 in the non-negative orthant of RN as a first-level prior for w. Assuming that all wi ’s are i.i.d. and γi ’s are the (normalized by β) variances of wi ’s, the prior assigned to w is analytically expressed as N Y γi p(w|γ, β) = NR1+ wi |0, β i=1 N Y 1 −1 β wi2 − 12 2 = 2 (2π) β 2 γi exp − IR1+ (wi ) 2 γi i=1 1 N β T −N N 2 2 = 2 (2π) β 2 |Λ| exp − w Λw IRN+ (w) 2 = NRN+ (w|0, β −1 Λ−1 ),

(5.17)

where R1+ is the set of non-negative real numbers and RN + is the non-negative orthant of RN , NRN+ (·) stands for the N -variate truncated normal distribution in RN + according to T Definition 1, γ = [γ1 , γ1 , . . . , γN ] is the N × 1 vector containing the hyperparameters, γi ≥ 0, i = 1, 2, . . . , N , and Λ is the N × N diagonal matrix, with Λ−1 = diag(γ). Note that the use of β as a normalization parameter in (5.17), ensures the unimodality of the posterior distribution of w, [100, 78]. For the second parameter, β, appearing in the likelihood function (5.12), a Gamma prior distribution is assumed, defined as p(β|κ, θ) = Γ(β|κ, θ) =

θκ κ−1 β exp [−θβ] , Γ(κ)

(5.18)

where β ≥ 0, κ is the shape parameter, κ ≥ 0, and θ is the inverse of the scale parameter of the Gamma distribution, θ ≥ 0. As it is well known, the mean and variance of the Gamma distribution are E[p(β|κ, θ)] = κθ , and var[p(β|κ, θ)] = θκ2 , respectively. 5.3.1.3 Hyperparameters’ priors Having defined the truncated Gaussian distribution for wi ’s, we focus now on the definition of the exponential distributions for γi ’s, in the spirit of eq. (5.16). Before we describe the 3

Note that the truncation of the normal distribution preserves conjugacy.


114


model for the priors of the hyperparameters γi ’s proposed in this work, let us first describe the model adopted in [7, 49]. There, the following exponential priors on γi are used λ λ λ p(γi |λ) = Γ(γi |1, ) = exp − γi , i = 1, 2, . . . , N, 2 2 2

(5.19)

where λ is a hyperparameter, which controls the level of sparsity, λ ≥ 0. If these priors were used for the elements of γ in (5.17), the prior distribution of w would be given as follows Z p(w|λ, β) = =

p(w|γ, β)p(γ|λ) dγ N Z Y i=1

∞

p(wi |γi , β)p(γi |λ)dγi

0

"

# N p X = (βλ) exp − βλ |wi | IRN+ (w) N 2

i=1

p = L w| βλ IRN+ (w).

(5.20)

√ √ With respect to Definition 1, L w| βλ IRN+ (w) is denoted as LRN+ (w| βλ), and is a truncated Laplace distribution on RN + . We have already pointed out the relationship between the Laplace density, shown in (5.14), and the Lasso criterion (5.15). In a similar way, it can be easily shown that under the truncated Laplace prior given in (5.20), the MAP estimator of w would be the solution of a non-negativity constrained Lasso criterion. Moreover, from a Lasso point of view, [117], it is known that as λ increases, sparser solutions arise for w. After the previous parenthesis, we proceed with the description of the model for γi ’s proposed in this work. The latter is an extension of that given in (5.19), where instead of having a single λ for all γi ’s, a distinct λi is associated with each γi (the motivation for such a choice will become clear in the analysis to follow). Thus, in the second stage of our hierarchical model, N independent Gamma priors are assigned to the elements of γ, each parameterized by a distinct λi , as follows λi λi λi = exp − γi , i = 1, 2, . . . , N, p(γi |λi ) = Γ γi |1, 2 2 2

(5.21)

where λi ≥ 0, i = 1, 2, . . . , N . By assuming that all γi ’s are independent, the joint distribution of γ can now be written as p(γ|λ) =

" # N N X λi 1 1 λi γi , exp − γi = |Ψ| exp − 2 2 2 2 i=1

N Y λi i=1

where λ = [λ1 , λ2 , . . . , λN ]T and Ψ = diag(λ). Konstantinos E. Themelis

115

(5.22)


The first two stages of the Bayesian model, summarized in (5.17) and (5.22), constitute a sparsity-promoting non-negative (truncated) Laplace prior. This prior can be obtained by marginalizing the hyperparameter vector γ from the model. In the one dimensional case, we get ∞

Z p(wi |λi , β) =

p(wi |γi , β)p(γi |λi )dγi h p i p = βλi exp − βλi |wi | IR1+ (wi ), 0

(5.23)

whereas, for the full model, the truncated Laplace prior is given by Z p(w|λ, β) = =

p(w|γ, β)p(γ|λ)dγ N Z Y i=1

=β

N 2

∞

p(wi |γi , β)p(γi |λi )dγi

0

|Ψ|

1 2

N h Y

i h p i exp − βλi |wi | IR1+ (wi )

i=1

"

=β

N 2

# N p p X |Ψ| exp − β λi |wi | IRN+ (w). 1 2

(5.24)

i=1

Our intention behind the use of a hyperparameter vector λ instead of a single λ for all γi ’s is to form a hierarchical Bayesian analogue to the adaptive lasso, proposed in [129]. Indeed, the following proposition holds: Proposition 5.3.1. The MAP estimate of w that utilizes the truncated Laplace prior of (5.24) coincides with the estimation of w resulting via the optimization of the non-negativity constrained adaptive lasso criterion ( ˜ = arg min w w

for αi =

√

N

X β ky − Φwk22 + αi w i 2 i=1

) , s.t. w ∈ RN +,

(5.25)

βλi , i = 1 . . . N .

Proof. In the Bayesian domain, the MAP estimator of w is defined as wMAP = arg max p(w|y). w

From Bayes’ theorem, the MAP estimator can be expressed as wMAP = arg max p(y|w, β)p(w|λ, β) w


116

(5.26)


= arg min {−log [p(y|w, β)p(w|λ, β)]} . w

(5.27)

Then, substituting in (5.27) the likelihood function from (5.12) and the truncated Laplace prior from (5.24), the MAP estimator can be expressed as wMAP = arg min w ( "

" # #) N p X p 1 N β −log (2π) β exp − ky − Φwk22 β 2 |Ψ| 2 exp − β λi |wi | IRN+ (w) 2 i=1 # " N X p β ky − Φwk22 + βλi |wi | − log IRN+ (w) . (5.28) = arg min w 2 i=1 −M 2

M 2

N , and −log I (w) = 0, for w ∈ RN Note that −log IRN+ (w) = ∞, for w ∈ / RN R+ + , i.e., + this term severely penalizes w’s with negative elements. Thus, it is established that the MAP estimation of w, given the truncated Laplace prior of (5.24), is equivalent to solving √ the adaptive lasso criterion of (5.25), for αi = βλi , i = 1, . . . , N , subject to w being non-negative, i.e., w ∈ RN +. As shown in (5.25), the main feature of the adaptive lasso is that each coordinate wi of w is now weighted by a distinct positive parameter αi . This modification results in a consistent estimator, [129], which is not the case for the original Lasso estimator (5.15). It is obvious from (5.24) that the quality of the endmember selection procedure depends on the tuning parameter vector λ. Typically, tuning parameters reflect one’s prior knowledge about the estimation problem and they can either be manually set, or can be considered as random variables. We choose the latter alternative, by assuming a Gamma hyperprior for the components of λ, p(λi |r, δ) = Γ(λi |r, δ) =

δr λi r−1 exp [−δλi ] , i = 1, 2, . . . , N Γ(r)

(5.29)

where r and δ are hyperparameters, with r ≥ 0 and δ ≥ 0. Both Gamma priors of β, in (5.18), and λi ’s, in (5.29), are flexible enough to express prior information, by properly tuning their hyperparameters. In this work, we use a non-informative Jeffrey’s prior (p(x) ∝ x1 ) over these parameters, which is obtained from (5.18) and (5.29) by setting all hyperparameters κ, θ, r, δ of the Gamma distributions to zero, as in [41, 40, 7]. A schematic representation of the proposed hierarchical Bayesian model in the form of a directed acyclic graph is shown in Figure 5.4.


117

λΝ

γΝ

δ

...

...


r

λ2 λ1

γ2 γ1

w β

κ θ

Figure 5.4: Directed acyclic graph of the proposed Bayesian model. The deterministic model parameters appear in boxes.

5.3.2 The proposed Bayesian inference methodology As it is common in Bayesian inference, the estimation of the parameters is based on their joint posterior distribution. This posterior for the model presented in section 5.3.1 is expressed as p (y|w, β) p (w|β, γ) p (γ|λ) p (λ) p (β) , (5.30) p (w, β, γ, λ|y) = p(y) which is intractable, in the sense that the integral Z Z Z Z p(y) =

p (y, w, β, γ, λ) dwdγdλdβ

(5.31)

cannot be expressed in closed form. In such cases, the Gibbs sampler [52] provides an alternative method for overcoming this impediment. As explained in Chapter 3, the Gibbs sampler generates random samples from the conditional posterior distributions of the associated model parameters iteratively. As explained in [29], this sampling procedure generates a Markov chain of random variables, which converges to the joint distribution (5.30) (usually the first few iterations, which constitute the so called burn-in period, are ignored). In the sequel, we compute first the conditional posterior distributions, which are vital for the proposed Bayesian inference algorithm, and we explain the difficulty of utilizing Gibbs sampling in the proposed Bayesian framework. Then the proposed algorithm is discussed in detail. 5.3.2.1 Posterior conditional distributions In this subsection, in accordance with the Gibbs sampler spirit, we derive the conditional posterior distributions of the model parameters w, γ, λ and β. Starting with w, it is easily shown (utilizing eq. (5.12) and (5.17)) that its posterior conditional density is a truncated multivariate Gaussian in RN +, p(w|y, γ, λ, β) = R


p(y|w, β)p(w|γ, β) = NRN+ (w|µ, Σ) p(y|w, β)p(w|γ, β)dw 118

(5.32)


where µ and Σ are expressed as follows, [75, theorem 10.3], µ = βΣΦT y −1 . Σ = β −1 ΦT Φ + Λ

(5.33) (5.34)

The posterior conditional for the precision parameter β, after eliminating the terms which are independent of β, is expressed as p(y|w, β)p(w|γ, β)p(β) p(β|y, w, γ, λ) = R ∞ . p(y|w, β)p(w|γ, β)p(β)dβ 0

(5.35)

Utilizing (5.12), (5.17), and (5.18), it is easily shown that β is Gamma distributed as follows M + N 1 1 T 2 p(β|y, w, γ, λ) = Γ β + κ , ky − Φwk2 + θ + w Λw 2 2 2

(5.36)

The conditional distribution of γi given y, wi , λi , β is a generalized inverse Gaussian distribution [110], analytically computed as p(y|wi , β)p(wi |γi , β)p(γi |λi )p(λi )p(β) p(y|wi , β)p(wi |γi , β)p(γi |λi )p(λi )p(β)dγi p(wi |γi , β)p(γi |λi ) =R p(wi |γi , β)p(γi |λi )dγi h i 1 −1 1 w2 2(2π)− 2 β 2 γi 2 exp − β2 γii IR1+ (wi ) λ2i exp − λ2i γi h i =R 1 λ ∞ β wi2 λi − 21 12 − 2 1 (wi ) exp − I 2(2π) β γ exp − 2i γi dγi R i 2 γi 2 0 + h i h i −1 −1 w2 w2 γi 2 exp − β2 γii − λ2i γi γi 2 exp − β2 γii − λ2i γi i h p h i =R = q 1 ∞ −2 β wi2 2π λi 2 exp − βλi wi γi exp − 2 γi − 2 γi dγi λi 0 12 p λi βwi2 λi − 12 = γi exp − − γi + βλi wi , 2π 2γi 2

p(γi |y, wi , λi , β) = R

(5.37)

where we used [55, equation 3.471.15] to compute the integral. Finally, the conditional posterior of λi given y, wi , γi , β is expressed as p(λi |y, wi , γi , β) = R ∞ 0

p(γi |λi )p(λi ) , p(γi |λi )p(λi )dλi

(5.38)

which, using (5.21) and (5.29), is shown to be a Gamma pdf, γi p(λi |y, wi , γi , β) = Γ λi |1 + r, + δ , i = 1, 2, . . . , N. 2


119

(5.39)


(t)

(t)

The Gibbs sampler generates a sequence of samples w(t) , β (t) , γi , and λi , i = 1, 2, . . . , N , by sampling the conditional pdfs (5.32), (5.36), (5.37) and (5.39) respectively. In this work a different procedure is followed. Specifically, we propose a deterministic approximation of the Gibbs sampler, where the randomly generated samples of the Gibbs sampler are replaced by the means of the corresponding conditional distributions, (5.32), (5.36), (5.37) and (5.39). Thus, a novel iterative scheme among the conditional means of w, β, γi ’s and λi ’s arises, termed Bayesian Inference Iterative Conditional Expectations (BI-ICE) algorithm. It should be emphasized that by following this approach, we depart from the statistical framework implied by the Gibbs sampler and we end up with a new deterministic algorithm for estimating the parameters of the proposed hierarchical model. Besides avoiding the complexity of sampling (5.37), BI-ICE is expected to converge faster than the original Gibbs sampler and, as a result, is expected to be much less computationally demanding. Also, as verified by extensive simulations, BI-ICE leads to sparse solutions and offers robust estimation performance under various experimental settings. 5.3.2.2 The BI-ICE algorithm As mentioned previously, BI-ICE needs the conditional expectations of the model parameters. These are computed analytically as described below. Expectation of p(w|y, γ, λ, β) As shown in (5.32), p(w|y, γ, λ, β) is a truncated Gaussian distribution in RN + . We know from [70] that in the one-dimensional case, the expectation of a random variable x modeled by the truncated Gaussian distribution in R1+ can be computed as

x ∼ NR1+ (x|µ∗ , σ ∗2 ) ⇒ E [x] = µ∗ +

√1 exp 2π

1 − 12 erfc

∗2 − 12 σµ∗2 ∗ σ∗, √µ 2σ ∗

(5.40)

where erfc(·) is the complementary error function. Unfortunately, to the best of our knowledge, there is no analogous closed form expression for the N -dimensional case. However, as shown in [103, 104], the distribution of the i-th element of w conditioned on the remaining elements w¬i = [w1 , . . . , wi−1 , wi+1 , . . . , wN ]T can be expressed as wi |w¬i ∼ NR1+ (wi |µ∗i , σii∗ )

(5.41)

T µ∗i = µi + σ¬i Σ−1 ¬i¬i (w¬i − µ¬i )

(5.42)

with


120


T Σ−1 σii∗ = σii − σ¬i ¬i¬i σ¬i .

(5.43)

Recalling that w ∼ NRN+ (w|µ, Σ), µi and σii represent the i-th and ii-th elements of µ and Σ respectively. The (N − 1) × (N − 1) matrix Σ¬i¬i is formed by removing the i-th row and the i-th column from Σ, while the (N − 1) × 1 vector σ¬i is the ith column of Σ after removing its ith element. By applying (5.40) and utilizing (5.42)-(5.43), the expected values of all random variables wi |w¬i , i = 1, 2, . . . , N can be analytically computed. Based on this result, an iterative procedure is proposed in order to estimate the mean of the posterior p(w|y, γ, λ, β). Specifically, the j-th iteration, j = 1, 2, . . . , of this procedure is described as follows4 (j)

(j−1)

(j)

(j)

1. w1 = E[p(w1 |w2

(j−1)

, w3

(j−1)

2. w2 = E[p(w2 |w1 , w3 .. . (j)

(j)

(j−1)

, . . . , wN

(j−1)

, . . . , wN

(j)

)]

)] (5.44)

(j)

N. wN = E[p(wN |w1 , w2 , . . . , wN −1 )] This procedure is repeated iteratively until convergence. Experimental results have shown that the iterative scheme in (5.44) converges to the mean of w ∼ NRN+ (w|µ, Σ) after a few iterations. Expectation of p(β|y, w, γ, λ) The mean value of the Gamma distribution in (5.36) is given by E [p(β|y, w, γ, λ)] =

1 ky 2

−

M +N + 2 2 Φwk2 + θ

κ + 21 wT Λw

(5.45)

Expectation of p(γi |y, wi , λi , β) The mean of the generalized inverse Gaussian distribution (5.37) is computed as Z

∞

E [p(γi |y, wi , λi , β)] =

γi p(γi |y, wi , λi , β)dγi 0

1 p βwi2 λi λi 2 12 = γi exp − − γi + βλi wi dγi 2π 2γi 2 0 21 Z h i ∞ 1 p βwi2 λi λi 2 = exp βλi wi γi exp − − γi dγi 2π 2γi 2 0 Z

∞

4 In the following, for notational simplicity, the expectation Ep(x|y) [x] of a random variable x with conditional distribution p(x|y) is denoted as E[p(x|y)].


121


=

2λi π

21

βwi2 λi

43 exp

hp

i p βλi wi K3/2 βλi wi ,

(5.46)

where we used [55, equation 3.471.9] for the integral computation, and Kν (·) stands for the modified Bessel function of second kind of order ν. Also, we set p(γi |y, wi , λi , β) = 0, for wi < 0. Note that this does not affect the BI-ICE algorithm, since wi ’s are guaranteed to be non-negative (the fact wi < 0 is impossible by the formulation of the problem). Equivalently, (5.46) can be expressed in a simpler form by utilizing [55, equation 3.471.15], e.g., 1 iZ ∞ 1 hp λi 2 βwi2 λi 2 βλi wi γi exp − E [p(γi |y, wi , λi , β)] = exp − γi dγi 2π 2γi 2 0 s βwi2 1 = + . λi λi

(5.47)

Expectation of p(λi |y, wi , γi , β) Again, the mean value of the Gamma distribution in (5.39) is given by E [p(λi |y, wi , γi , β)] =

1+r . +δ

1 γ 2 i

(5.48)

Based on the previous expressions, the proposed BI-ICE algorithm is summarized in Table 1. As shown in the Table, the algorithm is initialized with λ(0) = 1, γ (0) = 1 and, as in [7], β (0) = 0.01 kyk2 . Regarding the updating of parameter w(t) , an auxiliary variable v has been utilized in Table 5.3. This is initialized with µ(t) (the value of µ at iteration t) and is updated by performing a single iteration of the scheme described in (5.44). The resulting value of v is assigned to w(t) . The rationale behind this choice is that for a diagonal Σ (which happens when the columns of Φ are orthogonal), it easily follows from (5.42), (5.43) that the wi ’s in (5.44) are uncorrelated. Thus, a single iteration is sufficient to obtain the mean of NRN+ (w|µ, Σ). Although, this is not valid when Σ is not diagonal, experimental results have evidenced that the estimation of the mean of NRN+ (w|µ, Σ) resulting after the execution of a single iteration of the scheme in (5.44) is also sufficient in the framework of the BI-ICE algorithm. Due to the fact that the BI-ICE algorithm springs out from the hierarchical Bayesian model described in section 5.3.1, it leads to sparse estimations for w, and the endmembers present in the pixel are identified by the non-zero entries of w. In addition, all parameters of the model are naturally estimated from the data, as a consequence of the Bayesian Lasso approach followed in this section. This is in contrast to deterministic algorithms for


122


Table 5.3: The BI-ICE algorithm Input Φ, y, κ, θ, r, δ Initialize γ (0) = λ(0) = 1, β (0) = 0.01 kyk2 for t = 1, 2, . . . do - Compute w(t) as follows Compute µ(t) , Σ(t) using (5.33), (5.34) Set v(0) = µ(t) h i (1)

(0)

(0)

Compute v1 = E p(v1 |v2 , . . . , vN ) , using (5.42), (5.43), and (5.40) h i (1) (1) (0) (0) Compute v2 = E p(v2 |v1 , v3 , . . . , vN ) , using (5.42), (5.43), and (5.40) .. . h i (1) (1) (1) (1) Compute vN = E p(vN |v1 , v2 , . . . , vN −1 ) , using (5.42), (5.43), and (5.40)

Set w(t) = v(1) - Compute β (t) = E hp(β|y, w(t) , γ (t−1) , λ(t−1) i , using (5.45) (t)

(t)

(t−1)

, β (t) ) , i = 1, 2, . . . , N, using (5.46) - Compute γi = E p(γi |y, wi , λi i h (t) (t) (t) - Compute λi = E p(λi |y, wi , γi , β (t) ) , i = 1, 2, . . . , N, using (5.48) endfor solving the Lasso, e.g., [44, 129], or adaptive methods, [16], which face the problem of finetuning specific parameters, (corresponding to λ of our model), that control the sparsity of the solution. Besides, useful by-products of the BI-ICE algorithm are the estimates of (a) the variance of the additive noise of the linear model, as in [41], and (b) the variance of the abundance vector. The latter, coupled with the estimate of µ, provides the posterior distribution of the abundance vector, which can be used to provide confidence intervals to assess the reliability of the proposed estimator. Concerning the computational complexity, as it is clear from Table 1, the BI-ICE algorithm requires the evaluation of simple closed form formulas. The main computational burden is due to the calculation of the N inverse matrices Σ¬i¬i , i = 1, 2, . . . , N appearing in (5.42) and (5.43). As shown in section 5.3.2.4, all these matrices can be derived very efficiently from Σ−1 , and thus only one matrix inversion per iteration (related to the computation of Σ in (5.34)) is required. This results in a reduction of the computational complexity of the BI-ICE algorithm by one order of magnitude per iteration. Thus far, the proposed BI-ICE algorithm has been described as a deterministic approximation of the Gibbs sampler. An alternative view of the BI-ICE algorithm in the framework of variational Bayesian inference is provided in detail in section 5.3.2.5. There, it is shown that the adoption of a proper factorization of an approximation of the posterior p(w, γ, λ, β|y) results to a variational Bayesian inference scheme that exploits the same type of distributions and updates the same form of parameters. From this point of view, BI-ICE


123


can be thought of as a first moments approximation to a variational Bayesian inference scheme. 5.3.2.3 Embedding the sum-to-one constraint The sparsity-promoting hierarchical Bayesian model presented in the previous sections takes into consideration the non-negativity of the abundance vector w. However, the abundances’ sum-to-one constraint has not yet been considered. As noted in [64], the sum-to-one constraint is prone to strong criticisms. In real hyperspectral images the spectral signatures are usually defined up to a scale factor, and thus, the sum-to-one constraint should be reP placed by a generalized constraint of the form ci wi = 1, in which the weights ci denote the pixel-dependent scale factors. Moreover, it is known that the sparse solution of a linear system with Φ having non-negative entries already admits a generalized sum-to-one constraint, [24]. Thus, it can be safely assumed that the impact of not enforcing the sum-to-one constraint on the performance of the algorithm is not expected to be severe. Despite this fact, in this section we describe an efficient way to enforce this constraint, although through a regularization parameter. Note that direct incorporation of this constraint to the proposed Bayesian framework would require truncation of the prior normal distribution of w over a simplex, rendering the derivation of closed form expressions for the conditional posterior distributions intractable. To alleviate this, we choose, as in [112, 60], [54, p. 586], to impose the sum-to-one constraint deterministically, by augmenting the initial LMM of (5.1) with an extra equation as follows, " # " # " # y Φ n = w+ T α α1 0

(5.49)

where α is a scalar parameter, which controls the effect of the sum-to-one constraint on the estimation of w. Specifically, the larger the value of α is, the closer the sum of the estimated wi ’s will be to one. It should be noticed that the augmentation of the LMM as in (5.49) does not affect the proposed hierarchical Bayesian model and the subsequent analysis. 5.3.2.4 Fast computation of (5.42) and (5.43) In order to complete the presentation of the BI-ICE algorithm, we show in this section how the computational complexity of BI-ICE can be deduced by one order of magnitude by −1 −1 efficiently computing all Σ−1 i , i = 1, 2, ..., N , from Σ . Let us define V = Σ . In [103], T the formula Σ−1 ¬i¬i = V¬i¬i − v¬i v¬i /Vii , i = 1, 2, . . . , N , has been utilized for computing −1 all matrices Σ−1 ¬i¬i , i = 1, 2, . . . , N from Σ , where V¬i¬i and v¬i are related to V in the same way Σ¬i¬i and σ¬i are related to Σ. It has been seen in simulations that this rankKonstantinos E. Themelis

124


one downdate formula is numerically susceptible. In the following, an alternative method is proposed, which avoids direct computation of Σ−1 ¬i¬i and has exhibited numerical robustness in all simulations performed. Let Ti be an N × N permutation matrix, which when it premultiplies a matrix, moves its i-th row to the N -th position, after upshifting rows i + 1, . . . , N . Then, by defining Σi = Ti ΣTTi , it is easily verified that " Σi =

T Σ¬i¬i σ¬i σ¬i σii

# (5.50)

−1 −1 T Moreover, due to the orthogonality of Ti , Σ−1 i = Ti Σ Ti , i = 1, 2, . . . , N , i.e., all Σi , i = 1, 2, . . . , N , are obtained from Σ−1 by simple permutations. From [106, p. 54], and (5.50), we get

" Σ−1 i =

T Σ−1 ¬i¬i 0 0 0

#

1 + T σii − σ¬i Σ−1 ¬i¬i σ¬i

Let qi =

h

T 0 σ¬i

i

" Σ−1 i

T σ¬i 0

#

"

−Σ−1 ¬i¬i σ¬i 1

#

h

T Σ−1 −σ¬i ¬i¬i 1

2 −1 T Σ σ σ ¬i ¬i ¬i¬i T = σ¬i Σ−1 . ¬i¬i σ¬i + T σii − σ¬i Σ−1 ¬i¬i σ¬i

i

.

(5.51)

(5.52)

T Σ−1 Then, by rearranging (5.52) the term σ¬i ¬i¬i σ¬i can be written as T σ¬i Σ−1 ¬i¬i σ¬i =

and from (5.43) σii∗ = σii −

qi σii , qi + σii

qi σii . qi + σii

(5.53)

(5.54)

Define υ¬i = w¬i − µ¬i and pi =

h

T σ¬i 0

i

" Σ−1 i

υ¬i 0

# T = σ¬i Σ−1 ¬i¬i υ¬i +

−1 T T σ¬i Σ−1 ¬i¬i σ¬i σ¬i Σ¬i¬i υ¬i . T σii − σ¬i Σ−1 ¬i¬i σ¬i

(5.55)

T Then, solving for σ¬i Σ−1 ¬i¬i υ¬i , we get T σ¬i Σ−1 ¬i¬i υ¬i =

and (5.42) becomes µ∗i = µi +

pi σii . qi + σii

pi σii . qi + σii

(5.56)

(5.57)

from Σ−1 , qi and pi are computed from the first equations In summary, after obtaining Σ−1 i in (5.52) and (5.55) respectively. Then, σii∗ µ∗i , i = 1, 2, . . . , N are efficiently computed from


125


(5.54) and (5.57) respectively. 5.3.2.5 Relation to variational Bayesian inference and other methods In this section, we highlight the relation of the proposed BI-ICE algorithm with other known Bayesian inference methods and primarily with variational Bayesian inference, [72, 18, 6, 65]. To this end, we first apply the variational inference method to the proposed Bayesian model described in Section 5.3.1. In variational inference, the joint posterior distribution of the model parameters p(w, γ, λ, β|y) is approximated by a variational distribution Q(w, γ, λ, β). Assuming posterior independence among the model parameters, this variational distribution factorizes as follows ! N ! N Y Y Q(w, γ, λ, β) = q(w)q(γ)q(λ)q(β) = q(w) q(γi ) q(λi ) q(β). (5.58) i=1

i=1

According to the variational Bayes methodology, [18, pp. 466], the factors in (5.58) can be computed by minimizing the Kullback−Leibler divergence between the approximate distribution Q(w, γ, λ, β) and the target distribution p(w, γ, λ, β|y). After some straightforward algebraic manipulations, it turns out that q(w) is expressed as q(w) = NRN+ (w|µ, Σ),

(5.59)

with µ = hβiq(β) ΣΦT y

and

Σ = hβiq(β) ΦT Φ + hΛiq(γ)

−1

,

(5.60)

where hβiq(β) denotes the mean value of β with respect to the distribution q(β). For the rest factors we have 1 1 T M +N 2 + κ, hky − Φwk iq(w) + hw Λwiq(w)q(γ) + θ , (5.61) q(β) = Γ β| 2 2 2

r q(γi ) =

q hλi iq(λi ) − 21 hβiq(β) hwi2 iq(w) 1 1 2 γi exp − − hλi iq(λi ) γi + hλi iq(λi ) hβiq(β) hwi iq(w) , 2π 2 γi 2 (5.62)

and 1 q(λi ) = Γ λi |r + 1, hγi iq(γi ) + δ . 2


126

(5.63)


Eqs. (5.59)-(5.63) do not provide an explicit solution, since they depend on each other’s factors. However, in principle, a solution may be reached iteratively, by initializing the required moments and then cycling through the model parameters, updating each distribution in turn. It may come as a surprise, but, although a different approach is used, the derived expressions resemble the conditional posterior distributions (5.32), (5.36), (5.37), and (5.39) employed in the iterative scheme of BI-ICE. Notice that both approaches share (a) the same type of distributions and (b) the updating of the same form of parameters. The only difference is that, in a variational Bayesian framework, the computation of the mean values of the model parameters requires a blend of their first and second moments with respect to the approximate posterior distributions given in eqs. (5.59), (5.61) - (5.63), while this is not the case with BI-ICE (see (5.44), (5.45), (5.46), and (5.48)). As a result, the proposed BI-ICE can be considered as a first moments approximation of the variational Bayesian inference scheme, which is based on the factorization given in (5.58). To elaborate further on the relation of BI-ICE to variational Bayes approximation, let Q us assume that in the variational framework q(w) is factorized as q(w) = N i q(wi ). Then, it can be shown that the posterior approximate distributions q(γ), q(λ), and q(β) of the variational Bayes scheme remain exactly the same as in (5.62), (5.63), and (5.61) respectively, while q(wi ) is expressed as q(wi ) = NR1+ (wi |µ˙ i , σ˙ ii ) −1 1 T φTi (y − Φ¬i hw¬i i) µ˙ i = φi φi + h i γi −1 1 −1 T σ˙ ii = hβi φi φi + h i γi

(5.64) (5.65) (5.66)

where Φ¬i is the matrix resulting from Φ after removing its i-th column. By superimposing (5.64)-(5.66) and (5.41)-(5.43) it is revealed that the posterior independence of wi ’s assumed in the variational framework leads to a different updating mechanism compared to BI-ICE, in which such an assumption is not made. This means that the proposed scheme in (5.44) Q cannot result from a factorized approximation of the form N i q(wi ). It is also worth noting that the motivation for the derivation of the BI-ICE algorithm has been the so-called Rao-Blackwellized Gibbs sampling scheme [50, 84]. In a Rao-Blackwellized Gibbs sampler with two random variables X, Z, the sequences x1 , x2 , . . . and z1 , z2 , . . . are generated first by sampling the conditional distributions p(X|Z) and p(Z|X) respectively, as in the conventional Gibbs sampler. Then, the conditional expectations E[X|zi ] P PK 1 and E[Z|xi ], ∀i are computed and the sample means K1 K i=1 E[X|zi ] and K i=1 E[Z|xi ] for large K are obtained. According to the Rao-Blackwell theorem [83], these estimates improve P PK 1 upon the original Gibbs sampler estimates K1 K i=1 xi and K i=1 zi , [50, 29]. Note that in Konstantinos E. Themelis

127


the proposed iterative scheme, the conditional expectations of all involved parameters are computed as well. However, each one of them is now evaluated directly in each iteration, conditioned on the current values of the remaining conditional expectations. Finally, it should be mentioned that the proposed BI-ICE algorithm resembles the iterative conditional modes (ICM) algorithm presented in [11]. As noted in [18, p. 546], the ICM algorithm can be viewed as a “greedy” approximation to the Gibbs sampler, where instead of drawing a sample from each conditional distribution, the maximum of the conditional distribution is selected. The difference with the ICM method is that in BI-ICE the first order moment of the conditional posterior distributions is used instead of the maximum.

5.3.3 Experimental results 5.3.3.1 Simulation results on synthetic data This section illustrates the effectiveness of the proposed BI-ICE algorithm, by a series of experiments related to the unmixing of a synthetic hyperspectral image. Following the experimental settings of [64], where a thorough comparison of several (non Bayesian) sparse semi-supervised unmixing algorithms is presented, we consider two spectral data sets for the simulated hyperspectral scene: (a) Φ1 ∈ R453×220 , which is a matrix containing the spectral signatures of 220 endmembers selected from the USGS spectral library, [34], and (b) Φ2 ∈ R453×220 , which is a matrix of i.i.d. components uniformly distributed in the interval [0 1]. As expected, the spectral signatures of the materials of Φ1 are highly correlated. The condition number and the mutual coherence, [64], of Φ1 are 36.182 × 106 and 0.999933 respectively, whereas, for Φ2 , the same measures are equal to 82 and 0.8373 respectively. The abundance fractions of the simulated image and the number of different endmembers composing a single pixel are generated according to a Dirichlet distribution, [94]. In all simulations, the observations are considered to be corrupted by either white Gaussian or colored noise. Colored noise is produced by filtering a sequence of white noise using a lowpass filter with a normalized cutoff frequency of 5π/M . The variance of the additive noise is determined by the SNR level. First, the fast convergence and sparse estimations of w exhibited by the new algorithm are depicted in Figure 5.5. In this experiment, a pixel with three non-zero abundances (0.1397, 0.2305, 0.6298) is considered, and white noise is added to the model, such that the SNR is equal to 25dB. The curves in Figure 5.5 are the average of 50 noise realizations. We observe that less than 15 iterations are sufficient for the BI-ICE algorithm to converge to the correct sparse solution of w. That is, it determines correctly the abundance fractions of the endmembers present in the pixel, while all remaining abundance fractions converge to zero.


128


Φ2 1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

Abundances

Abundances

Φ1 1

0.5 0.4

0.5 0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0

10

20

30 40 50 Number of Iterations

60

70

0

80

0

10

(a)

20

30 40 50 Number of Iterations

60

70

80

(b)

Figure 5.5: Estimation of the entries of the sparse vector w, as BI-ICE progresses. The algorithm is applied to simulated data, generated using (a) a highly correlated matrix of spectral data, (b) a matrix of i.i.d uniform data. White noise is added (SNR = 25 dB). Dashed lines: true values. Solid lines: estimated values. Next, the BI-ICE algorithm was compared to: (a) the least squares (LS) algorithm, (b) a quadratic programming (QP) technique, which enforces the constraints, but does not specifically exploit the problem’s sparsity, [35], (c) the orthogonal matching pursuit (OMP) algorithm, [121], which is a widely used, greedy, sparsity promoting algorithm, (d) the sparse unmixing by variable splitting and augmented Lagrangian (SUnSAL) algorithm, [16, 64], which is based on the alternating direction method of multipliers to solve the `1 penalization problem of (5.15) subject to the physical constraints of the unmixing problem, and (e) the constrained version of SUnSAL, CSUnSAL, which solves the constrained version of the problem in (5.15), (see also [64] for details). In our experiments, the parameters used for SUnSAL are µ = 1, and λ = 1, while for CSUnSAL we used µ = 1, λ = 10−3 , and δ = 10−6 , see also [16]. Based on the following metric, # ˜ 22 kw − wk , MSE = E kwk22 "

(5.67)

˜ are the true and the estimated abundance vectors respectively, the correwhere w and w sponding MSE curves for different sparsity levels ranging from 1 (pure pixel) to 20 are shown in Figure 5.6, for both spectral libraries Φ1 and Φ2 . Due to poor results, the MSE curve of the LS algorithm is not shown in the figure. It can be seen that the proposed algorithm outperforms the OMP, QP, and CSUnSAL algorithms and has similar performance to the SUnSAL algorithm. In comparison to BI-ICE, the adaptive methods SUnSAL and CSUnKonstantinos E. Themelis

129


Φ1

2

Φ2

1

10

10 QP BI−ICE OMP SUnSAL+ CSUnSAL+

QP BI−ICE OMP SUnSAL+ CSUnSAL+

0

10

1

10

−1

MSE

MSE

10

−2

10 0

10

−3

10

−1

10

0

−4

2

4

6

8

10 ξ

12

14

16

18

20

(a)

10

0

2

4

6

8

10 ξ

12

14

16

18

20

(b)

Figure 5.6: MSE as a function of the level of sparsity obtained by different unmixing methods when applied to simulated data with white additive noise (SNR = 20 dB) and using two different spectral libraries. SAL are of lower computational complexity. However, it should be pointed out that the comparable performance, in terms of MSE, of the alternating direction algorithms SUnSAL and CSUnSAL with BI-ICE comes at the additional expense of manually fine-tuning nontrivial parameters, such as the sparsity promoting parameter λ, (see (5.15), and [64]). Thus, an advantage of the proposed BI-ICE algorithm over SUnSAL and CSUnSAL algorithms is that all unknown parameters are directly inferred from the data. Besides that, BI-ICE bears interesting by-products such as: (a) estimates of all model parameters (such a parameter that is useful in many applications is the noise variance), (b) estimates for the variances of the estimated parameters, which may lead to the establishment of confidence intervals, and (c) approximate posterior distributions for the estimated parameters. In contrast, all other algorithms considered are iterative algorithms that return point estimates of the parameters of interest. A quick view of Figure 5.6 also reveals that the OMP and QP algorithms attain the worst performance, in terms of MSE. OMP adds one endmember to its active set in each iteration, and subtracts its contribution from the residual signal, until the correlation coefficient of the remaining signal vector drops below a certain threshold, or the maximum of 20 selected endmembers is reached. However, due to its greedy nature and the high conditioning of Φ1 , OMP fails to detect the correct endmembers that compose the pixel. This is the reason for the algorithm’s poor performance, shown in Figure 5.6. Note also that, in the cases of high sparsity, the QP algorithm fails to detect the correct support of the sparse vector w, resulting in poor MSE performance. This may not come as a surprise, since the QP algorithm is not Konstantinos E. Themelis

130


Φ1

1

Φ2

1

10



0

10

−1

10

0

10

−2

MSE

MSE

10

−3

10 −1

10

−4

10

−5

10 −2

10

15

−6

20

25

30 35 SNR(dB)

40

45

50

(a)

10

15

20

25

30 35 SNR(dB)

40

45

50

(b)

Figure 5.7: MSE as a function of the SNR obtained by different sparse unmixing methods when applied to simulated data with white additive noise and using different spectral libraries for sparsity level ξ = 5. specifically designed for sparse regression problems. In Figure 5.7 the MSE values of the various sparse unmixing algorithms versus the SNR are displayed. For this experiment, the spectral libraries Φ1 and Φ2 were used to simulate two different hyperspectral scenes, each having 100 pixels. The level of sparsity for the abundance vectors of all pixels is held fixed and equal to five. As expected, the MSE values of all algorithms decrease as the SNR increases. This is not the case for the QP algorithm though, which completely fails to retrieve the correct support of the sparse abundance vector w, and its MSE is almost constant. Again, the performance of SUnSAL and BI-ICE is comparable, with BI-ICE having slightly better performance in the case of the i.i.d. mixing matrix Φ2 . In Figs. 5.8 and 5.9 the same experimental results are provided in the scenario where the simulated pixels are contaminated with colored noise. We observe that the performance pattern of the various algorithms is not affected by the presence of colored noise, apart form the fact that the MSE values are now slightly increased. Although our hierarchical Bayesian model assumes i.i.d. noise, these figures provide us with enough evidence to conclude that the proposed BI-ICE algorithm can also provide reliable results in colored noise environments. Finally, in Figure 5.10 the MSE performance of the proposed BI-ICE algorithm is shown, when the sum-to-one constraint is incorporated to the regression problem, as explained earlier in section 5.3.2.3, with α = 103 . It can be seen that the performance of the algorithm is particularly enhanced in the case of high sparsity, i.e. when the image pixel is either pure (ξ = 1) or it is composed of a few (ξ < 5) endmembers. As verified by experiments, the BI-ICE with the sum-to-one constraint correctly detects the support of the sparse signal Konstantinos E. Themelis

131


Φ1

2

Φ2

1

10



0

10

1

10

−1

MSE

MSE

10

−2

10 0

10

−3

10

−1

10

0

−4

2

4

6

8

10 ξ

12

14

16

18

10

20

0

2

4

6

(a)

8

10 ξ

12

14

16

18

20

(b)

Figure 5.8: MSE as a function of the level of sparsity obtained by different unmixing methods when applied to simulated data with colored additive noise (SNR = 20 dB) and using two different spectral libraries. Φ1

2

Φ2

1

10


1

10


0

10

−1

10

−2

MSE

MSE

10 0

10

−3

10

−4

10

−1

10

−5

10 −2

10

15

−6

20

25

30 35 SNR(dB)

40

45

50

(a)

10

15

20

25

30 35 SNR(dB)

40

45

50

(b)

Figure 5.9: MSE as a function of the SNR obtained by different sparse unmixing methods when applied to simulated data with colored additive noise and using different spectral libraries for sparsity level ξ = 5. with a probability close to one, which accounts for the significant decrease of the MSE. The experiment has been conducted for both spectral libraries Φ1 and Φ2 . The higher MSE improvement is observed for the case of i.i.d. spectral data.


132


Φ1

2

Φ2

1

10


1

10


0

10

−1

MSE

MSE

10

0

10

−2

10

−3

10 −1

10

−4

10

−2

10

0

−5

2

4

6

8

10 ξ

12

14

16

18

20

(a)

10

0

2

4

6

8

10 ξ

12

14

16

18

20

(b)

Figure 5.10: MSE as a function of the level of sparsity obtained by different unmixing methods when applied to simulated data with white additive noise (SNR = 20 dB) and using two different spectral libraries. The sum-to-one constraint is incorporated to the BIICE algorithm, as explained in section 5.3.2.3. 5.3.3.2 Simulation results on real data This section describes the application of the proposed BI-ICE algorithm to real hyperspectral image data. The real data were collected by the airborne visible/infrared imaging spectrometer (AVIRIS) flight over the Cuprite mining site, Nevada, in 1997, [66]. The AVIRIS sensor is a 224-channel imaging spectrometer with approximately 10-nm spectral resolution covering wavelengths ranging from 0.4 to 2.5 µm. The spatial resolution is 20 m. This data set has been widely used for remote sensing experiments [94, 33, 90, 30]. The spectral bands 1-2, 104-113, 148-167, and 221-224 were removed due to low SNR and water-vapor absorption. Hence, a total of 188 bands were considered in this experiment. The subimage of the 150th band, including 200 vertical lines with 200 samples per line (200 × 200) is shown in Figure 5.11. The VCA algorithm was used to extract 14 endmembers present in the image, as in [94]. Using these spectral signatures, three algorithms are tested to estimate the abundances, namely the LS algorithm, the QP method, and the proposed BI-ICE algorithm. The unmixing process generates an output image for each endmember, depicting the endmember’s estimated abundance fraction for each pixel. The darker the pixel, the smaller the contribution of this endmember in the pixel is. On the other hand, a light pixel indicates that the proportion of the endmember in the specific pixel is high. The abundance fractions of four endmembers, estimated using the LS, QP and BI-ICE algorithms, are shown in Figure 5.12a, Figure 5.12b, and Figure 5.12c, respectively. Note that, for the sake of comparison, a Konstantinos E. Themelis

133


Figure 5.11: Band 150 of a subimage of the Cuprite Aviris hyperspectral data set. necessary linear scaling in the range [0 1] has been performed for the LS abundance images. By simple inspection, it can be observed that the images taken using the LS algorithm clearly deviate from the images of the other two methods. The LS algorithm imposes no constraints on the estimated abundances, and hence the scaling has a major impact on the abundance fractions, resulting in performance degradation. On the contrary, the images obtained by QP and BI-ICE share a high degree of similarity and are in full agreement with previous results concerning the selected abundances and reported in [94, 30], as well as with the conclusions derived in section 5.3.3.1. 5.3.3.3 Simulation results on OMEGA data In this section we provide some experimental results on the unmixing of OMEGA data using the proposed BI-ICE algorithm. Similar unmixing results were also presented in section 4.2.3.2 of the previous chapter. We consider the Syrtis Major OMEGA image, for which 32 endmember spectra are available, namely hyperst2.spc Hypersthene PYX02.h, Silicate (Ino); Diopside CPX CRISM, Olivine Fayalite CRISM, Olivine Forsterite CRISM, Silicate (Phyll; Clay Montmorillonite Bentonite), Silicate (Phyll; Clay Illite Smectite), Silicate (Phyll; Serpentine Chrysotile Clinochrysot), Silicate (Phyll; Serpentine Lizardite), Silicate (Phyll; Clay Illite), Silicate (Phyll; Clay Kaolinite), Silicate (Phyll; Nontronite), Sulfate; Gypsum, Sulfate; Jarosite, Sulfate; Kieserite, Epsomite USGS GDS149, Oxide; Goethite, Oxide; Hematite, Oxide; Magnetite, Ferrihydrite USGS GDS75 Sy F6, Maghemite USGS GDS81 Sy (M-3), Carbonate; Calcite, Carbonate; Dolomite, Carbonate; Siderite, Silicate (Phyll; Chlorite), muscovi6.spc Muscovite GDS116 Tanzania, alunite2.spc Alunite GDS83 Na63, Atmo ORB0018 X:45 Y:406, CO2 grain 10 000, H2O grain 100, Flat 0.0001, Flat 1, Slope Decreasing. As described in the previous chapter, the last three artifact endmembers


134


(a) LS algorithm

(b) QP algorithm

(c) BI-ICE algorithm

Figure 5.12: Estimated abundance values of four endmembers using: (a) the LS algorithm, (b) the QP algorithm, (c) the proposed BI-ICE algorithm consist of two neutral spectral components (flat lines at 10−4 and 1), and a slope line, as in [107, 82]. The spectral signatures of the corresponding endmembers are shown in Figure 5.13. A simple inspection of Figure 5.13 may well determine that the respective endmembers are highly correlated, as is usually the case in most hyperspectral images. This can also be verified by measuring the mutual coherence and the condition number of the endmember matrix, which are 999.5 × 10−3 and 20 × 103 respectively. Utilizing these 32 endmembers and assuming that only a few materials will be present in each pixel, the BI-ICE algorithm was used to provide sparse estimates on the abundance coefficients. The resulting abundance maps of the minerals Hypersthene, Diopside, and Fayalite, are shown in Figure 5.14. In comparison to the respective abundance maps of section 4.2.3.2 in chapter 4, the minerals of interest are now found to occupy a significantly smaller area of the image. This is a direct effect of sparsity; as expected, the contribution of the endmembers in the majority of pixels is estimated close to zero.


135


1 0.9 0.8

Reflectance

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1

1.2

1.4

1.6

1.8 Wavelength

2

2.2

2.4

2.6

Figure 5.13: Reference endmember spectra of the Syrtis Major OMEGA image. BI−ICE

0

0.02

0.04

(a)

BI−ICE

0.06

0.08

0

0.05

(b)

BI−ICE

0.1

0

0.05

0.1

0.15

(c)

Figure 5.14: Abundance map produced by BI-ICE of the endmembers (a) Hypersthene, (b) Diopside, and (c) Fayalite for the Syrtis Major hyperspectral scene.

5.3.4 Conclusion A novel perspective for sparse semi-supervised hyperspectral unmixing has been presented in this section. The unmixing problem has been expressed in the form of a hierarchical Bayesian model, where the problem constraints and the parameters’ properties were incorporated by suitably selecting the priors’ and hyperpriors’ distributions of the model. Then, a new Bayesian inference iterative scheme has been developed for estimating the model paramKonstantinos E. Themelis

136


eters. The proposed algorithm is computationally efficient, converges very fast and exhibits enhanced estimation performance compared to other related methods. Moreover, it provides sparse solutions, without requiring the tuning of any parameters, which are naturally estimated from the algorithm. As it is also the case for other Bayesian inference methods, the theoretical proof of convergence of the proposed algorithm turns out to be a cumbersome task. Such a theoretical analysis is currently under investigation.


137

Chapter 6 An empirical Bayes algorithm for sparse signal reconstruction In this chapter, a slightly modified version of the hierarchical Bayesian model introduced in the previous chapter is utilized to reconstruct sparse signals using a set of linear measurements corrupted by Gaussian noise. The proposed model can be considered as the Bayesian counterpart of the adaptive lasso criterion. It utilizes independent Laplace priors, expressed in a hierarchical Bayes form, which favors sparsity. A fast iterative algorithm, which is based on the type-II maximum likelihood methodology, [7], is properly adjusted to conduct Bayesian inference on the unknown model parameters. The performance of the proposed hierarchical Bayesian approach is illustrated on the reconstruction of both sparse synthetic data, as well as real images. Experimental results show the improved performance of the proposed approach, when compared to state-of-the-art Bayesian compressive sensing algorithms.

6.1 Introduction Compressive sensing (CS) has gained considerable attention in the signal processing community over the recent years. CS theory asserts that one can exploit the inherent sparsity of a signal in order to design acquisition systems that need only a few measurements to capture the vital information of the (compressible) signal. CS has a plethora of signal processing applications, since all signals of general interest are potentially sparse if they are represented in a proper overcomplete basis, as, for example, in the wavelet transform case. In the standard framework of CS, the goal is to recover a sparse signal w ∈ RN from a small set of M linear, noisy measurements y = Φw + n,


139

(6.1)


where Φ = [φ1 , φ2 , . . . , φN ] is a M × N design matrix, M N , with N × 1 dimensional column vectors φj , and n is the additive noise. The existence of fewer measurements than parameters for the inversion problem at hand results to an infinite number of possible solutions for the signal vector w. However, utilizing the prior knowledge that the signal is sparse, CS theory dictates that under certain conditions, one may recover the exact solution of w with high probability, [28]. Mathematically speaking, a sparse solution for w can be recovered by considering the following `1 norm minimization problem min kwk1 subject to ky − Φwk22 ≤ ,

(6.2)

which, under certain conditions, [43], retains the sparsity of the solution [43]. As we have discussed in previous chapters, this is the convex optimization problem termed as the lasso, [117], and it can be efficiently solved in polynomial time. Among existing approaches that address the lasso problem, one can find linear programming methods, e.g., the basis pursuit, [43], greedy algorithms, e.g. the OMP algorithm, [121], projection based algorithms, e.g. the LARS algorithm, [44], or, finally, Bayesian methods. Interesting Bayesian approaches for the CS problem that mainly utilize the sparsity promoting Laplace distribution have recently emerged. In [49], a hierarchical Bayes interpretation of the Laplacian prior for the sparse vector w is introduced. It is shown that the maximum a posteriori (MAP) estimate of w under this prior is equivalent to solving the lasso, and an expectation maximization (EM) algorithm is used to estimate the unknown model parameters. The properties of the Bayesian lasso are further studied in [100], and in [78] more lasso schemes are sketched out in a Bayesian formulation. In [40] the Laplace prior is directly applied on the parameter w, along with an atom at zero in order to enhance sparsity, and a Gibbs sampler is then used to produce samples from the posterior distribution of w for magnetic resonance force microscopy (MRFM) image reconstruction. The framework of Bayesian compressive sensing is analytically established in [68], and a hierarchical model based on the concept of the Relevance Vector Machine (RVM), [119], is discussed. Then, a fast, sub-optimal marginal likelihood maximization algorithm, proposed in [120], is adjusted to perform Bayesian inference. In [67] the same methodology is employed under the hypothesis of a distributed environment, as far as the acquaintance of the measurements is concerned. In [7] the same efficient algorithm is also employed and adjusted, under a suitable hierarchical formulation of the Laplace prior. In the preceding chapter, we have also formulated a hierarchical Bayesian model using independent Laplace priors, [113, 114], which were suitably truncated to accommodate the inequality restrictions of the abundance coefficients. In this chapter, our motivation is to exploit the potential of the Bayesian hierarchical model described in the previous chapter, in


140


order to favor sparse solutions. To this end, we extend the hierarchical Bayesian model of [7] to incorporate an independent Laplace prior for each coefficient of the sparse signal vector w. It is known that assigning different penalization weights to different coefficients of the sparse signal vector w can lead to a consistent estimator for w, as M → ∞, as opposed to the `1 minimization of the original lasso, [117]. This gain is achieved at the cost of introducing many new parameters (the penalizing norm coefficients), whose values are of great importance for the estimation procedure. Owing to a conceptual Bayesian framework, all model parameters are given appropriate prior distributions and are then inferred from the data. To perform Bayesian inference, the fast technique proposed in [120] is properly adjusted to our settings. It is based on a type-II maximum likelihood approach, where a fast, iterative, coordinatetype algorithm is employed for the maximization of the marginal log-likelihood function. The proposed algorithm produces confidence intervals for all estimated parameters, which can be used as measures of uncertainty. A useful by-product of the proposed Bayesian approach is also the estimation of the noise variance. We provide experimental results on simulated data, as well as an image restoration example, that both illustrate the improvement on the performance of the proposed estimator, when compared to other related Bayesian methods.

6.2 Bayesian modeling In this section, we explore the relation between the `1 minimization problems and their Bayesian counterparts that use Laplace priors. Before we proceed, we assume that the additive noise in (6.1) is independent of w, is independent and identically distributed (i.i.d.), and has a zero-mean Gaussian distribution, n ∼ N (n|0, β −1 IM ). This leads to a multivariate Gaussian likelihood for the measurement vector y, p(y|w, β) = N (y|Φw, β −1 IM ) M β 2 −M = (2π) 2 β 2 exp − ky − Φwk2 , 2

(6.3)

where IM is the M × M identity matrix and β is the noise precision parameter.

6.2.1 The Bayesian lasso The `1 minimization problem defined in (6.2) is called the lasso, [117], and, using Langrangian arguments, it can be rewritten as ( wlasso = arg min ky − Φwk22 + µ w


N X i=1

141

) |wi | ,

(6.4)


where |·| denotes the absolute value of a real number or the determinant of a matrix, and µ is a tuning parameter which controls the sparsity of the solution. From a Bayesian viewpoint, the lasso criterion of (6.4) is related to the MAP estimator of w, under a Laplace prior distribution. Let us first define the sparsity promoting zero-mean Laplace prior for w as p(w|λ) =

N Y λ i=1

2

exp {−λ|wi |}

" # N X λN = N exp −λ |wi | , 2 i=1

(6.5)

where λ > 0. Then, the MAP estimator of w can be written as wMAP = arg max p(w|y) w

= arg max p(y|w, β)p(w|λ) w

= arg min {−log [p(y|w, β)p(w|λ)]} . w

(6.6)

Utilizing (6.3) and (6.5), equation (6.6) becomes ) N X 2λ |wi | , = arg min ky − Φwk22 + w β i=1 (

wMAP

which is equivalent to the lasso criterion in (6.4), for µ =

(6.7)

2λ . β

6.2.2 The Bayesian adaptive lasso As noted in [129], the `1 norm penalizes more heavily the larger signal components rather than the smaller ones, and as a result, can lead to suboptimal solutions. According to [129], different weighting coefficients for the `1 penalty can be assigned to different coefficients of the sparse signal vector w, in order to improve estimation accuracy. This gives rise to the adaptive lasso optimization problem (see also chapter 5), which is expressed as ( wadlasso = arg min ky − w

Φwk22

+

N X

) µi |wi | ,

(6.8)

i=1

where µi > 0, i = 1, . . . , N . It is interesting to note that, in general, the solutions of the lasso (6.4) and the adaptive lasso (6.8) will be both sparse, yet different. Therefore, it is expected that suitable selection of the penalization parameters can result in better estimation performance. In this section, we develop a hierarchical Bayesian model which is analogue to


142


the adaptive lasso, by utilizing N independent Laplace priors. This can be considered as a modified version of the hierarchical Bayesian model presented in Chapter 5 and in [113, 114], where nonnegative priors were employed. As in [68, 7], the prior over the signal parameter w is assumed to be a multivariate zero-mean Gaussian distribution, i.e., p(w|γ) =

N Y

N (wi |0, γi )

i=1 N Y

1 wi2 − = (2π) 2 γi i=1 1 1 T −N 2 2 = (2π) |Λ| exp − w Λw 2 − 21

−1 γi 2 exp

= N (w|0, Λ−1 ),

(6.9)

where γ = [γ1 , γ2 , . . . , γN ]T , Λ−1 = diag(γ), and γi ≥ 0, for i = 1, 2, . . . , N . Moreover, the noise precision parameter is assumed to follow a conjugate Gamma distribution, defined as p(β; κ, θ) = Γ(β|κ, θ) =

θκ κ−1 β exp [−θβ] , Γ(κ)

(6.10)

where β ≥ 0. An effective way to construct a sparsity-promoting Laplace distribution for the parameter w is to introduce an exponential prior over the variance hyperparameter vector γ, [49]. In this work, N independent exponential distributions are utilized, one for each one of the hyperparameters γi , as in [114], i.e., p(γ|λ) =

N Y

p(γi |λi )

i=1 N Y

λi λi exp − γi = 2 2 i=1 " # N N 1 1X = |Ψ| exp − λi γi , 2 2 i=1

(6.11)

where λ = [λ1 , λ2 , . . . , λN ]T , λi > 0, for i = 1, 2, . . . , N and Ψ = diag(λ). The proposed hierarchical model is completed by assigning a conjugate Gamma prior over the sparsity-


143


controlling hyperparameter vector λ. This prior is expressed as N Y δ ρ ρ−1 p(λ; ρ, δ) = λi exp [−δλi ] Γ(ρ) i=1 " # ρ N N X δ = |Ψ|ρ−1 exp −δ λi , Γ(ρ) i=1

(6.12)

where ρ and δ are two hyperparameters not treated as random variables, with ρ ≥ 0 and δ ≥ 0. Let us now highlight the relation of the proposed hierarchical Bayesian model defined by (6.9), (6.10), (6.11), and (6.12), with the adaptive lasso. Marginalizing out the parameter γ, a multivariate Laplace distribution arises as a prior for w, i.e., Z p(w|λ) = =

p(w|γ)p(γ|λ)dγ N Z Y

p(wi |γi )p(γi |λi )dγi

i=1

"

# N p X λi |wi | . = 2−N |Ψ| exp − 1 2

(6.13)

i=1

It is then straightforward to verify that the MAP estimator of w under the multivariate Laplace prior of (6.13), is expressed as wMAP = arg min {−log [p(y|w, β)p(w|λ)]} w ) ( N 2 Xp 2 λi |wi | , = arg min ky − Φwk2 + w β i=1

(6.14)

√ i.e., it coincides with the solution of the adaptive lasso for µi = β2 λi . Notice that in 6.14, the constraint on w is dropped, in comparison to the proposition 5.3.1 of chapter 5.

6.3 Bayesian inference As it is common in practice, the Bayesian inference is based on the posterior distribution of the model parameters w, β, γ, and λ, which can be expressed as p(w, γ, λ, β|y) =

p(w, γ, λ, β, y) . p(y)

(6.15)

However, it is rather difficult to compute this posterior distribution analytically, since p(y) cannot be expressed in a tractable form. One way to perform Bayesian inference is to resort Konstantinos E. Themelis

144


to a type-II maximum likelihood approach, as the one described in [7]. In the following, we recast the inference method used in [7], and modify it properly, in order to account for the hierarchical model described in the previous section.

6.3.1 Type-II maximum likelihood approach In a type-II maximum likelihood approach the objective is to maximize the marginal likelihood obtained by integrating out w, with respect to the unknown model parameters. The posterior distribution of (6.15) can be factored as p(w, γ, λ, β|y) = p(w|γ, λ, β, y)p(γ, λ, β|y).

(6.16)

Then, the distribution p(w|γ, λ, β, y) is easily shown to be the following multivariate Gaussian, p(w|y, γ, λ, β) = N (w|µ, Σ) −N 2

= (2π)

− 21

|Σ|

(w − µ)T Σ−1 (w − µ) exp − , 2

(6.17)

with parameters −1 Σ = βΦT Φ + Λ

and µ = βΣΦT y.

(6.18)

Following the type-II maximum likelihood methodology, the model parameters β, γ, and λ are individually selected so that to maximize their joint posterior distribution p(γ, λ, β|y). From the Bayes’ law, p(γ, λ, β|y) =

p(γ, λ, β, y) ∝ p(γ, λ, β, y). p(y)

(6.19)

It is therefore sufficient to maximize p(γ, λ, β, y), which is obtained by integrating out w from the joint p(y, w, γ, λ, β) , i.e., Z p(γ, λ, β, y) =

p(w, γ, λ, β, y)dw Z

=

p(y|w, β)p(w|γ, β)p(γ|λ)p(λ)p(β)dw M

1

= (2π)− 2 |β −1 IM + ΦΛ−1 ΦT |− 2 1 T −1 −1 T −1 × exp − y β IM + ΦΛ Φ y p(γ|λ)p(λ)p(β) 2


145


−M 2

= (2π)

− 12

|C|

1 T −1 exp − y C y p(γ|λ)p(λ)p(β), 2

(6.20)

where C = β −1 IM + ΦΛ−1 ΦT . Equivalently, we can maximize the logarithm of p(γ, λ, β, y) , denoted as L, with respect to the hyperparameters λ, γ, and β. The logarithm L can be expressed as L = log (p(γ, λ, β, y)) 1 δρ θκ M = − log(2π) + N log + N log + log + (κ − 1)logβ 2 2 Γ(ρ) Γ(κ) N N X 1 1 T −1 1X − log|C| − y C y + ρlog|Ψ| − λi γi − δ λi − θβ 2 2 2 i=1 i=1

(6.21)

Let us now state some equivalences that will be useful in solving this maximization problem. First, it is |C| = |β −1 IM + ΦΛ−1 ΦT | = |Λ + βΦT Φ| |Λ−1 | |β −1 IM |,

(6.22)

and thus, log|C| = −log|Σ| − M logβ − log|Λ|.

(6.23)

Furthermore, using the Woodbury identity, [54], we get C−1 = (β −1 IM + ΦΛ−1 ΦT )−1 = βIM − βΦ(Λ + βΦT Φ)−1 ΦT β = βIM − β 2 ΦΣΦT .

(6.24)

Hence, we can write yT C−1 y = βyT y − yT βΦΣΦT βy = βyT y − yT βΦµ = βyT (y − Φµ) = β ky − Φµk2 + βµT ΦT (y − Φµ) = β ky − Φµk2 + µT Λµ.


146

(6.25)


Using the identities in (6.22), (6.23), (6.24), (6.25), L is expressed as M 1 M 1 1 1 1 log (2π) + log|Σ| + logβ + log|Λ| − β ky − Φµk22 − µT Λµ + N log 2 2 2 2 2 2 2 N N ρ κ X X 1 δ θ −δ + (κ − 1)logβ − θβ (6.26) + ρlog|Ψ| − λi γi + N log λi + log 2 i=1 Γ(ρ) Γ(κ) i=1

L=−

Taking the partial derivatives of L, with respect to the parameters γ, λ, and β, and equating them to zero, the following expressions are obtained s

∂L 1 1 hwi2 i = 0 ⇒ γi = − + + ∂γi 2λi 4λ2i λi ρ ∂L = 0 ⇒ λi = 1 ∂λi γ +δ 2 i M + 2κ − 2 ∂L =0⇒β= ∂β ky − Φµk22 + 2θ ∂L = 0 ⇒ log|Ψ| + N logδ − N ψ(ρ) = 0 ∂ρ ρN ∂L = 0 ⇒ δ = PN , ∂δ λ i i=1

(6.27) (6.28) (6.29) (6.30) (6.31)

where ψ(·) is the digamma function1 , and hwi2 i = µ2i + Σii , with µi being the ith diagonal element of µ and Σii being the ith diagonal element of Σ. In (6.30) - (6.31), the partial derivatives with respect to ρ and δ are also reported, but will not be utilized until a later stage. Equations (6.18), (6.27) - (6.29) form an iterative updating scheme, which can be also obtained using the EM algorithm. However, this scheme is problematic in practice, since, (6.18) requires the inversion of a N ×N matrix, which is both computationally demanding and susceptible to numerical errors. By properly modifying the sub-optimum scheme proposed in [46, 120], a numerically robust, fast Bayesian Adaptive Lasso (Fast-BALa) algorithm is proposed in the following section to maximize the logarithm L of the marginal likelihood.

6.3.2 Fast suboptimal solution In this section we develop a fast sub-optimal technique to maximize the marginal loglikelihood L, that follows the same derivation procedure described in [7]. In the following analysis, it is interesting to notice that the importance of the parameter γ lies in the fact that setting γi = 0 is equivalent to pruning the ith variable out of the model, i.e., wi = 0 (as follows from eq. (6.9)). 1 In mathematics, the digamma function is defined as the logarithmic derivative of the gamma function, d ψ(x) = dx lnΓ(x)


147


Exploiting the diagonal form of Λ, matrix C in (6.20) is written in a convenient form to analyze the dependence on a single parameter, C = β −1 IM +

X

γj φj φTj + γi φi φTi = C¬i + γi φi φTi .

(6.32)

j6=i

Using the matrix inversion and determinant lemmas, (6.32) allows us to write C−1 = C−1 ¬i −

T −1 C−1 ¬i φi φi C¬i 1 + φTi C−1 ¬i φi γi

−1 |C| = |1 + γi φTi C−1 ¬i φi | |C¬i |.

and

(6.33)

Thus, the summation of the terms of L that depend on γ, denoted as L(γ), becomes " # X 1 qi2 γi 1 1 −1 −1 T log + − λi γi L(γ) = − log|C¬i | + y |C¬i |y + λj γj + 2 2 1 + γ 1 + γ i si i si j6=i = L(γ¬i ) + l(γi ),

(6.34)

T −1 where si and qi are defined as si = φTi C−1 ¬i φi and qi = φi C¬i y. It is interesting to see that the term L(γ¬i ) does not depend on the single parameter γi , thus the partial derivative of L(γ) with respect to γi , can be computed as

∂L(γ) ∂l(γ) 1 qi2 si = = + − − λi ∂γi ∂γi 2 1 + γi si (1 + γi si )2 1 γi2 (λi s2i ) + γi (s2i + 2λi si ) + (si − qi2 + λi ) =− . 2 (1 + γi si )2

(6.35)

Computing the roots of the numerator polynomial in (6.35) and following the approach described in [7], γi is estimated as

γi =

  

−si (si +2λi )+si

q (si +2λi )2 −4λi (si −qi2 +λi ) 2λi s2i

0

,

if qi2 − si > λi otherwise.

(6.36)

The updating of the parameter γi using (6.36) is followed by the updating of µ, Σ, λ, si , and qi . Based on the analysis given in [120], the updating of si and qi , which has to be performed for all N model variables, is made as follows, Si = βφTi φi − φTi βΦΣΦT βφi Si Qi si = qi = 1 − γi Si 1 − γi Si

Qi = βφTi y − φTi βΦΣΦT βy,

(6.37) (6.38)

The resulting fast Bayesian Adaptive Lasso algorithm is summarized in Table 6.1. Some Konstantinos E. Themelis

148


Table 6.1: The Fast-BALa algorithm Input Φ, y, κ, θ, ρ, δ. Set β = 0.01 kyk22 . Initialize γ (0) = 0, λ(0) = 1. for n = 1, 2, . . . do (n) Compute all possible γi ’s using (6.36). (n) (n−1) Choose the γi with the maximum d(n) = l(γi ) − l(γi ) if d(n) is sufficiently small, break, endif (n) (n) (n) (n−1) if (qi )2 − si > λi and γi =0 Add γi to the model. (n) (n) (n) (n−1) else if (qi )2 − si > λi and γi >0 (n) Keep γi as the updated value of γi . (n) (n) (n) else if (qi )2 − si < λi Prune γi from the model. endif Update µ and Σ using (6.18). Update si and qi using (6.38). Update λ using (6.28). Update ρ and δ using (6.30) and (6.31). endfor useful remarks are: 1. In its simplest form, i.e., when all λi ’s are identical and are updated using a single equation, the Fast-BALa is similar to the fast Laplace (FL) algorithm presented in [7], which solves the conventional lasso problem (6.4). Updating each λi independently using (6.28) improves estimation accuracy. 2. The parameters ρ and δ are also updated using (6.30) and (6.31), so that a change in the selected variables of the model can affect the values of all λi ’s, according to (6.28). 3. The updating of matrix Σ takes into account only the selected variables at the current iteration, which, in practice, are significantly fewer than N . 4. The parameter β is not included in the iterative scheme; rather, it is set equal to β = 0.01 kyk22 , as in [7]. Due to the incremental nature of the algorithm, updating β using (6.29) can produce unreliable values, which may cause problems to the convergence of the algorithm. 5. A powerful feature of the algorithm is its ability to remove basis vectors from the model that have been selected at early stages, which is not the case for greedy algorithms, e.g., the OMP algorithm. Konstantinos E. Themelis

149


1.5 BCS FL Fast−BALa

1

0.5

0 40

60

80

100

120

Reconstruction Error


1.5

0.5

60

80

100

M

M

(a)

(b)

120

140

1 BCS FL Fast−BALa

0.8 0.6 0.4 0.2

60

80

100

120



1

0 40

140

1

0 40

BCS FL Fast−BALa

0.8 0.6 0.4 0.2 0 40

140

BCS FL Fast−BALa

60

80

100

M

M

(c)

(d)

120

140

Figure 6.1: Reconstruction error versus the number of measurements M in the noise-free case. (a) Uniform spikes ±1, (b) zero mean unit variance Gaussian, (c) unit variance Laplace, (d) Student’s t-distribution with 3 degrees of freedom.

6.4 Simulation results This section illustrates the accuracy of the proposed algorithm on the sparse reconstruction of 1D signals and images. Simulations are conducted following the experimental settings of [68, 67, 7]. We compare the Fast-BALa algorithm proposed in section 6.3 with the Bayesian compressive sensing (BCS) algorithm of [68], and FL of [7]. We used the Matlab implementations of these methods, found at the corresponding websites. In all experiments, we ˆ est k2 / kwk2 , where w ˆ est is the estimated coeffievaluate the reconstruction error as kw − w cients vector. All experiments are performed on a 2.4Ghz Intel Core 2 machine.


150


1.5 BCS FL Fast−BALa

1

0.5

0 40

60

80

100

120



1.5

0.5

60

80

100

M

M

(a)

(b)

120

140

1 BCS FL Fast−BALa

0.8 0.6 0.4 0.2

60

80

100

120



1

0 40

140

1

0 40

BCS FL Fast−BaLa

0.8 0.6 0.4 0.2 0 40

140

BCS FL Fast−BALa

60

80

100

M

M

(c)

(d)

120

140

Figure 6.2: Reconstruction error versus the number of measurements M with noisy observations. (a) Uniform spikes ±1, (b) zero mean unit variance Gaussian, (c) unit variance Laplace, (d) Student’s t-distribution with 3 degrees of freedom.

6.4.1 1D signals The first example considers a signal of length N = 512 that contains T = 20 non-zero coefficients at random locations. As in [7], the design matrix Φ is constructed by first creating a M × N matrix with draws from a uniform distribution, and then we normalize the columns of Φ to unit magnitude. The non-zero coefficients are generated by employing: (a) ±1 uniform spikes, (b) a Gaussian distribution, N (0, 1), (c) a unit variance Laplace distribution, and (d) a Student’s t-distribution with 3 degrees of freedom. The number of measurements M varies between 40 and 120 with a step size of 5, and the average reconstruction error for 100 signal realizations is shown in Figure 6.1. To simulate measurement noise, zero-mean Gaussian noise with standard deviation σ = 0.03 is added to the model, and the results are


151


(a)

(b)

(c)

(d)

Figure 6.3: Mondrian image reconstruction using the algorithms: (a) linear sparse reconstruction (error = 0.13325), (b) BCS (error = 0.14979, time = 11.7836s), (c) FL (error = 0.15094, time = 15.6063s), (d) Fast-BALa (error = 0.14402, time = 26.5266s). then reported in Figure 6.2. As expected, the reconstruction error reduces as the number of measurements increases and it appears to converge to the same value for all three methods. It can be seen from Figs. 6.1 and 6.2 that the proposed Fast-BALa agorithm has the best performance, especially for lower values of M . This is more evident in Figs. 6.1a, 6.1b, and 6.2a, 6.2b. It is interesting to note that the only difference among the three methods lies on the updating of the sparsity controlling parameter λ; in BCS λ = 0, in FL all λi ’s are equal and are updated using a single equation, while in Fast-BALa λi ’s are independent and are updated using (6.28). The average computation time per sample for the three methods is 0.0271s for BCS, 0.0872s for FL, and 0.1713s for Fast-BALa.


152


(a)

(b)

(c)

(d)

Figure 6.4: Error maps of the reconstructed Mondrian images using the algorithms: (a) linear sparse reconstruction, (b) BCS, (c) FL, (d) Fast-BALa.

6.4.2 2D images In this section, the proposed Fast-BALa algorithm is used for the sparse representation of a 512 × 512 Mondrian image. Using the experimental setup in [68, 7], a multiscale CS scheme [122] is applied on the wavelet transform of the image with the “symmlet 8” wavelet with the coarsest scale 4 and finest scale 6. The number of wavelet samples is N = 4096, the number of measurements is M = 2713, and the measurement matrices are drawn from a uniform spherical distribution. Figure 6.3a displays the linear reconstruction of the image, using all the M = 2713 measurements, which is the best possible reconstruction. The corresponding reconstructions of the BCS, FL, and Fast-BALa are displayed in Figs. 6.3b, 6.3c, and 6.3d, respectively, while Figure 6.4 shows the reconstruction error for each case. The number of the nonzero coefficients for BCS, FL, and Fast-BALa algorithms are 751, 802, and 911 respectively. It is demonstrated that the Fast-BALa algorithm provides the smallest


153


reconstruction error among the three methods, although it is slightly more computationally demanding.

6.5 Conclusion In this chapter we have provided a Bayesian perspective over the adaptive lasso criterion. The non-uniformly weighted `1 penalty of the adaptive lasso was attained by employing N independent Laplace priors. A fast algorithm was then properly modified to perform Bayesian inference. Experimental results showed that the proposed method achieves better performance than other state-of-the-art Bayesian CS algorithms, at the cost of small additional complexity.


154

Chapter 7 Summary of the thesis In the preceding chapters we have presented a general framework for semi-supervised Bayesian learning, comprised of soft-constraint Bayesian estimation methods, sparsity-promoting hierarchical Bayesian models and novel methods for Bayesian inference. As an epilogue, we review the main contributions of this thesis and discuss some intriguing ideas for future research.

7.1 Summary The presented work has addressed the problem of semi-supervised Bayesian learning for hyperspectral unmixing. Under the assumption of the linear mixing model, spectral unmixing has been formulated as a standard linear regression problem, where the parameters of interest are the abundance fractions of the endmembers spectra in each pixel of a hyperspectral scene. Naturally, this presupposes that the spectral signatures of the endmembers are available a priori. In this setup, a handful of Bayesian analysis tools have been presented in this thesis, which take into account either the physical constraints or the sparsity of the fractional abundances. In chapter 4, a Bayesian MAP estimator has been suitably adjusted to address the abundance estimation problem, taking into account the constraints of nonnegativity and full additivity. Specifically, the abundances are assumed to have a Gaussian prior distribution, so that the MAP estimator can be computed in closed form. To compute the parameters of the posterior distribution, i.e., the mean and the covariance matrix, that enter in the computation of the MAP estimator, a criterion is selected for the mean squared error of the proposed estimator; to be lower that the variance of the least squares estimator. The symmetry of the constraints is utilized in order to obtain closed updating forms for the parameters. Although the MAP estimator can then provide a reliable solution, the statistical nature of our approach may well result in estimation points that do not satisfy the constraints. For


155


this reason, the last stage of the proposed MAP estimator is a simple projection of the MAP estimation point on the convex polytope formed by the constraints in the signal space. Experimental results for this method showed that it provides reliable results, and hence, it can be utilized to unmix real hyperspectral data. An interesting case study has also been presented in chapter 4, where the unmixing of real OMEGA hyperspectral data from ESA’s Mars Express mission is considered. Three methods are employed and compared in the unmixing process, the soft-constrained MAP estimator, a quadratic programming algorithm and the least squares method. Verifying theoretical expectations, the results of the MAP estimator were coherent with those of the quadratic programming method, while the least squares method produced abundance maps that were substantially degraded from those obtained by the other two methods. This result clearly shows the significance of the imposition of the constraints to the estimation problem. It is interesting to note that, among the two methods that produced the best unmixing results, the soft-constraint MAP estimator has lower computation complexity. This thesis also considered the issue of the sparsity of the abundance parameters. To the best of our knowledge, the approaches described in chapter 5 were among the first to introduce sparsity in the context of spectral unmixing. Sparsity hypothesis is supported by the fact that, in practice, only a few of the available endmembers are present in a single pixel. In order to induce sparsity, our research efforts have focused around the lasso and its variants. First, we have successfully applied the adaptively weighted lasso to the unmixing problem. The optimization function of the adaptive lasso was appropriately adjusted in order to account for the convex constraints of the abundances. A variant of the popular LARS algorithm, which permits the retrieval of nonnegative solutions, was selected for the optimization problem. The additivity constraint was included in the optimization problem by incorporating an extra equation into the system of linear equations. Experimental results, which were conducted on both simulated and real hyperspectral data, verified that the sparsity hypothesis can be utilized in practice to improve the quality of the unmixing results. This conclusion was the motive to further work on sparse methods for spectral unmixing. A fully Bayesian treatment for the unmixing problem was further developed, wherein a suitably selected hierarchical Bayesian model and an efficient method to perform Bayesian inference are proposed. The newly introduced hierarchical Bayesian model utilizes truncated Laplace priors so as to induce sparsity and also account for the nonnegativity of the abundances. The double exponential representation of the Laplace prior is exploited in order to retain the conjugacy of the priors. Instead of relying on standard methods to perform Bayesian inference, like for example variational methods, a deterministic approximation of the Gibbs sampler, termed as Bayesian inference iterative conditional expectations (BI-ICE) algorithm was proposed. As the name suggests, the means of the respective posterior con-


156


ditional distributions are used, instead of the random samples used in the Gibbs sampler. This procedure is shown to have a definite relation to variational approximation methods, if the factorization of the approximate posterior distribution in the latter method is similar to the parameters’ separation used in the posterior conditional distributions. More specifically, the BI-ICE algorithm can be considered as a first moments approximation to a variational Bayesian inference scheme. Experimental results for this method revealed that the BI-ICE algorithm retrieves sparse solutions, which enhance the unmixing performance. Moreover, it performs just as good in the case where the mixing matrix of the linear model is badly conditioned, which is a realistic scenario, since, in practice, most material spectral signatures are highly correlated. Notice also that the proposed method infers all parameter values directly from the data, as opposed to other sparsity inducing methods, that require parameter tuning. Departing from the thematic area of spectral unmixing, we attempt to demonstrate the potential of the hierarchical Bayesian model described in chapter 5 by considering a sparse image reconstruction problem. The main difference of this setup is that we are now interested in finding a sparse solution in an underdetermined system of linear equations. To this end, we exploit the same hierarchical structure of the Bayesian model of chapter 5, however excluding any truncation operations on the prior distributions. To perform Bayesian inference, we adopt a type-II maximum likelihood algorithm, which is fast, although suboptimal, and modify it properly so as to account for the proposed hierarchical model. It is worth mentioning that the original algorithm was designed for a similar hierarchical model, where a single Laplace prior instead of independent priors for each parameter coefficient is used to induce sparsity. Experimental results presented in chapter 6 show an improvement on the mean squared error of the proposed estimator, although not substantial.

7.2 Future directions Sparse signal processing has been an area of thriving research during the last years. Applications of sparse signal processing algorithms have spread throughout various scientific research disciplines and are constantly increasing in number. The term compressive sensing has been used to describe the framework of techniques for finding sparse solutions to underdetermined linear systems. Admittedly, we were greatly influenced and intrigued by the advances of compressive sensing during the development of our Bayesian framework. This has opened up several avenues for future work, which are designated towards both directions of unsupervised spectral unmixing and Bayesian compressive sensing, as it can be seen in the sections to follow.


157


7.2.1 Variational approximation for underdetermined systems In this thesis we have presented two inference methods for the problem of sparse Bayesian learning: (a) an iterative conditional expectations algorithm (BI-ICE), which serves as a deterministic approximation to the Gibbs sampler, and (b) a type-II maximum likelihood algorithm, which is incremental in the number of basis functions that compose the signal of interest. Based on the model they adopt, the former considers overdetermined systems of linear equations, while the latter is tailored for underdetermined systems. A possible research direction would be to extend BI-ICE to also account for underdetermined linear systems. To this end, an important workaround is necessary; the covariance matrix Σ of the posterior conditional expectation of the parameter w (note that w ∼ N (µ, Σ)) is singular and its inversion has undesirable effects. A possible remedy for the inversion of the high dimensional and rank deficient matrix Σ is to perform variational inference using an incremental type algorithm. Two approaches might be considered: (a) either each of the coefficients of the parameter vector w can be considered to have an independent variational distribution (in this case Σ is a diagonal matrix), or, (b) the contribution of each of the coefficients to the model can be individually expressed, so as to be able to add or remove coefficients according to a selected criterion. The development of an incremental variational approximation algorithm will further decrease the computational complexity of BI-ICE, making it suitable for compressive sensing applications.

7.2.2 Nonlinear spectral unmixing Another interesting research direction is the extension of the models and methods developed in this thesis to account for a nonlinear mixing model. Nonlinear spectral unmixing is a recent research focus, which provides a framework to overcome some of the inherent limitations of the linear mixing model. As noted in chapter 2, nonlinear models are better suited to describe and capture the multiple microscopic scattering effects of solar radiation (such phenomena are typical in hyperspectral scenes of vegetation, swallow water environments, or surfaces of sand grains of different compositions). Formulating spectral unmixing as a nonlinear source separation problem opens up new opportunities for subsequent study and application of novel signal processing techniques. Preliminary results in this direction, e.g. [77], have already shown the potential of nonlinear unmixing. The framework of Bayesian analysis can be utilized for nonlinear unmixing (see e.g. [59]). In this context, sparsity can also be imposed and exploited, leveraging the hierarchical model and techniques presented in this thesis.


158


7.2.3 Sparse unmixing using non-convex priors In chapter 5 we have exploited the `1 norm to induce sparsity in the abundance fractions. Ideally, one would like to utilize the `0 norm, which penalizes the number of the nonzero coefficients rather than their absolute values. However, as also reported in chapter 5, the optimization problem with respect to the `0 norm is shown to be NP complete. Therefore, the `1 norm has been proposed as a convex surrogate, in order to be able to solve the optimization problem in polynomial time. Following the latest advances of compressive sensing, the `q norm, with 0 < q < 1, can also be used to impose sparsity in the abundance vector (the `q norm, for q = 0.5 is displayed in Figure 3.2). In comparison to the `1 norm, least squares penalization using the `q norm, with 0 < q < 1, can produce sparser solutions. As a trade-off, the resulting optimization function is non-convex, and hence, the computational complexity of the solvers is highly increased. In a Bayesian formulation, the regularization approach of the `q norm can be equivalently resolved by utilizing appropriate non-convex priors. The full Bayesian treatment of spectral unmixing using non-convex priors is of great interest and has not been thoroughly explored in the literature.

7.2.4 Unsupervised spectral unmixing using spatial correlations As we have already discussed in chapter 2, a recent trend in hyperspectral unmixing is the exploitation of the spatial information of the hyperspectral scene. Traditional methods for spectral unmixing rely solely on the spectral information of the pixels, in order to determine the spectral signatures of the endmembers that comprise the scene. This means that if we apply a random perturbation on the spatial location of the pixels before the unmixing process, the unmixing results will not be slightly affected. However, the spatial information of the scene can also be of use, in the sense that adjacent pixels will probably share a common set of constituent endmembers. Sophisticated Bayesian models have already been proposed in the literature, and utilize Markov random fields to incorporate the prior knowledge of the spatial information of the image.

7.2.5 Adaptive Bayesian schemes Finally, an interesting idea is the development of an efficient variational inference scheme that will be able to process data that become available in a sequential manner. So far we have assumed that data are available a priori and their processing is a batch procedure that is executed offline. However, there are many application scenarios where the availability of the data is time dependent, and it is crucial for the parameter estimation to be executed on line. A representative example from the field of wireless communications is the tracking of


159


a frequency selective channel that varies with time. Adaptive signal processing is actually a quite popular research discipline in the signal processing literature. To our knowledge, particle filtering is a sophisticated parameter estimation method that is based on sequential Monte Carlo sampling to perform Bayesian inference at a specific time, i.e., given all observations up to that time. Similarly to all other sampling methods, particle methods generate a set of samples from the posterior distribution of the parameters of interest, and posterior expectations are then approximated using the generated samples. The accuracy of the estimation procedure lies then on the sample size and on whether these samples are representative of the posterior distribution. An alternative approach that it may worth to be investigated, would be to approximate the posterior distribution of the model parameters using variational methods rather than sampling. The Bayesian inference iterative conditional expectations algorithm described in chapter 5 has been shown to be an efficient procedure that is based on closed form equations and converges within a few iterations. Its convergence seems to be a consequence of the convexity of the posterior conditional distributions. These properties are well-suited for an adaptive setup and may prove to be motivational enough for the development of a variational approach specifically tailored for the framework of adaptive signal processing.


160

Bibliography [1] J. D. Aber and M. E. Martin. High spectral resolution remote sensing of canopy chemistry. In In Summaries of the Fifth Annual JPL Airborne Earth Science Workshop, volume 1, pages 1 – 4. JPL Publication 95-1, 1993. [2] N. Acito, G. Corsini, and M. Diani. Adaptive detection algorithm for full pixel targets in hyperspectral images. Vision, Image and Signal Processing, IEE Proceedings -, 152(6):731 – 740, December 2005. [3] I. J. Allison and E. A. Neal. Final report on the TIROS 1 Meteorological satellite system. NASA Technical Report R-131, Goddard Space Flight Center, Greenbelt, Maryland, 1962. [4] D. F. Andrews and C. L. Mallows. Scale mixtures of normal distributions. Journal of the Royal Statistical Society, Series B, 36(1):99–102, 1974. [5] A. Arthur. Conditions for positive and nonnegative definiteness in terms of pseudoinverses. SIAM Journal on Applied Mathematics, 17(2):pp. 434–440, 1969. [6] H. Attias. A variational Bayesian framework for graphical models. In Advances in Neural Information Processing Systems, volume 12, pages 209–215. MIT Press, 2000. [7] S. D. Babacan, R. Molina, and A.K. Katsaggelos. Bayesian compressive sensing using Laplace priors. IEEE Trans. Image Process., 19(1):53 –63, January 2010. [8] A. Benavoli, L. Chisci, and A. Farina. Estimation of constrained parameters with guaranteed mse improvement. Signal Processing, IEEE Transactions on, 55(4):1264 –1274, april 2007. [9] James O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer, 2nd edition, 1985. [10] J. Bernardo and A.F.M. Smith. Bayesian Theory. Wiley Series in Probability and Statistics. John Wiley & Sons Inc, 2007.


161


[11] J. E. Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society. Series B (Methodological), 48:259–302, March 1986. [12] J. E. Besag and J. C. York. Bayesian restoration of images. In Analysis of Statistical Information, pages 491–507, 1989. [13] J. P. Bibring, Y. Langevin, A. Gendrin, B. Gondet, F. Poulet, M. Berthe, A. Soufflot, R. Arvidson, N. Mangold, J. Mustard, P. Drossart, and T. O. Team. Mars Surface Diversity as Revealed by the OMEGA/Mars Express Observations. Science, 307(5715):1576–1581, March 2005. [14] J. P. Bibring, Y. Langevin, F. Poulet, A. Gendrin, B. Gondet, M. Berthe, A. Soufflot, P. Drossart, M. Combes, G. Bellucci, V. Moroz, N. Mangold, and B. Schmitt. Perennial water ice identified in the south polar cap of Mars. Nature, 428:627–630, April 2004. [15] J. P. Bibring, A. Soufflot, M. Berthe’, Y. Langevin, B. Gondet, P. Drossart, M. Bouye’, M. Combes, P. Puget, G. Bellucci, and et al. OMEGA: Observatoire pour la Mine’ralogie, l’Eau, les Glaces et l’Activit, volume SP-1240, pages 37–50. ESA, 2004. [16] J. Bioucas-Dias and M. Figueiredo. Alternating direction algorithms for constrained sparse regression: Application to hyperspectral unmixing. In Proc. IEEE International Workshop on Hyperspectral Image and Signal Processing: evolution in Remote Sensing (WHISPERS’10), Reykjavik, Iceland, June 2010. [17] J.M. Bioucas-Dias. Bayesian wavelet-based image deconvolution: a GEM algorithm exploiting a class of heavy-tailed priors. IEEE Trans. Image Process., 15(4):937 –951, April 2006. [18] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. [19] J. W. Boardman, F. A. Kruse, and R. O. Green. Mapping target signatures via partial unmixing of AVIRIS data. In Proc. JPL Airborne Earth Sci. Workshop, pages 23–26, 1995. [20] J.W. Boardman. Inversion of imaging spectrometry data using singular value decomposition. In Proc. of Geoscience and Remote Sensing Symposium, IGARSS’89, volume 4, pages 2069–2072, July 1989. [21] J. H. Bowles, P. J. Palmadesso, J. A. Antoniades, M.M. Baumback, and L. J. Rickard. Use of filter vectors in hyperspectral data analysis. In Proc. SPIE Infrared Spaceborne Remote Sensing III, volume 2553, pages 148 – 157, 1995. Konstantinos E. Themelis

162


[22] S. Boyd, L. El Ghaoui, E. Feron, and V. Balakrishnan. Linear Matrix Inequalities in System and Control Theory, volume 15 of Studies in Applied Mathematics. SIAM, Philadelphia, PA, June 1994. [23] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [24] A.M. Bruckstein, M. Elad, and M. Zibulevsky. On the uniqueness of nonnegative sparse solutions to underdetermined systems of equations. IEEE Trans. Inf. Theory, 54(11):4813 –4820, November 2008. [25] G. Camps-Valls and L. Bruzzone. Kernel-based methods for hyperspectral image classification. Geoscience and Remote Sensing, IEEE Transactions on, 43(6):1351 – 1362, June 2005. [26] E. Candes, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math., 59(8):1207–1223, 2005. [27] E. J. Candes and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717–772, 2009. [28] E.J. Candes and M.B. Wakin. An introduction to compressive sampling. IEEE Signal Process. Mag., 25(2):21 –30, March 2008. [29] G. Casella and E. I. George. Explaining the Gibbs sampler. The American Statistician, 46:167–174, August 1992. [30] T-H Chan, C-Y Chi, Y-M Huang, and W-K Ma. A convex analysis-based minimumvolume enclosing simplex algorithm for hyperspectral unmixing. IEEE Trans. Signal Process., 57(11):4418 –4432, November 2009. [31] Hao Chen and C.H. Chen. Hyperspectral image data unsupervised classification using gauss-markov random fields and pca principle. In Geoscience and Remote Sensing Symposium, 2002. IGARSS ’02. 2002 IEEE International, volume 3, pages 1431 – 1433 vol.3, 2002. [32] M. Chen, J. Silva, J. Paisley, C. Wang, D. Dunson, and L. Carin. Compressive sensing on manifolds using a nonparametric mixture of factor analyzers: Algorithm and performance bounds. Signal Processing, IEEE Transactions on, 58(12):6140 –6155, dec. 2010.


163


[33] R. N. Clark, G. A. Swayze, K. E. Livo, R. F. Kokaly, S. J. Sutley, J. B. Dalton, R. R. McDougal, and C. A. Gent. Imaging spectroscopy: Earth and planetary remote sensing with the USGS tetracorder and expert systems. J. Geophys. Res., 108(E12):5–15–44, December 1993. [34] R. N. Clark, G. A. Swayze, R. Wise, K. E. Livo, T. M. Hoefen, R. F. Kokaly, and S. J. Sutley. USGS digital spectral library, 2007. http://speclab.cr.usgs.gov/ spectral.lib06/ds231/datatable.html. [35] T. F. Coleman and Y. Li. A reflective Newton method for minimizing a quadratic function subject to bounds on some of the variables. SIAM Journal on Optimization, 6:1040–1058, 1996. [36] J Combe, S Lemouelic, C Sotin, A Gendrin, J Mustard, L Ledeit, P Launeau, J. Bibring, B Gondet, and Y Langevin. Analysis of OMEGA/Mars Express data hyperspectral data using a Multiple-Endmember Linear Spectral Unmixing Model (MELSUM): Methodology and first results. Planetary and Space Science, 56(7):951–975, 2008. [37] D.J. Costanzo. Hyperspectral imaging spectral variability experiment results. In Geoscience and Remote Sensing Symposium, 2000. Proceedings. IGARSS 2000. IEEE 2000 International, volume 4, pages 1385 –1387 vol.4, 2000. [38] B. Demir and S. Erturk. Hyperspectral image classification using relevance vector machines. Geoscience and Remote Sensing Letters, IEEE, 4(4):586 –590, October 2007. [39] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B, 39(1), 1977. [40] N. Dobigeon, A.O. Hero, and J.-Y. Tourneret. Hierarchical Bayesian sparse image reconstruction with application to MRFM. IEEE Trans. Image Process., 18(9):2059– 2070, September 2009. [41] N. Dobigeon, J.-Y. Tourneret, and Chein-I Chang. Semi-supervised linear spectral unmixing using a hierarchical Bayesian model for hyperspectral imagery. IEEE Trans. Signal Process., 56(7):2684–2695, July 2008. [42] D. L. Donoho, Y. Tsaig, I. Drori, and J-L Starck. Sparse solution of underdetermined linear equations by stagewise orthogonal matching pursuit. Technical report, Department of Statistics, Stanford University, 2006.


164


[43] D.L. Donoho. Compressed sensing. Information Theory, IEEE Transactions on, 52(4):1289 –1306, april 2006. [44] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32:407–499, February 2002. [45] W.E. Esaias, M.R. Abbott, I. Barton, O.B. Brown, J.W. Campbell, K.L. Carder, D.K. Clark, R.H. Evans, F.E. Hoge, H.R. Gordon, W.M. Balch, R. Letelier, and P.J. Minnett. An overview of MODIS capabilities for ocean science observations. Geoscience and Remote Sensing, IEEE Transactions on, 36(4):1250 –1265, July 1998. [46] A. C. Faul and M. E. Tipping. Analysis of sparse Bayesian learning. In NIPS, pages 383–389. MIT Press, 2001. [47] M. Fauvel, J.A. Benediktsson, J. Chanussot, and J.R. Sveinsson. Spectral and spatial classification of hyperspectral data using svms and morphological profiles. Geoscience and Remote Sensing, IEEE Transactions on, 46(11):3804 –3814, November 2008. [48] W. Feller. An Introduction to Probability Theory and Its Applications, volume 1. Wiley, January 1968. [49] M.A.T. Figueiredo. Adaptive sparseness for supervised learning. IEEE Trans. Pattern Anal. Mach. Intell., 25(9):1150 – 1159, September 2003. [50] A. E. Gelfand and A. F. M. Smith. Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association,, 85:398–409, June 1990. [51] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis, Second Edition (Chapman & Hall/CRC Texts in Statistical Science). Chapman and Hall/CRC, 2 edition, July 2003. [52] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell., PAMI-6(6):721–741, November 1984. [53] A. Gendrin, Y. Langevin, J. P. Bibring, and O. Forni. A new method to investigate hyperspectral image cubes: An application of the wavelet transform. J. Geophys. Res., 111, 2006. [54] G. H. Golub and C. F. Van Loan. Matrix Computations (Johns Hopkins Studies in Mathematical Sciences)(3rd Edition). The Johns Hopkins University Press, 1996.


165


[55] I. S. Gradshteyn and I. M. Ryzhik. Table of Integrals, Series, and Products. Academic, New York, 1980. [56] A. A. Green, M. Berman, P. Switzer, and M. D. Craig. A transformation for ordering multispectral data in terms of image quality with implications for noise removal. IEEE Transactions on Geoscience and Remote Sensing, 26(1):65–74, January 1988. [57] M. K. Griffin and H. K. Burke. Compensation of hyperspectral data for atmospheric effects. Lincoln Laboratory Journal, 14(1):29–54, 2003. [58] Z. Guo, T. Wittman, and S. Osher. L1 unmixing and its application to hyperspectral image enhancement. In Algorithms and Technologies for Multispectral, Hyperspectral, and Ultraspectral Imagery, volume 7334, pages 73341M–73341M–9. SPIE, 2009. [59] A. Halimi, Y. Altmann, N. Dobigeon, and J. Tourneret. Nonlinear unmixing of hyperspectral images using a generalized bilinear model. Geoscience and Remote Sensing, IEEE Transactions on, 49(11):4153 –4162, November 2011. [60] D. C. Heinz and C. I. Chang. Fully constrained least squares linear spectral mixture analysis method for material quantification in hyperspectral imagery. IEEE Trans. Geosci. Remote Sensing, 39:529–545, March 2001. [61] S. M. Hsu, , and H. K. Burke. Multisensor fusion with hyperspectral imaging data: Detection and classification. Lincoln Laboratory Journal, 14(1):145–159, 2003. [62] A. Ifarraguerri and C.-I. Chang. Multispectral and hyperspectral image analysis with convex cones. Geoscience and Remote Sensing, IEEE Transactions on, 37(2):756 –770, March 1999. [63] M-D. Iordache, J. Bioucas-Dias, and A. Plaza. Unmixing sparse hyperspectral mixtures. In Geoscience and Remote Sensing Symposium, 2009 IEEE International, IGARSS 2009, volume 4, pages 85 –88, Cape Town, July 2009. [64] M.-D. Iordache, J.M. Bioucas-Dias, and A. Plaza. Sparse unmixing of hyperspectral data. Geoscience and Remote Sensing, IEEE Transactions on, 49(6):2014 –2039, June 2011. [65] T. S. Jaakkola and M. I. Jordan. Bayesian parameter estimation via variational methods. Statistics and Computing, 10:25–37, January 2000. [66] National Aeronautics Jet Propulsion Laboratory (JPL) and Space Administration (NASA). AVIRIS free standard data products. Konstantinos E. Themelis

166


[67] S. Ji, D. Dunson, and L. Carin. Multitask compressive sensing. IEEE Trans. Signal Process., 57(1):92 –106, January 2009. [68] S. Ji, Y. Xue, and L. Carin. Bayesian compressive sensing. IEEE Trans. Signal Process., 56(6):2346–2356, June 2008. [69] A. Johnson, E. Windesheim, and J. Brockhaus. Hyperspectral imagery for trafficability analysis. In Aerospace Conference, 1998 IEEE, volume 2, pages 21 –35 vol.2, March 1998. [70] N. L. Johnson and S. Kotz. Continuous univariate distributions-1. John Wiley & Sons, 1970. [71] I. T. Jolliffe. Principal Component Analysis. Springer, second edition, oct. 2002. [72] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37:183–233, January 1999. [73] L. C. Kanner, J. F. Mustard, and A. Gendrin. Assessing the limits of the Modified Gaussian Model for remote spectroscopic studies of pyroxenes on Mars. Icarus, 187:442–456, April 2007. [74] N. Kashava. A survey of spectral unmixing algorithms. Lincoln Laboratory Journal, 14(1):55–78, 2003. [75] S. M. Kay. Fundamentals of Statistical Signal Processing: Estimation Theory. Prentice Hall, Englewood Cliffs, NJ, 1993. [76] J. G. Kemeny and J. L. Snell. Finite Markov chains. University series in undergraduate mathematics. VanNostrand, New York, 1969. [77] N. Keshava and J. F. Mustard. Spectral unmixing. IEEE Signal Processing Magazine, 19:44–57, January 2002. [78] M. Kyung, J. Gilly, M. Ghoshz, and G. Casella. Penalized regression, standard errors, and Bayesian Lassos. Bayesian Analysis, 5:369–412, February 2010. [79] D. Landgrebe. The evolution of Landsat data analysis. Photogrammetric Engineering & Remote Sensing, 63(7):859–867, 1997. [80] Y. Langevin, F. Poulet, J. P. Bibring, and B. Gondet. Sulfates in the north polar region of mars detected by omega/mars express. Science, 307(5715):1584–6, 2005. Konstantinos E. Themelis

167


[81] C. L. Lawson and R. J. Hanson. Solving least squares problems. Prentice-Hall Series in Automatic Computation, Englewood Cliffs, 3 edition, 1995. [82] S. Le Mouelic, J.-P. Combe, V. Sarago, N. Mangold, M. Masse, J.-P. Bibring, B. Gondet, Y. Langevin, and C. Sotin. An iterative least squares approach to decorrelate minerals and ices contributions in hyperspectral images: Application to cuprite (earth) and mars. In Hyperspectral Image and Signal Processing: Evolution in Remote Sensing, 2009. WHISPERS ’09. First Workshop on, pages 1 –4, aug. 2009. [83] E. L. Lehmann and G. Casella. Theory of Point Estimation. Springer, 1998. [84] J. S. Liu, W. H. Wong, and A. Kong. Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes. Biometrika, 81(1):27–40, 1994. [85] D. Manolakis, D. Marden, and G. A. Shaw. Hyperspectral image processing for automatic target detection applications. Lincoln Laboratory Journal, 14(1):79–116, 2003. [86] D. Manolakis, C. Siracusa, and G. Shaw. Hyperspectral subpixel target detection using the linear mixing model. Geoscience and Remote Sensing, IEEE Transactions on, 39(7):1392 –1409, July 2001. [87] S. Matteoli, N. Acito, M. Diani, and G. Corsini. An automatic approach to adaptive local background estimation and suppression in hyperspectral target detection. Geoscience and Remote Sensing, IEEE Transactions on, 49(2):790 –800, February 2011. [88] G. J. McLachlan and T. Krishnan. The EM algorithm and extensions / Geoffrey J. McLachlan, Thriyambakam Krishnan. Wiley, New York :, 1997. [89] R. N. Merton. Multi-Temporal Analysis Of Community Scale Vegetation Stress With Imaging Spectroscopy. PhD thesis, University of Auckland, 1999. [90] L. Miao and H. Qi. Endmember extraction from highly mixed data using minimum volume constrained nonnegative matrix factorization. IEEE Trans. Geosci. Remote Sens., 45(3):765 –777, March 2007. [91] S. Moussaoui, D. Brie, A. Mohammad-Djafari, and C. Carteret. Separation of nonnegative mixture of non-negative sources using a bayesian approach and mcmc sampling. Signal Processing, IEEE Transactions on, 54(11):4133 –4145, nov. 2006. [92] S. Moussaoui, H. Hauksdóttir, F. Schmidt, C. Jutten, J. Chanussot, D. Brie, S. Douté, and J. A. Benediktsson. On the decomposition of mars hyperspectral data by ica and bayesian positive source separation. Neurocomputing, 71:2194–2208, June 2008. Konstantinos E. Themelis

168


[93] J. F. Mustard, F. Poulet, A. Gendrin, J. P. Bibring, Y. Langevin, B. Gondet, N. Mangold, G. Bellucci, and F. Altieri. Olivine and Pyroxene Diversity in the Crust of Mars. Science, 307(5715):1594–1597, 2005. [94] J. M. P. Nascimento and J. M. Bioucas-Dias. Vertex component analysis: A fast algorithm to unmix hyperspectral data. IEEE Trans. Geosci. Remote Sensing, 4:898– 910, April 2005. [95] R. M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical report, Department of Computer Science, University of Toronto, 1993. [96] R. Neher and A. Srivastava. A bayesian mrf framework for labeling terrain using hyperspectral imaging. Geoscience and Remote Sensing, IEEE Transactions on, 43(6):1363 – 1374, June 2005. [97] R. A. Neville, K. Staenz, T. Szeredi, J. Lefebvre, and P. Hauff. Automatic endmember extraction from hyperspectral data for mineral exploration. In Proc. 21st Canadian Symp. Remote Sens., pages 21 – 24, 1999. [98] J. Paisley and L. Carin. Nonparametric factor analysis with beta process priors. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 777–784, New York, NY, USA, 2009. ACM. [99] G. Parisi. Statistical field theory. Advanced book classics. Perseus Books, 1998. [100] T. Park and Casella G. The Bayesian Lasso. Journal of the American Statistical Association, 103(482):681–686, June 2008. [101] A. Plaza, G. Martin, J. Plaza, M. Zortea, and S. Sanchez. Recent developments in endmember extraction and spectral unmixing. In Saurabh Prasad, Lori M. Bruce, Jocelyn Chanussot, Riad I. Hammoud, and Lawrence B. Wolff, editors, Optical Remote Sensing, volume 3 of Augmented Vision and Reality, pages 235–267. Springer Berlin Heidelberg, 2011. [102] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. Technical report, 2007. [103] C. P. Robert. Simulation of truncated normal variables. Statistics and Computing, 5:121–125, 1995. [104] G. Rodriguez-Yam, R.A. Davis, and L Scharf. Efficient Gibbs sampling of truncated multivariate normal with application to constrained linear regression. Technical report, Columbia University, 2004. Konstantinos E. Themelis

169


[105] D. M. Rogge, B. Rivard, J. Zhang, A. Sanchez, J. Harris, and J. Feng. Integration of spatialspectral information for the improved extraction of endmembers. Remote Sensing of Environment, 110(3):287–303, 2007. [106] L. L. Scharf. Statistical Signal Processing. Prentice Hall, 1991. [107] F. Schmidt, S. Bourguignon, S. Le Mouëlic, N. Dobigeon, C. Theys, and E. Tréguier. Accuracy and performance of linear unmixing techniques for detecting minerals on Omega/Mars Express. In Proc. IEEE GRSS Workshop on Hyperspectral Image and SIgnal Processing: Evolution in Remote Sensing (WHISPERS), Lisbon, Portugal, June 2011. [108] F. Schmidt, S. Doute, and B. Schmitt. Wavanglet: An efficient supervised classifier for hyperspectral images. Geoscience and Remote Sensing, IEEE Transactions on, 45(5):1374 –1385, May 2007. [109] F. Schmidt, A. Schmidt, E. Tréandguier, M. Guiheneuf, S. Moussaoui, and N. Dobigeon. Implementation strategies for hyperspectral unmixing using bayesian source separation. Geoscience and Remote Sensing, IEEE Transactions on, 48(11):4003–4013, November 2010. [110] H. Snoussi and J. Idier. Bayesian blind separation of generalized hyperbolic processes in noisy and underdeterminate mixtures. IEEE Trans. Signal Process., 54(9):3257 –3269, September 2006. [111] K. E. Themelis and A. A. Rontogiannis. A soft constrained MAP estimator for supervised hyperspectral signal unmixing. In Proc. of European Signal Processing Conference, (EUSIPCO), August 2008. [112] K. E. Themelis, A. A. Rontogiannis, and K. D. Koutroumbas. Semi-supervised hyperspectral unmixing via the weighted Lasso. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’10), Dallas, Texas, March 2010. [113] K. E. Themelis, A. A. Rontogiannis, and K. D. Koutroumbas. Sparse semi-supervised hyperspectral unmixing using a novel iterative bayesian inference algorithm. In 19th European Signal Processing Conference (EUSIPCO), pages 1165–1169, September 2011. [114] K. E. Themelis, A. A. Rontogiannis, and K. D. Koutroumbas. A novel hierarchical Bayesian approach for sparse semi-supervised hyperspectral unmixing. IEEE Trans. Signal Process., 60(2):585–599, February 2012.


170


[115] K. E. Themelis, A. A. Rontogiannis, O. Sykioti, K. D. Koutroumbas, I.A. Daglis, and F. Schmidt. On the unmixing of MEx/OMEGA hyperspectral data. In Proc. European Planetary Science Congress (EPSC), September 2010. [116] K. E. Themelis, F. Schmidt, O. Sykioti, A. A. Rontogiannis, K. D. Koutroumbas, and I. A. Daglis. On the unmixing of MEx/OMEGA hyperspectral data. Planetary and Space Science, Elsevier, pages –, 2011. accepted. [117] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, 58(1):267–288, January 1996. [118] L. Tierney. Markov Chains for Exploring Posterior Distributions. The Annals of Statistics, 22(4):1701–1728, 1994. [119] M. E. Tipping. Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res., 1:211–244, 2001. [120] M. E. Tipping and A. C. Faul. Fast marginal likelihood maximisation for sparse Bayesian models. In Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, pages 3–6, 2003. [121] J.A. Tropp and A.C. Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans. Inf. Theory, 53:4655 –4666, December 2007. [122] Y. Tsaig and D. L. Donoho. Extensions of compressed sensing. Signal Process., 86:549– 571, March 2006. [123] D. G. Tzikas, A. C. Likas, and N. P. Galatsanos. The variational approximation for Bayesian inference. IEEE Signal Process. Mag., 25(6):131–146, November 2008. [124] G. Wahba. A comparison of GCV and GML for choosing the smoothing parameter in the generalized spline smoothing problem. University of Wisconsin, Department of Statistics, 1983. [125] R. Weinstock. Calculus of variations: with applications to physics and engineering. Dover books on advanced mathematics. Dover Publications, 1974. [126] M. E. Winter. N-FINDR: an algorithm for fast autonomous spectral end-member determination in hyperspectral data. In Proc. SPIE Imaging Spectrometry V, volume 3753, pages 266 –275, July 1999.


171


[127] Ye Zhang and Yanfeng Gu. Kernel-based invariant subspace method for hyperspectral target detection. In Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP ’04). IEEE International Conference on, volume 5, pages V – 801–4 vol.5, may 2004. [128] M. Zortea and A. Plaza. Spatial preprocessing for endmember extraction. Geoscience and Remote Sensing, IEEE Transactions on, 47(8):2679 –2693, August 2009. [129] H. Zou. The adaptive Lasso and its oracle properties. Journal of the American Statistical Association, 101:1418–1429, December 2006.


172

Index abundances, 28, 42 anomaly detection, 39 Aviris data, 129

ergodicity, 66 homogeneous chain, 66 invariant distribution, 66 multispectral sensors, 34

background characterization, 39 band ratio method, 95

OMEGA data, 87

Chapman-Kolomogrov equation, 66 compressive sensing, 135

rejection sampling, 63 remote sensing, 33 active, 33 passive, 33 ridge regression, 56

dimensionality reduction, 41, 47 EM algorithm, 71 endmember extraction, 47 endmembers, 28, 42 evidence approximation, 76

spectral unmixing, 42 linear, 42, 44 nonlinear, 42 superefficient estimators, 81

Gibbs sampling, 61

target recognition, 39 transformation method, 62 type-II maximum likelihood, 141

HYDICE data, 85, 105 inversion, 48 LARS algorithm, 103 lasso, 57, 102, 109 adaptive lasso, 60, 112, 113, 138 weighted lasso, 102 linear mixing model, 44 linear regression, 51

variational EM, 74

Markov chain Monte Carlo, 65 Markov chains, 65 detailed balance, 66 equilibrium distribution, 67 Konstantinos E. Themelis

173