Robust regression is one such “modern” method. Workshop Modern methods for robust regression, (# 152. Workshop Modern methods for robust regression. Sitka text font for mac download. About the Author. Modern Methods For Robust Regression.pdf To download full version 'Modern Methods For Robust. Download regression methods in biostatistics. Robust Regression for the Linear Model Search form. Show page numbers. Download PDF. Robust Regression for the Linear Model Previous Next. In: Modern Methods for Robust Regression. Little Green Book. Download PDF. Show page numbers. Robust Regression for the Linear Model.
Part of a series on Statistics |
Regression analysis |
---|
Models |
Estimation |
Background |
In robust statistics, robust regression is a form of regression analysis designed to overcome some limitations of traditional parametric and non-parametric methods. Regression analysis seeks to find the relationship between one or more independent variables and a dependent variable. Certain widely used methods of regression, such as ordinary least squares, have favourable properties if their underlying assumptions are true, but can give misleading results if those assumptions are not true; thus ordinary least squares is said to be not robust to violations of its assumptions. Robust regression methods are designed to be not overly affected by violations of assumptions by the underlying. The method is robust to outliers in the response variable, but turned out not to be resistant to outliers in the explanatory variables (leverage points). In fact, when there are outliers in the explanatory variables, the method has no advantage over least squares. Dragon ball z all episodes english dubbed download. Endurance battery booster manual meat. Book for hindi typing.
In the 1980s, several alternatives to M-estimation were proposed as attempts to overcome the lack of resistance. See the book by Rousseeuw and Leroy for a very practical review. Least trimmed squares (LTS) is a viable alternative and is currently (2007) the preferred choice of Rousseeuw and Ryan (1997, 2008). The Theil–Sen estimator has a lower breakdown point than LTS but is statistically efficient and popular. Another proposed solution was S-estimation. This method finds a line (plane or hyperplane) that minimizes a robust estimate of the scale (from which the method gets the S in its name) of the residuals. This method is highly resistant to leverage points and is robust to outliers in the response. However, this method was also found to be inefficient.
MM-estimation attempts to retain the robustness and resistance of S-estimation, whilst gaining the efficiency of M-estimation. The method proceeds by finding a highly robust and resistant S-estimate that minimizes an M-estimate of the scale of the residuals (the first M in the method's name). The estimated scale is then held constant whilst a close by M-estimate of the parameters is located (the second M).
Parametric alternatives[edit]
Another approach to robust estimation of regression models is to replace the normal distribution with a heavy-tailed distribution. A t-distribution with 4–6 degrees of freedom has been reported to be a good choice in various practical situations. Bayesian robust regression, being fully parametric, relies heavily on such distributions.
Under the assumption of t-distributed residuals, the distribution is a location-scale family. That is, . The degrees of freedom of the t-distribution is sometimes called the kurtosis parameter. Lange, Little and Taylor (1989) discuss this model in some depth from a non-Bayesian point of view. A Bayesian account appears in Gelman et al. (2003).
An alternative parametric approach is to assume that the residuals follow a mixture of normal distributions; in particular, a contaminated normal distribution in which the majority of observations are from a specified normal distribution, but a small proportion are from a normal distribution with much higher variance. Download maxwell sv software reviews. That is, residuals have probability of coming from a normal distribution with variance , where is small, and probability of coming from a normal distribution with variance for some :
Typically, . This is sometimes called the -contamination model.
![Modern Methods For Robust Regression Pdf To Word Modern Methods For Robust Regression Pdf To Word](https://www.ebookmall.com/Public/Images/Products/ProductPage/6/1483263061.jpg)
Parametric approaches have the advantage that likelihood theory provides an 'off-the-shelf' approach to inference (although for mixture models such as the -contamination model, the usual regularity conditions might not apply), and it is possible to build simulation models from the fit. However, such parametric models still assume that the underlying model is literally true. As such, they do not account for skewed residual distributions or finite observation precisions.
Unit weights[edit]
Another robust method is the use of unit weights (Wainer & Thissen, 1976), a method that can be applied when there are multiple predictors of a single outcome. Ernest Burgess (1928) used unit weights to predict success on parole. Thepluginsite Plugins for Adobe Photoshop. He scored 21 positive factors as present (e.g., 'no prior arrest' = 1) or absent ('prior arrest' = 0), then summed to yield a predictor score, which was shown to be a useful predictor of parole success. Samuel S. Wilks (1938) showed that nearly all sets of regression weights sum to composites that are very highly correlated with one another, including unit weights, a result referred to as Wilk's theorem (Ree, Carretta, & Earles, 1998). Robyn Dawes (1979) examined decision making in applied settings, showing that simple models with unit weights often outperformed human experts. Bobko, Roth, and Buster (2007) reviewed the literature on unit weights and concluded that decades of empirical studies show that unit weights perform similar to ordinary regression weights on cross validation.
Example: BUPA liver data[edit]
The BUPA liver data have been studied by various authors, including Breiman (2001). The data can be found at the classic data sets page, and there is some discussion in the article on the Box–Cox transformation. A plot of the logs of ALT versus the logs of γGT appears below. The two regression lines are those estimated by ordinary least squares (OLS) and by robust MM-estimation. The analysis was performed in R using software made available by Venables and Ripley (2002).
The two regression lines appear to be very similar (and this is not unusual in a data set of this size). However, the advantage of the robust approach comes to light when the estimates of residual scale are considered. For ordinary least squares, the estimate of scale is 0.420, compared to 0.373 for the robust method. Thus, the relative efficiency of ordinary least squares to MM-estimation in this example is 1.266. This inefficiency leads to loss of power in hypothesis tests and to unnecessarily wide confidence intervals on estimated parameters.
Outlier detection[edit]
Another consequence of the inefficiency of the ordinary least squares fit is that several outliers are masked because the estimate of residual scale is inflated, the scaled residuals are pushed closer to zero than when a more appropriate estimate of scale is used. The plots of the scaled residuals from the two models appear below. The variable on the x axis is just the observation number as it appeared in the data set. Rousseeuw and Leroy (1986) contains many such plots.
The horizontal reference lines are at 2 and −2, so that any observed scaled residual beyond these boundaries can be considered to be an outlier. https://poweruptemplates.weebly.com/blog/stalker-call-of-pripyat-download. Clearly, the least squares method leads to many interesting observations being masked.
Whilst in one or two dimensions outlier detection using classical methods can be performed manually, with large data sets and in high dimensions the problem of masking can make identification of many outliers impossible. Robust methods automatically detect these observations, offering a serious advantage over classical methods when outliers are present.
See also[edit]
- Theil–Sen estimator, a method for robust simple linear regression
References[edit]
- Andersen, R. (2008). Modern Methods for Robust Regression. Sage University Paper Series on Quantitative Applications in the Social Sciences, 07-152.
- Ben-Gal I., Outlier detection, In: Maimon O. and Rockach L. (Eds.) Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers,' Kluwer Academic Publishers, 2005, ISBN0-387-24435-2.
- Bobko, P., Roth, P. L., & Buster, M. A. (2007). 'The usefulness of unit weights in creating composite scores: A literature review, application to content validity, and meta-analysis'. Organizational Research Methods, volume 10, pages 689-709. doi:10.1177/1094428106294734
- Breiman, L. (2001). 'Statistical Modeling: the Two Cultures'. Statistical Science. 16 (3): 199–231. doi:10.1214/ss/1009213725. JSTOR2676681.
- Burgess, E. W. (1928). 'Factors determining success or failure on parole'. In A. A. Bruce (Ed.), The Workings of the Indeterminate Sentence Law and Parole in Illinois (pp. 205–249). Springfield, Illinois: Illinois State Parole Board. Google books
- Dawes, Robyn M. (1979). 'The robust beauty of improper linear models in decision making'. American Psychologist, volume 34, pages 571-582. doi:10.1037/0003-066X.34.7.571 . archived pdf
- Draper, David (1988). 'Rank-Based Robust Analysis of Linear Models. I. Exposition and Review'. Statistical Science. 3 (2): 239–257. doi:10.1214/ss/1177012915. JSTOR2245578.
- Faraway, J. J. (2004). Linear Models with R. Chapman & Hall/CRC.
- Fornalski, K. W. (2015). 'Applications of the robust Bayesian regression analysis'. International Journal of Society Systems Science. 7 (4): 314–333. doi:10.1504/IJSSS.2015.073223.
- Gelman, A.; J. B. Carlin; H. S. Stern; D. B. Rubin (2003). Bayesian Data Analysis (Second ed.). Chapman & Hall/CRC.
- Hampel, F. R.; E. M. Ronchetti; P. J. Rousseeuw; W. A. Stahel (2005) [1986]. Robust Statistics: The Approach Based on Influence Functions. Wiley.
- Lange, K. L.; R. J. A. Little; J. M. G. Taylor (1989). 'Robust statistical modeling using the t-distribution'. Journal of the American Statistical Association. 84 (408): 881–896. doi:10.2307/2290063. JSTOR2290063.
- Lerman, G.; McCoy, M.; Tropp, J. A.; Zhang T. (2012). 'Robust computation of linear models, or how to find a needle in a haystack', arXiv:1202.4044.
- Maronna, R.; D. Martin; V. Yohai (2006). Robust Statistics: Theory and Methods. Wiley.
- McKean, Joseph W. (2004). 'Robust Analysis of Linear Models'. Statistical Science. 19 (4): 562–570. doi:10.1214/088342304000000549. JSTOR4144426.
- Radchenko S.G. (2005). Robust methods for statistical models estimation: Monograph. (on Russian language). Kiev: РР «Sanspariel». p. 504. ISBN978-966-96574-0-4.
- Ree, M. J., Carretta, T. R., & Earles, J. A. (1998). 'In top-down decisions, weighting variables does not matter: A consequence of Wilk's theorem. Organizational Research Methods, volume 1(4), pages 407-420. doi:10.1177/109442819814003
- Rousseeuw, P. J.; A. M. Leroy (2003) [1986]. Robust Regression and Outlier Detection. Wiley.
- Ryan, T. P. (2008) [1997]. Modern Regression Methods. Wiley.
- Seber, G. A. F.; A. J. Lee (2003). Linear Regression Analysis (Second ed.). Wiley.
- Stromberg, A. J. (2004). 'Why write statistical software? The case of robust statistical methods'. Journal of Statistical Software. 10 (5). doi:10.18637/jss.v010.i05.
- Strutz, T. (2016). Data Fitting and Uncertainty (A practical introduction to weighted least squares and beyond). Springer Vieweg. ISBN978-3-658-11455-8.
- Tofallis, Chris (2008). 'Least Squares Percentage Regression'. Journal of Modern Applied Statistical Methods. 7: 526–534. doi:10.2139/ssrn.1406472. SSRN1406472.
- Venables, W. N.; B. D. Ripley (2002). Modern Applied Statistics with S. Springer.
- Wainer, H., & Thissen, D. (1976). 'Three steps toward robust regression.' Psychometrika, volume 41(1), pages 9–34. doi:10.1007/BF02291695
- Wilks, S. S. (1938). 'Weighting systems for linear functions of correlated variables when there is no dependent variable'. Psychometrika, volume 3, pages 23–40. doi:10.1007/BF02287917
External links[edit]
- Nick Fieller's course notes on Statistical Modelling and Computation contain material on robust regression.
Low Rank Subspace Robust Regression Pdf
Retrieved from 'https://en.wikipedia.org/w/index.php?title=Robust_regression&oldid=907567775'
Modern Methods For Robust Regression Pdf To Word Document
Robust Methods in Biostatistics Stephane Heritier The George Institute for International Health, University of Sydney, Australia
Eva Cantoni Department of Econometrics, University of Geneva, Switzerland
Samuel Copt Merck Serono International, Geneva, Switzerland
Maria-Pia Victoria-Feser HEC Section, University of Geneva, Switzerland
A John Wiley and Sons, Ltd, Publication
Robust Methods in Biostatistics
WILEY SERIES IN PROBABILITY AND STATISTICS Established by WALTER A. SHEWHART and SAMUEL S. WILKS Editors David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice, Iain M. Johnstone, Geert Molenberghs, David W. Scott, Adrian F. M. Smith, Ruey S. Tsay, Sanford Weisberg, Harvey Goldstein. Editors Emeriti Vic Barnett, J. Stuart Hunter, Jozef L. Teugels A complete list of the titles in this series appears at the end of this volume.
Robust Methods in Biostatistics Stephane Heritier The George Institute for International Health, University of Sydney, Australia
Eva Cantoni Department of Econometrics, University of Geneva, Switzerland
Samuel Copt Merck Serono International, Geneva, Switzerland
Maria-Pia Victoria-Feser HEC Section, University of Geneva, Switzerland
A John Wiley and Sons, Ltd, Publication
This edition first published 2009 c 2009 John Wiley & Sons Ltd 0001 Registered office John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com. The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books. Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought. Library of Congress Cataloging-in-Publication Data Robust methods in biostatistics / Stephane Heritier . . . [et al.]. p. cm. Includes bibliographical references and index. ISBN 978-0-470-02726-4 (cloth) 1. Biometry–Statistical methods. I. Heritier, Stephane. [DNLM: 1. Biometry–methods. WA 950 R667 2009] QH323.5.R615 2009 570.1’5195–dc22 2009008863 A catalogue record for this book is available from the British Library. ISBN 9780470027264 Set in 10/12pt Times by Sunrise Setting Ltd, Torquay, UK. Printed in Great Britain by CPI Antony Rowe, Chippenham, Wiltshire.
To Anna, Olivier, Cassandre, Oriane, Sonia, Johannes, Véronique, Sébastien and Raphaël, who contributed in their ways. . .
Contents Preface
xiii
Acknowledgments 1
2
xv
Introduction 1.1 What is Robust Statistics? . . . . . . . . . . . . . . . . . . . . . . 1.2 Against What is Robust Statistics Robust? . . . . . . . . . . . . . . 1.3 Are Diagnostic Methods an Alternative to Robust Statistics? . . . . 1.4 How do Robust Statistics Compare with Other Statistical Procedures in Practice? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
Key Measures and Results 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Statistical Tools for Measuring Robustness Properties . . . . . 2.2.1 The Influence Function . . . . . . . . . . . . . . . . . 2.2.2 The Breakdown Point . . . . . . . . . . . . . . . . . 2.2.3 Geometrical Interpretation . . . . . . . . . . . . . . . 2.2.4 The Rejection Point . . . . . . . . . . . . . . . . . . 2.3 General Approaches for Robust Estimation . . . . . . . . . . 2.3.1 The General Class of M-estimators . . . . . . . . . . 2.3.2 Properties of M-estimators . . . . . . . . . . . . . . . 2.3.3 The Class of S-estimators . . . . . . . . . . . . . . . 2.4 Statistical Tools for Measuring Tests Robustness . . . . . . . . 2.4.1 Sensitivity of the Two-sample t-test . . . . . . . . . . 2.4.2 Local Stability of a Test: the Univariate Case . . . . . 2.4.3 Global Reliability of a Test: the Breakdown Functions 2.5 General Approaches for Robust Testing . . . . . . . . . . . . 2.5.1 Wald Test, Score Test and LRT . . . . . . . . . . . . . 2.5.2 Geometrical Interpretation . . . . . . . . . . . . . . . 2.5.3 General 0001-type Classes of Tests . . . . . . . . . . . . 2.5.4 Asymptotic Distributions . . . . . . . . . . . . . . . . 2.5.5 Robustness Properties . . . . . . . . . . . . . . . . .
15 15 16 17 20 20 21 21 23 27 30 32 34 34 37 38 39 40 40 42 43
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
1 1 3 7
CONTENTS
viii
3 Linear Regression 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Estimating the Regression Parameters . . . . . . . . . . . . 3.2.1 The Regression Model . . . . . . . . . . . . . . . . 3.2.2 Robustness Properties of the LS and MLE Estimators 3.2.3 Glomerular Filtration Rate (GFR) Data Example . . 3.2.4 Robust Estimators . . . . . . . . . . . . . . . . . . 3.2.5 GFR Data Example (continued) . . . . . . . . . . . 3.3 Testing the Regression Parameters . . . . . . . . . . . . . . 3.3.1 Significance Testing . . . . . . . . . . . . . . . . . 3.3.2 Diabetes Data Example . . . . . . . . . . . . . . . . 3.3.3 Multiple Hypothesis Testing . . . . . . . . . . . . . 3.3.4 Diabetes Data Example (continued) . . . . . . . . . 3.4 Checking and Selecting the Model . . . . . . . . . . . . . . 3.4.1 Residual Analysis . . . . . . . . . . . . . . . . . . 3.4.2 GFR Data Example (continued) . . . . . . . . . . . 3.4.3 Diabetes Data Example (continued) . . . . . . . . . 3.4.4 Coefficient of Determination . . . . . . . . . . . . . 3.4.5 Global Criteria for Model Comparison . . . . . . . . 3.4.6 Diabetes Data Example (continued) . . . . . . . . . 3.5 Cardiovascular Risk Factors Data Example . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
45 45 47 47 48 49 50 54 55 55 58 59 61 62 62 62 65 66 69 75 78
4 Mixed Linear Models 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4.2 The MLM . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 The MLM Formulation . . . . . . . . . . . . . . 4.2.2 Skin Resistance Data . . . . . . . . . . . . . . . 4.2.3 Semantic Priming Data . . . . . . . . . . . . . . 4.2.4 Orthodontic Growth Data . . . . . . . . . . . . 4.3 Classical Estimation and Inference . . . . . . . . . . . . 4.3.1 Marginal and REML Estimation . . . . . . . . . 4.3.2 Classical Inference . . . . . . . . . . . . . . . . 4.3.3 Lack of Robustness of Classical Procedures . . . 4.4 Robust Estimation . . . . . . . . . . . . . . . . . . . . . 4.4.1 Bounded Influence Estimators . . . . . . . . . . 4.4.2 S-estimators . . . . . . . . . . . . . . . . . . . 4.4.3 MM-estimators . . . . . . . . . . . . . . . . . . 4.4.4 Choosing the Tuning Constants . . . . . . . . . 4.4.5 Skin Resistance Data (continued) . . . . . . . . 4.5 Robust Inference . . . . . . . . . . . . . . . . . . . . . 4.5.1 Testing Contrasts . . . . . . . . . . . . . . . . . 4.5.2 Multiple Hypothesis Testing of the Main Effects 4.5.3 Skin Resistance Data Example (continued) . . . 4.5.4 Semantic Priming Data Example (continued) . . 4.5.5 Testing the Variance Components . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
83 83 84 84 88 89 90 91 91 94 96 97 97 98 100 102 103 104 104 106 107 107 110
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
CONTENTS 4.6
4.7
4.8 5
6
ix
Checking the Model . . . . . . . . . . . . . . . . . . . . 4.6.1 Detecting Outlying and Influential Observations 4.6.2 Prediction and Residual Analysis . . . . . . . . Further Examples . . . . . . . . . . . . . . . . . . . . . 4.7.1 Metallic Oxide Data . . . . . . . . . . . . . . . 4.7.2 Orthodontic Growth Data (continued) . . . . . . Discussion and Extensions . . . . . . . . . . . . . . . .
Generalized Linear Models 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 5.2 The GLM . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Model Building . . . . . . . . . . . . . . . . 5.2.2 Classical Estimation and Inference for GLM 5.2.3 Hospital Costs Data Example . . . . . . . . 5.2.4 Residual Analysis . . . . . . . . . . . . . . 5.3 A Class of M-estimators for GLMs . . . . . . . . . . 5.3.1 Choice of ψ and w(x) . . . . . . . . . . . . 5.3.2 Fisher Consistency Correction . . . . . . . . 5.3.3 Nuisance Parameters Estimation . . . . . . . 5.3.4 IF and Asymptotic Properties . . . . . . . . 5.3.5 Hospital Costs Example (continued) . . . . . 5.4 Robust Inference . . . . . . . . . . . . . . . . . . . 5.4.1 Significance Testing and CIs . . . . . . . . . 5.4.2 General Parametric Hypothesis Testing and Variable Selection . . . . . . . . . . . . . . 5.4.3 Hospital Costs Data Example (continued) . . 5.5 Breastfeeding Data Example . . . . . . . . . . . . . 5.5.1 Robust Estimation of the Full Model . . . . . 5.5.2 Variable Selection . . . . . . . . . . . . . . 5.6 Doctor Visits Data Example . . . . . . . . . . . . . 5.6.1 Robust Estimation of the Full Model . . . . . 5.6.2 Variable Selection . . . . . . . . . . . . . . 5.7 Discussion and Extensions . . . . . . . . . . . . . . 5.7.1 Robust Hurdle Models for Counts . . . . . . 5.7.2 Robust Akaike Criterion . . . . . . . . . . . 5.7.3 General Cp Criterion for GLMs . . . . . . . 5.7.4 Prediction with Robust Models . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
110 110 112 116 116 118 122
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
125 125 126 126 129 132 133 136 137 138 139 140 140 141 141
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
142 144 146 146 148 151 151 154 158 158 159 159 160
Marginal Longitudinal Data Analysis 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 The Marginal Longitudinal Data Model (MLDA) and Alternatives 6.2.1 Classical Estimation and Inference in MLDA . . . . . . . 6.2.2 Estimators for τ and α . . . . . . . . . . . . . . . . . . . 6.2.3 GUIDE Data Example . . . . . . . . . . . . . . . . . . . 6.2.4 Residual Analysis . . . . . . . . . . . . . . . . . . . . .
. . . . . .
161 161 163 164 166 169 171
CONTENTS
x 6.3
6.4
6.5 6.6 6.7
A Robust GEE-type Estimator . . . . . . 6.3.1 Linear Predictor Parameters . . . 6.3.2 Nuisance Parameters . . . . . . . 6.3.3 IF and Asymptotic Properties . . 6.3.4 GUIDE Data Example (continued) Robust Inference . . . . . . . . . . . . . 6.4.1 Significance Testing and CIs . . . 6.4.2 Variable Selection . . . . . . . . 6.4.3 GUIDE Data Example (continued) LEI Data Example . . . . . . . . . . . . Stillbirth in Piglets Data Example . . . . Discussion and Extensions . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
7 Survival Analysis 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 The Cox Model . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 The Partial Likelihood Approach . . . . . . . . . . . . 7.2.2 Empirical Influence Function for the PLE . . . . . . . 7.2.3 Myeloma Data Example . . . . . . . . . . . . . . . . 7.2.4 A Sandwich Formula for the Asymptotic Variance . . 7.3 Robust Estimation and Inference in the Cox Model . . . . . . 7.3.1 A Robust Alternative to the PLE . . . . . . . . . . . . 7.3.2 Asymptotic Normality . . . . . . . . . . . . . . . . . 7.3.3 Handling of Ties . . . . . . . . . . . . . . . . . . . . 7.3.4 Myeloma Data Example (continued) . . . . . . . . . . 7.3.5 Robust Inference and its Current Limitations . . . . . 7.4 The Veteran’s Administration Lung Cancer Data . . . . . . . . 7.4.1 Robust Estimation . . . . . . . . . . . . . . . . . . . 7.4.2 Interpretation of the Weights . . . . . . . . . . . . . . 7.4.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Structural Misspecifications . . . . . . . . . . . . . . . . . . . 7.5.1 Performance of the ARE . . . . . . . . . . . . . . . . 7.5.2 Performance of the robust Wald test . . . . . . . . . . 7.5.3 Other Issues . . . . . . . . . . . . . . . . . . . . . . . 7.6 Censored Regression Quantiles . . . . . . . . . . . . . . . . . 7.6.1 Regression Quantiles . . . . . . . . . . . . . . . . . . 7.6.2 Extension to the Censored Case . . . . . . . . . . . . 7.6.3 Asymptotic Properties and Robustness . . . . . . . . . 7.6.4 Comparison with the Cox Proportional Hazard Model 7.6.5 Lung Cancer Data Example (continued) . . . . . . . . 7.6.6 Limitations and Extensions . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
172 172 174 176 177 178 178 179 180 182 186 189
. . . . . . . . . . . . . . . . . . . . . . . . . . .
191 191 193 193 196 197 198 200 200 202 204 205 206 209 209 210 212 214 214 216 217 217 217 219 220 221 222 224
CONTENTS
xi
Appendices
227
A Starting Estimators for MM-estimators of Regression Parameters
229
B Efficiency, LRTρ , RAIC and RCp with Biweight ρ-function for the Regression Model
231
C An Algorithm Procedure for the Constrained S-estimator
235
D Some Distributions of the Exponential Family
237
E Computations for the Robust GLM Estimator 239 E.1 Fisher Consistency Corrections . . . . . . . . . . . . . . . . . . . . 239 E.2 Asymptotic Variance . . . . . . . . . . . . . . . . . . . . . . . . . 240 E.3 IRWLS Algorithm for Robust GLM . . . . . . . . . . . . . . . . . 242 F Computations for the Robust GEE Estimator 245 F.1 IRWLS Algorithm for Robust GEE . . . . . . . . . . . . . . . . . . 245 F.2 Fisher Consistency Corrections . . . . . . . . . . . . . . . . . . . . 246 G Computation of the CRQ
247
References
249
Index
265
Preface The use of statistical methods in medicine, genetics and more generally in health sciences has increased tremendously in the past two decades. More often than not, a parametric or semi-parametric model is used to describe the data and standard estimation and testing procedures are carried out. However, the validity and good performance of such procedures generally require strict adherence to the model assumptions, a condition that is in stark contrast with experience gained from field work. Indeed, the postulated models are often chosen because they help to understand a phenomenon, not because they fit exactly the data at hand. Robust statistics is an extension of classical statistics that specifically takes into account the fact that the underlying models used by analysts are only approximate. The basic philosophy of robust statistics is to produce statistical procedures that are stable with respect to small changes in the data or to small model departures. These include ‘outliers’, influential observations and other more sophisticated deviations from the model or model misspecifications. There has been considerable work in robust statistics in the last forty years following the pioneering work of Tukey (1960), Huber (1964) and Hampel (1968) and the theory now covers all models and techniques commonly used in biostatistics. However, the lack of a simple introduction of the basic concepts, the absence of meaningful examples presented at the appropriate level and the difficulty in finding suitable implementation of robust procedures other than robust linear regression have impeded the development and dissemination of such methods. Meanwhile, biostatisticians continue to use ‘ad-hoc’ techniques to deal with outliers and underestimate the impact of model misspecifications. This book is intended to fill the existing gap and present robust techniques in a consistent and understandable manner to all researchers in the health sciences and related fields interested in robust methods. Real examples chosen from the authors’ experience or for their relevance in biomedical research are used throughout the book to motivate robustness issues, explain the central ideas and concepts, and illustrate similarities and differences with the classical approach. This material has previously been tested in several short and regular courses in academia from which valuable feedback has been gained. In addition, the R-code and data used for all examples discussed in the book are available on the supporting website (http://www.wiley.com/go/heritier). The databased approach presented here makes it possible to acquire both the conceptual framework and practical tools for not only a good introduction but also a practical training in robust methods for a large spectrum of statistical models.
xiv
PREFACE
The book is organized as follows. Chapter 1 pitches robustness in the history of statistics and clarifies what it is supposed to do and not to do. Concepts and results are introduced in a general framework in Chapter 2. This chapter is more formalized as it presents the ideas and the results in their full generality. It presents in a more mathematical manner the basic concepts and statistical tools used throughout the book, to which the interested reader can refer when studying a particular model presented in one of the following chapters. Fundamental tools such as the influence function, the breakdown point and M-estimators are defined here and illustrated through examples. Chapters 3 to 7 are structured by model and include specific elements of theory but the emphasis is on data analysis and interpretation of the results. These five chapters deal respectively with robust methods in linear regression, mixed linear models, generalized linear models, marginal longitudinal data models, and models for survival analysis. Techniques presented in this book focus in particular on estimation, uni- and multivariate testing, model selection, model validation through prediction and residual analysis, and diagnostics. Chapters can be read independently of each other but starting with linear regression (Chapter 3) is recommended. A short introduction to the corresponding classical procedures is given at the beginning of each chapter to facilitate the transition from the classical to the robust approach. It is however assumed that the reader is reasonably familiar with classical procedures. Finally, some of the computational aspects are discussed in the appendix. The intended audience for this book includes: biostatisticians who wish to discover robust statistics and/or update their knowledge with the more recent developments; applied researchers in medical or health sciences interested in this topic; advanced undergraduate or graduate students acquainted with the classical theory of their model of interest; and also researchers outside the medical sciences, such as scientists in the social sciences, psychology or economics. The book can be read at different levels. Readers mainly interested in the potential of robust methods and their applications in their own field should grasp the basic statistical methods relevant to their problem and focus on the examples given in the book. Readers interested in understanding the key underpinnings of robust methods should have a background in statistics at the undergraduate level and, for the understanding of the finer theoretical aspects, a background at the graduate level is required. Finally, the datasets analyzed in this book can be used by the statistician familiar with robustness ideas as examples that illustrate the practice of robust methods in biostatistics. The book does not include all the available robust tools developed so far for each model, but rather a selected set that has been chosen for its practical use in biomedical research. The emphasis has been put on choosing only one or two methods for each situation, the methods being selected for their efficiency (at different levels) and their practicality (i.e. their implementation in the R package robustbase), hence making them directly available to the data analyst. This book would not exist without the hard work of all the statisticians who have contributed directly or indirectly to the development of robust statistics, not only the ones cited in this book but also those that are not.
Acknowledgements We are indebted to Elvezio Ronchetti and Chris Field for stimulating discussions, comments on early versions of the manuscript and their encouragement during the writing process, to Tadeusz Bednarski for valuable exchanges about robust methods for the Cox model and for providing its research code, and to Steve Portnoy for his review of the section on the censored regression quantiles. We also thank Sally Galbraith, Serigne Lo and Werner Stahel for reading some parts of the manuscript and for giving useful comments, Dominique Couturier for his invaluable help in the development of R code for the regression model and the mixed linear model, and Martin Mächler and Andreas Ruckstuhl for implementing the robust GLM in the robustbase package. The GFR data have been provided by Judy Simpson and the cardiovascular data by Pascal Bovet. Finally, we would like to thank the staff at Wiley for their support, as well as our respective institutions, our understanding colleagues and students who had to endure our regular ‘blackouts’ from daily work during the writing process of this book.
1
Introduction
1.1 What is Robust Statistics? The scientific method is a set of principles and procedures for the systematic pursuit of knowledge involving the recognition and formulation of a problem, the collection of data through observation and experiment, and the formulation and testing of hypotheses (Merriam-Webster online dictionary, http://merriam-webster.com). Although procedures may be different according to the field of study, scientific researchers agree that hypotheses need to be stated as explanations of phenomena, and experimental studies need to be designed to test these hypotheses. In a more philosophical perspective, the hypothetico-deductive model for scientific methods (Whewell, 1837, 1840) was formulated as the following four steps: (1) characterizations (observations, definitions and measurements of the subject of inquiry); (2) hypotheses (theoretical, hypothetical explanations of observations and measurements of the subject); (3) predictions (possibly through a model, logical deduction from the hypothesis or theory); (4) experiments (test (2) and (3), essentially to disprove them). It is obvious that statistical theory plays an important role in this process. Not only are measurements usually subject to uncertainty, but experiments are also set using the theory of experimental designs and predictions are often made through a statistical model that accounts for the uncertainty or the randomness of the measurements. As statisticians, however, we are aware that models can at best be approximated (at least for the random part), and this introduces another type of uncertainty into the process. G. E. P. Box’s famous citation that ‘all models are wrong, some models are useful’ (Box, 1979) is often cited by the researcher when faced with the data to analyze. Hence, for truly honest scientific Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
INTRODUCTION
2
research, statistics should offer methods that not only deal with uncertainty of the collected information (sampling error), but also with the fact that models are at best an approximation of reality. Consequently, statistics should be in ‘some sense’ robust to model misspecifications. This is important since the aim of scientific research is the pursuit of knowledge that is used in fine to improve the wellbeing of people as is obviously the case, for example, in medical research. Robust methods date back to the prehistory of statistics and they naturally start with outlier detection techniques and the subsequent treatment of the data. Mathematicians of the 18th century such as Bernoulli (1777) were already questioning the appropriateness of rejection rules, a common practice among astronomers of the time. The first formal rejection rules are suggested in the second part of the 19th century; see Hampel et al. (1986, p. 34), for details. Student (1927) proposes repetition (additional observations) in the case of outliers, combined with rejection. Independently, the use of mixture models and simple estimators that can partly downweight observations appears from 1870 onwards; see Stone (1873); Edgeworth (1883); Newcomb (1886) and others. Newcomb even imagines a procedure that can be posthumously described as a sort of one-step Huber estimator (see Stigler, 1973). These attempts to reduce the influence of outliers, to make them harmless instead of discarding them, are in the same spirit as modern robustness theory; see Huber (1972); Harter (1974–1976); Barnett and Lewis (1978) and Stigler (1973). The idea of a ‘supermodel’ is proposed by Pearson (1916) who embedded the normal model that gained a central role at the turn of the 20th century into a system of Pearson curves derived from differential equations. The curves are actually distributions where two additional parameters are added to ‘accommodate’ most deviations from normality. The discovery of the drastic instability of the test for equality of variance by Pearson (1931) sparked the systematic study of the non-robustness of tests. Exact references on these developments can be found in Hampel et al. (1986, pp. 35–36). The term robust (strong, sturdy, rough) itself appears to have been proposed in the statistical literature by Box (1953). The field of modern robust statistics finally emerged with the pioneering works of Tukey (1960), Huber (1964) and Hampel (1968), and has been intensively developed ever since. Indeed, a rough bibliographic search in the Current Index to Statistics1 revealed that since 1960 the number of articles having the word ‘robust’ in their title and/or in their keywords list has increased dramatically (see Figure 1.1). Compared with other well-established keywords, ‘robust’ appears to be quite popular: roughly half as popular as ‘Bayesian’ and ‘design’, but more popular than ‘survival’, ‘bootstrap’, ‘rank’ and ‘smoothing’. Is robust statistics really as popular as it appears to be, in that it is used fairly routinely in practical data analysis? We do not really believe so. It might be that the word ‘robust’ is associated with other keywords such as ‘rank’, ‘smoothing’ or ‘design’ because of the perceived nature of these methods or procedures. We also performed a rough bibliographic search under the same conditions as before, but with the combination of the words ‘robust’ and each of the other words. The result is presented in Figure 1.2. It appears that although ‘robust’ is relatively more associated 1 http://www.statindex.org/
600
1.2. AGAINST WHAT IS ROBUST STATISTICS ROBUST?
300
400
500
Robust Bayesian Bootstrap Smoothing Rank Survival Design
0
100
200
Number of citations
3
1960
1970
1980
1990
2000
Year
Figure 1.1 Number of articles (average per 2 years) citing the selected words in the title or in the keywords list according to the Current Index to Statistics (http://www.statindex.org/), December 2007. with ‘design’ and ‘Bayesian’, when we remove all of the combined associations there are 4367 remaining articles citing the word ‘robust’ (group ‘other’), a fairly large number. We believe that this rather impressive number of articles have often used the term ‘robust’ in quite different manners. At this point, it could be worth searching more deeply, for example by taking a sample or articles and looking at the possible meanings or uses of the statistical term ‘robust’, but we do not attempt that here. Instead, we will clarify in what sense we use the term ‘robust’ or ‘robustness’ in the present book. We hope that this will help in clarifying the extent and limitations of the theory of robust statistics for the scientist as set by Tukey (1960), Huber (1964) and Hampel (1968).
1.2 Against What is Robust Statistics Robust? Robust statistics aims at producing consistent and reasonably efficient estimators, test statistics with stable level and power, when the model is slightly misspecified.
INTRODUCTION
4
100 50 0
Number of citations with 'robust'
150
Other Bayesian Bootstrap Smoothing Rank Survival Design
1970
1975
1980
1985
1990
1995
2000
2005
Year
Figure 1.2 Number of articles (average per 6 years) citing the selected words together with ‘robust’ in the title or in the keywords list according to the Current Index to Statistics (http://www.statindex.org/), December 2007.
Model misspecifications encompass a relatively large set of possibilities, and robust statistics cannot deal with all types of model misspecifications. First we characterize the model using a cumulative probability distribution Fθ that captures the structural part as well as the random part of the model. The parameters needed for the structural part and/or the random part are included in the parameter’s vector θ . For example, in the regression model that is thoroughly studied in Chapter 3, θ contains the (linear) regression coefficients (structural part) as well as the residual error variance (random part) and Fθ is the (conditional) normal distribution of the response variable (given the set of explanatory variables). Here Fθ does not need to be fully parametric, e.g. the Cox model presented in Chapter 7 can also be used. Then, by ‘slight model misspecification’, we assume that the σγ20 0 D= 0 0 to the alternative H1 :
! D=
σγ20
σγ01
σγ01
σγ21
' ,
with σγ21 > 0 to guarantee that D is positive-definite. As two additional parameters σγ01 and σγ21 have been added to the model, a naive application of the classical theory would compare the corresponding LRT test with a χ22 distribution. The exact theory states that a mixture with equal weights 0.5 for χ12 and χ22 must be used. Therefore, a naive analysis could lead to larger p-values and, hence, acceptance of oversimplified variance structures. This result also holds for the REML-based LRT (Morrell, 1998).
MIXED LINEAR MODELS
96
Table 4.1 Estimates and standard errors for the REML for the skin resistance data using model (4.2)–(4.3) with and without observation 15. REML
REML without observation 15
Parameter
Estimate (SE)
p-value
Estimate (SE)
p-value
µ λ1 λ2 λ3 λ4 σs σ0015
2.030 (0.341) −0.213 (0.334) 0.842 (0.334) 0.549 (0.334) −0.526 (0.334) 1.190 1.495
<10−4 0.525 0.014 0.105 0.120
1.817 (0.284) 0.076 (0.246) 0.580 (0.246) 0.234 (0.246) −0.399 (0.246) 0.994 1.068
<10−4 0.756 0.221 0.345 0.110
As it is more accurate with a small sample we use this variant on the orthodontic growth data. The LRT statistic for testing H0 : σγ21 = 0, σγ01 = 0 returns a value of 2(−216.3 + 216.9) = 1.2. A correct p-value is therefore p = 0.5 × P (χ22 > 1.2) + 0.5 × P (χ12 > 1.2) = 0.41, whereas the naive calculation yields p = 0.55. In this case, both procedures conclude that a second random effect is probably not necessary (assuming that no robustness issue arises here). Stram and Lee (1994) also consider the case of testing k versus k + 1 random 2 effects. In that case, a mixture with equal weights 0.5 for χk2 and χk+1 is obtained for the asymptotic distribution. A more complex mixture is also available when l > 1 random effects are added to the model but requires complex calculations. Again, extensions of these results to LRT tests based on the REML is possible (see Morrell, 1998). Recent work by Scheipl et al. (2008) show that they are generally more powerful and should therefore be preferably used. The Wald test and classical confidence intervals for the variance parameter must also be corrected bearing in mind that they are generally outperformed by their LRT counterparts. Finally, a good account of these problems with applications can be found in Verbeke and Molenberghs (2000, pp. 64–74).
4.3.3 Lack of Robustness of Classical Procedures To illustrate the sensitivity of the classical estimators introduced in Section 4.3.1, let us go back to the skin resistance data. In Figure 4.1 we saw that, out of the 80 readings, two measurements (resistance of electrodes of type 2 and 3) taken on subject 15 were much larger than the others. The experimenter discovered later that the reason for these two rather large readings was the excessive amount of hair on the subject’s arm (see Berry, 1987). Table 4.1 presents the classical (REML) estimates and standard errors with and without case 15.5 One may notice that there is considerable variation in the estimates of the different electrode types (significant fixed effects) when observation 15 is present in the data. 5 The raw data have been divided by 100.
4.4. ROBUST ESTIMATION
97
These differences are less obvious when case 15 is removed from the data. Also a large difference is observed in the residual error variance estimate σˆ 00152 when it is computed with and without case 15. This clearly illustrates the lack of robustness of the REML. To quantify in a more formal way the sensitivity of the MLE and REML, we use the IF (see Section 2.2.1) which offers an elegant way to justify theoretically these empirical findings. Indeed, both the MLE and REML are M-estimators defined through estimating (4.19)–(4.20) and (4.19)–(4.24). Their IF is therefore proportional to their defining 0001-functions. Specifically, the influence of the ith independent cluster (i.e. the four measurements of the ith subject in the skin resistance experiment) for both classical estimators of β is proportional to the score function for that parameter s(yi , xi ; θ ) = xiT 0004i−1 (yi − xi β).
(4.27)
This quantity is unbounded in yi and in xi , which proves theoretically that both the MLE and REML estimates for the fixed effects are not robust. The situation is even worse for the variance component. The IF of αˆ [MLE] is proportional to the summand in (4.20), a quadratic form of yi and, as a result, a single abnormal response (such as case 15’s readings for type 2 and 3 electrodes) can ruin αˆ [MLE] . It is not possible to assess directly the effect of a single cluster on the REML variance estimates as the estimating equation cannot be defined at that level. However, a quadratic form appears in the left-hand side of (4.24), proving that αˆ [REML] is just as sensitive as αˆ [MLE] to contamination.
4.4 Robust Estimation 4.4.1 Bounded Influence Estimators It is possible to extend the bounded-influence approach of Section 2.3.2 to MLMs. Most of these methods are based on a weighted version of the likelihood, either directly (Huggins, 1993; Huggins and Staudte, 1994) where a robustified likelihood is maximized, or through a weighted score equation (Richardson and Welsh, 1995; Richardson, 1997; Stahel and Welsh, 1997). Summarizing the previous work, Welsh and Richardson (1997) introduce a very general class that encompasses most of the previous proposals through n 0002
−1/2
xiT W0i 0004i
i=1
−1/2
ψ0i (0004i
U0i (yi − xi β)) = 0
(4.28)
for the fixed effects, and n 10002 −1/2 −1/2 −1/2 {ψ1i (0004i U1i (yi − xi β))T W1i 0004i [Zj ZjT ](ii) 0004i W1i 2 i=1 −1/2
· ψ2i (0004i
U1i (yi − xi β)) − tr(K2i 0004i−1 [Zj ZjT ](ii) )} = 0
(4.29)
MIXED LINEAR MODELS
98
for each variance component σj2 . The matrices K2i are needed to ensure consistency at the normal model; see Welsh and Richardson (1997) for details. Equations (4.28) and (4.29) generalize the score equations (4.19) and (4.20) for the MLE. The choice of the weight matrices W0i , W1i , U0i , U1i and functions ψ0i , ψ1i , ψ2i defines each particular estimator including Huggins’ earlier proposals. The ψ functions are typically chosen as Huber functions applied to all components but other choices are also possible. The robust estimator with all weights equal to one and ψ0 = ψ1 = ψ2 is called robust MLE II in Richardson and Welsh (1995), as (4.29) is analogous to Huber’s Proposal 2 in linear regression. Likewise, the choice ψ0 = ψ2 and ψ1 (z) = z gives the robust MLE I of Richardson and Welsh (1995). It is also possible to define robust versions of the REML by using similar weighted equations to (4.29), the difference being a more complex trace term.6 As before two variants exist and are called robust REML I and II in Richardson and Welsh (1995) and Welsh and Richardson (1997). As all the proposals discussed here are defined through estimating equations of the type 0001(yi , xi ; θ ) where θ = (β T , α T )T , the general asymptotic theory for M-estimators applies. Although these developments generalize the bounded-influence approach of Section 3.2.4 in a considerable level of generality, several limitations can be mentioned. First, computation is generally complicated by the presence of complex matrices K2i required for consistency. The problem may even become intractable for redescending 0001 or complex variance structures. Second, in the presence of contaminated data, some small residual bias to the robust variance estimates remains even for the robust REML proposals; see the simulation results in Richardson and Welsh (1995, pp. 1437–1438). Finally, the breakdown point of such bounded influence estimators can be low and this may be an issue in complex models.
4.4.2 S-estimators The reformulation of the MLM as a multivariate normal model offers an elegant way to tackle the robustification problem. Specifically, S-estimators introduced earlier in Section 2.3.3 for their good breakdown properties can easily be generalized to balanced MLMs , i.e. models of type (4.8)–(4.9) where the cluster size pi = p and 0004i = 0004 for all clusters. This assumption is certainly not desirable from a practical perspective as the number of applications involved unbalanced data or variable repeated measures over time. As this theory is new (Copt and Victoria-Feser, 2006), there is however hope that this limitation will be relaxed in the near future. In the multivariate normal setting, one can define an S-estimator for the mean µ and covariance 0004 as the solution for these parameters that minimizes det(0004) = |0004| subject to n 0002 n−1 ρ(di ) = b0 , (4.30) i=1 6 The equation is similar to (4.29) with the trace term tr(K P Z Z T ) where K = diag(K , . . . , K ) j j 2 2 21 2n sitting outside the summation for all i.
4.4. ROBUST ESTIMATION where
99
di2 = (yi − µ)T 0004 −1 (yi − µ)
(4.31)
are the Mahalanobis distances, ρ is a bounded function and b0 = E0010 [ρ(d)] ensures consistency at the normal model. Using the relationship (2.33), the tuning parameter of the ρ-function can be chosen to achieve a pre-specified breakdown point ε ∗ (see Section 2.3.3). A typical choice for ρ is Tukey’s biweight given in (2.20). For the balanced case, the marginal MLM (4.9) simply becomes yi ∼ N (xi β; 0004) where the common covariance matrix is 0004=
r 0002 j =0
σj2 zj zjT
(4.32)
and zj is the (common) element of the design matrix Zj for a particular cluster. In the skin resistance data example, 0004 is given by (4.32) (see also (4.12)), with z0 z0T = I5 (for the residual variance) and z1 z1T = e5 e5T (for the subject random effect variance). Likewise, for the semantic priming data example, according to (4.16), we have that z0 z0T = I6 (for the residual variance), z1 z1T = J6 , z2 z2T = I2 ⊗ J3 and z3 z3T = J2 ⊗ I3 for the subject and its factors’ interactions random effects variances. The additional structure on the mean and covariance matrix implied by the MLM formulation does not create additional difficulties to extend the definition of an S-estimator to that setting. Indeed, it can be defined as the solution for β, σj2 , j = 0, . . . , r of the same minimization problem under the constraint (4.30), with 0015 di = di (β) = (yi − xi β)T 0004 −1 (yi − xi β) (4.33) and 0004 having the particular structure (4.32). The problem can be restated as solving the estimating equations 0002 0002 w(di )xiT 0004 −1 (yi − xi β) = 0001β (yi , xi ; θ ) = 0, j = 0, . . . , r (4.34) for β, and 0002 {pw(di )(yi − xi β)T 0004 −1 zj zjT 0004 −1 (yi − xi β) − w(di )di2 tr[0004 −1 zj zjT ]} 0002 (4.35) = 0001σ 2 (yi , xi ; θ ) = 0, j
for the variance component α = (σ02 , . . . , σr2 )T (see Copt and Victoria-Feser, 2006). Here w(d) = (∂/∂d)ρ(d)/d is the robust weight given to each observation. Equations (4.34) and (4.35) can be rewritten in a more compact form as 0002 0001(yi , xi ; θ ) = 0 where 0001 = (0001βT , 0001σ 2 , . . . , 0001σr2 )T . We propose to use Tukey’s biweight ρ-function 1 (2.20) and call the resulting robust estimator CBS for constrained biweight S-estimator.
MIXED LINEAR MODELS
100
Like for S-estimators in the linear regression model in Section 3.2.4, (4.34) and (4.35) may have multiple roots, and hence a good high breakdown point estimator is needed as a starting point to find the solution to (4.34) and (4.35) with a high breakdown point. A simple algorithm has been suggested by Copt and VictoriaFeser (2006) and is given in Appendix C; the way the high breakdown point starting estimator is obtained is also detailed. Following Davies (1987) and Lopuhaä (1992) for the normal multivariate case, Copt and Victoria-Feser (2006) prove that, under mild regularity conditions, a (constrained) S-estimator defined through (4.34) and (4.35) of θis consistent and asymptotically normal distributed. In particular, if the inverse of ni=1 xiT xi exists, βˆ[S] has an asymptotic variance given by n
−1 0002 0002
−1 n n e1 0002 T T T x x x 0004x x x , i i i i i i e22 i=1 i=1 i=1
(4.36)
where e1 = and
1 E0010 [d 2 w(d)2 ] p
0007 0006 1 ∂ e2 = E0010 w(d) + d w(d) p ∂d
(4.37)
(4.38)
and w is the weight function associated with the ρ-function. The constrained Sestimator of α is also asymptotically normally distributed with variance given by a complex sandwich formula (omitted here for simplicity); see Copt and Victoria-Feser (2006, p. 294).
4.4.3 MM-estimators In the same spirit as in the regression setting (see Section 3.2.4), Copt and Heritier (2007) propose MM-estimators for the main effects parameter. They possess many good properties, i.e. a high breakdown point even in the presence of leverage points, a good efficiency and, unlike S-estimators, they can be used to build a robust LRT-type test. This last property was the key incentive for their introduction; see also Section 4.5. The class of MM-estimators was first introduced by Yohai (1987) in the linear regression setting and was then generalized by Lopuhaä (1992) and Tatsuoka and Tyler (2000) to the multivariate linear model. The idea is to dissociate the estimation of the regression parameter (fixed effects) and variance component (random effects), and proceed in two steps. In the MLM setting, one can first obtain a high breakdown point estimator for the covariance matrix via the CBS estimator 0001CBS ) then use a better tuned ρ function to obtain a more efficient M-estimator for (0004 the fixed effects parameter (i.e. β). In practice the initial variance estimator is based on a ρ-function ρ0 (d; c0 ), the final estimator on ρ1 (d; c1 ). The tuning constants are usually chosen to achieve a specific breakdown point (through c0 ) and efficiency
4.4. ROBUST ESTIMATION
101
(through c1 ) at the model. Technically, the second step amounts to solving for β n 0002
0001(yi , xi ; β) =
i=1
n 0002
0005−1 (yi − xi β) = 0, w1 (di )xiT 0004
(4.39)
i=1
0005=0004 0005[CBS] , w1 (d) = (∂/∂d)ρ1 (d; c1)/d is the weight function associwhere e.g. 0004 ated with ρ1 , i.e. the ρ-function in the M-step. The solution of (4.39) is the MM0005[MM] of β. estimator β Two natural choices for w1 (·) (and, hence, ρ1 ) naturally arise from the regression setting, either the Huber’s ρ-function (see Equation (2.17)) or the bounded Tukey’s 0005[H ub] and β 0005[bi] , respectively. biweight ρ-function (see Equation (2.20)) leading to β The corresponding weights are w1 (d) = min(1, c1 /|d|),
(4.40)
for Huber’s weights and
2 d 2 −1 w1 (d) = c1 0
if |d| ≤ c1
(4.41)
if |d| > c1
for Tukey’s biweight weights (see also (3.14)). These two proposals serve different purposes. Huber’s estimator is well adapted to the cases when model deviations occur in the response variable only such as in ANOVA or models with wellcontrolled covariates. It can, however, be severely biased in the presence of (bad) leverage points. This is not the case with Tukey’s biweight which is robust to both response and covariate extreme observations. Note that for Huber’s weights (4.40), the associated ρ-function is (2.17), and for biweight weights (4.41) it is (3.15) with c replaced by c1 in both cases. Copt and Heritier (2007) show that, under mild conditions on ρ1 , √ ˆ n(β[MM] − β) has a limiting normal distribution with zero mean and var(βˆ[MM] ) = H = (1/n)M −1 QM −T where M and Q are proportional to 0016 = EK [x T 0004 −1 x] and K is the covariates’ distribution.7 A simpler representation for H can thus be given by 1 e1 H = EK [x T 0004 −1 x]−1 , (4.42) n e22 where e1 and e2 are given in (4.37) and (4.38), respectively, with w(d) = (∂/∂d) ρ1 (d)/d. In the case of fixed covariates, K can be replaced by the covariates’ empirical distribution in (4.42) yielding an asymptotic variance matrix H proportional to the asymptotic variance of the MLE (4.22). The multiplicative constant e1 /e22 will 7 In this section, we work under slightly more general conditions than in Section 4.4.2 by assuming that the covariates are not necessarily fixed but have a common distribution K. The rationale for this is to be able to account for leverage points or other problems in the covariates space. If one does not want to specify a particular model for K and therefore get back to the previous setting, one only needs to replace K by the empirical distribution of x.
MIXED LINEAR MODELS
102
Table 4.2 Values for c0 and c1 for Tukey’s biweight ρ-function (2.20) for the multivariate normal model. Constant c0 for a breakdown point of 50% p
1
2
3
4
5
6
7
8
9
10
c0
1.56
2.66
3.45
4.09
4.65
5.14
5.59
6.01
6.40
6.77
6.83
7.04
7.25
Constant c1 for 95% efficiency c1
4.68
5.12
5.51
5.82
6.10
6.37
6.60
be used to calibrate the efficiency of the MM-estimator (see below). However, we prefer to ignore the reduced form (4.42) to derive an estimate of H and use instead the sample analog of the sandwich formula 0005 = 1M 0005M 0005 −1 , 0005 −1 Q H n 0005 and Q 0005 are the empirical versions of (2.28) and (2.29) for the MLM. For where M 0005 = (1/n) n 0001(yi , xi ; β)s(yi , xi ; β)T with 0001 as in (4.39), and s is instance, M i=1 ˆ [CBS] has been plugged in for 0004. Such the score function (4.27) where again 0004 an estimator is usually more robust when extreme covariate values are observed. Numerical values are obtained by replacing β by βˆ[MM] .
4.4.4 Choosing the Tuning Constants As mentioned earlier, the constant c0 of ρ0 is chosen to ensure a high (asymptotic) 0005[CBS] . For that breakdown point ε∗ (50% in our case) for the initial estimate 0004 purpose, the relationship E[ρ0 (d; c0 )] = ε∗ max ρ0 (x; c0) x
is solved for c0 to achieve a pre-specified breakdown point ε∗ with (in our examples) Tukey’s biweight ρ0 . To determine the constant c1 , an efficiency level (typically 95%) needs to be specified a priori. As discussed earlier, formula (4.42) shows that the relative efficiency of the MM-estimator relative to the MLE is given by the ratio e22 E0010 [w1 (d) + (1/p)d(∂/∂d)w1 (d)]2 =p e1 E0010 [d 2 w1 (d)2 ]
(4.43)
with w1 (d) given in (4.40) or (4.41) depending on the choice for ρ1 (Huber or biweight) and c = c1 . The constant c1 is then found by equating (4.43) to the desired efficiency level (e.g. 95%). Note that in the univariate case (p = 1), (4.43) reduces to (3.20). Both constants depend on the dimension p of the response vector and can be obtained by Monte Carlo simulations. They are summarized in Table 4.2 for Tukey’s
4.4. ROBUST ESTIMATION
103
Table 4.3 Estimates and standard errors for the REML and the CBS–MM for the skin resistance data using model (4.2). REML
CBS–MM
Parameter
Estimate (SE)
p-value
Estimate (SE)
p-value
µ λ1 λ2 λ3 λ4 σs σ0015
2.030 (0.341) −0.213 (0.334) 0.842 (0.334) 0.549 (0.334) −0.526 (0.334) 1.190 1.459
<10−4 0.525 0.014 0.105 0.120
1.440 (0.233) −0.161 (0.175) 0.403 (0.175) 0.243 (0.175) −0.169 (0.175) 0.842 0.761
<10−4 0.356 0.021 0.163 0.332
CBS computed with c0 = 4.65 and MM (biweight) computed with c1 = 6.10.
biweight ρ-functions (for ρ0 and ρ1 ). When p becomes large enough, an asymptotic approximation given by Rocke (1996, p. 1330) can be used for Tukey’s biweight √ which yields c1 = p/m where m is defined through ρ[bi] (m) = 0.5ρ[bi] (1), with ρ[bi] given in (2.20) with c = 1. This approximation gives reasonable results from p > 10. Finally note that the values of c0 and c1 given here obviously depend on the choice of the ρ-function and would need to be recomputed had other ρ-functions been used. Another option is available for the Huber estimator, i.e. when Huber weights (4.40) are chosen. It stems from the fact that ρ in (2.17) is a function of d, the Mahalanobis distance. As d 2 has a chi-squared distribution with p degrees of freedom χp2 , c1 can be chosen as the square-root of a specific quantile of this distribution.
4.4.5 Skin Resistance Data (continued) As an illustration, we go back to the skin resistance data. Table 4.3 presents the 0005[bi] and robust CBS estimates αˆ [CBS] 8 and standard errors robust MM estimates β for the electrode resistance data9 along with the REMLs obtained earlier. The MM contrast estimates are not affected by case 15’s extreme readings for electrodes of type 2 and 3. They are actually close to what was observed with case 15 removed from the analysis. The CBS variances estimates, especially the residual estimate, are much smaller confirming the previous findings that the REML estimates are unduly inflated by the two abnormal readings. To limit the influence of potential outlying observations, Berry (1987) actually proposes to use a log(y + c) (c = 32) transformation of the data. A profile plot of the transformed data is presented in Figure 4.3. Graphically, the log-transformation limits the effect of the potential outliers (in particular observation number 15). The estimated model parameters using the transformed data and the classical (REML) 8 For simplicity, we call this set of robust estimators the CBS–MM. 9 The raw data have been divided by 100.
MIXED LINEAR MODELS 7
104
Subject
5 4 3
Mean of resistance (log scale)
6
2 1 3 12 15 6 14 10 13 11 7 5 9 4 16 8
E1
E2
E3
E4
E5
Electrode type
Figure 4.3 Profile plot for the skin resistance data (log-transformed).
and robust estimators are presented in Table 4.4. The overall mean, the contrasts and variance components and p-values this time are similar in the two methods. This illustrates the fact that outliers are model specific, i.e. the two abnormal readings on the original scale do not appear as so extreme on the log-transformed one. This was not the case with the non-transformed data. We defer the discussion on the effect of the electrode type to Section 4.5.3.
4.5 Robust Inference The MM-estimators were introduced earlier to offer more options for testing hypotheses on the main effects. Typical tests usually involve contrasts or multidimensional hypotheses that a component of the main effects parameter is null.
4.5.1 Testing Contrasts A contrast test occurs when a linear combination of the elements of β, typically represented by a (q + 1)-vector L, is tested. For example, suppose that we have a
4.5. ROBUST INFERENCE
105
Table 4.4 Estimates and standard errors for the REML and the CBS–MM for the skin resistance data using model (4.2) with a log-transformed response. REML
CBS–MM
Parameter
Estimate (SE)
p-value
Estimate (SE)
p-value
µ λ1 λ2 λ3 λ4 σs σ0015
4.913 (0.166) −0.097 (0.158) 0.396 (0.158) 0.179 (0.158) −0.289 (0.158) 0.585 0.701
<10−4 0.542 0.015 0.262 0.072
4.918 (0.176) −0.058 (0.161) 0.376 (0.161) 0.167 (0.161) −0.282 (0.161) 0.610 0.689
<10−4 0.718 0.019 0.299 0.079
CBS computed with c0 = 4.65 and MM (biweight) computed with c1 = 6.10.
one-factor within-subject ANOVA model with three levels, i.e. β = (β0 , β1 , β2 ) = (µ, λ1 , λ2 ) and suppose that the design matrix x is parametrized as ‘treatment’ contrasts (see e.g. (4.11)) with the third level as the reference level. Suppose also that our goal is to test for differences among elements of the mean vector (µ1 , µ2 , µ3 ). The corresponding null hypotheses are H0 : µ1 − µ3 = β2 = λ1 = 0 H1 : µ1 − µ3 = β2 = λ1 000f= 0, H0 : µ2 − µ3 = β3 = λ2 = 0 H1 : µ2 − µ3 = β3 = λ2 000f= 0, H0 : µ2 − µ1 = β3 − β2 = λ2 − λ1 = 0 H1 : µ2 − µ1 = β3 − β2 = λ2 − λ1 000f= 0. The corresponding contrasts L are LT = (0, 1, 0), LT = (0, 0, 1) and LT = (0, 1, −1). Simple robust inference for contrasts can be performed using an estimate of the 0005[MM] given in (4.42). For H0 : LT β = 0, a robust z-test asymptotic covariance of β statistic is given by 0005[MM] LT β z-statistic = (4.44) 0005[MM] ) SE(LT β with
SE(LT βˆ[MM] ) =
0017
LT Hˆ L.
The corresponding p-value is obtained by comparing (4.44) with the standard normal distribution. Note that, although we compute the z-statistic with the MM-estimator, the same sort of calculation can be done with the S-estimator using the appropriate asymptotic variance.
MIXED LINEAR MODELS
106
4.5.2 Multiple Hypothesis Testing of the Main Effects Tests involving multiple hypothesis can, for instance, be used to compare models with the same variance structure or to assess the statistical significance of a factor with several levels such as the type of electrode in model (4.2). Denote again by T , β T ) the partition of the vector β into q + 1 − k and k components and β T = (β(1) (2) by A(ij) , i, j = 1, 2 the corresponding partition of (q + 1) × (q + 1) matrices. The hypothesis to be tested can usually be formulated as H0 : β = β0 where β0(2) = 0, β0(1) unspecified, H1 : β0(2) 000f= 0, β(1) unspecified. The need for robust testing in this setting is obvious as the classical F -test has reportedly been found to be unreliable under sometimes mild model deviations (see e.g. (Copt and Heritier, 2007)). Robust alternatives to the classical Wald or score tests are readily available through (2.47) for the robust Wald test, (2.48) for the robust score test for any model. Robustifying the LRT is probably the most natural route to build a robust alternative to the F -test but, as alluded to in Section 4.4.3, such a test does not always exist for S-estimators. The reason is that the corresponding test statistic is by construction zero.To see this, just note that the robust LRT in (2.50) is based on the difference in ρ(di ) for both the full and reduced models. As the definition of S-estimators (4.30) sets both sums to b0 (up to a 1/n factor) the difference is simply zero. As shown in Copt and Heritier (2007), MM-estimators circumvent the problem by using another loss function ρ1 , different from that used to build the S-estimator, therefore the LRT statistic exits. A direct application of the general theory of robust testing introduced in Section 2.5 can then be used. Formally, the LRT 0015 statistic is computed in the same 0005−1 (yi − xi β) be the way as in the general case. Again let di (β) = (yi − xi β)T 0004 [S]
ˆ [S] a chosen S-estimator of 0004 (e.g. Mahalanobis distance for observation i with 0004 ˆ 0004[CBS]). The robust LRT -type test statistic is given by LRTρ = 2
n 0002 0005[MM] ))], [ρ(di (β˙[MM] )) − ρ(di (β
(4.45)
i=1
0005[MM] and β˙[MM] are the robust estimators in the full and reduced models, where β respectively, with corresponding loss function ρ1 . More specifically LRTρ associated with the Huber estimator, respectively the biweight estimator, is defined through (4.39) with weight function (4.40), respectively (4.41), with corresponding ρ1 function given in (2.17), respectively in (3.15). In both cases the covariance matrix 0005[CBS] . estimate is the CBS 0004 An estimate of a robust Wald-type test statistic is naturally defined by 0005T 0005−1 0005 W00012 = β [MM](2) H(22) β[MM](2) , 0005[MM](2) is the robust MM-estimator of β(2) in the full model and Hˆ (22) the where β corresponding covariance estimate. Finally, a score-(or Rao-)type test statistic is
4.5. ROBUST INFERENCE given by
107 2 0005−1 Zn , = ZnT C R0001
where Zn = (1/n) ni=1 0001(yi , xi ; β˙[MM] )(2) , β˙[MM] is the MM-estimator in the reduced model with corresponding 0001-function given in (4.39) and weights in (4.40) for Huber’s estimator and in (4.41) for Tukey’s biweight estimator. The 0005 is C 0005=M 000522.1 H 000522.1 = M 0005(22) − 0005(22)M 0005T with M k × k positive-definite matrix C 22.1 −1 0005(21) M 0005 0005 M 0005 M (11) (12) from the partitioning of the matrix M. Again we have defined the three test statistics for the MM-estimators but it is also possible to define the robust Wald and score test in a similar fashion for the CBS estimators. Under the null hypothesis, their asymptotic distribution is the same as in the general parametric settings (see Section 2.5.4).
4.5.3 Skin Resistance Data Example (continued) We now return to the problem of testing the multivariate hypothesis of equality of mean resistances given by H1 : H0 is not true
µ unspecified
irrespective of the chosen contrast matrix. The classical F -test statistic is 3.1455. When compared with an F4,60 distribution, we find a p-value of 0.020 so that the test is significant at the 5% level. We could conclude that there is a difference between the five electrode types. Using the Tukey’s biweight ρ-function, the robust LRT test statistics yields a p-value of 0.086 at the same 5% level. The test is, hence, not significant. Observations 15 and possibly 2 seem to have an influence on the MLE (or REML) estimates and consequently on the F -test. If the responses are log-transformed, the F -test statistic is 2.87 corresponding to a p-value of 0.03 and the robust LRT test gives a p-value of 0.061. Although the logtransformation gives similar results for the parameters’ estimates (see Table 4.4), it does not completely reduce the influence of the outlying observations (number 15 and possibly number 2) on the classical F -test: we still reject the null hypothesis of equal resistances. Note that Berry (1987) analyzes this dataset with subject 15 deleted, and finds a significant F -test on the original data (p-value of 0.044) and a non-significant F -test on the log-transformed data (p-value of 0.10).
4.5.4 Semantic Priming Data Example (continued) The model used to analyze this dataset is given in (4.15) with λj , j = 1, 2, the fixed effect for the delay and γk , k = 1, 2, 3, the fixed effect for the condition. Table 4.5 gives the estimates for the REML and the CBS–MM and the standard errors for the fixed effects computed using Tukey’s biweight weights. The contrasts for each factor are the ‘sum’-type contrasts. We can see that both methods detect a significant effect for the delay but with a borderline p-value of 0.046 for the REML whereas the message is clearer with the robust method yielding a p-value of 0.003. Another
MIXED LINEAR MODELS
108
Table 4.5 Estimates and standard errors for the REML and the CBS–MM for the semantic priming data using model (4.15). REML
CBS–MM
Parameter
Estimate (SE)
p-value
Estimate (SE)
p-value
µ λ1 γ1 γ2 λγ11 λγ12 σs σλs σγ s σ0015
633.436 (28.465) −18.071 (8.974) 18.563 (13.732) −51.222 (13.732) −3.690 (12.691) 16.809 (12.691) 122.622 0.006 29.433 100.73
<10−4 0.046 0.179 <10−4 0.771 0.188
586.420 (18.817) −17.876 (6.082) 14.317 (11.691) −56.994 (11.691) 12.706 (10.582) (8.844 (10.582) 77.991 N/A 27.199 81.885
<10−4 0.003 0.221 <10−4 0.230 0.403
CBS computed with c0 = 5.14 and MM (biweight) computed with c1 = 6.37.
important feature of this model is the estimation of the random effects. The robust estimate of the variance for the interaction between subject and delay is not reported. This is because the robust estimator gives a negative value. This can sometimes happen as some of the variance components correspond to covariances between responses on the same subject and, hence, can in principle be negative. Standard algorithms included in common statistical packages work around this solution by imputing very small values close to zero each time a variance is found to be negative. In this example, using the R package lme, one obtains a small value (0.006) for the corresponding classical estimator (REML).10 We also tested the significance of each factor and of the interactions, using the F -test and the robust LRT-type test. Results are presented in Table 4.6. The classical F -test and robust LRT test give similar results for the three hypotheses with, again, a stronger effect for the delay variable. In this example, the presence of possible outlier does not seem to influence the results of the tests. With this type of data, one can also consider a log-transformation, although in this domain one usually prefers the original scale, mainly for interpretation reasons. In Table 4.7 we give the REML and the CBS–MM estimates with corresponding standard errors. The estimates and p-values for significance testing are quite similar and lead to the same conclusions. Also note that again the variance of the random effect for the interaction between the subject and the delay is set to zero with the REML and found to be negative (and hence reported as N/A) with the CBS estimator. We can also test the significance of each factor and of the interactions, using the F -test and the robust LRT-type test. Results are presented in Table 4.8. Both approaches lead to similar conclusions. 10 The problem of negative variances is not specific to robust approaches but is a common problem in the general ANOVA/MLM setting; see, for example, Searle et al. (1992).
4.5. ROBUST INFERENCE
109
Table 4.6 Classical F -test and robust LRT for the fixed effects of the semantic priming data using model (4.15). p-value Variable
Classical F -test
Robust LRT test
Delay Condition Delay:Condition
0.046 0.001 0.383
0.005 0.001 0.131
Robust LRT test computed using the CBS with c0 = 5.14 and the MM (biweight) with c1 = 6.37.
Table 4.7 Estimates and standard errors for the REML and the CBS–MM for the semantic priming data using model (4.15) with log-transformed data. REML Parameter µ λ1 γ1 γ2 λγ11 λγ12 σs σλs σγ s σ0015
CBS–MM
Estimate (SE)
p-value
Estimate (SE)
p-value
6.421 (0.040) −0.027 (0.012) 0.032 (0.021) −0.088 (0.021) 0.002 (0.017) 0.018 (0.017) 0.173 0.000 0.069 0.136
<10−4
6.386 (0.036) −0.028 (0.010) 0.029 (0.022) −0.092 (0.022) 0.011 (0.018) 0.017 (0.018) 0.148 N/A 0.069 0.137
<10−4 0.007 0.177 <10−4 0.526 0.336
0.025 0.127 <10−4 0.873 0.271
CBS computed with c0 = 5.14 and MM (biweight) computed with c1 = 6.37.
Table 4.8 Classical F -test and robust LRT for the fixed effects of the semantic priming data using model (4.15), with log-transformed data. p-value Variable
Classical F -test
Robust LRT test
Delay Condition Delay:Condition
0.025 0.0003 0.390
0.008 0.001 0.272
Robust LRT test computed using the CBS with c0 = 5.14 and the MM (biweight) with c1 = 6.37.
110
MIXED LINEAR MODELS
4.5.5 Testing the Variance Components Most of the effort in robust testing in MLMs has focused on the main effects because the variance parameters are often considered as nuisance parameters. If one is truly interested in testing whether some random effects could be removed, the same problem mentioned above arises. As the null hypothesis typically involves restrictions of the type σj2 = 0, the overall null parameter vector θ0 is on the boundary of the parameter space and, as a result, the general theory of Section 2.5.3 breaks down. One could conjecture that the same kind of mixture of χ 2 distributions could be used for the robust Wald test. However such tests are known to perform poorly in the classical case and a similar behavior is expected in the robust case. The LRT test could constitute a better alternative but no such robust LRT test exists to the best of our knowledge as the only proposal to date, the robust LRT test (4.45), only targets hypotheses on the fixed effects. At this stage, the only viable option seems to use bootstrapping techniques with the warning mentioned in Chapter 2 that the simple bootstrap can fail when applied to robust estimators (as the breakdown point may be reached in some bootstrap samples). Our practical recommendation in that case is to use a robust estimator with a 50% breakdown point to have a good chance of avoiding the problem.
4.6 Checking the Model 4.6.1 Detecting Outlying and Influential Observations Since the MLM can be seen as a multivariate normal model, multivariate tools can be used to measure in some sense at which point the observations are far from the bulk of data. Such a tool is given by the Mahalanobis distances in (4.31) in which β and 0004 are replaced by suitable estimates. In order for the estimated Mahalanobis distances not to be influenced (hence biased) by extreme observations, 0005[MM] it is necessary that β and 0004 are replaced by their robust estimators, namely β 0005[CBS] . One then can rely on the asymptotic result that di in (4.31) has an and 0004 asymptotic χp2 distribution and, hence, compare the estimated Mahalanobis distances to, say, the corresponding 0.975 quantile. One can also, for comparison, estimate the Mahalanobis distances using the MLE or the REML for β and the variance components of 0004. A scatterplot of the robust versus classical Mahalanobis distances would reveal the outlying observations, i.e. the observations with corresponding robust and classical Mahalanobis distance above the 0.975 quantile of the χp2 , as well as the influential observations, i.e. the observations with corresponding robust Mahalanobis distances above and the corresponding classical Mahalanobis distances below the 0.975 quantile of the χp2 . These influential observations are such that the classical estimator is not able to detect them but is influenced by them. In multivariate setting such as with MLM, Mahalanobis distances are usually preferred to the weights per se to detect outlying observations. As an example, consider the skin resistance dataset estimated in Section 4.4.5. In Figure 4.1 we saw that out of the 80 readings, two measurements (electrodes of
4.6. CHECKING THE MODEL
111
60 40 20
2
0
Robust Mahalanobis distances
80
15
0
10
20
30
40
Mahalanobis distances
Figure 4.4 Scatterplot of the Mahalanobis distances for the skin resistance data. CBS computed with c0 = 4.65 and MM (biweight) with c1 = 6.10.
type 2 and 3) taken on subject 15 were much larger than the others. Observation number 2 corresponds to the second largest response. In Figure 4.4 we give the scatterplot of the Mahalanobis distances computed with the REML and the CBS– MM. The horizontal and vertical dotted lines correspond to the 0.975 quantile on the χ52 distribution, to detect outlying observations. The REML and CBS–MM estimators detect observations 15 and 2 as outlying observations. No influential observations is present in the sample. With the log-transformed data, the scatterplot of the Mahalanobis distances given in Figure 4.5, shows that the CBS–MM detects observation 15 as an influential observation, and observation 2 is no longer considered as extreme. As another example, consider the semantic priming dataset estimated in Section 4.5.4. In Figure 4.6 we give the scatterplot of the Mahalanobis distances computed with the REML and the CBS–MM. One can see that the REML and CBS–MM detect one outlier (observation 3) and the CBS–MM detects two influential observations (observations 8 and 16). These observations are certainly the cause of the differences found between the classical and robust estimates. With the log-transformed data, the scatterplot of the Mahalanobis distances for the corresponding REML and CBS– MM estimates is given in Figure 4.7. One can see that there are two outliers detected
MIXED LINEAR MODELS
112
15 10 5
Robust Mahalanobis distances
15
2
4
6
8
10
12
Mahalanobis distances
Figure 4.5 Scatterplot of the Mahalanobis distances for the skin resistance data (logtransformed). CBS with c0 = 4.65 and MM (biweight) with c1 = 6.10. by the REML and CBS–MM, and they do not seem to have much influence on the estimates.
4.6.2 Prediction and Residual Analysis As for the regression model of Chapter 3, residual analysis with MLMs is used to check the model fit and also the model assumptions. In order to compute residuals, one needs to be able to compute predicted values for the response vector y. For that, and with MLM, one also needs to compute estimates for the random effects levels. Actually, one can define predicted (or fitted) response values at different levels of nesting or directly at the population level. Given estimated values for θ = (β T , σ02 , . . . , σr2 )T , the predictions at the so-called population level are 0005 0005 y = Xβ
(4.46)
and the predictions at the so-called cluster (lowest) level are 0005 + Z0005 0005 y = Xβ γ.
(4.47)
We note that depending on the problem and for hierarchical models, there might be different cluster levels, so that Z0005 γ in (4.47) can be modified accordingly. In all cases,
4.6. CHECKING THE MODEL 50
113
30
16
20
8
10
Robust Mahalanobis distances
40
3
5
10
15
20
25
Mahalanobis distances
Figure 4.6 Scatterplot of the Mahalanobis distances for the semantic priming data. CBS with c0 = 5.14 and MM (biweight) with c1 = 6.37. when predicting at the cluster levels, an estimate for 0005 γ is needed so that the first step is to define estimators for the random effects levels. Recall that random effects are unobservable variables. However, given the information contained in a sample and given a model, it is possible to predict (an expected value of) the vector of random effects for each response. Classically, one uses the Best Linear Unbiased Predictor (BLUP) given by11 γ0005 = DZ T V −1 (y − Xβ)
(4.48)
where D = cov(γ ). Given values for the variance components α, (4.48) is computed using (4.21) for β. An interesting interpretation of 0005 γ is that it is the MLE based on the likelihood of the joint distribution of f (y, γ ) = f (y|γ )f (γ ) (for fixed values of α). Henderson et al. (1959) propose a set of equations for the simultaneous estimation 0005 indeed based on the joint distribution of y and γ . of γ0005 and β Prediction and residual analysis with robust estimators is not as straightforward as replacing all parameters in (4.48) by their robust estimates. If we choose this simple approach, we face the risk that a random effect corresponding to a particular observation yijk. could be overestimated or underestimated if this observation is considered as an outlier in terms of the Mahalanobis distance. Indeed Copt and 11 See e.g. McCulloch and Searle (2001, Chapter 9).
MIXED LINEAR MODELS
114
60 40 20
2
0
Robust Mahalanobis distances
80
15
0
10
20
30
40
Mahalanobis distances
Figure 4.7 Scatterplot of the Mahalanobis distances for the semantic priming data (log-transformed). CBS computed with c0 = 5.14 and MM (biweight) with c1 = 6.37. Victoria-Feser (2009) show that the IF of γ0005 in (4.48) depends on the robustness 0005 properties of β(α) and also on the deviations (y − Xβ). This means that in order to make the predictions robust to model deviations, one needs not only a robust estimator such as the CBS–MM, but also to bound (4.48). Copt and Victoria-Feser (2009) propose the use of ψ-based prediction defined as12 γˆψ = eψ,c DZ T V −1/2 ψ(V −1/2 (y − Xβ)), where ψ(r) = (∂/∂r)ρ(r) is a bounded function such as the Huber’s of Tukey’s biweight functions, and eψ,c is a correction factor (see below). A bounded ψ-function is necessary to guarantee the robustness of the corresponding prediction estimator. Moreover, in order for 0005 γψ to behave similarly to 0005 γ at the normal model, we also need to impose that E[0005 γψ ] = 0 and var(0005 γψ ) = var(0005 γ ). These constraints define (implicitly) the correction factor eψ,c . For Tukey’s biweight ψ-function, Copt and Victoria-Feser (2009) show that
−1/2 4 6 4 1 eψ[bi] ,c = I2 (c) − 2 I4 (c) + 4 I8 (c) − 6 I8 (c) + 8 I10 (c) c c c c 12 To compute V −1/2 , we follow Richardson and Welsh (1995) and chose V −1/2 to be symmetric with the same additive structure as V and V −1 and with the property that V −1/2 V −1/2 = V −1 .
4.6. CHECKING THE MODEL
115
0.3 0.1 −0.1
Sample quantiles
−0.1
0.1
0.3
Subject
−2
−1
0
1
2
Theoretical quantiles
Figure 4.8 Boxplot and Q-Q plot of the (estimated) subject random effect for the skin resistance data. CBS computed with c0 = 4.65 and MM (biweight) with c1 = 6.10.
where
0004 Ik (c) =
c −c
r k d0010(r);
see Appendix B for the computation of these truncated normal moments. For Huber’s ψ-function, eψ[H ub] ,c = (1 − 2c2(1 − 0010(c)))−1/2. Finally, to compute γˆψ in practice, one replaces α in (V and D) and β by their robust estimates. Estimated random effects can be used to check the model assumptions. Recall that the random effects are assumed to be normally distributed and independent of each other. A normal probability plot (normal quantiles against ordered estimated random effects) or a boxplot can be used to assess the normality assumption. Again, consider as an example the skin resistance dataset estimated in Section 4.2.2. This model has only one random effect, the subject. Figure 4.8 suggests that the normality of the subject random effect is fairly respected. As in the linear regression setting, residuals are defined as the difference between the response and the predicted value, i.e. y − 0005 y where yˆ is given in (4.47) and possibly also (4.46). They thus depend on the choice of predicted response. However, since random effects have been introduced into the model, it is more sensible to use the subject predicted values to define residuals as population fitted values may produce a structure in the residuals which is simply due to the random effects. The residuals can also be standardized by means of the (estimated) covariance matrix of y, yielding V −1/2 (y − 0005 y ). Figure 4.9 displays the standardized residuals versus fitted values at the subject level. We can see that there is no particular structure in the residuals.
MIXED LINEAR MODELS 12
116
15
8
10
15
2
4
6
2
−2
0
2
Residuals
2
0.8
1.0
1.2
1.4
1.6
1.8
2.0
Fitted values
Figure 4.9 Standardized residuals (subject level) versus fitted values for the skin resistance data. CBS computed with c0 = 4.65 and MM (biweight) with c1 = 6.10.
4.7 Further Examples 4.7.1 Metallic Oxide Data Until now, we have only presented models in which each level of a factor is combined with every level of another factor. Hierarchical models are models where only some levels of a factor are combined with the levels of another factor. More formally, suppose that we have two treatments λ and γ with l and g levels, respectively. In the language of experimental design, if each level of treatment γ appears only in one level of treatment λ, γ is said to be nested in λ. One can also extend the models so as to include so-called between subjects factors. For example, we have the typical experiment in which a measurement is taken from n1 samples of type j = 1 and n2 samples of type j = 2, and in each sample the measure is taken on g ‘objects’. For example, the ‘objects’ can be rats, the samples cages, n1 of which are given treatment j = 1 and others n2 given treatment j = 2. This type of design is called a nested design. The rats are nested within the
4.7. FURTHER EXAMPLES
117
cages. A rat belongs either to cage 1 or cage 2. We use different notation to represent nested factors. For example, suppose that γ is the parameter for the cage, then γj (i) would represent rat i nested within cage j . The between subjects factor here is the treatment. In this section, we analyze a dataset originating from a sampling study designed to explore the effects of process and measurement variation on the properties of lots of metallic oxides (Bennet, 1954). Two samples were drawn from each lot. Duplicate analyses were then performed by each of two chemists, with a pair of chemists randomly selected for each sample. Hence, the response yijklm corresponds to the metal content (percent by weight) measured on the ith metallic oxide type, on the jth lots, on the kth sample, by the lth chemist for the mth analysis. The model can be written as yijklm = µ + λJi (j ) + γj (i) + δj (i(k)) + ξj (i(k(l))) + 0015j (i(k(l(m)))), where
0001 0 Ji (j ) = 1
(4.49)
j = 1, j = 2,
and with µ + λJi (j ) the fixed effect and γj (i), i = 1, . . . , n = n1 + n2 the random effect due to the lot, δj (i(k)), k = 1, . . . , 2n, the random effect due to the sample and ξj (i(k(l))), l = 1, . . . , 4n, the random effect due to the chemist. We then have µi = e8 (µ + λJi (j )) = e8 ⊗ (1, Ji (j ))(µ, λ)T = xi β and Z1 = In ⊗ e8 for σγ2 , Z2 = In ⊗ I2 ⊗ e4 for σλ2 , Z3 = In ⊗ I4 ⊗ e2 for σδ2 , so that 0004 = σγ2 J8 + σλ2 I2 ⊗ J4 + σδ2 I4 ⊗ J2 + σ00152 I8 . Thus, the parameters to be estimated are the means for each type of metallic oxide and the variances associated with lots, samples and chemists. This dataset contains 248 observations. We can then make n = 31 independent sub-vectors yi of size 8. A plot of the responses by sample and chemist is given in Figure 4.10. One may notice that whatever the sample or the chemist, the responses are rather low for lots (observations) numbers 24 and 25 relative to the other lots. Table 4.9 presents the estimates and standard errors for the CBS–MM. The mean effect of the metallic oxide type is significant (p-value of 0.005), and the variances are larger for the lot and the chemist, and smaller for the sample. As a comparison, the REML gives larger estimates for the variance components of the lot and sample, while a smaller estimate for the chemist (results not presented here). An analysis of the Mahalanobis distances reveals that there are a few potential outlying observations (see Figure 4.11). One can see that the REML and CBS–MM detect two outliers (observations 24 and 30) and possibly observation 17 as well, while the CBS–MM detects two influential observations (observations 12 and 25). The analysis based on the classical Malahanobis distance alone is certainly misleading.
MIXED LINEAR MODELS
118 Type 1:
Type 2: −1
0
1
Chemist2 Sample1
2
3
4
Chemist2 Sample2 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
Lots
Chemist1 Sample1
Chemist1 Sample2
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 −1
0
1
2
3
4
Metal Content (percent by weight)
Figure 4.10 Metal content response for each lot by sample and chemist.
4.7.2 Orthodontic Growth Data (continued) The orthodontic growth data introduced in Section 4.2 are summarized in Figure 4.2 where individual scatterplots of the distance (between the pituitary and the pterygomaxillary fissure) versus age are displayed. Individual LS fits based on simple linear regression are added to each scatterplot. They reveal that the estimated slope for subject M13 is far larger than the other estimated slopes. Overall, it seems that the
4.7. FURTHER EXAMPLES
119
Table 4.9 Estimates and standard errors for the CBS–MM for the metallic oxide data using model (4.49). CBS–MM Parameter
Estimate (SE)
p-value
µ λ σlot σsample σchemist σ0015
3.726 (0.066) 0.184 (0.066) 0.317 0.144 0.188 0.186
<10−4 0.005
80
100
24
40
60
25
17 20
Robust Mahalanobis distances
120
CBS computed with c0 = 6.01 and MM (biweight) computed with c1 = 6.83.
30
0
12
0
10
20
30
40
Mahalanobis distances
Figure 4.11 Scatterplot of the Mahalanobis distances for the metallic oxide data. CBS computed with c0 = 6.01 and MM (biweight) with c1 = 6.83. responses for the boys vary more than those for the girls. Moreover, the plot suggests that two observations on subject M09 are outliers. These potential outliers are also detected in Figure 4.12 that presents the LS residuals plots by gender. As discussed in Section 4.2, a plausible working model is thought to be yijt = β0 + β1 t + (β0g + β1g t)Ji (j ) + γ0i + γ1i t + 0015ijt
(4.50)
MIXED LINEAR MODELS
120
20
Male
25
30
Female
4
M09
M13
Standardized residuals
2
0
−2
M09
M13 −4 20
25
30
Fitted values (mm)
Figure 4.12 Residuals versus fitted values by gender, corresponding to individual LS fits.
with yijt the response for the ith subject (i = 1, . . . , 27) of gender j (j = 1 for boys and j = 2 for girls) at age t = 8, 10, 12, 14, and Ji (j ) = 0 for boys (j = 1) and 1 for girls (j = 2). Table 4.10 presents the CBS–MM estimates and standard errors for the model parameters. The estimates show that the there is no significant mean intercept difference between boys and girls (p-value of 0.896), while there is a significant mean slope difference (p-value of 0.036). The random slope variance is found to be relatively small compared with the random intercept variance. As a comparison, the REML gives similar results, with a larger random slope variance estimate and residual variance. The robust Mahalanobis distances detect observations corresponding to the 9th and 13th boys as extreme, as was already found in the graphical data
4.7. FURTHER EXAMPLES
121
Table 4.10 Estimates and standard errors for the CBS–MM for the orthodontic data using model (4.50). CBS–MM Parameter β0 β0g β1 β1g σγ 0 σγ 1 σ0015
Estimate (SE)
p-value
17.395 (0.613) 0.080 (0.613) 0.581 (0.052) −0.110 (0.052) 1.584 0.115 1.04
<10−4 0.896 0.000 0.036
CBS computed with c0 = 4.09 and MM (biweight) computed with c1 = 5.82.
Age:Subject
0.02 −0.03
0.00
Sample quantiles
0.2 −0.2 −0.6
Sample quantiles
0.6
−0.03
−0.6
−0.2
0.00
0.2
0.02
0.6
Subject
−2
−1
0
1
Theoretical quantiles
2
−2
−1
0
1
2
Theoretical quantiles
Figure 4.13 Boxplot and Q-Q plot of the random effects for the orthodontic data. CBS computed with c0 = 4.09 and MM (biweight) with c1 = 5.82.
analysis in Figure 4.2. It should be noted that Pinheiro et al. (2001) also find the same outlying observations. A plot of the estimated random effects (see Figure 4.13) shows that both the random slope and the random intercept estimated with the robust estimator are normally distributed.
MIXED LINEAR MODELS
122
M09 M10
0
F10
M13
−6
−4
−2
Residuals
2
4
F11
22
24
26
28
Fitted values
Figure 4.14 Standardized residuals (subject level) versus fitted values for the orthodontic data. CBS computed with c0 = 4.09 and MM (biweight) with c1 = 5.82.
Figure 4.14 displays the standardized (Pearson) residuals versus fitted values at the subject level. We can see that there is no particular structure in the residuals and that subjects M13 and M09 are the largest outliers.
4.8 Discussion and Extensions Despite its good robustness properties and the fact that it does not suffer from computational problems when applied to complex data structures (as is often the case when modeling longitudinal data with fixed covariates), the CBS-MM estimator has a few limitations. The first limitation is, as stated earlier in this chapter, that the CBSMM estimator cannot handle unbalanced data at the moment unlike the very general bounded influence approach of Richardson and Welsh (1995). This is particularly annoying as balanced data are usually not the rule and one is more likely to encounter unbalanced data especially in medical research. The second limitation is the lack of inference theory for the variance components. We have seen (see Sections 4.3.2 and 4.5.5) that no proper solution to this problem
4.8. DISCUSSION AND EXTENSIONS
123
exists in the current robustness theory. Robust inferential procedures presented in this book fail as they all assume the null hypothesis to be an interior point of the parameter space. In addition, the robust LRT test defined in Section 4.5 targets only hypotheses on the fixed effects. It even cannot be defined for a simple testing problem on the variance parameters σj2 , e.g. testing the equality σj2 = σj20005 . Its extension to more general hypotheses on θ = (β T , α T )T may be proved challenging. In general, further research work is needed in this area. One possible robust extension of the MLM is to assume that the data follows a t distribution instead of the normal distribution assumed throughout this chapter. For example, Pinheiro et al. (2001) incorporate multivariate t distributed random components for the t MLMs. More recently Lin and Lee (2006) propose a model based on multivariate t distribution for autocorrelated longitudinal data by incorporating first an autoregressive dependence structure in the variance components and extend the work of Pinheiro et al. (2001) to allow for inference about the random effects and predictions. The next natural extension of robustness in the MLM environment is to extend it to the class of generalized linear mixed models (GLMMs). Yau and Kuk (2002) introduce robust maximum quasi-likelihood and residuals maximum quasilikelihood estimation to limit the influence of outlying observations. The way they introduce robustness in the GLMM follows the same line of thoughts as used by Richardson and Welsh (1995) in the MLM. Other attempts at robustifying the GLMM can be found in Mills et al. (2002) or Sinha (2004). More recently Litière et al. (2007a) study the impact of an incorrectly specified probability model on the maximum likelihood estimation in GLMM. They study the impact of misspecifying the random-effects distribution on the estimation and inference and show that the MLE are inconsistent in the presence of such misspecifications.
5
Generalized Linear Models 5.1 Introduction The framework of GLMs allows us to extend the class of models considered in Chapter 3 and to address situations with non-normal (non-Gaussian) responses. In particular, it allows us to consider continuous and discrete distributions for the response, both symmetric and asymmetric. From the practical point of view, this unified framework opens many perspectives formalized under the same setting and sharing a number of properties. The fields of application are quite wide: certainly biostatistics, but also medicine, economics, ecology, demography, psychology and many more. The family of possible distributions for the response is quite large, but the most common settings with no doubts include binary or binomial responses (e.g. presence or absence of a characteristic, see the example in Section 5.5, or the number of ‘successes’ in a sequence), count data (for example, the number of visits to the doctor, see the example in Section 5.6) and positive responses (e.g. hospital costs, see the example in Section 5.3.5). All of the classical theory of GLMs is likelihood based, and the gain in popularity of GLMs has helped in reinforcing the central role of the likelihood in statistical inference. We will see that the robust versions of GLM presented in this chapter move away from the likelihood setting, but retain almost all of its advantages in terms of statistical properties and interpretation. The route to the definition of the unified class of GLMs has been long and the steps to it went through multiple linear regression (Legendre, Gauss, early 19th century), the ANOVA of designed experiments (Fisher: 1920–1935), the likelihood function (Fisher, 1922), dilution assay (Fisher, 1922), the exponential family of distributions (Fisher, 1934), the probit analysis (Bliss, 1935), the logit models for proportions (Berkson, 1944; Dyke and Patterson, 1952), the item analysis (Rasch, 1960), loglinear models for counts (Birch, 1963) and inverse polynomials (Nelder 1966; see McCullagh and Nelder (1989, Chapter 1), for additional information). Nelder and Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
126
GENERALIZED LINEAR MODELS
Wedderburn (1972) show that the above problems can all be treated in the same way. They also show that the MLE for all of these models can be obtained using the same algorithm (IRWLS; see Appendix E.3). Binary logistic regression has received quite a lot of attention in the robust literature. In fact, one can find several robust contributions that follow different approaches: the early contributions of Pregibon (1982), Copas (1988) and Carroll and Pederson (1993), the L1 -norm quasi-likelihood approach of Morgenthaler (1992), the weighted likelihood approaches of Markatou et al. (1997) and Victoria-Feser (2002), and the high breakdown approaches of Bianco and Yohai (1997) and Christmann (1997). This wide contribution is certainly due to the fact that addressing the binary framework is simpler than addressing the general GLM class. This more general class has nevertheless been addressed with the work of Stefanski et al. (1986) and Künsch et al. (1989), who derive optimal (OBRE, see Section 2.3.1) and conditionally unbiased estimators for the entire GLM class. This theory is quite complex (even in its simpler conditional approach) and only the case of logistic regression can be implemented easily. More recently, Cantoni and Ronchetti (2001b) define Huber and Mallows-type estimators and quasi-deviance functions for application within the GLM framework, see also Cantoni (2003, 2004a) and Cantoni and Ronchetti (2006). Here we present this last piece of work which seems to us the most promising for use in the entire GLM class. In fact, it has the advantage over other proposals of having computationally tractable expressions (that allow us to consider the entire class of GLM and not only the logistic application) and of jointly providing a solution to the variable selection question through the definition of quasi-deviance functions. The present chapter is organized as follows. In Section 5.2 we set up the notation and define the model. We continue in Section 5.3 where we define the class of (robust) estimators and give their properties. The technique is illustrated on a real example in Section 5.3.5. The variable selection issue is addressed in Section 5.4.2 where a family of quasi-deviance functions are defined and its distribution studied. Section 5.4.3 considers the application to the previous studied example. Two additional complete data analyses with robust model selection are presented in Sections 5.5 and 5.6. Finally, Section 5.7 discusses the possible extensions of this work.
5.2 The GLM 5.2.1 Model Building We introduce here the GLM modeling approach without necessarily giving a complete and exhaustive treatment of the subject. Instead, we refer the interested reader to the general references treating GLM modeling, which include Dobson (2001) (a good starting point for beginners), Lindsey (1997) (an applied approach), McCullagh and Nelder (1989) (with additional technical details) and Fahrmeir and Tutz (2001) (more focused on discrete data).
5.2. THE GLM
127
Table 5.1 Properties of some distributions belonging to the exponential family. Distribution
θi (µi )
φ
E[yi ]
var(yi )
µi
σ2
µi = θi
σ2
log(pi /(1 − pi ))
1
Normal
N (µi , σ 2 )
Bernoulli
B(1, pi )
pi =
exp(θi ) pi (1 − pi ) 1 + exp(θi )
Scaled binomial Poisson Gamma
B(m, pi )/m log(pi /(1 − pi )) 1/m pi =
exp(θi ) pi (1 − pi ) 1 + exp(θi )
P(λi ) 0016(µi , ν)
log(λi ) −1/µi
1 1/ν
λi = exp(θi ) µi = −1/θi
λi µ2i /ν
See Appendix D for the distributions definitions.
Consider a sample of n individuals, for which we define the three following ingredients. • The random component. n independent random variables y1 , . . . , yn which are assumed to share the same distribution from the exponential family, that is with density that can be written as 0006 0007 yi θi − b(θi ) f (yi ; θi , φ) = exp + c(yi , φ) (5.1) ai (φ) for some specific functions a(·), b(·) and c(·). We denote µi = E[yi ] and var(yi ) = φvµi , where the specific form of vµi depends on the distributional assumption on yi , see the last column of Table 5.1. The most common families of distributions such as the normal, the binomial, the Poisson, the exponential, and the Gamma belong to the exponential family of distributions. Some of these distributions will be considered more closely here. The parameter θi , which is a function of µi , is called the natural parameter and φ is an additional scale or dispersion parameter, usually considered as a nuisance parameter. We note that φ is a constant in certain models (for example, φ = 1/m for the scaled binomial and φ = 1 for the Poisson distribution), and coincides with σ 2 in the normal model, see the fourth column of Table 5.1. • The systematic component. A set of parameter β T = (β0 , β1 , . . . , βq ) and q explanatory variables or covariates that can either be quantitative (numerical) or qualitative (levels of a factor, then coded with dummy variables as in linear regression). For each individual i = 1, . . . , n, the covariates are stored in the vector xiT = (1, xi1 , . . . , xiq ), from which the linear predictor ηi = xiT β is constructed. The parameter β0 therefore identifies the intercept. The pooled
128
GENERALIZED LINEAR MODELS covariate information is collected in a design matrix X as follows: T x1 x T 2 X= . . .
(5.2)
xnT As in the linear model, linearity in GLM is intended with respect to the parameters. We note that one could introduce transformed covariates, log(xij) or xij2 , for example, as well as interactions. Moreover, there are situations where a parameter βj is known a priori: the corresponding term in the linear structure is called an offset in the GLM terminology. • The link. A monotone link function g which links the random and the systematic components of the model g(µi ) = ηi = xiT β.
(5.3)
The link function defines the form of the relationship between the mean µi of the response and the assumed linear predictor ηi . It needs to be monotonic and differentiable. Moreover, it can be chosen to ensure that the estimated parameter lies in the admissible space of values (for example, the interval (0, 1) for the binomial distribution and (0, ∞) for the Poisson distribution). The natural or canonical link function is that relating the natural parameter directly to the linear predictor (θi = θi (µi ) = ηi = xiT β). Models making use of the canonical link enjoy convenient mathematical and statistical properties, but the canonical link can be easily replaced with a more appropriate link function from the practical or interpretation point of view (see Example 5.3.5). The definition of model (5.3) may be surprising at first to people used essentially to the linear model setting, but the connection with the linear model appears more evident when this latter (as defined in (3.1)) is rewritten in the equivalent form E[yi ] = µi = xiT β, with yi ∼ N (µi , σ 2 ). In this case, the link function is the identity function. In the GLM setting, the distributional assumptions are defined with respect to the response itself (conditionally on the set of explanatory variables) and not with respect to an additive error term. Table 5.1 provides an overview of the components of a GLM model for the most common situations.
5.2. THE GLM
129
5.2.2 Classical Estimation and Inference for GLM The parameters of model (5.3) are usually estimated by maximizing the corresponding log-likelihood (with respect to β) #
n l(β; y) = l(µ; y) = log f (yi ; θi , φ) =
n 0006 0002 i=1
i=1
0007 0002 n yi θi − b(θi ) + c(yi , φ) = li (µi ; yi ), ai (φ) i=1
(5.4)
where µi = g −1 (xiT β) and θi = θi (µi ) = θi (g −1 (xiT β)) are functions of β. The maximization of the log-likelihood (5.4) is performed numerically, either directly or via an IRWLS, see McCullagh and Nelder (1989, Section 2.5) and Appendix E.3. The resulting estimator βˆ[MLE] enjoys the general properties of maximum likelihood estimation, in particular the normal asymptotic distribution with variance given by the inverse of the Fisher information matrix I (β) (see (2.30)), √ that is n(βˆ[MLE] − β) ∼ N (0, I −1 (β)). Based on this asymptotic result, one can construct univariate test statistics for the coefficients βj , j = 0, . . . , q as βˆ[MLE]j SE(βˆ[MLE]j ) &
with SE(βˆ[MLE]j ) =
(5.5)
1 ˆ−1 ˆ [I (β[MLE] )](j +1)(j +1), n
and using an estimator Iˆ for the Fisher information matrix ' n ' 10002 ∂ ∂ ˆ ˆ li (µi ; yi ) T li (µi ; yi )' . I (β[MLE] ) = n i=1 ∂β ∂β β=βˆ[MLE] The statistic (5.5) is labeled the t-statistic if the dispersion parameter (φ) is estimated (for example, for the Gaussian and Gamma distributions), and is labelled the zstatistic if the dispersion parameter is known (for example, for the binomial and Poisson distributions). The test statistic (5.5) has a tn−(q+1) distribution under the null hypothesis H0 : βj = 0 in the first case and the standard normal in the second. The p-value for a two-sided alternative hypothesis H1 : βj 000f= 0 is therefore computed as P (|z-statistic| > |zobs|) = 2(1 − 0010(|zobs |)) or P (|t-statistic| > |tobs|) = 2(1 − tn−(q+1) (|tobs|)), where zobs and tobs are the values taken by the statistic (5.5) on the sample. Note that the z/t-statistic is a Wald approximation of the log-likelihood (secondorder Taylor expansion of the log-likelihood at the MLE) to test H0 : βj = 0 and is sometimes misleading with binomial GLMs. In fact, a small value for the z/t-statistic can either correspond to a small LRT statistic or to a situation where |βj | is large, the
GENERALIZED LINEAR MODELS
130
Wald approximation is poor and the likelihood ratio statistic is large. These problems can occur in cases when the fitted probabilities are extremely close to zero or one. This is called the Hauck–Donner phenomenon, see Hauck and Donner (1977). The asymptotic result is also useful in constructing approximate (1 − α) confidence intervals (CIs), according to the formula (βˆ[MLE]j − q(1−α/2)SE(βˆ[MLE]j ); βˆ[MLE]j + q(1−α/2)SE(βˆ[MLE]j )), where q(1−α/2) is either the (1 − α/2) quantile of the standard normal distribution or of the tn−(q+1) distribution, depending on whether φ is known or not. For binomial and Poisson models, it sometimes happens that data do not satisfy the variance assumption of the model, but rather that var(yi ) = τ vµi (recall that φ = 1 for binomial and Poisson models). This phenomenon is called over- or underdispersion depending on whether τ is larger or smaller than one. One of the main reasons for over-dispersion is clustering in the population (the parameter θi varies from cluster to cluster, as a function of cluster size for example). This means that the parameter θi is regarded as random rather than fixed. Beyond normality, specifying the expectation and the variance structure separately (first and second moment) does not correspond to a distribution function, therefore preventing the definition of a likelihood function. In this case, the model is fitted via the estimating equations
n 0002 y i − µi µ0005i = 0, (5.6) τ v µ i i=1 where µ0005i = ∂µi /∂β. Equation (5.6) corresponds to the maximization of the so-called quasi-likelihood function n n 0004 µi 0002 0002 yi − t Q(µ; y) = Q(µi ; yi ) = dt, (5.7) τ vt i=1 i=1 yi where µT = (µ1 , . . . , µn ) and yT = (y1 , . . . , yn ). Under some general conditions (see Wedderburn, 1974) the quasi-likelihood estimator is asymptotically normally distributed. Moreover, the MLE and the maximum quasi-likelihood estimator (MQLE) are the same for all of the models of the one-parameter exponential family (binomial and Poisson, for example). Note that τ has no impact on (5.6) because it cancels out, but does have an impact on the computation of the standard errors of the coefficients. The estimation of τ is based on the RSS as follows τˆ =
n 0002 1 (yi − µˆ i )2 , n − (q + 1) i=1 vµˆ i
where µˆ i are the fitted values g −1 (xiT βˆ[MQLE] ) on the response scale. The estimator τˆ is an unbiased estimator of τ if the fitted model is correct. A particular function based on the log-likelihood plays an important role in GLM modeling. It is called the deviance, which, assuming that ai (φ) in (5.1) can be
5.2. THE GLM
131
decomposed as φ/wi , is defined by D(µ; ˆ y) = 2φ[l(y; y) − l(µ; ˆ y)] =
n 0002
2φ[li (yi ; yi ) − li (µˆ i ; yi )] =
i=1
n 0002
φ di ,
i=1
(5.8) where µˆ is the vector of fitted values g −1 (xiT βˆ[MLE] ) and where l(µ; ˆ y) is the loglikelihood of the postulated model and l(y; y) is the saturated log-likelihood for a full model with n parameters. The deviance measures the discrepancy between the performance of the current model via its log-likelihood and the maximum loglikelihood achievable. It can therefore be used for goodness-of-fit purposes. Large values of D(µ; ˆ y) indicate that the model is not good. On the other hand, small values of D(µ; ˆ y) arise when the log-likelihood l(µ; ˆ y) is close to the saturated loglikelihood l(y; y). 2 The distribution of the deviance is exactly χn−(q+1) for normally distributed responses, and this distribution can be taken as an approximation for other distributions, for example binomial and Poisson. However, D(µ; ˆ y) is not usable for goodness-of-fit for Bernoulli responses, because it only depends on the observations y through the fitted probabilities µˆ and as such does not carry information about the agreement between the observations and the fitted probabilities (see Collett 2003a, Section 3.8.2). The deviance can be regarded as a LRT statistic for testing a specific model within the saturated model, assuming φ = 1. This is the case for binomial and Poisson models, but for other distributions, e.g. normal or Gamma, the deviance is not directly related to a LRT statistic. The deviance is also used to construct a difference of deviance statistics to compare nested models. Suppose that a model Mq−k+1 with (q − k) explanatory variables (plus intercept) is nested into a larger model Mq+1 with q explanatory variables (plus intercept). To test the null hypothesis, which states that the smallest model suffices to describe the data, one can test whether the parameters associated with the variables not included in the smallest model are equal to zero with the test statistic 0006D(µ, ˆ µ) ˙ = D(µ; ˙ y) − D(µ; ˆ y) = 2φ[l(µ; ˆ y) − l(µ; ˙ y)],
(5.9)
where µˆ = µ(βˆ[MLE] ) and µ˙ = µ(β˙[MLE] ) are the MLE estimates in the full model Mq+1 and the reduced model Mq−k+1 , respectively. If φ is known and, under the null hypothesis that the smaller model is good enough to represent the data, the distribution of 0006D(µ, ˆ µ) ˙ can be approximated by a φ χk2 (it is the LRT statistic up to a factor φ). This approximation is more accurate than the 2 approximation of the deviance itself by a χn−(q+1) distribution. When φ is not known (e.g. normal, Gamma) the usual approximation under H0 uses an F -type statistic: (D(µ; ˙ y) − D(µ; ˆ y))/k ∼ Fk,n−(q+1) , ˆ φ where φˆ = D(µ; ˆ y)/(n − (q + 1)). Note that for the normal case with identity link this is an exact result, but for the Gamma model the accuracy of this approximation is not well known.
132
GENERALIZED LINEAR MODELS
A natural definition of a quasi-deviance function follows from the definition (5.7) of a quasi-likelihood function: QD(µ; ˆ y) = Q(y; y) − Q(µ; ˆ y).
(5.10)
By analogy with the deviance function, one can use the quasi-deviance function for inference purposes to test whether a smaller model Mq−k+1 nested into a larger model Mq+1 is a good enough representation of the data with the difference of quasi-deviances statistics: 0006QD(µ, ˆ µ) ˙ = Q(µ; ˙ y) − Q(µ; ˆ y),
(5.11)
where µˆ = µ(βˆ[MQLE] ) and µ˙ = µ(β˙[MQLE] ) are the MQLE estimates in the full model Mq+1 and the reduced model Mq−k+1 , respectively. 2 distribution, at The test statistic 0006QD(µ, ˆ µ) ˙ is then compared with a χn−(q+1) least when φ is known. As with the likelihood, an F -type test is more appropriate if φ is unknown, see above.
5.2.3 Hospital Costs Data Example We introduce here a dataset on health care expenditures previously analyzed by Marazzi and Yohai (2004) and Cantoni and Ronchetti (2006). The aim is to explain the cost of stay (cost in Swiss francs) of 100 patients hospitalized at the Centre Hospitalier Universitaire Vaudois in Lausanne (Switzerland) during 1999 for ‘medical back problems’ (APDRG 243). The following explanatory variables have been measured: length of stay (los, in days), admission type (adm: 0 = planned, 1 = emergency), insurance type (ins: 0 = regular, 1 = private), age in years (age), sex (sex: 0 = female, 1 = male) and discharge destination (dest: 1 = home, 0 = another health institution). The median age over the 100 patients is 56.5 years (the youngest patient is 16 years old and the oldest is 93 years old). Moreover, 60 individuals out of the 100 in the sample were admitted as emergencies and only 9 patients had private insurance. Also, both sexes are well represented in the sample with 53 men and 47 women. After being treated, 82 patients went home directly. Modeling medical expenses is an important step in cost management and health care policy. Establishing the relationship between the cost and the above explanatory variables could, for example, help in reducing costs in health care expenditures which are increasing extremely fast everywhere and are therefore a matter of concern. In addition to be positive, cost measurements are known to be highly skewed. Moreover, it is also known that the thickness of the tail of their distribution is often determined by a small number of heavy users. Several authors (e.g. Blough et al., 1999; Gilleskie and Mroz, 2004) report that the variance of health care expenditures data can be considered as proportional to the squared mean. We therefore consider fitting a Gamma GLM model with a logarithmic link. Note that this model can be seen as issued from a multiplicative model yi = exp(xiT β) · ui , where the error term ui has constant variance. This is the reason why we use the logarithmic link instead of the canonical link g(µi ) = 1/µi (the inverse function), which, by the way, does
5.2. THE GLM
133 Table 5.2 Classical estimates for model (5.12).
Variable intercept log(los) adm ins age sex dest 1/ν (scale)
Estimate (SE)
95% CI
p-value
7.234 (0.147) 0.822 (0.028) 0.214 (0.050) 0.093 (0.079) −0.0005 (0.001) 0.095 (0.050) −0.104 (0.069) 0.0496
(6.940; 7.528) (0.766; 0.878) (0.114; 0.314) (−0.065; 0.252) (−0.003; 0.002) (−0.005; 0.195) (−0.243; 0.034)
<10−4 <10−4 <10−4 0.2414 0.6790 0.0602 0.1353
The estimates are obtained by maximum likelihood, see (5.4) (CI, confidence interval).
not guarantee that µi > 0. More specifically, we consider a parameterization of the Gamma density function such that one parameter identifies µi and the variance structure is defined by v(µi ) = µ2i /ν, see the top of page 201 in Cantoni and Ronchetti (2006). We start by fitting the full model, that is the model with all of the available explanatory variables, as follows log(E[cost]) = β0 + β1 log(los) + β2 adm + β3 ins + β4 age + β5 sex + β6 dest. (5.12) The MLE parameter estimates, their standard errors and the p-values of the significance tests (5.5) are given in Table 5.2. Before proceeding with any interpretation, it is recommended to validate the model. In this example, the deviance statistic (5.8) takes the value 5.07, which yields a p-value P (D > 5.07) 1 when compared with 2 2 distribution. This large p-value provides no evidence against the a χn−(q+1) = χ93 null hypothesis that the postulated model is better than the saturated model.
5.2.4 Residual Analysis Residual diagnostic plots are an alternative to formal tests. In the GLM setting several types of residuals can be defined, between which the most common are: 0015 ˆ µˆ ; • the Pearson residuals riP = (yi − µˆ i )/ φv i 0015 ˆ µˆ (1 − hii ), where • the standardized Pearson residuals riPS = (yi − µˆ i )/ φv i the leverages hii are the diagonal entries of the hat matrix, see (3.11); √ • the deviance residuals riD = sign(yi − µˆ i ) di ; 0015 ˆ − hii ). • the standardized deviance residuals riDS = riD / φ(1
GENERALIZED LINEAR MODELS
134
Normal Q− Q 4
Residuals vs Fitted
14
2 0
Std. deviance resid.
0.0
44 −4
−1.0
44 63 8.0
8.5
9.0
63 9.5
10.0
−2
10.5
−1
Predicted values
4
1.0
21
0
21
28
−2
Std. deviance resid.
2.0 1.5
2
0.5
14
2
44
28
1
Residuals vs Leverage
63 14
0
Theoretical Quantiles
Scale−Location
44
0.5
4
0.5
Std. deviance resid.
14
−2
0.5
21
−0.5
Residuals
28 21
28
1
−6
0.0
Cook’s 63 distance
8.0
8.5
9.0
9.5
Predicted values
10.0
10.5
0.00
0.05
0.10
0.15
Leverage
Figure 5.1 Diagnostic plots for the Gamma model (5.12), estimated with a MLE.
Residual plots can help in identifying departures from the linearity assumption (when plotted against continuous covariates), serial correlation (when plotted against the order in which the observations are collected, if known) and particular structures (when plotted against predicted values). In addition, it is usual to look at a Q-Q plot of the residuals against the normal quantiles. Note, that for binary logistic models very often structures appear on the residuals plots which are due to the discrete nature of the response variable but do not indicate fitting problems. Since the diagnostic approach is based on a classical fit, it has therefore to be used with caution. In fact, masking can occur, where a single large outlier may mask others. It is worth noting that in the GLM setting, an outlier or extreme observation would be an observation (yi , xiT ) such that, under the GLM model that fits the ˆ The majority of the data, yi is in some sense far from its fitted value g −1 (xiT β). −1 T ˆ can be large because yi is an extreme response and/or the quantity yi − g (xi β) covariates xi are (at least for one of them) extreme themselves. A classical residual analysis can suffer from the masking effect in that the distorted data appear to be the norm rather than the exception. For instance, consider a regression setting where an outlier may have such a large effect on a slope estimated by a MLE that its residual (or any other measure used for diagnostic) will tend to be small, whereas other observations will have corresponding relatively large residuals. This behavior is due to the fact that classical estimates are affected by outlying points and are pulled in the direction of them. We advocate later for the use of a robust analysis in
5.2. THE GLM
135
31 20000
cost
30000
40000
21
10000
28
44
14
0
63
1.0
1.5
2.0
2.5
3.0
3.5
4.0
log(los)
Figure 5.2 cost versus log(los) for the Gamma example of Section 5.2.3.
the first place (see also the discussion in Section 1.3). We nevertheless propose as a starting point to look at a few plots. In Figure 5.1 we present the diagnostic plots for the fitted Gamma model as per (5.12). In this figure we represent the Pearson residuals as a function of the fitted values (top left panel), a normal Q-Q plot of the standardized deviance residuals (top right panel), a scale–location plot of the standardized deviance residuals as a function of the fitted values (bottom left panel) and a residuals versus leverage plot, that is, a plot of the standardized deviance residuals as a function of the leverage hii (bottom right panel). This last plot comes with added contour lines of equal Cooks distances (see Cook and Weisberg, 1982). Note that the plot function in R can also produce two extra plots, namely the Cook’s distances and the Cook’s distances as a function of the leverage. From Figure 5.1, we can see that there seems to be few outlying/influential data points with large residuals, in particular those identified with their observation number. To see why these observations are extreme, one can for example look at the plot of the variable cost as a function of the variable log(los), as in Figure 5.2. We see from this figure that the points with large residuals are in fact points which are extreme with respect to observations with the same or similar values of log(los). Even though the Gamma model admits variance increasing with the covariates (of the
136
GENERALIZED LINEAR MODELS
order of µ2i = exp(2xiT β)), observations 14, 28, 63, 44 and 21 are considered too extreme with respect to the bulk of the data. On the other hand, observation 31 could be a leverage point, but is not otherwise worrying given that its y-value lies in a region covered by the model assumptions. The more extreme observations identified with this diagnostic analysis can potentially have a very bad impact on the parameter estimates and this issue needs to be investigated further. We reanalyze this dataset in Section 5.6 with a robust technique.
5.3 A Class of M-estimators for GLMs Deviations from the model can also occur for GLMs. The nature of possible deviations in the GLM class of models are close to what one can see in the regression setting: outliers in the response (producing large residuals) and leverage points in the design space. A notable exception is the binary response setting where deviations in the response space take the form of misclassification (a zero instead than a one, or vice versa), and where the difference between an outlier and a leverage point is less clearcut. To address the potential problem of deviating points in real data, or more generally the problem of slight model misspecification, we propose here a general class of M-estimators (see Section 2.3.1) for the0017GLM model as defined in Section 5.2. Given the Pearson residuals ri = (yi − µi )/ φvµi , the M-estimating equations for β of model (5.3) are given by the solution of the following estimating equations 0007 0002 n 0006 n 0002 1 ψ(ri ; β, φ, c)w(xi ) 0017 µ0005i − a(β) = 0001(yi , xi ; β, φ, c) = 0, (5.13) φvµi i=1 i=1 where µ0005i = ∂µi /∂β = ∂µi /∂ηi xi and a(β) = (1/n) ni=1 E[ψ(ri ; β, φ, c)]w(xi )/ 0017 φvµi µ0005i , with the expectation taken over the distribution of yi |xi . The constant a(β) is a correction term to ensure Fisher consistency; see Sections 2.3.2 and 5.3.2. The function ψ(ri ; β, φ, c) and the weights w(xi ) are the new ingredients with respect to the classical GLM estimators obtained by maximum quasi-likelihood: compare with the estimating equations (5.6), which are obtained with ψ(ri ; β, φ, c) = ri and w(xi ) = 1 for all i. The function ψ is introduced to control deviations in the y-space and leverage points are downweighted by the weights w(x). Conforming to the usage in robust linear regression, we call the estimator issued from (5.13) a Mallows-type estimator. It simplifies to a Huber-type estimator when w(xi ) = 1 for all i. It is worth noting that the estimating equations (5.13) can be conveniently rewritten as 0007 n 0006 0002 1 0005 w(r ˜ i ; β, φ, c)w(xi )ri 0017 µ − a(β) = 0, (5.14) φvµi i i=1 where w(r; ˜ β, φ, c) = ψ(r; β, φ, c)/r. In this form, the estimating equations (5.13) can be interpreted as the classical estimating equations weighted (both with respect
5.3. A CLASS OF M -ESTIMATORS FOR GLMS
137
to ri and xi ) and re-centered via a(β) to ensure consistency. The particular weighting scheme considered in (5.14) is multiplicative in its design and residuals components (wi = w(r ˜ i ; β, φ, c)w(xi )). Alternatively, one could consider a global weighting scheme of the form wi (ri , xi ), as for example in Künsch et al. (1989). It should nevertheless be stressed that such a scheme increases the difficulty in calculating the Fisher consistency correction a(β). The estimation procedure issued from (5.13) can be written as an IRWLS, in the same manner as it is usually presented for the classical GLM estimating equations. We give the algorithm in Appendix E.3. The IRWLS algorithm has been a particularly convincing ‘selling argument’ when GLMs have been proposed. Thanks to this representation, the estimation procedure only requires software that allows the computation of weighted LS (or even only matrix computation). Nowadays computer power is a less crucial issue and other numerical procedures can be considered. For example, one can use a Newton–Raphson or a quasi-Newton algorithm. Finally, one can see that if we write yT = (y1 , . . . , yn ) and µT = (µ1 , . . . , µn ), the estimating equations (5.13) correspond to the minimization of the quantity QM (µ; y) =
n 0002
QM (µi ; yi ),
(5.15)
i=1
with respect to β, where the functions QM (yi ; µi ) can be written as QM (µi ; yi )
0004 µi 1 yi − t = ψ ; c w(xi ) √ dt φv φv t t s˜ 0007
n 0004 µj 0006 10002 1 yi − t dt, − E ψ ; c w(xi ) √ n j =1 t˜ φvt φvt
(5.16)
with s˜ such that ψ((yi − s˜ )/(φvs˜ ); c) = 0, and t˜ such that E[ψ((yi − t˜)/(φvt˜); c)] = 0. The function QM (µi ; yi ) in (5.16) plays the same role as the function Q(µi ; yi ) in (5.7), and is used later to define a difference of quasi-deviance type statistic, see Section 5.4.2.
5.3.1 Choice of ψ and w(x) The role of the function ψ is to control the effect of large residuals, therefore it has to be bounded. Common choices for ψ are functions that level off such as the Huber function or functions that are redescending, see Section 2.3.1 for a discussion of the possible options. The function ψ is usually tuned with a constant c, which is typically chosen to guarantee a given level of asymptotic efficiency (which is computed as the ratio of traces of the asymptotic variances of the classical and the robust estimators, see, for example, (2.31)). The exact computation of the value of c that guarantees a certain level of efficiency in GLM models is more complicated than in linear regression because the asymptotic efficiency also depends here on the design and no
GENERALIZED LINEAR MODELS
138
general result can be derived. It is always possible to inspect the estimated efficiency a posteriori and refit the model with a different value of c if it is not satisfactory. In practice, if the Huber ψ-function is used (and this is the case in the glmrob function of the robustbase R package and therefore in our examples), a value of c between 1.2 and 1.8 is often adequate. The default value is set to 1.345, the value that guarantees 95% efficiency for the normal-identity link GLM model. This value is also often a reasonable choice for the other models, such as the binomial and Poisson models. Note that when c → ∞, the classical GLM estimators are reproduced. In practice, very large values of c (e.g. ≥ 10) have the same effect. The choice of w(xi ) is also suggested √ by robust estimators in linear models: the simplest approach is to use w(xi ) = 1 − hii , where hii is the leverage. More sophisticated choices for w(xi ) are available, in particular some that in addition do have high breakdown properties (see Section 3.2.4 for linear regression). The current implementation of the robustbase package √ in addition to equal weights (w(xi ) = 1, for all i, the default) and w(xi ) = 1 − hii , allows one to choose weights based on the Mahalanobis distances di (see (2.34)) of the form
w(xi ) = 0015
1
. √ 1 + 8 max(0, (di2 − q)/ 2q)
A few options are available to estimate the center and the scatter in di robustly, either by the MCD estimator of Rousseeuw (1984) or a more efficient S-estimator, see Section 2.3.3. Note, however, that these high breakdown estimators are not well suited for categorical or binary covariates, and their use only makes sense if all of the explanatory variables are continuous. A variation of this kind of weights is given in Victoria-Feser (2002). The weighting scheme issued from a robust fitting procedure can be used for diagnostic purposes. In fact, inspecting the observations that received a low weight allows the user to identify the outlying observations. For an illustration, see Section 5.5 (Figure 5.3) and Section 5.6 (Figure 5.7).
5.3.2 Fisher Consistency Correction The term a(β) in the estimating equations (5.13) guarantees that the estimator is Fisher consistent, that is, asymptotically unbiased under the postulated model (normal, binomial, etc.). This term can sometimes be difficult to compute. Note, however, that it can be computed explicitly for GLM models where the responses are binomial and Poisson (cf. Cantoni and Ronchetti (2001b, p. 1028) with the change in notation V (µi ) = φvµi ), and Gamma (see Cantoni and Ronchetti (2006, pp. 210– 211) with the change in notation v(µi ) = φvµi ). The expression of a(β) for these models in the unified notation of this book are given in Appendix E.1.
5.3. A CLASS OF M -ESTIMATORS FOR GLMS
139
When a(β) cannot be computed analytically, its estimation by simulation can be considered: the expectation involved in its computation is replaced by the empirical mean of a simulated sample.1 A different strategy is to compute a simpler biased estimator of β by solving the uncorrected estimating equations n 0002 i=1
ψ(ri ; β, φ, c)w(xi ) 0017
n 0002 1 ( i , xi ; β, φ, c) = 0 µ0005i = 0001(y φvµi i=1
(5.17)
and correct the bias a posteriori. In fact, the asymptotic bias of the estimator solving (5.17) can be approximated by a Taylor expansion and takes the form 0006 n ( 0007 0006 n 0007 ∂ i=1 0001(yi , xi ; β, φ, c) −1 0002 ( i , xi ; β, φ, c) . −E E (5.18) 0001(y ∂β i=1 This bias has to be estimated. One can either compute the expectations by numerical integration and evaluate them at β˜ (solution of (5.17)), or replace expectations with ( i , xi ; β, φ, c) evaluated at the averages with respect to the data. Given that ni=1 0001(y ˜ solution β of (5.17) equals zero, a robust pilot estimator, that is a robust estimator obtained by other means, is needed. For further details on the comparison of the estimator obtained from (5.17)–(5.18) and the estimator obtained from (5.13), see Dupuis and Morgenthaler (2002), in particular their Section 2.2. Using indirect inference (Gallant and Tauchen, 1996; Gouriéroux et al., 1993) is another possible approach that can be implemented to correct the bias a posteriori as is done in e.g. Moustaki and Victoria-Feser (2006). For illustrations of the use of indirect inference with robust estimators, see also Genton and Ronchetti (2003).
5.3.3 Nuisance Parameters Estimation As stated previously, φ is known to be constant for Bernoulli, (scaled) binomial and Poisson models. In other models, this parameter has to be estimated, and this should be done by paying attention to maintaining the robustness properties gained in the estimation of β. In other words, it is necessary to also use a robust estimator for φ. We address here the normal and the Gamma distribution settings. In both cases the nuisance parameter is a scale parameter (for the Gamma, one may notice that var((yi − µi )/µi ) = ν), and we suggest borrowing one of the robust scale estimators available in the literature. Namely, we propose to use the Huber’s Proposal 2 estimator (Huber, 1981, p. 137) defined by (see also (3.7) for the regression model)
n 0002 y i − µi χ 0017 ; β, φ, c = 0, (5.19) φvµi i=1 where χ(u; β, φ, c) = ψ 2 (u; β, φ, c) − δ, and δ = E[ψ 2 (u; β, φ, c)] is a constant that ensures Fisher consistency for the estimation of φ, see Hampel et al. (1986, 1 Care should be taken that in the iterative estimation process, the value of β used to simulate the data ˆ is not equal to the current value of β.
GENERALIZED LINEAR MODELS
140
p. 234). The function ψ can be chosen to be the same as that in (5.13).2 The expectation in δ is computed under normality for u, see (3.8) for its computation 2 2 2 for ψ 2 (u; β, φ, c) = ψ[H ub] (u; β, φ, c) = u w[H ub] (u; β, φ, c). Ideally, (5.19) has to be solved simultaneously with (5.13), but in practice a twostep procedure is often used. Starting from a first guess for φ, an estimate of β is obtained, which in turn is used in (5.19), and so on until convergence.
5.3.4 IF and Asymptotic Properties ˆ The estimator defined by (5.13) is an M-estimator 0017 β[M] characterized by the 0001function 0001(yi , xi ; β, φ, c) = ψ(ri ; β, φ, c)w(xi )/ φvµi µ0005i − a(β). Its IF is then ˆ Fβ ) = M(0001, Fβ )−1 0001(y, x; β, φ, c), IF(y, x; β,
(5.20)
√ where M(0001, Fβ ) = −E[(∂/∂β)0001(y, x; β, φ, c)]. Moreover, n(βˆ[M] − β) has an asymptotic normal distribution with asymptotic variance M(0001, Fβ )−1 Q(0001, Fβ ) M(0001, Fβ )−1 , where Q(0001, Fβ ) = E[0001(y, x; β, φ, c)0001(y, x; β, φ, c)T ] (see also (2.27)). The matrices M(0001, Fβ ) and Q(0001, Fβ ) for the Mallows quasi-likelihood estimator (5.13) can be easily computed as Q(0001, Fβ ) =
1 T X AX − a(β)a(β)T , n
(5.21)
where A is a diagonal matrix with elements ai = E[ψ(ri ; β, φ, c)2 ]w2 (xi )/(φvµi ) (∂µi /∂ηi )2 , and 1 M(0001, Fβ ) = XT BX, (5.22) n where B is a diagonal matrix with elements bi as defined in Appendix E.1, and where the expectations are taken at the conditional distribution of yi |xi . Cantoni and Ronchetti (2001b) have computed these matrices for binomial and Poisson models and Cantoni and Ronchetti (2006) for Gamma models. These results are presented in Appendix E.2 in a unified notation. Estimated versions of the matrices M(0001, β) and Q(0001, β) are obtained by replacing the parameters by their M-estimates.
5.3.5 Hospital Costs Example (continued) Consider again the hospital costs example introduced in Section 5.2.3. Model (5.12) is now refitted via the robust estimating equations (5.13) with c = 1.5 and w(xi ) = 1, that is, with a Huber estimator. The scale estimator (5.19) is used for the nuisance parameter with the same value of c. The estimated parameters, standard errors, CIs and p-values of the significance test statistics (5.23) are given in Table 5.3, to be compared with Table 5.2 (classical estimates). Only small differences appear on the values of the estimated coefficients between the classical and the robust analysis 2 The Huber ψ-function is the one used in the implementation in the robustbase package.
5.4. ROBUST INFERENCE
141
Table 5.3 Robust estimates for model (5.12). Variable intercept log(los) adm ins age sex dest 1/ν (scale)
Estimate (SE)
95% CI
p-value
7.252 (0.105) 0.839 (0.020) 0.222 (0.036) 0.009 (0.057) −0.001 (0.001) 0.073 (0.036) −0.123 (0.050) 0.0243
(7.042; 7.462) (0.799; 0.879) (0.151; 0.294) (−0.104; 0.122) (−0.003; 0.001) (0.001; 0.144) (−0.222; −0.024)
<10−4 <10−4 <10−4 0.869 0.257 0.042 0.013
The estimates are obtained solving (5.13) with c = 1.5 and w(xi ) = 1 for all i (Huber’s estimator), and (5.19) with c = 1.5.
except for the variable ins, where there is a difference by a factor of 10 (which is not a typo). This large difference is certainly due to the small number of patients (only nine) with private insurance, one of which is heavily downweighted in the robust analysis (patient 28, w(r ˜ i ; β, φ, c) = 0.24). On the other hand, there are major discrepancies between the estimated standard errors by the two estimators, those based on the robust approach being much smaller. These differences are mainly due to the fact that the scale estimate from the classical analysis is twice as large as that from the robust analysis (see also the simulation results of Cantoni and Ronchetti (2006, Section. 4)). This will also have an impact on the CIs and significance tests, as we will see in Section 5.4.3. Meanwhile, we look at what the robust fit tells us. The observations that are heavily downweighted, that is, with weights w(r ˜ i ; β, φ, c) smaller than 0.5 are w(r ˜ 14 ; β, φ, c) = 0.23, w(r ˜ 21 ; β, φ, c) = 0.50, w(r ˜ 28 ; β, φ, c) = 0.24, w(r ˜ 44 ; β, φ, c) = 0.42 and w(r ˜ 63 ; β, φ, c) = 0.32, which in this case are the same observations as identified in Section 5.2.3. Very similar results in terms √ of coefficient and standard error estimates are obtained if weights w(xi ) = 1 − hii are used (not shown). This indicates that we can be confident that there are no bad leverage points (see Section 3.2.4.2) in the sample and, therefore, we can use a Huber-type estimator to √ avoid any additional loss in efficiency. Indeed, if one computes the weights w(xi ) = 1 − hii , they would range from 0.9 to 1, with the first quartile equal to 0.96, the median equal to 0.97 and the third quartile equal to 0.98. It is particularly interesting to look at the weight of observation 31 (a potential influential point, as can be seen in Figure 5.2) which is w(x31 ) = 0.96, indicating that there is no leverage effect.
5.4 Robust Inference 5.4.1 Significance Testing and CIs With the asymptotic result of Section 5.3.4, it is possible to draw approximate inference for β, either by constructing approximate (1 − α) CIs or by computing
GENERALIZED LINEAR MODELS
142 univariate z-statistics, namely z-statistic = where SE(βˆ[M]j ) =
βˆ[M]j , SE(βˆ[M]j )
(5.23)
0015 v001c ar(βˆ[M]j ) and
1 ˆ ˆ ˆ Fβ )−1 Q(0001, v001c ar(βˆ[M]j ) = [M(0001, Fβ )−1 ](j +1)(j +1) Fβ )M(0001, n in which the matrices Qˆ and Mˆ are estimated using βˆ[M] in (5.21) and (5.22), respectively. The z-statistic can then be compared with a standard normal distribution to test the null hypothesis H0 : βj = 0 and compute the corresponding p-value. As in the classical setting, the asymptotic distribution can be used to define approximate (1 − α) CIs for each parameter βj . Here, they write (βˆ[M]j − z(1−α/2) SE(βˆ[M]j ); βˆ[M]j + z(1−α/2) SE(βˆ[M]j )), where z(1−α/2) is the (1 − α/2) quantile of the standard normal distribution.
5.4.2 General Parametric Hypothesis Testing and Variable Selection The general parametric theory on robust testing (e.g. Heritier and Ronchetti, 1994), i.e. robust LRT, Wald and Rao or score tests, can also be used in the GLMs setting using the results presented in Section 2.5.3. However, since historically with GLMs the deviance has been used for inference purposes, we prefer to concentrate on the possibilities offered by a robust version of the deviance. Note, however, that in the classical setting the difference of deviances statistic to compare two nested models coincides with the LRT statistic when φ (the scale parameter) is known. When confronted with data, it is common practice to fit a first model that includes all available explanatory variables (the full model). The p-values associated with the univariate test statistics (z-statistics) on each coefficient separately give a first broad impression on the important variables impacting the response. However, this information has to be interpreted with caution, given the possible correlation between explanatory variables and non-orthogonality of the tests. It is therefore preferable to conduct a proper variable selection analysis by means of adequate tools. Tools for variable selection, e.g. test statistics, are as much affected by extreme observations as estimators. This effect manifests itself in terms of level (for example, an actual level which does not correspond to the nominal level) and in terms of loss of power; see discussions in Sections 2.4.2, 2.4.3 and 2.5.5. Consider a larger model Mq+1 with q explanatory variables (plus intercept) and a sub-model Mq−k+1 with only (q − k) explanatory variables (plus intercept). The question that arises is whether the sub-model is a good enough representation of the data. Testing that some explanatory variables are not significantly contributing to
5.4. ROBUST INFERENCE
143
the model amounts to testing that a subset of β is equal to zero. Therefore, without T , β T )T with β loss of generality, we split β = (β(1) (1) of dimension (q − k + 1) and (2) β(2) of dimension k, and we test the null hypothesis H0 : β(2) = 0. We propose (see Cantoni and Ronchetti, 2001b) a robust counterpart to the difference of deviances statistic 00060002 0007 n n 0002 001aQM = 2 QM (µˆ i ; yi ) − QM (µ˙ i ; yi ) , (5.24) i=1
i=1
where the quasi-likelihood functions QM (µi ; yi ) are defined by (5.16), µˆ i = µi (βˆ[M] ) is the M-estimate under model Mq+1 and µ˙ i = µi (β˙[M] ) is the Mestimate under model Mq−k+1 . Note that this difference of deviances is independent of s˜ and t˜, see (5.16), because their contributions cancel out. Computing 001aQM implies the computation of the functions QM (µi ; yi ) which are integral forms and for which there is no general analytical expression. They can easily be approximated numerically and they have been implemented in this way. In situations where the evaluation of these integrals is problematic, an asymptotic approximation can be used, see Section 5.4.2.1. The same forms for the functions ψ and w(xi ) as for the M-estimator βˆ[M] can be used in (5.24), see the discussion in Section 5.3.1. The test statistic 001aQM can be used to compare two nested models predefined by the analyst, but can also be used for a more automatic analysis, either sequential (see the example in Section 5.6.2) or marginal (stepwise, see the example in Section 5.5.2). The test statistic (5.24) is in fact a generalization of the quasi-deviance test for 0003µ GLMs ((5.11), which is recovered by taking QM (µi ; yi ) = yi i ((yi − t)/τ vt ) dt). Moreover, when the link function is the identity (linear regression), the statistic (5.24) becomes the τ -test statistic given by Hampel et al. (1986, Chapter 7), see also Section 3.3.3. 5.4.2.1 Asymptotic Distribution and Robustness Properties Let A(ij) , i, j = 1, 2 be the partitions of a (q + 1) × (q + 1) matrix A according to the partition of β into β(1) and β(2) . Under technical conditions discussed in Cantoni and Ronchetti (2001b) and under H0 : β(2) = 0, the test statistic 001aQM defined by (5.24) is asymptotically equivalent to nLTn C(0001, Fβ )Ln = nRTn(2) M(0001, Fβ )22.1 Rn(2) ,
(5.25)
where C(0001, Fβ ) = M −1 (0001, Fβ ) − M˜ + (0001, Fβ ) (with M˜ + (0001, Fβ ) given below), √ n Ln (of dimension (q + 1)) is normally distributed N (0, Q(0001, Fβ )), M(0001, Fβ )22.1 = M(0001, Fβ )(22) − M(0001, Fβ )T(12) M(0001, Fβ )−1 (11) M(0001, Fβ )(12) , √ and n Rn (of dimension (q + 1)) is normally distributed N (0, M −1 (0001, Fβ )Q(0001, Fβ )M −1 (0001, Fβ )) (see Cantoni and Ronchetti, 2001b). Note that Rn(2) is of dimension k.
144
GENERALIZED LINEAR MODELS
This means that 001aQM is asymptotically equivalent to a quadratic form in normal variables and that 001aQM is asymptotically distributed as ki=1 di Ni2 , where N1 , . . . , Nk are independent standard normal variables, d1 , . . . , dk are the k positive eigenvalues of the matrix Q(0001, Fβ )(M −1 (0001, Fβ ) − M˜ + (0001, Fβ )), and M˜ + (0001, Fβ ) is equal to
0(q−k+1)×k M(0001, Fβ )−1 + (11) ˜ M (0001, Fβ ) = , 0k×(q−k+1) 0k×k where 0a×b is a matrix of dimension a × b with only zero entries. The above results imply that the asymptotic distribution of 001aQM is a linear combination of χ12 , for which theoretical results (e.g. Imhof, 1961) and algorithms (see Davies, 1980; the distribution Farebrother, 1990) exist. Moreover, if necessary, ¯ 2 , distribution quite well, of the variable ki=1 di Ni2 can be approximated with a dχ k where d¯ = 1/k ki=1 di . No formal proof exists for the asymptotic distribution of this test statistic, but we expect that the results for linear models by Markatou and Hettmansperger (1992) carry over, at least approximately. Our experience shows that it is often the case in practice. Other approximations exist, see Wood (1989), Wood et al. (1993) and Kuonen (1999). In addition to providing the asymptotic distribution of 001aQM , result (5.25) states that 001aQM is asymptotically equivalent to the quadratic form T βˆ[M](2) M(0001, β)22.1 βˆ[M](2) .
This suggests that 001aQM can be approximated with this easier to compute quadratic form to avoid the numerical integrations in QM (µi ; yi ), in particular when n is large.3 The robustness properties of a test statistic are measured on the level and on the power scale, see Section 2.2. Cantoni and Ronchetti (2001b) work out the expressions of the level and of the power of 001aQM under contamination. These results show in particular that the asymptotic level of 001aQM under contamination is stable as long as a bounded influence M-estimator βˆ[M](2) is used in its definition.
5.4.3 Hospital Costs Data Example (continued) If we look back at Tables 5.2 and 5.3, we can see that the conclusions from both the classical and the robust analyses on the basis of the univariate test statistics (p-values in Table 5.2 and 5.3) are quite different: if no doubt arises as to the significance of the intercept, and the variables log(los) and adm on both analyses, the robust analysis would suggest a significant effect also for dest, and less clearly for sex, making the role of these two variables less clear (see also the corresponding CIs). A more complete variable selection procedure is therefore recommended before proceeding with any interpretation and conclusion. We now investigate this variable selection issue a little bit further. 3 The anova.glmrob function in the package robustbase in R (called by the generic function anova), implements both the test statistic 001aQM and its asymptotic quadratic approximation, in addition to a Wald test.
5.4. ROBUST INFERENCE
145
We first start by comparing the full model to the reduced model without the variables ins and age. This amounts to testing H0 : β3 = β4 = 0 in (5.12). We keep the same robustness tuning parameters for the robust test as in Section 5.3.5, that is, c = 1.5 and w(xi ) = 1. The difference of quasi-deviances 001aQM is equal to 1.23 (p-value = 0.5), which confirms the fact that these two variables do not have a significant impact on the cost of stay significantly. We go on by comparing the model including log(los), adm, sex and dest to the nested sub-model that excludes sex. The hypothesis that the coefficient corresponding to variable sex is equal to zero is rejected at the 5% level (001aQM = 5.26 and p-value = 0.015). Similarly, we compare the model including log(los), adm, sex and dest to the nested submodel that discards dest. The difference of quasi-deviances statistic 001aQM is equal to 4.82 and the p-value is 0.02, which implies the rejection of the null hypothesis that the coefficient of dest is equal to zero at the 5% level. This means that the models without either sex and dest are not enough to describe the data. As a comparison, a classical analysis would also fail to reject the sub-model without ins and age compared with the full model (5.12) (p-value = 0.44). Starting from this sub-model, the classical analysis would reject the sub-model without sex, but not the sub-model without dest. This confirms the preliminary differences between the classical and robust analysis observed with the full fit in Section 5.3.5. The final model obtained from the robust analysis has the following estimated linear predictor (with standard errors of the coefficients within parentheses) 7.168 + 0.839 log(los) + 0.231 adm + 0.082 sex − 0.104 dest. (0.067) (0.020) (0.035) (0.034) (0.047) The estimate of the scale parameter is 0.024. The analysis suggests that hospital costs of stay for back problems are heavily dependent on length of stay, but also on the type of admission, the sex of the patient and their destination when leaving the hospital. The age of the patient and the type of insurance do not impact the costs significantly for this pathology. The impact of the significant covariates on the average costs E[yi | xi ] = µi is described by µi = g −1 (xiT β). Having used a logarithmic link in this example, we have that µi = exp(xiT β). The interpretation of each coefficient uses this relationship and can be done separately under the circumstance that all of the other variables are kept fixed. In this respect, the above constructed model tells us that an emergency admission has a multiplicative effect of exp(0.231) = 1.26 on the average cost, which means a 26% increase. Patients that go home directly after the hospital stay (with respect to those that go to another institution) have lower costs (about 90% = exp(−0.104)). One could expect the converse to be true, but the patient destination after hospital is probably an indicator of how severe the back problems under treatment are: a patient that can be independent and go home directly is probably treated for a lighter problem in the beginning. Of course, the longer the stay, the higher the costs, as expected: if log(los) increases by 1, that is if los increases by 2.7, the average cost is multiplied by exp(0.839) = 2.31. Finally, costs
146
GENERALIZED LINEAR MODELS
for male patients seem to be slightly higher than those for female patients by a factor of exp(0.082) = 1.09. In this example, the estimated parameters of the variables appearing in this final model are quite close to the corresponding estimates in the full model, see Tables 5.2 and 5.3. This is due to the low correlation between the covariates.
5.5 Breastfeeding Data Example 5.5.1 Robust Estimation of the Full Model We now look at a binary response example. The data come from a study conducted in a UK hospital on the decision of pregnant women to breastfeed their babies or not, see Moustaki et al. (1998). For the study, 135 expectant mothers were asked what kind of feeding method they would use for their coming baby. The responses were classified into two categories (variable breast), the first including breastfeeding, try to breastfeed and mixed breast- and bottle-feeding (coded 1), and the second for exclusive bottle-feeding (coded 0). The available covariates are the advancement of the pregnancy (pregnancy, end or beginning), how the mothers were fed as babies (howfed, some breastfeeding or only bottle-feeding), how the mother’s friend fed their babies (howfedfriends, some breastfeeding or only bottle-feeding), if they had a partner (partner, no or yes), their age (age), the age at which they left full-time education (educat), their ethnic group (ethnic, white or non-white) and if they have ever smoked (smokebf, no or yes) or if they had stopped smoking (smokenow, nor or yes). All of the factors are two-level factors. The first listed level of each factor is used as the reference (coded 0). The sample characteristics are as follows: out of the 135 observations, 99 were from mothers that have decided at least to try to breastfeed, 54 mothers were at the beginning of their pregnancy, 77 were themselves breastfed as a baby, 85 of the mother’s friend had breastfed their babies, 114 mothers had a partner, median age was 28.17 (with minimum equal 17 and maximum equal 40), median age at the end of education was 17 (minimum = 14, maximum = 38), 77 mothers were white and 32 mothers were smoking during the pregnancy, whereas 51 had smoked before. The aim of the study was to determine the factors impacting the decision to at least try to breastfeed in order to target breastfeeding promotion toward women with a lower probability of choosing it. We fitted the following model: logit(E[breast]) = logit(P (breast)) = β0 + β1 pregnancy + β2 howfed + β3 howfedfr + β4 partner + β5 age + β6 educat + β7 ethnic + β8 smokenow + β9 smokebf,
(5.26)
where logit(p) = log(p/(1 − p)), with p/(1 − p) being the odds of a success, and P (breast) is the probability of at least try to breastfeed. Table 5.4 gives the robust estimates, standard errors and p-values for the z-test (5.23) of model (5.26) for a Huber-type estimator (w(xi ) = 1) and for a Mallows√ type estimator with w(xi ) = 1 − hii . The value c = 1.5 has been used in both
5.5. BREASTFEEDING DATA EXAMPLE
147
Table 5.4 Robust estimates for model (5.26). Huber Variable intercept pregnancy beginning howfed breast howfedfr breast partner yes age educat ethnic non-white smokenow yes smokebf yes
Mallows
Estimate (SE)
p-value
Estimate (SE)
p-value
−7.782 (3.365) −0.816 (0.695) 0.545 (0.710) 1.479 (0.690) 0.772 (0.816) 0.030 (0.060) 0.377 (0.186) 2.712 (1.125) −3.476 (1.129) 1.507 (1.103)
0.021 0.241 0.443 0.032 0.344 0.611 0.042 0.016 0.002 0.172
−7.778 (3.363) −0.815 (0.694) 0.540 (0.708) 1.482 (0.689) 0.775 (0.816) 0.031 (0.060) 0.376 (0.185) 2.705 (1.122) −3.468 (1.127) 1.507 (1.102)
0.021 0.241 0.445 0.032 0.342 0.608 0.042 0.016 0.002 0.171
The estimates are√obtained by solving (5.13) with c = 1.5 (Huber’s estimator) and with c = 1.5 and w(xi ) = 1 − hii (Mallows’s estimator).
cases. The coefficient estimates from both analyses are quite close, even though individual 18 (see the top panel of Figure 5.3) is considered as a potential leverage point. This mother is 38 years old and is still in education (educat=38). This is possible, but is certainly not common to the majority of the population. This remark raises the question of the rationale behind the definition of the variable educat (age at the end of full-time education). What information are we trying to measure with this variable? If it is educational level, maybe it is not what the variable educat really measures. In other studies, the number of years of education is recorded, which can also be seen as a proxy for social status. From Figure 5.3 (bottom panel) we can also see that a small set of observations are downweighted on the grounds of their residuals, in particular observations 11, 14, 63, 75, 90 and 115 receive a weight of less than 0.6. Note that 6 observations out of 135 constitute about 4.5% of the total information. For these mothers the fitted model (5.26) would predict a probability of at least try to breastfeed which is not consistent with the behavior of the majority of the mothers in the sample on the basis of the covariates (see Figure 5.4): for instance, for observations 75, 11, 115 and 14 the predicted probability of trying to breastfeed is larger than 0.90, whereas these mothers have decided to bottlefeed. On the other hand, mothers 90 and 63 are given a low probability of only 0.02 and 0.11 respectively of trying to breastfeed by the model, whereas they have chosen to do so. According to the p-values of Table 5.4, the variables that have the greatest impact on the decision to at least try to breastfeed are whether the ethnic group is non-white, whether currently smoking and less strongly the age at which one left education and whether friends have chosen to breastfeed. A more formal variable selection procedure follows in Section 5.5.2. Note that a classical analysis would have yield different estimates and conclusions, see also Section 5.5.2. A slightly different estimation method for this dataset
GENERALIZED LINEAR MODELS
0.6
3 53
14
0.2
w(r)
1.0
148
63
0
20
115
90
11
75 40
60
80
100
120
140
80
100
120
140
0.95 0.85
w(x)
Index
18 0
20
40
60 Index
Figure 5.3 Robustness weights on the design and on√ the residuals for model (5.26), when estimated by (5.13) with c = 1.5 and w(xi ) = 1 − hii . has been used in Victoria-Feser (2002), in particular a model-based weighting scheme. The conclusions are similar between our proposal and her Mallows-type estimator.
5.5.2 Variable Selection When analysing the full model, on the basis of p-values corresponding to the zstatistics, the variables howfedfr, smokenow, ethnic and educat have an important impact on the decision to at least try to breastfeed. Here we investigate further the variable selection issue. With this dataset, we illustrate a backward stepwise procedure. We start with the full model and we use the test statistic 001aQM to test each sub-model with one variable removed. All of the sub-models for which the p-value of such a test is larger than 5% are candidates for removal, and we choose between them the sub-model which has generated the larger p-value. We then repeat the procedure by taking this sub-model as the new reference model and testing all of its sub-models. The procedure is stopped when all of the p-values are larger than 0.05. Table 5.5 gives the p-values at the first steps of the procedure. For comparison, we also put the results for a classical analysis. A comparison of the p-values from the classical and the robust approaches confirms that the robustness issues related to
149
75 11 115 14
0.6 0.4 0.2
Fitted probability of breastfeeding
0.8
1.0
5.5. BREASTFEEDING DATA EXAMPLE
63 0.0
90 0.0
0.2
0.4
0.6
0.8
1.0
Breast
Figure 5.4 Fitted values versus actual √ values for model (5.26), when estimated by (5.13) with c = 1.5 and w(xi ) = 1 − hii . Observations with w(ri ; β, φ, c) < 0.6 are spotted. the presence of deviating data points are also a concern for inference. In fact, large discrepancies (as large as 0.2) appear between the two approaches in terms of pvalues. Some of these differences do not really have an impact on the significance decision at a usual level of 5% or 10% (e.g. howfed or partner), but some others do (e.g. educat). The complete robust stepwise procedure yields the following final model (with standard errors of the coefficients within parentheses): −6.417 + 1.478 howfedfr + 3.260 ethnic + 0.403 educat − 2.421 smokenow. (2.973) (0.622) (1.199) (0.177) (0.664)
From the robust analysis, the non-significant variables have been removed in the following order: age (p-value = 0.58, the largest p-value at the first step, see Table 5.5), howfed (p-value = 0.40), pregnancy (p-value = 0.26), partner (p-value = 0.41) and smokebf (p-value = 0.25). As a comparison, a classical backward stepwise procedure would have discarded (in the order) age (p-value = 0.60, the largest p-value at the first step, see
GENERALIZED LINEAR MODELS
150
Table 5.5 p-values of the first step of a backward stepwise procedure for variable selection for the breastfeeding data example of Section 5.5. Variable
Classical
Robust
pregnancy beginning howfed breast howfedfr breast partner yes age educat ethnic non-white smokenow yes smokebf yes
0.08134 0.60261 0.00951 0.12219 0.60271 0.14075 0.00012 <10−4 0.05157
0.20600 0.39778 0.02820 0.32888 0.58512 0.02283 0.00187 <10−4 0.08605
Classical p-values obtained with c = ∞ and w(x √ i) = 1 and robust p-values with c = 1.5 and w(xi ) = 1 − hii in (5.24).
Table 5.5), howfed (p-value = 0.58), educat (p-value = 0.10), pregnancy (p-value = 0.20), smokebf (p-value = 0.10) and partner (p-value = 0.0577). The classical final model would therefore include only howfedfr, ethnic and smokenow, which is a smaller and different set of covariates than obtained by the robust analysis. From the model identified and fitted by the robust technique we learn that the way a mother has been fed as a child does not play a role in her decision of whether to breastfeed, whereas the choice of friends is more important and has an effect on the expectant mother’s decision. A mother’s choice to try to breastfeed does not evolve during the pregnancy. This choice is also not affected by the mother being single. Having smoked before being pregnant has no effect on the decision to breastfeed, but being a smoker during the pregnancy significantly reduces the probability to at least try to breastfeed. Ethnicity and age at which a mother leaves education are also factors that have an impact on a mother’s decision. The coefficient values allow us to quantify the identified effects on the decision to at least try to breastfeed. As opposed to the Gamma model of Section 5.3.5 or to a Poisson model (see Section 5.6), the interpretation of the impact of covariates on the probability P (breast) is more difficult due to the nature of the logit transformation. In fact, P (breasti ) = µi =
exp(xiT β) 1 + exp(xiT β)
.
(5.27)
With these models it is therefore more common to interpret the coefficients on the odds or odd-ratios scale. The robust estimation procedure has no impact on the way the model is interpreted. The only difference is that the coefficients are estimated differently.
5.6. DOCTOR VISITS DATA EXAMPLE
151
For a continuous variable, the effect of a unit change on the odds is equal to the exponential of the corresponding coefficient. For example, leaving education a year later increases the odds of at least try to breastfeed by a factor of exp(0.403) = 1.50, if all the other covariates are kept fixed. On the other hand, for two-level factors the logit model leads to the interpretation of the odds-ratio (the ratio of the odds). For instance, the odds-ratio of at least try to breastfeed for a non-white expecting mother relative to a white mother is equal to exp(3.260) = 26.05. Similarly, the odds-ratio of at least try to breastfeed for a smoking mother relative to a non-smoking one is exp(−2.421) = 0.09. Being a smoker during pregnancy has the strongest (negative) effect on the model. Finally, the odds-ratio of at least try to breastfeed for an expectant mother whose friends have chosen to breastfeed relative to friends bottlefeeding is exp(1.478) = 4.38. The interpretation of odds and odds-ratios pertains to the logistic model (that is, the binomial model with logit link), but does not apply to models with the probit or complementary log–log link. This fact is one of the reasons that makes logistic models more popular than the two other alternatives, in addition to their more convenient computational aspects. To summarize, let us recall that the aim of the study was to better target the expectant mothers when promoting breastfeeding. The analysis of this dataset suggests that if one wants to increase the average probability of choosing to at least try to breastfeed, directed effort should be towards white mothers and towards mothers that leave education earlier. Pregnant women that smoke tend to avoid breastfeeding: investigating this phenomenon further could help increase the average probability of expectant mothers choosing to breastfeed.
5.6 Doctor Visits Data Example 5.6.1 Robust Estimation of the Full Model Count data are an important subclass of data that fits into the GLM framework. For this application we use data from the Health and Retirement Study (HRS),4 which surveys more than 22 000 Americans over the age of 50 every 2 years. The study paints an emerging portrait of an aging America’s physical and mental health, insurance coverage, financial status, family support systems, labor market status and retirement planning. The original full dataset from RAND HRS Data (Version D) distribution (six waves: 1992, 1994, 1996, 1998, 2000 and 2002) contains 26 728 observations and 4140 variables per individual. Individuals were separated in four cohorts: • HRS cohort (born between 1931 and 1941); • AHEAD cohort (born before 1924); 4 Sponsored by the National Institute of Aging (grant number NIA U01AG09740) and conducted by the University of Michigan, see http://hrsonline.isr.umich.edu/.
GENERALIZED LINEAR MODELS
152
• CODA cohort (born between 1924 and 1930); • WB cohort (born between 1942 and 1947). In addition to respondents from eligible birth years, the survey interviewed the spouses of married respondents or the partner of a respondent, regardless of age. We focus on a subsample of 3066 individuals of the AHEAD cohort for wave 6 (year 2002). Note that only individuals with full information have been retained, to avoid issues with missing values. The aim is to identify variables impacting on equity in health care utilization. When the information about costs themselves is not available (in contrast to the example in Section 5.2.3), a proxy variable is used to measure health care consumption, for example the number of visits to the doctor in the previous 2 years. A set of potentially interesting explanatory variables has been retained on the basis of previous studies from the literature, e.g. Dunlop et al. (2002) and Gerdtham (1997), see Table 5.6. These variables are classified into three categories: predisposing variables, health needs and economic access. The first category includes age, gender, race and marital status. Health needs are represented by chronic conditions and functional limitations. In the economic access category, years of education and parents’ education measure human capital, whilst income and health insurance from a current or previous employer measure financial ability to pay. A potential concern with count data in the setting of health consumption is the excess of zeros, that is, a large presence of zero values among the responses, which cannot be modeled with standard distributions (see Ridout et al. (1998) and Section 5.7.1). Given that we target here a population of regular users (elderly) this issue can be excluded. In fact, only about 4% of the counts are equal to zero, see the histogram in Figure 5.5. We therefore confidently proceeded with a GLM Poisson model with log-link including all of the available covariates: log(E[visits]) = β0 + β1 age + β2 gender + β3 race + β4 hispan + β5 marital + β6 arthri + β7 cancer + β8 hipress + β9 diabet + β10 lung + β11 hearth + β12 stroke + β13 psych + β14 iadla1 + β15 iadla2 + β16 iadla3 + β17 adlwa1 + β18 adlwa2 + β19 adlwa3 + β20 edyears + β21 feduc + β22 meduc + β23 log(income + 1) + β24 insur.
(5.28)
We fitted both a classical MLE√and a Mallows’ robust estimator according to (5.13) with c = 1.6 and w(xi ) = 1 − hii . Given the large number of covariates, the results are presented graphically. Figure 5.6 shows approximate 95% CIs for each variable resulting from a classical fit (on the left, gray line) and from a robust fit (on the right, black line). The intervals are symmetric and the coefficient itself is represented in the middle with a dot.
5.6. DOCTOR VISITS DATA EXAMPLE
153
Table 5.6 HRS data variables description. Note that iadla sums the answer to ‘can use the phone’, ‘can manage money’, ‘can take medication’, where the answer to each question is coded 1 = difficulty or 0 = no difficulty. Similarly, adlwa sums the response to being able to ‘bath’, ‘eat’ and ‘dress’. Finally, ‘med’ stands for median. Sample size is 3066. Name
Description
Sample values
Response visits
Number of visits to the doctor
0–750 (med = 8)
Predisposing age gender race hispan marital
Age in years Gender (0 = male, 1 = female) Race (1 = white/Caucasian, 0 = other) Hispanic (1 = Hispanic, 0 = other) Marital status (1 = married, 0 = other)
42–109 (med = 82) 2079 females 2714 whites 183 Hispanic 1203 married ‘yes’: 2200 ‘yes’: 594 ‘yes’: 1856
iadla
Ever had arthritis (1 = yes, 0 = no) Ever had cancer (1 = yes, 0 = no) Ever had high blood pressure (1 = yes, 0 = no) Ever had diabetes (1 = yes, 0 = no) Ever had lung disease (1 = yes, 0 = no) Ever had hearth problems (1 = yes, 0 = no) Ever had a stroke (1 = yes, 0 = no) Ever had psychiatric problems (1 = yes, 0 = no) Instr. activities of daily leaving (0,1,2,3)
adlwa
Activities of daily leaving (0,1,2,3)
Health needs arthri cancer hipress diabet lung hearth stroke psych
Econ. access edyears feduc meduc income insur
Education years Father education (years) Mother education (years) Total household income Ins. from current/prev. empl. (1 = yes, 0 = no)
‘yes’: 524 ‘yes’: 312 ‘yes’: 1206 ‘yes’: 492 ‘yes’: 479 ‘0’: 2433, ‘1’: 258 ‘2’: 178, ‘3’: 197 ‘0’: 2284, ‘1’: 361 ‘2’: 234, ‘3’: 187 0–17 (med = 12) 0–17 (med = 8.5) 0–16 (med = 8.5) 0–725 600 (med = 21 540) ‘yes’: 649
Note that the magnitude of the coefficients is not comparable between all of the variables. In fact, some of them are measured in years, e.g. age, meduc, feduc and edyears, one is measured in log-dollars (log(income + 1)) and all of the other variables are dummies.
GENERALIZED LINEAR MODELS
0
200
400
600
800
1000
154
0
20
40
60 Doctor visits
80
100
Figure 5.5 Histogram of visits. Note that the abscissa has been limited to (0, 100) (there are 21 observations out of 3066 outside this range, the largest value being 750).
As one can see, the coefficients of the classical and the robust analyses are sometimes quite different. Also, the standard errors estimates tend to be a bit larger in the robust analysis. The CIs from the classical analysis indicate that all of the variables are highly significant (no crossing of the horizontal line at zero), except for marital. From the robust analysis it seems, however, that the variables race, meduc, log(income + 1) and insur are not significant. For additional variable significance tests, see Section 5.6.2. The dataset here is much larger than the previous dataset both in sample size and in the number of covariates. For this reason, the plot of the weights (see Figure 5.7) shows what seems to be a large number of downweighted observations. Note, however, that the average of the weights with respect to the total number of observations is 3066 ˜ i ; β, φ, c)w(xi )/3066 = 79.4%, which reflects, loosely i=1 w(r speaking, an average degree of ‘outlyingness’ of about 20%. This may seem a lot, possibly indicating that extra covariates should be added or that the distributional assumptions should be modified. Also, the weights on the design are all close to one.
5.6.2 Variable Selection As can be seen in Figure 5.6, almost all of the (preselected) variables for this study seem significant. We would like to confirm whether the variables race, meduc, log(income + 1) and insur can be excluded from the model. For this purpose
5.6. DOCTOR VISITS DATA EXAMPLE
insur
log(income+1)
meduc
feduc
edyears
adlwa(3)
adlwa(2)
adlwa(1)
iadla(3)
iadla(2)
iadla(1)
psych
stroke
hearth
lung
diabet
hipress
cancer
arthri
marital
hispan
race
gender
age
0.0
0.2
0.4
Classical Robust
−0.2
Confidence intervals for the coefficients
0.6
155
Figure 5.6 Coefficient estimates and approximate 95% CIs for the log-link Poisson model (5.28), estimated by maximum likelihood (classical) and by (5.13) with c = √ 1.6 and w(xi ) = 1 − hii (robust). For each variable, the results on the left are from the classical analysis and on the right from the robust analysis.
we use the difference of quasi-deviance statistic 001aQM with c = 1.6 and w(xi ) = √ 1 − hii . We first test the null hypothesis H0 : β3 = 0 in the full model, which is not rejected (p-value = 0.73). We therefore remove the variable race. We test next whether meduc is significant in the sub-model that has already race removed. This variable is not significant (p-value = 0.62) and we remove it. We go on with testing whether we can in addition remove log(income + 1), which is not significant (pvalue = 0.35). We last test the removal of insur. The p-value is 0.50, and we decide to remove also insur. The above approach is called a sequential approach and differs from a marginal/ stepwise approach in that it does not test all of the sub-models at each step. The drawback is that the final model is heavily dependent on the order in which the variables are considered for removal, in particular when the covariates are far from being independent. Table 5.7 gives the estimates on the final model retained above. The factors explaining the number of visits to the doctor are numerous, as confirmed by the long list of variables in Table 5.7. We have already learned that being Caucasian, the
GENERALIZED LINEAR MODELS
0.0
0.2
0.4
w(r)
0.6
0.8
1.0
156
0
500
1000
1500
2000
2500
3000
2000
2500
3000
w(x)
0.965
0.970
0.975
0.980
0.985
0.990
0.995
1.000
Index
0
500
1000
1500 Index
Figure 5.7 Robustness√weights from the fit of model (5.28) estimated by (5.13) with c = 1.6 and w(xi ) = 1 − hii .
5.6. DOCTOR VISITS DATA EXAMPLE
157
Table 5.7 Final model estimates for the doctor visits data. Variable intercept age gender hispan marital arthri cancer hipress diabet lung hearth stroke psych iadla1 iadla2 iadla3 adlwa1 adlwa2 adlwa3 edyears feduc
Estimate (SE)
p-value
1.989 (0.114) −0.005 (0.001) 0.030 (0.015) 0.213 (0.027) −0.050 (0.014) 0.180 (0.015) 0.178 (0.015) 0.197 (0.014) 0.198 (0.015) 0.110 (0.019) 0.304 (0.013) 0.125 (0.016) 0.180 (0.016) 0.056 (0.023) 0.176 (0.027) 0.244 (0.029) 0.160 (0.019) 0.231 (0.024) 0.382 (0.029) 0.008 (0.002) −0.020 (0.006)
<10−4 <10−4 0.0409 <10−4 0.0006 <10−4 <10−4 <10−4 <10−4 <10−4 <10−4 <10−4 <10−4 0.0143 <10−4 <10−4 <10−4 <10−4 <10−4 <10−4 0.0025
The estimates√are obtained by (5.13) with c = 1.6 and w(xi ) = 1 − hii (Mallows’ estimator).
level of mother’s education, total household income and having a health insurance plan from a previous employer do not have a statistically significant impact on health consumption (doctor visits). The Poisson GLM model used for this example has a logarithmic link. Interpretation of the coefficient is therefore done through the relationship µi = exp(xiT β), as in the Gamma model with logarithmic link in Section 5.4.3. For example, a patient who is five years older would have a number of visits to the doctor multiplied by exp(−0.005 · 5) = 0.975 on average, that is, reduced by 2.5%. It is surprising to see that the coefficient of age is negative, meaning that older patients consume less. However, the effect is really small (no practical significance), even though statistically significant. Interpretation of education level via years of education (edyears) and father’s education (feduc) is puzzling. On the one hand, an extra year of father’s education decreases the number of visits by 2% (exp(−0.02) = 0.98). On the other hand, years of education of the patient himself tend to increase the doctor needs by 1% (exp(0.008) = 1.01). Married individuals do visit the doctor less on average: exp(−0.05) = 95%. All of the effects of ‘health needs’ category are positive, indicating, as expected, that if some conditions are
158
GENERALIZED LINEAR MODELS
present (arthritis, diabetes, high blood pressure, etc.), the number of doctor visits is larger on average with respect to an individual with absence of these conditions.
5.7 Discussion and Extensions The GLM class encompasses a large variety of data distributions, but of course it has its own limitations. Therefore, GLM have been extended in various ways. The linear component structure has been relaxed and non-parametric functions have been considered in generalized additive models (GAMs; see Hastie and Tibshirani (1990)). The exponential family restriction can be overcome by using quasi-likelihood functions instead of proper likelihoods. The asymptotic results for the estimators derived in this way have to be adapted, essentially by changing the asymptotic variance estimator (sandwich formula, see Fahrmeir and Tutz (2001, pp. 55–58)). Finally, in GLM the responses are assumed to be independent and therefore do not include, for instance, longitudinal or clustered data, where there are typically several observations per subject for which it is not reasonable to assume independence (even though the subjects themselves can be considered independent), see in particular Chapter 6. In the following sections we discuss some ideas for extensions of the approach presented in this chapter and some open areas of research.
5.7.1 Robust Hurdle Models for Counts A particular feature of count data is the fact that they sometimes show an excess of zeros. Typical examples include the number of visits to the doctor on a given period (see Cameron and Trivedi, 1998) or abundance of species (see Barry and Welsh, 2002). Data with an excess of zeros have been modeled in various ways: with mixture models, with more flexible distributions than the more common Poisson (e.g. negative binomial, Neyman type-α, see for instance Dobbie and Welsh (2001b)), with zero-inflated distributions (zero-inflated Poisson or zero-inflated negative binomial, see Lambert (1992)), or with hurdle models (also called two-step or conditional models, see Mullahy (1986)). Ridout et al. (1998) and Min and Agresti (2002) give extensive reviews. From our perspective, hurdle models are quite attractive because they possess nice orthogonality properties, they fit nicely in the GLM framework and in its robust approach presented in this chapter. A hurdle model is characterized by a two-stage procedure. First, the presence (yi > 0) or absence (yi = 0) is modeled through a set of covariates xi with a logistic-type of model. Then, conditional on the presence, the positive values are modeled through a set of covariates x˜ i (possibly equal to xi ) with a truncated distribution (e.g. a truncated Poisson) and corresponding model (a log-linear type of model). This implies that yi = 0 with probability 1 − p(xi ) and
5.7. DISCUSSION AND EXTENSIONS
159
yi ∼ truncated Poisson with probability p(xi ). In summary, yi = 0, (1 − p(xi )) y i P (Yi = yi | xi , x˜ i ) = exp(−λ(x˜ i ))λ(x˜ i ) yi = 1, 2, . . . , p(xi ) yi !(1 − exp(−λ(x˜ i ))) with logit(p(xi )) = xiT β and log(λ(x˜ i )) = x˜ iT α. The log-likelihood l(α, β) of the above model factorizes as l(α) + l(β), which has the double advantage of splitting the fitting into two subproblems of smaller size and rendering the interpretation easier (each set of parameter impact only one part of the model). A robust procedure for the hurdle model can be derived by robustifying each submodel separately. The logistic presence/absence model can be fitted robustly by the approach presented in the previous sections and the truncated Poisson modeling part has been addressed in Zedini (2007). Routines in R are currently under preparation and will be made available either within the robustbase package or as a standalone package.
5.7.2 Robust Akaike Criterion The principle of the AIC (see Section 3.4.5) is to use the likelihood information at a given model penalized by its number of parameters to identify the best model(s), that is, the best compromise(s) between parsimony and goodness of fit. The smaller the value of AIC, the better. In fact, AIC is an estimate of the expected entropy, that one would like to maximize. A robust version of AIC is available for linear models, see Section 3.31, but not (yet) for GLMs, where a generalized version of AIC can be constructed based on the quasi-likelihood functions defined in this chapter. We briefly sketch the idea here. The log-likelihood in the original definition of AIC can be replaced by the quasilikelihood function (5.7) with the penalization term adapted, see Ronchetti (1997b) and Stone (1977). This yields the final generalized criterion: GAIC = −2
n 0002
QM (µˆ i ; yi ) + 2 tr(M −1 (0001, Fβ )Q(0001, Fβ )),
i=1
with M(0001, Fβ ) and Q(0001, Fβ ) given in (5.21) and (5.22).
5.7.3 General Cp Criterion for GLMs The Mallows’ Cp criterion (Mallows, 1973) has been mainly used in linear regression. A robust version of it for linear models exist thanks to Ronchetti and Staudte (1994) (see (3.32)). It is constructed upon the idea that the Cp criterion is an unbiased estimator of some sort of measure of prediction error. Following the same reasoning, Cantoni et al. (2005) develop a similar criterion, called GC p , to be used for GEE models to address various issues (missingness, heteroscedasticity) including
GENERALIZED LINEAR MODELS
160
robustness. The GLM setting being the limiting case of a longitudinal setting where there is only one observation per subject, GC p for GLM can be deduced from the original proposal of Cantoni et al. (2005). If we define the rescaled weighted predictive squared error by 0016p =
0007 0006 p n 0002 yˆi − E[yi | xi(p) ] 2 p 0017 , E w2 (ri ) · φ vˆµi i=1
(5.29)
p p 0017 p where ri = (yi − yˆi )/ φ vˆµi are the Pearson residuals, yˆi are the fitted values at the model with p ≤ (q + 1) explanatory variables xi(p) (including the intercept), vˆµi are ‘external’ variance estimates (held fixed) and where w(·) is a weighting function to downweight atypical observations, then a general form of an unbiased estimator for 0016p is
GCp =
n n n 0002 0002 0002 p p p p (w(ri )ri )2 − E[(w(ri )0015i )2 ] + 2 E[w2 (ri )0015i δi ], i=1
i=1
(5.30)
i=1
0017 0017 p with 0015i = (yi − E[yi | xi(p) ])/ φvµi and δi = (yˆi − E[yi | xi(p) ])/ φvµi and where the two latter terms are corrections to achieve unbiasedness. Computing these two terms for GLM and for our particular (robust) M-estimator (5.13) would yield the final form of GCp .
5.7.4 Prediction with Robust Models The goals of model fitting are numerous, but they certainly include prediction. For example, in the hospital costs example of Section 5.2.3, health insurances could be interested in forecasting costs for the following year in order to establish their budget. If in this example the robust fitted model is used naively to obtain predictions, the reproducibility of the outliers, that is the fact that individuals with high abnormal costs will likely appear again in the future, would imply potential severe bias in prediction (e.g. underestimation). This particular feature is shared by all of the models where the outliers are characterized by particularly large values with respect to the bulk of the data (this is not the case in examples with binary responses, for example). In this kind of situation, one should therefore correct the predictions for possible reproducible outliers, by considering shrinkage robust estimators, see for example Welsh and Ronchetti (1998) and Genton and Ronchetti (2008).
6
Marginal Longitudinal Data Analysis
6.1 Introduction Longitudinal data models are a step further away from linear models. Beyond GLMs, longitudinal studies are those where individuals are measured repeatedly over time. So, with respect to the GLM modeling of Chapter 5, a second dimension is added, where each subject can be measured several times. With respect to the (normal) MLMs of Chapter 4, the extension broadens the nature of responses considered. Here we allow the response to come from any distribution of the exponential family (discrete or continuous), as in Chapter 5. Note that The terminology ‘longitudinal data’ is used mostly in medicine, biology and health sciences, whereas sociologists and economists would mostly use the term ‘panel data’. It has to be stressed that even though the most common applications are for situations where the main units are individuals (e.g. the example in Section 6.5), the methodology can also be applied to otherwise clustered data where there are units in which measurements cannot be considered independent (e.g. the example in Section 6.2.3). When there is only one observation per subject, inference solely about the population average is possible. In contrast, longitudinal studies can distinguish between changes over time within individuals (called aging effects) and differences among people in their baseline levels (called cohort effects). Otherwise said, longitudinal studies are able to distinguish between the degree of variation of the response across time for one person and the variation in the response among people. Statistically speaking, one has to take into account the correlation within Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
162
MARGINAL LONGITUDINAL DATA ANALYSIS
measurements of the same subject (even if the subjects themselves can be considered independent). The same pattern/behavior is assumed across subjects and strength is borrowed from this. The literature about marginal longitudinal models is wide, also because models are developed in at least three main directions, see Section 6.2. The bases of the generalized estimating equations (GEE) approach that we follow here have been introduced with the seminal work of Liang and Zeger (1986) and Zeger and Liang (1986). Since then, many extensions and variations have been considered, in particular including an extension to the mixed linear type of models (Zeger et al., 1988), polytomous responses (Heagerty and Zeger, 1996; Liang et al., 1992; Stram et al., 1988), survival responses (Heagerty and Zeger, 2000, and references therein), weighted GEE (Preisser et al., 2000) and zero-inflated count data (Dobbie and Welsh, 2001a). A nice book on longitudinal data is Diggle et al. (2002), which is an extension of an earlier edition. The book by Molenberghs and Verbeke (2005) is another interesting reference. A more focused book on the GEE approach is Hardin and Hilbe (2003) and a recent book addressing correlated data is Song (2007). The theory around the GEE approach is sometimes sparse, in particular when it comes to the nuisance parameters, where the inferential aspects have not been well treated. The variable selection issues with these models have been addressed only recently, when Pan (2001) define an Akaike-type criterion for GEE, called QIC. Moreover, Cantoni et al. (2005) introduce a general Cp -like criterion for variable selection for marginal longitudinal models that can also address robustness issues. Robust alternatives to GEE-type of fits have been first proposed by Preisser and Qaqish (1999), who define a set of resistant estimating equations. Wang et al. (2005) propose a robust GEE-type bias corrected estimator, where the bias is estimated using a classical GEE estimator. Qu and Song (2004) show that their estimating equations proposal based on quadratic inference functions (Qu et al., 2000) has some nice robustness properties for the estimation of the regression parameters in some cases. Cantoni (2004b) propose a more general and improved version of the estimating equations of Preisser and Qaqish (1999) that also allows quasi-likelihood functions to be defined for inference, which puts the user in a position to carry a full analysis. We have chosen to present this approach given our familiarity with it, because of its extensions that make variable selection possible along the same lines as the approach for GLM and because of its forthcoming availability in R. In this chapter, after discussing the possible approaches to longitudinal data (Section 6.2), we go on to introduce marginal longitudinal models in more detail and present the classical estimation procedure (GEE) to fit them and the associated inference in Section 6.2.1. The robust counterpart, as per Cantoni (2004b), is introduced and illustrated in Section 6.3. It is based on a weighted set of estimating equations. In addition, quasi-deviance functions are defined for inference purposes and robust model selection. Three different examples serve as motivation and illustration of the theoretical elements introduced in this chapter, especially in Sections 6.3.4, 6.5 and 6.6.
6.2. THE MLDA AND ALTERNATIVES
163
6.2 The Marginal Longitudinal Data Model (MLDA) and Alternatives We assume that we have measurements yit for individual (or unit or cluster) i = 1, . . . , n at time (or occasion or occurrence) t = 1, . . . , ni . We additionally define yiT = (yi1 , . . . , yini ) as the collection of measurements for subject i and we assume independence between subjects. We assume that E[yi ] = µi and that var(yi ) is nondiagonal. At each time point, a set of covariates xitT = (1, xit 1 , . . . , xitq ) is also measured for each individual. The covariates information on subject i is collected in a ni × (q + 1) matrix T xi1 1 x111 · · · x11q . . . . . Xi = . = . . . . T xin i
1
x1ni 1 n
· · · x1ni q
The complete set of data comprises N = i=1 ni observations. As with GLMs, the response yit will be allowed to come from any distribution of the exponential family, see Table 5.1 in Chapter 5. However, using the GLM methodology would not be appropriate here because it ignores the correlation between the measurements of the same subject. Ignoring this correlation has consequences at different levels: inference about the regression parameters is incorrect, estimation of the regression parameters is inefficient and there is suboptimal protection against biases caused by missing data. The difficulty with the analysis of non-Gaussian longitudinal data was the lack of a rich class for the joint distribution of (yi1 , . . . , yini ). There are essentially three strategies to address the issue. All three approaches model both the dependence of the response on the explanatory variables and the correlation among the responses. In the following we give a brief overview. 1. Marginal models. Via this approach one models parametrically not only the marginal mean of yit (as in GLMs and in cross-sectional studies in general) but also the correlation matrix corr(yi ), by imposing a relationship g(E[yit ]) = xitT β for a link function g, and by modeling the covariance 1/2 1/2 matrix with extra parameters τ and α: Vµi ,τ,α = τ Aµi Rα,i Aµi , with Aµi = diag(vµi1 , . . . , vµini ), where vµit = var(yit ), Rα,i is the working correlation matrix and τ is a scale parameter. Only inference about the population mean is possible (population average inference). The parameters are estimated via a set of estimating equations, because there is no likelihood available in this setting. 2. Random effects models. With these models it is assumed that the correlation arising among repeated responses is due to the variation of the regression coefficients across individuals. One therefore models the conditional expectation of yit given γi (the individuals unexplained variations) by assuming g(E[yit | γi ]) = xitT β + zitT γi , with γi issued from a distribution F
164
MARGINAL LONGITUDINAL DATA ANALYSIS (usually Gaussian) such that E[γi ] = 0 and var(γi ) = σγ2 I . This modeling approach allows for inference about individuals (subject-specific inference). Parameters estimation is performed via likelihood maximization.
3. Transition models. In this case, the conditional expectation given the past E[yit | yi(t −1), . . . , yi1 ] is modeled. The assumptions about the dependence of yit on the past responses and on xit are combined into a single equation, that is, the conditional expectation of yit is written as an explicit function of yi(t −1), . . . , yi1 and xit . The likelihood is also the estimation method here.
6.2.1 Classical Estimation and Inference in MLDA In this chapter, we focus on marginal models, where the final goal is to describe the population average and for which a robust procedure similar to that in Chapter 5 is available. We note at this point that some robust options exist for random effects models as well, see e.g. Mills et al. (2002), Sinha (2004) and Noh and Lee (2007). The model assumptions under which we work are partially common with the main ingredients defined for GLM. • The marginal expectation of the response E[yit ] = µit depends on a set of explanatory variables xit via g(µit ) = xitT β, where g is the link function. • The marginal variance depends on the marginal mean through the relationship var(yit ) = τ vµit . The scale parameter τ allows for over- or under-dispersion, in the same manner as for GLMs, see Section 5.2.2. • The correlation between yit and yit 0005 (t 000f= t 0005 ) is a function of the corresponding marginal means and possibly of additional parameters α. This goal is achieved by parameterizing the correlation matrix with a parameter α 1/2 1/2 yielding a modeled covariance matrix Vµi ,τ,α = τ Aµi Rα,i Aµi , with Aµi = diag(vµi1 , . . . , vµini ), where vµit = var(yit ). The modeled correlation matrix Rα,i is called the ‘working’ correlation matrix, as opposed to the true underlying and unknown correlation matrix corr(yi ). The regression parameters β have the same interpretation as in GLM. They are regarded as the parameters of interest, whereas τ and α are considered nuisance parameters. This may not be appropriate when the time course for each subject is the focus, in which case one would need to consider either the extension proposed by Zeger et al. (1988) or a random effects model. Marginal models are natural extensions of GLM for dependent data. Therefore, the same or similar choices for the marginal distributions (within the exponential family) and the same link functions as in GLMs are used, see Chapter 5. However, even if a marginal distribution for yit is postulated (e.g. Bernoulli, binomial, Poisson), it does not define a (unique) joint multivariate distribution for yi , making it impossible to define a likelihood function to work with. The regression parameters β are therefore estimated by the GEE approach of Liang and Zeger (1986). Note, however, that the GEE reduce to maximum likelihood when the yi are multivariate
6.2. THE MLDA AND ALTERNATIVES
165
Gaussian distributed. In addition, GEE can be viewed as an extension of the quasi-likelihood approach where the variance cannot be specified only through the expectation µi but rather with additional correlation parameters α. This similarity with the quasi-likelihood approach explains why the parameter τ is directly included in the definition of Vµi ,τ,α . The quasi-likelihood approach used in (5.6) for GLM can be extended by solving for β the GEE (assuming τ and α are given): n 0002 (Dµi ,β )T (Vµi ,τ,α )−1 (yi − µi ) = 0,
(6.1)
i=1 1/2
1/2
where Dµi ,β = ∂µi /∂β and Vµi ,τ,α = τ Aµi Rα,i Aµi . The resulting GEE estimator for βˆ[GEE] can be obtained through an IRWLS by implementing a Fisher scoring algorithm. This algorithm is given in Appendix F.1 in its more general robust form. As said before, Rα,i is called the ‘working’ correlation, as opposed to the true (unknown) correlation matrix corr(yi ). The working correlation is imposed by the user and possible choices are as follows. • Independence. Here Rα,i = Ini , where Ini is the identity matrix of size ni . In this case, all of the set of N = ni=1 ni measurements are considered independent even within the same subject, and therefore we can treat this situation with a simple GLM model as if each observation yit corresponds to independent subjects. • Fixed. The correlation matrix Rα,i (or R) has a predefined form (either through a known parameter α or in general). This case is rare in practice, but could be implied by a formal theory or a result of previous studies. • Exchangeable (or compound symmetry). All of the correlations (Rα,i )t t 0005 between two occurrences t and t 0005 (t 000f= t 0005 ) are assumed to be equal to a scalar value α to be estimated. Formally, Rα,i = αeni enTi + (1 − α)Ini , where enTi is a vector of ones of dimension ni and Ini is the ni × ni identity matrix. This hypothesis may not be fulfilled when the repeated measurements are issued from subjects measured on several occasions over time, but is more appropriate in data where units are ‘natural’ clusters, such as children in the same class, members of a family or patients of the same practice, see e.g. the example in Section 6.2.3. Note that assuming exchangeable correlation in the normalidentity link setting corresponds to a random intercept MLM. • Autoregressive (AR). The correlation decreases with time difference, e.g. 0005 (Rα,i )tt0005 = α |t −t | , for an unknown scalar value α. This hypothesis is quite commonly used for measurements on the same subject over time because it can accommodate an arbitrary number and spacing of observations. • m-dependence. Observations are correlated up to time distance m, and therefore correlation is set to zero for observations that are more than m units
MARGINAL LONGITUDINAL DATA ANALYSIS
166
apart. Formally, for α = (α1 , . . . , αm ) t = t 0005, 1 (Rα,i )t t 0005 = αd d = |t − t 0005 | ≤ m, 0 otherwise. • Unstructured/unspecified. The correlation matrix Rα,i is completely free (apart from a diagonal of ones and the symmetry constraint), which gives many parameters to estimate. Obviously, this option requires clusters to be of the same size, that is, ni = n∗ for all i. We refer the reader to Table 1 in Horton and Lipsitz (1999) for a description of the possible correlation structures and recommendations. Moreover, Hardin and Hilbe (2003, pp. 141–142) give additional guidelines when choosing the correlation structure, as a function of the nature of the data at hand (e.g. size of the clusters, balanced data, characteristics defining the clusters).
6.2.2 Estimators for τ and α The GEE (6.1) are defined for given values of τ and α. A procedure that iterates between the estimation of the regression parameters β and the (moment) estimation of the nuisance parameters τ and α is implemented in all good software and therefore used in practice. Given that τ and α are nuisance parameters, less attention has been paid to their estimation and almost no theoretical results for inference exist for these parameters. √ The estimation of τ is based on the fact that τ is equal to var( τ rit ), where rit = √ (yit − µit )/ τ vµit are the Pearson residuals for unit i at occurrence t. Therefore, a simple estimator of τ is derived from the variance estimator based on all of the N residuals, i.e. ni n 0002 0002 (yit − µˆ it )2 /vµˆ it τˆ = . (6.2) N − (q + 1) i=1 t =1 On the other hand, the estimator of the correlation parameter α depends on the choice of the correlation structure Rα,i . The general approach is to estimate α by a simple function of all of the pairs of residuals rˆit , rˆit 0005 that share the same correlation (t and t 0005 defined accordingly). Below, we give some of the solutions implemented in software for the most common correlation structures.1 • If (Rα,i )t t 0005 = α (exchangeable correlation) for all t 000f= t 0005 , then we have αˆ =
n 0002 0002
rˆit rˆit 0005 /(K − (q + 1)),
i=1 t >t 0005
where K = 1/2
n
i=1 ni (ni
0015 − 1) and rˆit = (yit − µˆ it )/ τˆ vµˆ it .
1 Note that this list is not exhaustive, and different software implement different solutions.
(6.3)
6.2. THE MLDA AND ALTERNATIVES
167
0005
• If (Rα,i )t t 0005 = αt,t 0005 = α |t −t | (AR correlation), then given that E[rit rit 0005 ] 0005 α |t −t | (because E[rit rit 0005 ] cov(rit , rit 0005 )), one estimates α by the slope of the regression of log(ˆrit rˆit 0005 ) on log(|t − t 0005 |). Another option (see Hardin and Hilbe, 2003, p. 66) is to use n ni −(t −t 0005 ) 0002 rˆit rˆit 0005 t =1 αˆ t,t 0005 = . ni i=1 • If α = (α1 , . . . , αn∗ −1 ), where αt = (Rα,i )t (t +1) and n∗ is such that n1 = · · · = nn = n∗ , then αˆ t =
n 0002
rˆit rˆi(t +1) /(n − (q + 1)).
i=1
In particular, if Rα,i is tridiagonal with (Rα,i )t (t +1) = αt (one-dependent model), then if we let αt = α, we can estimate it by αˆ =
∗ −1 n0002
αˆ t /(n∗ − 1).
t =1
The extension to m-dependence is possible. • If Rα,i is totally unspecified, that is (Rα,i )t t 0005 = αt t 0005 for t 000f= t 0005 , one uses n 1 0002 (Aµˆ i )−1/2 (yi − µˆ i )(yi − µˆ i )T (Aµˆ i )−1/2 . Rˆ = τˆ n i=1
For the independence, exchangeable and m-dependence correlation structure, τ does not need to be computed to solve the estimating equations (it cancels out). In contrast, it is needed when Rα,i is AR. Liang and Zeger (1986, Section 4) give further details. The above-described estimators for τ and α are moment estimators that have a closed-form, but can be expressed in an estimating equation form to be solved simultaneously with the estimating equations for β, see Liang et al. (1992, pp. 9– 10). The GEE approach operates as if α and β were orthogonal to each other, even when they are not, yielding less efficient estimates of β when the correlation structure is misspecified. Zhao and Prentice (1990) introduce a modified version of GEE, called GEE2, that relaxes the orthogonality hypothesis. The price to pay is an increased computational burden and a larger sensitivity to the misspecification of the correlation structure, see Song (2007, p. 96). The GEE2 approach is usually not what is implemented in most software and for this reason, we do not pursue this theory further. √ If n-consistent estimators are used to estimate τ and α, it can be proved that √ ˆ n(β[GEE] − β) is asymptotically normally distributed with zero mean and variance 001b = lim M −1 QM −1 , n→∞
MARGINAL LONGITUDINAL DATA ANALYSIS
168 where M= and Q=
n 10002 (Dµi ,β )T (Vµi ,τ,α )−1 Dµi ,β , n i=1
n 10002 (Dµi ,β )T (Vµi ,τ,α )−1 var(yi )(Vµi ,τ,α )−1 Dµi ,β , n i=1
see Liang and Zeger (1986, Theorem 2). Note that the asymptotic theory here is intended with respect to the number of subjects (n) and for fixed numbers of occurrences (ni ). ˆ = Mˆ −1 Qˆ Mˆ −1 , where The estimator used for 001b is 001b n 10002 (D ˆ )T (Vµˆ i ,τˆ ,αˆ )−1 Dµˆ i ,βˆ , Mˆ = n i=1 µˆ i ,β
(6.4)
and n 10002 (D ˆ )T (Vµˆ i ,τˆ ,αˆ )−1 (yi − µˆ i )(yi − µˆ i )T (Vµˆ i ,τˆ ,αˆ )−1 Dµˆ ,βˆ , Qˆ = i n i=1 µˆ i ,β
(6.5)
where βˆ = βˆ[GEE] , µˆ i = µi (βˆ[GEE] ), τˆ is defined by (6.2) and αˆ is one of the estimators defined in the list above, depending on the assumed correlation structure. 0005 This is what is called in the Note that an estimator for var(βˆ[GEE] ) is n−1 001b. literature a ‘robust’ variance estimator, in contrast to a ‘naive’ variance estimator that would be obtained by assuming that the working correlation is true, and hence ar(βˆ[GEE] ) = n−1 Mˆ −1 . So, here ‘robust’ is var(yi ) = Vµi ,τ,α . This would yield v001c intended with respect to the misspecification of the correlation structure. For a similar use of ‘robust’, see also the discussion in Section 7.2.4. Approximate z-statistics and (1 − α) CIs can be defined in the usual manner, i.e. z-statistic =
βˆ[GEE]j , SE(βˆ[GEE]j )
(6.6)
0015 0005(j +1)(j +1). In the ar(βˆ[GEE]j ) and v001c ar(βˆ[GEE]j ) = n−1 001b with SE(βˆ[GEE]j ) = v001c same manner, we obtain (βˆ[GEE]j − z(1−α/2) SE(βˆ[GEE]j ); βˆ[GEE]j + z(1−α/2) SE(βˆ[GEE]j )), where z(1−α/2) is the (1 − α/2) quantile of the standard normal distribution. The GEE estimator βˆ[GEE] of β is attractive because it presents some nice theoretical properties. For instance, the asymptotic variance of βˆ[GEE] √ does not depend on the choice of the estimators for τ and α among the n-consistent ˆ depends only on the correct estimators. In addition, the consistency of βˆ[GEE] and 001b specification of the means µi and not on the correct specification of the correlation structure. In fact, inference about β is valid even when the correlation matrix is not
6.2. THE MLDA AND ALTERNATIVES
169
specified correctly (see Liang and Zeger (1986) for a more detailed discussion and for the proofs of these theoretical aspects). However, a careful choice of Rα,i , close to the true correlation matrix corr(yi ), increases efficiency, even though simulations results in Liang and Zeger (1986, Tables 1 and 2, p. 19) and Liang et al. (1992, Table 1, p. 15) tend to suggest otherwise. In these references the loss of efficiency is important only for highly correlated responses, but is limited for situations with moderate correlation. The drawbacks of the GEE approach are mostly related to the lack of a likelihood function for these models, which makes diagnostic and inference limited, and to the poor theory for the nuisance parameters.
6.2.3 GUIDE Data Example We consider the dataset of the GUIDE study (Guidelines for Urinary Incontinence Discussion and Evaluation2 as used by Preisser and Qaqish (1999). The response variable is the coded answer (bothered: 1 for ‘yes’, 0 for ‘no’) of a patient to the question: ‘Do you consider this accidental loss of urine a problem that interferes with your day to day activities or bothers you in other ways?’. There are five explanatory variables: gender, coded as an indicator for women (female), age (scaled by subtracting 76 and dividing by 10: age), the average number of leaking accidents per day (dayacc), the degree of the leak (severe: coded ‘1’ for ‘just create some moisture’, ‘2’ for ‘wet their underwear (or pad)’, ‘3’ for ‘trickle down their thigh’, ‘4’ for ‘wet the floor’) and the daily number of visits to the toilet to urinate (toilet). A total of 137 patients divided into 38 practices participated in the study. Figure 6.1 shows the responses for each cluster. Note that here the cluster sizes are different, ranging from one to eight. On the other hand, Figure 6.2 presents a summary of all of the covariates for all of the individuals. The series of plots in the left column is for observations such that yit = 1, that is, for patients that are bothered by their incontinence. The right column of plots is for patients with yit = 0. We observe a strong presence of female patients in the sample and a slightly larger proportion of female (90% versus 80%) within the subsample for which yit = 0. The age distribution is quite comparable between the two groups. On the other hand, as one can expect, the three indicators of the severity of the incontinence (dayacc, severe and toilet) show larger values for patients that declare themselves bothered by their problem (left column). For example, the median number of visits to the toilet is 6.5 for patients for which yit = 1 versus 5 for the other group. Similarly, the median number of leaking accidents per day for the first group is 4.6 against 1 for the second group. The model considered for this dataset is a binary logit-link model (τ = 1) defined by logit(E[bothered]) = logit(P (bothered)) = β0 + β1 female + β2 age + β3 dayacc + β4 severe + β5 toilet, 2 Available at http://www.bios.unc.edu/∼jpreisse/personal/uidata/preqaq99.dat
(6.7)
MARGINAL LONGITUDINAL DATA ANALYSIS
170 2 4 6 8
2 4 6 8
2 4 6 8
8
24
27
41
45
55
56
60
65
89
102
107
108
111
1 0
1 0
113
118
124
125
127
130
132
137
146
153
156
182
185
195
bothered
1 0
1 0
201
206
207
228
232
235
208
211
216
220
1 0
1 0 2 4 6 8
2 4 6 8
Occurence
Figure 6.1 The response for the GUIDE dataset for each practice (labeled by an increasing number, appearing in the shaded box).
where logit(p) = log(p/(1 − p)), with p/(1 − p) being the odds and p = P (bothered) the probability of being bothered. The clusters are defined by the practice, which means that patients from different practices are assumed independent. We assume common exchangeable correlation α between any two patients of a same practice. This hypothesis makes sense a priori in the context of this example: in fact, even though the patients of the same practice behave independently, correlation could be induced by the fact that a physician tends to prescribe similar treatments for their patients under treatment for the same problem. Note that the scaling of the variable age is not necessary but it is kept for consistency with the original analysis in Preisser and Qaqish (1999). Also, severe is used as a count (again for consistency with the original analysis) but should probably be put in the model as a four-level factor.
6.2. THE MLDA AND ALTERNATIVES
171
0
0.6 0.2
0.6
Proportions
female
0.2
Proportions
female
1
0
1
1.0 0.0
1.0
Density
(age−76)/10
0.0
Density
(age−76)/10
0.0
0.5
1.0
1.5
0.0
0.5
1.0
0.4 0.0
0.2
Density
0.4 0.2 0
5
10
15
0
5
10
2
3
1.0 0.6 0.2
Proportions 1
15
severe
0.1 0.3 0.5
Proportions
severe
4
1
2
3
4
0.10 0.00
0.00
0.10
Density
0.20
toilet visits
0.20
toilet visits Density
1.5
day accidents
0.0
Density
day accidents
5
10
15
20
5
10
15
20
Figure 6.2 Covariates pattern for the GUIDE dataset. The left column is for observations such that yit = 1 (54 observations out of 137) and the right column is for observations such that yit = 0 (83 observations).
The fitted parameters of model (6.7) via classical GEE with exchangeable correlation are given in the first column of Table 6.1. We interpret the results in Section 6.3.4.
6.2.4 Residual Analysis Residuals with longitudinal data can be considered at the observation level or at the cluster level. In both cases, the residuals proposed for GEE are similar to those used for GLMs with the additional requirement that the cluster structure has to be
MARGINAL LONGITUDINAL DATA ANALYSIS
172
Table 6.1 Estimates of α and β by classical and robust GEE for model (6.7). Variable αˆ intercept female age dayacc severe toilet
Classical coefficient (SE)
Huber coefficient (SE)
Mallows coefficient (SE)
0.09 −3.05 (0.96) −0.75 (0.60) −0.68 (0.56) 0.39 (0.09) 0.81 (0.36) 0.11 (0.10)
0.11 −3.62 (1.30) −1.45 (0.80) −1.48 (0.71) 0.51 (0.13) 0.71 (0.42) 0.36 (0.13)
0.10 −3.63 (1.28) −1.41 (0.78) −1.39 (0.69) 0.52 (0.13) 0.69 (0.41) 0.35 (0.13)
The classical estimates are the solution of (6.1)–(6.3). The robust estimates are obtained by solving (6.8), (6.10) and (6.11)0017with c = 1.5 and k = 2.4 (Huber’s estimator), and with c = 1.5, w(xit ) = 1 − hi,tt and k = 2.4 (Mallows’ estimator).
considered, see Hammill and Preisser (2006), Hardin and Hilbe (2003, Section 4.2) and Chapter 4. As in GLMs, we define the Pearson residuals yit − µˆ it rˆit = 0015 . τˆ vµˆ it They can be plotted to identify outliers and other violation of the assumptions like in other regression settings (e.g. heteroscedasticity, functional form of the regression, etc.). Figure 6.3 is a plot of the Pearson residuals for the GEE fit of the GUIDE dataset. It shows some large residuals, in particular for observations 8, 19, 42, 87 and 88. Given the fact that residuals estimated through non-robust estimators have to be analyzed with caution, in particular in regard of the possible masking effects, we defer the detailed interpretation of this residual analysis and introduce first the robust estimators.
6.3 A Robust GEE-type Estimator 6.3.1 Linear Predictor Parameters The robust counterpart to the GEE approach is built upon the theory of optimally weighted estimating equations (see Hanfelt and Liang 1995; McCullagh and Nelder 1989, p. 334). In the class of all estimating equations based on (yi − µi ) the optimal (that is, with smallest asymptotic dispersion) estimating equations are given by n n 0002 0002 (Dµi ,β )T iT (Vµi ,τ,α )−1 (ψi − ci ) = 00011 (yi , Xi ; β, α, τ, c) = 0, i=1
i=1
(6.8)
6.3. A ROBUST GEE-TYPE ESTIMATOR 4
173 88
19 87 14 22 2 0
44
−2
Pearson residuals
135
82
42 8 0
20
40
60
80
100
120
140
Observation
Figure 6.3 Pearson residuals corresponding to the classical GEE fit of the GUIDE dataset (first column of Table 6.1). where Dµi ,β = Di (Xi , β) = ∂µi /∂β is a ni × (q + 1) matrix, 1/2
1/2
Vµi ,τ,α = τ Aµi Rα,i Aµi
is a ni × ni matrix. Moreover, ψi = Wi · (yi − µi ), where the matrix Wi = W(yi , Xi ; µi ) = diag(wi1 , . . . , wini ) is a ni × ni diagonal weight matrix containing robustness weights wit for t = 1, . . . , ni , and ci = E[ψi ]. Finally, i = E[ψ˜ i − c˜i ] with ψ˜ i = ∂ψi /∂µi and c˜i = ∂ci /∂µi . Note that the set of estimating equations in (6.8) are a slightly modified version of the estimating equations in Preisser and Qaqish (1999) in that it includes the matrix i , which, for a given choice of weights Wi and ‘working’ correlation Rα,i , makes it optimal (in the sense of smallest asymptotic dispersion) in the class of all estimating equations based on (yi − µi ), see Hanfelt and Liang (1995). The computational details of ci and i for binary responses are given in Appendix F.2. We assume that the weights Wi downweight each observation separately, even though it is possible to consider a cluster downweighting scheme, see the discussion
MARGINAL LONGITUDINAL DATA ANALYSIS
174
about observation versus cluster outliers in Section 6.2.4. Possible choices for the weights are w(rit ; β, τ, c) as a function of the Pearson residuals rit = √ (yit − µit )/ τ vµit , for example Huber’s weight (see also (2.16)) 0001 √ √ c/|rit / τ | if |rit / τ | > c, (6.9) w(rit ; β, τ, c) = 1 otherwise, to ensure robustness with respect to outlying points in the response space (Huber’s estimator), or w(xit ) as a function of the diagonal elements 0017 hi,t t of the hat matrix Hi (see (3.11)) for subject i (for example, w(xit ) = 1 − hi,t t ) to handle leverage points. In practice, it often makes sense to combine both types of weights multiplicatively: wit = w(rit ; β, τ, c)w(xit ) (Mallows’ esitmator). The classical GEE are obtained with Wi equal to the identity matrix. We refer to Cantoni and Ronchetti (2001b) for a detailed discussion on the choice of the weights. For simplicity, our weighting scheme (as in Preisser and Qaqish, 1999) does not take into account the within-subject correlation and is therefore not suitable for the situation where this correlation is high, in which case it has to be redefined properly, see for example Huggins (1993) and Richardson and Welsh (1995). Doing so will change the definition in (6.8) and affect the distributional properties. Note, however, that protection against outliers affecting all of the observations of a cluster can be handled by our approach by specifying a cluster downweighting scheme, that is, with wit = wi∗ for all t = 1, . . . , ni , where the wi∗ have to be defined to take into account the information of the entire cluster. The estimating equations (6.8) do not simplify exactly to the estimating equations (5.13) for GLMs owing to the presence of the matrix i in the former. The presence of this matrix in the GEE setting is necessary to allow the construction of the quasideviance functions for inference (see Section 6.4.2).
6.3.2 Nuisance Parameters The estimators of the dispersion parameter τ and of the correlation parameter α also have to be made robust to avoid harmful consequences on the estimation of the regression parameters. We√build again on the fact that the parameter τ is the variance √ of (yit − µit )/ τ vµit = τ rit , see Section 6.2.2. We therefore proceed similarly as for GLM and choose Huber’s Proposal 2 estimator of variance (see Section 5.3.3), which is written here as ni n 0002 0002 i=1 t =1
χ(rit ; β, α, τ, c) =
n 0002
00012 (ri ; β, α, τ, c) = 0,
(6.10)
i=1
where χ(u; β, α, τ, c) = ψ 2 (u; β, α, τ, c) − δ. In addition, δ = E[ψ 2 (u; β, α, τ, c)] (under normality for u) is a constant that ensures Fisher consistency of the estimation of τ . For its computation for ψ 2 (u; β, α, τ, c) = ψ[H ub] (u; β, α, τ, c) (our preferred choice), see (3.8), while noticing that ψ[H ub] (u; β, α, τ, c) = uw[H ub] (u; β, α, τ, c). As in the classical GEE theory, the estimator of the correlation parameter α depends on the assumed correlation structure. To build a robust estimator of α, the
6.3. A ROBUST GEE-TYPE ESTIMATOR
175
idea is to base this estimator on appropriate pairs of residuals, along the same line as for the classical estimators (see Section 6.2.2), but to consider additional weighting schemes to downweight outlying observations. In the following we discuss in detail the case of exchangeable correlation and explain how one can deal with two other common situations, namely the mdependence and the AR correlation structures. Let us recall that the exchangeable correlation structure defines Rα,i = αeni enTi + (1 − α)Ini , with eni a vector of ones of length ni , and Ini the identity matrix of size ni × ni , which means that corr(yit , yit 0005 ) = α for t 000f= t 0005 , and one otherwise. A simple M-estimator of covariance can be defined through Huber’s type of weights (based on ψ[H ub] (·; β, α, τ, c)) which we define as functions of the Mahalanobis distance dtit 0005 (see (2.34)) between the pair of residuals rˆit and rˆit 0005 . The Mahalanobis distance is given in this case by ˆ −1 (ˆrit rˆit 0005 )T with (dtit 0005 )2 = (ˆrit rˆit 0005 ) 0004
1 αˆ [M] ˆ . 001c = τˆ[M] αˆ [M] 1 We define Huber’s weights on the Mahalanobis distances by 0001 1 if dtti 0005 ≤ k, u1,k (dtit 0005 ) = k/dtti 0005 otherwise. We then put u2,k (dtit 0005 ) = u1,k 2 (dtit 0005 )/γ with γ = E[ru1,k 2 (|r|)]/2 where the expectation is computed under normality for r. This yields γ = Fχ 2 (k 2 ) + k 2 /2(1 − 4
Fχ 2 (k 2 )), where Fχ 2 and Fχ 2 are the cumulative distribution function of a χ 2 2 4 2 distribution with four and two degrees of freedom, respectively. Let Bi = (ˆri1 · rˆi2 , rˆi1 · rˆi3 , . . . , rˆi(ni −1) · rˆini )T be the vector of the product of all of the pairs of i ), u (d i ), . . . , u (d i T residuals for cluster i and let Gi = (u2,k (d12 2,k 13 2,k (ni −1)ni )) be the vector of weights, then our robust estimator of α is defined as the solution αˆ [M] of
0002 n n 0002 K T Gi Bi − ατ = 00013 (ri ; β, α, τ, c) = 0, (6.11) n i=1 i=1 with K = ni=1 ni (ni − 1)/2. For more details on all of the above computations we refer to Maronna (1976), Devlin et al. (1981) and Marazzi (1993, p. 225). M-estimators are known to have a low breakdown point, namely one over the dimension of the problem, which is equal to two here (see the discussion of this point in Section 2.3.1). Nevertheless, high breakdown point estimators could be considered to estimate 0004. An ad hoc estimator of α in the case of binary responses with exchangeable correlation inspired by the classical moment estimator is considered by Preisser and Qaqish (1999). This proposal relies on the hypothesis that var(ψi ) can be decomposed as Ci var(yi )Ci and therefore the proposal cannot be extended to other settings, e.g. Poisson. Our proposal is more general and has the advantage of inheriting the whole set of distributional properties of M-estimators. It is also worth mentioning that all u1,k (dtti 0005 ) = 1, and therefore all u2,k (dtti 0005 ) = 1 gives the usual (classical) moment estimators for these situations.
MARGINAL LONGITUDINAL DATA ANALYSIS
176
Two other common correlation structures are the m-dependence correlation structure, which assumes that corr(yit , yi,t +j ) = αj , for j = 1, . . . , m, and the AR correlation structure which assumes that corr(yit , yi,t +j ) = α j for j = 0, 1, . . . , ni − t. The procedure described above can be adapted to these cases by constructing Bi appropriately, that is, with all of the products rˆit · rˆi,t +j in the first case, and rˆit · rˆi,t +1 in the latter. The correction terms have to be defined accordingly.
6.3.3 IF and Asymptotic Properties √ √ Under standard regularity conditions we have that ( n(βˆ[M] − β)T , n(τˆ[M] − √ τ )T , n(αˆ [M] − α)T )T , with βˆ[M] , τˆ[M] and αˆ [M] defined through (6.8), (6.10) and (6.11), respectively, follows an asymptotic normal distribution with mean zero and covariance matrix −1 −T 001d(11) 001d(12) 001d(13) F 0 0 F 0 0 T lim G H 0 001d(12) 001d(22) 001d(23) G H 0 , (6.12) n→∞ J L N J L N 001dT(13) 001dT(23) 001d(33) where all of the sub-matrices in (6.12) are given in Cantoni (2004b) (up to a factor 1/n, with 001d = 0010), where the proof of the distributional result is also given. The particular form of the matrices in (6.12) implies that the marginal asymptotic √ distribution of n(βˆ[M] − β) is normal with mean zero and variance equal to ϒ = lim F −1 001d(11) F −T , n→∞
where F =
n 10002 (Dµi ,β )T iT (Vµi ,τ,α )−1 i Dµi ,β , n i=1
(6.13)
(6.14)
and 001d(11) =
n 10002 (Dµi ,β )T iT (Vµi ,τ,α )−1 var(ψi )(Vµi ,τ,α )−1 i Dµi ,β . n i=1
(6.15)
The distributional result in (6.12) generalizes the results of Prentice (1988): it applies to other types of responses than Bernoulli trials, it allows for an over-dispersion parameter (τ ) and is developed in the more general setting of robust estimating equations defined by (6.8), (6.10) and (6.11). In addition, the estimating equations (6.8), (6.10) and (6.11) define a set of Mestimators (Huber, 1981), with the corresponding score functions 00011 (yi , Xi ; β, α, τ, c), 00012 (ri ; β, α, τ, c), 00013 (ri ; β, α, τ, c) in Appendix F.1. From general theory on M-estimation, we know that the IF of these estimators is proportional to the score functions defining them. Therefore, the estimators obtained by our procedure are robust as long as the functions 0001i are bounded in the design and in the response. This is in particular achieved if ψi in (6.8) and χ in (6.10) are bounded, and u2,k through Gi in (6.11) are allowed to be less than one.
1.0
6.3. A ROBUST GEE-TYPE ESTIMATOR •
•••
••
••• • •
•
177
• •••• • •••••• • • •• • • •
••
• • •••• • • • •••
0.8
• •
• Weigths w(r) 0.6
• 135 •
0.4
• 22 • 87
0.2
• 19
• 88
44 •
• 8 0
42 •
50
100
150
200
Practice
Figure 6.4 Robustness weights on the response, grouped by practice, for the fit corresponding to the middle column of Table 6.1 (Huber’s estimator).
6.3.4 GUIDE Data Example (continued) We estimate the regression parameters with the set of0017equations in (6.8), where we consider both a Mallows’ estimator with w(xit ) = 1 − hi,t t and c = 1.5 and a Huber estimator with w(xit ) = 1 and c = 1.5. In both cases, the exchangeable correlation parameter α is estimated through (6.11) with k = 2.4, which is approximately the 95%-quantile of a χ22 distribution. In addition to the classical results already presented in Section 6.2.3, Table 6.1 gives the estimated coefficients for the two robust alternatives. First note that the results of the second and third column (robust analyses) are quite close, whereas they differ noticeably from the classical analysis. This means that the additional weights on the design are probably not crucial in the sense that the dataset does not seem to contain leverage points. By looking at approximate CIs (see their definition in Section 6.4.1), the variables female and age are not significant in the classical analysis, but are borderline in the robust analysis. The significance of the variable dayacc seems to be equally well assessed in both types of analysis. The variable severe is no longer significant in the robust analysis, whereas the variable toilet seems to play an important role that was hidden in the classical approach.
178
MARGINAL LONGITUDINAL DATA ANALYSIS
The robust procedure also gives information on how many and which observations are downweighted. For example, in the analysis with weights on the response only (middle column of Table 6.1), there are 15 observations out of 137 that do not receive full weight, 8 of which have weight less than 0.6, see Figure 6.4. This group of observations is partially the same as that identified in Preisser and Qaqish (1999) with their robust procedure. The diagnostic approach in Hammill and Preisser (2006) identify as potential outliers a smaller group of observations, in particular patients 8 and 44. These two patients, together with patient 42, report not being bothered despite their high frequency of visits to the toilet (10 for patients 8 and 42, and 20 for patient 44) and their large average number of leaking accidents per day (9.3 for patient 8, 6 for patient 42 and 3 for patient 44). On the other hand, patients 19 and 88 declared themselves bothered, even though the severity of their symptoms (variables severe and toilet) are pretty low with respect to the other sample values. Only two of these heavily downweighted observations belong to the same practice (cluster), namely observations 87 and 88 from practice 156, confirming that the individual downweighting scheme is justified with this dataset.
6.4 Robust Inference 6.4.1 Significance Testing and CIs The z-test for significance testing and (1 − α) CIs for the regression parameters β can be constructed based on the asymptotic distribution of the estimator, see Section 6.3.3. The z-statistics and (1 − α) CIs are given by z-statistic = and
βˆ[M]j , SE(βˆ[M]j )
(βˆ[M]j − z(1−α/2) SE(βˆ[M]j ); βˆ[M]j + z(1−α/2) SE(βˆ[M]j )), &
with
1 ˆ (j +1)(j +1), [ϒ] n where z(1−α/2) is the (1 − α/2) quantile of the standard normal distribution, and ˆ (11) Fˆ −T , with where ϒˆ = Fˆ −1 001d SE(βˆ[M]j ) =
n 10002 (D ˆ )T iT (Vµˆ i ,τˆ ,αˆ )−1 i Dµˆ i ,βˆ , Fˆ = n i=1 µˆ i ,β
(6.16)
and n 0002 ˆ (11) = 1 (D ˆ )T iT (Vµˆ i ,τˆ ,αˆ )−1 (ψi − ci )(ψi − ci )T (Vµˆ i ,τˆ ,αˆ )−1 i Dµˆ i ,βˆ , 001d n i=1 µˆ i ,β (6.17) where βˆ = βˆ[M] , µˆ i = µi (βˆ[M] ), τˆ = τˆ[M] and αˆ = αˆ [M] .
6.4. ROBUST INFERENCE
179
6.4.2 Variable Selection Variable selection is performed here by comparing the adequacy of a submodel Mq−k+1 with (q − k + 1) regression parameters with respect to a larger model Mq+1 with (q + 1) regression parameters. This is done either in a stepwise procedure, or by comparing two predefined nested models. For that we define a class of test statistics based on differences of quasi-likelihoods, in the same spirit as the difference of quasi-deviances in (5.24) for GLMs in Chapter 5: 00190002 001a n n 0002 001at (s) = 2 Qti (s)(yi ; µˆ i ) − Qti (s)(yi ; µ˙ i ) , (6.18) i=1
i=1
where µˆ i = µi (βˆ[M] , αˆ [M] , τˆ[M] ) is the estimation under model Mq+1 , and where µ˙ i = µi (β˙[M] , α˙ [M] , τ˙[M] ) is the estimation under model Mq−k+1 and where the quasi-likelihood functions take the multidimensional form Qti (s) (yi ; µi ) 0004 1 µi = (yi − ti )T W (yi , Xi ; ti (s))(Vti (s),τ,α )−1 i (ti (s)) dti (s) τ yi 0004 1 µi E[(yi − ti (s))T W (yi , Xi ; ti (s))](Vti (s),τ,α )−1 i (ti (s)) dti (s), − τ yi (6.19) with the integrals possibly path-dependent. This means that there are several paths to go from a point yi to a point µi and implies, therefore, that the integrals in (6.19) are not uniquely defined. It is common practice to parameterize this path and a typical set of integration paths is given for example by tit (s) = yit + (µit − yit )s cit , for s ∈ [0, 1], cit ≥ 1 and t = 1, . . . , ni . For instance, when cit ≡ 1 for all t (see for example McCullagh and Nelder, 1989, p. 335), we have that Qti (s) (yi ; µi )
00060004 1 0007 1 T −1 sW (yi , Xi ; ti (s))(Vti (s),τ,α ) i (ti (s)) ds (yi − µi ) = − (yi − µi ) τ 0 0004 1 1 E[{yi − ti (s)}T W (yi , Xi ; ti (s))](Vti (s),τ,α )−1 i (ti (s))(yi − µi ) ds, + τ 0
which involves only univariate integrations, uniquely defined. The asymptotic result from Section 6.4.2.1 shows that the path-dependence of the integrals in (6.19) vanishes asymptotically. In addition, Hanfelt and Liang (1995) showed that the path of integration does not play an important role in finite-sample situations. These results support the use of the difference of robust quasi-likelihoods for inference.
MARGINAL LONGITUDINAL DATA ANALYSIS
180 6.4.2.1 Multivariate Testing
Multivariate testing of the type H0 : β(2) = 0 with β = (β(1) , β(2) ) and with β(1) of dimension (q + 1 − k) and β(2) of dimension k, can be performed using 001at (s) defined by (6.18) as a test statistic. Cantoni (2004b) proves that under quite general conditions (and under H0 ), 001at (s) is asymptotically equivalent to the following quadratic forms in normal variables nLTn (M −1 − M˜ + )Ln = nRTn(2) M22.1 Rn(2) , where
(6.20)
n 10002 (Dµi ,β )T iT (Vµi ,τ,α )−1 i Dµi ,β n→∞ n i=1
M = lim F = lim n→∞
is partitioned into four blocks according to the partition of β:
M(11) M(12) T M(12) M(22) and M˜ + =
−1 M(11)
0k×(q−k+1)
0(q−k+1)×k , 0k×k
where 0a×b is a matrix of dimension a × b with only zero entries. √ √ The variables nLn and nRn are asymptotically normally distributed N (0, Q) and N (0, M −1 QM −1 ), respectively, where n 10002 DµT i ,β iT (Vµi ,τ,α )−1 var(ψi )(Vµi ,τ,α )−1 i Dµi ,β . n→∞ n i=1
Q = lim 001d(11) = lim n→∞
This implies that 001at (s) is asymptotically distributed as linear combination of χ12 variables, similarly as for GLMs (see Section 5.4.2). More precisely, 001at (s) k 2 is asymptotically distributed as i=1 di Ni , where N1 , . . . , Nk are independent standard normal variables, d1 , . . . , dk are the k positive eigenvalues of the matrix Q(M −1 − M˜ + ). In practice, the empirical version of M and Q are used, that is, ˆ (11) (see (6.17)). Mˆ = Fˆ (see (6.16)) and Qˆ = 001d In addition to giving the asymptotic distribution, the above result provides an asymptotically equivalent quadratic form to 001at (s), which can be used as an asymptotic approximation when the integrals involved in the definition of 001at (s) are T problematic to compute. More precisely, one computes nβˆM(2) Mˆ 22.1 βˆM(2) . Finally, Cantoni (2004b) proves that the level and the power of 001at (s) under contamination are bounded provided that βˆM(2) has a bounded IF.
6.4.3 GUIDE Data Example (continued) Let us consider a backward stepwise procedure based on the difference of quasilikelihoods functions defined by (6.18) to check more carefully the issues related
6.4. ROBUST INFERENCE
181
Table 6.2 p-values of the backward stepwise procedure on the GUIDE dataset. Variable
Step 1
Step 2
Step 3
Step 4
Classical
female age dayacc severe toilet
0.224 0.249 <10−4 0.089 0.224
0.270 – <10−4 0.081 0.164
– – <10−4 0.061 0.165
– – <10−4 0.011 –
Robust
female age dayacc severe toilet
0.070 0.045 <10−4 0.092 0.006
0.095 0.041 <10−4 – 0.004
– 0.068 <10−4 – 0.004
– – <10−4 – 0.002
The robust test statistics (6.18) are computed by applying Huber’s-type weights (c = 1.5), and by using k = 2.4 for the estimation of α in (6.11) (exchangeable correlation). The classical test statistics are computed with c = ∞ and k = ∞.
to model selection. We use the same weights and the same set of parameters as for the Huber’s estimator of Section 6.3.4, and compute the quadratic form (6.20) asymptotically equivalent to 001at (s). At each step of the procedure, we remove the variable that is the least significant by looking at the p-value or, equivalently, at the value of the test statistic. The procedure is stopped when all of test statistics are significant at the 5% level. The classical counterpart is computed with the same quadratic form, by using c = ∞ and k = ∞ to compute the estimators. Table 6.2 gives the p-values of this backward stepwise procedure. It is impressive to see how the classical p-values differ from the robust p-values. This highlights the heavy influence of outlying observations on the test procedure and not only on the estimation procedure. Finally, the robust procedure ends up by retaining the variables dayacc and toilet, whereas the classical analysis would retain the variables dayacc and severe. On the basis of the theoretical properties of the robust estimator, and also on the simulations results in Cantoni (2004b), the conclusions from the robust analysis are more reliable. We therefore robustly refit the model with only dayacc and toilet and proceed with interpretations from this model. The estimated coefficients and standard errors for the linear predictors are as follows: −3.67 (0.76)
+
0.49 dayacc (0.12)
+
0.29 toilet. (0.10)
The estimated model in this clustered setting can be interpreted in the same way as for GLMs. In this example, the response is binary, and therefore the discussion of Section 5.5.2 about interpreting the coefficients on the odds scale still holds. The effect of an additional leaking accident per day is to increase by 63% (exp(0.49) = 1.63) the odds of a patient being bothered by their incontinence problem. Similarly, the effect of an extra visit to the toilet results in a 34% increase (exp(0.29) = 1.34)
182
MARGINAL LONGITUDINAL DATA ANALYSIS
on the same odds. This second effect is smaller in magnitude, which seems compliant with common sense.
6.5 LEI Data Example We consider here a dataset on direct laryngoscopic endotracheal intubation (LEI), a potentially life-saving procedure in which many health care professionals are trained. We examine data from a prospective longitudinal study on LEI at Dalhousie University, previously analyzed by Mills et al. (2002). Variable selection is an important step as the model(s) chosen will include only those covariates significant in predicting successful completion of LEI. A total of 436 LEIs were analyzed. We let yit = 1 if trainee i performs a complete LEI in less than 30 seconds on trial t, and zero otherwise. The correlation between observations on the same trainee is taken to be exchangeable. An AR correlation structure would be another option with these data. We judge trainees based on the following nine binary covariates taking the value one if the condition is satisfied: whether the head and neck were in optimal position (neckflex and extoa); whether they inserted the scope properly (proplgsp); whether they performed the lift successfully (proplift); whether there was appropriate request for help (askas); whether there was unsolicited intervention by the attending anesthesiologist (help); whether there were no complications (comps); and the trainee’s handedness (trhand) and gender (trgend). Nineteen trainees performed anywhere from 18 to 33 trials. Figure 6.5 gives the pattern profiles for the 19 trainees. These patterns tend to show that training results in better performance over time, see for example profiles of trainees K, L, VV and Z. It seems less evident for other individuals, namely AA and S. Table 6.3 presents a summary of all of the (binary) covariates for all of the individuals. As naturally expected, the proportion of ones (indicating that successful action has been taken or that no complications were observed) is larger for individual that have succeeded in performing a complete LEI. We fitted robustly a GEE model with exchangeable correlation to the above data. The estimates and test results are given in Table 6.4. The robust GEE model uses a Huber’s estimator with c = 1.5 for the Huber function and k = 2.4 for the Huber’s Proposal 2 (6.10). No weights on the design has been used here given the binary nature of all the covariates. A priori we would expect all of the coefficients to be positive, which would indicate that if proper action is taken, the probability of success in performing LEI increases. It is indeed the case expect for a few non-significant coefficients (askas and extoa). A classical approach (not shown here) would give substantially different estimated coefficients. The standard errors of the MLE would also be quite larger, which is a serious drawback when performing significance testing. Figure 6.6 gives the weights w(rit ; β, τ, c) from the robust fit. The two main outliers are observations 11 (11th trial of trainee AA) and 273 (14th trial of trainee T). The first observation corresponds to the only successful LEI for trainee AA in
6.5. LEI DATA EXAMPLE 0
10
183 20
30
0
10
20
30
AA
AB
K
L
M
N
O
P
Q
R
1
LEI completed in less than 30 seconds
0
1
0
S
T
U
V
W
X
Y
Z
VV
1
0
1
0
0
10
20
30
0
10
20
30
Trial
Figure 6.5 LEI responses (one for completed in less than 30 seconds) for each trainee, labeled by capital letters.
21 trials (see Figure 6.5) for a covariate pattern for this trainee which is quite stable through the trials (not shown) and can therefore not explain the different response. The second observation is an unsuccessful LEI, even though the covariates pattern would have called for a success. The significant variables stemming from the robust approach on the basis of their z-statistic are neckflex, proplgsp, proplift, help and perhaps comps. In the classical analysis, neckflex would have been considered non-significant, as would comps. The significance of all of the variables except comps is clearcut. We therefore only test three particular nested models with the difference of quasi-deviances (6.18) with Huber’s weights with c = 1.5: the full model including all of the available covariates, against the submodel without extoa, askas, trhand, trgend
MARGINAL LONGITUDINAL DATA ANALYSIS
184
Table 6.3 Covariates characteristics for the LEI dataset. Successful LEI (118 observations) Variable neckflex extoa proplgsp proplift askas help comps trhand trgend
Unsuccessful LEI (318 observations)
Proportion of ones
Proportion of zeros
Proportion of ones
Proportion of zeros
0.99 0.99 0.86 0.88 0.15 0.70 0.95 0.82 0.77
0.01 0.01 0.14 0.12 0.85 0.30 0.05 0.18 0.23
0.95 0.97 0.52 0.39 0.20 0.37 0.78 0.84 0.69
0.05 0.03 0.48 0.61 0.80 0.63 0.22 0.16 0.31
Table 6.4 Robust GEE fits for the LEI dataset. Variable intercept neckflex extoa proplgsp proplift askas help comps trhand trgend α
Coefficient (SE)
p-value
−4.18 (0.51) 1.52 (0.39) −0.24 (0.41) 0.69 (0.20) 0.98 (0.25) −0.42 (0.26) 0.34 (0.12) 0.99 (0.49) 0.04 (0.26) 0.05 (0.24)
<10−4 <10−4 0.56 0.0007 <10−4 0.11 0.004 0.04 0.89 0.84
0.061
The estimates are obtained by solving (6.8), (6.10) and (6.11) with c = 1.5 and k = 2.4 (Huber’s estimator).
(the clearly non-significant covariates) and the submodel that in addition remove comps. Table 6.5 gives the p-values associated with these tests. It confirms that the submodel without extoa, askas, trhand and trgend is enough to represents the relationship that describes a successful LEI. The robust analysis also shows the importance of the variable comps, given the rejection of the null hypothesis that its coefficient is equal to zero. The estimated final model therefore yields the following coefficients and standard errors for the linear predictor:
185
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •• • • •
0.8
1.0
6.5. LEI DATA EXAMPLE
•• •
Weights w(r) 0.4 0.6
•
• 26
• 113
•
190 •
•
•
•
•
• 405
233 •
• 339 308 •
•
432
0.2
• 31 • 243
• 36 • 11
• 278
0
100
200 Observation
300
400
Figure 6.6 Robust weights for the LEI data example.
Table 6.5 Robust p-values for comparison of models based on the difference of quasi-deviances (6.18). Model
001at (s)
p-value
Full - extoa - askas - trhand - trgend - extoa - askas - trhand - trgend - comps
4.49 2.76
0.36 0.01
The robust test statistics (6.18) are computed by applying Huber’s-type weights (c = 1.5) and by using k = 2.4 for the estimation of α in (6.11) (exchangeable correlation). −9.17 + 3.78 neckflex + 1.33 proplgsp + 1.93 proplift + 0.60 help + 1.93 comps. (1.23) (0.75) (0.35) (0.50) (0.26) (0.77)
The multiplicative effects of a positive action taken by the trainee or the fact that there was no complications (in which cases the covariate is equal to one) on the odds of succeeding in performing a LEI are as follows (exponential of the coefficient): neckflex 43.69
proplgsp 3.79
proplift 6.89
help 1.82
comps. 6.90
In addition to the statistical significance, we can see that the strongest effect on the odds of a successful LEI is definitely the proper positioning of the neck, followed
186
MARGINAL LONGITUDINAL DATA ANALYSIS
by the correct lift and the absence of complications. Inserting the scope properly and asking for help were also positively associated with a successful LEI, but the associations were somehow weaker.
6.6 Stillbirth in Piglets Data Example Genetic selection is an important research domain in animal science. It allows species to be selected with ‘stronger’ characteristics. For example, for most mammalian species, farrowing is a critical period. In pigs, for example, up to 8% of newborns are stillborn. Limiting or reducing the number of stillbirths requires its major determinants to be investigated. This section is devoted to the study of stillbirth in four genetic types of sow: Duroc × Large White (DU × LW), Large White (LW), Meishan (MS) and Laconie (LA). Data are from the INRA GEPA experimental unit (France) and have been kindly provided by L. Canario and Y. Billon. Related publications are Canario (2006) and Canario et al. (2006) where the reader can find a more extensive discussion of the modeling issues for this dataset. Previous studies have shown that parity number, piglet birth weight, sex and birth assistance were associated with perinatal mortality. The aim of the study is to establish whether there is a genotype effect, in view of possible genetic selection (e.g. development of crossed-synthetic lines). Our dataset comprises 80 litters for the genetic type (coded gentype) DU × LW, 633 litters for LW, 59 litters for MS and 168 litters for LA, for a total of 940 litters and 11 638 observations. There were 565 deaths (coded = 1) out of the 11 638. The genetic type LW is taken as the reference. Parity number, the number of times a mother has given birth (variable parity, taken as a factor), ranges from one to six with the following corresponding frequencies (35%, 26%, 15%, 12%, 8%, 4%), with one taken as the reference. Birth assistance (variable birthassist) is coded zero for no assistance and one for one or several assistances. The cluster is defined as the litter, which size varies from 5 to 23. We fit a binary logit-link model with exchangeable correlation. For the robust fit we use weights w(rit ; β, τ, c) on the residuals with a tuning constant c = 1.5 for the Huber function. We use k = 2.4 for the Huber’s Proposal 2 (6.10). The estimated coefficients, standard errors and p-values for z-test for significance on each coefficient (H0 : βj = 0) are given in Table 6.6. The robust analysis shows that piglets born from the MS genetic type have a lower risk of stillbirth with respect to LW. The odds ratio of a stillbirth for the MS genotype with respect to the LW genotype is equal to exp(−1.71) = 0.18. Also, the mortality increases with parity, at least for the 5th and 6th parity, which could result from the fatness of old sows or aging of the uterus (or both). The estimated exchangeable correlation is αˆ = 0.035, which is low. The conclusions, however, have to be taken with caution. A careful inspection of the weights associated by the robust technique to the observations show a particular pattern. Indeed, in Figure 6.7, one can see that the downweighted observations identify a subpopulation of the data, in fact all of the 565 observations corresponding
6.6. STILLBIRTH IN PIGLETS DATA EXAMPLE
187
Table 6.6 Robust estimates for the piglets dataset. Variable intercept factor(gentype)DU × LW factor(gentype)MS factor(gentype)LA factor(parity)2 factor(parity)3 factor(parity)4 factor(parity)5 factor(parity)6 birthassist
Coefficient (SE)
p-value
−3.00 (0.11) −0.20 (0.19) −1.71 (0.43) 0.11 (0.14) −0.23 (0.15) 0.10 (0.17) 0.15 (0.17) 0.38 (0.21) 0.55 (0.20) 0.13 (0.12)
<10−4 0.31 <10−4 0.45 0.12 0.57 0.38 0.08 0.005 0.30
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
• • • • • • • • • •• • • • •• • • • • • •• • • • • • • • • • • • •• •• • • • • • • •• • •• ••• • • • •••• • • • • •• • • •• • • • • • • •• •••• • • •••• • • • •••• ••• • •• • •• •• • • ••• •••• • ••• • •• •• •••• •• ••• • •• •• • • • • •• • • •• ••• • • ••••• •••• •• ••• • • ••• •••• ••••••• ••• • • •• • • • • • •••• ••• • • •• ••• • • • • •• • • ••• •• • • • • • •• • • •• • •• • •• ••• • ••• • •••••• •• • • • •
0.2
0.4
Weights w(r) 0.6
0.8
1.0
The estimates are obtained by solving (6.8), (6.10) and (6.11) with c = 1.5 and k = 2.4 (Huber’s estimator).
• 0
•
• 2000
4000
6000 Observation
•• 8000
10000
12000
Figure 6.7 Robustness weights on the response for the piglets dataset.
to a death (response = 1). Further investigations allowed us to identify suspected separation or near-separation in the data. This peculiarity of binary regression is a situation where the design space of the observations for which y = 1 and the observations for which y = 0 can be completely separated by a hyperplan.
MARGINAL LONGITUDINAL DATA ANALYSIS
188 6
0
0 5
0 0 1 1
4
x2
0
1 3
0
0 -1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
x1
Figure 6.8 Illustration of a situation with no overlap in binary regression: the observations for which y = 0 and the observations for which y = 1 can be completely separated by a hyperplan.
For example, if there are two covariates x1 and x2 , this would correspond to the situation depicted in Figure 6.8. We say that there is no overlap for this dataset (see also the illustration in Christmann and Rousseeuw (2001, pp. 67–69)). Christmann and Rousseeuw (2001) give an algorithm to compute the overlap, that is, the smallest number of observations whose removal yields complete or quasi-complete separation. In these cases, most estimators do not exist. In cases where the overlap is very small, the estimators exist but can potentially be very unstable. In addition, robust estimators work by downweighting (or sometimes removing) outlying points. It can therefore happen that the whole dataset has overlap, but that the robust estimators do not exist. The methodology by Christmann and Rousseeuw (2001) was used on the piglets dataset to compute the overlap, which is equal to eight. This is particularly related to the binary/categorical nature of the data. A (limited) sensitivity analysis has nevertheless shown that some stability is present and that therefore the study provides useful conclusions. In this analysis the robust methodology has helped in highlighting a peculiar feature of the data that could lead to disastrous conclusions if it remains undetected.
6.7. DISCUSSION AND EXTENSIONS
189
6.7 Discussion and Extensions At the time of writing, only the Bernoulli family has been implemented for the robust estimation and inference for GEE. Note, however, that the theory presented in this chapter is general and includes all GLM distributions. The difficulty arising in practice is the computation of the correction term ci in (6.8). This difficulty can be circumvented by computing the correction term by simulation. This is currently work in progress. As mentioned in Section 5.7.3, Cantoni et al. (2005) develop a criterion, called GC p , inspired by Mallows’s Cp for general model comparisons. It is given in (5.29) and the general form for an unbiased estimator GC p is given in (5.30). The particular form of GCp for a Mallows’s quasi-likelihood estimator as defined by (6.8) is given by Cantoni et al. (2005), where their extensive simulation study shows that the GCp is very effective in handling contaminated data.
7
Survival Analysis 7.1 Introduction Survival analysis is central to biostatistics and modeling such data is an important part of the work carried out daily by statisticians working with clinicians and medical researchers. Basically, survival data analysis is necessary each time a survival time or a time to a specific event (failure) such as organ dysfunction, disease progression or relapse is the outcome of interest. Such data are often censored as not all subjects enrolled in the study experience the event. When investigators are interested in testing the effect of a particular treatment on failure time the default method of analysis is the log-rank test, usually supplemented by Kaplan–Meier survival estimates. The log-rank test is, by definition, based on ranks and therefore offers some degree of protection against outliers. Criticisms have been raised (Kim and Bae, 2005) but the test is not as sensitive as most of the standard testing procedures in other models. When the outcome has to be explained by a set of predictors, the standard approach is the Cox (1972) proportional hazard model. Cox regression is appealing owing to its flexibility in modeling the instantaneous risk of failure (e.g. death) or hazard, even in the presence of censored observations. This interest toward the Cox model goes well beyond the world of medicine and biostatistics. Applications in biology, engineering, psychology, reliability theory, insurance and so forth can easily be found in the literature. Its uniqueness also stems from the fact that it is not, strictly speaking, based on maximum likelihood theory but on the concept of partial likelihood. This notion was introduced by Cox in his original paper to estimate the parameters of interest in a semi-parametric formulation of the instantaneous risk of failure at a specific time point, given that such an event has not occurred so far. Over the years many papers dealing with various misspecifications in the Cox model have been published. Diagnostic techniques have also flourished boosted by the ever-growing number of applications related to that model; see, for instance, Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
192
SURVIVAL ANALYSIS
Chen and Wang (1991), Nardi and Schemper (1999), Therneau and Grambsch (2000), Collett (2003b), Wang et al. (2006) for a review. In the 1980s, researchers were typically interested in whether a consistent estimate of a treatment effect could be obtained when omitting a covariate. Work by Gail et al. (1984), Bretagnolle and Huber-Carol (1988) and others show both theoretically and through simulations that if important predictors are omitted from the Cox model, a small resulting bias occurs in the estimate. Later, Lin and Wei (1989) propose a sandwich formula for the treatment effect estimator’s variance, which they call ‘robust’ in that significance testing of the treatment effect has approximately the desired level, even if important predictors are omitted from the model. They also claim that their variance estimator can cope with possible misspecifications of the hazard function. As argued in Section 1.2, this type of robustness is different from that discussed in this book. Robustness methods in the modern sense of the word have been relatively slow to emerge in survival analysis, hindered by the presence of censoring that is unaccounted for by the general robustness theory. Regarding the Cox model, another complication stems from its semi-parametric nature. This is in essence different from the fully parametric setting discussed at length in the previous chapters. In the early 1990s, researchers such as Hjort (1992) and Schemper (1992) started to tackle the problem but the first real attempts to robustify Cox’s partial likelihood appeared in Bednarski (1993), Sasieni (1993b,a) and Minder and Bednarski (1996). A complex methodology is generally required to cope with censoring. Bednarski’s work is based on a doubly weighted partial likelihood and extends the IF approach presented in Chapter 2. Later Grzegorek (1993) and Bednarski (1999, 2007) refined their weighting estimation technique to make it adaptive and invariant to timetransformation. A good account on how outliers affect the estimation process in practical terms with an illustration on clinical data is given in Valsecchi et al. (1996). A comparison of Bednarski’s approach and the work by Sasieni and colleagues is carried out in Bednarski and Nowak (2003). It essentially shows that none of these estimators clearly outperforms the others as far as problems in the response is the primary target. This technical literature focuses only on the estimation problem prompting questions about the robustness of tests as defined in Chapter 2. Recent work by Heritier and Galbraith (2008) illustrates the current limitations of robust testing for this model and clarifies the link with the theory by Lin and Wei (1989). Independently of all of these developments related to the Cox model, an innovative technique called regression quantiles appears in the late 1970s that seems totally unrelated to survival analysis. That pioneer work due to Koenker and Bassett (1978) is introduced in the econometric literature as a robust alternative to linear regression. In this work, any percentile of a particular outcome (e.g a survival time), or a transformation of it, can be regressed on a set of explanatory variables. This work in itself and many subsequent papers would not be sufficient to be mentioned in this chapter, if an extension to the censored case had not been proposed. Fortunately, such a method, called censored regression quantiles, now exists. The extension, due to Portnoy (2003), has a great potential in practice. It is easily computable, inherits the robustness of the sample quantiles, and constitutes a viable alternative to the Cox model, especially when the proportional hazard assumption is not met.
7.2. THE COX MODEL
193
This chapter is organized as follows. Cox’s partial likelihood and the classical theory is reviewed in Section 7.2. The so-called robust sandwich formula of Lin and Wei (1989) and its link to the IF is presented. The lack of robustness properties of standard estimation and inferential procedures are motivated by the analysis of the myeloma data. A robust (adaptive) estimator based on the work of Bednarski and colleagues is presented and illustrated in Section 7.3. Issues related to robust testing in the Cox model and its current limitations are also discussed. A complete workedout example using the well-known veterans’ administration lung cancer data (see e.g. Kalbfleisch and Prentice, 1980) is described in Section 7.4. Other issues including model misspecifications are outlined in Section 7.5. Finally, Section 7.6 is devoted to censored regression quantiles. We first introduce quantile regression, discuss its extension to censored data and apply the method to the lung cancer dataset.
7.2 The Cox Model 7.2.1 The Partial Likelihood Approach As mentioned earlier, the proportional hazard model introduced by Cox (1972) is probably the most commonly used model to describe the relationship between a set of covariates and survival time or another time-to-event, possibly censored. Let (ti , δi ) be independent random variables recording the survival time and absence of censoring (δi = 1 if ti is observed, 0 otherwise) for a sample of n individuals. It is convenient to write ti = min(ti0 , ci ) where ti0 is the possibly unknown survival time and ci the censoring time. The ti0 are independent random variables from a cumulative distribution F (· | xi ) with density f (· | xi ), where xi is a q-dimensional vector of fixed covariates. For simplicity we consider the standard case where all time points are different and ordered, i.e. t1 < t2 < · · · < tn . We also assume that the censoring mechanism is non-informative. The Cox model relates the survival time t to the covariates x through the hazard function of F 1 λ(t | x) = λ0 (t) exp(x T β),
(7.1)
where λ0 (t) is the so-called baseline hazard, usually unspecified, and β the regression parameter vector.2 In essence, λ(t | x) measures the instantaneous risk of death (or hazard rate) at time t for an individual with specific characteristics described by x, given that they have survived so far. The interesting feature of formulation (7.1) is that λ(t | x) is the product of a baseline hazard λ0 (t) and an exponential term depending on the covariates. This has two major advantages. First, as we will see in Section 7.2.1, it is not necessary to know the baseline hazard λ0 (t) to estimate the coefficients β. Second, we can derive immediately the effect of an increase of one unit in a particular covariate xj (e.g. the effect of an experimental treatment represented by a binary indicator: one for treatment, zero for placebo) on survival. 1 The hazard function of a distribution function F with density f is f (t)/(1 − F (t)). 2 Note that, by writing (7.1) on the log scale, log(λ (t)) can be seen as the intercept term added to the 0 linear predictor x T β, so that dim(β) = q.
SURVIVAL ANALYSIS
194
Indeed, such an increase translates into a constant relative change exp(βj ) of the hazard λ(t). This quantity is the hazard ratio (HR) and is usually interpreted as the relative risk of death related to an increment of 1 of that particular predictor. This property justifies the terminology proportional hazard model, commonly used for the Cox model. Model (7.1) encompasses two important parametric models, namely the exponential regression model for which λ0 (t) = λ and the Weibull regression model for which λ0 (t) = λγ t γ −1 . However, in a full parametric setting, the additional parameters λ and/or γ need to be estimated along with the slopes β for these models to be fully specified. In the proposal of Cox (1972), this is not necessary. Equation (7.1) can be expressed equivalently through the survival function S(t | x) = 1 − F (t | x) (see, for instance, Collett (2003b) and Therneau and Grambsch (2000)) given by S(t | x) = {S0 (t)}exp(x
T β)
,
(7.2)
where S0 (t) is defined through 0004 −log(S0 (t)) =
t
λ0 (u) du = 001a0 (t),
(7.3)
0
and 001a0 (t) is the baseline cumulative hazard obtained by integrating λ0 (u) between zero and t. The usual estimate of β is the parameter value that maximizes the partial likelihood 0007δi n 0006 # exp(xiT β) L(β) = (7.4) T j ≥i exp(xj β) i=1 or equivalently the solution of the first order equation 0007 0006 n 0002 S (1) (ti ; β) =0 δi xi − (0) S (ti ; β) i=1
(7.5)
where S (0) (ti ; β) =
0002 j ≥i
exp(xjT β) and S (1) (ti ; β) =
0002 j ≥i
exp(xjT β)xj
as in Minder and Bednarski (1996) and Lin and Wei (1989).3 The solution of (7.5) is the partial likelihood estimator (PLE) also denoted by βˆ[PLE] . Equation (7.5) is a simple rewriting of a more conventional presentation as in, for instance, Collett (2003b, Chapter 3). There, the risk set R(ti ) at time ti is used, i.e. the set of all patients who have not yet achieved the event by time ti and are then still ‘at risk’ of dying. It is formed of all observations with indices greater than or equal to i. 3 The idea is to base the likelihood function on the probability for a subject to achieve the event by time ti . This is given by the ratio of the hazard at time ti of the subject i over the hazard of all subjects who have not yet experienced the event by time ti , i.e. the set j ≥ i (also called the risk set). In this ratio, the baseline hazard cancels out and one obtains the expression given in (7.4).
7.2. THE COX MODEL
195
The purpose of writing (7.5) in that way is to stress the similarity with the definition of an M-estimator. Indeed, let 0007 0006 S (1) (ti ; β) Ui = U (ti , δi , xi ; β) = δi xi − (0) S (ti ; β) be the individual contribution (or score) then, at least literally, (7.5) looks like an M-estimator with 0001-function U (t, δ, x; β). The main difference is that the scores Ui are no longer independent since the two sums S (0) (ti ; β) and S (1) (ti ; β) depend on subsequent time points tj for j ≥ i. We have assumed, for simplicity, that all observed time points are different in the sample. The partial likelihood approach is generally modified to handle ties. We refer the reader to Therneau and Grambsch (2000) and Collett (2003b) for a more general introduction and Kalbfleisch and Prentice (1980) for technical details. Under some regularity conditions, the PLE is asymptotically normally distributed with asymptotic variance V = I (β)−1 , where I (β) is the information matrix for that model (see Kalbfleisch and Prentice (1980, Chapter 4) for details). Here I (β) is usually estimated by minus the second derivative of the average log partial likelihood, i.e. n 10002 ∂Ui I0005(β) = − . (7.6) n i ∂β Numerical values are obtained by replacing β by βˆ[PLE] in (7.6). An alternative formula for the variance of βˆ[PLE] will be given in Section 7.2.4. The asymptotic distribution can then be used for testing single hypothesis H0 : βj = 0 in a standard way. One just defines a z-statistic as z-statistic = where SE(βˆ[PLE]j ) =
0015
βˆ[PLE]j , SE(βˆ[PLE]j )
n−1 [I0005(βˆ[PLE] )−1 ]jj
(7.7)
(7.8)
is the (estimated) standard error of βˆ[PLE]j , i.e. the square root of the j th diagonal element of (7.6). Here z is compared with a standard normal distribution. More generally, standard asymptotic tests such as the LRT, score and Wald tests 0 are available to test a composite null hypothesis of the type H0 : β(2) = β(2) , with T T T β(1) unspecified and with β = (β(1) , β(2) ) . Specifically, the LRT is equal to twice the difference in the maximum log-partial-likelihood obtained at the full and reduced model, i.e. LRT = 2(log(L(βˆ[PLE] )) − log(L(β˙[PLE] )) (7.9) where βˆ[PLE] denotes the PLE at the full, model and β˙[PLE] its value under the null hypothesis (at the reduced model). The Wald test is based on βˆ[PLE](2) , the second component of the PLE in the full model, i.e. 0 T 0005−1 ˆ 0 W = n(βˆ[PLE](2) − β(2) ) V(22)(β[PLE](2) − β(2) ), (7.10)
SURVIVAL ANALYSIS
196
0005(22) is the block (22) of the estimated asymptotic variance V 0005=0005 where V I −1 (βˆ[PLE] ). For the score test, the general approach outlined in Section 2.5.1 can be extended to the Cox model. Under H0 , all three tests are asymptotically distributed as a χk2 distribution, where k = dim(β(2) ). They are generally provided by all common statistical packages.
7.2.2 Empirical Influence Function for the PLE The IF for the PLE is based on complex functionals taking into account the semiparametric nature of the Cox model and the presence of censoring. We give here its 0005 i evaluated at the observation (ti , δi , xi ) as originally derived empirical version IF 0005 i can be used as a diagnostic tool to assess by Reid and Crépeau (1985). Here IF the effect of a particular observation on the PLE. It is a q-dimensional vector proportional to a shifted score, i.e. 0005 i = I0005−1 (β)(Ui − Ci (β)), IF
(7.11)
where Ci (β) (or to be more specific C(ti , δi , xi ; β)) is a term depending on the observations in a complicate way; see Section 7.2.4. This ‘shift’ is not needed for consistency but to account for the dependence across the individual scores Ui . 0005 i is similar to As noted by Reid and Crépeau (1985), the first component in IF the usual IF for M-estimators in the uncensored case, and the second component −I0005−1 (β)Ci (β) represents the influence of the ith observation on the risk set of other subjects. A similar expression with a two-part IF is generally found for estimators for censored data. The first term is unbounded in xi , which means that spurious observations in the covariates can ruin the PLE. The second term shows the same deficiency and, as a function of ti ’s only, can be large enough to compromise the estimation process. It captures the influence of a particular observation (e.g. an abnormal long-term survivor) on the risk set of the others subjects. Valsecchi et al. (1996) give a good explanation on the acting mechanism. Abnormal longterm survivors ‘exert influence in two ways. First, that individual forms part of the very many risks sets (for all preceding failures). Secondly, whereas early failures will be matched to a large risk set, individuals failing toward the end of the study may, depending on the censoring, be matched to a very small risk set. Two groups may be initially of similar size but as time progresses the relative size of the two groups may steadily change as individuals in the high risk group die at a faster rate than those in the other group. Eventually the risk set may be highly imbalanced with just one or two individuals from the high risk group, so that removal of one such individual will greatly affect the hazard ratio.’ Atypical long-term survivors are not the only type of abnormal response that can be encountered but they are by far the most dangerous. Another possibility occurs when a low-risk individual fails early. As pointed out by Sasieni (1993a) this type of outlier is less harmful as their early disappearance from the risk set reduces their contribution to the score equation. Despite its relative complexity the IF for the PLE has similar properties to that given for the M-estimators of Chapter 2. It still measures the worst asymptotic bias caused
7.2. THE COX MODEL
197
to the estimator by some infinitesimal contamination in the neighborhood of the Cox model. It is therefore desirable to find estimators which bound that influence in some way. The two-part structure of (7.11) rules out a similar weighting to that used earlier in a fully parametric model. Innovative ways have to be imagined to control both components, in particular the influence on the risk set. This will be developed further in Section 7.2.4 in relation to the asymptotic variance.
7.2.3 Myeloma Data Example Krall et al. (1975) discuss the survival of 65 multiple myeloma patients and its association with 16 potential predictors all listed in their Table 2. They originally selected the logarithms of blood urea nitrogen (bun), serum calcium at diagnosis (calc) and hemoglobin (hgb) as significant covariates. Chen and Wang (1991) used their diagnostic plot and found that case 40 is an influential observation. They also concluded that no log-transformation of the three predictors was necessary. We also use the data without transformation to illustrate the IF approach. Table 7.1 presents the most influential data points as detected by the change 0006i βˆ in the regression coefficient βˆ[PLE] when the ith observation is deleted. Figures are given as percentages to make changes comparable across coefficients: percentages were simply obtained by dividing the raw change by the absolute value of the corresponding estimate obtained on all data points. The deletion of any of the remaining observations did not change the coefficients by more than ±11% for these two variables, and even less for bun. These values can be seen as a handy 0005 i ≈ (n − 1)0006i βˆ as pointed out by Reid and approximation of the IF itself as IF Crépeau (1985). This result is generally true for all models but is particularly useful here when IF has a complicated expression. Clearly case 40 is influential confirming the analysis by Chen and Wang (1991). Other observations might also be suspicious, e.g cases 3 or 48. A careful look at all exact values of the IF (not shown here) shows that the approximation works reasonably well justifying the use of 0006i βˆ as 0005 i . A word of caution must be added here. The empirical IF in (7.11) is proxy for IF typically computed at the PLE, itself potentially biased. This can cloud its ability to detect outliers as pointed out by Wang et al. (2006). However, extreme observations are generally correctly identified by this simple diagnostic technique. To illustrate how they can distort the estimation and testing procedures, we deleted case 40 and refitted the data. PLE estimates, standard errors and p-values for significance testing (z-statistic in (7.7)) are displayed in Table 7.2. Case 40 is actually a patient with high levels of serum calcium who survived much longer than similar patients. For that reason this subject tends on his own to determine the fit, an undesirable feature as the aim of the analysis is to identify associations that hold for the majority of the subjects. When all observations are included in the analysis, calcium is not significant (p = 0.089). After removal of case 40 a highly significant increase in risk of death of exp(0.31) − 1.0 = 0.36, 95% CI (0.1;0.7), per additional unit of serum calcium appears. This clearly illustrates the dramatic effect of case 40 on the test. The differences are even more pronounced if both cases 40 and 48 are removed making the need of a robust analysis even greater. However, as the dataset is relatively
SURVIVAL ANALYSIS
198
Table 7.1 Diagnostics 0006i βˆ for myeloma data. Case
hgb
calc
40 48 44 3 2
+17% 0% +13% −1% +2%
−48% −16% +12% +24% +35%
The regression coefficients are estimated by means of the PLE βˆ[PLE] .
Table 7.2 PLE estimates and standard errors for the myeloma data. All data
Case 40 removed
Variable
Estimate (SE)
p-value
Estimate (SE)
p-value
bun hgb calc
0.02 (0.005) −0.14 (0.059) 0.17 (0.099)
0.000 0.019 0.089
0.02 (0.005) −0.19 (0.063) 0.31 (0.112)
0.000 0.003 0.006
Ties treated by Efron’s method. Model-based SEs computed using (7.8).
small (n = 65), influential observations are more harmful and case-deletion and refit becomes a difficult exercise. We do not pretend to give a definitive analysis of these data here. The purpose was simply to illustrate the sensitivity of the PLE with respect to unexpected perturbations especially for small to moderate sample sizes.
7.2.4 A Sandwich Formula for the Asymptotic Variance A different estimate for the asymptotic variance of the PLE has been proposed by Lin and Wei (1989). It is often called ‘robust’ variance in common statistical packages, but as we argue below, it is not robust in the sense used in this book. We therefore name it the LW formula or classical sandwich formula. Perhaps, the best way to introduce the LW formula is through its link to the IF, something that is generally overlooked. A careful reading of Reid and Crépeau (1985, p. 3) shows 0005 i is as in (7.11), provides another asymptotic variance 0005 i IF 0005 Ti , where IF that n−1 IF estimate for the PLE. Elementary algebra shows that this can be rewritten as 0005LW (β) = I0005−1 (β)J0005(β)I0005−1 (β) V where I0005(β) is the information matrix estimator given in (7.6) and 0002 J0005(β) = Ui∗ Ui∗T ,
(7.12)
(7.13)
7.2. THE COX MODEL
199
Table 7.3 Estimates and standard errors for the PLE for the myeloma data. All data Variable bun hgb calc
Case 40 removed
Estimate (SELW )
p-value
Estimate (SELW )
p-value
0.02 (0.004) −0.14 (0.059) 0.17 (0.127)
0.000 0.019 0.186
0.02 (0.004) −0.19 (0.060) 0.31 (0.103)
0.000 0.002 0.003
Ties treated by Efron’s method. SE computed using (7.12).
where Ui∗ = Ui − Ci (β) is a shifted score. If we write down the correcting factor (shift) Ci (β) = exp(xiT β)xi
0002
δj (0) S (tj ; β) j ≤i
− exp(xiT β)
0002 δj S (1) (tj ; β) [S (0) (tj ; β)]2 j ≤i
and replace β by the PLE in (7.12) we obtain the variance estimate proposed by Lin and Wei (1989, p. 1074). Lin and Wei’s derivation is actually more general as it also covers the case of time-dependent covariates. Although the formula presented here assumes n different time points, its extension to data with ties is straightforward (see Lin and Wei (1989) and Reid and Crépeau (1985) for technical details). As an illustration, we refitted the myeloma data using the exact same model as before but use (7.12) as a variance estimate. PLE estimates, standard errors and pvalues are displayed in Table 7.3. Note that the coefficients reported in Table 7.3 are the same as those reported in Table 7.2 since the estimation procedure is still the PLE. On the other hand, the standard errors differ as they are now based on the LW formula. The p-values reported here refer to the individual significance z-tests, i.e. for H0 : βj = 0, βˆ[PLE]j , (7.14) zLW -statistic = SELW (βˆ[PLE]j ) where SELW (βˆ[PLE]j ) is the standard error of βˆ[PLE]j based on the LW formula (7.12), 0005LW (βˆ[PLE] )]jj . i.e. the square root of n−1 [V Results are very similar to those obtained in Table 7.2. It is clear that case 40 is influential even if the LW formula is used. In other words, the LW formula offers no kind of protection against extreme (influential) observations. For example, no effect of calcium appears when all data are fitted (p-value = 0.186), and after removal of case 40 the deleterious effect of this observation on the significance of serum calcium seems obvious as a p-value of 0.003 is reported. So we may legitimately ask the question ‘What is the LW formula robust against?’. Lin and Wei (1989) motivate their approach by mentioning some structural misspecifications, in particular covariate omission. As an example they consider a randomized clinical trial in which the effectiveness of a particular treatment on survival time is assessed. The true model is thought to be the Cox model with
SURVIVAL ANALYSIS
200
parameter β. We can split β into two parts ν and η where these components represent, respectively, the treatment parameters and the covariate effects. A valid test of no treatment effect is sought even if some of the predictors may be missing in the working model. Alternatively, investigators may simply prefer an unadjusted analysis for generalizabilty purposes, in which case only ν will be included in the analysis. Lin and Wei (1989) showed that approximate valid inference can still be achieved using their formula. This, of course, assumes that no treatment by covariate interaction exists. To test the null hypothesis of no treatment effect (i.e. H0 : ν = 0), one then uses (7.14). Lin and Wei (1989) also considered more serious departures from the Cox model, e.g. misspecification of the hazard form. This includes models with hazard defined on the log-scale or even a multiplicative model. Their simulation study shows that their approach allows approximate valid inference in the sense that the type I error (empirical level) of the Wald test using the LW formula (7.12) is close to the nominal level. The term ‘robust’ formula is hence used in that sense. This type of robustness however does not protect against biases induced by extreme (influential) observations. The reader is referred to Section 7.5 for further discussion on this topic in a more general setting.
7.3 Robust Estimation and Inference in the Cox Model 7.3.1 A Robust Alternative to the PLE The robust alternative to the PLE we present here has emerged over the years from Bednarski’s research. It is based on a doubly weighted PLE that astutely modifies the estimating equation (7.5) without fundamentally changing its structure. It also has the advantage of being easily computable with some code available and included in the R Coxrobust package. Following Bednarski (1993) and Minder and Bednarski (1996), we assume that a smooth weight function w(t, x) is available. Denote by wij = w(ti , xj ) and wi = wii = w(ti , xi ) the weights for all 1 ≤ i ≤ j ≤ n and set all other weights to zero by construction. Define the two sums 0002 wij exp(xjT β) (7.15) Sw(0) (ti ; β) = j ≥i
Sw(1) (ti ; β) =
0002 j ≥i
wij exp(xjT β)xj
(7.16)
in a similar way to their unweighted counterparts of Section 7.2. A natural extension of the PLE is the solution for β of 0007 0006 (1) n 0002 Sw (ti ; β) = 0. (7.17) wi δi xi − (0) Sw (ti ; β) i=1 The weight function w(t, x) enters at two points: (i) in the main sum with wi (0) (1) downweighting the uncensored observations; (ii) in the inner sums Sw and Sw with
7.3. ROBUST ESTIMATION AND INFERENCE IN THE COX MODEL
201
all the wij for i ≤ j ≤ n. Equation (7.17) clearly has a similar structure to (7.5). Moreover, when all of the weights are chosen equal to one, the solution of (7.17) is the PLE, so that (7.17) can be literally seen as an extension of equation (7.5). By analogy with the notation of Section 7.2 we also denote by the individual score Uw,i , i.e. the contribution of the ith observation to the sum in (7.17) 0007 0006 (1) Sw (ti ; β) , (7.18) Uw,i = wi δi xi − (0) Sw (ti ; β) and by Uw the total score or left-hand side of (7.17). A proper choice of w(t, x) is pivotal to make the solution of (7.17) both consistent and robust. The weights we consider here truncate large values of g(t) exp(x T β) where g(t) is an increasing function of time.4 Indeed, Bednarski (1993) and Minder and Bednarski (1996), considering the exponential model, argued that the PLE often fails when ti exp(xiT β) is too large. They hence proposed weight functions based on truncations of such quantities (i.e. with g(t) = t). Bednarski (1999), however, pointed out that a better choice for g(t) is the baseline cumulative hazard 001a0 (t) in (7.3). The rationale for this is that 001a0 (ti ) exp(xiT β), given the covariate vector xi , has an exponential one distribution if the Cox model holds and ti is not censored. This gives rise to the following weights T (linear), K − min(K, 001a0 (t) exp(x β)) T w(t, x) = exp(−001a0 (t) exp(x β)/K) (exponential), max(0, K − 001a0 (t) exp(x T β))2 /K 2 (quadratic), where K is a known cut-off value that can be chosen on robustness and efficiency grounds. Such weights have been used successfully ever since and are now implemented in the R Coxrobust package. In practice, two additional difficulties occur: first, the truncation value K is generally difficult to specify a priori, especially for censored data;5 second, the unknown cumulative baseline hazard 001a0 (t) is needed to compute the weights. This hazard is not often estimated in the Cox model as it is not actually needed to obtain the PLE and related tests. To overcome the first problem, Bednarski and colleagues proposed the use of an adaptive procedure that adjusts K at each step. They deal with the second problem by jointly (and robustly) estimating 001a0 (t) and β; see Grzegorek (1993) and Bednarski (1999, 2007). To compute a robust estimator defined through (7.17) with one of the proposed weighting schemes updated adaptively, one can use the following algorithm. Given a specific quantile value τ , e.g. τ = 90%, used to derive the truncation value adaptively, one proceeds through the following steps. • Initialization: obtain an initial estimator βˆ 0 , e.g. the PLE, compute the cut-off K as the pre-specified quantile τ of the empirical distribution ti exp(xiT βˆ 0 ), 4 Note that the notation above did not mention any dependence of w(t, x) on the regression parameter β and the weights should be more seen as ‘fixed’. However, Bednarski (1999) showed that, under stringent conditions, the dependence on β does not modify the asymptotic distribution of the resulting estimator. 5 One could argue that a quantile of the exponential one distribution could be used, at least in the absence of censoring.
SURVIVAL ANALYSIS
202
i = 1, . . . , n, set-up the current estimate b at βˆ 0 and initialize the set of weights. • Take the current estimate b, evaluate K as the same quantile τ of the empirical 0005w (ti ) exp(x T b) with distribution 001a i 0005w (t) = 001a
0002 i≤t
wi δi . T j ≥i wij exp(xj b)
(7.19)
• Update b by solving (7.17) and then recompute the set of weights. • Repeat the previous two steps until convergence. Technical details about the adaptive process and formula (7.19) are omitted for 0005w (t) is a simplicity but can be found in Bednarski (2007). Note though that 001a 6 robust adaptation of the Breslow estimator. The final value obtained through this algorithm is the adaptive robust estimator (ARE) or βˆ[ARE] . It can generally be obtained within a few iterations even for relatively large datasets. An advantage of this adaptive weighting scheme based on the cumulative hazard estimate (7.19) is that the ARE is invariant with respect to time transformations. It can also better cope with censored data by the way the cut-off value is updated. The price to pay for this flexibility is purely computational. Bednarski (1999) shows that the ARE has the same asymptotic distribution as its ‘fixed-weight’ counterpart defined in (7.17) and performs similarly in terms of robustness. The issue of the choice of weight function or quantile τ is more a matter of efficiency and/or personal preference. This question is discussed in the next section. Finally, it should be stressed that other possible weights have been proposed by Sasieni (1993a,b). Although the spirit of his approach is essentially the same, the proposed weights cannot handle abnormal responses for patients with extreme values in the covariates such as elevated blood cell counts or laboratory readings. Such extreme but still plausible data points are harmful to classical procedures. In contrast the ARE is built to offer some kind of protection in that case. A more formal treatment of these alternative weighting schemes can be found in Bednarski and Nowak (2003) along with a comparison with the ARE.
7.3.2 Asymptotic Normality Under regularity conditions on w(t, x) given in Bednarski (1993, 1999), the ARE is consistent and has the following asymptotic distribution √ n(βˆ[ARE] − β) → N (0, Vw (β)), (7.20) where the asymptotic variance is given by a sandwich formula7 Vw (β) = Mw−1 Qw Mw−T .
(7.21)
6 Formula (7.19) gives back the Breslow estimator of baseline cumulative hazard when all weights are
equal to one and thus b is the PLE; see, for instance, Collett (2003b, p. 101). 7 Again, to obtain the variance of βˆ [ARE] , one needs to divide (7.21) by n.
7.3. ROBUST ESTIMATION AND INFERENCE IN THE COX MODEL
203
The matrices Mw = Mw (β), Qw = Qw (β) are complicated expectations that we omit for simplicity; see Bednarski (1993) for details. Tedious but straightforward calculations show that their empirical versions have a much simpler form, i.e. 0005w (β) = − 1 M n and
n 0002 ∂Uw,i i=1
(7.22)
∂β
n 0002 0005w (β) = 1 Q U ∗ U ∗T n i=1 w,i w,i
(7.23)
∗ =U with Uw,i given in (7.18) and Uw,i w,i − Cw,i (β) a shifted weighted score with shift given in (7.24). A final estimate for the asymptotic variance follows easily by replacing β by βˆ[ARE] in (7.21)–(7.23). The asymptotic distribution (7.21) is valid not only under the assumption that the weights are fixed, i.e. K and g(t) are pre-specified independently of the regression parameters,8 but also for the adaptive weighting scheme described above; see Bednarski (1993, 1999, 2007) for details. Technical developments essentially show that the asymptotic result is not altered when smooth adaptive weight functions with bounded support and a robust hazard estimate are used. There is a clear link between the LW sandwich formula (7.12) and the asymptotic variance (7.21). The first thing to note is that for both the classical and robust estima 0005 i IF 0005 Ti tors the sandwich formula stems from the same property, i.e. from n−1 IF that is another consistent variance estimator. We have seen this for the PLE in Section 7.2.2 and the same has been shown by Bednarski (1993, 1999) for the ARE. Second, tedious rewriting show that the empirical IF for the ARE is again proportional to a shifted score as follows:
0005w−1 (β)U ∗ (β) = M 0005w−1 (β)(Uw,i (β) − Cw,i (β)) 0005 w,i = M IF w,i with Uw,i (β) given in (7.18) and where the shift has an ‘ugly’ but computable expression given by Cw,i (β) = exp(xiT β)xi
0002 wj δj wji j ≤i
Sw(0) (tj ; β)
− exp(xiT β)
0002 wj wj i δj Sw(1) (tj ; β) j ≤i
[Sw(0) (tj ; β)]2
.
(7.24) A careful look at all of the quantities involved in (7.18) and (7.24) and in the equations of Section 7.2.4 shows that, if all of the weights wij and wi are equal to one, then not only is the ARE identical to the PLE but their IF are the same and consequently its asymptotic variance reduces to the LW formula (7.12). The robust approach proposed in this chapter can literally be seen as an extension of the PLE combined with its LW sandwich variance. In practice, this never happens as the weights cannot be set to one if one wants the ARE to be robust. This analogy is nevertheless useful as it helps in understanding both formulas and properties. 8 The function g(t) is discussed on page 201.
204
SURVIVAL ANALYSIS
The use of AREs is desirable from a robustness standpoint. However, an expected loss of efficiency with respect to the PLE is observed when the Cox model assumptions hold. Unlike in simpler models, e.g. linear regression or mixed models, it is impossible to calibrate the tuning constant (i.e. the quantile τ ) to achieve a specific efficiency at the model for all designs. However, some hints can be given. First, it is clear that by construction linear adaptive weights of Section 7.3 automatically set a certain percentage of weights wi to zero. If we choose τ = 90%, then roughly 10% of the weights will be zero even though the data arise from a proportional hazard model. This automatically generates a loss of efficiency at the model as genuine observations will be ignored. A similar argument holds for quadratic weights. In contrast, the exponential weighting scheme of Section 7.3 is smoother and the ARE with exponential weights performs generally better in terms of efficiency. Second, simulations can provide valuable information. Our (limited) experience with the exponential distribution indicates that adaptive exponential weights do reasonably well in terms of robustness and efficiency when τ is chosen in the range 80–90%. An asymptotic efficiency relative to the PLE of at least 90% can easily be obtained even in the presence of a small amount of censoring, while linear weights achieve an efficiency of at most 80–90%. However, both weighting schemes perform equally well from a robustness point of view. For these reasons we tend to prefer exponential weights, with τ in the range 80–90%. Previous references by Bednarski and colleagues also used similar values of τ successfully. We would not recommend the use of much smaller quantiles. Finally a choice for τ for a given weighting scheme could in principle be computed by simulations to achieve a predetermined loss of efficiency at a parametric model (i.e. a parametric form for the hazard). For that purpose, one would need an idea of the censoring level, a rough idea of the true parameter values, and a distribution for the covariates or conditioning. In situations where ni=' is=' the=' number=' of=' subjects=' still=' ‘at=' risk’=' just=' prior=' to=' ti=' and=' di=' deaths=' at=' time=' ;=' see=' kalbfleisch=' prentice=' (1980,=' p.=' 12)=' or=' collett=' (2003b,=' 20).=' model-based=' survival=' curves=' are=' obtained=' as=' follows.=' first,=' note=' that=' by=' combining=' (7.2)=' (7.3)=' we=' obtain=' usual=' (but=' rarely=' used)=' expression=' s(t=' |=' x)=' a=' function=' β=' cumulative=' baseline=' hazard=' 001a0=' (t),=' i.e.=' (t)=' exp(x=' t=' β)).
(7.27)
Second, an overall survival curve estimate can be simply computed by averaging over the sample the predictions of individual survival time S(t | xi ) for t = tj , j = 1, . . . , n. For the ARE, the ith patient’s survival prediction is obtained by replacing in formula (7.27) the true cumulative baseline hazard by its estimate (7.19), and the linear predictor by xiT βˆ[ARE] . The same can be done for the PLE by using the corresponding classical estimates. The comparison between Sˆ[KM] (t) and its Cox-based counterparts proceeds by plotting their ‘standardized’ differences versus the logarithm of the survival time, possibly by categories (e.g. quartiles) of the linear predictor x T βˆ[PLE] . For the standardization factor, we follow Minder and Bednarski (1996) and use the square-root of Sˆ[KM] (t)(1 − Sˆ[KM] (t)). Figure 7.5 displays the standardized difference per tertile of linear predictor x T βˆ[PLE] . The horizontal lines represent plus or minus twice the standard error of the Kaplan–Meier estimate obtained through the Greenwood formula (see Collett, 2003b, pp. 24–25) to take into account the sample variability, at least approximately. A good agreement between the Kaplan–Meier and ARE survival curves can be observed for all panels. In contrast some discrepancy appears when the PLE is used to fit the Cox model, in particular in panels (a) and (c). This lack of fit disappears after deletion of the extreme observations identified earlier and repeat of the procedure (Figures not shown). This is a compelling argument in favor of the robust fit assuming that the model is structurally correct. Other plots can also be found in Minder and Bednarski (1996) and Bednarski (1999). Note as well that separate plots for each treatment arm could also be drawn, but this is not done here as the experimental treatment was found to be ineffective. 11 In the presence of ties, formula (7.26) still applies by replacing the t , i = 1, . . . , n, by the k < n i distinct ordered survival times t1 < t2 < . . . < tk .
SURVIVAL ANALYSIS
214
0.0 −0.4
−0.4
0.0
0.2 0.4
panel (b): second tertile
0.2 0.4
panel (a): first tertile
3
4
5
6
7
0
1
2
3
4
5
6
0.2 0.4
panel (c): third tertile
−0.4
0.0
Difference PLE − KM difference ARE − KM
0
1
2
3
4
5
6
Figure 7.5 Standardized differences between the Kaplan–Meier estimate (KM) and the model-based survival curves (PLE or ARE).
7.5 Structural Misspecifications 7.5.1 Performance of the ARE The main objective of this book is to present robust techniques dealing with distributional robustness. In essence, we assume a specific model (e.g. the Cox model) and propose estimation or testing procedures that are meant to be stable and more efficient than the classical procedures in a ‘neighborhood’ of the working model. We normally speak of model misspecification in that sense. This sometimes creates confusion, in particular for the proportional hazard model where the effect of many departures have been studied over the years, e.g. covariate omission, deviations from the proportional hazard assumption or measurement error in the variables. These can be seen as structural model misspecifications and are not the scope of the robustness theory. It is however important to discuss the performance of the robust procedures (estimation, tests) presented so far in that setting. Historically, researchers first studied the impact of covariate omission on the estimation process, in particular in randomized experiments where the main endpoint
7.5. STRUCTURAL MISSPECIFICATIONS
215
is a specific time to event. Typically, the question of whether an unadjusted analysis of a two-arm randomized clinical trial (RCT) still provides a consistent estimate of the treatment effect was of primary interest. Work by Gail et al. (1984), Bretagnolle and Huber-Carol (1988) and others showed both theoretically and by simulations that if important predictors were omitted from the Cox model the classical estimate (PLE) was slightly biased toward the null. They even showed that this situation could happen when the data were perfectly balanced as in RCTs and was worsened by the presence of censoring. The key reason for this is that the PLE does not generally converge toward the true regression parameter unless the treatment effect is itself zero. This type of problem being structural, a similar situation arises with the ARE. No formulas have ever been established but a hint was given by Minder and Bednarski (1996) who explicitly investigated the problem.12 They show in their simulation (type B) that the robust proposal is indeed biased toward the null but tend to be less biased than the PLE. A similar situation is encountered in Section 7.3.1 with measurement error problems. If a predictor x1 cannot be measured exactly but instead v1 = x1 + u is used where u is some random term added independently, it is well known that β1 , the slope of x1 , is not estimated consistently; see Carroll et al. (1995) for instance. An attenuation effect or dilution is observed if the naive approach is used (i.e. regressing the outcome on v1 and the other covariates) for most types of regression including Cox’s. In addition estimates of other slopes can also be affected. Again the ARE is not specifically built to remove such bias resulting more from a key feature of the data, i.e. a structural model misspecification, ignored in a naive analysis (classical or robust). In another simulation (type C), Minder and Bednarski (1996) showed that the ARE tended to be less biased than its classical counterpart. In such a case it is highly recommended to directly correct for measurement error using one of the many techniques described, for instance, by Carroll et al. (1995) and in the abundant literature dealing with this issue. Robust methods could then also be specifically developed in that setting, i.e. with a model that includes possible measurement error. The problem of ‘what to do when the hazard is not proportional’ often arises and is even more important in practice. Here two elements of the answer can be brought in. First, if non-proportionality is caused by a subgroup of patients responding differently then ARE will certainly provide safer results. Second, if the problem is more structural, e.g. a multiplicative model captures more the inherent nature of the data, the ARE will not perform any better than the classical technique. The reason is that (i) both methods still assume proportional hazards; (ii) this type of departure is not in a neighborhood of the working model. In other words it will not be ‘within the range’ of what the robust method can handle. By definition, the problem is more structural than distributional and beyond the scope of the current method. Finally, one may wonder how large amounts of censoring affects the ARE or if something similar is possible for the time-dependent Cox model. The robust approach presented here is only valid under the assumption of fixed predictors. 12 The estimator used in this reference is a simpler version of the ARE: the weighting scheme is based on g(t) = t, not the cumulative hazard function; see Section 7.3.1 for the definitions of the weights. However, the results are illustrative of what could be obtained with the ARE.
216
SURVIVAL ANALYSIS
Its extension to time-varying covariates has not been attempted, even under simple circumstances, and seem a considerable challenge. Regarding the impact of censoring, no work has been carried out to illustrate the performance of the ARE in the presence of heavy censoring.
7.5.2 Performance of the robust Wald test It is probably legitimate to wonder whether the robust Wald test defined in Section 7.3.5 provides some kind of protection against structural misspecifications. This question arises naturally as we know that the asymptotic variance (7.21) is literally a generalization of (7.12), the LW formula is supposed to be better at dealing with that type of problem; see the discussion in Section 7.2.4 and the link between the two formulas in Section 7.3.2. An insight is given in Heritier and Galbraith (2008) who carried out simulations similar to those undertaken by Lin and Wei (1989) with the addition of the ARE as genuine contender. We report here the results for covariate omission, a particularly relevant situation in RCTs as discussed earlier. The data were not, however, generated to mimic that situation as in Lin and Wei (1989). Survival times come from an exponential model with hazard λ(t | x) = exp(x12 ) where x1 follows a standard normal distribution. This is supposed to be an even worse scenario than simply ignoring the predictors in a RCT. The working model is a Cox model with two predictors x1 and x2 , generated independently of each other with the same distribution. This model is misspecified as x12 has been omitted from the fitted model and x2 is unnecessary. The primary objective is the performance of tests of H0 : β1 = 0 at the true model. The standard z-test (with model-based SE) cannot maintain its nominal level of 5% and instead exhibits an inflated type I error around 13%. In contrast the LW z-test has a type I error around 6–6.5% while ARE’s is around 3.5– 4.5%. These results stand for a sample size of 50–100 and are consistent with those initially reported by Lin and Wei (1989). The ARE-based Wald test thus seems to perform well in that particular setting; if anything the test seems to be conservative. A similar performance to the LW approach is also observed by Heritier and Galbraith (2008) for the other designs studied by Lin and Wei (1989), including misspecified hazards, e.g. fitting the Cox model to data generated with a logarithmic type of hazard. These conclusions are seriously limited by the fact that we are only focusing on the test level. Nothing is said about the loss of power of such procedures compared with those of inferential (robust) procedures developed in a structurally correct model. We therefore strongly recommend sorting out structural problems before carrying out robust inference. Distributional robustness deals with small deviations from the assumed (core) model, and this statement is even more critical for inferential matters. This is clearly not the case if, for instance, the right scale for the data is multiplicative as opposed to additive (i.e. one of the scenarios considered here). Using testing procedures in a Cox model fitted with the ARE should not be done if linearity on the log-hazard scale is clearly violated. The same kind of conclusion holds for violations from the proportional hazard assumption. This recommendation could only be waived if such departures are caused by a few abnormal cases, in which case the use of a robust Wald test can be beneficial. Finally, the LW approach is also
7.6. CENSORED REGRESSION QUANTILES
217
used when correlation (possibly due to clustering) is present in the data. It generally outperforms its model-based counterpart and maintains its level close to the nominal level. The properties of (7.21) in that setting have not been investigated.
7.5.3 Other Issues Robust methods in survival data have just started their development. As mentioned earlier the presence of censoring creates a considerable challenge. In the uncensored case robust methods in fully parametric models are readily available. One could, for instance, use robust Gamma regression as described in Chapter 5. Specific methods have also been proposed for the (log)-Weibull or (log)-Gamma distributions by Marazzi (2002), Marazzi and Barbati (2003), Marazzi and Yohai (2004) and Bianco et al. (2005). Interesting applications to the modeling of length of stay in hospital or its cost are given as an illustration. The inclusion of covariates is considered in the last two references. Marazzi and Yohai (2004) can also deal with right truncation but, unfortunately, these methods are not yet general enough to accommodate random censoring. In addition, the theory developed in this chapter for the Cox model assumes non-informative censoring. Misspecifications of the censoring mechanism have recently received attention, at least in the classical case; see Kong and Slud (1997) and DiRienzo and Lagakos (2001, 2003). Whether modern robustness ideas can valuably contribute to that type of problem is still an open question. Robust model choice selection for censored data is still a research question with an attempt in that direction by Bednarski and Mocarska (2006) for the Cox model.
7.6 Censored Regression Quantiles 7.6.1 Regression Quantiles In this section we introduce an approach that is a pure product of robust statistics in the sense that it does not have a classical counterpart. The seminal work dates back to Koenker and Bassett (1978) who were to first to propose to model any pre-specified quantile of a response variable instead of modeling the conditional mean. By doing so they offered statisticians a unique way to explain the entire conditional distribution. As the quantiles themselves can be modeled as a linear function of covariates they are called regression quantiles and the approach is termed quantile regression (QR). This technique was historically introduced as a robust alternative approach to linear regression in the econometric literature. Before presenting the extension to censored data, we present here the basic ideas underlying the QR approach. The basic idea is to estimate the conditional quantile of an outcome y given a vector of covariates x defined as Q(y, x; τ ) = inf{u : P (y ≤ u | x) = τ }
(7.28)
SURVIVAL ANALYSIS
218
for any pre-specified level 0 ≤ τ ≤ 1. We further assume that Q(y, x; τ ) is a linear combination of the covariates, i.e. Q(y, x; τ ) = x T β(τ )
(7.29)
with β(τ ) the τ th regression parameter. The rationale for (7.29) is that in many problems the way small or large quantiles depend on the covariates might be quite different from the median response. This will be particularly true in the heteroscedastic data common in the econometric literature where this approach gained rapid popularity. On the other hand, the ability to detect structures for different quantiles is appealing irrespective of the context. The linear specification is the simplest functional form we can imagine and corresponds to the problem of finding regression quantiles in a linear, possibly heterogeneous, regression model. Of course the response function need not be linear and f (x, β(τ )) is the obvious extension of the linear predictor in that case. For 0 ≤ τ ≤ 1 define the piecewise-linear function ρ(u; τ ) = u(τ − ι(u < 0)) where ι(u < 0) is one when u < 0 and zero otherwise. Koenker and Bassett (1978) then showed that a consistent estimator of β(τ ) is the ˆ ) that minimizes the objective function value β(τ r(β(τ )) =
n 0002
ρ(yi − xiT β(τ ); τ ),
(7.30)
i=1
for an i.i.d. sample (yi , xi ). When τ = 1/2, ρ(u; τ ) reduces to the absolute value up to a multiplicative factor 1/2. Thus, for the special case of the median this estimator is the so-called L1 -estimator in reference to the absolute (or L1 ) norm. For that reason, this approach is also referred to as the L1 regression quantiles. An introduction to this approach at a low level of technicality with a telling example for a biostatistical audience can be found in Koenker and Hallock (2001). In their pioneering work Koenker and Bassett (1978) provided an algorithm ˆ ) that was later refined by based on standard linear programming to compute β(τ Koenker and D’Orey (1987). They also proved that this estimator is consistent and asymptotically normal under mild conditions. For instance, in the classical i.i.d. setting we have √ ˆ ) − β(τ )) → N (0, ω(τ )001b−1 ) n(β(τ (7.31) where ω = τ (1 − τ )/f 2 (F −1 (τ )), 001b = E[xx T ] and f and F are the density and cumulative distribution functions for the error term, respectively. Conditions on f include f (F −1 (τ )) > 0 in a neighborhood of τ . It should be stressed that the fact that ˆ ) depends on the (unspecified) error distribution the asymptotic distribution of β(τ can create some difficulties in computing it. Indeed, the density needs to be estimated non-parametrically and the resulting estimates may suffer from a lack of stability. Inferential methods based on the bootstrap might then be preferred. We refer the reader interested in the technical aspects of this work to Koenker and Bassett (1982) for details and for a more comprehensive account discussing inferential aspects to Koenker (2005). The QR technique took two decades to make its way into survival data analysis, probably because of the lack of flexibility of QR to deal with censoring. A step in
7.6. CENSORED REGRESSION QUANTILES
219
the right direction was suggested by Koenker and Geling (2001). It is based on a simple idea: a transformation of the survival time yi , e.g. the log-transformation, is used providing a regression quantile approach to accelerated failure time model. This is straightforward when all survival times are indeed observed, see Koenker and Geling (2001) for an instructive example. However, this approach is insufficient for most applications in medical research where censoring occurs.
7.6.2 Extension to the Censored Case Early attempts to deal with censoring required too strict assumptions making their use relatively limited; see Powell (1986), Buchinsky and Hahn (1998), Honore et al. (2002) and Chernozhukov and Hong (2002), among others. The important breakthrough came with Portnoy (2003) who was able to accommodate general forms of censoring. He also made available a user-friendly R package called CRQ for censored regression quantiles (directly accessible on his website). We can then expect a rapid development of this innovative approach in biostatistics and medical research where it could be used as a valuable complement to the Cox model. CRQ involve more technical aspects since it combines both the elements of regression quantiles and the modeling of censored survival times. The reader may decide to skip this section in the first instance and just accept the existence of the extension to the censored case. Let ci , i = 1, . . . , n be the censoring times and yi0 the possibly unobserved response (e.g. survival time ti0 ) for the ith subject. We have yi = min(yi0 , ci ) (e.g. yi = ti the survival time), and δi = ι(yi0 ≤ ci ) the indicator of censoring. We can even allow ci to depend on xi but require yi0 and ci to be independent conditionally on xi . The model now stipulates that the conditional quantiles of yi0 are a linear combination of the covariates but will not impose any particular functional form on those of yi . Portnoy (2003) astutely noticed that QR is actually a generalization of the one-sample Kaplan–Meier approach. Two key ingredients combine here: (1) the Kaplan–Meier estimator (7.26) can be viewed as a ‘recursively reweighted’ empirical survival estimate; (2) a more technical argument linked to the regression quantiles computation, i.e. the weighted gradient used in the programming remains piecewise linear in τ . This simple remark permits the use of simplex pivoting techniques. Point (1) follows from Efron (1967) who shows that the Kaplan–Meier estimator can be computed by redistributing the mass of each censored observation to subsequent non-censored observations. In other words, the mass P (yi0 > ci ) can be redistributed to observations above ci . This is done ˆ ) depends on the sign by exploiting a key point of QR, i.e. the estimator β(τ of the residuals at any given point and not on the actual value of the response. ˆ ) when there is censoring works then in the The procedure for estimating β(τ following way. First, it is easy to start with a low quantile τ . We might not know the exact value of yi0 but we do know that it is beyond the censoring time ci . Then, when ˆ ) will ci lies above the τ th regression line, so does yi0 . The true residual yi0 − xiT β(τ 0 be positive irrespective of yi and we can just use the ordinary QR for such a small quantile value. Of course as τ becomes larger sooner or later a censored observation
SURVIVAL ANALYSIS
220
ˆ ). We do not know for sure whether the will have a negative residual ci − xiT β(τ true residual is positive or negative but as the sign has changed we call such an observation crossed from now on. The level at which the observation is crossed is denoted τˆi , thus ˆ τˆi ) ≥ 0 ci − xiT β(
ˆ )≤0 and ci − xiT β(τ
for all τ > τˆi .
As explained by Portnoy (2003) and Debruyne et al. (2008), the critical idea is ‘to estimate the probability of crossed censored observations having a positive, respectively negative residual and then use these estimates as weights further on’. This can be achieved by splitting such an observation into two weighted pseudoˆ ) ≥ 0) and one at observations, one at (ci , xi ) with weight wi (τ ) ≈ P (yi0 − xiT β(τ (+∞, xi ) with weight 1 − wi (τ ). The weight itself comes from quantile regression as 1 − τˆi is a rough estimate of the censoring probability P (yi0 > ci ), i.e. wi (τ ) =
τ − τˆi 1 − τˆi
for τ > τˆi .
Then we can proceed recursively to obtain the CRQ estimate; the exact algorithm as detailed in Debruyne et al. (2008) is given in Appendix G. This process is technically equivalent to one minus the Kaplan–Meier estimate with the Efron recursive reweighting scheme; see the example given in Portnoy (2003, p. 1004), for details. Improvements to the computation of CRQ can also be found in Fitzenberger and Winker (2007) and may prove useful for large datasets.
7.6.3 Asymptotic Properties and Robustness Establishing asymptotic results for CRQ is a considerable task as the weighting scheme sketched above must be taken into account. The most accurate result so ˆ ) converges to β(τ ) at the rate n−1/2 as shown by Neocleous et al. far is that β(τ (2006). The asymptotic normality with a closed form for the asymptotic variance is still a work in progress. The current way to compute standard errors or CIs is the bootstrap. This technique is computer intensive but stable and provides an effective way to perform inference in the i.i.d. setting; it is also the default method in the R package CRQ provided by Portnoy. More generally, even if an asymptotic result were available, it would not necessarily lead to an accurate estimate. Indeed as indicated earlier in (7.31), the asymptotic variance for the regression quantiles estimates in the non-censored case depends on the underlying (unspecified) error distribution and hence bootstrap methods can provide more reliable standard errors estimates. This is also certainly true in the presence of censoring. Elaboration must be made on the exact implementation of the bootstrap for CRQ as a few complications arise. First, when the survival distribution presents many censored observations in its right tail, it is virtually impossible to estimate the conditional quantile above the last τ value corresponding to the last uncensored observation. When bootstrapping the problem is even more serious as this cut-off is random. In one bootstrap sample the observed cut-off can be 0.9 whereas in another one it is about 0.7 due to the
7.6. CENSORED REGRESSION QUANTILES
221
presence of more censored observations from the right-hand tail. Thus, the simple percentile CI possibly fails. Portnoy (2003) introduced a hybrid approach called the 2.906 IQR bootstrap to cope with this problem: simply take the bootstrap estimate of the interquartile range (IQR) and use normality to obtain the relevant percentiles. Technically, this amounts to computing the bootstrap sample interquartile values ∗ ∗ and βˆ ∗ − βˆ ∗ , multiplying them by 2.906 for consistency and adding βˆ0.75 − βˆ0.5 0.5 0.25 ∗ to get upper and lower bounds of the 95% CI the values to the median estimate βˆ0.5 for all β(τ ). This approach seems to work reasonably well both in simulations and examples. Second, as the computational time can be prohibitive for large samples discouraging users, a possible solution has been implemented in the R package CRQ. It is called the ‘n-choose m’ bootstrap whereby replicates of size m < n are chosen to compute the estimates and then adjust the CIs for the smaller sample size. Improvements on the CRQ implementation are work in progress and limitations will certainly be relaxed in the near future. Regression quantiles inherit the robustness of ordinary sample quantiles, and thus present some form of robustness to distributional assumptions. As pointed out by Koenker and Hallock (2001) the estimates have ‘an inherent distributionfree character because quantile estimation is influenced only by the local behavior of the conditional distribution near the specified quantile’. This is equally true for CRQ as long as perturbations in the response only are considered. However, both regression quantiles and CRQ break down in the presence of bad leverage points or problems in the covariates. Robust inference has not been specifically studied but it is safe to say that the bootstrap-based approach probably works well for low levels of contamination and central values of τ (which is probably where most applied problems sit). In contrast extreme values of τ or a higher percentage of spurious data in the sample cause more trouble. Indeed, in that case the standard bootstrap approach breaks down as more outliers can be generated in the bootstrap sample. This is even more critical when extreme τ are the target as the breakdown point of ˆ ) is automatically lower. β(τ
7.6.4 Comparison with the Cox Proportional Hazard Model Straightforward computations based on the survival function and cumulative hazard given in Section 7.2 show that the conditional quantile for the survival time t given a particular covariate vector x is T Q(t, x; τ ) = 001a−1 0 [−log(1 − τ ) exp(−x β)].
(7.32)
Thus, the exponential form of the Cox model imposes a specific form on the conditional quantiles. More specifically (7.32) shows that they are all monotone in log(1 − τ ) and depend on 001a0 in a complicated way. As the conditional quantiles are not linear in the covariates the Cox model does not provide a direct analog of ˆ ). However, Koenker and Geling (2001) and Portnoy (2003) suggested that a β(τ ˆ ) is the derivative of (7.32) evaluated at x, good proxy for β(τ ¯ the average covariate
SURVIVAL ANALYSIS
222 vector, i.e. b(τ ) =
' ' ∂ Q(t, x; τ )' . ∂x x=x¯
(7.33)
ˆ ) that we can If we now plug in the PLE for β into formula (7.33) we obtain b(τ ˆ now compare with the censored regression quantile estimate β(τ ). It is worth noting that (7.33) implies that bj (τ ) = −
(1 − τ )γ (x) log(1 − τ )γ (x) βj , S00005 [Q(t, x; τ )]
where γ (x) = exp(−x T β). So the effect of the various covariates as a function of τ are all identical up to a scaling factor depending on x. In particular, the quantile treatment effect for the Cox model must have the same sign as βj precluding any form of effect that would allow crossings of the survival functions for different settings of covariates. This can be seen as a lack of flexibility of the Cox model imposed by the proportional hazard assumption.
7.6.5 Lung Cancer Data Example (continued) Figure 7.6 displays a concise summary of the results for a censored quantile regression analysis of log(time), i.e. an accelerated failure rate model, on the lung cancer data. The model includes eight estimated coefficients but ptherapy and age were omitted as the same flat non-significant pattern appears for all values of τ and methods. The shaded area represents the 95% pointwise band for each CRQ coefficient obtained by bootstrapping. The dashed line represents the analog ˆ ) for the Cox model given by (7.33). The Karnofsky performance status of β(τ (karnofsky) is a standard score of 0–100 assessing the functional ability of a patient to perform tasks; 0 represents death and 100 a normal ability with no complaints. Its effect depicted in the first panel is highly significant at all levels and for both the Cox and CRQ models. Around median values, e.g. τ = 0.50, the CRQ estimate is roughly 0.04 which translates into a multiplicative effect of exp(0.04 ∗ 10) = 1.49 on median survival for a 10 point increase on that scale (holding all other factors constant). The effect looks somehow higher for smaller quantiles and weaker for larger values of τ , a decreasing trend that is not detected by the Cox model. dduration and treatment have little impact on the outcome for all values of τ strengthening the previous findings that these predictors are not important in these data. cell is a more interesting predictor. No clear effect of squamous versus large cells appears although it seems that in the tails things could be different with possibly a crossover. With the 95% CI also being larger towards the ends, we do not pursue this interpretation. The situation is much neater for small cells where a significant constant effect appears at all levels except perhaps for larger values, τ ≥ 0.80 say. An estimate of −0.70 is obtained for τ = 0.50; this means that the presence of small cells reduces the median survival by 1 − exp(−0.70) = 50% in comparison with large cells. In contrast, the QR estimate (7.33) for the Cox model represented by the
0.2
0.4
0.6
0.8
0.02
dduration 0.0
223
−0.06
0.05 0.01
karnofsky
7.6. CENSORED REGRESSION QUANTILES
1.0
0.0
0.2
0.4
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6 tau
1.0
0.8
1.0
0.6
0.8
1.0
0.0
treatment 0.4
0.8
−1.5
−0.5
0.2
0.6 tau
−2.0
adeno
tau
0.0
1.0
−0.5
small 0.2
0.8
−2.0
0 1 0.0
0.6 tau
−2
squamous
tau
0.0
0.2
0.4 tau
ˆ ), with shaded 95% band for the lung cancer Figure 7.6 The CRQ coefficient, β(τ data. The Cox coefficient effect (7.33) is represented by the dashed line.
dashed line in the same panel is higher, more variable and its significance uncertain, probably for the same reasons mentioned earlier. Adeno cells seem to act similarly to small cells on survival although their effect looks clearer towards the upper end of the distribution. Finally we would like to mention some robustness concerns. As the CRQ approach is based on quantiles, it is robust to outliers in the response or vertical outliers as indicated earlier. It is therefore not influenced by the two longterm survivors (cases 17 and 44). This explains why the robust analysis of Section 7.4 is more in line with the current findings, especially on the role of the cell type. For the sake of completeness we also give the CRQ fit at τ = 0.50 in Table 7.6. It can be seen as a snapshot of Figure 7.6 at a particular level, here the median. The 95% CIs provided in this table are based on the bootstrap with B = 1000 replicates. The p-values correspond to the z-statistic obtained by studentizing by the bootstrap IQR as directly implemented in the R package developed by Portnoy. It is worth noting that the coefficients are similar to those given in the robust analysis of Section 7.4 up to the minus sign. The systematic reversing of the signs for significant predictors is generally observed. This is due to the fact that CRQ explains a specific quantile of the logarithm of time whereas in a Cox model the classical interpretation with hazard ratios relates more to survival. It is actually possible to obtain similar tables for other values of τ but the graphical summary is usually more informative unless an investigator is interested in one particular quantile of the
SURVIVAL ANALYSIS
224
Table 7.6 Estimates, 95% CIs and p-values for significance testing for the Veteran’s Administration lung cancer data. Variable
Estimate
95% CI
p-value
Intercept karnofsky dduration age ptherapy cell Squamous Small Adeno treatment
2.297 0.036 0.005 0.003 −0.010
(0.45; 4.12) (0.02; 0.05) (−0.02; 0.06) (−0.02; 0.03) (−0.07; 0.04)
0.01 0.00 0.80 0.83 0.71
−0.117 −0.685 −0.751 0.018
(−0.81; 0.78) (−1.28; −0.05) (−1.33; −0.06) (−0.54; 0.44)
0.77 0.03 0.02 0.94
The regression coefficients are estimated by means of the CRQ at τ = 0.50.
distribution. To conclude, it is useful to note that although the differences between the quantile method and the Cox model may not be considered as important in this example, it is not always the case. As pointed out by Portnoy (2003), CRQ generally provides new insight on the data with the discovery of substantial differences when a greater signal-to-noise ratio exists in the data.
7.6.6 Limitations and Extensions Despite its uniqueness and originality combined with both good local robustness properties and direct interpretation, CRQ have a few limitations. Unlike the proportional hazard model it cannot be extended to time-varying predictors as the whole algorithm is based on fixed x. This must be played down as many of the timedependent covariates used in the extended Cox model are actually introduced when the proportional hazard assumption itself is violated. As the proportional hazard assumption is no longer needed in QR, the problem is no object. From a robustness perspective CRQ is resistant to vertical outliers, i.e. abnormal responses in time, but not to leverage points. Recent work by Debruyne et al. (2008) shows that this difficulty can be overcome by introducing censored depth quantiles. More research is needed to study their asymptotic properties and compare them with CRQ. More importantly some more work is needed to sort out inferential issues even though the bootstrap approach described above offers a workable solution. Recently Peng and Huang (2008) introduced a new approach for censored QR based on the Nelson– Aalen estimator of the cumulative hazard function. Implementation of this technique has been provided in the R package quantreg; see Koenker (2008). This work is promising as Peng and Huang’s estimator admits a Martingale representation providing a natural route for an asymptotic theory. A key assumption of all of these techniques, however, is that Q(t, x; τ ) depends linearly on the regression
7.6. CENSORED REGRESSION QUANTILES
225
parameter β. This condition can be relaxed in partially linear models as investigated by Neocleous and Portnoy (2006). This could constitute a valuable alternative for intrinsically non-linear data. Irrespective of the method, we would like to stress the potential of QR in biostatistics as it constitutes an original complement to the Cox model. It has the advantage of being naturally interpretable and does not assume any form of proportionality of the hazard function. Results obtained by CRQ can sometimes contradict those derived from the Cox model. This should not be seen as a deficiency but more as a major strength. It can often capture new structures that were hidden behind the proportional hazard assumption. In general, its greater flexibility suggests that the corresponding results are more reliable, but we encourage users to carry out additional work to better understand how such differences can be explained.
Appendices
A
Starting Estimators for MM-estimators of Regression Parameters 00050 , one can choose an estimator among the class of For the starting point β S-estimators as proposed by Rousseeuw and Yohai (1984) (see also Section 2.3.3). A popular choice for the corresponding ρ-function is the biweight function (2.20), 2 for σ 2 which minimize σ 2 subject to 00050 for β and 0005 hence leading to the solution β σ[bi] [bi] n 10002 ρ[bi] (ri ; β, σ 2 , c) − E0010 [ρ[bi] (r; β, σ 2 , c)] = 0 n i=1
(A.1)
where the expected value ensures Fisher consistency of the resulting estimator. The breakdown point of this S-estimator can be chosen through the value of c that satisfies for ρ[bi] the condition E0010 [ρ[bi] (r; β, σ 2 , c)] = ε ∗ ρ[bi] (c; β, σ 2 , c), where ε∗ is the desired breakdown point (see Rousseeuw and Yohai, 1984). When ε∗ = 0.5 (the maximal value), then c = 1.547 (see Rousseeuw and Leroy, 1987, p. 136). However, its efficiency, i.e. the ratio between the traces of the asymptotic variances of respectively the LS and the S-estimator under the exact regression model, is equal to 0.287 (see Yohai et al., 1991), hence it is roughly four times more variable than the LS. The solution can be found by a random resampling algorithm, followed by a local search (see Yohai et al., 1991), by a genetic algorithm in place of the resampling algorithm, by an exhaustive form of sampling algorithm for small problems (see Marazzi (1993) for details on the numerical algorithms) and by faster algorithm for large problems (see Pena and Yohai, 1999). The computational speed is still an issue for computing βˆ 0 in general. When some of the explanatory variables are actually categorical (i.e. factors) as is the case Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
230
STARTING ESTIMATORS FOR MM-ESTIMATORS
with the diabetes data (variables bfmed, bflar and loc), Maronna and Yohai (2000) propose splitting the estimation procedure into an M-estimation part for the categorical variables and an S-estimation part for the other variables, resulting into what they call an MS-estimator. Basically, consider the following regression model T T yi = xi(1) β(1) + xi(2) β(2) + 0015i ,
i = 1, . . . , n,
where xi(1) are 0–1 vectors (i.e. dummy variables) of dimension q1 and xi(2) are realvalued vectors of dimension q2 . An estimator for β(1) is defined conditionally on a value for β(2) , i.e. the solution β˙(1) (β(2) ) in β(1) of n 0002
0001(˜ri , xi(1) ) = 0,
(A.2)
i=1 T β )/σ and y˜ = y − x T β . As an estimator for β with r˜ = (y˜ − x(1) (1) (2) one (2) (2) T T ˙ uses e.g. the S-estimator (A.1) in which ri = yi − xi(1) β(1) (β(2) ) + xi(2) β(2) . For a discussion on the choice for the 0001-function in (A.2) and simplified numerical procedures, see Maronna and Yohai (2000). One can also choose different ρ-functions and/or other objective functions to define high breakdown point estimators for the starting point. Indeed, one can cite the least median of squares estimator (LMS) and the least trimmed squares estimator (LTS), both from Rousseeuw (1984), and the least absolute deviations estimator (LAD) of Edgeworth (1887) (see also Bloomfield and Steiger, 1983) also known under L1 -regression. They can be seen in their definition as natural adaptation of the LS estimator or as a particular case of S-estimators. Indeed, the LS (for a given σ 2 ) is defined as the solution of n 10002 min ri2 , (A.3) β n i=1
i.e. the minimization of a scale estimate of the residuals, in a similar manner as for S-estimators (the square of the residuals is generalized to a function ρ). Replacing the mean by the median leads to the LMS, using a trimmed mean leads to the LTS and taking the absolute value instead of the square in (A.3) leads to the LAD. All of these estimators require a robust estimator for the scale σ and special algorithms to 2 ). 00050 (and 0005 compute them. They have progressively been abandoned in favor of β σ[bi] [bi]
B
Efficiency, LRTρ , RAIC and RCp with Biweight ρ-function for the Regression Model To develop the efficiency (3.20) and other quantities for the LRTρ , RAIC and the RCp with the biweight estimator with ρ-function (3.15), we make use of E[r k ] =
(k)! 2k/2(k/2)!
to compute the moments of a N (0, 1), and of 0004 c r k d0010(r) = Lk = −ck−1 φ(c) + (k − 1)0010(c)Lk−2 , −∞
with L0 = 0010(c) and L1 = −φ(c). We need (even) moments up to the order 14, i.e. L2 = −cφ(c) + 0010(c)2 L4 = −(c3 + 3c0010(c))φ(c) + 30010(c)3 L6 = −(c5 + 5c3 0010(c) + 15c0010(c)2)φ(c) + 150010(c)4 L8 = −(c7 + 7c5 0010(c) + 35c30010(c)2 + 105c0010(c)3)φ(c) + 1050010(c)5 L10 = −(c9 + 9c7 0010(c) + 63c50010(c)2 + 315c30010(c)3 + 945c0010(c)4)φ(c) + 9450010(c)6 L12 = −(c11 + 11c90010(c) + 99c70010(c)2 + 693c50010(c)3 + 3465c30010(c)4 + 10 395c0010(c)5)φ(c) + 10 3950010(c)7 L14 = −(c13 + 13c110010(c) + 143c90010(c)2 + 1287c70010(c)3 + 9009c50010(c)4 + 45 045c30010(c)5 + 135 135c0010(c)6)φ(c) + 135 1350010(c)8 Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
EFFICIENCY, LRT ρ , RAIC AND RCP WITH BIWEIGHT ρ -FUNCTION
232
and, therefore, 0004
−c
0004
c
−c 0004 c −c 0004 c −c 0004 c −c
0004
d0010(r) = 1 − 20010(−c) 0004
r 2 d0010(r) =
0004 r 2 d0010(r) − 2
−c
−∞
r 2 d0010(r) = 1 − 2φ(c)c − 20010(−c)2
r 4 d0010(r) = 3 − 2φ(c)(c3 + 3c0010(−c)) − 60010(−c)3 r 6 d0010(r) = 15 − 2φ(c)(c5 + 5c3 0010(−c) + 15c0010(−c)2) − 300010(−c)4 r 8 d0010(r) = 105 − 2φ(c)(c7 + 7c5 0010(−c) + 35c30010(−c)2 + 105c0010(−c)3) − 2100010(−c)5
c −c
0004
c
r 10 d0010(r) = 945 − 2φ(c)(c9 + 9c7 0010(c) + 63c50010(c)2 + 315c30010(c)3 + 945c0010(c)4) − 18900010(c)6
c −c
r 12 d0010(r) = 10 395 − 2φ(c)(c11 + 11c9 0010(−c) + 99c70010(−c)2 + 693c50010(−c)3 + 3465c30010(−c)4 + 10 395c0010(−c)5)
0004
− 20 7900010(c)7 c −c
r 14 d0010(r) = 135 135 − 2φ(c)(c13 + 13c110010(−c) + 143c90010(−c)2 + 1287c70010(−c)3 + 9009c50010(−c)4 + 45 045c30010(−c)5 + 135 135c0010(−c)6) − 270 2700010(−c)8.
For the efficiency (3.20), we have 0006
00072 0004 0004 0004 c 5 c 4 6 c 2 ec = 4 r d0010(r) − 2 r d0010(r) + d0010(r) c −c c −c −c 0004 c 0004 0004 4 c 8 6 c 6 1 10 × 8 r d0010(r) − 6 r d0010(r) + 4 r d0010(r) c −c c −c c −c
−1 0004 0004 c 4 c 4 2 − 2 r d0010(r) + r d0010(r) . c −c −c
EFFICIENCY, LRT ρ , RAIC AND RCp WITH BIWEIGHT ρ -FUNCTION
233
For the LRTρ , and using the ρ-function given in (3.15), we have that (3.26) reduces to
0004 0004 0004 c 6 c 2 5 c 4 r d0010(r) − 2 r d0010(r) + d0010(r) c4 −c c −c −c 0004 c 0004 0004 4 c 8 6 c 6 1 r 10 d0010(r) − 6 r d0010(r) + 4 r d0010(r) × 8 c −c c −c c −c
−1 0004 0004 c 4 c 4 2 − 2 r d0010(r) + r d0010(r) . c −c −c
For the RAIC given in (3.31), and using the ρ-function given in (3.15), we have a=
1 c8
0004
c −c
r 10 d0010(r) −
4 c6
0004
c −c
r 8 d0010(r) +
6 c4
0004
c
−c
r 6 d0010(r)
0004 0004 c 4 c 4 r d0010(r) + r 2 d0010(r) − 2 c −c −c
0004 c 0004 c 0004 c 6 5 4 2 r d0010(r) − 2 r d0010(r) + d0010(r) . b= c4 −c c −c −c For the RCp , Ronchetti and Staudte (1994) have shown that
2 ∂ ρ(r) d0010(r) ∂r
2 2 00060004 0007−1 0004 ∂2 ∂ ∂ ρ(r) ρ(r) d0010(r) ρ(r) d0010(r) − 2p ∂r ∂r∂r ∂r∂r
2 0004 2 0004 ∂2 ∂ 1 ∂ ρ(r) d0010(r) + 2 ρ(r) ρ(r) d0010(r) +p ∂r∂r r ∂r ∂r∂r
2
2
0004 0004 1 ∂ ∂ −3 ρ(r) d0010(r) ρ(r) d0010(r) ∂r r 2 ∂r 0007−2 00060004 ∂2 ρ(r) d0010(r) × ∂r∂r 0004
Up − V p = n
and 0004 VP = p
2
2 0007−2 00060004 0004 1 ∂ ∂2 ∂ ρ(r) ρ(r) ρ(r) d0010(r) d0010(r) d0010(r) . ∂r ∂r∂r r 2 ∂r
234
EFFICIENCY, LRT ρ , RAIC AND RCP WITH BIWEIGHT ρ -FUNCTION
For the biweight ρ-function (3.15), we have 0004 c
0004 0004 4 c 8 6 c 6 1 10 r d0010(r) − 6 r d0010(r) + 4 r d0010(r) (Up − Vp ) = n 8 c −c c −c c −c
0004 c 0004 c 4 r 4 d0010(r) − r 2 d0010(r) −n 2 c −c −c 0004 c 0004 26 c 12 5 r 14 d0010(r) − 10 r d0010(r) − 2p 12 c c −c −c 0004 0004 0004 55 c 10 60 c 8 35 c 6 + 8 r d0010(r) − 6 r d0010(r) + 4 r d0010(r) c −c c −c c −c
0004 0004 c 10 c 4 − 2 r d0010(r) + r 2 d0010(r) c −c −c 0007−1 0006 0004 c 0004 0004 c 6 c 2 5 r 4 d0010(r) − 2 r d0010(r) + d0010(r) × 4 c −c c −c −c 0004 c 0004 c 80 32 +p 8 r 8 d0010(r) − 6 r 6 d0010(r) c −c c −c
0004 0004 64 c 4 16 c 2 + 4 r d0010(r) − 2 r d0010(r) c −c c −c 0004 c 0004 0004 4 c 8 6 c 6 1 r 10 d0010(r) − 6 r d0010(r) + 4 r d0010(r) × 8 c −c c −c c −c
0004 0004 c 4 c 4 2 r d0010(r) + r d0010(r) − 2 c −c −c 0006 0004 c 0007−2 0004 0004 c 6 c 2 5 × 4 r 4 d0010(r) − 2 r d0010(r) + d0010(r) c −c c −c −c and
0004 0004 0004 4 c 6 6 c 4 1 c 8 r d0010(r) − r d0010(r) + r d0010(r) c8 −c c6 −c c4 −c
0004 c 0004 0004 0004 c 4 c 2 4 c 8 1 10 − 2 r d0010(r) + d0010(r) r d0010(r) − 6 r d0010(r) c −c c8 −c c −c −c
0004 0004 0004 c 6 c 6 4 c 4 + 4 r d0010(r) − 2 r d0010(r) + r 2 d0010(r) c −c c −c −c 0006 0004 c 0007−2 0004 c 0004 c 6 5 × 4 r 4 d0010(r) − 2 r 2 d0010(r) + d0010(r) . c −c c −c −c
VP = p
C
An Algorithm Procedure for the Constrained S-estimator The following is a pseudo code of the algorithm for computing the constrained S-estimator. • Given a model, define the design matrices zj zjT to obtain the structure of the covariance matrix and the matrices xi that define the mean vectors xi β, so that 0004=
r 0002 j =0
σj2 zj zjT .
• Compute the starting point of the constrained estimator, that is xi βstart
and 0004start .
In principle one can choose any high breakdown point estimator as starting point. It can be made ‘constrained’ to match the MLM model by averaging out the elements of the estimated covariance matrix that are equal under the MLM. We use the MCD estimator (see Section 2.3.3). • Compute the constrained estimator through the following iterative procedure: 1. Compute the Mahalanobis distances 0015 (1) −1 di = (yi − xi βstart )T 0004start (yi − xi βstart ). 2. Compute the weights w(di(1) ). 3. Compute the fixed effects parameters β (1) by solving 0002 (1) −1 (yi − xi βstart ). w(di )xiT 0004start Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
236
ALGORITHM PROCEDURE FOR THE CONSTRAINED S-ESTIMATOR 4. Let α = (σ02 , . . . , σr2 )T , an iterative expression for the variance components α (1) is given by
−1 0002 1 n w(di(1) )(di(1) )2 Q−1 U α (1) = n i=1 with U defined as 0002 1 (1) U= pw(di )(yi − xi βstart)T n
−1 −1 × 0004start zj zjT 0004start (yi − xi βstart)
j =0,.,r
and Q = tr(Mj Mk )j,k=0,.,r with
−1 Mj = 0004start zj zjT .
5. Using the design matrices zj zjT , update the constrained matrix by 0004 (1) =
r 0002 j =0
2(1)
σj
zj zjT .
6. Update the fixed effects by xi β (1) . 7. Compute some convergence criterion. If the conditions of the criterion are met, stop; otherwise put βstart = β (1) , 0004start = 0004 (1) and start again at point 1 by computing di(2) , the weights w(di(2) ) then β (2) and 0004 (2) . Repeat the procedure until convergence.
D
Some Distributions of the Exponential Family We give here the definitions of some of the distributions belonging to the exponential family, as listed in Table 5.1. • Normal. The density function of a variable distributed as yi ∼ N (µi , σ 2 ) is
1 1 exp − 2 (y − µi )2 , f (y; µi , σ 2 ) = √ 2σ 2πσ for y in R. • Bernoulli. A yi Bernoulli distributed variable can take values y = 0 or y = 1 according to y P (yi = y; pi ) = pi (1 − pi )1−y . • Scaled binomial. The scaled binomial distributed variables yi /m take values 0, 1/m, 2/m, . . . , 1 and are derived from the binomial variables yi with probabilities
m y P (yi = y; pi ) = pi (1 − pi )1−y , y for y = 0, 1, . . . , m. • Poisson. For a Poisson variable yi ∼ P(λ), probabilities are computed according to y λ P (yi = y; λi ) = exp(−λi ) i , y! for y = 0, 1, 2, . . . . Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
238
SOME DISTRIBUTIONS OF THE EXPONENTIAL FAMILY
• Gamma. Here yi is said to be 0016(µi , ν) distributed if its density is f (y; µi , ν) = for y > 0, with 0016(ν) =
0003∞ 0
ν/µi · exp(−νy/µi ) · (νy/µi )ν−1 , 0016(ν) exp(−u)uν−1 du.
E
Computations for the Robust GLM Estimator E.1 Fisher Consistency Corrections We give here the Fisher consistency corrections a(β) =
n 10002 1 E[ψ(ri ; β, φ, c)]w(xi ) 0017 µ0005 , n i=1 φvµi i
for the binomial, Poisson and Gamma models. Note that for binomial and Poisson models, φ = 1 and for the Gamma model φ = 1/ν, see Table 5.1. The only term to be computed for each model is E[ψ(ri ; β, φ, c)], which is done below for ψ(ri ; β, φ, c) = ψ[H ub] (ri ; β, φ, c),0017see Section 3.6. 0017 Let us first define j1 = 0014µi − c φvµi 0015 and j2 = 0014µi + c φvµi 0015, where 0014u0015 denotes the largest integer not greater than u. The binomial model states that yi ∼ B(mi , pi ), so that E[yi ] = µi = mi pi and var(yi ) = µi ((mi − µi )/mi ). Then we have E[ψ[H ub] (ri ; β, φ, c)] =
∞ 0002 j =−∞
ψ[H ub]
j − µi ; β, φ, c P (yi = j ) ι(j ∈ [0, mi ]) √ vµi
= c[P (yi ≥ j2 + 1) − P (yi ≤ j1 )] µi + √ [P (j1 ≤ y˜i ≤ j2 − 1) − P (j1 + 1 ≤ yi ≤ j2 )], vµi with y˜i ∼ B(mi − 1, pi ), and where ι(C) is the indicator function that takes the value one if C is true and zero otherwise. Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
COMPUTATIONS FOR THE ROBUST GLM ESTIMATOR
240
The Poisson model states that yi ∼ P(µi ) and, hence, E[yi ] = V (µi ) = µi . Then, E[ψ[H ub] (ri ; β, φ, c)] =
∞ 0002 j =−∞
ψ[H ub]
j − µi ; β, φ, c P (yi = j )ι(j ≥ 0) √ vµi
= c(P (yi ≥ j2 + 1) − P (yi ≤ j1 )) µi + √ [P (yi = j1 ) − P (yi = j2 )]. vµi Finally, for the Gamma model, one remarks in the first place that ri = (yi − 0017 µi )/ φvµi has a Gamma distribution (independent of µi ) with expectation equal √ √ to ν and shifted origin to − ν. It holds that 0004 E[ψ[H ub] (ri ; β, φ, c)] =
∞ √ − ν
ψ[H ub] (r; β, φ, c)f (r;
√ √ ν, ν) ι(r > − ν) dr
= c[P (ri > c) − P (ri < −c)] + where f (r;
ν (ν−1)/2 [G(−c, ν) − G(c, ν)], 0016(ν)
√ ν, ν) is the Gamma density (see Appendix D) and √ √ √ √ G(t, κ) = exp(− ν( ν + t))( ν + t)κ ι(t > − ν).
E.2 Asymptotic Variance Computing the asymptotic variance amounts to computing the matrices A and B of Section 5.3.4, and therefore of E[ψ 2 (ri ; β, φ, c)] and E[ψ(ri ; β, φ, c)(∂/∂µi ) log h(yi | xi , µi )] again for ψ(ri ; β, φ, c) = ψ[H ub] (ri ; β, φ, c), where h(yi | xi , µi ) is the conditional density or probability of yi | xi . For the binomial model 2 E[ψ[H ub] (ri ; β, φ, c)]
= c2 (P (yi ≤ j1 ) + P (yi ≥ j2 + 1)) +
1 [π 2 mi (mi − 1)P (j1 − 1 ≤ y˜˜i ≤ j2 − 2) vµi i
+ (µi − 2µ2i )P (j1 ≤ y˜i ≤ j2 − 1) + µ2i P (j1 + 1 ≤ yi ≤ j2 )], with yi ∼ B(mi , πi ), y˜i ∼ B(mi − 1, πi ) and y˜˜i ∼ B(mi − 2, πi ) (mi ≥ 3).
E.2. ASYMPTOTIC VARIANCE
241
Given that (∂/∂µi ) log h(yi | xi , µi ) is equal to (yi − µi )/vµi , we have 0007 0006 ∂ log h(yi | xi , µi ) E ψ[H ub] (ri ; β, φ, c) ∂µi 0007 0006 y i − µi = E ψ[H ub] (ri ; β, φ, c) vµi cµi [P (yi ≤ j1 ) − P (y˜i ≤ j1 − 1) + P (y˜i ≥ j2 ) − P (yi ≥ j2 + 1)] = vµi +
1 3/2 vµi
[πi2 mi (mi − 1)P (j1 − 1 ≤ y˜˜i ≤ j2 − 2)
+ (µi − 2µ2i )P (j1 ≤ y˜i ≤ j2 − 1) + µ2i P (j1 + 1 ≤ yi ≤ j2 )], with yi ∼ B(mi , πi ), y˜i ∼ B(mi − 1, πi ) and y˜˜i ∼ B(mi − 2, πi ) (mi ≥ 3). For the Poisson model, 2 2 E[ψ[H ub] (ri ; β, φ, c)] = c [P (yi ≤ j1 ) + P (yi ≥ j2 + 1)]
+
1 [µ2 P (j1 − 1 ≤ yi ≤ j2 − 2) vµi i
+ (µi − 2µ2i )P (j1 ≤ yi ≤ j2 − 1) + µ2i P (j1 + 1 ≤ yi ≤ j2 )]. We have y i − µi y i − µi ∂ log h(yi | xi , µi ) = = , ∂µi µi vµi so that 0006 0007 ∂ E ψ[H ub] (ri ; β, φ, c) log h(yi | xi , µi ) ∂µi 0007 0006 y i − µi = E ψ[H ub] (ri ; β, φ, c) vµi = c[P (yi = j1 ) + P (yi = j2 )] + µi P (j1 ≤ yi ≤ j2 − 1) +
1 3/2 vµi
µ2i [P (yi = j1 − 1) − P (yi = j1 ) − P (yi = j2 − 1) + P (yi = j2 )].
For the Gamma model, we first note that √ ∂ log h(yi | xi , µi ) = (yi − µi )/(µ2i /ν) = νri /µi . ∂µi
COMPUTATIONS FOR THE ROBUST GLM ESTIMATOR
242 This yields
E(ψ[H ub] (ri ; β, φ, c)
∂ log h(yi | xi , µi ) ∂µi
√ ν E(ψ[H ub] (ri ; β, φ, c)ri ) = µi
√ ν ν/2 c ν [G(−c, ν) + G(c, ν)] + P (−c < ri < c) = µi 0016(ν) µi ν ν/2 [G(−c, ν + 1) − G(c, ν + 1)] µi 0016(ν)
ν (ν+1)/2 ν + 1 − 2 [G(−c, ν) − G(c, ν)]. + µi 0016(ν) ν +
E.3 IRWLS Algorithm for Robust GLM We show here how the estimation procedure issued from (5.13) can be written as an IRWLS algorithm. Given β t −1 , the estimated value of β at iteration t − 1, one can obtain β t , the value of β at iteration t, by regressing Z = XT β t −1 + et −1 on X (see Definition (5.2)) with weights B = diag(b1 , . . . , bn ) with
0006 0007001b 0017 ∂ ∂µi 2 log h(yi | xi , µi ) φvµi w(xi ) , bi = E ψ(ri ; β, φ, c) ∂µi ∂ηi
(E.1)
for i = 1, . . . , n, where h(·) is the conditional density or probability of yi | xi and et −1 = (e1t −1 , . . . , ent −1 ) with eit −1 =
ψ(rit −1 ; β, φ, c) − E[ψ(rit −1 ; β, φ, c)]
E[ψ(rit −1 ; β, φ, c)(∂/∂µi ) log h(yi | xi , µti −1 )]
.
(E.2)
To see the above, define U (β) = ni=1 0001(yi , xi ; β, φ, c), where 0001(yi , xi ; β, φ, c) is given in (5.13). The Fisher-scoring algorithm at step t writes β t = β t −1 + H −1 (β t −1 )U (β t −1 ) or, alternatively, H (β t −1)β t = H (β t −1)β t −1 + U (β t −1 ), where H (β
t −1
' 0007 0006 ' ∂ ' = nM(0001, Fβ ) = XT B|β=β t−1 X. ) = E − U (β)' ∂β β=β t−1
E.3. IRWLS ALGORITHM FOR ROBUST GLM
243
Moreover, for Z = XT β t −1 + et −1 with et −1 as defined in (E.2), we have that H (β t −1 )β t −1 + U (β t −1 ) = XT BZ. In fact, for each j = 1, . . . , n, it holds that [H (β t −1 )β t −1 + U (β t −1 )]j =
p 0002 n 0002 k=1 i=1 n 0002
bi xij xik βkt −1 +
n 0002
ψ(ri ; β, φ, c)w(xi ) 0017
i=1
1 ∂µi xij φvµi ∂ηi
1 ∂µi xij φv µi ∂ηi i=1 0017 p n 00190002 0002 ψ(ri ; β, φ, c)w(xi )(1/ φvµi )(∂µi /∂ηi ) = xik βkt −1 + bi i=1 k=1 0017 001a E[ψ(ri ; β, φ, c)]w(xi )(1/ φvµi )(∂µi /∂ηi ) bi xij − bi 001a n 0019 0002 ψ(ri ; β, φ, c) − E[ψ(ri ; β, φ, c)] ∂ηi xi β t −1 + bi xij = E[ψ(ri ; β, φ, c)(∂/∂µi ) log h(yi | xi , µi )] ∂µi i=1 −
=
n 0002
E[ψ(ri ; β, φ, c)]w(xi ) 0017
Zi bi xij = [XT BZ]j ,
i=1
where the involved quantities are evaluated at β t −1 .
F
Computations for the Robust GEE Estimator F.1
IRWLS Algorithm for Robust GEE
The whole robust procedure consists of solving the three following sets of equations: n 0002
(Dµi ,β )T iT (Vµi ,τ,α )−1 (ψi − ci ) =
i=1
n 0002
00011 (yi , Xi ; β, α, τ, c) = 0
(F.1)
i=1 ni n 0002 0002 i=1 t =1
χ(rit ; β, α, φ, c) =
n 0002
00012 (ri ; β, α, τ, c) = 0
(F.2)
i=1
0002 n n 0002 K GTi Bi − ατ = 00013 (ri ; β, α, τ, c) = 0. n i=1 i=1
(F.3)
Ideally these equations should be solved simultaneously (as, for example, in Huggins (1993)). We implement a two-stage approach iterating between the estimation of the regression parameters via (F.1) and the estimation of the dispersion and correlation parameters via (F.2) and (F.3). In fact, for fixed values of the nuisance parameters τ and α, the estimation of the regression parameter β can be performed via an IRWLS algorithm by regressing the adjusted dependent variable Z = Xtot βˆ + D ∗ −1 (ψtot − ctot ) on Xtot with a block-diagonal weight matrix W∗ , where Xtot = (XT1 , . . . , XTn )T , ψtot = (ψ1T , . . . , ψnT )T , ctot = (c1T , . . . , cnT )T are the combined informations for the entire sample. The ith block of W∗ is the ni × ni matrix W∗i = Dµ∗ i ,β −1 iT (Aµi )−1/2 (Rα,i )−1 (Aµi )−1/2 i Dµ∗ i ,β −1 , Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
246
COMPUTATIONS FOR THE ROBUST GEE ESTIMATOR
and D ∗ is a block-diagonal matrix with blocks Dµ∗ i ,β = diag(∂ηi1 /∂µi1 , . . . ,
Xi . The matrix Hi = Xi (XT W∗ X)−1 ∂ηini /∂µini ). We remark that Dµi ,β = Dµ∗−1 i ,β XTi W∗i defines the hat matrix for subject i. One then obtains an estimate of τ and next an estimate of α from (F.2) and (F.3), respectively. Note that (F.3) can be solved explicitly when exchangeable correlation is assumed, yielding αˆ = 1/(τˆ K) ni=1 GTi Bi .
F.2 Fisher Consistency Corrections Let Yit and Yit 0005 be Bernoulli distributed with probability of success equal to µit and µit 0005 , respectively, and with correlation ρtt0005 . We assume that the robustness weight wit associated with subject i at time t can be decomposed as w(xit )w(rit ; β, τ, c). The joint distribution of (yit , yit 0005 ) is multinomial with set of probabilities (π11 , π10 , 1/2 1/2 π01 , π00 ), where π11 = ρtt0005 vit vit0005 + µit µit0005 , π10 = µit − π11 , π01 = µis − π11 and π00 = 1 − µit − µis + π11 . The consistency correction vector ci has elements cit = E[ψit ] that takes the form: cit = w(rit(1) ; β, τ, c)(w(rit(0) ; β, τ, c) − w(rit(0) ; β, τ, c))v(µit ), √ (j ) where w(rit ; β, τ, c) = w((j − µit )/v(µit )/ τ ) is the weight for the tth measure of cluster i evaluated at yit = j . Moreover, the diagonal matrix i = E[ψ˜ i − c˜i ], with ψ˜ i = ∂ψi /∂µi and c˜i = ∂ci /∂µi , has diagonal elements (1)
(0)
0016it = −w(xit )((1 − µit )w(rit ; β, τ, c) + µit w(rit ; β, τ, c)).
G
Computation of the CRQ The global algorithm uses the notation and definitions introduced in Section 7.6.2. It is taken from Portnoy (2003) or Debruyne et al. (2008) and works as follows. • As long as no censored observations are crossed use ordinary QR as in Koenker and Bassett (1978). • When the ith censored observation is crossed at the τ th quantile store this value as τˆi = τ . • When censored observations have been crossed for a specific τ , find the value in β that minimizes a weighted version of (7.30): 0002 ρ(yi − xiT β(τ ); τ ) i∈Kτc
+
0002
[wi (τ )ρ(yi − xiT β(τ ); τ ) + (1 − wi (τ )ρ(y ∗ − xiT β(τ ); τ )],
i∈Kτ
(G.1) where Kτ represents the set of crossed and censored observations at τ and Kτc its complementary. The weights wi (τ ) are defined in Section 7.6.2 and y ∗ is any value sufficiently large to exceed xiT β for all i. To compute the regression quantile objective function (G.1) in practice, a ˆ ) is piecewise constant sequence of breakpoints τ1∗ , τ2∗ , . . . , τL∗ is defined so that β(τ between these breakpoints. Then, simplex pivoting techniques allow to move from one breakpoint to another using the gradients of (G.1). Portnoy (2003) points out that the resulting gradients are linear in τ making the whole thing tractable. The above reference contains a detailed algorithm and additional explanations. Recently a variant of this called the grid algorithm has been proposed by Neocleous and Portnoy (2006). It is more stable, faster and has already been implemented in the R package Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
248
COMPUTATION OF THE CRQ
provided by Portnoy. It should be preferably used for large datasets. The simplex pivoting algorithm is still available and works well for smaller samples, that is, n up to several thousands.
References Adrover, J., Salibian-Barrera, M. and Zamar, R. (2004) Globally robust inference for the location and simple regression model. Journal of Statistical Planning and Inference, 119, 353–375. Akaike, H. (1973) Information theory and an extension of the maximum likelihood principle. Proceedings of the Second International Symposium on Information Theory (eds Petrov, B.N. and Csaki, F.), Akademiai Kiado, Budapest, pp. 267–281. Alario, F.J.S. and Ferrand, L. (2000) Semantic and associative priming in picture naming. The Quarterly Journal of Experimental Psychology, 53, 741–764. Andrews, D.F., Bickel, P.J., Hampel, F.R., Huber, P.J., Rogers, W.H. and Tukey, J.W. (1972) Robust Estimates of Location: Survey and Advances, Princeton University Press, Princeton, NJ. Atkinson, A.C. (1985) Plots, Transformations and Regression, Oxford University Press, Oxford. Atkinson, A.C. and Riani, M. (2000) Robust Diagnostic Regression Analysis, Springer, Berlin. Barnett, V. and Lewis, T. (1978) Outliers in Statistical Data, John Wiley & Sons, New York. Barry, S. and Welsh, A. (2002) Generalized additive modelling and zero inflated count data. Ecological Modelling, 157, 179–188. Beaton, A.E. and Tukey, J.W. (1974) The fitting of power series, meaning polynomials, illustrated on band-spectroscopicdata. Technometrics, 16, 147–185. Bednarski, T. (1993) Robust estimation in Cox’s regression model. Scandinavian Journal of Statistics, 20, 213–225. Bednarski, T. (1999) Adaptive robust estimation in the Cox regression model. Biocybernetics and Biomedical Engineering, 19, 5–15. Bednarski, T. (2007) On a robust modification of Breslow’s cumulated hazard estimator. Computational Statistics and Data Analysis, 52, 234–238. Bednarski, T. and Mocarska, E. (2006) On robust model selection within the Cox model. Econometrics Journal, 9, 279–290. Bednarski, T. and Nowak, M. (2003) Robustness and efficiency of Sasieni-type estimators in the Cox model. Journal of Statistical Planning and Inference, 115, 261–272. Bednarski, T. and Zontek, S. (1996) Robust estimation of parameters in a mixed unbalanced model. Annals of Statistics, 24, 1493–1510. Belsley, D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics, John Wiley & Sons, New York. Bennet, C.A. (1954) Effect on measurement error on chemical process control. Industrial Quality Control, 11, 17–20. Beran, R. (1981) Efficient robust tests in parametric models. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 57, 73–86. Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
250
REFERENCES
Berkson, J. (1944) Application of the logistic function to bio-assay. Journal of the American Statistical Association, 39, 357–365. Bernoulli, D. (1777) Dijudicatio maxime probabilis plurium observationum discrepantium atque verisimillima inductio inde formanda. Acta Acad, Sci. Petropolit., 1, 3–33 (English translation by Allen, C.C. (1961), Biometrika, 48, 3–13.) Berry, D.A. (1987) Logarithmic transformations in ANOVA. Biometrics, 43, 439–456. Bianco, A., Boente, G. and di Rienzo, J. (2000) Some results for robust GM-based estimators in heteroscedastic regression models. Journal of Statistical Planning and Inference, 89, 215–242. Bianco, A.M. and Yohai, V.J. (1997) Robust estimation in the logistic regression model. Robust Statistics, Data Analysis and Computer Intensive Methods (ed. Rieder H), Springer, New York, pp. 17–34. Bianco, A.M., Ben, M.G. and Yohai, V.J. (2005) Robust estimation for linear regression with asymmetric errors. The Canadian Journal of Statistics, 33, 511–528. Birch, M.W. (1963) Maximum likelihood in three-way contingency tables. Journal of the Royal Statistical Society, Series B, Methodological, 25, 220–233. Bliss, C.I. (1935) The calculation of the dosage-mortality curve. Annals of Applied Biology, 22, 134–167. Bloomfield, P. and Steiger, W.L. (1983) Least Absolute Deviations: Theory, Applications, and Algorithms, Birkhäuser, Boston, MA. Blough, D.K., Madden, C.W. and Hornbrook, M.C. (1999) Modeling risk using generalized linear models. Journal of Health Economics, 18, 153–171. Box, G. (1979) Robustness in the strategy of scientific model building. Robustness in Statistics (ed. Launer, R. and Wilkinson, G.), Academic Press, New York. Box, G.E.P. (1953) Non-normality and tests of variances. Biometrika, 40, 318–335. Bretagnolle, J. and Huber-Carol, C. (1988) Effects of omitting covariates in Cox’s model for survival data. Scandinavian Journal of Statistics, 15, 125–128. Brochner-Mortensen, J., Jensen, S. and Rodbro, P. (1977) Assessment of renal function from plasma creatinine in adult patients. Scandinavian Journal of Urology and Nephrology, 11, 263–270. Buchinsky, M. and Hahn, J. (1998) An alternative estimator for the censored quantile regression model. Econometrica, 66, 653–671. Cain, K. and Lange, T. (1984) Approximate case influence for the proportional hazards regression model with censored data. Biometrics, 40, 439–499. Cameron, A.C. and Trivedi, P.K. (1998) Regression Analysis of Count Data, Cambridge University Press, Cambridge. Canario, L. (2006) Genetic aspects of piglet mortality at birth and in early suckling period: relationships with sow maternal abilities and piglet vitality, PhD thesis, Institut National Agronomique Paris-Grignon, France. Canario, L., Cantoni, E., Le Bihan, E., Caritez, J., Billon, Y., Bidanel, J. and Foulley, J. (2006) Between breed variability of stillbirth and relationships with sow and piglet characteristics. Journal of Animal Science, 84, 3185–3196. Cantoni, E. (2003) Robust inference based on quasi-likelihoods for generalized linear models and longitudinal data. Developments in Robust Statistics. Proceedings of ICORS 2001 (eds. Dutter, R., Filzmoser, P., Gather, U. and Rousseeuw, P.J.), Springer, Heidelberg, pp. 114– 124. Cantoni, E. (2004a) Analysis of robust quasi-deviances for generalized linear models. Journal of Statistical Software., Vol. 10, Issue 4.
REFERENCES
251
Cantoni, E. (2004b) A robust approach to longitudinal data analysis. Canadian Journal of Statistics, 32, 169–180. Cantoni, E. and Ronchetti, E. (2001a) Resistant selection of the smoothing parameter for smoothing splines. Statistics and Computing, 11, 141–146. Cantoni, E. and Ronchetti, E. (2001b) Robust inference for generalized linear models. Journal of the American Statistical Association, 96, 1022–1030. Cantoni, E. and Ronchetti, E. (2006) A robust approach for skewed and heavy-tailed outcomes in the analysis of health care expenditures. Journal of Health Economics, 25, 198–213. Cantoni, E., Mills Flemming, J. and Ronchetti, E. (2005) Variable selection for marginal longitudinal generalized linear models. Biometrics, 61, 507–514. Carroll, R., Ruppert, D. and Stefanski, L. (1995) Measurement Error in Nonlinear Models, Chapman & Hall, London. Carroll, R.J. and Pederson, S. (1993) On robustness in the logistic regression model. Journal of the Royal Statistical Society, Series B, Methodological, 55, 693–706. Carroll, R.J. and Ruppert, D. (1982) Robust estimation in heteroscedastic linear models. Annals of Statistics, 10, 1224–1233. Chatterjee, S. and Hadi, A.S. (1988) Sensitivity Analysis in Linear Regression, John Wiley & Sons, New York. Chen, C. and Wang, P. (1991) Diagnostic plots in Cox’s regression model. Biometrics. 47, 841–850. Chernozhukov, V. and Hong, H. (2002) Three-step censored quantile regression and extramarital affairs. Journal of the American Statistical Association, 97, 872–882. Christmann, A. (1997) High breakdown point estimators in logistic regression. Robust Statistics, Data Analysis and Computer Intensive Methods (ed. Rieder H), Springer, New York, pp. 79–90. Christmann, A. and Rousseeuw, P.J. (2001) Measuring overlap in binary regression. Computational Statistics and Data Analysis, 37, 65–75. Cleveland, W.S. (1979) Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 74, 829–836. Collett, D. (2003a) Modelling Binary Data, Chapman & Hall, London. Collett, D. (2003b) Modelling Survival Data in Medical Research, 2nd edn, Chapman & Hall, London. Conen, D., Wietlisbach, V., Bovet, P., Shamlaye, C., Riesen, W., Paccaud, F. and Burnier, M. (2004) Prevalence of hyperuricemia and relation of serum uric acid with cardiovascular risk factors in a developing country. BMC Public Health, http://www.biomedcentral.com/1471-2458/4/9. Cook, R.D. and Weisberg, S. (1982) Residuals and Influence in Regression, Chapman & Hall, New York. Copas, J.B. (1988) Binary regression models for contaminated data. Journal of the Royal Statistical Society, Series B, Methodological, 50, 225–265. Copt, S. and Heritier, S. (2007) Robust alternative to the F -test in mixed linear models based on MM-estimates. Biometrics, 63, 1045–1052. Copt, S. and Victoria-Feser, M.P. (2006) High breakdown inference for mixed linear models. Journal of the American Statistical Association, 101(473), 292–300. Copt, S. and Victoria-Feser, M.P. (2009) Robust predictions in mixed linear models, Technical report, University of Geneva.
252
REFERENCES
Cox, D. (1972) Regression models and life tables. Journal of the Royal Statistical Society, Series B, Methodological, 34, 187–220. Cox, D.R. and Hinkley, D.V. (1992) Theoretical Statistics, Chapman & Hall, London. Cressie, N. and Lahiri, S. (1993) The asymptotic distribution of REML estimators. Journal of Multivariate Analysis, 45, 217–233. Croux, C., Dhaene, G. and Hoorelbeke, D. (2003) Robust standard errors for robust estimators, Discussion Paper Series 03.16, Center for Economic Studies, Catholic University of Leuven. Davies, P.L. (1987) Asymptotic behaviour of S-estimators of multivariate location parameters and dispertion matrices. Annals of Statistics, 15, 1269–1292. Davies, R.B. (1980) [Algorithm AS 155] The distribution of a linear combination of χ 2 random variables (AS R53: 84V33 p366- 369). Applied Statistics, 29, 323–333. Davison, A.C. and Hinkley, D.V. (1997) Bootstrap Methods and their Applications, Cambridge University Press, Cambridge. Debruyne, M., Hubert, M., Portnoy, S. and Vanden Branden, K. (2008) Censored depth quantiles. Computational Statistics and Data Analysis, 52, 1604–1614. Dempster, A.P., Rubin, D.B. and Tsutakawa, R.K. (1981) Estimation in covariance components models. Journal of the American Statistical Association, 76, 341–353. Devlin, S.J., Gnanadesikan, R. and Kettenring, J.R. (1981) Robust estimation of dispersion matrices and principal components. Journal of the American Statistical Association, 76, 354–362. Diggle, P.J., Heagerty, P., Liang, K.Y. and Zeger, S.L. (2002) Analysis of Longitudinal Data, Oxford University Press, New York. DiRienzo, A.G. and Lagakos, S.W. (2001) Effects of model misspecification on tests of no randomized treatment effect arising from Coxs proportional hazards model. Journal of the Royal Statistical Society Series B, Methodological, 63, 745–757. DiRienzo, A.G. and Lagakos, S.W. (2003) The effects of misspecifying Coxs regression model on randomized treatment group comparisons. Handbook of Statistics, 23, 1–15. Dobbie, M.J. and Welsh, A.H. (2001a) Modelling correlated zero-inflated count data. Australian and New Zealand Journal of Statistics, 43(4), 431–444. Dobbie, M.J. and Welsh, A.H. (2001b) Models for zero-inflated count data using the Neyman type A distribution. Statistical Modelling, 1(1), 65–80. Dobson, A.J. (2001) An Introduction to Generalized Linear Models, Chapman & Hall/CRC, Boca Raton, FL. Dunlop, D.D., Manheim, L.M., Song, J. and Chang, R.W. (2002) Gender and ethnic/racial disparities health care utilization among older adults. Journal of Gerontology, 57B, S221– S233. Dupuis, D.J. and Morgenthaler, S. (2002) Robust weighted likelihood estimators with an application to bivariate extreme value problems. Canadian Journal of Statistics, 30, 17–36. Dyke, G.V. and Patterson, H.D. (1952) Analysis of factorial arrangements when the data are proportions. Biometrics, 8, 1–12. Edgeworth, F.Y. (1883) The method of least squares. Philosophical Magazine, 23, 364–375. Edgeworth, F.Y. (1887) On observations relating to several quantities. Hermathena, 6, 279– 285. Efron, B. (1967) The power of the likelihood ratio test. The Annals of Mathematical Statistics, 38, 802–806.
REFERENCES
253
Efron, B. (1982) The Jackknife, the Bootstrap an Other Resampling Plans, vol. 38, Society for Industrial and Applied Mathematics, Philadelphia, PA. Everitt, B.S. (1994) Statistical Analysis using S-Plus, Chapman & Hall, London. Fahrmeir, L. and Tutz, G. (2001) Multivariate Statistical Modelling Based on Generalized Linear Models, Springer, Berlin. Farebrother, R.W. (1990) [Algorithm AS 256] The distribution of a quadratic form in normal variables. Applied Statistics, 39, 294–309. Fernholz, L.T. (1983) Von Mises Calculus for Statistical Functionals (Lecture Notes in Statistics, vol. 19), Springer, New York. Field, C. and Smith, B. (1994) Robust estimation—a weighted maximum likelihood approach. International Statistical Review, 62, 405–424. Fisher, R. (1925) Statistical Methods for Research Workers, 1st edn, Oliver and Boyd, Edinburgh. Fisher, R.A. (1922) On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society, 222, 309–368. Fisher, R.A. (1934) Two new properties of mathematical likelihood. Philosophical Transactions of the Royal Society A, 144, 285–307. Fitzenberger, B. and Winker, P. (2007) Improving the computation of censored quantile regressions. Computational Statistics and Data Analysis, 52, 88–108. Gail, M., Wieand, S. and Piantodosi, S. (1984) Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika, 71, 431–444. Gallant, A.R. and Tauchen, G. (1996) Which moments to match? Econometric Theory, 12, 657–681. Genton, M.G. and Ronchetti, E. (2003) Robust indirect inference. Journal of the American Statistical Association, 98(461), 67–76. Genton, M.G. and Ronchetti, E. (2008) Robust prediction of beta, in Computational Methods in Financial Engineering - Essays in Honour of Manfred Gilli (eds Kontoghiorghes, E.J., Rustem, B. and Winker, P.), Springer, Berlin pp. 147–161. Gerdtham, U. (1997) Equity in health care utilization: further tests based on hurdle models and Swedish micro data. Health Economics, 6, 303–319. Gilleskie, D.B. and Mroz, T.A. (2004) A flexible approach for estimating the effect of covariates on health expenditures. Journal of Health Economics, 23, 391–418. Giltinan, D.M., Carroll, R.J. and Ruppert, D. (1986) Some new estimation methods for weighted regression when there are possible outliers. Technometrics, 28, 219–230. Gouriéroux, C., Monfort, A. and Renault, E. (1993) Indirect inference. Journal of Applied Econometrics, 8S, 85–118. Greene, W. (1997) Econometric Analysis, 3rd edn, Prentice Hall, Englewood Cliffs, NJ. Grzegorek, K. (1993) On robust estimation of baseline hazard under the Cox model via Fréchet differentiability, PhD thesis, Preprint of the Institute of Mathematics of the Polish Academy of Sciences, 518. Hammill, B.G. and Preisser, J.S. (2006) A SAS/IML software program for GEE and regression diagnostic. Computational Statistics and Data Analysis, 51, 1197–1212. Hampel, F.R. (1968) Contribution to the theory of robust estimation, PhD thesis, University of California, Berkeley, CA. Hampel, F.R. (1974) The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69, 383–393.
254
REFERENCES
Hampel, F.R. (1985) The breakdown points of the mean combined with some rejection rules. Technometrics, 27, 95–107. Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J. and Stahel, W.A. (1986) Robust Statistics: The Approach Based on Influence Functions, John Wiley & Sons, New York. Hanfelt, J.J. and Liang, K.Y. (1995) Approximate likelihood ratios for general estimating functions. Biometrika, 82, 461–477. Hardin, J.W. and Hilbe, J.M. (2003) Generalized Estimating Equations, Chapman & Hall, London. Härdle, W. (1990) Applied Nonparametric Regression, Cambridge University Press, Cambridge. Harrell, F.E.J. (2001) Regression Modeling Strategies: With Application to Linear Models, Logistic Regression and Survival Analysis (Springer Series in Statistics), Springer, Berlin. Harter, H.L. (1974–1976) The method of least squares and some alternatives. Reviews of International Institute of Statistics, 42, 147–174 (Part I); 42, 235–264 (Part II); 43, 1–44 (Part III); 43, 125–190 (Part IV); 43, 269–278 (Part V); 44, 113–159 (Part VI). Harville, D.A. (1977) Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association, 72, 320–340. Hastie, T.J. and Tibshirani, R.J. (1990) Generalized Additive Models, Chapman & Hall, London. Hauck, W.W. and Donner, A. (1977) Wald’s test as applied to hypotheses in logit analysis (Corr 75, p. 482). Journal of the American Statistical Association, 72, 851–853. He, X. (1991) A local breakdown property of robust tests in linear regression. Journal of Multivariate Analysis, 38, 294–305. He, X., Simpson, D. and Portnoy, S. (1990) Breakdown robustness of tests. Journal of the American Statistical Association, 85, 446–452. Heagerty, P.J. and Zeger, S.L. (1996) Marginal regression models for clustered ordinal measurements. Journal of the American Statistical Association, 91, 1024–1036. Heagerty, P.J. and Zeger, S.L. (2000) Multivariate continuation ratio models: connections and caveats. Biometrics, 56(3), 719–732. Henderson, C.R. (1953) Estimation of variance and covariance components. Biometrics, 9, 226–252. Henderson, C.R., Kempthorne, O., Searle, S.R. and von Krosigk, C.N. (1959) Estimation of environmental and genetic trends from records subject to culling. Biometrics, 15, 192–218. Heritier, S. (1993) Contribution to robustness in nonlinear models: application to economic data, PhD thesis, Faculty of Economic and Social Sciences, University of Geneva Switzerland. Heritier, S. and Galbraith, S. (2008) A revisit of robust inference in the Cox model, Technical report, University of New South Wales, Australia. Heritier, S. and Ronchetti, E. (1994) Robust bounded-influence tests in general parametric models. Journal of the American Statistical Association, 89(427), 897–904. Heritier, S. and Victoria-Feser, M.P. (1997) Practical applications of bounded-influence tests, in Handbook of Statistics, vol. 15 (eds Maddala, G. and Rao, C.), Elsevier Science, Amsterdam, pp. 77–100. Hettmansperger, T.P. (1984) Statistical Inference Based on Ranks, John Wiley & Sons, New York. Hettmansperger, T.P. and McKean, J.W. (1998) Robust Nonparametric Statistical Methods, Arnold, London.
REFERENCES
255
Hjort, N. (1992) On inference in parametric survival models. International Statistical Review, 60, 55–387. Hodges, J.L.J. (1967) Efficiency in normal samples and tolerance of extreme values for some estimates of location, in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, University of California Press, Berkeley, CA, pp. 163– 186. Holcomb, P. and McPherson, W. (1994) Event-related brain potentials reflect semantic priming in an object decision task. Brain and Cognition, 24, 259–276. Hollis, S. and Campbell, F. (1999) What is meant by intention to treat? survey of published randomised clinical trials. British Medical Journal, 319, 670–674. Honore, B., Khan, S. and Powell, J.L. (2002) Quantile regression under random censoring. Journal of Econometrics, 109, 67–105. Horton, N.J. and Lipsitz, S.R. (1999) Review of software to fit generalized estimating equation regression models. The American Statistician, 53, 160–169. Huber-Carol, C. (1970) Etude Asymptotique de Tests Robustes, PhD thesis, ETH Zürich, Switzerland. Huber, P.J. (1964) Robust estimation of a location parameter. Annals of Mathematical Statistics, 35, 73–101. Huber, P.J. (1967) The behavior of the maximum likelihood estimates under non standard conditions, in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, University of California Press, Berkeley, CA, pp. 221–233. Huber, P.J. (1972) Robust statistics: a review. Annals of Mathematical Statistics, 43, 1041– 1067. Huber, P.J. (1973) Robust regression: Asymptotics, conjectures and Monte Carlo. Annals of Statistics, 1, 799–821. Huber, P.J. (1979) Robust smoothing, in Robustness in Statistics (eds Launer RL and Wilkinson GN), Academic Press, New York, pp. 33–48. Huber, P.J. (1981) Robust Statistics, John Wiley & Sons, New York. Huber, P.J. and Ronchetti, E.M. (2009) Robust Statistics, 2nd edn, John Wiley & Sons, New York. Huggins, R.M. (1993) A robust approach to the analysis of repeated measures. Biometrics, 49, 715–720. Huggins, R.M. and Staudte, R.G. (1994) Variance components models for dependent cell populations. Journal of the American Statistical Association, 89, 19–29. Imhof, J.P. (1961) Computing the distribution of quadratic forms in normal variables. Biometrika, 48, 352–363. Ingelfinger, J.A., Mosteller, F., Thibodeau, L.A. and Ware, J.H. (1987) Biostatistics in Clinical Medicine, 2nd edn, McMillan, New York. Jain, A., Tindell, C.A., Laux, I., Hunter, J.B., Curran, J., Galkin, A., Afar, D.E., Aronson, N., Shak, S., Natale, R.B. and Agus, D.B. (2005) Epithelial membrane protein-1 is a biomarker of gefitinib resistance. Proceedings of the National Academy of Science USA, 102, 11858– 11863. Kalbfleisch, J. and Prentice, R. (1980) The Statistical Analysis of Failure Time Data, John Wiley & Sons, Ltd., Chichester. Kenward, M.G. and Roger, J.H. (1997) Small sample inference for fixed effects from restricted maximum likelihood. Biometrics, 53, 983–997. Kim, C. and Bae, W. (2005) Case influence diagnostics in the Kaplan–Meier estimator and the log-rank test. Computational Statistics, 20, 521–534.
256
REFERENCES
Koenker, R. (2005) Quantile Regression (Econometric Society Monographs), Cambridge University Press, New York. Koenker, R. (2008) Censored quantile regression redux. Journal of Statistical Software, 27(6), 2–25. Koenker, R. and Bassett, G. (1978) Regression quantiles. Econometrica, 46, 33–50. Koenker, R. and Bassett, G. (1982) Robust test for heteroscedasticity based on regression quantiles. Econometrica, 50, 43–62. Koenker, R. and D’Orey, V. (1987) Computing regression quantiles. Applied Statistics, 36, 383–393. Koenker, R. and Geling, O. (2001) Reappraising medfly longevity: a quantile regression survival analysis. Journal of the American Statistical Association, 96, 458–468. Koenker, R. and Hallock, K. (2001) Quantile regression: an introduction. Journal of Econometric Perspectives, 15, 143–156. Kong, F.H. and Slud, E. (1997) Robust covariate-adjusted logrank tests. Biometrika, 84, 847– 862. Krall, J.M., Uthoff, V.A. and Hareley, J.B. (1975) A step-up procedure for selecting variables associated with survival. Biometrics, 31, 49–57. Krasker, W.S. and Welsch, R.E. (1982) Efficient bounded-influence regression estimation. Journal of the American Statistical Association, 77, 595–604. Künsch, H.R., Stefanski, L.A. and Carroll, R.J. (1989) Conditionally unbiased boundedinfluence estimation in general regression models, with applications to generalized linear models. Journal of the American Statistical Association, 84, 460–466. Kuonen, D. (1999) Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika, 86, 929–935. Kurttio, P., Komulainen, H., Leino, A., Salonen, L., Auvinen, A. and Saha, H. (2005) Bone as a possible target of chemical toxicity of natural uranium in drinking water. Environmental Health Perspectives, 113, 68–72. Laird, N. and Ware, J. (1982) Random-effect models for longitudinal data. Biometrics, 38, 963–974. Lambert, D. (1992) Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics, 34, 1–14. Lange, K.L., Little, R.J.A. and Taylor, J.M.G. (1989) Robust statistical modeling using the t-distribution. Journal of the American Statistical Association, 84, 881–896. Liang, K.Y. and Zeger, S.L. (1986) Longitudinal data analysis using generalized linear models. Biometrika, 73, 13–22. Liang, K.Y., Zeger, S.L. and Qaqish, B. (1992) Multivariate regression analyses for categorical data (Discussion: pp. 24–40). Journal of the Royal Statistical Society, Series B, Methodological, 54, 3–24. Lin, D.Y. and Wei, L.J. (1989) The robust inference for the Cox proportional hazard model. Journal of the American Statistical Association, 84, 1074–1078. Lin, T.I. and Lee, J.C. (2006) A robust approach to t linear mixed models applied to multiple sclerosis data. Statistics in Medicine, 25, 1397–1412. Lindsey, J.K. (1997) Applying Generalized Linear Models, Springer, Berlin. Lindstrom, M.J. and Bates, A. (1988) Newton–Raphson and EM algorithms for linear mixedeffects models for repeated data (correction: 94(89), 1572). Journal of the American Statistical Association, 83, 1014–1022.
REFERENCES
257
Litière, S., Alonso, A. and Molenberghs, G. (2007a) The impact of misspecified random-effect distribution on the estimation and performance of inferential procedures in generalized linear mixed models. Statistics in Medicine, 27, 3125–3144. Litière, S., Alonso, A. and Molenberghs, G. (2007b) Type I and type II error under randomeffects misspecification in generalized linear mixed models. Biometrics, 63, 1038–1044. Littell, R.C. (2002) Analysis of unbalanced mixed model data: a case study comparison of ANOVA versus REML/GLS. Journal of Agricultural, Biological and Environmental Statistics, 7, 472–490. Lopuhaä, H.P. (1989) On the relation between S-estimators and M-estimators of multivariate location and covariance. Annals of Statistics, 17, 1662–1683. Lopuhaä, H.P. (1992) High efficient estimators of multivariate location with high breakdown point. Annals of Statistics, 20, 398–413. Lopuhaä, H.P. and Rousseeuw, P.J. (1991) Breakdown points of affine equivariant estimators of multivariate locution and covariance matrices. Annals of Statistics. 19, 229–248. Ma, B. and Elis, R.E. (2003) Robust registration for computer-integrated orthopedic surgery: laboratory validation and clinical experience. Medical Image Analysis, 7(3), 237–250. Machado, J.A.F. and Machado, J.A.F. (1993) Robust model selection and m-estimation. Econometric Theory, 9, 478–493. Mahalanobis, P.C. (1936) On the generalized distance in statistics. Proceedings of the National Institute of Science of India, 12, 49–55. Mallows, C.L. (1973) Some comments on cp . Technometrics, 15, 661–675. Mallows, C.L. (1975) On some topics in robustness, Technical report, Bell Telephone Laboratories, Murray Hill, NJ. Marazzi, A. (1993) Algorithms, Routines and S-Functions for Robust Statistics, Wadsworth and Brooks/Cole, Belmont, CA. Marazzi, A. (2002) Bootstrap tests for robust means of asymmetric distributions with unequal shapes. Computational Statistics and Data Analysis, 39, 503–528. Marazzi, A. and Barbati, G. (2003) Robust parametric means of asymmetric distributions: estimation and testing. Estadistica, 54, 47–72. Marazzi, A. and Yohai, V. (2004) Adaptively truncated maximum likelihood regression with asymmetric errors. Journal of Statistical Planning and Inference, 122, 271–291. Markatou, M. and He, X. (1994) Bounded influence and high breakdown point testing procedures in linear models. Journal of the American Statistical Association, 89, 543–549. Markatou, M. and Hettmansperger, T.P. (1990) Robust bounded influence tests in linear models. Journal of the American Statistical Association, 85, 187–190. Markatou, M. and Hettmansperger, T.P. (1992) Applications of the asymmetric eigen value problem techniques to robust testing. Journal of Statistical Planning and Inference, 31, 51–65. Markatou, M. and Ronchetti, E. (1997) Robust inference: The approach based on influence functions, in Handbook of Statistics, Vol. 15: Robust Inference (eds Maddala, G. S. andRao C), Elsevier Science, New York, pp. 49–75. Markatou, M., Basu, A. and Lindsay, B. (1997) Weighted likelihood estimating equations: the discrete case with application to logistic regression. Journal of Statistical Planning and Inference, 57, 215–232. Markatou, M., Stahel, W.A. and Ronchetti, E. (1991) Robust M-type testing procedures for linear models, in Directions in Robust Statistics and Diagnostics, Part I (eds Stahel WA and Weisberg S), Springer, New York, pp. 201–220.
258
REFERENCES
Maronna, R.A. (1976) Robust M-estimators of multivariate location and scatter. Annals of Statistics 4, 51–67. Maronna, R.A. and Yohai, V.J. (2000) Robust regression with both continuous and categorical predictors. Journal of Statistical Planning and Inference, 89, 197–214. Maronna, R.A., Bustos, O.H. and Yohai, V.J. (1979) Bias- and efficiency-robustness of general M-estimators for regression with random carriers, in Smoothing Techniques for Curve Estimation (eds Gasser, T. and Rosenblatt, M.), Springer, New York, pp. 91–116. Maronna, R.A., Martin, R.D. and Yohai, V.J. (2006) Robust Statistics: Theory and Methods, John Wiley & Sons, Ltd, Chichester. McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd edn, Chapman & Hall, London. McCulloch, C.E. and Searle, S.R. (2001) Generalized, Linear, and Mixed Models, John Wiley & Sons, Ltd, Chichester. McLean, R.A., Sanders, W.L. and Stroup, W.W. (1991) A unified approach to mixed models. The American Statistician, 45, 54–64. Mills, J.E., Field, C.A. and Dupuis, D.J. (2002) Marginally specified generalized linear mixed models: a robust approach. Biometrics, 58, 727–734. Min, Y. and Agresti, A. (2002) Modeling nonnegative data with clumping at zero: a survey. Journal of the Iranian Statistical Society, 1, 7–33. Minder, C.E. and Bednarski, T. (1996) A robust method for proportional hazards regression. Statistics in Medicine, 15, 1033–1047. Molenberghs, G. and Verbeke, G. (2005) Models for Discrete Longitudinal Data, Springer, Berlin. Morgenthaler, S. (1992) Least-absolute-deviations fits for generalized linear models. Biometrika, 79, 747–754. Morrell, C.H. (1998) Likelihood ratio testing of variance components in the linear mixedeffects model using restricted maximum likelihood. Biometrics, 54, 1560–1568. Moustaki, I. and Victoria-Feser, M.P. (2006) Bounded-influence robust estimation in generalized linear latent variable models. Journal of the American Statistical Association, 101(474), 644–653. Moustaki, I., Victoria-Feser, M.P. and Hyams, H. (1998) A UK study on the effect of socioeconomic background of pregnat women and hospital practice on the decision to breastfeed and the initiation and duration of breastfeeding, Technical Report Statistics Research Report LSERR44, London School of Economics, London. Moy, G. and Mounoud, P. (2003) Object recognition in young adult: is priming with pantomimes possible?, in Catalogue des abstracts : 8ème congrès de la société suisse de psychologie (SSP), Bern, Switzerland. Mullahy, J. (1986) Specification and testing of some modified count data models. Journal of Econometrics, 33, 341–365. Müller, S. and Welsh, A.H. (2005) Outlier robust model selection in linear regression. Journal of the American Statistical Association, 100, 1297–1310. Nardi, A. and Schemper, M. (1999) New residuals for Cox regression and their application to outlier screening. Biometrics, 55, 523–529. Nelder, J.A. (1966) Inverse polynomials, a useful group of multi-factor response functions. Biometrics, 22, 128–141. Nelder, J.A. and Wedderburn, R.W.M. (1972) Generalized linear models. Journal of the Royal Statistical Society—Series A, 135, 370–384.
REFERENCES
259
Neocleous, T. and Portnoy, S. (2006) A partly linear model for censored regression quantiles, Technical report, Statistics Department, University of Illinois, IL, USA. Neocleous, T., Vanden Branden, K. and Portnoy, S. (2006) Correction to censored regression quantiles by S. Portnoy, 98 (2003), 1001–1012, Journal of the American Statistical Association, 101, 860–861. Newcomb, S. (1886) A generalized theory of the combinations of observations so as to obtain the best result. American Journal of Mathematics, 8, 343–366. Noh, M. and Lee, Y. (2007) Robust modeling for inference from generalized linear model classes. Journal of the American Statistical Association, 102(479), 1059–1072. Pan, W. (2001) Akaike’s information criterion in generalized estimating equations. Biometrics, 57(1), 120–125. Patterson, H.D. and Thompson, R. (1971) Recovery of inter-block information when block sizes are unequal. Biometrika, 58, 545–554. Pearson, E.S. (1931) The analysis of variance in cases of non-normal variation. Biometrika, 23, 114–133. Pearson, K. (1916) Second supplement to a memoir on skew variation. Philosophical Transactions A, 216, 429–457. Pena, D. and Yohai, V. (1999) A fast procedure for outlier diagnostics in large regression problems. Journal of the American Statistical Association, 94, 434–445. Peng, R. and Huang, Y. (2008) Survival analysis with quantile regression models. Journal of American Statistical Association, 103, 637–649. Pinheiro, J.C. and Bates, D.M. (2000) Mixed-Effects Models in S and S-PLUS, Springer, New York. Pinheiro, J.C., Liu, C. and Wu, Y.N. (2001) Efficient algorithms for robust estimation in linear mixed-effects models using the multivariate t distribution. Journal of Computational and Graphical Statistics, 10(2), 249–276. Portnoy, S. (2003) Censored regression quantiles. Journal of the American Statistical Association, 98, 1001–1012. Potthoff, R.F. and Roy, S.N. (1964) A generalized multivariate analysis of variance model useful especially for growth curve problem. Biometrika, 51, 313–326. Powell, J.L. (1986) Censored regression quantiles. Journal of Econometrics, 32, 143–155. Pregibon, D. (1982) Resistant fits for some commonly used logistic models with medical applications. Biometrics, 38, 485–498. Preisser, J.S. and Qaqish, B.F. (1999) Robust regression for clustered data with applications to binary regression. Biometrics, 55, 574–579. Preisser, J.S., Galecki, A.T., Lohman, K.K. and Wagenknecht, L.E. (2000) Analysis of smoking trends with incomplete longitudinal binary responses. Journal of the American Statistical Association, 95, 1021–1031. Prentice, R.L. (1988) Correlated binary regression with covariates specific to each binary observation. Biometrics, 44, 1033–1048. Qu, A. and Song, P.X.K. (2004) Assessing robustness of generalised estimating equations and quadratic inference functions. Biometrika, 91, 447–459. Qu, A., Lindsay, B.G. and Li, B. (2000) Improving generalised estimating equations using quadratic inference functions. Biometrika, 87(4), 823–836. Rao, C.R. (1973) Linear Statistical Inference and its Application, John Wiley & Sons, New York.
260
REFERENCES
Rasch, G. (1960) Probabilistic Models for some Intelligence and Attainment Tests, Danmarks Paedagogiske Institut, Copenhagen. Reid, N. and Crépeau, H. (1985) Influence functions for proportional hazards regression. Biometrika, 72, 1–9. Reimann, C., Filzmoser, P., Garrett, R. and Dutter, R. (2008) Statistical Data Analysis Explained, John Wiley & Sons, Ltd, Chichester. Renaud, O. and Victoria-Feser, M.P. (2009) Robust coefficient of determination, Technical report, University of Geneva. Richardson, A.M. (1997) Bounded influence estimation in the mixed linear model. Journal of the American Statistical Association, 92, 154–161. Richardson, A.M. and Welsh, A.H. (1994) Asymptotic properties of the restricted maximum likelihood for hierarchical mixed models. Australian Journal of Statistics, 36, 31–43. Richardson, A.M. and Welsh, A.H. (1995) Robust restricted maximum likelihood in mixed linear models. Biometrics, 51, 1429–1439. Ridout, M., Demétrio, C.G.B. and Hinde, J. (1998) Models for count data with many zeros, in Proceedings of the 19th International Biometrics Conference, Cape Town pp. 179–190. Rieder, H. (1978) A robust asymptotic testing model. Annals of Statistics, 6, 1080–1094. Rocke, D.M. (1996) Robustness properties of S-estimators of multivariate location and shape in high dimension. Annals of Statistics, 24, 1327–1345. Ronchetti, E. (1982a) Robust alternatives to the F -test for the linear model (STMA V24 1026), in Probability and Statistical Inference (eds Grossmann, W., Pflug, G.C. and Wertz, W.), Reidel, Dordrecht, pp. 329–342. Ronchetti, E. (1982b) Robust Testing in Linear Models: The Infinitesimal Approach, PhD thesis, ETH, Zürich, Switzerland. Ronchetti, E. (1997a) Robust influence by influence functions. Journal of Statistical Planning and Inference, 57, 59–72. Ronchetti, E. (1997b) Robustness aspects of model choice. Statistica Sinica, 7, 327–338. Ronchetti, E. (2006) Fréchet and robust statistics. Journal de la Société Francaise de Statistique 147, 73–75. (Comment on ‘Sur une limitation très générale de la dispersion de la médiane’ by Maurice Fréchet.) Ronchetti, E. and Staudte, R.G. (1994) A robust version of Mallows’s Cp . Journal of the American Statistical Association, 89, 550–559. Ronchetti, E. and Trojani, F. (2001) Robust inference with GMM estimators. Journal of Econometrics, 101(1), 37–69. Ronchetti, E., Field, C. and Blanchard, W. (1997) Robust linear model selection by crossvalidation. Journal of the American Statistical Association, 92, 1017–1023. Rousseeuw, P.J. (1984) Least median of squares regression. Journal of the American Statistical Association, 79, 871–880. Rousseeuw, P.J. and Leroy, A.M. (1987) Robust Regression and Outlier Detection, John Wiley & Sons, New York. Rousseeuw, P.J. and Ronchetti, E. (1979) The influence curve for tests, Research Report 21, ETH Zürich, Switzerland. Rousseeuw, P.J. and Ronchetti, E. (1981) Influence curves for general statistics. Journal of Computational and Applied Mathematics, 7, 161–166. Rousseeuw, P.J. and Van Driessen, K. (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41, 212–223.
REFERENCES
261
Rousseeuw, P.J. and Yohai, V.J. (1984) Robust regression by means of S-estimators, in Robust and Nonlinear Time Series Analysis (eds Franke JW, Hardle and Martin RD), Springer, New York, pp. 256–272. Rule, A.D., Larson, T.S., Bergstralh, E.J., Slezak, J.M., Jacobsen, S.J. and Cosio, F.G. (2004) Using serum creatinine to estimate glomerular filtration rate: Accuracy in good health and in chronic kidney disease. Annals of Internal Medicine, 141(12), 929–938. Salibian-Barrera, M. and Zamar, R.H. (2002) Bootstrapping robust estimates of regression. Annals of Statistics, 30, 556–582. Sasieni, P.D. (1993a) Maximum weighted partial likelihood estimates in the Cox model. Journal of the American Statistical Association, 88, 144–152. Sasieni, P.D. (1993b) Some new estimators for Cox regression. Annals of Statistics, 21, 1721– 1759. Satterthwaite, F.E. (1941) Synthesis of variance. Psychometrika, 6, 309–316. Scheipl, F., Greven, S. and Küchenhoff, H. (2008) Size and power of tests for a zero random effect variance or polynomial regression in additive and linear mixed models. Computational Statistics and Data Analysis, 52, 3283–3299. Schemper, M. (1992) Cox analysis of survival data with non-proportional hazard functions. The Statistician, 41, 455–465. Schwarz, G. (1978) Estimating the dimension of a model. Annals of Statistics, 6, 461–464. Searle, S.R., Casella, G. and McCulloch, C.E. (1992) Variance Components, John Wiley and Sons, Ltd, Chichester. Self, S.G. and Liang, K.Y. (1987) Asymptotic properties of the maximum likelihood estimators and likelihood tests under nonstandard conditions. Journal of the American Statistical Association, 82, 605–610. Silvapulle, M.J. (1992) Robust Wald-type tests of one sided hypothesis in the linear model. Journal of the American Statistical Association, 87, 156–161. Simpson, D.G., Ruppert, D. and Caroll, R.J. (1992) One-step GM estimates and stability of inferences in linear regression. Journal of the American Statistical Association, 87, 439– 450. Sinha, S.K. (2004) Robust analysis of generalized linear mixed models. Journal of the American Statistical Association, 99(466), 451–460. Sommer, S. and Huggins, R.M. (1996) Variables selection using the Wald test and a robust Cp . Applied Statistics, 45, 15–29. Song, P.X.K. (2007) Correlated Data Analysis: Modeling, Analytics, and Applications, Springer, New York. Stahel, W.A. and Welsh, A. (1997) Approaches to robust estimation in the simplest variance components model. Journal of Statistical Planning and Inference, 57, 295–319. Stahel, W.A. and Welsh, A.H. (1992) Robust estimation of variance components, Research Report 69, ETH, Zürich. Staudte, R.G. and Sheather, S.J. (1990) Robust Estimation and Testing, John Wiley & Sons, New York. Stefanski, L.A., Carroll, R.J. and Ruppert, D. (1986) Optimally bounded score functions for generalized linear models with applications to logistic regression. Biometrika, 73, 413–424. Stern, S.E. and Welsh, A.H. (1998) Likelihood inference for small variance components. The Canadian Journal of Statistics, 28, 517–532. Stigler, S.M. (1973) Simon Newcomb, Percy Daniell, and the history of robust estimation 1885–1920. Journal of the American Statistical Association, 68, 872–879.
262
REFERENCES
Stone, E.J. (1873) On the rejection of discordant observations. Monthly Notices of the Royal Astronomical Society, 34, 9–15. Stone, M. (1977) An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. Journal of the Royal Statistical Society, Series B, Methodological, 39, 44–47. Stram, D.O. and Lee, J.W. (1994) Variance components testing in the longitudinal mixed effects model. Biometrics, 50, 1171–1177. Stram, D.O., Wei, L.J. and Ware, J.H. (1988) Analysis of repeated ordered categorical outcomes with possibly missing observations and time-dependent covariates. Journal of the American Statistical Association, 83, 631–637. Student, (1927) Errors of routine analysis. Biometrika 19, 151–164. Subrahmanian, K., Subrahmaniam, K. and Messeri, J.Y. (1975) On the robustness of some tests of significance in sampling from a normal population. Journal of the American Statistical Association, 70, 435–438. Tai, B.C., White, I.R., Gebski, V. and Machin, D. (2002) On the issue of ‘multiple’ first failures in competing risks analysis. Statistics in Medicine, 21, 2243–2255. Tashkin, D.P. et al. (2006) Cyclophosphamide versus placebo in scleroderma lung disease. New England Journal of Medicine, 354(25), 2655–66. Tatsuoka, K.S. and Tyler, D.E. (2000) On the uniqueness of S-functionals and M-functionals under nonelliptical distributions. Annals of Statistics, 28(4), 1219–1243. Therneau, T.M. and Grambsch, P.M. (2000) Modeling Survival Data: Extending the Cox Model, Springer, New York. Tukey, J.W. (1960) A survey of sampling from contaminated distributions, in Contributions to Probability and Statistics (ed. Olkin, I.), Stanford University Press, Stanford, CA, pp. 448– 485. Tukey, J.W. (1970) Exploratory Data Analysis, Addison-Wesley, Reading, MA. (Mimeographed preliminary edition. Published in 1977.) Valsecchi, M.G., Silvestri, D. and Sasieni, P. (1996) Evaluation of long-term survival: use of diagnostics and robust estimators with Cox’s proportional hazards models. Statistics in Medicine, 15, 2763–2780. Verbeke, G. and Molenberghs, G. (1997) Linear Mixed Model in Practice: a SAS-Oriented Approach (Lecture Notes in Statistics, vol. 126), Springer, New York. Verbeke, G. and Molenberghs, G. (2000) Linear Mixed Models for Longitudinal Data, Springer, New York. Victoria-Feser, M.P. (2002) Robust inference with binary data. Psychometrica, 67, 21–32. Victoria-Feser, M.P. (2007) De-biasing weighted MLE via indirect inference: The case of generalized linear latent variable models. Revstat Statistical Journal, 5, 85–96. von Mises, R. (1947) On the asymptotic distribution of differentiable statistical functions. The Annals of Mathematical Statistics, 18, 309–348. Wager, T.D., Keller, M.C., Lacey, S.C. and Jonides, J. (2003) Increased sensitivity in neuroimaging analyses using robust regression. NeuroImage, 26, 99–113. Wald, A. (1943) Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society, 3, 426–482. Wang, H.M., Jones, M.P. and Storer, B.E. (2006) Comparison of case-deletion diagnostic methods for Cox regression. Statistics in Medicine, 25, 669–683. Wang, Y.G., Lin, X. and Zhu, M. (2005) Robust estimating functions and bias correction for longitudinal data analysis. Biometrics, 61, 684–691.
REFERENCES
263
Wedderburn, R.W.M. (1974) Quasi-likelihood functions, generalized linear models, and the Gauss–Newton method. Biometrika, 61, 439–447. Welsh, A. and Richardson, A. (1997) Approaches to the robust estimation of mixed models, in Handbook of Statistics, vol. 15 (eds Maddala G and Rao C), Elsevier Science, pp. 343–385. Welsh, A.H. (1996) Aspects of Statistical Inference (Wiley Series in Probability and Statistics), John Wiley & Sons, New York. Welsh, A.H. and Ronchetti, E. (1998) Bias-calibrated estimation from sample surveys containing outliers. Journal of the Royal Statistical Society, Series B, Methodological, 60, 413–428. Welsh, A.H. and Ronchetti, E. (2002) A journey in single steps: robust one-step m-estimation in linear regression. Journal of Statistical Planning and Inference, 103(2), 287–310. Welsh, A.H., Cunningham, R.B., Donnelly, C.F. and Lindenmayer, D.B. (1996) Modelling the abundance of rare species: statistical models for counts with extra zeros. Ecological Modelling, 88, 297–308. Wen, L., Parchman, M.L., Linn, W.D. and Lee, S. (2004) Association between self-monitoring of blood glucose and glycemic control in patients with type 2 diabetes mellitus. American Journal of Health-System Pharmacy, 61, 2401–2405. Whewell, W. (1837) History of the Inductive Sciences, from the Earliest to the Present Time, Parker, London. Whewell, W. (1840) Philosophy of the Inductive Sciences, Founded upon their History, Parker, London. Wilcox, R.R. (1997) Introduction to Robust Estimation and Hypothesis Testing, Academic Press, New York, London. Wood, A.T.A. (1989) An F approximation to the distribution of a linear combination of chisquared variables. Communications in Statistics: Simulation and Computation, 18, 1439– 1456. Wood, A.T.A., Booth, J.G. and Butler, R.W. (1993) Saddlepoint approximations to the CDF of some statistics with nonnormal limit distributions. Journal of the American Statistical Association, 88, 680–686. Yau, K.K.W. and Kuk, A.Y.C. (2002) Robust estimation in generalized linear mixed models. Journal of the Royal Statistical Society, Series B, Methodological, 64, 101–117. Ylvisaker, D. (1977) Test resistance. Journal of the American Statistical Association, 72, 551– 556. Yohai, V.J. (1987) High breakdown point and high efficiency robust estimates for regression. Annals of Statistics, 15, 642–656. Yohai, V.J. and Zamar, R.H. (1998) Optimal locally robust m-estimates of regression. Journal of Statistical Planning and Inference, 64, 309–323. Yohai, V.J., Stahel, W.A. and Zamar, R.H. (1991) A procedure for robust estimation and inference in linear regression, in Directions in Robust Statistics and Diagnostics, part II (eds Stahel WA and Weisberg S) (The IMA Volumes in Mathematics and its Applications, vol. 34), Springer, Berlin, pp. 365–374. Zedini, A. (2007) Poisson hurdle model: Towards a robustified approach, Master’s thesis, University of Geneva. Zeger, S.L. and Liang, K.Y. (1986) Longitudinal data analysis for discrete and continuous outcomes. Biometrics, 42, 121–130. Zeger, S.L., Liang, K.Y. and Albert, P.S. (1988) Models for longitudinal data: a generalized estimating equation approach (Correction: 45, 347). Biometrics, 44, 1049–1060.
264
REFERENCES
Zhang, J. (1996) The sample breakdown of tests. Journal of Statistical Planning and Inference, 52, 161–181. Zhao, L.P. and Prentice, R.L. (1990) Correlated binary regression using a quadratic exponential model. Biometrika, 77, 642–648.
Index Cp , see Mallows’ F -test, 5, 33, 34, 40, 59–62, 83, 84, 88, 94, 106–109 RCp , see Mallows’ χ 2 -distribution, see distribution z-statistic, 35, 37, 40, 94, 105, 186, 195, 197, 199, 205–207, 216, 223 z-test, see z-statistic adaptive procedure, 201, 202 weights, 202–204, 206, 209 AIC, see Akaike information criterion Akaike information criterion, 71, 162 classical (AIC), 46, 72–76, 159 generalized (GAIC), 159 robust (RAIC), 73–77, 80, 81, 159, 231, 233 Analysis of variance, see ANOVA ANOVA, 5, 46–48, 83, 88, 90, 94, 101, 105, 108, 125 ARE, see estimator asymptotic rejection probability, 21, 31 bias, 14, 15, 20, 28, 48, 83, 93, 139, 160, 162, 192, 215 asymptotic, 16–18, 21, 139, 196 correction, 28, 139, 162 maximal, 19–21 residual, 7, 98, 209 binary regression, see exponential family - Bernouilli BLUE, see estimator bootstrap, 2, 13, 14, 74, 110, 218, 220–224 breakdown point, 14, 16, 20, 22, 23, 26, 27, 30–32, 37, 38, 44, 53, 54, 79, 84, 98–100, 102, 110, 175, 221, 229
level, 38 power, 38 coefficient of determination, 66–69 confidence interval, 13, 14, 33, 96, 130, 140–142, 144, 152, 154, 155, 168, 178, 197, 207–210, 220–224 coverage, 207–209 consistency, 22, 23, 27, 28, 31, 168, 229 correction, 24, 25, 27, 28, 30, 51, 68, 79, 98, 99, 136–139, 174, 196, 221, 239, 246 Fisher, see consistency contrasts, 47, 84–86, 88, 93, 94, 104, 105, 107 correlation, 8–10, 67, 69, 83, 95, 142, 146, 161, 163–170, 174, 176, 182, 245, 246 m-dependence, 167, 176 autoregressive, 167, 175, 176, 182 exchangeable, 165, 170, 171, 175, 177, 181, 182, 186, 246 serial, 134 unstructured, 166 working, 163–165, 168, 173 covariance (matrix), 10, 30–32, 44, 51, 87, 88, 90, 91, 93–95, 98–100, 105, 106, 108, 115, 163, 164, 175, 176, 235 Cox proportional hazard model, see hazard datasets breastfeeding, 146, 150 cardiovascular risk factors, 9–12, 78–82
Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
266 diabetes, 58–62, 65, 69–72, 75–77, 230 doctor visits, 151 gromerular filtration rate (GFR), 49, 50, 54–56, 58, 62–69 GUIDE data, 169–173, 177, 180, 181 hospital costs, 125, 132, 140, 144, 160 LEI data, 182, 184, 185 metallic oxide, 116–119 myeloma, 193, 197–199, 205–207 orthodontic, 90, 92, 95, 96, 118, 121, 122 semantic priming, 89, 99, 107–109, 111, 113, 114 skin resistance, 85, 86, 88, 96, 97, 99, 103–105, 107, 110–112, 115, 116 stillbirth in piglets, 186–188 Veteran’s Administration lung cancer, 193, 209–212, 222–224 deviance, 130–132, 142, 143 quasi-, 126, 132, 137, 145, 162, 174, 179, 183, 185 residuals, see residuals test, 61, 131–133, 142, 143, 145, 155 diagnostic, 7–9, 48, 52, 133–136, 138, 169, 178, 191, 196–198, 205, 206 distribution binomial, see exponential family Chi-squared, 31, 40, 42, 43, 52, 60, 61, 94–96, 103, 110, 111, 131–133, 144, 175, 177, 196, 206 exponential, 127, 201, 204, 205, 207, 211, 212, 216 Gamma, see exponential family gross error, 5, 17, 18, 44 point mass, see distribution - gross error Poisson, see exponential family efficiency, 13, 18, 22, 23, 25, 26, 28–30, 51, 53, 54, 57, 100, 102, 137,
INDEX 138, 169, 201, 202, 204, 205, 229, 231, 232 loss, 28, 29, 46, 58, 141, 169, 204 empirical IF, 17, 196, 197, 203, 205 breakdown point, 20 distribution, 17, 43, 101, 201, 202 estimator GM-, 52–54 M-, 15, 16, 23–27, 29–31, 39, 41–44, 48, 52, 54, 61, 74, 97, 98, 100, 136, 140, 143, 144, 160, 175, 176, 195, 196, 205 MM-, 54, 84, 100–102, 104–107, 229 S-, 30–32, 84, 98–100, 105, 106, 138, 229, 230, 235 adaptive robust, 202–216 best linear unbiased, 45 CBS–MM, 103, 105, 107–109, 111, 112, 117, 119–121 high breakdown, 26, 27, 47, 53, 84, 100, 138, 175, 230 Huber’s, 50, 51, 53, 101, 107, 172, 177, 181, 182 least squares, 45–52, 54–56, 59, 60, 62–66, 68–74, 76, 118, 119, 229, 230 Mallows’, 52, 136, 147, 152, 156, 157, 172, 177, 189 maximum likelihood, 5, 13, 16, 22–25, 27–29, 31, 38–41, 45–48, 51, 55, 56, 83, 84, 91–94, 97, 98, 101, 102, 107, 110, 113, 123, 126, 130–134, 152, 182 partial likelihood, 13, 191–208, 210, 213–215, 222 restricted (or residual) maximum likelihood estimator (REML), 83, 84, 91, 93, 94, 96–98, 103, 105, 107–112, 117, 120 Tukey’s biweight, 27–29, 55, 60, 61, 63, 66, 76, 77, 79, 81, 99, 102, 106, 107, 231 weighted maximum likelihood, 24–29, 50, 53 weighted partial likelihood, 192, 200
INDEX excess of zeros, 5, 152, 158 exchangeable, 166 exponential family, 125, 127, 130, 158, 161, 163, 164, 237 Bernouilli, 125, 126, 134, 136, 146, 160, 173, 175, 181, 186–188 binary, see Bernouilli binomial, 5, 127–131, 138–140, 151, 158, 164, 237, 239, 240 Gamma, 127, 129, 131–135, 139, 140, 150, 157, 217, 238–241 Poisson, 43, 127–131, 138–140, 150, 152, 155, 157, 158, 164, 175, 237, 239–241 exponential weight, 204–208, 210–212, 216 fitted value, 112, 115, 116, 120, 122, 130, 131, 134, 135, 149, 160 generalized linear model, 15, 39, 53, 61, 125, 126, 128–130, 132–134, 136–138, 142, 151, 152, 157–165, 171, 172, 174, 179–181, 189 GES, see gross error sensitivity GLM, see generalized linear model gross error model / data generating process, see distribution gross error gross error sensitivity, 19–21, 36, 54 hat matrix, 52, 133, 174, 246 hazard, 13, 191, 193, 194, 196, 200, 201, 203, 204, 207, 210, 212, 215, 216, 222–225 baseline, 193, 194 cumulative, 194, 201, 202, 213, 215, 221, 224 function, 192, 193 proportional, 192 proportional - Cox model, 9, 12, 191, 193, 194, 204, 214, 221, 224 high breakdown estimator, see estimator Huber’s ψ function, 25, 51 ρ function, 25 estimator, see estimator
267 proposal II, 51, 53, 98, 139, 174, 182, 186 weight, 25, 26, 50, 53, 101, 174, 175, 181, 183, 185 hurdle model, 5, 158, 159 IF, see influence function indirect inference, 27, 139 influence curve, see sensitivity curve influence function, 15–25, 36, 37, 43, 44, 48, 84, 97, 114, 140, 176, 180, 192, 193, 196–198, 203 empirical, see empirical IRWLS, see iterative reweighted least squares iterative reweighted least squares, 51, 53, 54, 79, 126, 129, 137, 165 Kaplan–Meier, 191, 213, 214, 219, 220 leverage, 52, 100, 101, 133, 135, 136, 138, 141, 147, 174, 177, 221, 224 likelihood quasi-, 123, 126, 130, 132, 136, 140, 143, 158, 159, 162, 165, 179, 180, 189 likelihood ratio test classical (S 2 , LRT), 38, 40, 42, 44, 46, 59, 60, 70, 83, 94–96, 106, 129–131, 142, 195 robust (Sρ2 , LRTρ ), 42, 61, 62, 84, 100, 106–110, 206, 231, 233 linear model, see regression model link function, 128, 131, 132, 138, 143, 145, 151, 152, 155, 157, 163–165, 169, 186, 193 logistic regression, see exponential family - Bernouilli logit, see link function LRT, see likelihood ratio test LS, see estimator LW variance, see variance - sandwich Mallows’ Cp , 46, 73, 74, 159, 189 RCp , 231, 233 Mallows’ estimator, see estimator marginal longitudinal data model, 15, 53, 162, 164
INDEX
268 masking effect, 8, 48, 134, 172, 206 missing covariate, 6, 9, 200 mixed linear model, 6, 9, 13–15, 27, 30, 32, 39, 48, 83, 86, 87, 94, 95, 97–100, 102, 110, 112, 123, 161, 162, 165, 204 MLDA, see marginal longitudinal data model MLE, see estimator MLM, see mixed linear model model misspecification, 2, 4–6, 13, 14, 16, 17, 19–21, 35, 37, 136, 193, 214 distributional, 6, 12, 215 structural, 6, 9, 199, 214–216 over dispersion, 130, 164, 176 PLE, see estimator point mass contamination, see distribution - point mass predicted value, 64, 69, 73, 112, 115 prediction, 1, 54, 69, 72, 84, 112–114, 123, 160, 213 proportional hazard, see hazard R-squared, see coefficient of determination RAIC, see Akaike information criterion Rao test, see score or Rao test regression model, 4, 8–10, 14, 15, 24–27, 30, 39, 41–48, 53, 55, 56, 58–62, 67, 69–71, 73, 79, 80, 83, 100, 112, 115, 118, 125, 137–139, 143, 159, 192, 204, 209, 229, 230 non-parametric, 14 quantiles, 192, 212, 217–220 quantiles - censored, 192, 193, 217, 219, 222, 224 rejection point, 21, 26 REML, see etimatori residual analysis, 8, 48, 62–68, 70, 75, 80, 82, 112, 113, 133, 134, 145, 172 deviance, 133, 135
Pearson, 122, 133, 135, 136, 160, 166, 172–174 risk set, 194, 196, 197, 211, 212 robustness of efficiency, 34, 35, 38, 44, 201 of validity, 33, 34, 38, 43, 44, 207 score or Rao test classical (R2 ), 39, 142 2 ), 41, 42, 106 robust (R0001 sensitivity curve, 16–18, 23 survival curve, 213, 214 ties, 195, 199, 204, 205, 210, 213 Tukey’s bisquare, see Tukey’s biweight Tukey’s biweight ψ function, 26, 27, 53 ρ function, 26, 30, 31, 99, 101, 102, 107, 114 estimator, see estimator weights, 68, 101, 107 tuning constant / parameter, 26, 27, 29, 30, 53, 99, 100, 102, 145, 186, 204 variable selection, 46, 59, 70, 73, 74, 79, 80, 126, 142, 144, 147, 148, 150, 154, 162, 179, 182 variance asymptotic, 18, 29, 36, 39–42, 57, 93, 100, 101, 105, 137, 140, 158, 168, 195–198, 203, 206, 216, 220, 229, 240 sandwich, 100, 102, 192, 193, 198–200, 202, 203, 207, 208, 216 Wald test, 6 classical (W 2 ), 38–41, 46, 61, 62, 74, 83, 94–96, 106, 129, 130, 142, 144, 195, 200, 206, 207, 209 2 ), 16, 41–44, 106, 107, robust (W0001 110, 206–208, 216 weighted partial likelihood, see estimator WMLE, see estimator zero-inflated model, 5, 158, 162
WILEY SERIES IN PROBABILITY AND STATISTICS Established by WALTER A. SHEWHART and SAMUEL S. WILKS Editors David J. Balding, Noel A. C. Cressie, Nicholas I. Fisher, Iain M. Johnstone, J. B. Kadane, Geert Molenberghs, David W. Scott, Adrian F. M. Smith, Sanford Weisberg, Harvey Goldstein Editors Emeriti Vic Barnett, J. Stuart Hunter, David G. Kendall, Jozef L. Teugels The Wiley Series in Probability and Statistics is well established and authoritative. It covers many topics of current research interest in both pure and applied statistics and probability theory. Written by leading statisticians and institutions, the titles span both state-of-the-art developments in the field and classical methods. Reflecting the wide range of current research in statistics, the series encompasses applied, methodological and theoretical statistics, ranging from applications and new techniques made possible by advances in computerized practice to rigorous treatment of theoretical approaches. This series provides essential and invaluable reading for all statisticians, whether in academia, industry, government, or research. ABRAHAM and LEDOLTER · Statistical Methods for Forecasting AGRESTI · Analysis of Ordinal Categorical Data AGRESTI · An Introduction to Categorical Data Analysis AGRESTI · Categorical Data Analysis, Second Edition ALTMAN, GILL and McDONALD · Numerical Issues in Statistical Computing for the Social Scientist AMARATUNGA and CABRERA · Exploration and Analysis of DNA Microarray and Protein Array Data ˘ · Mathematics of Chance ANDEL ANDERSON · An Introduction to Multivariate Statistical Analysis, Third Edition *ANDERSON · The Statistical Analysis of Time Series ANDERSON, AUQUIER, HAUCK, OAKES, VANDAELE and WEISBERG · Statistical Methods for Comparative Studies ANDERSON and LOYNES · The Teaching of Practical Statistics ARMITAGE and DAVID (editors) · Advances in Biometry ARNOLD, BALAKRISHNAN and NAGARAJA · Records *ARTHANARI and DODGE · Mathematical Programming in Statistics *BAILEY · The Elements of Stochastic Processes with Applications to the Natural Sciences BALAKRISHNAN and KOUTRAS · Runs and Scans with Applications BALAKRISHNAN and NG · Precedence-Type Tests and Applications BARNETT · Comparative Statistical Inference, Third Edition BARNETT · Environmental Statistics: Methods & Applications BARNETT and LEWIS · Outliers in Statistical Data, Third Edition BARTOSZYNSKI and NIEWIADOMSKA-BUGAJ · Probability and Statistical Inference BASILEVSKY · Statistical Factor Analysis and Related Methods: Theory and Applications BASU and RIGDON · Statistical Methods for the Reliability of Repairable Systems BATES and WATTS · Nonlinear Regression Analysis and Its Applications BECHHOFER, SANTNER and GOLDSMAN · Design and Analysis of Experiments for Statistical Selection, Screening and Multiple Comparisons BELSLEY · Conditioning Diagnostics: Collinearity and Weak Data in Regression BELSLEY, KUH and WELSCH · Regression Diagnostics: Identifying Influential Data and Sources of Collinearity BENDAT and PIERSOL · Random Data: Analysis and Measurement Procedures, Third Edition BERNARDO and SMITH · Bayesian Theory BERRY, CHALONER and GEWEKE · Bayesian Analysis in Statistics and Econometrics: Essays in Honor of Arnold Zellner BHAT and MILLER · Elements of Applied Stochastic Processes, Third Edition *Now available in a lower priced paperback edition in the Wiley Classics Library.
BHATTACHARYA and JOHNSON · Statistical Concepts and Methods BHATTACHARYA and WAYMIRE · Stochastic Processes with Applications BIEMER, GROVES, LYBERG, MATHIOWETZ and SUDMAN · Measurement Errors in Surveys BILLINGSLEY · Convergence of Probability Measures, Second Edition BILLINGSLEY · Probability and Measure, Third Edition BIRKES and DODGE · Alternative Methods of Regression BLISCHKE and MURTHY (editors) · Case Studies in Reliability and Maintenance BLISCHKE and MURTHY · Reliability: Modeling, Prediction and Optimization BLOOMFIELD · Fourier Analysis of Time Series: An Introduction, Second Edition BOLLEN · Structural Equations with Latent Variables BOLLEN and CURRAN · Latent Curve Models: A Structural Equation Perspective BOROVKOV · Ergodicity and Stability of Stochastic Processes BOSQ and BLANKE · Inference and Prediction in Large Dimensions BOULEAU · Numerical Methods for Stochastic Processes BOX · Bayesian Inference in Statistical Analysis BOX · R. A. Fisher, the Life of a Scientist BOX and DRAPER · Empirical Model-Building and Response Surfaces *BOX and DRAPER · Evolutionary Operation: A Statistical Method for Process Improvement BOX, HUNTER and HUNTER · Statistics for Experimenters: An Introduction to Design, Data Analysis and Model Building BOX, HUNTER and HUNTER · Statistics for Experimenters: Design, Innovation and Discovery, Second Edition BOX and LUCEÑO · Statistical Control by Monitoring and Feedback Adjustment BRANDIMARTE · Numerical Methods in Finance: A MATLAB-Based Introduction BROWN and HOLLANDER · Statistics: A Biomedical Introduction BRUNNER, DOMHOF and LANGER · Nonparametric Analysis of Longitudinal Data in Factorial Experiments BUCKLEW · Large Deviation Techniques in Decision, Simulation and Estimation CAIROLI and DALANG · Sequential Stochastic Optimization CASTILLO, HADI, BALAKRISHNAN and SARABIA · Extreme Value and Related Models with Applications in Engineering and Science CHAN · Time Series: Applications to Finance CHATTERJEE and HADI · Regression Analysis by Example, Fourth Edition CHATTERJEE and HADI · Sensitivity Analysis in Linear Regression CHATTERJEE and PRICE · Regression Analysis by Example, Third Edition CHERNICK · Bootstrap Methods: A Practitioner’s Guide CHERNICK and FRIIS · Introductory Biostatistics for the Health Sciences CHILÉS and DELFINER · Geostatistics: Modeling Spatial Uncertainty CHOW and LIU · Design and Analysis of Clinical Trials: Concepts and Methodologies, Second Edition CLARKE and DISNEY · Probability and Random Processes: A First Course with Applications, Second Edition *COCHRAN and COX · Experimental Designs, Second Edition CONGDON · Applied Bayesian Modelling CONGDON · Bayesian Models for Categorical Data CONGDON · Bayesian Statistical Modelling CONGDON · Bayesian Statistical Modelling, Second Edition CONOVER · Practical Nonparametric Statistics, Second Edition COOK · Regression Graphics COOK and WEISBERG · An Introduction to Regression Graphics COOK and WEISBERG · Applied Regression Including Computing and Graphics CORNELL · Experiments with Mixtures, Designs, Models and the Analysis of Mixture Data, Third Edition COVER and THOMAS · Elements of Information Theory COX · A Handbook of Introductory Statistical Methods *COX · Planning of Experiments *Now available in a lower priced paperback edition in the Wiley Classics Library.
CRESSIE · Statistics for Spatial Data, Revised Edition CSÖRGÖ and HORVÁTH · Limit Theorems in Change Point Analysis DANIEL · Applications of Statistics to Industrial Experimentation DANIEL · Biostatistics: A Foundation for Analysis in the Health Sciences, Sixth Edition *DANIEL · Fitting Equations to Data: Computer Analysis of Multifactor Data, Second Edition DASU and JOHNSON · Exploratory Data Mining and Data Cleaning DAVID and NAGARAJA · Order Statistics, Third Edition *DEGROOT, FIENBERG and KADANE · Statistics and the Law DEL CASTILLO · Statistical Process Adjustment for Quality Control DEMARIS · Regression with Social Data: Modeling Continuous and Limited Response Variables DEMIDENKO · Mixed Models: Theory and Applications DENISON, HOLMES, MALLICK and SMITH · Bayesian Methods for Nonlinear Classification and Regression DETTE and STUDDEN · The Theory of Canonical Moments with Applications in Statistics, Probability and Analysis DEY and MUKERJEE · Fractional Factorial Plans DILLON and GOLDSTEIN · Multivariate Analysis: Methods and Applications DODGE · Alternative Methods of Regression *DODGE and ROMIG · Sampling Inspection Tables, Second Edition *DOOB · Stochastic Processes DOWDY, WEARDEN and CHILKO · Statistics for Research, Third Edition DRAPER and SMITH · Applied Regression Analysis, Third Edition DRYDEN and MARDIA · Statistical Shape Analysis DUDEWICZ and MISHRA · Modern Mathematical Statistics DUNN and CLARK · Applied Statistics: Analysis of Variance and Regression, Second Edition DUNN and CLARK · Basic Statistics: A Primer for the Biomedical Sciences, Third Edition DUPUIS and ELLIS · A Weak Convergence Approach to the Theory of Large Deviations EDLER and KITSOS (editors) · Recent Advances in Quantitative Methods in Cancer and Human Health Risk Assessment *ELANDT-JOHNSON and JOHNSON · Survival Models and Data Analysis ENDERS · Applied Econometric Time Series ETHIER and KURTZ · Markov Processes: Characterization and Convergence EVANS, HASTINGS and PEACOCK · Statistical Distribution, Third Edition FELLER · An Introduction to Probability Theory and Its Applications, Volume I, Third Edition, Revised; Volume II, Second Edition FISHER and VAN BELLE · Biostatistics: A Methodology for the Health Sciences FITZMAURICE, LAIRD and WARE · Applied Longitudinal Analysis *FLEISS · The Design and Analysis of Clinical Experiments FLEISS · Statistical Methods for Rates and Proportions, Second Edition FLEMING and HARRINGTON · Counting Processes and Survival Analysis FULLER · Introduction to Statistical Time Series, Second Edition FULLER · Measurement Error Models GALLANT · Nonlinear Statistical Models. GEISSER · Modes of Parametric Statistical Inference GELMAN and MENG (editors) · Applied Bayesian Modeling and Casual Inference from Incomplete-data Perspectives GEWEKE · Contemporary Bayesian Econometrics and Statistics GHOSH, MUKHOPADHYAY and SEN · Sequential Estimation GIESBRECHT and GUMPERTZ · Planning, Construction and Statistical Analysis of Comparative Experiments GIFI · Nonlinear Multivariate Analysis GIVENS and HOETING · Computational Statistics GLASSERMAN and YAO · Monotone Structure in Discrete-Event Systems GNANADESIKAN · Methods for Statistical Data Analysis of Multivariate Observations, Second Edition GOLDSTEIN and LEWIS · Assessment: Problems, Development and Statistical Issues *Now available in a lower priced paperback edition in the Wiley Classics Library.
GREENWOOD and NIKULIN · A Guide to Chi-Squared Testing GROSS and HARRIS · Fundamentals of Queueing Theory, Third Edition *HAHN and SHAPIRO · Statistical Models in Engineering HAHN and MEEKER · Statistical Intervals: A Guide for Practitioners HALD · A History of Probability and Statistics and their Applications Before 1750 HALD · A History of Mathematical Statistics from 1750 to 1930 HAMPEL · Robust Statistics: The Approach Based on Influence Functions HANNAN and DEISTLER · The Statistical Theory of Linear Systems HEIBERGER · Computation for the Analysis of Designed Experiments HEDAYAT and SINHA · Design and Inference in Finite Population Sampling HEDEKER and GIBBONS · Longitudinal Data Analysis HELLER · MACSYMA for Statisticians HINKELMANN and KEMPTHORNE · Design and Analysis of Experiments, Volume 1: Introduction to Experimental Design HINKELMANN and KEMPTHORNE · Design and analysis of experiments, Volume 2: Advanced Experimental Design HOAGLIN, MOSTELLER and TUKEY · Exploratory Approach to Analysis of Variance HOAGLIN, MOSTELLER and TUKEY · Exploring Data Tables, Trends and Shapes *HOAGLIN, MOSTELLER and TUKEY · Understanding Robust and Exploratory Data Analysis HOCHBERG and TAMHANE · Multiple Comparison Procedures HOCKING · Methods and Applications of Linear Models: Regression and the Analysis of Variance, Second Edition HOEL · Introduction to Mathematical Statistics, Fifth Edition HOGG and KLUGMAN · Loss Distributions HOLLANDER and WOLFE · Nonparametric Statistical Methods, Second Edition HOSMER and LEMESHOW · Applied Logistic Regression, Second Edition HOSMER and LEMESHOW · Applied Survival Analysis: Regression Modeling of Time to Event Data HUBER · Robust Statistics HUBERTY · Applied Discriminant Analysis HUNT and KENNEDY · Financial Derivatives in Theory and Practice, Revised Edition HUSKOVA, BERAN and DUPAC · Collected Works of Jaroslav Hajek—with Commentary HUZURBAZAR · Flowgraph Models for Multistate Time-to-Event Data IMAN and CONOVER · A Modern Approach to Statistics JACKSON · A User’s Guide to Principle Components JOHN · Statistical Methods in Engineering and Quality Assurance JOHNSON · Multivariate Statistical Simulation JOHNSON and BALAKRISHNAN · Advances in the Theory and Practice of Statistics: A Volume in Honor of Samuel Kotz JOHNSON and BHATTACHARYYA · Statistics: Principles and Methods, Fifth Edition JUDGE, GRIFFITHS, HILL, LU TKEPOHL and LEE · The Theory and Practice of Econometrics, Second Edition JOHNSON and KOTZ · Distributions in Statistics JOHNSON and KOTZ (editors) · Leading Personalities in Statistical Sciences: From the Seventeenth Century to the Present JOHNSON, KOTZ and BALAKRISHNAN · Continuous Univariate Distributions, Volume 1, Second Edition JOHNSON, KOTZ and BALAKRISHNAN · Continuous Univariate Distributions, Volume 2, Second Edition JOHNSON, KOTZ and BALAKRISHNAN · Discrete Multivariate Distributions JOHNSON, KOTZ and KEMP · Univariate Discrete Distributions, Second Edition ˇ JURECKOVÁ and SEN · Robust Statistical Procedures: Asymptotics and Interrelations JUREK and MASON · Operator-Limit Distributions in Probability Theory KADANE · Bayesian Methods and Ethics in a Clinical Trial Design KADANE and SCHUM · A Probabilistic Analysis of the Sacco and Vanzetti Evidence KALBFLEISCH and PRENTICE · The Statistical Analysis of Failure Time Data, Second Edition *Now available in a lower priced paperback edition in the Wiley Classics Library.
KARIYA and KURATA · Generalized Least Squares KASS and VOS · Geometrical Foundations of Asymptotic Inference KAUFMAN and ROUSSEEUW · Finding Groups in Data: An Introduction to Cluster Analysis KEDEM and FOKIANOS · Regression Models for Time Series Analysis KENDALL, BARDEN, CARNE and LE · Shape and Shape Theory KHURI · Advanced Calculus with Applications in Statistics, Second Edition KHURI, MATHEW and SINHA · Statistical Tests for Mixed Linear Models *KISH · Statistical Design for Research KLEIBER and KOTZ · Statistical Size Distributions in Economics and Actuarial Sciences KLUGMAN, PANJER and WILLMOT · Loss Models: From Data to Decisions KLUGMAN, PANJER and WILLMOT · Solutions Manual to Accompany Loss Models: From Data to Decisions KOTZ, BALAKRISHNAN and JOHNSON · Continuous Multivariate Distributions, Volume 1, Second Edition KOTZ and JOHNSON (editors) · Encyclopedia of Statistical Sciences: Volumes 1 to 9 with Index KOTZ and JOHNSON (editors) · Encyclopedia of Statistical Sciences: Supplement Volume KOTZ, READ and BANKS (editors) · Encyclopedia of Statistical Sciences: Update Volume 1 KOTZ, READ and BANKS (editors) · Encyclopedia of Statistical Sciences: Update Volume 2 KOVALENKO, KUZNETZOV and PEGG · Mathematical Theory of Reliability of Time-Dependent Systems with Practical Applications KUROWICKA and COOKE · Uncertainty Analysis with High Dimensional Dependence Modelling LACHIN · Biostatistical Methods: The Assessment of Relative Risks LAD · Operational Subjective Statistical Methods: A Mathematical, Philosophical and Historical Introduction LAMPERTI · Probability: A Survey of the Mathematical Theory, Second Edition LANGE, RYAN, BILLARD, BRILLINGER, CONQUEST and GREENHOUSE · Case Studies in Biometry LARSON · Introduction to Probability Theory and Statistical Inference, Third Edition LAWLESS · Statistical Models and Methods for Lifetime Data, Second Edition LAWSON · Statistical Methods in Spatial Epidemiology, Second Edition LE · Applied Categorical Data Analysis LE · Applied Survival Analysis LEE and WANG · Statistical Methods for Survival Data Analysis, Third Edition LEPAGE and BILLARD · Exploring the Limits of Bootstrap LEYLAND and GOLDSTEIN (editors) · Multilevel Modelling of Health Statistics LIAO · Statistical Group Comparison LINDVALL · Lectures on the Coupling Method LINHART and ZUCCHINI · Model Selection LITTLE and RUBIN · Statistical Analysis with Missing Data, Second Edition LLOYD · The Statistical Analysis of Categorical Data LOWEN and TEICH · Fractal-Based Point Processes MAGNUS and NEUDECKER · Matrix Differential Calculus with Applications in Statistics and Econometrics, Revised Edition MALLER and ZHOU · Survival Analysis with Long Term Survivors MALLOWS · Design, Data and Analysis by Some Friends of Cuthbert Daniel MANN, SCHAFER and SINGPURWALLA · Methods for Statistical Analysis of Reliability and Life Data MANTON, WOODBURY and TOLLEY · Statistical Applications Using Fuzzy Sets MARCHETTE · Random Graphs for Statistical Pattern Recognition MARKOVICH · Nonparametric Analysis of Univariate Heavy-Tailed Data: Research and practice MARDIA and JUPP · Directional Statistics MARKOVICH · Nonparametric Analysis of Univariate Heavy-Tailed Data: Research and Practice MARONNA, MARTIN and YOHAI · Robust Statistics: Theory and Methods MASON, GUNST and HESS · Statistical Design and Analysis of Experiments with Applications to Engineering and Science, Second Edition *Now available in a lower priced paperback edition in the Wiley Classics Library.
MCCULLOCH and SERLE · Generalized, Linear and Mixed Models MCFADDEN · Management of Data in Clinical Trials MCLACHLAN · Discriminant Analysis and Statistical Pattern Recognition MCLACHLAN, DO and AMBROISE · Analyzing Microarray Gene Expression Data MCLACHLAN and KRISHNAN · The EM Algorithm and Extensions MCLACHLAN and PEEL · Finite Mixture Models MCNEIL · Epidemiological Research Methods MEEKER and ESCOBAR · Statistical Methods for Reliability Data MEERSCHAERT and SCHEFFLER · Limit Distributions for Sums of Independent Random Vectors: Heavy Tails in Theory and Practice MICKEY, DUNN and CLARK · Applied Statistics: Analysis of Variance and Regression, Third Edition *MILLER · Survival Analysis, Second Edition MONTGOMERY, PECK and VINING · Introduction to Linear Regression Analysis, Fourth Edition MORGENTHALER and TUKEY · Configural Polysampling: A Route to Practical Robustness MUIRHEAD · Aspects of Multivariate Statistical Theory MULLER and STEWART · Linear Model Theory: Univariate, Multivariate and Mixed Models MURRAY · X-STAT 2.0 Statistical Experimentation, Design Data Analysis and Nonlinear Optimization MURTHY, XIE and JIANG · Weibull Models MYERS and MONTGOMERY · Response Surface Methodology: Process and Product Optimization Using Designed Experiments, Second Edition MYERS, MONTGOMERY and VINING · Generalized Linear Models. With Applications in Engineering and the Sciences †NELSON · Accelerated Testing, Statistical Models, Test Plans and Data Analysis †NELSON · Applied Life Data Analysis NEWMAN · Biostatistical Methods in Epidemiology OCHI · Applied Probability and Stochastic Processes in Engineering and Physical Sciences OKABE, BOOTS, SUGIHARA and CHIU · Spatial Tesselations: Concepts and Applications of Voronoi Diagrams, Second Edition OLIVER and SMITH · Influence Diagrams, Belief Nets and Decision Analysis PALTA · Quantitative Methods in Population Health: Extentions of Ordinary Regression PANJER · Operational Risks: Modeling Analytics PANKRATZ · Forecasting with Dynamic Regression Models PANKRATZ · Forecasting with Univariate Box-Jenkins Models: Concepts and Cases *PARZEN · Modern Probability Theory and Its Applications PEÑA, TIAO and TSAY · A Course in Time Series Analysis PIANTADOSI · Clinical Trials: A Methodologic Perspective PORT · Theoretical Probability for Applications POURAHMADI · Foundations of Time Series Analysis and Prediction Theory PRESS · Bayesian Statistics: Principles, Models and Applications PRESS · Subjective and Objective Bayesian Statistics, Second Edition PRESS and TANUR · The Subjectivity of Scientists and the Bayesian Approach PUKELSHEIM · Optimal Experimental Design PURI, VILAPLANA and WERTZ · New Perspectives in Theoretical and Applied Statistics PUTERMAN · Markov Decision Processes: Discrete Stochastic Dynamic Programming QIU · Image Processing and Jump Regression Analysis RAO · Linear Statistical Inference and its Applications, Second Edition RAUSAND and HØYLAND · System Reliability Theory: Models, Statistical Methods and Applications, Second Edition RENCHER · Linear Models in Statistics RENCHER · Methods of Multivariate Analysis, Second Edition RENCHER · Multivariate Statistical Inference with Applications RIPLEY · Spatial Statistics RIPLEY · Stochastic Simulation ROBINSON · Practical Strategies for Experimenting *Now available in a lower priced paperback edition in the Wiley Classics Library. †Now available in a lower priced paperback edition in the Wiley - Interscience Paperback Series.
ROHATGI and SALEH · An Introduction to Probability and Statistics, Second Edition ROLSKI, SCHMIDLI, SCHMIDT and TEUGELS · Stochastic Processes for Insurance and Finance ROSENBERGER and LACHIN · Randomization in Clinical Trials: Theory and Practice ROSS · Introduction to Probability and Statistics for Engineers and Scientists ROSSI, ALLENBY and MCCULLOCH · Bayesian Statistics and Marketing ROUSSEEUW and LEROY · Robust Regression and Outline Detection RUBIN · Multiple Imputation for Nonresponse in Surveys RUBINSTEIN · Simulation and the Monte Carlo Method RUBINSTEIN and MELAMED · Modern Simulation and Modeling RYAN · Modern Regression Methods RYAN · Statistical Methods for Quality Improvement, Second Edition SALEH · Theory of Preliminary Test and Stein-Type Estimation with Applications SALTELLI, CHAN and SCOTT (editors) · Sensitivity Analysis *SCHEFFE · The Analysis of Variance SCHIMEK · Smoothing and Regression: Approaches, Computation and Application SCHOTT · Matrix Analysis for Statistics SCHOUTENS · Levy Processes in Finance: Pricing Financial Derivatives SCHUSS · Theory and Applications of Stochastic Differential Equations SCOTT · Multivariate Density Estimation: Theory, Practice and Visualization *SEARLE · Linear Models SEARLE · Linear Models for Unbalanced Data SEARLE · Matrix Algebra Useful for Statistics SEARLE and WILLETT · Matrix Algebra for Applied Economics SEBER · Multivariate Observations SEBER and LEE · Linear Regression Analysis, Second Edition SEBER and WILD · Nonlinear Regression SENNOTT · Stochastic Dynamic Programming and the Control of Queueing Systems *SERFLING · Approximation Theorems of Mathematical Statistics SHAFER and VOVK · Probability and Finance: Its Only a Game! SILVAPULLE and SEN · Constrained Statistical Inference: Inequality, Order and Shape Restrictions SINGPURWALLA · Reliability and Risk: A Bayesian Perspective SMALL and MCLEISH · Hilbert Space Methods in Probability and Statistical Inference SRIVASTAVA · Methods of Multivariate Statistics STAPLETON · Linear Statistical Models STAUDTE and SHEATHER · Robust Estimation and Testing STOYAN, KENDALL and MECKE · Stochastic Geometry and Its Applications, Second Edition STOYAN and STOYAN · Fractals, Random and Point Fields: Methods of Geometrical Statistics STYAN · The Collected Papers of T. W. Anderson: 1943–1985 SUTTON, ABRAMS, JONES, SHELDON and SONG · Methods for Meta-Analysis in Medical Research TANAKA · Time Series Analysis: Nonstationary and Noninvertible Distribution Theory THOMPSON · Empirical Model Building THOMPSON · Sampling, Second Edition THOMPSON · Simulation: A Modeler’s Approach THOMPSON and SEBER · Adaptive Sampling THOMPSON, WILLIAMS and FINDLAY · Models for Investors in Real World Markets TIAO, BISGAARD, HILL, PEÑA and STIGLER (editors) · Box on Quality and Discovery: with Design, Control and Robustness TIERNEY · LISP-STAT: An Object-Oriented Environment for Statistical Computing and Dynamic Graphics TSAY · Analysis of Financial Time Series UPTON and FINGLETON · Spatial Data Analysis by Example, Volume II: Categorical and Directional Data VAN BELLE · Statistical Rules of Thumb *Now available in a lower priced paperback edition in the Wiley Classics Library.
VAN BELLE, FISHER, HEAGERTY and LUMLEY · Biostatistics: A Methodology for the Health Sciences, Second Edition VESTRUP · The Theory of Measures and Integration VIDAKOVIC · Statistical Modeling by Wavelets VINOD and REAGLE · Preparing for the Worst: Incorporating Downside Risk in Stock Market Investments WALLER and GOTWAY · Applied Spatial Statistics for Public Health Data WEERAHANDI · Generalized Inference in Repeated Measures: Exact Methods in MANOVA and Mixed Models WEISBERG · Applied Linear Regression, Second Edition WELISH · Aspects of Statistical Inference WESTFALL and YOUNG · Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment WHITTAKER · Graphical Models in Applied Multivariate Statistics WINKER · Optimization Heuristics in Economics: Applications of Threshold Accepting WONNACOTT and WONNACOTT · Econometrics, Second Edition WOODING · Planning Pharmaceutical Clinical Trials: Basic Statistical Principles WOOLSON and CLARKE · Statistical Methods for the Analysis of Biomedical Data, Second Edition WU and HAMADA · Experiments: Planning, Analysis and Parameter Design Optimization WU and ZHANG · Nonparametric Regression Methods for Longitudinal Data Analysis: Mixed-Effects Modeling Approaches YANG · The Construction Theory of Denumerable Markov Processes YOUNG, VALERO-MORA and FRIENDLY · Visual Statistics: Seeing Data with Dynamic Interactive Graphics *ZELLNER · An Introduction to Bayesian Inference in Econometrics ZELTERMAN · Discrete Distributions: Applications in the Health Sciences ZHOU, OBUCHOWSKI and McCLISH · Statistical Methods in Diagnostic Medicine
*Now available in a lower priced paperback edition in the Wiley Classics Library.
Eva Cantoni Department of Econometrics, University of Geneva, Switzerland
Samuel Copt Merck Serono International, Geneva, Switzerland
Maria-Pia Victoria-Feser HEC Section, University of Geneva, Switzerland
A John Wiley and Sons, Ltd, Publication
Robust Methods in Biostatistics
WILEY SERIES IN PROBABILITY AND STATISTICS Established by WALTER A. SHEWHART and SAMUEL S. WILKS Editors David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice, Iain M. Johnstone, Geert Molenberghs, David W. Scott, Adrian F. M. Smith, Ruey S. Tsay, Sanford Weisberg, Harvey Goldstein. Editors Emeriti Vic Barnett, J. Stuart Hunter, Jozef L. Teugels A complete list of the titles in this series appears at the end of this volume.
Robust Methods in Biostatistics Stephane Heritier The George Institute for International Health, University of Sydney, Australia
Eva Cantoni Department of Econometrics, University of Geneva, Switzerland
Samuel Copt Merck Serono International, Geneva, Switzerland
Maria-Pia Victoria-Feser HEC Section, University of Geneva, Switzerland
A John Wiley and Sons, Ltd, Publication
This edition first published 2009 c 2009 John Wiley & Sons Ltd 0001 Registered office John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com. The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books. Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought. Library of Congress Cataloging-in-Publication Data Robust methods in biostatistics / Stephane Heritier . . . [et al.]. p. cm. Includes bibliographical references and index. ISBN 978-0-470-02726-4 (cloth) 1. Biometry–Statistical methods. I. Heritier, Stephane. [DNLM: 1. Biometry–methods. WA 950 R667 2009] QH323.5.R615 2009 570.1’5195–dc22 2009008863 A catalogue record for this book is available from the British Library. ISBN 9780470027264 Set in 10/12pt Times by Sunrise Setting Ltd, Torquay, UK. Printed in Great Britain by CPI Antony Rowe, Chippenham, Wiltshire.
To Anna, Olivier, Cassandre, Oriane, Sonia, Johannes, Véronique, Sébastien and Raphaël, who contributed in their ways. . .
Contents Preface
xiii
Acknowledgments 1
2
xv
Introduction 1.1 What is Robust Statistics? . . . . . . . . . . . . . . . . . . . . . . 1.2 Against What is Robust Statistics Robust? . . . . . . . . . . . . . . 1.3 Are Diagnostic Methods an Alternative to Robust Statistics? . . . . 1.4 How do Robust Statistics Compare with Other Statistical Procedures in Practice? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
Key Measures and Results 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Statistical Tools for Measuring Robustness Properties . . . . . 2.2.1 The Influence Function . . . . . . . . . . . . . . . . . 2.2.2 The Breakdown Point . . . . . . . . . . . . . . . . . 2.2.3 Geometrical Interpretation . . . . . . . . . . . . . . . 2.2.4 The Rejection Point . . . . . . . . . . . . . . . . . . 2.3 General Approaches for Robust Estimation . . . . . . . . . . 2.3.1 The General Class of M-estimators . . . . . . . . . . 2.3.2 Properties of M-estimators . . . . . . . . . . . . . . . 2.3.3 The Class of S-estimators . . . . . . . . . . . . . . . 2.4 Statistical Tools for Measuring Tests Robustness . . . . . . . . 2.4.1 Sensitivity of the Two-sample t-test . . . . . . . . . . 2.4.2 Local Stability of a Test: the Univariate Case . . . . . 2.4.3 Global Reliability of a Test: the Breakdown Functions 2.5 General Approaches for Robust Testing . . . . . . . . . . . . 2.5.1 Wald Test, Score Test and LRT . . . . . . . . . . . . . 2.5.2 Geometrical Interpretation . . . . . . . . . . . . . . . 2.5.3 General 0001-type Classes of Tests . . . . . . . . . . . . 2.5.4 Asymptotic Distributions . . . . . . . . . . . . . . . . 2.5.5 Robustness Properties . . . . . . . . . . . . . . . . .
15 15 16 17 20 20 21 21 23 27 30 32 34 34 37 38 39 40 40 42 43
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
1 1 3 7
CONTENTS
viii
3 Linear Regression 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Estimating the Regression Parameters . . . . . . . . . . . . 3.2.1 The Regression Model . . . . . . . . . . . . . . . . 3.2.2 Robustness Properties of the LS and MLE Estimators 3.2.3 Glomerular Filtration Rate (GFR) Data Example . . 3.2.4 Robust Estimators . . . . . . . . . . . . . . . . . . 3.2.5 GFR Data Example (continued) . . . . . . . . . . . 3.3 Testing the Regression Parameters . . . . . . . . . . . . . . 3.3.1 Significance Testing . . . . . . . . . . . . . . . . . 3.3.2 Diabetes Data Example . . . . . . . . . . . . . . . . 3.3.3 Multiple Hypothesis Testing . . . . . . . . . . . . . 3.3.4 Diabetes Data Example (continued) . . . . . . . . . 3.4 Checking and Selecting the Model . . . . . . . . . . . . . . 3.4.1 Residual Analysis . . . . . . . . . . . . . . . . . . 3.4.2 GFR Data Example (continued) . . . . . . . . . . . 3.4.3 Diabetes Data Example (continued) . . . . . . . . . 3.4.4 Coefficient of Determination . . . . . . . . . . . . . 3.4.5 Global Criteria for Model Comparison . . . . . . . . 3.4.6 Diabetes Data Example (continued) . . . . . . . . . 3.5 Cardiovascular Risk Factors Data Example . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
45 45 47 47 48 49 50 54 55 55 58 59 61 62 62 62 65 66 69 75 78
4 Mixed Linear Models 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4.2 The MLM . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 The MLM Formulation . . . . . . . . . . . . . . 4.2.2 Skin Resistance Data . . . . . . . . . . . . . . . 4.2.3 Semantic Priming Data . . . . . . . . . . . . . . 4.2.4 Orthodontic Growth Data . . . . . . . . . . . . 4.3 Classical Estimation and Inference . . . . . . . . . . . . 4.3.1 Marginal and REML Estimation . . . . . . . . . 4.3.2 Classical Inference . . . . . . . . . . . . . . . . 4.3.3 Lack of Robustness of Classical Procedures . . . 4.4 Robust Estimation . . . . . . . . . . . . . . . . . . . . . 4.4.1 Bounded Influence Estimators . . . . . . . . . . 4.4.2 S-estimators . . . . . . . . . . . . . . . . . . . 4.4.3 MM-estimators . . . . . . . . . . . . . . . . . . 4.4.4 Choosing the Tuning Constants . . . . . . . . . 4.4.5 Skin Resistance Data (continued) . . . . . . . . 4.5 Robust Inference . . . . . . . . . . . . . . . . . . . . . 4.5.1 Testing Contrasts . . . . . . . . . . . . . . . . . 4.5.2 Multiple Hypothesis Testing of the Main Effects 4.5.3 Skin Resistance Data Example (continued) . . . 4.5.4 Semantic Priming Data Example (continued) . . 4.5.5 Testing the Variance Components . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
83 83 84 84 88 89 90 91 91 94 96 97 97 98 100 102 103 104 104 106 107 107 110
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
CONTENTS 4.6
4.7
4.8 5
6
ix
Checking the Model . . . . . . . . . . . . . . . . . . . . 4.6.1 Detecting Outlying and Influential Observations 4.6.2 Prediction and Residual Analysis . . . . . . . . Further Examples . . . . . . . . . . . . . . . . . . . . . 4.7.1 Metallic Oxide Data . . . . . . . . . . . . . . . 4.7.2 Orthodontic Growth Data (continued) . . . . . . Discussion and Extensions . . . . . . . . . . . . . . . .
Generalized Linear Models 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 5.2 The GLM . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Model Building . . . . . . . . . . . . . . . . 5.2.2 Classical Estimation and Inference for GLM 5.2.3 Hospital Costs Data Example . . . . . . . . 5.2.4 Residual Analysis . . . . . . . . . . . . . . 5.3 A Class of M-estimators for GLMs . . . . . . . . . . 5.3.1 Choice of ψ and w(x) . . . . . . . . . . . . 5.3.2 Fisher Consistency Correction . . . . . . . . 5.3.3 Nuisance Parameters Estimation . . . . . . . 5.3.4 IF and Asymptotic Properties . . . . . . . . 5.3.5 Hospital Costs Example (continued) . . . . . 5.4 Robust Inference . . . . . . . . . . . . . . . . . . . 5.4.1 Significance Testing and CIs . . . . . . . . . 5.4.2 General Parametric Hypothesis Testing and Variable Selection . . . . . . . . . . . . . . 5.4.3 Hospital Costs Data Example (continued) . . 5.5 Breastfeeding Data Example . . . . . . . . . . . . . 5.5.1 Robust Estimation of the Full Model . . . . . 5.5.2 Variable Selection . . . . . . . . . . . . . . 5.6 Doctor Visits Data Example . . . . . . . . . . . . . 5.6.1 Robust Estimation of the Full Model . . . . . 5.6.2 Variable Selection . . . . . . . . . . . . . . 5.7 Discussion and Extensions . . . . . . . . . . . . . . 5.7.1 Robust Hurdle Models for Counts . . . . . . 5.7.2 Robust Akaike Criterion . . . . . . . . . . . 5.7.3 General Cp Criterion for GLMs . . . . . . . 5.7.4 Prediction with Robust Models . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
110 110 112 116 116 118 122
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
125 125 126 126 129 132 133 136 137 138 139 140 140 141 141
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
142 144 146 146 148 151 151 154 158 158 159 159 160
Marginal Longitudinal Data Analysis 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 The Marginal Longitudinal Data Model (MLDA) and Alternatives 6.2.1 Classical Estimation and Inference in MLDA . . . . . . . 6.2.2 Estimators for τ and α . . . . . . . . . . . . . . . . . . . 6.2.3 GUIDE Data Example . . . . . . . . . . . . . . . . . . . 6.2.4 Residual Analysis . . . . . . . . . . . . . . . . . . . . .
. . . . . .
161 161 163 164 166 169 171
CONTENTS
x 6.3
6.4
6.5 6.6 6.7
A Robust GEE-type Estimator . . . . . . 6.3.1 Linear Predictor Parameters . . . 6.3.2 Nuisance Parameters . . . . . . . 6.3.3 IF and Asymptotic Properties . . 6.3.4 GUIDE Data Example (continued) Robust Inference . . . . . . . . . . . . . 6.4.1 Significance Testing and CIs . . . 6.4.2 Variable Selection . . . . . . . . 6.4.3 GUIDE Data Example (continued) LEI Data Example . . . . . . . . . . . . Stillbirth in Piglets Data Example . . . . Discussion and Extensions . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
7 Survival Analysis 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 The Cox Model . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 The Partial Likelihood Approach . . . . . . . . . . . . 7.2.2 Empirical Influence Function for the PLE . . . . . . . 7.2.3 Myeloma Data Example . . . . . . . . . . . . . . . . 7.2.4 A Sandwich Formula for the Asymptotic Variance . . 7.3 Robust Estimation and Inference in the Cox Model . . . . . . 7.3.1 A Robust Alternative to the PLE . . . . . . . . . . . . 7.3.2 Asymptotic Normality . . . . . . . . . . . . . . . . . 7.3.3 Handling of Ties . . . . . . . . . . . . . . . . . . . . 7.3.4 Myeloma Data Example (continued) . . . . . . . . . . 7.3.5 Robust Inference and its Current Limitations . . . . . 7.4 The Veteran’s Administration Lung Cancer Data . . . . . . . . 7.4.1 Robust Estimation . . . . . . . . . . . . . . . . . . . 7.4.2 Interpretation of the Weights . . . . . . . . . . . . . . 7.4.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Structural Misspecifications . . . . . . . . . . . . . . . . . . . 7.5.1 Performance of the ARE . . . . . . . . . . . . . . . . 7.5.2 Performance of the robust Wald test . . . . . . . . . . 7.5.3 Other Issues . . . . . . . . . . . . . . . . . . . . . . . 7.6 Censored Regression Quantiles . . . . . . . . . . . . . . . . . 7.6.1 Regression Quantiles . . . . . . . . . . . . . . . . . . 7.6.2 Extension to the Censored Case . . . . . . . . . . . . 7.6.3 Asymptotic Properties and Robustness . . . . . . . . . 7.6.4 Comparison with the Cox Proportional Hazard Model 7.6.5 Lung Cancer Data Example (continued) . . . . . . . . 7.6.6 Limitations and Extensions . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
172 172 174 176 177 178 178 179 180 182 186 189
. . . . . . . . . . . . . . . . . . . . . . . . . . .
191 191 193 193 196 197 198 200 200 202 204 205 206 209 209 210 212 214 214 216 217 217 217 219 220 221 222 224
CONTENTS
xi
Appendices
227
A Starting Estimators for MM-estimators of Regression Parameters
229
B Efficiency, LRTρ , RAIC and RCp with Biweight ρ-function for the Regression Model
231
C An Algorithm Procedure for the Constrained S-estimator
235
D Some Distributions of the Exponential Family
237
E Computations for the Robust GLM Estimator 239 E.1 Fisher Consistency Corrections . . . . . . . . . . . . . . . . . . . . 239 E.2 Asymptotic Variance . . . . . . . . . . . . . . . . . . . . . . . . . 240 E.3 IRWLS Algorithm for Robust GLM . . . . . . . . . . . . . . . . . 242 F Computations for the Robust GEE Estimator 245 F.1 IRWLS Algorithm for Robust GEE . . . . . . . . . . . . . . . . . . 245 F.2 Fisher Consistency Corrections . . . . . . . . . . . . . . . . . . . . 246 G Computation of the CRQ
247
References
249
Index
265
Preface The use of statistical methods in medicine, genetics and more generally in health sciences has increased tremendously in the past two decades. More often than not, a parametric or semi-parametric model is used to describe the data and standard estimation and testing procedures are carried out. However, the validity and good performance of such procedures generally require strict adherence to the model assumptions, a condition that is in stark contrast with experience gained from field work. Indeed, the postulated models are often chosen because they help to understand a phenomenon, not because they fit exactly the data at hand. Robust statistics is an extension of classical statistics that specifically takes into account the fact that the underlying models used by analysts are only approximate. The basic philosophy of robust statistics is to produce statistical procedures that are stable with respect to small changes in the data or to small model departures. These include ‘outliers’, influential observations and other more sophisticated deviations from the model or model misspecifications. There has been considerable work in robust statistics in the last forty years following the pioneering work of Tukey (1960), Huber (1964) and Hampel (1968) and the theory now covers all models and techniques commonly used in biostatistics. However, the lack of a simple introduction of the basic concepts, the absence of meaningful examples presented at the appropriate level and the difficulty in finding suitable implementation of robust procedures other than robust linear regression have impeded the development and dissemination of such methods. Meanwhile, biostatisticians continue to use ‘ad-hoc’ techniques to deal with outliers and underestimate the impact of model misspecifications. This book is intended to fill the existing gap and present robust techniques in a consistent and understandable manner to all researchers in the health sciences and related fields interested in robust methods. Real examples chosen from the authors’ experience or for their relevance in biomedical research are used throughout the book to motivate robustness issues, explain the central ideas and concepts, and illustrate similarities and differences with the classical approach. This material has previously been tested in several short and regular courses in academia from which valuable feedback has been gained. In addition, the R-code and data used for all examples discussed in the book are available on the supporting website (http://www.wiley.com/go/heritier). The databased approach presented here makes it possible to acquire both the conceptual framework and practical tools for not only a good introduction but also a practical training in robust methods for a large spectrum of statistical models.
xiv
PREFACE
The book is organized as follows. Chapter 1 pitches robustness in the history of statistics and clarifies what it is supposed to do and not to do. Concepts and results are introduced in a general framework in Chapter 2. This chapter is more formalized as it presents the ideas and the results in their full generality. It presents in a more mathematical manner the basic concepts and statistical tools used throughout the book, to which the interested reader can refer when studying a particular model presented in one of the following chapters. Fundamental tools such as the influence function, the breakdown point and M-estimators are defined here and illustrated through examples. Chapters 3 to 7 are structured by model and include specific elements of theory but the emphasis is on data analysis and interpretation of the results. These five chapters deal respectively with robust methods in linear regression, mixed linear models, generalized linear models, marginal longitudinal data models, and models for survival analysis. Techniques presented in this book focus in particular on estimation, uni- and multivariate testing, model selection, model validation through prediction and residual analysis, and diagnostics. Chapters can be read independently of each other but starting with linear regression (Chapter 3) is recommended. A short introduction to the corresponding classical procedures is given at the beginning of each chapter to facilitate the transition from the classical to the robust approach. It is however assumed that the reader is reasonably familiar with classical procedures. Finally, some of the computational aspects are discussed in the appendix. The intended audience for this book includes: biostatisticians who wish to discover robust statistics and/or update their knowledge with the more recent developments; applied researchers in medical or health sciences interested in this topic; advanced undergraduate or graduate students acquainted with the classical theory of their model of interest; and also researchers outside the medical sciences, such as scientists in the social sciences, psychology or economics. The book can be read at different levels. Readers mainly interested in the potential of robust methods and their applications in their own field should grasp the basic statistical methods relevant to their problem and focus on the examples given in the book. Readers interested in understanding the key underpinnings of robust methods should have a background in statistics at the undergraduate level and, for the understanding of the finer theoretical aspects, a background at the graduate level is required. Finally, the datasets analyzed in this book can be used by the statistician familiar with robustness ideas as examples that illustrate the practice of robust methods in biostatistics. The book does not include all the available robust tools developed so far for each model, but rather a selected set that has been chosen for its practical use in biomedical research. The emphasis has been put on choosing only one or two methods for each situation, the methods being selected for their efficiency (at different levels) and their practicality (i.e. their implementation in the R package robustbase), hence making them directly available to the data analyst. This book would not exist without the hard work of all the statisticians who have contributed directly or indirectly to the development of robust statistics, not only the ones cited in this book but also those that are not.
Acknowledgements We are indebted to Elvezio Ronchetti and Chris Field for stimulating discussions, comments on early versions of the manuscript and their encouragement during the writing process, to Tadeusz Bednarski for valuable exchanges about robust methods for the Cox model and for providing its research code, and to Steve Portnoy for his review of the section on the censored regression quantiles. We also thank Sally Galbraith, Serigne Lo and Werner Stahel for reading some parts of the manuscript and for giving useful comments, Dominique Couturier for his invaluable help in the development of R code for the regression model and the mixed linear model, and Martin Mächler and Andreas Ruckstuhl for implementing the robust GLM in the robustbase package. The GFR data have been provided by Judy Simpson and the cardiovascular data by Pascal Bovet. Finally, we would like to thank the staff at Wiley for their support, as well as our respective institutions, our understanding colleagues and students who had to endure our regular ‘blackouts’ from daily work during the writing process of this book.
1
Introduction
1.1 What is Robust Statistics? The scientific method is a set of principles and procedures for the systematic pursuit of knowledge involving the recognition and formulation of a problem, the collection of data through observation and experiment, and the formulation and testing of hypotheses (Merriam-Webster online dictionary, http://merriam-webster.com). Although procedures may be different according to the field of study, scientific researchers agree that hypotheses need to be stated as explanations of phenomena, and experimental studies need to be designed to test these hypotheses. In a more philosophical perspective, the hypothetico-deductive model for scientific methods (Whewell, 1837, 1840) was formulated as the following four steps: (1) characterizations (observations, definitions and measurements of the subject of inquiry); (2) hypotheses (theoretical, hypothetical explanations of observations and measurements of the subject); (3) predictions (possibly through a model, logical deduction from the hypothesis or theory); (4) experiments (test (2) and (3), essentially to disprove them). It is obvious that statistical theory plays an important role in this process. Not only are measurements usually subject to uncertainty, but experiments are also set using the theory of experimental designs and predictions are often made through a statistical model that accounts for the uncertainty or the randomness of the measurements. As statisticians, however, we are aware that models can at best be approximated (at least for the random part), and this introduces another type of uncertainty into the process. G. E. P. Box’s famous citation that ‘all models are wrong, some models are useful’ (Box, 1979) is often cited by the researcher when faced with the data to analyze. Hence, for truly honest scientific Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
INTRODUCTION
2
research, statistics should offer methods that not only deal with uncertainty of the collected information (sampling error), but also with the fact that models are at best an approximation of reality. Consequently, statistics should be in ‘some sense’ robust to model misspecifications. This is important since the aim of scientific research is the pursuit of knowledge that is used in fine to improve the wellbeing of people as is obviously the case, for example, in medical research. Robust methods date back to the prehistory of statistics and they naturally start with outlier detection techniques and the subsequent treatment of the data. Mathematicians of the 18th century such as Bernoulli (1777) were already questioning the appropriateness of rejection rules, a common practice among astronomers of the time. The first formal rejection rules are suggested in the second part of the 19th century; see Hampel et al. (1986, p. 34), for details. Student (1927) proposes repetition (additional observations) in the case of outliers, combined with rejection. Independently, the use of mixture models and simple estimators that can partly downweight observations appears from 1870 onwards; see Stone (1873); Edgeworth (1883); Newcomb (1886) and others. Newcomb even imagines a procedure that can be posthumously described as a sort of one-step Huber estimator (see Stigler, 1973). These attempts to reduce the influence of outliers, to make them harmless instead of discarding them, are in the same spirit as modern robustness theory; see Huber (1972); Harter (1974–1976); Barnett and Lewis (1978) and Stigler (1973). The idea of a ‘supermodel’ is proposed by Pearson (1916) who embedded the normal model that gained a central role at the turn of the 20th century into a system of Pearson curves derived from differential equations. The curves are actually distributions where two additional parameters are added to ‘accommodate’ most deviations from normality. The discovery of the drastic instability of the test for equality of variance by Pearson (1931) sparked the systematic study of the non-robustness of tests. Exact references on these developments can be found in Hampel et al. (1986, pp. 35–36). The term robust (strong, sturdy, rough) itself appears to have been proposed in the statistical literature by Box (1953). The field of modern robust statistics finally emerged with the pioneering works of Tukey (1960), Huber (1964) and Hampel (1968), and has been intensively developed ever since. Indeed, a rough bibliographic search in the Current Index to Statistics1 revealed that since 1960 the number of articles having the word ‘robust’ in their title and/or in their keywords list has increased dramatically (see Figure 1.1). Compared with other well-established keywords, ‘robust’ appears to be quite popular: roughly half as popular as ‘Bayesian’ and ‘design’, but more popular than ‘survival’, ‘bootstrap’, ‘rank’ and ‘smoothing’. Is robust statistics really as popular as it appears to be, in that it is used fairly routinely in practical data analysis? We do not really believe so. It might be that the word ‘robust’ is associated with other keywords such as ‘rank’, ‘smoothing’ or ‘design’ because of the perceived nature of these methods or procedures. We also performed a rough bibliographic search under the same conditions as before, but with the combination of the words ‘robust’ and each of the other words. The result is presented in Figure 1.2. It appears that although ‘robust’ is relatively more associated 1 http://www.statindex.org/
600
1.2. AGAINST WHAT IS ROBUST STATISTICS ROBUST?
300
400
500
Robust Bayesian Bootstrap Smoothing Rank Survival Design
0
100
200
Number of citations
3
1960
1970
1980
1990
2000
Year
Figure 1.1 Number of articles (average per 2 years) citing the selected words in the title or in the keywords list according to the Current Index to Statistics (http://www.statindex.org/), December 2007. with ‘design’ and ‘Bayesian’, when we remove all of the combined associations there are 4367 remaining articles citing the word ‘robust’ (group ‘other’), a fairly large number. We believe that this rather impressive number of articles have often used the term ‘robust’ in quite different manners. At this point, it could be worth searching more deeply, for example by taking a sample or articles and looking at the possible meanings or uses of the statistical term ‘robust’, but we do not attempt that here. Instead, we will clarify in what sense we use the term ‘robust’ or ‘robustness’ in the present book. We hope that this will help in clarifying the extent and limitations of the theory of robust statistics for the scientist as set by Tukey (1960), Huber (1964) and Hampel (1968).
1.2 Against What is Robust Statistics Robust? Robust statistics aims at producing consistent and reasonably efficient estimators, test statistics with stable level and power, when the model is slightly misspecified.
INTRODUCTION
4
100 50 0
Number of citations with 'robust'
150
Other Bayesian Bootstrap Smoothing Rank Survival Design
1970
1975
1980
1985
1990
1995
2000
2005
Year
Figure 1.2 Number of articles (average per 6 years) citing the selected words together with ‘robust’ in the title or in the keywords list according to the Current Index to Statistics (http://www.statindex.org/), December 2007.
Model misspecifications encompass a relatively large set of possibilities, and robust statistics cannot deal with all types of model misspecifications. First we characterize the model using a cumulative probability distribution Fθ that captures the structural part as well as the random part of the model. The parameters needed for the structural part and/or the random part are included in the parameter’s vector θ . For example, in the regression model that is thoroughly studied in Chapter 3, θ contains the (linear) regression coefficients (structural part) as well as the residual error variance (random part) and Fθ is the (conditional) normal distribution of the response variable (given the set of explanatory variables). Here Fθ does not need to be fully parametric, e.g. the Cox model presented in Chapter 7 can also be used. Then, by ‘slight model misspecification’, we assume that the σγ20 0 D= 0 0 to the alternative H1 :
! D=
σγ20
σγ01
σγ01
σγ21
' ,
with σγ21 > 0 to guarantee that D is positive-definite. As two additional parameters σγ01 and σγ21 have been added to the model, a naive application of the classical theory would compare the corresponding LRT test with a χ22 distribution. The exact theory states that a mixture with equal weights 0.5 for χ12 and χ22 must be used. Therefore, a naive analysis could lead to larger p-values and, hence, acceptance of oversimplified variance structures. This result also holds for the REML-based LRT (Morrell, 1998).
MIXED LINEAR MODELS
96
Table 4.1 Estimates and standard errors for the REML for the skin resistance data using model (4.2)–(4.3) with and without observation 15. REML
REML without observation 15
Parameter
Estimate (SE)
p-value
Estimate (SE)
p-value
µ λ1 λ2 λ3 λ4 σs σ0015
2.030 (0.341) −0.213 (0.334) 0.842 (0.334) 0.549 (0.334) −0.526 (0.334) 1.190 1.495
<10−4 0.525 0.014 0.105 0.120
1.817 (0.284) 0.076 (0.246) 0.580 (0.246) 0.234 (0.246) −0.399 (0.246) 0.994 1.068
<10−4 0.756 0.221 0.345 0.110
As it is more accurate with a small sample we use this variant on the orthodontic growth data. The LRT statistic for testing H0 : σγ21 = 0, σγ01 = 0 returns a value of 2(−216.3 + 216.9) = 1.2. A correct p-value is therefore p = 0.5 × P (χ22 > 1.2) + 0.5 × P (χ12 > 1.2) = 0.41, whereas the naive calculation yields p = 0.55. In this case, both procedures conclude that a second random effect is probably not necessary (assuming that no robustness issue arises here). Stram and Lee (1994) also consider the case of testing k versus k + 1 random 2 effects. In that case, a mixture with equal weights 0.5 for χk2 and χk+1 is obtained for the asymptotic distribution. A more complex mixture is also available when l > 1 random effects are added to the model but requires complex calculations. Again, extensions of these results to LRT tests based on the REML is possible (see Morrell, 1998). Recent work by Scheipl et al. (2008) show that they are generally more powerful and should therefore be preferably used. The Wald test and classical confidence intervals for the variance parameter must also be corrected bearing in mind that they are generally outperformed by their LRT counterparts. Finally, a good account of these problems with applications can be found in Verbeke and Molenberghs (2000, pp. 64–74).
4.3.3 Lack of Robustness of Classical Procedures To illustrate the sensitivity of the classical estimators introduced in Section 4.3.1, let us go back to the skin resistance data. In Figure 4.1 we saw that, out of the 80 readings, two measurements (resistance of electrodes of type 2 and 3) taken on subject 15 were much larger than the others. The experimenter discovered later that the reason for these two rather large readings was the excessive amount of hair on the subject’s arm (see Berry, 1987). Table 4.1 presents the classical (REML) estimates and standard errors with and without case 15.5 One may notice that there is considerable variation in the estimates of the different electrode types (significant fixed effects) when observation 15 is present in the data. 5 The raw data have been divided by 100.
4.4. ROBUST ESTIMATION
97
These differences are less obvious when case 15 is removed from the data. Also a large difference is observed in the residual error variance estimate σˆ 00152 when it is computed with and without case 15. This clearly illustrates the lack of robustness of the REML. To quantify in a more formal way the sensitivity of the MLE and REML, we use the IF (see Section 2.2.1) which offers an elegant way to justify theoretically these empirical findings. Indeed, both the MLE and REML are M-estimators defined through estimating (4.19)–(4.20) and (4.19)–(4.24). Their IF is therefore proportional to their defining 0001-functions. Specifically, the influence of the ith independent cluster (i.e. the four measurements of the ith subject in the skin resistance experiment) for both classical estimators of β is proportional to the score function for that parameter s(yi , xi ; θ ) = xiT 0004i−1 (yi − xi β).
(4.27)
This quantity is unbounded in yi and in xi , which proves theoretically that both the MLE and REML estimates for the fixed effects are not robust. The situation is even worse for the variance component. The IF of αˆ [MLE] is proportional to the summand in (4.20), a quadratic form of yi and, as a result, a single abnormal response (such as case 15’s readings for type 2 and 3 electrodes) can ruin αˆ [MLE] . It is not possible to assess directly the effect of a single cluster on the REML variance estimates as the estimating equation cannot be defined at that level. However, a quadratic form appears in the left-hand side of (4.24), proving that αˆ [REML] is just as sensitive as αˆ [MLE] to contamination.
4.4 Robust Estimation 4.4.1 Bounded Influence Estimators It is possible to extend the bounded-influence approach of Section 2.3.2 to MLMs. Most of these methods are based on a weighted version of the likelihood, either directly (Huggins, 1993; Huggins and Staudte, 1994) where a robustified likelihood is maximized, or through a weighted score equation (Richardson and Welsh, 1995; Richardson, 1997; Stahel and Welsh, 1997). Summarizing the previous work, Welsh and Richardson (1997) introduce a very general class that encompasses most of the previous proposals through n 0002
−1/2
xiT W0i 0004i
i=1
−1/2
ψ0i (0004i
U0i (yi − xi β)) = 0
(4.28)
for the fixed effects, and n 10002 −1/2 −1/2 −1/2 {ψ1i (0004i U1i (yi − xi β))T W1i 0004i [Zj ZjT ](ii) 0004i W1i 2 i=1 −1/2
· ψ2i (0004i
U1i (yi − xi β)) − tr(K2i 0004i−1 [Zj ZjT ](ii) )} = 0
(4.29)
MIXED LINEAR MODELS
98
for each variance component σj2 . The matrices K2i are needed to ensure consistency at the normal model; see Welsh and Richardson (1997) for details. Equations (4.28) and (4.29) generalize the score equations (4.19) and (4.20) for the MLE. The choice of the weight matrices W0i , W1i , U0i , U1i and functions ψ0i , ψ1i , ψ2i defines each particular estimator including Huggins’ earlier proposals. The ψ functions are typically chosen as Huber functions applied to all components but other choices are also possible. The robust estimator with all weights equal to one and ψ0 = ψ1 = ψ2 is called robust MLE II in Richardson and Welsh (1995), as (4.29) is analogous to Huber’s Proposal 2 in linear regression. Likewise, the choice ψ0 = ψ2 and ψ1 (z) = z gives the robust MLE I of Richardson and Welsh (1995). It is also possible to define robust versions of the REML by using similar weighted equations to (4.29), the difference being a more complex trace term.6 As before two variants exist and are called robust REML I and II in Richardson and Welsh (1995) and Welsh and Richardson (1997). As all the proposals discussed here are defined through estimating equations of the type 0001(yi , xi ; θ ) where θ = (β T , α T )T , the general asymptotic theory for M-estimators applies. Although these developments generalize the bounded-influence approach of Section 3.2.4 in a considerable level of generality, several limitations can be mentioned. First, computation is generally complicated by the presence of complex matrices K2i required for consistency. The problem may even become intractable for redescending 0001 or complex variance structures. Second, in the presence of contaminated data, some small residual bias to the robust variance estimates remains even for the robust REML proposals; see the simulation results in Richardson and Welsh (1995, pp. 1437–1438). Finally, the breakdown point of such bounded influence estimators can be low and this may be an issue in complex models.
4.4.2 S-estimators The reformulation of the MLM as a multivariate normal model offers an elegant way to tackle the robustification problem. Specifically, S-estimators introduced earlier in Section 2.3.3 for their good breakdown properties can easily be generalized to balanced MLMs , i.e. models of type (4.8)–(4.9) where the cluster size pi = p and 0004i = 0004 for all clusters. This assumption is certainly not desirable from a practical perspective as the number of applications involved unbalanced data or variable repeated measures over time. As this theory is new (Copt and Victoria-Feser, 2006), there is however hope that this limitation will be relaxed in the near future. In the multivariate normal setting, one can define an S-estimator for the mean µ and covariance 0004 as the solution for these parameters that minimizes det(0004) = |0004| subject to n 0002 n−1 ρ(di ) = b0 , (4.30) i=1 6 The equation is similar to (4.29) with the trace term tr(K P Z Z T ) where K = diag(K , . . . , K ) j j 2 2 21 2n sitting outside the summation for all i.
4.4. ROBUST ESTIMATION where
99
di2 = (yi − µ)T 0004 −1 (yi − µ)
(4.31)
are the Mahalanobis distances, ρ is a bounded function and b0 = E0010 [ρ(d)] ensures consistency at the normal model. Using the relationship (2.33), the tuning parameter of the ρ-function can be chosen to achieve a pre-specified breakdown point ε ∗ (see Section 2.3.3). A typical choice for ρ is Tukey’s biweight given in (2.20). For the balanced case, the marginal MLM (4.9) simply becomes yi ∼ N (xi β; 0004) where the common covariance matrix is 0004=
r 0002 j =0
σj2 zj zjT
(4.32)
and zj is the (common) element of the design matrix Zj for a particular cluster. In the skin resistance data example, 0004 is given by (4.32) (see also (4.12)), with z0 z0T = I5 (for the residual variance) and z1 z1T = e5 e5T (for the subject random effect variance). Likewise, for the semantic priming data example, according to (4.16), we have that z0 z0T = I6 (for the residual variance), z1 z1T = J6 , z2 z2T = I2 ⊗ J3 and z3 z3T = J2 ⊗ I3 for the subject and its factors’ interactions random effects variances. The additional structure on the mean and covariance matrix implied by the MLM formulation does not create additional difficulties to extend the definition of an S-estimator to that setting. Indeed, it can be defined as the solution for β, σj2 , j = 0, . . . , r of the same minimization problem under the constraint (4.30), with 0015 di = di (β) = (yi − xi β)T 0004 −1 (yi − xi β) (4.33) and 0004 having the particular structure (4.32). The problem can be restated as solving the estimating equations 0002 0002 w(di )xiT 0004 −1 (yi − xi β) = 0001β (yi , xi ; θ ) = 0, j = 0, . . . , r (4.34) for β, and 0002 {pw(di )(yi − xi β)T 0004 −1 zj zjT 0004 −1 (yi − xi β) − w(di )di2 tr[0004 −1 zj zjT ]} 0002 (4.35) = 0001σ 2 (yi , xi ; θ ) = 0, j
for the variance component α = (σ02 , . . . , σr2 )T (see Copt and Victoria-Feser, 2006). Here w(d) = (∂/∂d)ρ(d)/d is the robust weight given to each observation. Equations (4.34) and (4.35) can be rewritten in a more compact form as 0002 0001(yi , xi ; θ ) = 0 where 0001 = (0001βT , 0001σ 2 , . . . , 0001σr2 )T . We propose to use Tukey’s biweight ρ-function 1 (2.20) and call the resulting robust estimator CBS for constrained biweight S-estimator.
MIXED LINEAR MODELS
100
Like for S-estimators in the linear regression model in Section 3.2.4, (4.34) and (4.35) may have multiple roots, and hence a good high breakdown point estimator is needed as a starting point to find the solution to (4.34) and (4.35) with a high breakdown point. A simple algorithm has been suggested by Copt and VictoriaFeser (2006) and is given in Appendix C; the way the high breakdown point starting estimator is obtained is also detailed. Following Davies (1987) and Lopuhaä (1992) for the normal multivariate case, Copt and Victoria-Feser (2006) prove that, under mild regularity conditions, a (constrained) S-estimator defined through (4.34) and (4.35) of θis consistent and asymptotically normal distributed. In particular, if the inverse of ni=1 xiT xi exists, βˆ[S] has an asymptotic variance given by n
−1 0002 0002
−1 n n e1 0002 T T T x x x 0004x x x , i i i i i i e22 i=1 i=1 i=1
(4.36)
where e1 = and
1 E0010 [d 2 w(d)2 ] p
0007 0006 1 ∂ e2 = E0010 w(d) + d w(d) p ∂d
(4.37)
(4.38)
and w is the weight function associated with the ρ-function. The constrained Sestimator of α is also asymptotically normally distributed with variance given by a complex sandwich formula (omitted here for simplicity); see Copt and Victoria-Feser (2006, p. 294).
4.4.3 MM-estimators In the same spirit as in the regression setting (see Section 3.2.4), Copt and Heritier (2007) propose MM-estimators for the main effects parameter. They possess many good properties, i.e. a high breakdown point even in the presence of leverage points, a good efficiency and, unlike S-estimators, they can be used to build a robust LRT-type test. This last property was the key incentive for their introduction; see also Section 4.5. The class of MM-estimators was first introduced by Yohai (1987) in the linear regression setting and was then generalized by Lopuhaä (1992) and Tatsuoka and Tyler (2000) to the multivariate linear model. The idea is to dissociate the estimation of the regression parameter (fixed effects) and variance component (random effects), and proceed in two steps. In the MLM setting, one can first obtain a high breakdown point estimator for the covariance matrix via the CBS estimator 0001CBS ) then use a better tuned ρ function to obtain a more efficient M-estimator for (0004 the fixed effects parameter (i.e. β). In practice the initial variance estimator is based on a ρ-function ρ0 (d; c0 ), the final estimator on ρ1 (d; c1 ). The tuning constants are usually chosen to achieve a specific breakdown point (through c0 ) and efficiency
4.4. ROBUST ESTIMATION
101
(through c1 ) at the model. Technically, the second step amounts to solving for β n 0002
0001(yi , xi ; β) =
i=1
n 0002
0005−1 (yi − xi β) = 0, w1 (di )xiT 0004
(4.39)
i=1
0005=0004 0005[CBS] , w1 (d) = (∂/∂d)ρ1 (d; c1)/d is the weight function associwhere e.g. 0004 ated with ρ1 , i.e. the ρ-function in the M-step. The solution of (4.39) is the MM0005[MM] of β. estimator β Two natural choices for w1 (·) (and, hence, ρ1 ) naturally arise from the regression setting, either the Huber’s ρ-function (see Equation (2.17)) or the bounded Tukey’s 0005[H ub] and β 0005[bi] , respectively. biweight ρ-function (see Equation (2.20)) leading to β The corresponding weights are w1 (d) = min(1, c1 /|d|),
(4.40)
for Huber’s weights and
2 d 2 −1 w1 (d) = c1 0
if |d| ≤ c1
(4.41)
if |d| > c1
for Tukey’s biweight weights (see also (3.14)). These two proposals serve different purposes. Huber’s estimator is well adapted to the cases when model deviations occur in the response variable only such as in ANOVA or models with wellcontrolled covariates. It can, however, be severely biased in the presence of (bad) leverage points. This is not the case with Tukey’s biweight which is robust to both response and covariate extreme observations. Note that for Huber’s weights (4.40), the associated ρ-function is (2.17), and for biweight weights (4.41) it is (3.15) with c replaced by c1 in both cases. Copt and Heritier (2007) show that, under mild conditions on ρ1 , √ ˆ n(β[MM] − β) has a limiting normal distribution with zero mean and var(βˆ[MM] ) = H = (1/n)M −1 QM −T where M and Q are proportional to 0016 = EK [x T 0004 −1 x] and K is the covariates’ distribution.7 A simpler representation for H can thus be given by 1 e1 H = EK [x T 0004 −1 x]−1 , (4.42) n e22 where e1 and e2 are given in (4.37) and (4.38), respectively, with w(d) = (∂/∂d) ρ1 (d)/d. In the case of fixed covariates, K can be replaced by the covariates’ empirical distribution in (4.42) yielding an asymptotic variance matrix H proportional to the asymptotic variance of the MLE (4.22). The multiplicative constant e1 /e22 will 7 In this section, we work under slightly more general conditions than in Section 4.4.2 by assuming that the covariates are not necessarily fixed but have a common distribution K. The rationale for this is to be able to account for leverage points or other problems in the covariates space. If one does not want to specify a particular model for K and therefore get back to the previous setting, one only needs to replace K by the empirical distribution of x.
MIXED LINEAR MODELS
102
Table 4.2 Values for c0 and c1 for Tukey’s biweight ρ-function (2.20) for the multivariate normal model. Constant c0 for a breakdown point of 50% p
1
2
3
4
5
6
7
8
9
10
c0
1.56
2.66
3.45
4.09
4.65
5.14
5.59
6.01
6.40
6.77
6.83
7.04
7.25
Constant c1 for 95% efficiency c1
4.68
5.12
5.51
5.82
6.10
6.37
6.60
be used to calibrate the efficiency of the MM-estimator (see below). However, we prefer to ignore the reduced form (4.42) to derive an estimate of H and use instead the sample analog of the sandwich formula 0005 = 1M 0005M 0005 −1 , 0005 −1 Q H n 0005 and Q 0005 are the empirical versions of (2.28) and (2.29) for the MLM. For where M 0005 = (1/n) n 0001(yi , xi ; β)s(yi , xi ; β)T with 0001 as in (4.39), and s is instance, M i=1 ˆ [CBS] has been plugged in for 0004. Such the score function (4.27) where again 0004 an estimator is usually more robust when extreme covariate values are observed. Numerical values are obtained by replacing β by βˆ[MM] .
4.4.4 Choosing the Tuning Constants As mentioned earlier, the constant c0 of ρ0 is chosen to ensure a high (asymptotic) 0005[CBS] . For that breakdown point ε∗ (50% in our case) for the initial estimate 0004 purpose, the relationship E[ρ0 (d; c0 )] = ε∗ max ρ0 (x; c0) x
is solved for c0 to achieve a pre-specified breakdown point ε∗ with (in our examples) Tukey’s biweight ρ0 . To determine the constant c1 , an efficiency level (typically 95%) needs to be specified a priori. As discussed earlier, formula (4.42) shows that the relative efficiency of the MM-estimator relative to the MLE is given by the ratio e22 E0010 [w1 (d) + (1/p)d(∂/∂d)w1 (d)]2 =p e1 E0010 [d 2 w1 (d)2 ]
(4.43)
with w1 (d) given in (4.40) or (4.41) depending on the choice for ρ1 (Huber or biweight) and c = c1 . The constant c1 is then found by equating (4.43) to the desired efficiency level (e.g. 95%). Note that in the univariate case (p = 1), (4.43) reduces to (3.20). Both constants depend on the dimension p of the response vector and can be obtained by Monte Carlo simulations. They are summarized in Table 4.2 for Tukey’s
4.4. ROBUST ESTIMATION
103
Table 4.3 Estimates and standard errors for the REML and the CBS–MM for the skin resistance data using model (4.2). REML
CBS–MM
Parameter
Estimate (SE)
p-value
Estimate (SE)
p-value
µ λ1 λ2 λ3 λ4 σs σ0015
2.030 (0.341) −0.213 (0.334) 0.842 (0.334) 0.549 (0.334) −0.526 (0.334) 1.190 1.459
<10−4 0.525 0.014 0.105 0.120
1.440 (0.233) −0.161 (0.175) 0.403 (0.175) 0.243 (0.175) −0.169 (0.175) 0.842 0.761
<10−4 0.356 0.021 0.163 0.332
CBS computed with c0 = 4.65 and MM (biweight) computed with c1 = 6.10.
biweight ρ-functions (for ρ0 and ρ1 ). When p becomes large enough, an asymptotic approximation given by Rocke (1996, p. 1330) can be used for Tukey’s biweight √ which yields c1 = p/m where m is defined through ρ[bi] (m) = 0.5ρ[bi] (1), with ρ[bi] given in (2.20) with c = 1. This approximation gives reasonable results from p > 10. Finally note that the values of c0 and c1 given here obviously depend on the choice of the ρ-function and would need to be recomputed had other ρ-functions been used. Another option is available for the Huber estimator, i.e. when Huber weights (4.40) are chosen. It stems from the fact that ρ in (2.17) is a function of d, the Mahalanobis distance. As d 2 has a chi-squared distribution with p degrees of freedom χp2 , c1 can be chosen as the square-root of a specific quantile of this distribution.
4.4.5 Skin Resistance Data (continued) As an illustration, we go back to the skin resistance data. Table 4.3 presents the 0005[bi] and robust CBS estimates αˆ [CBS] 8 and standard errors robust MM estimates β for the electrode resistance data9 along with the REMLs obtained earlier. The MM contrast estimates are not affected by case 15’s extreme readings for electrodes of type 2 and 3. They are actually close to what was observed with case 15 removed from the analysis. The CBS variances estimates, especially the residual estimate, are much smaller confirming the previous findings that the REML estimates are unduly inflated by the two abnormal readings. To limit the influence of potential outlying observations, Berry (1987) actually proposes to use a log(y + c) (c = 32) transformation of the data. A profile plot of the transformed data is presented in Figure 4.3. Graphically, the log-transformation limits the effect of the potential outliers (in particular observation number 15). The estimated model parameters using the transformed data and the classical (REML) 8 For simplicity, we call this set of robust estimators the CBS–MM. 9 The raw data have been divided by 100.
MIXED LINEAR MODELS 7
104
Subject
5 4 3
Mean of resistance (log scale)
6
2 1 3 12 15 6 14 10 13 11 7 5 9 4 16 8
E1
E2
E3
E4
E5
Electrode type
Figure 4.3 Profile plot for the skin resistance data (log-transformed).
and robust estimators are presented in Table 4.4. The overall mean, the contrasts and variance components and p-values this time are similar in the two methods. This illustrates the fact that outliers are model specific, i.e. the two abnormal readings on the original scale do not appear as so extreme on the log-transformed one. This was not the case with the non-transformed data. We defer the discussion on the effect of the electrode type to Section 4.5.3.
4.5 Robust Inference The MM-estimators were introduced earlier to offer more options for testing hypotheses on the main effects. Typical tests usually involve contrasts or multidimensional hypotheses that a component of the main effects parameter is null.
4.5.1 Testing Contrasts A contrast test occurs when a linear combination of the elements of β, typically represented by a (q + 1)-vector L, is tested. For example, suppose that we have a
4.5. ROBUST INFERENCE
105
Table 4.4 Estimates and standard errors for the REML and the CBS–MM for the skin resistance data using model (4.2) with a log-transformed response. REML
CBS–MM
Parameter
Estimate (SE)
p-value
Estimate (SE)
p-value
µ λ1 λ2 λ3 λ4 σs σ0015
4.913 (0.166) −0.097 (0.158) 0.396 (0.158) 0.179 (0.158) −0.289 (0.158) 0.585 0.701
<10−4 0.542 0.015 0.262 0.072
4.918 (0.176) −0.058 (0.161) 0.376 (0.161) 0.167 (0.161) −0.282 (0.161) 0.610 0.689
<10−4 0.718 0.019 0.299 0.079
CBS computed with c0 = 4.65 and MM (biweight) computed with c1 = 6.10.
one-factor within-subject ANOVA model with three levels, i.e. β = (β0 , β1 , β2 ) = (µ, λ1 , λ2 ) and suppose that the design matrix x is parametrized as ‘treatment’ contrasts (see e.g. (4.11)) with the third level as the reference level. Suppose also that our goal is to test for differences among elements of the mean vector (µ1 , µ2 , µ3 ). The corresponding null hypotheses are H0 : µ1 − µ3 = β2 = λ1 = 0 H1 : µ1 − µ3 = β2 = λ1 000f= 0, H0 : µ2 − µ3 = β3 = λ2 = 0 H1 : µ2 − µ3 = β3 = λ2 000f= 0, H0 : µ2 − µ1 = β3 − β2 = λ2 − λ1 = 0 H1 : µ2 − µ1 = β3 − β2 = λ2 − λ1 000f= 0. The corresponding contrasts L are LT = (0, 1, 0), LT = (0, 0, 1) and LT = (0, 1, −1). Simple robust inference for contrasts can be performed using an estimate of the 0005[MM] given in (4.42). For H0 : LT β = 0, a robust z-test asymptotic covariance of β statistic is given by 0005[MM] LT β z-statistic = (4.44) 0005[MM] ) SE(LT β with
SE(LT βˆ[MM] ) =
0017
LT Hˆ L.
The corresponding p-value is obtained by comparing (4.44) with the standard normal distribution. Note that, although we compute the z-statistic with the MM-estimator, the same sort of calculation can be done with the S-estimator using the appropriate asymptotic variance.
MIXED LINEAR MODELS
106
4.5.2 Multiple Hypothesis Testing of the Main Effects Tests involving multiple hypothesis can, for instance, be used to compare models with the same variance structure or to assess the statistical significance of a factor with several levels such as the type of electrode in model (4.2). Denote again by T , β T ) the partition of the vector β into q + 1 − k and k components and β T = (β(1) (2) by A(ij) , i, j = 1, 2 the corresponding partition of (q + 1) × (q + 1) matrices. The hypothesis to be tested can usually be formulated as H0 : β = β0 where β0(2) = 0, β0(1) unspecified, H1 : β0(2) 000f= 0, β(1) unspecified. The need for robust testing in this setting is obvious as the classical F -test has reportedly been found to be unreliable under sometimes mild model deviations (see e.g. (Copt and Heritier, 2007)). Robust alternatives to the classical Wald or score tests are readily available through (2.47) for the robust Wald test, (2.48) for the robust score test for any model. Robustifying the LRT is probably the most natural route to build a robust alternative to the F -test but, as alluded to in Section 4.4.3, such a test does not always exist for S-estimators. The reason is that the corresponding test statistic is by construction zero.To see this, just note that the robust LRT in (2.50) is based on the difference in ρ(di ) for both the full and reduced models. As the definition of S-estimators (4.30) sets both sums to b0 (up to a 1/n factor) the difference is simply zero. As shown in Copt and Heritier (2007), MM-estimators circumvent the problem by using another loss function ρ1 , different from that used to build the S-estimator, therefore the LRT statistic exits. A direct application of the general theory of robust testing introduced in Section 2.5 can then be used. Formally, the LRT 0015 statistic is computed in the same 0005−1 (yi − xi β) be the way as in the general case. Again let di (β) = (yi − xi β)T 0004 [S]
ˆ [S] a chosen S-estimator of 0004 (e.g. Mahalanobis distance for observation i with 0004 ˆ 0004[CBS]). The robust LRT -type test statistic is given by LRTρ = 2
n 0002 0005[MM] ))], [ρ(di (β˙[MM] )) − ρ(di (β
(4.45)
i=1
0005[MM] and β˙[MM] are the robust estimators in the full and reduced models, where β respectively, with corresponding loss function ρ1 . More specifically LRTρ associated with the Huber estimator, respectively the biweight estimator, is defined through (4.39) with weight function (4.40), respectively (4.41), with corresponding ρ1 function given in (2.17), respectively in (3.15). In both cases the covariance matrix 0005[CBS] . estimate is the CBS 0004 An estimate of a robust Wald-type test statistic is naturally defined by 0005T 0005−1 0005 W00012 = β [MM](2) H(22) β[MM](2) , 0005[MM](2) is the robust MM-estimator of β(2) in the full model and Hˆ (22) the where β corresponding covariance estimate. Finally, a score-(or Rao-)type test statistic is
4.5. ROBUST INFERENCE given by
107 2 0005−1 Zn , = ZnT C R0001
where Zn = (1/n) ni=1 0001(yi , xi ; β˙[MM] )(2) , β˙[MM] is the MM-estimator in the reduced model with corresponding 0001-function given in (4.39) and weights in (4.40) for Huber’s estimator and in (4.41) for Tukey’s biweight estimator. The 0005 is C 0005=M 000522.1 H 000522.1 = M 0005(22) − 0005(22)M 0005T with M k × k positive-definite matrix C 22.1 −1 0005(21) M 0005 0005 M 0005 M (11) (12) from the partitioning of the matrix M. Again we have defined the three test statistics for the MM-estimators but it is also possible to define the robust Wald and score test in a similar fashion for the CBS estimators. Under the null hypothesis, their asymptotic distribution is the same as in the general parametric settings (see Section 2.5.4).
4.5.3 Skin Resistance Data Example (continued) We now return to the problem of testing the multivariate hypothesis of equality of mean resistances given by H1 : H0 is not true
µ unspecified
irrespective of the chosen contrast matrix. The classical F -test statistic is 3.1455. When compared with an F4,60 distribution, we find a p-value of 0.020 so that the test is significant at the 5% level. We could conclude that there is a difference between the five electrode types. Using the Tukey’s biweight ρ-function, the robust LRT test statistics yields a p-value of 0.086 at the same 5% level. The test is, hence, not significant. Observations 15 and possibly 2 seem to have an influence on the MLE (or REML) estimates and consequently on the F -test. If the responses are log-transformed, the F -test statistic is 2.87 corresponding to a p-value of 0.03 and the robust LRT test gives a p-value of 0.061. Although the logtransformation gives similar results for the parameters’ estimates (see Table 4.4), it does not completely reduce the influence of the outlying observations (number 15 and possibly number 2) on the classical F -test: we still reject the null hypothesis of equal resistances. Note that Berry (1987) analyzes this dataset with subject 15 deleted, and finds a significant F -test on the original data (p-value of 0.044) and a non-significant F -test on the log-transformed data (p-value of 0.10).
4.5.4 Semantic Priming Data Example (continued) The model used to analyze this dataset is given in (4.15) with λj , j = 1, 2, the fixed effect for the delay and γk , k = 1, 2, 3, the fixed effect for the condition. Table 4.5 gives the estimates for the REML and the CBS–MM and the standard errors for the fixed effects computed using Tukey’s biweight weights. The contrasts for each factor are the ‘sum’-type contrasts. We can see that both methods detect a significant effect for the delay but with a borderline p-value of 0.046 for the REML whereas the message is clearer with the robust method yielding a p-value of 0.003. Another
MIXED LINEAR MODELS
108
Table 4.5 Estimates and standard errors for the REML and the CBS–MM for the semantic priming data using model (4.15). REML
CBS–MM
Parameter
Estimate (SE)
p-value
Estimate (SE)
p-value
µ λ1 γ1 γ2 λγ11 λγ12 σs σλs σγ s σ0015
633.436 (28.465) −18.071 (8.974) 18.563 (13.732) −51.222 (13.732) −3.690 (12.691) 16.809 (12.691) 122.622 0.006 29.433 100.73
<10−4 0.046 0.179 <10−4 0.771 0.188
586.420 (18.817) −17.876 (6.082) 14.317 (11.691) −56.994 (11.691) 12.706 (10.582) (8.844 (10.582) 77.991 N/A 27.199 81.885
<10−4 0.003 0.221 <10−4 0.230 0.403
CBS computed with c0 = 5.14 and MM (biweight) computed with c1 = 6.37.
important feature of this model is the estimation of the random effects. The robust estimate of the variance for the interaction between subject and delay is not reported. This is because the robust estimator gives a negative value. This can sometimes happen as some of the variance components correspond to covariances between responses on the same subject and, hence, can in principle be negative. Standard algorithms included in common statistical packages work around this solution by imputing very small values close to zero each time a variance is found to be negative. In this example, using the R package lme, one obtains a small value (0.006) for the corresponding classical estimator (REML).10 We also tested the significance of each factor and of the interactions, using the F -test and the robust LRT-type test. Results are presented in Table 4.6. The classical F -test and robust LRT test give similar results for the three hypotheses with, again, a stronger effect for the delay variable. In this example, the presence of possible outlier does not seem to influence the results of the tests. With this type of data, one can also consider a log-transformation, although in this domain one usually prefers the original scale, mainly for interpretation reasons. In Table 4.7 we give the REML and the CBS–MM estimates with corresponding standard errors. The estimates and p-values for significance testing are quite similar and lead to the same conclusions. Also note that again the variance of the random effect for the interaction between the subject and the delay is set to zero with the REML and found to be negative (and hence reported as N/A) with the CBS estimator. We can also test the significance of each factor and of the interactions, using the F -test and the robust LRT-type test. Results are presented in Table 4.8. Both approaches lead to similar conclusions. 10 The problem of negative variances is not specific to robust approaches but is a common problem in the general ANOVA/MLM setting; see, for example, Searle et al. (1992).
4.5. ROBUST INFERENCE
109
Table 4.6 Classical F -test and robust LRT for the fixed effects of the semantic priming data using model (4.15). p-value Variable
Classical F -test
Robust LRT test
Delay Condition Delay:Condition
0.046 0.001 0.383
0.005 0.001 0.131
Robust LRT test computed using the CBS with c0 = 5.14 and the MM (biweight) with c1 = 6.37.
Table 4.7 Estimates and standard errors for the REML and the CBS–MM for the semantic priming data using model (4.15) with log-transformed data. REML Parameter µ λ1 γ1 γ2 λγ11 λγ12 σs σλs σγ s σ0015
CBS–MM
Estimate (SE)
p-value
Estimate (SE)
p-value
6.421 (0.040) −0.027 (0.012) 0.032 (0.021) −0.088 (0.021) 0.002 (0.017) 0.018 (0.017) 0.173 0.000 0.069 0.136
<10−4
6.386 (0.036) −0.028 (0.010) 0.029 (0.022) −0.092 (0.022) 0.011 (0.018) 0.017 (0.018) 0.148 N/A 0.069 0.137
<10−4 0.007 0.177 <10−4 0.526 0.336
0.025 0.127 <10−4 0.873 0.271
CBS computed with c0 = 5.14 and MM (biweight) computed with c1 = 6.37.
Table 4.8 Classical F -test and robust LRT for the fixed effects of the semantic priming data using model (4.15), with log-transformed data. p-value Variable
Classical F -test
Robust LRT test
Delay Condition Delay:Condition
0.025 0.0003 0.390
0.008 0.001 0.272
Robust LRT test computed using the CBS with c0 = 5.14 and the MM (biweight) with c1 = 6.37.
110
MIXED LINEAR MODELS
4.5.5 Testing the Variance Components Most of the effort in robust testing in MLMs has focused on the main effects because the variance parameters are often considered as nuisance parameters. If one is truly interested in testing whether some random effects could be removed, the same problem mentioned above arises. As the null hypothesis typically involves restrictions of the type σj2 = 0, the overall null parameter vector θ0 is on the boundary of the parameter space and, as a result, the general theory of Section 2.5.3 breaks down. One could conjecture that the same kind of mixture of χ 2 distributions could be used for the robust Wald test. However such tests are known to perform poorly in the classical case and a similar behavior is expected in the robust case. The LRT test could constitute a better alternative but no such robust LRT test exists to the best of our knowledge as the only proposal to date, the robust LRT test (4.45), only targets hypotheses on the fixed effects. At this stage, the only viable option seems to use bootstrapping techniques with the warning mentioned in Chapter 2 that the simple bootstrap can fail when applied to robust estimators (as the breakdown point may be reached in some bootstrap samples). Our practical recommendation in that case is to use a robust estimator with a 50% breakdown point to have a good chance of avoiding the problem.
4.6 Checking the Model 4.6.1 Detecting Outlying and Influential Observations Since the MLM can be seen as a multivariate normal model, multivariate tools can be used to measure in some sense at which point the observations are far from the bulk of data. Such a tool is given by the Mahalanobis distances in (4.31) in which β and 0004 are replaced by suitable estimates. In order for the estimated Mahalanobis distances not to be influenced (hence biased) by extreme observations, 0005[MM] it is necessary that β and 0004 are replaced by their robust estimators, namely β 0005[CBS] . One then can rely on the asymptotic result that di in (4.31) has an and 0004 asymptotic χp2 distribution and, hence, compare the estimated Mahalanobis distances to, say, the corresponding 0.975 quantile. One can also, for comparison, estimate the Mahalanobis distances using the MLE or the REML for β and the variance components of 0004. A scatterplot of the robust versus classical Mahalanobis distances would reveal the outlying observations, i.e. the observations with corresponding robust and classical Mahalanobis distance above the 0.975 quantile of the χp2 , as well as the influential observations, i.e. the observations with corresponding robust Mahalanobis distances above and the corresponding classical Mahalanobis distances below the 0.975 quantile of the χp2 . These influential observations are such that the classical estimator is not able to detect them but is influenced by them. In multivariate setting such as with MLM, Mahalanobis distances are usually preferred to the weights per se to detect outlying observations. As an example, consider the skin resistance dataset estimated in Section 4.4.5. In Figure 4.1 we saw that out of the 80 readings, two measurements (electrodes of
4.6. CHECKING THE MODEL
111
60 40 20
2
0
Robust Mahalanobis distances
80
15
0
10
20
30
40
Mahalanobis distances
Figure 4.4 Scatterplot of the Mahalanobis distances for the skin resistance data. CBS computed with c0 = 4.65 and MM (biweight) with c1 = 6.10.
type 2 and 3) taken on subject 15 were much larger than the others. Observation number 2 corresponds to the second largest response. In Figure 4.4 we give the scatterplot of the Mahalanobis distances computed with the REML and the CBS– MM. The horizontal and vertical dotted lines correspond to the 0.975 quantile on the χ52 distribution, to detect outlying observations. The REML and CBS–MM estimators detect observations 15 and 2 as outlying observations. No influential observations is present in the sample. With the log-transformed data, the scatterplot of the Mahalanobis distances given in Figure 4.5, shows that the CBS–MM detects observation 15 as an influential observation, and observation 2 is no longer considered as extreme. As another example, consider the semantic priming dataset estimated in Section 4.5.4. In Figure 4.6 we give the scatterplot of the Mahalanobis distances computed with the REML and the CBS–MM. One can see that the REML and CBS–MM detect one outlier (observation 3) and the CBS–MM detects two influential observations (observations 8 and 16). These observations are certainly the cause of the differences found between the classical and robust estimates. With the log-transformed data, the scatterplot of the Mahalanobis distances for the corresponding REML and CBS– MM estimates is given in Figure 4.7. One can see that there are two outliers detected
MIXED LINEAR MODELS
112
15 10 5
Robust Mahalanobis distances
15
2
4
6
8
10
12
Mahalanobis distances
Figure 4.5 Scatterplot of the Mahalanobis distances for the skin resistance data (logtransformed). CBS with c0 = 4.65 and MM (biweight) with c1 = 6.10. by the REML and CBS–MM, and they do not seem to have much influence on the estimates.
4.6.2 Prediction and Residual Analysis As for the regression model of Chapter 3, residual analysis with MLMs is used to check the model fit and also the model assumptions. In order to compute residuals, one needs to be able to compute predicted values for the response vector y. For that, and with MLM, one also needs to compute estimates for the random effects levels. Actually, one can define predicted (or fitted) response values at different levels of nesting or directly at the population level. Given estimated values for θ = (β T , σ02 , . . . , σr2 )T , the predictions at the so-called population level are 0005 0005 y = Xβ
(4.46)
and the predictions at the so-called cluster (lowest) level are 0005 + Z0005 0005 y = Xβ γ.
(4.47)
We note that depending on the problem and for hierarchical models, there might be different cluster levels, so that Z0005 γ in (4.47) can be modified accordingly. In all cases,
4.6. CHECKING THE MODEL 50
113
30
16
20
8
10
Robust Mahalanobis distances
40
3
5
10
15
20
25
Mahalanobis distances
Figure 4.6 Scatterplot of the Mahalanobis distances for the semantic priming data. CBS with c0 = 5.14 and MM (biweight) with c1 = 6.37. when predicting at the cluster levels, an estimate for 0005 γ is needed so that the first step is to define estimators for the random effects levels. Recall that random effects are unobservable variables. However, given the information contained in a sample and given a model, it is possible to predict (an expected value of) the vector of random effects for each response. Classically, one uses the Best Linear Unbiased Predictor (BLUP) given by11 γ0005 = DZ T V −1 (y − Xβ)
(4.48)
where D = cov(γ ). Given values for the variance components α, (4.48) is computed using (4.21) for β. An interesting interpretation of 0005 γ is that it is the MLE based on the likelihood of the joint distribution of f (y, γ ) = f (y|γ )f (γ ) (for fixed values of α). Henderson et al. (1959) propose a set of equations for the simultaneous estimation 0005 indeed based on the joint distribution of y and γ . of γ0005 and β Prediction and residual analysis with robust estimators is not as straightforward as replacing all parameters in (4.48) by their robust estimates. If we choose this simple approach, we face the risk that a random effect corresponding to a particular observation yijk. could be overestimated or underestimated if this observation is considered as an outlier in terms of the Mahalanobis distance. Indeed Copt and 11 See e.g. McCulloch and Searle (2001, Chapter 9).
MIXED LINEAR MODELS
114
60 40 20
2
0
Robust Mahalanobis distances
80
15
0
10
20
30
40
Mahalanobis distances
Figure 4.7 Scatterplot of the Mahalanobis distances for the semantic priming data (log-transformed). CBS computed with c0 = 5.14 and MM (biweight) with c1 = 6.37. Victoria-Feser (2009) show that the IF of γ0005 in (4.48) depends on the robustness 0005 properties of β(α) and also on the deviations (y − Xβ). This means that in order to make the predictions robust to model deviations, one needs not only a robust estimator such as the CBS–MM, but also to bound (4.48). Copt and Victoria-Feser (2009) propose the use of ψ-based prediction defined as12 γˆψ = eψ,c DZ T V −1/2 ψ(V −1/2 (y − Xβ)), where ψ(r) = (∂/∂r)ρ(r) is a bounded function such as the Huber’s of Tukey’s biweight functions, and eψ,c is a correction factor (see below). A bounded ψ-function is necessary to guarantee the robustness of the corresponding prediction estimator. Moreover, in order for 0005 γψ to behave similarly to 0005 γ at the normal model, we also need to impose that E[0005 γψ ] = 0 and var(0005 γψ ) = var(0005 γ ). These constraints define (implicitly) the correction factor eψ,c . For Tukey’s biweight ψ-function, Copt and Victoria-Feser (2009) show that
−1/2 4 6 4 1 eψ[bi] ,c = I2 (c) − 2 I4 (c) + 4 I8 (c) − 6 I8 (c) + 8 I10 (c) c c c c 12 To compute V −1/2 , we follow Richardson and Welsh (1995) and chose V −1/2 to be symmetric with the same additive structure as V and V −1 and with the property that V −1/2 V −1/2 = V −1 .
4.6. CHECKING THE MODEL
115
0.3 0.1 −0.1
Sample quantiles
−0.1
0.1
0.3
Subject
−2
−1
0
1
2
Theoretical quantiles
Figure 4.8 Boxplot and Q-Q plot of the (estimated) subject random effect for the skin resistance data. CBS computed with c0 = 4.65 and MM (biweight) with c1 = 6.10.
where
0004 Ik (c) =
c −c
r k d0010(r);
see Appendix B for the computation of these truncated normal moments. For Huber’s ψ-function, eψ[H ub] ,c = (1 − 2c2(1 − 0010(c)))−1/2. Finally, to compute γˆψ in practice, one replaces α in (V and D) and β by their robust estimates. Estimated random effects can be used to check the model assumptions. Recall that the random effects are assumed to be normally distributed and independent of each other. A normal probability plot (normal quantiles against ordered estimated random effects) or a boxplot can be used to assess the normality assumption. Again, consider as an example the skin resistance dataset estimated in Section 4.2.2. This model has only one random effect, the subject. Figure 4.8 suggests that the normality of the subject random effect is fairly respected. As in the linear regression setting, residuals are defined as the difference between the response and the predicted value, i.e. y − 0005 y where yˆ is given in (4.47) and possibly also (4.46). They thus depend on the choice of predicted response. However, since random effects have been introduced into the model, it is more sensible to use the subject predicted values to define residuals as population fitted values may produce a structure in the residuals which is simply due to the random effects. The residuals can also be standardized by means of the (estimated) covariance matrix of y, yielding V −1/2 (y − 0005 y ). Figure 4.9 displays the standardized residuals versus fitted values at the subject level. We can see that there is no particular structure in the residuals.
MIXED LINEAR MODELS 12
116
15
8
10
15
2
4
6
2
−2
0
2
Residuals
2
0.8
1.0
1.2
1.4
1.6
1.8
2.0
Fitted values
Figure 4.9 Standardized residuals (subject level) versus fitted values for the skin resistance data. CBS computed with c0 = 4.65 and MM (biweight) with c1 = 6.10.
4.7 Further Examples 4.7.1 Metallic Oxide Data Until now, we have only presented models in which each level of a factor is combined with every level of another factor. Hierarchical models are models where only some levels of a factor are combined with the levels of another factor. More formally, suppose that we have two treatments λ and γ with l and g levels, respectively. In the language of experimental design, if each level of treatment γ appears only in one level of treatment λ, γ is said to be nested in λ. One can also extend the models so as to include so-called between subjects factors. For example, we have the typical experiment in which a measurement is taken from n1 samples of type j = 1 and n2 samples of type j = 2, and in each sample the measure is taken on g ‘objects’. For example, the ‘objects’ can be rats, the samples cages, n1 of which are given treatment j = 1 and others n2 given treatment j = 2. This type of design is called a nested design. The rats are nested within the
4.7. FURTHER EXAMPLES
117
cages. A rat belongs either to cage 1 or cage 2. We use different notation to represent nested factors. For example, suppose that γ is the parameter for the cage, then γj (i) would represent rat i nested within cage j . The between subjects factor here is the treatment. In this section, we analyze a dataset originating from a sampling study designed to explore the effects of process and measurement variation on the properties of lots of metallic oxides (Bennet, 1954). Two samples were drawn from each lot. Duplicate analyses were then performed by each of two chemists, with a pair of chemists randomly selected for each sample. Hence, the response yijklm corresponds to the metal content (percent by weight) measured on the ith metallic oxide type, on the jth lots, on the kth sample, by the lth chemist for the mth analysis. The model can be written as yijklm = µ + λJi (j ) + γj (i) + δj (i(k)) + ξj (i(k(l))) + 0015j (i(k(l(m)))), where
0001 0 Ji (j ) = 1
(4.49)
j = 1, j = 2,
and with µ + λJi (j ) the fixed effect and γj (i), i = 1, . . . , n = n1 + n2 the random effect due to the lot, δj (i(k)), k = 1, . . . , 2n, the random effect due to the sample and ξj (i(k(l))), l = 1, . . . , 4n, the random effect due to the chemist. We then have µi = e8 (µ + λJi (j )) = e8 ⊗ (1, Ji (j ))(µ, λ)T = xi β and Z1 = In ⊗ e8 for σγ2 , Z2 = In ⊗ I2 ⊗ e4 for σλ2 , Z3 = In ⊗ I4 ⊗ e2 for σδ2 , so that 0004 = σγ2 J8 + σλ2 I2 ⊗ J4 + σδ2 I4 ⊗ J2 + σ00152 I8 . Thus, the parameters to be estimated are the means for each type of metallic oxide and the variances associated with lots, samples and chemists. This dataset contains 248 observations. We can then make n = 31 independent sub-vectors yi of size 8. A plot of the responses by sample and chemist is given in Figure 4.10. One may notice that whatever the sample or the chemist, the responses are rather low for lots (observations) numbers 24 and 25 relative to the other lots. Table 4.9 presents the estimates and standard errors for the CBS–MM. The mean effect of the metallic oxide type is significant (p-value of 0.005), and the variances are larger for the lot and the chemist, and smaller for the sample. As a comparison, the REML gives larger estimates for the variance components of the lot and sample, while a smaller estimate for the chemist (results not presented here). An analysis of the Mahalanobis distances reveals that there are a few potential outlying observations (see Figure 4.11). One can see that the REML and CBS–MM detect two outliers (observations 24 and 30) and possibly observation 17 as well, while the CBS–MM detects two influential observations (observations 12 and 25). The analysis based on the classical Malahanobis distance alone is certainly misleading.
MIXED LINEAR MODELS
118 Type 1:
Type 2: −1
0
1
Chemist2 Sample1
2
3
4
Chemist2 Sample2 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
Lots
Chemist1 Sample1
Chemist1 Sample2
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 −1
0
1
2
3
4
Metal Content (percent by weight)
Figure 4.10 Metal content response for each lot by sample and chemist.
4.7.2 Orthodontic Growth Data (continued) The orthodontic growth data introduced in Section 4.2 are summarized in Figure 4.2 where individual scatterplots of the distance (between the pituitary and the pterygomaxillary fissure) versus age are displayed. Individual LS fits based on simple linear regression are added to each scatterplot. They reveal that the estimated slope for subject M13 is far larger than the other estimated slopes. Overall, it seems that the
4.7. FURTHER EXAMPLES
119
Table 4.9 Estimates and standard errors for the CBS–MM for the metallic oxide data using model (4.49). CBS–MM Parameter
Estimate (SE)
p-value
µ λ σlot σsample σchemist σ0015
3.726 (0.066) 0.184 (0.066) 0.317 0.144 0.188 0.186
<10−4 0.005
80
100
24
40
60
25
17 20
Robust Mahalanobis distances
120
CBS computed with c0 = 6.01 and MM (biweight) computed with c1 = 6.83.
30
0
12
0
10
20
30
40
Mahalanobis distances
Figure 4.11 Scatterplot of the Mahalanobis distances for the metallic oxide data. CBS computed with c0 = 6.01 and MM (biweight) with c1 = 6.83. responses for the boys vary more than those for the girls. Moreover, the plot suggests that two observations on subject M09 are outliers. These potential outliers are also detected in Figure 4.12 that presents the LS residuals plots by gender. As discussed in Section 4.2, a plausible working model is thought to be yijt = β0 + β1 t + (β0g + β1g t)Ji (j ) + γ0i + γ1i t + 0015ijt
(4.50)
MIXED LINEAR MODELS
120
20
Male
25
30
Female
4
M09
M13
Standardized residuals
2
0
−2
M09
M13 −4 20
25
30
Fitted values (mm)
Figure 4.12 Residuals versus fitted values by gender, corresponding to individual LS fits.
with yijt the response for the ith subject (i = 1, . . . , 27) of gender j (j = 1 for boys and j = 2 for girls) at age t = 8, 10, 12, 14, and Ji (j ) = 0 for boys (j = 1) and 1 for girls (j = 2). Table 4.10 presents the CBS–MM estimates and standard errors for the model parameters. The estimates show that the there is no significant mean intercept difference between boys and girls (p-value of 0.896), while there is a significant mean slope difference (p-value of 0.036). The random slope variance is found to be relatively small compared with the random intercept variance. As a comparison, the REML gives similar results, with a larger random slope variance estimate and residual variance. The robust Mahalanobis distances detect observations corresponding to the 9th and 13th boys as extreme, as was already found in the graphical data
4.7. FURTHER EXAMPLES
121
Table 4.10 Estimates and standard errors for the CBS–MM for the orthodontic data using model (4.50). CBS–MM Parameter β0 β0g β1 β1g σγ 0 σγ 1 σ0015
Estimate (SE)
p-value
17.395 (0.613) 0.080 (0.613) 0.581 (0.052) −0.110 (0.052) 1.584 0.115 1.04
<10−4 0.896 0.000 0.036
CBS computed with c0 = 4.09 and MM (biweight) computed with c1 = 5.82.
Age:Subject
0.02 −0.03
0.00
Sample quantiles
0.2 −0.2 −0.6
Sample quantiles
0.6
−0.03
−0.6
−0.2
0.00
0.2
0.02
0.6
Subject
−2
−1
0
1
Theoretical quantiles
2
−2
−1
0
1
2
Theoretical quantiles
Figure 4.13 Boxplot and Q-Q plot of the random effects for the orthodontic data. CBS computed with c0 = 4.09 and MM (biweight) with c1 = 5.82.
analysis in Figure 4.2. It should be noted that Pinheiro et al. (2001) also find the same outlying observations. A plot of the estimated random effects (see Figure 4.13) shows that both the random slope and the random intercept estimated with the robust estimator are normally distributed.
MIXED LINEAR MODELS
122
M09 M10
0
F10
M13
−6
−4
−2
Residuals
2
4
F11
22
24
26
28
Fitted values
Figure 4.14 Standardized residuals (subject level) versus fitted values for the orthodontic data. CBS computed with c0 = 4.09 and MM (biweight) with c1 = 5.82.
Figure 4.14 displays the standardized (Pearson) residuals versus fitted values at the subject level. We can see that there is no particular structure in the residuals and that subjects M13 and M09 are the largest outliers.
4.8 Discussion and Extensions Despite its good robustness properties and the fact that it does not suffer from computational problems when applied to complex data structures (as is often the case when modeling longitudinal data with fixed covariates), the CBS-MM estimator has a few limitations. The first limitation is, as stated earlier in this chapter, that the CBSMM estimator cannot handle unbalanced data at the moment unlike the very general bounded influence approach of Richardson and Welsh (1995). This is particularly annoying as balanced data are usually not the rule and one is more likely to encounter unbalanced data especially in medical research. The second limitation is the lack of inference theory for the variance components. We have seen (see Sections 4.3.2 and 4.5.5) that no proper solution to this problem
4.8. DISCUSSION AND EXTENSIONS
123
exists in the current robustness theory. Robust inferential procedures presented in this book fail as they all assume the null hypothesis to be an interior point of the parameter space. In addition, the robust LRT test defined in Section 4.5 targets only hypotheses on the fixed effects. It even cannot be defined for a simple testing problem on the variance parameters σj2 , e.g. testing the equality σj2 = σj20005 . Its extension to more general hypotheses on θ = (β T , α T )T may be proved challenging. In general, further research work is needed in this area. One possible robust extension of the MLM is to assume that the data follows a t distribution instead of the normal distribution assumed throughout this chapter. For example, Pinheiro et al. (2001) incorporate multivariate t distributed random components for the t MLMs. More recently Lin and Lee (2006) propose a model based on multivariate t distribution for autocorrelated longitudinal data by incorporating first an autoregressive dependence structure in the variance components and extend the work of Pinheiro et al. (2001) to allow for inference about the random effects and predictions. The next natural extension of robustness in the MLM environment is to extend it to the class of generalized linear mixed models (GLMMs). Yau and Kuk (2002) introduce robust maximum quasi-likelihood and residuals maximum quasilikelihood estimation to limit the influence of outlying observations. The way they introduce robustness in the GLMM follows the same line of thoughts as used by Richardson and Welsh (1995) in the MLM. Other attempts at robustifying the GLMM can be found in Mills et al. (2002) or Sinha (2004). More recently Litière et al. (2007a) study the impact of an incorrectly specified probability model on the maximum likelihood estimation in GLMM. They study the impact of misspecifying the random-effects distribution on the estimation and inference and show that the MLE are inconsistent in the presence of such misspecifications.
5
Generalized Linear Models 5.1 Introduction The framework of GLMs allows us to extend the class of models considered in Chapter 3 and to address situations with non-normal (non-Gaussian) responses. In particular, it allows us to consider continuous and discrete distributions for the response, both symmetric and asymmetric. From the practical point of view, this unified framework opens many perspectives formalized under the same setting and sharing a number of properties. The fields of application are quite wide: certainly biostatistics, but also medicine, economics, ecology, demography, psychology and many more. The family of possible distributions for the response is quite large, but the most common settings with no doubts include binary or binomial responses (e.g. presence or absence of a characteristic, see the example in Section 5.5, or the number of ‘successes’ in a sequence), count data (for example, the number of visits to the doctor, see the example in Section 5.6) and positive responses (e.g. hospital costs, see the example in Section 5.3.5). All of the classical theory of GLMs is likelihood based, and the gain in popularity of GLMs has helped in reinforcing the central role of the likelihood in statistical inference. We will see that the robust versions of GLM presented in this chapter move away from the likelihood setting, but retain almost all of its advantages in terms of statistical properties and interpretation. The route to the definition of the unified class of GLMs has been long and the steps to it went through multiple linear regression (Legendre, Gauss, early 19th century), the ANOVA of designed experiments (Fisher: 1920–1935), the likelihood function (Fisher, 1922), dilution assay (Fisher, 1922), the exponential family of distributions (Fisher, 1934), the probit analysis (Bliss, 1935), the logit models for proportions (Berkson, 1944; Dyke and Patterson, 1952), the item analysis (Rasch, 1960), loglinear models for counts (Birch, 1963) and inverse polynomials (Nelder 1966; see McCullagh and Nelder (1989, Chapter 1), for additional information). Nelder and Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
126
GENERALIZED LINEAR MODELS
Wedderburn (1972) show that the above problems can all be treated in the same way. They also show that the MLE for all of these models can be obtained using the same algorithm (IRWLS; see Appendix E.3). Binary logistic regression has received quite a lot of attention in the robust literature. In fact, one can find several robust contributions that follow different approaches: the early contributions of Pregibon (1982), Copas (1988) and Carroll and Pederson (1993), the L1 -norm quasi-likelihood approach of Morgenthaler (1992), the weighted likelihood approaches of Markatou et al. (1997) and Victoria-Feser (2002), and the high breakdown approaches of Bianco and Yohai (1997) and Christmann (1997). This wide contribution is certainly due to the fact that addressing the binary framework is simpler than addressing the general GLM class. This more general class has nevertheless been addressed with the work of Stefanski et al. (1986) and Künsch et al. (1989), who derive optimal (OBRE, see Section 2.3.1) and conditionally unbiased estimators for the entire GLM class. This theory is quite complex (even in its simpler conditional approach) and only the case of logistic regression can be implemented easily. More recently, Cantoni and Ronchetti (2001b) define Huber and Mallows-type estimators and quasi-deviance functions for application within the GLM framework, see also Cantoni (2003, 2004a) and Cantoni and Ronchetti (2006). Here we present this last piece of work which seems to us the most promising for use in the entire GLM class. In fact, it has the advantage over other proposals of having computationally tractable expressions (that allow us to consider the entire class of GLM and not only the logistic application) and of jointly providing a solution to the variable selection question through the definition of quasi-deviance functions. The present chapter is organized as follows. In Section 5.2 we set up the notation and define the model. We continue in Section 5.3 where we define the class of (robust) estimators and give their properties. The technique is illustrated on a real example in Section 5.3.5. The variable selection issue is addressed in Section 5.4.2 where a family of quasi-deviance functions are defined and its distribution studied. Section 5.4.3 considers the application to the previous studied example. Two additional complete data analyses with robust model selection are presented in Sections 5.5 and 5.6. Finally, Section 5.7 discusses the possible extensions of this work.
5.2 The GLM 5.2.1 Model Building We introduce here the GLM modeling approach without necessarily giving a complete and exhaustive treatment of the subject. Instead, we refer the interested reader to the general references treating GLM modeling, which include Dobson (2001) (a good starting point for beginners), Lindsey (1997) (an applied approach), McCullagh and Nelder (1989) (with additional technical details) and Fahrmeir and Tutz (2001) (more focused on discrete data).
5.2. THE GLM
127
Table 5.1 Properties of some distributions belonging to the exponential family. Distribution
θi (µi )
φ
E[yi ]
var(yi )
µi
σ2
µi = θi
σ2
log(pi /(1 − pi ))
1
Normal
N (µi , σ 2 )
Bernoulli
B(1, pi )
pi =
exp(θi ) pi (1 − pi ) 1 + exp(θi )
Scaled binomial Poisson Gamma
B(m, pi )/m log(pi /(1 − pi )) 1/m pi =
exp(θi ) pi (1 − pi ) 1 + exp(θi )
P(λi ) 0016(µi , ν)
log(λi ) −1/µi
1 1/ν
λi = exp(θi ) µi = −1/θi
λi µ2i /ν
See Appendix D for the distributions definitions.
Consider a sample of n individuals, for which we define the three following ingredients. • The random component. n independent random variables y1 , . . . , yn which are assumed to share the same distribution from the exponential family, that is with density that can be written as 0006 0007 yi θi − b(θi ) f (yi ; θi , φ) = exp + c(yi , φ) (5.1) ai (φ) for some specific functions a(·), b(·) and c(·). We denote µi = E[yi ] and var(yi ) = φvµi , where the specific form of vµi depends on the distributional assumption on yi , see the last column of Table 5.1. The most common families of distributions such as the normal, the binomial, the Poisson, the exponential, and the Gamma belong to the exponential family of distributions. Some of these distributions will be considered more closely here. The parameter θi , which is a function of µi , is called the natural parameter and φ is an additional scale or dispersion parameter, usually considered as a nuisance parameter. We note that φ is a constant in certain models (for example, φ = 1/m for the scaled binomial and φ = 1 for the Poisson distribution), and coincides with σ 2 in the normal model, see the fourth column of Table 5.1. • The systematic component. A set of parameter β T = (β0 , β1 , . . . , βq ) and q explanatory variables or covariates that can either be quantitative (numerical) or qualitative (levels of a factor, then coded with dummy variables as in linear regression). For each individual i = 1, . . . , n, the covariates are stored in the vector xiT = (1, xi1 , . . . , xiq ), from which the linear predictor ηi = xiT β is constructed. The parameter β0 therefore identifies the intercept. The pooled
128
GENERALIZED LINEAR MODELS covariate information is collected in a design matrix X as follows: T x1 x T 2 X= . . .
(5.2)
xnT As in the linear model, linearity in GLM is intended with respect to the parameters. We note that one could introduce transformed covariates, log(xij) or xij2 , for example, as well as interactions. Moreover, there are situations where a parameter βj is known a priori: the corresponding term in the linear structure is called an offset in the GLM terminology. • The link. A monotone link function g which links the random and the systematic components of the model g(µi ) = ηi = xiT β.
(5.3)
The link function defines the form of the relationship between the mean µi of the response and the assumed linear predictor ηi . It needs to be monotonic and differentiable. Moreover, it can be chosen to ensure that the estimated parameter lies in the admissible space of values (for example, the interval (0, 1) for the binomial distribution and (0, ∞) for the Poisson distribution). The natural or canonical link function is that relating the natural parameter directly to the linear predictor (θi = θi (µi ) = ηi = xiT β). Models making use of the canonical link enjoy convenient mathematical and statistical properties, but the canonical link can be easily replaced with a more appropriate link function from the practical or interpretation point of view (see Example 5.3.5). The definition of model (5.3) may be surprising at first to people used essentially to the linear model setting, but the connection with the linear model appears more evident when this latter (as defined in (3.1)) is rewritten in the equivalent form E[yi ] = µi = xiT β, with yi ∼ N (µi , σ 2 ). In this case, the link function is the identity function. In the GLM setting, the distributional assumptions are defined with respect to the response itself (conditionally on the set of explanatory variables) and not with respect to an additive error term. Table 5.1 provides an overview of the components of a GLM model for the most common situations.
5.2. THE GLM
129
5.2.2 Classical Estimation and Inference for GLM The parameters of model (5.3) are usually estimated by maximizing the corresponding log-likelihood (with respect to β) #
n l(β; y) = l(µ; y) = log f (yi ; θi , φ) =
n 0006 0002 i=1
i=1
0007 0002 n yi θi − b(θi ) + c(yi , φ) = li (µi ; yi ), ai (φ) i=1
(5.4)
where µi = g −1 (xiT β) and θi = θi (µi ) = θi (g −1 (xiT β)) are functions of β. The maximization of the log-likelihood (5.4) is performed numerically, either directly or via an IRWLS, see McCullagh and Nelder (1989, Section 2.5) and Appendix E.3. The resulting estimator βˆ[MLE] enjoys the general properties of maximum likelihood estimation, in particular the normal asymptotic distribution with variance given by the inverse of the Fisher information matrix I (β) (see (2.30)), √ that is n(βˆ[MLE] − β) ∼ N (0, I −1 (β)). Based on this asymptotic result, one can construct univariate test statistics for the coefficients βj , j = 0, . . . , q as βˆ[MLE]j SE(βˆ[MLE]j ) &
with SE(βˆ[MLE]j ) =
(5.5)
1 ˆ−1 ˆ [I (β[MLE] )](j +1)(j +1), n
and using an estimator Iˆ for the Fisher information matrix ' n ' 10002 ∂ ∂ ˆ ˆ li (µi ; yi ) T li (µi ; yi )' . I (β[MLE] ) = n i=1 ∂β ∂β β=βˆ[MLE] The statistic (5.5) is labeled the t-statistic if the dispersion parameter (φ) is estimated (for example, for the Gaussian and Gamma distributions), and is labelled the zstatistic if the dispersion parameter is known (for example, for the binomial and Poisson distributions). The test statistic (5.5) has a tn−(q+1) distribution under the null hypothesis H0 : βj = 0 in the first case and the standard normal in the second. The p-value for a two-sided alternative hypothesis H1 : βj 000f= 0 is therefore computed as P (|z-statistic| > |zobs|) = 2(1 − 0010(|zobs |)) or P (|t-statistic| > |tobs|) = 2(1 − tn−(q+1) (|tobs|)), where zobs and tobs are the values taken by the statistic (5.5) on the sample. Note that the z/t-statistic is a Wald approximation of the log-likelihood (secondorder Taylor expansion of the log-likelihood at the MLE) to test H0 : βj = 0 and is sometimes misleading with binomial GLMs. In fact, a small value for the z/t-statistic can either correspond to a small LRT statistic or to a situation where |βj | is large, the
GENERALIZED LINEAR MODELS
130
Wald approximation is poor and the likelihood ratio statistic is large. These problems can occur in cases when the fitted probabilities are extremely close to zero or one. This is called the Hauck–Donner phenomenon, see Hauck and Donner (1977). The asymptotic result is also useful in constructing approximate (1 − α) confidence intervals (CIs), according to the formula (βˆ[MLE]j − q(1−α/2)SE(βˆ[MLE]j ); βˆ[MLE]j + q(1−α/2)SE(βˆ[MLE]j )), where q(1−α/2) is either the (1 − α/2) quantile of the standard normal distribution or of the tn−(q+1) distribution, depending on whether φ is known or not. For binomial and Poisson models, it sometimes happens that data do not satisfy the variance assumption of the model, but rather that var(yi ) = τ vµi (recall that φ = 1 for binomial and Poisson models). This phenomenon is called over- or underdispersion depending on whether τ is larger or smaller than one. One of the main reasons for over-dispersion is clustering in the population (the parameter θi varies from cluster to cluster, as a function of cluster size for example). This means that the parameter θi is regarded as random rather than fixed. Beyond normality, specifying the expectation and the variance structure separately (first and second moment) does not correspond to a distribution function, therefore preventing the definition of a likelihood function. In this case, the model is fitted via the estimating equations
n 0002 y i − µi µ0005i = 0, (5.6) τ v µ i i=1 where µ0005i = ∂µi /∂β. Equation (5.6) corresponds to the maximization of the so-called quasi-likelihood function n n 0004 µi 0002 0002 yi − t Q(µ; y) = Q(µi ; yi ) = dt, (5.7) τ vt i=1 i=1 yi where µT = (µ1 , . . . , µn ) and yT = (y1 , . . . , yn ). Under some general conditions (see Wedderburn, 1974) the quasi-likelihood estimator is asymptotically normally distributed. Moreover, the MLE and the maximum quasi-likelihood estimator (MQLE) are the same for all of the models of the one-parameter exponential family (binomial and Poisson, for example). Note that τ has no impact on (5.6) because it cancels out, but does have an impact on the computation of the standard errors of the coefficients. The estimation of τ is based on the RSS as follows τˆ =
n 0002 1 (yi − µˆ i )2 , n − (q + 1) i=1 vµˆ i
where µˆ i are the fitted values g −1 (xiT βˆ[MQLE] ) on the response scale. The estimator τˆ is an unbiased estimator of τ if the fitted model is correct. A particular function based on the log-likelihood plays an important role in GLM modeling. It is called the deviance, which, assuming that ai (φ) in (5.1) can be
5.2. THE GLM
131
decomposed as φ/wi , is defined by D(µ; ˆ y) = 2φ[l(y; y) − l(µ; ˆ y)] =
n 0002
2φ[li (yi ; yi ) − li (µˆ i ; yi )] =
i=1
n 0002
φ di ,
i=1
(5.8) where µˆ is the vector of fitted values g −1 (xiT βˆ[MLE] ) and where l(µ; ˆ y) is the loglikelihood of the postulated model and l(y; y) is the saturated log-likelihood for a full model with n parameters. The deviance measures the discrepancy between the performance of the current model via its log-likelihood and the maximum loglikelihood achievable. It can therefore be used for goodness-of-fit purposes. Large values of D(µ; ˆ y) indicate that the model is not good. On the other hand, small values of D(µ; ˆ y) arise when the log-likelihood l(µ; ˆ y) is close to the saturated loglikelihood l(y; y). 2 The distribution of the deviance is exactly χn−(q+1) for normally distributed responses, and this distribution can be taken as an approximation for other distributions, for example binomial and Poisson. However, D(µ; ˆ y) is not usable for goodness-of-fit for Bernoulli responses, because it only depends on the observations y through the fitted probabilities µˆ and as such does not carry information about the agreement between the observations and the fitted probabilities (see Collett 2003a, Section 3.8.2). The deviance can be regarded as a LRT statistic for testing a specific model within the saturated model, assuming φ = 1. This is the case for binomial and Poisson models, but for other distributions, e.g. normal or Gamma, the deviance is not directly related to a LRT statistic. The deviance is also used to construct a difference of deviance statistics to compare nested models. Suppose that a model Mq−k+1 with (q − k) explanatory variables (plus intercept) is nested into a larger model Mq+1 with q explanatory variables (plus intercept). To test the null hypothesis, which states that the smallest model suffices to describe the data, one can test whether the parameters associated with the variables not included in the smallest model are equal to zero with the test statistic 0006D(µ, ˆ µ) ˙ = D(µ; ˙ y) − D(µ; ˆ y) = 2φ[l(µ; ˆ y) − l(µ; ˙ y)],
(5.9)
where µˆ = µ(βˆ[MLE] ) and µ˙ = µ(β˙[MLE] ) are the MLE estimates in the full model Mq+1 and the reduced model Mq−k+1 , respectively. If φ is known and, under the null hypothesis that the smaller model is good enough to represent the data, the distribution of 0006D(µ, ˆ µ) ˙ can be approximated by a φ χk2 (it is the LRT statistic up to a factor φ). This approximation is more accurate than the 2 approximation of the deviance itself by a χn−(q+1) distribution. When φ is not known (e.g. normal, Gamma) the usual approximation under H0 uses an F -type statistic: (D(µ; ˙ y) − D(µ; ˆ y))/k ∼ Fk,n−(q+1) , ˆ φ where φˆ = D(µ; ˆ y)/(n − (q + 1)). Note that for the normal case with identity link this is an exact result, but for the Gamma model the accuracy of this approximation is not well known.
132
GENERALIZED LINEAR MODELS
A natural definition of a quasi-deviance function follows from the definition (5.7) of a quasi-likelihood function: QD(µ; ˆ y) = Q(y; y) − Q(µ; ˆ y).
(5.10)
By analogy with the deviance function, one can use the quasi-deviance function for inference purposes to test whether a smaller model Mq−k+1 nested into a larger model Mq+1 is a good enough representation of the data with the difference of quasi-deviances statistics: 0006QD(µ, ˆ µ) ˙ = Q(µ; ˙ y) − Q(µ; ˆ y),
(5.11)
where µˆ = µ(βˆ[MQLE] ) and µ˙ = µ(β˙[MQLE] ) are the MQLE estimates in the full model Mq+1 and the reduced model Mq−k+1 , respectively. 2 distribution, at The test statistic 0006QD(µ, ˆ µ) ˙ is then compared with a χn−(q+1) least when φ is known. As with the likelihood, an F -type test is more appropriate if φ is unknown, see above.
5.2.3 Hospital Costs Data Example We introduce here a dataset on health care expenditures previously analyzed by Marazzi and Yohai (2004) and Cantoni and Ronchetti (2006). The aim is to explain the cost of stay (cost in Swiss francs) of 100 patients hospitalized at the Centre Hospitalier Universitaire Vaudois in Lausanne (Switzerland) during 1999 for ‘medical back problems’ (APDRG 243). The following explanatory variables have been measured: length of stay (los, in days), admission type (adm: 0 = planned, 1 = emergency), insurance type (ins: 0 = regular, 1 = private), age in years (age), sex (sex: 0 = female, 1 = male) and discharge destination (dest: 1 = home, 0 = another health institution). The median age over the 100 patients is 56.5 years (the youngest patient is 16 years old and the oldest is 93 years old). Moreover, 60 individuals out of the 100 in the sample were admitted as emergencies and only 9 patients had private insurance. Also, both sexes are well represented in the sample with 53 men and 47 women. After being treated, 82 patients went home directly. Modeling medical expenses is an important step in cost management and health care policy. Establishing the relationship between the cost and the above explanatory variables could, for example, help in reducing costs in health care expenditures which are increasing extremely fast everywhere and are therefore a matter of concern. In addition to be positive, cost measurements are known to be highly skewed. Moreover, it is also known that the thickness of the tail of their distribution is often determined by a small number of heavy users. Several authors (e.g. Blough et al., 1999; Gilleskie and Mroz, 2004) report that the variance of health care expenditures data can be considered as proportional to the squared mean. We therefore consider fitting a Gamma GLM model with a logarithmic link. Note that this model can be seen as issued from a multiplicative model yi = exp(xiT β) · ui , where the error term ui has constant variance. This is the reason why we use the logarithmic link instead of the canonical link g(µi ) = 1/µi (the inverse function), which, by the way, does
5.2. THE GLM
133 Table 5.2 Classical estimates for model (5.12).
Variable intercept log(los) adm ins age sex dest 1/ν (scale)
Estimate (SE)
95% CI
p-value
7.234 (0.147) 0.822 (0.028) 0.214 (0.050) 0.093 (0.079) −0.0005 (0.001) 0.095 (0.050) −0.104 (0.069) 0.0496
(6.940; 7.528) (0.766; 0.878) (0.114; 0.314) (−0.065; 0.252) (−0.003; 0.002) (−0.005; 0.195) (−0.243; 0.034)
<10−4 <10−4 <10−4 0.2414 0.6790 0.0602 0.1353
The estimates are obtained by maximum likelihood, see (5.4) (CI, confidence interval).
not guarantee that µi > 0. More specifically, we consider a parameterization of the Gamma density function such that one parameter identifies µi and the variance structure is defined by v(µi ) = µ2i /ν, see the top of page 201 in Cantoni and Ronchetti (2006). We start by fitting the full model, that is the model with all of the available explanatory variables, as follows log(E[cost]) = β0 + β1 log(los) + β2 adm + β3 ins + β4 age + β5 sex + β6 dest. (5.12) The MLE parameter estimates, their standard errors and the p-values of the significance tests (5.5) are given in Table 5.2. Before proceeding with any interpretation, it is recommended to validate the model. In this example, the deviance statistic (5.8) takes the value 5.07, which yields a p-value P (D > 5.07) 1 when compared with 2 2 distribution. This large p-value provides no evidence against the a χn−(q+1) = χ93 null hypothesis that the postulated model is better than the saturated model.
5.2.4 Residual Analysis Residual diagnostic plots are an alternative to formal tests. In the GLM setting several types of residuals can be defined, between which the most common are: 0015 ˆ µˆ ; • the Pearson residuals riP = (yi − µˆ i )/ φv i 0015 ˆ µˆ (1 − hii ), where • the standardized Pearson residuals riPS = (yi − µˆ i )/ φv i the leverages hii are the diagonal entries of the hat matrix, see (3.11); √ • the deviance residuals riD = sign(yi − µˆ i ) di ; 0015 ˆ − hii ). • the standardized deviance residuals riDS = riD / φ(1
GENERALIZED LINEAR MODELS
134
Normal Q− Q 4
Residuals vs Fitted
14
2 0
Std. deviance resid.
0.0
44 −4
−1.0
44 63 8.0
8.5
9.0
63 9.5
10.0
−2
10.5
−1
Predicted values
4
1.0
21
0
21
28
−2
Std. deviance resid.
2.0 1.5
2
0.5
14
2
44
28
1
Residuals vs Leverage
63 14
0
Theoretical Quantiles
Scale−Location
44
0.5
4
0.5
Std. deviance resid.
14
−2
0.5
21
−0.5
Residuals
28 21
28
1
−6
0.0
Cook’s 63 distance
8.0
8.5
9.0
9.5
Predicted values
10.0
10.5
0.00
0.05
0.10
0.15
Leverage
Figure 5.1 Diagnostic plots for the Gamma model (5.12), estimated with a MLE.
Residual plots can help in identifying departures from the linearity assumption (when plotted against continuous covariates), serial correlation (when plotted against the order in which the observations are collected, if known) and particular structures (when plotted against predicted values). In addition, it is usual to look at a Q-Q plot of the residuals against the normal quantiles. Note, that for binary logistic models very often structures appear on the residuals plots which are due to the discrete nature of the response variable but do not indicate fitting problems. Since the diagnostic approach is based on a classical fit, it has therefore to be used with caution. In fact, masking can occur, where a single large outlier may mask others. It is worth noting that in the GLM setting, an outlier or extreme observation would be an observation (yi , xiT ) such that, under the GLM model that fits the ˆ The majority of the data, yi is in some sense far from its fitted value g −1 (xiT β). −1 T ˆ can be large because yi is an extreme response and/or the quantity yi − g (xi β) covariates xi are (at least for one of them) extreme themselves. A classical residual analysis can suffer from the masking effect in that the distorted data appear to be the norm rather than the exception. For instance, consider a regression setting where an outlier may have such a large effect on a slope estimated by a MLE that its residual (or any other measure used for diagnostic) will tend to be small, whereas other observations will have corresponding relatively large residuals. This behavior is due to the fact that classical estimates are affected by outlying points and are pulled in the direction of them. We advocate later for the use of a robust analysis in
5.2. THE GLM
135
31 20000
cost
30000
40000
21
10000
28
44
14
0
63
1.0
1.5
2.0
2.5
3.0
3.5
4.0
log(los)
Figure 5.2 cost versus log(los) for the Gamma example of Section 5.2.3.
the first place (see also the discussion in Section 1.3). We nevertheless propose as a starting point to look at a few plots. In Figure 5.1 we present the diagnostic plots for the fitted Gamma model as per (5.12). In this figure we represent the Pearson residuals as a function of the fitted values (top left panel), a normal Q-Q plot of the standardized deviance residuals (top right panel), a scale–location plot of the standardized deviance residuals as a function of the fitted values (bottom left panel) and a residuals versus leverage plot, that is, a plot of the standardized deviance residuals as a function of the leverage hii (bottom right panel). This last plot comes with added contour lines of equal Cooks distances (see Cook and Weisberg, 1982). Note that the plot function in R can also produce two extra plots, namely the Cook’s distances and the Cook’s distances as a function of the leverage. From Figure 5.1, we can see that there seems to be few outlying/influential data points with large residuals, in particular those identified with their observation number. To see why these observations are extreme, one can for example look at the plot of the variable cost as a function of the variable log(los), as in Figure 5.2. We see from this figure that the points with large residuals are in fact points which are extreme with respect to observations with the same or similar values of log(los). Even though the Gamma model admits variance increasing with the covariates (of the
136
GENERALIZED LINEAR MODELS
order of µ2i = exp(2xiT β)), observations 14, 28, 63, 44 and 21 are considered too extreme with respect to the bulk of the data. On the other hand, observation 31 could be a leverage point, but is not otherwise worrying given that its y-value lies in a region covered by the model assumptions. The more extreme observations identified with this diagnostic analysis can potentially have a very bad impact on the parameter estimates and this issue needs to be investigated further. We reanalyze this dataset in Section 5.6 with a robust technique.
5.3 A Class of M-estimators for GLMs Deviations from the model can also occur for GLMs. The nature of possible deviations in the GLM class of models are close to what one can see in the regression setting: outliers in the response (producing large residuals) and leverage points in the design space. A notable exception is the binary response setting where deviations in the response space take the form of misclassification (a zero instead than a one, or vice versa), and where the difference between an outlier and a leverage point is less clearcut. To address the potential problem of deviating points in real data, or more generally the problem of slight model misspecification, we propose here a general class of M-estimators (see Section 2.3.1) for the0017GLM model as defined in Section 5.2. Given the Pearson residuals ri = (yi − µi )/ φvµi , the M-estimating equations for β of model (5.3) are given by the solution of the following estimating equations 0007 0002 n 0006 n 0002 1 ψ(ri ; β, φ, c)w(xi ) 0017 µ0005i − a(β) = 0001(yi , xi ; β, φ, c) = 0, (5.13) φvµi i=1 i=1 where µ0005i = ∂µi /∂β = ∂µi /∂ηi xi and a(β) = (1/n) ni=1 E[ψ(ri ; β, φ, c)]w(xi )/ 0017 φvµi µ0005i , with the expectation taken over the distribution of yi |xi . The constant a(β) is a correction term to ensure Fisher consistency; see Sections 2.3.2 and 5.3.2. The function ψ(ri ; β, φ, c) and the weights w(xi ) are the new ingredients with respect to the classical GLM estimators obtained by maximum quasi-likelihood: compare with the estimating equations (5.6), which are obtained with ψ(ri ; β, φ, c) = ri and w(xi ) = 1 for all i. The function ψ is introduced to control deviations in the y-space and leverage points are downweighted by the weights w(x). Conforming to the usage in robust linear regression, we call the estimator issued from (5.13) a Mallows-type estimator. It simplifies to a Huber-type estimator when w(xi ) = 1 for all i. It is worth noting that the estimating equations (5.13) can be conveniently rewritten as 0007 n 0006 0002 1 0005 w(r ˜ i ; β, φ, c)w(xi )ri 0017 µ − a(β) = 0, (5.14) φvµi i i=1 where w(r; ˜ β, φ, c) = ψ(r; β, φ, c)/r. In this form, the estimating equations (5.13) can be interpreted as the classical estimating equations weighted (both with respect
5.3. A CLASS OF M -ESTIMATORS FOR GLMS
137
to ri and xi ) and re-centered via a(β) to ensure consistency. The particular weighting scheme considered in (5.14) is multiplicative in its design and residuals components (wi = w(r ˜ i ; β, φ, c)w(xi )). Alternatively, one could consider a global weighting scheme of the form wi (ri , xi ), as for example in Künsch et al. (1989). It should nevertheless be stressed that such a scheme increases the difficulty in calculating the Fisher consistency correction a(β). The estimation procedure issued from (5.13) can be written as an IRWLS, in the same manner as it is usually presented for the classical GLM estimating equations. We give the algorithm in Appendix E.3. The IRWLS algorithm has been a particularly convincing ‘selling argument’ when GLMs have been proposed. Thanks to this representation, the estimation procedure only requires software that allows the computation of weighted LS (or even only matrix computation). Nowadays computer power is a less crucial issue and other numerical procedures can be considered. For example, one can use a Newton–Raphson or a quasi-Newton algorithm. Finally, one can see that if we write yT = (y1 , . . . , yn ) and µT = (µ1 , . . . , µn ), the estimating equations (5.13) correspond to the minimization of the quantity QM (µ; y) =
n 0002
QM (µi ; yi ),
(5.15)
i=1
with respect to β, where the functions QM (yi ; µi ) can be written as QM (µi ; yi )
0004 µi 1 yi − t = ψ ; c w(xi ) √ dt φv φv t t s˜ 0007
n 0004 µj 0006 10002 1 yi − t dt, − E ψ ; c w(xi ) √ n j =1 t˜ φvt φvt
(5.16)
with s˜ such that ψ((yi − s˜ )/(φvs˜ ); c) = 0, and t˜ such that E[ψ((yi − t˜)/(φvt˜); c)] = 0. The function QM (µi ; yi ) in (5.16) plays the same role as the function Q(µi ; yi ) in (5.7), and is used later to define a difference of quasi-deviance type statistic, see Section 5.4.2.
5.3.1 Choice of ψ and w(x) The role of the function ψ is to control the effect of large residuals, therefore it has to be bounded. Common choices for ψ are functions that level off such as the Huber function or functions that are redescending, see Section 2.3.1 for a discussion of the possible options. The function ψ is usually tuned with a constant c, which is typically chosen to guarantee a given level of asymptotic efficiency (which is computed as the ratio of traces of the asymptotic variances of the classical and the robust estimators, see, for example, (2.31)). The exact computation of the value of c that guarantees a certain level of efficiency in GLM models is more complicated than in linear regression because the asymptotic efficiency also depends here on the design and no
GENERALIZED LINEAR MODELS
138
general result can be derived. It is always possible to inspect the estimated efficiency a posteriori and refit the model with a different value of c if it is not satisfactory. In practice, if the Huber ψ-function is used (and this is the case in the glmrob function of the robustbase R package and therefore in our examples), a value of c between 1.2 and 1.8 is often adequate. The default value is set to 1.345, the value that guarantees 95% efficiency for the normal-identity link GLM model. This value is also often a reasonable choice for the other models, such as the binomial and Poisson models. Note that when c → ∞, the classical GLM estimators are reproduced. In practice, very large values of c (e.g. ≥ 10) have the same effect. The choice of w(xi ) is also suggested √ by robust estimators in linear models: the simplest approach is to use w(xi ) = 1 − hii , where hii is the leverage. More sophisticated choices for w(xi ) are available, in particular some that in addition do have high breakdown properties (see Section 3.2.4 for linear regression). The current implementation of the robustbase package √ in addition to equal weights (w(xi ) = 1, for all i, the default) and w(xi ) = 1 − hii , allows one to choose weights based on the Mahalanobis distances di (see (2.34)) of the form
w(xi ) = 0015
1
. √ 1 + 8 max(0, (di2 − q)/ 2q)
A few options are available to estimate the center and the scatter in di robustly, either by the MCD estimator of Rousseeuw (1984) or a more efficient S-estimator, see Section 2.3.3. Note, however, that these high breakdown estimators are not well suited for categorical or binary covariates, and their use only makes sense if all of the explanatory variables are continuous. A variation of this kind of weights is given in Victoria-Feser (2002). The weighting scheme issued from a robust fitting procedure can be used for diagnostic purposes. In fact, inspecting the observations that received a low weight allows the user to identify the outlying observations. For an illustration, see Section 5.5 (Figure 5.3) and Section 5.6 (Figure 5.7).
5.3.2 Fisher Consistency Correction The term a(β) in the estimating equations (5.13) guarantees that the estimator is Fisher consistent, that is, asymptotically unbiased under the postulated model (normal, binomial, etc.). This term can sometimes be difficult to compute. Note, however, that it can be computed explicitly for GLM models where the responses are binomial and Poisson (cf. Cantoni and Ronchetti (2001b, p. 1028) with the change in notation V (µi ) = φvµi ), and Gamma (see Cantoni and Ronchetti (2006, pp. 210– 211) with the change in notation v(µi ) = φvµi ). The expression of a(β) for these models in the unified notation of this book are given in Appendix E.1.
5.3. A CLASS OF M -ESTIMATORS FOR GLMS
139
When a(β) cannot be computed analytically, its estimation by simulation can be considered: the expectation involved in its computation is replaced by the empirical mean of a simulated sample.1 A different strategy is to compute a simpler biased estimator of β by solving the uncorrected estimating equations n 0002 i=1
ψ(ri ; β, φ, c)w(xi ) 0017
n 0002 1 ( i , xi ; β, φ, c) = 0 µ0005i = 0001(y φvµi i=1
(5.17)
and correct the bias a posteriori. In fact, the asymptotic bias of the estimator solving (5.17) can be approximated by a Taylor expansion and takes the form 0006 n ( 0007 0006 n 0007 ∂ i=1 0001(yi , xi ; β, φ, c) −1 0002 ( i , xi ; β, φ, c) . −E E (5.18) 0001(y ∂β i=1 This bias has to be estimated. One can either compute the expectations by numerical integration and evaluate them at β˜ (solution of (5.17)), or replace expectations with ( i , xi ; β, φ, c) evaluated at the averages with respect to the data. Given that ni=1 0001(y ˜ solution β of (5.17) equals zero, a robust pilot estimator, that is a robust estimator obtained by other means, is needed. For further details on the comparison of the estimator obtained from (5.17)–(5.18) and the estimator obtained from (5.13), see Dupuis and Morgenthaler (2002), in particular their Section 2.2. Using indirect inference (Gallant and Tauchen, 1996; Gouriéroux et al., 1993) is another possible approach that can be implemented to correct the bias a posteriori as is done in e.g. Moustaki and Victoria-Feser (2006). For illustrations of the use of indirect inference with robust estimators, see also Genton and Ronchetti (2003).
5.3.3 Nuisance Parameters Estimation As stated previously, φ is known to be constant for Bernoulli, (scaled) binomial and Poisson models. In other models, this parameter has to be estimated, and this should be done by paying attention to maintaining the robustness properties gained in the estimation of β. In other words, it is necessary to also use a robust estimator for φ. We address here the normal and the Gamma distribution settings. In both cases the nuisance parameter is a scale parameter (for the Gamma, one may notice that var((yi − µi )/µi ) = ν), and we suggest borrowing one of the robust scale estimators available in the literature. Namely, we propose to use the Huber’s Proposal 2 estimator (Huber, 1981, p. 137) defined by (see also (3.7) for the regression model)
n 0002 y i − µi χ 0017 ; β, φ, c = 0, (5.19) φvµi i=1 where χ(u; β, φ, c) = ψ 2 (u; β, φ, c) − δ, and δ = E[ψ 2 (u; β, φ, c)] is a constant that ensures Fisher consistency for the estimation of φ, see Hampel et al. (1986, 1 Care should be taken that in the iterative estimation process, the value of β used to simulate the data ˆ is not equal to the current value of β.
GENERALIZED LINEAR MODELS
140
p. 234). The function ψ can be chosen to be the same as that in (5.13).2 The expectation in δ is computed under normality for u, see (3.8) for its computation 2 2 2 for ψ 2 (u; β, φ, c) = ψ[H ub] (u; β, φ, c) = u w[H ub] (u; β, φ, c). Ideally, (5.19) has to be solved simultaneously with (5.13), but in practice a twostep procedure is often used. Starting from a first guess for φ, an estimate of β is obtained, which in turn is used in (5.19), and so on until convergence.
5.3.4 IF and Asymptotic Properties ˆ The estimator defined by (5.13) is an M-estimator 0017 β[M] characterized by the 0001function 0001(yi , xi ; β, φ, c) = ψ(ri ; β, φ, c)w(xi )/ φvµi µ0005i − a(β). Its IF is then ˆ Fβ ) = M(0001, Fβ )−1 0001(y, x; β, φ, c), IF(y, x; β,
(5.20)
√ where M(0001, Fβ ) = −E[(∂/∂β)0001(y, x; β, φ, c)]. Moreover, n(βˆ[M] − β) has an asymptotic normal distribution with asymptotic variance M(0001, Fβ )−1 Q(0001, Fβ ) M(0001, Fβ )−1 , where Q(0001, Fβ ) = E[0001(y, x; β, φ, c)0001(y, x; β, φ, c)T ] (see also (2.27)). The matrices M(0001, Fβ ) and Q(0001, Fβ ) for the Mallows quasi-likelihood estimator (5.13) can be easily computed as Q(0001, Fβ ) =
1 T X AX − a(β)a(β)T , n
(5.21)
where A is a diagonal matrix with elements ai = E[ψ(ri ; β, φ, c)2 ]w2 (xi )/(φvµi ) (∂µi /∂ηi )2 , and 1 M(0001, Fβ ) = XT BX, (5.22) n where B is a diagonal matrix with elements bi as defined in Appendix E.1, and where the expectations are taken at the conditional distribution of yi |xi . Cantoni and Ronchetti (2001b) have computed these matrices for binomial and Poisson models and Cantoni and Ronchetti (2006) for Gamma models. These results are presented in Appendix E.2 in a unified notation. Estimated versions of the matrices M(0001, β) and Q(0001, β) are obtained by replacing the parameters by their M-estimates.
5.3.5 Hospital Costs Example (continued) Consider again the hospital costs example introduced in Section 5.2.3. Model (5.12) is now refitted via the robust estimating equations (5.13) with c = 1.5 and w(xi ) = 1, that is, with a Huber estimator. The scale estimator (5.19) is used for the nuisance parameter with the same value of c. The estimated parameters, standard errors, CIs and p-values of the significance test statistics (5.23) are given in Table 5.3, to be compared with Table 5.2 (classical estimates). Only small differences appear on the values of the estimated coefficients between the classical and the robust analysis 2 The Huber ψ-function is the one used in the implementation in the robustbase package.
5.4. ROBUST INFERENCE
141
Table 5.3 Robust estimates for model (5.12). Variable intercept log(los) adm ins age sex dest 1/ν (scale)
Estimate (SE)
95% CI
p-value
7.252 (0.105) 0.839 (0.020) 0.222 (0.036) 0.009 (0.057) −0.001 (0.001) 0.073 (0.036) −0.123 (0.050) 0.0243
(7.042; 7.462) (0.799; 0.879) (0.151; 0.294) (−0.104; 0.122) (−0.003; 0.001) (0.001; 0.144) (−0.222; −0.024)
<10−4 <10−4 <10−4 0.869 0.257 0.042 0.013
The estimates are obtained solving (5.13) with c = 1.5 and w(xi ) = 1 for all i (Huber’s estimator), and (5.19) with c = 1.5.
except for the variable ins, where there is a difference by a factor of 10 (which is not a typo). This large difference is certainly due to the small number of patients (only nine) with private insurance, one of which is heavily downweighted in the robust analysis (patient 28, w(r ˜ i ; β, φ, c) = 0.24). On the other hand, there are major discrepancies between the estimated standard errors by the two estimators, those based on the robust approach being much smaller. These differences are mainly due to the fact that the scale estimate from the classical analysis is twice as large as that from the robust analysis (see also the simulation results of Cantoni and Ronchetti (2006, Section. 4)). This will also have an impact on the CIs and significance tests, as we will see in Section 5.4.3. Meanwhile, we look at what the robust fit tells us. The observations that are heavily downweighted, that is, with weights w(r ˜ i ; β, φ, c) smaller than 0.5 are w(r ˜ 14 ; β, φ, c) = 0.23, w(r ˜ 21 ; β, φ, c) = 0.50, w(r ˜ 28 ; β, φ, c) = 0.24, w(r ˜ 44 ; β, φ, c) = 0.42 and w(r ˜ 63 ; β, φ, c) = 0.32, which in this case are the same observations as identified in Section 5.2.3. Very similar results in terms √ of coefficient and standard error estimates are obtained if weights w(xi ) = 1 − hii are used (not shown). This indicates that we can be confident that there are no bad leverage points (see Section 3.2.4.2) in the sample and, therefore, we can use a Huber-type estimator to √ avoid any additional loss in efficiency. Indeed, if one computes the weights w(xi ) = 1 − hii , they would range from 0.9 to 1, with the first quartile equal to 0.96, the median equal to 0.97 and the third quartile equal to 0.98. It is particularly interesting to look at the weight of observation 31 (a potential influential point, as can be seen in Figure 5.2) which is w(x31 ) = 0.96, indicating that there is no leverage effect.
5.4 Robust Inference 5.4.1 Significance Testing and CIs With the asymptotic result of Section 5.3.4, it is possible to draw approximate inference for β, either by constructing approximate (1 − α) CIs or by computing
GENERALIZED LINEAR MODELS
142 univariate z-statistics, namely z-statistic = where SE(βˆ[M]j ) =
βˆ[M]j , SE(βˆ[M]j )
(5.23)
0015 v001c ar(βˆ[M]j ) and
1 ˆ ˆ ˆ Fβ )−1 Q(0001, v001c ar(βˆ[M]j ) = [M(0001, Fβ )−1 ](j +1)(j +1) Fβ )M(0001, n in which the matrices Qˆ and Mˆ are estimated using βˆ[M] in (5.21) and (5.22), respectively. The z-statistic can then be compared with a standard normal distribution to test the null hypothesis H0 : βj = 0 and compute the corresponding p-value. As in the classical setting, the asymptotic distribution can be used to define approximate (1 − α) CIs for each parameter βj . Here, they write (βˆ[M]j − z(1−α/2) SE(βˆ[M]j ); βˆ[M]j + z(1−α/2) SE(βˆ[M]j )), where z(1−α/2) is the (1 − α/2) quantile of the standard normal distribution.
5.4.2 General Parametric Hypothesis Testing and Variable Selection The general parametric theory on robust testing (e.g. Heritier and Ronchetti, 1994), i.e. robust LRT, Wald and Rao or score tests, can also be used in the GLMs setting using the results presented in Section 2.5.3. However, since historically with GLMs the deviance has been used for inference purposes, we prefer to concentrate on the possibilities offered by a robust version of the deviance. Note, however, that in the classical setting the difference of deviances statistic to compare two nested models coincides with the LRT statistic when φ (the scale parameter) is known. When confronted with data, it is common practice to fit a first model that includes all available explanatory variables (the full model). The p-values associated with the univariate test statistics (z-statistics) on each coefficient separately give a first broad impression on the important variables impacting the response. However, this information has to be interpreted with caution, given the possible correlation between explanatory variables and non-orthogonality of the tests. It is therefore preferable to conduct a proper variable selection analysis by means of adequate tools. Tools for variable selection, e.g. test statistics, are as much affected by extreme observations as estimators. This effect manifests itself in terms of level (for example, an actual level which does not correspond to the nominal level) and in terms of loss of power; see discussions in Sections 2.4.2, 2.4.3 and 2.5.5. Consider a larger model Mq+1 with q explanatory variables (plus intercept) and a sub-model Mq−k+1 with only (q − k) explanatory variables (plus intercept). The question that arises is whether the sub-model is a good enough representation of the data. Testing that some explanatory variables are not significantly contributing to
5.4. ROBUST INFERENCE
143
the model amounts to testing that a subset of β is equal to zero. Therefore, without T , β T )T with β loss of generality, we split β = (β(1) (1) of dimension (q − k + 1) and (2) β(2) of dimension k, and we test the null hypothesis H0 : β(2) = 0. We propose (see Cantoni and Ronchetti, 2001b) a robust counterpart to the difference of deviances statistic 00060002 0007 n n 0002 001aQM = 2 QM (µˆ i ; yi ) − QM (µ˙ i ; yi ) , (5.24) i=1
i=1
where the quasi-likelihood functions QM (µi ; yi ) are defined by (5.16), µˆ i = µi (βˆ[M] ) is the M-estimate under model Mq+1 and µ˙ i = µi (β˙[M] ) is the Mestimate under model Mq−k+1 . Note that this difference of deviances is independent of s˜ and t˜, see (5.16), because their contributions cancel out. Computing 001aQM implies the computation of the functions QM (µi ; yi ) which are integral forms and for which there is no general analytical expression. They can easily be approximated numerically and they have been implemented in this way. In situations where the evaluation of these integrals is problematic, an asymptotic approximation can be used, see Section 5.4.2.1. The same forms for the functions ψ and w(xi ) as for the M-estimator βˆ[M] can be used in (5.24), see the discussion in Section 5.3.1. The test statistic 001aQM can be used to compare two nested models predefined by the analyst, but can also be used for a more automatic analysis, either sequential (see the example in Section 5.6.2) or marginal (stepwise, see the example in Section 5.5.2). The test statistic (5.24) is in fact a generalization of the quasi-deviance test for 0003µ GLMs ((5.11), which is recovered by taking QM (µi ; yi ) = yi i ((yi − t)/τ vt ) dt). Moreover, when the link function is the identity (linear regression), the statistic (5.24) becomes the τ -test statistic given by Hampel et al. (1986, Chapter 7), see also Section 3.3.3. 5.4.2.1 Asymptotic Distribution and Robustness Properties Let A(ij) , i, j = 1, 2 be the partitions of a (q + 1) × (q + 1) matrix A according to the partition of β into β(1) and β(2) . Under technical conditions discussed in Cantoni and Ronchetti (2001b) and under H0 : β(2) = 0, the test statistic 001aQM defined by (5.24) is asymptotically equivalent to nLTn C(0001, Fβ )Ln = nRTn(2) M(0001, Fβ )22.1 Rn(2) ,
(5.25)
where C(0001, Fβ ) = M −1 (0001, Fβ ) − M˜ + (0001, Fβ ) (with M˜ + (0001, Fβ ) given below), √ n Ln (of dimension (q + 1)) is normally distributed N (0, Q(0001, Fβ )), M(0001, Fβ )22.1 = M(0001, Fβ )(22) − M(0001, Fβ )T(12) M(0001, Fβ )−1 (11) M(0001, Fβ )(12) , √ and n Rn (of dimension (q + 1)) is normally distributed N (0, M −1 (0001, Fβ )Q(0001, Fβ )M −1 (0001, Fβ )) (see Cantoni and Ronchetti, 2001b). Note that Rn(2) is of dimension k.
144
GENERALIZED LINEAR MODELS
This means that 001aQM is asymptotically equivalent to a quadratic form in normal variables and that 001aQM is asymptotically distributed as ki=1 di Ni2 , where N1 , . . . , Nk are independent standard normal variables, d1 , . . . , dk are the k positive eigenvalues of the matrix Q(0001, Fβ )(M −1 (0001, Fβ ) − M˜ + (0001, Fβ )), and M˜ + (0001, Fβ ) is equal to
0(q−k+1)×k M(0001, Fβ )−1 + (11) ˜ M (0001, Fβ ) = , 0k×(q−k+1) 0k×k where 0a×b is a matrix of dimension a × b with only zero entries. The above results imply that the asymptotic distribution of 001aQM is a linear combination of χ12 , for which theoretical results (e.g. Imhof, 1961) and algorithms (see Davies, 1980; the distribution Farebrother, 1990) exist. Moreover, if necessary, ¯ 2 , distribution quite well, of the variable ki=1 di Ni2 can be approximated with a dχ k where d¯ = 1/k ki=1 di . No formal proof exists for the asymptotic distribution of this test statistic, but we expect that the results for linear models by Markatou and Hettmansperger (1992) carry over, at least approximately. Our experience shows that it is often the case in practice. Other approximations exist, see Wood (1989), Wood et al. (1993) and Kuonen (1999). In addition to providing the asymptotic distribution of 001aQM , result (5.25) states that 001aQM is asymptotically equivalent to the quadratic form T βˆ[M](2) M(0001, β)22.1 βˆ[M](2) .
This suggests that 001aQM can be approximated with this easier to compute quadratic form to avoid the numerical integrations in QM (µi ; yi ), in particular when n is large.3 The robustness properties of a test statistic are measured on the level and on the power scale, see Section 2.2. Cantoni and Ronchetti (2001b) work out the expressions of the level and of the power of 001aQM under contamination. These results show in particular that the asymptotic level of 001aQM under contamination is stable as long as a bounded influence M-estimator βˆ[M](2) is used in its definition.
5.4.3 Hospital Costs Data Example (continued) If we look back at Tables 5.2 and 5.3, we can see that the conclusions from both the classical and the robust analyses on the basis of the univariate test statistics (p-values in Table 5.2 and 5.3) are quite different: if no doubt arises as to the significance of the intercept, and the variables log(los) and adm on both analyses, the robust analysis would suggest a significant effect also for dest, and less clearly for sex, making the role of these two variables less clear (see also the corresponding CIs). A more complete variable selection procedure is therefore recommended before proceeding with any interpretation and conclusion. We now investigate this variable selection issue a little bit further. 3 The anova.glmrob function in the package robustbase in R (called by the generic function anova), implements both the test statistic 001aQM and its asymptotic quadratic approximation, in addition to a Wald test.
5.4. ROBUST INFERENCE
145
We first start by comparing the full model to the reduced model without the variables ins and age. This amounts to testing H0 : β3 = β4 = 0 in (5.12). We keep the same robustness tuning parameters for the robust test as in Section 5.3.5, that is, c = 1.5 and w(xi ) = 1. The difference of quasi-deviances 001aQM is equal to 1.23 (p-value = 0.5), which confirms the fact that these two variables do not have a significant impact on the cost of stay significantly. We go on by comparing the model including log(los), adm, sex and dest to the nested sub-model that excludes sex. The hypothesis that the coefficient corresponding to variable sex is equal to zero is rejected at the 5% level (001aQM = 5.26 and p-value = 0.015). Similarly, we compare the model including log(los), adm, sex and dest to the nested submodel that discards dest. The difference of quasi-deviances statistic 001aQM is equal to 4.82 and the p-value is 0.02, which implies the rejection of the null hypothesis that the coefficient of dest is equal to zero at the 5% level. This means that the models without either sex and dest are not enough to describe the data. As a comparison, a classical analysis would also fail to reject the sub-model without ins and age compared with the full model (5.12) (p-value = 0.44). Starting from this sub-model, the classical analysis would reject the sub-model without sex, but not the sub-model without dest. This confirms the preliminary differences between the classical and robust analysis observed with the full fit in Section 5.3.5. The final model obtained from the robust analysis has the following estimated linear predictor (with standard errors of the coefficients within parentheses) 7.168 + 0.839 log(los) + 0.231 adm + 0.082 sex − 0.104 dest. (0.067) (0.020) (0.035) (0.034) (0.047) The estimate of the scale parameter is 0.024. The analysis suggests that hospital costs of stay for back problems are heavily dependent on length of stay, but also on the type of admission, the sex of the patient and their destination when leaving the hospital. The age of the patient and the type of insurance do not impact the costs significantly for this pathology. The impact of the significant covariates on the average costs E[yi | xi ] = µi is described by µi = g −1 (xiT β). Having used a logarithmic link in this example, we have that µi = exp(xiT β). The interpretation of each coefficient uses this relationship and can be done separately under the circumstance that all of the other variables are kept fixed. In this respect, the above constructed model tells us that an emergency admission has a multiplicative effect of exp(0.231) = 1.26 on the average cost, which means a 26% increase. Patients that go home directly after the hospital stay (with respect to those that go to another institution) have lower costs (about 90% = exp(−0.104)). One could expect the converse to be true, but the patient destination after hospital is probably an indicator of how severe the back problems under treatment are: a patient that can be independent and go home directly is probably treated for a lighter problem in the beginning. Of course, the longer the stay, the higher the costs, as expected: if log(los) increases by 1, that is if los increases by 2.7, the average cost is multiplied by exp(0.839) = 2.31. Finally, costs
146
GENERALIZED LINEAR MODELS
for male patients seem to be slightly higher than those for female patients by a factor of exp(0.082) = 1.09. In this example, the estimated parameters of the variables appearing in this final model are quite close to the corresponding estimates in the full model, see Tables 5.2 and 5.3. This is due to the low correlation between the covariates.
5.5 Breastfeeding Data Example 5.5.1 Robust Estimation of the Full Model We now look at a binary response example. The data come from a study conducted in a UK hospital on the decision of pregnant women to breastfeed their babies or not, see Moustaki et al. (1998). For the study, 135 expectant mothers were asked what kind of feeding method they would use for their coming baby. The responses were classified into two categories (variable breast), the first including breastfeeding, try to breastfeed and mixed breast- and bottle-feeding (coded 1), and the second for exclusive bottle-feeding (coded 0). The available covariates are the advancement of the pregnancy (pregnancy, end or beginning), how the mothers were fed as babies (howfed, some breastfeeding or only bottle-feeding), how the mother’s friend fed their babies (howfedfriends, some breastfeeding or only bottle-feeding), if they had a partner (partner, no or yes), their age (age), the age at which they left full-time education (educat), their ethnic group (ethnic, white or non-white) and if they have ever smoked (smokebf, no or yes) or if they had stopped smoking (smokenow, nor or yes). All of the factors are two-level factors. The first listed level of each factor is used as the reference (coded 0). The sample characteristics are as follows: out of the 135 observations, 99 were from mothers that have decided at least to try to breastfeed, 54 mothers were at the beginning of their pregnancy, 77 were themselves breastfed as a baby, 85 of the mother’s friend had breastfed their babies, 114 mothers had a partner, median age was 28.17 (with minimum equal 17 and maximum equal 40), median age at the end of education was 17 (minimum = 14, maximum = 38), 77 mothers were white and 32 mothers were smoking during the pregnancy, whereas 51 had smoked before. The aim of the study was to determine the factors impacting the decision to at least try to breastfeed in order to target breastfeeding promotion toward women with a lower probability of choosing it. We fitted the following model: logit(E[breast]) = logit(P (breast)) = β0 + β1 pregnancy + β2 howfed + β3 howfedfr + β4 partner + β5 age + β6 educat + β7 ethnic + β8 smokenow + β9 smokebf,
(5.26)
where logit(p) = log(p/(1 − p)), with p/(1 − p) being the odds of a success, and P (breast) is the probability of at least try to breastfeed. Table 5.4 gives the robust estimates, standard errors and p-values for the z-test (5.23) of model (5.26) for a Huber-type estimator (w(xi ) = 1) and for a Mallows√ type estimator with w(xi ) = 1 − hii . The value c = 1.5 has been used in both
5.5. BREASTFEEDING DATA EXAMPLE
147
Table 5.4 Robust estimates for model (5.26). Huber Variable intercept pregnancy beginning howfed breast howfedfr breast partner yes age educat ethnic non-white smokenow yes smokebf yes
Mallows
Estimate (SE)
p-value
Estimate (SE)
p-value
−7.782 (3.365) −0.816 (0.695) 0.545 (0.710) 1.479 (0.690) 0.772 (0.816) 0.030 (0.060) 0.377 (0.186) 2.712 (1.125) −3.476 (1.129) 1.507 (1.103)
0.021 0.241 0.443 0.032 0.344 0.611 0.042 0.016 0.002 0.172
−7.778 (3.363) −0.815 (0.694) 0.540 (0.708) 1.482 (0.689) 0.775 (0.816) 0.031 (0.060) 0.376 (0.185) 2.705 (1.122) −3.468 (1.127) 1.507 (1.102)
0.021 0.241 0.445 0.032 0.342 0.608 0.042 0.016 0.002 0.171
The estimates are√obtained by solving (5.13) with c = 1.5 (Huber’s estimator) and with c = 1.5 and w(xi ) = 1 − hii (Mallows’s estimator).
cases. The coefficient estimates from both analyses are quite close, even though individual 18 (see the top panel of Figure 5.3) is considered as a potential leverage point. This mother is 38 years old and is still in education (educat=38). This is possible, but is certainly not common to the majority of the population. This remark raises the question of the rationale behind the definition of the variable educat (age at the end of full-time education). What information are we trying to measure with this variable? If it is educational level, maybe it is not what the variable educat really measures. In other studies, the number of years of education is recorded, which can also be seen as a proxy for social status. From Figure 5.3 (bottom panel) we can also see that a small set of observations are downweighted on the grounds of their residuals, in particular observations 11, 14, 63, 75, 90 and 115 receive a weight of less than 0.6. Note that 6 observations out of 135 constitute about 4.5% of the total information. For these mothers the fitted model (5.26) would predict a probability of at least try to breastfeed which is not consistent with the behavior of the majority of the mothers in the sample on the basis of the covariates (see Figure 5.4): for instance, for observations 75, 11, 115 and 14 the predicted probability of trying to breastfeed is larger than 0.90, whereas these mothers have decided to bottlefeed. On the other hand, mothers 90 and 63 are given a low probability of only 0.02 and 0.11 respectively of trying to breastfeed by the model, whereas they have chosen to do so. According to the p-values of Table 5.4, the variables that have the greatest impact on the decision to at least try to breastfeed are whether the ethnic group is non-white, whether currently smoking and less strongly the age at which one left education and whether friends have chosen to breastfeed. A more formal variable selection procedure follows in Section 5.5.2. Note that a classical analysis would have yield different estimates and conclusions, see also Section 5.5.2. A slightly different estimation method for this dataset
GENERALIZED LINEAR MODELS
0.6
3 53
14
0.2
w(r)
1.0
148
63
0
20
115
90
11
75 40
60
80
100
120
140
80
100
120
140
0.95 0.85
w(x)
Index
18 0
20
40
60 Index
Figure 5.3 Robustness weights on the design and on√ the residuals for model (5.26), when estimated by (5.13) with c = 1.5 and w(xi ) = 1 − hii . has been used in Victoria-Feser (2002), in particular a model-based weighting scheme. The conclusions are similar between our proposal and her Mallows-type estimator.
5.5.2 Variable Selection When analysing the full model, on the basis of p-values corresponding to the zstatistics, the variables howfedfr, smokenow, ethnic and educat have an important impact on the decision to at least try to breastfeed. Here we investigate further the variable selection issue. With this dataset, we illustrate a backward stepwise procedure. We start with the full model and we use the test statistic 001aQM to test each sub-model with one variable removed. All of the sub-models for which the p-value of such a test is larger than 5% are candidates for removal, and we choose between them the sub-model which has generated the larger p-value. We then repeat the procedure by taking this sub-model as the new reference model and testing all of its sub-models. The procedure is stopped when all of the p-values are larger than 0.05. Table 5.5 gives the p-values at the first steps of the procedure. For comparison, we also put the results for a classical analysis. A comparison of the p-values from the classical and the robust approaches confirms that the robustness issues related to
149
75 11 115 14
0.6 0.4 0.2
Fitted probability of breastfeeding
0.8
1.0
5.5. BREASTFEEDING DATA EXAMPLE
63 0.0
90 0.0
0.2
0.4
0.6
0.8
1.0
Breast
Figure 5.4 Fitted values versus actual √ values for model (5.26), when estimated by (5.13) with c = 1.5 and w(xi ) = 1 − hii . Observations with w(ri ; β, φ, c) < 0.6 are spotted. the presence of deviating data points are also a concern for inference. In fact, large discrepancies (as large as 0.2) appear between the two approaches in terms of pvalues. Some of these differences do not really have an impact on the significance decision at a usual level of 5% or 10% (e.g. howfed or partner), but some others do (e.g. educat). The complete robust stepwise procedure yields the following final model (with standard errors of the coefficients within parentheses): −6.417 + 1.478 howfedfr + 3.260 ethnic + 0.403 educat − 2.421 smokenow. (2.973) (0.622) (1.199) (0.177) (0.664)
From the robust analysis, the non-significant variables have been removed in the following order: age (p-value = 0.58, the largest p-value at the first step, see Table 5.5), howfed (p-value = 0.40), pregnancy (p-value = 0.26), partner (p-value = 0.41) and smokebf (p-value = 0.25). As a comparison, a classical backward stepwise procedure would have discarded (in the order) age (p-value = 0.60, the largest p-value at the first step, see
GENERALIZED LINEAR MODELS
150
Table 5.5 p-values of the first step of a backward stepwise procedure for variable selection for the breastfeeding data example of Section 5.5. Variable
Classical
Robust
pregnancy beginning howfed breast howfedfr breast partner yes age educat ethnic non-white smokenow yes smokebf yes
0.08134 0.60261 0.00951 0.12219 0.60271 0.14075 0.00012 <10−4 0.05157
0.20600 0.39778 0.02820 0.32888 0.58512 0.02283 0.00187 <10−4 0.08605
Classical p-values obtained with c = ∞ and w(x √ i) = 1 and robust p-values with c = 1.5 and w(xi ) = 1 − hii in (5.24).
Table 5.5), howfed (p-value = 0.58), educat (p-value = 0.10), pregnancy (p-value = 0.20), smokebf (p-value = 0.10) and partner (p-value = 0.0577). The classical final model would therefore include only howfedfr, ethnic and smokenow, which is a smaller and different set of covariates than obtained by the robust analysis. From the model identified and fitted by the robust technique we learn that the way a mother has been fed as a child does not play a role in her decision of whether to breastfeed, whereas the choice of friends is more important and has an effect on the expectant mother’s decision. A mother’s choice to try to breastfeed does not evolve during the pregnancy. This choice is also not affected by the mother being single. Having smoked before being pregnant has no effect on the decision to breastfeed, but being a smoker during the pregnancy significantly reduces the probability to at least try to breastfeed. Ethnicity and age at which a mother leaves education are also factors that have an impact on a mother’s decision. The coefficient values allow us to quantify the identified effects on the decision to at least try to breastfeed. As opposed to the Gamma model of Section 5.3.5 or to a Poisson model (see Section 5.6), the interpretation of the impact of covariates on the probability P (breast) is more difficult due to the nature of the logit transformation. In fact, P (breasti ) = µi =
exp(xiT β) 1 + exp(xiT β)
.
(5.27)
With these models it is therefore more common to interpret the coefficients on the odds or odd-ratios scale. The robust estimation procedure has no impact on the way the model is interpreted. The only difference is that the coefficients are estimated differently.
5.6. DOCTOR VISITS DATA EXAMPLE
151
For a continuous variable, the effect of a unit change on the odds is equal to the exponential of the corresponding coefficient. For example, leaving education a year later increases the odds of at least try to breastfeed by a factor of exp(0.403) = 1.50, if all the other covariates are kept fixed. On the other hand, for two-level factors the logit model leads to the interpretation of the odds-ratio (the ratio of the odds). For instance, the odds-ratio of at least try to breastfeed for a non-white expecting mother relative to a white mother is equal to exp(3.260) = 26.05. Similarly, the odds-ratio of at least try to breastfeed for a smoking mother relative to a non-smoking one is exp(−2.421) = 0.09. Being a smoker during pregnancy has the strongest (negative) effect on the model. Finally, the odds-ratio of at least try to breastfeed for an expectant mother whose friends have chosen to breastfeed relative to friends bottlefeeding is exp(1.478) = 4.38. The interpretation of odds and odds-ratios pertains to the logistic model (that is, the binomial model with logit link), but does not apply to models with the probit or complementary log–log link. This fact is one of the reasons that makes logistic models more popular than the two other alternatives, in addition to their more convenient computational aspects. To summarize, let us recall that the aim of the study was to better target the expectant mothers when promoting breastfeeding. The analysis of this dataset suggests that if one wants to increase the average probability of choosing to at least try to breastfeed, directed effort should be towards white mothers and towards mothers that leave education earlier. Pregnant women that smoke tend to avoid breastfeeding: investigating this phenomenon further could help increase the average probability of expectant mothers choosing to breastfeed.
5.6 Doctor Visits Data Example 5.6.1 Robust Estimation of the Full Model Count data are an important subclass of data that fits into the GLM framework. For this application we use data from the Health and Retirement Study (HRS),4 which surveys more than 22 000 Americans over the age of 50 every 2 years. The study paints an emerging portrait of an aging America’s physical and mental health, insurance coverage, financial status, family support systems, labor market status and retirement planning. The original full dataset from RAND HRS Data (Version D) distribution (six waves: 1992, 1994, 1996, 1998, 2000 and 2002) contains 26 728 observations and 4140 variables per individual. Individuals were separated in four cohorts: • HRS cohort (born between 1931 and 1941); • AHEAD cohort (born before 1924); 4 Sponsored by the National Institute of Aging (grant number NIA U01AG09740) and conducted by the University of Michigan, see http://hrsonline.isr.umich.edu/.
GENERALIZED LINEAR MODELS
152
• CODA cohort (born between 1924 and 1930); • WB cohort (born between 1942 and 1947). In addition to respondents from eligible birth years, the survey interviewed the spouses of married respondents or the partner of a respondent, regardless of age. We focus on a subsample of 3066 individuals of the AHEAD cohort for wave 6 (year 2002). Note that only individuals with full information have been retained, to avoid issues with missing values. The aim is to identify variables impacting on equity in health care utilization. When the information about costs themselves is not available (in contrast to the example in Section 5.2.3), a proxy variable is used to measure health care consumption, for example the number of visits to the doctor in the previous 2 years. A set of potentially interesting explanatory variables has been retained on the basis of previous studies from the literature, e.g. Dunlop et al. (2002) and Gerdtham (1997), see Table 5.6. These variables are classified into three categories: predisposing variables, health needs and economic access. The first category includes age, gender, race and marital status. Health needs are represented by chronic conditions and functional limitations. In the economic access category, years of education and parents’ education measure human capital, whilst income and health insurance from a current or previous employer measure financial ability to pay. A potential concern with count data in the setting of health consumption is the excess of zeros, that is, a large presence of zero values among the responses, which cannot be modeled with standard distributions (see Ridout et al. (1998) and Section 5.7.1). Given that we target here a population of regular users (elderly) this issue can be excluded. In fact, only about 4% of the counts are equal to zero, see the histogram in Figure 5.5. We therefore confidently proceeded with a GLM Poisson model with log-link including all of the available covariates: log(E[visits]) = β0 + β1 age + β2 gender + β3 race + β4 hispan + β5 marital + β6 arthri + β7 cancer + β8 hipress + β9 diabet + β10 lung + β11 hearth + β12 stroke + β13 psych + β14 iadla1 + β15 iadla2 + β16 iadla3 + β17 adlwa1 + β18 adlwa2 + β19 adlwa3 + β20 edyears + β21 feduc + β22 meduc + β23 log(income + 1) + β24 insur.
(5.28)
We fitted both a classical MLE√and a Mallows’ robust estimator according to (5.13) with c = 1.6 and w(xi ) = 1 − hii . Given the large number of covariates, the results are presented graphically. Figure 5.6 shows approximate 95% CIs for each variable resulting from a classical fit (on the left, gray line) and from a robust fit (on the right, black line). The intervals are symmetric and the coefficient itself is represented in the middle with a dot.
5.6. DOCTOR VISITS DATA EXAMPLE
153
Table 5.6 HRS data variables description. Note that iadla sums the answer to ‘can use the phone’, ‘can manage money’, ‘can take medication’, where the answer to each question is coded 1 = difficulty or 0 = no difficulty. Similarly, adlwa sums the response to being able to ‘bath’, ‘eat’ and ‘dress’. Finally, ‘med’ stands for median. Sample size is 3066. Name
Description
Sample values
Response visits
Number of visits to the doctor
0–750 (med = 8)
Predisposing age gender race hispan marital
Age in years Gender (0 = male, 1 = female) Race (1 = white/Caucasian, 0 = other) Hispanic (1 = Hispanic, 0 = other) Marital status (1 = married, 0 = other)
42–109 (med = 82) 2079 females 2714 whites 183 Hispanic 1203 married ‘yes’: 2200 ‘yes’: 594 ‘yes’: 1856
iadla
Ever had arthritis (1 = yes, 0 = no) Ever had cancer (1 = yes, 0 = no) Ever had high blood pressure (1 = yes, 0 = no) Ever had diabetes (1 = yes, 0 = no) Ever had lung disease (1 = yes, 0 = no) Ever had hearth problems (1 = yes, 0 = no) Ever had a stroke (1 = yes, 0 = no) Ever had psychiatric problems (1 = yes, 0 = no) Instr. activities of daily leaving (0,1,2,3)
adlwa
Activities of daily leaving (0,1,2,3)
Health needs arthri cancer hipress diabet lung hearth stroke psych
Econ. access edyears feduc meduc income insur
Education years Father education (years) Mother education (years) Total household income Ins. from current/prev. empl. (1 = yes, 0 = no)
‘yes’: 524 ‘yes’: 312 ‘yes’: 1206 ‘yes’: 492 ‘yes’: 479 ‘0’: 2433, ‘1’: 258 ‘2’: 178, ‘3’: 197 ‘0’: 2284, ‘1’: 361 ‘2’: 234, ‘3’: 187 0–17 (med = 12) 0–17 (med = 8.5) 0–16 (med = 8.5) 0–725 600 (med = 21 540) ‘yes’: 649
Note that the magnitude of the coefficients is not comparable between all of the variables. In fact, some of them are measured in years, e.g. age, meduc, feduc and edyears, one is measured in log-dollars (log(income + 1)) and all of the other variables are dummies.
GENERALIZED LINEAR MODELS
0
200
400
600
800
1000
154
0
20
40
60 Doctor visits
80
100
Figure 5.5 Histogram of visits. Note that the abscissa has been limited to (0, 100) (there are 21 observations out of 3066 outside this range, the largest value being 750).
As one can see, the coefficients of the classical and the robust analyses are sometimes quite different. Also, the standard errors estimates tend to be a bit larger in the robust analysis. The CIs from the classical analysis indicate that all of the variables are highly significant (no crossing of the horizontal line at zero), except for marital. From the robust analysis it seems, however, that the variables race, meduc, log(income + 1) and insur are not significant. For additional variable significance tests, see Section 5.6.2. The dataset here is much larger than the previous dataset both in sample size and in the number of covariates. For this reason, the plot of the weights (see Figure 5.7) shows what seems to be a large number of downweighted observations. Note, however, that the average of the weights with respect to the total number of observations is 3066 ˜ i ; β, φ, c)w(xi )/3066 = 79.4%, which reflects, loosely i=1 w(r speaking, an average degree of ‘outlyingness’ of about 20%. This may seem a lot, possibly indicating that extra covariates should be added or that the distributional assumptions should be modified. Also, the weights on the design are all close to one.
5.6.2 Variable Selection As can be seen in Figure 5.6, almost all of the (preselected) variables for this study seem significant. We would like to confirm whether the variables race, meduc, log(income + 1) and insur can be excluded from the model. For this purpose
5.6. DOCTOR VISITS DATA EXAMPLE
insur
log(income+1)
meduc
feduc
edyears
adlwa(3)
adlwa(2)
adlwa(1)
iadla(3)
iadla(2)
iadla(1)
psych
stroke
hearth
lung
diabet
hipress
cancer
arthri
marital
hispan
race
gender
age
0.0
0.2
0.4
Classical Robust
−0.2
Confidence intervals for the coefficients
0.6
155
Figure 5.6 Coefficient estimates and approximate 95% CIs for the log-link Poisson model (5.28), estimated by maximum likelihood (classical) and by (5.13) with c = √ 1.6 and w(xi ) = 1 − hii (robust). For each variable, the results on the left are from the classical analysis and on the right from the robust analysis.
we use the difference of quasi-deviance statistic 001aQM with c = 1.6 and w(xi ) = √ 1 − hii . We first test the null hypothesis H0 : β3 = 0 in the full model, which is not rejected (p-value = 0.73). We therefore remove the variable race. We test next whether meduc is significant in the sub-model that has already race removed. This variable is not significant (p-value = 0.62) and we remove it. We go on with testing whether we can in addition remove log(income + 1), which is not significant (pvalue = 0.35). We last test the removal of insur. The p-value is 0.50, and we decide to remove also insur. The above approach is called a sequential approach and differs from a marginal/ stepwise approach in that it does not test all of the sub-models at each step. The drawback is that the final model is heavily dependent on the order in which the variables are considered for removal, in particular when the covariates are far from being independent. Table 5.7 gives the estimates on the final model retained above. The factors explaining the number of visits to the doctor are numerous, as confirmed by the long list of variables in Table 5.7. We have already learned that being Caucasian, the
GENERALIZED LINEAR MODELS
0.0
0.2
0.4
w(r)
0.6
0.8
1.0
156
0
500
1000
1500
2000
2500
3000
2000
2500
3000
w(x)
0.965
0.970
0.975
0.980
0.985
0.990
0.995
1.000
Index
0
500
1000
1500 Index
Figure 5.7 Robustness√weights from the fit of model (5.28) estimated by (5.13) with c = 1.6 and w(xi ) = 1 − hii .
5.6. DOCTOR VISITS DATA EXAMPLE
157
Table 5.7 Final model estimates for the doctor visits data. Variable intercept age gender hispan marital arthri cancer hipress diabet lung hearth stroke psych iadla1 iadla2 iadla3 adlwa1 adlwa2 adlwa3 edyears feduc
Estimate (SE)
p-value
1.989 (0.114) −0.005 (0.001) 0.030 (0.015) 0.213 (0.027) −0.050 (0.014) 0.180 (0.015) 0.178 (0.015) 0.197 (0.014) 0.198 (0.015) 0.110 (0.019) 0.304 (0.013) 0.125 (0.016) 0.180 (0.016) 0.056 (0.023) 0.176 (0.027) 0.244 (0.029) 0.160 (0.019) 0.231 (0.024) 0.382 (0.029) 0.008 (0.002) −0.020 (0.006)
<10−4 <10−4 0.0409 <10−4 0.0006 <10−4 <10−4 <10−4 <10−4 <10−4 <10−4 <10−4 <10−4 0.0143 <10−4 <10−4 <10−4 <10−4 <10−4 <10−4 0.0025
The estimates√are obtained by (5.13) with c = 1.6 and w(xi ) = 1 − hii (Mallows’ estimator).
level of mother’s education, total household income and having a health insurance plan from a previous employer do not have a statistically significant impact on health consumption (doctor visits). The Poisson GLM model used for this example has a logarithmic link. Interpretation of the coefficient is therefore done through the relationship µi = exp(xiT β), as in the Gamma model with logarithmic link in Section 5.4.3. For example, a patient who is five years older would have a number of visits to the doctor multiplied by exp(−0.005 · 5) = 0.975 on average, that is, reduced by 2.5%. It is surprising to see that the coefficient of age is negative, meaning that older patients consume less. However, the effect is really small (no practical significance), even though statistically significant. Interpretation of education level via years of education (edyears) and father’s education (feduc) is puzzling. On the one hand, an extra year of father’s education decreases the number of visits by 2% (exp(−0.02) = 0.98). On the other hand, years of education of the patient himself tend to increase the doctor needs by 1% (exp(0.008) = 1.01). Married individuals do visit the doctor less on average: exp(−0.05) = 95%. All of the effects of ‘health needs’ category are positive, indicating, as expected, that if some conditions are
158
GENERALIZED LINEAR MODELS
present (arthritis, diabetes, high blood pressure, etc.), the number of doctor visits is larger on average with respect to an individual with absence of these conditions.
5.7 Discussion and Extensions The GLM class encompasses a large variety of data distributions, but of course it has its own limitations. Therefore, GLM have been extended in various ways. The linear component structure has been relaxed and non-parametric functions have been considered in generalized additive models (GAMs; see Hastie and Tibshirani (1990)). The exponential family restriction can be overcome by using quasi-likelihood functions instead of proper likelihoods. The asymptotic results for the estimators derived in this way have to be adapted, essentially by changing the asymptotic variance estimator (sandwich formula, see Fahrmeir and Tutz (2001, pp. 55–58)). Finally, in GLM the responses are assumed to be independent and therefore do not include, for instance, longitudinal or clustered data, where there are typically several observations per subject for which it is not reasonable to assume independence (even though the subjects themselves can be considered independent), see in particular Chapter 6. In the following sections we discuss some ideas for extensions of the approach presented in this chapter and some open areas of research.
5.7.1 Robust Hurdle Models for Counts A particular feature of count data is the fact that they sometimes show an excess of zeros. Typical examples include the number of visits to the doctor on a given period (see Cameron and Trivedi, 1998) or abundance of species (see Barry and Welsh, 2002). Data with an excess of zeros have been modeled in various ways: with mixture models, with more flexible distributions than the more common Poisson (e.g. negative binomial, Neyman type-α, see for instance Dobbie and Welsh (2001b)), with zero-inflated distributions (zero-inflated Poisson or zero-inflated negative binomial, see Lambert (1992)), or with hurdle models (also called two-step or conditional models, see Mullahy (1986)). Ridout et al. (1998) and Min and Agresti (2002) give extensive reviews. From our perspective, hurdle models are quite attractive because they possess nice orthogonality properties, they fit nicely in the GLM framework and in its robust approach presented in this chapter. A hurdle model is characterized by a two-stage procedure. First, the presence (yi > 0) or absence (yi = 0) is modeled through a set of covariates xi with a logistic-type of model. Then, conditional on the presence, the positive values are modeled through a set of covariates x˜ i (possibly equal to xi ) with a truncated distribution (e.g. a truncated Poisson) and corresponding model (a log-linear type of model). This implies that yi = 0 with probability 1 − p(xi ) and
5.7. DISCUSSION AND EXTENSIONS
159
yi ∼ truncated Poisson with probability p(xi ). In summary, yi = 0, (1 − p(xi )) y i P (Yi = yi | xi , x˜ i ) = exp(−λ(x˜ i ))λ(x˜ i ) yi = 1, 2, . . . , p(xi ) yi !(1 − exp(−λ(x˜ i ))) with logit(p(xi )) = xiT β and log(λ(x˜ i )) = x˜ iT α. The log-likelihood l(α, β) of the above model factorizes as l(α) + l(β), which has the double advantage of splitting the fitting into two subproblems of smaller size and rendering the interpretation easier (each set of parameter impact only one part of the model). A robust procedure for the hurdle model can be derived by robustifying each submodel separately. The logistic presence/absence model can be fitted robustly by the approach presented in the previous sections and the truncated Poisson modeling part has been addressed in Zedini (2007). Routines in R are currently under preparation and will be made available either within the robustbase package or as a standalone package.
5.7.2 Robust Akaike Criterion The principle of the AIC (see Section 3.4.5) is to use the likelihood information at a given model penalized by its number of parameters to identify the best model(s), that is, the best compromise(s) between parsimony and goodness of fit. The smaller the value of AIC, the better. In fact, AIC is an estimate of the expected entropy, that one would like to maximize. A robust version of AIC is available for linear models, see Section 3.31, but not (yet) for GLMs, where a generalized version of AIC can be constructed based on the quasi-likelihood functions defined in this chapter. We briefly sketch the idea here. The log-likelihood in the original definition of AIC can be replaced by the quasilikelihood function (5.7) with the penalization term adapted, see Ronchetti (1997b) and Stone (1977). This yields the final generalized criterion: GAIC = −2
n 0002
QM (µˆ i ; yi ) + 2 tr(M −1 (0001, Fβ )Q(0001, Fβ )),
i=1
with M(0001, Fβ ) and Q(0001, Fβ ) given in (5.21) and (5.22).
5.7.3 General Cp Criterion for GLMs The Mallows’ Cp criterion (Mallows, 1973) has been mainly used in linear regression. A robust version of it for linear models exist thanks to Ronchetti and Staudte (1994) (see (3.32)). It is constructed upon the idea that the Cp criterion is an unbiased estimator of some sort of measure of prediction error. Following the same reasoning, Cantoni et al. (2005) develop a similar criterion, called GC p , to be used for GEE models to address various issues (missingness, heteroscedasticity) including
GENERALIZED LINEAR MODELS
160
robustness. The GLM setting being the limiting case of a longitudinal setting where there is only one observation per subject, GC p for GLM can be deduced from the original proposal of Cantoni et al. (2005). If we define the rescaled weighted predictive squared error by 0016p =
0007 0006 p n 0002 yˆi − E[yi | xi(p) ] 2 p 0017 , E w2 (ri ) · φ vˆµi i=1
(5.29)
p p 0017 p where ri = (yi − yˆi )/ φ vˆµi are the Pearson residuals, yˆi are the fitted values at the model with p ≤ (q + 1) explanatory variables xi(p) (including the intercept), vˆµi are ‘external’ variance estimates (held fixed) and where w(·) is a weighting function to downweight atypical observations, then a general form of an unbiased estimator for 0016p is
GCp =
n n n 0002 0002 0002 p p p p (w(ri )ri )2 − E[(w(ri )0015i )2 ] + 2 E[w2 (ri )0015i δi ], i=1
i=1
(5.30)
i=1
0017 0017 p with 0015i = (yi − E[yi | xi(p) ])/ φvµi and δi = (yˆi − E[yi | xi(p) ])/ φvµi and where the two latter terms are corrections to achieve unbiasedness. Computing these two terms for GLM and for our particular (robust) M-estimator (5.13) would yield the final form of GCp .
5.7.4 Prediction with Robust Models The goals of model fitting are numerous, but they certainly include prediction. For example, in the hospital costs example of Section 5.2.3, health insurances could be interested in forecasting costs for the following year in order to establish their budget. If in this example the robust fitted model is used naively to obtain predictions, the reproducibility of the outliers, that is the fact that individuals with high abnormal costs will likely appear again in the future, would imply potential severe bias in prediction (e.g. underestimation). This particular feature is shared by all of the models where the outliers are characterized by particularly large values with respect to the bulk of the data (this is not the case in examples with binary responses, for example). In this kind of situation, one should therefore correct the predictions for possible reproducible outliers, by considering shrinkage robust estimators, see for example Welsh and Ronchetti (1998) and Genton and Ronchetti (2008).
6
Marginal Longitudinal Data Analysis
6.1 Introduction Longitudinal data models are a step further away from linear models. Beyond GLMs, longitudinal studies are those where individuals are measured repeatedly over time. So, with respect to the GLM modeling of Chapter 5, a second dimension is added, where each subject can be measured several times. With respect to the (normal) MLMs of Chapter 4, the extension broadens the nature of responses considered. Here we allow the response to come from any distribution of the exponential family (discrete or continuous), as in Chapter 5. Note that The terminology ‘longitudinal data’ is used mostly in medicine, biology and health sciences, whereas sociologists and economists would mostly use the term ‘panel data’. It has to be stressed that even though the most common applications are for situations where the main units are individuals (e.g. the example in Section 6.5), the methodology can also be applied to otherwise clustered data where there are units in which measurements cannot be considered independent (e.g. the example in Section 6.2.3). When there is only one observation per subject, inference solely about the population average is possible. In contrast, longitudinal studies can distinguish between changes over time within individuals (called aging effects) and differences among people in their baseline levels (called cohort effects). Otherwise said, longitudinal studies are able to distinguish between the degree of variation of the response across time for one person and the variation in the response among people. Statistically speaking, one has to take into account the correlation within Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
162
MARGINAL LONGITUDINAL DATA ANALYSIS
measurements of the same subject (even if the subjects themselves can be considered independent). The same pattern/behavior is assumed across subjects and strength is borrowed from this. The literature about marginal longitudinal models is wide, also because models are developed in at least three main directions, see Section 6.2. The bases of the generalized estimating equations (GEE) approach that we follow here have been introduced with the seminal work of Liang and Zeger (1986) and Zeger and Liang (1986). Since then, many extensions and variations have been considered, in particular including an extension to the mixed linear type of models (Zeger et al., 1988), polytomous responses (Heagerty and Zeger, 1996; Liang et al., 1992; Stram et al., 1988), survival responses (Heagerty and Zeger, 2000, and references therein), weighted GEE (Preisser et al., 2000) and zero-inflated count data (Dobbie and Welsh, 2001a). A nice book on longitudinal data is Diggle et al. (2002), which is an extension of an earlier edition. The book by Molenberghs and Verbeke (2005) is another interesting reference. A more focused book on the GEE approach is Hardin and Hilbe (2003) and a recent book addressing correlated data is Song (2007). The theory around the GEE approach is sometimes sparse, in particular when it comes to the nuisance parameters, where the inferential aspects have not been well treated. The variable selection issues with these models have been addressed only recently, when Pan (2001) define an Akaike-type criterion for GEE, called QIC. Moreover, Cantoni et al. (2005) introduce a general Cp -like criterion for variable selection for marginal longitudinal models that can also address robustness issues. Robust alternatives to GEE-type of fits have been first proposed by Preisser and Qaqish (1999), who define a set of resistant estimating equations. Wang et al. (2005) propose a robust GEE-type bias corrected estimator, where the bias is estimated using a classical GEE estimator. Qu and Song (2004) show that their estimating equations proposal based on quadratic inference functions (Qu et al., 2000) has some nice robustness properties for the estimation of the regression parameters in some cases. Cantoni (2004b) propose a more general and improved version of the estimating equations of Preisser and Qaqish (1999) that also allows quasi-likelihood functions to be defined for inference, which puts the user in a position to carry a full analysis. We have chosen to present this approach given our familiarity with it, because of its extensions that make variable selection possible along the same lines as the approach for GLM and because of its forthcoming availability in R. In this chapter, after discussing the possible approaches to longitudinal data (Section 6.2), we go on to introduce marginal longitudinal models in more detail and present the classical estimation procedure (GEE) to fit them and the associated inference in Section 6.2.1. The robust counterpart, as per Cantoni (2004b), is introduced and illustrated in Section 6.3. It is based on a weighted set of estimating equations. In addition, quasi-deviance functions are defined for inference purposes and robust model selection. Three different examples serve as motivation and illustration of the theoretical elements introduced in this chapter, especially in Sections 6.3.4, 6.5 and 6.6.
6.2. THE MLDA AND ALTERNATIVES
163
6.2 The Marginal Longitudinal Data Model (MLDA) and Alternatives We assume that we have measurements yit for individual (or unit or cluster) i = 1, . . . , n at time (or occasion or occurrence) t = 1, . . . , ni . We additionally define yiT = (yi1 , . . . , yini ) as the collection of measurements for subject i and we assume independence between subjects. We assume that E[yi ] = µi and that var(yi ) is nondiagonal. At each time point, a set of covariates xitT = (1, xit 1 , . . . , xitq ) is also measured for each individual. The covariates information on subject i is collected in a ni × (q + 1) matrix T xi1 1 x111 · · · x11q . . . . . Xi = . = . . . . T xin i
1
x1ni 1 n
· · · x1ni q
The complete set of data comprises N = i=1 ni observations. As with GLMs, the response yit will be allowed to come from any distribution of the exponential family, see Table 5.1 in Chapter 5. However, using the GLM methodology would not be appropriate here because it ignores the correlation between the measurements of the same subject. Ignoring this correlation has consequences at different levels: inference about the regression parameters is incorrect, estimation of the regression parameters is inefficient and there is suboptimal protection against biases caused by missing data. The difficulty with the analysis of non-Gaussian longitudinal data was the lack of a rich class for the joint distribution of (yi1 , . . . , yini ). There are essentially three strategies to address the issue. All three approaches model both the dependence of the response on the explanatory variables and the correlation among the responses. In the following we give a brief overview. 1. Marginal models. Via this approach one models parametrically not only the marginal mean of yit (as in GLMs and in cross-sectional studies in general) but also the correlation matrix corr(yi ), by imposing a relationship g(E[yit ]) = xitT β for a link function g, and by modeling the covariance 1/2 1/2 matrix with extra parameters τ and α: Vµi ,τ,α = τ Aµi Rα,i Aµi , with Aµi = diag(vµi1 , . . . , vµini ), where vµit = var(yit ), Rα,i is the working correlation matrix and τ is a scale parameter. Only inference about the population mean is possible (population average inference). The parameters are estimated via a set of estimating equations, because there is no likelihood available in this setting. 2. Random effects models. With these models it is assumed that the correlation arising among repeated responses is due to the variation of the regression coefficients across individuals. One therefore models the conditional expectation of yit given γi (the individuals unexplained variations) by assuming g(E[yit | γi ]) = xitT β + zitT γi , with γi issued from a distribution F
164
MARGINAL LONGITUDINAL DATA ANALYSIS (usually Gaussian) such that E[γi ] = 0 and var(γi ) = σγ2 I . This modeling approach allows for inference about individuals (subject-specific inference). Parameters estimation is performed via likelihood maximization.
3. Transition models. In this case, the conditional expectation given the past E[yit | yi(t −1), . . . , yi1 ] is modeled. The assumptions about the dependence of yit on the past responses and on xit are combined into a single equation, that is, the conditional expectation of yit is written as an explicit function of yi(t −1), . . . , yi1 and xit . The likelihood is also the estimation method here.
6.2.1 Classical Estimation and Inference in MLDA In this chapter, we focus on marginal models, where the final goal is to describe the population average and for which a robust procedure similar to that in Chapter 5 is available. We note at this point that some robust options exist for random effects models as well, see e.g. Mills et al. (2002), Sinha (2004) and Noh and Lee (2007). The model assumptions under which we work are partially common with the main ingredients defined for GLM. • The marginal expectation of the response E[yit ] = µit depends on a set of explanatory variables xit via g(µit ) = xitT β, where g is the link function. • The marginal variance depends on the marginal mean through the relationship var(yit ) = τ vµit . The scale parameter τ allows for over- or under-dispersion, in the same manner as for GLMs, see Section 5.2.2. • The correlation between yit and yit 0005 (t 000f= t 0005 ) is a function of the corresponding marginal means and possibly of additional parameters α. This goal is achieved by parameterizing the correlation matrix with a parameter α 1/2 1/2 yielding a modeled covariance matrix Vµi ,τ,α = τ Aµi Rα,i Aµi , with Aµi = diag(vµi1 , . . . , vµini ), where vµit = var(yit ). The modeled correlation matrix Rα,i is called the ‘working’ correlation matrix, as opposed to the true underlying and unknown correlation matrix corr(yi ). The regression parameters β have the same interpretation as in GLM. They are regarded as the parameters of interest, whereas τ and α are considered nuisance parameters. This may not be appropriate when the time course for each subject is the focus, in which case one would need to consider either the extension proposed by Zeger et al. (1988) or a random effects model. Marginal models are natural extensions of GLM for dependent data. Therefore, the same or similar choices for the marginal distributions (within the exponential family) and the same link functions as in GLMs are used, see Chapter 5. However, even if a marginal distribution for yit is postulated (e.g. Bernoulli, binomial, Poisson), it does not define a (unique) joint multivariate distribution for yi , making it impossible to define a likelihood function to work with. The regression parameters β are therefore estimated by the GEE approach of Liang and Zeger (1986). Note, however, that the GEE reduce to maximum likelihood when the yi are multivariate
6.2. THE MLDA AND ALTERNATIVES
165
Gaussian distributed. In addition, GEE can be viewed as an extension of the quasi-likelihood approach where the variance cannot be specified only through the expectation µi but rather with additional correlation parameters α. This similarity with the quasi-likelihood approach explains why the parameter τ is directly included in the definition of Vµi ,τ,α . The quasi-likelihood approach used in (5.6) for GLM can be extended by solving for β the GEE (assuming τ and α are given): n 0002 (Dµi ,β )T (Vµi ,τ,α )−1 (yi − µi ) = 0,
(6.1)
i=1 1/2
1/2
where Dµi ,β = ∂µi /∂β and Vµi ,τ,α = τ Aµi Rα,i Aµi . The resulting GEE estimator for βˆ[GEE] can be obtained through an IRWLS by implementing a Fisher scoring algorithm. This algorithm is given in Appendix F.1 in its more general robust form. As said before, Rα,i is called the ‘working’ correlation, as opposed to the true (unknown) correlation matrix corr(yi ). The working correlation is imposed by the user and possible choices are as follows. • Independence. Here Rα,i = Ini , where Ini is the identity matrix of size ni . In this case, all of the set of N = ni=1 ni measurements are considered independent even within the same subject, and therefore we can treat this situation with a simple GLM model as if each observation yit corresponds to independent subjects. • Fixed. The correlation matrix Rα,i (or R) has a predefined form (either through a known parameter α or in general). This case is rare in practice, but could be implied by a formal theory or a result of previous studies. • Exchangeable (or compound symmetry). All of the correlations (Rα,i )t t 0005 between two occurrences t and t 0005 (t 000f= t 0005 ) are assumed to be equal to a scalar value α to be estimated. Formally, Rα,i = αeni enTi + (1 − α)Ini , where enTi is a vector of ones of dimension ni and Ini is the ni × ni identity matrix. This hypothesis may not be fulfilled when the repeated measurements are issued from subjects measured on several occasions over time, but is more appropriate in data where units are ‘natural’ clusters, such as children in the same class, members of a family or patients of the same practice, see e.g. the example in Section 6.2.3. Note that assuming exchangeable correlation in the normalidentity link setting corresponds to a random intercept MLM. • Autoregressive (AR). The correlation decreases with time difference, e.g. 0005 (Rα,i )tt0005 = α |t −t | , for an unknown scalar value α. This hypothesis is quite commonly used for measurements on the same subject over time because it can accommodate an arbitrary number and spacing of observations. • m-dependence. Observations are correlated up to time distance m, and therefore correlation is set to zero for observations that are more than m units
MARGINAL LONGITUDINAL DATA ANALYSIS
166
apart. Formally, for α = (α1 , . . . , αm ) t = t 0005, 1 (Rα,i )t t 0005 = αd d = |t − t 0005 | ≤ m, 0 otherwise. • Unstructured/unspecified. The correlation matrix Rα,i is completely free (apart from a diagonal of ones and the symmetry constraint), which gives many parameters to estimate. Obviously, this option requires clusters to be of the same size, that is, ni = n∗ for all i. We refer the reader to Table 1 in Horton and Lipsitz (1999) for a description of the possible correlation structures and recommendations. Moreover, Hardin and Hilbe (2003, pp. 141–142) give additional guidelines when choosing the correlation structure, as a function of the nature of the data at hand (e.g. size of the clusters, balanced data, characteristics defining the clusters).
6.2.2 Estimators for τ and α The GEE (6.1) are defined for given values of τ and α. A procedure that iterates between the estimation of the regression parameters β and the (moment) estimation of the nuisance parameters τ and α is implemented in all good software and therefore used in practice. Given that τ and α are nuisance parameters, less attention has been paid to their estimation and almost no theoretical results for inference exist for these parameters. √ The estimation of τ is based on the fact that τ is equal to var( τ rit ), where rit = √ (yit − µit )/ τ vµit are the Pearson residuals for unit i at occurrence t. Therefore, a simple estimator of τ is derived from the variance estimator based on all of the N residuals, i.e. ni n 0002 0002 (yit − µˆ it )2 /vµˆ it τˆ = . (6.2) N − (q + 1) i=1 t =1 On the other hand, the estimator of the correlation parameter α depends on the choice of the correlation structure Rα,i . The general approach is to estimate α by a simple function of all of the pairs of residuals rˆit , rˆit 0005 that share the same correlation (t and t 0005 defined accordingly). Below, we give some of the solutions implemented in software for the most common correlation structures.1 • If (Rα,i )t t 0005 = α (exchangeable correlation) for all t 000f= t 0005 , then we have αˆ =
n 0002 0002
rˆit rˆit 0005 /(K − (q + 1)),
i=1 t >t 0005
where K = 1/2
n
i=1 ni (ni
0015 − 1) and rˆit = (yit − µˆ it )/ τˆ vµˆ it .
1 Note that this list is not exhaustive, and different software implement different solutions.
(6.3)
6.2. THE MLDA AND ALTERNATIVES
167
0005
• If (Rα,i )t t 0005 = αt,t 0005 = α |t −t | (AR correlation), then given that E[rit rit 0005 ] 0005 α |t −t | (because E[rit rit 0005 ] cov(rit , rit 0005 )), one estimates α by the slope of the regression of log(ˆrit rˆit 0005 ) on log(|t − t 0005 |). Another option (see Hardin and Hilbe, 2003, p. 66) is to use n ni −(t −t 0005 ) 0002 rˆit rˆit 0005 t =1 αˆ t,t 0005 = . ni i=1 • If α = (α1 , . . . , αn∗ −1 ), where αt = (Rα,i )t (t +1) and n∗ is such that n1 = · · · = nn = n∗ , then αˆ t =
n 0002
rˆit rˆi(t +1) /(n − (q + 1)).
i=1
In particular, if Rα,i is tridiagonal with (Rα,i )t (t +1) = αt (one-dependent model), then if we let αt = α, we can estimate it by αˆ =
∗ −1 n0002
αˆ t /(n∗ − 1).
t =1
The extension to m-dependence is possible. • If Rα,i is totally unspecified, that is (Rα,i )t t 0005 = αt t 0005 for t 000f= t 0005 , one uses n 1 0002 (Aµˆ i )−1/2 (yi − µˆ i )(yi − µˆ i )T (Aµˆ i )−1/2 . Rˆ = τˆ n i=1
For the independence, exchangeable and m-dependence correlation structure, τ does not need to be computed to solve the estimating equations (it cancels out). In contrast, it is needed when Rα,i is AR. Liang and Zeger (1986, Section 4) give further details. The above-described estimators for τ and α are moment estimators that have a closed-form, but can be expressed in an estimating equation form to be solved simultaneously with the estimating equations for β, see Liang et al. (1992, pp. 9– 10). The GEE approach operates as if α and β were orthogonal to each other, even when they are not, yielding less efficient estimates of β when the correlation structure is misspecified. Zhao and Prentice (1990) introduce a modified version of GEE, called GEE2, that relaxes the orthogonality hypothesis. The price to pay is an increased computational burden and a larger sensitivity to the misspecification of the correlation structure, see Song (2007, p. 96). The GEE2 approach is usually not what is implemented in most software and for this reason, we do not pursue this theory further. √ If n-consistent estimators are used to estimate τ and α, it can be proved that √ ˆ n(β[GEE] − β) is asymptotically normally distributed with zero mean and variance 001b = lim M −1 QM −1 , n→∞
MARGINAL LONGITUDINAL DATA ANALYSIS
168 where M= and Q=
n 10002 (Dµi ,β )T (Vµi ,τ,α )−1 Dµi ,β , n i=1
n 10002 (Dµi ,β )T (Vµi ,τ,α )−1 var(yi )(Vµi ,τ,α )−1 Dµi ,β , n i=1
see Liang and Zeger (1986, Theorem 2). Note that the asymptotic theory here is intended with respect to the number of subjects (n) and for fixed numbers of occurrences (ni ). ˆ = Mˆ −1 Qˆ Mˆ −1 , where The estimator used for 001b is 001b n 10002 (D ˆ )T (Vµˆ i ,τˆ ,αˆ )−1 Dµˆ i ,βˆ , Mˆ = n i=1 µˆ i ,β
(6.4)
and n 10002 (D ˆ )T (Vµˆ i ,τˆ ,αˆ )−1 (yi − µˆ i )(yi − µˆ i )T (Vµˆ i ,τˆ ,αˆ )−1 Dµˆ ,βˆ , Qˆ = i n i=1 µˆ i ,β
(6.5)
where βˆ = βˆ[GEE] , µˆ i = µi (βˆ[GEE] ), τˆ is defined by (6.2) and αˆ is one of the estimators defined in the list above, depending on the assumed correlation structure. 0005 This is what is called in the Note that an estimator for var(βˆ[GEE] ) is n−1 001b. literature a ‘robust’ variance estimator, in contrast to a ‘naive’ variance estimator that would be obtained by assuming that the working correlation is true, and hence ar(βˆ[GEE] ) = n−1 Mˆ −1 . So, here ‘robust’ is var(yi ) = Vµi ,τ,α . This would yield v001c intended with respect to the misspecification of the correlation structure. For a similar use of ‘robust’, see also the discussion in Section 7.2.4. Approximate z-statistics and (1 − α) CIs can be defined in the usual manner, i.e. z-statistic =
βˆ[GEE]j , SE(βˆ[GEE]j )
(6.6)
0015 0005(j +1)(j +1). In the ar(βˆ[GEE]j ) and v001c ar(βˆ[GEE]j ) = n−1 001b with SE(βˆ[GEE]j ) = v001c same manner, we obtain (βˆ[GEE]j − z(1−α/2) SE(βˆ[GEE]j ); βˆ[GEE]j + z(1−α/2) SE(βˆ[GEE]j )), where z(1−α/2) is the (1 − α/2) quantile of the standard normal distribution. The GEE estimator βˆ[GEE] of β is attractive because it presents some nice theoretical properties. For instance, the asymptotic variance of βˆ[GEE] √ does not depend on the choice of the estimators for τ and α among the n-consistent ˆ depends only on the correct estimators. In addition, the consistency of βˆ[GEE] and 001b specification of the means µi and not on the correct specification of the correlation structure. In fact, inference about β is valid even when the correlation matrix is not
6.2. THE MLDA AND ALTERNATIVES
169
specified correctly (see Liang and Zeger (1986) for a more detailed discussion and for the proofs of these theoretical aspects). However, a careful choice of Rα,i , close to the true correlation matrix corr(yi ), increases efficiency, even though simulations results in Liang and Zeger (1986, Tables 1 and 2, p. 19) and Liang et al. (1992, Table 1, p. 15) tend to suggest otherwise. In these references the loss of efficiency is important only for highly correlated responses, but is limited for situations with moderate correlation. The drawbacks of the GEE approach are mostly related to the lack of a likelihood function for these models, which makes diagnostic and inference limited, and to the poor theory for the nuisance parameters.
6.2.3 GUIDE Data Example We consider the dataset of the GUIDE study (Guidelines for Urinary Incontinence Discussion and Evaluation2 as used by Preisser and Qaqish (1999). The response variable is the coded answer (bothered: 1 for ‘yes’, 0 for ‘no’) of a patient to the question: ‘Do you consider this accidental loss of urine a problem that interferes with your day to day activities or bothers you in other ways?’. There are five explanatory variables: gender, coded as an indicator for women (female), age (scaled by subtracting 76 and dividing by 10: age), the average number of leaking accidents per day (dayacc), the degree of the leak (severe: coded ‘1’ for ‘just create some moisture’, ‘2’ for ‘wet their underwear (or pad)’, ‘3’ for ‘trickle down their thigh’, ‘4’ for ‘wet the floor’) and the daily number of visits to the toilet to urinate (toilet). A total of 137 patients divided into 38 practices participated in the study. Figure 6.1 shows the responses for each cluster. Note that here the cluster sizes are different, ranging from one to eight. On the other hand, Figure 6.2 presents a summary of all of the covariates for all of the individuals. The series of plots in the left column is for observations such that yit = 1, that is, for patients that are bothered by their incontinence. The right column of plots is for patients with yit = 0. We observe a strong presence of female patients in the sample and a slightly larger proportion of female (90% versus 80%) within the subsample for which yit = 0. The age distribution is quite comparable between the two groups. On the other hand, as one can expect, the three indicators of the severity of the incontinence (dayacc, severe and toilet) show larger values for patients that declare themselves bothered by their problem (left column). For example, the median number of visits to the toilet is 6.5 for patients for which yit = 1 versus 5 for the other group. Similarly, the median number of leaking accidents per day for the first group is 4.6 against 1 for the second group. The model considered for this dataset is a binary logit-link model (τ = 1) defined by logit(E[bothered]) = logit(P (bothered)) = β0 + β1 female + β2 age + β3 dayacc + β4 severe + β5 toilet, 2 Available at http://www.bios.unc.edu/∼jpreisse/personal/uidata/preqaq99.dat
(6.7)
MARGINAL LONGITUDINAL DATA ANALYSIS
170 2 4 6 8
2 4 6 8
2 4 6 8
8
24
27
41
45
55
56
60
65
89
102
107
108
111
1 0
1 0
113
118
124
125
127
130
132
137
146
153
156
182
185
195
bothered
1 0
1 0
201
206
207
228
232
235
208
211
216
220
1 0
1 0 2 4 6 8
2 4 6 8
Occurence
Figure 6.1 The response for the GUIDE dataset for each practice (labeled by an increasing number, appearing in the shaded box).
where logit(p) = log(p/(1 − p)), with p/(1 − p) being the odds and p = P (bothered) the probability of being bothered. The clusters are defined by the practice, which means that patients from different practices are assumed independent. We assume common exchangeable correlation α between any two patients of a same practice. This hypothesis makes sense a priori in the context of this example: in fact, even though the patients of the same practice behave independently, correlation could be induced by the fact that a physician tends to prescribe similar treatments for their patients under treatment for the same problem. Note that the scaling of the variable age is not necessary but it is kept for consistency with the original analysis in Preisser and Qaqish (1999). Also, severe is used as a count (again for consistency with the original analysis) but should probably be put in the model as a four-level factor.
6.2. THE MLDA AND ALTERNATIVES
171
0
0.6 0.2
0.6
Proportions
female
0.2
Proportions
female
1
0
1
1.0 0.0
1.0
Density
(age−76)/10
0.0
Density
(age−76)/10
0.0
0.5
1.0
1.5
0.0
0.5
1.0
0.4 0.0
0.2
Density
0.4 0.2 0
5
10
15
0
5
10
2
3
1.0 0.6 0.2
Proportions 1
15
severe
0.1 0.3 0.5
Proportions
severe
4
1
2
3
4
0.10 0.00
0.00
0.10
Density
0.20
toilet visits
0.20
toilet visits Density
1.5
day accidents
0.0
Density
day accidents
5
10
15
20
5
10
15
20
Figure 6.2 Covariates pattern for the GUIDE dataset. The left column is for observations such that yit = 1 (54 observations out of 137) and the right column is for observations such that yit = 0 (83 observations).
The fitted parameters of model (6.7) via classical GEE with exchangeable correlation are given in the first column of Table 6.1. We interpret the results in Section 6.3.4.
6.2.4 Residual Analysis Residuals with longitudinal data can be considered at the observation level or at the cluster level. In both cases, the residuals proposed for GEE are similar to those used for GLMs with the additional requirement that the cluster structure has to be
MARGINAL LONGITUDINAL DATA ANALYSIS
172
Table 6.1 Estimates of α and β by classical and robust GEE for model (6.7). Variable αˆ intercept female age dayacc severe toilet
Classical coefficient (SE)
Huber coefficient (SE)
Mallows coefficient (SE)
0.09 −3.05 (0.96) −0.75 (0.60) −0.68 (0.56) 0.39 (0.09) 0.81 (0.36) 0.11 (0.10)
0.11 −3.62 (1.30) −1.45 (0.80) −1.48 (0.71) 0.51 (0.13) 0.71 (0.42) 0.36 (0.13)
0.10 −3.63 (1.28) −1.41 (0.78) −1.39 (0.69) 0.52 (0.13) 0.69 (0.41) 0.35 (0.13)
The classical estimates are the solution of (6.1)–(6.3). The robust estimates are obtained by solving (6.8), (6.10) and (6.11)0017with c = 1.5 and k = 2.4 (Huber’s estimator), and with c = 1.5, w(xit ) = 1 − hi,tt and k = 2.4 (Mallows’ estimator).
considered, see Hammill and Preisser (2006), Hardin and Hilbe (2003, Section 4.2) and Chapter 4. As in GLMs, we define the Pearson residuals yit − µˆ it rˆit = 0015 . τˆ vµˆ it They can be plotted to identify outliers and other violation of the assumptions like in other regression settings (e.g. heteroscedasticity, functional form of the regression, etc.). Figure 6.3 is a plot of the Pearson residuals for the GEE fit of the GUIDE dataset. It shows some large residuals, in particular for observations 8, 19, 42, 87 and 88. Given the fact that residuals estimated through non-robust estimators have to be analyzed with caution, in particular in regard of the possible masking effects, we defer the detailed interpretation of this residual analysis and introduce first the robust estimators.
6.3 A Robust GEE-type Estimator 6.3.1 Linear Predictor Parameters The robust counterpart to the GEE approach is built upon the theory of optimally weighted estimating equations (see Hanfelt and Liang 1995; McCullagh and Nelder 1989, p. 334). In the class of all estimating equations based on (yi − µi ) the optimal (that is, with smallest asymptotic dispersion) estimating equations are given by n n 0002 0002 (Dµi ,β )T iT (Vµi ,τ,α )−1 (ψi − ci ) = 00011 (yi , Xi ; β, α, τ, c) = 0, i=1
i=1
(6.8)
6.3. A ROBUST GEE-TYPE ESTIMATOR 4
173 88
19 87 14 22 2 0
44
−2
Pearson residuals
135
82
42 8 0
20
40
60
80
100
120
140
Observation
Figure 6.3 Pearson residuals corresponding to the classical GEE fit of the GUIDE dataset (first column of Table 6.1). where Dµi ,β = Di (Xi , β) = ∂µi /∂β is a ni × (q + 1) matrix, 1/2
1/2
Vµi ,τ,α = τ Aµi Rα,i Aµi
is a ni × ni matrix. Moreover, ψi = Wi · (yi − µi ), where the matrix Wi = W(yi , Xi ; µi ) = diag(wi1 , . . . , wini ) is a ni × ni diagonal weight matrix containing robustness weights wit for t = 1, . . . , ni , and ci = E[ψi ]. Finally, i = E[ψ˜ i − c˜i ] with ψ˜ i = ∂ψi /∂µi and c˜i = ∂ci /∂µi . Note that the set of estimating equations in (6.8) are a slightly modified version of the estimating equations in Preisser and Qaqish (1999) in that it includes the matrix i , which, for a given choice of weights Wi and ‘working’ correlation Rα,i , makes it optimal (in the sense of smallest asymptotic dispersion) in the class of all estimating equations based on (yi − µi ), see Hanfelt and Liang (1995). The computational details of ci and i for binary responses are given in Appendix F.2. We assume that the weights Wi downweight each observation separately, even though it is possible to consider a cluster downweighting scheme, see the discussion
MARGINAL LONGITUDINAL DATA ANALYSIS
174
about observation versus cluster outliers in Section 6.2.4. Possible choices for the weights are w(rit ; β, τ, c) as a function of the Pearson residuals rit = √ (yit − µit )/ τ vµit , for example Huber’s weight (see also (2.16)) 0001 √ √ c/|rit / τ | if |rit / τ | > c, (6.9) w(rit ; β, τ, c) = 1 otherwise, to ensure robustness with respect to outlying points in the response space (Huber’s estimator), or w(xit ) as a function of the diagonal elements 0017 hi,t t of the hat matrix Hi (see (3.11)) for subject i (for example, w(xit ) = 1 − hi,t t ) to handle leverage points. In practice, it often makes sense to combine both types of weights multiplicatively: wit = w(rit ; β, τ, c)w(xit ) (Mallows’ esitmator). The classical GEE are obtained with Wi equal to the identity matrix. We refer to Cantoni and Ronchetti (2001b) for a detailed discussion on the choice of the weights. For simplicity, our weighting scheme (as in Preisser and Qaqish, 1999) does not take into account the within-subject correlation and is therefore not suitable for the situation where this correlation is high, in which case it has to be redefined properly, see for example Huggins (1993) and Richardson and Welsh (1995). Doing so will change the definition in (6.8) and affect the distributional properties. Note, however, that protection against outliers affecting all of the observations of a cluster can be handled by our approach by specifying a cluster downweighting scheme, that is, with wit = wi∗ for all t = 1, . . . , ni , where the wi∗ have to be defined to take into account the information of the entire cluster. The estimating equations (6.8) do not simplify exactly to the estimating equations (5.13) for GLMs owing to the presence of the matrix i in the former. The presence of this matrix in the GEE setting is necessary to allow the construction of the quasideviance functions for inference (see Section 6.4.2).
6.3.2 Nuisance Parameters The estimators of the dispersion parameter τ and of the correlation parameter α also have to be made robust to avoid harmful consequences on the estimation of the regression parameters. We√build again on the fact that the parameter τ is the variance √ of (yit − µit )/ τ vµit = τ rit , see Section 6.2.2. We therefore proceed similarly as for GLM and choose Huber’s Proposal 2 estimator of variance (see Section 5.3.3), which is written here as ni n 0002 0002 i=1 t =1
χ(rit ; β, α, τ, c) =
n 0002
00012 (ri ; β, α, τ, c) = 0,
(6.10)
i=1
where χ(u; β, α, τ, c) = ψ 2 (u; β, α, τ, c) − δ. In addition, δ = E[ψ 2 (u; β, α, τ, c)] (under normality for u) is a constant that ensures Fisher consistency of the estimation of τ . For its computation for ψ 2 (u; β, α, τ, c) = ψ[H ub] (u; β, α, τ, c) (our preferred choice), see (3.8), while noticing that ψ[H ub] (u; β, α, τ, c) = uw[H ub] (u; β, α, τ, c). As in the classical GEE theory, the estimator of the correlation parameter α depends on the assumed correlation structure. To build a robust estimator of α, the
6.3. A ROBUST GEE-TYPE ESTIMATOR
175
idea is to base this estimator on appropriate pairs of residuals, along the same line as for the classical estimators (see Section 6.2.2), but to consider additional weighting schemes to downweight outlying observations. In the following we discuss in detail the case of exchangeable correlation and explain how one can deal with two other common situations, namely the mdependence and the AR correlation structures. Let us recall that the exchangeable correlation structure defines Rα,i = αeni enTi + (1 − α)Ini , with eni a vector of ones of length ni , and Ini the identity matrix of size ni × ni , which means that corr(yit , yit 0005 ) = α for t 000f= t 0005 , and one otherwise. A simple M-estimator of covariance can be defined through Huber’s type of weights (based on ψ[H ub] (·; β, α, τ, c)) which we define as functions of the Mahalanobis distance dtit 0005 (see (2.34)) between the pair of residuals rˆit and rˆit 0005 . The Mahalanobis distance is given in this case by ˆ −1 (ˆrit rˆit 0005 )T with (dtit 0005 )2 = (ˆrit rˆit 0005 ) 0004
1 αˆ [M] ˆ . 001c = τˆ[M] αˆ [M] 1 We define Huber’s weights on the Mahalanobis distances by 0001 1 if dtti 0005 ≤ k, u1,k (dtit 0005 ) = k/dtti 0005 otherwise. We then put u2,k (dtit 0005 ) = u1,k 2 (dtit 0005 )/γ with γ = E[ru1,k 2 (|r|)]/2 where the expectation is computed under normality for r. This yields γ = Fχ 2 (k 2 ) + k 2 /2(1 − 4
Fχ 2 (k 2 )), where Fχ 2 and Fχ 2 are the cumulative distribution function of a χ 2 2 4 2 distribution with four and two degrees of freedom, respectively. Let Bi = (ˆri1 · rˆi2 , rˆi1 · rˆi3 , . . . , rˆi(ni −1) · rˆini )T be the vector of the product of all of the pairs of i ), u (d i ), . . . , u (d i T residuals for cluster i and let Gi = (u2,k (d12 2,k 13 2,k (ni −1)ni )) be the vector of weights, then our robust estimator of α is defined as the solution αˆ [M] of
0002 n n 0002 K T Gi Bi − ατ = 00013 (ri ; β, α, τ, c) = 0, (6.11) n i=1 i=1 with K = ni=1 ni (ni − 1)/2. For more details on all of the above computations we refer to Maronna (1976), Devlin et al. (1981) and Marazzi (1993, p. 225). M-estimators are known to have a low breakdown point, namely one over the dimension of the problem, which is equal to two here (see the discussion of this point in Section 2.3.1). Nevertheless, high breakdown point estimators could be considered to estimate 0004. An ad hoc estimator of α in the case of binary responses with exchangeable correlation inspired by the classical moment estimator is considered by Preisser and Qaqish (1999). This proposal relies on the hypothesis that var(ψi ) can be decomposed as Ci var(yi )Ci and therefore the proposal cannot be extended to other settings, e.g. Poisson. Our proposal is more general and has the advantage of inheriting the whole set of distributional properties of M-estimators. It is also worth mentioning that all u1,k (dtti 0005 ) = 1, and therefore all u2,k (dtti 0005 ) = 1 gives the usual (classical) moment estimators for these situations.
MARGINAL LONGITUDINAL DATA ANALYSIS
176
Two other common correlation structures are the m-dependence correlation structure, which assumes that corr(yit , yi,t +j ) = αj , for j = 1, . . . , m, and the AR correlation structure which assumes that corr(yit , yi,t +j ) = α j for j = 0, 1, . . . , ni − t. The procedure described above can be adapted to these cases by constructing Bi appropriately, that is, with all of the products rˆit · rˆi,t +j in the first case, and rˆit · rˆi,t +1 in the latter. The correction terms have to be defined accordingly.
6.3.3 IF and Asymptotic Properties √ √ Under standard regularity conditions we have that ( n(βˆ[M] − β)T , n(τˆ[M] − √ τ )T , n(αˆ [M] − α)T )T , with βˆ[M] , τˆ[M] and αˆ [M] defined through (6.8), (6.10) and (6.11), respectively, follows an asymptotic normal distribution with mean zero and covariance matrix −1 −T 001d(11) 001d(12) 001d(13) F 0 0 F 0 0 T lim G H 0 001d(12) 001d(22) 001d(23) G H 0 , (6.12) n→∞ J L N J L N 001dT(13) 001dT(23) 001d(33) where all of the sub-matrices in (6.12) are given in Cantoni (2004b) (up to a factor 1/n, with 001d = 0010), where the proof of the distributional result is also given. The particular form of the matrices in (6.12) implies that the marginal asymptotic √ distribution of n(βˆ[M] − β) is normal with mean zero and variance equal to ϒ = lim F −1 001d(11) F −T , n→∞
where F =
n 10002 (Dµi ,β )T iT (Vµi ,τ,α )−1 i Dµi ,β , n i=1
(6.13)
(6.14)
and 001d(11) =
n 10002 (Dµi ,β )T iT (Vµi ,τ,α )−1 var(ψi )(Vµi ,τ,α )−1 i Dµi ,β . n i=1
(6.15)
The distributional result in (6.12) generalizes the results of Prentice (1988): it applies to other types of responses than Bernoulli trials, it allows for an over-dispersion parameter (τ ) and is developed in the more general setting of robust estimating equations defined by (6.8), (6.10) and (6.11). In addition, the estimating equations (6.8), (6.10) and (6.11) define a set of Mestimators (Huber, 1981), with the corresponding score functions 00011 (yi , Xi ; β, α, τ, c), 00012 (ri ; β, α, τ, c), 00013 (ri ; β, α, τ, c) in Appendix F.1. From general theory on M-estimation, we know that the IF of these estimators is proportional to the score functions defining them. Therefore, the estimators obtained by our procedure are robust as long as the functions 0001i are bounded in the design and in the response. This is in particular achieved if ψi in (6.8) and χ in (6.10) are bounded, and u2,k through Gi in (6.11) are allowed to be less than one.
1.0
6.3. A ROBUST GEE-TYPE ESTIMATOR •
•••
••
••• • •
•
177
• •••• • •••••• • • •• • • •
••
• • •••• • • • •••
0.8
• •
• Weigths w(r) 0.6
• 135 •
0.4
• 22 • 87
0.2
• 19
• 88
44 •
• 8 0
42 •
50
100
150
200
Practice
Figure 6.4 Robustness weights on the response, grouped by practice, for the fit corresponding to the middle column of Table 6.1 (Huber’s estimator).
6.3.4 GUIDE Data Example (continued) We estimate the regression parameters with the set of0017equations in (6.8), where we consider both a Mallows’ estimator with w(xit ) = 1 − hi,t t and c = 1.5 and a Huber estimator with w(xit ) = 1 and c = 1.5. In both cases, the exchangeable correlation parameter α is estimated through (6.11) with k = 2.4, which is approximately the 95%-quantile of a χ22 distribution. In addition to the classical results already presented in Section 6.2.3, Table 6.1 gives the estimated coefficients for the two robust alternatives. First note that the results of the second and third column (robust analyses) are quite close, whereas they differ noticeably from the classical analysis. This means that the additional weights on the design are probably not crucial in the sense that the dataset does not seem to contain leverage points. By looking at approximate CIs (see their definition in Section 6.4.1), the variables female and age are not significant in the classical analysis, but are borderline in the robust analysis. The significance of the variable dayacc seems to be equally well assessed in both types of analysis. The variable severe is no longer significant in the robust analysis, whereas the variable toilet seems to play an important role that was hidden in the classical approach.
178
MARGINAL LONGITUDINAL DATA ANALYSIS
The robust procedure also gives information on how many and which observations are downweighted. For example, in the analysis with weights on the response only (middle column of Table 6.1), there are 15 observations out of 137 that do not receive full weight, 8 of which have weight less than 0.6, see Figure 6.4. This group of observations is partially the same as that identified in Preisser and Qaqish (1999) with their robust procedure. The diagnostic approach in Hammill and Preisser (2006) identify as potential outliers a smaller group of observations, in particular patients 8 and 44. These two patients, together with patient 42, report not being bothered despite their high frequency of visits to the toilet (10 for patients 8 and 42, and 20 for patient 44) and their large average number of leaking accidents per day (9.3 for patient 8, 6 for patient 42 and 3 for patient 44). On the other hand, patients 19 and 88 declared themselves bothered, even though the severity of their symptoms (variables severe and toilet) are pretty low with respect to the other sample values. Only two of these heavily downweighted observations belong to the same practice (cluster), namely observations 87 and 88 from practice 156, confirming that the individual downweighting scheme is justified with this dataset.
6.4 Robust Inference 6.4.1 Significance Testing and CIs The z-test for significance testing and (1 − α) CIs for the regression parameters β can be constructed based on the asymptotic distribution of the estimator, see Section 6.3.3. The z-statistics and (1 − α) CIs are given by z-statistic = and
βˆ[M]j , SE(βˆ[M]j )
(βˆ[M]j − z(1−α/2) SE(βˆ[M]j ); βˆ[M]j + z(1−α/2) SE(βˆ[M]j )), &
with
1 ˆ (j +1)(j +1), [ϒ] n where z(1−α/2) is the (1 − α/2) quantile of the standard normal distribution, and ˆ (11) Fˆ −T , with where ϒˆ = Fˆ −1 001d SE(βˆ[M]j ) =
n 10002 (D ˆ )T iT (Vµˆ i ,τˆ ,αˆ )−1 i Dµˆ i ,βˆ , Fˆ = n i=1 µˆ i ,β
(6.16)
and n 0002 ˆ (11) = 1 (D ˆ )T iT (Vµˆ i ,τˆ ,αˆ )−1 (ψi − ci )(ψi − ci )T (Vµˆ i ,τˆ ,αˆ )−1 i Dµˆ i ,βˆ , 001d n i=1 µˆ i ,β (6.17) where βˆ = βˆ[M] , µˆ i = µi (βˆ[M] ), τˆ = τˆ[M] and αˆ = αˆ [M] .
6.4. ROBUST INFERENCE
179
6.4.2 Variable Selection Variable selection is performed here by comparing the adequacy of a submodel Mq−k+1 with (q − k + 1) regression parameters with respect to a larger model Mq+1 with (q + 1) regression parameters. This is done either in a stepwise procedure, or by comparing two predefined nested models. For that we define a class of test statistics based on differences of quasi-likelihoods, in the same spirit as the difference of quasi-deviances in (5.24) for GLMs in Chapter 5: 00190002 001a n n 0002 001at (s) = 2 Qti (s)(yi ; µˆ i ) − Qti (s)(yi ; µ˙ i ) , (6.18) i=1
i=1
where µˆ i = µi (βˆ[M] , αˆ [M] , τˆ[M] ) is the estimation under model Mq+1 , and where µ˙ i = µi (β˙[M] , α˙ [M] , τ˙[M] ) is the estimation under model Mq−k+1 and where the quasi-likelihood functions take the multidimensional form Qti (s) (yi ; µi ) 0004 1 µi = (yi − ti )T W (yi , Xi ; ti (s))(Vti (s),τ,α )−1 i (ti (s)) dti (s) τ yi 0004 1 µi E[(yi − ti (s))T W (yi , Xi ; ti (s))](Vti (s),τ,α )−1 i (ti (s)) dti (s), − τ yi (6.19) with the integrals possibly path-dependent. This means that there are several paths to go from a point yi to a point µi and implies, therefore, that the integrals in (6.19) are not uniquely defined. It is common practice to parameterize this path and a typical set of integration paths is given for example by tit (s) = yit + (µit − yit )s cit , for s ∈ [0, 1], cit ≥ 1 and t = 1, . . . , ni . For instance, when cit ≡ 1 for all t (see for example McCullagh and Nelder, 1989, p. 335), we have that Qti (s) (yi ; µi )
00060004 1 0007 1 T −1 sW (yi , Xi ; ti (s))(Vti (s),τ,α ) i (ti (s)) ds (yi − µi ) = − (yi − µi ) τ 0 0004 1 1 E[{yi − ti (s)}T W (yi , Xi ; ti (s))](Vti (s),τ,α )−1 i (ti (s))(yi − µi ) ds, + τ 0
which involves only univariate integrations, uniquely defined. The asymptotic result from Section 6.4.2.1 shows that the path-dependence of the integrals in (6.19) vanishes asymptotically. In addition, Hanfelt and Liang (1995) showed that the path of integration does not play an important role in finite-sample situations. These results support the use of the difference of robust quasi-likelihoods for inference.
MARGINAL LONGITUDINAL DATA ANALYSIS
180 6.4.2.1 Multivariate Testing
Multivariate testing of the type H0 : β(2) = 0 with β = (β(1) , β(2) ) and with β(1) of dimension (q + 1 − k) and β(2) of dimension k, can be performed using 001at (s) defined by (6.18) as a test statistic. Cantoni (2004b) proves that under quite general conditions (and under H0 ), 001at (s) is asymptotically equivalent to the following quadratic forms in normal variables nLTn (M −1 − M˜ + )Ln = nRTn(2) M22.1 Rn(2) , where
(6.20)
n 10002 (Dµi ,β )T iT (Vµi ,τ,α )−1 i Dµi ,β n→∞ n i=1
M = lim F = lim n→∞
is partitioned into four blocks according to the partition of β:
M(11) M(12) T M(12) M(22) and M˜ + =
−1 M(11)
0k×(q−k+1)
0(q−k+1)×k , 0k×k
where 0a×b is a matrix of dimension a × b with only zero entries. √ √ The variables nLn and nRn are asymptotically normally distributed N (0, Q) and N (0, M −1 QM −1 ), respectively, where n 10002 DµT i ,β iT (Vµi ,τ,α )−1 var(ψi )(Vµi ,τ,α )−1 i Dµi ,β . n→∞ n i=1
Q = lim 001d(11) = lim n→∞
This implies that 001at (s) is asymptotically distributed as linear combination of χ12 variables, similarly as for GLMs (see Section 5.4.2). More precisely, 001at (s) k 2 is asymptotically distributed as i=1 di Ni , where N1 , . . . , Nk are independent standard normal variables, d1 , . . . , dk are the k positive eigenvalues of the matrix Q(M −1 − M˜ + ). In practice, the empirical version of M and Q are used, that is, ˆ (11) (see (6.17)). Mˆ = Fˆ (see (6.16)) and Qˆ = 001d In addition to giving the asymptotic distribution, the above result provides an asymptotically equivalent quadratic form to 001at (s), which can be used as an asymptotic approximation when the integrals involved in the definition of 001at (s) are T problematic to compute. More precisely, one computes nβˆM(2) Mˆ 22.1 βˆM(2) . Finally, Cantoni (2004b) proves that the level and the power of 001at (s) under contamination are bounded provided that βˆM(2) has a bounded IF.
6.4.3 GUIDE Data Example (continued) Let us consider a backward stepwise procedure based on the difference of quasilikelihoods functions defined by (6.18) to check more carefully the issues related
6.4. ROBUST INFERENCE
181
Table 6.2 p-values of the backward stepwise procedure on the GUIDE dataset. Variable
Step 1
Step 2
Step 3
Step 4
Classical
female age dayacc severe toilet
0.224 0.249 <10−4 0.089 0.224
0.270 – <10−4 0.081 0.164
– – <10−4 0.061 0.165
– – <10−4 0.011 –
Robust
female age dayacc severe toilet
0.070 0.045 <10−4 0.092 0.006
0.095 0.041 <10−4 – 0.004
– 0.068 <10−4 – 0.004
– – <10−4 – 0.002
The robust test statistics (6.18) are computed by applying Huber’s-type weights (c = 1.5), and by using k = 2.4 for the estimation of α in (6.11) (exchangeable correlation). The classical test statistics are computed with c = ∞ and k = ∞.
to model selection. We use the same weights and the same set of parameters as for the Huber’s estimator of Section 6.3.4, and compute the quadratic form (6.20) asymptotically equivalent to 001at (s). At each step of the procedure, we remove the variable that is the least significant by looking at the p-value or, equivalently, at the value of the test statistic. The procedure is stopped when all of test statistics are significant at the 5% level. The classical counterpart is computed with the same quadratic form, by using c = ∞ and k = ∞ to compute the estimators. Table 6.2 gives the p-values of this backward stepwise procedure. It is impressive to see how the classical p-values differ from the robust p-values. This highlights the heavy influence of outlying observations on the test procedure and not only on the estimation procedure. Finally, the robust procedure ends up by retaining the variables dayacc and toilet, whereas the classical analysis would retain the variables dayacc and severe. On the basis of the theoretical properties of the robust estimator, and also on the simulations results in Cantoni (2004b), the conclusions from the robust analysis are more reliable. We therefore robustly refit the model with only dayacc and toilet and proceed with interpretations from this model. The estimated coefficients and standard errors for the linear predictors are as follows: −3.67 (0.76)
+
0.49 dayacc (0.12)
+
0.29 toilet. (0.10)
The estimated model in this clustered setting can be interpreted in the same way as for GLMs. In this example, the response is binary, and therefore the discussion of Section 5.5.2 about interpreting the coefficients on the odds scale still holds. The effect of an additional leaking accident per day is to increase by 63% (exp(0.49) = 1.63) the odds of a patient being bothered by their incontinence problem. Similarly, the effect of an extra visit to the toilet results in a 34% increase (exp(0.29) = 1.34)
182
MARGINAL LONGITUDINAL DATA ANALYSIS
on the same odds. This second effect is smaller in magnitude, which seems compliant with common sense.
6.5 LEI Data Example We consider here a dataset on direct laryngoscopic endotracheal intubation (LEI), a potentially life-saving procedure in which many health care professionals are trained. We examine data from a prospective longitudinal study on LEI at Dalhousie University, previously analyzed by Mills et al. (2002). Variable selection is an important step as the model(s) chosen will include only those covariates significant in predicting successful completion of LEI. A total of 436 LEIs were analyzed. We let yit = 1 if trainee i performs a complete LEI in less than 30 seconds on trial t, and zero otherwise. The correlation between observations on the same trainee is taken to be exchangeable. An AR correlation structure would be another option with these data. We judge trainees based on the following nine binary covariates taking the value one if the condition is satisfied: whether the head and neck were in optimal position (neckflex and extoa); whether they inserted the scope properly (proplgsp); whether they performed the lift successfully (proplift); whether there was appropriate request for help (askas); whether there was unsolicited intervention by the attending anesthesiologist (help); whether there were no complications (comps); and the trainee’s handedness (trhand) and gender (trgend). Nineteen trainees performed anywhere from 18 to 33 trials. Figure 6.5 gives the pattern profiles for the 19 trainees. These patterns tend to show that training results in better performance over time, see for example profiles of trainees K, L, VV and Z. It seems less evident for other individuals, namely AA and S. Table 6.3 presents a summary of all of the (binary) covariates for all of the individuals. As naturally expected, the proportion of ones (indicating that successful action has been taken or that no complications were observed) is larger for individual that have succeeded in performing a complete LEI. We fitted robustly a GEE model with exchangeable correlation to the above data. The estimates and test results are given in Table 6.4. The robust GEE model uses a Huber’s estimator with c = 1.5 for the Huber function and k = 2.4 for the Huber’s Proposal 2 (6.10). No weights on the design has been used here given the binary nature of all the covariates. A priori we would expect all of the coefficients to be positive, which would indicate that if proper action is taken, the probability of success in performing LEI increases. It is indeed the case expect for a few non-significant coefficients (askas and extoa). A classical approach (not shown here) would give substantially different estimated coefficients. The standard errors of the MLE would also be quite larger, which is a serious drawback when performing significance testing. Figure 6.6 gives the weights w(rit ; β, τ, c) from the robust fit. The two main outliers are observations 11 (11th trial of trainee AA) and 273 (14th trial of trainee T). The first observation corresponds to the only successful LEI for trainee AA in
6.5. LEI DATA EXAMPLE 0
10
183 20
30
0
10
20
30
AA
AB
K
L
M
N
O
P
Q
R
1
LEI completed in less than 30 seconds
0
1
0
S
T
U
V
W
X
Y
Z
VV
1
0
1
0
0
10
20
30
0
10
20
30
Trial
Figure 6.5 LEI responses (one for completed in less than 30 seconds) for each trainee, labeled by capital letters.
21 trials (see Figure 6.5) for a covariate pattern for this trainee which is quite stable through the trials (not shown) and can therefore not explain the different response. The second observation is an unsuccessful LEI, even though the covariates pattern would have called for a success. The significant variables stemming from the robust approach on the basis of their z-statistic are neckflex, proplgsp, proplift, help and perhaps comps. In the classical analysis, neckflex would have been considered non-significant, as would comps. The significance of all of the variables except comps is clearcut. We therefore only test three particular nested models with the difference of quasi-deviances (6.18) with Huber’s weights with c = 1.5: the full model including all of the available covariates, against the submodel without extoa, askas, trhand, trgend
MARGINAL LONGITUDINAL DATA ANALYSIS
184
Table 6.3 Covariates characteristics for the LEI dataset. Successful LEI (118 observations) Variable neckflex extoa proplgsp proplift askas help comps trhand trgend
Unsuccessful LEI (318 observations)
Proportion of ones
Proportion of zeros
Proportion of ones
Proportion of zeros
0.99 0.99 0.86 0.88 0.15 0.70 0.95 0.82 0.77
0.01 0.01 0.14 0.12 0.85 0.30 0.05 0.18 0.23
0.95 0.97 0.52 0.39 0.20 0.37 0.78 0.84 0.69
0.05 0.03 0.48 0.61 0.80 0.63 0.22 0.16 0.31
Table 6.4 Robust GEE fits for the LEI dataset. Variable intercept neckflex extoa proplgsp proplift askas help comps trhand trgend α
Coefficient (SE)
p-value
−4.18 (0.51) 1.52 (0.39) −0.24 (0.41) 0.69 (0.20) 0.98 (0.25) −0.42 (0.26) 0.34 (0.12) 0.99 (0.49) 0.04 (0.26) 0.05 (0.24)
<10−4 <10−4 0.56 0.0007 <10−4 0.11 0.004 0.04 0.89 0.84
0.061
The estimates are obtained by solving (6.8), (6.10) and (6.11) with c = 1.5 and k = 2.4 (Huber’s estimator).
(the clearly non-significant covariates) and the submodel that in addition remove comps. Table 6.5 gives the p-values associated with these tests. It confirms that the submodel without extoa, askas, trhand and trgend is enough to represents the relationship that describes a successful LEI. The robust analysis also shows the importance of the variable comps, given the rejection of the null hypothesis that its coefficient is equal to zero. The estimated final model therefore yields the following coefficients and standard errors for the linear predictor:
185
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •• • • •
0.8
1.0
6.5. LEI DATA EXAMPLE
•• •
Weights w(r) 0.4 0.6
•
• 26
• 113
•
190 •
•
•
•
•
• 405
233 •
• 339 308 •
•
432
0.2
• 31 • 243
• 36 • 11
• 278
0
100
200 Observation
300
400
Figure 6.6 Robust weights for the LEI data example.
Table 6.5 Robust p-values for comparison of models based on the difference of quasi-deviances (6.18). Model
001at (s)
p-value
Full - extoa - askas - trhand - trgend - extoa - askas - trhand - trgend - comps
4.49 2.76
0.36 0.01
The robust test statistics (6.18) are computed by applying Huber’s-type weights (c = 1.5) and by using k = 2.4 for the estimation of α in (6.11) (exchangeable correlation). −9.17 + 3.78 neckflex + 1.33 proplgsp + 1.93 proplift + 0.60 help + 1.93 comps. (1.23) (0.75) (0.35) (0.50) (0.26) (0.77)
The multiplicative effects of a positive action taken by the trainee or the fact that there was no complications (in which cases the covariate is equal to one) on the odds of succeeding in performing a LEI are as follows (exponential of the coefficient): neckflex 43.69
proplgsp 3.79
proplift 6.89
help 1.82
comps. 6.90
In addition to the statistical significance, we can see that the strongest effect on the odds of a successful LEI is definitely the proper positioning of the neck, followed
186
MARGINAL LONGITUDINAL DATA ANALYSIS
by the correct lift and the absence of complications. Inserting the scope properly and asking for help were also positively associated with a successful LEI, but the associations were somehow weaker.
6.6 Stillbirth in Piglets Data Example Genetic selection is an important research domain in animal science. It allows species to be selected with ‘stronger’ characteristics. For example, for most mammalian species, farrowing is a critical period. In pigs, for example, up to 8% of newborns are stillborn. Limiting or reducing the number of stillbirths requires its major determinants to be investigated. This section is devoted to the study of stillbirth in four genetic types of sow: Duroc × Large White (DU × LW), Large White (LW), Meishan (MS) and Laconie (LA). Data are from the INRA GEPA experimental unit (France) and have been kindly provided by L. Canario and Y. Billon. Related publications are Canario (2006) and Canario et al. (2006) where the reader can find a more extensive discussion of the modeling issues for this dataset. Previous studies have shown that parity number, piglet birth weight, sex and birth assistance were associated with perinatal mortality. The aim of the study is to establish whether there is a genotype effect, in view of possible genetic selection (e.g. development of crossed-synthetic lines). Our dataset comprises 80 litters for the genetic type (coded gentype) DU × LW, 633 litters for LW, 59 litters for MS and 168 litters for LA, for a total of 940 litters and 11 638 observations. There were 565 deaths (coded = 1) out of the 11 638. The genetic type LW is taken as the reference. Parity number, the number of times a mother has given birth (variable parity, taken as a factor), ranges from one to six with the following corresponding frequencies (35%, 26%, 15%, 12%, 8%, 4%), with one taken as the reference. Birth assistance (variable birthassist) is coded zero for no assistance and one for one or several assistances. The cluster is defined as the litter, which size varies from 5 to 23. We fit a binary logit-link model with exchangeable correlation. For the robust fit we use weights w(rit ; β, τ, c) on the residuals with a tuning constant c = 1.5 for the Huber function. We use k = 2.4 for the Huber’s Proposal 2 (6.10). The estimated coefficients, standard errors and p-values for z-test for significance on each coefficient (H0 : βj = 0) are given in Table 6.6. The robust analysis shows that piglets born from the MS genetic type have a lower risk of stillbirth with respect to LW. The odds ratio of a stillbirth for the MS genotype with respect to the LW genotype is equal to exp(−1.71) = 0.18. Also, the mortality increases with parity, at least for the 5th and 6th parity, which could result from the fatness of old sows or aging of the uterus (or both). The estimated exchangeable correlation is αˆ = 0.035, which is low. The conclusions, however, have to be taken with caution. A careful inspection of the weights associated by the robust technique to the observations show a particular pattern. Indeed, in Figure 6.7, one can see that the downweighted observations identify a subpopulation of the data, in fact all of the 565 observations corresponding
6.6. STILLBIRTH IN PIGLETS DATA EXAMPLE
187
Table 6.6 Robust estimates for the piglets dataset. Variable intercept factor(gentype)DU × LW factor(gentype)MS factor(gentype)LA factor(parity)2 factor(parity)3 factor(parity)4 factor(parity)5 factor(parity)6 birthassist
Coefficient (SE)
p-value
−3.00 (0.11) −0.20 (0.19) −1.71 (0.43) 0.11 (0.14) −0.23 (0.15) 0.10 (0.17) 0.15 (0.17) 0.38 (0.21) 0.55 (0.20) 0.13 (0.12)
<10−4 0.31 <10−4 0.45 0.12 0.57 0.38 0.08 0.005 0.30
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
• • • • • • • • • •• • • • •• • • • • • •• • • • • • • • • • • • •• •• • • • • • • •• • •• ••• • • • •••• • • • • •• • • •• • • • • • • •• •••• • • •••• • • • •••• ••• • •• • •• •• • • ••• •••• • ••• • •• •• •••• •• ••• • •• •• • • • • •• • • •• ••• • • ••••• •••• •• ••• • • ••• •••• ••••••• ••• • • •• • • • • • •••• ••• • • •• ••• • • • • •• • • ••• •• • • • • • •• • • •• • •• • •• ••• • ••• • •••••• •• • • • •
0.2
0.4
Weights w(r) 0.6
0.8
1.0
The estimates are obtained by solving (6.8), (6.10) and (6.11) with c = 1.5 and k = 2.4 (Huber’s estimator).
• 0
•
• 2000
4000
6000 Observation
•• 8000
10000
12000
Figure 6.7 Robustness weights on the response for the piglets dataset.
to a death (response = 1). Further investigations allowed us to identify suspected separation or near-separation in the data. This peculiarity of binary regression is a situation where the design space of the observations for which y = 1 and the observations for which y = 0 can be completely separated by a hyperplan.
MARGINAL LONGITUDINAL DATA ANALYSIS
188 6
0
0 5
0 0 1 1
4
x2
0
1 3
0
0 -1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
x1
Figure 6.8 Illustration of a situation with no overlap in binary regression: the observations for which y = 0 and the observations for which y = 1 can be completely separated by a hyperplan.
For example, if there are two covariates x1 and x2 , this would correspond to the situation depicted in Figure 6.8. We say that there is no overlap for this dataset (see also the illustration in Christmann and Rousseeuw (2001, pp. 67–69)). Christmann and Rousseeuw (2001) give an algorithm to compute the overlap, that is, the smallest number of observations whose removal yields complete or quasi-complete separation. In these cases, most estimators do not exist. In cases where the overlap is very small, the estimators exist but can potentially be very unstable. In addition, robust estimators work by downweighting (or sometimes removing) outlying points. It can therefore happen that the whole dataset has overlap, but that the robust estimators do not exist. The methodology by Christmann and Rousseeuw (2001) was used on the piglets dataset to compute the overlap, which is equal to eight. This is particularly related to the binary/categorical nature of the data. A (limited) sensitivity analysis has nevertheless shown that some stability is present and that therefore the study provides useful conclusions. In this analysis the robust methodology has helped in highlighting a peculiar feature of the data that could lead to disastrous conclusions if it remains undetected.
6.7. DISCUSSION AND EXTENSIONS
189
6.7 Discussion and Extensions At the time of writing, only the Bernoulli family has been implemented for the robust estimation and inference for GEE. Note, however, that the theory presented in this chapter is general and includes all GLM distributions. The difficulty arising in practice is the computation of the correction term ci in (6.8). This difficulty can be circumvented by computing the correction term by simulation. This is currently work in progress. As mentioned in Section 5.7.3, Cantoni et al. (2005) develop a criterion, called GC p , inspired by Mallows’s Cp for general model comparisons. It is given in (5.29) and the general form for an unbiased estimator GC p is given in (5.30). The particular form of GCp for a Mallows’s quasi-likelihood estimator as defined by (6.8) is given by Cantoni et al. (2005), where their extensive simulation study shows that the GCp is very effective in handling contaminated data.
7
Survival Analysis 7.1 Introduction Survival analysis is central to biostatistics and modeling such data is an important part of the work carried out daily by statisticians working with clinicians and medical researchers. Basically, survival data analysis is necessary each time a survival time or a time to a specific event (failure) such as organ dysfunction, disease progression or relapse is the outcome of interest. Such data are often censored as not all subjects enrolled in the study experience the event. When investigators are interested in testing the effect of a particular treatment on failure time the default method of analysis is the log-rank test, usually supplemented by Kaplan–Meier survival estimates. The log-rank test is, by definition, based on ranks and therefore offers some degree of protection against outliers. Criticisms have been raised (Kim and Bae, 2005) but the test is not as sensitive as most of the standard testing procedures in other models. When the outcome has to be explained by a set of predictors, the standard approach is the Cox (1972) proportional hazard model. Cox regression is appealing owing to its flexibility in modeling the instantaneous risk of failure (e.g. death) or hazard, even in the presence of censored observations. This interest toward the Cox model goes well beyond the world of medicine and biostatistics. Applications in biology, engineering, psychology, reliability theory, insurance and so forth can easily be found in the literature. Its uniqueness also stems from the fact that it is not, strictly speaking, based on maximum likelihood theory but on the concept of partial likelihood. This notion was introduced by Cox in his original paper to estimate the parameters of interest in a semi-parametric formulation of the instantaneous risk of failure at a specific time point, given that such an event has not occurred so far. Over the years many papers dealing with various misspecifications in the Cox model have been published. Diagnostic techniques have also flourished boosted by the ever-growing number of applications related to that model; see, for instance, Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
192
SURVIVAL ANALYSIS
Chen and Wang (1991), Nardi and Schemper (1999), Therneau and Grambsch (2000), Collett (2003b), Wang et al. (2006) for a review. In the 1980s, researchers were typically interested in whether a consistent estimate of a treatment effect could be obtained when omitting a covariate. Work by Gail et al. (1984), Bretagnolle and Huber-Carol (1988) and others show both theoretically and through simulations that if important predictors are omitted from the Cox model, a small resulting bias occurs in the estimate. Later, Lin and Wei (1989) propose a sandwich formula for the treatment effect estimator’s variance, which they call ‘robust’ in that significance testing of the treatment effect has approximately the desired level, even if important predictors are omitted from the model. They also claim that their variance estimator can cope with possible misspecifications of the hazard function. As argued in Section 1.2, this type of robustness is different from that discussed in this book. Robustness methods in the modern sense of the word have been relatively slow to emerge in survival analysis, hindered by the presence of censoring that is unaccounted for by the general robustness theory. Regarding the Cox model, another complication stems from its semi-parametric nature. This is in essence different from the fully parametric setting discussed at length in the previous chapters. In the early 1990s, researchers such as Hjort (1992) and Schemper (1992) started to tackle the problem but the first real attempts to robustify Cox’s partial likelihood appeared in Bednarski (1993), Sasieni (1993b,a) and Minder and Bednarski (1996). A complex methodology is generally required to cope with censoring. Bednarski’s work is based on a doubly weighted partial likelihood and extends the IF approach presented in Chapter 2. Later Grzegorek (1993) and Bednarski (1999, 2007) refined their weighting estimation technique to make it adaptive and invariant to timetransformation. A good account on how outliers affect the estimation process in practical terms with an illustration on clinical data is given in Valsecchi et al. (1996). A comparison of Bednarski’s approach and the work by Sasieni and colleagues is carried out in Bednarski and Nowak (2003). It essentially shows that none of these estimators clearly outperforms the others as far as problems in the response is the primary target. This technical literature focuses only on the estimation problem prompting questions about the robustness of tests as defined in Chapter 2. Recent work by Heritier and Galbraith (2008) illustrates the current limitations of robust testing for this model and clarifies the link with the theory by Lin and Wei (1989). Independently of all of these developments related to the Cox model, an innovative technique called regression quantiles appears in the late 1970s that seems totally unrelated to survival analysis. That pioneer work due to Koenker and Bassett (1978) is introduced in the econometric literature as a robust alternative to linear regression. In this work, any percentile of a particular outcome (e.g a survival time), or a transformation of it, can be regressed on a set of explanatory variables. This work in itself and many subsequent papers would not be sufficient to be mentioned in this chapter, if an extension to the censored case had not been proposed. Fortunately, such a method, called censored regression quantiles, now exists. The extension, due to Portnoy (2003), has a great potential in practice. It is easily computable, inherits the robustness of the sample quantiles, and constitutes a viable alternative to the Cox model, especially when the proportional hazard assumption is not met.
7.2. THE COX MODEL
193
This chapter is organized as follows. Cox’s partial likelihood and the classical theory is reviewed in Section 7.2. The so-called robust sandwich formula of Lin and Wei (1989) and its link to the IF is presented. The lack of robustness properties of standard estimation and inferential procedures are motivated by the analysis of the myeloma data. A robust (adaptive) estimator based on the work of Bednarski and colleagues is presented and illustrated in Section 7.3. Issues related to robust testing in the Cox model and its current limitations are also discussed. A complete workedout example using the well-known veterans’ administration lung cancer data (see e.g. Kalbfleisch and Prentice, 1980) is described in Section 7.4. Other issues including model misspecifications are outlined in Section 7.5. Finally, Section 7.6 is devoted to censored regression quantiles. We first introduce quantile regression, discuss its extension to censored data and apply the method to the lung cancer dataset.
7.2 The Cox Model 7.2.1 The Partial Likelihood Approach As mentioned earlier, the proportional hazard model introduced by Cox (1972) is probably the most commonly used model to describe the relationship between a set of covariates and survival time or another time-to-event, possibly censored. Let (ti , δi ) be independent random variables recording the survival time and absence of censoring (δi = 1 if ti is observed, 0 otherwise) for a sample of n individuals. It is convenient to write ti = min(ti0 , ci ) where ti0 is the possibly unknown survival time and ci the censoring time. The ti0 are independent random variables from a cumulative distribution F (· | xi ) with density f (· | xi ), where xi is a q-dimensional vector of fixed covariates. For simplicity we consider the standard case where all time points are different and ordered, i.e. t1 < t2 < · · · < tn . We also assume that the censoring mechanism is non-informative. The Cox model relates the survival time t to the covariates x through the hazard function of F 1 λ(t | x) = λ0 (t) exp(x T β),
(7.1)
where λ0 (t) is the so-called baseline hazard, usually unspecified, and β the regression parameter vector.2 In essence, λ(t | x) measures the instantaneous risk of death (or hazard rate) at time t for an individual with specific characteristics described by x, given that they have survived so far. The interesting feature of formulation (7.1) is that λ(t | x) is the product of a baseline hazard λ0 (t) and an exponential term depending on the covariates. This has two major advantages. First, as we will see in Section 7.2.1, it is not necessary to know the baseline hazard λ0 (t) to estimate the coefficients β. Second, we can derive immediately the effect of an increase of one unit in a particular covariate xj (e.g. the effect of an experimental treatment represented by a binary indicator: one for treatment, zero for placebo) on survival. 1 The hazard function of a distribution function F with density f is f (t)/(1 − F (t)). 2 Note that, by writing (7.1) on the log scale, log(λ (t)) can be seen as the intercept term added to the 0 linear predictor x T β, so that dim(β) = q.
SURVIVAL ANALYSIS
194
Indeed, such an increase translates into a constant relative change exp(βj ) of the hazard λ(t). This quantity is the hazard ratio (HR) and is usually interpreted as the relative risk of death related to an increment of 1 of that particular predictor. This property justifies the terminology proportional hazard model, commonly used for the Cox model. Model (7.1) encompasses two important parametric models, namely the exponential regression model for which λ0 (t) = λ and the Weibull regression model for which λ0 (t) = λγ t γ −1 . However, in a full parametric setting, the additional parameters λ and/or γ need to be estimated along with the slopes β for these models to be fully specified. In the proposal of Cox (1972), this is not necessary. Equation (7.1) can be expressed equivalently through the survival function S(t | x) = 1 − F (t | x) (see, for instance, Collett (2003b) and Therneau and Grambsch (2000)) given by S(t | x) = {S0 (t)}exp(x
T β)
,
(7.2)
where S0 (t) is defined through 0004 −log(S0 (t)) =
t
λ0 (u) du = 001a0 (t),
(7.3)
0
and 001a0 (t) is the baseline cumulative hazard obtained by integrating λ0 (u) between zero and t. The usual estimate of β is the parameter value that maximizes the partial likelihood 0007δi n 0006 # exp(xiT β) L(β) = (7.4) T j ≥i exp(xj β) i=1 or equivalently the solution of the first order equation 0007 0006 n 0002 S (1) (ti ; β) =0 δi xi − (0) S (ti ; β) i=1
(7.5)
where S (0) (ti ; β) =
0002 j ≥i
exp(xjT β) and S (1) (ti ; β) =
0002 j ≥i
exp(xjT β)xj
as in Minder and Bednarski (1996) and Lin and Wei (1989).3 The solution of (7.5) is the partial likelihood estimator (PLE) also denoted by βˆ[PLE] . Equation (7.5) is a simple rewriting of a more conventional presentation as in, for instance, Collett (2003b, Chapter 3). There, the risk set R(ti ) at time ti is used, i.e. the set of all patients who have not yet achieved the event by time ti and are then still ‘at risk’ of dying. It is formed of all observations with indices greater than or equal to i. 3 The idea is to base the likelihood function on the probability for a subject to achieve the event by time ti . This is given by the ratio of the hazard at time ti of the subject i over the hazard of all subjects who have not yet experienced the event by time ti , i.e. the set j ≥ i (also called the risk set). In this ratio, the baseline hazard cancels out and one obtains the expression given in (7.4).
7.2. THE COX MODEL
195
The purpose of writing (7.5) in that way is to stress the similarity with the definition of an M-estimator. Indeed, let 0007 0006 S (1) (ti ; β) Ui = U (ti , δi , xi ; β) = δi xi − (0) S (ti ; β) be the individual contribution (or score) then, at least literally, (7.5) looks like an M-estimator with 0001-function U (t, δ, x; β). The main difference is that the scores Ui are no longer independent since the two sums S (0) (ti ; β) and S (1) (ti ; β) depend on subsequent time points tj for j ≥ i. We have assumed, for simplicity, that all observed time points are different in the sample. The partial likelihood approach is generally modified to handle ties. We refer the reader to Therneau and Grambsch (2000) and Collett (2003b) for a more general introduction and Kalbfleisch and Prentice (1980) for technical details. Under some regularity conditions, the PLE is asymptotically normally distributed with asymptotic variance V = I (β)−1 , where I (β) is the information matrix for that model (see Kalbfleisch and Prentice (1980, Chapter 4) for details). Here I (β) is usually estimated by minus the second derivative of the average log partial likelihood, i.e. n 10002 ∂Ui I0005(β) = − . (7.6) n i ∂β Numerical values are obtained by replacing β by βˆ[PLE] in (7.6). An alternative formula for the variance of βˆ[PLE] will be given in Section 7.2.4. The asymptotic distribution can then be used for testing single hypothesis H0 : βj = 0 in a standard way. One just defines a z-statistic as z-statistic = where SE(βˆ[PLE]j ) =
0015
βˆ[PLE]j , SE(βˆ[PLE]j )
n−1 [I0005(βˆ[PLE] )−1 ]jj
(7.7)
(7.8)
is the (estimated) standard error of βˆ[PLE]j , i.e. the square root of the j th diagonal element of (7.6). Here z is compared with a standard normal distribution. More generally, standard asymptotic tests such as the LRT, score and Wald tests 0 are available to test a composite null hypothesis of the type H0 : β(2) = β(2) , with T T T β(1) unspecified and with β = (β(1) , β(2) ) . Specifically, the LRT is equal to twice the difference in the maximum log-partial-likelihood obtained at the full and reduced model, i.e. LRT = 2(log(L(βˆ[PLE] )) − log(L(β˙[PLE] )) (7.9) where βˆ[PLE] denotes the PLE at the full, model and β˙[PLE] its value under the null hypothesis (at the reduced model). The Wald test is based on βˆ[PLE](2) , the second component of the PLE in the full model, i.e. 0 T 0005−1 ˆ 0 W = n(βˆ[PLE](2) − β(2) ) V(22)(β[PLE](2) − β(2) ), (7.10)
SURVIVAL ANALYSIS
196
0005(22) is the block (22) of the estimated asymptotic variance V 0005=0005 where V I −1 (βˆ[PLE] ). For the score test, the general approach outlined in Section 2.5.1 can be extended to the Cox model. Under H0 , all three tests are asymptotically distributed as a χk2 distribution, where k = dim(β(2) ). They are generally provided by all common statistical packages.
7.2.2 Empirical Influence Function for the PLE The IF for the PLE is based on complex functionals taking into account the semiparametric nature of the Cox model and the presence of censoring. We give here its 0005 i evaluated at the observation (ti , δi , xi ) as originally derived empirical version IF 0005 i can be used as a diagnostic tool to assess by Reid and Crépeau (1985). Here IF the effect of a particular observation on the PLE. It is a q-dimensional vector proportional to a shifted score, i.e. 0005 i = I0005−1 (β)(Ui − Ci (β)), IF
(7.11)
where Ci (β) (or to be more specific C(ti , δi , xi ; β)) is a term depending on the observations in a complicate way; see Section 7.2.4. This ‘shift’ is not needed for consistency but to account for the dependence across the individual scores Ui . 0005 i is similar to As noted by Reid and Crépeau (1985), the first component in IF the usual IF for M-estimators in the uncensored case, and the second component −I0005−1 (β)Ci (β) represents the influence of the ith observation on the risk set of other subjects. A similar expression with a two-part IF is generally found for estimators for censored data. The first term is unbounded in xi , which means that spurious observations in the covariates can ruin the PLE. The second term shows the same deficiency and, as a function of ti ’s only, can be large enough to compromise the estimation process. It captures the influence of a particular observation (e.g. an abnormal long-term survivor) on the risk set of the others subjects. Valsecchi et al. (1996) give a good explanation on the acting mechanism. Abnormal longterm survivors ‘exert influence in two ways. First, that individual forms part of the very many risks sets (for all preceding failures). Secondly, whereas early failures will be matched to a large risk set, individuals failing toward the end of the study may, depending on the censoring, be matched to a very small risk set. Two groups may be initially of similar size but as time progresses the relative size of the two groups may steadily change as individuals in the high risk group die at a faster rate than those in the other group. Eventually the risk set may be highly imbalanced with just one or two individuals from the high risk group, so that removal of one such individual will greatly affect the hazard ratio.’ Atypical long-term survivors are not the only type of abnormal response that can be encountered but they are by far the most dangerous. Another possibility occurs when a low-risk individual fails early. As pointed out by Sasieni (1993a) this type of outlier is less harmful as their early disappearance from the risk set reduces their contribution to the score equation. Despite its relative complexity the IF for the PLE has similar properties to that given for the M-estimators of Chapter 2. It still measures the worst asymptotic bias caused
7.2. THE COX MODEL
197
to the estimator by some infinitesimal contamination in the neighborhood of the Cox model. It is therefore desirable to find estimators which bound that influence in some way. The two-part structure of (7.11) rules out a similar weighting to that used earlier in a fully parametric model. Innovative ways have to be imagined to control both components, in particular the influence on the risk set. This will be developed further in Section 7.2.4 in relation to the asymptotic variance.
7.2.3 Myeloma Data Example Krall et al. (1975) discuss the survival of 65 multiple myeloma patients and its association with 16 potential predictors all listed in their Table 2. They originally selected the logarithms of blood urea nitrogen (bun), serum calcium at diagnosis (calc) and hemoglobin (hgb) as significant covariates. Chen and Wang (1991) used their diagnostic plot and found that case 40 is an influential observation. They also concluded that no log-transformation of the three predictors was necessary. We also use the data without transformation to illustrate the IF approach. Table 7.1 presents the most influential data points as detected by the change 0006i βˆ in the regression coefficient βˆ[PLE] when the ith observation is deleted. Figures are given as percentages to make changes comparable across coefficients: percentages were simply obtained by dividing the raw change by the absolute value of the corresponding estimate obtained on all data points. The deletion of any of the remaining observations did not change the coefficients by more than ±11% for these two variables, and even less for bun. These values can be seen as a handy 0005 i ≈ (n − 1)0006i βˆ as pointed out by Reid and approximation of the IF itself as IF Crépeau (1985). This result is generally true for all models but is particularly useful here when IF has a complicated expression. Clearly case 40 is influential confirming the analysis by Chen and Wang (1991). Other observations might also be suspicious, e.g cases 3 or 48. A careful look at all exact values of the IF (not shown here) shows that the approximation works reasonably well justifying the use of 0006i βˆ as 0005 i . A word of caution must be added here. The empirical IF in (7.11) is proxy for IF typically computed at the PLE, itself potentially biased. This can cloud its ability to detect outliers as pointed out by Wang et al. (2006). However, extreme observations are generally correctly identified by this simple diagnostic technique. To illustrate how they can distort the estimation and testing procedures, we deleted case 40 and refitted the data. PLE estimates, standard errors and p-values for significance testing (z-statistic in (7.7)) are displayed in Table 7.2. Case 40 is actually a patient with high levels of serum calcium who survived much longer than similar patients. For that reason this subject tends on his own to determine the fit, an undesirable feature as the aim of the analysis is to identify associations that hold for the majority of the subjects. When all observations are included in the analysis, calcium is not significant (p = 0.089). After removal of case 40 a highly significant increase in risk of death of exp(0.31) − 1.0 = 0.36, 95% CI (0.1;0.7), per additional unit of serum calcium appears. This clearly illustrates the dramatic effect of case 40 on the test. The differences are even more pronounced if both cases 40 and 48 are removed making the need of a robust analysis even greater. However, as the dataset is relatively
SURVIVAL ANALYSIS
198
Table 7.1 Diagnostics 0006i βˆ for myeloma data. Case
hgb
calc
40 48 44 3 2
+17% 0% +13% −1% +2%
−48% −16% +12% +24% +35%
The regression coefficients are estimated by means of the PLE βˆ[PLE] .
Table 7.2 PLE estimates and standard errors for the myeloma data. All data
Case 40 removed
Variable
Estimate (SE)
p-value
Estimate (SE)
p-value
bun hgb calc
0.02 (0.005) −0.14 (0.059) 0.17 (0.099)
0.000 0.019 0.089
0.02 (0.005) −0.19 (0.063) 0.31 (0.112)
0.000 0.003 0.006
Ties treated by Efron’s method. Model-based SEs computed using (7.8).
small (n = 65), influential observations are more harmful and case-deletion and refit becomes a difficult exercise. We do not pretend to give a definitive analysis of these data here. The purpose was simply to illustrate the sensitivity of the PLE with respect to unexpected perturbations especially for small to moderate sample sizes.
7.2.4 A Sandwich Formula for the Asymptotic Variance A different estimate for the asymptotic variance of the PLE has been proposed by Lin and Wei (1989). It is often called ‘robust’ variance in common statistical packages, but as we argue below, it is not robust in the sense used in this book. We therefore name it the LW formula or classical sandwich formula. Perhaps, the best way to introduce the LW formula is through its link to the IF, something that is generally overlooked. A careful reading of Reid and Crépeau (1985, p. 3) shows 0005 i is as in (7.11), provides another asymptotic variance 0005 i IF 0005 Ti , where IF that n−1 IF estimate for the PLE. Elementary algebra shows that this can be rewritten as 0005LW (β) = I0005−1 (β)J0005(β)I0005−1 (β) V where I0005(β) is the information matrix estimator given in (7.6) and 0002 J0005(β) = Ui∗ Ui∗T ,
(7.12)
(7.13)
7.2. THE COX MODEL
199
Table 7.3 Estimates and standard errors for the PLE for the myeloma data. All data Variable bun hgb calc
Case 40 removed
Estimate (SELW )
p-value
Estimate (SELW )
p-value
0.02 (0.004) −0.14 (0.059) 0.17 (0.127)
0.000 0.019 0.186
0.02 (0.004) −0.19 (0.060) 0.31 (0.103)
0.000 0.002 0.003
Ties treated by Efron’s method. SE computed using (7.12).
where Ui∗ = Ui − Ci (β) is a shifted score. If we write down the correcting factor (shift) Ci (β) = exp(xiT β)xi
0002
δj (0) S (tj ; β) j ≤i
− exp(xiT β)
0002 δj S (1) (tj ; β) [S (0) (tj ; β)]2 j ≤i
and replace β by the PLE in (7.12) we obtain the variance estimate proposed by Lin and Wei (1989, p. 1074). Lin and Wei’s derivation is actually more general as it also covers the case of time-dependent covariates. Although the formula presented here assumes n different time points, its extension to data with ties is straightforward (see Lin and Wei (1989) and Reid and Crépeau (1985) for technical details). As an illustration, we refitted the myeloma data using the exact same model as before but use (7.12) as a variance estimate. PLE estimates, standard errors and pvalues are displayed in Table 7.3. Note that the coefficients reported in Table 7.3 are the same as those reported in Table 7.2 since the estimation procedure is still the PLE. On the other hand, the standard errors differ as they are now based on the LW formula. The p-values reported here refer to the individual significance z-tests, i.e. for H0 : βj = 0, βˆ[PLE]j , (7.14) zLW -statistic = SELW (βˆ[PLE]j ) where SELW (βˆ[PLE]j ) is the standard error of βˆ[PLE]j based on the LW formula (7.12), 0005LW (βˆ[PLE] )]jj . i.e. the square root of n−1 [V Results are very similar to those obtained in Table 7.2. It is clear that case 40 is influential even if the LW formula is used. In other words, the LW formula offers no kind of protection against extreme (influential) observations. For example, no effect of calcium appears when all data are fitted (p-value = 0.186), and after removal of case 40 the deleterious effect of this observation on the significance of serum calcium seems obvious as a p-value of 0.003 is reported. So we may legitimately ask the question ‘What is the LW formula robust against?’. Lin and Wei (1989) motivate their approach by mentioning some structural misspecifications, in particular covariate omission. As an example they consider a randomized clinical trial in which the effectiveness of a particular treatment on survival time is assessed. The true model is thought to be the Cox model with
SURVIVAL ANALYSIS
200
parameter β. We can split β into two parts ν and η where these components represent, respectively, the treatment parameters and the covariate effects. A valid test of no treatment effect is sought even if some of the predictors may be missing in the working model. Alternatively, investigators may simply prefer an unadjusted analysis for generalizabilty purposes, in which case only ν will be included in the analysis. Lin and Wei (1989) showed that approximate valid inference can still be achieved using their formula. This, of course, assumes that no treatment by covariate interaction exists. To test the null hypothesis of no treatment effect (i.e. H0 : ν = 0), one then uses (7.14). Lin and Wei (1989) also considered more serious departures from the Cox model, e.g. misspecification of the hazard form. This includes models with hazard defined on the log-scale or even a multiplicative model. Their simulation study shows that their approach allows approximate valid inference in the sense that the type I error (empirical level) of the Wald test using the LW formula (7.12) is close to the nominal level. The term ‘robust’ formula is hence used in that sense. This type of robustness however does not protect against biases induced by extreme (influential) observations. The reader is referred to Section 7.5 for further discussion on this topic in a more general setting.
7.3 Robust Estimation and Inference in the Cox Model 7.3.1 A Robust Alternative to the PLE The robust alternative to the PLE we present here has emerged over the years from Bednarski’s research. It is based on a doubly weighted PLE that astutely modifies the estimating equation (7.5) without fundamentally changing its structure. It also has the advantage of being easily computable with some code available and included in the R Coxrobust package. Following Bednarski (1993) and Minder and Bednarski (1996), we assume that a smooth weight function w(t, x) is available. Denote by wij = w(ti , xj ) and wi = wii = w(ti , xi ) the weights for all 1 ≤ i ≤ j ≤ n and set all other weights to zero by construction. Define the two sums 0002 wij exp(xjT β) (7.15) Sw(0) (ti ; β) = j ≥i
Sw(1) (ti ; β) =
0002 j ≥i
wij exp(xjT β)xj
(7.16)
in a similar way to their unweighted counterparts of Section 7.2. A natural extension of the PLE is the solution for β of 0007 0006 (1) n 0002 Sw (ti ; β) = 0. (7.17) wi δi xi − (0) Sw (ti ; β) i=1 The weight function w(t, x) enters at two points: (i) in the main sum with wi (0) (1) downweighting the uncensored observations; (ii) in the inner sums Sw and Sw with
7.3. ROBUST ESTIMATION AND INFERENCE IN THE COX MODEL
201
all the wij for i ≤ j ≤ n. Equation (7.17) clearly has a similar structure to (7.5). Moreover, when all of the weights are chosen equal to one, the solution of (7.17) is the PLE, so that (7.17) can be literally seen as an extension of equation (7.5). By analogy with the notation of Section 7.2 we also denote by the individual score Uw,i , i.e. the contribution of the ith observation to the sum in (7.17) 0007 0006 (1) Sw (ti ; β) , (7.18) Uw,i = wi δi xi − (0) Sw (ti ; β) and by Uw the total score or left-hand side of (7.17). A proper choice of w(t, x) is pivotal to make the solution of (7.17) both consistent and robust. The weights we consider here truncate large values of g(t) exp(x T β) where g(t) is an increasing function of time.4 Indeed, Bednarski (1993) and Minder and Bednarski (1996), considering the exponential model, argued that the PLE often fails when ti exp(xiT β) is too large. They hence proposed weight functions based on truncations of such quantities (i.e. with g(t) = t). Bednarski (1999), however, pointed out that a better choice for g(t) is the baseline cumulative hazard 001a0 (t) in (7.3). The rationale for this is that 001a0 (ti ) exp(xiT β), given the covariate vector xi , has an exponential one distribution if the Cox model holds and ti is not censored. This gives rise to the following weights T (linear), K − min(K, 001a0 (t) exp(x β)) T w(t, x) = exp(−001a0 (t) exp(x β)/K) (exponential), max(0, K − 001a0 (t) exp(x T β))2 /K 2 (quadratic), where K is a known cut-off value that can be chosen on robustness and efficiency grounds. Such weights have been used successfully ever since and are now implemented in the R Coxrobust package. In practice, two additional difficulties occur: first, the truncation value K is generally difficult to specify a priori, especially for censored data;5 second, the unknown cumulative baseline hazard 001a0 (t) is needed to compute the weights. This hazard is not often estimated in the Cox model as it is not actually needed to obtain the PLE and related tests. To overcome the first problem, Bednarski and colleagues proposed the use of an adaptive procedure that adjusts K at each step. They deal with the second problem by jointly (and robustly) estimating 001a0 (t) and β; see Grzegorek (1993) and Bednarski (1999, 2007). To compute a robust estimator defined through (7.17) with one of the proposed weighting schemes updated adaptively, one can use the following algorithm. Given a specific quantile value τ , e.g. τ = 90%, used to derive the truncation value adaptively, one proceeds through the following steps. • Initialization: obtain an initial estimator βˆ 0 , e.g. the PLE, compute the cut-off K as the pre-specified quantile τ of the empirical distribution ti exp(xiT βˆ 0 ), 4 Note that the notation above did not mention any dependence of w(t, x) on the regression parameter β and the weights should be more seen as ‘fixed’. However, Bednarski (1999) showed that, under stringent conditions, the dependence on β does not modify the asymptotic distribution of the resulting estimator. 5 One could argue that a quantile of the exponential one distribution could be used, at least in the absence of censoring.
SURVIVAL ANALYSIS
202
i = 1, . . . , n, set-up the current estimate b at βˆ 0 and initialize the set of weights. • Take the current estimate b, evaluate K as the same quantile τ of the empirical 0005w (ti ) exp(x T b) with distribution 001a i 0005w (t) = 001a
0002 i≤t
wi δi . T j ≥i wij exp(xj b)
(7.19)
• Update b by solving (7.17) and then recompute the set of weights. • Repeat the previous two steps until convergence. Technical details about the adaptive process and formula (7.19) are omitted for 0005w (t) is a simplicity but can be found in Bednarski (2007). Note though that 001a 6 robust adaptation of the Breslow estimator. The final value obtained through this algorithm is the adaptive robust estimator (ARE) or βˆ[ARE] . It can generally be obtained within a few iterations even for relatively large datasets. An advantage of this adaptive weighting scheme based on the cumulative hazard estimate (7.19) is that the ARE is invariant with respect to time transformations. It can also better cope with censored data by the way the cut-off value is updated. The price to pay for this flexibility is purely computational. Bednarski (1999) shows that the ARE has the same asymptotic distribution as its ‘fixed-weight’ counterpart defined in (7.17) and performs similarly in terms of robustness. The issue of the choice of weight function or quantile τ is more a matter of efficiency and/or personal preference. This question is discussed in the next section. Finally, it should be stressed that other possible weights have been proposed by Sasieni (1993a,b). Although the spirit of his approach is essentially the same, the proposed weights cannot handle abnormal responses for patients with extreme values in the covariates such as elevated blood cell counts or laboratory readings. Such extreme but still plausible data points are harmful to classical procedures. In contrast the ARE is built to offer some kind of protection in that case. A more formal treatment of these alternative weighting schemes can be found in Bednarski and Nowak (2003) along with a comparison with the ARE.
7.3.2 Asymptotic Normality Under regularity conditions on w(t, x) given in Bednarski (1993, 1999), the ARE is consistent and has the following asymptotic distribution √ n(βˆ[ARE] − β) → N (0, Vw (β)), (7.20) where the asymptotic variance is given by a sandwich formula7 Vw (β) = Mw−1 Qw Mw−T .
(7.21)
6 Formula (7.19) gives back the Breslow estimator of baseline cumulative hazard when all weights are
equal to one and thus b is the PLE; see, for instance, Collett (2003b, p. 101). 7 Again, to obtain the variance of βˆ [ARE] , one needs to divide (7.21) by n.
7.3. ROBUST ESTIMATION AND INFERENCE IN THE COX MODEL
203
The matrices Mw = Mw (β), Qw = Qw (β) are complicated expectations that we omit for simplicity; see Bednarski (1993) for details. Tedious but straightforward calculations show that their empirical versions have a much simpler form, i.e. 0005w (β) = − 1 M n and
n 0002 ∂Uw,i i=1
(7.22)
∂β
n 0002 0005w (β) = 1 Q U ∗ U ∗T n i=1 w,i w,i
(7.23)
∗ =U with Uw,i given in (7.18) and Uw,i w,i − Cw,i (β) a shifted weighted score with shift given in (7.24). A final estimate for the asymptotic variance follows easily by replacing β by βˆ[ARE] in (7.21)–(7.23). The asymptotic distribution (7.21) is valid not only under the assumption that the weights are fixed, i.e. K and g(t) are pre-specified independently of the regression parameters,8 but also for the adaptive weighting scheme described above; see Bednarski (1993, 1999, 2007) for details. Technical developments essentially show that the asymptotic result is not altered when smooth adaptive weight functions with bounded support and a robust hazard estimate are used. There is a clear link between the LW sandwich formula (7.12) and the asymptotic variance (7.21). The first thing to note is that for both the classical and robust estima 0005 i IF 0005 Ti tors the sandwich formula stems from the same property, i.e. from n−1 IF that is another consistent variance estimator. We have seen this for the PLE in Section 7.2.2 and the same has been shown by Bednarski (1993, 1999) for the ARE. Second, tedious rewriting show that the empirical IF for the ARE is again proportional to a shifted score as follows:
0005w−1 (β)U ∗ (β) = M 0005w−1 (β)(Uw,i (β) − Cw,i (β)) 0005 w,i = M IF w,i with Uw,i (β) given in (7.18) and where the shift has an ‘ugly’ but computable expression given by Cw,i (β) = exp(xiT β)xi
0002 wj δj wji j ≤i
Sw(0) (tj ; β)
− exp(xiT β)
0002 wj wj i δj Sw(1) (tj ; β) j ≤i
[Sw(0) (tj ; β)]2
.
(7.24) A careful look at all of the quantities involved in (7.18) and (7.24) and in the equations of Section 7.2.4 shows that, if all of the weights wij and wi are equal to one, then not only is the ARE identical to the PLE but their IF are the same and consequently its asymptotic variance reduces to the LW formula (7.12). The robust approach proposed in this chapter can literally be seen as an extension of the PLE combined with its LW sandwich variance. In practice, this never happens as the weights cannot be set to one if one wants the ARE to be robust. This analogy is nevertheless useful as it helps in understanding both formulas and properties. 8 The function g(t) is discussed on page 201.
204
SURVIVAL ANALYSIS
The use of AREs is desirable from a robustness standpoint. However, an expected loss of efficiency with respect to the PLE is observed when the Cox model assumptions hold. Unlike in simpler models, e.g. linear regression or mixed models, it is impossible to calibrate the tuning constant (i.e. the quantile τ ) to achieve a specific efficiency at the model for all designs. However, some hints can be given. First, it is clear that by construction linear adaptive weights of Section 7.3 automatically set a certain percentage of weights wi to zero. If we choose τ = 90%, then roughly 10% of the weights will be zero even though the data arise from a proportional hazard model. This automatically generates a loss of efficiency at the model as genuine observations will be ignored. A similar argument holds for quadratic weights. In contrast, the exponential weighting scheme of Section 7.3 is smoother and the ARE with exponential weights performs generally better in terms of efficiency. Second, simulations can provide valuable information. Our (limited) experience with the exponential distribution indicates that adaptive exponential weights do reasonably well in terms of robustness and efficiency when τ is chosen in the range 80–90%. An asymptotic efficiency relative to the PLE of at least 90% can easily be obtained even in the presence of a small amount of censoring, while linear weights achieve an efficiency of at most 80–90%. However, both weighting schemes perform equally well from a robustness point of view. For these reasons we tend to prefer exponential weights, with τ in the range 80–90%. Previous references by Bednarski and colleagues also used similar values of τ successfully. We would not recommend the use of much smaller quantiles. Finally a choice for τ for a given weighting scheme could in principle be computed by simulations to achieve a predetermined loss of efficiency at a parametric model (i.e. a parametric form for the hazard). For that purpose, one would need an idea of the censoring level, a rough idea of the true parameter values, and a distribution for the covariates or conditioning. In situations where ni=' is=' the=' number=' of=' subjects=' still=' ‘at=' risk’=' just=' prior=' to=' ti=' and=' di=' deaths=' at=' time=' ;=' see=' kalbfleisch=' prentice=' (1980,=' p.=' 12)=' or=' collett=' (2003b,=' 20).=' model-based=' survival=' curves=' are=' obtained=' as=' follows.=' first,=' note=' that=' by=' combining=' (7.2)=' (7.3)=' we=' obtain=' usual=' (but=' rarely=' used)=' expression=' s(t=' |=' x)=' a=' function=' β=' cumulative=' baseline=' hazard=' 001a0=' (t),=' i.e.=' (t)=' exp(x=' t=' β)).
(7.27)
Second, an overall survival curve estimate can be simply computed by averaging over the sample the predictions of individual survival time S(t | xi ) for t = tj , j = 1, . . . , n. For the ARE, the ith patient’s survival prediction is obtained by replacing in formula (7.27) the true cumulative baseline hazard by its estimate (7.19), and the linear predictor by xiT βˆ[ARE] . The same can be done for the PLE by using the corresponding classical estimates. The comparison between Sˆ[KM] (t) and its Cox-based counterparts proceeds by plotting their ‘standardized’ differences versus the logarithm of the survival time, possibly by categories (e.g. quartiles) of the linear predictor x T βˆ[PLE] . For the standardization factor, we follow Minder and Bednarski (1996) and use the square-root of Sˆ[KM] (t)(1 − Sˆ[KM] (t)). Figure 7.5 displays the standardized difference per tertile of linear predictor x T βˆ[PLE] . The horizontal lines represent plus or minus twice the standard error of the Kaplan–Meier estimate obtained through the Greenwood formula (see Collett, 2003b, pp. 24–25) to take into account the sample variability, at least approximately. A good agreement between the Kaplan–Meier and ARE survival curves can be observed for all panels. In contrast some discrepancy appears when the PLE is used to fit the Cox model, in particular in panels (a) and (c). This lack of fit disappears after deletion of the extreme observations identified earlier and repeat of the procedure (Figures not shown). This is a compelling argument in favor of the robust fit assuming that the model is structurally correct. Other plots can also be found in Minder and Bednarski (1996) and Bednarski (1999). Note as well that separate plots for each treatment arm could also be drawn, but this is not done here as the experimental treatment was found to be ineffective. 11 In the presence of ties, formula (7.26) still applies by replacing the t , i = 1, . . . , n, by the k < n i distinct ordered survival times t1 < t2 < . . . < tk .
SURVIVAL ANALYSIS
214
0.0 −0.4
−0.4
0.0
0.2 0.4
panel (b): second tertile
0.2 0.4
panel (a): first tertile
3
4
5
6
7
0
1
2
3
4
5
6
0.2 0.4
panel (c): third tertile
−0.4
0.0
Difference PLE − KM difference ARE − KM
0
1
2
3
4
5
6
Figure 7.5 Standardized differences between the Kaplan–Meier estimate (KM) and the model-based survival curves (PLE or ARE).
7.5 Structural Misspecifications 7.5.1 Performance of the ARE The main objective of this book is to present robust techniques dealing with distributional robustness. In essence, we assume a specific model (e.g. the Cox model) and propose estimation or testing procedures that are meant to be stable and more efficient than the classical procedures in a ‘neighborhood’ of the working model. We normally speak of model misspecification in that sense. This sometimes creates confusion, in particular for the proportional hazard model where the effect of many departures have been studied over the years, e.g. covariate omission, deviations from the proportional hazard assumption or measurement error in the variables. These can be seen as structural model misspecifications and are not the scope of the robustness theory. It is however important to discuss the performance of the robust procedures (estimation, tests) presented so far in that setting. Historically, researchers first studied the impact of covariate omission on the estimation process, in particular in randomized experiments where the main endpoint
7.5. STRUCTURAL MISSPECIFICATIONS
215
is a specific time to event. Typically, the question of whether an unadjusted analysis of a two-arm randomized clinical trial (RCT) still provides a consistent estimate of the treatment effect was of primary interest. Work by Gail et al. (1984), Bretagnolle and Huber-Carol (1988) and others showed both theoretically and by simulations that if important predictors were omitted from the Cox model the classical estimate (PLE) was slightly biased toward the null. They even showed that this situation could happen when the data were perfectly balanced as in RCTs and was worsened by the presence of censoring. The key reason for this is that the PLE does not generally converge toward the true regression parameter unless the treatment effect is itself zero. This type of problem being structural, a similar situation arises with the ARE. No formulas have ever been established but a hint was given by Minder and Bednarski (1996) who explicitly investigated the problem.12 They show in their simulation (type B) that the robust proposal is indeed biased toward the null but tend to be less biased than the PLE. A similar situation is encountered in Section 7.3.1 with measurement error problems. If a predictor x1 cannot be measured exactly but instead v1 = x1 + u is used where u is some random term added independently, it is well known that β1 , the slope of x1 , is not estimated consistently; see Carroll et al. (1995) for instance. An attenuation effect or dilution is observed if the naive approach is used (i.e. regressing the outcome on v1 and the other covariates) for most types of regression including Cox’s. In addition estimates of other slopes can also be affected. Again the ARE is not specifically built to remove such bias resulting more from a key feature of the data, i.e. a structural model misspecification, ignored in a naive analysis (classical or robust). In another simulation (type C), Minder and Bednarski (1996) showed that the ARE tended to be less biased than its classical counterpart. In such a case it is highly recommended to directly correct for measurement error using one of the many techniques described, for instance, by Carroll et al. (1995) and in the abundant literature dealing with this issue. Robust methods could then also be specifically developed in that setting, i.e. with a model that includes possible measurement error. The problem of ‘what to do when the hazard is not proportional’ often arises and is even more important in practice. Here two elements of the answer can be brought in. First, if non-proportionality is caused by a subgroup of patients responding differently then ARE will certainly provide safer results. Second, if the problem is more structural, e.g. a multiplicative model captures more the inherent nature of the data, the ARE will not perform any better than the classical technique. The reason is that (i) both methods still assume proportional hazards; (ii) this type of departure is not in a neighborhood of the working model. In other words it will not be ‘within the range’ of what the robust method can handle. By definition, the problem is more structural than distributional and beyond the scope of the current method. Finally, one may wonder how large amounts of censoring affects the ARE or if something similar is possible for the time-dependent Cox model. The robust approach presented here is only valid under the assumption of fixed predictors. 12 The estimator used in this reference is a simpler version of the ARE: the weighting scheme is based on g(t) = t, not the cumulative hazard function; see Section 7.3.1 for the definitions of the weights. However, the results are illustrative of what could be obtained with the ARE.
216
SURVIVAL ANALYSIS
Its extension to time-varying covariates has not been attempted, even under simple circumstances, and seem a considerable challenge. Regarding the impact of censoring, no work has been carried out to illustrate the performance of the ARE in the presence of heavy censoring.
7.5.2 Performance of the robust Wald test It is probably legitimate to wonder whether the robust Wald test defined in Section 7.3.5 provides some kind of protection against structural misspecifications. This question arises naturally as we know that the asymptotic variance (7.21) is literally a generalization of (7.12), the LW formula is supposed to be better at dealing with that type of problem; see the discussion in Section 7.2.4 and the link between the two formulas in Section 7.3.2. An insight is given in Heritier and Galbraith (2008) who carried out simulations similar to those undertaken by Lin and Wei (1989) with the addition of the ARE as genuine contender. We report here the results for covariate omission, a particularly relevant situation in RCTs as discussed earlier. The data were not, however, generated to mimic that situation as in Lin and Wei (1989). Survival times come from an exponential model with hazard λ(t | x) = exp(x12 ) where x1 follows a standard normal distribution. This is supposed to be an even worse scenario than simply ignoring the predictors in a RCT. The working model is a Cox model with two predictors x1 and x2 , generated independently of each other with the same distribution. This model is misspecified as x12 has been omitted from the fitted model and x2 is unnecessary. The primary objective is the performance of tests of H0 : β1 = 0 at the true model. The standard z-test (with model-based SE) cannot maintain its nominal level of 5% and instead exhibits an inflated type I error around 13%. In contrast the LW z-test has a type I error around 6–6.5% while ARE’s is around 3.5– 4.5%. These results stand for a sample size of 50–100 and are consistent with those initially reported by Lin and Wei (1989). The ARE-based Wald test thus seems to perform well in that particular setting; if anything the test seems to be conservative. A similar performance to the LW approach is also observed by Heritier and Galbraith (2008) for the other designs studied by Lin and Wei (1989), including misspecified hazards, e.g. fitting the Cox model to data generated with a logarithmic type of hazard. These conclusions are seriously limited by the fact that we are only focusing on the test level. Nothing is said about the loss of power of such procedures compared with those of inferential (robust) procedures developed in a structurally correct model. We therefore strongly recommend sorting out structural problems before carrying out robust inference. Distributional robustness deals with small deviations from the assumed (core) model, and this statement is even more critical for inferential matters. This is clearly not the case if, for instance, the right scale for the data is multiplicative as opposed to additive (i.e. one of the scenarios considered here). Using testing procedures in a Cox model fitted with the ARE should not be done if linearity on the log-hazard scale is clearly violated. The same kind of conclusion holds for violations from the proportional hazard assumption. This recommendation could only be waived if such departures are caused by a few abnormal cases, in which case the use of a robust Wald test can be beneficial. Finally, the LW approach is also
7.6. CENSORED REGRESSION QUANTILES
217
used when correlation (possibly due to clustering) is present in the data. It generally outperforms its model-based counterpart and maintains its level close to the nominal level. The properties of (7.21) in that setting have not been investigated.
7.5.3 Other Issues Robust methods in survival data have just started their development. As mentioned earlier the presence of censoring creates a considerable challenge. In the uncensored case robust methods in fully parametric models are readily available. One could, for instance, use robust Gamma regression as described in Chapter 5. Specific methods have also been proposed for the (log)-Weibull or (log)-Gamma distributions by Marazzi (2002), Marazzi and Barbati (2003), Marazzi and Yohai (2004) and Bianco et al. (2005). Interesting applications to the modeling of length of stay in hospital or its cost are given as an illustration. The inclusion of covariates is considered in the last two references. Marazzi and Yohai (2004) can also deal with right truncation but, unfortunately, these methods are not yet general enough to accommodate random censoring. In addition, the theory developed in this chapter for the Cox model assumes non-informative censoring. Misspecifications of the censoring mechanism have recently received attention, at least in the classical case; see Kong and Slud (1997) and DiRienzo and Lagakos (2001, 2003). Whether modern robustness ideas can valuably contribute to that type of problem is still an open question. Robust model choice selection for censored data is still a research question with an attempt in that direction by Bednarski and Mocarska (2006) for the Cox model.
7.6 Censored Regression Quantiles 7.6.1 Regression Quantiles In this section we introduce an approach that is a pure product of robust statistics in the sense that it does not have a classical counterpart. The seminal work dates back to Koenker and Bassett (1978) who were to first to propose to model any pre-specified quantile of a response variable instead of modeling the conditional mean. By doing so they offered statisticians a unique way to explain the entire conditional distribution. As the quantiles themselves can be modeled as a linear function of covariates they are called regression quantiles and the approach is termed quantile regression (QR). This technique was historically introduced as a robust alternative approach to linear regression in the econometric literature. Before presenting the extension to censored data, we present here the basic ideas underlying the QR approach. The basic idea is to estimate the conditional quantile of an outcome y given a vector of covariates x defined as Q(y, x; τ ) = inf{u : P (y ≤ u | x) = τ }
(7.28)
SURVIVAL ANALYSIS
218
for any pre-specified level 0 ≤ τ ≤ 1. We further assume that Q(y, x; τ ) is a linear combination of the covariates, i.e. Q(y, x; τ ) = x T β(τ )
(7.29)
with β(τ ) the τ th regression parameter. The rationale for (7.29) is that in many problems the way small or large quantiles depend on the covariates might be quite different from the median response. This will be particularly true in the heteroscedastic data common in the econometric literature where this approach gained rapid popularity. On the other hand, the ability to detect structures for different quantiles is appealing irrespective of the context. The linear specification is the simplest functional form we can imagine and corresponds to the problem of finding regression quantiles in a linear, possibly heterogeneous, regression model. Of course the response function need not be linear and f (x, β(τ )) is the obvious extension of the linear predictor in that case. For 0 ≤ τ ≤ 1 define the piecewise-linear function ρ(u; τ ) = u(τ − ι(u < 0)) where ι(u < 0) is one when u < 0 and zero otherwise. Koenker and Bassett (1978) then showed that a consistent estimator of β(τ ) is the ˆ ) that minimizes the objective function value β(τ r(β(τ )) =
n 0002
ρ(yi − xiT β(τ ); τ ),
(7.30)
i=1
for an i.i.d. sample (yi , xi ). When τ = 1/2, ρ(u; τ ) reduces to the absolute value up to a multiplicative factor 1/2. Thus, for the special case of the median this estimator is the so-called L1 -estimator in reference to the absolute (or L1 ) norm. For that reason, this approach is also referred to as the L1 regression quantiles. An introduction to this approach at a low level of technicality with a telling example for a biostatistical audience can be found in Koenker and Hallock (2001). In their pioneering work Koenker and Bassett (1978) provided an algorithm ˆ ) that was later refined by based on standard linear programming to compute β(τ Koenker and D’Orey (1987). They also proved that this estimator is consistent and asymptotically normal under mild conditions. For instance, in the classical i.i.d. setting we have √ ˆ ) − β(τ )) → N (0, ω(τ )001b−1 ) n(β(τ (7.31) where ω = τ (1 − τ )/f 2 (F −1 (τ )), 001b = E[xx T ] and f and F are the density and cumulative distribution functions for the error term, respectively. Conditions on f include f (F −1 (τ )) > 0 in a neighborhood of τ . It should be stressed that the fact that ˆ ) depends on the (unspecified) error distribution the asymptotic distribution of β(τ can create some difficulties in computing it. Indeed, the density needs to be estimated non-parametrically and the resulting estimates may suffer from a lack of stability. Inferential methods based on the bootstrap might then be preferred. We refer the reader interested in the technical aspects of this work to Koenker and Bassett (1982) for details and for a more comprehensive account discussing inferential aspects to Koenker (2005). The QR technique took two decades to make its way into survival data analysis, probably because of the lack of flexibility of QR to deal with censoring. A step in
7.6. CENSORED REGRESSION QUANTILES
219
the right direction was suggested by Koenker and Geling (2001). It is based on a simple idea: a transformation of the survival time yi , e.g. the log-transformation, is used providing a regression quantile approach to accelerated failure time model. This is straightforward when all survival times are indeed observed, see Koenker and Geling (2001) for an instructive example. However, this approach is insufficient for most applications in medical research where censoring occurs.
7.6.2 Extension to the Censored Case Early attempts to deal with censoring required too strict assumptions making their use relatively limited; see Powell (1986), Buchinsky and Hahn (1998), Honore et al. (2002) and Chernozhukov and Hong (2002), among others. The important breakthrough came with Portnoy (2003) who was able to accommodate general forms of censoring. He also made available a user-friendly R package called CRQ for censored regression quantiles (directly accessible on his website). We can then expect a rapid development of this innovative approach in biostatistics and medical research where it could be used as a valuable complement to the Cox model. CRQ involve more technical aspects since it combines both the elements of regression quantiles and the modeling of censored survival times. The reader may decide to skip this section in the first instance and just accept the existence of the extension to the censored case. Let ci , i = 1, . . . , n be the censoring times and yi0 the possibly unobserved response (e.g. survival time ti0 ) for the ith subject. We have yi = min(yi0 , ci ) (e.g. yi = ti the survival time), and δi = ι(yi0 ≤ ci ) the indicator of censoring. We can even allow ci to depend on xi but require yi0 and ci to be independent conditionally on xi . The model now stipulates that the conditional quantiles of yi0 are a linear combination of the covariates but will not impose any particular functional form on those of yi . Portnoy (2003) astutely noticed that QR is actually a generalization of the one-sample Kaplan–Meier approach. Two key ingredients combine here: (1) the Kaplan–Meier estimator (7.26) can be viewed as a ‘recursively reweighted’ empirical survival estimate; (2) a more technical argument linked to the regression quantiles computation, i.e. the weighted gradient used in the programming remains piecewise linear in τ . This simple remark permits the use of simplex pivoting techniques. Point (1) follows from Efron (1967) who shows that the Kaplan–Meier estimator can be computed by redistributing the mass of each censored observation to subsequent non-censored observations. In other words, the mass P (yi0 > ci ) can be redistributed to observations above ci . This is done ˆ ) depends on the sign by exploiting a key point of QR, i.e. the estimator β(τ of the residuals at any given point and not on the actual value of the response. ˆ ) when there is censoring works then in the The procedure for estimating β(τ following way. First, it is easy to start with a low quantile τ . We might not know the exact value of yi0 but we do know that it is beyond the censoring time ci . Then, when ˆ ) will ci lies above the τ th regression line, so does yi0 . The true residual yi0 − xiT β(τ 0 be positive irrespective of yi and we can just use the ordinary QR for such a small quantile value. Of course as τ becomes larger sooner or later a censored observation
SURVIVAL ANALYSIS
220
ˆ ). We do not know for sure whether the will have a negative residual ci − xiT β(τ true residual is positive or negative but as the sign has changed we call such an observation crossed from now on. The level at which the observation is crossed is denoted τˆi , thus ˆ τˆi ) ≥ 0 ci − xiT β(
ˆ )≤0 and ci − xiT β(τ
for all τ > τˆi .
As explained by Portnoy (2003) and Debruyne et al. (2008), the critical idea is ‘to estimate the probability of crossed censored observations having a positive, respectively negative residual and then use these estimates as weights further on’. This can be achieved by splitting such an observation into two weighted pseudoˆ ) ≥ 0) and one at observations, one at (ci , xi ) with weight wi (τ ) ≈ P (yi0 − xiT β(τ (+∞, xi ) with weight 1 − wi (τ ). The weight itself comes from quantile regression as 1 − τˆi is a rough estimate of the censoring probability P (yi0 > ci ), i.e. wi (τ ) =
τ − τˆi 1 − τˆi
for τ > τˆi .
Then we can proceed recursively to obtain the CRQ estimate; the exact algorithm as detailed in Debruyne et al. (2008) is given in Appendix G. This process is technically equivalent to one minus the Kaplan–Meier estimate with the Efron recursive reweighting scheme; see the example given in Portnoy (2003, p. 1004), for details. Improvements to the computation of CRQ can also be found in Fitzenberger and Winker (2007) and may prove useful for large datasets.
7.6.3 Asymptotic Properties and Robustness Establishing asymptotic results for CRQ is a considerable task as the weighting scheme sketched above must be taken into account. The most accurate result so ˆ ) converges to β(τ ) at the rate n−1/2 as shown by Neocleous et al. far is that β(τ (2006). The asymptotic normality with a closed form for the asymptotic variance is still a work in progress. The current way to compute standard errors or CIs is the bootstrap. This technique is computer intensive but stable and provides an effective way to perform inference in the i.i.d. setting; it is also the default method in the R package CRQ provided by Portnoy. More generally, even if an asymptotic result were available, it would not necessarily lead to an accurate estimate. Indeed as indicated earlier in (7.31), the asymptotic variance for the regression quantiles estimates in the non-censored case depends on the underlying (unspecified) error distribution and hence bootstrap methods can provide more reliable standard errors estimates. This is also certainly true in the presence of censoring. Elaboration must be made on the exact implementation of the bootstrap for CRQ as a few complications arise. First, when the survival distribution presents many censored observations in its right tail, it is virtually impossible to estimate the conditional quantile above the last τ value corresponding to the last uncensored observation. When bootstrapping the problem is even more serious as this cut-off is random. In one bootstrap sample the observed cut-off can be 0.9 whereas in another one it is about 0.7 due to the
7.6. CENSORED REGRESSION QUANTILES
221
presence of more censored observations from the right-hand tail. Thus, the simple percentile CI possibly fails. Portnoy (2003) introduced a hybrid approach called the 2.906 IQR bootstrap to cope with this problem: simply take the bootstrap estimate of the interquartile range (IQR) and use normality to obtain the relevant percentiles. Technically, this amounts to computing the bootstrap sample interquartile values ∗ ∗ and βˆ ∗ − βˆ ∗ , multiplying them by 2.906 for consistency and adding βˆ0.75 − βˆ0.5 0.5 0.25 ∗ to get upper and lower bounds of the 95% CI the values to the median estimate βˆ0.5 for all β(τ ). This approach seems to work reasonably well both in simulations and examples. Second, as the computational time can be prohibitive for large samples discouraging users, a possible solution has been implemented in the R package CRQ. It is called the ‘n-choose m’ bootstrap whereby replicates of size m < n are chosen to compute the estimates and then adjust the CIs for the smaller sample size. Improvements on the CRQ implementation are work in progress and limitations will certainly be relaxed in the near future. Regression quantiles inherit the robustness of ordinary sample quantiles, and thus present some form of robustness to distributional assumptions. As pointed out by Koenker and Hallock (2001) the estimates have ‘an inherent distributionfree character because quantile estimation is influenced only by the local behavior of the conditional distribution near the specified quantile’. This is equally true for CRQ as long as perturbations in the response only are considered. However, both regression quantiles and CRQ break down in the presence of bad leverage points or problems in the covariates. Robust inference has not been specifically studied but it is safe to say that the bootstrap-based approach probably works well for low levels of contamination and central values of τ (which is probably where most applied problems sit). In contrast extreme values of τ or a higher percentage of spurious data in the sample cause more trouble. Indeed, in that case the standard bootstrap approach breaks down as more outliers can be generated in the bootstrap sample. This is even more critical when extreme τ are the target as the breakdown point of ˆ ) is automatically lower. β(τ
7.6.4 Comparison with the Cox Proportional Hazard Model Straightforward computations based on the survival function and cumulative hazard given in Section 7.2 show that the conditional quantile for the survival time t given a particular covariate vector x is T Q(t, x; τ ) = 001a−1 0 [−log(1 − τ ) exp(−x β)].
(7.32)
Thus, the exponential form of the Cox model imposes a specific form on the conditional quantiles. More specifically (7.32) shows that they are all monotone in log(1 − τ ) and depend on 001a0 in a complicated way. As the conditional quantiles are not linear in the covariates the Cox model does not provide a direct analog of ˆ ). However, Koenker and Geling (2001) and Portnoy (2003) suggested that a β(τ ˆ ) is the derivative of (7.32) evaluated at x, good proxy for β(τ ¯ the average covariate
SURVIVAL ANALYSIS
222 vector, i.e. b(τ ) =
' ' ∂ Q(t, x; τ )' . ∂x x=x¯
(7.33)
ˆ ) that we can If we now plug in the PLE for β into formula (7.33) we obtain b(τ ˆ now compare with the censored regression quantile estimate β(τ ). It is worth noting that (7.33) implies that bj (τ ) = −
(1 − τ )γ (x) log(1 − τ )γ (x) βj , S00005 [Q(t, x; τ )]
where γ (x) = exp(−x T β). So the effect of the various covariates as a function of τ are all identical up to a scaling factor depending on x. In particular, the quantile treatment effect for the Cox model must have the same sign as βj precluding any form of effect that would allow crossings of the survival functions for different settings of covariates. This can be seen as a lack of flexibility of the Cox model imposed by the proportional hazard assumption.
7.6.5 Lung Cancer Data Example (continued) Figure 7.6 displays a concise summary of the results for a censored quantile regression analysis of log(time), i.e. an accelerated failure rate model, on the lung cancer data. The model includes eight estimated coefficients but ptherapy and age were omitted as the same flat non-significant pattern appears for all values of τ and methods. The shaded area represents the 95% pointwise band for each CRQ coefficient obtained by bootstrapping. The dashed line represents the analog ˆ ) for the Cox model given by (7.33). The Karnofsky performance status of β(τ (karnofsky) is a standard score of 0–100 assessing the functional ability of a patient to perform tasks; 0 represents death and 100 a normal ability with no complaints. Its effect depicted in the first panel is highly significant at all levels and for both the Cox and CRQ models. Around median values, e.g. τ = 0.50, the CRQ estimate is roughly 0.04 which translates into a multiplicative effect of exp(0.04 ∗ 10) = 1.49 on median survival for a 10 point increase on that scale (holding all other factors constant). The effect looks somehow higher for smaller quantiles and weaker for larger values of τ , a decreasing trend that is not detected by the Cox model. dduration and treatment have little impact on the outcome for all values of τ strengthening the previous findings that these predictors are not important in these data. cell is a more interesting predictor. No clear effect of squamous versus large cells appears although it seems that in the tails things could be different with possibly a crossover. With the 95% CI also being larger towards the ends, we do not pursue this interpretation. The situation is much neater for small cells where a significant constant effect appears at all levels except perhaps for larger values, τ ≥ 0.80 say. An estimate of −0.70 is obtained for τ = 0.50; this means that the presence of small cells reduces the median survival by 1 − exp(−0.70) = 50% in comparison with large cells. In contrast, the QR estimate (7.33) for the Cox model represented by the
0.2
0.4
0.6
0.8
0.02
dduration 0.0
223
−0.06
0.05 0.01
karnofsky
7.6. CENSORED REGRESSION QUANTILES
1.0
0.0
0.2
0.4
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6 tau
1.0
0.8
1.0
0.6
0.8
1.0
0.0
treatment 0.4
0.8
−1.5
−0.5
0.2
0.6 tau
−2.0
adeno
tau
0.0
1.0
−0.5
small 0.2
0.8
−2.0
0 1 0.0
0.6 tau
−2
squamous
tau
0.0
0.2
0.4 tau
ˆ ), with shaded 95% band for the lung cancer Figure 7.6 The CRQ coefficient, β(τ data. The Cox coefficient effect (7.33) is represented by the dashed line.
dashed line in the same panel is higher, more variable and its significance uncertain, probably for the same reasons mentioned earlier. Adeno cells seem to act similarly to small cells on survival although their effect looks clearer towards the upper end of the distribution. Finally we would like to mention some robustness concerns. As the CRQ approach is based on quantiles, it is robust to outliers in the response or vertical outliers as indicated earlier. It is therefore not influenced by the two longterm survivors (cases 17 and 44). This explains why the robust analysis of Section 7.4 is more in line with the current findings, especially on the role of the cell type. For the sake of completeness we also give the CRQ fit at τ = 0.50 in Table 7.6. It can be seen as a snapshot of Figure 7.6 at a particular level, here the median. The 95% CIs provided in this table are based on the bootstrap with B = 1000 replicates. The p-values correspond to the z-statistic obtained by studentizing by the bootstrap IQR as directly implemented in the R package developed by Portnoy. It is worth noting that the coefficients are similar to those given in the robust analysis of Section 7.4 up to the minus sign. The systematic reversing of the signs for significant predictors is generally observed. This is due to the fact that CRQ explains a specific quantile of the logarithm of time whereas in a Cox model the classical interpretation with hazard ratios relates more to survival. It is actually possible to obtain similar tables for other values of τ but the graphical summary is usually more informative unless an investigator is interested in one particular quantile of the
SURVIVAL ANALYSIS
224
Table 7.6 Estimates, 95% CIs and p-values for significance testing for the Veteran’s Administration lung cancer data. Variable
Estimate
95% CI
p-value
Intercept karnofsky dduration age ptherapy cell Squamous Small Adeno treatment
2.297 0.036 0.005 0.003 −0.010
(0.45; 4.12) (0.02; 0.05) (−0.02; 0.06) (−0.02; 0.03) (−0.07; 0.04)
0.01 0.00 0.80 0.83 0.71
−0.117 −0.685 −0.751 0.018
(−0.81; 0.78) (−1.28; −0.05) (−1.33; −0.06) (−0.54; 0.44)
0.77 0.03 0.02 0.94
The regression coefficients are estimated by means of the CRQ at τ = 0.50.
distribution. To conclude, it is useful to note that although the differences between the quantile method and the Cox model may not be considered as important in this example, it is not always the case. As pointed out by Portnoy (2003), CRQ generally provides new insight on the data with the discovery of substantial differences when a greater signal-to-noise ratio exists in the data.
7.6.6 Limitations and Extensions Despite its uniqueness and originality combined with both good local robustness properties and direct interpretation, CRQ have a few limitations. Unlike the proportional hazard model it cannot be extended to time-varying predictors as the whole algorithm is based on fixed x. This must be played down as many of the timedependent covariates used in the extended Cox model are actually introduced when the proportional hazard assumption itself is violated. As the proportional hazard assumption is no longer needed in QR, the problem is no object. From a robustness perspective CRQ is resistant to vertical outliers, i.e. abnormal responses in time, but not to leverage points. Recent work by Debruyne et al. (2008) shows that this difficulty can be overcome by introducing censored depth quantiles. More research is needed to study their asymptotic properties and compare them with CRQ. More importantly some more work is needed to sort out inferential issues even though the bootstrap approach described above offers a workable solution. Recently Peng and Huang (2008) introduced a new approach for censored QR based on the Nelson– Aalen estimator of the cumulative hazard function. Implementation of this technique has been provided in the R package quantreg; see Koenker (2008). This work is promising as Peng and Huang’s estimator admits a Martingale representation providing a natural route for an asymptotic theory. A key assumption of all of these techniques, however, is that Q(t, x; τ ) depends linearly on the regression
7.6. CENSORED REGRESSION QUANTILES
225
parameter β. This condition can be relaxed in partially linear models as investigated by Neocleous and Portnoy (2006). This could constitute a valuable alternative for intrinsically non-linear data. Irrespective of the method, we would like to stress the potential of QR in biostatistics as it constitutes an original complement to the Cox model. It has the advantage of being naturally interpretable and does not assume any form of proportionality of the hazard function. Results obtained by CRQ can sometimes contradict those derived from the Cox model. This should not be seen as a deficiency but more as a major strength. It can often capture new structures that were hidden behind the proportional hazard assumption. In general, its greater flexibility suggests that the corresponding results are more reliable, but we encourage users to carry out additional work to better understand how such differences can be explained.
Appendices
A
Starting Estimators for MM-estimators of Regression Parameters 00050 , one can choose an estimator among the class of For the starting point β S-estimators as proposed by Rousseeuw and Yohai (1984) (see also Section 2.3.3). A popular choice for the corresponding ρ-function is the biweight function (2.20), 2 for σ 2 which minimize σ 2 subject to 00050 for β and 0005 hence leading to the solution β σ[bi] [bi] n 10002 ρ[bi] (ri ; β, σ 2 , c) − E0010 [ρ[bi] (r; β, σ 2 , c)] = 0 n i=1
(A.1)
where the expected value ensures Fisher consistency of the resulting estimator. The breakdown point of this S-estimator can be chosen through the value of c that satisfies for ρ[bi] the condition E0010 [ρ[bi] (r; β, σ 2 , c)] = ε ∗ ρ[bi] (c; β, σ 2 , c), where ε∗ is the desired breakdown point (see Rousseeuw and Yohai, 1984). When ε∗ = 0.5 (the maximal value), then c = 1.547 (see Rousseeuw and Leroy, 1987, p. 136). However, its efficiency, i.e. the ratio between the traces of the asymptotic variances of respectively the LS and the S-estimator under the exact regression model, is equal to 0.287 (see Yohai et al., 1991), hence it is roughly four times more variable than the LS. The solution can be found by a random resampling algorithm, followed by a local search (see Yohai et al., 1991), by a genetic algorithm in place of the resampling algorithm, by an exhaustive form of sampling algorithm for small problems (see Marazzi (1993) for details on the numerical algorithms) and by faster algorithm for large problems (see Pena and Yohai, 1999). The computational speed is still an issue for computing βˆ 0 in general. When some of the explanatory variables are actually categorical (i.e. factors) as is the case Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
230
STARTING ESTIMATORS FOR MM-ESTIMATORS
with the diabetes data (variables bfmed, bflar and loc), Maronna and Yohai (2000) propose splitting the estimation procedure into an M-estimation part for the categorical variables and an S-estimation part for the other variables, resulting into what they call an MS-estimator. Basically, consider the following regression model T T yi = xi(1) β(1) + xi(2) β(2) + 0015i ,
i = 1, . . . , n,
where xi(1) are 0–1 vectors (i.e. dummy variables) of dimension q1 and xi(2) are realvalued vectors of dimension q2 . An estimator for β(1) is defined conditionally on a value for β(2) , i.e. the solution β˙(1) (β(2) ) in β(1) of n 0002
0001(˜ri , xi(1) ) = 0,
(A.2)
i=1 T β )/σ and y˜ = y − x T β . As an estimator for β with r˜ = (y˜ − x(1) (1) (2) one (2) (2) T T ˙ uses e.g. the S-estimator (A.1) in which ri = yi − xi(1) β(1) (β(2) ) + xi(2) β(2) . For a discussion on the choice for the 0001-function in (A.2) and simplified numerical procedures, see Maronna and Yohai (2000). One can also choose different ρ-functions and/or other objective functions to define high breakdown point estimators for the starting point. Indeed, one can cite the least median of squares estimator (LMS) and the least trimmed squares estimator (LTS), both from Rousseeuw (1984), and the least absolute deviations estimator (LAD) of Edgeworth (1887) (see also Bloomfield and Steiger, 1983) also known under L1 -regression. They can be seen in their definition as natural adaptation of the LS estimator or as a particular case of S-estimators. Indeed, the LS (for a given σ 2 ) is defined as the solution of n 10002 min ri2 , (A.3) β n i=1
i.e. the minimization of a scale estimate of the residuals, in a similar manner as for S-estimators (the square of the residuals is generalized to a function ρ). Replacing the mean by the median leads to the LMS, using a trimmed mean leads to the LTS and taking the absolute value instead of the square in (A.3) leads to the LAD. All of these estimators require a robust estimator for the scale σ and special algorithms to 2 ). 00050 (and 0005 compute them. They have progressively been abandoned in favor of β σ[bi] [bi]
B
Efficiency, LRTρ , RAIC and RCp with Biweight ρ-function for the Regression Model To develop the efficiency (3.20) and other quantities for the LRTρ , RAIC and the RCp with the biweight estimator with ρ-function (3.15), we make use of E[r k ] =
(k)! 2k/2(k/2)!
to compute the moments of a N (0, 1), and of 0004 c r k d0010(r) = Lk = −ck−1 φ(c) + (k − 1)0010(c)Lk−2 , −∞
with L0 = 0010(c) and L1 = −φ(c). We need (even) moments up to the order 14, i.e. L2 = −cφ(c) + 0010(c)2 L4 = −(c3 + 3c0010(c))φ(c) + 30010(c)3 L6 = −(c5 + 5c3 0010(c) + 15c0010(c)2)φ(c) + 150010(c)4 L8 = −(c7 + 7c5 0010(c) + 35c30010(c)2 + 105c0010(c)3)φ(c) + 1050010(c)5 L10 = −(c9 + 9c7 0010(c) + 63c50010(c)2 + 315c30010(c)3 + 945c0010(c)4)φ(c) + 9450010(c)6 L12 = −(c11 + 11c90010(c) + 99c70010(c)2 + 693c50010(c)3 + 3465c30010(c)4 + 10 395c0010(c)5)φ(c) + 10 3950010(c)7 L14 = −(c13 + 13c110010(c) + 143c90010(c)2 + 1287c70010(c)3 + 9009c50010(c)4 + 45 045c30010(c)5 + 135 135c0010(c)6)φ(c) + 135 1350010(c)8 Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
EFFICIENCY, LRT ρ , RAIC AND RCP WITH BIWEIGHT ρ -FUNCTION
232
and, therefore, 0004
−c
0004
c
−c 0004 c −c 0004 c −c 0004 c −c
0004
d0010(r) = 1 − 20010(−c) 0004
r 2 d0010(r) =
0004 r 2 d0010(r) − 2
−c
−∞
r 2 d0010(r) = 1 − 2φ(c)c − 20010(−c)2
r 4 d0010(r) = 3 − 2φ(c)(c3 + 3c0010(−c)) − 60010(−c)3 r 6 d0010(r) = 15 − 2φ(c)(c5 + 5c3 0010(−c) + 15c0010(−c)2) − 300010(−c)4 r 8 d0010(r) = 105 − 2φ(c)(c7 + 7c5 0010(−c) + 35c30010(−c)2 + 105c0010(−c)3) − 2100010(−c)5
c −c
0004
c
r 10 d0010(r) = 945 − 2φ(c)(c9 + 9c7 0010(c) + 63c50010(c)2 + 315c30010(c)3 + 945c0010(c)4) − 18900010(c)6
c −c
r 12 d0010(r) = 10 395 − 2φ(c)(c11 + 11c9 0010(−c) + 99c70010(−c)2 + 693c50010(−c)3 + 3465c30010(−c)4 + 10 395c0010(−c)5)
0004
− 20 7900010(c)7 c −c
r 14 d0010(r) = 135 135 − 2φ(c)(c13 + 13c110010(−c) + 143c90010(−c)2 + 1287c70010(−c)3 + 9009c50010(−c)4 + 45 045c30010(−c)5 + 135 135c0010(−c)6) − 270 2700010(−c)8.
For the efficiency (3.20), we have 0006
00072 0004 0004 0004 c 5 c 4 6 c 2 ec = 4 r d0010(r) − 2 r d0010(r) + d0010(r) c −c c −c −c 0004 c 0004 0004 4 c 8 6 c 6 1 10 × 8 r d0010(r) − 6 r d0010(r) + 4 r d0010(r) c −c c −c c −c
−1 0004 0004 c 4 c 4 2 − 2 r d0010(r) + r d0010(r) . c −c −c
EFFICIENCY, LRT ρ , RAIC AND RCp WITH BIWEIGHT ρ -FUNCTION
233
For the LRTρ , and using the ρ-function given in (3.15), we have that (3.26) reduces to
0004 0004 0004 c 6 c 2 5 c 4 r d0010(r) − 2 r d0010(r) + d0010(r) c4 −c c −c −c 0004 c 0004 0004 4 c 8 6 c 6 1 r 10 d0010(r) − 6 r d0010(r) + 4 r d0010(r) × 8 c −c c −c c −c
−1 0004 0004 c 4 c 4 2 − 2 r d0010(r) + r d0010(r) . c −c −c
For the RAIC given in (3.31), and using the ρ-function given in (3.15), we have a=
1 c8
0004
c −c
r 10 d0010(r) −
4 c6
0004
c −c
r 8 d0010(r) +
6 c4
0004
c
−c
r 6 d0010(r)
0004 0004 c 4 c 4 r d0010(r) + r 2 d0010(r) − 2 c −c −c
0004 c 0004 c 0004 c 6 5 4 2 r d0010(r) − 2 r d0010(r) + d0010(r) . b= c4 −c c −c −c For the RCp , Ronchetti and Staudte (1994) have shown that
2 ∂ ρ(r) d0010(r) ∂r
2 2 00060004 0007−1 0004 ∂2 ∂ ∂ ρ(r) ρ(r) d0010(r) ρ(r) d0010(r) − 2p ∂r ∂r∂r ∂r∂r
2 0004 2 0004 ∂2 ∂ 1 ∂ ρ(r) d0010(r) + 2 ρ(r) ρ(r) d0010(r) +p ∂r∂r r ∂r ∂r∂r
2
2
0004 0004 1 ∂ ∂ −3 ρ(r) d0010(r) ρ(r) d0010(r) ∂r r 2 ∂r 0007−2 00060004 ∂2 ρ(r) d0010(r) × ∂r∂r 0004
Up − V p = n
and 0004 VP = p
2
2 0007−2 00060004 0004 1 ∂ ∂2 ∂ ρ(r) ρ(r) ρ(r) d0010(r) d0010(r) d0010(r) . ∂r ∂r∂r r 2 ∂r
234
EFFICIENCY, LRT ρ , RAIC AND RCP WITH BIWEIGHT ρ -FUNCTION
For the biweight ρ-function (3.15), we have 0004 c
0004 0004 4 c 8 6 c 6 1 10 r d0010(r) − 6 r d0010(r) + 4 r d0010(r) (Up − Vp ) = n 8 c −c c −c c −c
0004 c 0004 c 4 r 4 d0010(r) − r 2 d0010(r) −n 2 c −c −c 0004 c 0004 26 c 12 5 r 14 d0010(r) − 10 r d0010(r) − 2p 12 c c −c −c 0004 0004 0004 55 c 10 60 c 8 35 c 6 + 8 r d0010(r) − 6 r d0010(r) + 4 r d0010(r) c −c c −c c −c
0004 0004 c 10 c 4 − 2 r d0010(r) + r 2 d0010(r) c −c −c 0007−1 0006 0004 c 0004 0004 c 6 c 2 5 r 4 d0010(r) − 2 r d0010(r) + d0010(r) × 4 c −c c −c −c 0004 c 0004 c 80 32 +p 8 r 8 d0010(r) − 6 r 6 d0010(r) c −c c −c
0004 0004 64 c 4 16 c 2 + 4 r d0010(r) − 2 r d0010(r) c −c c −c 0004 c 0004 0004 4 c 8 6 c 6 1 r 10 d0010(r) − 6 r d0010(r) + 4 r d0010(r) × 8 c −c c −c c −c
0004 0004 c 4 c 4 2 r d0010(r) + r d0010(r) − 2 c −c −c 0006 0004 c 0007−2 0004 0004 c 6 c 2 5 × 4 r 4 d0010(r) − 2 r d0010(r) + d0010(r) c −c c −c −c and
0004 0004 0004 4 c 6 6 c 4 1 c 8 r d0010(r) − r d0010(r) + r d0010(r) c8 −c c6 −c c4 −c
0004 c 0004 0004 0004 c 4 c 2 4 c 8 1 10 − 2 r d0010(r) + d0010(r) r d0010(r) − 6 r d0010(r) c −c c8 −c c −c −c
0004 0004 0004 c 6 c 6 4 c 4 + 4 r d0010(r) − 2 r d0010(r) + r 2 d0010(r) c −c c −c −c 0006 0004 c 0007−2 0004 c 0004 c 6 5 × 4 r 4 d0010(r) − 2 r 2 d0010(r) + d0010(r) . c −c c −c −c
VP = p
C
An Algorithm Procedure for the Constrained S-estimator The following is a pseudo code of the algorithm for computing the constrained S-estimator. • Given a model, define the design matrices zj zjT to obtain the structure of the covariance matrix and the matrices xi that define the mean vectors xi β, so that 0004=
r 0002 j =0
σj2 zj zjT .
• Compute the starting point of the constrained estimator, that is xi βstart
and 0004start .
In principle one can choose any high breakdown point estimator as starting point. It can be made ‘constrained’ to match the MLM model by averaging out the elements of the estimated covariance matrix that are equal under the MLM. We use the MCD estimator (see Section 2.3.3). • Compute the constrained estimator through the following iterative procedure: 1. Compute the Mahalanobis distances 0015 (1) −1 di = (yi − xi βstart )T 0004start (yi − xi βstart ). 2. Compute the weights w(di(1) ). 3. Compute the fixed effects parameters β (1) by solving 0002 (1) −1 (yi − xi βstart ). w(di )xiT 0004start Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
236
ALGORITHM PROCEDURE FOR THE CONSTRAINED S-ESTIMATOR 4. Let α = (σ02 , . . . , σr2 )T , an iterative expression for the variance components α (1) is given by
−1 0002 1 n w(di(1) )(di(1) )2 Q−1 U α (1) = n i=1 with U defined as 0002 1 (1) U= pw(di )(yi − xi βstart)T n
−1 −1 × 0004start zj zjT 0004start (yi − xi βstart)
j =0,.,r
and Q = tr(Mj Mk )j,k=0,.,r with
−1 Mj = 0004start zj zjT .
5. Using the design matrices zj zjT , update the constrained matrix by 0004 (1) =
r 0002 j =0
2(1)
σj
zj zjT .
6. Update the fixed effects by xi β (1) . 7. Compute some convergence criterion. If the conditions of the criterion are met, stop; otherwise put βstart = β (1) , 0004start = 0004 (1) and start again at point 1 by computing di(2) , the weights w(di(2) ) then β (2) and 0004 (2) . Repeat the procedure until convergence.
D
Some Distributions of the Exponential Family We give here the definitions of some of the distributions belonging to the exponential family, as listed in Table 5.1. • Normal. The density function of a variable distributed as yi ∼ N (µi , σ 2 ) is
1 1 exp − 2 (y − µi )2 , f (y; µi , σ 2 ) = √ 2σ 2πσ for y in R. • Bernoulli. A yi Bernoulli distributed variable can take values y = 0 or y = 1 according to y P (yi = y; pi ) = pi (1 − pi )1−y . • Scaled binomial. The scaled binomial distributed variables yi /m take values 0, 1/m, 2/m, . . . , 1 and are derived from the binomial variables yi with probabilities
m y P (yi = y; pi ) = pi (1 − pi )1−y , y for y = 0, 1, . . . , m. • Poisson. For a Poisson variable yi ∼ P(λ), probabilities are computed according to y λ P (yi = y; λi ) = exp(−λi ) i , y! for y = 0, 1, 2, . . . . Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
238
SOME DISTRIBUTIONS OF THE EXPONENTIAL FAMILY
• Gamma. Here yi is said to be 0016(µi , ν) distributed if its density is f (y; µi , ν) = for y > 0, with 0016(ν) =
0003∞ 0
ν/µi · exp(−νy/µi ) · (νy/µi )ν−1 , 0016(ν) exp(−u)uν−1 du.
E
Computations for the Robust GLM Estimator E.1 Fisher Consistency Corrections We give here the Fisher consistency corrections a(β) =
n 10002 1 E[ψ(ri ; β, φ, c)]w(xi ) 0017 µ0005 , n i=1 φvµi i
for the binomial, Poisson and Gamma models. Note that for binomial and Poisson models, φ = 1 and for the Gamma model φ = 1/ν, see Table 5.1. The only term to be computed for each model is E[ψ(ri ; β, φ, c)], which is done below for ψ(ri ; β, φ, c) = ψ[H ub] (ri ; β, φ, c),0017see Section 3.6. 0017 Let us first define j1 = 0014µi − c φvµi 0015 and j2 = 0014µi + c φvµi 0015, where 0014u0015 denotes the largest integer not greater than u. The binomial model states that yi ∼ B(mi , pi ), so that E[yi ] = µi = mi pi and var(yi ) = µi ((mi − µi )/mi ). Then we have E[ψ[H ub] (ri ; β, φ, c)] =
∞ 0002 j =−∞
ψ[H ub]
j − µi ; β, φ, c P (yi = j ) ι(j ∈ [0, mi ]) √ vµi
= c[P (yi ≥ j2 + 1) − P (yi ≤ j1 )] µi + √ [P (j1 ≤ y˜i ≤ j2 − 1) − P (j1 + 1 ≤ yi ≤ j2 )], vµi with y˜i ∼ B(mi − 1, pi ), and where ι(C) is the indicator function that takes the value one if C is true and zero otherwise. Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
COMPUTATIONS FOR THE ROBUST GLM ESTIMATOR
240
The Poisson model states that yi ∼ P(µi ) and, hence, E[yi ] = V (µi ) = µi . Then, E[ψ[H ub] (ri ; β, φ, c)] =
∞ 0002 j =−∞
ψ[H ub]
j − µi ; β, φ, c P (yi = j )ι(j ≥ 0) √ vµi
= c(P (yi ≥ j2 + 1) − P (yi ≤ j1 )) µi + √ [P (yi = j1 ) − P (yi = j2 )]. vµi Finally, for the Gamma model, one remarks in the first place that ri = (yi − 0017 µi )/ φvµi has a Gamma distribution (independent of µi ) with expectation equal √ √ to ν and shifted origin to − ν. It holds that 0004 E[ψ[H ub] (ri ; β, φ, c)] =
∞ √ − ν
ψ[H ub] (r; β, φ, c)f (r;
√ √ ν, ν) ι(r > − ν) dr
= c[P (ri > c) − P (ri < −c)] + where f (r;
ν (ν−1)/2 [G(−c, ν) − G(c, ν)], 0016(ν)
√ ν, ν) is the Gamma density (see Appendix D) and √ √ √ √ G(t, κ) = exp(− ν( ν + t))( ν + t)κ ι(t > − ν).
E.2 Asymptotic Variance Computing the asymptotic variance amounts to computing the matrices A and B of Section 5.3.4, and therefore of E[ψ 2 (ri ; β, φ, c)] and E[ψ(ri ; β, φ, c)(∂/∂µi ) log h(yi | xi , µi )] again for ψ(ri ; β, φ, c) = ψ[H ub] (ri ; β, φ, c), where h(yi | xi , µi ) is the conditional density or probability of yi | xi . For the binomial model 2 E[ψ[H ub] (ri ; β, φ, c)]
= c2 (P (yi ≤ j1 ) + P (yi ≥ j2 + 1)) +
1 [π 2 mi (mi − 1)P (j1 − 1 ≤ y˜˜i ≤ j2 − 2) vµi i
+ (µi − 2µ2i )P (j1 ≤ y˜i ≤ j2 − 1) + µ2i P (j1 + 1 ≤ yi ≤ j2 )], with yi ∼ B(mi , πi ), y˜i ∼ B(mi − 1, πi ) and y˜˜i ∼ B(mi − 2, πi ) (mi ≥ 3).
E.2. ASYMPTOTIC VARIANCE
241
Given that (∂/∂µi ) log h(yi | xi , µi ) is equal to (yi − µi )/vµi , we have 0007 0006 ∂ log h(yi | xi , µi ) E ψ[H ub] (ri ; β, φ, c) ∂µi 0007 0006 y i − µi = E ψ[H ub] (ri ; β, φ, c) vµi cµi [P (yi ≤ j1 ) − P (y˜i ≤ j1 − 1) + P (y˜i ≥ j2 ) − P (yi ≥ j2 + 1)] = vµi +
1 3/2 vµi
[πi2 mi (mi − 1)P (j1 − 1 ≤ y˜˜i ≤ j2 − 2)
+ (µi − 2µ2i )P (j1 ≤ y˜i ≤ j2 − 1) + µ2i P (j1 + 1 ≤ yi ≤ j2 )], with yi ∼ B(mi , πi ), y˜i ∼ B(mi − 1, πi ) and y˜˜i ∼ B(mi − 2, πi ) (mi ≥ 3). For the Poisson model, 2 2 E[ψ[H ub] (ri ; β, φ, c)] = c [P (yi ≤ j1 ) + P (yi ≥ j2 + 1)]
+
1 [µ2 P (j1 − 1 ≤ yi ≤ j2 − 2) vµi i
+ (µi − 2µ2i )P (j1 ≤ yi ≤ j2 − 1) + µ2i P (j1 + 1 ≤ yi ≤ j2 )]. We have y i − µi y i − µi ∂ log h(yi | xi , µi ) = = , ∂µi µi vµi so that 0006 0007 ∂ E ψ[H ub] (ri ; β, φ, c) log h(yi | xi , µi ) ∂µi 0007 0006 y i − µi = E ψ[H ub] (ri ; β, φ, c) vµi = c[P (yi = j1 ) + P (yi = j2 )] + µi P (j1 ≤ yi ≤ j2 − 1) +
1 3/2 vµi
µ2i [P (yi = j1 − 1) − P (yi = j1 ) − P (yi = j2 − 1) + P (yi = j2 )].
For the Gamma model, we first note that √ ∂ log h(yi | xi , µi ) = (yi − µi )/(µ2i /ν) = νri /µi . ∂µi
COMPUTATIONS FOR THE ROBUST GLM ESTIMATOR
242 This yields
E(ψ[H ub] (ri ; β, φ, c)
∂ log h(yi | xi , µi ) ∂µi
√ ν E(ψ[H ub] (ri ; β, φ, c)ri ) = µi
√ ν ν/2 c ν [G(−c, ν) + G(c, ν)] + P (−c < ri < c) = µi 0016(ν) µi ν ν/2 [G(−c, ν + 1) − G(c, ν + 1)] µi 0016(ν)
ν (ν+1)/2 ν + 1 − 2 [G(−c, ν) − G(c, ν)]. + µi 0016(ν) ν +
E.3 IRWLS Algorithm for Robust GLM We show here how the estimation procedure issued from (5.13) can be written as an IRWLS algorithm. Given β t −1 , the estimated value of β at iteration t − 1, one can obtain β t , the value of β at iteration t, by regressing Z = XT β t −1 + et −1 on X (see Definition (5.2)) with weights B = diag(b1 , . . . , bn ) with
0006 0007001b 0017 ∂ ∂µi 2 log h(yi | xi , µi ) φvµi w(xi ) , bi = E ψ(ri ; β, φ, c) ∂µi ∂ηi
(E.1)
for i = 1, . . . , n, where h(·) is the conditional density or probability of yi | xi and et −1 = (e1t −1 , . . . , ent −1 ) with eit −1 =
ψ(rit −1 ; β, φ, c) − E[ψ(rit −1 ; β, φ, c)]
E[ψ(rit −1 ; β, φ, c)(∂/∂µi ) log h(yi | xi , µti −1 )]
.
(E.2)
To see the above, define U (β) = ni=1 0001(yi , xi ; β, φ, c), where 0001(yi , xi ; β, φ, c) is given in (5.13). The Fisher-scoring algorithm at step t writes β t = β t −1 + H −1 (β t −1 )U (β t −1 ) or, alternatively, H (β t −1)β t = H (β t −1)β t −1 + U (β t −1 ), where H (β
t −1
' 0007 0006 ' ∂ ' = nM(0001, Fβ ) = XT B|β=β t−1 X. ) = E − U (β)' ∂β β=β t−1
E.3. IRWLS ALGORITHM FOR ROBUST GLM
243
Moreover, for Z = XT β t −1 + et −1 with et −1 as defined in (E.2), we have that H (β t −1 )β t −1 + U (β t −1 ) = XT BZ. In fact, for each j = 1, . . . , n, it holds that [H (β t −1 )β t −1 + U (β t −1 )]j =
p 0002 n 0002 k=1 i=1 n 0002
bi xij xik βkt −1 +
n 0002
ψ(ri ; β, φ, c)w(xi ) 0017
i=1
1 ∂µi xij φvµi ∂ηi
1 ∂µi xij φv µi ∂ηi i=1 0017 p n 00190002 0002 ψ(ri ; β, φ, c)w(xi )(1/ φvµi )(∂µi /∂ηi ) = xik βkt −1 + bi i=1 k=1 0017 001a E[ψ(ri ; β, φ, c)]w(xi )(1/ φvµi )(∂µi /∂ηi ) bi xij − bi 001a n 0019 0002 ψ(ri ; β, φ, c) − E[ψ(ri ; β, φ, c)] ∂ηi xi β t −1 + bi xij = E[ψ(ri ; β, φ, c)(∂/∂µi ) log h(yi | xi , µi )] ∂µi i=1 −
=
n 0002
E[ψ(ri ; β, φ, c)]w(xi ) 0017
Zi bi xij = [XT BZ]j ,
i=1
where the involved quantities are evaluated at β t −1 .
F
Computations for the Robust GEE Estimator F.1
IRWLS Algorithm for Robust GEE
The whole robust procedure consists of solving the three following sets of equations: n 0002
(Dµi ,β )T iT (Vµi ,τ,α )−1 (ψi − ci ) =
i=1
n 0002
00011 (yi , Xi ; β, α, τ, c) = 0
(F.1)
i=1 ni n 0002 0002 i=1 t =1
χ(rit ; β, α, φ, c) =
n 0002
00012 (ri ; β, α, τ, c) = 0
(F.2)
i=1
0002 n n 0002 K GTi Bi − ατ = 00013 (ri ; β, α, τ, c) = 0. n i=1 i=1
(F.3)
Ideally these equations should be solved simultaneously (as, for example, in Huggins (1993)). We implement a two-stage approach iterating between the estimation of the regression parameters via (F.1) and the estimation of the dispersion and correlation parameters via (F.2) and (F.3). In fact, for fixed values of the nuisance parameters τ and α, the estimation of the regression parameter β can be performed via an IRWLS algorithm by regressing the adjusted dependent variable Z = Xtot βˆ + D ∗ −1 (ψtot − ctot ) on Xtot with a block-diagonal weight matrix W∗ , where Xtot = (XT1 , . . . , XTn )T , ψtot = (ψ1T , . . . , ψnT )T , ctot = (c1T , . . . , cnT )T are the combined informations for the entire sample. The ith block of W∗ is the ni × ni matrix W∗i = Dµ∗ i ,β −1 iT (Aµi )−1/2 (Rα,i )−1 (Aµi )−1/2 i Dµ∗ i ,β −1 , Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
246
COMPUTATIONS FOR THE ROBUST GEE ESTIMATOR
and D ∗ is a block-diagonal matrix with blocks Dµ∗ i ,β = diag(∂ηi1 /∂µi1 , . . . ,
Xi . The matrix Hi = Xi (XT W∗ X)−1 ∂ηini /∂µini ). We remark that Dµi ,β = Dµ∗−1 i ,β XTi W∗i defines the hat matrix for subject i. One then obtains an estimate of τ and next an estimate of α from (F.2) and (F.3), respectively. Note that (F.3) can be solved explicitly when exchangeable correlation is assumed, yielding αˆ = 1/(τˆ K) ni=1 GTi Bi .
F.2 Fisher Consistency Corrections Let Yit and Yit 0005 be Bernoulli distributed with probability of success equal to µit and µit 0005 , respectively, and with correlation ρtt0005 . We assume that the robustness weight wit associated with subject i at time t can be decomposed as w(xit )w(rit ; β, τ, c). The joint distribution of (yit , yit 0005 ) is multinomial with set of probabilities (π11 , π10 , 1/2 1/2 π01 , π00 ), where π11 = ρtt0005 vit vit0005 + µit µit0005 , π10 = µit − π11 , π01 = µis − π11 and π00 = 1 − µit − µis + π11 . The consistency correction vector ci has elements cit = E[ψit ] that takes the form: cit = w(rit(1) ; β, τ, c)(w(rit(0) ; β, τ, c) − w(rit(0) ; β, τ, c))v(µit ), √ (j ) where w(rit ; β, τ, c) = w((j − µit )/v(µit )/ τ ) is the weight for the tth measure of cluster i evaluated at yit = j . Moreover, the diagonal matrix i = E[ψ˜ i − c˜i ], with ψ˜ i = ∂ψi /∂µi and c˜i = ∂ci /∂µi , has diagonal elements (1)
(0)
0016it = −w(xit )((1 − µit )w(rit ; β, τ, c) + µit w(rit ; β, τ, c)).
G
Computation of the CRQ The global algorithm uses the notation and definitions introduced in Section 7.6.2. It is taken from Portnoy (2003) or Debruyne et al. (2008) and works as follows. • As long as no censored observations are crossed use ordinary QR as in Koenker and Bassett (1978). • When the ith censored observation is crossed at the τ th quantile store this value as τˆi = τ . • When censored observations have been crossed for a specific τ , find the value in β that minimizes a weighted version of (7.30): 0002 ρ(yi − xiT β(τ ); τ ) i∈Kτc
+
0002
[wi (τ )ρ(yi − xiT β(τ ); τ ) + (1 − wi (τ )ρ(y ∗ − xiT β(τ ); τ )],
i∈Kτ
(G.1) where Kτ represents the set of crossed and censored observations at τ and Kτc its complementary. The weights wi (τ ) are defined in Section 7.6.2 and y ∗ is any value sufficiently large to exceed xiT β for all i. To compute the regression quantile objective function (G.1) in practice, a ˆ ) is piecewise constant sequence of breakpoints τ1∗ , τ2∗ , . . . , τL∗ is defined so that β(τ between these breakpoints. Then, simplex pivoting techniques allow to move from one breakpoint to another using the gradients of (G.1). Portnoy (2003) points out that the resulting gradients are linear in τ making the whole thing tractable. The above reference contains a detailed algorithm and additional explanations. Recently a variant of this called the grid algorithm has been proposed by Neocleous and Portnoy (2006). It is more stable, faster and has already been implemented in the R package Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
248
COMPUTATION OF THE CRQ
provided by Portnoy. It should be preferably used for large datasets. The simplex pivoting algorithm is still available and works well for smaller samples, that is, n up to several thousands.
References Adrover, J., Salibian-Barrera, M. and Zamar, R. (2004) Globally robust inference for the location and simple regression model. Journal of Statistical Planning and Inference, 119, 353–375. Akaike, H. (1973) Information theory and an extension of the maximum likelihood principle. Proceedings of the Second International Symposium on Information Theory (eds Petrov, B.N. and Csaki, F.), Akademiai Kiado, Budapest, pp. 267–281. Alario, F.J.S. and Ferrand, L. (2000) Semantic and associative priming in picture naming. The Quarterly Journal of Experimental Psychology, 53, 741–764. Andrews, D.F., Bickel, P.J., Hampel, F.R., Huber, P.J., Rogers, W.H. and Tukey, J.W. (1972) Robust Estimates of Location: Survey and Advances, Princeton University Press, Princeton, NJ. Atkinson, A.C. (1985) Plots, Transformations and Regression, Oxford University Press, Oxford. Atkinson, A.C. and Riani, M. (2000) Robust Diagnostic Regression Analysis, Springer, Berlin. Barnett, V. and Lewis, T. (1978) Outliers in Statistical Data, John Wiley & Sons, New York. Barry, S. and Welsh, A. (2002) Generalized additive modelling and zero inflated count data. Ecological Modelling, 157, 179–188. Beaton, A.E. and Tukey, J.W. (1974) The fitting of power series, meaning polynomials, illustrated on band-spectroscopicdata. Technometrics, 16, 147–185. Bednarski, T. (1993) Robust estimation in Cox’s regression model. Scandinavian Journal of Statistics, 20, 213–225. Bednarski, T. (1999) Adaptive robust estimation in the Cox regression model. Biocybernetics and Biomedical Engineering, 19, 5–15. Bednarski, T. (2007) On a robust modification of Breslow’s cumulated hazard estimator. Computational Statistics and Data Analysis, 52, 234–238. Bednarski, T. and Mocarska, E. (2006) On robust model selection within the Cox model. Econometrics Journal, 9, 279–290. Bednarski, T. and Nowak, M. (2003) Robustness and efficiency of Sasieni-type estimators in the Cox model. Journal of Statistical Planning and Inference, 115, 261–272. Bednarski, T. and Zontek, S. (1996) Robust estimation of parameters in a mixed unbalanced model. Annals of Statistics, 24, 1493–1510. Belsley, D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics, John Wiley & Sons, New York. Bennet, C.A. (1954) Effect on measurement error on chemical process control. Industrial Quality Control, 11, 17–20. Beran, R. (1981) Efficient robust tests in parametric models. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 57, 73–86. Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
250
REFERENCES
Berkson, J. (1944) Application of the logistic function to bio-assay. Journal of the American Statistical Association, 39, 357–365. Bernoulli, D. (1777) Dijudicatio maxime probabilis plurium observationum discrepantium atque verisimillima inductio inde formanda. Acta Acad, Sci. Petropolit., 1, 3–33 (English translation by Allen, C.C. (1961), Biometrika, 48, 3–13.) Berry, D.A. (1987) Logarithmic transformations in ANOVA. Biometrics, 43, 439–456. Bianco, A., Boente, G. and di Rienzo, J. (2000) Some results for robust GM-based estimators in heteroscedastic regression models. Journal of Statistical Planning and Inference, 89, 215–242. Bianco, A.M. and Yohai, V.J. (1997) Robust estimation in the logistic regression model. Robust Statistics, Data Analysis and Computer Intensive Methods (ed. Rieder H), Springer, New York, pp. 17–34. Bianco, A.M., Ben, M.G. and Yohai, V.J. (2005) Robust estimation for linear regression with asymmetric errors. The Canadian Journal of Statistics, 33, 511–528. Birch, M.W. (1963) Maximum likelihood in three-way contingency tables. Journal of the Royal Statistical Society, Series B, Methodological, 25, 220–233. Bliss, C.I. (1935) The calculation of the dosage-mortality curve. Annals of Applied Biology, 22, 134–167. Bloomfield, P. and Steiger, W.L. (1983) Least Absolute Deviations: Theory, Applications, and Algorithms, Birkhäuser, Boston, MA. Blough, D.K., Madden, C.W. and Hornbrook, M.C. (1999) Modeling risk using generalized linear models. Journal of Health Economics, 18, 153–171. Box, G. (1979) Robustness in the strategy of scientific model building. Robustness in Statistics (ed. Launer, R. and Wilkinson, G.), Academic Press, New York. Box, G.E.P. (1953) Non-normality and tests of variances. Biometrika, 40, 318–335. Bretagnolle, J. and Huber-Carol, C. (1988) Effects of omitting covariates in Cox’s model for survival data. Scandinavian Journal of Statistics, 15, 125–128. Brochner-Mortensen, J., Jensen, S. and Rodbro, P. (1977) Assessment of renal function from plasma creatinine in adult patients. Scandinavian Journal of Urology and Nephrology, 11, 263–270. Buchinsky, M. and Hahn, J. (1998) An alternative estimator for the censored quantile regression model. Econometrica, 66, 653–671. Cain, K. and Lange, T. (1984) Approximate case influence for the proportional hazards regression model with censored data. Biometrics, 40, 439–499. Cameron, A.C. and Trivedi, P.K. (1998) Regression Analysis of Count Data, Cambridge University Press, Cambridge. Canario, L. (2006) Genetic aspects of piglet mortality at birth and in early suckling period: relationships with sow maternal abilities and piglet vitality, PhD thesis, Institut National Agronomique Paris-Grignon, France. Canario, L., Cantoni, E., Le Bihan, E., Caritez, J., Billon, Y., Bidanel, J. and Foulley, J. (2006) Between breed variability of stillbirth and relationships with sow and piglet characteristics. Journal of Animal Science, 84, 3185–3196. Cantoni, E. (2003) Robust inference based on quasi-likelihoods for generalized linear models and longitudinal data. Developments in Robust Statistics. Proceedings of ICORS 2001 (eds. Dutter, R., Filzmoser, P., Gather, U. and Rousseeuw, P.J.), Springer, Heidelberg, pp. 114– 124. Cantoni, E. (2004a) Analysis of robust quasi-deviances for generalized linear models. Journal of Statistical Software., Vol. 10, Issue 4.
REFERENCES
251
Cantoni, E. (2004b) A robust approach to longitudinal data analysis. Canadian Journal of Statistics, 32, 169–180. Cantoni, E. and Ronchetti, E. (2001a) Resistant selection of the smoothing parameter for smoothing splines. Statistics and Computing, 11, 141–146. Cantoni, E. and Ronchetti, E. (2001b) Robust inference for generalized linear models. Journal of the American Statistical Association, 96, 1022–1030. Cantoni, E. and Ronchetti, E. (2006) A robust approach for skewed and heavy-tailed outcomes in the analysis of health care expenditures. Journal of Health Economics, 25, 198–213. Cantoni, E., Mills Flemming, J. and Ronchetti, E. (2005) Variable selection for marginal longitudinal generalized linear models. Biometrics, 61, 507–514. Carroll, R., Ruppert, D. and Stefanski, L. (1995) Measurement Error in Nonlinear Models, Chapman & Hall, London. Carroll, R.J. and Pederson, S. (1993) On robustness in the logistic regression model. Journal of the Royal Statistical Society, Series B, Methodological, 55, 693–706. Carroll, R.J. and Ruppert, D. (1982) Robust estimation in heteroscedastic linear models. Annals of Statistics, 10, 1224–1233. Chatterjee, S. and Hadi, A.S. (1988) Sensitivity Analysis in Linear Regression, John Wiley & Sons, New York. Chen, C. and Wang, P. (1991) Diagnostic plots in Cox’s regression model. Biometrics. 47, 841–850. Chernozhukov, V. and Hong, H. (2002) Three-step censored quantile regression and extramarital affairs. Journal of the American Statistical Association, 97, 872–882. Christmann, A. (1997) High breakdown point estimators in logistic regression. Robust Statistics, Data Analysis and Computer Intensive Methods (ed. Rieder H), Springer, New York, pp. 79–90. Christmann, A. and Rousseeuw, P.J. (2001) Measuring overlap in binary regression. Computational Statistics and Data Analysis, 37, 65–75. Cleveland, W.S. (1979) Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 74, 829–836. Collett, D. (2003a) Modelling Binary Data, Chapman & Hall, London. Collett, D. (2003b) Modelling Survival Data in Medical Research, 2nd edn, Chapman & Hall, London. Conen, D., Wietlisbach, V., Bovet, P., Shamlaye, C., Riesen, W., Paccaud, F. and Burnier, M. (2004) Prevalence of hyperuricemia and relation of serum uric acid with cardiovascular risk factors in a developing country. BMC Public Health, http://www.biomedcentral.com/1471-2458/4/9. Cook, R.D. and Weisberg, S. (1982) Residuals and Influence in Regression, Chapman & Hall, New York. Copas, J.B. (1988) Binary regression models for contaminated data. Journal of the Royal Statistical Society, Series B, Methodological, 50, 225–265. Copt, S. and Heritier, S. (2007) Robust alternative to the F -test in mixed linear models based on MM-estimates. Biometrics, 63, 1045–1052. Copt, S. and Victoria-Feser, M.P. (2006) High breakdown inference for mixed linear models. Journal of the American Statistical Association, 101(473), 292–300. Copt, S. and Victoria-Feser, M.P. (2009) Robust predictions in mixed linear models, Technical report, University of Geneva.
252
REFERENCES
Cox, D. (1972) Regression models and life tables. Journal of the Royal Statistical Society, Series B, Methodological, 34, 187–220. Cox, D.R. and Hinkley, D.V. (1992) Theoretical Statistics, Chapman & Hall, London. Cressie, N. and Lahiri, S. (1993) The asymptotic distribution of REML estimators. Journal of Multivariate Analysis, 45, 217–233. Croux, C., Dhaene, G. and Hoorelbeke, D. (2003) Robust standard errors for robust estimators, Discussion Paper Series 03.16, Center for Economic Studies, Catholic University of Leuven. Davies, P.L. (1987) Asymptotic behaviour of S-estimators of multivariate location parameters and dispertion matrices. Annals of Statistics, 15, 1269–1292. Davies, R.B. (1980) [Algorithm AS 155] The distribution of a linear combination of χ 2 random variables (AS R53: 84V33 p366- 369). Applied Statistics, 29, 323–333. Davison, A.C. and Hinkley, D.V. (1997) Bootstrap Methods and their Applications, Cambridge University Press, Cambridge. Debruyne, M., Hubert, M., Portnoy, S. and Vanden Branden, K. (2008) Censored depth quantiles. Computational Statistics and Data Analysis, 52, 1604–1614. Dempster, A.P., Rubin, D.B. and Tsutakawa, R.K. (1981) Estimation in covariance components models. Journal of the American Statistical Association, 76, 341–353. Devlin, S.J., Gnanadesikan, R. and Kettenring, J.R. (1981) Robust estimation of dispersion matrices and principal components. Journal of the American Statistical Association, 76, 354–362. Diggle, P.J., Heagerty, P., Liang, K.Y. and Zeger, S.L. (2002) Analysis of Longitudinal Data, Oxford University Press, New York. DiRienzo, A.G. and Lagakos, S.W. (2001) Effects of model misspecification on tests of no randomized treatment effect arising from Coxs proportional hazards model. Journal of the Royal Statistical Society Series B, Methodological, 63, 745–757. DiRienzo, A.G. and Lagakos, S.W. (2003) The effects of misspecifying Coxs regression model on randomized treatment group comparisons. Handbook of Statistics, 23, 1–15. Dobbie, M.J. and Welsh, A.H. (2001a) Modelling correlated zero-inflated count data. Australian and New Zealand Journal of Statistics, 43(4), 431–444. Dobbie, M.J. and Welsh, A.H. (2001b) Models for zero-inflated count data using the Neyman type A distribution. Statistical Modelling, 1(1), 65–80. Dobson, A.J. (2001) An Introduction to Generalized Linear Models, Chapman & Hall/CRC, Boca Raton, FL. Dunlop, D.D., Manheim, L.M., Song, J. and Chang, R.W. (2002) Gender and ethnic/racial disparities health care utilization among older adults. Journal of Gerontology, 57B, S221– S233. Dupuis, D.J. and Morgenthaler, S. (2002) Robust weighted likelihood estimators with an application to bivariate extreme value problems. Canadian Journal of Statistics, 30, 17–36. Dyke, G.V. and Patterson, H.D. (1952) Analysis of factorial arrangements when the data are proportions. Biometrics, 8, 1–12. Edgeworth, F.Y. (1883) The method of least squares. Philosophical Magazine, 23, 364–375. Edgeworth, F.Y. (1887) On observations relating to several quantities. Hermathena, 6, 279– 285. Efron, B. (1967) The power of the likelihood ratio test. The Annals of Mathematical Statistics, 38, 802–806.
REFERENCES
253
Efron, B. (1982) The Jackknife, the Bootstrap an Other Resampling Plans, vol. 38, Society for Industrial and Applied Mathematics, Philadelphia, PA. Everitt, B.S. (1994) Statistical Analysis using S-Plus, Chapman & Hall, London. Fahrmeir, L. and Tutz, G. (2001) Multivariate Statistical Modelling Based on Generalized Linear Models, Springer, Berlin. Farebrother, R.W. (1990) [Algorithm AS 256] The distribution of a quadratic form in normal variables. Applied Statistics, 39, 294–309. Fernholz, L.T. (1983) Von Mises Calculus for Statistical Functionals (Lecture Notes in Statistics, vol. 19), Springer, New York. Field, C. and Smith, B. (1994) Robust estimation—a weighted maximum likelihood approach. International Statistical Review, 62, 405–424. Fisher, R. (1925) Statistical Methods for Research Workers, 1st edn, Oliver and Boyd, Edinburgh. Fisher, R.A. (1922) On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society, 222, 309–368. Fisher, R.A. (1934) Two new properties of mathematical likelihood. Philosophical Transactions of the Royal Society A, 144, 285–307. Fitzenberger, B. and Winker, P. (2007) Improving the computation of censored quantile regressions. Computational Statistics and Data Analysis, 52, 88–108. Gail, M., Wieand, S. and Piantodosi, S. (1984) Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika, 71, 431–444. Gallant, A.R. and Tauchen, G. (1996) Which moments to match? Econometric Theory, 12, 657–681. Genton, M.G. and Ronchetti, E. (2003) Robust indirect inference. Journal of the American Statistical Association, 98(461), 67–76. Genton, M.G. and Ronchetti, E. (2008) Robust prediction of beta, in Computational Methods in Financial Engineering - Essays in Honour of Manfred Gilli (eds Kontoghiorghes, E.J., Rustem, B. and Winker, P.), Springer, Berlin pp. 147–161. Gerdtham, U. (1997) Equity in health care utilization: further tests based on hurdle models and Swedish micro data. Health Economics, 6, 303–319. Gilleskie, D.B. and Mroz, T.A. (2004) A flexible approach for estimating the effect of covariates on health expenditures. Journal of Health Economics, 23, 391–418. Giltinan, D.M., Carroll, R.J. and Ruppert, D. (1986) Some new estimation methods for weighted regression when there are possible outliers. Technometrics, 28, 219–230. Gouriéroux, C., Monfort, A. and Renault, E. (1993) Indirect inference. Journal of Applied Econometrics, 8S, 85–118. Greene, W. (1997) Econometric Analysis, 3rd edn, Prentice Hall, Englewood Cliffs, NJ. Grzegorek, K. (1993) On robust estimation of baseline hazard under the Cox model via Fréchet differentiability, PhD thesis, Preprint of the Institute of Mathematics of the Polish Academy of Sciences, 518. Hammill, B.G. and Preisser, J.S. (2006) A SAS/IML software program for GEE and regression diagnostic. Computational Statistics and Data Analysis, 51, 1197–1212. Hampel, F.R. (1968) Contribution to the theory of robust estimation, PhD thesis, University of California, Berkeley, CA. Hampel, F.R. (1974) The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69, 383–393.
254
REFERENCES
Hampel, F.R. (1985) The breakdown points of the mean combined with some rejection rules. Technometrics, 27, 95–107. Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J. and Stahel, W.A. (1986) Robust Statistics: The Approach Based on Influence Functions, John Wiley & Sons, New York. Hanfelt, J.J. and Liang, K.Y. (1995) Approximate likelihood ratios for general estimating functions. Biometrika, 82, 461–477. Hardin, J.W. and Hilbe, J.M. (2003) Generalized Estimating Equations, Chapman & Hall, London. Härdle, W. (1990) Applied Nonparametric Regression, Cambridge University Press, Cambridge. Harrell, F.E.J. (2001) Regression Modeling Strategies: With Application to Linear Models, Logistic Regression and Survival Analysis (Springer Series in Statistics), Springer, Berlin. Harter, H.L. (1974–1976) The method of least squares and some alternatives. Reviews of International Institute of Statistics, 42, 147–174 (Part I); 42, 235–264 (Part II); 43, 1–44 (Part III); 43, 125–190 (Part IV); 43, 269–278 (Part V); 44, 113–159 (Part VI). Harville, D.A. (1977) Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association, 72, 320–340. Hastie, T.J. and Tibshirani, R.J. (1990) Generalized Additive Models, Chapman & Hall, London. Hauck, W.W. and Donner, A. (1977) Wald’s test as applied to hypotheses in logit analysis (Corr 75, p. 482). Journal of the American Statistical Association, 72, 851–853. He, X. (1991) A local breakdown property of robust tests in linear regression. Journal of Multivariate Analysis, 38, 294–305. He, X., Simpson, D. and Portnoy, S. (1990) Breakdown robustness of tests. Journal of the American Statistical Association, 85, 446–452. Heagerty, P.J. and Zeger, S.L. (1996) Marginal regression models for clustered ordinal measurements. Journal of the American Statistical Association, 91, 1024–1036. Heagerty, P.J. and Zeger, S.L. (2000) Multivariate continuation ratio models: connections and caveats. Biometrics, 56(3), 719–732. Henderson, C.R. (1953) Estimation of variance and covariance components. Biometrics, 9, 226–252. Henderson, C.R., Kempthorne, O., Searle, S.R. and von Krosigk, C.N. (1959) Estimation of environmental and genetic trends from records subject to culling. Biometrics, 15, 192–218. Heritier, S. (1993) Contribution to robustness in nonlinear models: application to economic data, PhD thesis, Faculty of Economic and Social Sciences, University of Geneva Switzerland. Heritier, S. and Galbraith, S. (2008) A revisit of robust inference in the Cox model, Technical report, University of New South Wales, Australia. Heritier, S. and Ronchetti, E. (1994) Robust bounded-influence tests in general parametric models. Journal of the American Statistical Association, 89(427), 897–904. Heritier, S. and Victoria-Feser, M.P. (1997) Practical applications of bounded-influence tests, in Handbook of Statistics, vol. 15 (eds Maddala, G. and Rao, C.), Elsevier Science, Amsterdam, pp. 77–100. Hettmansperger, T.P. (1984) Statistical Inference Based on Ranks, John Wiley & Sons, New York. Hettmansperger, T.P. and McKean, J.W. (1998) Robust Nonparametric Statistical Methods, Arnold, London.
REFERENCES
255
Hjort, N. (1992) On inference in parametric survival models. International Statistical Review, 60, 55–387. Hodges, J.L.J. (1967) Efficiency in normal samples and tolerance of extreme values for some estimates of location, in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, University of California Press, Berkeley, CA, pp. 163– 186. Holcomb, P. and McPherson, W. (1994) Event-related brain potentials reflect semantic priming in an object decision task. Brain and Cognition, 24, 259–276. Hollis, S. and Campbell, F. (1999) What is meant by intention to treat? survey of published randomised clinical trials. British Medical Journal, 319, 670–674. Honore, B., Khan, S. and Powell, J.L. (2002) Quantile regression under random censoring. Journal of Econometrics, 109, 67–105. Horton, N.J. and Lipsitz, S.R. (1999) Review of software to fit generalized estimating equation regression models. The American Statistician, 53, 160–169. Huber-Carol, C. (1970) Etude Asymptotique de Tests Robustes, PhD thesis, ETH Zürich, Switzerland. Huber, P.J. (1964) Robust estimation of a location parameter. Annals of Mathematical Statistics, 35, 73–101. Huber, P.J. (1967) The behavior of the maximum likelihood estimates under non standard conditions, in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, University of California Press, Berkeley, CA, pp. 221–233. Huber, P.J. (1972) Robust statistics: a review. Annals of Mathematical Statistics, 43, 1041– 1067. Huber, P.J. (1973) Robust regression: Asymptotics, conjectures and Monte Carlo. Annals of Statistics, 1, 799–821. Huber, P.J. (1979) Robust smoothing, in Robustness in Statistics (eds Launer RL and Wilkinson GN), Academic Press, New York, pp. 33–48. Huber, P.J. (1981) Robust Statistics, John Wiley & Sons, New York. Huber, P.J. and Ronchetti, E.M. (2009) Robust Statistics, 2nd edn, John Wiley & Sons, New York. Huggins, R.M. (1993) A robust approach to the analysis of repeated measures. Biometrics, 49, 715–720. Huggins, R.M. and Staudte, R.G. (1994) Variance components models for dependent cell populations. Journal of the American Statistical Association, 89, 19–29. Imhof, J.P. (1961) Computing the distribution of quadratic forms in normal variables. Biometrika, 48, 352–363. Ingelfinger, J.A., Mosteller, F., Thibodeau, L.A. and Ware, J.H. (1987) Biostatistics in Clinical Medicine, 2nd edn, McMillan, New York. Jain, A., Tindell, C.A., Laux, I., Hunter, J.B., Curran, J., Galkin, A., Afar, D.E., Aronson, N., Shak, S., Natale, R.B. and Agus, D.B. (2005) Epithelial membrane protein-1 is a biomarker of gefitinib resistance. Proceedings of the National Academy of Science USA, 102, 11858– 11863. Kalbfleisch, J. and Prentice, R. (1980) The Statistical Analysis of Failure Time Data, John Wiley & Sons, Ltd., Chichester. Kenward, M.G. and Roger, J.H. (1997) Small sample inference for fixed effects from restricted maximum likelihood. Biometrics, 53, 983–997. Kim, C. and Bae, W. (2005) Case influence diagnostics in the Kaplan–Meier estimator and the log-rank test. Computational Statistics, 20, 521–534.
256
REFERENCES
Koenker, R. (2005) Quantile Regression (Econometric Society Monographs), Cambridge University Press, New York. Koenker, R. (2008) Censored quantile regression redux. Journal of Statistical Software, 27(6), 2–25. Koenker, R. and Bassett, G. (1978) Regression quantiles. Econometrica, 46, 33–50. Koenker, R. and Bassett, G. (1982) Robust test for heteroscedasticity based on regression quantiles. Econometrica, 50, 43–62. Koenker, R. and D’Orey, V. (1987) Computing regression quantiles. Applied Statistics, 36, 383–393. Koenker, R. and Geling, O. (2001) Reappraising medfly longevity: a quantile regression survival analysis. Journal of the American Statistical Association, 96, 458–468. Koenker, R. and Hallock, K. (2001) Quantile regression: an introduction. Journal of Econometric Perspectives, 15, 143–156. Kong, F.H. and Slud, E. (1997) Robust covariate-adjusted logrank tests. Biometrika, 84, 847– 862. Krall, J.M., Uthoff, V.A. and Hareley, J.B. (1975) A step-up procedure for selecting variables associated with survival. Biometrics, 31, 49–57. Krasker, W.S. and Welsch, R.E. (1982) Efficient bounded-influence regression estimation. Journal of the American Statistical Association, 77, 595–604. Künsch, H.R., Stefanski, L.A. and Carroll, R.J. (1989) Conditionally unbiased boundedinfluence estimation in general regression models, with applications to generalized linear models. Journal of the American Statistical Association, 84, 460–466. Kuonen, D. (1999) Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika, 86, 929–935. Kurttio, P., Komulainen, H., Leino, A., Salonen, L., Auvinen, A. and Saha, H. (2005) Bone as a possible target of chemical toxicity of natural uranium in drinking water. Environmental Health Perspectives, 113, 68–72. Laird, N. and Ware, J. (1982) Random-effect models for longitudinal data. Biometrics, 38, 963–974. Lambert, D. (1992) Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics, 34, 1–14. Lange, K.L., Little, R.J.A. and Taylor, J.M.G. (1989) Robust statistical modeling using the t-distribution. Journal of the American Statistical Association, 84, 881–896. Liang, K.Y. and Zeger, S.L. (1986) Longitudinal data analysis using generalized linear models. Biometrika, 73, 13–22. Liang, K.Y., Zeger, S.L. and Qaqish, B. (1992) Multivariate regression analyses for categorical data (Discussion: pp. 24–40). Journal of the Royal Statistical Society, Series B, Methodological, 54, 3–24. Lin, D.Y. and Wei, L.J. (1989) The robust inference for the Cox proportional hazard model. Journal of the American Statistical Association, 84, 1074–1078. Lin, T.I. and Lee, J.C. (2006) A robust approach to t linear mixed models applied to multiple sclerosis data. Statistics in Medicine, 25, 1397–1412. Lindsey, J.K. (1997) Applying Generalized Linear Models, Springer, Berlin. Lindstrom, M.J. and Bates, A. (1988) Newton–Raphson and EM algorithms for linear mixedeffects models for repeated data (correction: 94(89), 1572). Journal of the American Statistical Association, 83, 1014–1022.
REFERENCES
257
Litière, S., Alonso, A. and Molenberghs, G. (2007a) The impact of misspecified random-effect distribution on the estimation and performance of inferential procedures in generalized linear mixed models. Statistics in Medicine, 27, 3125–3144. Litière, S., Alonso, A. and Molenberghs, G. (2007b) Type I and type II error under randomeffects misspecification in generalized linear mixed models. Biometrics, 63, 1038–1044. Littell, R.C. (2002) Analysis of unbalanced mixed model data: a case study comparison of ANOVA versus REML/GLS. Journal of Agricultural, Biological and Environmental Statistics, 7, 472–490. Lopuhaä, H.P. (1989) On the relation between S-estimators and M-estimators of multivariate location and covariance. Annals of Statistics, 17, 1662–1683. Lopuhaä, H.P. (1992) High efficient estimators of multivariate location with high breakdown point. Annals of Statistics, 20, 398–413. Lopuhaä, H.P. and Rousseeuw, P.J. (1991) Breakdown points of affine equivariant estimators of multivariate locution and covariance matrices. Annals of Statistics. 19, 229–248. Ma, B. and Elis, R.E. (2003) Robust registration for computer-integrated orthopedic surgery: laboratory validation and clinical experience. Medical Image Analysis, 7(3), 237–250. Machado, J.A.F. and Machado, J.A.F. (1993) Robust model selection and m-estimation. Econometric Theory, 9, 478–493. Mahalanobis, P.C. (1936) On the generalized distance in statistics. Proceedings of the National Institute of Science of India, 12, 49–55. Mallows, C.L. (1973) Some comments on cp . Technometrics, 15, 661–675. Mallows, C.L. (1975) On some topics in robustness, Technical report, Bell Telephone Laboratories, Murray Hill, NJ. Marazzi, A. (1993) Algorithms, Routines and S-Functions for Robust Statistics, Wadsworth and Brooks/Cole, Belmont, CA. Marazzi, A. (2002) Bootstrap tests for robust means of asymmetric distributions with unequal shapes. Computational Statistics and Data Analysis, 39, 503–528. Marazzi, A. and Barbati, G. (2003) Robust parametric means of asymmetric distributions: estimation and testing. Estadistica, 54, 47–72. Marazzi, A. and Yohai, V. (2004) Adaptively truncated maximum likelihood regression with asymmetric errors. Journal of Statistical Planning and Inference, 122, 271–291. Markatou, M. and He, X. (1994) Bounded influence and high breakdown point testing procedures in linear models. Journal of the American Statistical Association, 89, 543–549. Markatou, M. and Hettmansperger, T.P. (1990) Robust bounded influence tests in linear models. Journal of the American Statistical Association, 85, 187–190. Markatou, M. and Hettmansperger, T.P. (1992) Applications of the asymmetric eigen value problem techniques to robust testing. Journal of Statistical Planning and Inference, 31, 51–65. Markatou, M. and Ronchetti, E. (1997) Robust inference: The approach based on influence functions, in Handbook of Statistics, Vol. 15: Robust Inference (eds Maddala, G. S. andRao C), Elsevier Science, New York, pp. 49–75. Markatou, M., Basu, A. and Lindsay, B. (1997) Weighted likelihood estimating equations: the discrete case with application to logistic regression. Journal of Statistical Planning and Inference, 57, 215–232. Markatou, M., Stahel, W.A. and Ronchetti, E. (1991) Robust M-type testing procedures for linear models, in Directions in Robust Statistics and Diagnostics, Part I (eds Stahel WA and Weisberg S), Springer, New York, pp. 201–220.
258
REFERENCES
Maronna, R.A. (1976) Robust M-estimators of multivariate location and scatter. Annals of Statistics 4, 51–67. Maronna, R.A. and Yohai, V.J. (2000) Robust regression with both continuous and categorical predictors. Journal of Statistical Planning and Inference, 89, 197–214. Maronna, R.A., Bustos, O.H. and Yohai, V.J. (1979) Bias- and efficiency-robustness of general M-estimators for regression with random carriers, in Smoothing Techniques for Curve Estimation (eds Gasser, T. and Rosenblatt, M.), Springer, New York, pp. 91–116. Maronna, R.A., Martin, R.D. and Yohai, V.J. (2006) Robust Statistics: Theory and Methods, John Wiley & Sons, Ltd, Chichester. McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd edn, Chapman & Hall, London. McCulloch, C.E. and Searle, S.R. (2001) Generalized, Linear, and Mixed Models, John Wiley & Sons, Ltd, Chichester. McLean, R.A., Sanders, W.L. and Stroup, W.W. (1991) A unified approach to mixed models. The American Statistician, 45, 54–64. Mills, J.E., Field, C.A. and Dupuis, D.J. (2002) Marginally specified generalized linear mixed models: a robust approach. Biometrics, 58, 727–734. Min, Y. and Agresti, A. (2002) Modeling nonnegative data with clumping at zero: a survey. Journal of the Iranian Statistical Society, 1, 7–33. Minder, C.E. and Bednarski, T. (1996) A robust method for proportional hazards regression. Statistics in Medicine, 15, 1033–1047. Molenberghs, G. and Verbeke, G. (2005) Models for Discrete Longitudinal Data, Springer, Berlin. Morgenthaler, S. (1992) Least-absolute-deviations fits for generalized linear models. Biometrika, 79, 747–754. Morrell, C.H. (1998) Likelihood ratio testing of variance components in the linear mixedeffects model using restricted maximum likelihood. Biometrics, 54, 1560–1568. Moustaki, I. and Victoria-Feser, M.P. (2006) Bounded-influence robust estimation in generalized linear latent variable models. Journal of the American Statistical Association, 101(474), 644–653. Moustaki, I., Victoria-Feser, M.P. and Hyams, H. (1998) A UK study on the effect of socioeconomic background of pregnat women and hospital practice on the decision to breastfeed and the initiation and duration of breastfeeding, Technical Report Statistics Research Report LSERR44, London School of Economics, London. Moy, G. and Mounoud, P. (2003) Object recognition in young adult: is priming with pantomimes possible?, in Catalogue des abstracts : 8ème congrès de la société suisse de psychologie (SSP), Bern, Switzerland. Mullahy, J. (1986) Specification and testing of some modified count data models. Journal of Econometrics, 33, 341–365. Müller, S. and Welsh, A.H. (2005) Outlier robust model selection in linear regression. Journal of the American Statistical Association, 100, 1297–1310. Nardi, A. and Schemper, M. (1999) New residuals for Cox regression and their application to outlier screening. Biometrics, 55, 523–529. Nelder, J.A. (1966) Inverse polynomials, a useful group of multi-factor response functions. Biometrics, 22, 128–141. Nelder, J.A. and Wedderburn, R.W.M. (1972) Generalized linear models. Journal of the Royal Statistical Society—Series A, 135, 370–384.
REFERENCES
259
Neocleous, T. and Portnoy, S. (2006) A partly linear model for censored regression quantiles, Technical report, Statistics Department, University of Illinois, IL, USA. Neocleous, T., Vanden Branden, K. and Portnoy, S. (2006) Correction to censored regression quantiles by S. Portnoy, 98 (2003), 1001–1012, Journal of the American Statistical Association, 101, 860–861. Newcomb, S. (1886) A generalized theory of the combinations of observations so as to obtain the best result. American Journal of Mathematics, 8, 343–366. Noh, M. and Lee, Y. (2007) Robust modeling for inference from generalized linear model classes. Journal of the American Statistical Association, 102(479), 1059–1072. Pan, W. (2001) Akaike’s information criterion in generalized estimating equations. Biometrics, 57(1), 120–125. Patterson, H.D. and Thompson, R. (1971) Recovery of inter-block information when block sizes are unequal. Biometrika, 58, 545–554. Pearson, E.S. (1931) The analysis of variance in cases of non-normal variation. Biometrika, 23, 114–133. Pearson, K. (1916) Second supplement to a memoir on skew variation. Philosophical Transactions A, 216, 429–457. Pena, D. and Yohai, V. (1999) A fast procedure for outlier diagnostics in large regression problems. Journal of the American Statistical Association, 94, 434–445. Peng, R. and Huang, Y. (2008) Survival analysis with quantile regression models. Journal of American Statistical Association, 103, 637–649. Pinheiro, J.C. and Bates, D.M. (2000) Mixed-Effects Models in S and S-PLUS, Springer, New York. Pinheiro, J.C., Liu, C. and Wu, Y.N. (2001) Efficient algorithms for robust estimation in linear mixed-effects models using the multivariate t distribution. Journal of Computational and Graphical Statistics, 10(2), 249–276. Portnoy, S. (2003) Censored regression quantiles. Journal of the American Statistical Association, 98, 1001–1012. Potthoff, R.F. and Roy, S.N. (1964) A generalized multivariate analysis of variance model useful especially for growth curve problem. Biometrika, 51, 313–326. Powell, J.L. (1986) Censored regression quantiles. Journal of Econometrics, 32, 143–155. Pregibon, D. (1982) Resistant fits for some commonly used logistic models with medical applications. Biometrics, 38, 485–498. Preisser, J.S. and Qaqish, B.F. (1999) Robust regression for clustered data with applications to binary regression. Biometrics, 55, 574–579. Preisser, J.S., Galecki, A.T., Lohman, K.K. and Wagenknecht, L.E. (2000) Analysis of smoking trends with incomplete longitudinal binary responses. Journal of the American Statistical Association, 95, 1021–1031. Prentice, R.L. (1988) Correlated binary regression with covariates specific to each binary observation. Biometrics, 44, 1033–1048. Qu, A. and Song, P.X.K. (2004) Assessing robustness of generalised estimating equations and quadratic inference functions. Biometrika, 91, 447–459. Qu, A., Lindsay, B.G. and Li, B. (2000) Improving generalised estimating equations using quadratic inference functions. Biometrika, 87(4), 823–836. Rao, C.R. (1973) Linear Statistical Inference and its Application, John Wiley & Sons, New York.
260
REFERENCES
Rasch, G. (1960) Probabilistic Models for some Intelligence and Attainment Tests, Danmarks Paedagogiske Institut, Copenhagen. Reid, N. and Crépeau, H. (1985) Influence functions for proportional hazards regression. Biometrika, 72, 1–9. Reimann, C., Filzmoser, P., Garrett, R. and Dutter, R. (2008) Statistical Data Analysis Explained, John Wiley & Sons, Ltd, Chichester. Renaud, O. and Victoria-Feser, M.P. (2009) Robust coefficient of determination, Technical report, University of Geneva. Richardson, A.M. (1997) Bounded influence estimation in the mixed linear model. Journal of the American Statistical Association, 92, 154–161. Richardson, A.M. and Welsh, A.H. (1994) Asymptotic properties of the restricted maximum likelihood for hierarchical mixed models. Australian Journal of Statistics, 36, 31–43. Richardson, A.M. and Welsh, A.H. (1995) Robust restricted maximum likelihood in mixed linear models. Biometrics, 51, 1429–1439. Ridout, M., Demétrio, C.G.B. and Hinde, J. (1998) Models for count data with many zeros, in Proceedings of the 19th International Biometrics Conference, Cape Town pp. 179–190. Rieder, H. (1978) A robust asymptotic testing model. Annals of Statistics, 6, 1080–1094. Rocke, D.M. (1996) Robustness properties of S-estimators of multivariate location and shape in high dimension. Annals of Statistics, 24, 1327–1345. Ronchetti, E. (1982a) Robust alternatives to the F -test for the linear model (STMA V24 1026), in Probability and Statistical Inference (eds Grossmann, W., Pflug, G.C. and Wertz, W.), Reidel, Dordrecht, pp. 329–342. Ronchetti, E. (1982b) Robust Testing in Linear Models: The Infinitesimal Approach, PhD thesis, ETH, Zürich, Switzerland. Ronchetti, E. (1997a) Robust influence by influence functions. Journal of Statistical Planning and Inference, 57, 59–72. Ronchetti, E. (1997b) Robustness aspects of model choice. Statistica Sinica, 7, 327–338. Ronchetti, E. (2006) Fréchet and robust statistics. Journal de la Société Francaise de Statistique 147, 73–75. (Comment on ‘Sur une limitation très générale de la dispersion de la médiane’ by Maurice Fréchet.) Ronchetti, E. and Staudte, R.G. (1994) A robust version of Mallows’s Cp . Journal of the American Statistical Association, 89, 550–559. Ronchetti, E. and Trojani, F. (2001) Robust inference with GMM estimators. Journal of Econometrics, 101(1), 37–69. Ronchetti, E., Field, C. and Blanchard, W. (1997) Robust linear model selection by crossvalidation. Journal of the American Statistical Association, 92, 1017–1023. Rousseeuw, P.J. (1984) Least median of squares regression. Journal of the American Statistical Association, 79, 871–880. Rousseeuw, P.J. and Leroy, A.M. (1987) Robust Regression and Outlier Detection, John Wiley & Sons, New York. Rousseeuw, P.J. and Ronchetti, E. (1979) The influence curve for tests, Research Report 21, ETH Zürich, Switzerland. Rousseeuw, P.J. and Ronchetti, E. (1981) Influence curves for general statistics. Journal of Computational and Applied Mathematics, 7, 161–166. Rousseeuw, P.J. and Van Driessen, K. (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41, 212–223.
REFERENCES
261
Rousseeuw, P.J. and Yohai, V.J. (1984) Robust regression by means of S-estimators, in Robust and Nonlinear Time Series Analysis (eds Franke JW, Hardle and Martin RD), Springer, New York, pp. 256–272. Rule, A.D., Larson, T.S., Bergstralh, E.J., Slezak, J.M., Jacobsen, S.J. and Cosio, F.G. (2004) Using serum creatinine to estimate glomerular filtration rate: Accuracy in good health and in chronic kidney disease. Annals of Internal Medicine, 141(12), 929–938. Salibian-Barrera, M. and Zamar, R.H. (2002) Bootstrapping robust estimates of regression. Annals of Statistics, 30, 556–582. Sasieni, P.D. (1993a) Maximum weighted partial likelihood estimates in the Cox model. Journal of the American Statistical Association, 88, 144–152. Sasieni, P.D. (1993b) Some new estimators for Cox regression. Annals of Statistics, 21, 1721– 1759. Satterthwaite, F.E. (1941) Synthesis of variance. Psychometrika, 6, 309–316. Scheipl, F., Greven, S. and Küchenhoff, H. (2008) Size and power of tests for a zero random effect variance or polynomial regression in additive and linear mixed models. Computational Statistics and Data Analysis, 52, 3283–3299. Schemper, M. (1992) Cox analysis of survival data with non-proportional hazard functions. The Statistician, 41, 455–465. Schwarz, G. (1978) Estimating the dimension of a model. Annals of Statistics, 6, 461–464. Searle, S.R., Casella, G. and McCulloch, C.E. (1992) Variance Components, John Wiley and Sons, Ltd, Chichester. Self, S.G. and Liang, K.Y. (1987) Asymptotic properties of the maximum likelihood estimators and likelihood tests under nonstandard conditions. Journal of the American Statistical Association, 82, 605–610. Silvapulle, M.J. (1992) Robust Wald-type tests of one sided hypothesis in the linear model. Journal of the American Statistical Association, 87, 156–161. Simpson, D.G., Ruppert, D. and Caroll, R.J. (1992) One-step GM estimates and stability of inferences in linear regression. Journal of the American Statistical Association, 87, 439– 450. Sinha, S.K. (2004) Robust analysis of generalized linear mixed models. Journal of the American Statistical Association, 99(466), 451–460. Sommer, S. and Huggins, R.M. (1996) Variables selection using the Wald test and a robust Cp . Applied Statistics, 45, 15–29. Song, P.X.K. (2007) Correlated Data Analysis: Modeling, Analytics, and Applications, Springer, New York. Stahel, W.A. and Welsh, A. (1997) Approaches to robust estimation in the simplest variance components model. Journal of Statistical Planning and Inference, 57, 295–319. Stahel, W.A. and Welsh, A.H. (1992) Robust estimation of variance components, Research Report 69, ETH, Zürich. Staudte, R.G. and Sheather, S.J. (1990) Robust Estimation and Testing, John Wiley & Sons, New York. Stefanski, L.A., Carroll, R.J. and Ruppert, D. (1986) Optimally bounded score functions for generalized linear models with applications to logistic regression. Biometrika, 73, 413–424. Stern, S.E. and Welsh, A.H. (1998) Likelihood inference for small variance components. The Canadian Journal of Statistics, 28, 517–532. Stigler, S.M. (1973) Simon Newcomb, Percy Daniell, and the history of robust estimation 1885–1920. Journal of the American Statistical Association, 68, 872–879.
262
REFERENCES
Stone, E.J. (1873) On the rejection of discordant observations. Monthly Notices of the Royal Astronomical Society, 34, 9–15. Stone, M. (1977) An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. Journal of the Royal Statistical Society, Series B, Methodological, 39, 44–47. Stram, D.O. and Lee, J.W. (1994) Variance components testing in the longitudinal mixed effects model. Biometrics, 50, 1171–1177. Stram, D.O., Wei, L.J. and Ware, J.H. (1988) Analysis of repeated ordered categorical outcomes with possibly missing observations and time-dependent covariates. Journal of the American Statistical Association, 83, 631–637. Student, (1927) Errors of routine analysis. Biometrika 19, 151–164. Subrahmanian, K., Subrahmaniam, K. and Messeri, J.Y. (1975) On the robustness of some tests of significance in sampling from a normal population. Journal of the American Statistical Association, 70, 435–438. Tai, B.C., White, I.R., Gebski, V. and Machin, D. (2002) On the issue of ‘multiple’ first failures in competing risks analysis. Statistics in Medicine, 21, 2243–2255. Tashkin, D.P. et al. (2006) Cyclophosphamide versus placebo in scleroderma lung disease. New England Journal of Medicine, 354(25), 2655–66. Tatsuoka, K.S. and Tyler, D.E. (2000) On the uniqueness of S-functionals and M-functionals under nonelliptical distributions. Annals of Statistics, 28(4), 1219–1243. Therneau, T.M. and Grambsch, P.M. (2000) Modeling Survival Data: Extending the Cox Model, Springer, New York. Tukey, J.W. (1960) A survey of sampling from contaminated distributions, in Contributions to Probability and Statistics (ed. Olkin, I.), Stanford University Press, Stanford, CA, pp. 448– 485. Tukey, J.W. (1970) Exploratory Data Analysis, Addison-Wesley, Reading, MA. (Mimeographed preliminary edition. Published in 1977.) Valsecchi, M.G., Silvestri, D. and Sasieni, P. (1996) Evaluation of long-term survival: use of diagnostics and robust estimators with Cox’s proportional hazards models. Statistics in Medicine, 15, 2763–2780. Verbeke, G. and Molenberghs, G. (1997) Linear Mixed Model in Practice: a SAS-Oriented Approach (Lecture Notes in Statistics, vol. 126), Springer, New York. Verbeke, G. and Molenberghs, G. (2000) Linear Mixed Models for Longitudinal Data, Springer, New York. Victoria-Feser, M.P. (2002) Robust inference with binary data. Psychometrica, 67, 21–32. Victoria-Feser, M.P. (2007) De-biasing weighted MLE via indirect inference: The case of generalized linear latent variable models. Revstat Statistical Journal, 5, 85–96. von Mises, R. (1947) On the asymptotic distribution of differentiable statistical functions. The Annals of Mathematical Statistics, 18, 309–348. Wager, T.D., Keller, M.C., Lacey, S.C. and Jonides, J. (2003) Increased sensitivity in neuroimaging analyses using robust regression. NeuroImage, 26, 99–113. Wald, A. (1943) Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society, 3, 426–482. Wang, H.M., Jones, M.P. and Storer, B.E. (2006) Comparison of case-deletion diagnostic methods for Cox regression. Statistics in Medicine, 25, 669–683. Wang, Y.G., Lin, X. and Zhu, M. (2005) Robust estimating functions and bias correction for longitudinal data analysis. Biometrics, 61, 684–691.
REFERENCES
263
Wedderburn, R.W.M. (1974) Quasi-likelihood functions, generalized linear models, and the Gauss–Newton method. Biometrika, 61, 439–447. Welsh, A. and Richardson, A. (1997) Approaches to the robust estimation of mixed models, in Handbook of Statistics, vol. 15 (eds Maddala G and Rao C), Elsevier Science, pp. 343–385. Welsh, A.H. (1996) Aspects of Statistical Inference (Wiley Series in Probability and Statistics), John Wiley & Sons, New York. Welsh, A.H. and Ronchetti, E. (1998) Bias-calibrated estimation from sample surveys containing outliers. Journal of the Royal Statistical Society, Series B, Methodological, 60, 413–428. Welsh, A.H. and Ronchetti, E. (2002) A journey in single steps: robust one-step m-estimation in linear regression. Journal of Statistical Planning and Inference, 103(2), 287–310. Welsh, A.H., Cunningham, R.B., Donnelly, C.F. and Lindenmayer, D.B. (1996) Modelling the abundance of rare species: statistical models for counts with extra zeros. Ecological Modelling, 88, 297–308. Wen, L., Parchman, M.L., Linn, W.D. and Lee, S. (2004) Association between self-monitoring of blood glucose and glycemic control in patients with type 2 diabetes mellitus. American Journal of Health-System Pharmacy, 61, 2401–2405. Whewell, W. (1837) History of the Inductive Sciences, from the Earliest to the Present Time, Parker, London. Whewell, W. (1840) Philosophy of the Inductive Sciences, Founded upon their History, Parker, London. Wilcox, R.R. (1997) Introduction to Robust Estimation and Hypothesis Testing, Academic Press, New York, London. Wood, A.T.A. (1989) An F approximation to the distribution of a linear combination of chisquared variables. Communications in Statistics: Simulation and Computation, 18, 1439– 1456. Wood, A.T.A., Booth, J.G. and Butler, R.W. (1993) Saddlepoint approximations to the CDF of some statistics with nonnormal limit distributions. Journal of the American Statistical Association, 88, 680–686. Yau, K.K.W. and Kuk, A.Y.C. (2002) Robust estimation in generalized linear mixed models. Journal of the Royal Statistical Society, Series B, Methodological, 64, 101–117. Ylvisaker, D. (1977) Test resistance. Journal of the American Statistical Association, 72, 551– 556. Yohai, V.J. (1987) High breakdown point and high efficiency robust estimates for regression. Annals of Statistics, 15, 642–656. Yohai, V.J. and Zamar, R.H. (1998) Optimal locally robust m-estimates of regression. Journal of Statistical Planning and Inference, 64, 309–323. Yohai, V.J., Stahel, W.A. and Zamar, R.H. (1991) A procedure for robust estimation and inference in linear regression, in Directions in Robust Statistics and Diagnostics, part II (eds Stahel WA and Weisberg S) (The IMA Volumes in Mathematics and its Applications, vol. 34), Springer, Berlin, pp. 365–374. Zedini, A. (2007) Poisson hurdle model: Towards a robustified approach, Master’s thesis, University of Geneva. Zeger, S.L. and Liang, K.Y. (1986) Longitudinal data analysis for discrete and continuous outcomes. Biometrics, 42, 121–130. Zeger, S.L., Liang, K.Y. and Albert, P.S. (1988) Models for longitudinal data: a generalized estimating equation approach (Correction: 45, 347). Biometrics, 44, 1049–1060.
264
REFERENCES
Zhang, J. (1996) The sample breakdown of tests. Journal of Statistical Planning and Inference, 52, 161–181. Zhao, L.P. and Prentice, R.L. (1990) Correlated binary regression using a quadratic exponential model. Biometrika, 77, 642–648.
Index Cp , see Mallows’ F -test, 5, 33, 34, 40, 59–62, 83, 84, 88, 94, 106–109 RCp , see Mallows’ χ 2 -distribution, see distribution z-statistic, 35, 37, 40, 94, 105, 186, 195, 197, 199, 205–207, 216, 223 z-test, see z-statistic adaptive procedure, 201, 202 weights, 202–204, 206, 209 AIC, see Akaike information criterion Akaike information criterion, 71, 162 classical (AIC), 46, 72–76, 159 generalized (GAIC), 159 robust (RAIC), 73–77, 80, 81, 159, 231, 233 Analysis of variance, see ANOVA ANOVA, 5, 46–48, 83, 88, 90, 94, 101, 105, 108, 125 ARE, see estimator asymptotic rejection probability, 21, 31 bias, 14, 15, 20, 28, 48, 83, 93, 139, 160, 162, 192, 215 asymptotic, 16–18, 21, 139, 196 correction, 28, 139, 162 maximal, 19–21 residual, 7, 98, 209 binary regression, see exponential family - Bernouilli BLUE, see estimator bootstrap, 2, 13, 14, 74, 110, 218, 220–224 breakdown point, 14, 16, 20, 22, 23, 26, 27, 30–32, 37, 38, 44, 53, 54, 79, 84, 98–100, 102, 110, 175, 221, 229
level, 38 power, 38 coefficient of determination, 66–69 confidence interval, 13, 14, 33, 96, 130, 140–142, 144, 152, 154, 155, 168, 178, 197, 207–210, 220–224 coverage, 207–209 consistency, 22, 23, 27, 28, 31, 168, 229 correction, 24, 25, 27, 28, 30, 51, 68, 79, 98, 99, 136–139, 174, 196, 221, 239, 246 Fisher, see consistency contrasts, 47, 84–86, 88, 93, 94, 104, 105, 107 correlation, 8–10, 67, 69, 83, 95, 142, 146, 161, 163–170, 174, 176, 182, 245, 246 m-dependence, 167, 176 autoregressive, 167, 175, 176, 182 exchangeable, 165, 170, 171, 175, 177, 181, 182, 186, 246 serial, 134 unstructured, 166 working, 163–165, 168, 173 covariance (matrix), 10, 30–32, 44, 51, 87, 88, 90, 91, 93–95, 98–100, 105, 106, 108, 115, 163, 164, 175, 176, 235 Cox proportional hazard model, see hazard datasets breastfeeding, 146, 150 cardiovascular risk factors, 9–12, 78–82
Robust Methods in Biostatistics S. Heritier, E. Cantoni, S. Copt and M.-P. Victoria-Feser c 2009 John Wiley & Sons, Ltd 0001
266 diabetes, 58–62, 65, 69–72, 75–77, 230 doctor visits, 151 gromerular filtration rate (GFR), 49, 50, 54–56, 58, 62–69 GUIDE data, 169–173, 177, 180, 181 hospital costs, 125, 132, 140, 144, 160 LEI data, 182, 184, 185 metallic oxide, 116–119 myeloma, 193, 197–199, 205–207 orthodontic, 90, 92, 95, 96, 118, 121, 122 semantic priming, 89, 99, 107–109, 111, 113, 114 skin resistance, 85, 86, 88, 96, 97, 99, 103–105, 107, 110–112, 115, 116 stillbirth in piglets, 186–188 Veteran’s Administration lung cancer, 193, 209–212, 222–224 deviance, 130–132, 142, 143 quasi-, 126, 132, 137, 145, 162, 174, 179, 183, 185 residuals, see residuals test, 61, 131–133, 142, 143, 145, 155 diagnostic, 7–9, 48, 52, 133–136, 138, 169, 178, 191, 196–198, 205, 206 distribution binomial, see exponential family Chi-squared, 31, 40, 42, 43, 52, 60, 61, 94–96, 103, 110, 111, 131–133, 144, 175, 177, 196, 206 exponential, 127, 201, 204, 205, 207, 211, 212, 216 Gamma, see exponential family gross error, 5, 17, 18, 44 point mass, see distribution - gross error Poisson, see exponential family efficiency, 13, 18, 22, 23, 25, 26, 28–30, 51, 53, 54, 57, 100, 102, 137,
INDEX 138, 169, 201, 202, 204, 205, 229, 231, 232 loss, 28, 29, 46, 58, 141, 169, 204 empirical IF, 17, 196, 197, 203, 205 breakdown point, 20 distribution, 17, 43, 101, 201, 202 estimator GM-, 52–54 M-, 15, 16, 23–27, 29–31, 39, 41–44, 48, 52, 54, 61, 74, 97, 98, 100, 136, 140, 143, 144, 160, 175, 176, 195, 196, 205 MM-, 54, 84, 100–102, 104–107, 229 S-, 30–32, 84, 98–100, 105, 106, 138, 229, 230, 235 adaptive robust, 202–216 best linear unbiased, 45 CBS–MM, 103, 105, 107–109, 111, 112, 117, 119–121 high breakdown, 26, 27, 47, 53, 84, 100, 138, 175, 230 Huber’s, 50, 51, 53, 101, 107, 172, 177, 181, 182 least squares, 45–52, 54–56, 59, 60, 62–66, 68–74, 76, 118, 119, 229, 230 Mallows’, 52, 136, 147, 152, 156, 157, 172, 177, 189 maximum likelihood, 5, 13, 16, 22–25, 27–29, 31, 38–41, 45–48, 51, 55, 56, 83, 84, 91–94, 97, 98, 101, 102, 107, 110, 113, 123, 126, 130–134, 152, 182 partial likelihood, 13, 191–208, 210, 213–215, 222 restricted (or residual) maximum likelihood estimator (REML), 83, 84, 91, 93, 94, 96–98, 103, 105, 107–112, 117, 120 Tukey’s biweight, 27–29, 55, 60, 61, 63, 66, 76, 77, 79, 81, 99, 102, 106, 107, 231 weighted maximum likelihood, 24–29, 50, 53 weighted partial likelihood, 192, 200
INDEX excess of zeros, 5, 152, 158 exchangeable, 166 exponential family, 125, 127, 130, 158, 161, 163, 164, 237 Bernouilli, 125, 126, 134, 136, 146, 160, 173, 175, 181, 186–188 binary, see Bernouilli binomial, 5, 127–131, 138–140, 151, 158, 164, 237, 239, 240 Gamma, 127, 129, 131–135, 139, 140, 150, 157, 217, 238–241 Poisson, 43, 127–131, 138–140, 150, 152, 155, 157, 158, 164, 175, 237, 239–241 exponential weight, 204–208, 210–212, 216 fitted value, 112, 115, 116, 120, 122, 130, 131, 134, 135, 149, 160 generalized linear model, 15, 39, 53, 61, 125, 126, 128–130, 132–134, 136–138, 142, 151, 152, 157–165, 171, 172, 174, 179–181, 189 GES, see gross error sensitivity GLM, see generalized linear model gross error model / data generating process, see distribution gross error gross error sensitivity, 19–21, 36, 54 hat matrix, 52, 133, 174, 246 hazard, 13, 191, 193, 194, 196, 200, 201, 203, 204, 207, 210, 212, 215, 216, 222–225 baseline, 193, 194 cumulative, 194, 201, 202, 213, 215, 221, 224 function, 192, 193 proportional, 192 proportional - Cox model, 9, 12, 191, 193, 194, 204, 214, 221, 224 high breakdown estimator, see estimator Huber’s ψ function, 25, 51 ρ function, 25 estimator, see estimator
267 proposal II, 51, 53, 98, 139, 174, 182, 186 weight, 25, 26, 50, 53, 101, 174, 175, 181, 183, 185 hurdle model, 5, 158, 159 IF, see influence function indirect inference, 27, 139 influence curve, see sensitivity curve influence function, 15–25, 36, 37, 43, 44, 48, 84, 97, 114, 140, 176, 180, 192, 193, 196–198, 203 empirical, see empirical IRWLS, see iterative reweighted least squares iterative reweighted least squares, 51, 53, 54, 79, 126, 129, 137, 165 Kaplan–Meier, 191, 213, 214, 219, 220 leverage, 52, 100, 101, 133, 135, 136, 138, 141, 147, 174, 177, 221, 224 likelihood quasi-, 123, 126, 130, 132, 136, 140, 143, 158, 159, 162, 165, 179, 180, 189 likelihood ratio test classical (S 2 , LRT), 38, 40, 42, 44, 46, 59, 60, 70, 83, 94–96, 106, 129–131, 142, 195 robust (Sρ2 , LRTρ ), 42, 61, 62, 84, 100, 106–110, 206, 231, 233 linear model, see regression model link function, 128, 131, 132, 138, 143, 145, 151, 152, 155, 157, 163–165, 169, 186, 193 logistic regression, see exponential family - Bernouilli logit, see link function LRT, see likelihood ratio test LS, see estimator LW variance, see variance - sandwich Mallows’ Cp , 46, 73, 74, 159, 189 RCp , 231, 233 Mallows’ estimator, see estimator marginal longitudinal data model, 15, 53, 162, 164
INDEX
268 masking effect, 8, 48, 134, 172, 206 missing covariate, 6, 9, 200 mixed linear model, 6, 9, 13–15, 27, 30, 32, 39, 48, 83, 86, 87, 94, 95, 97–100, 102, 110, 112, 123, 161, 162, 165, 204 MLDA, see marginal longitudinal data model MLE, see estimator MLM, see mixed linear model model misspecification, 2, 4–6, 13, 14, 16, 17, 19–21, 35, 37, 136, 193, 214 distributional, 6, 12, 215 structural, 6, 9, 199, 214–216 over dispersion, 130, 164, 176 PLE, see estimator point mass contamination, see distribution - point mass predicted value, 64, 69, 73, 112, 115 prediction, 1, 54, 69, 72, 84, 112–114, 123, 160, 213 proportional hazard, see hazard R-squared, see coefficient of determination RAIC, see Akaike information criterion Rao test, see score or Rao test regression model, 4, 8–10, 14, 15, 24–27, 30, 39, 41–48, 53, 55, 56, 58–62, 67, 69–71, 73, 79, 80, 83, 100, 112, 115, 118, 125, 137–139, 143, 159, 192, 204, 209, 229, 230 non-parametric, 14 quantiles, 192, 212, 217–220 quantiles - censored, 192, 193, 217, 219, 222, 224 rejection point, 21, 26 REML, see etimatori residual analysis, 8, 48, 62–68, 70, 75, 80, 82, 112, 113, 133, 134, 145, 172 deviance, 133, 135
Pearson, 122, 133, 135, 136, 160, 166, 172–174 risk set, 194, 196, 197, 211, 212 robustness of efficiency, 34, 35, 38, 44, 201 of validity, 33, 34, 38, 43, 44, 207 score or Rao test classical (R2 ), 39, 142 2 ), 41, 42, 106 robust (R0001 sensitivity curve, 16–18, 23 survival curve, 213, 214 ties, 195, 199, 204, 205, 210, 213 Tukey’s bisquare, see Tukey’s biweight Tukey’s biweight ψ function, 26, 27, 53 ρ function, 26, 30, 31, 99, 101, 102, 107, 114 estimator, see estimator weights, 68, 101, 107 tuning constant / parameter, 26, 27, 29, 30, 53, 99, 100, 102, 145, 186, 204 variable selection, 46, 59, 70, 73, 74, 79, 80, 126, 142, 144, 147, 148, 150, 154, 162, 179, 182 variance asymptotic, 18, 29, 36, 39–42, 57, 93, 100, 101, 105, 137, 140, 158, 168, 195–198, 203, 206, 216, 220, 229, 240 sandwich, 100, 102, 192, 193, 198–200, 202, 203, 207, 208, 216 Wald test, 6 classical (W 2 ), 38–41, 46, 61, 62, 74, 83, 94–96, 106, 129, 130, 142, 144, 195, 200, 206, 207, 209 2 ), 16, 41–44, 106, 107, robust (W0001 110, 206–208, 216 weighted partial likelihood, see estimator WMLE, see estimator zero-inflated model, 5, 158, 162
WILEY SERIES IN PROBABILITY AND STATISTICS Established by WALTER A. SHEWHART and SAMUEL S. WILKS Editors David J. Balding, Noel A. C. Cressie, Nicholas I. Fisher, Iain M. Johnstone, J. B. Kadane, Geert Molenberghs, David W. Scott, Adrian F. M. Smith, Sanford Weisberg, Harvey Goldstein Editors Emeriti Vic Barnett, J. Stuart Hunter, David G. Kendall, Jozef L. Teugels The Wiley Series in Probability and Statistics is well established and authoritative. It covers many topics of current research interest in both pure and applied statistics and probability theory. Written by leading statisticians and institutions, the titles span both state-of-the-art developments in the field and classical methods. Reflecting the wide range of current research in statistics, the series encompasses applied, methodological and theoretical statistics, ranging from applications and new techniques made possible by advances in computerized practice to rigorous treatment of theoretical approaches. This series provides essential and invaluable reading for all statisticians, whether in academia, industry, government, or research. ABRAHAM and LEDOLTER · Statistical Methods for Forecasting AGRESTI · Analysis of Ordinal Categorical Data AGRESTI · An Introduction to Categorical Data Analysis AGRESTI · Categorical Data Analysis, Second Edition ALTMAN, GILL and McDONALD · Numerical Issues in Statistical Computing for the Social Scientist AMARATUNGA and CABRERA · Exploration and Analysis of DNA Microarray and Protein Array Data ˘ · Mathematics of Chance ANDEL ANDERSON · An Introduction to Multivariate Statistical Analysis, Third Edition *ANDERSON · The Statistical Analysis of Time Series ANDERSON, AUQUIER, HAUCK, OAKES, VANDAELE and WEISBERG · Statistical Methods for Comparative Studies ANDERSON and LOYNES · The Teaching of Practical Statistics ARMITAGE and DAVID (editors) · Advances in Biometry ARNOLD, BALAKRISHNAN and NAGARAJA · Records *ARTHANARI and DODGE · Mathematical Programming in Statistics *BAILEY · The Elements of Stochastic Processes with Applications to the Natural Sciences BALAKRISHNAN and KOUTRAS · Runs and Scans with Applications BALAKRISHNAN and NG · Precedence-Type Tests and Applications BARNETT · Comparative Statistical Inference, Third Edition BARNETT · Environmental Statistics: Methods & Applications BARNETT and LEWIS · Outliers in Statistical Data, Third Edition BARTOSZYNSKI and NIEWIADOMSKA-BUGAJ · Probability and Statistical Inference BASILEVSKY · Statistical Factor Analysis and Related Methods: Theory and Applications BASU and RIGDON · Statistical Methods for the Reliability of Repairable Systems BATES and WATTS · Nonlinear Regression Analysis and Its Applications BECHHOFER, SANTNER and GOLDSMAN · Design and Analysis of Experiments for Statistical Selection, Screening and Multiple Comparisons BELSLEY · Conditioning Diagnostics: Collinearity and Weak Data in Regression BELSLEY, KUH and WELSCH · Regression Diagnostics: Identifying Influential Data and Sources of Collinearity BENDAT and PIERSOL · Random Data: Analysis and Measurement Procedures, Third Edition BERNARDO and SMITH · Bayesian Theory BERRY, CHALONER and GEWEKE · Bayesian Analysis in Statistics and Econometrics: Essays in Honor of Arnold Zellner BHAT and MILLER · Elements of Applied Stochastic Processes, Third Edition *Now available in a lower priced paperback edition in the Wiley Classics Library.
BHATTACHARYA and JOHNSON · Statistical Concepts and Methods BHATTACHARYA and WAYMIRE · Stochastic Processes with Applications BIEMER, GROVES, LYBERG, MATHIOWETZ and SUDMAN · Measurement Errors in Surveys BILLINGSLEY · Convergence of Probability Measures, Second Edition BILLINGSLEY · Probability and Measure, Third Edition BIRKES and DODGE · Alternative Methods of Regression BLISCHKE and MURTHY (editors) · Case Studies in Reliability and Maintenance BLISCHKE and MURTHY · Reliability: Modeling, Prediction and Optimization BLOOMFIELD · Fourier Analysis of Time Series: An Introduction, Second Edition BOLLEN · Structural Equations with Latent Variables BOLLEN and CURRAN · Latent Curve Models: A Structural Equation Perspective BOROVKOV · Ergodicity and Stability of Stochastic Processes BOSQ and BLANKE · Inference and Prediction in Large Dimensions BOULEAU · Numerical Methods for Stochastic Processes BOX · Bayesian Inference in Statistical Analysis BOX · R. A. Fisher, the Life of a Scientist BOX and DRAPER · Empirical Model-Building and Response Surfaces *BOX and DRAPER · Evolutionary Operation: A Statistical Method for Process Improvement BOX, HUNTER and HUNTER · Statistics for Experimenters: An Introduction to Design, Data Analysis and Model Building BOX, HUNTER and HUNTER · Statistics for Experimenters: Design, Innovation and Discovery, Second Edition BOX and LUCEÑO · Statistical Control by Monitoring and Feedback Adjustment BRANDIMARTE · Numerical Methods in Finance: A MATLAB-Based Introduction BROWN and HOLLANDER · Statistics: A Biomedical Introduction BRUNNER, DOMHOF and LANGER · Nonparametric Analysis of Longitudinal Data in Factorial Experiments BUCKLEW · Large Deviation Techniques in Decision, Simulation and Estimation CAIROLI and DALANG · Sequential Stochastic Optimization CASTILLO, HADI, BALAKRISHNAN and SARABIA · Extreme Value and Related Models with Applications in Engineering and Science CHAN · Time Series: Applications to Finance CHATTERJEE and HADI · Regression Analysis by Example, Fourth Edition CHATTERJEE and HADI · Sensitivity Analysis in Linear Regression CHATTERJEE and PRICE · Regression Analysis by Example, Third Edition CHERNICK · Bootstrap Methods: A Practitioner’s Guide CHERNICK and FRIIS · Introductory Biostatistics for the Health Sciences CHILÉS and DELFINER · Geostatistics: Modeling Spatial Uncertainty CHOW and LIU · Design and Analysis of Clinical Trials: Concepts and Methodologies, Second Edition CLARKE and DISNEY · Probability and Random Processes: A First Course with Applications, Second Edition *COCHRAN and COX · Experimental Designs, Second Edition CONGDON · Applied Bayesian Modelling CONGDON · Bayesian Models for Categorical Data CONGDON · Bayesian Statistical Modelling CONGDON · Bayesian Statistical Modelling, Second Edition CONOVER · Practical Nonparametric Statistics, Second Edition COOK · Regression Graphics COOK and WEISBERG · An Introduction to Regression Graphics COOK and WEISBERG · Applied Regression Including Computing and Graphics CORNELL · Experiments with Mixtures, Designs, Models and the Analysis of Mixture Data, Third Edition COVER and THOMAS · Elements of Information Theory COX · A Handbook of Introductory Statistical Methods *COX · Planning of Experiments *Now available in a lower priced paperback edition in the Wiley Classics Library.
CRESSIE · Statistics for Spatial Data, Revised Edition CSÖRGÖ and HORVÁTH · Limit Theorems in Change Point Analysis DANIEL · Applications of Statistics to Industrial Experimentation DANIEL · Biostatistics: A Foundation for Analysis in the Health Sciences, Sixth Edition *DANIEL · Fitting Equations to Data: Computer Analysis of Multifactor Data, Second Edition DASU and JOHNSON · Exploratory Data Mining and Data Cleaning DAVID and NAGARAJA · Order Statistics, Third Edition *DEGROOT, FIENBERG and KADANE · Statistics and the Law DEL CASTILLO · Statistical Process Adjustment for Quality Control DEMARIS · Regression with Social Data: Modeling Continuous and Limited Response Variables DEMIDENKO · Mixed Models: Theory and Applications DENISON, HOLMES, MALLICK and SMITH · Bayesian Methods for Nonlinear Classification and Regression DETTE and STUDDEN · The Theory of Canonical Moments with Applications in Statistics, Probability and Analysis DEY and MUKERJEE · Fractional Factorial Plans DILLON and GOLDSTEIN · Multivariate Analysis: Methods and Applications DODGE · Alternative Methods of Regression *DODGE and ROMIG · Sampling Inspection Tables, Second Edition *DOOB · Stochastic Processes DOWDY, WEARDEN and CHILKO · Statistics for Research, Third Edition DRAPER and SMITH · Applied Regression Analysis, Third Edition DRYDEN and MARDIA · Statistical Shape Analysis DUDEWICZ and MISHRA · Modern Mathematical Statistics DUNN and CLARK · Applied Statistics: Analysis of Variance and Regression, Second Edition DUNN and CLARK · Basic Statistics: A Primer for the Biomedical Sciences, Third Edition DUPUIS and ELLIS · A Weak Convergence Approach to the Theory of Large Deviations EDLER and KITSOS (editors) · Recent Advances in Quantitative Methods in Cancer and Human Health Risk Assessment *ELANDT-JOHNSON and JOHNSON · Survival Models and Data Analysis ENDERS · Applied Econometric Time Series ETHIER and KURTZ · Markov Processes: Characterization and Convergence EVANS, HASTINGS and PEACOCK · Statistical Distribution, Third Edition FELLER · An Introduction to Probability Theory and Its Applications, Volume I, Third Edition, Revised; Volume II, Second Edition FISHER and VAN BELLE · Biostatistics: A Methodology for the Health Sciences FITZMAURICE, LAIRD and WARE · Applied Longitudinal Analysis *FLEISS · The Design and Analysis of Clinical Experiments FLEISS · Statistical Methods for Rates and Proportions, Second Edition FLEMING and HARRINGTON · Counting Processes and Survival Analysis FULLER · Introduction to Statistical Time Series, Second Edition FULLER · Measurement Error Models GALLANT · Nonlinear Statistical Models. GEISSER · Modes of Parametric Statistical Inference GELMAN and MENG (editors) · Applied Bayesian Modeling and Casual Inference from Incomplete-data Perspectives GEWEKE · Contemporary Bayesian Econometrics and Statistics GHOSH, MUKHOPADHYAY and SEN · Sequential Estimation GIESBRECHT and GUMPERTZ · Planning, Construction and Statistical Analysis of Comparative Experiments GIFI · Nonlinear Multivariate Analysis GIVENS and HOETING · Computational Statistics GLASSERMAN and YAO · Monotone Structure in Discrete-Event Systems GNANADESIKAN · Methods for Statistical Data Analysis of Multivariate Observations, Second Edition GOLDSTEIN and LEWIS · Assessment: Problems, Development and Statistical Issues *Now available in a lower priced paperback edition in the Wiley Classics Library.
GREENWOOD and NIKULIN · A Guide to Chi-Squared Testing GROSS and HARRIS · Fundamentals of Queueing Theory, Third Edition *HAHN and SHAPIRO · Statistical Models in Engineering HAHN and MEEKER · Statistical Intervals: A Guide for Practitioners HALD · A History of Probability and Statistics and their Applications Before 1750 HALD · A History of Mathematical Statistics from 1750 to 1930 HAMPEL · Robust Statistics: The Approach Based on Influence Functions HANNAN and DEISTLER · The Statistical Theory of Linear Systems HEIBERGER · Computation for the Analysis of Designed Experiments HEDAYAT and SINHA · Design and Inference in Finite Population Sampling HEDEKER and GIBBONS · Longitudinal Data Analysis HELLER · MACSYMA for Statisticians HINKELMANN and KEMPTHORNE · Design and Analysis of Experiments, Volume 1: Introduction to Experimental Design HINKELMANN and KEMPTHORNE · Design and analysis of experiments, Volume 2: Advanced Experimental Design HOAGLIN, MOSTELLER and TUKEY · Exploratory Approach to Analysis of Variance HOAGLIN, MOSTELLER and TUKEY · Exploring Data Tables, Trends and Shapes *HOAGLIN, MOSTELLER and TUKEY · Understanding Robust and Exploratory Data Analysis HOCHBERG and TAMHANE · Multiple Comparison Procedures HOCKING · Methods and Applications of Linear Models: Regression and the Analysis of Variance, Second Edition HOEL · Introduction to Mathematical Statistics, Fifth Edition HOGG and KLUGMAN · Loss Distributions HOLLANDER and WOLFE · Nonparametric Statistical Methods, Second Edition HOSMER and LEMESHOW · Applied Logistic Regression, Second Edition HOSMER and LEMESHOW · Applied Survival Analysis: Regression Modeling of Time to Event Data HUBER · Robust Statistics HUBERTY · Applied Discriminant Analysis HUNT and KENNEDY · Financial Derivatives in Theory and Practice, Revised Edition HUSKOVA, BERAN and DUPAC · Collected Works of Jaroslav Hajek—with Commentary HUZURBAZAR · Flowgraph Models for Multistate Time-to-Event Data IMAN and CONOVER · A Modern Approach to Statistics JACKSON · A User’s Guide to Principle Components JOHN · Statistical Methods in Engineering and Quality Assurance JOHNSON · Multivariate Statistical Simulation JOHNSON and BALAKRISHNAN · Advances in the Theory and Practice of Statistics: A Volume in Honor of Samuel Kotz JOHNSON and BHATTACHARYYA · Statistics: Principles and Methods, Fifth Edition JUDGE, GRIFFITHS, HILL, LU TKEPOHL and LEE · The Theory and Practice of Econometrics, Second Edition JOHNSON and KOTZ · Distributions in Statistics JOHNSON and KOTZ (editors) · Leading Personalities in Statistical Sciences: From the Seventeenth Century to the Present JOHNSON, KOTZ and BALAKRISHNAN · Continuous Univariate Distributions, Volume 1, Second Edition JOHNSON, KOTZ and BALAKRISHNAN · Continuous Univariate Distributions, Volume 2, Second Edition JOHNSON, KOTZ and BALAKRISHNAN · Discrete Multivariate Distributions JOHNSON, KOTZ and KEMP · Univariate Discrete Distributions, Second Edition ˇ JURECKOVÁ and SEN · Robust Statistical Procedures: Asymptotics and Interrelations JUREK and MASON · Operator-Limit Distributions in Probability Theory KADANE · Bayesian Methods and Ethics in a Clinical Trial Design KADANE and SCHUM · A Probabilistic Analysis of the Sacco and Vanzetti Evidence KALBFLEISCH and PRENTICE · The Statistical Analysis of Failure Time Data, Second Edition *Now available in a lower priced paperback edition in the Wiley Classics Library.
KARIYA and KURATA · Generalized Least Squares KASS and VOS · Geometrical Foundations of Asymptotic Inference KAUFMAN and ROUSSEEUW · Finding Groups in Data: An Introduction to Cluster Analysis KEDEM and FOKIANOS · Regression Models for Time Series Analysis KENDALL, BARDEN, CARNE and LE · Shape and Shape Theory KHURI · Advanced Calculus with Applications in Statistics, Second Edition KHURI, MATHEW and SINHA · Statistical Tests for Mixed Linear Models *KISH · Statistical Design for Research KLEIBER and KOTZ · Statistical Size Distributions in Economics and Actuarial Sciences KLUGMAN, PANJER and WILLMOT · Loss Models: From Data to Decisions KLUGMAN, PANJER and WILLMOT · Solutions Manual to Accompany Loss Models: From Data to Decisions KOTZ, BALAKRISHNAN and JOHNSON · Continuous Multivariate Distributions, Volume 1, Second Edition KOTZ and JOHNSON (editors) · Encyclopedia of Statistical Sciences: Volumes 1 to 9 with Index KOTZ and JOHNSON (editors) · Encyclopedia of Statistical Sciences: Supplement Volume KOTZ, READ and BANKS (editors) · Encyclopedia of Statistical Sciences: Update Volume 1 KOTZ, READ and BANKS (editors) · Encyclopedia of Statistical Sciences: Update Volume 2 KOVALENKO, KUZNETZOV and PEGG · Mathematical Theory of Reliability of Time-Dependent Systems with Practical Applications KUROWICKA and COOKE · Uncertainty Analysis with High Dimensional Dependence Modelling LACHIN · Biostatistical Methods: The Assessment of Relative Risks LAD · Operational Subjective Statistical Methods: A Mathematical, Philosophical and Historical Introduction LAMPERTI · Probability: A Survey of the Mathematical Theory, Second Edition LANGE, RYAN, BILLARD, BRILLINGER, CONQUEST and GREENHOUSE · Case Studies in Biometry LARSON · Introduction to Probability Theory and Statistical Inference, Third Edition LAWLESS · Statistical Models and Methods for Lifetime Data, Second Edition LAWSON · Statistical Methods in Spatial Epidemiology, Second Edition LE · Applied Categorical Data Analysis LE · Applied Survival Analysis LEE and WANG · Statistical Methods for Survival Data Analysis, Third Edition LEPAGE and BILLARD · Exploring the Limits of Bootstrap LEYLAND and GOLDSTEIN (editors) · Multilevel Modelling of Health Statistics LIAO · Statistical Group Comparison LINDVALL · Lectures on the Coupling Method LINHART and ZUCCHINI · Model Selection LITTLE and RUBIN · Statistical Analysis with Missing Data, Second Edition LLOYD · The Statistical Analysis of Categorical Data LOWEN and TEICH · Fractal-Based Point Processes MAGNUS and NEUDECKER · Matrix Differential Calculus with Applications in Statistics and Econometrics, Revised Edition MALLER and ZHOU · Survival Analysis with Long Term Survivors MALLOWS · Design, Data and Analysis by Some Friends of Cuthbert Daniel MANN, SCHAFER and SINGPURWALLA · Methods for Statistical Analysis of Reliability and Life Data MANTON, WOODBURY and TOLLEY · Statistical Applications Using Fuzzy Sets MARCHETTE · Random Graphs for Statistical Pattern Recognition MARKOVICH · Nonparametric Analysis of Univariate Heavy-Tailed Data: Research and practice MARDIA and JUPP · Directional Statistics MARKOVICH · Nonparametric Analysis of Univariate Heavy-Tailed Data: Research and Practice MARONNA, MARTIN and YOHAI · Robust Statistics: Theory and Methods MASON, GUNST and HESS · Statistical Design and Analysis of Experiments with Applications to Engineering and Science, Second Edition *Now available in a lower priced paperback edition in the Wiley Classics Library.
MCCULLOCH and SERLE · Generalized, Linear and Mixed Models MCFADDEN · Management of Data in Clinical Trials MCLACHLAN · Discriminant Analysis and Statistical Pattern Recognition MCLACHLAN, DO and AMBROISE · Analyzing Microarray Gene Expression Data MCLACHLAN and KRISHNAN · The EM Algorithm and Extensions MCLACHLAN and PEEL · Finite Mixture Models MCNEIL · Epidemiological Research Methods MEEKER and ESCOBAR · Statistical Methods for Reliability Data MEERSCHAERT and SCHEFFLER · Limit Distributions for Sums of Independent Random Vectors: Heavy Tails in Theory and Practice MICKEY, DUNN and CLARK · Applied Statistics: Analysis of Variance and Regression, Third Edition *MILLER · Survival Analysis, Second Edition MONTGOMERY, PECK and VINING · Introduction to Linear Regression Analysis, Fourth Edition MORGENTHALER and TUKEY · Configural Polysampling: A Route to Practical Robustness MUIRHEAD · Aspects of Multivariate Statistical Theory MULLER and STEWART · Linear Model Theory: Univariate, Multivariate and Mixed Models MURRAY · X-STAT 2.0 Statistical Experimentation, Design Data Analysis and Nonlinear Optimization MURTHY, XIE and JIANG · Weibull Models MYERS and MONTGOMERY · Response Surface Methodology: Process and Product Optimization Using Designed Experiments, Second Edition MYERS, MONTGOMERY and VINING · Generalized Linear Models. With Applications in Engineering and the Sciences †NELSON · Accelerated Testing, Statistical Models, Test Plans and Data Analysis †NELSON · Applied Life Data Analysis NEWMAN · Biostatistical Methods in Epidemiology OCHI · Applied Probability and Stochastic Processes in Engineering and Physical Sciences OKABE, BOOTS, SUGIHARA and CHIU · Spatial Tesselations: Concepts and Applications of Voronoi Diagrams, Second Edition OLIVER and SMITH · Influence Diagrams, Belief Nets and Decision Analysis PALTA · Quantitative Methods in Population Health: Extentions of Ordinary Regression PANJER · Operational Risks: Modeling Analytics PANKRATZ · Forecasting with Dynamic Regression Models PANKRATZ · Forecasting with Univariate Box-Jenkins Models: Concepts and Cases *PARZEN · Modern Probability Theory and Its Applications PEÑA, TIAO and TSAY · A Course in Time Series Analysis PIANTADOSI · Clinical Trials: A Methodologic Perspective PORT · Theoretical Probability for Applications POURAHMADI · Foundations of Time Series Analysis and Prediction Theory PRESS · Bayesian Statistics: Principles, Models and Applications PRESS · Subjective and Objective Bayesian Statistics, Second Edition PRESS and TANUR · The Subjectivity of Scientists and the Bayesian Approach PUKELSHEIM · Optimal Experimental Design PURI, VILAPLANA and WERTZ · New Perspectives in Theoretical and Applied Statistics PUTERMAN · Markov Decision Processes: Discrete Stochastic Dynamic Programming QIU · Image Processing and Jump Regression Analysis RAO · Linear Statistical Inference and its Applications, Second Edition RAUSAND and HØYLAND · System Reliability Theory: Models, Statistical Methods and Applications, Second Edition RENCHER · Linear Models in Statistics RENCHER · Methods of Multivariate Analysis, Second Edition RENCHER · Multivariate Statistical Inference with Applications RIPLEY · Spatial Statistics RIPLEY · Stochastic Simulation ROBINSON · Practical Strategies for Experimenting *Now available in a lower priced paperback edition in the Wiley Classics Library. †Now available in a lower priced paperback edition in the Wiley - Interscience Paperback Series.
ROHATGI and SALEH · An Introduction to Probability and Statistics, Second Edition ROLSKI, SCHMIDLI, SCHMIDT and TEUGELS · Stochastic Processes for Insurance and Finance ROSENBERGER and LACHIN · Randomization in Clinical Trials: Theory and Practice ROSS · Introduction to Probability and Statistics for Engineers and Scientists ROSSI, ALLENBY and MCCULLOCH · Bayesian Statistics and Marketing ROUSSEEUW and LEROY · Robust Regression and Outline Detection RUBIN · Multiple Imputation for Nonresponse in Surveys RUBINSTEIN · Simulation and the Monte Carlo Method RUBINSTEIN and MELAMED · Modern Simulation and Modeling RYAN · Modern Regression Methods RYAN · Statistical Methods for Quality Improvement, Second Edition SALEH · Theory of Preliminary Test and Stein-Type Estimation with Applications SALTELLI, CHAN and SCOTT (editors) · Sensitivity Analysis *SCHEFFE · The Analysis of Variance SCHIMEK · Smoothing and Regression: Approaches, Computation and Application SCHOTT · Matrix Analysis for Statistics SCHOUTENS · Levy Processes in Finance: Pricing Financial Derivatives SCHUSS · Theory and Applications of Stochastic Differential Equations SCOTT · Multivariate Density Estimation: Theory, Practice and Visualization *SEARLE · Linear Models SEARLE · Linear Models for Unbalanced Data SEARLE · Matrix Algebra Useful for Statistics SEARLE and WILLETT · Matrix Algebra for Applied Economics SEBER · Multivariate Observations SEBER and LEE · Linear Regression Analysis, Second Edition SEBER and WILD · Nonlinear Regression SENNOTT · Stochastic Dynamic Programming and the Control of Queueing Systems *SERFLING · Approximation Theorems of Mathematical Statistics SHAFER and VOVK · Probability and Finance: Its Only a Game! SILVAPULLE and SEN · Constrained Statistical Inference: Inequality, Order and Shape Restrictions SINGPURWALLA · Reliability and Risk: A Bayesian Perspective SMALL and MCLEISH · Hilbert Space Methods in Probability and Statistical Inference SRIVASTAVA · Methods of Multivariate Statistics STAPLETON · Linear Statistical Models STAUDTE and SHEATHER · Robust Estimation and Testing STOYAN, KENDALL and MECKE · Stochastic Geometry and Its Applications, Second Edition STOYAN and STOYAN · Fractals, Random and Point Fields: Methods of Geometrical Statistics STYAN · The Collected Papers of T. W. Anderson: 1943–1985 SUTTON, ABRAMS, JONES, SHELDON and SONG · Methods for Meta-Analysis in Medical Research TANAKA · Time Series Analysis: Nonstationary and Noninvertible Distribution Theory THOMPSON · Empirical Model Building THOMPSON · Sampling, Second Edition THOMPSON · Simulation: A Modeler’s Approach THOMPSON and SEBER · Adaptive Sampling THOMPSON, WILLIAMS and FINDLAY · Models for Investors in Real World Markets TIAO, BISGAARD, HILL, PEÑA and STIGLER (editors) · Box on Quality and Discovery: with Design, Control and Robustness TIERNEY · LISP-STAT: An Object-Oriented Environment for Statistical Computing and Dynamic Graphics TSAY · Analysis of Financial Time Series UPTON and FINGLETON · Spatial Data Analysis by Example, Volume II: Categorical and Directional Data VAN BELLE · Statistical Rules of Thumb *Now available in a lower priced paperback edition in the Wiley Classics Library.
VAN BELLE, FISHER, HEAGERTY and LUMLEY · Biostatistics: A Methodology for the Health Sciences, Second Edition VESTRUP · The Theory of Measures and Integration VIDAKOVIC · Statistical Modeling by Wavelets VINOD and REAGLE · Preparing for the Worst: Incorporating Downside Risk in Stock Market Investments WALLER and GOTWAY · Applied Spatial Statistics for Public Health Data WEERAHANDI · Generalized Inference in Repeated Measures: Exact Methods in MANOVA and Mixed Models WEISBERG · Applied Linear Regression, Second Edition WELISH · Aspects of Statistical Inference WESTFALL and YOUNG · Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment WHITTAKER · Graphical Models in Applied Multivariate Statistics WINKER · Optimization Heuristics in Economics: Applications of Threshold Accepting WONNACOTT and WONNACOTT · Econometrics, Second Edition WOODING · Planning Pharmaceutical Clinical Trials: Basic Statistical Principles WOOLSON and CLARKE · Statistical Methods for the Analysis of Biomedical Data, Second Edition WU and HAMADA · Experiments: Planning, Analysis and Parameter Design Optimization WU and ZHANG · Nonparametric Regression Methods for Longitudinal Data Analysis: Mixed-Effects Modeling Approaches YANG · The Construction Theory of Denumerable Markov Processes YOUNG, VALERO-MORA and FRIENDLY · Visual Statistics: Seeing Data with Dynamic Interactive Graphics *ZELLNER · An Introduction to Bayesian Inference in Econometrics ZELTERMAN · Discrete Distributions: Applications in the Health Sciences ZHOU, OBUCHOWSKI and McCLISH · Statistical Methods in Diagnostic Medicine
*Now available in a lower priced paperback edition in the Wiley Classics Library.