Search
   >> Home >> Bookstore >> Survey statistics >> Applied Survey Data Analysis

Applied Survey Data Analysis

Authors:
Steve G. Heeringa, Brady T. West, and Patricia A. Berglund
Publisher: Chapman & Hall/CRC
Copyright: 2010
ISBN-13: 978-1-4200-8066-7
Pages: 462; hardcover
Price: $69.50

Comment from the Stata technical group

Applied Survey Data Analysis is an intermediate-level, example-driven treatment of current methods for complex survey data. It will appeal to researchers of all disciplines who work with survey data and have basic knowledge of applied statistical methodology for standard (nonsurvey) data.

The authors begin with some history and by discussing some widely used survey datasets, such as the National Health and Nutrition Examination Survey (NHANES). They then follow with the basic concepts of survey data: sampling plans, weights, clustering, prestratification and poststratification, design effects, and multistage samples. Discussion then turns to the types of variance estimators: Taylor linearization, jackknife, bootstrap, and balanced and repeated replication.

The middle sections of the text provide in-depth coverage of the types of analyses that can be performed with survey data, including means and proportions, correlations, tables, linear regression, regression with limited dependent variables (including logit and Poisson), and survival analysis (including Cox regression). Two final chapters are devoted to advanced topics, such as multiple imputation, Bayesian analysis, and multilevel models. The appendix provides overviews of popular statistical software, including Stata.


Table of contents

Preface
1. Applied Survey Data Analysis: Overview
1.1 Introduction
1.2 A Brief History of Applied Survey Data Analysis
1.2.1 Key Theoretical Developments
1.2.2 Key Software Developments
1.3 Example Data Sets and Exercises
1.3.1 The National Comorbidity Survey Replication (NCS-R)
1.3.2 The Health and Retirement Study (HRS)—2006
1.3.3 The National Health and Nutrition Examination Survey (NHANES)—2005, 2006
1.3.4 Steps in Applied Survey Data Analysis
1.3.4.1 Step 1: Definition of the Problem and Statement of the Objectives
1.3.4.2 Step 2: Understanding the Sample Design
1.3.4.3 Step 3: Understanding Design Variables, Underlying Constructs, and Missing Data
1.3.4.4 Step 4: Analyzing the Data
1.3.4.5 Step 5: Interpreting and Evaluating the Results of the Analysis
1.3.4.6 Step 6: Reporting of Estimates and Inferences from the Survey Data
2. Getting to Know the Complex Sample Design
2.1 Introduction
2.1.1 Technical Documentation and Supplemental Literature Review
2.2 Classification of Sample Designs
2.2.1 Sampling Plans
2.2.2 Inference from Survey Data
2.3 Target Populations and Survey Populations
2.4 Simple Random Sampling: A Simple Model for Design-Based Inference
2.4.1 Relevance of SRS to Complex Sample Survey Data Analysis
2.4.2 SRS Fundamentals: A Framework for Design-Based Inference
2.4.3 An Example of Design-Based Inference under SRS
2.5 Complex Sample Design Effects
2.5.1 Design Effect Ratio
2.5.2 Generalized Design Effects and Effective Sample Sizes
2.6 Complex Samples: Clustering and Stratification
2.6.1 Clustered Sampling Plans
2.6.2 Stratification
2.6.3 Joint Effects of Sample Stratification and Clustering
2.7 Weighting in Analysis of Survey Data
2.7.1 Introduction to Weighted Analysis of Survey Data
2.7.2 Weighting for Probabilities of Selection
2.7.3 Nonresponse Adjustment Weights
2.7.3.1 Weighting Class Approach
2.7.3.2 Propensity Cell Adjustment Approach
2.7.4 Poststratification Weight Factors
2.7.5 Design Effects Due to Weighted Analysis
2.8 Multistage Area Probability Sample Designs
2.8.1 Primary Stage Sampling
2.8.2 Secondary Stage Sampling
2.8.3 Third and Fourth Stage Sampling of Housing Units and Eligible Respondents
2.9 Special Types of Sampling Plans Encountered in Surveys
3. Foundations and Techniques for Design-Based Estimation and Inference
3.1 Introduction
3.2 Finite Populations and Superpopulation Models
3.3 Confidence Intervals for Population Parameters
3.4 Weighted Estimation of Population Parameters
3.5 Probability Distributions and Design-Based Inference
3.5.1 Sampling Distributions of Survey Estimates
3.5.2 Degrees of Freedom for t under Complex Sample Designs
3.6 Variance Estimation
3.6.1 Simplifying Assumptions Employed in Complex Sample Variance Estimation
3.6.2 The Taylor Series Linearization Method
3.6.2.1 TSL Step 1
3.6.2.2 TSL Step 2
3.6.2.3 TSL Step 3
3.6.2.4 TSL Step 4
3.6.2.5 TSL Step 5
3.6.3 Replication Methods for Variance Estimation
3.6.3.1 Jackknife Repeated Replication
3.6.3.2 Balanced Repeated Replication
3.6.3.3 The Bootstrap
3.6.4 An Example Comparing the Results from TSL, JRR, and BRR Methods
3.7 Hypothesis Testing in Survey Data Analysis
3.8 Total Survey Error and Its Impact on Survey Estimation and Inference
3.8.1 Variable Errors
3.8.2 Biases in Survey Data
4. Preparation for Complex Sample Survey Data Analysis
4.1 Introduction
4.2 Analysis Weights: Review by the Data User
4.2.1 Identification of the Correct Weight Variables for the Analysis
4.2.2 Determining the Distribution and Scaling of the Weight Variables
4.2.3 Weighting Applications: Sensitivity of Survey Estimates to the Weights
4.3 Understanding and Checking the Sampling Error Calculation Model
4.3.1 Stratum and Cluster Codes in Complex Sample Survey Data Sets
4.3.2 Building the NCS-R Sampling Error Calculation Model
4.3.3 Combining Strata, Randomly Grouping PSUs, and Collapsing Strata
4.3.4 Checking the Sampling Error Calculation Model for the Survey Data Set
4.4 Addressing Item Missing Data in Analysis Variables
4.4.1 Potential Bias Due to Ignoring Missing Data
4.4.2 Exploring Rates and Patterns of Missing Data Prior to Analysis
4.5 Preparing to Analyze Data for Sample Subpopulations
4.5.1 Subpopulation Distributions across Sample Design Units
4.5.2 The Unconditional Approach for Subclass Analysis
4.5.3 Preparation for Subclass Analyses
4.6 A Final Checklist for Data Users
5. Descriptive Analysis for Continuous Variables
5.1 Introduction
5.2 Special Considerations in Descriptive Analysis of Complex Sample Survey Data
5.2.1 Weighted Estimation
5.2.2 Design Effects for Descriptive Statistics
5.2.3 Matching the Method to the Variable Type
5.3 Simple Statistics for Univariate Continuous Distributions
5.3.1 Graphical Tools for Descriptive Analysis of Survey Data
5.3.2 Estimation of Population Totals
5.3.3 Means of Continuous, Binary, or Interval Scale Data
5.3.4 Standard Deviations of Continuous Variables
5.3.5 Estimation of Percentiles and Medians of Population Distributions
5.4 Bivariate Relationships between Two Continuous Variables
5.4.1 X–Y Scatterplots
5.4.2 Product Moment Correlation Statistic (r)
5.4.3 Ratios of Two Continuous Variables
5.5 Descriptive Statistics of Subpopulations
5.6 Linear Functions of Descriptive Estimates and Differences of Means
5.6.1 Differences of Means for Two Subpopulations
5.6.2 Comparing Means over Time
5.7 Exercises
6. Categorical Data Analysis
6.1 Introduction
6.2 A Framework for Analysis of Categorical Survey Data
6.2.1 Incorporating the Complex Design and Pseudo-Maximum Likelihood
6.2.2 Proportions and Percentages
6.2.3 Cross-Tabulations, Contingency Tables, and Weighted Frequencies
6.3 Univariate Analysis of Categorical Data
6.3.1 Estimation of Proportions for Binary Variables
6.3.2 Estimation of Category Proportions for Multinomial Variables
6.3.3 Testing Hypotheses Concerning a Vector of Population Proportions
6.3.4 Graphical Display for a Single Categorical Variable
6.4 Bivariate Analysis of Categorical Data
6.4.1 Response and Factor Variables
6.4.2 Estimation of Total, Row, and Column Proportions for Two-Way Tables
6.4.3 Estimating and Testing Differences in Subpopulation Proportions
6.4.4 Chi-Square Tests of Independence of Rows and Columns
6.4.5 Odds Ratios and Relative Risks
6.4.6 Simple Logistic Regression to Estimate the Odds Ratio
6.4.7 Bivariate Graphical Analysis
6.5 Analysis of Multivariate Categorical Data
6.5.1 The Cochran–Mantel–Haenszel Test
6.5.2 Log-Linear Models for Contingency Tables
6.6 Exercises
7. Linear Regression Models
7.1 Introduction
7.2 The Linear Regression Model
7.2.1 The Standard Linear Regression Model
7.2.2 Survey Treatment of the Regression Model
7.3 Four Steps in Linear Regression Analysis
7.3.1 Step 1: Specifying and Refining the Model
7.3.2 Step 2: Estimation of Model Parameters
7.3.2.1 Estimation for the Standard Linear Regression Model
7.3.2.2 Linear Regression Estimation for Complex Sample Survey Data
7.3.3 Step 3: Model Evaluation
7.3.3.1 Explained Variance and Goodness of Fit
7.3.3.2 Residual Diagnostics
7.3.3.3 Model Specification and Homogeneity of Variance
7.3.3.4 Normality of the Residual Errors
7.3.3.5 Outliers and Influence Statistics
7.3.4 Step 4: Inference
7.3.4.1 Inference Concerning Model Parameters
7.3.4.2 Prediction Intervals
7.4 Some Practical Considerations and Tools
7.4.1 Distribution of the Dependent Variable
7.4.2 Parameterization and Scaling for Independent Variables
7.4.3 Standardization of the Dependent and Independent Variables
7.4.4 Specification and Interpretation of Interactions and Nonlinear Relationships
7.4.5 Model-Building Strategies
7.5 Application: Modeling Diastolic Blood Pressure with the NHANES Data
7.5.1 Exploring the Bivariate Relationships
7.5.2 Naïve Analysis: Ignoring Sample Design Features
7.5.3 Weighted Regression Analysis
7.5.4 Appropriate Analysis: Incorporating All Sample Design Features
7.6 Exercises
8. Logistic Regression and Generalized Linear Models for Binary Survey Variables
8.1 Introduction
8.2 Generalized Linear Models for Binary Survey Responses
8.2.1 The Logistic Regression Model
8.2.2 The Probit Regression Model
8.2.3 The Complementary Log–Log Model
8.3 Building the Logistic Regression Model: Stage 1, Model Specification
8.4 Building the Logistic Regression Model: Stage 2, Estimation of Model Parameters and Standard Errors
8.5 Building the Logistic Regression Model: Stage 3, Evaluation of the Fitted Model
8.5.1 Wald Tests of Model Parameters
8.5.2 Goodness of Fit and Logistic Regression Diagnostics
8.6 Building the Logistic Regression Model: Stage 4, Interpretation and Inference
8.7 Analysis Application
8.7.1 Stage 1: Model Specification
8.7.2 Stage 2: Model Estimation
8.7.3 Stage 3: Model Evaluation
8.7.4 Stage 4: Model Interpretation/Inference
8.8 Comparing the Logistic, Probit, and Complementary Log–Log GLMs for Binary Dependent Variables
8.9 Exercises
9. Generalized Linear Models for Multinomial, Ordinal, and Count Variables
9.1 Introduction
9.2 Analyzing Survey Data Using Multinomial Logit Regression Models
9.2.1 The Multinomial Logit Regression Model
9.2.2 Multinomial Logit Regression Model: Specification Stage
9.2.3 Multinomial Logit Regression Model: Estimation Stage
9.2.4 Multinomial Logit Regression Model: Evaluation Stage
9.2.5 Multinomial Logit Regression Model: Interpretation Stage
9.2.6 Example: Fitting a Multinomial Logit Regression Model to Complex Sample Survey Data
9.3 Logistic Regression Models for Ordinal Survey Data
9.3.1 Cumulative Logit Regression Model
9.3.2 Cumulative Logit Regression Model: Specification Stage
9.3.3 Cumulative Logit Regression Model: Estimation Stage
9.3.4 Cumulative Logit Regression Model: Evaluation Stage
9.3.5 Cumulative Logit Regression Model: Interpretation Stage
9.3.6 Example: Fitting a Cumulative Logit Regression Model to Complex Sample Survey Data
9.4 Regression Models for Count Outcomes
9.4.1 Survey Count Variables and Regression Modeling Alternatives
9.4.2 Generalized Linear Models for Count Variables
9.4.2.1 The Poisson Regression Model
9.4.2.2 The Negative Binomial Regression Model
9.4.2.3 Two-Part Models: Zero-Inflated Poisson and Negative Binomial Regression Models
9.4.3 Regression Models for Count Data: Specification Stage
9.4.4 Regression Models for Count Data: Estimation Stage
9.4.5 Regression Models for Count Data: Evaluation Stage
9.4.6 Regression Models for Count Data: Interpretation Stage
9.4.7 Example: Fitting Poisson and Negative Binomial Regression Models to Complex Sample Survey Data
9.5 Exercises
10. Survival Analysis of Event History Survey Data
10.1 Introduction
10.2 Basic Theory of Survival Analysis
10.2.1 Survey Measurement of Event History Data
10.2.2 Data for Event History Models
10.2.3 Important Notation and Definitions
10.2.4 Models for Survival Analysis
10.3 (Nonparametric) Kaplan–Meier Estimation of the Survivor Function
10.3.1 K–M Model Specification and Estimation
10.3.2 K–M Estimator—Evaluation and Interpretation
10.3.3 K–M Survival Analysis Example
10.4 Cox Proportional Hazards Model
10.4.1 Cox Proportional Hazards Model: Specification
10.4.2 Cox Proportional Hazards Model: Estimation Stage
10.4.3 Cox Proportional Hazards Model: Evaluation and Diagnostics
10.4.4 Cox Proportional Hazards Model: Interpretation and Presentation of Results
10.4.5 Example: Fitting a Cox Proportional Hazards Model to Complex Sample Survey Data
10.5 Discrete Time Survival Models
10.5.1 The Discrete Time Logistic Model
10.5.2 Data Preparation for Discrete Time Survival Models
10.5.3 Discrete Time Models: Estimation Stage
10.5.4 Discrete Time Models: Evaluation and Interpretation
10.5.5 Fitting a Discrete Time Model to Complex Sample Survey Data
10.6 Exercises
11. Multiple Imputation: Methods and Applications for Survey Analysts
11.1 Introduction
11.2 Important Missing Data Concepts
11.2.1 Sources and Patterns of Item-Missing Data in Surveys
11.2.2 Item-Missing Data Mechanisms
11.2.3 Implications of Item-Missing Data for Survey Data Analysis
11.2.4 Review of Strategies to Address Item-Missing Data in Surveys
11.3 An Introduction to Imputation and the Multiple Imputation Method
11.3.1 A Brief History of Imputation Procedures
11.3.2 Why the Multiple Imputation Method?
11.3.3 Overview of Multiple Imputation and MI Phases
11.4 Models for Multiply Imputing Missing Data
11.4.1 Choosing the Variables to Include in the Imputation Model
11.4.2 Distributional Assumptions for the Imputation Model
11.5 Creating the Imputations
11.5.1 Transforming the Imputation Problem to Monotonic Missing Data
11.5.2 Specifying an Explicit Multivariate Model and Applying Exact Bayesian Posterior Simulation Methods
11.5.3 Sequential Regression or “Chained Regressions”
11.6 Estimation and Inference for Multiply Imputed Data
11.6.1 Estimators for Population Parameters and Associated Variance Estimators
11.6.2 Model Evaluation and Inference
11.7 Applications to Survey Data
11.7.1 Problem Definition
11.7.2 The Imputation Model for the NHANES Blood Pressure Example
11.7.3 Imputation of the Item-Missing Data
11.7.4 Multiple Imputation Estimation and Inference
11.7.4.1 Multiple Imputation Analysis 1: Estimation of Mean Diastolic Blood Pressure
11.7.4.2 Multiple Imputation Analysis 2: Estimation of the Linear Regression Model for Diastolic Blood Pressure
11.8 Exercises
12. Advanced Topics in the Analysis of Survey Data
12.1 Introduction
12.2 Bayesian Analysis of Complex Sample Survey Data
12.3 Generalized Linear Mixed Models (GLMMs) in Survey Data Analysis
12.3.1 Overview of Generalized Linear Mixed Models
12.3.2 Generalized Linear Mixed Models and Complex Sample Survey Data
12.3.3 GLMM Approaches to Analyzing Longitudinal Survey Data
12.3.4 Example: Longitudinal Analysis of the HRS Data
12.3.5 Directions for Future Research
12.4 Fitting Structural Equation Models to Complex Sample Survey Data
12.5 Small Area Estimation and Complex Sample Survey Data
12.6 Nonparametric Methods for Complex Sample Survey Data
Appendix A: Software Overview
A.1 Introduction
A.1.1 Historical Perspective
A.1.2 Software for Sampling Error Estimation
A.2 Overview of Stata® Version 10+
A.3 Overview of SAS® Version 9.2
A.3.1 The SAS SURVEY Procedures
A.4 Overview of SUDAAN® Version 9.0
A.4.1 The SUDAAN Procedures
A.5. Overview of SPSS®
A.5.1 The SPSS Complex Samples Commands
A.6 Overview of Additional Software
A.6.1 WesVar®
A.6.2 IVEware (Imputation and Variance Estimation Software)
A.6.3 Mplus
A.6.4 The R survey Package
A.7 Summary
References
Index
The Stata Blog: Not Elsewhere Classified Find us on Facebook Follow us on Twitter LinkedIn Google+ Watch us on YouTube