Series Editors:

 

Albert:Bayesian Computation with R

CooklSwayne: Interactive and Dynamic Graphics for Data Analysis: With R and GGobi

Paradis: Analysis of Phylogenetics and Evolution with R

Pfaff.- Analysis of Integrated and Cointegrated Time Series with R

 

Jim Albert

 

There has been dramatic growth in the development and application of Bayesian inference in statistics. Berger (2000) documents the increase in Bayesian activity by the number of published research articles, the number of books, and the extensive number of applications of Bayesian articles in applied disciplines such as science and engineering.

One reason for the dramatic growth in Bayesian modeling is the availability of computational algorithms to compute the range of integrals that are necessary in a Bayesian posterior analysis. Due to the speed of modern computers, it is now possible to use the Bayesian paradigm to fit very complex models that cannot be fit by alternative frequentist methods.

To fit Bayesian models, one needs a statistical computing environment. This environment should be such that one can

• write short scripts to define a Bayesian model

• use or write functions to summarize a posterior distribution

• use functions to simulate from the posterior distribution

• construct graphs to illustrate the posterior inference

An environment that meets these requirements is the R system. R provides a wide range of functions for data manipulation, calculation, and graphical displays. Moreover, it includes a well-developed, simple programming language that users can extend by adding new functions. Many such extensions of the language in the form of packages are easily downloadable from the Comprehensive R Archive Network (CRAN).

The purpose of this book is to introduce Bayesian modeling by the use of computation using the R language. At Bowling Green State University, I have taught an introductory Bayesian inference class to students in masters and doctoral programs in statistics for which this book would be appropriate. This book would serve as a useful companion to the introductory Bayesian texts Gelman et al (2003), Carlin and Louis (2000), Press (2003), Gill (2002), or Lee (2004). Also the book would be valuable to the statistical practitioner who wishes to learn more about the R language and Bayesian methodology.

Chapters 2, 3, and 4 illustrate the use of R for Bayesian inference for standard one- and two-parameter problems. These chapters discuss the use of different types of priors, the use of the posterior distribution to perform different types of inference, and the use of the predictive distribution. The base package of R provides functions to simulate from all of the standard probability distributions, and these functions can be used to simulate from a variety of posterior distributions. Modern Bayesian computing is introduced in Chapters 5 and 6. Chapter 5 discusses the summarization of posterior distribution by posterior modes and introduces rejection sampling and the Monte Carlo approach for computing integrals. Chapter 6 introduces the fundamental ideas of Markov chain Monte Carlo (MCMC) methods and the use of MCMC output analysis to decide if the batch of simulated draws provides a reasonable approximation to the posterior distribution of interest. The remaining chapters illustrate the use of these computational algorithms for a variety of Bayesian applications. Chapter 7 introduces the use of exchangeable models in the simultaneous estimation of a set of Poisson rates. Chapter 8 describes Bayesian tests of simple hypotheses and the use of Bayes factors in comparing models. Chapter 9 describes Bayesian regression models, and Chapter 10 describes several applications, such as robust modeling, binary regression with a probit link, and order-restricted inference that are well-suited for the Gibbs sampling algorithm. Chapter 11 describes the use of R to interface with WinBUGS, a popular program for implementing MCMC algorithms.

An R package, LearnBayes, available from the CRAN site, has been written to accompany this text. This package contains all of the Bayesian R functions and datasets described in the book. One goal in writing LearnBayes is to provide guidance for the student and applied statistician in writing short R functions for implementing Bayesian calculations for their specific problems. Also the LearnBayes package will make it easier for users to use the growing number of R packages for fitting a variety of Bayesian models.

I would like to express my appreciation for the people who provided assistance in preparing this book. John Kimmel, my editor, was most helpful in encouraging me to write this book and provide helpful feedback. I am appreciative of Patricia Williamson and Sherwin Toribio for providing useful suggestions. I am appreciative to all of the students at Bowling Green who have enrolled in my Bayesian statistics class over the years. Finally, but certainly not least, I wish to thank my wife Anne and my children Lynne, Bethany and Steven for encouragement and inspiration.

 

1 An Introduction to R

1.1 Overview

1.2 Exploring a Student Dataset

1.2.1 Introduction to the Dataset

1.2.2 Reading the Data into R

1.2.3 R Commands to Summarize and Graph a Single Batch.

1.2.4 R Commands to Compare Batches

1.2.5 R Commands for Studying Relationships

1.3 Exploring the Robustness of the t Statistic

1.3.1 Introduction

1.3.2 Writing a Function to Compute the t Statistic

1.3.3 Programming a Monte Carlo Simulation

1.3.4 The Behavior of the True Significance Level Under Different Assumptions

1.4 Further Reading

1.5 Summary of R Functions

1.6 Exercises

2 Introduction to Bayesian Thinking

2.1 Introduction

2.2 Learning About the Proportion of Heavy Sleepers

2.3 Using a Discrete Prior

2.4 Using a Beta Prior

2.5 Using a Histogram Prior

2.6 Prediction

2.7 Further Reading

2.8 Summary of R Functions

2.9 Exercises

3 Single-Parameter Models

3.1 Introduction

3.2 Normal Distribution with Known Mean but Unknown Variance

3.3 Estimating a Heart Transplant Mortality Rate

3.4 An Illustration of Bayesian Robustness

3.5 A Bayesian Test of the Fairness of a Coin

3.6 Further Reading

3.7 Summary of R Functions

3.8 Exercises

4 Multiparameter Models

4.1 Introduction

4.2 Normal Data with Both Parameters Unknown

4.3 A Multinomial Model

4.4 A Bioassay Experiment

4.5 Comparing Two Proportions

4.6 Further Reading

4.7 Summary of R Functions

4.8 Exercises

5 Introduction to Bayesian Computation

5.1 Introduction

5.2 Computing Integrals

5.3 Setting Up a Problem on R

5.4 A Beta-Binomial Model for Overdispersion

5.5 Approximations Based on Posterior Modes

5.6 The Example

5.7 Monte Carlo Method for Computing Integrals

5.8 Rejection Sampling

5.9 Importance Sampling

5.10 Sampling Importance Resampling

5.11 Further Reading

5.12 Summary of R Functions

5.13 Exercises

6 Markov Chain Monte Carlo Methods

6.1 Introduction

6.2 Introduction to Discrete Markov Chains

6.3 Metropolis-Hasting Algorithms

6.4 Gibbs Sampling

6.5 MCMC Output Analysis

6.6 A Strategy in Bayesian Computing

6.7 Learning About a Normal Population from Grouped Data

6.8 Example of Output Analysis

6.9 Modeling Data with Cauchy Errors

6.10 Analysis of the Stanford Heart Transplant Data

6.11 Further Reading

6.12 Summary of R Functions

6.13 Exercises

7 Hierarchical Modeling

7.1 Introduction

7.2 Introduction to Hierarchical Modeling

7.3 Individual and Combined Estimates

7.4 Equal Mortality Rates?

7.5 Modeling a Prior Belief of Exchangeability

7.6 Posterior Distribution

7.7 Simulating from the Posterior

7.8 Posterior Inferences

7.8.1 Shrinkage

7.8.2 Comparing Hospitals

7.9 Posterior Predictive Model Checking

7.10 Further Reading

7.11 Summary of R Functions

7.12 Exercises

8 Model Comparison

8.1 Introduction

8.2 Comparison of Hypotheses

8.3 A One-Sided Test of a Normal Mean

8.4 A Two-Sided Test of a Normal Mean

8.5 Comparing Two Models

8.6 Models for Soccer Goals

8.7 Is a Baseball Hitter Really Streaky?

8.8 A Test of Independence in a Two-Way Contingency Table

8.9 Further Reading....

8.10 Summary of R Functions

8.11 Exercises

Regression Models

9.1 Introduction

9.2 Normal Linear Regression

9.2.1 The Model

9.2.2 The Posterior Distribution

9.2.3 Prediction of Future Observations

9.2.4 Computation

9.2.5 Model Checking

9.2.6 An Example

9.3 Survival Modeling

9.4 Further Reading

9.5 Summary of R Functions

9.6 Exercises

10 Gibbs Sampling

10.1 Introduction

10.2 Robust Modeling

10.3 Binary Response Regression with a Probit Link

10.4 Estimating a Table of Means

10.4.1 Introduction

10.4.2 A Flat Prior Over the Restricted Space

10.4.3 A Hierarchical Regression Prior

10.4.4 Predicting the Success of Future Students

10.5 Further Reading

10.6 Summary of R Functions

10.7 Exercises

11 Using R to Interface with WinBUGS

11.1 Introduction to WinBUGS

11.2 An R Interface to WinBUGS

11.3 MCMC Diagnostics Using the boa Package

11.4 A Change-Point Model

11.5 A Robust Regression Model

11.6 Estimating Career Trajectories

11.7 Further Reading

11.8 Exercises

References

Index

 

1.1 Overview

R is a rich environment for statistical computing and has many capabilities to explore data in its base package. In addition, R contains a collection of functions for simulating and summarizing the familiar one-parameter probability distributions. One goal of this chapter is to provide a brief introduction to basic commands for summarizing and graphing data. We illustrate these commands on a dataset about students in an introductory statistics class. A second goal of this chapter is to introduce the use of R as an environment for programming Monte Carlo simulation studies. We describe a simple Monte Carlo study to explore the behavior of the two-sample t statistic when testing from populations that deviate from the usual assumptions. We will find these data analysis and simulation commands very helpful in Bayesian computation.

1.2 Exploring a Student Dataset

1.2.1 Introduction to the Dataset

To illustrate some basic commands for summarizing and graphing data, we consider answers from a sheet of questions given to all students in an introductory statistics class at Bowling Green State University. Some of the questions that were asked included:

1. What is your gender?

2. What is your height in inches?

3. Choose a whole number between 1 and 10.

4. Give the time you went to bed last night.

5. Give the time you woke up this morning.

6. What was the cost (in dollars) of your last haircut including the tip?

7. Do you prefer water, pop, or milk with your evening meal?

This is a rich dataset that can be used to illustrate methods for exploring a single batch of categorical or quantitative data, for comparing subgroups of the data, such as comparing the haircut costs of men and women, and for exploring relationships.

1.2.2 Reading the Data into R

The data for 657 students were recorded in a spreadsheet and saved as the file "studentdata.txt" in text format with tabs between the fields. The first line of the data file is a header that includes the variable names.

One can read this data into R by the read. table command. There are three arguments used in this command. The first argument is the name of the datafile in quotes; the next argument, sep, indicates that fields in the file are separated by tab characters; and the header=TRUE argument indicates that the file has a header line with the variable names. This dataset is stored in the R data frame called studentdata.

This dataset is also available as part of the LearnBayes package. Assuming that the package has been installed and loaded into R, one accesses the data by means of the data command:

To see the variable names, we display the first row of the data frame by the studentdata [1, ] command.

To make the variable names visible in the R environment, we use the attach command.

1.2.3 R Commands to Summarize and Graph a Single Batch

One categorical variable in this dataset is Drink which indicates the student's drinking preference between milk, pop, and water. One can tally the different responses by the table command.

We see that more than half the students preferred water, and pop was more popular than milk.

One can graph these frequencies with a bar graph by the barplot command. We first save the output of table in the variable t and then use barplot with t as an argument. We add labels to the horizontal and vertical axes by the xlab and ylab argument options. Fig. 1.1 displays the resulting graph.

Fig. 1.1. Bar plot of the drinking preference of the statistics students.

Suppose we are next interested in examining how long the students slept the previous night. We did not directly ask the students about their sleeping time, but we can compute a student's hours of sleep by subtracting her goto-bed time from her wakeup time. In R we perform this computation for all students and the variable hours.of.sleep contains the sleeping times.

A simple way to summarize this quantitative variable is by the summary command that gives a variety of descriptive statistics about the variable.

On average, we see that students slept 7.5 hours and half of the students slept between 6.5 and 8.5 hours.

To see the distribution of sleeping times, we can construct a histogram using the hist command (see Fig. 1.2).

The shape of this distribution looks symmetric about the average value of 7.5 hours.

1.2.4 R Commands to Compare Batches

Since the gender of each student was recorded, one can make comparisons between men and women on any of the quantitative variables. Do men tend to sleep longer than women? We can answer this question graphically by constructing parallel boxplots of the sleeping times of men and women. Parallel boxplots can be displayed by the boxplot command. The argument is given by

This indicates that a boxplot of the hours of sleep will be constructed for each level of Gender. The resulting graph is displayed in Fig. 1.3. From the display, it appears that men and women are similar with respect to their sleeping time.

For other variables, there are substantial differences between the two genders. Suppose we wish to divide the haircut prices into two groups - the haircut prices for the men and the haircut prices for the women. We do this by use of the R logical operator ==. The syntax

Fig. 1.2. Histogram of the hours of sleep of the statistics students.

is a logical statement that will be TRUE if Gender is "female"; otherwise it will be FALSE. The expression

will produce a subset of Haircut according to when the condition is TRUE. So the statement

will select the haircut prices only for the female students and store the prices into the variable female.Haircut. Similarly, we use the logical operator to store the male haircut prices into the variable male.Haircut.

By use of the summary command, we summarize the haircut prices of the women and the men.

Fig. 1.3. Parallel boxplots of the hours of sleep of the male and female students.

We see large differences between men and women - the men average about $10 for a haircut and the women average about $34.

1.2.5 R Commands for Studying Relationships

There are many interesting relationships that can be explored in this student dataset. To get a good night's sleep, one may want to go to bed early in the evening. This raises the question: Is the length of sleep for a student related to the time that he or she goes to bed?" We can explore the relationship between the ToSleep and hours.of.sleep variables by means of a scatterplot. The R command plot (ToSleep, hours.of. sleep) will construct a scatterplot with ToSleep on the horizontal scale and hours.of.sleep on the vertical scale. If we draw this scatterplot, it is a little difficult to see the pattern in the graph since many of the points are identical. We use the jitter function on each variable before plotting - this has the effect of adding a small amount of noise so that more points are visible on the graph (see Fig. 1.4).

Fig. 1.4. Scatterplot of wake-up time and hours of sleep for students.

We can describe the decreasing pattern in this scatterplot by fitting a line. A least-squares fit is done using the in command; this has the syntax

The output of this fitting is stored in the variable fit. If we display this variable, we see the intercept and slope of the least-squares line.

The slope is approximately -0.5, which means that a student loses about a half-hour of sleep for every hour later that he or she goes to bed.

We can display this line on top of the scatterplot by the abline command (see Fig. 1.5) . There are two arguments to the function, the intercept and slope of the line to be plotted.

Fig. 1.5. Scatterplot of wake-up time and hours of sleep for students with leastsquares line plotted on top.

1.3 Exploring the Robustness of the t Statistic

1.3.1 Introduction

Suppose one has two independent samples xl,..., x,,,,, and yl,..., y, and one wishes to test the hypothesis that the mean of the x population is equal to the mean of the y population:

where sp is the pooled standard deviation

Under the hypothesis Ho, the test statistic T has a t distribution with rn+n-2 degrees of freedom when

• both the xs and ys are independent random samples from normal distributions

• the standard deviations of the x and y populations, Qx and Qy, are equal

Suppose the level of significance of the test is set at a. Then one will reject H when

where tdf,, is the (1 - a) quantile of a t random variable with df degrees of freedom.

If the underlying assumptions of normal populations and equal variances hold, then the level of significance of the t-test will be the stated level of a. But in practice, many people use the t statistic to compare two samples even when the underlying assumptions are in doubt. So an interesting problem is to investigate the robustness or sensitivity of this popular test statistic with respect to changes in the assumptions. If the stated significance level is a=.10 and the populations are skewed or have heavy tails, what will be the true significance level? If the assumption of equal variances is violated and there are significant differences in the spreads of the two populations, what is the true significance level? One can answer these questions through a Monte Carlo simulation study. R is a very suitable platform for writing a simulation algorithm. One can generate random samples from a wide variety of probability distributions and R has an extensive set of data analysis capabilities for summarizing and graphing the simulation output. Here we illustrate the construction of a simple R function to address the robustness of the t statistic.

1.3.2 Writing a Function to Compute the t Statistic

To begin, we generate some random data for the samples of xs and ys. We simulate a sample of 10 observations from a normal distribution with mean 50 and standard deviation 10 using the rnorm function and store the vector of values in the variable x. Likewise we simulate a sample of ys by simulating 10 values from a N(50, 10) distribution and store these values in the variable Y.

Next we write a few lines of R code to compute the value of the t statistic from the samples in x and y. We find the sample sizes m and n by the R length command.

We compute the pooled standard deviation sp - in the R code, sd is the standard deviation function and sqrt takes the square root of its argument.

With m, n, and sp defined, we compute the t statistic

By combining these R statements, we can write a short R function tstatistic to compute the t statistic. This function has two arguments, the vectors x and y, and the output of the function (indicated by the return statement) is the value of the t statistic.

Suppose this function has been saved in the file "tstatistic.R". We enter this function into R by means of the source command.

We try the function by placing some fake data in vectors data.x and data.y and then computing the t statistic on these data:

1.3.3 Programming a Monte Carlo Simulation

Suppose we are interested in learning about the true significance level for the t statistic when the populations don't follow the standard assumptions of normality and equal variances. In general, the true significance level will depend on

• the stated level of significance a

• the shape of the populations (normal, skewed, heavy-tailed, etc.)

• the spreads of the two populations as measured by the two standard deviations

• the sample sizes m and n

Given a particular choice of a, shape, spreads, and sample sizes, we wish to estimate the true significance level given by

Here is an outline of a simulation algorithm to compute aT :

1. Simulate a random sample xl,..., x n from the first population and yl,..., yn from the second population.

2. Compute the t statistic T from the two samples.

3. Decide if ITI exceeds the critical point and Ho is rejected.

One repeats steps 1-3 of the algorithm N times. One estimates the true sib nificance level by

The following is an R script that implements the simulation algorithm for normal populations with mean 0 and standard deviation 1. The R variable alpha is the stated significance level, m and n are the sample sizes, and N is the number of simulations. The rnorm command is used to simulate the two samples and t contains the value of the t statistic. One decides to reject if

where qt (p, df) is the pth quantile of a t distribution with df degrees of freedom. The observed significance level is stored in the variable true. sig. level.

1.3.4 The Behavior of the True Significance Level Under Different Assumptions

The R script described in the previous section can be used to explore the pattern of the true significance level aT for different choices of sample sizes and populations. The only two lines that need to be changed in the R script are the definition of the sample sizes m and n and the two lines where the two samples are simulated.

Suppose we fix the stated significance level at a = .10 and keep the sample sizes at in = 10 and n = 10. We simulate samples from the following populations, where the only restriction is that the population means are equal.

• Normal populations with zero means and equal spreads (ax = o"y = 1).

• Normal populations with zero means and very different spreads (a = 1, a"y = 10).

• T populations, 4 degrees of freedom and equal spreads

• Exponential populations with µx = µy = 1.

• One normal population (px = 10, a = 2) and one exponential population (µy = 10).

The R script was run for each of these five population scenarios using N = 10, 000 iterations and the estimated true significance levels are displayed in Table 1.1. These values should be compared with the stated significance level of a = .1, keeping in mind that the simulation standard error of each estimate is equal to .003. (The simulation standard error, the usual standard error in computing a binomial proportion, is equal to .1(.9)10000 = 0.003.) In this brief study, it appears that if the populations have equal spreads, then the true significance level is approximately equal to the stated level for different population shapes. If the populations have similar shapes and different spreads, then the true significance level can be slightly higher than 10%. If the populations have substantially different shapes (such as normal and exponential) and unequal spreads, then the true significance level can be substantially higher than the stated level.

Since the true significance level in the last case is 50% higher than the stated level, one might be interested in seeing the exact sampling distribution of the t statistic. We rerun this simulation for the normal and exponential populations, storing the simulated values of the t statistic in a vector tstat. We use the R command density to construct a nonparametric density estimate of the exact sampling distribution of the t statistic. The lines command is used to plot the t density with 18 degrees of freedom on top. Fig. 1.6 displays the resulting graph of the two densities. Note that the actual sampling distribution of the t statistic is right-skewed, which would account for the large true significance level.

1.4 Further Reading

Although R is a sophisticated package with many commands, there are many resources available for learning the package. There is some basic instruction on R that can be found from the R Help menu. The R project home page at http://www.r-project.org lists a number of books describing different levels of statistical computing using R. Verzani (2004) is a good book describing the use of R in an introductory statistics course; in particular, the book is helpful for getting started in constructing different types of graphical displays. Gentle (2002), Appendix A, gives a general description of Monte Carlo experiments with an extended example.

Fig. 1.6. Exact sampling density of the t statistic when sampling from normal and exponential distributions.

1.5 Summary of R Functions

An outline of the R functions used in this chapter is presented here. Detailed information about any specific function, say abline, can be found by typing

in the R command window.

abline - add a straight line to a plot

attach - attach a set of R objects to search path

barplot - create a barplot with vertical or horizontal bars

boxplot - produce box-and-whisker plot(s) of the given (grouped) values

density - computes kernel density estimates

hist - computes a histogram of the given data values

lm - used to fit linear models such as regression

mean - computes the arithmetic mean

plot - generic function for plotting R objects

read.table - reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file

rexp - random generation for the exponential distribution

rnorm - random generation for the normal distribution

rt - random generation for the t distribution

sd - computes the value of the standard deviation

summary - generic function used to produce result summaries of the results of various model fitting functions

table - uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels

1.6 Exercises

1. Movie DVDs owned by students

The variable Dvds in the student dataset contains the number of movie DVDs owned by students in the class.

a) Construct a histogram of this variable by use of the hist command.

b) Summarize this variable by the summary command.

c) Use the table command to construct a frequency table of the individual values of Dvds that were observed. If one constructs a barplot of these tabled values by use of the command

one will see that particular response values are very popular. Is there any explanation for these popular values for number of DVDs owned?

2. Student heights

The variable Height contains the height (in inches) of each student in the class.

a) Construct parallel boxplots of the heights by the Gender variable.

b) If one assigns the boxplot output to a variable

then output is a list that contains statistics used in constructing the boxplots. Print output to see the statistics that are stored.

3. Sleeping times

The variables ToSleep and WakeUp contain, respectively, the time to bed and wakeup time for each student the previous evening. (The data are recorded as hours past midnight, so a value of -2 indicates 10 p.m.)

a) Construct a scatterplot of ToSleep and WakeUp.

b) Find a least-squares fit to these data by the lm command.

c) Place the least-squares fit on the scatterplot by the abline command.

d) Use the line to predict the wakeup time for a student who went to bed at midnight.

4. Performance of the traditional confidence interval for a proportion

Suppose one observes y that is binomially distributed with sample size n and probability of success p. The standard 90% confidence interval for p is given by

The function binomial. conf .interval will return the limits of a 90% confidence interval given values of y and it.

a) Read the function binomial.conf.interval into R.

b) Suppose that samples of size n = 20 are taken and the true value of the proportion is p = .5. Using the rbinom command, simulate a value of y and use binomial.conf.interval to compute the 90% confidence interval. Repeat this a total of 20 times and estimate the true probability of coverage P(p E C(y)).

c) Suppose that n = 20 and the true value of the proportion is p = .05. Simulate 20 binomial random variates with n = 20 and p = .05 and for each simulated y, compute a 90% confidence interval. Estimate the true probability of coverage.

5. Performance of the traditional confidence interval for a proportion

Exercise 4 demonstrated that the actual probability of coverage of the traditional confidence interval depends on the values of n and p. Construct a Monte Carlo study that investigates how the probability of coverage depends on the sample size and true proportion value. In the study, let n be 10, 25, and 100, and let p be .05, .25, and .50. Write an R function that has three inputs: n, p, and the number of Monte Carlo simulations m, and will output the estimate of the exact coverage probability. Implement your function using each combination of n and p and in = 1000 simulations. Describe how the actual probability of converge of the traditional interval depends on the sample size and true proportion value.

 

2.1 Introduction

In this chapter, the basic elements of the Bayesian inferential approach are introduced through the basic problem of learning about a population proportion. Before taking data, one has beliefs about the value of the proportion and one models his or her beliefs in terms of a prior distribution. We will illustrate the use of different functional forms for this prior. After data have been observed, one updates one's beliefs about the proportion by the computation of the posterior distribution. One summarizes this probability distribution to perform inferences. Also one may be interested in predicting the likely outcomes of a new sample taken from the population.

Many of the commands in the R base package can be used in this setting. The probability distribution commands such as dbinom and dbeta and simulation commands such as rbeta, rbinom and sample are helpful in simulating draws from the posterior and predictive distributions. Also we illustrate some special R commands pdisc, histprior, and discint in the LearnBayes package that are helpful in constructing priors and computing and summarizing a posterior.

2.2 Learning About the Proportion of Heavy Sleepers

Suppose a person is interested in learning about the sleeping habits of American college students. She hears that doctors recommend eight hours of sleep for an average adult. What proportion of college students get at least eight hours of sleep?

Here we think of a population consisting of all American college students and let p represent the proportion of this population who sleep (on a typical night during the week) at least eight hours. We are interested in learning about the location of p.

The value of the proportion p is unknown. In the Bayesian viewpoint a person's beliefs about the uncertainty in this proportion are represented by a probability distribution placed on this parameter. This distribution reflects the person's subjective prior opinion about plausible values of p.

A random sample of students from a particular university will be taken to learn about this proportion. But first the person does some initial research to learn about the sleeping habits of college students. This research will help her in constructing a prior distribution.

In the Internet article "College Students Don't Get Enough Sleep" in The Gamecock, the student newspaper of the University of South Carolina (April 20, 2004), the person reads that a sample survey reports that most students spend only six hours sleeping. She reads a second article "Sleep on It: Implementing a Relaxation Program into the College Curriculum" in Fresh Writing, a 2003 publication of the University of Notre Dame. Based on a sample of 100 students, "approximately 70% reported receiving only five to six hours of sleep on the weekdays, 28% receiving seven to eight, and only 2% receiving the healthy nine hours for teenagers."

Based on this information, this person believes that college students generally get less than eight hours of sleep and so p (the proportion that sleep at least eight hours) is likely smaller than .5. After some reflection, her best guess at the value of p is .3. But it is very plausible that this proportion could be any value in the interval from 0 to .5.

A sample of 27 students is taken - in this group, 11 record that they had at least eight hours of sleep the previous night. Based on the prior information and this observed data, the person is interested in estimating the proportion p. In addition, she is interested in predicting the number of students that get at least eight hours of sleep if a new sample of 20 students is taken.

Suppose that our prior density for p is denoted by g(p). If we regard a "success" as sleeping at least eight hours and we take a random sample with s successes and f failures, then the likelihood function is given by

The posterior density for p, by Bayes' rule, is obtained, up to a proportionality constant, by multiplying the prior density by the likelihood.

We demonstrate posterior distribution calculations using three different choices of the prior density g corresponding to three methods for representing the person's prior knowledge about the proportion.

2.3 Using a Discrete Prior

A simple approach for assessing a prior for p is to write down a list of plausible proportion values and then assign weights to these values. In our example, the person believes that

are possible values for p. Based on her beliefs, she assigns these values the corresponding weights

which can be converted to prior probabilities by dividing each weight by the sum. In R, we define p to be the vector of proportion values and prior the corresponding weights that we normalize to probabilities. The plot command is used with the "histogram" type option to graph the prior distribution, and Fig. 2.1 displays the graph.

Fig. 2.1. A discrete prior distribution for a proportion p.

In our example, 11 of 27 students sleep a sufficient number of hours, so s = 11 and f = 16, and the likelihood function is

(Note that the likelihood is a beta density with parameters s + 1 = 12 and f + 1 = 17.) The R function pdisc in the package LearnBayes computes the posterior probabilities. To use pdisc, one inputs the vector of proportion values p, the vector of prior probabilities prior, and a data vector data consisting of s and f. The output of pdisc is a vector of posterior probabilities. The cbind command is used to display a table of the prior and posterior probabilities, and Fig. 2.2 displays a line graph of the posterior probabilities.

Here we note that most of the posterior probability is concentrated on the values p = .35 and p = .45. If we combine the probabilities for the three most likely values, we can say the posterior probability that p falls in the set {.25, .35, .45} is equal to .942.

2.4 Using a Beta Prior

Since the proportion is a continuous parameter, an alternative approach is to construct a density g(p) on the interval (0, 1) that represents the person's initial beliefs. Suppose she believes that the proportion is equally likely to be smaller or larger than p = .3. Moreover, she is 90% confident that p is less than .5. A convenient family of densities for a proportion is the beta with kernel proportional to

Fig. 2.2. Posterior distribution for a proportion p using a discrete prior.

where the hyperparameters a and b are chosen to reflect the user's prior beliefs about p. Here the person believes that the median and 90th percentiles are given, respectively, by .3 and .5, and this can be matched, by trial and error, with a beta density with a = 3.4 and b = 7.4. Combining this beta prior with the likelihood function, one can show that the posterior density is also of the beta form with updated parameters a + s and b + f.

where a + s = 3.4 + 11 and b + f = 7.4 + 16. (This is an example of a conjugate analysis where the prior and posterior densities have the same functional form.) Since the prior, likelihood, and posterior are all in the beta family, we can use the R command dbeta to compute values of prior, likelihood, and posterior. These three densities are displayed using the R commands plot and lines in the same graph in Fig. 2.3. This figure is helpful in seeing that the posterior density in this case compromises between the initial prior beliefs and the information in the data.

Fig. 2.3. The prior density g(p), the likelihood function L(p), and the posterior density g(pldata) for learning about a proportion p.

We illustrate different ways of summarizing the beta posterior distribution to make inferences about the proportion of heavy sleepers p. The beta cdf and inverse cdf functions pbeta and qbeta are useful in computing probabilities and constructing interval estimates for p. Is it likely that the proportion of heavy sleepers is greater than .5? This is answered by computing the posterior probability P(p >= .51data), which is given by the R command

This probability is small, so it is unlikely that more than half of the students are heavy sleepers. A 90% interval estimate for p is found by computing the 5th and 95th percentiles of the beta density:

We are 90% confident that the proportion is between .256 and .513.

These summaries are exact because they are based on R functions for the beta posterior density. An alternative method of summarization of a posterior density is based on simulation. In this case we can simulate a large number of values from the beta posterior density and summarize the simulated output. Using the random beta command rbeta, we simulate 1000 random proportion values from the beta(a + s, b + f) posterior by the command

and display the posterior as a histogram of the simulated values in Fig. 2.4.

The probability that the proportion is larger than .5 is estimated by the proportion of simulated values in this range.

A 90 percent interval estimate can be estimated by the 5th and 95th sample quantiles of the simulated sample.

Note that these summaries of the posterior density for p based on simulation are approximately equal to the exact values based on calculations from the beta distribution.

Fig. 2.4. A simulated sample from the beta posterior distribution of p.

2.5 Using a Histogram Prior

Although there are computational advantages to the use of a beta prior, it is straightforward to perform posterior computations for any choice of prior. We outline a "brute-force" method of summarizing posterior computations for an arbitrary prior density g(p).

• Choose a grid of values of p over an interval that covers the posterior density.

• Compute the product of the likelihood L(p) and the prior g(p) on the grid.

• Normalize by dividing each product by the sum of the products. In this step, we are approximating the posterior density by a discrete probability distribution on the grid.

• By use of the R command sample, take a random sample with replacement from the discrete distribution.

The resulting simulated draws are an approximate sample from the posterior distribution.

We illustrate this "brute-force" algorithm for a "histogram" prior that may better reflect the person's prior opinion about the proportion p. Suppose it is convenient for our person to state her prior beliefs about the proportion of heavy sleepers by dividing the range of p into 10 subintervals (0, .1), (.1, .2), (.9, 1), and then assigning probabilities to the intervals. In our example, the person assigns the weights 2, 4, 8, 8, 4, 2, 1, 1, 1, 1 to these intervals - this can be viewed as a continuous version of the discrete prior used earlier.

In R, we represent this histogram prior with the vector midpt that contains the midpoints of the intervals and the vector prior that contains the associated prior weights. We convert the prior weights to probabilities by dividing each weight by the sum. We graph this prior in Fig. 2.5 on a grid of values p using the R function histprior in the LearnBayes package.

Fig. 2.5. A histogram prior for a proportion p.

On the grid of values of p, we compute the posterior density by multiplying the histogram prior by the likelihood function. (Recall that the likelihood function for binomial density is given by a beta(s + 1, f + 1) density; this function is available by the dbeta function.) In Fig. 2.6, the posterior density is displayed.

Fig. 2.6. The posterior density for a proportion using a histogram prior

To obtain a simulated sample from the posterior density by our algorithm, we convert the products on the grid to probabilities

and take a sample with replacement from the grid using the R function sample.

Fig. 2.7 shows a histogram of the simulated values.

Fig. 2.7. A histogram of simulated draws from the posterior distribution of p with the use of a histogram prior.

The simulated draws can be used as before to summarize any feature of the posterior distribution of interest.

2.6 Prediction

If g is a prior density, then we refer to this as the prior predictive density, and if g is a posterior, then f is a posterior predictive density.

We illustrate the computation of the predictive density using the different forms of prior density described in this chapter. Suppose we use a discrete prior where {pi} represent the possible values of the proportion with respective probabilities {g(pi)j. Let fB(yln,p) denote the binomial sampling density given values of the sample size n and proportion p:

The function pdiscp in the LearnBayes package can be used to compute the predictive probabilities when p is given a discrete distribution. As before p is a vector of proportion values and prior a vector of current probabilities. The remaining arguments are the future sample size m and a vector ys of numbers of successes of interest. The output is a vector of the corresponding predictive probabilities.

Suppose instead that we model our beliefs about p using a beta(a, b) prior. In this case, we can analytically integrate out p to get a closed-form expression for the predictive density.

where B(a, b) is the beta function. The predictive probabilities using the beta density are computed using the function pbetap. The inputs to this function are the vector ab of beta parameters a and b, the size of the future sample m, and the vector of numbers of successes y. The output is a vector of predictive probabilities corresponding to the values in y. We illustrate this computation using the beta(3.4, 7.4) prior used to reflect the person's beliefs about the proportion of heavy sleepers at the school.

We demonstrate this simulation approach for the beta(3.4, 7.4) prior. We first simulate 1000 draws from the prior and store the simulated values in p:

Fig. 2.8. A graph of the predictive probabilities of the number of sleepers y" in a future sample of size 20 when the proportion is assigned a beta(3.4, 7.4) prior.

2.7 Further Reading

A number of books are available that describe the basic tenets of Bayesian thinking. Berry (1996) and Albert and Rossman (2001) describe the Bayesian approach for proportions at an introductory statistics level. Albert (1996) describes Bayesian computational algorithms for proportions using the statistics package Minitab. Antleman (1996) and Bolstad (2004) provide elementary descriptions of Bayesian thinking suitable for undergraduate statistics classes.

2.8 Summary of R Functions

discint - computes a highest probability interval for a discrete distribution

Usage: discint (dist, prob)

Arguments: dist, a probability distribution written as a matrix where the first column contains the values and the second column contains the probabilities; prob, the probability content of interest

Value: prob, the exact probability content of the interval, and set, the set of values of the probability interval

histprior - computes the density of a probability distribution defined on a set of equal-width intervals

Usage: histprior(p,midpts,prob)

Arguments: p, the vector of values for which the density is to computed; midpts, the vector of midpoints of the intervals; prob, the vector of probabilities of the intervals

Value: vector of values of the probability density

pdisc - computes the posterior distribution for a proportion for a discrete prior distribution

Usage: pdisc(p, prior, data)

Arguments: p, a vector of proportion values; prior, a vector of prior probabilities; data, a vector consisting of the number of successes and number of failures

Value: the vector of posterior probabilities

pdiscp - computes predictive distribution for the number of successes of a future binomial experiment with a discrete distribution for the proportion

Usage: pdiscp(p, probs, n, s)

Arguments: p, the vector of proportion values; probs, the vector of probabilities; n, the size of the future binomial sample; s, the vector of number of successes for future binomial experiment

Value: the vector of predictive probabilities for the values in the vector s

pbetap - computes predictive distribution for the number of successes of a future binomial experiment with a beta distribution for the proportion

Usage: pbetap(ab, n, s)

Arguments: ab, the vector of parameters of the beta prior; n, the size of the future binomial sample; s, the vector of number of successes for future binomial experiment

Value: the vector of predictive probabilities for the values in the vector s

2.9 Exercises

1. Estimating a proportion with a discrete prior

Bob claims to have ESP. To test this claim, you propose the following experiment. You will select one from four large cards with different geometric figures and Bob will try to identify it. Let p denote the probability that Bob is correct in identifying the figure for a single card. You believe that Bob has no ESP ability (p = .25), but there is a small chance that p is either larger or smaller than .25. After some thought, you place the following prior distribution on p:

Suppose that the experiment is repeated ten times and Bob is correct six times and incorrect four times. Using the function pdisc, find the posterior probabilities of these values of p. What is your posterior probability that Bob has no ability?

2. Estimating a proportion with a histogram prior

Consider the following experiment. Hold a penny on edge on a flat hard surface, and spin it with your fingers. Let p denote the probability that it lands heads. To estimate this probability, we will use a histogram to model our prior beliefs about p. Divide the interval [0,1] into the 10 subintervals [0,.1], [.1,.2], ..., [.9,1], and specify probabilities that p is in each interval. Next spin the penny 20 times and count the number of successes (heads) and failures (tails). Simulate from the posterior distribution by (1) computing the posterior density of p on a grid of values on (0, 1) and (2) taking a simulated sample with replacement from the grid. (The functions histprior and sample are helpful in this computation.) How have the interval probabilities changed on the basis of your data?

3. Estimating a proportion and prediction of a future sample

A study reported on the long-term effects of exposure to low levels of lead in childhood. Researchers analyzed children's shed primary teeth for lead content. Of the children whose teeth had a lead content of more than 22.22 parts per million (ppm), 22 eventually graduated from high school and 7 did not. Suppose your prior density for p, the proportion of all such children who will graduate from high school is beta(l, 1), and so your posterior density is beta(23, 8).

4. Contrasting predictions using two different priors

Suppose two persons are interested in estimating the proportion p of students at a college who commute to school. Suppose Joe uses a discrete prior given in the following table:

a) Use R to compute the mean and standard deviation of p for Joe's prior and for Sam's prior. Based on this computation, do Joe and Sam have similar prior beliefs about the location of p?

b) Suppose one is interested in predicting the number of commuters y in a future sample of size 12. Use the functions pdiscp and pbetap to compute the predictive probabilities of y using both Joe's prior and Sam's prior. Do the two people have similar beliefs about the outcomes of a future sample?

5. Estimating a normal mean with a discrete prior

Suppose you are interested in estimating the average total snowfall per year It (in inches) for a large city on the East Coast of the United States. Assume individual yearly snow totals yl,..., yn, are collected from a population that is assumed to be normally distributed with mean µ and known standard deviation a = 10 inches.

a) Before collecting data, suppose you believe that the mean snowfall It can be the values 20, 30, 40, 50, 60, and 70 inches with the following probabilities:

Place the values of It in the vector mu and the associated prior probabilities in the vector prior.

b) Suppose you observe the yearly snowfall totals 38.6, 42.4, 57.5, 40.5, 51.7, 67.1, 33.4, 60.9, 64.1, 40.1, 40.7, and 6.4. Enter these data into a vector y and compute the sample mean ybar.

c) In this problem, the likelihood function is given by

d) One can compute the posterior probabilities for p using the formula

Compute the posterior probabilities of it for this example.

e) Using the function discint, find an 80% probability interval for µ.

6. Estimating a Poisson mean using a discrete prior (from Antleman (1996))

Suppose you own a trucking company with a large fleet of trucks. Breakdowns occur randomly in time and the number of breakdowns during an interval of t days is assumed to be Poisson distributed with mean t.A. The parameter A is the daily breakdown rate. The possible values for A are .5, 1, 1.5, 2, 2.5, and 3 with respective probabilities .1, .2, .3, .2, .15, and .05. If one observes y breakdowns, then the posterior probability of A is proportional to

where g is the prior probability.

a) If 12 trucks break down in a six-day period, find the posterior probabilities for the different rate values.

b) Find the probability that there are no breakdowns during the next week. Hint: If the rate is A, the conditional probability of no breakdowns during a seven-day period is given by exp{-7.A}. One can compute this predictive probability by multiplying a list of conditional probabilities by the posterior probabilities of A and finding the sum of the products.

 

3.1 Introduction

In this chapter, we introduce the use of R in summarizing the posterior distributions for several single-parameter models. We begin by describing Bayesian inference for a variance for a normal population and inference for a Poisson mean when informative prior information is available. For both problems, summarization of the posterior distribution is facilitated by the use of R functions to compute and simulate distributions from the exponential family. In Bayesian analyses, one may have limited beliefs about a parameter and there may be several priors that provide suitable matches to these beliefs. In estimating a normal mean, we illustrate the use of two distinct priors in modeling beliefs, and show that inferences may or may not be sensitive to the choice of prior. In this example, we illustrate the "brute force" method of summarizing a posterior where the density is computed by the "prior times likelihood" recipe over a fine grid. We conclude by describing a Bayesian test of the simple hypothesis that a coin is fair. The computation of the posterior probability of "fair coin" is facilitated using beta and binom functions in R.

3.2 Normal Distribution with Known Mean but Unknown Variance

Gelman et al (2003) consider a problem of estimating an unknown variance using American football scores. The focus is on the difference d between a game outcome (winning score minus losing score) and a published point spread. We observe dl, ..., dn, the observed differences between game outcomes and point spreads for n football games. If these differences are assumed to be a random sample from a normal distribution with mean 0 and unknown variance a2, the likelihood function is given by

Suppose the noninformative prior density p(c72) = 1/ore is assigned to the variance. Then the posterior density of Q2 is given, up to a proportionality constant, by

In the following R output, we first read in the datafile footballscores that is available in the LearnBayes package. For each of 672 games, the data file contains favorite and underdog, the actual scores of the favorite and underdog teams, and spread, the published point spread. We compute the difference variable d. As in the preceding notation, n is the sample size and v is the sum of squares of the differences.

We simulate 1000 values from the posterior distribution of the standard deviation c7 in two steps. First, we simulate values of the precision parameter P = 1/o72 from the scaled chi-square(n) distribution by the command rchisq(1000, n)/v. Then we perform the transformation or = 11P to get values from the posterior distribution of the standard deviation or. We use the hist command to construct a histogram of the draws of c7 (see Fig. 3.1).

The R quantile command is used to extract the 2.5%, 50%, and 97.5% percentiles of this simulated sample. A point estimate for or is provided by the posterior median 13.85 . In addition, the extreme percentiles (13.2, 14.6) represent a 95% probability interval for Q.

Fig. 3.1. Histogram of simulated sample of the standard deviation Q of differences between game outcomes and point spreads.

3.3 Estimating a Heart Transplant Mortality Rate

Consider the problem of learning about the rate of success of heart transplant surgery of a particular hospital in the United States. For this hospital, we observe the number of transplant surgeries n and the number of deaths within 30 days of surgery y is recorded. In addition, one can predict the probability of death for an individual patient. This prediction is based on a model that uses information such as patients' medical condition before surgery, gender, and race. Based on these predicted probabilities, one can obtain an expected number of deaths, denoted by e. A standard model assumes that the number of deaths y follows a Poisson distribution with mean eA, and the objective is to estimate the mortality rate per unit exposure A.

A convenient source of prior information is heart transplant data from a small group of hospitals that we believe has the same rate of mortality as the rate from the hospital of interest. Suppose we observe the number of deaths zj and the exposure of for 10 hospitals (j = 1, ..., 10), where zj is Poisson with mean ojA. If we assign A the standard noninformative prior p(A) = A-', then the updated distribution for A, given this data from the 10 hospitals, is

and so we assign A a gamma(16, 15174) prior.

If the observed number of deaths from surgery yobs for a given hospital with exposure e is Poisson (eA) and A is assigned the gamma(a, ~3) prior, then the posterior distribution will also have the gamma form with parameters a + yobs and ~ + e. Also the (prior) predictive density of y (before any data are observed) can be computed by the formula

where f (yea) is the Poisson(e)) sampling density and g(A) and g(AIy) are, respectively, the prior and posterior densities of A.

By the model checking strategy of Box (1980), both the posterior density g(Aly) and the predictive density f(y) play important roles in a Bayesian analysis. By use of the posterior density, one performs inference about the unknown parameter conditional on the Bayesian model that includes the assumptions of sampling density and the prior density. One can check the validity of the proposed model by inspecting the predictive density. If the observed data value yobs is consistent with the predictive density p(y), then the model seems reasonable. On the other hand, if yobs is in the extreme tail portion of the predictive density, then this casts doubt on the validity of the Bayesian model, and perhaps the prior density or the sampling density has been misspecified.

We consider inference about the heart transplant rate for two hospitals - one that has experienced a small number of surgeries and a second that has experienced many surgeries. First consider hospital A which experienced only one death (yobs = 1) with an exposure of e = 66. The standard estimate of this hospital's rate, 1/66, is suspect due to the small observed number of deaths. The following R calculations illustrate the Bayesian calculations. After the gamma prior parameters alpha and beta and exposure ex are defined, the predictive density of the values y = 0, 1, ..., 10 are found by use of the preceding formula and the R functions dpois and dgamma. The formula for the predictive density is valid for all A, but to ensure that there is not any underflow in the calculations, the values of f(y) are computed for the prior mean value A = a/~3. Note that practically all of the probability of the predictive density is concentrated on the two values y = 0 and 1. The observed number of deaths (yobs = 1) is in the middle of this predictive distribution and so there is no reason to doubt our Bayesian model.

The posterior density of A can be summarized by simulating 1000 values from the gamma density.

Let's consider the estimation of a different hospital that experiences many surgeries. Hospital B had yobs = 4 deaths with an exposure of e = 1767. For this data, we again have R compute the prior predictive density and simulate 1000 draws from the posterior density using the rgamma command. Again we see that the observed number of deaths seems consistent with this model since yobs = 4 is not in the extreme tails of this distribution.

To see the impact of the prior density on the inference, it is helpful to display the prior and posterior distributions on the same graph. In Fig. 3.2, the histograms of the draws from the posterior distributions of the rates are shown for hospitals A and B. The gamma prior density is placed on top of the histogram in each case. We see that for hospital A with relatively little experience in surgeries, the prior information is significant and the posterior distribution resembles the prior distribution. In contrast, for hospital B with many surgeries, the prior information is less influential and the posterior distribution resembles the likelihood function.

3.4 An Illustration of Bayesian Robustness

In practice, one may have incomplete prior information about a parameter in the sense that one's beliefs won't entirely define a prior density. There may be a number of different priors that match the given prior information. For example, if you believe a priori that the median of a parameter 0 is 30 and its 80th percentile is 50, certainly there are many prior probability distributions that can be chosen that match these two percentiles. In this situation where different priors are possible, it is desirable that inferences from the posterior not be dependent on the exact functional form of the prior. A Bayesian analysis is said to be robust to the choice of prior if the inference is insensitive to different priors that match the user's beliefs.

Fig. 3.2. Histograms for simulated samples from the posterior distributions for two transplant rates. The prior density for the corresponding rate is drawn in each graph.

To illustrate this idea, suppose you are interested in estimating the true IQ 9 for a person we'll call Joe. You believe Joe has an average intelligence and the median of your prior distribution is 100. Also you are 90% confident that Joe's IQ falls between 80 and 120. You construct a prior density by matching this information with a normal density with mean µ and standard deviation T, or N(µ, T). It is straightforward to show that the parameter values that match this prior information are µ = 100 and T = 12.16.

Joe takes four IQ tests and his scores are yl, y2, p3i y4. Assuming that an individual score p is distributed N(9, a-) with known standard deviation o= 15, the observed mean score y is N(9, Q/\).

With the use of a normal prior in this case, the posterior density of 9 will also be normal with standard deviation

and mean

Let's now consider an alternative prior density to model our beliefs about Joe's true IQ. Any symmetric density instead of a normal could be used, so we use a t density with location it, scale T, and two degrees of freedom. Since our prior median is 100, we let the median of our t density be equal to p. = 100. We find the scale parameter T so the t density matches our prior belief that the 95th percentile of 0 is equal to 120. Note that

where T is a standard t variate with two degrees of freedom. It follows that

where t,(p) is the pth quantile of a t random variable with v degrees of freedom. We find T by use of the t quantile function qt in R.

We display the normal and t priors in a single graph in Fig. 3.3. Although they have the same basic shape, note that the t density has significantly flatter tails - we will see that this will impact the posterior density for "extreme" test scores.

Fig. 3.3. Normal and t priors for representing prior opinion about a person's true IQ score.

We perform the posterior calculations using the t prior for each of the possible sample results. Note that the posterior density of 0 is given, up to a proportionality constant, by

Let's compare the posterior moments of 0 using the two priors by combining the two R matrices summa and summ2.

Fig. 3.4. Posterior densities for a person's true IQ using normal and t priors.

When a normal prior is used, the posterior will always be a compromise between the prior information and the observed data, even when the data result conflicts with one's prior beliefs about the location of Joe's IQ. In contrast, when a t prior is used, the likelihood will be in the flat-tailed portion of the prior and the posterior will resemble the likelihood function.

In this case, the inference about the mean is robust to the choice of prior (normal or t) when the observed mean IQ score is consistent with the prior beliefs. But in the case when an extreme IQ score is observed, we see that the inference is not robust to the choice of prior density.

3.5 A Bayesian Test of the Fairness of a Coin

Suppose you are interested in assessing the fairness of a coin. You observe y distributed binomial with parameters n and p and you are interested in testing the hypothesis H that p = .5. If y is observed, then it is usual practice to make a decision on the basis of the p-value

If this p-value is small, then you reject the hypothesis H and conclude that the coin is not fair. Suppose, for example, the coin is flipped 20 times and only 5 heads are observed. In R we compute the probability of obtaining five or fewer heads:

The p-value here is 2 x .021 = .042. Since this value is smaller than the common significance level of .05, you would decide to reject the hypothesis H and conclude that the coin is not fair.

Let's consider this problem from a Bayesian perspective. There are two possible models here - either the coin is fair (p = .5) or the coin is not fair (p .5). Suppose that you are indifferent between the two possibilities, and so you initially assign each model a probability of 1/2. Now if you believe the coin is fair, then your entire prior distribution for p would be concentrated on the value p = .5. If instead the coin is unfair, you would assign a different prior distribution on (0, 1), call it gi(p), that would reflect your beliefs about the unfair coin probability. Suppose you assign a beta(a, a) prior on p. This beta distribution is symmetric about .5 - it says that you believe the coin is not fair, the probability is close to p = .5. To summarize, your prior distribution in this testing situation can be written

where I(A) is an indicator function equal to 1 if the event A is true and otherwise equal to 0.

After observing the number of heads in n tosses, we would update our prior distribution by Bayes' rule. The posterior density for p can be written as

where gi is a beta(a+y, a+n-y) density and A (y) is the posterior probability of the model that the coin is fair

In the expression for A (y), p(yl.5) is the binomial density for y when p = .5, and ml (y) is the (prior) predictive density for y using the beta density.

In R the posterior probability of fairness .A(y) is easily computed. The R command dbinom will compute the binomial probability p(yl.5) and the predictive density for y can be computed using the identity

Assume first that we assign a beta(10, 10) prior for p when the coin is not fair and we observe y = 5 heads in n = 20 tosses. The posterior probability of fairness is stored in the R variable lambda.

We get the surprising result that the posterior probability of the hypothesis of fairness H is .28, which is less evidence against fairness than implied by the above p-value calculation.

The function pbetat in the LearnBayes package performs a test of a binomial proportion. The inputs to the function are the value of p to be tested, the prior probability of that value, a vector of parameters of the beta prior when the hypothesis is not true, and a vector of numbers of successes and failures. In this example, the syntax would be

The output variable post is the posterior probability that p = .5, which agrees with the calculation. The output variable bf is the Bayes factor in support of the null hypothesis which is discussed in Chapter 8.

Since the choice of the prior parameter a = 10 in this analysis seems arbitrary, it is natural to ask about the sensitivity of this posterior calculation to the choice of this parameter. To answer this question, we compute the posterior probability of fairness for a range of values of log a. We graph the posterior probability against log a in Fig. 3.5.

Fig. 3.5. Posterior probability that coin is fair graphed against values of the prior parameter log a.

We see from this graph that the probability of fairness appears to be greater than .2 for all choices of a. Since the value of .2 is larger than the p-value calculation of .042, this suggests that the p-value is overstating the evidence against the hypothesis that the coin is fair.

Another distinction between the frequentist and Bayesian calculations is the event that led to the decision about rejecting the hypothesis that the coin was fair. The p-value calculation was based on the probability of the event "5 heads or fewer," but the Bayesian calculation was based solely on the likelihood based on the event "exactly 5 heads." That raises the question: how would the Bayesian answers change if we observed "5 heads or fewer"? One can show that the posterior probability that the coin is fair is given by

where Po(Y<5) is the probability of five heads or fewer under the binomial model with p = .5, and Pl(Y<5) is the predictive probability of this event under the alternative model with a beta(10, 10) prior on p. In the following R output, the cumulative probability of five heads under the binomial model is computed by the R function pbinom. The probability of five or fewer heads under the alternative model is computed by summing the predictive density over the six values of y.

Note that the posterior probability of fairness is .218 based on the data "5 heads or fewer." This posterior probability is smaller than the value of .280 found earlier based on y = 5. This is a reasonable result, since observing "5 heads or fewer" is stronger evidence against fairness than the result "5 heads."

3.6 Further Reading

Carlin and Louis (2000), chapter 2, and Gelman et al (2003), chapter 2, provide general discussions of Bayesian inference for one-parameter problems. Lee (2004), Antleman (1996), and Bolstad (2004) provide extensive descriptions of inference for a variety of one-parameter models. The notion of Bayesian robustness is discussed in detail in Berger (1985). Bayesian testing for basic inference problems is outlined in Lee (2004).

3.7 Summary of R Functions

pbetat - Bayesian test that a proportion is equal to a specified prior using a beta prior

Usage: pbetat(pO,prob,ab,data)

Arguments: p0, the value of the proportion to be tested; prob, the prior probability of the hypothesis; ab, the vector of parameter values of the beta prior under the alternative hypothesis; data, vector containing the number of successes and number of failures

Value: bf, the Bayes factor in support of the null hypothesis; post, the posterior probability of the null hypothesis

3.8 Exercises

1. Cauchy sampling model

Suppose one observes a random sample yl, ..., y, from a Cauchy density with location 0 and scale parameter 1. If a uniform prior is placed on 0, then the posterior density is given (up to a proportionality constant) by

Suppose one observes the data 0, 10, 9, 8, 11, 3, 3, 8, 8, 11.

a) Using the R command seq, set up a grid of values of 0 from -2 to 12 in steps of 0.1.

b) Compute the posterior density on this grid.

c) Plot the density and comment on its main features.

d) Compute the posterior mean and posterior standard deviation of 9.

2. Learning about an exponential mean

Suppose a random sample is taken from an exponential distribution with mean A. If we assign the usual noninformative prior g(,\) = 1/), then the posterior density is given, up to a proportionality constant, by

a) Show that if we transform A to 0 = 1/), then 0 has a gamma density with shape parameter n and rate parameter s. (A gamma density with shape a and rate 13 is proportional to h(x) = x'-1 exp(-~3x).)

b) In a life-testing illustration, five bulbs are tested with observed burn times (in hours) of 751, 594, 1213, 1126, and 819. Using the R function rgamma, simulate 1000 values from the posterior distribution of 9.

c) By transforming these simulated draws, obtain a simulated sample from the posterior distribution of A.

d) Estimate the posterior probability that A exceeds 1000 hours.

3. Learning about the upper bound of a discrete uniform density

Suppose one takes independent observations yl,..., yn from a uniform distribution on the set {1, 2, ..., N}, where the upper bound N is unknown. Suppose one places a uniform prior for N on the values 1, ..., B, where B is known. Then the posterior probabilities for N are given by

where y(,n) is the maximum observation. To illustrate this situation, suppose a tourist is waiting for a taxi in a city. During this waiting time, she observes five taxis with the numbers 43, 24, 100, 35, and 85. She assumes that taxis in this city are numbered from 1 to N, she is equally likely to observe any numbered taxi at a given time, and observations are independent. She also knows that there cannot be more than 200 taxis in the city.

a) Use R to compute the posterior probabilities of N on a grid of values.

b) Compute the posterior mean and posterior standard deviation of N.

c) Find the probability that there are more than 150 taxis in the city.

4. Bayesian robustness

Suppose you are about to flip a coin that you believe is fair. If p denotes the probability of flipping a head, then your "best guess" at p is .5. Moreover, you believe that it is highly likely that the coin is close to fair, which you quantify by P(.44 < p < .56) = .9. Consider the following two priors for p:

Pl:p distributed beta(100, 100)

P2:p distributed according to the mixture prior

a) Simulate 1000 values from each prior density P1 and P2. By summarizing the simulated samples, show that both priors match the given prior beliefs about the coin flipping probability p.

b) Suppose you flip the coin 100 times and obtain 45 heads. Simulate 1000 values from the posteriors from priors P1 and P2, and compute 90% probability intervals.

c) Suppose that you only observe 30 heads out of 100 flips. Again simulate 1000 values from the two posteriors and compute 90% probability intervals.

d) Looking at your results from (b) and (c), comment on the robustness of the inference with respect to the choice of prior density in each case.

5. Test of a proportion

In the standard Rhine test of extra-sensory perception (ESP), a set of cards is used where each card has a circle, a square, wavy lines, a cross, or a star. A card is selected at random from the deck and a person tries to guess the symbol on the card. This experiment is repeated 20 times and the number of correct guesses y is recorded. Let p denote the probability that the person makes a correct guess, where p = .2 if the person does not have ESP and is just guessing at the card symbol. To see if the person truly has some ESP, we would like to test the hypothesis H : p = .2.

a) If the person identifies y = 8 cards correctly, compute the p-value.

b) Suppose you believe a priori that the probability that p = .2 is .5 and if p .2, you assign a beta(l, 4) prior on the proportion. Using the function pbetat, compute the posterior probability of the hypothesis H. Compare your answer with the p-value computed in part (a)-

c) The posterior probability computed in part (b) depended on your belief about plausible values of the proportion p when p .2. For each of the following priors, compute the posterior probability of H: (1) p - beta(.5, 2), (2) p - beta(2, 8), (3) p - beta(8, 32).

d) On the basis of your Bayesian computations, do you think that y = 8 is convincing evidence that the person really has some ESP? Explain.

6. Learning from grouped data

Suppose you drive on a particular interstate roadway and typically drive at a constant speed of 70 mph. One day, you pass one car and get passed by 17 cars. Suppose that the speeds are normally distributed with unknown mean y and standard deviation c7 = 10. If you pass s cars, and f cars pass you, the likelihood of it is given by

where fly, y, or) is the cdf of the normal distribution with mean p and standard deviation o-. Assign the unknown mean µ a flat prior density.

a) If s = 1 and f = 17, plot the posterior density of p.

b) Using the density found in part (a), find the posterior mean of y.

c) Find the probability that the average speed of the cars on this interstate roadway exceeds 80 mph.

 

4.1 Introduction

In this chapter, we describe the use of R to summarize Bayesian models with several unknown parameters. In learning about parameters of a normal population or multinomial parameters, posterior inference is accomplished by simulating from distributions of standard forms. Once a simulated sample is obtained from the joint posterior, it is straightforward to perform transformations on these simulated draws to learn about any function of the parameters. We next consider estimating the parameters of a simple logistic regression model. Although the posterior distribution does not have a simple functional form, it can be summarized by computing the density on a fine grid of points. A common inference problem is to compare two proportions in a 2 x 2 contingency table. We illustrate the computation of the posterior probability that one proportion exceeds the second proportion in the situation in which one believes a priori that the proportions are dependent.

4.2 Normal Data with Both Parameters Unknown

A standard inference problem is to learn about a normal population where both the mean and variance are unknown. To illustrate Bayesian computation for this problem, suppose we are interested in learning about the distribution of completion times for men between ages 20 and 29 who are running the New York Marathon. We observe the times yl, ... , y20 for 20 runners in minutes, and we assume they represent a random sample from an N(µ, c7) distribution. If we assume the standard noninformative prior g(µ, Q2) a 1/a2, then the

posterior density of the mean and variance is given by

This joint posterior has the familiar normal/inverse chi-square form where

We first use R to construct a contour plot of the joint posterior density for this example. We read in the data marathontimes: when we attach this dataset, we can use the variable time that contains the vector of running times. The R function normchi2post. R in the LearnBayes package computes the logarithm of the joint posterior density of ([b,or 2). We also use a function mycontour.R in the LearnBayes package that facilitates the use of the R contour command. There are three inputs to mycontour: the name of the function that defines the log density, a vector with the values (xlo, xhi, ylo, and yhi) that define the rectangle where the density is to graphed, and the data used in the function for the log density. The function produces a contour graph shown in Fig. 4.1, where the contour lines are drawn at 10%, 1%, and .1% of the maximum value of the posterior density over the grid.

We display the simulated sampled values of (µ, o-2) on top of the contour plot of the distribution in Fig. 4.1.

Inferences about the parameters or functions of the parameters are available from the simulated sample. To construct a 95% interval estimate for the mean µ, we use the R quantile function to find percentiles of the simulated sample of y.

Fig. 4.1. Contour plot of the joint posterior distribution of (µ, a2) for a normal sampling model. The points represent a simulated random sample from this distribution.

A 95% credible interval for the mean completion time is (254.1, 301.7) minutes.

Suppose we are interested in learning about the standard deviation O' that describes the spread of the population of marathon running times. To obtain a sample of the posterior of a, we take square roots of the simulated draws of a2. From the output, we see that an approximate 95% probability interval for a is (37.5, 70.9) minutes.

4.3 A Multinomial Model

Gelman et al (2003) describe a sample survey conducted by CBS news before the 1988 presidential election. A total of 1447 adults were polled to indicate their preference; y1 = 727 supported George Bush, y2 = 583 supported Michael Dukakis, and y3 = 137 supported other candidates or expressed no opinion. The counts y1, y2i and y3 are assumed to have a multinomial distribution with sample size n and respective probabilities 01, 02, and 03. If a uniform prior distribution is assigned to the multinomial vector 0 = (01i 02, 03), then the posterior distribution of 0 is proportional to

which is recognized as a Dirichlet distribution with parameters (y1 + 1, y2 + 1, y3 + 1). The focus is to compare the proportions of voters for Bush and Dukakis by the difference 01 - 02.

The summarization of the Dirichlet posterior distribution is again conveniently done by simulation. Although the base R package does not have a function to simulate Dirichlet variates, it is easy to write a function to simulate this distribution based on the fact that if W1, W2i W3 are independent distributed from gamma(al, 1), gamma(a2, 1), gamma(a3, 1) distributions and T = Wl + W2 + W3i then the distribution of the proportions (W1/T, W2/T, W3/T) has a Dirichlet(al, a2i a3) distribution. The R function rdirichlet. R in the package LearnBayes uses this transformation of random variates to simulate draws of a Dirichlet distribution. One thousand vectors B are simulated and stored in the matrix theta.

Since we are interested in comparing the proportions for Bush and Dukakis, we focus on the difference 01 - 02. A histogram of the simulated draws of this difference is displayed in Fig. 4.2. Note that all of the mass of this distribution is on positive values, indicating that there is strong evidence that the proportion of voters for Bush exceeds the proportion for Dukakis.

4.4 A Bioassay Experiment

In the development of drugs, bioassay experiments are often performed on animals. In a typical experiment, various dose levels of a compound are administered to batches of animals and a binary outcome (positive or negative) is recorded for each animal. We consider data from Gelman et al (2003), where one observes a dose level (in log g/ml), the number of animals, and the number of deaths for each of four groups. The data are displayed in Table 4.1.

Fig. 4.2. Histogram of simulated sample of the marginal posterior distribution of 01 - Bz for the multinomial sampling example.

Let yz denote the number of deaths observed out of nz with dose level xi. We assume yj is binomial(ni, pi), where the probability pi follows the logistic model

The likelihood function of the unknown regression parameters /3o and /31 is given by

where pi = exp(~3o+~31xj)/(l+exp(@o+131xz)). If the standard flat noninformative prior is placed on then the posterior density is proportional to the likelihood function.

We begin in R by defining the covariate vector x and the vectors of sample sizes and observed success counts n and y.

A standard classical analysis fits the model by maximum likelihood. The R function glm is used to do this fitting, and the summary output presents the estimates and the associated standard errors.

The log posterior density for (~3o, X31) in this logistic model is contained in the R function logisticpost. To summarize the posterior distribution, we first find a rectangle that covers essentially all of the posterior probability. The maximum likelihood fit is helpful for finding this rectangle. As seen by the contour plot displayed in Fig. 4.3, we see the rectangle -4 < ~3o < 8, -5 < X31 < 39 contains the contours that are greater than .1% of the modal value.

Fig. 4.3. Contour plot of the posterior distribution of (/3o, X31) for the bioassay example. The contour lines are drawn at 10%, 1%, and .1% of the model height.

Now that we have found the posterior distribution, we use the function simcontour to simulate pairs of (0o, 01) from the posterior density computed on this rectangular grid. We display the contour plot with the points superimposed in Fig. 4.4 to confirm that we are sampling from the posterior distribution.

We illustrate several types of inferences for this problem. Fig. 4.5 displays a density estimate of the simulated values (using the R function density) of the slope parameter 31. All of the mass of the density of /31 is on positive values, indicating that there is significant evidence that increasing the level of the dose does increase the probability of death.

Fig. 4.4. Contour plot of the posterior distribution of (/3o, /3i) for the bioassay example. A simulated random sample from this distribution is shown on top of the contour plot.

In this setting, one parameter of interest is the LD-50, the value of the dose x such that the probability of death is equal to one half. It is straightforward to show that the LD-50 is equal to 0 = -~0/~1. One can obtain a simulated sample from the marginal posterior density of 0 by computing a value of 0 from each simulated pair (~3o,~j). A histogram of the LD-50 is shown in Fig. 4.6.

In contrast to the histogram of X31, the LD-50 is more difficult to estimate and the posterior density of this parameter is relatively wide. We compute a 95% credible interval from the simulated draws of 0.

The probability that 0 is contained in the interval (-.290,.114) is .95.

Fig. 4.5. Histogram of simulated values from the posterior of the slope parameter /3i in the bioassay example.

4.5 Comparing Two Proportions

Howard (1998) considers the general problem of comparing the proportions from two independent binomial distributions. Suppose we observe yj distributed binomial(nl, pl), y2 distributed binomial(n2, p2). One wants to know if the data favor the hypothesis Hl : p1 > p2 or the hypothesis H2 : pl < p2 and want a measure of the strength of the evidence in support of one hypothesis. Howard gives a broad survey of frequentist and Bayesian approaches for comparing two proportions. Here we focus on the application of Howard's recommended "dependent prior" for this particular testing problem.

In this situation, suppose that one is given the information that one proportion is equal to a particular value, say pi = .8. This knowledge can influence a user's prior beliefs about the location of the second proportion p2; specifically, one may believe that the value of pz is also close to .8. This implies that the use of dependent priors for pi and P2 may be more appropriate than the common use of uniform independent priors for the proportions.

Howard proposes the following dependent prior. First the proportions are transformed into the real-valued logit parameters

Fig. 4.6. Histogram of simulated values of the LD-50 parameter -~3o/~31 in the bioassay example.

Then suppose that given a value of 01, the logit 02 is assumed to be normally distributed with mean 01 and standard deviation a-. By generalizing this idea, Howard proposes the dependent prior of the general form

where

This class of dependent priors is indexed by the parameters (a, ~3, -y, 6, or). The first four parameters reflect one's beliefs about the locations of p1 and p2 and the parameter a indicates one prior belief in the dependence between the two proportions.

Suppose that a = ~3 = = b = 1, reflecting vague prior beliefs about each individual parameter. The logarithm of the dependent prior is defined in the R function howardprior. By use of the function mycontour, Fig. 4.7 shows contour plots of the dependent prior for values of the association parameter a = 2, 1, .5, and .25. Note as the value of c7 goes to zero, the prior is placing more of its mass along the line where the two proportions are equal.

Fig. 4.7. Contour graphs of Howard's dependent prior for values of the association parameter (T = 2, 1, .5, and .25.

Suppose we observe counts P1, P2 from the two binomial samples. The likelihood function is given by

Combining the likelihood with the prior, one sees that the posterior density has the same functional "dependent" form with updated parameters

We illustrate testing the hypotheses using a dataset discussed by Pearson (1947) shown in Table 4.2.

Since the posterior distribution is the same functional form as the prior, we can use the same howardprior function for the posterior calculations. In Fig. 4.8, contour plots of the posterior are shown for the four values of the association parameter a.

We can test the hypothesis H1 : p1 > P2 simply by computing the posterior probability of this region of the parameter space. We first produce by the function simcontour a simulated sample from the posterior distribution of (p1, P2) and then find the proportion of simulated pairs where p1 > p2. For example, we display the R commands for the computation of the posterior probability for a = 2.

Fig. 4.8. Contour graphs of posterior for Howard's dependent prior for values of the association parameter a = 2, 1, .5, and .25.

Table 4.3 displays the posterior probability that pl exceeds p2 for four choices of the dependent prior parameter a. Note that this posterior probability is sensitive to the prior belief about the dependence between the two proportions.

4.6 Further Reading

Chapter 3 of Gelman et al (2003) describes the normal sampling problem and other multiparameter problems from a Bayesian perspective. In particular, Gelman et al (2003) illustrate the use of simulation when the posterior has been computed on a grid. Carlin and Louis (2000), chapter 2, and Lee (2004) illustrate Bayesian inference for some basic two-parameter problems. Howard (1998) gives a general discussion of inference for the two-by-two contingency table, contrasting frequentist and Bayesian approaches.

4.7 Summary of R Functions

howardprior - computes the logarithm of a dependent prior on two proportions proposed by Howard in a Statistical Science paper in 1998

Usage: howardprior (xy, par)

Arguments: xy, a matrix of parameter values where each row represents a value of the proportions (pl, p2); par, a vector containing parameter values alpha, beta, gamma, delta, sigma

Value: vector of values of the log posterior where each value corresponds to each row of the parameters in xy

logisticpost - computes the log posterior density of (beta0, betal) when yi are independent binomial(ni, pi) and logit(pi)=beta0+betal*xi

Usage: logisticpost(beta,data)

Arguments: beta, a matrix of parameter values where each row represents a value of (beta0, betal); data, a matrix of columns of covariate values x, sample sizes n, and number of successes y

Value: vector of values of the log posterior where each value corresponds to each row of the parameters in beta

mycontour - for a general two parameter density, draws a contour graph where the contour lines are drawn at 10%, 1%, and .1% of the height at the mode

Usage: mycontour(logf ,limits ,data)

Arguments: logf, a function that defines the logarithm of the density; limits, a vector of limits (xlo, xhi, ylo, yhi) where the graph is to be drawn; data, a vector or list of parameters associated with the function logpost

Value: a contour graph of the density is drawn

normchi2post - computes the log of the posterior density of a mean M and a variance S2 when a sample is taken from a normal density and a standard noninformative prior is used

Usage: normchi2post(theta,data)

Arguments: theta, a matrix of parameter values where each row is a value of (M, S2); data, a vector containing the sample observations

Value: a vector of values of the log posterior where the values correspond to the rows in theta

rdirichlet - simulates values from a Dirichlet distribution

Usage: rdirichlet (n, par)

Arguments: n, the number of simulations required, par, the vector of parameters of the Dirichlet distribution

Value: a matrix of simulated draws, where a row contains one simulated Dirichlet draw

simcontour - for a general two-parameter density defined on a grid, simulates a random sample

Usage: simcontour(logf,limits,data,m)

Arguments: logf, a function that defines the logarithm of the density; limits, a vector limits (xlo, xhi, ylo, yhi) that cover the joint probability density; data, a vector or list of parameters associated with the function logpost; m, the size of simulated sample

Value: x, the vector of simulated draws of the first parameter; y, the vector of simulated draws of the second parameter

4.8 Exercises

1. Inference about a normal population

Suppose we are interested in learning about the sleeping habits of students at a particular college. We collect yl, ..., y20, the sleeping times (in hours), for 20 randomly selected students in a statistics course. Here are the observations:

a) Assuming that the observations represent a random sample from a normal population with mean µ and variance a2 and the usual noninformative prior is placed on (y, a2), simulate a sample of 1000 draws from the joint posterior distribution.

b) Use the simulated sample to find 90% interval estimates for the mean µ and the standard deviation Q.

c) Suppose one is interested in estimating the upper quartile p75 of the normal population. Using the fact that p75 = µ + 0.674c7, find the posterior mean and posterior standard deviation of p7,5.

2. The Behrens-Fisher problem

c) The following data give the mandible lengths in millimeters for 10 male and ten female golden jackals in the collection of the British Museum. Using simulation, find the posterior density of the difference in mean mandible length between the sexes. Is there sufficient evidence to conclude that the males have a larger average?

3. Comparing two proportions

The following table gives the records of accidents in 1998 compiled by the Department of Highway Safety and Motor Vehicles in Florida.

Denote the number of accidents and fatalities when no safety equipment was in use by nN and YN, respectively. Similarly, let ns and ys denote the number of accidents and fatalities when a seat belt was in use. Assume that YN and ys are independent with YN distributed binomial(nN,pN) and ys distributed binomial(ns, ps). Assume a uniform prior is placed on the vector of probabilities (pN,ps).

a) Show that PN and ps have independent beta posterior distributions.

b) Use the function rbeta to simulate 1000 values from the joint posterior distribution of (pN,ps).

c) Using your sample, construct a histogram of the relative risk pN/ps. Find a 95% interval estimate of this relative risk.

d) Construct a histogram of the difference in risks PN - Ps

e) Compute the posterior probability that the difference in risks exceeds 0.

4. Learning from rounded data

It is a common problem for measurements to be observed in rounded form. Suppose we weigh an object five times and measure weights rounded to the nearest pound of 10, 1, 12, 11, 9. Assume the unrounded measurements are normally distributed with a noninformative prior distribution on the mean y and variance Q2.

a) Pretend that the observations are exact unrounded measurements. Simulate a sample of 1000 draws from the joint posterior distribution by using the algorithm described in Section 4.2.

b) Write down the correct posterior distributions for (,u, (72) treating the measurements as rounded.

c) By computing the correct posterior distribution on a grid of points (as in Section 4.4), simulate a sample from this distribution.

d) How do the incorrect and correct posterior distributions for y compare? Answer this question by comparing posterior means and variances from the two simulated samples.

5. Estimating the parameters of a Poisson/gamma density

Suppose that yl, ..., y,z are a random sample from the Poisson/gamma density

where a > 0 and b > 0. This density is an appropriate model for observed counts that show more dispersion than predicted under a Poisson model. Suppose that (a, b) are assigned the noninformative prior proportional to 1/(ab)2. If we transform to the real-valued parameters 9j = log a and 02 = log b, the posterior density is proportional to

where a = exp{B1} and b = exp{02}. Use this framework to model data collected by Gilchrist (1984), in which a series of 33 insect traps were set across sand dunes and the numbers of different insects caught over a fixed time were recorded. The number of insects of the taxa Staphylinoidea caught in the traps are shown here.

By computing the posterior density on a grid, simulate 1000 draws from the joint posterior density of (01, 02). From the simulated sample, find 90% interval estimates for the parameters a and b.

6. Comparison of two Poisson rates (from Antleman (1996))

A seller receives 800-number telephone orders from a first geographic area at a rate of Al per week and from a second geographic area at a rate of .A2 per week. Assume that incoming orders behave as if generated by a Poisson distribution; if the rate is A, then the number of orders y in t weeks is distributed Poisson( t)). Suppose a series of newspaper ads is run in the two areas for a period of four weeks, and sales for these four weeks are 260 units in area 1 and 165 units in area 2. The seller is interested in the effectiveness of these ads. One measure of this would be the probability that the sales rate in area 1 is greater than 1.5 times the sales rate in area 2:

 

5.1 Introduction

In the previous two chapters, two types of strategies were used in the summarization of posterior distributions. If the sampling density has a familiar functional form, such as a member of an exponential family, and a conjugate prior is chosen for the parameter, then the posterior distribution often is expressible in terms of familiar probability distributions. In this case, we can simulate parameters directly by use of the R collection of random variate functions (such as rnorm, rbeta and rgamma), and we can summarize the posterior by computations on this simulated sample. A second type of computing strategy is what we called the "brute-force" method. In the case where the posterior distribution is not a familiar functional form, then one simply computes values of the posterior on a grid of points and then approximates the continuous posterior by a discrete posterior that is concentrated on the values of the grid. This brute-force method can be generally applied for oneand two-parameter problems such as those illustrated in Chapters 3 and 4.

In this chapter, we describe the Bayesian computational problem and introduce some of the more sophisticated computational methods that will be employed in later chapters. One general approach is based on the behavior of the posterior distribution about its mode. This gives a multivariate normal approximation to the posterior that serves as a good first approximation in the development of more exact methods. We then provide a general introduction to the use of simulation in computing summaries of the posterior distribution. When one can directly simulate samples from the posterior distribution, then the Monte Carlo algorithm gives an estimate and associated standard error for the posterior mean of any function of the parameters of interest. In the situation where the posterior distribution is not a standard functional form, rejection sampling with a suitable choice of proposal density provides an alternative method for producing draws from the posterior. Importance sampling and sampling importance resampling (SIR) algorithms are alternative general methods for computing integrals and simulating from a general posterior distribution. The SIR algorithm is especially useful when one wishes to investigate the sensitivity of a posterior distribution with respect to changes in the prior and likelihood functions.

5.2 Computing Integrals

The Bayesian recipe for inference is conceptually simple. If we observe data y from a sampling density f (yl B), where B is a vector of parameters and one assigns B a prior g(9), then the posterior density of B is proportional to

The computational problem is to summarize this multivariate probability distribution to perform inference about functions of B.

Many of the posterior summaries are expressible in terms of integrals. Suppose we are interested in the posterior mean of a function h(9). This mean is expressible as a ratio of integrals

If we are interested in the posterior probability that h(9) falls in a set A, we wish to compute

Integrals are also involved when we are interested in obtaining marginal densities of parameters of interest. Suppose the parameter B = (9k, 92), where Bl are the parameters of interest and 92 are so-called nuisance parameters. One obtains the marginal posterior density of Bl by integrating out the nuisance parameters from the joint posterior:

In the common situation where one needs to evaluate these integrals numerically, there are a number of quadrature methods available. However, these quadrature methods have limited use for Bayesian integration problems. First, the choice of quadrature method depends on the location and shape of the posterior distribution. Second, for a typical quadrature method, the number of evaluations of the posterior density grows exponentially as a function of the number of components of B. In this chapter, we focus on the use of computational methods for computing integrals that are applicable to high-dimensional Bayesian problems.

5.3 Setting Up a Problem on R

Before we describe some general summarization methods, we first describe setting up a Bayesian problem on R. Suppose one is able to write an explicit expression for the joint posterior density. In writing this expression, it is not necessary to include any normalizing constants that don't involve the parameters. Next, for the algorithms described in this book, it is helpful to reparameterize all parameters so that they are all real-valued. If one has a positive parameter such as a variance, then transform using a log function. If one has a proportion parameter p, then it can be transformed to the real line by the logit function logit(p) = log(p/(1 - p)).

After the posterior density has been expressed in terms of transformed parameters, the first step in summarizing this density is to write an R function defining the logarithm of the joint posterior density.

The general structure of this R function is

To apply the functions described in this chapter, theta is assumed to be a matrix with n rows and k columns, where each row of theta corresponds to a value of the parameter vector 0 = (01, ..., 9k). The input data is a vector of observed values or a list of data values and other model specifications such as the values of prior hyperparameters. The output vector val contains n values corresponding to the n values of the parameter vector 0.

One common situation is where one observes a random sample yl,..., y, from a sampling density f (y10) and one assigns 0 the prior density g(9). The logarithm of the posterior density of 0 is given, up to an additive constant, by

When programming this function, it is important to note that the input is a matrix theta of parameter values. So it is necessary to use a loop to perform the summation when programming this function. Suppose we are sampling from a normal distribution with mean µ and standard deviation a, the parameter vector 0 = (,u, log a) and we place an N(10, 20) prior on it and a flat prior on log a. The log posterior would have the form

where 0(y; p, a) is the normal density with mean p and standard deviation a. If data is the vector of observations yl, ..., y, then the function defining the log posterior would in this case would be written as follows.

We use the log = TRUE option in dnorm to compute the logarithm of the density. Note the use of the small trick val=0*mu; this is a simple way of creating a zero column vector of the same size as the vector mu.

5.4 A Beta-Binomial Model for Overdispersion

Tsutakawa et al (1985) describe the problem of simultaneously estimating the rates of death from stomach cancer for males at risk in the age bracket 45-64 for the largest cities in Missouri. Table 5.1 displays the mortality rates for 20 of these cities, where a cell contains the number nj at risk and the number of cancer deaths yj for a given city.

A first modeling attempt might assume that the {yj} represent independent binomial samples with sample sizes {nj} and common probability of death p. But it can be shown that these data are overdispersed in the sense that the counts {yj} display more variation that would be predicted under a binomial model with a constant probability p. A better fitting model assumes that yj is distributed from a beta-binomial model with mean it and precision K:

Suppose we assign the parameters the vague prior proportional to

Then the posterior density of (ii, K) is given, up to a proportionality constant, by

where 0<~q <1andK>0.

We write a short function betabinexchO to compute the logarithm of the posterior density. The inputs to the function are theta, a matrix where the values of y and K are respectively in the first and second columns, and data, a matrix with columns the vector of counts {yj} and the vector of sample sizes {nj}.

We read in the dataset cancermortality and use the function mycontour together with the log density function betabinexchO to display a contour plot of the posterior density of (rj, K) (See Fig. 5.1).

Note the strong skewness in the density, especially toward large values of the precision parameter K. This right skewness is a common characteristic of the likelihood function of a precision or variance parameter. Following the general guidance in Section 5.3, suppose we transform each parameter to the real line by the reexpressions

Fig. 5.2 displays a contour plot of the posterior of (01, 02) using the mycontour function. Although the density has an unusual shape, the strong skewness has been reduced and the distribution is more amenable to the computational methods described in this and the following chapters.

5.5 Approximations Based on Posterior Modes

One method of summarizing a multivariate posterior distribution is based on the behavior of the density about its mode. Let 0 be a vector-valued parameter with prior density g(8). If we observe data y with sampling density f (yJ0), then consider the logarithm of the joint density of 0 and y

Fig. 5.1. Contour plot of parameters q and K in the beta-binomial model problem.

In addition, this approximation allows one to analytically integrate out B from the joint density and obtain the following approximation to the prior predictive density:

where d is the dimension of B.

To apply this approximation, one needs to find the mode of the posterior density of B. A good general-purpose optimization algorithm for finding this mode is provided by Newton's method. Suppose one has a guess at the posterior mode B°. If 0t-1 is the estimate at the mode at the t - 1 iteration of the algorithm, then the next iterate is given by

One continues these iterations until convergence.

After one writes an R function to evaluate the log posterior density, the R function laplace in the LearnBayes package finds the joint posterior mode by several iterations of Newton's method. The inputs to laplace are the function defining the joint posterior, an intelligent guess at the posterior mode, the number of iterations of this algorithm, and data and parameters used in the definition of the log posterior. The choice of "intelligent guess" can be important since Newton's algorithm may fail to converge with a poor choice of starting value.

Fig. 5.2. Contour plot of transformed parameters logit(q) and log K in the betabinomial model problem.

Suppose that a suitable starting value is used and laplace is successful in finding the posterior mode. The output of laplace is a list with three components. The component mode gives the value of the posterior mode B, the component var is the associated variance-covariance matrix V, and the component int is the approximation to the logarithm of the prior predictive density.

5.6 The Example

We illustrate the use of the function laplace for our beta-binomial modeling example. Based on our contour plot, we start Newton's method with the initial guess (logit(rl), log K) = (-7, 6) and perform 10 Newton steps.

We find the posterior mode to be (-6.82, 7.57). Also this gives the approximation that (logit(i ), log K) is approximately bivariate normal with mean vector fit$mode and variance-covariance matrix fit$var. By use of the mycontour function with the log bivariate normal function lbinorm, Fig. 5.3 displays the contours of the approximate normal density. Comparing Fig. 5.2 and Fig.5.3, we see significant differences between the exact and approximate normal posteriors.

One advantage of this algorithm is that one obtains quick summaries of the parameters by use of the multivariate normal approximation. By use of the diagonal elements of the variance-covariance matrix, one can construct approximate probability intervals for logit(q) and log K. For example, the following code constructs 90% probability intervals for the parameters:

So a 90% interval estimate for logit(T/) is (-7.28, -6.36), and a 90% interval estimate for log K is (5.66, 9.49).

Fig. 5.3. Contour plot of normal approximation of logit(q) and log K in the betabinomial model problem.

5.7 Monte Carlo Method for Computing Integrals

A second general approach for summarizing a posterior distribution is based on simulation. Suppose that B has a posterior density g(Bly) and we are interested in learning about a particular function of the parameters h(9). The posterior mean of h(9) is given by

Suppose we are able to simulate an independent sample 91, ..., B" from the posterior density. Then the Monte Carlo estimate at the posterior mean is given by the sample mean

The associated simulation standard error of this estimate is estimated by

The Monte Carlo approach is an effective method for summarizing a posterior distribution when simulated samples are available from the exact posterior distribution. For a simple illustration of the Monte Carlo method, return to Section 2.4 where we were interested in the proportion of heavy sleepers p at a college. With the use of a beta prior, the posterior distribution for p was beta(14.4, 23.4). Suppose we are interested in the posterior mean of p2. (This is the predictive probability that two students in a future sample will be heavy sleepers.) We simulate 1000 draws from the beta posterior distribution. If {pi} represent the simulated sample, the Monte Carlo estimate at this posterior mean will be the mean of the {(pj)2}, and the simulated standard error is the standard deviation of the {(p3)2} divided by the square root of the simulation sample size.

The Monte Carlo estimate at E(p2 data) is 0.152 with an associated simulation standard error of 0.002.

5.8 Rejection Sampling

In the examples of Chapter 2, 3, and 4, we were able to produce simulated samples directly from the posterior distribution since the distributions were familiar functional forms. Then we would be able to obtain Monte Carlo estimates at the posterior mean of any function of the parameters of interest. But in many situations such as the beta-binomial example of this chapter, the posterior does not have a familiar form and we need to use an alternative algorithm for producing a simulated sample.

A general-purpose algorithm for simulating random draws from a given probability distribution is rejection sampling. In this setting, suppose we wish to produce an independent sample from a posterior density g(Bly) where the normalizing constant may not be known. The first step in rejection sampling is to find another probability density p(O) such that

• It is easy to simulate draws from p.

• The density p resembles the posterior density of interest g in terms of location and spread.

• For all 0 and a constant c, g(9 y) < cp(8).

Suppose we are able to find a density p with these properties. Then one obtains draws from g by the following accept/reject algorithm:

1. Simulate independently 0 from p and a uniform random variable U on the unit interval.

2. If U < g(Bly)/(cp(B)), then accept B as a draw from the density g, otherwise reject B.

3. Continue steps 1 and 2 of the algorithm until one has collected a sufficient number of "accepted" 0.

Rejection sampling is one of the most useful methods for simulating draws from a variety of distributions and standard methods for simulating from standard probability distributions such as normal, gamma, and beta are typically based on rejection algorithms. The main task in designing a rejection sampling algorithm is finding a suitable proposal density p and constant value c. At step 2 of the algorithm, the probability of accepting a candidate draw is given by g(Bly)/(cp(B)). One can monitor the algorithm by computing the proportion of draws of p that are accepted; an efficient rejection sampling algorithm has a high acceptance rate.

We consider the use of rejection sampling to simulate draws of B =(logit(,q), log K) in the beta-binomial example. We wish to find a proposal density of a simple functional form that, when multiplied by an appropriate constant, covers the posterior density of interest. One choice for p would be a bivariate normal density with mean and variance given as outputs of the function laplace. Although this density does resemble the posterior density, the normal density has relatively sharp tails and likely the ratio g(01y)/p(0) would not be bounded. A better choice for a covering density is a multivariate t with mean and scale matrix chosen to match the posterior density and a small number of degrees of freedom. The small number of degrees of freedom gives the density heavy tails and one is more likely to find bounds for the ratio g(9 y)/p(8).

In our earlier work, we found approximations to the posterior mean and variance-covariance matrix of 0 =(logit(i ), log K) based on the Laplace method. If the output variable of laplace is fit, then fit$mode is the posterior mode and fit$var the associated variance-covariance matrix. Suppose we decide to use a multivariate t density with location fit$mode, scale matrix 2 fit$var, and four degrees of freedom. These choices are made to mimic the posterior density and ensure that the ratio g(O y)/p(O) is bounded from above.

To set up, we need to find the value of the bounding constant. We want to find the constant c such that

Equivalently, since g is programmed on the log scale, we want to find the constant d = log c such that

Basically we wish to maximize the function log g(Oly) - loo p(0) over all 0. A convenient way to perform this maximization is by use of the laplace function. We write a new function betabinT that computes values of this difference function. There are two inputs, the parameter theta and a list datapar with components data, the data matrix, and par, a list with the parameters of the t proposal density (mean, scale matrix, and degrees of freedom)

For our problem, we define the parameters of the t proposal density and the list datapar:

We run the function laplace with this new function and use of an "intelligent" starting value.

We find the maximum value d occurs at the value 0 = (-6.889,12.42736). We note that this 0 value is not at the extreme portion of the space of simulated draws that indicates that we indeed have found an approximate maximum. The value of d is found by evaluating the function at the modal value.

We implement rejection sampling using the function rejectsampling. The inputs are the function defining the log posterior, the parameters of the t covering density, the value of d, the number of candidate values simulated, and the data for the log posterior function. In this function, we simulate a vector of 0 from the proposal density, compute the values of log g and log f on these simulated draws, compute the acceptance probabilities, and return only the simulated values of 0 where the uniform draws are smaller than the acceptance probabilities.

We run the function rejectsampling using the constant value of d found earlier and simulate 10,000 draws from the proposal density. We see that the output value theta has only 2406 rows, so the acceptance rate of this algorithm is 2406/10,000 = .24. This is a relatively inefficient algorithm since it has a small acceptance rate, but the proposal density was found without too much effort.

We plot the simulated draws from rejection sampling on the contour plot of the log posterior density in Fig. 5.4. As expected, most of the draws fall within the inner contour of the exact density.

5.9 Importance Sampling

Let us return to the basic problem of computing an integral in Bayesian inference. In many situations, the normalizing constant of the posterior density g(Oy) will be unknown. So the posterior mean of the function h(9) will be given by the ratio of integrals

If we were able to simulate a sample {93 } directly from the posterior density g, then one could approximate this expectation by a Monte Carlo estimate. In the case where we are not able to generate a sample directly from g, suppose instead that we can construct a probability density p that we can simulate and that approximates the posterior density g. We rewrite the posterior mean as

Fig. 5.4. Contour plot of logit(g) and log K in the beta-binomial model problem together with simulated draws from the rejection algorithm.

where w(9) = g(Bly)/p(B) is the weight function. If 01,..., 0' are a simulated sample from the approximation density p, then the importance sampling estimate at the posterior mean is

This is called an importance sampling estimate because we are sampling values of 0 that are important in computing the integrals in the numerator and denominator. The simulation standard error of an importance sampling estimate is estimated by

As in rejection sampling, the main issue in designing a good importance sampling estimate is finding a suitable sampling density p. This density should be of a familiar functional form so simulated draws are available. The density should mimic the posterior density g and have relatively flat tails so that the weight function w(9) is bounded from above. One can monitor the choice of p by inspecting the values of the simulated weights w(8j). If there are not any unusually large weights, then it is likely that the weight function is bounded and the importance sampler is providing a suitable estimate.

To illustrate importance sampling, let us return to our beta-binomial example and consider the problem of estimating the posterior mean of log K. For a posterior density of real-valued parameters, a convenient choice of sampler p is a multivariate t density. The R function impsampling will implement importance sampling when p is a t density. There are five inputs to this function: logf is the function defining the logarithm of the posterior, tpar is a list of parameter values of the t density, h is a function defining the function h(9) of interest, n is the size of the simulated sample, and data is the vector or list used in the definition of logf. In the function impsampling, the functions rmt and dmt from the mnormt library are used to simulate and compute values of the t density. Note that the value and is the maximum value of the difference of logs of the posterior and proposing density - this value is used in the computation of the weights to prevent possible overflow. The output of impsampling is a list with four components: est is the importance sampling estimate, se is the corresponding simulation standard error, theta is a matrix of simulated draws from the proposing density p, and wt is a vector of the corresponding weights.

For this example, the choice of proposal density used in the development of a rejection algorithm seems to be a good choice for importance sampling. We choose a t density where the location is the posterior mode (found from laplace), the scale matrix is twice the estimated variance-covariance matrix, and the number of degrees of freedom is four. This choice for p will resemble the posterior density and have flat tails that we hope will result in bounded weights. We define a short function myfunc to compute the function h; since we are interested in the posterior mean of log K we define the function to be the second column of the matrix 0. We are now ready to run impsampling.

We see from the output that the importance sampling estimate of the mean of log K is 7.958 with an associated standard error of 0.020. To check if the weight function is bounded, we compute a histogram of the simulated weights (not shown here) and note that there are no extreme weights.

5.10 Sampling Importance Resampling

In rejection sampling, we simulated draws from a proposal density p and accepted a subset of these values to be distributed according to the posterior density of interest g(9 y). There is an alternative method of obtaining a simulated sample from the posterior density g motivated by the importance sampling algorithm.

As before, we simulate m draws from the proposal density p denoted by B1, ..., Bm and compute the weights {w(93) = g(BJ jy)/p(B3)}. Convert the weights to probabilities by the formula

Suppose we take a new sample 0*1, ..., 0*m from the discrete distribution over 01, ..., 0' with respective probabilities p1, ..., p'. Then the {9*j} will be approximately distributed according to the posterior distribution g. This method, called sampling importance resampling or SIR for short, is a weighted bootstrap procedure where we sample with replacement from the sample {93} with unequal sampling probabilities.

This sampling algorithm is straightforward to implement in R using the sample command. Suppose we wish to obtain a simulated sample of size n. As in importance sampling, we first simulate from the proposal density which in this situation is a multivariate t distribution, and then compute the importance sampling weights stored in the vector wt.

To implement the SIR algorithm, we first convert the weights to probabilities and store them in the vector probs. Next we use sample to take a sample with replacement from the indices 1, ..., n, where the sampling probabilities are contained in the vector probs; the simulated indices are stored in the vector indices.

Finally, we use the random indices in indices to select the rows of theta and assign to the matrix theta.s. The matrix theta.s contain the simulated draws from the posterior.

The function sir implements this algorithm for a multivariate t proposal density. The inputs to this function are the function defining the log posterior logf, the list tpar of parameters of the multivariate proposal density, the number n of simulated draws, and the data used in the log posterior function. The output is a matrix of simulated draws from the posterior. In the betabinomial modeling example, we implement the SIR algorithm by the command

We have illustrated the use of the SIR algorithm in converting simulated draws from a proposal density to draws from the posterior density. But this algorithm can be used to convert simulated draws from one probability density to a second probability density. To show the power of this method, suppose we wish to perform a Bayesian sensitivity analysis with respect to the individual observations in the dataset. Suppose we focus on posterior inference about the log precision parameter log K and question how the inference would change if we removed individual observations from the likelihood. Let g(Bly) denote the posterior density from the full dataset and g(Bly(_j)) denote the posterior density with the ith observation removed. Let {93} represent a simulated sample from the full dataset. We can obtain a simulated sample from g(Bly(_z)) by resampling from {63 }, where the sampling probabilities are proportional to the weights

Suppose that the inference of interest is a 90% probability interval for the log precision log K. The R code for this resampling for the "leave observation i out" follows. One first computes the sampling weights and the sampling probabilities. Then the sample command is used to do the resampling from theta and the simulated draws from the "leave one out" posterior are stored in the variable theta. s. We summarize the simulated values of log K by the 5th, 50th, and 95th quantiles.

The function bayes.influence computes probability intervals for log K for the complete dataset and "leave one out" datasets using the SIR algorithm. We assume one already has simulated a sample of values from the complete data posterior and the draws are stored in the matrix variable theta.s. The inputs to bayes. influence are theta.s and the dataset data. In this case, suppose we have just implemented the SIR algorithm and the posterior draws are stored in the matrix theta. s. Then the form of the function would be

The output of this function is a list S; S$summary is a vector containing the 5th, 50th, and 95th percentiles and S$summary.obs is a matrix where the ith row gives the percentiles for the posterior with the ith observation removed.

Fig. 5.5 is a graphical display of the sensitivity of the posterior inference about log K with respect to the individual observations. The bold line shows the posterior median and 90% probability interval for the complete dataset and the remaining lines show the inference with each possible observation removed. Note that if observation number 15 is removed ((yz, ni) = (54, 53637)), then the location of log K is shifted toward smaller values. Also if either observation 10 or observation 19 is removed, log K is shifted toward larger values. These two observations are notable since each city experienced three deaths and had relatively high mortality rates.

Fig. 5.5. Ninety percent interval estimates for log K for full dataset (thick line) and interval estimates for datasets with each individual observation removed.

5.11 Further Reading

Rejection sampling is a general method used in simulating probability distributions; discussion of rejection sampling for statistical problems is described in Givens and Hoeting (2005), Monahan (2001), and Robert and Casella (2004). Tanner (1996) introduces normal approximations to posterior distributions in chapter 2 and Monte Carlo methods in chapter 3. Robert and Casella (2004) in chapter 3 describe different aspects of Monte Carlo integration. Smith and Gelfand (1992) introduce the use of rejection sampling and the SIR algorithm in simulating from the posterior distribution.

5.12 Summary of R Functions

bayes. influence - computes probability intervals for the log precision parameter K in a beta-binomial model for all "leave one out" models using sampling importance resampling

Usage:bayes.influence(theta,data)

Arguments: theta, matrix of simulated draws from the posterior of (logit eta, log K) for a beta-binomial model; data, matrix with columns of counts and sample sizes

Value: summary, vector of 5th, 50th and 95th percentiles of log K for posterior of complete sample; summary. obs, matrix where the ith row contains the 5th, 50th and 95th percentiles of log K for posterior when the ith observation is removed

betabinexchO - computes the logarithm of the posterior for the parameters (mean and precision) in a beta/binomial model

Usage: betabinexch0 (theta ,data)

Arguments: theta, matrix of parameter values where each row represents a value of (eta, K); data, matrix with columns of counts and sample sizes

Value: vector of values of the log posterior where each value corresponds to each row in the parameters in theta

betabinexch - computes the logarithm of the posterior for the parameters (logit mean and log precision) in a beta/binomial model

Usage: betabinexch(theta,data)

Arguments: theta, matrix of parameter values where each row represents a value of (logit eta, log K); data, matrix with columns of counts and sample sizes

Value: vector of values of the log posterior where each value corresponds to each row in the parameters in theta

impsampling - implements importance sampling to compute the posterior mean of a function using a multivariate t proposal density

Usage: impsampling(logf,tpar,h,n,data)

Arguments: logf, function defining the log density; tpar, list of parameters of a multivariate t proposal density including the mean m, the scale matrix var, and the degrees of freedom df; h, function that defines h(theta); n, number of simulated draws from the proposal density; data, data and or parameters used in the function logf

Value: est, estimate at the posterior mean; se, simulation standard error of the estimate; theta, matrix of simulated draws from proposal density; wt, vector of importance sampling weights

laplace - for a general posterior density, computes the posterior mode, the associated variance-covariance matrix, and an estimate at the logarithm at the normalizing constant

Usage: laplace(logpost,mode, iter,par)

Arguments: logpost, function that defines the logarithm of the posterior density; mode, vector that is a guess at the posterior mode; iter, number of iterations of Newton-Raphson algorithm; par, vector or list of parameters associated with the function logpost

Value: mode, current estimate at the posterior mode; var, current estimate at the associated variance-covariance matrix; int, estimate at the logarithm of the normalizing constant

rejectsampling -implements a rejection sampling algorithm for a probability density using a multivariate t proposal density

Usage:rejectsampling(logf,tpar,dmax,n,data)

Arguments: logf, function that defines the logarithm of the density of interest; tpar, list of parameters of a multivariate t proposal density including the mean m, the scale matrix var, and the degrees of freedom df; dmax, logarithm of the rejection sampling constant; n, number of simulated draws from the proposal

density; data, data and or parameters used in the function loaf

Value: matrix of simulated draws from density of interest

sir - implements the sampling importance resampling algorithm for a multivariate t proposal density

Usage: sir(logf,tpar,n,data)

Arguments: logf, function defining logarithm of density of interest; tpar, list of parameters of a multivariate t proposal density including the mean m, the scale matrix var, and the degrees of freedom df; n, number of simulated draws from the posterior; data, data and parameters used in the function logf

Value: matrix of simulated draws from the posterior where each row corresponds to a single draw

5.13 Exercises

1. Estimating a log-odds with a normal prior

Suppose y has a binomial distribution with parameters n and p, and we are interested in the lob odds value B = log (p/(1 - p)) . Our prior for B is that B - N(µ, or). It follows that the posterior density of B is given, up to a proportionality constant, by

More concretely, suppose we are interested in learning about the probability a special coin lands heads when tossed. A priori we believe that the coin is fair, so we assign B an N(0, .25) prior. We toss the coin n = 5 times and obtain y = 5 heads.

a) Using a normal approximation to the posterior density, compute the probability that the coin is biased toward heads (i.e., that B is positive).

b) Using the prior density as a proposal density, design a rejection algorithm for sampling from the posterior distribution. Using simulated draws from your algorithm, approximate the probability that the coin is biased toward heads.

c) Using the prior density as a proposal density, simulate values from the posterior distribution using the SIR algorithm. Approximate the probability the coin is biased toward heads.

2. Genetic linkage model from Rao (2002)

Suppose 197 animals are distributed into four categories with the following frequencies:

where 8 is an unknown parameter between 0 and 1. If 8 is assigned a uniform prior, then the posterior density of 8 is given by

where 0 < 8 < 1. If 8 is transformed to the real-valued logit rj _ log (8/(1 - 8)), then the posterior density of 77 can be written as

a) Use a normal approximation to find a 95% probability interval for q. Transform this interval to obtain a 95% probability interval for the original parameter of interest 8.

b) Design a rejection sampling algorithm for simulating from the posterior density of rl. Use a t proposal density using a small number of degrees of freedom and mean and scale parameters given by the normal approximation.

3. Estimation for the two-parameter exponential distribution

Martz and Waller (1982) describe the analysis of a "type I/time-truncated" life testing experiment. Fifteen reciprocating pumps were tested for a prespecified time; any failures among the pumps were replaced. One assumes that the failure times follow the two-parameter exponential distribution

Suppose one places a uniform prior on (,u, ~3). Then Martz and Waller show that the posterior density is given by

where n is the number of items placed on test, t is the total time on test, ti is the smallest failure time, and s is the observed number of failures in a sample of size n. In the example, data were reported in cycles to failure; n = 15 pumps were tested for a total time of t = 15, 962, 989. Eight failures (s = 8) were observed and the smallest failure time was ti = 237, 217.

a) Suppose one transforms the parameters to the real line by the transformations 01 = loV3, 92 = log(t1 - µ). Write down the posterior density of (01 i 92) .

b) Construct an R function that computes the log posterior density of (01,02)-

c) Use the laplace function to approximate the posterior density.

d) Use a multivariate t proposal density and the SIR algorithm to simulate a sample of 1000 draws from the posterior distribution.

e) Suppose one is interested in estimating the reliability at time to defined by

Using your simulated values from the posterior, find the posterior mean and posterior standard deviation of R(to) when to = 106 cycles.

4. Poisson regression

Haberman (1978) considers an experiment involving subjects reporting one stressful event. The collected data are yi,..., y18, where yi is the number of events recalled i months before the interview. Suppose yi is distributed Poisson with mean Ai, where the {Ai} satisfy the loglinear regression model

The data are shown in Table 5.2. If (/3o, /81) is assigned a uniform prior, then the logarithm of the posterior density is given, up to an additive constant, by

a) Write a R function to compute the logarithm of the posterior density of (0o, 01).

b) Suppose we are interested in estimating the posterior mean and standard deviation for the slope 01. Approximate these moments by a normal approximation about the posterior mode (function laplace).

c) Use a multivariate t proposal density and the SIR algorithm to simulate 1000 draws from the posterior density. Use this sample to estimate the posterior mean and standard deviation of the slope 31. Compare your estimates with the estimates using the normal approximation.

5. Grouped Poisson data

Hartley (1958) fits a Poisson model to the following grouped data:

Suppose the mean Poisson parameter is A and the frequency of observations with j events is nj, j = 0, 1, 2, and n3 is the frequency of observations with at least three events. If the standard noninformative prior g(A) = 1/. is assigned, then the posterior density is given by

• Write an R function to compute the logarithm of the posterior density of A.

• Use the function laplace to find a normal approximation to the posterior density of the transformed parameter 0 = log A.

• Use a t proposal density and the SIR algorithm to simulate 1000 draws from the posterior. Use the simulated sample to estimate the posterior mean and standard deviation of A. Compare the estimates with the normal approximation estimates found in part (a).

 

6.1 Introduction

In Chapter 5, we introduced the use of simulation in Bayesian inference. Rejection sampling is a general method for simulating from an arbitrary posterior distribution, but it can be difficult to set up since it requires the construction of a suitable proposal density. Importance sampling and SIR algorithms are also general-purpose algorithms, but they also require proposal densities that may be difficult to find for high-dimensional problems. In this chapter, we illustrate the use of Markov chain Monte Carlo (MCMC) algorithms in summarizing posterior distributions. Markov chains are introduced in the discrete state space situation in Section 6.2. Through a simple random walk example, we illustrate some of the important properties of a special Markov chain, and we use R to simulate from the chain and move toward the stationary distribution. In Section 6.3, we describe two variants of the popular Metropolis-Hastings algorithms in setting up Markov chains, and in Section 6.4, we describe Gibbs sampling where the Markov chain is set up through the conditional distributions of the posterior. We describe one strategy for summarizing a posterior distribution and illustrate it for three problems. MCMC algorithms are very attractive in that they are easy to set up and program and require relatively little prior input from the user. R is a convenient language for programming these algorithm and is also very suitable for performing output analysis, where one does several graphical and numerical computations to check if the algorithm is indeed producing draws from the target posterior distribution.

6.2 Introduction to Discrete Markov Chains

Suppose a person takes a random walk on a number line on the values 1, 2, 3, 4, 5, 6. If the person is currently at an interior value (2, 3, 4, or 5), in the next second she is equally likely to remain at that number or move to an adjacent number. If she does move, she is equally likely to move left or right. If the person is currently at one of the end values (1 or 6), in the next second she is equally likely to stay or move to the adjacent location.

This is a simple example of a discrete Markov chain. A Markov chain describes probabilistic movement between a number of states. Here there are six possible states, 1 through 6, corresponding to the possible location of the walker. Given that the person is at a current location, she moves to other locations with specified probabilities. The probability she moves to another location depends only on her current location and not on previous locations visited. We describe movement between states in terms of transition probabilities - they describe the likelihoods of moving between all possible states in a single step in a Markov chain. We summarize the transition probabilities by means of a transition matrix T:

The first row in T gives the probabilities of moving to all states 1 through 6 in a single step from location 1, the second row gives the transition probabilities in a single step from location 2, and so on.

There are several important properties of this particular Markov chain. It is possible to go from every state to every state in one or more steps - a Markov chain with this property is said to be irreducible. Given that the person is in a particular state, if the person can only return to this state at regular intervals, then the Markov chain is said to be periodic. This example is aperiodic since it is not a periodic Markov chain.

We can represent one's current location as a probability row vector of the form

where pi represents the probability the person is currently in state i. If pi represents the location of the traveler at step j, then the location of the traveler at the j + 1 step is given by the matrix product

Suppose we can find a probability vector w such that wT = w. Then w is said to be the stationary distribution. If a Markov chain is irreducible and aperiodic, then it has a unique stationary distribution. Moreover, the limiting distribution of this Markov chain, as the number of steps approaches infinity, will be equal to this stationary distribution.

We can empirically demonstrate the existence of the stationary distribution of our Markov chain by running a simulation experiment. We start our random walk at a particular state, say location 3, and then simulate many steps of the Markov chain using the transition matrix T. The relative frequencies of our traveler in the six locations after many steps will eventually approach the stationary distribution w.

We start our simulation in R by reading in the transition matrix T and setting up a storage vector s for the locations of our traveler in the random walk.

We indicate that the starting location for our traveler is state 3 and perform a loop to simulate 50,000 draws from the Markov chain. We use the sample function to simulate one step - the arguments to this function indicate that we are sampling a single value from the set {1, 2, 3, 4, 5, 6} with probabilities given by the sJ row of the transition matrix T, where si-1 is the current location of our traveler.

We summarize the frequencies of visits to the six states after 500, 2000, 8000, and 50,000 steps of the chain by use of the table command; we convert the counts to relative frequencies by dividing by the number of steps.

It appears from the output that the relative frequencies of the states are converging to the stationary distribution w = (0.1, 0.2, 0.2, 0.2, 0.2, 0.1). We can confirm that w is indeed the stationary distribution of this chain by multiplying w by the transition matrix T:

6.3 Metropolis-Hasting Algorithms

A popular way of simulating from a general posterior distribution is by Markov chain Monte Carlo (MCMC) methods. This essentially is a continuous-valued generalization of the discrete Markov chain setup described in the previous section. The MCMC sampling strategy sets up an irreducible, aperiodic Markov chain for which the stationary distribution equals the posterior distribution of interest. A general way of constructing a Markov chain is by a MetropolisHastings algorithm. In this section, we focus on two particular variants of Metropolis-Hastings algorithms, the independence chain and the random walk chain, that are applicable to a wide variety of Bayesian inference problems.

Suppose we wish to simulate from a posterior density g(Bly). In the following, to simplify notation, we write the density simply as g(8). A MetropolisHastings algorithm begins with an initial value B° and specifies a rule for simulating the tth value in the sequence Bt given the (t - 1)st value in the sequence 0". This rule consists of a proposal density which simulates a candidate value B*, and the computation of an acceptance probability P that indicates the probability the candidate value will be accepted to be the next value in the sequence. Specifically, this algorithm can be described as follows:

• Simulate a candidate value B* from a proposal density p(9* 19t-1)

• Compute the ratio

• Compute the acceptance probability P = min{R,1}.

• Sample a value Bt such that Bt = B* with probability P; otherwise Bt = Bt-1.

Under some easily satisfied regularity conditions on the proposal density p(9* 19t-1), the sequence of simulated draws 01,02'... will converge to a random variable that is distributed according to the posterior distribution g(6).

Different Metropolis-Hastings algorithms are constructed by the choice of proposal density. If the proposal density is independent of the current value in the sequence, that is,

then the resulting algorithm is called an independence chain. Other proposal densities can be defined by letting the density have the form

where h is a symmetric density about the origin. In this type of random walk chain, the ratio R has the simple form

The R functions rwmetrop and indepmetrop in the LearnBayes package implement, respectively, the random-walk and independence MetropolisHasting algorithms for special choices of proposal densities. For the function rwmetrop, the proposal density has the form

where Z is multivariate normal with mean vector 0 and variance-covariance matrix V and scale is a positive scale parameter. For the function indepmetrop, the proposal density for 0* is multivariate normal with mean vector y and covariance matrix V.

To use a Metropolis-Hastings algorithm, one first decides on the proposal density and then obtains a simulated sample of draws {Bt, t = 1, ...m} by use of the R functions rwmetrop or indepmetrop. The output of each of these functions has two components: par is a matrix of simulated draws where each row corresponds to a value of 0, and accept gives the acceptance rate of the algorithm.

Desirable features of the proposal density in an algorithm depend on the MCMC algorithm employed. For an independence chain, we desire that the proposal density p approximates the posterior density g, suggesting a high acceptance rate. But, as in rejection sampling, it is important that the ratio g/p is bounded, especially in the tail portion of the posterior density. This means that one may choose a proposal p that is more diffuse than the posterior, resulting in a lower acceptance rate. For random walk chains with normal proposal densities, it has been suggested that acceptance rates between 25% and 45% are good. The "best" choice of acceptance rate ranges from 45% for one and two parameters to 25% for problems with more parameters. This advice also applies when one monitors the Metropolis within Gibbs algorithm described in Section 6.4.

6.4 Gibbs Sampling

One of the attractive methods of setting up an MCMC algorithm is Gibbs sampling. Suppose that the parameter vector of interest is B = (01i ..., Bp). The joint posterior distribution of B, which we denote by [01data], may be of high dimension and difficult to summarize. Suppose we define the set of conditional distributions

where [X IY, Z] represents the distribution of X conditional on values of the random variables Y and Z. The idea behind Gibbs sampling is that we can set up a Markov chain simulation algorithm from the joint posterior distribution by successfully simulating individual parameters from the set of p conditional distributions. Simulating in turn one value of each individual parameter from these distributions is called one cycle of Gibbs sampling. Under general conditions, draws from this simulation algorithm will converge to the target distribution (the joint posterior of B) of interest.

6.5 MCMC Output Analysis

For the MCMC algorithms described in this book, the distribution of the simulated value at the jth iterate, 93, will converge to a draw from the posterior distribution as j approaches infinity. Unfortunately, this theoretical result provides no practical guidance on how to decide if the simulated sample provides a reasonable approximation to the posterior density g(Bldata).

In typical practice, one monitors the performance of an MCMC algorithm by inspecting the value of the acceptance rate, constructing graphs, and computing diagnostic statistics on the stream of simulated draws. We call this investigation an MCMC output analysis. By means of this exploratory analysis, one decides if the chain has sufficiently explored the entire posterior distribution (there is good mixing) and the sequence of draws has approximately converged. If one has a sample from the posterior distribution, then one wishes to obtain a sufficient number of draws so that one can accurately estimate any particular summary of the posterior of interest.

In this section we briefly describe some of the important issues in interpreting MCMC output and describe a few graphical and numerical diagnostics for assessing convergence. One issue in understanding MCMC output is detecting the size of the burn-in period. The simulated values of B obtained at the beginning of an MCMC run are not distributed from the posterior distribution. However, after some number of iterations have been performed (the burn-in period), the effect of the initial values wears off and the distribution of the new iterates approaches the true posterior distribution. One way of estimating the length of the burn-in period is to examine trace plots of simulated values of a component or particular function of B against the iteration number. Trace plots are especially important when MCMC algorithms are initialized with parameter values that are far from the center of the posterior distribution.

A second concern in analyzing output from MCMC algorithms is the degree of autocorrelation in the sampled values. In both the Metropolis and Gibbs sampling algorithms, the simulated value of B at the (j + 1)st iteration is dependent on the simulated value at the jth iteration. If there is strong correlation between successive values in the chain, then two consecutive values provide only marginally more information about the posterior distribution than a single simulated draw. Also, a strong correlation between successive iterates may prevent the algorithm from exploring the entire region of the parameter space. A standard statistic for measuring the degree of dependence between successive draws in the chain is the autocorrelation that measures the correlation between the sets {93} and {93+L'}, where L is the lag or number of iterates separating the two sets of values. A standard graph is to plot the values of the autocorrelation against the log L. If the chain is mixing adequately, the values of the autocorrelation will decrease to zero as the lag value is increased.

Another issue that arises in output analysis is the choice of the simulated sample size and the resulting accuracy of calculated posterior summaries. Since iterates in an MCMC algorithm are not independent, one cannot use standard "independent sample" methods to compute estimated standard errors. One simple method of computing standard errors for correlated output is the method of batch means. Suppose we estimate the posterior mean of Bi with the summary sample mean

6.6 A Strategy in Bayesian Computing

For a particular Bayesian inference problem, we assume that one has defined the log posterior density by an R function. Following the recommendation of Gelman et al (2003), Chapter 11, a good approach for summarizing this density is to set up a Markov chain simulation algorithm. The Metropolis-Hastings random walk and independence chains and the Gibbs sampling algorithm are attractive Markov chains since they are easy to program and require relatively little prior input. But these algorithms do require some initial guesses at the location and spread of the parameter vector 0. These initial guesses can be found by non-Bayesian methods such as the method of moments or maximum likelihood. Alternatively, one can obtain an approximation to the posterior distribution by finding the mode by use of some optimization algorithm. For example, Newton's method gives the posterior mode and an approximation to the variance-covariance matrix that can be used in specifying the proposal densities in the Metropolis-Hastings algorithms.

6.7 Learning About a Normal Population from Grouped Data

As a first example, suppose a random sample is taken from a normal population with mean fr and standard deviation a-. But one only observes the data in "grouped" form, where the frequencies of the data in bins are recorded. For example, suppose one is interested in learning about the mean and standard deviation of the heights (in inches) of men from a local college. One is given the summary frequency data shown in Table 6.1. One sees that 14 men were shorter than 66 inches, 30 men had heights between 66 and 68 inches, and so on.

We are observing multinomial data with unknown bin probabilities pi,..., P6 where the probabilities are functions of the unknown parameters of the normal population. For example, the probability that a student's height is between 66 and 68 inches is given by p2=oP(68,p,7)-x(66,It,7), where ~P(; µ, 7) is the cdf of a normal(µ, o-) random variable. It is straightforward to show that the likelihood of the normal parameters given this grouped data is given by

Suppose (µ, 7) are assigned the usual noninformative prior proportional to 1/7. Then the posterior density of the parameters is proportional to

Following our general strategy, we transform the positive standard deviation by A = log (7) and the posterior density of (y, A) is given by

We begin by writing a short function groupeddatapost that computes the logarithm of the posterior density of (,u, A). There are two arguments to this function: a matrix theta, where each row corresponds to a value of (µ, A), and a list data. The list has two components: data$b is a vector of cutpoints for the bins and data$f is a vector of bin frequencies.

We begin by defining the grouped data by the list d.

To use the function laplace, one requires a good guess at the location of the posterior mode. To estimate the mode of (p, log a), we first create an artificial continuous dataset by replacing each grouped observation by its bin midpoint. Then we approximate the posterior mode by computing the sample mean and the logarithm of the standard deviation of these artificial observations.

Based on this computation, we believe that the posterior of the vector (µ, log c7) is approximately (70, 1). We use the laplace function, where the log posterior is defined in the function groupeddatapost, start is set equal to this starting value, 10 iterations of Newton's method are run, and the grouped data are contained in the list d.

From the output, the posterior mode of (y, log ar) is found to be (70.17, 0.97). The associated posterior standard deviations of the parameters can be estimated by computing the square roots of the diagonal elements of the variancecovariance matrix.

We use the output from the function laplace to design a Metropolis random walk algorithm to simulate from the joint posterior. For the proposal density we use the variance-covariance matrix obtained from laplace and we set the scale parameter equal to 2. We run 10,000 iterations of the random walk algorithm starting at the value start. The output fit2 is a list with two components: par is a matrix of simulated values where each row corresponds to a single draw of the parameter vector, and accept gives the acceptance rate of the random walk chain.

We monitor the algorithm by displaying the acceptance rate; here the value is .2937 which is close to the desired acceptance rate for this Metropolis random walk algorithm.

We can summarize the parameters µ and log o- by computation of the posterior means and posterior standard deviations.

One can assess the accuracy of the model approximation to the posterior by comparing the means and standard deviations from the function laplace with the values computed from the simulated output from the MCMC algorithm.

For this model, there is close agreement in the two sets of posterior moments which indicates that the modal approximation to the posterior distribution is reasonably accurate.

We confirm this statement by using the function mycontour to draw a contour plot of the joint posterior of µ and log o-. The last 5000 simulated draws from the random walk Metropolis algorithm are drawn on top in Fig. 6.1. Note that the contour lines have an elliptical shape that confirms the accuracy of the normal approximation in this example.

Fig. 6.1. Contour plot of posterior of p and log a for grouped data example. A simulated sample of 5000 draws of the posterior is also shown.

6.8 Example of Output Analysis

We illustrate the use of MCMC output analysis by use of the R package boa that will be described in Chapter 11. Suppose we rerun the Metropolis random walk algorithm for the grouped data posterior with poor choices of starting value and proposal density. As a starting value, we choose (µ, log a) = (65, 1) (the choice of µ is too small) and we select the small scale factor of 0.2 (instead of 2):

We then rerun the Metropolis function rwmetrop:

We find that the acceptance rate of this modified algorithm is 0.89 which is much larger than the 0.29 rate that we found using the scale factor 2.

Fig. 6.2 displays a trace plot of the simulated draws of y from this Metropolis algorithm. Note that there is a significant burn-in period, approximately 600 iterations, before the simulated draws reach the main support of the posterior of µ. Also note the irregularity of the simulated sequence; for example, the iterates will explore the region where µ > 71 for a while before returning to the center of the distribution.

One can observe the strong correlation structure of the sequence by the use of an autocorrelation plot shown in Fig. 6.3. The autocorrelations are close to one for lag one and reduce very slowly as a function of the lag.

The following summary output of the simulated draws of p confirm the behavior of the MCMC run seen in Fig. 6.2 and Fig. 6.3. The estimate at the posterior mean of It is 70.04. If we assume naively that this simulated sample represented independent draws, then the standard error of this estimate is .0069. However, a more accurate estimate at the standard error is the Batch SE given by .0483. The lag one autocorrelation of the batch means (using batches of size 50) is .842.

Fig. 6.2. Trace plot of simulated draws of p for an MCMC chain with poor choices for starting value and scale factor.

It is instructive to compare these diagnostic graphs with the graphs using the better starting value and choice of proposal density used in Section 6.7. Fig. 6.4 and Fig. 6.5 display a trace plot and autocorrelation graph of the simulated draws of µ using the starting value (µ, log or) = (70, 1) and scale factor equal to 2. The trace plot of the simulated stream of µ looks more like random noise. The lag one autocorrelation is high, but the autocorrelation values dissipate rapidly as a function of the lag.

As before, we can compute summary statistics for this stream of MCMC output.

Fig. 6.3. Autocorrelation plot of simulated draws of y for an MCMC chain with poor choices for starting value and scale factor.

Here the estimate of the posterior mean of y is 70.16 with a batch standard error of .011. The autocorrelation between batch means of size 50 is the small value 0.0196. The graphs and the summary statistics confirm the better performance of the MCMC chain with a starting value (µ, log a) = (70, 1) and scale factor of 2.

Fig. 6.4. Trace plot of simulated draws of p for MCMC chain with good choices for starting value and scale factor.

6.9 Modeling Data with Cauchy Errors

For a second example, suppose that we are interested in modeling data where outliers may be presented. Suppose yl, ..., yn are a random sample from a Cauchy density with location parameter it and scale parameter a:

where z = (y - µ)/o-. Suppose that we assign the usual noninformative prior to (/"a):

The posterior density of y and or is given, up to a proportionality constant, by

Fig. 6.5. Autocorrelation plot of simulated draws of p for an MCMC chain with good choices for starting value and scale factor.

Again we first transform the positive parameter a to the real line by the reexpression A = log Q, leading to the posterior density of (µ, A):

The logarithm of the density is then given, up to an additive constant, by

We write the following R function cauchyerrorpost to compute the log arithm of the posterior density. There are two arguments to the function: theta, a matrix where each row corresponds to a value of the pair (it, A), and the vector of observations y. Since the parameters are vectors, we use a loop in the function where the individual terms of the log likelihood are summed over the values of yl, ..., yn. To simplify the code, we use the R function dt, which computes the density of the t random variable. (The Cauchy density is the t density with a single degree of freedom.)

We apply this model to Darwin's famous dataset concerning 15 differences of the heights of cross- and self-fertilized plants quoted by Fisher (1960). This dataset can be found in the LearnBayes library with the name darwin. We read in the dataset and attach the dataframe so we can access the variable difference. We initially compute the mean and logarithm of the standard deviation of the data to get some initial estimates at the locations of the posterior distributions of µ and ) = log(o).

To find the posterior mode, we use the function laplace. The arguments are the name of the function cauchyerrorpost defining the log posterior density, an array of initial estimates at the parameters, the number of iterations of the Newton-Raphson algorithm, and the data used in the log posterior function. For initial estimates, we use the values µ = 21.6, ) = 3.6 found earlier, and we use 10 iterations of the algorithm.

The posterior mode is given by (µ, )) = (24.7, 2.77). The output also gives the associated variance-covariance matrix and an estimate at the log integral.

We can use these estimates of center and spread to construct a rectangle that covers essentially all of the posterior probability of the parameters. As an initial guess at this rectangle, we take for each parameter the posterior mode plus and minus four standard deviations, where the standard deviations are obtainable from the diagonal elements of the variance-covariance matrix.

After some trial and error, we use the rectangle p E (-10, 60), A E (1, 4.5) as the bounding rectangle for the function mycontour. Fig. 6.6 displays the contour graph of the exact posterior distribution.

The contours of the exact posterior distribution have an interesting shape and one may wonder how these contours compare to those for a bivariate normal approximation. In the R code, we rerun the laplace function to obtain the posterior mode t$mode and associated variance-covariance matrix t$var. Using these values as inputs, we draw contours of a bivariate normal density in Fig. 6.7 where the log bivariate normal density is programmed in the function lbinorm. The eliptical shape of these normal contours seems significantly different from the shape of the exact posterior contours, which indicates that the normal approximation may be inadequate.

Although the normal approximation may not be the best summary of the posterior distribution, the estimated variance-covariance matrix is helpful in setting up a Metropolis random walk chain. We initially define a list proposal that contains the estimated variance-covariance matrix and a scale factor. We define the starting value of the chain in the array start. The simulation algorithm is run using the function rwmetrop. The inputs are the function defining the log posterior, the list proposal, the starting value, the number of simulations, and the data vector.

Fig. 6.6. Contour plot of posterior of it and log a for Cauchy error model problem.

In Fig. 6.8 we display simulated draws from rwmetrop on top of the contour graph.

Fig. 6.9 and Fig. 6.10 show the "exact" marginal posterior densities of µ and log a found from a density estimate from 50,000 simulated draws from the random walk algorithm. Also each figure shows the approximate normal approximation from the laplace output. These figures demonstrate the nonnormal shape of these marginal posteriors.

Fig. 6.7. Contour plot of normal approximation to posterior of ,a and log a for Cauchy error model problem.

It is instructive to illustrate "brute-force" and other Metropolis-Hastings algorithms for this problem. The brute-force algorithm is based on simulating draws of (µ, log or) from the grid using the function simcontour. One can use a Metropolis-Hastings independence chain, where the proposal density is multivariate normal with mean and variance given by the normal approximation. Alternatively, one can apply a Gibbs sampling algorithm with a vector of scale parameters equal to (12, .75); these values are approximately equal to twice the estimated posterior standard deviations of the two parameters. All the simulation algorithms were run with a simulation sample size of 50,000. The R code for the implementation of the four simulation algorithms follows.

Fig. 6.8. Contour plot of posterior of y and log Q with simulated sample for Cauchy error model problem.

The simulated draws for a parameter can be summarized by the computation of the 5th, 50th, and 95th percentiles. For example, one can find the summaries of It and log a from the random walk simulation by the command

Table 6.2 displays the estimated posterior quantiles for all of the algorithms described in this chapter. In addition, the acceptance rates for the MetropolisHastings random walk and independence chains and the Gibbs sampler are shown. Generally there is agreement among the simulation-based methods and these "exact" posterior summaries are different from the quantiles found using the Laplace normal approximation. The exact marginal posterior distribution of ,u has heavier tails than suggested by the normal approximation; also there is some skewness in the marginal posterior distribution of log Q.

Fig. 6.9. Posterior density of µ using normal approximation and simulated draws from the Metropolis random walk chain.

Fig. 6.10. Posterior density of log or using normal approximation and simulated draws from the Metropolis random walk chain.

6.10 Analysis of the Stanford Heart Transplant Data

Turnbull et al (1974) describe a number of approaches for analyzing heart transplant data from the Stanford Heart Transplanation Program. One of the inferential goals is to decide if heart transplantation extends a patient's life. One of their models, the Pareto model, assumes individual patients in the nontransplant group have exponential lifetime distributions with mean 1/9, where 9 is assumed to vary between patients and is drawn from a gamma distribution with density

Patients in the transplant group have a similar exponential lifetime distribution where the mean is 1/(9'r). This model assumes that the patient's risk of death changes by an unknown constant factor T > 0. If T = 1, then there is no increased risk by having a transplant operation.

Suppose the survival times {xi} are observed for N nontransplant patients. For n of these patients, xi represents the actual survival time (in days); the remaining N-n patients were still alive at the end of the study, so xi represents the censoring time. For the M patients that have a heart transplant, let yj and zj denote the time to transplant and survival time; in of these patients died during the study. The unknown parameter vector is (T, A, p) with likelihood function given by

where all the parameters are positive. Suppose we place a uniform prior on (T, A, p), and so the posterior density is proportional to the likelihood.

Following our summarization strategy, we transform the parameters by logs:

The posterior density of 0 = (01, 02, 03) is given by

The dataset stanfordheart in the LearnBayes package contains the data for 82 patients; for each patient, there are four variables: survtime, the survival time; transplant, a variable that is 1 or 0 if the patient had a transplant or not; timetotransplant, the time a transplant patient waits for the operation; and state, a variable that indicates if the survival time was censored (0 if the patient died and 1 if he was still alive). We load this datafile into R.

We write a function transplantpost that computes a value of the log posterior. In the following code, we generally follow the earlier notation. The numbers of nontransplant and transplant patients are denoted by N and M. We divide the data into two groups by the transplant indicator variable t. For the nontransplant patients, the survival times and censoring indicators are denoted by xnt and dnt, and for the transplant patients, the waiting times, survival times, and censoring indicators are denoted by y, z, and dt.

To get an initial idea about the location of the posterior, we run the function laplace. Our initial estimate at the posterior mode is 0 = (0, 3, -1) and we run 10 Newton steps. The algorithm converges and we obtain the posterior mode and an estimate at the variance-covariance matrix.

We use a Metropolis random walk algorithm (implemented in the function rwmetrop) to simulate from the posterior. We use a proposal variance of 2V, where V is the estimated variance-covariance matrix from the Laplace fit. We run the simulation for 10,000 iterations and as the output indicates, the acceptance rate was equal to 19%.

One primary inference in this problem is to learn about the three parameters T, A, and p. Fig. 6.11 displays density estimates of the simulated draws from the marginal posterior densities of each parameter. These are simply obtained by exponentiating the simulated draws of 0 that are output from the function rwmetrop. For example, the first plot in Fig. 6.11 is constructed by first computing the simulated draws of T and then using the plot (density O ) command.

Fig. 6.11. Posterior densities of parameters T, A, and p in Pareto survival model.

We can summarize the parameters T, A, and p by computing the 5th, 50th, and 95th percentiles of the simulated draws by the apply command.

From Fig. 6.11 and these summaries, we see that the value T = 1 is in the center of the posterior distribution and so there is insufficient evidence to conclude from this data that T 1. This means that there is insufficient evidence to conclude that the risk of death is higher (or lower) with a transplant operation.

In this problem, one is typically interested in estimating a patient's survival curve. For a nontransplant patient, the survival function is equal to

For a given value of the time to, one can compute a sample from the posterior distribution of S(to) by computing the function )P/(A+to)P from the simulated values from the joint posterior distribution of ) and p. In the following code, we assume that simulated samples from the marginal posterior distributions of ) and p are stored in the vectors lambda and p, respectively. Then we (1) set up a grid of values of t and storage vectors p5, p50 and p95; (2) simulate a sample of values of S(t) for each value of t on the grid; and (3) summarize the posterior sample by the computation of the 5th, 50th, and 95th percentiles. These percentiles are stored in the variables p5, p50, and p95. In Fig. 6.12, we graph these percentiles as a function of the time variable t. Since there is little evidence that T 1, this survival curve represents the risk for both transplant and nontransplant patients.

Fig. 6.12. Posterior distribution of probability of survival S(t) for heart transplant patients. Lines correspond to the 5th, 50th, and 95th percentiles of the posterior of S(t) for each time t.

6.11 Further Reading

A good overview of discrete Markov chains is contained in Kemeny and Snell (1976). Since MCMC algorithms currently play a central role in applied Bayesian inference, most modern textbooks devote significant content to these methods. Chapter 11 of Gelman et al (2003) and chapter 5 of Carlin and Louis (2000) provide good introductions to MCMC methods and their application in Bayesian methods. Robert and Casella (2004) and Givens and Hoeting (2005) give more detailed descriptions of MCMC algorithms within the context of computational statistical methods. Introductory discussions of Metropolis and Gibbs sampling are provided, respectively, in Chib and Greenberg (1995) and Casella and George (1992).

6.12 Summary of R Functions

cauchyerrorpost - computes the log posterior density of (M,log S) when a sample is taken from a Cauchy density with location M and scale S and a uniform prior distribution is taken on (M, log S)

Usage: cauchyerrorpost(theta, data)

Arguments: theta, matrix of parameter values where each row represents a value of (M, log S); data, vector containing sample of observations

Value: vector of values of the log posterior where each value corresponds to each row of the parameters in theta

gibbs - implements a Metropolis within Gibbs algorithm for an arbitrary real-valued posterior density defined by the user

Usage:gibbs(logpost,start,m,scale,data)

Arguments: logpost, function defining the log posterior density; start, array with a single row that gives the starting value of the parameter vector; m, the number of iterations of the Gibbs sampling algorithm; scale, vector of scale parameters for the random walk Metropolis steps; data, data used in the function logpost

Value: par, a matrix of simulated values where each row corresponds to a value of the vector parameter; accept, vector of acceptance rates of the Metropolis steps of the algorithm

groupeddatapost - computes the log posterior for (M, log S), when sampling from a normal density and the data are recorded in grouped format

Usage: groupeddatapost=function(theta,data)

Arguments: theta, matrix of parameter values where each row represents a value of (M, log S); data, list with components b, a vector of midpoints, and f, the corresponding bin frequencies

Value: vector of values of the log posterior where each value corresponds to each row of the parameters in theta

indepmetrop - simulates iterates of a Metropolis independence chain for an arbitrary real-valued posterior density defined by the user

Usage:indepmetrop(logpost,proposal,start,m,data)

Arguments: logpost, function defining the log posterior density; proposal, a list containing mu, an estimated mean and var, an estimated variancecovariance matrix of the normal proposal density; start, array with a single row that gives the starting value of the parameter vector; m, the number of iterations of the chain data, data used in the function logpost

Value: par, a matrix of simulated values where each row corresponds to a value of the vector parameter; accept, the acceptance rate of the algorithm.

lbinorm - computes the logarithm of a bivariate normal density

Usage: lbinorm(xy,par)

Arguments: xy, matrix of values where each row corresponds to a value of (x, y); par, list containing m, a vector of means, and v, a variance-covariance matrix

Value: vector of values of the kernel of the log density function

rwmetrop - simulates iterates of a random walk Metropolis chain for an arbitrary real-valued posterior density defined by the user

Usage:rwmetrop(logpost,proposal,start,m,par)

Arguments: logpost, function defining the log posterior density; proposal, a list containing var, an estimated variance-covariance matrix, and scale, the Metropolis scale factor; start, array with a single row that gives the starting value of the parameter vector; m, the number of iterations of the chain; par, data used in the function logpost

Value: par, a matrix of simulated values where each row corresponds to a value of the vector parameter; accept, the acceptance rate of the algorithm

transplantpost - computes the log posterior for (log tau, log lambda, log p) for a Pareto model for survival data

Usage:transplantpost=function(theta,data)

Arguments: theta, matrix of parameter values where each row represents a value of (log tau, log lambda, log p); data, data matrix where columns are survival time, time to transplant, transplant indicator, and censoring indicator

Value: vector of values of the log posterior where each value corresponds to each row of the parameters in theta

6.13 Exercises

1. A random walk

The following matrix represents the transition matrix for a random walk on the integers {1, 2, 3, 4, 5}.

a) Suppose one starts at the location 1. By use of the sample command, simulate 1000 steps of the Markov chain using the probabilities given in the transition matrix. Store the locations of the walk in a vector.

b) Compute the relative frequencies of the walker in the five states from the simulation output. Guess at the value of the stationary distribution vector w.

c) Confirm that your guess is indeed the stationary distribution by the matrix computation w %*% T.

2. Estimating a log-odds with a normal prior

In Exercise 1 of Chapter 5, we considered the estimation of a log-odds parameter when y is binomial(n, p) and the log-odds 8 = log (p/(1 - p)) is distributed N(µ, o-) with ,u = 0 and or = .25. The coin was tossed n = 5 times and y = 5 heads were observed.

Use a Metropolis-Hastings random walk algorithm to simulate from the posterior density. In the algorithm, let s be equal to twice the approximate posterior standard deviation found in the normal approximation. Use the simulation output to approximate the posterior mean and standard deviation of 9, and the posterior probability that 9 is positive. Compare your answers with those obtained by the normal approximation in Exercise 1 of Chapter 5.

3. Genetic linkage model from Rao (2002)

In Exercise 2 of Chapter 5, we considered the estimation of a parameter 9 in a genetic linkage model. The posterior density was expressed in terms of the real-valued logit y = log (9/(1 - 9)).

a) Use a Metropolis-Hastings random walk algorithm to simulate from the posterior density of y. (Choose the scale parameter s to be twice the approximate posterior standard deviation of 17 found in a normal approximation.) Compare the histogram of the simulated output of y with the normal approximation. From the simulation output, find a 95% interval estimate for the parameter of interest 8.

b) Use a Metropolis-Hastings independence algorithm to simulate from the posterior density of y. Use a normal proposal density. Again compare the histogram of the simulated output with the normal approximation and find a 95% probability interval for the parameter of interest 9.

4. Modeling data with Cauchy errors

As in Section 6.8, suppose we observe yl,..., y,, from a Cauchy density with location µ and scale a and a noninformative prior is placed on (µ, Q). Consider the following hypothetical test scores from a class that is a mixture of good and poor students.

a) Use the laplace function to find the posterior mode. Check that you have indeed found the posterior mode by trying several starting values in Newton's algorithm.

b) Use the Metropolis random walk algorithm (using the function rwmetrop) to simulate 1000 draws from the posterior density. Compute the posterior mean and standard deviation of µ and log or.

Fig. 6.13. Posterior distribution of p and log a for the Cauchy sampling exercise.

5. Estimation for the two-parameter exponential distribution

Exercise 3 of Chapter 5 considered the "type I/time-truncated" life testing experiment. We are interested in the posterior density of 6 = (Bi, 62), where 01 = loo o, 82 = lo;(t1 - µ).

a) Using the posterior mode and variance-covariance matrix from laplace, simulate 1000 values from the posterior distribution by the Metropolis random walk algorithm (function rwmetrop).

b) Suppose one is interested in estimating the reliability at time to defined by

Using your simulated values from the posterior, find the posterior mean and posterior standard deviation of R(to) when to = 106 cycles.

6. Poisson regression

Exercise 4 of Chapter 5 describes an experiment from Haberman (1978) involving subjects reporting one stressful event. The number of events recalled i months before an interview yz is distributed Poisson with mean )j, where the {Az} satisfy the loglinear regression model

One is interested in learning about the posterior density of the regression coefficients (/.3o, /31).

a) Using the output of laplace, construct a Metropolis random walk algorithm for simulating from the posterior density. Use the function rwmetrop to simulate 1000 iterates and compute the posterior mean and standard deviation of /31.

b) Construct a Metropolis independence algorithm and use the function rwindep to simulate 1000 iterates from the posterior. Compute the posterior mean and standard deviation of /31.

c) Use a table such as Table 6.2 to compare the posterior estimates using the three computational methods.

7. Generalized logit model

Carlin and Louis (2000) describe the use of a generalized logit model to fit dose-mortality data from Bliss (1935). Table 6.3 records the number of adult flour beetles killed after five hours of exposure to various levels of gaseous carbon disulphide. The number of insects killed yi under dose wi is assumed binomial(ni, pi), where the probability pi of death is given by

where xi = (wi - µ)/a. The prior distributions for p, or, ml are assumed independent, where p is assigned a uniform prior, c7 is assigned a prior proportional to 1/o, and 7n1 is gamma with parameters ao and bo. In the example, the prior hyperparameters of ao = .25 and bo = 4 were used. If one transforms to the real-valued parameters (01, 02 i 03) = (/1, 10g or, log Ml), then Carlin and Louis (2000) show the posterior density is given by

a) Write an R function that defines the log posterior of (01i 02, 03).

b) Carlin and Louis (2000) suggest running a Metropolis random walk chain with a multivariate normal proposal density where the variancecovariance matrix is diagonal with elements 0.00012, 0.033, and 0.10. Use the function rwmetrop to run this chain for 10,000 iterations. Compute the acceptance rate and the 5th and 95th percentiles for each parameter.

c) Run the function laplace to get a non-diagonal estimate of the variance-covariance matrix. Use this estimate in the proposal density of rwmetrop and run the chain for 10,000 iterations. Compute the acceptance rate and the 5th and 95th percentiles for each parameter.

d) Compare your answers in parts (b) and (c).

 

7.1 Introduction

In this chapter, we illustrate the use of R to summarize an exchangeable hierarchical model. We begin by giving a brief introduction to hierarchical modeling. Then we consider the simultaneous estimation of the true mortality rates from heart transplants for a large number of hospitals. Some of the individual estimated mortality rates are based on limited data and it may be desirable to combine the individual rates in some way to obtain more accurate estimates. We describe a two-stage model, a mixture of gamma distributions, to represent prior beliefs that the true mortality rates are exchangeable. We describe the use of R to simulate from the posterior distribution. We first use contour graphs and simulation to learn about the posterior distribution of the hyperparameters. Once we simulate hyperparameters, we can simulate from the posterior distributions of the true mortality rates from gamma distributions. We conclude by illustrating how the simulation of the joint posterior can be used to perform different types of inferences in the heart transplant application.

7.2 Introduction to Hierarchical Modeling

In many statistical problems, we are interested in learning about many parameters that are connected in some way. To illustrate, consider the following three problems described in this chapter and the chapters to follow.

1. Simultaneous estimation of hospital mortality rates

In the main example of this chapter, one is interested in learning about the mortality rates due to heart transplant surgery for 94 hospitals. Each hospital has a true mortality rate Az, and so one wishes to simultaneously estimate the 94 rates A1, ..., .94. It is reasonable to believe a priori that the true rates are similar in size, which implies a dependence structure between the parameters. If one is told some information about a particular hospital's true rate, that information would likely affect one's belief about the location of a second hospital's rate.

2. Estimating college grade point averages

In an example in Chapter 10, admissions people at a particular university collect a table of means of freshman grade point averages (GPA) organized by the student's high school rank and his or her score on a standardized test. One wishes to learn about the collection of population mean GPAs with the ultimate goal of making predictions about the success of future students that attend the university. One believes that the population GPAs can be represented as a simple linear function of the high school rank and standardized test score.

3. Estimating career trajectories

In an example in Chapter 11, one is learning about the pattern of performance of athletes as they age during their sports careers. In particular, one wishes to estimate the career trajectories of the batting performances of a number of baseball players. For each player, one fits a model to estimate his career trajectory, and Fig. 7.1 displays the fitted career trajectories for nine players. Note that the shapes of these trajectories are similar; a player generally will increase in performance until his late 20s or early 30s and then decline until retirement. The prior belief is that the true trajectories will be similar between players, which again implies a prior distribution with dependence.

In many-parameter situations like the ones described here, it is natural to construct a prior distribution in a hierarchical fashion. In this type of model, the observations are given distributions conditional on parameters, and the parameters in turn have distributions conditional on additional parameters called hyperparameters. Specifically, we begin by specifying a data distribution

and the prior vector 0 will be assigned a prior distribution with unknown hyperparameters A:

The hyperparameter vector A in turn will be assigned a distribution

One general way of constructing a hierarchical prior is based on the prior belief of exchangeability. A set of parameters 0 = (0l, ..., Ok) is exchangeable if the distribution of 0 is unchanged if the parameter components are permuted. This implies that one's prior belief about Oil say, will be the same as one's belief about 0, . One can construct an exchangeable prior by assuming that the components of 0 are a random sample from a distribution gl:

Fig. 7.1. Plots of fitted career trajectories for nine baseball players as a function of their age.

and the unknown hyperparameter vector A is assigned a known prior at the second stage:

This particular form of hierarchical prior will be used for the mortality rates example of this chapter and the career trajectories example of Chapter 11.

7.3 Individual and Combined Estimates

Consider again the heart transplant mortality data discussed in Chapter 3. The number of deaths within 30 days of heart transplant surgery is recorded for each of 94 hospitals. In addition, we record for each hospital an expected number of deaths called the exposure denoted by e. We let yi and ei denote the respective observed number of deaths and exposure for the ith hospital. In R, we read in the relevant dataset hearttransplants in the LearnBayes package.

A standard model assumes that the number of deaths yi follows a Poisson distribution with mean ci)i and the objective is to estimate the mortality rate per unit exposure )i. The fraction yi/ei is the number of deaths per unit exposure and can be viewed as an estimate of the death rate for the ith hospital. In Fig. 7.2, we plot the ratios {yi/ei} against the logarithms of the exposures {log(ei)} for all hospitals where each point is labeled by the number of observed deaths yi.

Fig. 7.2. Plot of death rates against log exposure for all hospitals. Each point is labeled by the number of observed deaths.

Note that the estimated rates are highly variable, especially for programs with small exposures. The programs experiencing no deaths (a plotting label of 0) also are primarily associated with small exposures.

Suppose we are interested in simultaneously estimating the true mortality rates {)j} for all hospitals. One option is to simply estimate the true rates by the individual death rates

Unfortunately these individual rates can be poor estimates, especially for the hospitals with small exposures. In Fig. 7.2, we saw that some of these hospitals did not experience any deaths and the individual death rate y2/e2 = 0 would likely underestimate the hospital's true mortality rate. Also it is clear from the figure that the rates for the hospitals with small exposures have high variability.

Since the individual death rates can be poor, it seems desirable to combine the individual estimates in some way to obtain improved estimates. Suppose we can assume that the true mortality rates are equal across hospitals; that is,

Under this "equal-means" Poisson model, the estimate of the mortality rate for the ith hospital would be the pooled estimate

But this pooled estimate is based on the strong assumption that the true mortality rate is the same across hospitals. This is questionable since one would expect some variation in the true rates.

We have discussed two possible estimates for the mortality rate of the ith hospital: the individual estimate yj /ei and the pooled estimate yj/ej. A third possibility is the compromise estimate

This estimate shrinks or moves the individual estimate yi/Cj toward the pooled estimate yj / ej where the parameter 0 < A < 1 determines the size of the shrinkage. We will see that this shrinkage estimate is a natural byproduct of the application of an exchangeable prior model on the true mortality rates.

7.4 Equal Mortality Rates?

Before we consider an exchangeable model, let's illustrate fitting and checking the model where the mortality rates are assumed equal. Suppose yz is distributed Poisson(ez)), i = 1, ..., 94, and the common mortality rate A is assigned a standard noninformative prior of the form

Then the posterior density of A is given by

and so the posterior density for the common rate ) is gamma(277, 294681).

Using the following R code, Fig. 7.3 displays a histogram of this posterior predictive distribution and the actual number of transplant deaths y94 is shown by a vertical line.

Since the observed yj is in the tail portion of the distribution, it seems inconsistent with the fitted model - it suggests that this hospital actually has a higher true mortality rate than estimated from this equal-rates model.

Fig. 7.3. Histogram of simulated draws from the posterior predictive distribution of y94. The actual number of transplant deaths is shown by a vertical line.

The following R code computes the probabilities of "at least as extreme" for all observations and places the probabilities in the vector pout.

We plot the probabilities against the log exposures which is displayed in Fig. 7.4.

Fig. 7.4. Scatterplot of predictive probabilities of "at least as extreme" against log exposures for all observations.

Note that a number of these tail probabilities appear small (15 are smaller than 0.10) which means that the "equal rates" model is inadequate for explaining the distribution of mortality rates for the group of 94 hospitals. We will have to assume differences between the true mortality rates that will be modeled by the exchangeable model described in the next section.

7.5 Modeling a Prior Belief of Exchangeability

At the first stage of the prior, the true death rates A1, ..., A94 are assumed to be a random sample from a gamma(a, a/y) distribution of the form

The prior mean and variance of A are given by y and y2/a, respectively. At the second stage of the prior, the hyperparameters it and a are assumed independent, with it assigned a gamma(a, b) distribution with density ya-1 exp(-by) and a the density g(a).

This prior distribution induces positive correlation between the true death rates. To see this, suppose one assigns the hyperparameter µ a gamma(10, 10) distribution and sets the hyperparameter a equal to a fixed value ao. (This is equivalent to assigning a density g(a) that places probability one on the value ao.) One can simulate values of, say (A1, )2), from the prior distribution by

• simulating values from y from the gamma(a, b) distribution, a from the prior density g(a)

• for each simulated pair (µ, a), simulate A1, )2 from gamma(a, a/µ) distributions

This simulation is illustrated in the following R code. Fig. 7.5 displays 500 simulated values from the prior distribution of (A1, A2) for the values ao equal to 5, 20, 80, and 400. Note that since y is assigned a gamma(10, 10) distribution, both the true rates ) and )'2 are centered about the value 1. The hyperparameter a is a precision parameter that controls the correlation between the parameters. For the fixed value a = 400, note that Al and )'2 are concentrated along the line Ai = .A2. As the precision parameter a approaches infinity, the exchangeable prior places all of its mass along the space where A1-...-A94.

Fig. 7.5. Simulated values from the exchangeable prior on (Al, A2) for values of the precision parameter a = 5, 20, 80, and 400.

Although we used subjective priors to illustrate the behavior of the prior distribution, in practice vague distributions can be chosen for the hyperparameters µ and a. In this example, we assign the mean parameter the typical vague prior of the form

The precision parameter a assigned the proper, but relatively flat, prior density of the form

The user will specify a value of the parameter zo that is the median of a. In this example, we let zo = 0.53.

7.6 Posterior Distribution

Owing to the conditionally independent structure of the hierarchical model and the choice of a conjugate prior form at stage 2, there is a relatively simple posterior analysis. Conditional on values of the hyperparameters it and a, the rates A1, ..., A94 have independent posterior distributions. The posterior distribution of )i is gamma(yi + a, ei + a/µ). The posterior mean of Ai, conditional on a and µ, can be written as

where

The posterior mean of the true rate )i can be viewed as a shrinkage estimator, where Bi is the shrinkage fraction of the posterior mean away from the usual estimate yi/ei toward the prior mean µ.

Also since a conjugate model structure was used, the rates )'z can be integrated out of the joint posterior density, resulting in the marginal posterior density of (a, µ):

where K is a proportionality constant.

7.7 Simulating from the Posterior

In the previous section the posterior density of all parameters was expressed as

where the hyperparameters are (µ, a) and the true rates are (A1, ..., A94). By the composition method, we can simulate a random draw from the joint posterior by

• simulating (µ, a) from the marginal posterior distribution

• simulating A1, ..., A94 from their distribution conditional on the values of the simulated µ and a

First we need to simulate from the marginal density of the hyperparameters p and a. Since both parameters are positive, a good first step in this simulation process is to transform each to the real-valued parameters

The marginal posterior of the transformed parameters is given by

The following R function poissgamexch contains the definition of the log posterior of 01 and 02.

Note that this function has two inputs:

• theta- a matrix of two columns where each row corresponds to a value of (01, 02)

• datapar - a R list with two components, the data and the value of the hyperparameter zO

Note that since theta is a matrix, we sum over the observations to compute the log posterior. We use the function lgamma that computes the log of the gamma function, log F(x).

Using the R function laplace, we find the posterior mode and associated variance-covariance matrix. We perform five iterations of the Newton-Raphson algorithm at the starting value (9k, B2) = (2, -7). The output of laplace includes the mode and the corresponding estimate at the variance-covariance matrix.

This output gives us information about the location of the posterior density. By trial and error, we use the function mycontour to find a grid that contains the posterior density of (01i 02). The resulting graph is displayed in Fig. 7.6.

Fig. 7.6. Contour plot of the posterior density of (log a, log p) for the heart transplant example. Contour lines are drawn at 10%, 1%, and .1% of the modal value.

By inspection of Fig. 7.6, we see that the posterior density for (01, 02) is nonnormal shaped, especially in the direction of 01 = log cc. Since the normal approximation to the posterior is inadequate, we obtain a simulated sample of (01, 02) by use of the "Metropolis within Gibbs" algorithm in the function gibbs. In this Gibbs sampling algorithm, we start at the value (01, 02) = (4, -7) and iterate through 1000 cycles with Metropolis scale parameters c1 = 1, c2 = .15. As the output indicates, the acceptance rates in the simulation of the two conditional distributions are each about 30%.

Fig. 7.7 shows a simulated sample of 1000 placed on top of the contour graph. Note that most of the points fall within the first two contour lines of the graph, indicating that the algorithm appears to give a representative sample from the marginal posterior distribution of 01 and 02.

Fig. 7.8 shows a kernel density estimate of the simulated draws from the marginal posterior distribution of the precision parameter 01 = log(es).

We can learn about the true mortality rates A1, ..., A94 by simulating values of from their posterior distributions. Given values of the hyperparameters a and it, the true rates have independent posterior distributions with Ai distributed gamma(yi + a, ei + a/y). For each rate, we use the R rgamma function to obtain a sample from the gamma distribution, where the gamma parameters are functions of the simulated values of a and it. For example, one can obtain a sample from the posterior distribution of Al by the R code

After we obtain a simulated sample of size 1000 for each true rate Ai, we can summarize each sample by computing the 5th and 95th percentiles. The interval from these two percentiles constitutes an approximate 90% probability interval for .i. We graph these 90% probability intervals as vertical lines on our original graph of the log exposures and the individual rates in Fig. 7.9. In contrast to the wide variation in the observed death rates, note the similarity in the locations of the probability intervals for the true rates. This indicates that these Bayesian estimates are shrinking the individual rates toward the pooled estimate.

Fig. 7.7. Contour plot of the posterior density of (log a, log y) for the heart transplant example with a sample of simulated values placed on top.

7.8 Posterior Inferences

Once a simulated sample of true rates {)j} and the hyperparameters p, a has been generated from the joint posterior distribution, we can use this sample to perform various types of inferences.

Fig. 7.8. Density estimate of simulated draws from the marginal posterior of log a.

7.8.1 Shrinkage

The posterior mean of the ith true mortality rate )i can be approximated by

Fig. 7.9. Plot of observed death rates against log exposures together with intervals representing 90% posterior probability bands for the true rates {Ai}.

7.8.2 Comparing Hospitals

Suppose one is interested in comparing the true mortality rates of the hospitals. Specifically, suppose one wishes to compare the "best hospital" with the other hospitals. First, we find the hospital with the smallest estimated mortality rate. In the following R output, we compute the posterior mean of the mortality rates, where the posterior mean of the true rate for hospital i is given by

where the expectation is taken over the marginal posterior distribution of (aI ft)

Fig. 7.10. Plot of the posterior shrinkages against the log exposures for the heart transplant example.

We identify hospital 85 as the one with the smallest true mortality rate.

Suppose we wish to compare hospital i with hospital j. One first obtains simulated draws from the marginal distribution of (Aj,.j). Then the probability that hospital i has a smaller mortality rate, POD < Aj), can be estimated by the proportion of simulated (Az, )j) pairs where )'z is smaller than .j. In the following R code, we compute these probabilities for all pairs of hospitals and the results are stored in the matrix better. The probability that hospital is rate is smaller than hospital j's rate is stored in the ith row and jth element of better.

To compare the best hospital 85 with the remaining hospitals, we display the 85th column of the matrix better. These give the probabilities P(Az < A8,5) for all i. We display these probabilities for the first 24 hospitals. Note that hospital 85 is better than most of these hospitals since most of the posterior probabilities are close to zero.

7.9 Posterior Predictive Model Checking

In Section 7.3, we used the posterior predictive distribution to examine the suitability of the "equal rates" model where Al = ... = A94, and we saw that the model seemed inadequate in explaining the number of transplant deaths for individual hospitals. Here we use the same methodology to check the appropriateness of the exchangeable model.

Note that in this case the observed number of deaths for this hospital is in the middle of the predictive distribution that indicates agreement of this observation with the fitted model.

Fig. 7.11. Histogram of posterior predictive distribution of y94 for hospital 94 from the exchangeable model. The observed value of yg4 is indicated by the vertical line.

Recall that the probabilities of at least as extreme for the equal means model were contained in the vector pout. To compare the goodness of fits of the two models, Fig. 7.12 shows a scatterplot of the two sets of probabilities with a comparison line y = x placed on top.

Note that the probabilities of extreme for the exchangeable model are larger, indicating that the observations are more consistent with the exchangeable fitted model. Note that only two of the observations have a probability smaller than 0.1 for the exchangeable model, indicating general agreement of the observed data with this model.

Fig. 7.12. Scatterplot of posterior predictive probabilities of "at least as extreme" for the equal means and exchangeable models.

7.10 Further Reading

Gelman et al (2003), chapter 5, provide a good introduction to hierarchical models. Carlin and Louis (2000), chapter 3, introduce hierarchical modeling from an empirical Bayes perspective. Posterior predictive model checking is described as a general method for model checking in chapter 6 of Gelman et al (2003). The use of hierarchical modeling to analyze the heart transplant data is described in Christiansen and Morris (1995).

7.11 Summary of R Functions

poissgamexch - computes the logarithm of the posterior for the parameters (log alpha, log mu) in a Poisson/gamma model

Usage: poissgamexch(theta,datapar)

Arguments: theta, matrix of parameter values where each row represents a value of (log alpha, log mu); datapar, list with components data (matrix with column of counts and column of exposures) and zO, the value of the second-stage hyperparameter

Value: vector of values of the log posterior where each value corresponds to each row of the parameters in theta

7.12 Exercises

1. Normal/normal exchangeable model

where O(yjy, a) denotes the normal density with mean it and standard deviation a.

a) Write an R function to compute the logarithm of the posterior density of the hyperparameters y and log T. (Don't forget to include the Jacobian term in the transformation to (p,logT).) Use a simulation algorithm such as Gibbs sampling (function gibbs), random walk Metropolis (function rwmetrop), or independence Metropolis (function indepmetrop) to obtain a sample of size 1000 from the posterior of (µ,logT).

b) Using the simulated sample from the marginal posterior of (µ, logT), simulate 1000 draws from the joint posterior density of the means 01, ..., Bj. Summarize the posterior distribution of each Bj by the computation of a posterior mean and posterior standard deviation.

2. Normal/normal exchangeable model (continued)

We assume that the sampling algorithm in Exercise 7.1 has been followed and one has simulated a sample of 1000 values from the marginal posterior of the hyperparameters y and log T, and also from the posterior densities of 01, ..., 0j.

a) The posterior mean of Bj, conditional on y and T, can be written as

b) School A had the largest observed coaching effect of 28. From the simulated draws from the joint distribution of 9i, ..., 9 j, compute the posterior probability P(01 > 8j) for j = 2,..., J.

3. Binomial/beta exchangeable model

In Chapter 5, we described the problem of simultaneously estimating the rates of death from stomach cancer for males at risk in the age bracket 45-64 for the largest cities in Missouri. The dataset is available as cancermortality in the LearnBayes package. Assume that the numbers of cancer deaths {yj} are independent, where yi is binomial with sample size nj and probability of death pj. To model a prior belief of exchangeability, it is assumed that p1, ..., p20 are a random sample from a beta distribution with parameters a and b. We reparameterize the beta parameters a and b to new values

The hyperparameter r) is the prior mean of each pj and K is a precision parameter. At the last stage of this model, we assign (ri, K) the noninformative proper prior

Due to the conjugate form of the prior, one can derive the following posterior distributions.

• Conditional on the values of the hyperparameters y and K, the probabilities p1i ..., p20 are independent, with pj distributed beta with parameters aj = Krj + yj and bj = K(1 - r7) + nj P7.

• The marginal posterior density of (q, K) has the form

where K>0and0<r)<1.

a) To summarize the posterior distribution of the hyperparameters ii and K, first transform the parameters to the real line by the reexpressions 01 = logK and 92 = log(rl/(1 -,q)). Write an R function to compute values of the log posterior of 01 and 02-

b) Use a simulation algorithm such as Gibbs sampling (function gibbs), random walk Metropolis (function rwmetrop), or independence Metropolis (function indepmetrop) to obtain a sample of size 1000 from the posterior of (8i, 92). Summarize the posterior distributions of K and r) by 90% interval estimates.

c) Using the simulated sample from the marginal posterior of (01i 02), simulate 1000 draws from the joint posterior density of the probabilities p1, •••,p20• Summarize the posterior distribution of each pj by a 90% interval estimate.

4. Binomial/beta exchangeable model (continued)

We assume that the sampling algorithm in Exercise 7.3 has been followed and one has simulated a sample of 1000 values from the marginal posterior of the hyperparameters K and m, and also from the posterior densities of p1, •••,P20•

 

8.1 Introduction

In this chapter we illustrate the use of R to compare models from a Bayesian perspective. We introduce the notion of a Bayes factor in the setting where one is comparing two hypotheses about a parameter. In the setting where one is testing hypotheses about a population mean, we illustrate the computation of Bayes factors in both the one-sided and two-sided settings. We then generalize to the setting where one is comparing two Bayesian models, each consisting of a choice of prior and sampling density. In this case, the Bayes factor is the ratio of the marginal densities for the two models. We illustrate Bayes factor computations in two examples. In the analysis of hitting data for a baseball player, one wishes to compare a "consistent" model with a "streaky" model where the probability of a success may change over a season. In the second application, we illustrate the computation of Bayes factors against independence in a two-way contingency table.

8.2 Comparison of Hypotheses

To introduce Bayesian measures of evidence, suppose one observes Y from a sampling distribution f(yJB) and one wishes to test the hypotheses

where e0 and e1 form a partition of the parameter space. If one assigns a proper prior density g(9), then one can judge the two hypotheses a priori by the prior odds ratio

After data Y = y is observed, one's beliefs about the parameter are updated by the posterior density

where L(9) is the likelihood function. One's new beliefs about the two hypotheses are summarized by the posterior odds ratio

The Bayes factor is the ratio of the posterior odds to the prior odds of the hypotheses

The statistic BF is a measure of the evidence provided by the data in support of the hypothesis Ho. The posterior probability of the hypothesis Ho can be expressed as a function of the Bayes factor and the prior probabilities of the hypotheses by

8.3 A One-Sided Test of a Normal Mean

In an example from chapter 14 of Berry (1996), the author was interested in determining his true weight from a variable bathroom scale. We assume the measurements are normally distributed with mean µ and standard deviation Q. The author weighs himself 10 times with the measurements (in pounds) 182, 172, 173, 176, 176, 180, 173, 174, 179, and 175. For simplicity, assume that he knows the accuracy of the scale and a = 3 pounds.

If we let it denote the author's true weight, suppose he is interested in assessing if his true weight is larger than 175 pounds. He wishes to test the hypotheses

Suppose the author has little prior knowledge about his true weight and so he assigns µ a normal prior with mean 170 and standard deviation 5

The prior odds of the null hypothesis Ho is given by

We compute this prior odds from the N(170, 5) density using the pnorm function. In the following output, pmean and pvar are, respectively, the prior mean and prior variance of y.

So a priori the null hypothesis is five times more likely than the alternative hypothesis.

We enter the 10 weight measurements into R and compute the sample mean y and the associated sampling variance sigma2 equal to a2/n.

By the familiar normal density/normal prior updating formula, the posterior precision (inverse of the variance) of y is the sum of the precisions of the data and the prior.

The posterior mean of y is the weighted sum of the sample mean and the prior mean where the weights are proportional to the respective precisions.

The posterior density of p. is N(175.79, 0.93).

Using this normal posterior density, we calculate the odds of the null hypothesis.

So the Bayes factor in support of the null hypothesis is

From the prior probabilities and the Bayes factor, we can compute the posterior probability of the null hypothesis.

Based on this calculation, the author can conclude that it is unlikely that his weight is at most 175 pounds.

There is an interesting connection between this Bayesian measure of evidence and the frequentist p-value. Here with a known value of the standard deviation Q, the traditional test of Ho is based on the test statistic

The p-value is the probability that a standard normal variate exceeds z. In the R output, we compute the p-value using the pnorm function.

Suppose we repeat the Bayesian analysis using a very flat prior where the mean and standard deviation are 170 and 1000, respectively. The function mnormt.onesided in the LearnBayes package performs the calculations where one inputs the value of the mean to be tested, the parameters (mean and standard deviation) of the normal prior, and the data values (sample mean, sample size, known sampling standard deviation).

Note that the probability of the null hypothesis is approximately equal to the p-value. This illustrates a general result that a Bayesian probability of a hypothesis is equal to the p-value for one-sided testing problems when a vague prior distribution is placed on the parameter.

8.4 A Two-Sided Test of a Normal Mean

Consider the "two-sided" test of the hypothesis that a mean from a normal distribution (with known standard deviation) is equal to a specific value. Continuing the example from the last section, suppose that Berry knows that his weight last year was 170 pounds and he wonders whether he still weighs 170 this year. So he is interested in the hypothesis Ho that his true current weight ,u is equal to 170. The alternative hypothesis Hl is that his weight is now either larger or smaller than 170.

The construction of the prior distribution is somewhat unique here since there will be a point mass at the value of µ in the null hypothesis. In the example, the author believes that there is a good chance that his weight did not change from last year, and so he assigns the statement µ = 170 a probability of .5.

Next, the author has to think about plausible values for ,u if the hypothesis Ho is not true. If his weight did change from last year, then he may think that it is more likely that ,u is close to last year's weight (170) than far from it. A normal distribution with mean 170 and standard deviation T will then be a suitable model for alternative values for y.

In general, we are testing the hypothesis Ho : µ = µo against the alternative hypothesis Hl : µ ~ µo in the case where the standard deviation a is known. A normal prior distribution with mean µo and standard deviation T will be used to represent one's opinion under the alternative hypothesis H1.

In this situation the Bayes factor in support of the hypothesis H is given by

As before, if xo is the prior probability of the null hypothesis Ho that µ = µo, then the posterior probability of Ho is

To compute the Bayes factor in practice, one has to input the standard deviation T of the normal density under the alternative hypothesis H1. If the author's weight did change from last year, how large will the change be? One way of obtaining the value of T is to think of the range of possible alternative values for ft and then solve for this standard deviation by setting the 95% range of the normal distribution, 4T, to this range. To illustrate, suppose that the author thinks that his current weight could be five pounds less or more than last year's weight of 170. The range of alternative values for µ is 175 -165 = 10 and by setting 10 = 4T, one obtains T = 2.5.

The function mnormt.twosided in the LearnBayes package computes the Bayes factor and the posterior probability of the null hypothesis in this problem. The inputs to the function are the value µo to be tested, the prior probability To of the hypothesis Ho, the prior standard deviation T, and the data values (sample mean, sample size, known sampling standard deviation). Since it may be difficult to assess values for T, the function allows the user to input a vector of plausible values.

The R code for the computation in this example is shown here. Note the values .5, 1, 2, 4, and 8 are inputted as possible values for T.

For each value of the prior standard deviation T, the program gives the Bayes factor in support of the hypothesis that y takes on the specific value and the posterior probability that the hypothesis H is true. If the author uses a normal (170, 2) density to reflect alternative values for his weight y, then the Bayes factor in support of the hypothesis ,u = 170 is equal to .0000002. The posterior probability that his weight hasn't changed is .0000002, which is much smaller than the author's prior probability of .5. He should conclude that his current weight is not 170.

8.5 Comparing Two Models

The Bayesian approach to comparing hypotheses can be generalized to compare two models. If we let y denote the vector of data and 9 the parameter, then a Bayesian model consists of a specification of the sampling density f (y19) and the prior density g(9). Given this model, one can compute the marginal or prior predictive density of the data

Suppose we wish to compare two Bayesian models

where it is possible that the definition of the parameter B may differ between models. Then the Bayes factor in support of model Mo is the ratio of the respective marginal densities (or prior predictive densities) of the data for the two models.

If pro and 7T1 denote the respective prior probabilities of the models Mo and M1, then the posterior probability of model Mo is given by

where d is the number of parameters. On the log scale, we have

Once an R function is written to compute the logarithm of the product f (yJ9)g(9), then the function laplace can be applied and the component of the output int gives an estimate of log m(y). By applying this method for several models, one can use the computed values of 7n(y) to compute a Bayes factor.

8.6 Models for Soccer Goals

To illustrate the use of the function laplace in computing Bayes factors, suppose you are interested in learning about the mean number of goals scored by a team in Major League Soccer. You observe the number of goals scored yl, ..., y,, for n games. Since goals are relatively rare events, it is reasonable to assume that the yes are distributed according to a Poisson distribution with mean A. We consider the use of the following four subjective priors for A:

1. Prior 1. You assign a conjugate gamma prior to A of the form

with a = 4.57 and ~3 = 1.43. This prior says that you believe that a team averages about 3 goals a game and the quartiles for A are given by 2.10 and 4.04.

2. Prior 2. It is more convenient for you to represent prior opinion in terms of symmetric distributions. So you assume that log A is normal with mean 1 and standard deviation .5. The quartiles of this prior for log A are 0.66 and 1.34, which translates to prior quartiles for A of 1.94 and 3.81. Note that Prior 1 and this prior reflect similar beliefs about the location of the mean rate A.

3. Prior 3. This prior assumes that log) is N(2, .5). The prior quartiles for the rate A are 5.27 and 10.35. This prior says that you believe teams score a lot of goals in Major League Soccer.

4. Prior 4. This prior assumes that log A is N(1, 2)with associated quartiles for the rate A of 1.92 and 28.5. This prior reflects little knowledge about the scoring pattern of soccer games.

The number of goals were observed for a particular team in Major League Soccer for the 2006 season. The dataset is available as soccergoals in the LearnBayes package. The likelihood of A, assuming the Poisson model, is given by

To use the function laplace, we have to write short functions defining the log posterior. The first function logpoissgamma computes the log posterior with Poisson sampling and a gamma prior. Following our usual strategy, we transform A to the real-valued parameter 0 = log A. The arguments to the function are theta and datapar, a list that contains the data vector data and the parameters of the gamma prior par. Note that we use the R function dgamma in computing both the likelihood and the prior.

Similarly, we write the function logpoissnormal to compute the log posterior of log A for Poisson sampling and a normal prior. This function uses both the R functions dgamma and dnorm.

Fig. 8.1. The likelihood function and four priors on 0 = log A for the soccer goal example.

We first load in the datafile soccergoals; there is one variable goals in this dataset that we make available by the attach command. For each of the four priors, we use the function laplace to summarize the posterior. If the output of the function is fit, fit$mode is the posterior mode, fit$var is the associated estimate at the posterior variance and fit$int is the estimate at log m(y).

We display the posterior modes, posterior standard deviations, and log marginal densities for the four models corresponding to the four priors.

By use of the values of log m(y), one can compare the different models by Bayes factors. Does it matter if we use a gamma(4.57, .7) prior on ) or a normal(1, .5) prior on log A? To answer this question, we can compute the Bayes factor in support of Prior 2 over Prior 1:

There is slight support for the normal prior - this makes sense from Fig. 8.1 since Prior 2 is slightly closer to the likelihood function. Comparing Prior 2 with Prior 3, the Bayes factor in support of Prior 2 is

indicating large support for Prior 2. Actually, note that the locations of the likelihood and Prior 3 are far apart, indicating a conflict between the data and the prior and a small value of m3(y). Comparing Prior 2 with Prior 4, the Bayes factor in support of Prior 2 is

Generally, the marginal probability for a prior will decrease as the prior density becomes more diffuse.

8.7 Is a Baseball Hitter Really Streaky?

In sports, we observe much streaky behavior in players and teams. For example, in the sport of baseball, one measure of success of a hitter is the batting average or proportion of base hits. During a baseball season, there will be periods when a player is "hot" and has an unusually high batting average, and there will also be periods when the player is "cold" and has a very small batting average. We observe many streaky patterns in the performance of players. The interesting question is what this streaky data says about the ability of a player to be streaky.

In baseball, the player has opportunities to bat in an individual season - we call these opportunities "at-bats." In each at-bat, there are two possible outcomes - a hit (a success) or an out (a failure). Suppose we divide all of the at-bats in a particular baseball season into N periods. Let pi denote the probability the player gets a hit in a single at-bat during the ith period, i = 1, ..., N. If a player is truly consistent or nonstreaky, then the probability of a hit stays constant across all periods; we call this the nonstreaky model Mo :

To complete this model specification, we assign the common probability value p a uniform prior on (0, 1).

On the other hand, if the player is truly streaky, then the probability of a hit pi will change across the season. A convenient way to model this variation in the probabilities is to assume that p1i ..., pN are a random sample from a beta density of the form

In the density g, g is the mean and K is a precision parameter. We can index this streaky model by the parameter K; we represent the streaky model by

For this model, we place a uniform prior on the mean parameter rl, reflecting little knowledge about the location of the random effects distribution. Note that as the precision parameter K approaches infinity, the streaky model MK approaches the consistent model Mo.

To compare the models Mo and MK, we need to compute the associated marginal densities. Under the model Mo, the numbers of hits y1 i ..., YN are independent, where yz is binomial(ni, p). With the assumption that p is uniform(0, 1), we obtain the marginal density

Under the alternative "streaky" model, the marginal density is given by

The Bayes factor in support of the "streaky" model HK compared to the "nonstreaky" model Ho is given by

We use the function laplace to compute the integral in the Bayes factor BK. We first transform the variable rl in the integral to the real-valued variable 0 = log(,q/(1 - ri)). Using the R function lbeta that computes the logarithm of the beta function, we define the following function bfexch that computes the log integral. The inputs to this function are theta and a list datapar with components data (a matrix with columns y and n) and K.

To compute the Bayes factor BK for a specific value, say KO, we use the function laplace with inputs the function bfexch, a starting value of r) = 0, 10 iterations of Newton's method, and the list datapar using the value KO.

The list s is the output of laplace; the component s$int gives the estimate at the logarithm of the Bayes factor log BK.

To illustrate the use of this method, we consider the hitting data for the New York Yankee player Derek Jeter for the 2004 baseball season. Jeter was one of the "star" players on this team, and he experienced an unusual hitting slump during the early part of the season that attracted much attention from the local media.

Hitting data for Jeter were collected for each of the 155 games he played in that particular season. A natural way of defining periods during the season is by games, so N = 155. However, it is difficult to detect streakiness in these hitting data since Jeter only had about 4-5 opportunities to hit in each game. So we group the data into five-game intervals. The original game-by-game data are available as jeter2004 in the LearnBayes package. In the following R code, we read in the complete hitting data for Jeter and use the regroup function to group the data into periods of five games.

The matrix datal contains the grouped hitting data (yz, ni), i = 1, ..., 31, where yz is the number of hits by Jeter in nz at-bats in the ith interval of games. These data are listed in Table 8.1.

We compute the Bayes factor for a sequence of values of log K using the function laplace and the definition of the log integral defined in the function bfexch. In this example, the vector logK contains the values log(K) = 2, 3, 4, 5, and 6 and the vector logBF stores the corresponding values of the log Bayes factor log BK. We display in a matrix the values of log K, the values of K, the values of log BK, and the values of the Bayes factor BK.

We see from the output that the value log K = 4 is most supported by the data with a corresponding Bayes factor of BK = 4.21. This particular streaky model is approximately four times as likely as the consistent model. This indicates that Jeter indeed did display some true streakiness in his hitting behavior for this particular baseball season.

8.8 A Test of Independence in a Two-Way Contingency Table

A basic problem in statistics is to explore the relationship between two categorical measurements. To illustrate this situation, consider the following example presented in Moore (1995) in which North Carolina State University looked at student performance in a course taken by chemical engineering majors. Researchers wished to learn about the relationship between the time spent in extracurricular activities and the grade in the course. Data on these two categorical variables were collected from 119 students, and the responses are presented using the contingency table in Table 8.2.

To learn about the possible relationship between participation in extracurricular activities and grade, one tests the hypothesis of independence. The usual non-Bayesian approach of testing the independence hypothesis is based on a Pearson chi-squared statistic that contrasts the observed counts with expected counts under an independence model. In R, we read in the table of counts and use the function chisq.test to test the independence hypothesis:

Here the p-value is approximately .03, which is some evidence that one's grade is related to the time spent on extracurricular activities.

From a Bayesian viewpoint, there are two possible models - the model Al, that the two categorical variables are independent and the model MD that the two variables are dependent in some manner. To describe the Bayesian models, assume that these data represent a random sample from the population of interest and the counts of the table have a multinomial distribution with proportion values as shown in Table 8.3. Under the dependence model MD, the proportion values p11, ... , P23 can take any values that sum to 1, and we assume the prior density places a uniform distribution over this space.

Under the independence model MI, the proportions in the table are determined by the marginal probabilities {pl+, p2+} and {p+1, p+2, p+3} as displayed in Table 8.4. Here the unknown parameters are the proportions of students in different activity levels and the proportions with different grades. We assume that our knowledge about these two sets of proportions, {pi+} and {p+j }, are independent and assign to each set a uniform density over all possible values.

We have defined two models - a dependence model MD where the multinomial proportions are uniformly distributed and an independence model All where the multinomial proportions have an independence structure and the marginal proportions are assigned independent uniform priors. It can be shown that the Bayes factor in support of the dependence model over the independence model is given by

where y is the matrix of counts, YR is the vector of row totals, yc is the vector of column totals, 1R is the vector of ones of length R, and D(v) is the Dirichlet function defined by

The R function ctable will compute this Bayes factor for a two-way contingency table. One inputs a matrix a of prior parameters for the matrix of probabilities. By taking a matrix a of ones, one is assigning a uniform prior on {pjj} and uniform priors on {pz+} and {p+,} under the dependence model. The output of this problem is the value of the Bayes factor. Here the value is BF = 1.66, which indicates modest support against independence.

We are comparing "uniform" with "independence" models for a contingency table. One criticism of this method is that we may not really be interested in a "uniform" alternative model. Perhaps we would like to compare "independence" with a model where the cell probabilities are "close to independence." Such a model was proposed by Albert and Gupta (1981). Suppose the table probabilities {pzj } are assigned a conjugate Dirichlet distribution of the form

where the prior means {qzj } satisfy an independence configuration

It can be shown that the Bayes factor in support of the "close to independence" model MK over the independence model All is given by

One straightforward way of computing the Bayes factor is by importance sampling. The Bayes factor can be represented as the integral

where B = (1)A, q B). Suppose the integrand can be approximated by the density g(O), where g is easy to simulate. Then by writing the integral as

we can approximate the integral as

where 9i, ..., 0.. are independent simulated draws from g(9). The simulation standard error of this importance sampler estimate is given by

In our example, it can be shown, as K approaches infinity, the posterior of the vectors of marginal prior means gA and gB can be shown to be independent with

Using this importance sampling algorithm, the function bfindep computes the Bayes factor using this alternative "close to independence" model. One inputs the data matrix y, the Dirichlet precision parameter K, and the size of the simulated sample in. The output is a list with two components: bf, the value of the Bayes factor, and nse, an estimate at the simulation standard error of the computed value of BF.

In the following R input, we compute the Bayes factor for a sequence of values of log K. The output gives the value of the log Bayes factor and the Bayes factor for some values of log K. Fig. 8.2 displays the log Bayes factor as a function of log K and 10,000 simulation draws. (We used the R function spm in the SemiPar package to smooth out the simulation errors in the computed log Bayes factors before plotting.) Note that this maximum value of the Bayes factor is 2.3, indicating some support for an alternative model that is in the neighborhood of the independence model.

8.9 Further Reading

Carlin and Louis (2000), chapter 6, and Kass and Raftery (1995) provide general discussions of the use of Bayes factors in selecting models. Berger and Sellke (1987) and Casella and Berger (1987) describe the relationship between Bayesian and frequentist measures of evidence in the two-sided and one-sided testing situations, respectively. Gunel and Dickey (1974) describe the use of Dirichlet distributions in the development of tests for contingency tables, and Albert and Gupta (1981) introduce the use of mixtures of Dirichlet distributions for contingency tables.

Fig. 8.2. Plot of log Bayes factor in support of model MK over MI against the precision parameter log K.

8.10 Summary of R Functions

bfexch - computes the logarithm of the integral of the Bayes factor for testing homogeneity of a set of probabilities

Usage: bfexch(theta,datapar)

Arguments: theta, vector of values of the logit of the prior hyperparameter ij; datapar, list with components data (matrix with columns y and n) and K (prior precision hyperparameter)

Value: vector of values of the logarithm of the integral

bfindep - computes a Bayes factor against independence for a two-way contingency table assuming a "close to independence" alternative model

Usage: bfindep(y, K, m)

Arguments: y, matrix of counts; K, Dirichlet precision hyperparameter; m, number of simulations

Value: bf, value of the Bayes factor against independence; nse, estimate of the simulation standard error of the computed value of the Bayes factor

ctable - computes a Bayes factor against independence for a two-way contingency table assuming uniform prior distributions

Usage: ctable(y,a)

Arguments: y, matrix of counts; a, matrix of prior parameters for the matrix of probabilities

Value: the Bayes factor against the hypothesis of independence

logpoissgamma - computes the logarithm of the posterior with Poisson sampling and a gamma prior

Usage: logpoissgamma(theta, datapar)

Arguments: theta, vector of values of the log mean parameter; datapar, list with components data (vector of sample values) and par (vector of parameters of the gamma prior)

Value: value of the log posterior for all values in theta

logpoissnormal - computes the logarithm of the posterior with Poisson sampling and a normal prior

Usage: logpoissnormal (theta, datapar)

Arguments: theta, vector of values of the log mean parameter; datapar, list with components data (vector of sample values) and par (vector of parameters of the normal prior)

Value: value of the log posterior for all values in theta

mnormt. onesided - Bayesian test of the hypothesis that a normal mean M is less than or equal to a specific value

Usage:mnormt.onesided(muO,normpar,data)

Arguments: muO, value of the normal mean to be tested; normpar, vector of mean and standard deviation of the normal prior distribution; data, vector of sample mean, sample size, and known value of the population standard deviation

Value: BF, Bayes factor in support of the null hypothesis; prior. odds, the prior odds of the null hypothesis; post. odds, the posterior odds of the null hypothesis, postH, the posterior probability of the null hypothesis

mnormt. twosided - Bayesian test of the hypothesis that a normal mean M is equal to a specific value

Usage: mnormt.twosided(muO, probH, tau, data)

Arguments: muO, the value of the normal mean to be tested; probH, the prior probability of the null hypothesis; tau, vector of values of the prior standard deviation under the alternative hypothesis; data, vector of sample mean, sample size, and known value of the population standard deviation

Value: bf, vector of values of the Bayes factor in support of the null hypothesis; post, vector of values of the posterior probability of the null hypothesis

8.11 Exercises

1. A one-sided test of a binomial probability

In 1986, the St. Louis Post Dispatch was interested in measuring public support for the construction of a new indoor stadium. The newspaper conducted a survey in which they interviewed 301 registered voters. Let p denote the proportion of all registered voters in the St. Louis voting district opposed to the stadium. A city councilman wishes to test the hypotheses H : p > .5, K : p < .5.

a) The number y opposed to the stadium construction is assumed binomial(301, p). Suppose the survey result is y = 135. Using the R function pbinom, compute the p-value P(y < 1351p = .5). If this probability is small, say under 5%, then one concludes that there is significant evidence in support of the hypothesis K : p < .5.

b) Suppose one places a uniform prior on p. Compute the prior odds of the hypothesis K.

c) After observing y = 135, the posterior distribution of p is beta(136, 167). By use of the R function pbeta, compute the posterior odds of the hypothesis K.

d) Compute the Bayes factor in support of the hypothesis K.

2. A two-sided test of a normal mean (example from Weiss (2001))

For last year, a sample of 50 cell phone users had a mean local monthly bill of $41.40. Do these data provide sufficient evidence to conclude that last year's mean local monthly bill for cell phone users has changed from the 1996 mean of $47.70? (Assume that the population standard deviation is c7 = $25.)

a) The usual statistic for testing the value of a normal mean it is z = V/-n(y - µ)/or. Use this statistic and the R function pnorm to compute a p-value for testing the hypothesis H : µ = 47.7.

b) Suppose one assigns a prior probability of .5 to the null hypothesis. Use the R function mnormt. twosided to compute the posterior probability of H. The arguments to mnormt. twosided are the value to be tested (47.70), the prior probability of H (.5), the standard deviation T of the prior under the alternative hypothesis (assume T = 4), and the data vector (values of sample mean, sample size, and known sampling standard deviation).

c) Compute the posterior probability of H for the alternative values T =1, 4, 6, 8, and 10. Compare the values of the posterior probability with the value of the p-value computed in part (a).

3. Comparing Bayesian models by a Bayes factor

Suppose that the number of births to women during a month at a particular hospital has a Poisson distribution with parameter R. During a given year at a particular hospital, 66 births were recorded in January and 48 births were recorded in April. If the birth rates during January and April are given by Rj and RA respectively, then (assuming independence) the probability of the sample result is

Consider the following two priors for (Ri, RA):

• Mi : RJ ti gamma(240, 4), RA - gamma(200, 4).

• M2: RJ = RA and the common value of the rate R - gamma(220, 4).

a) Write R functions to compute the logarithm of the posterior density of (Ri, RA) under model M1, and the logarithm of the posterior density of R under model Mz.

b) Use the function laplace to compute the logarithm of the predictive density for both models All and M2.

c) Compute the Bayes factor in support of the model Ml.

4. Is a basketball player streaky?

Kobe Bryant is one of the most famous players in professional basketball. Shooting data were obtained for Bryant for the first 15 games in the 2006 season. For game i, one records the number of field goal attempts nz and the number of successful field goals yzf the data are displayed in Table 8.6. If pi denotes the probability that Kobe makes a shot during the ith game, it is of interest to compare the nonstreaky hypothesis

against the streaky hypothesis that the pi vary according to a beta distribuiton

MK: pl, ...,p15 random sample from beta(Kri, K(1-ri)), ri - uniform(0, 1).

Use the function laplace together with the function bfexch to compute the logarithm of the Bayes factor in support of the streaky hypothesis MK. Compute the log of the Bayes factors for values of K = 10, 20, 50, and 100. Based on your work, is there much evidence that Bryant displayed true streakiness in his shooting performance in these 15 games?

5. Test of independence (example from Agresti and Franklin (2005))

The 2002 General Social Survey asked the question "Taken all together, would you say that you are very happy, pretty happy, or not too happy?" Also the survey asked "Compared with American families in general, would you say that your family income is below average, average, or above average?" Table 8.7 cross-tabulates the answers to these two questions.

a) By use of the Pearson chi-square statistic, use the function chisq.test to test the hypothesis that happiness and family income are independent. Based on the p-value, is there evidence to suggest that the level of happiness is dependent on the family income?

b) Consider two models, a "dependence model" where the underlying multinomial probability vector is uniformly distributed and an "independence model" where the cell probabilities satisfy an independence configuration and the marginal probability vectors have uniform distributions. Using the R function ctable, compute the Bayes factor in support of the dependence hypothesis.

c) Instead of the analysis in part (b), suppose that one wishes to compare the independence model with the "close to independence" model MK described in Section 8.8. Using the function bfindep, compute the Bayes factor in support of the model MK for values of log K = 2, 3, 4, 5, 6, and 7.

d) Compare the frequentist measure of evidence against independence with the Bayesian measures of evidence computed in parts (b) and (c). Which type of measure, frequentist or Bayesian, indicates more evidence against independence?

 

9.1 Introduction

In this chapter, we illustrate R to fit some common regression models from a Bayesian perspective. We first outline the Bayesian normal regression model and describe algorithms to simulate from the joint distribution of regression parameters and error variance and the predictive distribution of future observations. One can judge the adequacy of the fitted model through use of the posterior predictive distribution and the inspection of the posterior distributions of Bayesian residuals. We then illustrate the R Bayesian computations in an example where one is interested in explaining the variation of extinction times of birds in terms of their nesting behavior, their size, and their migrant status. We conclude by illustrating the Bayesian fitting of a survival regression model.

9.2 Normal Linear Regression

9.2.1 The Model

In the usual multiple regression problem, we are interested in describing the variation in a response variable y in terms of k predictor variables xi, ..., xk. We describe the mean value of yz, the response for the ith individual, as

where xil, ..., Xik are the predictor values for the ith individual and ~3i, ...,13k are unknown regression parameters. If we let xi = (Xi 1, ..., xzk) denote the row vector of predictors for the ith individual and /3 = (~31 i ..., ,(3k) the column vector of regression coefficients, we can reexpress the mean value as

The {yz} are assumed to be conditionally independent given values of the parameters and the predictor variables. In the ordinary linear regression setting, we assume equal variances where var (yj 10,X) = o•'2. We let 0 = (8 i ..., A, C72 ) denote the vector of unknown parameters. Finally, we assume that the errors ez = yz - E(yz 1/3, X) are independent normally distributed with mean 0 and variance o'2.

In matrix notation, this model can be written for all observations as

where y is the vector of observations; X is the design matrix with rows x1, ..., xn,; I is the identity matrix; and NN(y, A) indicates a multivariate normal distribution of dimension k with mean vector y and variance-covariance matrix A.

To complete the Bayesian formulation of the model, we assume (13, Q2) have the typical noninformative prior

9.2.2 The Posterior Distribution

The posterior analysis for the normal regression model has a form similar to the posterior analysis of a mean and variance for a normal sampling model. We represent the joint density of (/3, o 2) as the product

If one defines the inverse gamma(a, b) density proportional toy-a-1 exp{-b/y}, then the marginal posterior distribution of a2 is inverse gamma((n-k)/2 S12) where

9.2.3 Prediction of Future Observations

9.2.4 Computation

The expressions for the posterior and predictive distributions lead to efficient simulation algorithms. To simulate from the joint posterior distribution of the regression coefficient vector /3 and the error variance a2, one

• simulates a value of the error variance a' from its marginal posterior density g(a2ly)

• simulates a value of ~ 3 from the conditional posterior density 1 , 7 y)

Since the two component distributions (inverse gamma and multivariate normal) are convenient functional forms, it is relatively easy to construct an algorithm in R such as the one programmed in the function blinreg to perform this simulation.

Once the joint posterior distribution has been simulated, it is straightforward to obtain a sample from the marginal posterior distribution of any function h(d, a) of interest. For example, if x* denotes a row vector of particular values of covariates, suppose one is interested in the mean response at x*

If 8* is a simulated draw from the marginal posterior of /3, then x*~3* will be a simulated draw from the marginal posterior of x*/8. The R function blinregexpected facilitates the simulation of linear combinations of the beta coefficients.

• simulating (13, (7 2) from the joint posterior given the data y

The R function blinregpred can be used to simulate sets of draws of future observations corresponding to a list of covariate values of interest.

9.2.5 Model Checking

A second approach is based on the use of "Bayesian residuals." In a traditional regression analysis, one judges the adequacy of the fitted model by inspection of the standardized residuals

Before any data are observed, the parametric residuals are a random sample from an N(0, o7) distribution. Suppose we say that the ith observation is an outlier if IEi I > ko-, where k is a predetermined constant such as 2 or 3. The prior probability that a particular observation is an outlier is 20(-k), where ~P(z) is the standard normal cdf.

After data y are observed, we can compute the posterior probability that each observation is an outlier. Define the functions zl and z2 as

where

Then the posterior probability that the ith observation is an outlier is

In practice, the pis can be computed and compared to the prior probability 2oP(-k). The R function bayesresiduals can be used to compute the posterior outlying probabilities for a linear regression model.

9.2.6 An Example

Ramsey and Schafer (1997), chapter 10, describe an interesting study from Pimm et al (1988) on the extinction of birds. Measurements on breeding pairs of land-bird species were collected from 16 islands around Britain over the course of several decades. For each species, the dataset contains TIME, the average time of extinction on the islands where it appeared, NESTING, the average number of nesting pairs, SIZE, the size of the species (large or small), and STATUS, the migratory status of the species (migrant or resident). The objective is to fit a model that describes the variation in the time of extinction of the bird species in terms of the covariates NESTING, SIZE, and STATUS.

This dataset is available as birdextinct in the LearnBayes package. We read in the datafile and construct some initial graphs. Since the TIME variable is strongly right-skewed, we initially transform it by a logarithm creating the variable LOGTIME. Fig. 9.1, Fig. 9.2, and Fig. 9.3 plot LOGTIME against each of the three predictor variables. Since the categorical variables SIZE and STATUS take only two values, we use the R jitter function to jitter the horizontal location of the points so we can see any overlapping points. Note that there is a positive relationship between the average number of nesting pairs and time to extinction. However, there are five particular species (labeled in the graph) with points that seem to vary from the general pattern. There may be relationships of each of the categorical variables with LOGTIME, but the strength of the relationship seems weak in comparison with the relationship of NESTING and LOGTIME.

Fig. 9.1. Plot of logarithm of the extinction time against the average number of nesting pairs for the bird study.

Fig. 9.2. Plot of the logarithm of the extinction time against the bird size for the bird study. The bird size variable is coded 0 for small and 1 for large.

We write the regression model as

As two of the covariates are categorical with two levels, they can be represented by binary indicators; in the data file birdextinct, SIZE is coded 0 (1) for small (large), and STATUS is coded 0 (1) for migrant (resident).

We first perform the traditional least-squares fit by the lm command.

Fig. 9.3. Plot of the logarithm of the extinction time against the bird status for the bird study. The bird status variable is coded 0 for migrant and 1 for resident.

We see from the output that NESTING is a strong effect; species with larger number of nesting pairs tend to have longer extinction times which means that these species are less likely to be extinct. The SIZE and STATUS effects appear to be less significant; larger birds (with SIZE = 1) have smaller extinction times and resident birds (with STATUS = 1) have longer extinction times.

The function blinreg is used to sample from the joint posterior distribution of /3 and Q. The inputs to this function are the vector of values of the response variable y, the design matrix of the linear regression fit X, and the number of simulations in. Note that we used the optional arguments x = TRUE, y = TRUE in the function in so that the design matrix and response vector are available as components of the structure fit.

The algorithm in binreg is based on the decomposition of the joint posterior [/3, u2 ly] as the product [ore Iy] [131x2, y]. To simulate one draw of (C72, 13), a2 is first drawn from the inverse gamma((n - k)/2, S/2) density:

The function blinreg returns two components, beta is a matrix of simulated draws from the marginal posterior of ), where each row is a simulated draw, and sigma is a vector of simulated draws from the marginal posterior of or.

The following R commands construct histograms of the simulated posterior draws of the individual regression coefficients ~1, 02, and /33 and the error standard deviation g (see Fig. 9.4).

We can summarize each individual parameter by computing the 5th, 50th, and 95th percentiles of each collection of simulated draws. In the output, we use the apply and quantile commands to summarize the simulation matrix of /3 theta. sample$beta. Similarly, we use the quantile command to simulate the draws of a.

Fig. 9.4. Histogram of simulated draws from the marginal posterior distributions of /31, /32, /33, and a.

As expected, the posterior medians of the regression parameters are similar in value to the ordinary regression estimates. Actually they are equivalent since we applied a vague prior for /3; any small differences between the posterior medians and the least-square estimates are due to small errors inherent in the simulation.

Next, suppose we are interested in estimating the mean log extinction time E(ylx*) = x*/3 for four nesting pairs and for different combinations of SIZE and STATUS. The values of the four sets of covariates are shown in Table 9.1.

In the following input, we define the four sets of covariates and stack these sets in the matrix X1. The function blinregexpected will give a simulated sample for the expected response E(ylx*) = x*' 3 for each set of covariate values. The inputs to the function are the matrix X1 of covariate values and the list of simulated values of /3 and or obtained from the function binlinreg. The output of the function is a matrix where a column contains the simulated draws for a given covariate set. We construct histograms of the simulated draws for each of the mean extinction times and the plots are displayed in Fig. 9.5.

Fig. 9.5. Histograms of simulated draws of the posterior of the mean extinction time for four sets of covariate values.

Fig. 9.6 displays histograms of the simulated draws from the predictive distribution for the same four sets of covariates. Comparing Fig. 9.5 and Fig. 9.6 note that the predictive distributions are substantially wider than the mean response distributions.

Fig. 9.6. Histograms of simulated draws of the predictive distribution for a future extinction time for four sets of covariate values.

Another method for outlier detection is based on the use of the Bayesian residuals Ez = yz - x13. Following the strategy described in Section 9.2.5, we can compute the posterior outlying probabilities

for all observations for a constant value k. These probabilities can be computed using the function bayesresiduals. The inputs are the lm fit structure fit, the matrix of simulated parameter draws theta.sample, and the value of k. The output is a vector of posterior outlying probabilities. In this example, we use a cutoff value of k = 2. We use the plot command to construct a scatterplot of the probabilities against the nesting covariate; the resulting display is in Fig. 9.8. By use of the identify command, we identify four birds that have outlying probabilities of .4 or higher. These birds have extinction times that are not well-explained by the variables NESTING, SIZE, and STATUS. Two of the outlying species, raven and skylark, were also identified by the posterior predictive methodology.

Fig. 9.7. Posterior predictive distributions of {yz } with actual log extinction times {yz} indicated by solid points.

9.3 Survival Modeling

Suppose one is interested in constructing a model for lifetimes in a survival study. For a set of n individuals, one observes the lifetimes t1, ..., tn. It is possible that some of the lifetimes are not observable since some individuals are still alive at the end of the study. In this case we represent the response by the pair (ti, bj), where ti is the observation and bi is a censoring indicator. If Si = 1, the observation is not censored and ti is the actual survival time. Otherwise when Si = 0, the observation ti is the censored time.

Fig. 9.8. Plot of posterior probabilities of outlying for all observations. Four unusually large probabilities are identified with the name of the species.

Suppose we wish to describe the variation in the survival times using p covariates xj, ..., xP. One can describe this relationship by the Weibull proportional hazards model. This model can be expressed as the log-linear model

where xii, ..., xip are the values of the p covariates for the ith individual and Ei is assumed to have a Gumbel distribution with density f (E) = exp(E - c). There are p+2 unknown parameters in this model, the p regression coefficients, the constant term It, and the scale parameter or.

It can be shown that the density of the log time, yz = log tz is given by

where zi = (yz -,u - ~31xi1 - ... - 13pxzp)/Q. Also, the survival function for the ith individual is given by SS(yz) = exp(-ez4). Then the likelihood function of the regression vector /3 = (/31, ..., /3p), y and or is given by

Suppose we assign y, /3 uniform priors and the scale parameter Q the usual noninformative prior proportional to 1/o-. Then the posterior density is given, up to a proportionality constant, by

To illustrate the application of this model, Edmunson et al (1979) studied the effect of different chemotherapy treatments following surgical treatment of ovarian cancer. The response variable TIME was the survival time in days following randomization to one of two chemotherapy treatments. Also we record a censoring variable STATUS that indicates if TIME is an actual survival time (STATUS = 1) or censored at that time (STATUS = 0). The two covariates are TREAT, the treatment group, and AGE, the age of the patient. The log-linear model is

The dataset is given the name chemotherapy in the LearnBayes package. To begin, we read in the dataset and illustrate fitting this model using the survreg function in the survival library.

Unlike the normal regression model, the posterior distribution of the parameters of this survival model can not be simulated by standard probability distributions. But we are able to apply our general computing strategy described in Chapter 6 to summarize the posterior distribution for this problem. We first make all parameters real-valued by transforming the scale parameter or to r) = log a-. We write the following function weibullregpost, which computes the joint posterior density of 0 = (ri, µ, /31i 02). The argument data is the data matrix where the first two columns are {tz} and {cz} and the remaining columns are the covariates TREAT and AGE.

To get some initial estimates at the location and spread of the posterior density, we use the laplace function. We use the output of the survreg fit to suggest the initial guess at the posterior mode (-.5, 9,.5, -.05). The output of this function is the posterior mode 0 and associated variance-covariance matrix V.

We then use the information from the laplace function to find a proposal density for the Metropolis random walk chain programmed in the R function rwmetrop. The proposal density will be a multivariate normal density with mean 0 and variance-covariance scale V, where scale is a scale parameter chosen so that the random walk chain has an acceptance range in the 20-40% range. With some trial and error, we find that scale = 1.5 seems to give a satisfactory acceptance rate.

By use of several hist commands, we display histograms of the simulated draws from the marginal posterior densities of ~31 (corresponding to TREAT), X32 (corresponding to AGE), and the scale parameter o (see Fig. 9.9).

Suppose one is interested in estimating the survival curve for an individual in the treatment group (TREAT = 1) who is 60 years old. For a given time t, the probability that this individual survives beyond t days is given by

where z = (log t-µ-/.31(1)-/32(60))/a. A simulated sample of draws from this survival probability is obtained by computing this function on the simulated draws of 0, and this simulated sample can be summarized by the 5th, 50th, and 95th percentiles. This procedure was repeated for a grid of t values between 0 and 2000 days. Fig. 9.10 graphs the 5th, 50th, and 95th percentiles for the survival curve for this individual. In a similar fashion, it is straightforward to make inferences about any function of the parameters of interest.

Fig. 9.9. Plot of the posterior probabilities of regression coefficients for TREAT and AGE and the scale parameter Q for the chemotherapy example.

9.4 Further Reading

Chapter 14 of Gelman et al (2003) introduces Bayesian model building and inference for normal linear models. Analogous methods for generalized linear models are presented in chapter 16 of Gelman et al (2003). The Bayesian linear regression model is also described in chapter 6 of Gill (2002) and chapter 12 of Press (2003). The classical Weibull survival regression model is discussed in Collett (1994). Chaloner and Brant (1988) describes the use of Bayesian residuals in a linear regression model.

Fig. 9.10. Posterior median and 90% Bayesian interval estimates for the survival function S for an individual 60 years old in the treatment group.

9.5 Summary of R Functions

bayesresiduals - computation of posterior outlying probabilities for a linear regression model with a noninformative prior

Usage: bayesresiduals(fit, theta.sample, k)

Arguments: fit, output of a least-squares fit (R function lm); theta.sample, list with components beta (matrix of simulated draws from the posterior of beta) and sigma (vector of simulated draws from the posterior of sigma); k, cutoff value that defines an outlier

Value: vector of posterior outlying probabilities

blinreg - gives a simulated sample from the joint posterior distribution of the regression vector and the error standard deviation for a linear regression model with a noninformative prior

Usage: blinreg(y,X,m)

Arguments: y, vector of responses; X, design matrix; m, number of simulations desired

Value: beta, matrix of simulated draws of beta where each row corresponds to one draw; sigma, vector of simulated draws of the error standard deviation blinregexpected - simulates draws of the expected response for a linear regression model with a noninformative prior

Usage: binregexpected(X,theta.sample)

Arguments: X, matrix where each row corresponds to a covariate set;

theta.sample, list with components beta (matrix of simulated draws from the posterior of beta) and sigma (vector of simulated draws from the posterior of sigma

Value: matrix where a column corresponds to the simulated draws of the expected response for a given covariate set

blinregpred - simulates draws of the predicted future response for a linear regression model with a noninformative prior

Usage: binregpred(X,theta.sample)

Arguments: X, matrix where each row corresponds to a covariate set;

theta.sample; list with components beta (matrix of simulated draws from the posterior of beta) and sigma (vector of simulated draws from the posterior of sigma

Value: matrix where a column corresponds to the simulated draws of the predicted future response for a given covariate set

weibullregpost - computes the logarithm of the posterior of (log sigma, mu, beta) for a Weibull proportional odds model

Usage: weibullregpost (theta, data)

Arguments: theta, matrix of parameter values where each row represents a value of (log sigma, mu, beta); data, matrix with columns survival time, censoring variable, and covariate matrix

Value: vector of values of the log posterior where each value corresponds to each row of the parameters in theta

9.6 Exercises

1. Normal linear regression

Dobson (2001) describes a birthweight regression study. One is interested in predicting a baby's birthweight (in grams) based on the gestational age (in weeks) and the gender of the baby. The data are presented in Table 9.2 and available as birthweight in the LearnBayes package. In the standard linear regression model, we assume that

where the ei are independent and normally distributed with mean 0 and variance Q2.

a) Use the R function in to fit this model by least-squares. From the output, assess if the effects AGE and GENDER are significant, and if they are significant, describe the effects of each covariate on birthweight.

b) Suppose a uniform prior is placed on the regression parameter vector ~ = (/3o, /31, /32). Use the function blinreg to simulate a sample of 5000 draws from the joint posterior distribution of (0, (72). From the simulated sample, compute posterior means and standard deviations of /3i and /32. Check the consistency of the posterior means and standard deviations with the least-squares estimates and associated standard errors from the in run.

c) Suppose one is interested in estimating the expected birthweight for male and female babies of gestational weeks 36 and 40. From the simulated draws of the posterior distribution and function binregexpected, construct 90% interval estimates for 36-week males, 36-week females, 40-week males, and 40-week females.

d) Suppose instead one wishes to predict the birthweight for a 36-week male, a 36-week female, a 40-week male, and a 40-week female. Use the function blinregpred and the simulated posterior sample to construct 90% prediction intervals for the birthweight for each type of baby.

2. Logistic regression

For a given professional athlete, his or her performance level will tend to increase until midcareer and then deteriorate until retirement. Let yz denote the number of home runs hit by the professional baseball player Mike Schmidt in nz at-bats (opportunities) during the ith season. Table 9.3 gives Schmidt's age, yz and ni for all 18 years of his baseball career. The datafile is named schmidt in the LearnBayes package. The home run rates {yi/nz} are graphed against Schmidt's year in Fig. 9.11. If yz is assumed to be binomial(ni, pi), where pi denotes the probability of hitting a home run during the ith season, then a reasonable model for the {pi} is the logic quadratic model of the form

where AGEE is Schmidt's age during the ith season.

a) Assume that the regression vector 13 = (130, ,31i X32) has a uniform noninformative prior. Write a short R function to compute the logarithm of the posterior density of 1.3.

b) Use the function laplace to find the posterior mode and associated variance-covariance matrix of /3.

c) Based on the output from laplace, use the function rwmetrop to simulate 5000 draws from the posterior distribution of 13.

d) One would expect the fitted parabola to have a concave down shape where 132 < 0. Use the simulation output from part (c) to find the posterior probability that the fitted curve is concave down.

3. Logistic regression (continued)

For this exercise, we assume that a simulated sample from the posterior distribution of the regression vector 13 has been obtained.

a) When evaluating a baseball player, one is interested in estimating the player's ability at his peak. One can show that if N2 < 0, the peak value of the probability, on the logit scale, is given by

Compute a density estimate of the marginal posterior density of PEAK.

b) One is also interested in the age at which a player achieves his peak performance. From the quadratic model, the peak age can be shown to be equal to

Fig. 9.11. Scatterplot of home run rates HR/AB against age for Mike Schmidt.

Using the simulated draws from the posterior of /.3, find a 90% interval estimate for the PEAK AGE.

4. Survival modeling

Collett (1994) describes an investigation to evaluate a histochemical marker HPA, which discriminates between primary breast cancer that has metastasized and that which has not. The question is whether HPA staining can be used to predict the survival experience of women with breast cancer. Tumors of the women were treated with HPA, and each tumor was classified as being positively or negatively stained, positively staining corresponds to a tumor with the potential for metastasis. Survival times of the women who died of breast cancer were collected; the data are displayed in Table 9.4. For some women (indicated by an asterisk in Table 9.4), the survival status at the end of the study was unknown and the time from surgery to last date they were known to be alive is a censored survival time. The datafile breastcancer in the LearnBayes package contains the data. There are three variables: time is the survival time (in months); status gives the censoring status, where status = 1 indicates a complete survival time and status = 0 indicates a time that is censored; and stain indicates the group, where stain = 0 (1) indicates a tumor that was negatively (positively) stained.

a) Use the function survreg to fit a Weibull proportional hazards model of the form

where Ez is assumed to have a standard Gumbul distribution. Obtain estimates and associated standard errors for the group regression coefficient /3 and the scale parameter o.

b) The function weibullregpost computes the log posterior of

(log a, y, ~) assuming the standard noninformative prior. Use the function laplace to find the posterior mode and associated variancecovariance matrix. Then apply the function rwmetrop to simulate a sample of 1000 iterates from the joint posterior. Compute the posterior mean and standard deviation of ~ and a and compare your answers with the estimates from part (a).

c) Using the simulated sample from the posterior of (log or, µ, ~3), estimate the survival curve S(t) for a patient in the negatively stained group and a patient in the positively stained group. Choose a sequence of values of the time t, and for each t, find 5th, 50th, and 95th percentiles of the survival probability S(t). As in Fig. 9.10, graph the median estimates of the survival curves for the two individuals.

 

10.1 Introduction

One attractive method for constructing an MCMC algorithm is Gibbs sampling, introduced in Chapter 6. To slightly generalize our earlier discussion, suppose that we partition the parameter vector of interest into p components 0 = (9i, ..., Bp), where 6k may consist of a vector of parameters. The MCMC algorithm is implemented by sampling in turn from the p conditional posterior distributions

Under general regularity conditions, draws from this Gibbs sampler will converge to the target joint posterior distribution [Or, ..., Bp] of interest.

For a large group of inference problems, Gibbs sampling is automatic in the sense that all conditional posterior distributions are available or easy to simulate using standard probability distributions. There are several attractive aspects of "automatic" Gibbs sampling. First, one can program these simulation algorithms with a small amount of R code, especially when one can use vector and matrix representations for parameters and data. Second, unlike the more general Metropolis-Hastings algorithms described in Chapter 6, there are no tuning constants or proposal densities to define. Last, these Gibbs sampling algorithms provide a nice introduction to the use of more sophisticated MCMC algorithms in Bayesian fitting.

We illustrate the use of R to write Gibbs sampling algorithms for several popular inferential models. We revisit the robust modeling example of Section 6.8 where we applied various computational algorithms to summarize the exact posterior distribution. In Section 10.2, we illustrate a simple Gibbs sampler by representing the t sampling model as a scale mixture of normal densities. In Section 10.3, we apply the idea of latent variables to simulate from a binary response model where a probit link is used. This algorithm is attractive in that one can simulate from this probit model by iterating between truncated normal and multivariate normal probability distributions.

We conclude the chapter by considering a problem where one desires to smooth a two-way table of means. One model for these data is to assume that the underlying population means of the table follow a particular order restriction. A second model assumes that the population means follow a hierarchical regression model, where the population means are a linear function of row and column covariates. For both problems, R functions can be used to implement Gibbs sampling algorithms for simulating from the joint posterior of all parameters. These algorithms are automatic in that they are entirely based on standard probability distribution simulations.

10.2 Robust Modeling

We revisit the situation in Section 6.9 where we model data with a symmetric continuous distribution. When there is a possibility of outliers, a good strategy assumes the observations are distributed from a population with tails that are heavier than the normal form. One example of a heavy-tailed distribution is the t family with a small number of degrees of freedom.

With this motivation we suppose yl,..., y.,, are a sample from a t distribution with location it, scale parameter or, and known degrees of freedom v. If we assign the usual noninformative prior on (µ Q)

the posterior density is given by

In the case of Cauchy sampling (v = 1), we illustrated in Section 6.9 the use of different computational algorithms to summarize this representation of the posterior density.

By use of a simple trick, we can implement an automatic Gibbs sampler for this problem. A t density with location y, scale a, and degrees of freedom v can be represented as the following mixture:

Suppose each observation yi is represented as a scale mixture of normals with the introduction of the scale parameter .i. Then we can write our model as

In the following, it is convenient to express the posterior in terms of the variance Q2 instead of the standard deviation Q. Using the scale-mixture representation, the joint density of all parameters (µ, o'2, {.Ai}) is given by

On the surface, it appears that we have complicated the analysis through the introduction of the scale parameters {.Ai}. But Gibbs sampling is easy now since all of the conditional distributions have the following simple functional forms:

1. Conditional on y and a •2, A,,..., An are independent where

2. Conditional on 0-'2 and {Ai}, the mean µ has a normal distribution:

3. Conditional on µ and {Az}, the variance Q2 has an inverse gamma distribution:

In R, we can let lam denote the vector {Ai}, and mu and sig2 denote the values of µ and Q2. These three conditional distribution simulations can be implemented by the following R commands:

Note that we are using the random gamma function rgamma using a vector rate parameter; due to the conditional independence property, A,,..., An can be simultaneously simulated by a single command. Also we have defined the function rigamma in the LearnBayes package to simulate from the inverse gamma density y 1 exp(-b/y) with arguments a and b.

The function robustt will implement this Gibbs sampling algorithm. The three arguments to this function are the data vector y, the degrees of freedom v, and the number of cycles of the Gibbs sampler m. The output of this function is a list with three components: mu is a vector of simulated draws of µ, s2 is a vector of simulated draws of Q2, and lam is a matrix of simulated draws of {.i}, where each row corresponds to a single draw.

We apply this algorithm to Darwin's dataset of the differences of the heights of cross- and self-fertilized plants analyzed in Chapter 6. We model the observations with a t(4) density and run the algorithm for 10,000 cycles.

We use the density estimation command density to construct a smooth estimate of the marginal posterior density of the location parameter µ. The resulting graph is shown in Fig. 10.1.

Fig. 10.1. Density estimate of simulated sample of marginal posterior density of µ in t modeling example.

The {Az} parameters are interesting to examine since )z represents the weight of the observation y in the estimation of the location and scale parameters of the t population. In the following R code, we compute the posterior mean of each Ai and place the posterior means in the vector mean. lambda. Likewise, we compute the 5th and 95th percentiles of each simulated sample of {)j} (by use of the apply command with the function quantile) and store these quantiles in the vectors lams and 1am95. We first plot the posterior means of the {)i} against the observations {yi}, then we overlay lines that represent 90% interval estimates for these parameters (see Fig. 10.2). Note that the location of the posterior density of Ai tends to be small for the outlying observations; these particular observations are downweighted in the estimation of the location and scale parameters.

Fig. 10.2. 90% posterior interval estimates of scale parameters {)z} plotted against the observations y. The observations are also plotted along the horizontal axis.

10.3 Binary Response Regression with a Probit Link

In Section 4.4, we considered a regression problem where we modeled the probability of death as a function of the dose level of a compound. We now consider the more general case where a probability is represented as a function of several covariates. By regarding this problem as a missing data problem, one can develop an automatic Gibbs sampling method described in Albert and Chib (1993) for simulating from the posterior distribution.

Suppose one observes binary observations yl,..., y,,. Associated with the ith response, one observes the values of k covariates Xil, ..., Xik. In the probit regression model, the probability that yz = 1, pi, is written as

where l3 = (/31i ..., /3k.) is a vector of unknown regression coefficients and q5() is the cdf of a standard normal distribution. If we place a uniform prior on /3, then the posterior density is given by

In the example to be discussed shortly, the binary response yj is an indicator of survival, where yj = 1 indicates the person survived the ordeal and yz = 0 indicates the person did not survive. Suppose that there exists a continuous measurement Zi of health such that if Zz is positive, then the person survives; otherwise the person does not survive. Moreover the health measurement is related to the k covariates by the normal regression model

where Ei,..., c,, are a random sample from a standard normal distribution. It is a straightforward calculation to show that

So we can regard this problem as a missing data problem where we have a normal regression model on latent data Zl,..., Z,,, and the observed responses are missing or incomplete in that we only observe if Zz > 0 (yj = 1) or Zz < 0(yi=0).

An automatic Gibbs sampling algorithm is constructed by adding the (unknown) latent data Z = (Z1, ..., Z,,,) to the parameter vector /3 and sampling from the joint posterior distribution of Z and ~. Both conditional posterior distributions, [ZJ/3] and [)IZ], have convenient functional forms. If we are given a value of the vector of latent data Z, then it can be shown that the conditional posterior distribution of /3 is

where X is the design matrix for the problem. If we are given a value of the regression parameter vector ~3, then Z1, ..., Z,,, are independent, with

and xi denotes the vector of covariates for the ith individual. So given the value of ~3, we simulate the latent data Z from truncated normal distributions, where the truncation point is 0 and the side of the truncation depends on the values of the binary response.

The function bayes.probit implements this Gibbs sampling algorithm for the probit regression model. The key lines in the R code of this function simulate from the two conditional distributions. To simulate a variate Z from a normal(p,1) distribution truncated on the interval (a, b), one uses the recipe

where 0O and 0-1() are, respectively, the standard normal cdf and inverse cdf, and U is a uniform variate on the unit interval. In the following code, lp is the vector of linear predictors and y is the vector of binary responses. Then the latent data z are simulated by the following code:

Given values of the latent data in the vector z and the design matrix in x, the following code simulates the vector data from the multivariate normal distribution:

To illustrate the use of the function bayes.probit, we consider a dataset on the Donner party, a group of wagon train emigrants who had difficulty in crossing the Sierra Nevada mountains in California and a large number starved to death. (See Grayson (1990) for more information about the Donner party.) The dataset donner in the LearnBayes package contains the age, gender, and survival status for 45 members of the party age 15 and older. For the ith member, we let yi denote the survival status (1 if survived, 0 if not survived), MALEi denote the gender (1 if male, 0 if female), and AGEi denote the age in years. We wish to fit the model

We read in the dataset that has variable names survival, male, and age. We create the design matrix and store it in the variable X.

A maximum likelihood fit of the probit model can be found using the glm function with the family=binomial option, indicating by link=probit that a probit link is used.

To fit the posterior distribution of ~3 by Gibbs sampling, we use the function bayes.probit. The inputs to this function are the vector of binary responses survival, the design matrix X, and the number of cycles of Gibbs sampling M.

The output of this function is a matrix of simulated draws, where each row corresponds to a single draw of 13. We can compute the posterior means and posterior standard deviations of the regression coefficients by use of the apply function.

The posterior mean and standard deviations are similar in value to the maximum likelihood estimates and their associated standard errors. This is expected since the posterior analysis was based on a noninformative prior on the regression vector 13.

Since both the age and gender variables appear to be significant in this study, it is interesting to explore the probability of survival

as a function of these two variables. The function bprobit.probs is useful for computing a simulated posterior sample of probabilities for covariate sets of interest. For example, suppose we wish to estimate the probability of survival for males age 15 through 65. We construct a matrix of covariate vectors Xl, where a row corresponds to the values of the covariates for a male of a particular age. The function bprobit.probs is used with inputs X1 and the simulated matrix of simulated regression coefficients from bayes.probit. The output is a matrix of simulated draws p.male, where each column corresponds to a simulated sample for a given survival probability.

We can summarize the simulated matrix of probabilities by the apply command. We compute the 5th, 50th, and 95th percentiles of the simulated sample of

for each of the AGE values. In Fig. 10.3, we graph these percentiles as a function of age. For each age, the solid line is the location of the median of the survival probability and the interval between the dashed lines corresponds to a 90% interval estimate for this probability. In Fig. 10.4, we repeat this work to estimate the survival probabilities of females of different ages. These two figures clearly show how survival is dependent on the age and gender of the emigrant.

10.4 Estimating a Table of Means

10.4.1 Introduction

A university would like its students to be successful in their classes. Since all students do not do well and some may eventually drop out, the admissions office is interested in understanding what measures of high school performance are helpful in predicting success in college. The standard measure of performance in university courses is the grade point average (GPA). The admissions people are interested in understanding the relationship between a student's GPA with two particular high school measures: the student's score on the ACT exam (a standardized test given to all high school juniors) and the student's percentile rank in his or her high school class.

Fig. 10.3. Posterior distribution of probability of survival for males of different ages. For each age, the 5th, 50th, and 95th percentiles of the posterior are plotted.

The datafile iowagpa in the LearnBayes package contains the data for this problem. This dataset is a matrix of 40 rows, where a row contains the sample mean, the sample size, the high school rank percentile and the ACT score. By use of the R matrix command, these data are represented by the following two-way table of means. The row of the table corresponds to the high school rank (HSR) of the student and the column corresponds to the level of the ACT score. The entry of the table is the mean GPA of all students with the particular high school rank and ACT score.

Fig. 10.4. Posterior distribution of probability of survival for females of different ages. For each age, the 5th, 50th, and 95th percentiles of the posterior are plotted.

The following table gives the number of students in each level of high school rank and ACT score. Note that most of the students are in the upper-right corner of the table corresponding to high values of both variables.

The admissions people at this university believe that both high school rank and ACT score are useful predictors of grade point average. One way of expressing this belief is to state that the corresponding population means of the table satisfy a particular order restriction. Let /,ti denote the mean GPA of the population of students with the ith level of HSR and jth level of ACT score. If one looks at the ith row of the table with a fixed HSR rank, it is reasonable to believe that the column means satisfy the order restriction

T This s i h expresses the belief that if you focus on students with a given highschool rank, then students with higher ACT scores will obtain higher grade point averages. Likewise, for a particular ACT level (jth column), one may believe that students with higher percentile ranks will get higher grades, and thus the row means satisfy the order restriction

The standard estimates of the population means are the corresponding observed sample means. Fig. 10.5 displays the matrix of sample means using a series of line graphs where each row of means is represented by a single line. (This graph is created using the R function matplot.) Note from the figure that the sample means do not totally satisfy the order restrictions. For example, in the "31-40" row of HSR, the mean GPA for ACT score 19-21 is larger than the mean GPA in the same row for larger values of ACT. It is desirable to obtain smoothed estimates of the population means that more closely follow the belief in order restriction.

Fig. 10.5. Sample mean GPAs of students for each level of high school rank (HSR) and ACT score.

10.4.2 A Flat Prior Over the Restricted Space

Suppose one is certain before sampling that the population means follow the order restriction, but otherwise one has little opinion about the location of the means. Then if it denotes the vector of population means, one could assign the flat prior

where A is the space of values of y that follow the order restrictions.

Let yid and nzj denote the sample mean GPA and sample size, respectively, of the (i, j) cell of the table. We assume that the observations yll, ..., y85 are independent with yid distributed normal with mean µzj and variance a2/nzj where c7 is known. The likelihood function of µ then is given by

Combining the likelihood with the prior, the posterior density is given by

This is a relatively complicated 40-dimensional posterior distribution due to the restriction of its mass to the region A. However, to implement the Gibbs sampler, one only requires the availability of the set of full conditional distributions. Here "available" means that the one-dimensional distributions have recognizable distributions that are easy to simulate. Note that the posterior distribution of µij, conditional on the remaining components of y, has the truncated normal form

where max{µi_1j, µz, _1} < µzj < min{ L j+1i µz+1.J}.

The R function ordergibbs implements Gibbs sampling for this model. As mentioned earlier, we assume that the standard deviation o- is known, and the known value o' = .65 is assigned inside the function. To begin the algorithm, the program uses a starting value for the matrix of means µ that satisfies the order restriction. Also for ease in programming, the means are embedded within a larger matrix augmented by two rows and two columns containing values of -oc and +oo. Note in this programming we have changed the ordering of the rows so that the means are increasing from the first to last rows.

In the one main loop, the program goes sequentially through all entries of the population matrix y, simulating at each step from the posterior of an individual cell mean conditional on the values of the remaining means of the table. The posterior density of tjj is given by a truncated normal form, where the truncation points depend on the current simulated values of the means in a neighborhood of this (i, j) cell. For example, beginning with the starting value of y, one would first simulate pl, from a normal (yji,a/nll) distribution truncated on the interval (-oc, min{1.59,1.85}). As shown in this fragment of the code of the function ordergibbs, a truncated normal simulation is accomplished by the special R function rnormt.

Given the R matrix iowagpa containing two columns of sample means and sample sizes, the command s=ordergibbs (iowagpa,m) implements Gibbs sampling for rn cycles and the matrix of simulated values is stored in the matrix MU.A column of the matrix represents an approximate random sample from the posterior distribution for a single cell mean. In the following, we use in = 5000 iterations.

The apply command is used to find the posterior means of all cell means and the collection of posterior means is placed in an 8-by-5 matrix. Fig. 10.6 displays these posterior means. Note that since the prior support is entirely on the order-restricted space, these posterior means do follow the order restrictions.

One way of investigating the impact of the prior belief in order restriction on inference is to compute the posterior standard deviations of the cell means and compare these estimates with the classical standard errors. By use of the apply command, we compute the posterior standard deviations:

Fig. 10.6. Plot of posterior means of GPAs using noninformative prior on orderrestricted space.

The standard error of the observed sample mean yid is given by SE(yzj) = c7/nij, where we assume that a = .65. The following table computes the ratios {SD(,u 1 y)/SE(yzj)} for all cells. Note that most of the ratios are in the .5 to .7 range, indicating that we are adding significant prior information by use of this order-restricted prior.

10.4.3 A Hierarchical Regression Prior

The use of the flat prior over the restricted space A resembles a frequentist analysis where one would find the maximum likelihood estimate. However, from a subjective Bayesian viewpoint, alternative priors could be considered. If one believes that the means satisfy an order restriction, then one may also have prior knowledge about the location of the means. Specifically, one may believe that the mean GPAs may be linearly related to the high school rank and ACT scores of the students.

One can construct a hierarchical regression prior to reflect the relationship between the GPA and the two explanatory variables. At the first stage of the prior, we assume the means are independent where pig is normal with location given by the regression structure

We describe the R implementation for a single Gibbs cycle that simulates in turn from the three sets of conditional posterior distributions.

3. Simulation of µ. Conditional on the remaining parameters, the components of p have independent normal distributions. It is convenient to simultaneously simulate all distributions by means of vector operations. The R variable postvar contains values of the posterior variances for the components of µ and postmean contains the respective posterior means. Then the command rnorm(n,postmean,sgrt(postvar)) simulates the values from the 40 independent normal distributions.

The Gibbs sampler is run for 5000 cycles by executing the function hiergibbs.

Fig. 10.7 shows density estimates of the simulated draws of the regression coefficients X31 and X32 corresponding respectively to the two covariates high school rank and ACT score. We summarize each coefficient by the computation of the .025, .25, .5, .75, and .975 quantiles of each batch of simulated draws. A 95% interval estimate for N2, for example, is given by the .025 and .975 quantiles: (.0223, .0346).

Fig. 10.7. Density estimates of simulated draws of regression coefficients /3i and /32 in hierarchical regression model.

Last, we compute and display the posterior means of the cell means in Fig. 10.8. These posterior mean estimates using a hierarchical prior look similar to the posterior estimates using a noninformative prior on the restricted space displayed in Fig. 10.6.

Fig. 10.8. Plot of posterior means of GPAs using hierarchical prior.

10.4.4 Predicting the Success of Future Students

One can express this probability as the integral

We illustrate this computation when a hierarchical regression model is placed on the cell means. Recall that the output of the function hiergibbs in our example was FIT and so FIT$mu is the matrix of simulated cell means from the posterior distribution. We transform all the cell means to probabilities of success by use of the pnorm function and we compute the sample means for all cells by use of the apply function.

We convert this vector of estimated probabilities of success to a matrix by the matrix command, attach row and column labels to the table by the dimnames command, and then display the probabilities, rounding to the third decimal space.

This table of predictive probabilities should be useful to the admissions officer at the university. By this table, one may wish to admit students who have a predictive probability of, say, at least 0.70, of being successful in college.

10.5 Further Reading

Gelfand and Smith (1990) and Gelfand et al (1990) were the first papers to describe the statistical applications for Gibbs sampling. Wasserman and Verdinelli (1991) and Albert (1992) describe the use of Gibbs sampling in outlier models. The use of latent variables and Gibbs sampling for fitting binary response models is described in Albert and Chib (1993). The use of Gibbs sampling in modeling order restrictions in a two-way table of means was illustrated in Albert (1994).

10.6 Summary of R Functions

bayes.probit - simulates from a probit binary response regression model using data augmentation and Gibbs sampling

Usage: bayes.probit(y, X, m)

Arguments: y, vector of binary responses; X, covariate matrix; m, number of simulations

Value: matrix of simulated draws of the regression vector beta, where each row corresponds to a draw of beta

bprobit.probs - simulates fitted probabilities for a probit regression model

Usage: bprobit.probs(X, fit)

Arguments: X, matrix where each row corresponds to a covariate set;

fit, matrix of simulated draws from the posterior distribution of the regression vector beta

Value: matrix of simulated draws of the fitted probabilities, where a column corresponds to a particular covariate set

hiergibbs - implements Gibbs sampling for estimating a two-way table of normal means under a hierarchical regression model

Usage: hiergibbs(data, m)

Arguments: data, data matrix where columns are observed sample means, sample sizes, and values of two covariates; m, number of cycles of Gibbs sampling

Value: beta, matrix of simulated values of regression parameter; mu, matrix of simulated values of cell means; var vector of simulated values of second-stage prior variance

ordergibbs - implements Gibbs sampling for estimating a two-way table of normal means under an order restriction

Usage: ordergibbs(data, m)

Arguments: data, data matrix where first column contains the sample means and the second column contains the sample sizes; m, number of iterations of Gibbs sampling

Value: matrix of simulated draws of the normal means where each row represents one simulated draw

robustt - implements Gibbs sampling for a robust t sampling model with location mu, scale sigma, and degrees of freedom v

Usage: robustt (y, v, m)

Arguments: y, vector of data values; v, degrees of freedom for t model; m, number of cycles of the Gibbs sampler

Value: mu, vector of simulated values of mu; s2, vector of simulated draws of sigma2; lam, matrix of simulated draws of lambda where each row corresponds to a single draw

10.7 Exercises

1. Robust modeling with Cauchy sampling

In Section 6.9, different computational methods are used to model data where outliers may be present. The data yl, ...y,z are assumed independent, where yz is Cauchy with location y and scale ar. Using the standard noninformative prior of the form g(µ, a) = 1/a and Darwin's dataset, Table 6.2 presents 5th, 50th, and 95th percentiles of the marginal posterior densities of it and log a using Laplace, brute force, random walk Metropolis, independence Metropolis, and Metropolis within Gibbs algorithms. Use the "automatic" Gibbs sampler as implemented in the function robustt to fit this Cauchy error model where the degrees of freedom of the t density is set to one. Run the algorithm for 10,000 cycles and compute the posterior mean and standard deviation of it and log Q. Compare your answers with the values given in Table 6.2 using the other computational methods.

2. Mixtures of sampling densities

Suppose one observes a random sample yl, ..., yn from the mixture density

where f (yea) is a Poisson density with mean A, p is a mixture parameter between 0 and 1, and Al < A2. Suppose that a priori the parameters (p, A,i )'2) are independent with p assigned a uniform density and )'i assigned gamma(ai, bi), i = 1, 2. Then the joint posterior density is given by

Suppose one introduces the latent data Zl,..., Z,,,, where Zz = 1 or 2 if yz - Poisson(A1) or yi - Poisson(A2), respectively. The joint posterior density of the vector of latent data Z = (Z1, ..., Zn,) and the parameters is given by

where I (A) is the indicator function that is equal to 1 if A is true, or 0 otherwise.

a) Find the complete conditional densities of p, Ai, )'2, and Zi.

b) Describe a Gibbs sampling algorithm for simulating from the joint density of (p, A1, '2, Z).

c) Write an R function to implement the Gibbs sampler.

d) To test your function, the following data were simulated from the mixture density with p = .3, Al = 5, and )2 = 15:

Let the prior hyperparameters be equal to al = bl = a2 = b2 = 1. Run the Gibbs sampler for 10,000 iterations. From the simulated output, compute the posterior mean and standard deviation of p, A1, and A2, and compare the posterior means with the parameter values from which the data were simulated.

3. Censored data

Suppose that observations xl, ..., xn, are normally distributed with mean p and variance Q2. However the measuring device has malfunctioned and one only knows the first observation xl exceeds a known constant c; the remaining observations x2i ..., xn are recorded correctly. If we regard the censored observation xi as an unknown and we assign the usual noninformative prior on (µ, Q2), then the joint density of all unknowns (the single observation and the two parameters) has the form

a) Suppose one partitions the unknowns by [µ, O"2] and [xi]. Describe the conditional posterior distributions [µ,a2xl] and [xllp,Q2].

b) Write an R function to program the Gibbs sampling algorithm based on the conditional distributions found in part (a).

c) Suppose the sample observations are 110, 104, 98, 101, 105, 97, 106, 107, 84, 104, where the measuring device is "stuck" at 110 and one knows that the first observation exceeds 110. Use the Gibbs sampling algorithm to find 90% interval estimates for p and a,

4. Order restricted inference

Suppose one observes yi, ..., YN, where yz is distributed binomial with sample size ni and probability of success pi. A priori suppose one assigns a uniform prior over the space where the probabilities satisfy the order restriction

a) Describe a Gibbs sampling algorithm for simulating from the joint posterior distribution of (p1, ..., pN).

b) Write an R function to implement the Gibbs sampler found in part (a).

c) Suppose N = 4, the sample sizes are nl = ... = n4 = 20 and one observes yi = 2, y2 = 5, y3 = 12, and y4 = 9. Use the R function to simulate 1000 draws from the joint posterior distribution of (p1,p2,p3,P4)•

5. Grouped data

In Section 6.7, inference about the mean y and the variance Q2 of a normal population is considered, where the heights of male students are observed in grouped form as displayed in Table 6.1. Let y = (yl, ..., yn,) denote the vector of actual unobserved heights that are distributed N(µ, C7). Consider the joint posterior distribution of all unobservables (y, µ, a2). As in Section 6.7, we assume that the parameters (µ, a2) have the noninformative prior proportional to 1/Q2.

a) Describe the conditional posterior distributions [ylµ, Q2] and [µ, only].

b) Program an R function that implements a Gibbs sampler based on the conditional posterior distributions found in part (a).

c) Using the R function, simulate 1000 cycles of the Gibbs sampler. Compute the posterior mean and posterior standard deviation of /,t and a and compare your estimates with the values reported using the Metropolis random walk algorithm in Section 6.7.

 

11.1 Introduction to WinBUGS

The BUGS project is focused on the development of software to facilitate Bayesian fitting of complex statistical models using Markov chain Monte Carlo algorithms. In this chapter, we introduce the use of R in running WinBUGS, a stand-alone software program for the Windows operating system.

WinBUGS is a program for sampling from a general posterior distribution of a Bayesian model by use of Gibbs sampling and a general class of proposal densities. To describe the use of WinBUGS in a very simple setting, suppose you observe y distributed binomial(n, p) and a beta(a, ~3) prior is placed on p where a = 0.5 and 0 = 0.5. You observe y = 7 successes in a sample of n = 50 and you wish to construct a 90% interval estimate for p.

After you launch the WinBUGS program, you create a file that describes the Bayesian model. For this example, the model script looks like the following:

Note that the script begins with model and one indicates distributional assumptions by the "-" symbol. The names for different distributions (dbin, dbeta, etc.) are similar to the names of these densities in the R system.

After the model is described, one defines the data and any known parameter values in the file. This script begins with the word data and we use a list to specify the values of y, n, a, and /3.

Last, we specify the initial values of parameters in the MCMC simulation. This section begins with the word inits and a list specifies the initial values. Here we have a single parameter p and we decide to begin the simulation at p = .1.

Once the model, data, and initial values have been defined, we tell WinBUGS, in the Sample Monitor Tool, what parameters to monitor in the simulation. These will be the parameters of primary interest in the inferential problem. Here there is only one parameter p that we wish to monitor.

By use of the Update Tool we are able to use WinBUGS to take a simulated sample of a particular size from the posterior distribution. Once the MCMC simulation is finished, we want to make plots or compute diagnostic statistics of the parameters that help us learn if the MCMC simulation has approximately converged to the posterior distribution. If we believe that the simulation draws represent (approximately) a sample from the posterior, then we want to construct a graph of various marginal posterior distributions of interest and compute various summaries to draw inferences about the parameters.

WinBUGS is useful for fitting a variety of Bayesian models, some of high dimension. But the program runs independently of other programs such as R and one is limited to the data analysis tools available in the WinBUGS system. Recently, there have been efforts to provide interfaces between popular statistical packages (such as R) and WinBUGS. In the remainder of the chapter, we describe one attractive R function bugs that simplifies the process of using the WinBUGS program and allows one to use the R system to analyze the simulation output.

11.2 An R Interface to WinBUGS

Before you can use this R/WinBUGS interface, some setup needs to be done. The WinBUGS and OpenBUGS programs should be downloaded and installed on your Windows system. Also, special packages including R2WinBUGS and BRugs need to be downloaded and installed on your R system. This setup procedure likely will be modified over time; you should consult with the WinBUGS home page (http://www.mrc-bsu.cam.ac.uk/bugs/) for the most recent information.

Once the setup is completed, it is easy to define a Bayesian problem for WinBUGS by use of this R interface. There are four necessary inputs that are similar to the inputs required within the WinBUGS program:

• Model. One describes the statistical model by means of a "model" file that describes the model in the BUGS language.

• Data. One inputs data directly into R in the form of constants, vectors, matrices and model parameters.

• Parameters. Within R, one specifies the parameters to be monitored in the simulation run.

• Initial values. One specifies initial values of the parameters in the R console.

Suppose the model is defined in the file model. bug and the data, parameters, and initial values are defined in R in the respective variables data, parameters and inits. Then one simulates from the Bayesian model by the R command bugs:

When this command is executed, the model information is sent to the WinBUGS program. The WinBUGS program will run in the background, simulating parameters from the model. At the completion of the simulation, WinBUGS will close, and one is returned to the R console. The output of bugs is a structure containing the output from the WinBUGS run. Specifically, from the object model.sim, one can access the matrix of simulated draws of the monitored parameters.

One controls different aspects of the simulation by use of optional arguments to the function bugs. A more general form of bugs including optional arguments is given here:

• n.chains contains the number of Markov chains that are run. By default, three parallel chains will be run; if one wishes only to simulate one chain, the argument n. chains = 1 should be used.

• n.iter is the number of total iterations for each chain.

• n.burnin is the number of iterations to discard at the beginning. Typically, one will discard a specific number of the initial draws and base inference on the remaining output. By default, the first half of the iterations are removed; that is, n.burnin = n.iter/2.

• n.thin is the thinning rate. If n.thin=1, every iterate will be collected; if n. thin = 2, every other iterate will be collected, and so on. By default, the thinning rate is set so that 1000 iterations will be collected for each chain.

• bin is the number of iterations between savings of results; the default is only to save at the end.

11.3 MCMC Diagnostics Using the boa Package

Once the MCMC chain has been run and simulated samples from the algorithm have been stored, then the user needs to perform some diagnostics on the simulations to determine if they approximately represent the posterior distribution of interest. Some diagnostic questions include the following:

1. How many chains should be run in the simulation? Does the choice of starting value in the chain make a difference?

2. How long is the burn in time before the simulated draws approximately represent a sample from the posterior distribution?

3. How many simulated draws should be collected to get accurate approximations at summaries of the posterior?

4. What is the simulation standard error of a particular summary of the posterior distribution?

5. Are there high correlations between successive simulated draws?

The boa (Bayesian Output Analysis) package, written by Brian Smith, is based on the Convergence Diagnosis and Output Analysis Software for Gibbs sampling output (CODA) developed by Best, Cowles, and Vines. This package provides a variety of diagnostic functions useful for MCMC output. In particular, the boa package

• provides various summary statistics such as means, standard deviations, quantiles, highest probability density intervals, and simulation standard errors for correlated output based on batch means

• allows one to compare autocorrelations and cross-correlations of simulated samples from different parameters

• computes various convergence diagnostics, such as those proposed by Geweke, Gelman and Rubin, and Raftery and Lewis

• provides a variety of different plots, such as lag correlations, density estimates, and running means

It is convenient to use the boa package after the bugs function is used to perform the MCMC sampling in WinBUGS. One can access the boa functions by the menu option boa.menuO. When the menu appears, one selects Import Data > Data Matrix Object to load vectors or matrices of simulated parameters into the package. Many of the MCMC diagnostics can then be performed by using the Analysis and Plot items on the main menu.

11.4 A Change-Point Model

We begin with an analysis of counts of British coal mining disasters described in Carlin et al (1992). The number of disasters is recorded for each year from 1851 to 1962; we let yt denote the number of disasters in year t, where t = actual year - 1850. Looking at the data, it appears that the rate of accidents decreased in some year during the end of the 19th century. We assume for the early years, say when t < T, yt has a Poisson distribution where the logarithm of the mean log /t = /3o, and for the later years (t > T) log /-It = /30 + X31. We represent this as

where S() is defined to be 1 if its argument is nonnegative, and 0 otherwise. The unknown parameters are the regression parameters X30, /31, and the changepoint parameter T. We complete the model by assigning vague uniform priors to )0 and X31 and assigning T a uniform prior on the interval (1, N), where N is the number of years.

The first step in using WinBUGS is to write a short script defining the model in the BUGS language. The description of the change-point model is displayed next. Note that the observation for a particular year is denoted by D[year] and the corresponding mean as mu [year] . The parameters are b [1] ,b [21, and the change-point parameter T is called changeyear. Note that the syntax is similar to that used in R with some exceptions. The syntax

indicates that D [year] is distributed Poisson with mean mu [year]. Similarly, the code

indicates that ~3j is assigned a normal prior distribution with mean 0 and a precision (reciprocal of the variance) equal to 0.000001. In WinBUGS, one must assign proper distributions to all parameters, and this normal density approximates the improper uniform prior density. Also

indicates that T has a continuous uniform prior density on the interval (1, N). The operator <indicates an assignment to a variable; for example, the syntax

After the model has been defined, we enter the data directly into the R console. The R constant N is the number of years and D is the vector of observed counts. The variable data is a list containing the names of the variables N and D that are sent to WinBUGS.

Next we indicate by the parameters line

that we wish to monitor the simulated samples of the change-point parameter T and the regression vector /3.

Last, we indicate by the line

that the starting value for the parameter (/31i/32) is (0, 0), and the starting value of T is 50.

Now that the problem has been set up, the function bugs is used to run WinBUGS.

The output of bugs is the simulation object coalmining.sim.

To obtain some basic information about the simulated draws, one can apply the print function on the simulation object coalmining. sim. The output explains that three chains were used, each with 1000 iterations, and the first 500 iterations (the burn-in) were discarded in each chain. Summary statistics for each parameter are given for the total 1500 iterations that were saved.

We can construct and display a density estimate of the simulated sample of T by the plot (densityO) command (see Fig. 11.1). This density has an interesting bimodal shape; this indicates that there is support for a changepoint near 37 and 40 years past 1850.

Similarly, we can construct density estimates for the simulated draws of X31 and N2. It is clear from Fig. 11.2 that 02 < 0 which indicates a drop in the rate of coal mining facilities beyond the change-point year.

11.5 A Robust Regression Model

As a second illustration of the R/WinBUGS interface, we consider the fitting of a robust simple regression model. One is interested in the relationship between the vote count in the 1996 and 2000 presidential elections in the state of Florida. For each of 67 counties in Florida, one records the voter count for Pat Buchanan, the Reform party candidate in 2000, and the voter count for Ross Perot, the Reform party candidate in 1996. Fig. 11.3 plots the square root of the Buchanan vote against the square root of the Perot count. One notices a linear relationship with one distinctive outlier. This outlier is due to an unusual high vote count for Buchanan in Palm Beach County due to a butterfly ballot design used in that county.

Fig. 11.1. Density estimate of parameter v for the change-point problem.

Let yz and xi denote the square root of the voter count in the ith county for Buchanan and Perot, respectively. Ftom our preliminary analysis, a linear regression assuming normal errors seems inappropriate. Instead, we assume that yl, ..., yn, follow the regression model

where El,..., c,, are a random sample from a t distribution with mean 0, scale parameter or and v = 4 degrees of freedom. As in Section 10.2, we can represent this model as the following scale mixture of normal distributions:

To complete the model, we assign 30 and /31 uniform priors and let the precision T have the standard noninformative prior proportional to 1/T.

This model is described by means of the following model script in WinBUGS. The observations are y Ell , ... , y [N] ; the observation means are mu [1] , ... , mu EN]; and the observation precisions are p [1] , ... , p EN]. The ith precision, p [i] is defined by tau*lam [i] , where the scale parameter lam [i] is assigned a gamma(2, 2) distribution. One cannot formally assign improper priors to parameters, but we approximate a uniform prior for b [1] by assigning it a normal prior with mean 0 and small precision value .001. In a similar fashion, we assign the precision parameter tau a gamma prior with shape and scale parameters each set to the small value of .001. This script is saved as the file robust. bug.

Fig. 11.2. Density estimates of parameters /3i and /32 for the change-point problem.

Next we define the data in R. The Florida voter data for the 1996 and 2000 elections is stored in the dataset election in the package LearnBayes.

Fig. 11.3. Scatterplot of Buchanan and Perot voter counts in Florida in the 1996 and 2000 presidential elections.

The variables buchanan and perot contain, respectively, the Buchanan and Perot vote totals. There are three quantities to define, the number of paired observations N, the vector of responses y, and the vector of covariates x. Recall that we applied an initial square root reexpression of both 1996 and 2000 vote totals.

The final two inputs are the selection of initial values for the parameters and the decision on what parameters to monitor in the simulation run. In the command

we indicate that the starting values for the regression parameters are 0 and 0, and the starting value of the precision parameter T is 1. We last indicate through the parameters statement that we wish to monitor -r, the vector of values {Ail, and the regression vector ~.

We are ready to use WinBUGS to simulate from the model by the bugs function.

Suppose we are interested in estimating the mean Buchanan (root) count E(ylx) for a range of values of the Perot (root) count x. In the R code, we first create a sequence of x values in the variable xo and store the corresponding design matrix in the variable X0. By multiplying this matrix by the matrix of simulated draws of the regression vector b, we get a simulated sample from the posterior of E(ylx) for all values of x in xo. We summarize the matrix of posterior distributions meanresponse with the 5th, 50th, and 95th percentiles and plot these values as lines in Fig. 11.4. Note that this robust fit is relatively unaffected by the one outlier with an unusually large value of y.

11.6 Estimating Career Trajectories

A professional athlete's performance level will tend to increase until the middle of his or her career and then deteriorate until retirement. For a baseball player, suppose one records the number of home runs yj out of the number of balls that are put into play nj (formally, the number of balls put in play is equal to the number of "at-bats" minus the number of strikeouts) for the jth year of his career. One is interested in the pattern of the home run rate yj/nj as a function of the player's age xj. Fig. 11.5 displays a graph of home run rate against age for the great slugger Mickey Mantle.

To understand a player's career trajectory, we fit a model. Suppose yj is binomial(nj,pj), where pj is the probability of a home run during the jth season. We assume the probabilities follow the logistic quadratic model

Fig. 11.4. Scatterplot of Buchanan and Perot voter counts. The solid line represents the median of the posterior distribution of the expected response and the dashed lines correspond to the 5th and 95th percentiles of the distribution.

Fig. 11.5 displays the fitted probabilities for Mickey Mantle using the glm function.

In studying a player's career performance, one may be interested in the player's peak ability and the age where he achieved this peak ability. From the quadratic model, if N2 < 0, then the probability is maximized at the value

and the peak value of the probability (on the logit scale) is

Fig. 11.5. Career trajectory and fitted probabilities for Mickey Mantle's home run rates.

Although fitting this model is informative about a player's career trajectory, it has some limitations. Since a player only plays for 15-20 years and there is sizable binomial variation, it can be difficult to get precise estimates at a player's peak age and his peak ability. But there are many players in baseball history who display similar career trajectories. It would seem that one could obtain improved estimates at players' career trajectories by combining data from players with similar abilities.

One can get improved estimates by fitting an exchangeable model. Suppose we have k similar players; for player i we record the number of home runs yid, number of balls put in-play nzj, and the age xzj for the seasons j = 1, ...TT. We assume that the associated probabilities {pjj} satisfy the logistic model

Let /3z = (~3iO, /3i1, A2) denote the regression coefficient vector for the ith player. To represent the belief in exchangeability, we assume that 01, ..., A are a random sample from a common multivariate normal prior with mean vector µ,3 and variance-covariance matrix V:

At the second stage of the prior, we assign vague priors to the hyperparameters.

where inverse Wishart(S-1, v) denotes the inverse Wishart distribution with scale matrix S and degrees of freedom v. In WinBUGS, information about a variance-covariance matrix is represented by means of a Wishart(S, v) distribution placed on the precision matrix P:

Data are available for 10 great home run hitters in baseball history in the dataset sluggerdata in the package LearnBayes. This dataset contains batting statistics for these players for all seasons of their careers. The R function careertraj.setup is used to extract the matrices from sluggerdata that will be used in the WinBUGS program.

The variable N is the number of players and the vector T contains the number of seasons for each player. The matrix y has 10 rows and 23 columns where the ith row in y represents the number of home runs of the ith player for the years of his career. Similarly, the matrix n contains the number of balls put in play for all players and the matrix x contains the ages of the players for all seasons.

A listing of the file career.bug describing the model in the WinBUGS language is shown next. The variable beta is a matrix where the ith row corresponds to the regression vector for the ith player. The syntax

indicates that the i row of beta is assigned a multivariate normal prior with mean vector mu.beta and precision matrix R. The syntax

gives the logistic model for the home run probabilities in the matrix p. Finally, the syntax

assigns the second-stage priors. The mean vector mu.beta is assigned a multivariate normal prior with mean mean and precision matrix prec: the precision matrix R is assigned a Wishart distribution with scale matrix Omega and degrees of freedom 3.

The dataset variables N, T, y, n, and x have already been defined in R with help of the careertraj.setup function. One defines the hyperparameter values at the last stage of the prior.

Next one gives initial estimates for ~3, µo, and R. The estimate of /.3z is found by fitting a logistic model to the pooled dataset for all players and µ,Q is also set to be this value. The precision matrix R is initially given a diagonal form with small values.

We then indicate in the data line the list of variables, the inits function specifies the initial values and the parameter line indicates that we will monitor only the matrix beta. We run the MCMC simulation by the bugs command.

Since we saved the output in the variable career.sim, the simulated draws of /3 are contained in the component career.sims$sims.list$beta. This is a three-dimensional array, where beta[, i ,1] contains the simulated draws of ~3io, beta E, i, 2] contains the simulated draws of /3i1, and beta E, i, 3] contains the simulated draws of A2. Suppose we focus on the estimates of the peak age for each player. In the following R code, we create a new matrix to hold the simulated draws of the peak age and then compute the functions in a loop.

We illustrate the use of the boa package to perform output analysis and summarize the samples of peak age parameters. We invoke the boa menu system by typing

To read in the matrix peak. age into the package, we choose the menu option File -> Import Data -> Data Matrix Object and entered the object name peak.age. To obtain trace plots for each parameter, we return to the main menu and choose the menu option Plot -> Descriptive -> Trace. Fig. 11.6 displays the trace plots that are produced for the peak age parameters for the first six players. In a similar fashion, one can produce alternative graphs such as autocorrelations or running means. Density plots for the parameters can be obtained by the menu selection Plot -> Descriptive -> Density. Fig. 11.7 displays density estimates of the peak ages for the same six players. Finally, to compute 95% interval estimates of each parameter, we use the menu option Analysis -> Descriptive Statistics -> Highest Probability Density Intervals and the following output is produced:

We see that baseball players generally peak in home run hitting ability in their early 30s, although there are some exceptions.

Fig. 11.6. 'Dace plots of the peak age parameters for six of the baseball players.

11.7 Further Reading

Cowles (2004) gives a general review and evaluation of WinBUGS. A tutorial on computing Bayesian analyses via WinBUGS is provided by George Woodworth in the complement of chapter 6 of Press (2003). General information about WinBUGS including the program code for many examples can be found in the WinBUGS user manual Spiegelhalter et al (2003). Congdon (2003, 2005, 2007) describes a wide variety of Bayesian inference problems that can be fit using WinBUGS. Cowles and Carlin (1996) give an overview of diagnostics for MCMC output. Sturtz et al (2005) give a general description of the R2WinBUGS package including examples demonstrating the use of the package. Smith (2004) describes the use of the package BOA in implementing MCMC output analysis on R.

Fig. 11.7. Density estimates of the peak age parameters for six of the baseball players.

11.8 Exercises

1. Estimation of a proportion with a discrete prior

In Chapter 2, we considered the situation where one observes y - binomial(n, p) and the proportion p is assigned a discrete prior. Suppose the possible values of p are .05, .15, ..., .95, with respective prior probabilities .0625, .125, .25, .25, .125, .0625, .03125, .03125, .03125, .03125. Place the values of p in a vector p and the probabilities in the vector prior. As in the example of Chapter 2, set y = 11 and n = 27. Define data, inits, and parameters as follows:

Save the following script in a file "proportion. bug".

Use the R interface to simulate 1000 draws from the posterior distribution of p. Compute the posterior probability that p is larger than .5.

2. Fitting a binomial/beta exchangeable model

In Chapter 5, we considered the problem of simultaneously estimating the rates of death from stomach cancer for males at risk for cities in Missouri. Assume the number of cancer deaths yj for a given city is binomial with sample size nj and probability of success pj. To model the belief that the {pj} are exchangeable, we assume that they are a random sample from a beta(a, ~) distribution. The beta parameters a and ~ are assumed independent from gamma(.11,.11) distributions. The WinBUGS model file is shown here. Note that the variable betamean is the prior mean of pj and K1 is the prior precision.

Use the R interface to simulate from the joint posterior distribution of ({pj}, a, /3). Summarize each probability pj and the prior mean aAa+~3) and prior precision K = a + ~3 by 90% interval estimates.

3. Smoothing multinomial counts

Consider the observed multinomial frequencies (14, 20, 20, 13, 14, 10, 18, 15, 11, 16, 16, 24). Using a GLIM formulation for these data, suppose that the counts {yi} are independent Poisson with means {µi}. The multinomial proportion parameters are defined by Bi=pilY:jyj. Suppose one believes that the {9j} are similar in size. To model this belief, assume that {Bi} has a symmetric Dirichlet distribution of the form

The hyperparameter k has a prior density proportional to (1 + k)_2 that is equivalent to log k distributed according to a standard logistic distribution. The WinBUGS model description is shown here:

Using the R interface, simulate from the posterior distribution of {9j} and K. Summarize each parameter by a posterior mean and standard deviation.

4. A gamma regression model

Congdon (2007) gives a Bayesian analysis of an example from McCullagh and Nelder (1989) modeling the effects of three nutrients on coastal Bermuda grass. The design was a 4 x 4 x 4 factorial experiment defined by replications involving the nutrients nitrogen (N), phosphorus (P), and potassium (K). The response yj is the yield of grass in tons/acre. We assume yz is gamma with shape v and scale parameter v€z where the mean ri satisfies

In Congdon's formulation, al, a2i and a3 (background nutrient levels) are assigned independent normal priors with respective means 40, 22, and 32 and variance 100. Noninformative priors were assigned to /3o, v and the growth effect parameters /31, 02, and 03, except that the growth effects are assumed to be positive.

The WinBUGS model description is shown here. The LearnBayes data file bermuda.grass contains the data; the factor levels are stored in the variables Nit, Phos, and Pot and the response values are stored in the variable y. Also one needs to define the sample size variable n = 64 and the nutrient value vectors N= 0, 100, 200, and 400, P= 0, 22, 44, and 88, and K =0, 42, 84, and 168.

Use WinBUGS and the R interface to simulate 10,000 iterations from this model. Compute 90% interval estimates for all parameters.

5. A non-linear hierarchical growth curve model

The BUGS manual presents an analysis of data originally presented in Draper and Smith (1998). The response yip is the trunk circumference recorded at time xj = 1, ..., 7 for each of i = 1, ..., 5 orange trees; the data are displayed in Table 11.1. One assumes yip is normally distributed with mean r1ij and variance o 2 where the means satisfy the nonlinear growth model

Let Oi = (Oil, Oi2, Oi3) represent the vector of growth parameters for the ith tree. To reflect a prior belief in similarity in the growth patterns of the five trees, one assumes that {Bi, i = 1, ..., 5} are a random sample from a multivariate normal distribution with mean vector µ and variancecovariance matrix Q. At the final stage of the prior, one assumes Q-1 is Wishart with parameters R and 3, and assumes µ is multivariate normal with mean vector to and variance-covariance matrix M. In this example, one assumes R is a diagonal matrix with diagonal elements .1, .1, and .1, µo is the zero vector, and M-1 is the diagonal matrix with diagonal elements 1.0E-.6, 1.0E-6, and 1.0E-6.

The WinBUGS model description is shown here:

Use WinBUGS and the R interface to simulate 10,000 iterations from this model. Compute 90% interval estimates for all parameters.

 

Agresti, A., and Franklin, C. (2005), Statistics: The Art and Science of Learning from Data, Prentice-Hall.

Albert, J. (1992), "A Bayesian analysis of a Poisson random effects model for home run hitters," The American Statistician, 46, 246-253.

Albert, J. (1994), "A Bayesian approach to estimation of GPA's of University of Iowa freshmen under order restrictions," Journal of Educational Statistics, 19, 1-22.

Albert, J. (1996), Bayesian Computation Using Minitab, Belmont, CA: Duxbury Press.

Albert, J., and Chib, S. (1993), "Bayesian analysis of binary and polychotomous response data," Journal of the American Statistical Association, 88, 669-679.

Albert, J., and Gupta, A. (1981), "Mixtures of Dirichlet distributions and estimation in contingency tables," Annals of Statistics, 10, 1261-1268.

Albert, J., and Rossman, A. (2001), Workshop Statistics: Discovery with Data, a Bayesian Approach, Emeryville, CA: Key College.

Antleman, G. (1996), Elementary Bayesian Statistics, Cheltenham: Edward Elgar Publishing.

Berger, J. (1985), Statistical Decision Theory and Bayesian Analysis, New York: Springer-Verlag.

Berger, J. (2000), "Bayesian analysis: A look at today and thoughts of tomorrow," Journal of the American Statistical Association, 95, 1269-1276.

Berger, J., and Sellke, T. (1987), "Testing a point null hypothesis: The irreconcilability of p values and evidence," Journal of the American Statistical Association, 397, 112-122.

Berry, D. (1996), Statistics: A Bayesian Perspective, Belmont, CA: Duxbury Press.

Bliss, C. (1935), "The calculation of the dosage-mortality curve," Annals of Applied Biology, 22, 134-167.

Bolstad, W. (2004), Introduction to Bayesian Statistics, Hoboken, NJ: John Wiley.

Box, G. (1980), "Sampling and Bayes' inference in scientific modelling and robustness (with discussion)," Journal of the Royal Statistical Society, Series A, 143, 383-430.

Carlin, B., Gelfand, A. and Smith, A. (1992), "Hierarchical Bayesian analysis of changepoint problems," Applied Statistics, 41, 389-405.

Carlin, B., and Louis, T. (2000), Bayes and Empirical Bayes Methods for Data Analysis, Boca Rotan: Chapman and Hall.

Casella, G., and Berger, R. (1987), "Testing a point null hypothesis: The irreconcilability of p values and evidence," Journal of the American Statistical Association, 397, 106-111.

Casella, G., and George, E. (1992), "Explaining the Gibbs sampler," The American Statistician, 46, 167-174.

Chaloner, K., and Brant, R. (1988), "A Bayesian approach to outlier detection and residual analysis," Biometrika, 75, 651-659.

Chib, S., and Greenberg, E. (1995), "Understanding the Metropolis-Hastings algorithm," The American Statistician, 49, 327-335.

Christiansen, C., and Morris, C. (1995), "Fitting and checking a two-level Poisson model: modeling patient mortality rates in heart transplant patients," in Berry, D. and Stangl, D, eds, Bayesian Biostatistics, New York: Marcel Dekker.

Collett, D. (1994), Modelling Survival Data in Medical Research, London: Chapman and Hall.

Congdon, P. (2007), Bayesian Statistical Modelling, second edition, Chichester: John Wiley.

Congdon, P. (2004), Applied Bayesian Modelling, Chichester: John Wiley.

Congdon, P. (2005), Bayesian Models for Categorical Data, Chichester: John Wiley.

Cowles, K. (2004), "Review of WinBUGS 1.4," The American Statistician, 58, 330-336.

Cowles, K., and Carlin, B. (1996), "Markov chain Monte Carlo convergence diagnostics: a comparative review," Journal of the American Statistical Association, 91, 883-904.

Dobson, A. (2001), An Introduction to Generalized Linear Models, New York: Chapman and Hall.

Draper, N., and Smith, H. (1998), Applied Regression Analysis, New York: John Wiley.

Edmonson, J., Fleming, T., Decker, D., Malkasian, G., Jorgensen, E., Jefferies, J., Webb, M, and Kvols, L. (1979), "Different chemotherapeutic sensitivities and host factors affecting prognosis in advanced ovarian carcinoma versus minimal residual disease," Cancer Treatment Reports, 63, 241-247.

Fisher, R. (1960), Statistical Methods for Research Workers, Edinburgh: Oliver & Boyd.

Gelfand, A., and Smith, A. (1990), "Sampling-based approaches to calculating marginal densities," Journal of the American Statistical Association, 85, 398-409.

Gelfand, A., Hills, S., Racine-Poon, A., and Smith, A. (1990), "Illustration of Bayesian inference in normal data models using Gibbs sampling," Journal of the American Statistical Association, 85, 972-985.

Gelman, A., Carlin, J., Stern, H. and Rubin, D. (2003), Bayesian Data Analysis, New York: Chapman and Hall.

Gelman, A., Meng, X. and Stern, H. (1996), "Posterior predictive assessment of model fitness via realized discrepancies," Statistics Sinica, 6, 733-807.

Gentle, J. (2002), Elements of Computational Statistics, New York: Springer.

Gilchrist, W. (1984), Statistical Modeling, Chichester: John Wiley and Sons.

Gill, J. (2002), Bayesian Methods,, New York: Chapman and Hall.

Givens, G., and Hoeting, J. (2005), Computational Statistics, Hoboken, NJ: John Wiley.

Grayson, D. (1990), "Donner party deaths: a demographic assessment," Journal of Anthropological Assessment, 46, 223-242.

Gunel, E., and Dickey, J. M. (1974), "Bayes factors for independence in contingency tables," Biometrika, 61, 545-557.

Haberman, S. (1978), Analysis of Qualitative Data: Introductory topics (Vol. 1), New York: Academic Press.

Hartley, H. O. (1958), "Maximum likelihood estimation from incomplete data," Biometrics, 14, 174-194.

Howard, J. (1998), "The 2 x 2 table: a discussion from a Bayesian viewpoint," Statistical Science, 13, 351-367.

Kass, R., and Raftery, A. (1995), "Bayes factors," Journal of the American Statistical Association, 90, 773-795.

Kemeny, J., and Snell, J. (1976), Finite Markov Chains," New York: SpringerVerlag.

Lee, P. (2004), Bayesian Statistics: An Introduction, New York: Oxford University Press.

Martz, H., and Waller, R. (1982), Bayesian Reliability Analysis, New York: John Wiley.

McCullagh, P., and Nelder, J. (1989), Generalized Linear Models, New York: Chapman and Hall.

Monahan, J. (2001), Numerical Methods of Statistics, Cambridge: Cambridge University Press.

Moore, D. (1995), The Basic Practice of Statistics, New York: W. H. Freeman.

Pearson, E. (1947), "The choice of statistical tests illustrated in the interpretation of data classed in a 2 x 2 table," Biometrika, 34, 139-167.

Pimm, S., Jones, H., and Diamond, J. (1988), "On the risk of extinction," American Naturalist, 132, 757-785.

Press, J. (2003), Subjective and Objective Bayesian Statistics, Hoboken, NJ: John Wiley.

Ramsey F., and Schafer, D. (1997), The Statistical Sleuth, Belmont CA: Duxbury Press.

Rao, C. R. (2002), Linear Statistical Inference and Applications, New York: John Wiley & Sons.

Robertson, T., Wright, F. and Dykstra, R. (1988), Order Restricted Statistical Inference, London: John Wiley.

Robert, C., and Casella, G. (2004), Monte Carlo Statistical Methods, New York: Springer.

Smith, B. (2004), "boa: Bayesian output analysis program (BOA) for MCMC," R package version 1.1.2-1, URL http://www.public-health.uiowa.edu/boa.

Smith, A., and Gelfand, A. (1992), "Bayesian statistics without tears: a sampling-resampling perspective," The American Statistician, 46, 84-88.

Spiegelhalter, D., Thomas, A., Best, N., and Lunn, D. (2003), WinBUGS 1.4 Manual.

Sturtz, S., Ligges, U., and Gelman, A. (2005), "R2WinBUGS: A package for running WinBUGS from R," Journal of Statistical Software, 12, 1-16.

Tanner, M. (1996), Tools for Statistical Inference, New York: Springer-Verlag.

Tsutakawa, R., Shoop, G., and Marienfeld, C. (1985), "Empirical Bayes Estimation of Cancer Mortality Rates," Statistics in Medicine, 4, 201-212.

Turnbull, B., Brown, B. and Hu, M. (1974), "Survivorship analysis of heart transplant data," Journal of the American Statistical Association, 69, 74-80.

Verzani, J. (2004), Using R for Introductory Statistics, Boca Raton: Chapman and Hall.

Wasserman, L., and Verdinelli, I. (1991), "Bayesian analysis of outlier models using the Gibbs sampler," Statistics and Computing, 1, 105-117.

Weiss, N. (2001), Elementary Statistics, Boston: Addison-Wesley.