MAT4374 Computational Statistics 

(Summer 2008)

Course Synopsis following the Textbook: EM Optimization Methods , Simulation and Monte Carlo Methods, Markov Chain Monte Carlo , Bootstrapping, Permutation Tests

Professor:  David McDonald

Office Hours: Wednesday, 10:00 - 11:00, or by appointment.


Midterm:  Scheduled TBA

Tuesday,  16:00 - 18:00              CBE 202 (The Cube)

Thursday, 16:00 - 18:00             CBE 202 (The Cube)

Courses start May 1 and end July 31


Textbook: Givens, G.H., Hoeting, J.A. (2005). Computational Statistics. John Wiley & Sons, New Jersey.  

Check the website for this book to get datasets, some R code and information.

Computing: The computer package S plus will be used.  It is available in the Cube. You can buy a student copy for $20 at MRC133.


Chapter 6: Simulation 

Chapter 7: Markov Chain Monte Carlo

Chapter 8: More MCMC

Chapter 9: The Bootstrap

Grades: There will be a small number of homeworks on background material (Markov chains and so forth) worth 10% plus a term test on

basic material (Markov chains and so forth) 25% (on June 26th). There won't be a final exam  but each student must make a presentation. 

The presentation will be on one of the following:

1) Antithetic sampling 6.3.2 - May 27

2) Control variates 6.3.3 - June 3

3) Rao-Blackwellization 6.3.4 June 10

4) Importance Sampling 6.2 - simulation of random variables June 17

5) Importance Sampling 6.3 -network simulation July 1

6) Perfect sampling July 8

7) Perfect sampling II July 15

The presenter will consult with the professor at least two weeks before the presentation. The student will present the material in class together with a written report in pdf format which will attached to this page one week before the presentation.  The student will illustrate the algorithm by solving a practical problem. The solution must be implemented in S+ (from the web if possible). The software and instructions are part of the report.  

The student will also prepare 15 minute written quiz  everybody will write after the presentation.  The presenter will grade the quiz from 1 to 10  and return the grades to the professor.
The presenter will also prepare a homework problem to be solved using the algorithm and software presented. Each student will turn this in for grading one week later. The presenter will provide a solution to the homework problem. The professor will post the solution to the quiz and the homework and grade each homework from 1 to 10.  The quiz grades will count for 5%. The presentation homework grades will count for 15%. The quiz and the homework problem should be cleared with the professor two weeks before the presentation.

Along with the homework, each student will submit a one page written critique on the presentation  giving a grade from 1 to 100 based on the following:
1) clarity of the presentation
2) ease of use of the software provided
3) how much the presentation helped in solving the homework problem
4) how much was learned.

The professor will use these grades and the results of the quiz and homework plus his own assessment to assign a grade on 40% to the presentation.  
The professor will also assign a grade from 1 to 10 to each critique based on insightful comments and the concordance of the critique with the professor's assessment
and that of the other students; i.e. a decisive, accurate critique is better than a limip,  vague critique. The total grade for the critiques will be 5%.

 May 1 - We start with the basic concepts of random number generation and simulation.

HW1 due May 15 :  a)  Solution We have just received a secret message from a friend.  Each of us have a memory stick with 100 gigabytes of random data.  We use one megabyte of data per day as a one time pad; i.e. we never reuse the same random numbers. The start of todays one time pad is 646F7070  and we have received the data  0C0A1C00 all written in hexadecimal.  What is the message in ASCII? We know our friend has taken the original ASCII word and converted it to hex.  Then he xored the hex with 646F7070. We can get the message back by xoring the coded messge with 646F7070. What is the message?

Note that this code is totally unbreakable if the one time pad was created from truly random numbers because a one becomes a zero or a one with probability 1/2. Note that we could send a megabyte a day for the next 100 years without getting a new pad.

b) A circuit board has 4 components   A  B arranged as an array on the board.

                                                          C  D

The circuit board is 5 cm by 5 cm.  The specifications for  all components is 2cm wide and long but there is a standard deviation of  .3 centimeters in each dimension.  The distribution of  the length and width is normal. If the components don't fit onto the board then the board is defective.  What proportion of the boards are defective? Solve by a simulation in S+.  

Generate a large number of boards.  For each board check if components overlap using the data -> transform ifelse function.  For instance if I create a column AW - widths for component A and a column BW - widths for component B then the expression  ifelse(AW+BW<5,0,1) will encode a zero if the widths of A and B fit across the board.  Create columns which check the different dimensions and then create a column of zeros and ones for all those that have a dimension that doesn't fit. The proportion that don't fit is your estimate for the proportion of defectives.

c)  Use the rejection method to generate random variables with density  (5/12)(1+(x-1)^4) on [0,2].  Try to be efficient.  Write the code in Splus.

 HW2  Due May 22  Solution

a) Redo c) from HW1 a bit better using squeeze sampling.

b)   Consider a probability transition kernel $K$ on three states: 0,1,2.
Let the initial distribution be $\alpha=(.2,.1,.7)$ and let the
probability transition kernel $K$ be
.5& .3 & .2 \\
.4& .2 & .4 \\
.3& .5 &.2
a) Compute $P(X_1=2|X_0=1)$.\\
b) Compute $P(X_{21}=2|X_{20}=1)$.\\
c) Compute $P(X_3=0,X_{5}=2,X_{6}=1|X_0=1)$.\\
d) Compute $EX_0$.\\
e) Compute $E(X_0|X_0=1)$.\\
f) Compute $P(X_1=1)$.\\
g) Compute $EX_1$.\\
h) Compute $E(X_1|X_0=1)$.\\
i) Compute $EX_1^2$.\\
j) Compute $E(X_1^2|X_0=1)$.\\
k) Compute $Var(X_1|X_0=1)$.\\
l) Compute $P(X_0=1|X_1=2)$.\\
m) Calculate the stationary probability measure associated with $K$.\\
n) Calculate the mean number of transitions to return to $0$.

Review Markov chains and generate chains with Splus.  For instance the code below generates the chain analyzed in class

b1 <- 1/3
b2 <- 1/2
uni <- runif(1000,0,1)
x <- rep (0,1000)
x[1]<- 0
for (i in 2:1000){
if(x[i-1]=0 && {uni[i-1] <b1}){x[i] <- 0 }
if(x[i-1]=0 && uni[i-1] >b1){x[i] <- 1 }
if(x[i-1]=1 && uni[i-1] <b2){x[i] <- 0 }
if(x[i-1]=1 && uni[i-1] >b2){x[i] < -1 }

 Week three:  We started MCMC.  The next homework is due June 6th.  The task is to write the Splus code to implement the baseball example given in class. Please review the theory on the baseball example is given in the  attached pdf
More details on the problem set will be given after we discuss the Gibbs sampler Here is a  link to baseball data I want you to assume there is a probability p_i  of hitting a home run which remains constant through the preseason and the season.  We assume the p_i's are distributed like a beta distribution with parameters a and b..

Of course we can do a histogram of the 17 p_i's but that is surely compatible with some choice of the parameters a and b so that doesn't prove anything.  If we had the data for many years then maybe we could  justify this assumption betters. In any case all we can do is put a hyperprior (1/1000)^2exp(-(a+b)/1000) on the parameters a and b.  This is a noninformative prior that doesn't assume much about a and b.  We want to simulate the distribution of the p_i's and more specifically AB * p_i in order to predict the home run production of each batter.  Does the derived distribution match the Bayesian analysis copied from the book by Young and Smith.  If it does then we have a lot more faith in the Bayesian analysis of this data.

We are starting the bootstrap.  See the page
and the book

Here is the Splus program for the presentation on antithetic sampling:
X1 <- rep(0,50000)
X2 <- rep(0,50000)
ans <- rep(0,100000)
i <- 1

h <- function(x){(3*x+17)/2^x}

while(i <= 50000)
        x <- rnorm(1,0,1)
        ans[i] <- h(x)
        X1[i] <- h(x)
        ans[i+50000] <- h(-x)
        X2[i] <- h(-x)
        i <- i+1

Here is Mr. Mei's  Splus program for the presentation on importance sampling.
Here is Mr. Tong's Splus program from the presentation on importance sampling.

For the perfect sampling project:
The steady state given for the matrix

K=     .5  .3  .2
          .4   .2  .4
          .3   .5  .2
was WRONG.  It should have been (.4151,3108,.2641)
Below is the code for perfect sampling .  Mr. Mei made some corrections and it now seems correct:

hold<- matrix(0,1000,1)


if({{state==1} && {0 <= pass}} && {pass<=0.5}){return(1); stop}
if({{state==1} && {0.5 < pass}} && {pass<=0.8}){return(2); stop}
if({{state==1} && {0.8< pass}} && {pass<=1}){return(3); stop}
if({{state==2} && {0<= pass}} && {pass<=0.4}){return(1); stop}
if({{state==2} && {0.4< pass}} && {pass<=0.6}){return(2); stop}
if({{state==2} && {0.6< pass}} && {pass<=1}){return(3); stop}
if({{state==3} && {0<=pass}} && {pass<=0.3}) {return(1); stop}
if({{state==3} && {0.3< pass}} && {pass<=0.8}){return(2); stop}
if({{state==3} && {0.8< pass}} && {pass<=1}){return(3); stop}

    tm <- 1
    while(X[1,1]!=X[1,2]|| X[1,1]!=X[1,3] || X[1,2]!=X[1,3])
                hold[tm]<- runif(1,0,1)
                tm<- tm+1
                X[tm,1] <-1
                X[tm,2] <-2
                X[tm,3] <-3

                for(i in (1:tm-1))

                        X[tm-i,1]<- my(X[tm-i+1,1],hold[tm-i])
                        X[tm-i,2]<- my(X[tm-i+1,2],hold[tm-i])
                        X[tm-i,3]<- my(X[tm-i+1,3],hold[tm-i])


guiPlot( PlotType = "Histogram", DataSet = "res", AxisType = "Linear")