Introduction to R

 

En français

 


Table of Content
Downloading and Installing R
Example 1: Working at the console
Example 2: Binomial and Normal distributions
Example 3: Descriptive Statistics
Example 4: R Editor window
Example 5: Importing data from a file
Example 6: A few graphs: histogram and boxplot.
Example 7: Side-by-side boxplots
Example 8: Operations on numerical vectors
Example 9: Assessing Normality (histogram)
Example 10: Assessing Normality (quantile-quantile plot)
Example 11: One sample t-test
Example 12: Paired t-test
Example 13: Unpaired t-test
Example 14: Contingency Tables
Example 15: Correlation and Simple Linear Regression











 

 

Downloading and Installing R

Educational video:
Downloading and Installing R

Summary of the video:

How to download and install R

 

 




Getting started with R (Working at the R console)

Educational video:
Working at the R console

Summary of the video: Here are a few things you need to know to get started with R

  • The arrow (>) that appears when R is opened is the prompt to start entering commands.
  • can be used as a calculator. Simply enter what you would like to solve and press enter. For example
    > log(5)-2.5^2
    [1] -4.640562
    

    R understands this as the natural log (ln) and not log to the base 10.

  • Numerical and Categorical variables can constructed in R as objects that are called vectors.

  • To display an object like a vector, simply enter the name you assigned to it.
    > subject
    [1] "Alice"  "Bob"    "Celine"
    > BP
    [1] 112 125 161
    
  • To show the list of vectors already entered, enter ls() at the prompt.
    > ls()
    [1] "BP"      "subject"
    
  • To remove an object (e.g. a vector), enter rm(name of the object that you would like to remove). In our case, let's say
     > rm(subject)
    
    Note that after doing this, using the ls() command will no longer include subject among the list of objects.
    > ls()
    [1] "BP"
    



    Binomial and normal distributions



    Educational video:
    Binomial and the Normal Distributions

    Summary of the video:

    Note: The number sign is used to enter comments in R. R does not interpret any comments.

    This is an example of how a comment will be written in R:

     > # I am studying Biostatistics 
    Binomial Distributions

    With R, the binomial distribution is called binom. A prefix 'd' or 'p' can be used with binom.

  • dbinom is the probability mass function (pmf). Its usage is dbinom(x, n, p), where x is a value in the range of the random variable, n is the number of trials and p is the probability of success.
  • Example: Compute the probability that X=20, where X has a binomial distribution with n=30 and p=0.75

    Solution: We have P(X=20)=0.0909. Here is the corresponding R output.

    > dbinom (20, 30, 0.75)
    [1] 0.09086524
    

    Suppose now that we want to compute that X is between 20 and 25 (inclusively). We can compute the probability mass for each of the values 20, 21, ..., 25 as follows.

    >  dbinom (20:25, 30, 0.75)
    [1] 0.09086524 0.12980749 0.15930919 0.16623567 0.14545621 0.10472847
    

    The command 20:25 creates a vector that contains the integers from 20 to 25. R will compute the pmf for each of the components of the vector.

    To add up all the individual probabilities that R generates, we use the sum command.

    > sum(dbinom(20:25, 30, 0.75))
    [1] 0.7964023
    
    So P(20 ≤X ≤25)=0.7964.
  • pbinom is the cumulative distribution function (cdf). To compute P(X ≤ 25), where X has a binomial distribution with n=30 and p=0.75, we use:
    > pbinom (25, 30, 0.75)
    [1] 0.9021304
    
    Thus, P(X ≤ 25)=F(25)=0.9021.

    Assuming that we want P(20 ≤ X ≤ 25), we can use the cdf to do the computation. We have P(20 ≤ X ≤ 25)=F(25)-F(19)=0.7964. Here is the computation with R.

    > pbinom (25, 30, 0.75)-pbinom (19, 30, 0.75)
    [1] 0.7964023
    

    Notice that we get the same value as the sum of the probability masses for x=20 to x=25.

    Normal Distributions

    The R name given to this distribution is norm.

    Assuming we are working with a mean of 25 and a standard devation of 5.25, let's compute the probability that our normal distribution would be at most 20, i.e. P(X ≤ 20).

  • pnorm is the cumulative distribution function.
  • Its usage is pnorm(x, mean, sd), where mean is the mean of the distribution and sd is the standard deviation of the distribution.

    We want P(X ≤ 20)=F(20)=0.1705. Here is the computation with R.
    >  pnorm(20, 25, 5.25)
    [1] 0.1704519
    

    Suppose we want P(20 > pnorm(25, 25, 5.25) - pnorm(20, 25, 5.25) [1] 0.3295481

  • qnorm is the quantile function. Its usage is qnorm( probability, mean, sd).

    Say that we are interested in the 5th percentile, that is a value q such that 0.05=P(X ≤ q), where X has a normal distribution with mean 25 and standard deviation 5.25. We can use the following command.

    > qnorm(0.05, 25, 5.25)
    [1] 16.36452
    
    So the 5th percentile is around 16.36.
  • Suppose that we want to compute the first quartile (which is approximately 21.46) and the third quartile (which is approximately 28.54). That is, we want the 25th percentile and the 75th percentile. We use:
    	> qnorm(0.25,25,5.25)
    	[1] 21.45893
    	> qnorm(0.75,25,5.25)
    	[1] 28.54107
          
  • Suppose that we want the interquartile range, i.e. the distance between the third quartile and the first quartile. We use:
    	> qnorm(0.75,25,5.25) - qnorm(0.25,25,5.25)
    	[1] 7.082142
           




    Descriptive Statistics



    Educational video:
    Descriptive Statistics

    Summary of the video:

    These are the list of functions to use when working with Descriptive Statistics in R. The following are some of the functions;

  • ls() - Used to display a list objects previously defined during the current R session. In the video, we see that an object x existed. To display x, simply enter x at the prompt. We obtain:
    > x
    [1]  12  13  11   9   2  75 125  35
    
    Remark: We had previously defined this numerical vector with the following command: x=c(12, 13, 11, 9, 2, 75, 125, 35)

  • is.vector(x) - To ask R if the object x is a vector (True or False).
  • is.numeric(x) - To ask R if the object x is numeric (True or False)
  • length(x) - Provides the length of the vector x, which is the number of components (sometimes this correspondes to the sample size).
  • summary(x) - gives a summary of some descriptive statistics of x; mean, median, 1st and 3rd quartile, the minimum and the maximum.
  • mean(x) - for the mean of x
  • median(x) - for the median of x
  • var(x) - for the variance of x
  • sd(x) - for the standard deviation of x
  • IQR(x) - for the interquartile range of x

    Note: IQR must be written in upper case.

  • range(x) - gives a vector the contains two values. In component 1, we have the minimum. In component 2, we have the maximum.

    For the range defined as the distance between the minimum and the maximum, use the following command (assuming that x is the numerical vector):

    > range(x)[2] - range(x)[1]
    

    The indicices for the components (here it is 2 and 1) should be put in square brackets.

  • sort(x) - arranges the numerical values from smallest to largest
  • quantile(x) - to get some percentiles (also known as sample quantiles). Here is an example:
    > quantile(x)
       0%   25%   50%   75%  100%
      2.0  10.5  12.5  45.0 125.0
    
    We notice that the first quartile is 10.5, the third quartile is 45 and the median is 12.5.

    Remark: R does not use, by default, the same formula as us to compute the quartiles. To force R to compute the quartiles as we learned in class, we use quantile(x,type=6). For our example, we get
    > quantile(x,type=6)
       0%   25%   50%   75%  100%
      2.0   9.5  12.5  65.0 125.0
    

    We can add a second argument to the quantile function to change the order of the quantile.

    > quantile(x, c (0.05, 0.95))
    

    This will give the 5th and 95th percentile of the numerical vector x.

  • boxplot(x) - to construct a boxplot of the numerical vector x. A boxplot is displayed in the graphics window.
  • boxplot(x)$out - to display the outliers of the numerical vector x.
  • Keywords: Numerical vector, measures of central tendency, measure of variability, descriptive statistics with R.




    R Editor Window



    Educational video:
    R editor window

    Summary of the video:

    Instead of working in the R console, it is sometimes more efficient to enter and edit your commands in an R editor window.

    To open an editor window in R,

  • Go to R Gui
  • Click on File, then new script (in windows) or new document (on a Mac)
  • Enter your commands on this page.
  • Select the commands that you want to submit and use CTRL-R (in windows) or CMD-ENTER (on a Mac) to submit the commands at R prompt.
  • To save commands on the R Editor Window,

  • Go to R Gui
  • Click on File, then Save as
  • Use the extension .R to save. For example, the file name can be Rstuff.R.
  •  

     




    Importing data from a file



    Educational video:
    Importing data from a file

    Summary of the video:

    Working with the dataframe ...

    Keywords: read.table, indices, boxplot, summary, aggregate




    Graphs: Histogram and boxplot



    Educational video:
    Histogram and boxplot

    Summary of the video: Keywords: histogram, boxplot, quantile of type 6.








    Side by side boxplots (also called comparative boxplots)



    Educational video:
    Comparative boxplots

    Summary of the video:


    Operations on numerical vectors



    Educational video:
    Operations on numerical vectors

    Summary of the video:


    Assessing normality with a density histogram



    Educational video:
    Overlay of a normal density onto a density histogram

    Summary of the video:


    Assessing normality with a normal quantile-quantile plot or a normal probability plot



    Educational video:
    normal QQ plot and normal probability plot

    Summary of the video:


    One sample t-test



    Educational video:
    One sample t-test

    Summary of the video:

     

     




    Paired t-test



    Educational video:
    Paired t-test

    Summary of the video:

    Keywords: paired t-test, numerical variable, t.test







    T-test to compare the means from two independent populations



    Educational video:
    unpaired t-test

    Summary of the video:





    Contingency tables



    Educational video:
    Contingency Tables

    Summary of the video:





    Correlation and Simple Linear Regression



    Educational video:
    Correlation and Simple Linear Regression

    Summary of the video: