Table of contents

  1. Table of contents
  2. Learning goals
  3. GWAS project

Learning goals

By the end of this session, you should be able to…

  • calculate odds ratios by hand and associated values (confidence intervals etc.)
  • visualize odds ratios
  • identify extreme odds ratios
  • calculate \(chi^2\) statitics by hand and use \(chi^2\) distribution to get a p-value
  • simulate a null distribution to get a p-value
  • use R’s \(chi^2\) test function to get a p-value
  • use R’s Fisher’s exact test to get a p-value
  • compare the results
  • select a method and test all 280.000 SNPs
  • identify interesting patterns
  • explain the pattern by biomedical knowledge
library(tidyverse)
library(microbenchmark)

GWAS project

First, we read the data and see what kind of variables we have.

snpdata = readRDS(file = "gwas_allele_counts.julia.2008.rds")
head(snpdata)
## # A tibble: 6 x 5
##   rsid       chromosome chromosome_posit~ genome_position contingency_tab~
##   <chr>           <int>             <int>           <dbl> <list>          
## 1 rs3934834           1            995669          995669 <dbl [2 x 2]>   
## 2 rs3737728           1           1011278         1011278 <dbl [2 x 2]>   
## 3 rs6687776           1           1020428         1020428 <dbl [2 x 2]>   
## 4 rs9651273           1           1021403         1021403 <dbl [2 x 2]>   
## 5 rs4970405           1           1038818         1038818 <dbl [2 x 2]>   
## 6 rs12726255          1           1039813         1039813 <dbl [2 x 2]>

This dataset is described in the paper “Genome-wide association study of rheumatoid arthritis in the Spanish population: KLF12 as a risk locus for rheumatoid arthritis susceptibility” by Julia et al.

The first three columns are easy:

rsid = SNP id
chromosome = the name of the chromosome
chromosome_position = the position of the SNP within the chromosome

We have taken all the chromosome and glues them together from chromosome 1…chromosome 22, then we made a new coordinate system starting from 1 (chromosome 1) and ending in position ~3 gigabases (chromosome 22). This position is calculated in the column called “genome_position”. Finally we have counted the number of alleles in cases and controls and put this contingency table into the variable “contingency_table”.

Looking at the data

Count the number of SNPs per chromosome.

table(snpdata$chromosome)
## 
##     1     2     3     4     5     6     7     8     9    10    11    12 
## 21427 23044 19745 17149 17448 18932 15185 16470 14467 14256 13293 13701 
##    13    14    15    16    17    18    19    20    21    22 
## 10470  9032  8085  8271  7639  9518  5394  7121  5029  4970

Plot the number of SNPs per chromosome.

barplot(table(snpdata$chromosome), col = "deeppink4")

We can see that the number of SNPs shows a decreasing tendency - this is likely due to the fact that the chromosomes have different lengths, and the longer the chromosome is, the more likely it is to have more SNPs on it.

Now we will look at the position of SNPs in the genome in more detail.

hist(snpdata$chromosome_position, col = "deeppink4", 
     xlab = "Chromosome position",
     main="Position of SNPs in the genome") 

This tendency can also be explained by the chromosome size. All chromosomes are at least a few million bp long, but only the longest ones stretch to, say, \(1.5*10^8\) bps and beyond.

To illustrate this, let’s make a table of the chromosome positions per chromosome and make a plot.

x = table(snpdata$chromosome, snpdata$chromosome_position%/%10^7)
barplot(x, col=rainbow(25), xlab="Chromosome", ylab="Chromosome position / 10^7") 

Make a histogram of the genome_position. First calculate the number of breaks needed so each bin is ~10 million bp, then make the histogram using that number of breaks.

nbreaks = round(max(snpdata$genome_position) / 10^7)
hist(snpdata$genome_position, col="deeppink4", breaks = nbreaks, border = NA,
     xlab = "Genome position", main = "Histogram of genome position")

What is the average number of SNPs pr. 10 megabases? Eyeball it from the histogram.

Approximately 1000.

x = table(snpdata$genome_position %/% 10^7)
mean(x)
## [1] 998.7402
median(x)
##   64 
## 1015
hist(x, breaks=30, col="deeppink4", main="Histogram of the number of SNPs per 10 megabases",
     xlab="")

Take me back to the beginning!

Odds ratios for one contingency table

We will look at SNP number 1 and SNP number 2059.

M  = snpdata$contingency_table[[1]] # extracting the contingency table corresponding to SNP 1
M2 = snpdata$contingency_table[[2059]]

M
##          allele1 allele2
## cases        143     657
## controls     136     664
addmargins(M) # adding the row and column sums 
##          allele1 allele2  Sum
## cases        143     657  800
## controls     136     664  800
## Sum          279    1321 1600
M2
##          allele1 allele2
## cases        400     400
## controls     280     520
addmargins(M2)
##          allele1 allele2  Sum
## cases        400     400  800
## controls     280     520  800
## Sum          680     920 1600
a = M[1,1] # getting the value from the first row and first column
c = M[2,1] # getting the value from the second row and first column
a+c
## [1] 279

Use the equations from the book the calculate the odds ratio for the first table (M). Make 4 variables like the layout in example 9.3 - also page 239 in Analysis of biological data, 2nd edition. So you should make a,b,c,d from the contingency table.

a = M[1,1]
b = M[1,2]
c = M[2,1]
d = M[2,2]
OR = (a*d)/(b*c)
OR
## [1] 1.062673

Use the equations from the book the calculate the SE of the odds ratio.

SE = sqrt(1/a + 1/b + 1/c + 1/d)
SE
## [1] 0.1318106

Use the equations from the book the calculate the 99% confidence interval of the odds ratio.

low99  = exp(log(OR) - 2.58 * SE)
high99 = exp(log(OR) + 2.58 * SE)
low99
## [1] 0.7563254
high99
## [1] 1.493107

Does the confidence interval overlap 1.0? What is your conclusion?

Yes, the confidence interval overlaps 1.0. Based on these results, we can’t conclude that there is an association between our variables - there might be a slight positive or slight negative effect, or no effect at all, so more analysis would be needed to pinpoint what exactly is going on.

Also calculate the OR and the 99% CI for M2 (the other SNP).

a = M2[1,1]
b = M2[1,2]
c = M2[2,1]
d = M2[2,2]
OR = (a*d)/(b*c)
SE = sqrt(1/a + 1/b + 1/c + 1/d)
low99  = exp(log(OR) - 2.58 * SE)
high99 = exp(log(OR) + 2.58 * SE)
cat(low99, "<", OR, "<", high99, "\n")
## 1.425806 < 1.857143 < 2.418969

Does the confidence interval overlap 1.0? What is your conclusion?

No, in this case the confidence interval for the odds ratio exceeds 1, which suggests an association between the variables (so an association between having the SNP and having the disease, in this case, rheumatoid arthritis).

Make a plot similar to figure 9.3-1 for SNP 1 and SNP 2059.

mosaicplot(snpdata$contingency_table[[1]], col=c("firebrick", "goldenrod1"), cex.axis = 1.2, 
           sub = "Allele", dir = c("h","v"), ylab = "Group",
           main ="Contingency table for SNP 1")

mosaicplot(snpdata$contingency_table[[2059]], col=c("firebrick", "goldenrod1"), cex.axis = 1.2, 
           sub = "Allele", dir = c("h","v"), ylab = "Group", main=paste("SNP 2059\n", 
          round(low99,2), "<", round(OR,2), "<", round(high99,2)))

We can also perform Fisher’s exact test for these SNPs.

fisher.test(M)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  M
## p-value = 0.6927
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.813807 1.387945
## sample estimates:
## odds ratio 
##     1.0626
fisher.test(M2)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  M2
## p-value = 1.637e-09
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.511641 2.281964
## sample estimates:
## odds ratio 
##   1.856437

As expected, we get a high p-value for SNP 1, which makes sense as the confidence interval for this SNP overlapped 1 so we can’t reject the null hypothesis of no association (or OR being 1). In contrast, however, the p-value for SNP 2059 is below 0.05, so we can reject the null hypothesis of no association. These results are visually supported by the mosaic plots.

Take me back to the beginning!

Many odds ratios

Calculate the odds ratio and 99% confidence intervals for all contigency tables. Add these to the snpdata and call them oddsratio, low99 and high99.

# first, we are making three vectors which are as long as many rows our SNPdata dataframe has - so as many SNPs we have in total
r1 = rep(NA, nrow(snpdata))   
r2 = rep(NA, nrow(snpdata))   
r3 = rep(NA, nrow(snpdata))   

# then we loop through the entire dataset
for (i in 1:length(r1)) {
  
  # we extract the i-th contingency table
  M = snpdata$contingency_table[[i]]
  
  # then we make our calculations on this contingency table
  a = M[1,1]
  b = M[1,2]
  c = M[2,1]
  d = M[2,2]
  
  OR     = (a*d)/(b*c)
  SE     = sqrt(1/a + 1/b + 1/c + 1/d)
  low99  = exp(log(OR) - 2.58 * SE)
  high99 = exp(log(OR) + 2.58 * SE)
  
  # then we put the calculated values to the i-th position of our vectors
  r1[i] = OR
  r2[i] = low99
  r3[i] = high99
}

# finally, we append these vectors as new columns to our dataset
snpdata$oddsratio = r1
snpdata$low99 = r2
snpdata$high99 = r3

Plot the odds ratio as function of genome position for all odds ratios.

plot(oddsratio ~ genome_position, data = snpdata, col = chromosome,
     ylab = "Odds ratio", xlab = "Genome position",
     main = "OR as function of genome position, colored by chromosome") 

Plot the odds ratio as function of genome position for all odds ratios where the 99% confidence interval doesn’t overlap 1.00.

In this case, we want a subset of the data where either the lower bound for the confidence interval (low99) is above 1 or the upper bound (high99) is below 1, which are the conditions that we specify in the subset function. The | symbol between them means “or”, so we get all the datapoints where the CI spans either below or above 1. If we substituted | for &, that would mean “and” and we would get no datapoints as our conditions are mutually exclusive.

plot(oddsratio  ~ genome_position, data = subset(x=snpdata, subset = low99 > 1.00 | high99 < 1.00),
     col = chromosome, ylab = "Odds ratio", xlab = "Genome position",
     main = "OR as function of genome position, colored by chromosome")  

A gentle introduction to ggplot - it is an entirely different way of plotting in R, using the package “ggplot2” from the “tidyverse”. It’s pretty awesome, so if you’re interested in plotting, definitely look it up.

plot1 = ggplot(data = snpdata %>% filter(low99 > 1.0 | high99 < 1.0)) +
  geom_point(mapping = aes(x=genome_position, y=oddsratio, color=factor(chromosome))) +
  ggtitle("OR as function of genome position, colored by chromosome") +
  xlab("Genome position") +
  ylab("Odds ratio")

plot(plot1)

Count the number of SNPs where the 99% confidence interval of the odds ratio doesn’t overlap 1.00.

sum(snpdata$low99 > 1.0 | snpdata$high99 < 1.0)
## [1] 2764

Take me back to the beginning!

Look for a biological signal in the genome using the odds ratios

Make a subset of the SNPs where the 99% confidence interval of the odds ratio doesn’t overlap 1.00 and count the number of these SNPs on each chromosome.

df = subset(x = snpdata, subset = low99 > 1.0 |  high99 < 1.0)
x = table(df$chromosome)
x
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
## 205 267 216 151 153 262 158 121 158 129 116 114  90  79  72  89  73 106 
##  19  20  21  22 
##  38  55  54  58

Which chromosome have the highest percentage of SNPs where the 99% confidence interval do NOT overlap 1.00?

E.g. it turns out that 129 SNPs on chromosome 8 has a 99% CI above or below 1.0. There are 16470 SNPs on chromosome 8 in total, so 129 / 16470 is 0.78%.

x = table(df$chromosome) 
y = table(snpdata$chromosome) 
sort(x/y) 
## 
##          19           8          20          12          13          11 
## 0.007044865 0.007346691 0.007723634 0.008320561 0.008595989 0.008726397 
##          14           5           4          15          10          17 
## 0.008746678 0.008768913 0.008805178 0.008905380 0.009048822 0.009556225 
##           1           7          21          16           9           3 
## 0.009567368 0.010405005 0.010737721 0.010760488 0.010921407 0.010939478 
##          18           2          22           6 
## 0.011136793 0.011586530 0.011670020 0.013839003
barplot(sort(x/y), col = "deeppink4", main ="Percentage of SNPs where the 99% CI doesn't overlap 1")

We can see that chromosome 6 has the highest percentage of such SNPs.

Extract the significant SNPs from this chromosome and plot the odds ratio as function of chromosome_position.

df = subset(x = snpdata, subset = chromosome==6 & (low99 > 1.0 |  high99 < 1.0)) 
plot(oddsratio  ~ chromosome_position, data = df, col = "deeppink4", 
     ylab = "Odds ratio", xlab = "Genome position", 
     main = "OR as function of genome position")  

ggplot(data = df) + 
  geom_point(mapping = aes(x=chromosome_position, y=oddsratio), col = "deeppink4") + 
  geom_hline(yintercept = 1) +
  ggtitle("OR as function of genome position") +
  ylab("Odds ratio") +
  xlab("Genome position")

Do the SNPs with a CI above or below 1.0 look like they are randomly distributed or clumped?

They seem to cluster together around 30 megabases.

table(df$chromosome_position %/% 1000000)
## 
##   0   1   2   6   8   9  10  14  15  16  17  18  22  23  24  25  26  27 
##   1   3   1   3   2   3   2   1   1   2   6   1   5   1   3   1   1   3 
##  28  29  30  31  32  33  34  37  40  41  42  44  45  46  47  48  52  53 
##   6   1   7  39  15   7   3   1   1   2   1   1   2   1   2   2   1   1 
##  55  63  64  68  70  76  78  79  83  85  87  91  92  93  94  95  96  97 
##   1   1   2   3   9   1   1   1   2   2   2   1   2   3   1   1  10   2 
## 100 107 108 109 114 116 122 123 125 127 128 129 130 133 134 137 138 139 
##   1   1   1   1   1  10   1   4   2   4   1   2   2   3   2   2   4   1 
## 141 144 148 149 151 152 154 155 156 157 158 159 161 162 163 164 168 169 
##   1   1   4   3   1   1   1   7   5   3   2   1   1   1   1   3   3   2

We see 39 SNPs between 31 and 32 mb.

Take me back to the beginning!

Chi^2 calculations

Let’s look at SNP number 2059.

Observed = snpdata$contingency_table[[2059]] # we extract the contingency table

# then we extract the values we need from the contingency table
a = Observed[1,1]
b = Observed[1,2]
c = Observed[2,1]
d = Observed[2,2]

n = a+b+c+d # number of observations in total

# calculating the probabilities for the rows and columns
Pr_row1 = (a+b)/n 
Pr_row2 = (c+d)/n
Pr_col1 = (a+c)/n
Pr_col2 = (b+d)/n

# calculating the expected counts for each cell
E_a = Pr_row1 * Pr_col1 * n 
E_b = Pr_row1 * Pr_col2 * n 
E_c = Pr_row2 * Pr_col1 * n 
E_d = Pr_row2 * Pr_col2 * n 

# then we create a table from our calculated values
# c() concatenates the two numbers
# rbind() "glues together" the two rows of those concatenated numbers
# as.table() converts this structure into a table
Expected = as.table(rbind(c(E_a, E_b), c(E_c, E_d)))
colnames(Expected) = c("allele1", "allele2")
rownames(Expected) = c("cases", "controls")

mosaicplot(Observed, col=c("firebrick", "goldenrod1"), cex.axis = 1.2, 
           sub = "Allele", dir = c("h","v"), ylab = "Group", main="SNP 2059 - observed")

mosaicplot(Expected, col=c("firebrick", "goldenrod1"), cex.axis = 1.2, 
           sub = "Allele", dir = c("h","v"), ylab = "Group", main="SNP 2059 - expected")

# then we can do the entire chi^2 test by doing calculations with our tables
# this is going to subtract the corresponding cell values:
Observed-Expected 
##          allele1 allele2
## cases         60     -60
## controls     -60      60
# so we can calculate chi^2 accordingly:
chisquare = sum((Observed-Expected)^2/Expected)
df = (nrow(Observed)-1)*(ncol(Observed) - 1) # Page 249 in Analysis of biological data, 2nd edition

p = pchisq(q = chisquare, df = df, lower.tail = FALSE)
cat("X-square:", chisquare, "df =", df, "p =", p, "\n")
## X-square: 36.82864 df = 1 p = 1.289812e-09
#### Built-in test in R ####
chisq.test(Observed, correct = F)
## 
##  Pearson's Chi-squared test
## 
## data:  Observed
## X-squared = 36.829, df = 1, p-value = 1.29e-09

Compare the values of chi^2, the p-value and the degrees of freedom for the two approaches.

Apart from differences due to rounding, they are the same.

Take me back to the beginning!

Correction for continuity

You can find information about this on page 251 in Analysis of biological data, 2nd edition.

Perform a chi^2 test with correction for continuity, and also do it with the built-in chisq.test function. Compare the results.

# notice that the formula for chi^2 changes a bit - we subtract 1/2 and we also take the absolute value
chisquare = sum((abs(Observed-Expected)-1/2)^2/Expected) 
df = (nrow(Observed)-1)*(ncol(Observed) - 1) 
p = pchisq(q = chisquare, df = df, lower.tail = FALSE)

cat("X-square:", chisquare, "df =", df, "p =", p, "\n")
## X-square: 36.21739 df = 1 p = 1.764885e-09
# Compare with the built-in R version of continuity correction - we just need to specify correct = T
chisq.test(Observed, correct = T)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  Observed
## X-squared = 36.217, df = 1, p-value = 1.765e-09

We get the same results.

Take me back to the beginning!

Simulating/permuting the null distribution

# we can start by creating a new data structure
M = rbind(rbind(rbind(
  # we create as many rows as many cases we had with Allele 1
  data.frame("Group"=rep("cases", a),    "Allele" = rep("1", a)),
  # then as many controls we had with Allele 2
  data.frame("Group"=rep("controls", c), "Allele" = rep("1", c))),
  # and do the same with the cases and controls having Allele 2
  data.frame("Group"=rep("cases", b),    "Allele" = rep("2", b))),
  data.frame("Group"=rep("controls", d), "Allele" = rep("2", d)))

# now we have a data frame with one row per individual
head(M)
##   Group Allele
## 1 cases      1
## 2 cases      1
## 3 cases      1
## 4 cases      1
## 5 cases      1
## 6 cases      1
# we can turn it into a table and we get the original data back
addmargins(table(M))
##           Allele
## Group         1    2  Sum
##   cases     400  400  800
##   controls  280  520  800
##   Sum       680  920 1600
# let's randomly shuffle one of the columns in our data frame - the sample() function will do this for us
M$Group = sample(M$Group)
head(M)
##      Group Allele
## 1 controls      1
## 2 controls      1
## 3 controls      1
## 4 controls      1
## 5    cases      1
## 6 controls      1
addmargins(table(M))
##           Allele
## Group         1    2  Sum
##   cases     332  468  800
##   controls  348  452  800
##   Sum       680  920 1600
# let's do this again - we get a different outcome due to the randomness of the shuffling
M$Group = sample(M$Group)
head(M)
##      Group Allele
## 1    cases      1
## 2 controls      1
## 3 controls      1
## 4    cases      1
## 5 controls      1
## 6 controls      1
addmargins(table(M))
##           Allele
## Group         1    2  Sum
##   cases     343  457  800
##   controls  337  463  800
##   Sum       680  920 1600
# this is quite troublesome - R has a simpler way of doing this using the r2dtable() function
# we just need the row and column sums
rs = rowSums(Observed)
cs = colSums(Observed)
# and we give these values to the r2dtable() function, along with the number of simulations we need, so in this case, 5
r2dtable(n=5, r = rs, c = cs)
## [[1]]
##      [,1] [,2]
## [1,]  337  463
## [2,]  343  457
## 
## [[2]]
##      [,1] [,2]
## [1,]  328  472
## [2,]  352  448
## 
## [[3]]
##      [,1] [,2]
## [1,]  358  442
## [2,]  322  478
## 
## [[4]]
##      [,1] [,2]
## [1,]  345  455
## [2,]  335  465
## 
## [[5]]
##      [,1] [,2]
## [1,]  344  456
## [2,]  336  464
# so now we just do this one million times
simulations = 1000000
# Simulate a lot of tables
set.seed(0)
simulated_tables = r2dtable(n = simulations,r = rs, c = cs)
# Check the first two tables
addmargins(simulated_tables[[1]])
##              Sum
##     324 476  800
##     356 444  800
## Sum 680 920 1600
addmargins(simulated_tables[[2]])
##              Sum
##     337 463  800
##     343 457  800
## Sum 680 920 1600
addmargins(Observed)
##          allele1 allele2  Sum
## cases        400     400  800
## controls     280     520  800
## Sum          680     920 1600
# now we want to do a chi^2 test for all of them
# first we create a vector that is as long as many simulations we have
sim_chisquares = rep(NA, simulations)

# then we loop through all the tables and calculate the chi square for each of them
for (j in 1:simulations) {
  sim_chisquares[j] = sum(((simulated_tables[[j]]-Expected)^2)/Expected)
}

# let's calculate the observed chi^2 as well
chisquare = sum((Observed-Expected)^2/Expected)
df = (nrow(Observed)-1)*(ncol(Observed) - 1) 

# now, for plotting the theoretical chi^2 distribution, we create breaks that go from 0 to the maximum value of our simulated chi^2 values, by 0.5
breaks = seq(0,ceiling(max(sim_chisquares)), by = 0.5)
# and calculate the midpoints for each bin
midpoints = 1/2 * (breaks[2:length(breaks)] + breaks[1:(length(breaks)-1)])

# we basically calculate the expected frequency of chi^2 values between 0 and 1 - and multiply by the total number of simulations
# we have the midpoints on the x axis
x = midpoints 
# and the probabilities on the y axis
y = (pchisq(q = breaks[2:length(breaks)], df=1) - pchisq(q=breaks[1:(length(breaks)-1)], df=1))*length(sim_chisquares)

# now we put the simulated chi^2 values on a histogram
hist(sim_chisquares, breaks=breaks, xlim = c(0, 40), col = "deeppink4",
     main = "Theoretical chi^2 distribution", xlab = "Simulated chi^2 values")
# add line with the observed value
abline(v=chisquare)
# add the real chi square distribution
lines(y ~ x, lwd=2)

How many time did the null distribution give something that was equally or more extreme than the observed?

sum(sim_chisquares >= chisquare)
## [1] 0

What is the estimated p-value?

sum(sim_chisquares >= chisquare) / length(sim_chisquares)
## [1] 0

Compare it with the built-in function that uses simulation.

chisq.test(Observed, correct = F, simulate.p.value = TRUE, B = 1000000)
## 
##  Pearson's Chi-squared test with simulated p-value (based on 1e+06
##  replicates)
## 
## data:  Observed
## X-squared = 36.829, df = NA, p-value = 1e-06

The chi^2 value is very close to the one we obtained and we also get a very small p-value, so we can arrive to the same conclusion using both methods.

Take me back to the beginning!

Trying Fisher’s exact test

fisher.test(Observed)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  Observed
## p-value = 1.637e-09
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.511641 2.281964
## sample estimates:
## odds ratio 
##   1.856437
x = fisher.test(Observed) # we store the results of the test in a variable
names(x) 
## [1] "p.value"     "conf.int"    "estimate"    "null.value"  "alternative"
## [6] "method"      "data.name"
x$conf.int 
## [1] 1.511641 2.281964
## attr(,"conf.level")
## [1] 0.95
x$estimate # estimate for the true odds ratio
## odds ratio 
##   1.856437
x$p.value
## [1] 1.636953e-09

What about the speed of the different solutions?

microbenchmark(chisq.test(Observed))
## Unit: microseconds
##                  expr    min     lq     mean median      uq     max neval
##  chisq.test(Observed) 65.186 66.502 80.80606 68.489 76.4655 257.969   100
microbenchmark(chisq.test(Observed, correct = F))
## Unit: microseconds
##                               expr   min      lq   mean  median      uq
##  chisq.test(Observed, correct = F) 58.92 60.7465 70.524 61.5305 66.6115
##      max neval
##  222.352   100
microbenchmark(chisq.test(Observed, simulate.p.value = TRUE))
## Unit: microseconds
##                                           expr     min      lq     mean
##  chisq.test(Observed, simulate.p.value = TRUE) 437.924 448.669 528.3793
##    median      uq      max neval
##  462.7065 534.875 1025.747   100
microbenchmark(chisq.test(Observed, simulate.p.value = TRUE, B = 100000))
## Unit: milliseconds
##                                                      expr      min
##  chisq.test(Observed, simulate.p.value = TRUE, B = 1e+05) 16.75983
##        lq     mean   median       uq      max neval
##  17.68245 18.56991 18.22254 18.78174 44.61221   100
microbenchmark(fisher.test(Observed))
## Unit: milliseconds
##                   expr      min      lq     mean   median       uq
##  fisher.test(Observed) 1.109719 1.27573 2.021988 1.737481 1.956997
##       max neval
##  17.20881   100

What method would you use for all 280.000 SNPS?

In itself, the chisq.test is fastest, so it’s a good solution when we can make sure that we can satisfy the assumptions of the chi^2 test. However, Fisher’s exact test wins if we compare it to doing a chi^2 test with simulation.

Test all 280.000 SNPs using Fisher’s exact test. For each test you want to extract the p-value and finally save it in column snpdata$p.value.

x = rep(NA, nrow(snpdata))

for (i in 1:length(x)) {
  result =fisher.test(snpdata$contingency_table[[i]])
  x[i] = result$p.value
}

snpdata$p.value = x

Take me back to the beginning!

Making a Manhattan plot and correcting for multiple testing

Make a plot with x = genome position and y = -log10(p.value).

plot(x = snpdata$genome_position, y=-log10(snpdata$p.value),
     ylab = "-log10(p-value)", xlab = "Genome position")

ggplot(data = snpdata) +
  geom_point(mapping = aes(x=genome_position, y=-log10(p.value), color=factor(chromosome))) +
  xlab("Genome position") +
  ylab("-log10(p-value)")

Make a Manhattan plot for the 1000 most significant SNPs.

# we get the indices of the top 1000 significant SNPs and use these to plot
i = order(snpdata$p.value)[1:1000]
plot(x = snpdata$genome_position[i], y=-log10(snpdata$p.value[i]),
     ylab = "-log10(p-value)", xlab = "Genome position")

Let’s adjust the p-values and add them to our data.

# Bonferroni
snpdata$bonferroni = p.adjust(p = snpdata$p.value, method = "bonferroni")

# False discovery rate
snpdata$fdr = p.adjust(p = snpdata$p.value, method = "fdr")

# the different adjustment methods
p.adjust.methods
## [1] "holm"       "hochberg"   "hommel"     "bonferroni" "BH"        
## [6] "BY"         "fdr"        "none"

How many SNPs are significant using uncorrected p-values (alpha = 0.05)?

sum(snpdata$p.value <= 0.05)
## [1] 9914

How many do you expect?

0.05*nrow(snpdata)
## [1] 14032.3

How many SNPs are significant using Bonferroni-corrected p-values (alpha = 0.05)?

sum(snpdata$bonferroni <= 0.05) 
## [1] 9
# another way of doing this:
sum(snpdata$p.value <= 0.05/nrow(snpdata) )
## [1] 9
snpdata[(snpdata$p.value <= 0.05/nrow(snpdata)),]
## # A tibble: 9 x 11
##   rsid      chromosome chromosome_positi~ genome_position contingency_tab~
##   <chr>          <int>              <int>           <dbl> <list>          
## 1 rs3806308          1           20015453        20015453 <dbl [2 x 2]>   
## 2 rs116795~          2            2385495       249526800 <dbl [2 x 2]>   
## 3 rs2597515          3           13502134       503336259 <dbl [2 x 2]>   
## 4 rs4371616          4           14985834       704118331 <dbl [2 x 2]>   
## 5 rs1924466          6           17880743      1078765480 <dbl [2 x 2]>   
## 6 rs701308           7           83400100      1315014970 <dbl [2 x 2]>   
## 7 rs2462692         10            7184261      1683985726 <dbl [2 x 2]>   
## 8 rs985035          13           83160898      2161969337 <dbl [2 x 2]>   
## 9 rs1975242         15           31588499      2330850156 <dbl [2 x 2]>   
## # ... with 6 more variables: oddsratio <dbl>, low99 <dbl>, high99 <dbl>,
## #   p.value <dbl>, bonferroni <dbl>, fdr <dbl>

After Bonferroni correction, we only have 9 SNPs left, which is quite a drastic cut after having 9914 such SNPs without correction. However, this is a good thing - in a GWAS study, we are looking for disease-causing alleles and it’s less helpful if we find thousands (especially since many of them are found significant due to the inflated type 1 error stemming from multiple testing).

How many SNPs are significant if we accept a false discovery rate of 0.1?

sum(snpdata$fdr <= 0.1)
## [1] 72

How many SNPs are significant if we accept a false discovery rate of 0.2?

sum(snpdata$fdr <= 0.2)
## [1] 166

Take me back to the beginning!

The distribution of the significant SNPs in the genome

For the rest of the exercise we will focus on significant SNPs with an FDR <= 0.2 (false discovery rate).

df = subset(x = snpdata, subset = fdr <= 0.2)

How many SNPs are there on each chromosome (in the original data)?

table(snpdata$chromosome)
## 
##     1     2     3     4     5     6     7     8     9    10    11    12 
## 21427 23044 19745 17149 17448 18932 15185 16470 14467 14256 13293 13701 
##    13    14    15    16    17    18    19    20    21    22 
## 10470  9032  8085  8271  7639  9518  5394  7121  5029  4970

How many significant SNPs are there on each chromosome?

table(df$chromosome) 
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 
## 15 18 14  8  6 23  7  3 13  9  5  3  3  7  6  3  2 11  1  3  3  3

What is the proportion of significant SNPs on each chromosome?

table(df$chromosome) / table(snpdata$chromosome)
## 
##            1            2            3            4            5 
## 0.0007000513 0.0007811144 0.0007090403 0.0004664995 0.0003438790 
##            6            7            8            9           10 
## 0.0012148743 0.0004609812 0.0001821494 0.0008985968 0.0006313131 
##           11           12           13           14           15 
## 0.0003761378 0.0002189621 0.0002865330 0.0007750221 0.0007421150 
##           16           17           18           19           20 
## 0.0003627131 0.0002618144 0.0011557050 0.0001853912 0.0004212891 
##           21           22 
## 0.0005965401 0.0006036217

Make a plot of this proportion on each chromosome.

barplot(table(df$chromosome)/table(snpdata$chromosome), col="deeppink4")

Which chromosome is most enriched for significant SNPs?

barplot(sort(table(df$chromosome)/table(snpdata$chromosome)), col="deeppink4")

Chromosome 6.

How many of all SNPs are on this chromosome? How many of all SNPs are one other chromosomes?

table(snpdata$chromosome==6)
## 
##  FALSE   TRUE 
## 261714  18932

How many of the significant SNPs are on this chromosome? How many of the significant SNPs are one other chromosomes?

table(df$chromosome==6) 
## 
## FALSE  TRUE 
##   143    23

So a SNP can be significant or not, and it can be on chromosome 6 or not.

Use a binomial test to test if the number of significant SNPs on this chromosome is different than expected.

You can imagine this as a coin you toss n times - you get x number of heads (heads being landing on chromosome 6), and for each toss you expect to hit heads with probability p (which we estimate from how all the SNPs are distributed).

# p is the probability that a random SNP is on chromosome 6
p = sum(snpdata$chromosome==6) / nrow(snpdata) 

# x is how many significant SNPs we actually observe on chromosome 6
x = sum(df$chromosome==6) 

# n is the number of significant SNPs we have in total (out of which each falls on chromosome 6 with probability p)
n = nrow(df) 

binom.test(x = x, n = n, p = p)
## 
##  Exact binomial test
## 
## data:  x and n
## number of successes = 23, number of trials = 166, p-value =
## 0.0009543
## alternative hypothesis: true probability of success is not equal to 0.06745865
## 95 percent confidence interval:
##  0.0899025 0.2006129
## sample estimates:
## probability of success 
##              0.1385542
# we reject the null hypothesis - the number of significant SNPs on chromosome 6 indeed seems to be different than expected

Use a contingency table test instead.

x = snpdata$chromosome==6 # being on chromosome 6 or not
y = snpdata$fdr <= 0.2 # being significant or not
# we put all this data into a table
z = table(x,y)
addmargins(z)
##        y
## x        FALSE   TRUE    Sum
##   FALSE 261571    143 261714
##   TRUE   18909     23  18932
##   Sum   280480    166 280646
# and we can easily perform a chi^2 test 
chisq.test(z)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  z
## X-squared = 12.239, df = 1, p-value = 0.000468
# we can also perform a Fisher's test
fisher.test(z)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  z
## p-value = 0.0009508
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.366136 3.472255
## sample estimates:
## odds ratio 
##   2.224882

Using all methods, we come to the same conclusions, so it seems that there are more significant SNPs falling on chromosome 6 than we’d expect.

Count the number of SNPs per chromosome.

table(snpdata$chromosome)
## 
##     1     2     3     4     5     6     7     8     9    10    11    12 
## 21427 23044 19745 17149 17448 18932 15185 16470 14467 14256 13293 13701 
##    13    14    15    16    17    18    19    20    21    22 
## 10470  9032  8085  8271  7639  9518  5394  7121  5029  4970

Plot the number of SNPS per chromosome.

barplot(table(snpdata$chromosome), col = "deeppink4") 

Which chromosome has the highest “concentration” of SNPs?

# first of all, we create a new dataframe to store the chromosome size information
chromosome_sizes = aggregate(x = snpdata$chromosome_position, by = list(chr = unlist(snpdata$chromosome)), FUN = max)
chromosome_sizes = data.frame(chromosome_sizes)
colnames(chromosome_sizes) = c("chromosome", "size")
chromosome_sizes$size = as.numeric(chromosome_sizes$size)
chromosome_sizes$size/sum(as.numeric(chromosome_sizes$size))
##  [1] 0.08624786 0.08469542 0.06955154 0.06670047 0.06303436 0.05958174
##  [7] 0.05542261 0.05103705 0.04890244 0.04721178 0.04691498 0.04616643
## [13] 0.03982168 0.03711252 0.03497343 0.03094383 0.02744198 0.02656317
## [19] 0.02225672 0.02176843 0.01637046 0.01728109
# table of the number of SNPs in the chromosome
x = table(snpdata$chromosome)
x
## 
##     1     2     3     4     5     6     7     8     9    10    11    12 
## 21427 23044 19745 17149 17448 18932 15185 16470 14467 14256 13293 13701 
##    13    14    15    16    17    18    19    20    21    22 
## 10470  9032  8085  8271  7639  9518  5394  7121  5029  4970
#we divide all of them by the chromosome size, and sort them
sort(x / chromosome_sizes$size)
## 
##           15           19           14            1            4 
## 8.067608e-05 8.457711e-05 8.493104e-05 8.669939e-05 8.972488e-05 
##           13           16            2            7            5 
## 9.175508e-05 9.327965e-05 9.495131e-05 9.561605e-05 9.659870e-05 
##           17           11            3           22            9 
## 9.714582e-05 9.888137e-05 9.907256e-05 1.003664e-04 1.032407e-04 
##           12           10           21            6            8 
## 1.035688e-04 1.053781e-04 1.072072e-04 1.108885e-04 1.126188e-04 
##           20           18 
## 1.141607e-04 1.250457e-04
barplot(x / chromosome_sizes$size,  col = "deeppink4")

# we can plot the number of SNPs per chromosome
pd = as.data.frame(table(snpdata$chromosome))
names(pd) = c("Chromosome", "Frequency")
ggplot(pd, aes(x=Chromosome, y=Frequency)) + 
  geom_col(fill="deeppink4") + 
  ylab("Number of SNPs")

# and the chromosome size per chromosome as well
pd = as.data.frame(chromosome_sizes)
names(pd) = c("Chromosome", "Frequency")
ggplot(pd, aes(x=Chromosome, y=Frequency)) + 
  geom_col(fill="deeppink4") + 
  ylab("Chromosome size")

# finally, the SNP "concentration" - so the number of SNPs divided by the chromosome size
pd = as.data.frame(x / chromosome_sizes$size)
names(pd) = c("Chromosome", "Frequency")
ggplot(pd, aes(x=Chromosome, y=Frequency)) + 
  geom_col(fill="deeppink4") + 
  ylab("Density of SNPs")

Chromosome 18 is the one with the largest concentration of SNPs.

Test the null hypothesis that says: “The number of SNPs per megabase is the same for this chromosome and the rest of the genome.

\(H_A:\) The number of SNPs per megabase on this chromosome is NOT the same as the rest of the genome

We can do a binomial test to test this null hypothesis.

# we get the size of chromosome 18
size18 = chromosome_sizes$size[chromosome_sizes$chromosome==18]

# then the total size of the genome
totalsize = sum(chromosome_sizes$size)

# p will be the proportion of the genome that is chromosome 18
p = size18 / totalsize

# x is the number of SNPs we observe on chromosome 18
x = sum(snpdata$chromosome==18)

# n is the total number of SNPs
n = nrow(snpdata)

p*n  # expected number of SNPs on chromosome 18 - it's less than what we observe
## [1] 7454.846
# with these, we can do a binomial test
binom.test(x = x, n = n, p = p)
## 
##  Exact binomial test
## 
## data:  x and n
## number of successes = 9518, number of trials = 280650, p-value <
## 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.02656317
## 95 percent confidence interval:
##  0.03324796 0.03459091
## sample estimates:
## probability of success 
##             0.03391461

We reject the null hypothesis, the number of SNPs per megabase on this chromosome is not the same as the rest of the genome.

Actually, we cherry pick a little bit above. The real null hypothesis should be: \(H_0:\) The number of SNPs per base is the same for all chromosomes \(H_A:\) The number of SNPs per base is NOT the same for all chromosomes

Test this null hypothesis.

We can use a chi^2 test.

# the number of SNPs per chromosome
snps = table(snpdata$chromosome)

# not SNPs - we just subtract the number of SNPs (as each have a length of 1) from the total length of the chromosome
not_snps = chromosome_sizes$size - snps

# we put these together in a table
z = rbind(snps, not_snps)
round(addmargins(z)/1000, 2)
##                  1         2         3         4         5         6
## snps         21.43     23.04     19.75     17.15     17.45     18.93
## not_snps 247119.88 242669.78 199278.63 191111.55 180606.10 170711.20
## Sum      247141.30 242692.82 199298.37 191128.70 180623.54 170730.13
##                  7         8         9        10        11       12
## snps         15.19     16.47     14.47     14.26     13.29     13.7
## not_snps 158797.06 146229.04 140114.37 135270.04 134420.52 132275.2
## Sum      158812.25 146245.51 140128.84 135284.29 134433.81 132288.9
##                 13        14        15       16       17       18       19
## snps         10.47      9.03      8.09     8.27     7.64     9.52     5.39
## not_snps 114097.65 106336.07 100207.50 88660.59 78626.73 76106.63 63770.72
## Sum      114108.12 106345.10 100215.58 88668.86 78634.37 76116.15 63776.12
##                20       21       22        Sum
## snps         7.12     5.03     4.97     280.65
## not_snps 62369.84 46904.15 49513.59 2865196.78
## Sum      62376.96 46909.18 49518.56 2865477.42
# and do a chi^2 test
r = chisq.test(z)
r
## 
##  Pearson's Chi-squared test
## 
## data:  z
## X-squared = 2714.7, df = 21, p-value < 2.2e-16
# we can also check what would be the expected number of SNPs and not SNPs per chromosome
r$expected
##                     1            2            3            4            5
## snps         24205.12     23769.43     19519.36     18719.22     17690.34
## not_snps 247117099.88 242669050.57 199278852.64 191109977.78 180605852.66
##                     6            7            8            9          10
## snps         16721.38     15554.13     14323.34     13724.27     13249.8
## not_snps 170713411.62 158796692.87 146231188.66 140115111.73 135271043.2
##                   11           12           13           14           15
## snps         13166.5     12956.42     11175.79     10415.48 9.815154e+03
## not_snps 134420645.5 132275912.58 114096945.21 106334681.52 1.002058e+08
##                    16           17           18           19           20
## snps         8684.263     7701.481     7454.846     6246.258     6109.224
## not_snps 88660171.737 78626664.519 76108697.154 63769871.742 62370848.776
##                    21           22
## snps         4594.303     4849.867
## not_snps 46904580.697 49513709.133

We reject the null hypothesis, the number of SNPs per base is not the same for all chromosomes.

Take me back to the beginning!