CPBayes (Bayesian meta analysis for studying cross-phenotype genetic associations) package

Introduction

Simultaneous analysis of genetic associations with multiple phenotypes may reveal shared genetic susceptibility across traits (pleiotropy). CPBayes is a Bayesian meta analysis method for studying cross-phenotype genetic associations. It uses summary-level data across multiple phenotypes to simultaneously measure the evidence of aggregate-level pleiotropic association and estimate an optimal subset of traits associated with the risk locus. CPBayes model is based on a spike and slab prior.

This R-package consists of following functions:

  1. analytic_locFDR_BF_uncor: This function analytically computes the local FDR & Bayes factor (BF) that quantifies the evidence of aggregate-level pleiotropic association for uncorrelated summary statistics.
  2. cpbayes_uncor: It implements CPBayes for uncorrelated summary statistics to figure out the optimal subset of non-null traits underlying a pleiotropic signal and other insights. The summary statistics across traits/studies are uncorrelated when the studies have no overlapping subjects.
  3. analytic_locFDR_BF_cor: This function analytically computes the local FDR & Bayes factor (BF) that quantifies the evidence of aggregate-level pleiotropic association for correlated summary statistics.
  4. cpbayes_cor: It implements CPBayes for correlated summary statistics to figure out the optimal subset of non-null traits underlying a pleiotropic signal and other insights. The summary statistics across traits/studies are correlated when the studies have overlapping subjects or the phenotypes were measured in a cohort study.
  5. post_summaries: It summarizes the MCMC data produced by cpbayes_uncor or cpbayes_cor. It computes additional summaries to provide a better insight into a pleiotropic signal. It works in the same way for both cpbayes_uncor and cpbayes_cor.
  6. forest_cpbayes: It creates a forest plot presenting the pleiotropy result obtained by cpbayes_uncor or cpbayes_cor. It works in the same way for both cpbayes_uncor and cpbayes_cor.
  7. estimate_corln: It computes an approximate correlation matrix of the beta-hat vector for multiple overlapping case-control studies using the sample-overlap matrices.

Installation

You can install CPBayes from CRAN.

install.packages("CPBayes")
library("CPBayes")

How to run CPBayes for uncorrelated summary statistics.

Load the example data.

library("CPBayes")
# Load the beta hat vector
BetaHatfile <- system.file("extdata", "BetaHat.rda", package = "CPBayes")
load(BetaHatfile)
BetaHat
##  [1]  0.03402285 -0.01987781 -0.01088476 -0.00940975  0.02195188 -0.01728955
##  [7]  0.14180403  0.06979527 -0.11721340 -0.09231879

BetaHat contains an example data of the main genetic effect (beta/log odds ratio) estimates for a single nucleotide polymorphism (SNP) obtained from 10 separate case-control studies for 10 different diseases. In each case-control study comprising a distinct set of 7000 cases and 10000 controls, we fit a logistic regression of the case-control status on the genotype coded as the minor allele count for all the individuals in the sample. One can also include various covariates, such as, age, gender, principal components (PCs) of ancestries in the logistic regression. From each logistic regression for a disease, we obtain the estimate of the main genetic association parameter (beta/logodds ratio) along with the corresponding standard error. Since the studies do not have any overlapping subject, beta-hat across the diseases are uncorrelated.

# Load the standard error vector
SEfile <- system.file("extdata", "SE.rda", package = "CPBayes")
load(SEfile)
SE
##  [1] 0.02736746 0.02741876 0.02749038 0.02747637 0.02749800 0.02753650
##  [7] 0.02731022 0.02714103 0.02786046 0.02783711

SE contains the standard errors corresponding to the above beta hat vector across 10 separate case-control studies.

Next we specify the name of the diseases/phenotypes and the genetic variant.

# Specify the name of the traits and the genetic variant.
traitNames <- paste("Disease", 1:10, sep = "")
SNP1 <- "rs1234"
traitNames
##  [1] "Disease1"  "Disease2"  "Disease3"  "Disease4"  "Disease5"  "Disease6" 
##  [7] "Disease7"  "Disease8"  "Disease9"  "Disease10"
SNP1
## [1] "rs1234"

Now, since the studies are non-overlapping, the summary statistics across traits are uncorrelated. Here we run the analytic_locFDR_BF_uncor function for this example data.

#Run analytic_locFDR_BF_uncor function to analytically compute locFDR and log10BF for uncorrelated summary statistics.
result <- analytic_locFDR_BF_uncor(BetaHat, SE)
str(result)
## List of 2
##  $ locFDR  : num 3.43e-06
##  $ log10_BF: num 3.93

So, locFDR [result$locFDR] was analytically computed as 3.43*10^(-6) and log10(Bayes factor) [result$log10_BF] was estimated as 3.93 indicating an aggregate-level pleiotropic association. While analytically computing locFDR (BF), a fixed value of slab variance is considered.

Next we implement CPBayes for this example data. Since the studies are non-overlapping, the summary statistics across traits are uncorrelated. Hence we run the cpbayes_uncor function.

# Run the uncorrelated version of CPBayes (based on MCMC).
result <- cpbayes_uncor(BetaHat, SE, Phenotypes = traitNames, Variant = SNP1, MCMCiter = 5500, Burnin = 500)
## Important traits with trait-specific posterior prob of assoc: 
##      traits PPAj
## 1  Disease7 1.00
## 2  Disease8 0.23
## 3  Disease9 0.97
## 4 Disease10 0.56

There are more options of arguments to pass into the function (see the Arguments section of cpbayes_uncor in the CPBayes manual). After running cpbayes_uncor, it prints the list of important traits for which the trait-specific posterior probability of association (PPAj) > 20%. However, the printed output is only a part of ‘result’ which is a list that constitutes of various components. An overall summary of ‘result’ can be seen by using the str() function (as shown below).

# Overall summary of the primary results produced by cpbayes_uncor.
str(result)
## List of 7
##  $ variantName     : chr "rs1234"
##  $ log10_BF        : num 8.43
##  $ locFDR          : num 1.09e-10
##  $ subset          : chr [1:3] "Disease7" "Disease9" "Disease10"
##  $ important_traits:'data.frame':    4 obs. of  2 variables:
##   ..$ traits: chr [1:4] "Disease7" "Disease8" "Disease9" "Disease10"
##   ..$ PPAj  : num [1:4] 1 0.227 0.971 0.565
##  $ auxi_data       :List of 8
##   ..$ traitNames     : chr [1:10] "Disease1" "Disease2" "Disease3" "Disease4" ...
##   ..$ K              : int 10
##   ..$ mcmc.samplesize: num 5000
##   ..$ PPAj           : num [1:10] 0.0334 0.0188 0.0142 0.0132 0.0178 ...
##   ..$ Z.data         : num [1:5000, 1:10] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ sim.beta       : num [1:5000, 1:10] 0.007478 -0.00191 -0.003954 0.00095 -0.000388 ...
##   ..$ betahat        : num [1:10] 0.03402 -0.01988 -0.01088 -0.00941 0.02195 ...
##   ..$ se             : num [1:10] 0.0274 0.0274 0.0275 0.0275 0.0275 ...
##  $ runtime         : 'proc_time' Named num [1:5] 0.25 0.008 0.258 0 0
##   ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...

Here, result$variantName returns the name of the genetic variant specified by the user. Here, it is ‘rs1234’. Estimated based on the MCMC posterior sample, result$log10_BF provides the log10(Bayes factor) and result$locFDR provides the local false discovery rate (posterior probability of null association) evaluating the aggregate-level pleiotropic association. These values were different from that obtained by the analytical version (above). The main reason is that the locFDR (log10BF) obtained by cpbayes_uncor() function are estimated based on MCMC sample and a range of slab variance is considered. We recommend using the locFDR (log10BF) value obtained from the analytical version. Next, result$subset provides the optimal subset of associated/non-null traits selected by CPBayes. CPBayes selected Disease7, Disease9, and Disease10 as associated/non-null. It also gives a list of important traits (important_traits) comprising phenotypes having PPAj > 20%. Even if a phenotype is not selected in the optimal subset of non-null traits, it can produce a non-negligible value of trait-specific posterior probability of association (PPAj) and can be promising to be pleiotropic. For example, Disease8 had a PPAj of 21% and was listed in important_traits, but was not included in the optimal subset. A detailed interpretation of all the outputs are described in the Value section of cpbayes_uncor in the CPBayes manual.

The post_summaries function provides important insights into an obseved pleiotropic signal, e.g. the direction of associations, trait-specific posterior probability of associations (PPAj), posterior mean/median and 95% credible interval (Bayesian analog of the confidence interval) of the unknown true genetic effect (beta/odds ratio) on each trait, etc.

# Post summary of the MCMC data produced by cpbayes_uncor.
PleioSumm <- post_summaries(result, level = 0.05)  
str(PleioSumm)
## List of 9
##  $ variantName       : chr "rs1234"
##  $ log10_BF          : num 8.43
##  $ locFDR            : num 1.09e-10
##  $ subset            :'data.frame':  3 obs. of  3 variables:
##   ..$ traits   : chr [1:3] "Disease7" "Disease9" "Disease10"
##   ..$ PPAj     : num [1:3] 1 0.971 0.565
##   ..$ direction: chr [1:3] "positive" "negative" "negative"
##  $ important_traits  :'data.frame':  4 obs. of  3 variables:
##   ..$ traits   : chr [1:4] "Disease7" "Disease8" "Disease9" "Disease10"
##   ..$ PPAj     : num [1:4] 1 0.227 0.971 0.565
##   ..$ direction: chr [1:4] "positive" "positive" "negative" "negative"
##  $ traitNames        : chr [1:10] "Disease1" "Disease2" "Disease3" "Disease4" ...
##  $ PPAj              :'data.frame':  10 obs. of  2 variables:
##   ..$ traits: chr [1:10] "Disease1" "Disease2" "Disease3" "Disease4" ...
##   ..$ PPAj  : num [1:10] 0.0334 0.0188 0.0142 0.0132 0.0178 ...
##  $ poste_summary_beta:'data.frame':  10 obs. of  6 variables:
##   ..$ traits      : chr [1:10] "Disease1" "Disease2" "Disease3" "Disease4" ...
##   ..$ poste_mean  : num [1:10] 0.00491 -0.00269 -0.00151 -0.00137 0.00306 ...
##   ..$ poste_median: num [1:10] 0.00423 -0.00231 -0.00133 -0.00134 0.00275 ...
##   ..$ poste_se    : num [1:10] 0.01189 0.01047 0.00984 0.0099 0.01051 ...
##   ..$ lCl         : num [1:10] -0.0143 -0.0226 -0.0203 -0.0207 -0.0157 ...
##   ..$ uCl         : num [1:10] 0.0275 0.0168 0.017 0.0177 0.0235 ...
##  $ poste_summary_OR  :'data.frame':  10 obs. of  5 variables:
##   ..$ traits      : chr [1:10] "Disease1" "Disease2" "Disease3" "Disease4" ...
##   ..$ poste_mean  : num [1:10] 1.005 0.997 0.999 0.999 1.003 ...
##   ..$ poste_median: num [1:10] 1.004 0.998 0.999 0.999 1.003 ...
##   ..$ lCl         : num [1:10] 0.986 0.978 0.98 0.979 0.984 ...
##   ..$ uCl         : num [1:10] 1.03 1.02 1.02 1.02 1.02 ...

So we have to pass the list ‘result’ returned by cpbayes_uncor as the first argument and the ‘level’ as the second argument into the post_summaries function. If ‘level’ is not specified, the default value is 0.05. Note that post_summaries computes (1-level)% credible interval of the unknown true genetic effect (beta/odds ratio) on each trait. It estimates the direction of association with the important traits, the vector of trait-specific posterior probability of association (PPAj), etc. For detailed description of different outputs provided by this function, see the Value section of post_summaries in the CPBayes manual.

Next we run the forest_cpbayes function to create a forest plot that presents the pleiotropy result produced by cpbayes_uncor.

# Forest plot for the pleiotropy result obtained by cpbayes_uncor.
forest_cpbayes(result, level = 0.05)

Similarly as for the post_summaries function, we need to pass the same list `result’ returned by cpbayes_uncor as the first argument into the function. Second argument is the level whose default value is 0.05. In the forest plot, (1-level)% confidence interval of the beta/log odds ratio parameter is plotted for each trait. For more details, please see the section of forest_cpbayes function in the CPBayes manual.

How to run CPBayes for correlated summary statistics.

Next we demonstrate how to run CPBayes for correlated summary statistics. Get the path to the data.

# Load the beta-hat vector
datafile <- system.file("extdata", "cBetaHat.rda", package = "CPBayes")
load(datafile)
cBetaHat
##  [1] -0.042216290 -0.035624411 -0.058063641 -0.026037581 -0.014434543
##  [6] -0.004717336 -0.043771014 -0.025293739  0.059845798 -0.177518725

Here, ‘c’ in cBetaHat stands for correlated case. cBetaHat contains an example data of the main genetic association parameter (beta/log odds ratio) estimates for a SNP across 10 overlapping case-control studies for 10 different diseases. Each of the 10 studies has a distinct set of 7000 cases and a common set of 10000 controls shared across all the studies. In each case-control study, we fit a logistic regression of the case-control status on the genotype coded as the minor allele count for all the individuals in the sample. One can also include various covariates, such as, age, gender, principal components (PCs) of ancestries in the logistic regression. From each logistic regression for a disease, we obtain the estimate of the main genetic effect (beta/log odds ratio) along with the corresponding standard error. Since the studies have overlapping subjects, beta-hat across the diseases are correlated.

# Load the standard error vector
datafile <- system.file("extdata", "cSE.rda", package = "CPBayes")
load(datafile)
cSE
##  [1] 0.02386863 0.02389768 0.02404311 0.02395724 0.02383974 0.02376651
##  [7] 0.02402079 0.02392416 0.02377087 0.02446383

cSE contains the standard errors corresponding to the above beta hat vector across 10 overlapping case-control studies.

# Load the correlation matrix of the beta-hat vector (cBetaHat)
datafile <- system.file("extdata", "cor.rda", package = "CPBayes")
load(datafile)
cor
##            [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
##  [1,] 1.0000000 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647
##  [2,] 0.4117647 1.0000000 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647
##  [3,] 0.4117647 0.4117647 1.0000000 0.4117647 0.4117647 0.4117647 0.4117647
##  [4,] 0.4117647 0.4117647 0.4117647 1.0000000 0.4117647 0.4117647 0.4117647
##  [5,] 0.4117647 0.4117647 0.4117647 0.4117647 1.0000000 0.4117647 0.4117647
##  [6,] 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647 1.0000000 0.4117647
##  [7,] 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647 1.0000000
##  [8,] 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647
##  [9,] 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647
## [10,] 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647
##            [,8]      [,9]     [,10]
##  [1,] 0.4117647 0.4117647 0.4117647
##  [2,] 0.4117647 0.4117647 0.4117647
##  [3,] 0.4117647 0.4117647 0.4117647
##  [4,] 0.4117647 0.4117647 0.4117647
##  [5,] 0.4117647 0.4117647 0.4117647
##  [6,] 0.4117647 0.4117647 0.4117647
##  [7,] 0.4117647 0.4117647 0.4117647
##  [8,] 1.0000000 0.4117647 0.4117647
##  [9,] 0.4117647 1.0000000 0.4117647
## [10,] 0.4117647 0.4117647 1.0000000

Since the summary statistics across traits are correlated, we run the the analytic_locFDR_BF_cor function for this example data.

# Run analytic_locFDR_BF_cor function to analytically compute locFDR and log10BF for correlated summary statistics.
result <- analytic_locFDR_BF_cor(cBetaHat, cSE, cor)
str(result)
## List of 2
##  $ locFDR  : num 3.54e-10
##  $ log10_BF: num 9.18

So the locFDR [result$locFDR] was analytically computed as 3.54*10^(-10) and log10(Bayes factor) [result$log10_BF] was estimated as 9.18 indicating an aggregate-level pleiotropic association.

The correlation matrix of the beta-hat vector (cBetaHat) is given by ‘cor’ which we estimated by employing the estimate_corln function (demonstrated later in this tutorial) using the sample-overlap matrices (explained later in this tutorial). Next we run the correlated version of CPBayes based on MCMC for this example data.

# Run the correlated version of CPBayes.
result <- cpbayes_cor(cBetaHat, cSE, cor, Phenotypes = traitNames, Variant = SNP1, MCMCiter = 5500, Burnin = 500)
## Important traits with trait-specific posterior prob of assoc: 
##      traits PPAj
## 1  Disease9  0.9
## 2 Disease10  1.0

There are more options of arguments to pass into the function (see the Arguments section of cpbayes_cor in the CPBayes manual). After running cpbayes_cor, it prints the list of important traits for which the trait-specific posterior probability of association (PPAj) > 20%. However, the printed outputs are only a part of ‘result’ which is a list that constitutes of various components. An overall summary of ‘result’ can be seen by using the str() function (as shown below).

# Overall summary of the primary results produced by cpbayes_cor.
str(result)
## List of 8
##  $ variantName     : chr "rs1234"
##  $ log10_BF        : num 20.9
##  $ locFDR          : num 6.85e-22
##  $ subset          : chr [1:2] "Disease9" "Disease10"
##  $ important_traits:'data.frame':    2 obs. of  2 variables:
##   ..$ traits: chr [1:2] "Disease9" "Disease10"
##   ..$ PPAj  : num [1:2] 0.898 1
##  $ auxi_data       :List of 8
##   ..$ traitNames     : chr [1:10] "Disease1" "Disease2" "Disease3" "Disease4" ...
##   ..$ K              : int 10
##   ..$ mcmc.samplesize: num 5000
##   ..$ PPAj           : num [1:10] 0.0092 0.0092 0.0202 0.0072 0.0074 0.011 0.012 0.0066 0.898 1
##   ..$ Z.data         : num [1:5000, 1:10] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ sim.beta       : num [1:5000, 1:10] -0.00172 -0.00438 -0.01626 0.00571 -0.01197 ...
##   ..$ betahat        : num [1:10] -0.0422 -0.0356 -0.0581 -0.026 -0.0144 ...
##   ..$ se             : num [1:10] 0.0239 0.0239 0.024 0.024 0.0238 ...
##  $ uncor_use       : chr "No"
##  $ runtime         : 'proc_time' Named num [1:5] 0.83 0 0.831 0 0
##   ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...

Here, result$variantName returns the name of the genetic variant specified by the user. Here, it is ‘rs1234’. Estimated based on the MCMC sample, result$log10_BF provides the log10(Bayes factor) and result$locFDR provides the local false discovery rate (posterior probability of null association) measuring the evidence of aggregate-level pleiotropic association. Again, these values were different from that obtained by the analytical version. The main reason is that the locFDR (log10BF) obtained by cpbayes_cor() function are estimated based on MCMC sample and a range of slab variance is considered. We recommend using the locFDR (log10BF) value obtained from the analytical version. However, for large number of traits (say > 25), analytic_locFDR_BF_cor may be slow to run. Next, result$subset provides the optimal subset of associated traits selected by CPBayes. A detailed interpretation of all the outputs are described in the Value section of cpbayes_cor in the CPBayes manual.

The post_summaries function provides important insights into an observed pleiotropic signal, e.g., the direction of associations, trait-specific posterior probability of associations (PPAj), posterior mean/median and 95% credible interval (Bayesian analog of the confidence interval) of the unknown true genetic effect (beta/odds ratio) on each trait, etc.

# Post summary of the MCMC data produced by cpbayes_cor.
PleioSumm <- post_summaries(result, level = 0.05)  
str(PleioSumm)
## List of 9
##  $ variantName       : chr "rs1234"
##  $ log10_BF          : num 20.9
##  $ locFDR            : num 6.85e-22
##  $ subset            :'data.frame':  2 obs. of  3 variables:
##   ..$ traits   : chr [1:2] "Disease9" "Disease10"
##   ..$ PPAj     : num [1:2] 0.898 1
##   ..$ direction: chr [1:2] "positive" "negative"
##  $ important_traits  :'data.frame':  2 obs. of  3 variables:
##   ..$ traits   : chr [1:2] "Disease9" "Disease10"
##   ..$ PPAj     : num [1:2] 0.898 1
##   ..$ direction: chr [1:2] "positive" "negative"
##  $ traitNames        : chr [1:10] "Disease1" "Disease2" "Disease3" "Disease4" ...
##  $ PPAj              :'data.frame':  10 obs. of  2 variables:
##   ..$ traits: chr [1:10] "Disease1" "Disease2" "Disease3" "Disease4" ...
##   ..$ PPAj  : num [1:10] 0.0092 0.0092 0.0202 0.0072 0.0074 0.011 0.012 0.0066 0.898 1
##  $ poste_summary_beta:'data.frame':  10 obs. of  6 variables:
##   ..$ traits      : chr [1:10] "Disease1" "Disease2" "Disease3" "Disease4" ...
##   ..$ poste_mean  : num [1:10] -0.00413 -0.0026 -0.00811 -0.00022 0.00235 ...
##   ..$ poste_median: num [1:10] -0.004053 -0.002547 -0.007689 -0.000236 0.002184 ...
##   ..$ poste_se    : num [1:10] 0.0092 0.00912 0.01021 0.00903 0.00907 ...
##   ..$ lCl         : num [1:10] -0.0222 -0.0205 -0.028 -0.018 -0.0157 ...
##   ..$ uCl         : num [1:10] 0.01341 0.01511 0.00939 0.01756 0.01977 ...
##  $ poste_summary_OR  :'data.frame':  10 obs. of  5 variables:
##   ..$ traits      : chr [1:10] "Disease1" "Disease2" "Disease3" "Disease4" ...
##   ..$ poste_mean  : num [1:10] 0.996 0.997 0.992 1 1.002 ...
##   ..$ poste_median: num [1:10] 0.996 0.997 0.992 1 1.002 ...
##   ..$ lCl         : num [1:10] 0.978 0.98 0.972 0.982 0.984 ...
##   ..$ uCl         : num [1:10] 1.01 1.02 1.01 1.02 1.02 ...

Note that, post_summaries works exactly in the same way for both cpbayes_cor and cpbayes_uncor. For detailed description of different outputs provided by post_summaries, see the Value section of post_summaries in the CPBayes manual.

Next we run the forest_cpbayes function to create a forest plot that presents the pleiotropy result produced by cpbayes_cor.

# Forest plot for the pleiotropy result obtained by cpbayes_cor.
forest_cpbayes(result, level = 0.05)

Note that, forest_cpbayes works exactly in the same way for both cpbayes_cor and cpbayes_uncor. For more details, see the section of forest_cpbayes function in the CPBayes manual.

How to run estimate_corln.

The function estimate_corln estimates the correlation matrix of the beta-hat vector for multiple overlapping case-control studies using the sample-overlap matrices which describe the number of cases or controls shared between studies/traits, and the number of subjects who are case for one study/trait but control for another study/trait. For a cohort study, the phenotypic correlation matrix should be a reasonable substitute of this correlation matrix.

# Example data of sample-overlap matrices
SampleOverlapMatrixFile <- system.file("extdata", "SampleOverlapMatrix.rda", package = "CPBayes")
load(SampleOverlapMatrixFile)
SampleOverlapMatrix
## $n11
##        trait1 trait2 trait3 trait4 trait5
## trait1   9048   4647   2985   2835   1812
## trait2   4647  13565   3873   4245   2419
## trait3   2985   3873  14681   6285   2044
## trait4   2835   4245   6285  16697   2059
## trait5   1812   2419   2044   2059   7121
## 
## $n00
##        trait1 trait2 trait3 trait4 trait5
## trait1  44683  35765  32987  30821  39374
## trait2  35765  40166  29358  27714  35464
## trait3  32987  29358  39050  28638  33973
## trait4  30821  27714  28638  37034  31972
## trait5  39374  35464  33973  31972  46610
## 
## $n10
##        trait1 trait2 trait3 trait4 trait5
## trait1      0   4401   6063   6213   7236
## trait2   8918      0   9692   9320  11146
## trait3  11696  10808      0   8396  12637
## trait4  13862  12452  10412      0  14638
## trait5   5309   4702   5077   5062      0

SampleOverlapMatrix is a list that contains an example of the sample overlap matrices for five different diseases in the Kaiser GERA cohort (a real data). The list constitutes of three matrices as follows. SampleOverlapMatrix$n11 provides the number of cases shared between all possible pairs of studies/traits. SampleOverlapMatrix$n00 provides the number of controls shared between all possible pairs of studies/traits. SampleOverlapMatrix$n10 provides the number of subjects who are case for one study/trait and control for another study/trait. For more detailed explanation, see the Arguments section of estimate_corln in the CPBayes manual.

# Estimate the correlation matrix of correlated beta-hat vector
n11 <- SampleOverlapMatrix$n11
n00 <- SampleOverlapMatrix$n00
n10 <- SampleOverlapMatrix$n10
cor <- estimate_corln(n11, n00, n10)
cor
##             trait1      trait2     trait3       trait4      trait5
## trait1 1.000000000 0.270490702 0.05723195  0.002505875  0.08989408
## trait2 0.270490702 1.000000000 0.01601813  0.002744961  0.07849158
## trait3 0.057231953 0.016018131 1.00000000  0.155476792  0.01211052
## trait4 0.002505875 0.002744961 0.15547679  1.000000000 -0.01824859
## trait5 0.089894085 0.078491585 0.01211052 -0.018248589  1.00000000

The function estimate_corln computes an approximate correlation matrix of the correlated beta-hat vector obtained from multiple overlapping case-control studies using the sample-overlap matrices. Note that for a cohort study, the phenotypic correlation matrix should be a reasonable substitute of this correlation matrix. These approximations of the correlation structure are accurate when none of the diseases/traits is associated with the environmental covariates and genetic variant. While demonstrating cpbayes_cor, we used simulated data for 10 overlapping case-control studies with each study having a distinct set of 7000 cases and a common set of 10000 controls shared across all the studies. We used the estimate_corln function to estimate the correlation matrix of the correlated beta-hat vector using the sample-overlap matrices.

Important note on the estimation of correlation structure of correlated beta-hat vector: In general, environmental covariates are expected to be present in a study and associated with the phenotypes of interest. Also, a small proportion of genome-wide genetic variants are expected to be associated. Hence the above approximations of the correlation matrix may not be accurate. So in general, we recommend an alternative strategy to estimate the correlation matrix using the genome-wide summary statistics data across traits as follows. First, extract all the SNPs for each of which the trait-specific univariate association p-value across all the traits are > 0.1. The trait-specific univariate association p-values are obtained using the beta-hat and standard error for each trait. Each of the SNPs selected in this way is either weakly or not associated with any of the phenotypes (null SNP). Next, select a set of independent null SNPs from the initial set of null SNPs by using a threshold of r^2 < 0.01 (r: the correlation between the genotypes at a pair of SNPs). In the absence of in-sample linkage disequilibrium (LD) information, one can use the reference panel LD information for this screening. Finally, compute the correlation matrix of the effect estimates (beta-hat vector) as the sample correlation matrix of the beta-hat vector across all the selected independent null SNPs. This strategy is more general and applicable to a cohort study or multiple overlapping studies for binary or quantitative traits with arbitrary distributions. It is also useful when the beta-hat vector for multiple non-overlapping studies become correlated due to genetically related individuals across studies. Misspecification of the correlation structure can affect the results produced by CPBayes to some extent. Hence, if genome-wide summary statistics data across traits is available, we recommend this alternative strategy to estimate the correlation matrix of the beta-hat vector.

See our paper for more details: Majumdar A, Haldar T, Bhattacharya S, Witte JS (2018) An efficient Bayesian meta analysis approach for studying cross-phenotype genetic associations. PLoS Genet 14(2): e1007139.