Introduction
Simultaneous analysis of genetic associations with multiple
phenotypes may reveal shared genetic susceptibility across traits
(pleiotropy). CPBayes is a Bayesian meta analysis method for studying
cross-phenotype genetic associations. It uses summary-level data across
multiple phenotypes to simultaneously measure the evidence of
aggregate-level pleiotropic association and estimate an optimal subset
of traits associated with the risk locus. CPBayes model is based on a
spike and slab prior.
This R-package consists of following functions:
- analytic_locFDR_BF_uncor: This function analytically computes the
local FDR & Bayes factor (BF) that quantifies the evidence of
aggregate-level pleiotropic association for uncorrelated summary
statistics.
- cpbayes_uncor: It implements CPBayes for uncorrelated summary
statistics to figure out the optimal subset of non-null traits
underlying a pleiotropic signal and other insights. The summary
statistics across traits/studies are uncorrelated when the studies have
no overlapping subjects.
- analytic_locFDR_BF_cor: This function analytically computes the
local FDR & Bayes factor (BF) that quantifies the evidence of
aggregate-level pleiotropic association for correlated summary
statistics.
- cpbayes_cor: It implements CPBayes for correlated summary statistics
to figure out the optimal subset of non-null traits underlying a
pleiotropic signal and other insights. The summary statistics across
traits/studies are correlated when the studies have overlapping subjects
or the phenotypes were measured in a cohort study.
- post_summaries: It summarizes the MCMC data produced by
cpbayes_uncor or cpbayes_cor. It computes additional summaries to
provide a better insight into a pleiotropic signal. It works in the same
way for both cpbayes_uncor and cpbayes_cor.
- forest_cpbayes: It creates a forest plot presenting the pleiotropy
result obtained by cpbayes_uncor or cpbayes_cor. It works in the same
way for both cpbayes_uncor and cpbayes_cor.
- estimate_corln: It computes an approximate correlation matrix of the
beta-hat vector for multiple overlapping case-control studies using the
sample-overlap matrices.
Installation
You can install CPBayes from CRAN.
install.packages("CPBayes")
library("CPBayes")
How to run CPBayes for uncorrelated summary statistics.
Load the example data.
library("CPBayes")
# Load the beta hat vector
BetaHatfile <- system.file("extdata", "BetaHat.rda", package = "CPBayes")
load(BetaHatfile)
BetaHat
## [1] 0.03402285 -0.01987781 -0.01088476 -0.00940975 0.02195188 -0.01728955
## [7] 0.14180403 0.06979527 -0.11721340 -0.09231879
BetaHat contains an example data of the main genetic effect (beta/log
odds ratio) estimates for a single nucleotide polymorphism (SNP)
obtained from 10 separate case-control studies for 10 different
diseases. In each case-control study comprising a distinct set of 7000
cases and 10000 controls, we fit a logistic regression of the
case-control status on the genotype coded as the minor allele count for
all the individuals in the sample. One can also include various
covariates, such as, age, gender, principal components (PCs) of
ancestries in the logistic regression. From each logistic regression for
a disease, we obtain the estimate of the main genetic association
parameter (beta/logodds ratio) along with the corresponding standard
error. Since the studies do not have any overlapping subject, beta-hat
across the diseases are uncorrelated.
# Load the standard error vector
SEfile <- system.file("extdata", "SE.rda", package = "CPBayes")
load(SEfile)
SE
## [1] 0.02736746 0.02741876 0.02749038 0.02747637 0.02749800 0.02753650
## [7] 0.02731022 0.02714103 0.02786046 0.02783711
SE contains the standard errors corresponding to the above beta hat
vector across 10 separate case-control studies.
Next we specify the name of the diseases/phenotypes and the genetic
variant.
# Specify the name of the traits and the genetic variant.
traitNames <- paste("Disease", 1:10, sep = "")
SNP1 <- "rs1234"
traitNames
## [1] "Disease1" "Disease2" "Disease3" "Disease4" "Disease5" "Disease6"
## [7] "Disease7" "Disease8" "Disease9" "Disease10"
SNP1
## [1] "rs1234"
Now, since the studies are non-overlapping, the summary statistics
across traits are uncorrelated. Here we run the analytic_locFDR_BF_uncor
function for this example data.
#Run analytic_locFDR_BF_uncor function to analytically compute locFDR and log10BF for uncorrelated summary statistics.
result <- analytic_locFDR_BF_uncor(BetaHat, SE)
str(result)
## List of 2
## $ locFDR : num 3.43e-06
## $ log10_BF: num 3.93
So, locFDR [result$locFDR] was analytically computed as 3.43*10^(-6)
and log10(Bayes factor) [result$log10_BF] was estimated as 3.93
indicating an aggregate-level pleiotropic association. While
analytically computing locFDR (BF), a fixed value of slab variance is
considered.
Next we implement CPBayes for this example data. Since the studies
are non-overlapping, the summary statistics across traits are
uncorrelated. Hence we run the cpbayes_uncor function.
# Run the uncorrelated version of CPBayes (based on MCMC).
result <- cpbayes_uncor(BetaHat, SE, Phenotypes = traitNames, Variant = SNP1, MCMCiter = 5500, Burnin = 500)
## Important traits with trait-specific posterior prob of assoc:
## traits PPAj
## 1 Disease7 1.00
## 2 Disease8 0.23
## 3 Disease9 0.97
## 4 Disease10 0.56
There are more options of arguments to pass into the function (see
the Arguments section of cpbayes_uncor in the CPBayes manual). After
running cpbayes_uncor, it prints the list of important traits for which
the trait-specific posterior probability of association (PPAj) > 20%.
However, the printed output is only a part of ‘result’ which is a list
that constitutes of various components. An overall summary of ‘result’
can be seen by using the str() function (as shown below).
# Overall summary of the primary results produced by cpbayes_uncor.
str(result)
## List of 7
## $ variantName : chr "rs1234"
## $ log10_BF : num 8.43
## $ locFDR : num 1.09e-10
## $ subset : chr [1:3] "Disease7" "Disease9" "Disease10"
## $ important_traits:'data.frame': 4 obs. of 2 variables:
## ..$ traits: chr [1:4] "Disease7" "Disease8" "Disease9" "Disease10"
## ..$ PPAj : num [1:4] 1 0.227 0.971 0.565
## $ auxi_data :List of 8
## ..$ traitNames : chr [1:10] "Disease1" "Disease2" "Disease3" "Disease4" ...
## ..$ K : int 10
## ..$ mcmc.samplesize: num 5000
## ..$ PPAj : num [1:10] 0.0334 0.0188 0.0142 0.0132 0.0178 ...
## ..$ Z.data : num [1:5000, 1:10] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ sim.beta : num [1:5000, 1:10] 0.007478 -0.00191 -0.003954 0.00095 -0.000388 ...
## ..$ betahat : num [1:10] 0.03402 -0.01988 -0.01088 -0.00941 0.02195 ...
## ..$ se : num [1:10] 0.0274 0.0274 0.0275 0.0275 0.0275 ...
## $ runtime : 'proc_time' Named num [1:5] 0.25 0.008 0.258 0 0
## ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
Here, result$variantName returns the name of the genetic variant
specified by the user. Here, it is ‘rs1234’. Estimated based on the MCMC
posterior sample, result$log10_BF provides the log10(Bayes factor) and
result$locFDR provides the local false discovery rate (posterior
probability of null association) evaluating the aggregate-level
pleiotropic association. These values were different from that obtained
by the analytical version (above). The main reason is that the locFDR
(log10BF) obtained by cpbayes_uncor() function are estimated based on
MCMC sample and a range of slab variance is considered. We recommend
using the locFDR (log10BF) value obtained from the analytical version.
Next, result$subset provides the optimal subset of associated/non-null
traits selected by CPBayes. CPBayes selected Disease7, Disease9, and
Disease10 as associated/non-null. It also gives a list of important
traits (important_traits) comprising phenotypes having PPAj > 20%.
Even if a phenotype is not selected in the optimal subset of non-null
traits, it can produce a non-negligible value of trait-specific
posterior probability of association (PPAj) and can be promising to be
pleiotropic. For example, Disease8 had a PPAj of 21% and was listed in
important_traits, but was not included in the optimal subset. A detailed
interpretation of all the outputs are described in the Value section of
cpbayes_uncor in the CPBayes manual.
The post_summaries function provides important insights into an
obseved pleiotropic signal, e.g. the direction of associations,
trait-specific posterior probability of associations (PPAj), posterior
mean/median and 95% credible interval (Bayesian analog of the confidence
interval) of the unknown true genetic effect (beta/odds ratio) on each
trait, etc.
# Post summary of the MCMC data produced by cpbayes_uncor.
PleioSumm <- post_summaries(result, level = 0.05)
str(PleioSumm)
## List of 9
## $ variantName : chr "rs1234"
## $ log10_BF : num 8.43
## $ locFDR : num 1.09e-10
## $ subset :'data.frame': 3 obs. of 3 variables:
## ..$ traits : chr [1:3] "Disease7" "Disease9" "Disease10"
## ..$ PPAj : num [1:3] 1 0.971 0.565
## ..$ direction: chr [1:3] "positive" "negative" "negative"
## $ important_traits :'data.frame': 4 obs. of 3 variables:
## ..$ traits : chr [1:4] "Disease7" "Disease8" "Disease9" "Disease10"
## ..$ PPAj : num [1:4] 1 0.227 0.971 0.565
## ..$ direction: chr [1:4] "positive" "positive" "negative" "negative"
## $ traitNames : chr [1:10] "Disease1" "Disease2" "Disease3" "Disease4" ...
## $ PPAj :'data.frame': 10 obs. of 2 variables:
## ..$ traits: chr [1:10] "Disease1" "Disease2" "Disease3" "Disease4" ...
## ..$ PPAj : num [1:10] 0.0334 0.0188 0.0142 0.0132 0.0178 ...
## $ poste_summary_beta:'data.frame': 10 obs. of 6 variables:
## ..$ traits : chr [1:10] "Disease1" "Disease2" "Disease3" "Disease4" ...
## ..$ poste_mean : num [1:10] 0.00491 -0.00269 -0.00151 -0.00137 0.00306 ...
## ..$ poste_median: num [1:10] 0.00423 -0.00231 -0.00133 -0.00134 0.00275 ...
## ..$ poste_se : num [1:10] 0.01189 0.01047 0.00984 0.0099 0.01051 ...
## ..$ lCl : num [1:10] -0.0143 -0.0226 -0.0203 -0.0207 -0.0157 ...
## ..$ uCl : num [1:10] 0.0275 0.0168 0.017 0.0177 0.0235 ...
## $ poste_summary_OR :'data.frame': 10 obs. of 5 variables:
## ..$ traits : chr [1:10] "Disease1" "Disease2" "Disease3" "Disease4" ...
## ..$ poste_mean : num [1:10] 1.005 0.997 0.999 0.999 1.003 ...
## ..$ poste_median: num [1:10] 1.004 0.998 0.999 0.999 1.003 ...
## ..$ lCl : num [1:10] 0.986 0.978 0.98 0.979 0.984 ...
## ..$ uCl : num [1:10] 1.03 1.02 1.02 1.02 1.02 ...
So we have to pass the list ‘result’ returned by cpbayes_uncor as the
first argument and the ‘level’ as the second argument into the
post_summaries function. If ‘level’ is not specified, the default value
is 0.05. Note that post_summaries computes (1-level)% credible interval
of the unknown true genetic effect (beta/odds ratio) on each trait. It
estimates the direction of association with the important traits, the
vector of trait-specific posterior probability of association (PPAj),
etc. For detailed description of different outputs provided by this
function, see the Value section of post_summaries in the CPBayes
manual.
Next we run the forest_cpbayes function to create a forest plot that
presents the pleiotropy result produced by cpbayes_uncor.
# Forest plot for the pleiotropy result obtained by cpbayes_uncor.
forest_cpbayes(result, level = 0.05)
Similarly as for the post_summaries function, we need to pass the
same list `result’ returned by cpbayes_uncor as the first argument into
the function. Second argument is the level whose default value is 0.05.
In the forest plot, (1-level)% confidence interval of the beta/log odds
ratio parameter is plotted for each trait. For more details, please see
the section of forest_cpbayes function in the CPBayes manual.
How to run CPBayes for correlated summary statistics.
Next we demonstrate how to run CPBayes for correlated summary
statistics. Get the path to the data.
# Load the beta-hat vector
datafile <- system.file("extdata", "cBetaHat.rda", package = "CPBayes")
load(datafile)
cBetaHat
## [1] -0.042216290 -0.035624411 -0.058063641 -0.026037581 -0.014434543
## [6] -0.004717336 -0.043771014 -0.025293739 0.059845798 -0.177518725
Here, ‘c’ in cBetaHat stands for correlated case. cBetaHat contains
an example data of the main genetic association parameter (beta/log odds
ratio) estimates for a SNP across 10 overlapping case-control studies
for 10 different diseases. Each of the 10 studies has a distinct set of
7000 cases and a common set of 10000 controls shared across all the
studies. In each case-control study, we fit a logistic regression of the
case-control status on the genotype coded as the minor allele count for
all the individuals in the sample. One can also include various
covariates, such as, age, gender, principal components (PCs) of
ancestries in the logistic regression. From each logistic regression for
a disease, we obtain the estimate of the main genetic effect (beta/log
odds ratio) along with the corresponding standard error. Since the
studies have overlapping subjects, beta-hat across the diseases are
correlated.
# Load the standard error vector
datafile <- system.file("extdata", "cSE.rda", package = "CPBayes")
load(datafile)
cSE
## [1] 0.02386863 0.02389768 0.02404311 0.02395724 0.02383974 0.02376651
## [7] 0.02402079 0.02392416 0.02377087 0.02446383
cSE contains the standard errors corresponding to the above beta hat
vector across 10 overlapping case-control studies.
# Load the correlation matrix of the beta-hat vector (cBetaHat)
datafile <- system.file("extdata", "cor.rda", package = "CPBayes")
load(datafile)
cor
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 1.0000000 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647
## [2,] 0.4117647 1.0000000 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647
## [3,] 0.4117647 0.4117647 1.0000000 0.4117647 0.4117647 0.4117647 0.4117647
## [4,] 0.4117647 0.4117647 0.4117647 1.0000000 0.4117647 0.4117647 0.4117647
## [5,] 0.4117647 0.4117647 0.4117647 0.4117647 1.0000000 0.4117647 0.4117647
## [6,] 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647 1.0000000 0.4117647
## [7,] 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647 1.0000000
## [8,] 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647
## [9,] 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647
## [10,] 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647 0.4117647
## [,8] [,9] [,10]
## [1,] 0.4117647 0.4117647 0.4117647
## [2,] 0.4117647 0.4117647 0.4117647
## [3,] 0.4117647 0.4117647 0.4117647
## [4,] 0.4117647 0.4117647 0.4117647
## [5,] 0.4117647 0.4117647 0.4117647
## [6,] 0.4117647 0.4117647 0.4117647
## [7,] 0.4117647 0.4117647 0.4117647
## [8,] 1.0000000 0.4117647 0.4117647
## [9,] 0.4117647 1.0000000 0.4117647
## [10,] 0.4117647 0.4117647 1.0000000
Since the summary statistics across traits are correlated, we run the
the analytic_locFDR_BF_cor function for this example data.
# Run analytic_locFDR_BF_cor function to analytically compute locFDR and log10BF for correlated summary statistics.
result <- analytic_locFDR_BF_cor(cBetaHat, cSE, cor)
str(result)
## List of 2
## $ locFDR : num 3.54e-10
## $ log10_BF: num 9.18
So the locFDR [result$locFDR] was analytically computed as
3.54*10^(-10) and log10(Bayes factor) [result$log10_BF] was estimated as
9.18 indicating an aggregate-level pleiotropic association.
The correlation matrix of the beta-hat vector (cBetaHat) is given by
‘cor’ which we estimated by employing the estimate_corln function
(demonstrated later in this tutorial) using the sample-overlap matrices
(explained later in this tutorial). Next we run the correlated version
of CPBayes based on MCMC for this example data.
# Run the correlated version of CPBayes.
result <- cpbayes_cor(cBetaHat, cSE, cor, Phenotypes = traitNames, Variant = SNP1, MCMCiter = 5500, Burnin = 500)
## Important traits with trait-specific posterior prob of assoc:
## traits PPAj
## 1 Disease9 0.9
## 2 Disease10 1.0
There are more options of arguments to pass into the function (see
the Arguments section of cpbayes_cor in the CPBayes manual). After
running cpbayes_cor, it prints the list of important traits for which
the trait-specific posterior probability of association (PPAj) > 20%.
However, the printed outputs are only a part of ‘result’ which is a list
that constitutes of various components. An overall summary of ‘result’
can be seen by using the str() function (as shown below).
# Overall summary of the primary results produced by cpbayes_cor.
str(result)
## List of 8
## $ variantName : chr "rs1234"
## $ log10_BF : num 20.9
## $ locFDR : num 6.85e-22
## $ subset : chr [1:2] "Disease9" "Disease10"
## $ important_traits:'data.frame': 2 obs. of 2 variables:
## ..$ traits: chr [1:2] "Disease9" "Disease10"
## ..$ PPAj : num [1:2] 0.898 1
## $ auxi_data :List of 8
## ..$ traitNames : chr [1:10] "Disease1" "Disease2" "Disease3" "Disease4" ...
## ..$ K : int 10
## ..$ mcmc.samplesize: num 5000
## ..$ PPAj : num [1:10] 0.0092 0.0092 0.0202 0.0072 0.0074 0.011 0.012 0.0066 0.898 1
## ..$ Z.data : num [1:5000, 1:10] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ sim.beta : num [1:5000, 1:10] -0.00172 -0.00438 -0.01626 0.00571 -0.01197 ...
## ..$ betahat : num [1:10] -0.0422 -0.0356 -0.0581 -0.026 -0.0144 ...
## ..$ se : num [1:10] 0.0239 0.0239 0.024 0.024 0.0238 ...
## $ uncor_use : chr "No"
## $ runtime : 'proc_time' Named num [1:5] 0.83 0 0.831 0 0
## ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
Here, result$variantName returns the name of the genetic variant
specified by the user. Here, it is ‘rs1234’. Estimated based on the MCMC
sample, result$log10_BF provides the log10(Bayes factor) and
result$locFDR provides the local false discovery rate (posterior
probability of null association) measuring the evidence of
aggregate-level pleiotropic association. Again, these values were
different from that obtained by the analytical version. The main reason
is that the locFDR (log10BF) obtained by cpbayes_cor() function are
estimated based on MCMC sample and a range of slab variance is
considered. We recommend using the locFDR (log10BF) value obtained from
the analytical version. However, for large number of traits (say >
25), analytic_locFDR_BF_cor may be slow to run. Next, result$subset
provides the optimal subset of associated traits selected by CPBayes. A
detailed interpretation of all the outputs are described in the Value
section of cpbayes_cor in the CPBayes manual.
The post_summaries function provides important insights into an
observed pleiotropic signal, e.g., the direction of associations,
trait-specific posterior probability of associations (PPAj), posterior
mean/median and 95% credible interval (Bayesian analog of the confidence
interval) of the unknown true genetic effect (beta/odds ratio) on each
trait, etc.
# Post summary of the MCMC data produced by cpbayes_cor.
PleioSumm <- post_summaries(result, level = 0.05)
str(PleioSumm)
## List of 9
## $ variantName : chr "rs1234"
## $ log10_BF : num 20.9
## $ locFDR : num 6.85e-22
## $ subset :'data.frame': 2 obs. of 3 variables:
## ..$ traits : chr [1:2] "Disease9" "Disease10"
## ..$ PPAj : num [1:2] 0.898 1
## ..$ direction: chr [1:2] "positive" "negative"
## $ important_traits :'data.frame': 2 obs. of 3 variables:
## ..$ traits : chr [1:2] "Disease9" "Disease10"
## ..$ PPAj : num [1:2] 0.898 1
## ..$ direction: chr [1:2] "positive" "negative"
## $ traitNames : chr [1:10] "Disease1" "Disease2" "Disease3" "Disease4" ...
## $ PPAj :'data.frame': 10 obs. of 2 variables:
## ..$ traits: chr [1:10] "Disease1" "Disease2" "Disease3" "Disease4" ...
## ..$ PPAj : num [1:10] 0.0092 0.0092 0.0202 0.0072 0.0074 0.011 0.012 0.0066 0.898 1
## $ poste_summary_beta:'data.frame': 10 obs. of 6 variables:
## ..$ traits : chr [1:10] "Disease1" "Disease2" "Disease3" "Disease4" ...
## ..$ poste_mean : num [1:10] -0.00413 -0.0026 -0.00811 -0.00022 0.00235 ...
## ..$ poste_median: num [1:10] -0.004053 -0.002547 -0.007689 -0.000236 0.002184 ...
## ..$ poste_se : num [1:10] 0.0092 0.00912 0.01021 0.00903 0.00907 ...
## ..$ lCl : num [1:10] -0.0222 -0.0205 -0.028 -0.018 -0.0157 ...
## ..$ uCl : num [1:10] 0.01341 0.01511 0.00939 0.01756 0.01977 ...
## $ poste_summary_OR :'data.frame': 10 obs. of 5 variables:
## ..$ traits : chr [1:10] "Disease1" "Disease2" "Disease3" "Disease4" ...
## ..$ poste_mean : num [1:10] 0.996 0.997 0.992 1 1.002 ...
## ..$ poste_median: num [1:10] 0.996 0.997 0.992 1 1.002 ...
## ..$ lCl : num [1:10] 0.978 0.98 0.972 0.982 0.984 ...
## ..$ uCl : num [1:10] 1.01 1.02 1.01 1.02 1.02 ...
Note that, post_summaries works exactly in the same way for both
cpbayes_cor and cpbayes_uncor. For detailed description of different
outputs provided by post_summaries, see the Value section of
post_summaries in the CPBayes manual.
Next we run the forest_cpbayes function to create a forest plot that
presents the pleiotropy result produced by cpbayes_cor.
# Forest plot for the pleiotropy result obtained by cpbayes_cor.
forest_cpbayes(result, level = 0.05)
Note that, forest_cpbayes works exactly in the same way for both
cpbayes_cor and cpbayes_uncor. For more details, see the section of
forest_cpbayes function in the CPBayes manual.
How to run estimate_corln.
The function estimate_corln estimates the correlation matrix of the
beta-hat vector for multiple overlapping case-control studies using the
sample-overlap matrices which describe the number of cases or controls
shared between studies/traits, and the number of subjects who are case
for one study/trait but control for another study/trait. For a cohort
study, the phenotypic correlation matrix should be a reasonable
substitute of this correlation matrix.
# Example data of sample-overlap matrices
SampleOverlapMatrixFile <- system.file("extdata", "SampleOverlapMatrix.rda", package = "CPBayes")
load(SampleOverlapMatrixFile)
SampleOverlapMatrix
## $n11
## trait1 trait2 trait3 trait4 trait5
## trait1 9048 4647 2985 2835 1812
## trait2 4647 13565 3873 4245 2419
## trait3 2985 3873 14681 6285 2044
## trait4 2835 4245 6285 16697 2059
## trait5 1812 2419 2044 2059 7121
##
## $n00
## trait1 trait2 trait3 trait4 trait5
## trait1 44683 35765 32987 30821 39374
## trait2 35765 40166 29358 27714 35464
## trait3 32987 29358 39050 28638 33973
## trait4 30821 27714 28638 37034 31972
## trait5 39374 35464 33973 31972 46610
##
## $n10
## trait1 trait2 trait3 trait4 trait5
## trait1 0 4401 6063 6213 7236
## trait2 8918 0 9692 9320 11146
## trait3 11696 10808 0 8396 12637
## trait4 13862 12452 10412 0 14638
## trait5 5309 4702 5077 5062 0
SampleOverlapMatrix is a list that contains an example of the sample
overlap matrices for five different diseases in the Kaiser GERA cohort
(a real data). The list constitutes of three matrices as follows.
SampleOverlapMatrix$n11 provides the number of cases shared between all
possible pairs of studies/traits. SampleOverlapMatrix$n00 provides the
number of controls shared between all possible pairs of studies/traits.
SampleOverlapMatrix$n10 provides the number of subjects who are case for
one study/trait and control for another study/trait. For more detailed
explanation, see the Arguments section of estimate_corln in the CPBayes
manual.
# Estimate the correlation matrix of correlated beta-hat vector
n11 <- SampleOverlapMatrix$n11
n00 <- SampleOverlapMatrix$n00
n10 <- SampleOverlapMatrix$n10
cor <- estimate_corln(n11, n00, n10)
cor
## trait1 trait2 trait3 trait4 trait5
## trait1 1.000000000 0.270490702 0.05723195 0.002505875 0.08989408
## trait2 0.270490702 1.000000000 0.01601813 0.002744961 0.07849158
## trait3 0.057231953 0.016018131 1.00000000 0.155476792 0.01211052
## trait4 0.002505875 0.002744961 0.15547679 1.000000000 -0.01824859
## trait5 0.089894085 0.078491585 0.01211052 -0.018248589 1.00000000
The function estimate_corln computes an approximate correlation
matrix of the correlated beta-hat vector obtained from multiple
overlapping case-control studies using the sample-overlap matrices. Note
that for a cohort study, the phenotypic correlation matrix should be a
reasonable substitute of this correlation matrix. These approximations
of the correlation structure are accurate when none of the
diseases/traits is associated with the environmental covariates and
genetic variant. While demonstrating cpbayes_cor, we used simulated data
for 10 overlapping case-control studies with each study having a
distinct set of 7000 cases and a common set of 10000 controls shared
across all the studies. We used the estimate_corln function to estimate
the correlation matrix of the correlated beta-hat vector using the
sample-overlap matrices.
Important note on the estimation of correlation structure
of correlated beta-hat vector: In general, environmental
covariates are expected to be present in a study and associated with the
phenotypes of interest. Also, a small proportion of genome-wide genetic
variants are expected to be associated. Hence the above approximations
of the correlation matrix may not be accurate. So in general, we
recommend an alternative strategy to estimate the correlation matrix
using the genome-wide summary statistics data across traits as follows.
First, extract all the SNPs for each of which the trait-specific
univariate association p-value across all the traits are > 0.1. The
trait-specific univariate association p-values are obtained using the
beta-hat and standard error for each trait. Each of the SNPs selected in
this way is either weakly or not associated with any of the phenotypes
(null SNP). Next, select a set of independent null SNPs from the initial
set of null SNPs by using a threshold of r^2 < 0.01 (r: the
correlation between the genotypes at a pair of SNPs). In the absence of
in-sample linkage disequilibrium (LD) information, one can use the
reference panel LD information for this screening. Finally, compute the
correlation matrix of the effect estimates (beta-hat vector) as the
sample correlation matrix of the beta-hat vector across all the selected
independent null SNPs. This strategy is more general and applicable to a
cohort study or multiple overlapping studies for binary or quantitative
traits with arbitrary distributions. It is also useful when the beta-hat
vector for multiple non-overlapping studies become correlated due to
genetically related individuals across studies. Misspecification of the
correlation structure can affect the results produced by CPBayes to some
extent. Hence, if genome-wide summary statistics data across traits is
available, we recommend this alternative strategy to estimate the
correlation matrix of the beta-hat vector.
See our paper for more details: Majumdar A, Haldar T, Bhattacharya S,
Witte JS (2018) An efficient Bayesian meta analysis approach for
studying cross-phenotype genetic associations. PLoS Genet 14(2):
e1007139.