Finding reference allele for mendelian randomisation

Finding reference allele for mendelian randomisation - r

I am trying to perform an MR using summary statistics from this GWAS. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7026164/#MOESM1
Unfortunately, the summary stats in the supplemental only have an A1 allele and do not give an A2 reference allele or EAF and therefore I am unable to harmonise the data to my outcome data.
I am using MR package in R with code
x <- harmonise_data(
exposure_dat =exposure_dat,
outcome_dat = outcome_dat_all, action = 1)
and i am getting the error "error in A2[to_swap] <- A1[to_swap] :
NAs are not allowed in subscripted assignments"
I believe this is because it requires an A2 allele in the exposure dataset. Is there anyway I can perform the MR without it? Or alternatively, can anybody suggest how I can quickly find all of the reference alleles. There are around 400 SNPs so searching for them individually would not be ideal.
Thanks, I would appreciate any help.

I think you can fetch A2 from the column uniqID from the supplementary information. If they have not provided EAF, you can potentially use the 1000 Genomes data to calculate it.

Related

Finding summary statistics. Struggling with having anything work after importing data into R from Excel

Very new to R here, also very new to the idea of coding and computer stuff.
Second week of class and I need to find some summary statistics from a set of data my professor provided. I downloaded the chart of data and tried to follow along with his verbal instructions during class, but I am one of the only non-computer science backgrounds in my degree program (I am an RN going for degree in Health Informatics), so he went way too fast for me.
I was hoping for some input on just where to start with his list of tasks for me to complete. I downloaded his data into an excel file, and then uploaded it into R and it is now a matrix. However, everything I try for getting the mean and standard deviation of the columns he wants comes up with an error. I am understanding that I need to convert these column names into some sort of vector, but online every website tells me to do these tasks differently. I don't even know where to start with this assignment.
Any help on how to get myself started would be greatly appreciated. Ive included a screenshot of his instructions and of my matrix. and please, excuse my ignorance/lack of familiarity compared to most of you here... this is my second week into my masters I am hoping I begin to pick this up soon I am just not there yet.
the instructions include:
# * Import the dataset
# * summarize the dataset,Compute the mean and standard deviation for the three variables (columns): age, height, weight
# * Tabulate smokers and age.level data with the variable and its frequency. How many smokers in each age category ?
# * Subset dataset by the mothers that smoke and weigh less than 100kg,how many mothers meet this requirements?
# * Compute the mean and standard deviation for the three variables (columns): age, height, weight
# * Plot a histogram

Stack Overflow is not a place for homeworks, but I feel your pain. Let's get piece by piece.
First let's use a package that helps us do those tasks:
library(data.table) # if not installed, install it with install.packages("data.table")
Then, let's load the data:
library(readxl) #again, install it if not installed
dt = setDT(read_excel("path/to/your/file/here.xlsx"))
Now to the calculations:
1 summarize the dataset. Here you'll see the ranges, means, medians and other interesting data of your table.
summary(dt)
1A mean and standard deviation of age, height and weight (replace age with the column name of height and weight to get those)
dt[, .(meanValue = mean(age, na.rm = TRUE), stdDev = sd(age, na.rm = TRUE))]
2 tabulate smokers and age.level. get the counts for each combination:
dt[, .N, by = .(smoke, age.level)]
3 subset smoker mothers with wt < 100 (I'm asuming non-pregnant mothers have NA in the gestation field. Adjust as necessary):
dt[smoke == 1 & weight < 100 & !is.na(gestation), .N]
4 Is the same as 1A.
5 Plot a histogram (but you don't specify of what variable, so let's say it's age):
hist(dt$age)
Keep on studying R, it's not that difficult. The book recommended in the comments is a very good start.

Error during harmonisation in TwoSampleMR (R-package)

I am trying to perform Mandalian Randomisation using the R package “TwoSampleMR”.
As exposure data, I use instruments from the GWAS catalog. (Phenotype - Sphingolipid levels).
As a outcome data, I use GISCOME ischemic stroke outcome GWAS (http://www.kp4cd.org/index.php/node/391)
I have an error when I do harmonization by the command harmonise_data().
The text of the error is:
**Error in data.frame(…, check.names = FALSE) : arguments imply differing number of rows: 1, 0**.
I have noticed that the error is caused by some exact lines in the file with outcomes. When I make a text file that contains only one line from the original file and use it as outcome data, some lines cause an error, and someones don’t.
As an example this one causes an error:
MarkerName CHR POS Allele1 Allele2 Freq1 Effect StdErr P-value
rs10938494 4 47563448 a g 0.2139 0.0294 0.0519 0.5706
This one doesn’t:
rs1000778 11 61655305 a g 0.2559 0.0939 0.0493 0.05705
Here is all commands that I use.
library(TwoSampleMR)
library(MRInstruments)
data(gwas_catalog)
exp <- subset(gwas_catalog, grepl("Sphingolipid levels", Phenotype))
exp_dat<-format_data(exp)
exp_dat<-clump_data(exp_dat)
exp_dat
out_dat<-read_outcome_data(
snps=exp_dat$SNP,
filename='giscome.012vs3456.age-gender-5PC.meta1.txt'
sep='\t', snp_col='MarkerName',
beta_col='Effect',
se_col='StdErr',
effect_allele_col='Allele1',
other_allele_col='Allele2',
eaf_col='Freq1',
pval_col='Р-value'
)
dat<-harmonise_data(exporsure_dat=exp_dat, outcome_dat=out_dat)
What would be the reason for this problem?
Thank you.

It is difficult to comment without looking at your sample input file but you might encounter this sort of error when there are inconsistencies with naming the exposure columns in your data frame.
Please see this thread on.
https://github.com/MRCIEU/TwoSampleMR/issues/226

What is the best way to manage/store result from either posthoc.krukal.dunn.test() or dunn.test() - where my input data is in dataframe format?

I am a newbie in R programming and seek help in analyzing the Metabolomics data - 118 metabolites with 4 conditions (3 replicates per condition). I would like to know, for each metabolite, which condition(s) is significantly different from which. Here is part of my data
> head(mydata)
Conditions HMDB03331 HMDB00699 HMDB00606 HMDB00707 HMDB00725 HMDB00017 HMDB01173
1 DMSO_BASAL 0.001289121 0.001578235 0.001612297 0.0007772231 3.475837e-06 0.0001221674 0.02691318
2 DMSO_BASAL 0.001158363 0.001413287 0.001541713 0.0007278363 3.345166e-04 0.0001037669 0.03471329
3 DMSO_BASAL 0.001043537 0.002380287 0.001240891 0.0008595932 4.007387e-04 0.0002033625 0.07426482
4 DMSO_G30 0.001195253 0.002338346 0.002133992 0.0007924157 4.189224e-06 0.0002131131 0.05000778
5 DMSO_G30 0.001511538 0.002264779 0.002535853 0.0011580857 3.639661e-06 0.0001700157 0.02657079
6 DMSO_G30 0.001554804 0.001262859 0.002047611 0.0008419137 6.350990e-04 0.0000851638 0.04752020
This is what I have so far.
I learned the first line from this post
kwtest_pvl = apply(mydata[,-1], 2, function(x) kruskal.test(x,as.factor(mydata$Conditions))$p.value)
and this is where I loop through the metabolite that past KW test
tCol = colnames(mydata[,-1])[kwtest_pvl <= 0.05]
for (k in tCol){
output = posthoc.kruskal.dunn.test(mydata[,k],as.factor(mydata$Conditions),p.adjust.method = "BH")
}
I am not sure how to manage my output such that it is easier to manage for all the metabolites that passed KW test. Perhaps saving the output from each iteration appending to excel? I also tried dunn.test package since it has an option of table or list output. However, it still leaves me at the same point. Kinda stuck here.
Moreover, should I also perform some kind of adjusted p-value, i.e FWER, FDR, BH right after KW test - before performing the posthoc test?
Any suggestion(s) would be greatly appreciated.

Juxtaposing Replicate Data

I have provided a sample dataset that I have arranged in column format (called "full.table").
These data were extracted from a 96-well PCR plate, & while collecting my data, I always ran a duplicate experiment, meaning each variable (aka test) has 1 replicate. I would like to take all replicates and juxtapose them (have them be side by side), which would allow me to easily visualize replicates next to each other, and finally calculate an average value for the variable "Cq" between the two.
The complications stems from having done multiple tests over several days (complication one), and NOT having my samples always run in the same fashion on the PCR plate (complication two). Typically, as you see on my data set below, Well A1 has a duplicate in Well B1, however this is not always the case. Occasionally, Well A7 matches Well A8 (and NOT B7).
Replicates were always run on the same day, so an important variable here is “date” which I added via R before uploading to Stack Exchange. I am confused on how to re-arrange the data to get my desired result (not even sure where to start)
I have provided an example of what I would like in the end, called “sample.finished.table”
Logically, having 768 observations in this example, this should divide it in two, resulting in 384 total lines of data (385 with header)
I appreciate any feedback. Thank you
full.table<- read.table("https://pastebin.com/raw/kTQhuttv", header=T, sep="")
sample.finished.table <- read.table("https://pastebin.com/raw/Phg7C9xD", header=T, sep="")

You can use dplyr here to group by sample and extract the requested values:
library(dplyr)
full.table %>% group_by(sample,date) %>% summarise(
Well1 = first(Well), Cq1 = first(Cq),
Well2 = last(Well), sample1 = last(sample), Cq2 = last(Cq), Cq_mean = mean(Cq[Cq > 0]))

ANOVA in R using summary data

is it possible to run an ANOVA in r with only means, standard deviation and n-value? Here is my data frame:
q2data.mean <- c(90,85,92,100,102,106)
q2data.sd <- c(9.035613,11.479667,9.760268,7.662572,9.830258,9.111457)
q2data.n <- c(9,9,9,9,9,9)
q2data.frame <- data.frame(q2data.mean,q2data.sq,q2data.n)
I am trying to find the means square residual, so I want to take a look at the ANOVA table.
Any help would be really appreciated! :)

Here you go, using ind.oneway.second from the rspychi package:
library(rpsychi)
with(q2data.frame, ind.oneway.second(q2data.mean,q2data.sd,q2data.n) )
#$anova.table
# SS df MS F
#Between (A) 2923.5 5 584.70 6.413
#Within 4376.4 48 91.18
#Total 7299.9 53
# etc etc
Update: the rpsychi package was archived in March 2022 but the function is still available here: http://github.com/cran/rpsychi/blob/master/R/ind.oneway.second.R (hat-tip to #jrcalabrese in the comments)
As an unrelated side note, your data could do with some renaming. q2data.frame is a data.frame, no need to put it in the title. Also, no need to specify q2data.mean inside q2data.frame - surely mean would suffice. It just means you end up with complex code like:
q2data.frame$q2data.mean
when:
q2$mean
would give you all the info you need.