I have two cohorts so I did a linear regression seperately per cohort and used a for loop so that my coefficients have been saved per cohort. I now want to get a pooled estimate per SNP, but I have 53 SNPs so would prefer not having to type out all the coefficients by hand. Is there a way to make a for loop to use in the rma command from metafor?
So far I've come as far as thinking that it's probably easiest to merge my two coefficient files together. I've called this coeffs. The first column has the SNP names, the 2nd and 6th columns have the betas from cohort 1 and 2, respectively and columns 3 and 7 have the standard errors from the two cohorts.
So I want to make an item beta that includes my beta from cohort 1 and cohort 2 per for 1 SNP. Then the same idea with se. I then want to have an rma(beta,se) per SNP so I can export the results to excel.
So far I thought of doing the following (but it doesn't work)
output3 <- data.frame(matrix(nrow=84,ncol=3))
names(output3)=c("Pooled Estimate", "Pooled Std.Error", "P-value")
for(l in 3:84){
beta <-c(output3[l,2], output3[l,6])
se <-c(output3[l,3], output3[l,7])
pool <- rma(beta,se)
}
When I run the rma I get the following error message:
Error in [[<-.data.frame(*tmp*, l, value = list(b =
-0.105507438518734, : replacement has 70 rows, data has 84
If I change nrow to 70, then I don't get the information. From the rma output I want the second row and columns 1,2 and 4. I think this is going wrong somewhere.
I figured out my mistake, my problem was I forgot to tell R what lines of data I needed and where it needed to be saved to.
For anyone else with this problem, here is my script which worked. I first created my data.frame where I wanted the data to be saved.
output3 <- data.frame(matrix(nrow=84,ncol=3))
names(output3)=c("Pooled Estimate", "Pooled Std.Error", "P-value")
Next I made a for loop to extract the betas and s.e's from each SNP
for(l in 3:84){
beta <-c(coeffs[l,2], coeffs[l,6])
se <-c(coeffs[l,3], coeffs[l,7])
pool <- rma(beta,se^2)
z3 <- colnames(qcwomenc[1:84])
row.names(output3)<-z3
output3[l,1]<-coef(summary(pool))[1,1]
output3[l,2]<-coef(summary(pool))[1,2]
output3[l,3]<-coef(summary(pool))[1,4]
}
You say it is not working but you don't say what exactly. I think what you are missing is something to save your results. Like a list or data.frame. The variable pool gets updated on each iteration of the loop so after the loop is through it will contain only the last model. Also, your indices do not match the example data.frame as you are referring to column 6 and 7 which do not exist. But I guess they do exist in your actual data.frame. Also your example data.frame is full of NA values. Maybe try like this:
output3 <- data.frame(matrix(runif(84*4), nrow=84,ncol=4))
names(output3)=c("se1", "beta1", "se2", "beta2")
modellist <- list()
for(l in 3:84){
beta <-c(output3[l,2], output3[l,4])
se <-c(output3[l,1], output3[l,3])
pool <- sum(beta, se)
modellist[[l]] <- pool
}
modellist
Note, I used the sum instead your rma() because I don't know this function and I don't know what package it's from.
Related
I am a noob at programming, sorry if this is a silly question.
My supervisor doesn't seem to trust set.seed() function in r as every number will yield a different output (with different test and train sets). Thus she asked me to specify the range for my training and test dataset.
I am conducting a Binary logistic regression model in R with a sample size of 1790. There are 8 independent variables in my model. I want to do a 70/30 split for train and test data. I did it using these lines of code the first time:
RLV <- read.csv(file.choose(), header = T)
set.seed(123)
index <- sample(2, nrow(RLV), replace = T, prob = c(0.7, 0.3))
train <- RLV[index == 1,]
test <- RLV[index == 2,]
But if I change 123 into say 1234, the output is similar but not exactly the previous one (and yes I know that's the point).
But according to my supervisor, she wants me to train using the data obtained in Day 1 and Day 2 and test(validate) using the data of Day 3 (That was my initial plan as well).
Thus after intense brainstorming I came up with these lines of code...
RLV <- read.csv(file.choose(), header = T)
train <- RLV[1:1253,]
test <- RLV[1254:1790,]
head(test)
I want all the rows from 1 to 1253 (all columns too) in my train dataset and from 1254 to 1790 in my test(validation) dataset.
I checked using the head function and it does seem to work. But I am on the fence here. Can someone please clarify how this works? Or please if its even right (lol). I just want to complete this project without any hassle.
Thanks a bunch.
As you said: It does work. head() shows you the first six rows of a dataframe. So you should get rows 1254, 1255, 1256, 1257, 1258, and 1259 from your 'test set' after head(test).
It works because if you index a dataframe with [,], everything before the comma specifies row restrictions and everything after the comma specifies column restrictions. You indexed by row number. It would also be possible to index by a logical vector. For example, RLV[RLV$Day %in% 1:2,] would give you all cases from RLV where (the hypothetical) column Day holds the value 1 or 2.
If this doesn't answer your question(s), please specify what you mean by "how this works" and "if it's even right" ;)
I’m new to programming and I’m currently writing a function to go through hundreds of csv files in the working directory.
The files have tons of NA values in it.
The function (which I call it corr) has two parameters, the directory, and a threshold value (numeric vector of length 1 indicating the number of complete cases).
The purpose of the function is to take the complete cases for two columns that are sulfate and nitrate(second and third column in the spreadsheet) and calculate the correlation between them if the number of complete cases is greater than the threshold parameter.
The function should return a vector with the correlation if it met the threshold requirement (the default threshold value is 0).
When I run the code I get back two of the following:
A + sign in the console
OR
2.The objects I created in the function can't be found.
Any help would be much appreciated. Thank you in advance!
corr <- function(directory, threshold=0){
filelist2<- data.frame(list.files(path=directory,
pattern=".csv", full.names=TRUE))
corvector <- numeric()
for(i in 1:length(filelist2)){
data <-data.frame(read.csv(filelist2[i]))
removedNA<-complete.cases(data)
newdata<-data[removedNA,2:3]
if(nrow(removedNA) > threshold){
corvector<-c(corvector, cor(data$sulfate, data$nitrate ))
}
}
corvector
}
I don't think your nrow(removedNA) does what you think it does. To replicate the example I use the mtcars dataset.
data <- mtcars # create dataset
data[2:4, 2] <- NA # create some missings in column 2
data[15:17, 3] <- NA # create some missing in column 3
removedNA <- complete.cases(data)
table(removedNA) # 6 missings indeed
nrow(removedNA) # NULL removedNA is no data.frame, so nrow() doesn't work
newdata <- data[removedNA, 2:3] # this works though
nrow(newdata) # and this shows the rows in 'newdata'
#---- therefore instead of nrow(removedNA) try
if(nrow(data)-nrow(newdata) < threshold) {
...
}
NB: I changed the > in < in the line with threshold. I guess it depends on whether you want to set an absolute minimum number of lines (in which cases you could simply use nrow(newdata) > threshold) as threshold, or whether you want the threshold to reflect the different number of lines in the original data and 'new' data.
I wrote a function in R, which parses arguments from a dataframe, and outputs the old dataframe + a new column with stats from each row.
I get the following warning:
Warning message:
In [[.data.frame(xx, sxx[j]) :
named arguments other than 'exact' are discouraged
I am not sure what this means, to be honest. I did spot checks on the results and seem OK to me.
The function itself is quite long, I will post it if needed to better answer the question.
Edit:
This is a sample dataframe:
my_df<- data.frame('ALT'= c('A,C', 'A,G'),
'Sample1'= c('1/1:35,3,0,35,3,35:1:1:0:0,1,0', './.:0,0,0,0,0,0:0:0:0:0,0,0'),
'Sample2'= c('2/2:188,188,188,33,33,0:11:11:0:0,0,11', '1/1:255,99,0,255,99,255:33:33:0:0,33,0'),
'Sample3'= c('1/1:219,69,0,219,69,219:23:23:0:0,23,0', '0/1:36,0,78,48,87,120:7:3:0:4,3,0'))
And this is the function:
multi_allelic_filter_v2<- function(in_vcf, start_col, end_col, threshold=1){
#input: must have gone through biallelic_assessment first
table0<- in_vcf
#ALT_alleles is the number of alt alleles with coverage > threshold across samples
#The following function calculates coverage across samples for a single allele
single_allele_tot_cov_count<- function(list_of_unparsed_cov,
allele_pos){
single_allele_coverage_count<- 0
for (i in 1:length(list_of_unparsed_cov)) { # i is each group of coverages/sample
single_allele_coverage_count<- single_allele_coverage_count+
as.numeric(strsplit(as.character(list_of_unparsed_cov[i]),
split= ',')[[1]])[allele_pos]}
return(single_allele_coverage_count)}
#single row function
#Now we need to reiterate on each ALT allele in the row
single_row_assessment<- function(single_row){
# No. of alternative alleles over threshold
alt_alleles0 <- 0
if (single_row$is_biallelic==TRUE){
alt_alleles0<- 1
} else {
alt_coverages <- numeric() #coverages across sample of each ALT allele
altcovs_unparsed<- character() #Unparsed coverages from each sample
for (i in start_col:end_col) {
#Let's fill altcovs_unparsed
altcovs_unparsed<- c(altcovs_unparsed,
strsplit(x = as.character(single_row[1,i]), split = ':')[[1]][6])}
#Now let's calculate alt_coverages
for (i in 1:lengths(strsplit(as.character(
single_row$ALT),',',fixed = TRUE))) {
alt_coverages<- c(alt_coverages, single_allele_tot_cov_count(
list_of_unparsed_cov = altcovs_unparsed, allele_pos = i+1))}
#Now, let's see how many ALT alleles are over threshold
alt_alleles0<- sum(alt_coverages>threshold)}
return(alt_alleles0)}
#Now, let's reiterate across each row:
#ALT_alleles is no. of alt alleles with coverage >threshold across samples
table0$ALT_alleles<- -99 # Just as a marker, to make sure function works
for (i in 1:nrow(table0)){
table0[i,'ALT_alleles'] <- single_row_assessment(single_row = table0[i,])}
#Now we now how many ALT alleles >= threshold coverage are in each SNP
return(table0)}
Basically, in the following line:
'1/1:219,69,0,219,69,219:23:23:0:0,23,0'
fields are separated by ":", and I am interested in the last two numbers of the last field (23 and 0); in each row I want to sum all the numbers in those positions (two separate sums), and output how many of the "sums" are over a threshold. Hope it makes sense...
OK,
I re-run the script with the same dataset on the same computer (same project, then new project), then run it again on a different computer, could not get the warnings again in any case. I am not sure what happened, and the results seem correct. Never mind. Thanks anyway for the comments and advice
I have 1 data.frame as follows, each line is a different Stock data :
Teste=data.frame(matrix(runif(25), nrow=5, ncol=5))
colnames(Teste) <- c("AVG_VOLUME","AVG_RETURN","VOL","PRICE","AVG_XX")
AVG_VOLUME AVG_RETURN VOL PRICE AVG_XX
1 0.7028197 0.9264265 0.2169411 0.80897110 0.3047671
2 0.7154557 0.3314615 0.4839466 0.63529520 0.5633933
3 0.4038030 0.4347487 0.3441471 0.07028743 0.7704912
4 0.5392530 0.6414982 0.4482528 0.11087518 0.3512511
5 0.8720084 0.9615865 0.8081017 0.45781973 0.0137508
What i want to do is to apply the function GBM from package sde (https://cran.r-project.org/web/packages/sde/sde.pdf) using the cols AVG_RETURN, VOL, PRICE as arguments for all lines in the data.frame.
Something like this :
Result <- apply(Teste,1,function(x) {
GBM(x[,"PRICE"],x[,"AVG_RETURN"],x[,"VOL"],1,252)
})
So i want the Result to be a data.frame that runs GBM for each Stock in the Teste data.frame.
How can i get this result ?
The answer to the narrow question about why you are getting errors is that when the apply function passes values it is only as a vector rather than a dataframe, so removing hte commas in the arguments to "[" will get you a result.
Result <- apply(Teste,1,function(x) {
GBM(x[,"PRICE"],x[,"AVG_RETURN"],x[,"VOL"],1,252)
})
If you need it to be a dataframe where each stock would be a column, and the input datastructure has meaningful stock names, then I suggest using:
dfRes <- setNames( data.frame(Result), rownames(Teste) )
I think the only way this could be meaningful in a risk analysis context is if many more simulation runs than these single instances are assembled in some higher level context.
I am trying to run some Monte Carlo simulations on animal position data. So far, I have sampled 100 X and Y coordinates, 100 times. This results in a list of 200. I then convert this list into a dataframe that is more condusive to eventual functions I want to run for each sample (kernel.area).
Now I have a data frame with 200 columns, and I would like to perform the kernel.area function using each successive pair of columns.
I can't reproduce my own data here very well, so I've tried to give a basic example just to show the structure of the data frame I'm working with. I've included the for loop I've tried so far, but I am still an R novice and would appreciate any suggestions.
# generate dataframe representing X and Y positions
df <- data.frame(x=seq(1:200),y=seq(1:200))
# 100 replications of sampling 100 "positions"
resamp <- replicate(100,df[sample(nrow(df),100),])
# convert to data frame (kernel.area needs an xy dataframe)
df2 <- do.call("rbind", resamp[1:2,])
# xy positions need to be in columns for kernel.area
df3 <- t(df2)
#edit: kernel.area requires you have an id field, but I am only dealing with one individual, so I'll construct a fake one of the same length as the positions
id=replicate(100,c("id"))
id=data.frame(id)
Here is the structure of the for loop I've tried (edited since first post):
for (j in seq(1,ncol(df3)-1,2)) {
kud <- kernel.area(df3[,j:(j+1)],id=id,kern="bivnorm",unin=c("m"),unout=c("km2"))
print(kud)
}
My end goal is to calculate kernel.area for each resampling event (ie rows 1:100 for every pair of columns up to 200), and be able to combine the results in a dataframe. However, after running the loop, I get this error message:
Error in df[, 1] : incorrect number of dimensions
Edit: I realised my id format was not the same as my data frame, so I change it and now have the error:
Error in kernelUD(xy, id, h, grid, same4all, hlim, kern, extent) :
id should have the same length as xy
First, a disclaimer: I have never worked with the package adehabitat, which has a function kernel.area, which I assume you are using. Perhaps you could confirm which package contains the function in question.
I think there are a couple suggestions I can make that are independent of knowledge of the specific package, though.
The first lies in the creation of df3. This should probably be
df3 <- t(df2), but this is most likely correct in your actual code
and just a typo in your post.
The second suggestion has to do with the way you subset df3 in the
loop. j:j+1 is just a single number, since the : has a higher
precedence than + (see ?Syntax for the order in which
mathematical operations are conducted in R). To get the desired two
columns, use j:(j+1) instead.
EDIT:
When loading adehabitat, I was warned to "Be careful" and use the related new packages, among which is adehabitatHR, which also contains a function kernel.area. This function has slightly different syntax and behavior, but perhaps it would be worthwhile examining. Using adehabitatHR (I had to install from source since the package is not available for R 2.15.0), I was able to do the following.
library(adehabitatHR)
for (j in seq(1,ncol(df3)-1,2)) {
kud <-kernelUD(SpatialPoints(df3[,j:(j+1)]),kern="bivnorm")
kernAr<-kernel.area(kud,unin=c("m"),unout=c("km2"))
print(kernAr)
}
detach(package:adehabitatHR, unload=TRUE)
This prints something, and as is mentioned in a comment below, kernelUD() is called before kernel.area().