I am trying to conduct a meta analysis of a group of studies. I am trying to figure out the code to properly output what I would like. Essentially I am trying to run a fixed effects meta-analysis test using the metafor package and collect the coefficient estimates to store them inside a matrix.
I have several problems. For example, I want to run these tests only for research that have multiple results. So studies whose study numbers turn up more than once (see code for example). When I try this using my existing code, it does not work, it spits out a number, but it isn't the right one. Also, some study-numbers are larger than the actual amount of studies I have. In my personal dataset there is a study numbered 3500 for example. When I run my loop, R spits out the results for that particular fixed effects model on the 3500th row, instead of just placing it in the next empty row.
I have a basic example below that anyone can run in R.
library(metafor)
origdata <- data.frame(matrix(data=NA, nrow=15, ncol=3))
colnames(origdata) <- c("studynum", "Mail_b", "Mail_SE")
origdata$studynum <- c(1, 1, 1, 1, 1, 2, 3, 3, 3, 4, 7, 7, 7, 7, 7)
origdata$Mail_b <- c(1.8, 0.8, 1.2, 1, 1, 5, 3, 3, 6, 4, 5, 8, 5, 9, 2)
origdata$Mail_SE <- c(1.6, 0.8, 1.3, 1, 1, 1, 3, 2.9, 6, 4, 5, 8, 5, 8, 1)
collapsedtest <- data.frame(matrix(data=NA, nrow=5, ncol=3))
colnames(collapsedtest) <- c("studynum", "Meta_b", "Meta_SE")
collapsedtest$studynum <- unique(origdata$studynum)
for(i in unique(origdata$studynum)) {
if((table(origdata$studynum) == 1) == FALSE){
collapsedtest[i, 2] <- (coef(summary(rma(yi=Mail_b[origdata$studynum == i],
sei=Mail_SE[origdata$studynum == i],
method="FE",
data=origdata)))$estimate)
collapsedtest[i, 3] <- (coef(summary(rma(yi=Mail_b[origdata$studynum == i ],
sei=Mail_SE[origdata$studynum == i],
method="FE", data=origdata)))$ci.ub
-
(coef(summary(rma(yi=Mail_b[origdata$studynum == i],
sei=Mail_SE[origdata$studynum == i],
method="FE",
data=origdata)))$estimate)) / 1.96
} else {
collapsed[i, 2] <- origdata$Mail_b[origdata$studynum == i]
collapsed[i, 3] <- origdata$Mail_SE[origdata$studynum == i]
}
}
Consider using nrow and adjust your row indexer with a conditional on studynum. Also, be sure to run your rma method only once and then use resulting estimates.
for(i in unique(origdata$studynum)) {
coef_data <- coef(summary(rma(yi=Mail_b[origdata$studynum == i],
sei=Mail_SE[origdata$studynum == i],
method="FE",
data=origdata)))
if(nrow(origdata[origdata$studynum==i,]) > 1) {
collapsedtest[collapsedtest$studynum==i, 2] <- coef_data$estimate
collapsedtest[collapsedtest$studynum==i, 3] <- (coef_data$$ci.ub - coef_data$estimate) / 1.96
collapsedtest[collapsedtest$studynum==i, 2] <- origdata$Mail_b[origdata$studynum == i]
collapsedtest[collapsedtest$studynum==i, 3] <- origdata$Mail_SE[origdata$studynum == i]
}
}
However, you can avoid the studynum filters by using by, the method that subsets a dataframe by factor(s) and runs subsets into an operation. Then after processing, row bind all subsetted dataframes with do.call:
df_list <- by(origdata, origdata$studynum, function(sub){
coef_data <- coef(summary(rma(yi=Mail_b, sei=Mail_SE, method="FE", data=origdata)))
if(nrow(sub) > 1) {
df <- data.frame(studynum = sub$studynum[[1]],
Meta_b = coef_data$estimate,
Meta_SE = (coef_data$$ci.ub - coef_data$estimate) / 1.96
} else {
df <- data.frame(studynum = sub$studynum[[1]],
Meta_b = sub$Mail_b,
Meta_SE = sub$Mail_SE)
}
return(df)
})
collapsedtest <- do.call(rbind, df_list)
Related
I'm trying to create a function that compares two matrices. It will compare the element of both matrices at a certain position, and returns "greater than" "equal to" or "less than". Below is the code I have right now. However, when I tried calling the function, R does not return anything, not even an error message. I'm wondering why that is the case. Any suggestions would be helpful. Thanks.
fxn <- function(x, y) {
emptymatrix <- matrix( , nrow = dim(x)[1], ncol = dim(x)[2])
for (i in 1:dim(emptymatrix)[1]) {
for (j in 1:dim(emptymatrix)[2]) {
if (x[i, j] < y[i, j]) {
emptymatrix[i, j] <- "Less Than"
}else if (x[i, j] == y[i, j]) {
emptymatrix[i, j] <- "Equal to"
}else {
emptymatrix[i, j] <- "Greater than"
}
}
}
}
#trying to test the function
vecc1 <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
vecc2 <- c(4, 5, 2, 3, 1, 1, 8, 9, 10)
matrix1 <- matrix(vecc1, nrow = 3, byrow = T)
matrix2 <- matrix (vecc2, nrow=3, byrow = T)
fxn(matrix1, matrix2)
Hi as SamR pointed out in his comment, your function doesn't return anything, because it has no return function / object in the end. He is also right about the loop thing, because R is mainly designed for tabular data and matrices it can do a lot of stuff for you under the hood. This is a great examples about some design principles R has. First we don't need to use a for loop because we can just evaluate larger equal less, on all indices (vectorized). The output will be a matrix of size M with TRUE / FALSE. we can use this matrix to index our new matrix at all TRUE position. than we just need to assign a single string "equal", "larger", or "less" that gets recycled to the length of the longer vector(/list).
vecc1 <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
vecc2 <- c(4, 5, 2, 3, 1, 1, 8, 9, 10)
matrix1 <- matrix(vecc1, nrow = 3, byrow = T)
matrix2 <- matrix (vecc2, nrow=3, byrow = T)
# run this to see how the comparision works
matrix1 == matrix2
foo <- function(x,y) {
m_new<-matrix(NA,nrow=dim(x),ncol=dim(x))
m_new[x==y]<-"Equal"
m_new[x<y]<-"Less Than"
m_new[x>y]<-"Greater Than"
m_new # faster
#return(m_new) is not as efficent
}
foo(matrix1,matrix2)
You missed returning emptyMatrix from your function.
In R, the result of the last statement in a function is returned automatically. In the original function, the last statement was the for loop, whose value is NULL. It was returned, marked "invisible", so it didn't print.
The usual convention in R is to type the name of the object you want to return when it isn't already the last value produced. So just add one line to your function, containing emptyMatrix.
You can also call return(emptyMatrix), but that's actually less efficient.
And if you like returning things invisibly like for loops do, you can call invisible(emptyMatrix) as the last line. Then it won't automatically print, but you can still assign it to another variable.
I have data that is roughly in the following format but is very large but is broken up by groups using the class and uniqueId variable. Where each location is a pair row wise (x, y).
df <-
data.frame(
x = c(1, 2, 3, 4, 5, 6, 8, 9, 10),
y = c(1, 2, 3, 4, 5, 6, 8, 9, 10),
class = c(0, 0, 0, 0, 0, 1, 0, 1, 0),
uniqueId = c("1-2-3", "1-2-3", "1-2-3", "1-2-4", "1-2-4", "1-2-4", "1-3-2", "1-3-2", "1-3-2"),
partialId = c("1.2", "1.2", "1.2", "1.2", "1.2", "1.2", "1.3", "1.3", 1.3")
)
The function I am using should go through the dataframe and calculate the smallest distance to another object within the same uniqueId but different class as the current row. To do this I've broken my data into chunks the following way.
indexes <-
df %>%
select(partialId) %>%
unique()
j <- 1
library(doParallel)
class_separation <- c()
cl <- makePSOCKcluster(24)
registerDoParallel(cl)
while(j <= nrow(indexes)) {
test <- df %>% filter(partialId == indexes$partialId[j])
n <- nrow(test)
vec <- numeric(n)
vec <- foreach(k = 1:n, .combine = 'c', .multicombine = F) %dopar% {
c(
min(
apply(
test[test$uniqueId == test$uniqueId[k] & test$class != test$class[k], c("x","y")],
1,
function(x) dist(rbind(c(test$x[k],test$y[k]), c(x[1], x[2])))
)
)
)
}
class_separation <- c(class_separation, vec)
j <- j + 1
}
endtime <- Sys.time()
stopwatch <- endtime - starttime
closeAllConnections()
registerDoSEQ()
gc()
df <- cbind(df, class_separation)
When handling single plays or small batches, this code seems to operate just fine. However, when handling the full dataset I am getting results that are obviously incorrect. I know there must be a flaw in how I am calculating these distances since there is very little chance the dist() function itself or %dopar% is at fault. I have changed to %do% and my results do not change.
As an example of the discrepancy, the following image shows the class_separation column from when the full data run is conducted vs when I feed it a small example. As you can see the results are wildly different, but I'm not sure why.
After a day of thinking about this, the problem is in how I was sending my df to dist().
For example, if we intended to pass
dist(rbind(c(1, 1), c(6, 6)))
dist(rbind(c(1, 1), c(9, 9)))
What we actually pass is dist(rbind(c(1, 1), c(6, 6, 9, 9)))
This is obviously not what I want. I needed both distances and then to select the minimum between them or add in other conditionals. The way to do this I found was using the rdist package.
foreach(i = 1:nrow(df), .combine = 'c', .multicombine = F, .packages = c('tidyverse',
'rdist')) %dopar% {
min(
cdist(
df[df$class != df$class[i] & df$uniqueId == df$uniqueId[i], ] %>% select(x, y),
df %>% select(x, y) %>% slice(i)
)
)
}
For our test data this returns the vector
Inf Inf Inf 2.828427 1.414214 1.414214 1.414214 1.414214 1.414214
Which is exactly what I needed. The first three entries having no class == 1 options for their uniqueId should return Inf, row 4 is twice as far from row 6 as row 5 while all having the same uniqueId, while row 9 is equally distance to rows 8 and 10. Whether this solution will be sufficiently fast I will test out.
I want to run a partition regression in R, for which I need to assign a factor to indicate which partition this data belongs to. For example, when it is greater than mean+2 standard deviations,I assign the indicator 2, and between mean+1sd and mean+2sd, 1 so on and so forth. I know it can be done by if and else. But when the partitions are way too much, the code seems to be too long. Is there any easy and succinct methods to accomplish it?
mean=mean(x)
sd=sd(x)
if((x[i]-mean)/sd< -3) signal[i]=-3
if((x[i]-mean)/sd> -3) signal[i]=-2
if((x[i]-mean)/sd> -2) signal[i]=-1
if((x[i]-mean)/sd> -1) signal[i]=0
if((x[i]-mean)/sd>1) signal[i]=1
if((x[i]-mean)/sd>2) signal[i]=2
if((x[i]-mean)/sd>3) signal[i]=3
}
Thanks for #jogo and #r.user.05apr.
Now I have a slightly different problem. I want to compute the partition based on rolling windows, 20 days for example, which means I need to scale the data of day t based on the past 20 days (day t-20 to day t-1) and assign the same values as above according to its z score. In such case, can cut function still be used? I have written a code with a loop and if sentences
signal <- vector()
n=20 #window
for(i in (n+1):length(x)){
mean=mean(x[(n-20):(n-1)])
sd=sd(x[(i-20):(i-1)])
if((x[i]-mean)/sd< -3) signal[i]=-3
if((x[i]-mean)/sd> -3) signal[i]=-2
if((x[i]-mean)/sd> -2) signal[i]=-1
if((x[i]-mean)/sd> -1) signal[i]=0
if((x[i]-mean)/sd>1) signal[i]=1
if((x[i]-mean)/sd>2) signal[i]=2
if((x[i]-mean)/sd>3) signal[i]=3
}
You can use cut()
x <- iris$Petal.Length
m <- mean(x)
s <- sd(x)
cut((x - m)/s, breaks = c(-Inf, -3, -2, -1, 1, 2, 3, +Inf), labels = c((-3):3))
to coerce to numeric:
as.numeric(as.character(cut((x - m)/s, breaks = c(-Inf, -3, -2, -1, 1, 2, 3, +Inf), labels = c((-3):3))))
remark:
You can shorten (x - m)/s to scale(x)
Depends on how dynamic the value assignment has to be. Alternative option:
criteria <- data.frame(operator = c("<", rep(">", 6)),
criterion = c(-3, seq(-3, -1, 1), 1:3),
result = c(seq(-3, 0, 1), 1:3),
stringsAsFactors = FALSE)
criteria # data frame with individual conditions for if
get_signal <- function(mean, sd, x) {
dummy <- (x-mean)/sd
for (i in (1:nrow(criteria))) {
if (do.call(criteria[i, 1], list(dummy, criteria[i, 2]))) res <- criteria[i, 3]
}
res
}
sapply(-5:10, function(x) get_signal(2, 1, x))
I am working on an assignment for school. I need to transform the columns in a data frame using a for loop and the bcPower function from the cars package. My data frame named bb2.df consists of 13 columns of baseball statistics for 337 players. The data is from:
http://ww2.amstat.org/publications/jse/datasets/baseball.dat.txt
I read the data in using:
bb.df <- read.fwf("baseball.dat.txt",widths=c(4,6,6,4,4,3,3,3,4,4,4,3,3,2,2,2,2,19))
And then I created a second data frame just for the numeric stats using:
bb2.df <- bb.df[,1:13]
library(cars)
Then I unsuccessfully tried to build the for loop.
> bb2.df[[i]] <- bcPower(bb2.df[[i]],c)
> for (i in 1:ncol(bb2.df)) {
+ c <- coef(powerTransform(bb2.df[[i]]))
+ bb2.df[[i]] <- bcPower(bb2.df[[i]],c)
+ }
Error in bc1(out[, j], lambda[j]) :
First argument must be strictly positive.
The loop seems to transform the first three columns but stops.
What am I doing wrong?
This solution
tests whether a column appears to contain logical values and omits them from the transformation
replaces zero values in the vectors with a small number, outside the range of the actual values
stores the transformed values in a new data frame, retaining the column and row names
I have also tested all of the variables for normality before and after the transformation. I tried to find a variable that's interesting in that the transformed variable has a large p-value for the Shapiro test, but also there there was a large change in the p-value. Finally, the interesting variable is scaled in both the original and transformed version, and the two versions are overlaid on a density plot.
library(car); library(ggplot2); library(reshape2)
# see this link for column names and type hints
# http://ww2.amstat.org/publications/jse/datasets/baseball.txt
# add placeholder column for opening quotation mark
bb.df <-
read.fwf(
"http://ww2.amstat.org/publications/jse/datasets/baseball.dat.txt",
widths = c(4, 6, 6, 4, 4, 3, 3, 3, 4, 4, 4, 3, 3, 2, 2, 2, 2, 2, 17)
)
# remove placeholder column
bb.df <- bb.df[,-(ncol(bb.df) - 1)]
names(bb.df) <- make.names(
c(
'Salary', 'Batting average', 'OBP', 'runs', 'hits', 'doubles', 'triples',
'home runs', 'RBI', 'walks', 'strike-outs', 'stolen bases', 'errors',
"free agency eligibility", "free agent in 1991/2" ,
"arbitration eligibility", "arbitration in 1991/2", 'name'
)
)
# test for boolean/logical values... don't try to transform them
logicals.test <- apply(
bb.df,
MARGIN = 2,
FUN = function(one.col) {
asnumeric <- as.numeric(one.col)
aslogical <- as.logical(asnumeric)
renumeric <- as.numeric(aslogical)
matchflags <- renumeric == asnumeric
cant.be.logical <- any(!matchflags)
print(cant.be.logical)
}
)
logicals.test[is.na(logicals.test)] <- FALSE
probably.numeric <- bb.df[, logicals.test]
result <- apply(probably.numeric, MARGIN = 2, function(one.col)
{
# can't transform vectors containing non-positive values
# replace zeros with something small
non.zero <- one.col[one.col > 0]
small <- min(non.zero) / max(non.zero)
zeroless <- one.col
zeroless[zeroless == 0] <- small
c <- coef(powerTransform(zeroless))
transformation <- bcPower(zeroless, c)
return(transformation)
})
result <- as.data.frame(result)
row.names(result) <- bb.df$name
cols2test <- names(result)
normal.before <- sapply(cols2test, function(one.col) {
print(one.col)
temp <- shapiro.test(bb.df[, one.col])
return(temp$p.value)
})
normal.after <- sapply(cols2test, function(one.col) {
print(one.col)
temp <- shapiro.test(result[, one.col])
return(temp$p.value)
})
more.normal <- cbind.data.frame(normal.before, normal.after)
more.normal$more.normal <-
more.normal$normal.after / more.normal$normal.before
more.normal$interest <-
more.normal$normal.after * more.normal$more.normal
interesting <-
rownames(more.normal)[which.max(more.normal$interest)]
data2plot <-
cbind.data.frame(bb.df[, interesting], result[, interesting])
names(data2plot) <- c("original", "transformed")
data2plot <- scale(data2plot)
data2plot <- melt(data2plot)
names(data2plot) <- c("Var1", "dataset", interesting)
ggplot(data2plot, aes(x = data2plot[, 3], fill = dataset)) +
geom_density(alpha = 0.25) + xlab(interesting)
Original, incomplete answer:
I believe you're trying to do illegal power transformations (vectors including non-positive values, specifically zeros; vectors with no variance)
The fact that you are copying bb.df into bb2.df and then overwriting is a sure sign that you should really be using apply.
This doesn't create a useful dataframe, but it should get you started,
library(car)
bb.df <-
read.fwf(
"baseball.dat.txt",
widths = c(4, 6, 6, 4, 4, 3, 3, 3, 4, 4, 4, 3, 3, 2, 2, 2, 2, 19)
)
bb.df[bb.df == 0] <- NA
# skip last (text) col
for (i in 1:(ncol(bb.df) - 1)) {
print(i)
# use comma to indicate indexing by column
temp <- bb.df[, i]
temp[temp == 0] <- NA
temp <- temp[complete.cases(temp)]
if (length(unique(temp)) > 1) {
c <- coef(powerTransform(bb.df[, i]))
print(bcPower(bb.df[i], c))
} else {
print(paste0("column ", i, " is invariant"))
}
}
# apply solution
result <- apply(bb.df[,-ncol(bb.df)], MARGIN = 2, function(one.col)
{
temp <- one.col
temp[temp == 0] <- NA
temp <- temp[complete.cases(temp)]
if (length(unique(temp)) > 1) {
c <- coef(powerTransform(temp))
transformation <- bcPower(temp, c)
return(transformation)
} else
{
print("skipping invariant column")
return(NULL)
}
})
I wish to extract an increasing subsequence of a vector, starting from the first element. For example, from this vector:
a = c(2, 5, 4, 0, 1, 6, 8, 7)
...I'd like to return:
res = c(2, 5, 6, 8).
I thought I could use a loop, but I want to avoid it. Another attempt with sort:
a = c(2, 5, 4, 0, 1, 6, 8, 7)
ind = sort(a, index.return = TRUE)$ix
mat = (t(matrix(ind))[rep(1, length(ind)), ] - matrix(ind)[ , rep(1, length(ind))])
mat = ((mat*upper.tri(mat)) > 0) %*% rep(1, length(ind)) == (c(length(ind):1) - 1)
a[ind][mat]
Basically I sort the input vector and check if the indices verify the condition "no indices at the right hand side are lower" which means that there were no greater values beforehand.
But it seems a bit complicated and I wonder if there are easier/quicker solutions, or a pre-built function in R.
Thanks
One possibility would be to find the cumulative maxima of the vector, and then extract unique elements:
unique(cummax(a))
# [1] 2 5 6 8
The other answer is better, but i made this iterative function which works as well. It works by making all consecutive differences > 0
increasing <- function (input_vec) {
while(!all(diff(input_vec) > 0)){
input_vec <- input_vec[c(1,diff(input_vec))>0]
}
input_vec
}