Correlation Test loop in R - r

I am trying to create a data frame with p values and estimates that compares one gene to many different expression markers. My cor.test works when I use it on only one expression but when I try to loop it it breaks and gives me this error " 'x' and 'y' must have the same length".
I am wondering how to get this loop to work and build the data frame.
Below is what I am running through my loop and the code for the loop.
M3 <- ads$mean
Expression <- c("Exp1","Exp2","Exp3")
for (i in seq_along(Expression))
{
corr<-cor.test(M3, Expression[i], method = "pearson")
cor_df<-data.frame(Expression = Expression[i],pvalue = corr$p.value,
cor = corr$estimate)
}

Based on your comment, if Exp1, Exp2, and Exp3 are columns in a data frame (df) then you can use something like this:
corr <- cor.test(M3, df[ ,Expression[i]], method = "pearson")

Related

Loop over data table columns and apply glm using for loop

I am trying to loop over my data table columns and apply glm to each column using a for loop.
for(n in 1:ncol(dt)){
model = glm(y ~ dt[, n], family=binomial(link="logit"))
}
Why doesn't this work? I am getting this error:
Error in `[.data.table`(dt, , n) :
j (the 2nd argument inside [...]) is a single symbol but column name 'n' is not found. Perhaps you intended DT[, ..n]. This difference to data.frame is deliberate and explained in FAQ 1.1.
I nearly managed to make it work using something like dt[[n]], but I think it gets rid of the column name.
Using lapply to iterate over columns and reformulate to construct the formula.
model_list <- lapply(names(dt), function(x)
glm(reformulate(x, 'y'), dt, family=binomial(link="logit")))
We can create a formula with paste and use that in glm
model <- vector('list', ncol(dt))
for(n in 1:ncol(dt)){
model[[n]] = glm(as.formula(paste0('y ~ ', names(dt)[n])),
data = dt, family=binomial(link="logit"))
}

Using apply() function to iterate over different data types doesn't work

I want to write a function that dynamically uses different correlation methods depending on the scale of measure of the feature (continuous, dichotomous, ordinal). The label is always continuous. My idea was to use the apply() function, so iterate over every feature (aka column), check it's scale of measure (numeric, factor with two levels, factor with more than two levels) and then use the appropriate correlation function. Unfortunately my code seems to convert every feature into a character vector and as consequence the condition in the if statement is always false for every column. I don't know why my code is doing this. How can I prevent my code from converting my features to character vectors?
set.seed(42)
foo <- sample(c("x", "y"), 200, replace = T, prob = c(0.7, 0.3))
bar <- sample(c(1,2,3,4,5),200,replace = T,prob=c(0.5,0.05,0.1,0.1,0.25))
y <- sample(c(1,2,3,4,5),200,replace = T,prob=c(0.25,0.1,0.1,0.05,0.5))
data <- data.frame(foo,bar,y)
features <- data[, !names(data) %in% 'y']
dyn.corr <- function(x,y){
# print out structure of every column
print(str(x))
# if feature is numeric and has more than two outcomes use corr.test
if(is.numeric(x) & length(unique(x))>2){
result <- corr.test(x,y)[['r']]
} else {
result <- "else"
}
}
result <- apply(features,2,dyn.corr,y)
apply is built for matrices. When you apply to a data frame, the first thing that happens is coercing your data frame to a matrix. A matrix can only have one data type, so all columns of your data are converted to the most general type among them when this happens.
Use sapply or lapply to work with columns of a data frame.
This should work fine (I tried to test, but I don't know what package to load to get the corr.test function.)
result <- sapply(features, dyn.corr, income)

R - unpacking nested list of t.test results

I have a data frame attended with 12 pairs of pre/post numerical metrics (columns), and computing a t-test for each pair.
Here is the function that does a single test:
attended_test <- function(pre, post) {
tryCatch(t.test(log10(attended[pre]+1), log10(attended[post]+1), alternative
= "greater", paired = FALSE,
var.equal = FALSE, conf.level = 0.95), error=function(e)
c("NA","NA","NA","NA","NA","NA","NA","NA","NA"))
}
Creating vectors that correspond to data frame's columns:
pre <- as.list(c(4,5,6,7,8,9,16,17,18,19,20,21))
post <- as.list(c(10,11,12,13,14,15,22,23,24,25,26,27))
Applying test function to each pair of columns:
attended_test_results_list <- mapply(attended_test, pre, post, SIMPLIFY = FALSE)
The problem I'm having is unlisting attended_test_results_list into a single data frame. This structure is a list of 12 list objects for each test (aka nested list).
I identified the attributes I want from each test result's list:
data.frame(t(unlist(attended_test_results_list[[1]][c("estimate","p.value","statistic","conf.int")])))
Which has an output like so:
estimate.mean.of.x estimate.mean.of.y p.value statistic.t conf.int1 conf.int2
1 0.2476742 0.2530888 0.5950925 -0.2407039 -0.04243605 Inf
I want to create a single data frame with 1 row for each test (12 rows) like above. I've used lapply plenty of times, and I understand that I need to execute the code above for each of the 12 lists in attended_test_results_list and row bind to a single data frame.
But with this function I am getting this error:
attended_unpacked_test_results <- lapply(attended_test_results_list,
function(x){
data.frame(t(unlist(attended_test_results_list[[x]]
[c("estimate","p.value","statistic","conf.int")])))
})
Error in attended_test_results_list[[x]] : invalid subscript type 'list'
Do I need to be using a second lapply somewhere? How can create the data frame in the format I want?
It should be enough with one lapply. You get the error because you are passing a list to the argument x. This is why you get the error invalid subscript type 'list'.
I am not sure, but this should work:
attended_unpacked_test_results <- lapply(attended_test_results_list, function(x) {
data.frame(t(unlist(x[c("estimate","p.value","statistic","conf.int")])))
})
This will return a list. Possibly sapply will return a data frame.

Create a new variable using dplyr::mutate and pasting two existing variables for user-defined function

I would like to create a function to join the lower and higher bound of confidence intervals (named as CIlow and CIhigh) from a data frame. See the data frame below as example.
dataframe<-data.frame(CIlow_a=c(1.1,1.2),CIlow_b=c(2.1,2.2),CIlow_c=c(3.1,3.2),
CIhigh_a=c(1.3,1.4),CIhigh_b=c(2.3,2.4),CIhigh_c=c(3.3,3.4))
The data frame has CIlow and CIhigh for a number of groups (named as a, b and c) and for a number of variables (in this case two, the rows of the data frame).
group<-c("a","b","c")
To build my own function I tried the following code:
f<-function(df,gr){
enquo_df<-enquo(df)
enquo_gr<-enquo(gr)
r<-df%>%
dplyr::mutate(UQ(paste("CI",enquo_gr,sep="_")):=
sprintf("(%s,%s)",
paste("CIlow",quo_name(enquo_gr),sep="_"),
paste("CIhigh",quo_name(enquo_gr),sep="_")))
return(r)
}
However when using the function
library(dplyr)
group<-c("a","b","c")
dataframe<-data.frame(CIlow_a=c(1.1,1.2),CIlow_b=c(2.1,2.2),CIlow_c=c(3.1,3.2),CIhigh_a=c(1.3,1.4),CIhigh_b=c(2.3,2.4),CIhigh_c=c(3.3,3.4))
f(df=dataframe,gr=group)
I do not get the expected output
output<-data.frame(CI_a=c("(1.1,1.3)","(1.2,1.4)"),
CI_b=c("(2.1,2.3)","(2.2,2.4)"),
CI_c=c("(3.1,3.3)","(3.2,3.4)"))
but the following error message:
Error: LHS must be a name or string
Do you know why? How could I solve this issue? Thanks in advance.
Old school solution:
res <- as.data.frame(matrix(NA_character_, nrow(dataframe), ncol(dataframe) / 2))
for (i in seq_along(group)) {
var <- paste0("CI", c("low", "high"), "_", group[[i]])
res[[i]] <- sprintf("(%s,%s)", dataframe[[var[[1]]]], dataframe[[var[[2]]]])
}
names(res) <- paste0("CI_", group)

Running for loop across multiple groups

I am running the following imputation task in R as a for loop:
myData <- essuk[c(2,3,4,5,6,12)]
myDataImp <- matrix(0,dim(myData)[1],dim(myData)[2])
lower <- c(0)
upper <- c(Inf)
for (k in c(1:5))
{
gmm.fit1 <- gmm.tmvnorm(matrix(myData[,k],length(myData[,k]),1), lower=lower, upper=upper)
useMu <- matrix(gmm.fit1$coefficients[1],1,1)
useSigma <- matrix(gmm.fit1$coefficients[2],1,1)
replaceThese <- myData[,k]<=0
myDataImp[,k] <- myData[,k]
myDataImp[replaceThese,k] <- rtmvnorm(n=sum(replaceThese), c(useMu), c(useSigma), c(-Inf), c(0))
}
The steps are pretty straightforward
Define the data set and an empty imputation data set.
For column 1-5, fit a model.
Extract model estimates to be used for imputation.
Run a model using model estimates and replace values <= 0 with the new values in the imputation data set.
However, I want to do this separately for multiple groups, rather than for the full sample. Column 12 in the data set contains information on group membership (integers ranging from 1-72).
I have tried several options, including splitting the data frame with data_list <- split(myData, myData$V12) and use the lapply() function. However, this does not work due to how model estimates are formatted:
Error in as.data.frame.default(data) :
cannot coerce class ""gmm"" to a data.frame
I have also thought about the possibility of doing a nested for loop, although I am not sure how that could be accomplished. Any suggestions are much appreciated.
what about using subset() ?
myData$V12 = as.factor(myData$V12)
listofresults= c()
for (i in levels(myData$V12)){
data = subset (myData, myData$V12 == i)
#your analysis here: result saved in myDataImp
listofresults = c(listofresults, myDataImp)
}
not the most elegant, but should work.

Resources