I have two lists of matrices (list A and list B, each matrix is of dimension 14x14, and list A contains 10 matrices and list B 11)
and I would like to do a t test for each coordinate to compare the means of each coordinate of group A and group B.
As a result I would like to have a matrix of dimension 14x14 which contains the p value associated with each t test.
Thank you in advance for your answers.
Here's a method using a for loop and then applying the lm() function.
First we'll generate some fake data as described in the question.
#generating fake matrices described by OP
listA <- vector(mode = "list", length = 10)
listB <- vector(mode = "list", length = 10)
for (i in 1:10) {
listA[[i]] <- matrix(rnorm(196),14,14,byrow = TRUE)
}
for (i in 1:11) {
listB[[i]] <- matrix(rnorm(196),14,14,byrow = TRUE)
}
Then we'll unwrap each matrix as described by dcarlson in a for loop.
Unwrapped.Mats <- NULL
for (ID in 1:10) {
unwrapped <- as.vector(as.matrix(listA[[ID]])) #Unwrapping each matrix into a vector
withID <- c(ID, "GroupA", unwrapped) #labeling it with ID# and which group it belongs to
UnwrappedCorMats <- rbind(Unwrapped.Mats, withID)
}
for (ID in 1:11) {
unwrapped <- as.vector(as.matrix(listB[[ID]]))
withID <- c(PID, "GroupB", unwrapped)
UnwrappedCorMats <- rbind(UnwrappedCorMats, withID)
}
Then write and apply a function to run lm(). lm() is statistically equivalent to an unpaired t-test in this context but I'm using it to be more easily adapted into a mixed effect model if anyone wants to add mixed effects.
UnwrappedDF <- as.data.frame(UnwrappedCorMats)
lmPixel2Pixel <- function(i) { #defining function to run lm
lmoutput <- summary(lm(i ~ V2, data= UnwrappedDF))
lmoutputJustP <- lmoutput$coefficients[2,4] #Comment out this line to return full lm output rather than just p value
}
Vector_pvals <- sapply(UnwrappedDF[3:length(UnwrappedDF)], lmPixel2Pixel)
Finally we will reform the vector into the same shape as the original matrix for more easily interpreting results
LM_mat.again <- as.data.frame(matrix(Vector_pvals, nrow = nrow(listA[[1]]), ncol = ncol(listA[[1]]), byrow = T))
colnames(LM_mat.again) <- colnames(listA[[1]]) #if your matrix has row or column names then restoring them is helpful for interpretation
rownames(LM_mat.again) <- colnames(listB[[1]])
head(LM_mat.again)
I'm sure there are faster methods but this one is pretty straight forward and I was surprised there weren't answers for this type of operation yet
Related
I have two datasets with abundance data from groups of different species. Columns are species and rows are sites. The sites (rows) are identical between the two datasets and what i am trying to do is to correlate the columns of the first dataset to the columns of the second dataset in order to see if there is a positive or a negative correlation.
library(Hmisc)
rcorr(otu.table.filter$sp1,new6$spA, type="spearman"))$P
rcorr(otu.table.filter$sp1,new6$spA, type="spearman"))$r
the first will give me the p value of the relation between sp1 and spA and the second the r value
I initially created a loop that allowed me to check all species of the first dataframe with a single column of the second dataframe. Needless to say if I was to make this work I would have to repeat the process a few hundred times.
My simple loop for one column of df1(new6) against all columns of df2(otu.table.filter)
pvalues = list()
for(i in 1:ncol(otu.table.filter)) {
pvalues[[i]] <-(rcorr(otu.table.filter[ , i], new6$Total, type="spearman"))$P
}
rvalues = list()
for(i in 1:ncol(otu.table.filter)) {
rvalues[[i]] <-(rcorr(otu.table.filter[ , i], new6$Total, type="spearman"))$r
}
p<-NULL
for(i in 1:length(pvalues)){
tmp <-print(pvalues[[i]][2])
p <- rbind(p, tmp)
}
r<-NULL
for(i in 1:length(rvalues)){
tmp <-print(rvalues[[i]][2])
r <- rbind(r, tmp)
}
fdr<-as.matrix(p.adjust(p, method = "fdr", n = length(p)))
sprman<-cbind(r,p,fdr)
and using the above as a starting point I tried to create a nested loop that each time would examine a column of df1 vs all columns of df2 and then it would proceed to the second column of df1 against all columns of df2 etc etc
but here i am a bit lost and i could not find an answer for a solution in r
I would assume that the pvalues output should be a list of
pvalues[[i]][[j]]
and similarly the rvalues output
rvalues[[i]][[j]]
but I am a bit lost and I dont know how to do that as I tried
pvalues = list()
rvalues = list()
for (j in 1:7){
for(i in 1:ncol(otu.table.filter)) {
pvalues[[i]][[j]] <-(rcorr(otu.table.filter[ , i], new7[,j], type="spearman"))$P
}
for(i in 1:ncol(otu.table.filter)) {
rvalues[[i]][[j]] <-(rcorr(otu.table.filter[ , i], new7[,j], type="spearman"))$r
}
}
but I cannot make it work cause I am not sure how to direct the output in the lists and then i would also appreciate if someone could help me with the next part which would be to extract for each comparison the p and r value and apply the fdr function (similar to what i did with my simple loop)
here is a subset of my two dataframes
Here a small demo. Let's assume two matrices x and y with a sample size n. Then correlation and approximate p-values can be estimated as:
n <- 100
x <- matrix(rnorm(10 * n), nrow = n)
y <- matrix(rnorm(5 * n), nrow = n)
## correlation matrix
r <- cor(x, y, method = "spearman")
## p-values
pval <- function(r, n) 2 * (1 - pt(abs(r)/sqrt((1 - r^2)/(n - 2)), n - 2))
pval(r, n)
## for comparison
cor.test(x[,1], y[,1], method = "spearman", exact = FALSE)
More details can be found here: https://stats.stackexchange.com/questions/312216/spearman-correlation-significancy-test
Edit
And finally a loop with cor.test:
## for comparison
p <- matrix(NA, nrow = ncol(x), ncol=ncol(y))
for (i in 1:ncol(x)) {
for (j in 1:ncol(y)) {
p[i, j] <- cor.test(x[,i], y[,j], method = "spearman")$p.value
}
}
p
The values differ a somewhat, because the first uses the t-approximation then the second the "exact AS 89 algorithm" of cor.test.
I would like to clean up my code a bit and start to use more functions for my everyday computations (where I would normally use for loops). I have an example of a for loop that I would like to make into a function. The problem I am having is in how to step through the constraint vectors without a loop. Here's what I mean;
## represents spectral data
set.seed(11)
df <- data.frame(Sample = 1:100, replicate(1000, sample(0:1000, 100, rep = TRUE)))
## feature ranges by column number
frm <- c(438,563,953,963)
to <- c(548,803,1000,993)
nm <- c("WL890", "WL1080", "WL1400", "WL1375")
WL.ps <- list()
for (i in 1:length(frm)){
## finds the minimum value within the range constraints and returns the corresponding column name
WL <- colnames(df[frm[i]:to[i]])[apply(df[frm[i]:to[i]],1,which.min)]
WL.ps[[i]] <- WL
}
new.df <- data.frame(WL.ps)
colnames(new.df) <- nm
The part where I iterate through the 'frm' and 'to' vector values is what I'm having trouble with. How does one go from frm[1] to frm[2].. so-on in a function (apply or otherwise)?
Any advice would be greatly appreciated.
Thank you.
You could write a function which returns column name of minimum value in each row for a particular range of columns. I have used max.col instead of apply(df, 1, which.min) to get minimum value in a row since max.col would be efficient compared to apply.
apply_fun <- function(data, x, y) {
cols <- x:y
names(data[cols])[max.col(-data[cols])]
}
Apply this function using Map :
WL.ps <- Map(apply_fun, frm, to, MoreArgs = list(data = df))
I intend to find Pearson correlation coefficient from multi-dim data to one numeric vector in R. Basically, I am expecting to get a correlation matrix by using the Pearson method, want to keep the rows (a.k.a, features for each column) in multi-dim data by using certain correlation coefficient as threshold.However, I tentatively tried some R implementation to do that but didn't get correct correlation matrix though. How can I get this one? can anyone point me out how to make this happen easily in R? any thought?
reproducible example
persons_df <- data.frame(person1=sample(1:20,10, replace = FALSE),
person2=as.factor(sample(10)),
person3=sample(1:25,10, replace = FALSE),
person4=sample(1:30,10, replace = FALSE),
person5=as.factor(sample(10)),
person6=as.factor(sample(10)))
row.names(persons_df) <-letters[1:10]
in persons_df, different features in row-wise and different persons in column-wise are given.
I have also age_df which has age of each person.
age_df <- data.frame(personID= colnames(persons_df),
age=sample(1:50, 6 , replace = FALSE))
my initial attempt:
pearson_corr <- function(df1, df2, verbose=FALSE){
stopifnot(ncol(df1)==nrow(df2))
res <- as.data.frame()
lapply(colnames(df1), function(x){
lapply(x, rownames(y){
if(colnames(x) %in% rownames(df2)){
cor_mat <- stats::cor(y, df2$age, method = "pearson")
ncor <- ncol(cor_mat)
cmatt <- col(cor_mat)
ord <- order(-cmat, cor_mat, decreasing = TRUE)- (ncor*cmatt - ncor)
colnames(ord) <- colnames(cor_mat)
res <- cbind(ID=c(cold(ord), ID2=c(ord)))
res <- as.data.frame(cbind(out, cor=cor_mat[res]))
res <- cbind(res, cor=cor_mat[out])
}
})
})
return(final_df)
}
but above code didn't return correct correlation matrix. what I want to do how each features of the certain person is correlated with his age. Is there any efficient way to make this happen? any idea?
goal:
basically, I want to keep the features which show a high correlation with age. I don't have a better idea to do this in R. Can anyone point me out how to get his done easily and efficiently in R? thanks
mylist = do.call(rbind,
apply(persons_df, 1, function(x){
temp = cor.test(age_df$age, as.numeric(x))
data.frame(t = temp$statistic, p = temp$p.value)
}))
mylist
# t p
#a -1.060264 3.488012e-01
#b -2.292612 8.361623e-02
#c -16.785311 7.382895e-05
#d -1.362776 2.446304e-01
#e -1.922296 1.269356e-01
#f -4.671259 9.509393e-03
#g -3.719296 2.048710e-02
#h -2.684663 5.496171e-02
#i -15.814635 9.341701e-05
#j -2.423014 7.252635e-02
Then use mylist to filter out what values you don't want.
I am new to R and programming, I want to store values from loop to a data frame in R. I want ker, cValues, accuracyValues values to be stored a data frame from bellow code. I am not able to achieve this, Data Frame is only saving last value not all the values.
Can you please help me with this please.
# Define a vector which has different kernel methods
kerna <- c("rbfdot","polydot","vanilladot","tanhdot","laplacedot",
"besseldot","anovadot","splinedot")
# Define a for loop to calculate accuracy for different values of C and kernel
for (ker in kerna){
cValues <- c()
accuracyValues <- c()
for (c in 1:100) {
model <- ksvm(V11~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,
data = credit_card_data,
type ="C-svc",
kernel = ker,
C=c,
scaled =TRUE)
pred <- predict(model,credit_card_data[,1:10])
#pred
accuracy <- sum(pred== credit_card_data$V11)/nrow(credit_card_data)
cValues[c] <- c;
accuracyValues[c] <- accuracy;
}
for(i in 1:100) {
print(paste("kernal:",ker, "c=",cValues[i],"accuracy=",accuracyValues[i]))
}
}
Starting from your base code, set up the structure of the output data frame. Then, loop through and fill in the accuracy values on each iteration. This method also "flattens" the nested loop and gets rid of your c variable which conflicts with the built-in c() function.
kerna <- c("rbfdot","polydot","vanilladot","tanhdot","laplacedot",
"besseldot","anovadot","splinedot")
# Create dataframe to store output data
df <- data.frame(kerna = rep(kerna, each = 100),
cValues = rep(1:100, times = length(kerna)),
accuracyValues = NA,
stringsAsFactors = F)
# Define a for loop to calculate accuracy for different values of C and kernel
for (i in 1:nrow(df)){
ker <- df$kerna[i]
j <- df$cValues[i]
model <- ksvm(V11~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,
data = credit_card_data,
type ="C-svc",
kernel = ker,
C=j,
scaled =TRUE)
pred <- predict(model,credit_card_data[,1:10])
accuracy <- sum(pred== credit_card_data$V11)/nrow(credit_card_data)
# Insert accuracy into df$accuracyValues
df$accuracyValues[i] <- accuracy;
}
Consider Map to build a list of data frames from each pairing of ker and cValues (1:100) generated from expand.grid and row bind all elements together.
k_c_pairs_df <- expand.grid(kerna=kerna, c_value=1:100, stringsAsFactors = FALSE)
model_fct <- function(ker, c) {
model <- ksvm(V11~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,
data = credit_card_data,
type ="C-svc",
kernel = ker,
C=c,
scaled =TRUE)
pred <- predict(model,credit_card_data[,1:10])
accuracy <- sum(pred== credit_card_data$V11)/nrow(credit_card_data)
print(paste("kernal:",ker, "c=",cValues[i],"accuracy=",accuracyValues[i]))
return(data.frame(kernel = ker, cValues = c, accuracyValues = accuracy))
}
df_list <- Map(model_fct, k_c_pairs_df$ker, k_c_pairs_df$c_value)
final_df <- do.call(rbind, df_list)
I have a data set with election poll data with numeric values 0-10.
I have 30 columns with these values and I want to compare every column with
all the other columns in order to create a correlation matrix.
]
But I keep getting the following error code:
Error in columnlist[i, j] <- cor(feeling_therm[, i], feeling_therm[, j], :
incorrect number of subscripts on matrix
Any suggestion on how to get this right? I'm still getting used to the syntax of R.
Just use cor(election). It should create the correlation matrix.
As Nathan Werth remarked, cor(election) works just fine. However, if you insist on using a for loop you should initialize your matrix as a matrix (including correct dimensions), not as a list:
election <- replicate(5, rnorm(n = 100))
election <- as.data.frame(election)
cor_matrix <- matrix(nrow = ncol(election), ncol = ncol(election))
for (i in 1:ncol(election)) {
for (j in 1:ncol(election)) {
cor_matrix[i,j] <- cor(election[,i], election[,j], use = "complete")
}
}
If your aim is to store values in a list then following is an option.
n_col <- ncol(election)
electionlist <- as.list(data.frame( matrix(NA, n_col, n_col)))
for (i in 1 : n_col) {
for (j in 1 : n_col) {
electionlist[[c(i, j)]] <- cor(election[,i],
election[,j], use = "complete")
}
}