Spearman correlation between two matrices of same dimensions - r

I have two matrices of equal dimensions (p and e) and I would like to make a spearman correlation between columns of the same name. I want to have the output of pair correlations in a matrix (M)
I used the corr.test() function from library Psych and here is what I did:
library(psych)
M <- data.frame(matrix(ncol=3,nrow=ncol(p)))
M[,1] <- as.character()
G <- colnames(p)
for(rs in 1:ncol(p){
M[rs,1] <- G[rs]
cor <- corr.test(p[,rs],e[,rs],method="spearman",adjust="none")
M[rs,2] <- cor$r
M[rs,3] <- cor$p
}
But I get an error message:
Error in 1:ncol(y) : argument of length 0
Could you please show me what is wrong? or suggest another method?

No need for all this looping and indexing etc:
# test data
p <- matrix(data = rnorm(100),nrow = 10)
e <- matrix(data = rnorm(100),nrow = 10)
cor <- corr.test(p, e, method="spearman", adjust="none")
data.frame(name=colnames(p), r=diag(cor$r), p=diag(cor$p))
# name r p
#a a 0.36969697 0.2930501
#b b 0.16363636 0.6514773
#c c -0.15151515 0.6760652
# etc etc
If the names of the matrices don't already match, then match them:
cor <- corr.test(p, e[,match(colnames(p),colnames(e))], method="spearman", adjust="none")

Since the two matrices are huge, it would take very long system.time to execute the function corr.test() on all possible pairs but the loop that finally worked is as follow:
library(psych)
M <- data.frame(matrix(ncol=3,nrow=ncol(p)))
M[,1] <- as.character()
G <- colnames(p)
for(rs in 1:ncol(p){
M[rs,1] <- G[rs]
cor <- corr.test(as.data.frame(p[,rs]),as.data.frame(e[,rs]),
method="spearman",adjust="none")
M[rs,2] <- cor$r
M[rs,3] <- cor$p
}

Related

Apply function to dataset when function calls from two sources

I have a function that I want to apply to a dataset, but the function also uses global variables as arguments as these variables are needed elsewhere.
With this reduced example I want to apply 'pterotest' to the rows of 'data'. This test case works when the function is given V as a vector, and M and g as a single value.
df<- data.frame(matrix(ncol = 1, nrow = 3))
row.names(df) <- c("Apsaravis_ukhaana", "Jeholornis_prima", "Changchengornis_hengdaoziensis")
colnames(df) <- "M"
mass_var <- c(0.1840000, 1.6910946, 0.0858997)
df$M <- mass_var
V <- seq(0.25,30, by = 0.05)
g <- 9.81
pterotest <- function(V, M, g) {
out1 <- M*g
out2 <- V*M
return(list(V, out1, out2))
}
apply(df,1,pterotest, M = "M", g = g, V = V)
However, all I get is an error of the form:
Error in match.fun(FUN) : '1' is not a function, character or symbol
EDIT: Turning this on it's head, what I could do would be to run a loop over each row, using the multiple columns as different arguments to the function, but with a 4.2M line dataset I feel vectorising might be quicker...

PCA analysis in a loop for certain column intervals in R

I have a data set containing 526 rows nd 560 columns. In this data set, I want to run pca analysis for each 16 columns, respectively, in the loop and save the PCA scores for each row. I tried the below code but it did not work. I would be happy to get your advice.
Thanks in advance for your help.
for(i in 1:ncol(df)) {
df[ , i:(i+15)] <- prcomp(df[, i:(i+15)], scale. = TRUE, center = T)
}
Here is a way with a lapply loop. Create a vector f of consecutive integers, each repeated 16 times. Then split the data.frame names by this vector and lapply function prcomp to each subset. Finally, extract the scores.
f <- c(1, rep(0, 15))
f <- rep(f, length(names(df1))/16)
f <- cumsum(f)
nms <- split(names(df1), f)
pca_list <- lapply(nms, function(x){
prcomp(df1[x], center = TRUE, scale. = TRUE)
})
scores_list <- lapply(pca_list, '[[', 'x')
Test data creation code
set.seed(2021)
df1 <- replicate(560, rnorm(526))
df1 <- as.data.frame(df1)

Any efficient way to filter out multi-dim dataframe by measuring its correlation coefficient in R?

I intend to find Pearson correlation coefficient from multi-dim data to one numeric vector in R. Basically, I am expecting to get a correlation matrix by using the Pearson method, want to keep the rows (a.k.a, features for each column) in multi-dim data by using certain correlation coefficient as threshold.However, I tentatively tried some R implementation to do that but didn't get correct correlation matrix though. How can I get this one? can anyone point me out how to make this happen easily in R? any thought?
reproducible example
persons_df <- data.frame(person1=sample(1:20,10, replace = FALSE),
person2=as.factor(sample(10)),
person3=sample(1:25,10, replace = FALSE),
person4=sample(1:30,10, replace = FALSE),
person5=as.factor(sample(10)),
person6=as.factor(sample(10)))
row.names(persons_df) <-letters[1:10]
in persons_df, different features in row-wise and different persons in column-wise are given.
I have also age_df which has age of each person.
age_df <- data.frame(personID= colnames(persons_df),
age=sample(1:50, 6 , replace = FALSE))
my initial attempt:
pearson_corr <- function(df1, df2, verbose=FALSE){
stopifnot(ncol(df1)==nrow(df2))
res <- as.data.frame()
lapply(colnames(df1), function(x){
lapply(x, rownames(y){
if(colnames(x) %in% rownames(df2)){
cor_mat <- stats::cor(y, df2$age, method = "pearson")
ncor <- ncol(cor_mat)
cmatt <- col(cor_mat)
ord <- order(-cmat, cor_mat, decreasing = TRUE)- (ncor*cmatt - ncor)
colnames(ord) <- colnames(cor_mat)
res <- cbind(ID=c(cold(ord), ID2=c(ord)))
res <- as.data.frame(cbind(out, cor=cor_mat[res]))
res <- cbind(res, cor=cor_mat[out])
}
})
})
return(final_df)
}
but above code didn't return correct correlation matrix. what I want to do how each features of the certain person is correlated with his age. Is there any efficient way to make this happen? any idea?
goal:
basically, I want to keep the features which show a high correlation with age. I don't have a better idea to do this in R. Can anyone point me out how to get his done easily and efficiently in R? thanks
mylist = do.call(rbind,
apply(persons_df, 1, function(x){
temp = cor.test(age_df$age, as.numeric(x))
data.frame(t = temp$statistic, p = temp$p.value)
}))
mylist
# t p
#a -1.060264 3.488012e-01
#b -2.292612 8.361623e-02
#c -16.785311 7.382895e-05
#d -1.362776 2.446304e-01
#e -1.922296 1.269356e-01
#f -4.671259 9.509393e-03
#g -3.719296 2.048710e-02
#h -2.684663 5.496171e-02
#i -15.814635 9.341701e-05
#j -2.423014 7.252635e-02
Then use mylist to filter out what values you don't want.

input k-means in R

I'm trying to perform k-means on a dataframe with 69 columns and 1000 rows. First, I need to decide upon the optimal numbers of clusters first with the use of the Davies-Bouldin index. This algorithm requires that the input should be in the form of a matrix, I used this code first:
totalm <- data.matrix(total)
Followed by the following code (Davies-Bouldin index)
clusternumber<-0
max_cluster_number <- 30
#Davies Bouldin algorithm
library(clusterCrit)
smallest <-99999
for(b in 2:max_cluster_number){
a <-99999
for(i in 1:200){
cl <- kmeans(totalm,b)
cl<-as.numeric(cl)
intCriteria(totalm,cl$cluster,c("dav"))
if(intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin < a){
a <- intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin }
}
if(a<smallest){
smallest <- a
clusternumber <-b
}
}
print("##clusternumber##")
print(clusternumber)
print("##smallest##")
print(smallest)
I keep on getting this error:(list) object cannot be coerced to type 'double'.
How can I solve this?
Reproducable example:
a <- c(0,0,1,0,1,0,0)
b <- c(0,0,1,0,0,0,0)
c <- c(1,1,0,0,0,0,1)
d <- c(1,1,0,0,0,0,0)
total <- cbind(a,b,c,d)
The error is coming from cl<-as.numeric(cl). The result of a call to kmeans is an object, which is a list containing various information about the model.
Run ?kmeans
I would also recommend you add nstart = 20 to your kmeans call. k-means clustering is a random process. This will run the algorithm 20 times and find the best fit (i.e. for each number of centers).
for(b in 2:max_cluster_number){
a <-99999
for(i in 1:200){
cl <- kmeans(totalm,centers = b,nstart = 20)
#cl<-as.numeric(cl)
intCriteria(totalm,cl$cluster,c("dav"))
if(intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin < a){
a <- intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin }
}
if(a<smallest){
smallest <- a
clusternumber <-b
}
}
This gave me
[1] "##clusternumber##"
[1] 4
[1] "##smallest##"
[1] 0.138675
(tempoarily changing max clusters to 4 as reproducible data is a small set)
EDIT Integer Error
I was able to reproduce your error using
a <- as.integer(c(0,0,1,0,1,0,0))
b <- as.integer(c(0,0,1,0,0,0,0))
c <- as.integer(c(1,1,0,0,0,0,1))
d <- as.integer(c(1,1,0,0,0,0,0))
totalm <- cbind(a,b,c,d)
So that an integer matrix is created.
I was then able to remove the error by using
storage.mode(totalm) <- "double"
Note that
total <- cbind(a,b,c,d)
totalm <- data.matrix(total)
is unnecessary for the data in this example
> identical(total,totalm)
[1] TRUE

How do you find the sample sizes used in calculations on r?

I am running correlations between variables, some of which have missing data, so the sample size for each correlation are likely different. I tried print and summary, but neither of these shows me how big my n is for each correlation. This is a fairly simple problem that I cannot find the answer to anywhere.
like this..?
x <- c(1:100,NA)
length(x)
length(x[!is.na(x)])
you can also get the degrees of freedom like this...
y <- c(1:100,NA)
x <- c(1:100,NA)
cor.test(x,y)$parameter
But I think it would be best if you show the code for how your are estimating the correlation for exact help.
Here's an example of how to find the pairwise sample sizes among the columns of a matrix. If you want to apply it to (certain) numeric columns of a data frame, combine them accordingly, coerce the resulting object to matrix and apply the function.
# Example matrix:
xx <- rnorm(3000)
# Generate some NAs
vv <- sample(3000, 200)
xx[vv] <- NA
# reshape to a matrix
dd <- matrix(xx, ncol = 3)
# find the number of NAs per column
apply(dd, 2, function(x) sum(is.na(x)))
# tack on some column names
colnames(dd) <- paste0("x", seq(3))
# Function to find the number of pairwise complete observations
# among all pairs of columns in a matrix. It returns a data frame
# whose first two columns comprise all column pairs
pairwiseN <- function(mat)
{
u <- if(is.null(colnames(mat))) paste0("x", seq_len(ncol(mat))) else colnames(mat)
h <- expand.grid(x = u, y = u)
f <- function(x, y)
sum(apply(mat[, c(x, y)], 1, function(z) !any(is.na(z))))
h$n <- mapply(f, h[, 1], h[, 2])
h
}
# Call it
pairwiseN(dd)
The function can easily be improved; for example, you could set h <- expand.grid(x = u[-1], y = u[-length(u)]) to cut down on the number of calculations, you could return an n x n matrix instead of a three-column data frame, etc.
Here is a for-loop implementation of Dennis' function above to output an n x n matrix rather than have to pivot_wide() that result. On my databricks cluster it cut the compute time for 1865 row x 69 column matrix down from 2.5 - 3 minutes to 30-40 seconds.
Thanks for your answer Dennis, this helped me with my work.
pairwise_nxn <- function(mat)
{
cols <- if(is.null(colnames(mat))) paste0("x", seq_len(ncol(mat))) else colnames(mat)
nn <- data.frame(matrix(nrow = length(cols), ncol = length(cols)))
rownames(nn) <- colnames(nn) <- cols
f <- function(x, y)
sum(apply(mat[, c(x, y)], 1, function(z) !any(is.na(z))))
for (i in 1:nrow(nn))
for (j in 1:ncol(nn))
nn[i,j] <- f(rownames(nn)[i], colnames(nn)[j])
nn
}
If your variables are vectors named a and b, would something like sum(is.na(a) | is.na(b)) help you?

Resources