Related
I am trying to assess the stability of a correlation analysis by iteratively dropping a variable, and re-running the analysis.
As I understand it, this requires me to (1) create matrices of length p-1, by iteratively/sequentially dropping a variable from a dataframe, (2) run a correlation function over a series of matrices, and (3) feed the output into a common dataframe or list, for subsequent analysis.
I am able to achieve each of these steps manually, as follows:
#required library for cc function
library(CCA)
#set seed
set.seed(123)
#X and Y dataframes
X_df <- data.frame(replicate(4,sample(1:10,10,rep=TRUE)))
Y_df <- data.frame(replicate(3,sample(1:10,10,rep=TRUE)))
#X and Y as scaled matrices
X <- scale(X_df)
Y <- scale(Y_df)
#manually omit a variable/column from the X df
X1 <- scale(X_df[2:4])
X2 <- scale(X_df[c(1, 3:4)])
X3 <- scale(X_df[c(1:2, 4)])
X4 <- scale(X_df[1:3])
#manually omit a variable/column from the Y df
Y1 <- scale(Y_df[2:3])
Y2 <- scale(Y_df[c(1, 3)])
Y3 <- scale(Y_df[1:2])
#perform canonical correlation - X sets and Y
cX1 <- cc(X1,Y)$cor
cX2 <- cc(X2,Y)$cor
cX3 <- cc(X3,Y)$cor
cX4 <- cc(X4,Y)$cor
#perform canonical correlation - Y sets and X
cY1 <- cc(X,Y1)$cor
cY2 <- cc(X,Y2)$cor
cY3 <- cc(X,Y3)$cor
#get canonical correlation values into a df
XVALS <- as.data.frame(rbind(cX1, cX2, cX3, cX4))
YVALS <- as.data.frame(rbind(cY1, cY2, cY3))
Of course, I know it's very bad to do this manually, and my real data is much larger.
Unfortunately, I am pretty new to R (and coding), and have been struggling to achieve any of these steps in a better way. I am familiar with the (existence of) the apply functions and also some functions in dplyr that I think are likely relevant (e.g., select) but I just can't get it to work despite reading documentation and seemingly similar posts for hours -- any guidance would be greatly appreciated.
Don't scale.
First of all, there is no need for scaled vectors as the code below shows.
The reason why vectors are scaled is a variant of R FAQ 7.31, see also this SO post.
With older processors the precision loss was a real problem, leading to clearly wrong results. This is no longer true, at least not in the general case.
#perform canonical correlation - original X sets and Y
cX1b <- cc(X_df[2:4], Y)$cor
cX2b <- cc(X_df[c(1, 3:4)], Y)$cor
cX3b <- cc(X_df[c(1:2, 4)], Y)$cor
cX4b <- cc(X_df[1:3], Y)$cor
XVALSb <- as.data.frame(rbind(cX1b, cX2b, cX3b, cX4b))
XVALS and XVALSb row names are different, make them equal in order to please all.equal().
row.names(XVALS) <- 1:4
row.names(XVALSb) <- 1:4
The results are not exactly equal but are within floating-point accuracy. In this case I'm testing equality with all.equal's default of .Machine$double.eps^0.5.
identical(XVALS, XVALSb)
#[1] FALSE
all.equal(XVALS, XVALSb)
#[1] TRUE
XVALS - XVALSb
# V1 V2 V3
#1 0.000000e+00 1.110223e-16 0.000000e+00
#2 -1.110223e-16 1.110223e-16 5.551115e-17
#3 1.110223e-16 -2.220446e-16 2.220446e-16
#4 1.110223e-16 4.440892e-16 1.110223e-16
The question.
To get all combinations of columns leaving one out there is function combn.
Function cc_df_one_out first calls combn on each of its arguments then apply to those indices an anonymous function computing CCA::cc.
Note that the rows order is not the same as in your posted example, since combn does not follow your order of column indices.
cc_df_one_out <- function(X, Y){
f <- function(x) combn(ncol(x), ncol(x) - 1)
X_inx <- f(X)
Y_inx <- f(Y)
ccX <- t(apply(X_inx, 2, function(i) cc(X[, i], Y)$cor))
ccY <- t(apply(Y_inx, 2, function(i) cc(X, Y[, i])$cor))
list(XVALS = as.data.frame(ccX), YVALS = as.data.frame(ccY))
}
cc_df_one_out(X_df, Y_df)
#$XVALS
# V1 V2 V3
#1 0.8787169 0.6999526 0.5073979
#2 0.8922514 0.7244302 0.2979096
#3 0.8441566 0.7807032 0.3331449
#4 0.9059585 0.7371382 0.1344559
#
#$YVALS
# V1 V2
#1 0.8975949 0.7309265
#2 0.8484323 0.7488632
#3 0.8721945 0.7452478
In R I want to calculate for the following:
This is the way I have it
for (i in 1:dim(x)[1] ) {
for (j in 1:dim(x)[2] ) {
omega_2new[i] = sum((X[,j]-munew)^2/sigma_2new[j])
}
}
omega_2new = omega_2new/dim(x)[2]
what's the fastest way to do it in R?
With #Zbynek's data, but avoiding R loops and summing the rows (which I believe is required):
1/ncol(M)*colSums(t((M-munew)^2)/sigma_2new)
First I create some sample data
M <- matrix(runif(100), nrow=10)
sigma_2new <- runif(10)
munew <- mean(M)
And then use sapply over the range of columns
omega_2new <- sapply(1:nrow(M), function(x) sum((M[x,]-munew)^2/sigma_2new[x]))
omega_2new
[1] 1.4127063 3.9928844 1.3996303 1.2290480 1.4367367 4.9893873 0.9401457 0.6567372 4.5156849 1.0743495
I am using in my code colSums but I also need the standard deviation beside the sum.
I searched in the internet and found this page which contain only:
colSums
colMeans
http://stat.ethz.ch/R-manual/R-devel/library/base/html/colSums.html
I tried this:
colSd
but I got this error:
Error: could not find function "colSd"
How I can do the same thing but for standard deviation:
colSd
Here is the code:
results <- colSums(x,na.rm=TRUE)#### here I want colsd
I want to provide a fourth (very similar to #Thomas) approach and some benchmarking:
library("microbenchmark")
library("matrixStats")
colSdApply <- function(x, ...)apply(X=x, MARGIN=2, FUN=sd, ...)
colSdMatrixStats <- colSds
colSdColMeans <- function(x, na.rm=TRUE) {
if (na.rm) {
n <- colSums(!is.na(x)) # thanks #flodel
} else {
n <- nrow(x)
}
colVar <- colMeans(x*x, na.rm=na.rm) - (colMeans(x, na.rm=na.rm))^2
return(sqrt(colVar * n/(n-1)))
}
colSdThomas <- function(x)sqrt(rowMeans((t(x)-colMeans(x))^2)*((dim(x)[1])/(dim(x)[1]-1)))
m <- matrix(runif(1e7), nrow=1e3)
microbenchmark(colSdApply(m), colSdMatrixStats(m), colSdColMeans(m), colSdThomas(m))
# Unit: milliseconds
# expr min lq median uq max neval
# colSdApply(m) 435.7346 448.8673 456.6176 476.8373 512.9783 100
# colSdMatrixStats(m) 344.6416 357.5439 383.8736 389.0258 465.5715 100
# colSdColMeans(m) 124.2028 128.9016 132.9446 137.6254 172.6407 100
# colSdThomas(m) 231.5567 240.3824 245.4072 274.6611 307.3806 100
all.equal(colSdApply(m), colSdMatrixStats(m))
# [1] TRUE
all.equal(colSdApply(m), colSdColMeans(m))
# [1] TRUE
all.equal(colSdApply(m), colSdThomas(m))
# [1] TRUE
colSds and rowSds are two of many similar functions in the matrixStats package
Use the following:
colSd <- function (x, na.rm=FALSE) apply(X=x, MARGIN=2, FUN=sd, na.rm=na.rm)
This is the quickest and shortest way to calculate the standard deviation of the columns:
sqrt(diag(cov(data_matrix)))
Since the diagonal of a co-variance matrix consists of the variances of each variable, we do the following:
Calculate the co-variance matrix using cov
Extract the diagonal of the matrix using diag
Take the square root of the diagonal values using sqrt in order to get the standard deviation
I hope that helps :)
I don't know if these are particularly fast, but why not just use the formulae for SD:
x <- data.frame(y = rnorm(1000,0,1), z = rnorm(1000,2,3))
# If you have a population:
colsdpop <- function(x,...)
sqrt(rowMeans((t(x)-colMeans(x,...))^2,...))
colsdpop(x)
sd(x$y); sd(x$z) # won't match `sd`
# If you have a sample:
colsdsamp <- function(x)
sqrt( (rowMeans((t(x)-colMeans(x))^2)*((dim(x)[1])/(dim(x)[1]-1))) )
colsdsamp(x)
sd(x$y); sd(x$z) # will match `sd`
Note: the sample solution won't handle NAs well. One could incorporate something like apply(x,2,function(z) sum(!is.na(z))) into the right-most part of the formula to get an appropriate denominator, but it would get really murky quite quickly.
I believe I have found a more elegant solution in diag(sqrt(var(data)))
This worked for me to get the standard deviation of each of my columns. However, it does compute a bunch of extra unnecessary covariances (and their square roots) along the way, so it isn't necessarily the most efficient approach. But if your data is small, it works excellently.
EDIT: I just realized that sqrt(diag(var(data))) is probably a bit more efficient, since it drops the unnecessary covariance terms earlier.
I usually do column sd's with apply:
x <- data.frame(y = rnorm(20,0,1), z = rnorm(20,2,3))
> apply(x, 2, sd)
y z
0.8022729 3.4700314
Verify:
> sd(x$y)
[1] 0.8022729
> sd(x$z)
[1] 3.470031
You can also do it with dplyr easily:
library(dplyr)
library(magrittr) # for pipes
> x %>% summarize_all(.,sd)
y z
1 0.8022729 3.470031
You can just use apply function
all.sd <- apply(data, 2,sd)
I am running correlations between variables, some of which have missing data, so the sample size for each correlation are likely different. I tried print and summary, but neither of these shows me how big my n is for each correlation. This is a fairly simple problem that I cannot find the answer to anywhere.
like this..?
x <- c(1:100,NA)
length(x)
length(x[!is.na(x)])
you can also get the degrees of freedom like this...
y <- c(1:100,NA)
x <- c(1:100,NA)
cor.test(x,y)$parameter
But I think it would be best if you show the code for how your are estimating the correlation for exact help.
Here's an example of how to find the pairwise sample sizes among the columns of a matrix. If you want to apply it to (certain) numeric columns of a data frame, combine them accordingly, coerce the resulting object to matrix and apply the function.
# Example matrix:
xx <- rnorm(3000)
# Generate some NAs
vv <- sample(3000, 200)
xx[vv] <- NA
# reshape to a matrix
dd <- matrix(xx, ncol = 3)
# find the number of NAs per column
apply(dd, 2, function(x) sum(is.na(x)))
# tack on some column names
colnames(dd) <- paste0("x", seq(3))
# Function to find the number of pairwise complete observations
# among all pairs of columns in a matrix. It returns a data frame
# whose first two columns comprise all column pairs
pairwiseN <- function(mat)
{
u <- if(is.null(colnames(mat))) paste0("x", seq_len(ncol(mat))) else colnames(mat)
h <- expand.grid(x = u, y = u)
f <- function(x, y)
sum(apply(mat[, c(x, y)], 1, function(z) !any(is.na(z))))
h$n <- mapply(f, h[, 1], h[, 2])
h
}
# Call it
pairwiseN(dd)
The function can easily be improved; for example, you could set h <- expand.grid(x = u[-1], y = u[-length(u)]) to cut down on the number of calculations, you could return an n x n matrix instead of a three-column data frame, etc.
Here is a for-loop implementation of Dennis' function above to output an n x n matrix rather than have to pivot_wide() that result. On my databricks cluster it cut the compute time for 1865 row x 69 column matrix down from 2.5 - 3 minutes to 30-40 seconds.
Thanks for your answer Dennis, this helped me with my work.
pairwise_nxn <- function(mat)
{
cols <- if(is.null(colnames(mat))) paste0("x", seq_len(ncol(mat))) else colnames(mat)
nn <- data.frame(matrix(nrow = length(cols), ncol = length(cols)))
rownames(nn) <- colnames(nn) <- cols
f <- function(x, y)
sum(apply(mat[, c(x, y)], 1, function(z) !any(is.na(z))))
for (i in 1:nrow(nn))
for (j in 1:ncol(nn))
nn[i,j] <- f(rownames(nn)[i], colnames(nn)[j])
nn
}
If your variables are vectors named a and b, would something like sum(is.na(a) | is.na(b)) help you?
First of all, I am new to R (I started yesterday).
I have two groups of points, data and centers, the first one of size n and the second of size K (for instance, n = 3823 and K = 10), and for each i in the first set, I need to find j in the second with the minimum distance.
My idea is simple: for each i, let dist[j] be the distance between i and j, I only need to use which.min(dist) to find what I am looking for.
Each point is an array of 64 doubles, so
> dim(data)
[1] 3823 64
> dim(centers)
[1] 10 64
I have tried with
for (i in 1:n) {
for (j in 1:K) {
d[j] <- sqrt(sum((centers[j,] - data[i,])^2))
}
S[i] <- which.min(d)
}
which is extremely slow (with n = 200, it takes more than 40s!!). The fastest solution that I wrote is
distance <- function(point, group) {
return(dist(t(array(c(point, t(group)), dim=c(ncol(group), 1+nrow(group)))))[1:nrow(group)])
}
for (i in 1:n) {
d <- distance(data[i,], centers)
which.min(d)
}
Even if it does a lot of computation that I don't use (because dist(m) computes the distance between all rows of m), it is way more faster than the other one (can anyone explain why?), but it is not fast enough for what I need, because it will not be used only once. And also, the distance code is very ugly. I tried to replace it with
distance <- function(point, group) {
return (dist(rbind(point,group))[1:nrow(group)])
}
but this seems to be twice slower. I also tried to use dist for each pair, but it is also slower.
I don't know what to do now. It seems like I am doing something very wrong. Any idea on how to do this more efficiently?
ps: I need this to implement k-means by hand (and I need to do it, it is part of an assignment). I believe I will only need Euclidian distance, but I am not yet sure, so I will prefer to have some code where the distance computation can be replaced easily. stats::kmeans do all computation in less than one second.
Rather than iterating across data points, you can just condense that to a matrix operation, meaning you only have to iterate across K.
# Generate some fake data.
n <- 3823
K <- 10
d <- 64
x <- matrix(rnorm(n * d), ncol = n)
centers <- matrix(rnorm(K * d), ncol = K)
system.time(
dists <- apply(centers, 2, function(center) {
colSums((x - center)^2)
})
)
Runs in:
utilisateur système écoulé
0.100 0.008 0.108
on my laptop.
rdist() is a R function from {fields} package which is able to calculate distances between two sets of points in matrix format quickly.
https://www.image.ucar.edu/~nychka/Fields/Help/rdist.html
Usage :
library(fields)
#generating fake data
n <- 5
m <- 10
d <- 3
x <- matrix(rnorm(n * d), ncol = d)
y <- matrix(rnorm(m * d), ncol = d)
rdist(x, y)
[,1] [,2] [,3] [,4] [,5]
[1,] 1.512383 3.053084 3.1420322 4.942360 3.345619
[2,] 3.531150 4.593120 1.9895867 4.212358 2.868283
[3,] 1.925701 2.217248 2.4232672 4.529040 2.243467
[4,] 2.751179 2.260113 2.2469334 3.674180 1.701388
[5,] 3.303224 3.888610 0.5091929 4.563767 1.661411
[6,] 3.188290 3.304657 3.6668867 3.599771 3.453358
[7,] 2.891969 2.823296 1.6926825 4.845681 1.544732
[8,] 2.987394 1.553104 2.8849988 4.683407 2.000689
[9,] 3.199353 2.822421 1.5221291 4.414465 1.078257
[10,] 2.492993 2.994359 3.3573190 6.498129 3.337441
You may want to have a look into the apply functions.
For instance, this code
for (j in 1:K)
{
d[j] <- sqrt(sum((centers[j,] - data[i,])^2))
}
Can easily be substituted by something like
dt <- data[i,]
d <- apply(centers, 1, function(x){ sqrt(sum(x-dt)^2)})
You can definitely optimise it more but you get the point I hope
dist works fast because is't vectorized and call internal C functions.
You code in loop could be vectorized in many ways.
For example to compute distance between data and centers you could use outer:
diff_ij <- function(i,j) sqrt(rowSums((data[i,]-centers[j,])^2))
X <- outer(seq_len(n), seq_len(K), diff_ij)
This gives you n x K matrix of distances. And should be way faster than loop.
Then you could use max.col to find maximum in each row (see help, there are some nuances when are many maximums). X must be negate cause we search for minimum.
CL <- max.col(-X)
To be efficient in R you should vectorized as possible. Loops could be in many cases replaced by vectorized substitute. Check help for rowSums (which describe also rowMeans, colSums, rowSums), pmax, cumsum. You could search SO, e.g.
https://stackoverflow.com/search?q=[r]+avoid+loop (copy&paste this link, I don't how to make it clickable) for some examples.
My solution:
# data is a matrix where each row is a point
# point is a vector of values
euc.dist <- function(data, point) {
apply(data, 1, function (row) sqrt(sum((point - row) ^ 2)))
}
You can try it, like:
x <- matrix(rnorm(25), ncol=5)
euc.dist(x, x[1,])