I'm very new to R. I have two matrices of different dimensions, C (3 rows, 79 columns) and T(3 rows, 215 columns). I want my code to calculate the Spearman correlation between the first column of C and all the columns of T and return the maximum correlation with the indexes and of the columns. Then, the second column of C and all the columns of T and so on. In fact, I want to find the columns between two matrices which are most correlated. Hope it was clear.
What I did was a nested for loop, but the result is not what I search.
for (i in 1:79){
for(j in 1:215){
print(max(cor(C[,i],T[,j],method = c("spearman"))))
}
}
You don't have to loop over the columns.
x <- cor(C,T,method = c("spearman"))
out <- data.frame(MaxCorr = apply(x,1,max), T_ColIndex=apply(x,1,which.max),C_ColIndex=1:nrow(x))
head(out)
gives,
MaxCorr T_ColIndex C_ColIndex
1 1 8 1
2 1 1 2
3 1 2 3
4 1 1 4
5 1 11 5
6 1 4 6
Fake Data:
C <- matrix(rnorm(3*79),nrow=3)
T <- matrix(rnorm(3*215),nrow=3)
Maybe something like the function below can solve the problem.
pairwise_cor <- function(x, y, method = "spearman"){
ix <- seq_len(ncol(x))
iy <- seq_len(ncol(y))
t(sapply(ix, function(i){
m <- sapply(iy, function(j) cor(x[,i], y[,j], method = method))
setNames(c(i, which.max(m), max(m)), c("col_x", "col_y", "max"))
}))
}
set.seed(2021)
C <- matrix(rnorm(3*5), nrow=3)
T <- matrix(rnorm(3*7), nrow=3)
pairwise_cor(C, T)
# col_x col_y max
#[1,] 1 1 1.0
#[2,] 2 2 1.0
#[3,] 3 2 1.0
#[4,] 4 3 0.5
#[5,] 5 5 1.0
Related
I have a 2D matrix mat with 500 rows × 335 columns, and a data.frame dat with 120425 rows. The data.frame dat has two columns I and J, which are integers to index the row, column from mat. I would like to add the values from mat to the rows of dat.
Here is my conceptual fail:
> dat$matval <- mat[dat$I, dat$J]
Error: cannot allocate vector of length 1617278737
(I am using R 2.13.1 on Win32). Digging a bit deeper, I see that I'm misusing matrix indexing, as it appears that I'm only getting a sub-matrix of mat, and not a single-dimension array of values as I expected, i.e.:
> str(mat[dat$I[1:100], dat$J[1:100]])
int [1:100, 1:100] 20 1 1 1 20 1 1 1 1 1 ...
I was expecting something like int [1:100] 20 1 1 1 20 1 1 1 1 1 .... What is the correct way to index a 2D matrix using indices of row, column to get the values?
Almost. Needs to be offered to "[" as a two column matrix:
dat$matval <- mat[ cbind(dat$I, dat$J) ] # should do it.
There is a caveat: Although this also works for dataframes, they are first coerced to matrix-class and if any are non-numeric, the entire matrix becomes the "lowest denominator" class.
Using a matrix to index as DWin suggests is of course much cleaner, but for some strange reason doing it manually using 1-D indices is actually slightly faster:
# Huge sample data
mat <- matrix(sin(1:1e7), ncol=1000)
dat <- data.frame(I=sample.int(nrow(mat), 1e7, rep=T),
J=sample.int(ncol(mat), 1e7, rep=T))
system.time( x <- mat[cbind(dat$I, dat$J)] ) # 0.51 seconds
system.time( mat[dat$I + (dat$J-1L)*nrow(mat)] ) # 0.44 seconds
The dat$I + (dat$J-1L)*nrow(m) part turns the 2-D indices into 1-D ones. The 1L is the way to specify an integer instead of a double value. This avoids some coercions.
...I also tried gsk3's apply-based solution. It's almost 500x slower though:
system.time( apply( dat, 1, function(x,mat) mat[ x[1], x[2] ], mat=mat ) ) # 212
Here's a one-liner using apply's row-based operations
> dat <- as.data.frame(matrix(rep(seq(4),4),ncol=2))
> colnames(dat) <- c('I','J')
> dat
I J
1 1 1
2 2 2
3 3 3
4 4 4
5 1 1
6 2 2
7 3 3
8 4 4
> mat <- matrix(seq(16),ncol=4)
> mat
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
> dat$K <- apply( dat, 1, function(x,mat) mat[ x[1], x[2] ], mat=mat )
> dat
I J K
1 1 1 1
2 2 2 6
3 3 3 11
4 4 4 16
5 1 1 1
6 2 2 6
7 3 3 11
8 4 4 16
n <- 10
mat <- cor(matrix(rnorm(n*n),n,n))
ix <- matrix(NA,n*(n-1)/2,2)
k<-0
for (i in 1:(n-1)){
for (j in (i+1):n){
k <- k+1
ix[k,1]<-i
ix[k,2]<-j
}
}
o <- rep(NA,nrow(ix))
o <- mat[ix]
out <- cbind(ix,o)
I have a 2D matrix mat with 500 rows × 335 columns, and a data.frame dat with 120425 rows. The data.frame dat has two columns I and J, which are integers to index the row, column from mat. I would like to add the values from mat to the rows of dat.
Here is my conceptual fail:
> dat$matval <- mat[dat$I, dat$J]
Error: cannot allocate vector of length 1617278737
(I am using R 2.13.1 on Win32). Digging a bit deeper, I see that I'm misusing matrix indexing, as it appears that I'm only getting a sub-matrix of mat, and not a single-dimension array of values as I expected, i.e.:
> str(mat[dat$I[1:100], dat$J[1:100]])
int [1:100, 1:100] 20 1 1 1 20 1 1 1 1 1 ...
I was expecting something like int [1:100] 20 1 1 1 20 1 1 1 1 1 .... What is the correct way to index a 2D matrix using indices of row, column to get the values?
Almost. Needs to be offered to "[" as a two column matrix:
dat$matval <- mat[ cbind(dat$I, dat$J) ] # should do it.
There is a caveat: Although this also works for dataframes, they are first coerced to matrix-class and if any are non-numeric, the entire matrix becomes the "lowest denominator" class.
Using a matrix to index as DWin suggests is of course much cleaner, but for some strange reason doing it manually using 1-D indices is actually slightly faster:
# Huge sample data
mat <- matrix(sin(1:1e7), ncol=1000)
dat <- data.frame(I=sample.int(nrow(mat), 1e7, rep=T),
J=sample.int(ncol(mat), 1e7, rep=T))
system.time( x <- mat[cbind(dat$I, dat$J)] ) # 0.51 seconds
system.time( mat[dat$I + (dat$J-1L)*nrow(mat)] ) # 0.44 seconds
The dat$I + (dat$J-1L)*nrow(m) part turns the 2-D indices into 1-D ones. The 1L is the way to specify an integer instead of a double value. This avoids some coercions.
...I also tried gsk3's apply-based solution. It's almost 500x slower though:
system.time( apply( dat, 1, function(x,mat) mat[ x[1], x[2] ], mat=mat ) ) # 212
Here's a one-liner using apply's row-based operations
> dat <- as.data.frame(matrix(rep(seq(4),4),ncol=2))
> colnames(dat) <- c('I','J')
> dat
I J
1 1 1
2 2 2
3 3 3
4 4 4
5 1 1
6 2 2
7 3 3
8 4 4
> mat <- matrix(seq(16),ncol=4)
> mat
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
> dat$K <- apply( dat, 1, function(x,mat) mat[ x[1], x[2] ], mat=mat )
> dat
I J K
1 1 1 1
2 2 2 6
3 3 3 11
4 4 4 16
5 1 1 1
6 2 2 6
7 3 3 11
8 4 4 16
n <- 10
mat <- cor(matrix(rnorm(n*n),n,n))
ix <- matrix(NA,n*(n-1)/2,2)
k<-0
for (i in 1:(n-1)){
for (j in (i+1):n){
k <- k+1
ix[k,1]<-i
ix[k,2]<-j
}
}
o <- rep(NA,nrow(ix))
o <- mat[ix]
out <- cbind(ix,o)
I have a 2D matrix mat with 500 rows × 335 columns, and a data.frame dat with 120425 rows. The data.frame dat has two columns I and J, which are integers to index the row, column from mat. I would like to add the values from mat to the rows of dat.
Here is my conceptual fail:
> dat$matval <- mat[dat$I, dat$J]
Error: cannot allocate vector of length 1617278737
(I am using R 2.13.1 on Win32). Digging a bit deeper, I see that I'm misusing matrix indexing, as it appears that I'm only getting a sub-matrix of mat, and not a single-dimension array of values as I expected, i.e.:
> str(mat[dat$I[1:100], dat$J[1:100]])
int [1:100, 1:100] 20 1 1 1 20 1 1 1 1 1 ...
I was expecting something like int [1:100] 20 1 1 1 20 1 1 1 1 1 .... What is the correct way to index a 2D matrix using indices of row, column to get the values?
Almost. Needs to be offered to "[" as a two column matrix:
dat$matval <- mat[ cbind(dat$I, dat$J) ] # should do it.
There is a caveat: Although this also works for dataframes, they are first coerced to matrix-class and if any are non-numeric, the entire matrix becomes the "lowest denominator" class.
Using a matrix to index as DWin suggests is of course much cleaner, but for some strange reason doing it manually using 1-D indices is actually slightly faster:
# Huge sample data
mat <- matrix(sin(1:1e7), ncol=1000)
dat <- data.frame(I=sample.int(nrow(mat), 1e7, rep=T),
J=sample.int(ncol(mat), 1e7, rep=T))
system.time( x <- mat[cbind(dat$I, dat$J)] ) # 0.51 seconds
system.time( mat[dat$I + (dat$J-1L)*nrow(mat)] ) # 0.44 seconds
The dat$I + (dat$J-1L)*nrow(m) part turns the 2-D indices into 1-D ones. The 1L is the way to specify an integer instead of a double value. This avoids some coercions.
...I also tried gsk3's apply-based solution. It's almost 500x slower though:
system.time( apply( dat, 1, function(x,mat) mat[ x[1], x[2] ], mat=mat ) ) # 212
Here's a one-liner using apply's row-based operations
> dat <- as.data.frame(matrix(rep(seq(4),4),ncol=2))
> colnames(dat) <- c('I','J')
> dat
I J
1 1 1
2 2 2
3 3 3
4 4 4
5 1 1
6 2 2
7 3 3
8 4 4
> mat <- matrix(seq(16),ncol=4)
> mat
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
> dat$K <- apply( dat, 1, function(x,mat) mat[ x[1], x[2] ], mat=mat )
> dat
I J K
1 1 1 1
2 2 2 6
3 3 3 11
4 4 4 16
5 1 1 1
6 2 2 6
7 3 3 11
8 4 4 16
n <- 10
mat <- cor(matrix(rnorm(n*n),n,n))
ix <- matrix(NA,n*(n-1)/2,2)
k<-0
for (i in 1:(n-1)){
for (j in (i+1):n){
k <- k+1
ix[k,1]<-i
ix[k,2]<-j
}
}
o <- rep(NA,nrow(ix))
o <- mat[ix]
out <- cbind(ix,o)
I set x=1:2,y=1:2,and I would like to display all x+y outcomes 2 3 4. But it just prints 2 and 4.
x<-0
for(y in 1:2){
x<-x+1
print(y+x)
}
# [1] 2
# [1] 4
If you want all combinations, you can do this with outer instead of an explicit loop:
x <- 1:2
y <- 1:2
outer(x, y, FUN='+')
## [,1] [,2]
## [1,] 2 3
## [2,] 3 4
You can then reduce this matrix to a vector with c and use unique to get unique entries:
unique(c(outer(x, y, FUN='+')))
## [1] 2 3 4
You can use expand.grid to get all combinations of x and y
dat <- expand.grid(x=x, y=y)
dat
x y
1 1 1
2 2 1
3 1 2
4 2 2
And then calculate the sums with rowSums
rowSums(dat)
[1] 2 3 3 4
Or the unique rowSums
unique(rowSums(dat))
[1] 2 3 4
If you need all the combinations then use,
i<-0
abc <- array()
for(x in 1:2){
for(y in 1:2){
i <- i + 1
abc[i] <- y+x
}
}
If you need only unique combinatinos,
unique(abc)
I have a data frame with list of X/Y locations (>2000 rows). What I want is to select or find all the rows/locations based on a max distance. For example, from the data frame select all the locations that are between 1-100 km from each other. Any suggestions on how to do this?
You need to somehow determine the distance between each pair of rows.
The simplest way is with a corresponding distance matrix
# Assuming Thresh is your threshold
thresh <- 10
# create some sample data
set.seed(123)
DT <- data.table(X=sample(-10:10, 5, TRUE), Y=sample(-10:10, 5, TRUE))
# create the disance matrix
distTable <- matrix(apply(createTable(DT), 1, distance), nrow=nrow(DT))
# remove the lower.triangle since we have symmetry (we don't want duplicates)
distTable[lower.tri(distTable)] <- NA
# Show which rows are above the threshold
pairedRows <- which(distTable >= thresh, arr.ind=TRUE)
colnames(pairedRows) <- c("RowA", "RowB") # clean up the names
Starting with:
> DT
X Y
1: -4 -10
2: 6 1
3: -2 8
4: 8 1
5: 9 -1
We get:
> pairedRows
RowA RowB
[1,] 1 2
[2,] 1 3
[3,] 2 3
[4,] 1 4
[5,] 3 4
[6,] 1 5
[7,] 3 5
These are the two functions used for creating the distance matrix
# pair-up all of the rows
createTable <- function(DT)
expand.grid(apply(DT, 1, list), apply(DT, 1, list))
# simple cartesian/pythagorean distance
distance <- function(CoordPair)
sqrt(sum((CoordPair[[2]][[1]] - CoordPair[[1]][[1]])^2, na.rm=FALSE))
I'm not entirely clear from your question, but assuming you mean you want to take each row of coordinates and find all the other rows whose coordinates fall within a certain distance:
# Create data set for example
set.seed(42)
x <- sample(-100:100, 10)
set.seed(456)
y <- sample(-100:100, 10)
coords <- data.frame(
"x" = x,
"y" = y)
# Loop through all rows
lapply(1:nrow(coords), function(i) {
dis <- sqrt(
(coords[i,"x"] - coords[, "x"])^2 + # insert your preferred
(coords[i,"y"] - coords[, "y"])^2 # distance calculation here
)
names(dis) <- 1:nrow(coords) # replace this part with an index or
# row names if you have them
dis[dis > 0 & dis <= 100] # change numbers to preferred threshold
})
[[1]]
2 6 7 9 10
25.31798 95.01579 40.01250 30.87070 73.75636
[[2]]
1 6 7 9 10
25.317978 89.022469 51.107729 9.486833 60.539243
[[3]]
5 6 8
70.71068 91.78780 94.86833
[[4]]
5 10
40.16217 99.32774
[[5]]
3 4 6 10
70.71068 40.16217 93.40771 82.49242
[[6]]
1 2 3 5 7 8 9 10
95.01579 89.02247 91.78780 93.40771 64.53681 75.66373 97.08244 34.92850
[[7]]
1 2 6 9 10
40.01250 51.10773 64.53681 60.41523 57.55867
[[8]]
3 6
94.86833 75.66373
[[9]]
1 2 6 7 10
30.870698 9.486833 97.082439 60.415230 67.119297
[[10]]
1 2 4 5 6 7 9
73.75636 60.53924 99.32774 82.49242 34.92850 57.55867 67.11930