I have this R code:
> coef
[1] 1.5 2.4 3.9 4.4
> y
[,1] [,2] [,3] [,4]
[1,] 1 2 12 45
[2,] 5 6 7 8
[3,] 9 10 2 12
[4,] 13 14 15 45
[5,] 17 18 39 7
I have to multiply each value of the column with the respective coef. The result should be:
First column:
1*1.5
5*1.5
9*1.5
13*1.5
17*1.5
Second column:
2*2.4
6*2.4
10*2.4
14*2.4
18*2.4
Third column:
12*3.9
7*3.9
2*3.9
15*3.9
39*3.9
Fourth column:
45*4.4
8*4.4
12*4.4
45*4.4
7*4.4
All the column's values moltiplied by the same coefficient at the same index in the vector.
How can I do this calculation?
The solution could be:
> y[,1] <- y[,1] * coef[1]
> y[,2] <- y[,2] * coef[2]
> y[,3] <- y[,3] * coef[3]
> y[,4] <- y[,4] * coef[4]
But doesn't seem too optimized! Something better?
Thank you!
This will give you what you want:
t( t(y) * coef )
Two more possibilities: sweep and scale (the latter only operates columnwise, and seems to me to be a bit of hack).
coef <- c(1.5,2.4,3.9,4.4)
y <- matrix(c(seq(1,17,by=4),
seq(2,18,by=4),
c(12,7,2,15,39,
45,8,12,45,7)),
ncol=4)
t(t(y)*coef)
t(apply(y,1,"*",coef))
sweep(y,2,coef,"*")
scale(y,center=FALSE,scale=1/coef)
library(rbenchmark)
benchmark(t(t(y)*coef),
y %*% diag(coef),
t(apply(y,1,"*",coef)),
sweep(y,2,coef,"*"),
scale(y,center=FALSE,scale=1/coef),
replications=1e4)
test replications elapsed relative
5 scale(y, center = FALSE, scale = 1/coef) 10000 0.990 4.342105
4 sweep(y, 2, coef, "*") 10000 0.846 3.710526
3 t(apply(y, 1, "*", coef)) 10000 1.537 6.741228
1 t(t(y) * coef) 10000 0.228 1.000000
2 y %*% diag(coef) 10000 0.365 1.600877
edit: added y %*% diag(coef) from #baptiste [not fastest, although it might be so for a big problem with a sufficiently optimized BLAS package ...] [and it was fastest in another trial, so I may just not have had a stable estimate]
edit: fixed typo in t(t(y)*coef) [thanks to Timur Shtatland] (but did not update timings, so they might be slightly off ...)
I also tried library(Matrix); y %*% Diagonal(x=coef), which is very slow for this example but might be fast for a large matrix (??). (I also tried constructing the diagonal matrix just once, but even multiplication by a predefined matrix was slow in this example (25x slower than the best, vs. 47x slower when defining the matrix on the fly.)
I have a mild preference for sweep as I think it expresses most clearly the operation being done ("multiply the columns by the elements of coef")
apply(y, 1, "*", coef)
# -- result --
[,1] [,2] [,3] [,4] [,5]
[1,] 1.5 7.5 13.5 19.5 25.5
[2,] 4.8 14.4 24.0 33.6 43.2
[3,] 46.8 27.3 7.8 58.5 152.1
[4,] 198.0 35.2 52.8 198.0 30.8
A late entry:
coef[col(y)]*y
On my system, this is the fastest.
test replications elapsed relative
6 coef[col(y)] * y 10000 0.068 1.000
5 scale(y, center = FALSE, scale = 1/coef) 10000 0.640 9.412
4 sweep(y, 2, coef, "*") 10000 0.535 7.868
3 t(apply(y, 1, "*", coef)) 10000 0.837 12.309
1 t(t(y) * coef) 10000 0.176 2.588
2 y %*% diag(coef) 10000 0.187 2.750
Related
I have a computation that does a version of this:
n <- 5
l <- 3
m <- seq(0,1,length.out = n)
r <- seq(3,4,length.out = n)
y <- 1:n
pp <- sapply(0:l, function(h) cumsum(y[(h+1):n]*y[1:(n-h)]))
rec.acf <- sapply(0:l, function(h) pp[[h+1]] + sapply((h+1):n, function(j) m[j] + r[j-h]) )
to obtain
> rec.acf
[[1]]
[1] 4.0 8.5 18.0 34.5 60.0
[[2]]
[1] 5.25 11.75 24.25 44.75
[[3]]
[1] 6.5 15.0 30.5
[[4]]
[1] 7.75 18.25
In practice, of course, n and l are much larger (and the actual functions, computing autocovariances over samples of increasing size, are more complicated).
When l is relatively small, as I had hoped for, the computations work much faster than other implementations I had worked out that do not take into account that I can recycle many identical computations through pp.
However, the picture reverses when l is large relative to n, likely because the outer sapply then sends off many inner loops. Is there anything obviously inefficient about my approach?
I tinkered with mapply, without much success.
The inner loop is unnecessary:
mm = lapply(0:l, function(h) tail(m, length(m) - h) + head(r, length(r) - h))
mapply("+", pp, mm)
#[[1]]
#[1] 4.0 8.5 18.0 34.5 60.0
#
#[[2]]
#[1] 5.25 11.75 24.25 44.75
#
#[[3]]
#[1] 6.5 15.0 30.5
#
#[[4]]
#[1] 7.75 18.25
Doing some PCA analysis and when comparing with the results for FactoMineR function PCA together with prcomp from base I dont get the same results. One example
library(ISLR)
library(FactoMineR)
data("NCI60")
df <- NCI60$data
pca_prcomp <- prcomp(df, scale. = T)
pca_facto <- FactoMineR::PCA(df, scale.unit = T, graph = F, ncp = 65)
# One column is missing
dim(pca_prcomp$x)
dim(pca_facto$ind$coord)
# Values are similiare - but not the same
head(pca_prcomp$x[, 1:2])
head(pca_facto$ind$coord[, 1:2])
# Using scale function - does not return same values
pca_facto_scale <- PCA(scale(df), scale.unit = F, graph = F, ncp = 65)
head(pca_facto$ind$coord[, 1:2], 3)
head(pca_facto_scale$ind$coord[, 1:2], 3)
Sorry for being late, the FactoMineR package uses the same approach of svd() which should be similar (but not identical) with the prcomp() approach and both of them are listed under the Q-mode, which is the preferred method to do PCA for its numerical accuracy. But note, I didn't say identical, why? FactoMineR uses its own algorithm for PCA where it calculates the number of components like the following:
ncp <- min(ncp, nrow(X) - 1, ncol(X))
which tells you clearly why you got number of components 63 not 64 as what prcomp() would normally give. Your data set is typical of genomics data where you have n rows smaller than p columns of genes and the above code will clearly take columns or rows, whichever has the less number. If you follow the svd() algorithm it will return 64 dimensions not 63.
To explore the source code further type FactoMineR:::PCA.
For differences between the Q-mode (svd, prcomp(), FactoMineR::PCA()) and R-mode (eigen(), princomp()) I would recommend visiting this answer.
Side note: for prcomp() you want pass the center = T argument in order to center your data before doing PCA. Scaling on the other hand will give all your gene columns equal weight.
pca_prcomp <- prcomp(df, center = T, scale. = T) # add center=T
For scaling, the prcomp() use N as the divisor while FactoMineR::PCA() uses N-1 instead. The code below will prove it (refer to the same linked answer above):
# this is the scaled data by scale()
df_scaled <- scale(df)
# then you need to get the standardized data matrix from the output of the FactoMineR::PCR() function, which can be done easily as follows:
df_restored <- pca_facto$svd$U %*% diag(pca_facto$svd$vs) %*% t(pca_facto$svd$V)
# the to make both FactoMineR::PCR() and scale() match up you need to do the correction
df_corrected <- df_restored * sqrt(63 / 64) # correct for sqrt(N-1/N)
head(df[, 1:5]) # glimpse the first five columns only!
head(df_scaled[, 1:5])
head(df_restored[, 1:5]) # glimpse the first five columns only!
head(df_corrected[, 1:5])
round(head(df_scaled[, 1:5]), 3) == round(head(df_corrected[, 1:5]), 3) # TRUE
R> head(df[, 1:5])
1 2 3 4 5
V1 0.300 1.180 0.550 1.140 -0.265
V2 0.680 1.290 0.170 0.380 0.465
V3 0.940 -0.040 -0.170 -0.040 -0.605
V4 0.280 -0.310 0.680 -0.810 0.625
V5 0.485 -0.465 0.395 0.905 0.200
V6 0.310 -0.030 -0.100 -0.460 -0.205
R> head(df_scaled[, 1:5])
1 2 3 4 5
V1 0.723 1.59461 1.315 1.345 -0.600
V2 1.584 1.73979 0.438 0.649 0.905
V3 2.173 -0.01609 -0.346 0.264 -1.301
V4 0.678 -0.37256 1.615 -0.441 1.235
V5 1.142 -0.57720 0.958 1.130 0.359
V6 0.746 -0.00289 -0.185 -0.120 -0.476
R> head(df_restored[, 1:5])
[,1] [,2] [,3] [,4] [,5]
[1,] 0.729 1.60722 1.326 1.356 -0.605
[2,] 1.596 1.75354 0.442 0.654 0.912
[3,] 2.190 -0.01622 -0.349 0.266 -1.311
[4,] 0.683 -0.37550 1.628 -0.444 1.244
[5,] 1.151 -0.58176 0.965 1.139 0.361
[6,] 0.752 -0.00291 -0.186 -0.121 -0.480
R> head(df_corrected[, 1:5])
[,1] [,2] [,3] [,4] [,5]
[1,] 0.723 1.59461 1.315 1.345 -0.600
[2,] 1.584 1.73979 0.438 0.649 0.905
[3,] 2.173 -0.01609 -0.346 0.264 -1.301
[4,] 0.678 -0.37256 1.615 -0.441 1.235
[5,] 1.142 -0.57720 0.958 1.130 0.359
[6,] 0.746 -0.00289 -0.185 -0.120 -0.476
R> round(head(df_scaled[, 1:5]), 3) == round(head(df_corrected[, 1:5]), 3)
1 2 3 4 5
V1 TRUE TRUE TRUE TRUE TRUE
V2 TRUE TRUE TRUE TRUE TRUE
V3 TRUE TRUE TRUE TRUE TRUE
V4 TRUE TRUE TRUE TRUE TRUE
V5 TRUE TRUE TRUE TRUE TRUE
V6 TRUE TRUE TRUE TRUE TRUE
Book excerpt
There is also the book for FactoMineR package called "Exploratory Multivariate Analysis by Example Using R" 2nd edition by François Husson, Sébastien, and Lê Jérôme Pagès. Below is an excerpt from page 55 of the book which was discussing a data set from a genomic study similar to yours with n rows (43) far less than p 7407 columns chicken.csv data set, you can see more info in their website as well as the data set itself can be downloaded from this link.
The difference is most likely between the eigenvalue and SVD methods for performing PCA (see this great answer for some details.
From ?prcomp:
The calculation is done by a singular value decomposition of the
(centered and possibly scaled) data matrix, not by using ‘eigen’
on the covariance matrix. This is generally the preferred method
for numerical accuracy.
From ?PCA:
Returns a list including:
eig: a matrix containing all the eigenvalues, the percentage of
variance and the cumulative percentage of variance
I am learning R on my own and I am having some troubles trying to build a transition probability matrix in Rstudio using the markovchain package. First I tried to calculate the transition probabilities of a DNA sequence.
ATTCAACACATCCAGCCACATGCTCCGAGAGGAGGCAGAGGGCCCCCGGAATGATGCTTACCGAGATTCTTGTTTTTATCCTCGTGGTTGTTTAAAAACGAGTTGAAACTGACGGCATGTCGGACTATAAGCTACTTACTCACCATAGACGTGACCATAGGCCCTAAAACGTTACCGAGATATTCACTTCTAATAACAGTTGTCGGCAGAGCCAAAAGGCCGGGTGATAATACTTTAAAAAGGGAGTTGATTGTTGTATCTAATCCTAGAATGTCAAGAGCGACCATAACAAGATAATTCGGCAGAGCCAGAAAGCGTTCAAGGACTAGAACCATACCGAGACGCAAACGTTCAGGTCGAACTCTAATACCGATTAGT
But how can the transition probability matrix be calculated in a sequence like this, I was thinking of using R indexes but I don't really know how to calculate those transition probabilities.
Is there a way of doing this in R?
I am guessing that the output of those probabilities in a matrix should be something like this:
A T C G
A 0.60 0.10 0.10 0.20
T 0.10 0.50 0.30 0.10
C 0.05 0.20 0.70 0.05
G 0.40 0.05 0.05 0.50
You can use the markovchain package for help with this. First, your data
seq <- "ATTCAACACATCCAGCCACATGCTCCGAGAGGAGGCAGAGGGCCCCCGGAATGATGCTTACCGAGATTCTTGTTTTTATCCTCGTGGTTGTTTAAAAACGAGTTGAAACTGACGGCATGTCGGACTATAAGCTACTTACTCACCATAGACGTGACCATAGGCCCTAAAACGTTACCGAGATATTCACTTCTAATAACAGTTGTCGGCAGAGCCAAAAGGCCGGGTGATAATACTTTAAAAAGGGAGTTGATTGTTGTATCTAATCCTAGAATGTCAAGAGCGACCATAACAAGATAATTCGGCAGAGCCAGAAAGCGTTCAAGGACTAGAACCATACCGAGACGCAAACGTTCAGGTCGAACTCTAATACCGATTAGT"
Then use the package
library(markovchain)
base_sequence <- strsplit(seq, "")[[1]]
mcX <- markovchainFit(base_sequence)$estimate
mcX
# A C G T
# A 0.3000000 0.2250000 0.2583333 0.2166667
# C 0.2857143 0.2619048 0.2380952 0.2142857
# G 0.3764706 0.1882353 0.2117647 0.2235294
# T 0.3068182 0.2159091 0.1818182 0.2954545
Create DNA
DNA <- "ATTCAACACATCCAGCCACATGCTCCGAGAGGAGGCAGAGGGCCCCCGGAATGATGCTTACCGAGATTCTTGTTTTTATCCTCGTGGTTGTTTAAAAACGAGTTGAAACTGACGGCATGTCGGACTATAAGCTACTTACTCACCATAGACGTGACCATAGGCCCTAAAACGTTACCGAGATATTCACTTCTAATAACAGTTGTCGGCAGAGCCAAAAGGCCGGGTGATAATACTTTAAAAAGGGAGTTGATTGTTGTATCTAATCCTAGAATGTCAAGAGCGACCATAACAAGATAATTCGGCAGAGCCAGAAAGCGTTCAAGGACTAGAACCATACCGAGACGCAAACGTTCAGGTCGAACTCTAATACCGATTAGT"
Split it character by character
DNA_list <- unlist(strsplit(DNA, split = ""))
Retrieve unique elements
DNA_unique <- unique(DNA_list)
Create an empty matrix
matrix <- matrix(0, ncol = length(DNA_unique), nrow=length(DNA_unique))
Fill it: to elt i and element i + 1 and add one in the corresponding cell of the matrix.
for (i in 1:(length(DNA_list) - 1)){
index_of_i <- DNA_unique == DNA_list[i]
index_of_i_plus_1 <- DNA_unique == DNA_list[i + 1]
matrix[index_of_i, index_of_i_plus_1] = matrix[index_of_i, index_of_i_plus_1] + 1
}
Normalize it
matrix <- matrix / rowSums(matrix)
> matrix
[,1] [,2] [,3] [,4]
[1,] 0.3000000 0.2166667 0.2250000 0.2583333
[2,] 0.3068182 0.2954545 0.2159091 0.1818182
[3,] 0.2857143 0.2142857 0.2619048 0.2380952
[4,] 0.3764706 0.2235294 0.1882353 0.2117647
NB: There might be a way to perform it in a faster way if you have really large DNA to compute. But here it seeems to be fast enough.
I want to get the data that each column minus its mean.
First I count the mean of each column
There is my data bellow called m
angel distance
[1,] 1.3 0.43
[2,] 4.0 0.84
[3,] 2.7 0.58
[4,] 2.2 0.58
[5,] 3.6 0.70
[6,] 4.9 1.00
[7,] 0.9 0.27
[8,] 1.1 0.29
[9,] 3.1 0.63
> mean<-apply(m,2,FUN=mean)
angel distance
2.6444444 0.5911111
> m-mean
angel distance
1 -1.34444444 -0.16111111
2 3.40888889 -1.80444444
3 0.05555556 -0.01111111
4 1.60888889 -2.06444444
5 0.95555556 0.10888889
6 4.30888889 -1.64444444
7 -1.74444444 -0.32111111
8 0.50888889 -2.35444444
9 0.45555556 0.03888889
So the final answer is got through minus mean by column.
I want it minus by each row. How can I get this?
First, let's use colMeans(m) to get column means of matrix m. Then we use sweep:
sweep(m, 2, colMeans(m))
where 2 specifies margin (we want column-wise operation, and in 2D index, the second index is for column). By default, sweep performs FUN = "-", so in above we are subtracting column means from the matrix, i.e., centring the matrix.
Similarly if we want to subtract row means from all rows, we can use:
sweep(m, 1, rowMeans(m))
You can set FUN argument to other functions, too. Another common use of sweep is for column / row rescaling, where you can read How to rescale my matrix by column or row for more.
Function scale mentioned by the other answer is used only for column-wise operation. A common use is to standardised all matrix columns. We can set scale = FALSE to perform column centring only.
scale is just a wrapper function of sweep which you can verify by inspecting the source code of sweep.default:
if (center) {
center <- colMeans(x, na.rm = TRUE)
x <- sweep(x, 2L, center, check.margin = FALSE)
}
if (scale) {
scale <- apply(x, 2L, f)
x <- sweep(x, 2L, scale, "/", check.margin = FALSE)
}
Read ?sweep, ?scale, ?colMeans for more on those functions.
You can get the same by this (z-score normalization without scaling):
scale(df, scale=FALSE)
angel distance
[1,] -1.34444444 -0.16111111
[2,] 1.35555556 0.24888889
[3,] 0.05555556 -0.01111111
[4,] -0.44444444 -0.01111111
[5,] 0.95555556 0.10888889
[6,] 2.25555556 0.40888889
[7,] -1.74444444 -0.32111111
[8,] -1.54444444 -0.30111111
[9,] 0.45555556 0.03888889
This question is sort of a follow-up to how to extract intragroup and intergroup distances from a distance matrix? in R. In that question, they first computed the distance matrix for all points, and then simply extracted the inter-class distance matrix. I have a situation where I'd like to bypass the initial computation and skip right to extraction, i.e. I want to directly compute the inter-class distance matrix. Drawing from the linked example, with tweaks, let's say I have some data in a dataframe called df:
values<-c(0.002,0.3,0.4,0.005,0.6,0.2,0.001,0.002,0.3,0.01)
class<-c("A","A","A","B","B","B","B","A","B","A")
df<-data.frame(values, class)
What I'd like is a distance matrix:
1 2 3 8 10
4 .003 .295 .395 .003 .005
5 .598 .300 .200 .598 .590
6 .198 .100 .200 .198 .190
7 .001 .299 .399 .001 .009
9 .298 .000 .100 .298 .290
Does there already exist in R an elegant and fast way to do this?
EDIT After receiving a good solution for the 1D case above, I thought of a bonus question: what about a higher-dimensional case, say if instead df looks like this:
values1<-c(0.002,0.3,0.4,0.005,0.6,0.2,0.001,0.002,0.3,0.01)
values2<-c(0.001,0.1,0.1,0.001,0.1,0.1,0.001,0.001,0.1,0.01)
class<-c("A","A","A","B","B","B","B","A","B","A")
df<-data.frame(values1, values2, class)
And I'm interested in again getting a matrix of the Euclidean distance between points in class B with points in class A.
For general n-dimensional Euclidean distance, we can exploit the equation (not R, but algebra):
square_dist(b,a) = sum_i(b[i]*b[i]) + sum_i(a[i]*a[i]) - 2*inner_prod(b,a)
where the sums are over the dimensions of vectors a and b for i=[1,n]. Here, a and b are one pair from A and B. The key here is that this equation can be written as a matrix equation for all pairs in A and B.
In code:
## First split the data with respect to the class
n <- 2 ## the number of dimensions, for this example is 2
tmp <- split(df[,1:n], df$class)
d <- sqrt(matrix(rowSums(expand.grid(rowSums(tmp$B*tmp$B),rowSums(tmp$A*tmp$A))),
nrow=nrow(tmp$B)) -
2. * as.matrix(tmp$B) %*% t(as.matrix(tmp$A)))
Notes:
The inner rowSums compute sum_i(b[i]*b[i]) and sum_i(a[i]*a[i]) for each b in B and a in A, respectively.
expand.grid then generates all pairs between B and A.
The outer rowSums computes the sum_i(b[i]*b[i]) + sum_i(a[i]*a[i]) for all these pairs.
This result is then reshaped into a matrix. Note that the number of rows of this matrix is the number of points of class B as you requested.
Then subtract two times the inner product of all pairs. This inner product can be written as a matrix multiply tmp$B %*% t(tmp$A) where I left out the coercion to matrix for clarity.
Finally, take the square root.
Using this code with your data:
print(d)
## 1 2 3 8 10
##4 0.0030000 0.3111688 0.4072174 0.0030000 0.01029563
##5 0.6061394 0.3000000 0.2000000 0.6061394 0.59682493
##6 0.2213707 0.1000000 0.2000000 0.2213707 0.21023796
##7 0.0010000 0.3149635 0.4110985 0.0010000 0.01272792
##9 0.3140143 0.0000000 0.1000000 0.3140143 0.30364453
Note that this code will work for any n > 1. We can recover your previous 1-d result by setting n to 1 and not perform the inner rowSums (because there is now only one column in tmp$A and tmp$B):
n <- 1 ## the number of dimensions, set this now to 1
tmp <- split(df[,1:n], df$class)
d <- sqrt(matrix(rowSums(expand.grid(tmp$B*tmp$B,tmp$A*tmp$A)),
nrow=length(tmp$B)) -
2. * as.matrix(tmp$B) %*% t(as.matrix(tmp$A)))
print(d)
## [,1] [,2] [,3] [,4] [,5]
##[1,] 0.003 0.295 0.395 0.003 0.005
##[2,] 0.598 0.300 0.200 0.598 0.590
##[3,] 0.198 0.100 0.200 0.198 0.190
##[4,] 0.001 0.299 0.399 0.001 0.009
##[5,] 0.298 0.000 0.100 0.298 0.290
Here's an attempt via generating each combination and then simply taking the difference from each value:
abs(matrix(Reduce(`-`, expand.grid(split(df$values, df$class))), nrow=5, byrow=TRUE))
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0.003 0.295 0.395 0.003 0.005
#[2,] 0.598 0.300 0.200 0.598 0.590
#[3,] 0.198 0.100 0.200 0.198 0.190
#[4,] 0.001 0.299 0.399 0.001 0.009
#[5,] 0.298 0.000 0.100 0.298 0.290