what is mathematical formula of scale in R? I just tried the following but it is not the same as scale(X)
( X-colmeans(X))/ sapply(X, sd)
Since vector subtraction from matrices/data-frames works column-wise instead of row-wise, you have to transpose the matrix/data-frame before subtraction and then transpose back at the end. The result is the same as scale except for rounding errors. This is obviously a hassle to do, which I guess is why there's a convenience function.
x <- as.data.frame(matrix(sample(100), 10 , 10))
s <- scale(x)
my_s <- t((t(x) - colMeans(x))/sapply(x, sd))
all(s - my_s < 1e-15)
# [1] TRUE
1) For each column subtract its mean and then divide by its standard deviation:
apply(X, 2, function(x) (x - mean(x)) / sd(x))
2) Another way to write this which is fairly close to the code in the question is the following. The main difference between this and the question is that the question's code recycles by column (which is not correct in this case) whereas the following code recycles by row.
nr <- nrow(X)
nc <- ncol(X)
(X - matrix(colMeans(X), nr, nc, byrow = TRUE)) /
matrix(apply(X, 2, sd), nr, nc, byrow = TRUE)
Related
I want to integrate a function which looks like
f <- function(x) 1.96 * sqrt(t(c(1,x)) %*% m %*% c(1,x))
where m is
m <- matrix(c(3.855, -0.206, -0.206, 0.01), nrow = 2, ncol = 2, byrow = T)
Since the matrix multiplication inside my function produces a scalar, for any value of x, this is just a one-dimensional integration for f(x). How can I solve this smoothly?
Simply with integrate and Vectorize:
integrate(Vectorize(f), lower = 0, upper = 1)
Here is another option without Vectorize (but I believe the approach by #Stéphane Laurent is more space-efficient)
> ff <- function(x) 1.96 * sqrt(diag(t(rbind(1, x)) %*% m %*% rbind(1, x)))
> integrate(ff, lower = 0, upper = 1)
3.745299 with absolute error < 4.2e-14
where ff is already a vectorized function since it is constructed using rbind + diag to accept vector argument.
I am a novice in R asked to compute for a descriptive statistic called dominance (D; expressed as a percentage). D is basically defined as the mean abundance (MA) value of x divided by the sum of MA values of x to i. MA meanwhile is defined as the sum of all values in a vector over the length of the said vector. Here is how I normally approach things:
#Example data
x <- c(1, 2, 3)
y <- c(4, 5, 6)
z <- c(7, 8, 9)
#Mean abundance function
mean.abundance <- function(x){
N_sum <- sum(x)
N_count <- length(x)
N_sum/N_count
}
#Percent dominance function (workaround)
percent.dominance <- function(x, ...){
MA_a <- (x)
sum_MA_i <- sum(x, ...)
(MA_a/sum_MA_i)*100
}
MA_x <- mean.abundance(x)
MA_y <- mean.abundance(y)
MA_z <- mean.abundance(z)
MA <- c(MA_x, MA_y, MA_z)
MA
D_x <- percent.dominance(MA_x, MA_y, MA_z)
D_y <- percent.dominance(MA_y, MA_x, MA_z)
D_z <- percent.dominance(MA_z, MA_x, MA_y)
D <- c(D_x, D_y, D_z)
D
That approach alone already gives me the %D values I am looking for. My problem is that my (perfectionist) PI is asking me to compute for the %D values directly using vectors x, y, and z (and not stepwise by means of calculating MA values then using vectors MA_x, MA_y, and MA_z to calculate for %D). I am stumped making a custom function for %D that involves vectors containing raw data; here is a failed attempt to revise said custom function, just to give a general idea.
#Percent dominance function (incorrect)
percent.dominance <- function(x, ...){
MA_a <- sum(x)/length(x)
sum_MA_i <- sum(x, ...)/length(x, ...)
(MA_a/sum_MA_i)*100
}
You can capture the optional data passed with list(...) and make the following changes to the function -
percent.dominance <- function(x, ...){
data <- list(...)
MA_a <- sum(x)/length(x)
sum_MA_i <- sum(x, unlist(data))/(length(data) + 1)
(MA_a/sum_MA_i)*100
}
percent.dominance(x, y, z)
#[1] 13.33333
percent.dominance(y, x, z)
#[1] 33.33333
percent.dominance(z, x, y)
#[1] 53.33333
Title's a little rough, open to suggestions to improve.
I'm trying to calculate time-average covariances for a 500 length vector.
This is the equation we're using
The result I'm hoping for is a vector with an entry for k from 0 to 500 (0 would just be the variance of the whole set).
I've started with something like this, but I know I'll need to reference the gap (i) in the first mean comparison as well:
x <- rnorm(500)
xMean <-mean(x)
i <- seq(1, 500)
dfGam <- data.frame(i)
dfGam$gamma <- (1/(500-dfGam$i))*(sum((x-xMean)*(x[-dfGam$i]-xMean)))
Is it possible to do this using vector math or will I need to use some sort of for loop?
Here's the for loop that I've come up with for the solution:
gamma_func <- function(input_vec) {
output_vec <- c()
input_mean <- mean(input_vec)
iter <- seq(1, length(input_vec)-1)
for(val in iter){
iter2 <- seq((val+1), length(input_vec))
gamma_sum <- 0
for(val2 in iter2){
gamma_sum <- gamma_sum + (input_vec[val2]-input_mean)*(input_vec[val2-val]-input_mean)
}
output_vec[val] <- (1/length(iter2))*gamma_sum
}
return(output_vec)
}
Thanks
Using data.table, mostly for the shift function to make x_{t - k}, you can do this:
library(data.table)
gammabar <- function(k, x){
xbar <- mean(x)
n <- length(x)
df <- data.table(xt = x, xtk = shift(x, k))[!is.na(xtk)]
df[, sum((xt - xbar)*(xtk - xbar))/n]
}
gammabar(k = 10, x)
# [1] -0.1553118
The filter [!is.na(xtk)] starts the sum at t = k + 1, because xtk will be NA for the first k indices due to being shifted by k.
Reproducible x
x <- c(0.376972124936433, 0.301548373935665, -1.0980231706536, -1.13040590360378,
-2.79653431987176, 0.720573498411587, 0.93912102300901, -0.229377746707471,
1.75913134696347, 0.117366786802848, -0.853122822287008, 0.909259181618213,
1.19637295955276, -0.371583903741348, -0.123260233287436, 1.80004311672545,
1.70399587729432, -3.03876460529759, -2.28897494991878, 0.0583034949929225,
2.17436525195634, 1.09818265352131, 0.318220322390854, -0.0731475581637693,
0.834268741278827, 0.198750636733429, 1.29784138432631, 0.936718306241348,
-0.147433193833294, 0.110431994640128, -0.812504663900505, -0.743702167768748,
1.09534507180741, 2.43537370755095, 0.38811846676708, 0.290627670295127,
-0.285598287083935, 0.0760147178373681, -0.560298603759627, 0.447188372143361,
0.908501134499943, -0.505059597708343, -0.301004012157305, -0.726035976548133,
-1.18007702699501, 0.253074712637114, -0.370711296884049, 0.0221795637601637,
0.660044122429767, 0.48879363533552)
I would like to vectorize a distance matrix calculation by using the Law of Cosines. For a matrix with no missing values, this calculation is even faster than dist for very big matrices.
The function goes like this:
vectorizedDistMat <- function(x,y) {
an = rowSums(x^2)
bn = rowSums(y^2)
m = nrow(x)
n = nrow(y)
tmp = matrix(rep(an, n), nrow=m)
tmp = tmp + matrix(rep(bn, m), nrow=m, byrow=TRUE)
sqrt( abs(tmp - 2 * tcrossprod(x,y) ))
}
Now, missing values complicate things, especially in the last line of the above function, when the two matrices are multiplied. There is a way of accounting for missing values retrospectively, e.g. here. But how do I efficiently prevent missing value to "multiply" when multiplying the matrices?
See for example M1 and M2 below:
set.seed(42)
M1 = matrix(rnorm(50), nrow = 10, ncol = 5)
M2 = matrix(rnorm(50), nrow = 10, ncol = 5)
M1[1,2] = NA
M2[3,4] = NA
here, tcrossprod(M1, M2) yields NA's in the first row and third column. How do I account for them and get rid of them in advance (like the built-in function dist)?
I have a matrix I want to transform, such that every feature in the transformed dataset has mean of 0 and variance of 1.
I have tried to use the following code:
scale <- function(train, test)
{
trainmean <- mean(train)
trainstd <- sd(train)
xout <- test
for (i in 1:length(train[1,])) {
xout[,i] = xout[,i] - trainmean(i)
}
for (i in 1:lenght(train[1,])) {
xout[,i] = xout[,i]/trainstd[i]
}
}
invisible(xout)
normalized <- scale(train, test)
This is, however, not working for me. Am I on the right track?
Edit: I am very new to the syntax!
You can use the built-in scale function for this.
Here's an example, where we fill a matrix with random uniform variates between 0 and 1 and centre and scale them to have 0 mean and unit standard deviation:
m <- matrix(runif(1000), ncol=4)
m_scl <- scale(m)
Confirm that the column means are 0 (within tolerance) and their standard deviations are 1:
colMeans(m_scl)
# [1] -1.549004e-16 -2.490889e-17 -6.369905e-18 -1.706621e-17
apply(m_scl, 2, sd)
# [1] 1 1 1 1
See ?scale for more details.
To write your own normalisation function, you could use:
my_scale <- function(x) {
apply(m, 2, function(x) {
(x - mean(x))/sd(x)
})
}
m_scl <- my_scale(m)
or the following, which is probably faster on larger matrices
my_scale <- function(x) sweep(sweep(x, 2, colMeans(x)), 2, apply(x, 2, sd), '/')
Just suggesting another own written normalizing function avoiding apply with is from my experience slower than matrix computation:
m = matrix(rnorm(5000, 2, 3), 50, 100)
m_centred = m - m%*%rep(1,dim(m)[2])%*%rep(1, dim(m)[2])/dim(m)[2]
m_norm = m_centred/sqrt(m_centred^2%*%rep(1,dim(m)[2])/(dim(m)[2]-1))%*%rep(1,dim(m)[2])
## Verirication
rowMeans(m_norm)
apply(m_norm, 1, sd)
(Note that here row vectors are considered)