Applying a function to elements of each row and then summarising

Applying a function to elements of each row and then summarising - r

I have a question on manipulating data in data.frame.
Essentially I have a large data set - abbreviated version below:
structure(list(nm_mean = c(194213914.326, 194213914.326, 194213914.326,
194213914.326, 194213914.326, 217947112.739), nm_se = c(9984735.05918367,
9984735.05918367, 9984735.05918367, 9984735.05918367, 9984735.05918367,
11010386.0760204), alpha = c(193.197697846336, 214.592588477741,
240.246557258741, 258.116959355425, 282.560024775668, 306.610038660465
), beta = c(61526.2664158025, 57950.9563448233, 56085.1512614369,
52919.4794239927, 51483.4591654126, 50405.8186695088)), .Names = c("nm_mean",
"nm_se", "alpha", "beta"), row.names = c(NA, 6L), class = "data.frame")
I want to use rbeta to generate probabilities using the beta distribution and alpha and beta as the parameters
Similarly I want to use rnorm to generate random numbers using the normal distribution with nm_mean and nm_se as the mean and sd.
I then want to multiply the rbeta values generated by the rnorm values and extract the 50th, 25th and 75th quantile back into the dataframe
So as an example for row 1
x <- rbeta(1000,193.1977,61526.27)
y <- rnorm(1000,194213914,9984735)
z <- x*y
dat$ce <- quantile(z,0.5)
dat$ll <- quantile(z,0.25)
dat$ul <- quantile(z,0.975)
In essence i get a ce, ll and ul for product of the rbeta and rnorm appended back to the database.

Motivated by #HackR's code, what I think is a functional vectorized version:
set.seed(42)
n <- 1000
nrows <- nrow(dat)
rn <- matrix(rnorm(nrows * n, dat$nm_mean, dat$nm_se), ncol = nrows, byrow = TRUE)
rb <- matrix(rbeta(nrows * n, shape1 = dat$alpha, shape2 = dat$beta),
ncol = nrows, byrow = TRUE)
cbind(dat,
structure(t(apply(rn * rb, 2, function(z) quantile(z, c(0.5, 0.25, 0.975)))),
.Dimnames = list(NULL, c("ce", "ll", "ul"))))
# nm_mean nm_se alpha beta ce ll ul
# 1 194213914 9984735 193.1977 61526.27 608455.3 570100.5 710373.6
# 2 194213914 9984735 214.5926 57950.96 715305.0 677754.3 856570.7
# 3 194213914 9984735 240.2466 56085.15 825143.7 778351.2 979361.1
# 4 194213914 9984735 258.1170 52919.48 943261.4 895832.6 1091899.3
# 5 194213914 9984735 282.5600 51483.46 1054514.3 995640.8 1226176.4
# 6 217947113 11010386 306.6100 50405.82 1312325.0 1247030.8 1515630.5

This is vectorized solution based on my conversation with #thelatemail:
n <- 1000
grp <- nrow(dat)
z <- with(dat, rnorm(grp*n, nm_mean, nm_se) * rbeta(grp*n, alpha, beta) )
m <- 1
for(i in 1:nrow(dat)){
dat$ce[i] <- quantile(z[m:(i*1000)],0.5)
dat$ll[i] <- quantile(z[m:(i*1000)],0.25)
dat$ul[i] <- quantile(z[m:(i*1000)],0.975)
m <- m + 1000
}
A less vectorized solution is:
for(i in 1:nrow(dat)){
x <- rbeta(1000, shape1 = dat$alpha[i], shape2 = dat$beta[i])
y <- rnorm(n=1000,dat$nm_mean[i],dat$nm_se[i])
z <- x*y
dat$ce[i] <- quantile(z,0.5)
dat$ll[i] <- quantile(z,0.25)
dat$ul[i] <- quantile(z,0.975)
}
dat
nm_mean nm_se alpha beta ce ll ul
1 194213914 9984735 193.1977 61526.27 607563.9 573229.9 713057.2
2 194213914 9984735 214.5926 57950.96 712268.5 674826.3 836950.8
3 194213914 9984735 240.2466 56085.15 823322.9 777482.8 981156.7
4 194213914 9984735 258.1170 52919.48 937331.2 884945.0 1095876.3
5 194213914 9984735 282.5600 51483.46 1059980.4 1003596.4 1225615.6
6 217947113 11010386 306.6100 50405.82 1316733.1 1250190.1 1515185.0

Related

Thousand separator to numeric columns in R

I am trying to format numbers as shown (adding thousand separator). The function is working fine but post formatting the numbers, the numeric columns does not sort by numbers since there are characters
df <- data.frame(x = c(12345,35666,345,5646575))
format_numbers <- function (df, column_name){
df[[column_name]] <- ifelse(nchar(df[[column_name]]) <= 5, paste(format(round(df[[column_name]] / 1e3, 1), trim = TRUE), "K"),
paste(format(round(df[[column_name]] / 1e6, 1), trim = TRUE), "M"))
}
df$x <- format_numbers(df,"x")
> df
x
1 12.3 K
2 35.7 K
3 0.3 K
4 5.6 M
Can we make sure the numbers are sorted in descending/ascending order post formatting ?
Note : This data df is to be incorporated in DT table

The problem is the formating part. If you do it correctly--ie while maintaining your data as numeric, then everything else will fall in place. Here I will demonstrate using S3 class:
my_numbers <- function(x) structure(x, class = c('my_numbers', 'numeric'))
format.my_numbers <- function(x,..., d = 1, L = c('', 'K', 'M', 'B', 'T')){
ifelse(abs(x) >= 1000, Recall(x/1000, d = d + 1),
sprintf('%.1f%s', x, L[d]))
}
print.my_numbers <- function(x, ...) print(format(x), quote = FALSE)
'[.my_numbers' <- function(x, ..., drop = FALSE) my_numbers(NextMethod('['))
Now you can run your code:
df <- data.frame(x = c(12345,35666,345,5646575))
df$x <- my_numbers(df$x)
df
x
1 12.3K
2 35.7K
3 345.0
4 5.6M
You can use any mathematical operation on column x as it is numeric.
eg:
cbinding with its double and ordering from smallest to larges:
cbind(x = df, y = df*2)[order(df$x),]
x x
3 345.0 690.0 # smallest
1 12.3K 24.7K
2 35.7K 71.3K
4 5.6M 11.3M # largest ie Millions
Note that under the hood, x does not change:
unclass(df$x)
[1] 12345 35666 345 5646575 # Same as given

Error in while (e_i$X1 < 12 | e_i$X2 < 12) { : argument is of length zero

In an earlier question (R: Logical Conditions Not Being Respected), I learned how to make the following simulation :
Step 1: Keep generating two random numbers "a" and "b" until both "a" and "b" are greater than 12
Step 2: Track how many random numbers had to be generated until it took for Step 1 to be completed
Step 3: Repeat Step 1 and Step 2 100 times
res <- matrix(0, nrow = 0, ncol = 3)
for (j in 1:100){
a <- rnorm(1, 10, 1)
b <- rnorm(1, 10, 1)
i <- 1
while(a < 12 | b < 12) {
a <- rnorm(1, 10, 1)
b <- rnorm(1, 10, 1)
i <- i + 1
}
x <- c(a,b,i)
res <- rbind(res, x)
}
head(res)
[,1] [,2] [,3]
x 12.14232 12.08977 399
x 12.27158 12.01319 1695
x 12.57345 12.42135 302
x 12.07494 12.64841 600
x 12.03210 12.07949 82
x 12.34006 12.00365 782
Question: Now, I am trying to make a slight modification to the above code - Instead of "a" and "b" being produced separately, I want them to be produced "together" (in math terms: "a" and "b" were being produced from two independent univariate normal distributions, now I want them to come from a bivariate normal distribution).
I tried to modify this code myself:
library(MASS)
Sigma = matrix(
c(1,0.5, 0.5, 1), # the data elements
nrow=2, # number of rows
ncol=2, # number of columns
byrow = TRUE) # fill matrix by rows
res <- matrix(0, nrow = 0, ncol = 3)
for (j in 1:100){
e_i = data.frame(mvrnorm(n = 1, c(10,10), Sigma))
e_i$i <- 1
while(e_i$X1 < 12 | e_i$X2 < 12) {
e_i = data.frame(mvrnorm(n = 1, c(10,10), Sigma))
e_i$i <- i + 1
}
x <- c(e_i$X1, e_i$X2 ,i)
res <- rbind(res, x)
}
res = data.frame(res)
But this is producing the following error:
Error in while (e_i$X1 < 12 | e_i$X2 < 12) { : argument is of length
zero

If I understand your code correctly you are trying to see how many samples occur before both values are >=12 and doing that for 100 trials? This is the approach I would take:
library(MASS)
for(i in 1:100){
n <- 1
while(any((x <- mvrnorm(1, mu=c(10,10), Sigma=diag(0.5, nrow=2)+0.5))<12)) n <- n+1
if(i==1) res <- data.frame("a"=x[1], "b"=x[2], n)
else res <- rbind(res, data.frame("a"=x[1], "b"=x[2], n))
}
Here I am assigning the results of a mvrnorm to x within the while() call. In that same call, it evaluates whether either are less than 12 using the any() function. If that evaluates to FALSE, n (the counter) is increased and the process repeated. Once TRUE, the values are appended to your data.frame and it goes back to the start of the for-loop.
Regarding your code, the mvrnorm() function is returning a vector, not a matrix, when n=1 so both values go into a single variable in the data.frame:
data.frame(mvrnorm(n = 1, c(10,10), Sigma))
Returns:
mvrnorm.n...1..c.10..10...Sigma.
1 9.148089
2 10.605546
The matrix() function within your data.frame() calls, along with some tweaks to your use of i, will fix your code:
library(MASS)
Sigma = matrix(
c(1,0.5, 0.5, 1), # the data elements
nrow=2, # number of rows
ncol=2, # number of columns
byrow = TRUE) # fill matrix by rows
res <- matrix(0, nrow = 0, ncol = 3)
for (j in 1:10){
e_i = data.frame(matrix(mvrnorm(n = 1, c(10,10), Sigma), ncol=2))
i <- 1
while(e_i$X1[1] < 12 | e_i$X2[1] < 12) {
e_i = data.frame(matrix(mvrnorm(n = 1, c(10,10), Sigma), ncol=2))
i <- i + 1
}
x <- c(e_i$X1, e_i$X2 ,i)
res <- rbind(res, x)
}
res = data.frame(res)

How can I modify my code to include loop?

I am trying to create a function that examines how variables with different distributions influence OLS results. I have created two DVs (y1 and y2) but would like to expand this to include five or so. I am trying to change my code to include a loop so I do not have to copy and paste this multiple times, but I am not having much luck. Any suggestions would be greatly appreciated.
library(psych)
library(arm)
library(plyr)
library(fBasics)
regsim <- function(iter, n) {
ek1 <- rnorm(n, 0, 1)
ek2 <- rnorm(n, 0, 5)
x <- rnorm(n, 0, .5)
y1 <- .3*x + ek1
y2 <- .3*x + ek2
#y1
lm1 <- lm(y1 ~ x)
bhat1 <- coef (lm1)[2]
sehat1 <- se.coef (lm1) [2]
skewy1 <- skew(y1)
stdevy1 <- stdev(y1)
#y2
lm2 <- lm(y2 ~ x)
bhat2 <- coef (lm2)[2]
sehat2 <- se.coef (lm2) [2]
skewy2 <- skew(y2)
stdevy2 <- stdev(y2)
results <- c(bhat1, sehat1, stdevy1, skewy1,
bhat2, sehat2, stdevy2, skewy2)
names(results) <- c('b1', 'se1', 'sdy1', 'skewy1',
'b2', 'se2', 'sdy2', 'skewy2')
return(results)
}
iter <-1000
n <-500
results <- NULL
sims <-ldply(1:iter, regsim, n)
sims$n <- n
results <- rbind(results, sims)

Another option...
regsim <- function(n=100,num.y=5,sd=c(1:5)){
if(length(sd) != num.y){stop('length of sd must match number of dependent vars')
} else {
ldply(1:num.y,function(x){
e <- rnorm(n,0,sd=sd[x])
x <- rnorm(n,0,5)
y <- 0.3*x + e
out <- lm(y~x)
b1 <- coef(out)[2]
int <- coef(out)[1]
data.frame(b1=b1,int=int)
})
}
}
regsim(num.y=10,sd=c(1:10))
b1 int
1 0.30817303 0.0781049
2 0.38681600 -0.3359067
3 0.24560773 -0.0277561
4 0.08032659 0.1877233
5 0.39873955 -0.6027522
6 0.21729930 0.7384340
7 0.33761456 -0.1053028
8 0.26502006 -0.1851552
9 0.15452261 -1.6334873
10 -0.10496863 -0.3225169
This will allow you to specify the number of dependent variables and the SD for each error term. You can then use replicate to repeat the function for the desired number of replications.
replicate(10,regsim(),simplify = F)
[[1]]
b1 int
1 0.3047779 -0.01984306
2 0.3133198 -0.20458410
3 0.2833979 -0.25307502
4 0.3066878 -0.03235019
5 0.1374949 0.10958616
[[2]]
b1 int
1 0.2902103 -0.12683502
2 0.3499006 0.06691437
3 0.1949797 -0.14371830
4 0.2358269 0.53117467
5 0.2869511 0.16281380
[[3]]
b1 int
1 0.2952211 0.05905549
2 0.2367774 0.02862166
3 0.0896778 -0.08467935
4 0.2352622 -0.20835837
5 0.3149963 0.07042032
[[4]]
b1 int
1 0.2946468 -0.08266406
2 0.3322577 0.17558135
3 0.2200087 -0.25778150
4 0.1822915 0.34962679
5 0.2442479 0.34433656
[[5]]
b1 int
1 0.2882853 0.12677506
2 0.3455534 -0.27885958
3 0.2981193 0.04598347
4 0.3380173 0.05243198
5 0.2148643 -0.09631672
[[6]]
b1 int
1 0.2962269 0.03743759
2 0.2979327 -0.12830803
3 0.3352781 -0.03935422
4 0.2584965 -0.05924351
5 0.2856802 0.03430055
[[7]]
b1 int
1 0.2968077 -0.10300109
2 0.2954560 0.25979902
3 0.3276077 -0.07001758
4 0.1825841 0.13508932
5 0.4302788 -0.13951914
[[8]]
b1 int
1 0.2992147 0.02084806
2 0.2765976 0.07277813
3 0.2469616 0.44580403
4 0.2601966 -0.09849855
5 0.2679183 0.50501652
[[9]]
b1 int
1 0.2963905 0.03308366
2 0.3356783 -0.06080088
3 0.3199835 0.22533444
4 0.3546083 -0.26909478
5 0.3536241 -0.19795094
[[10]]
b1 int
1 0.3100336 -0.05228032
2 0.4076447 -0.18715063
3 0.3436858 -0.37518649
4 0.4569368 -0.09114672
5 0.3255668 -0.18738138

How about this:
n <- 1000
x <- rnorm(n, 0, .5)
fun_reg <- function(n, ek_mu, ek_sd, x){
s <- list() # list to collect results for output
ek <- rnorm(n, ek_mu, ek_sd)
y <- .3*x + ek
m <- lm(y ~ x)
s$bhat <- coef(m)[2]
s$sehat <- arm::se.coef(m)[2]
s$skewy <- psych::skew(y)
s$stdevy <- fBasics::stdev(y)
return(s)
}
purrr::map_dfr(c(1, 5, 10, 20, 50), ~fun_reg(n, 0, ., x))
Edit:
This now has 500 observations each and the regression is repeated with 1000 draws for each value of the standard deviation. A variable ek_sd has been added to the final output, to reflect with which standard deviation the values were arrived at. Note that x is not redrawn for each iteration, but I'm not entirely sure, that that is what you want. If you want x to be redrawn at each iteration, move it inside the function.
n <- 500
x <- rnorm(n, 0, .5)
fun_reg <- function(n, ek_mu, ek_sd, x){
s <- list()
ek <- rnorm(n, ek_mu, ek_sd)
y <- .3*x + ek
m <- lm(y ~ x)
s$ek_sd <- ek_sd
s$bhat <- coef(m)[2]
s$sehat <- arm::se.coef(m)[2]
s$skewy <- psych::skew(y)
s$stdevy <- fBasics::stdev(y)
return(s)
}
intr <- unlist(lapply(c(1, 5, 10, 20, 50), rep, 1000))
purrr::map_dfr(intr, ~fun_reg(n, 0, ., x))

This reduces the package reliance to just psych::skew and an optional ggplot2 call:
library(psych)
regsim <- function(n, eks) {
x <- rnorm(n, 0, .5)
ek <- sapply(eks, function(x) rnorm(n, 0, x))
y <- 0.3 * x + ek
lms <- lm(y ~ x)
data.frame(b_hat = lms[['coefficients']][2,],
int = lms[['coefficients']][1, ],
skew_y = psych::skew(y),
se_hat = unlist(lapply(summary(lms), function(lst) lst[[4]][2,2]), use.names = FALSE),
sd_y = apply(y, 2, sd),
sd_eks = eks
)
}
iter <-1000
n <-500
eks_sd = c(1,5)
# do the simulations and make them into a nice data.frame
sims <- replicate(iter, regsim(n, eks_sd), simplify = FALSE)
results <- do.call(rbind, sims)
#next parts are optional
results$iter_id <- rep(seq_len(iter), each = length(eks_sd))
tibble::as_tibble(results)
# Random graph because everyone loves graphs
library(ggplot2)
ggplot(results, aes(x = iter_id, y = int)) + geom_point() + facet_grid(vars(sd_eks))
The main thing is that lm can take multiple y arguments. That's why we we create a matrix of ek using sapply.

How to generate multiple matrix in R

I have gotten two lists of values in R.
daily_max_car: (List 1)
21 21 22 22 22 22 21
daily_0.8: (List 2)
16 17 17 17 18 17 17
Trying to write a For Loop in R-Studio to generate multiple matrix by using the one of the values from these two lists (One by One).
Here is the code I have been using to generate one matrix!
Lambda <- 21 (From List 1)
Mue <- 4
Rho <- Lambda/Mue
N <- 16 (From List 2)
All of these four parameters will be used in the "calculatewq" Function.
calculatewq <- function(c)
{....Some thing happening }
##Create Matrix
matrix1 <- matrix(0,Lambda,4)
matrix1[,1] <- 1:Lambda
### Create a column of matrix with repeated "N"
rep.row<-function(x,y)
{matrix(rep(x,each=y),nrow=y)}
created_mar_1 <- rep.row(N,Lambda)
car_n<- created_mar_1-matrix1[,1]
created_mar_3 <- rep.row(69*60*24,Lambda)
## Add into Matrix
for (i in 1:Lambda)
{matrix1[i,2] <- calculatewq(i)[2]
matrix1[i,3] <- calculatewq(i)[5]
matrix1[,4] = car_n*created_mar_3}`
Once I change one of the parameters it will generate a new matrix.
Thus, how can I write a for loop to generate multiple matrix while I am putting different value in Lambda and N.
Thank you so much!
Sampson

I removed for loop inside calculatewq function. Please make sure you needed a for loop in it.
myfun <- function(Lambda, N, mu )
{
# browser()
var1 <- seq_len( Lambda )
var2 <- ( rep( N, each = Lambda) ) - var1
var3 <- rep( 69*60*24, each = Lambda )
var4 <- var2 * var3
fun_vals <- do.call( 'rbind',
lapply( var1, function( x ) calculatewq( x, Lambda = Lambda, N = N, mu = mu ) ) )
mat <- matrix( NA, nrow = Lambda, ncol = mu )
mat[, 1] <- var1
mat[, 2] <- fun_vals[, 'Wq']
mat[, 3] <- fun_vals[, 'customer_serviced']
mat[, 4] <- var4
return(mat)
}
calculatewq <- function( x, Lambda, N, mu )
{
# browser()
Rho <- Lambda / mu
p0_inv <- ( Rho^x * (1-(( Rho/x )^( N-x+1)))) / (factorial( x ) * ( 1 - ( Rho / x ) ) )
p0_inv <- p0_inv + ( Rho^x) / factorial( x )
P0 <- 1/p0_inv
Lq <- ( Rho^(x+1)) * (1-((Rho/x)^(N-x+1))-((N-x+1)*(1-(Rho/x))*((Rho/x)^(N-x))))*P0/(factorial(x-1)*(x-Rho)^2)
Wq <- 60*Lq/Lambda
Ls <- Lq + Rho
Ws <- 60*Ls/Lambda
PN <- (Rho^N)*P0/(factorial(x)*x^(N-x))
customer_serviced <- (1 - PN)*100
a <- cbind( Lq, Wq, Ls, Ws, customer_serviced )
return(a)
}
mu <- 4
res <- Map( myfun,
list( 21 ,21, 22, 22 ,22, 22 ,21 ),
list( 16, 17, 17, 17, 18, 17 ,17 ),
mu)
head( res[[1]])
# [,1] [,2] [,3] [,4]
# [1,] 1 42.184874 19.04762 1490400
# [2,] 2 38.241748 38.09526 1391040
# [3,] 3 33.339271 57.13862 1291680
# [4,] 4 26.014138 75.70348 1192320
# [5,] 5 16.339462 89.88989 1092960
# [6,] 6 9.121053 96.32498 993600

daily_max_car <- list(21,21,22,22,22,22,21)
daily_0.8 <- list(16,17,17,17,18,17,17)
myfunction <- function(Lambda, N){
Mue <- 4
Rho <- Lambda/Mue
df <- as.data.frame(matrix(0, ncol = 4, nrow = Lambda))
names(df) <- c("A","B","C","D")
df[,1] <- 1:Lambda
df[,2] <- N
df[,3] <- df[,1] - df[,2]
df[,4] <- 69*60*24
return(df)
}
myfunction(21,16)
result <- mapply(myfunction, daily_max_car, daily_0.8)
Result

Lambda <- 21
Mue <- 4
Rho <- Lambda/Mue
N <- 19
matrix1 <- matrix(0,Lambda,4)
matrix1[,1] <- 1:Lambda
rep.row<-function(x,y)
{
matrix(rep(x,each=y),nrow=y)
}
created_mar_1 <- rep.row(N,Lambda)
car_n<- created_mar_1-matrix1[,1]
created_mar_3 <- rep.row(69*60*24,Lambda)
calculatewq(7)
calculatewq <- function(c)
{
P0inv <- (Rho^c*(1-((Rho/c)^(N-c+1))))/(factorial(c)*(1-(Rho/c)))
for (i in 1:c-1)
{
P0inv = P0inv + (Rho^i)/factorial(i)
}
P0 = 1/P0inv
Lq = (Rho^(c+1))*(1-((Rho/c)^(N-c+1))-((N-c+1)*(1-(Rho/c))*((Rho/c)^(N- c))))*P0/(factorial(c-1)*(c-Rho)^2)
Wq = 60*Lq/Lambda
Ls <- Lq + Rho
Ws <- 60*Ls/Lambda
PN <- (Rho^N)*P0/(factorial(c)*c^(N-c))
customer_serviced <- (1 - PN)*100
a <- cbind(Lq,Wq,Ls,Ws,customer_serviced)
return(a)
}
for (i in 1:Lambda)
{
matrix1[i,2] <- calculatewq(i)[2]
matrix1[i,3] <- calculatewq(i)[5]
matrix1[,4] = car_n*created_mar_3
}

How to efficiently do complex row operations with nested functions in R?

Given a multidimensional array, e.g. a zoo object z, with columns a,b,c,x. Given further a function W(w=c(1,1,1), x) which for example weights every column individually, but which also DEPENDS on the specific row value in column x. How to efficiently do row operations here, e.g. calculating the rowWeightedMeans?
It is known that R::zoo is very fast and efficient for row operations, if the function is very simple, e.g.:
W <- function(w) { return(w); }
z[,"wmean"] <- rowWeightedMeans(z[,1:3], w=W(c(0.1,0.5,0.3)))
But what if W() depends on a value in that row? E.g.:
W <- function(w, x) { return(w*x); }
z[,"wmean"] <- rowWeightedMeans(z[,1:3], w=W(c(0.1,0.5,0.3), z[,4]))
R complains here because it does not know how to hanlde the multi-dimensions of the arguments in the nested function.
The solution could be a for(i in 1:nrow(z)) loop, and computing the values individually for every row i. However, for large data sets this takes a enormous amount of extra computational effort and time.
EDIT
Ok guys, thanks for your time and critics. I tried and tested all your answers but must admit that the actual problem was not solved or understood. For example, I hadn't ask to rewrite my weight function or calculations, because I already presented a minimal version of much more complex calculations. The issue or question here lies much deeper. So I sat back and tried to boil down the problem to the root of the evil and found a minimal working example for you without any zoos, weightedMeans, and so on. Here you go:
z <- data.frame(matrix (1:20, nrow = 4))
colnames (z) <- c ("a", "b", "c", "x", "y")
z
# a b c x y
#1 1 5 9 13 17
#2 2 6 10 14 18
#3 3 7 11 15 19
#4 4 8 12 16 20
W <- function(abc, w, p) {
ifelse (w[1] == p, return(length(p)), return(0))
# Please do not complain! I know this is stupid, but it is an MWE
# and my calculations contained in W() are much more complex!
}
z[,"y"] <- W(z[,1:3], c(14,7,8), z[,"x"])
# same result: z[,"y"] <- apply(z[,1:3], 1, W, c(14,7,8), z[,"x"])
z
# a b c x y
#1 1 5 9 13 4
#2 2 6 10 14 4
#3 3 7 11 15 4
#4 4 8 12 16 4
# expected outcome:
# a b c x y
#1 1 5 9 13 0
#2 2 6 10 14 4
#3 3 7 11 15 0
#4 4 8 12 16 0
The problem I am facing is, that R passes all lines of z[,"x"] to the function, however, I expect it to take only the line which corresponds to the line of z[,"y"] that is currently processed internally when R loops through it. In this example, I expect 14==14 only in line number 2!
So: how to tell R to pass line by line to functions?
SOLUTION
Besides the awarded and accepted answer, I like to summarize the solution here to improve clarity and provide a better overview about the discussion.
This question was not about rewriting the specific function W (e.g. weighting). It was only about the inability of R to pass multiple row-by-row arguments to a general function. By either using z$y <- f(z$a, z$x) or z$y <- apply(z$a, 1, f, z$x), both methods only pass the first argument as row-by-row, and the second argument as a complete column with all rows. It seems that this is an intrinsic behaviour of R around which we need to work around.
To solve this, the whole row needs to be passed as a single argument to a wrapper function, which in turn then applies the specific calculations on that row. Solution for the problem with the weights:
f <- function(x) weighted.mean(x[1:3], W(c(0.1,0.5,0.3), x[4]))
z[,"wmean"] <- apply(z[,1:4], 1, f)
Solution for the geenral problem with the data frame:
f <- function(x) W(x[1:3], c(14,7,8), x[4])
z$y <- apply(z, 1, f)
Brian presents also even faster methods using compiled C code in his accepted answer. Thanks to #BrianAlbertMonroe, #jaimedash and #inscaven for dealing with the poorly clarified question and for hinting to this solution.

Haven't really worked with zoo or rowWeightedMeans but if you simply apply weights to row elements before taking the mean of them, and require the weights to depend on one of the elements of the row:
z <- matrix(rnorm(100),ncol=4)
W <- function(row, weights){
weights <- weights * row[4]
row2 <- row[1:3] * weights
sum(row2) / sum(weights)
}
w.means <- apply(z, 1, W, weights = c(0.1, 0.5, 0.3))
If the above gives the correct answer but you're worried about quickness write the W function in Rcpp or use the built in cmpfun,
N <- 10000
z <- matrix(rnorm(N),ncol=4)
# Interpreted R function
W1 <- function(row, weights){
weights <- weights * row[4]
row2 <- row[1:3] * weights
mean(row2)
}
# Compiled R function
W2 <- compiler::cmpfun(W1)
# C++ function imported into R via Rcpp
Rcpp::cppFunction('double Wcpp(NumericVector row, NumericVector weights){
int x = row.size() ;
NumericVector wrow(x - 1);
NumericVector nweights(x - 1);
nweights = weights * row[x - 1];
for( int i = 0; i < (x-1) ; i++){
wrow[i] = row[i] * nweights[i];
}
double res = sum(wrow) / sum(nweights);
return(res);
}')
w.means0 <- apply(z,1,W,weights=c(0.1,0.5,0.3))
w.means1 <- apply(z,1,W2,weights=c(0.1,0.5,0.3))
w.means2 <- apply(z,1,Wcpp,weights=c(0.1,0.5,0.3))
identical( w.means0, w.means1, w.means2 )
#[1] TRUE
Or
# Write the whole thing in C++
Rcpp::cppFunction('NumericVector WM(NumericMatrix z , NumericVector weights){
int x = z.ncol() ;
int y = z.nrow() ;
NumericVector res(y);
NumericVector wrow(x - 1);
NumericVector nweights(x - 1);
double nwsum;
double mult;
for( int row = 0 ; row < y ; row++){
mult = z(row,x-1);
nweights = weights * mult;
nwsum = sum(nweights);
for( int i = 0; i < (x-1) ; i++){
wrow[i] = z(row,i) * nweights[i] ;
}
res[row] = sum(wrow) / nwsum;
}
return(res);
}')
microbenchmark::microbenchmark(
w.means0 <- apply(z,1,W1,weights=c(0.1,0.5,0.3)),
w.means1 <- apply(z,1,W2,weights=c(0.1,0.5,0.3)),
w.means2 <- apply(z,1,Wcpp,weights=c(0.1,0.5,0.3)),
w.means3 <- WM(z = z, weights = c(0.1, 0.5, 0.3))
)
Unit: microseconds
expr min lq mean median uq max neval
w.means0 <- apply(z, 1, W1, weights = c(0.1, 0.5, 0.3)) 12114.834 12536.9330 12995.1722 12838.2805 13163.4835 15796.403 100
w.means1 <- apply(z, 1, W2, weights = c(0.1, 0.5, 0.3)) 9941.571 10286.8085 10769.7330 10410.9465 10788.6800 19526.840 100
w.means2 <- apply(z, 1, Wcpp, weights = c(0.1, 0.5, 0.3)) 10919.112 11631.5530 12849.7294 13262.9705 13707.7465 17438.524 100
w.means3 <- WM(z = z, weights = c(0.1, 0.5, 0.3)) 94.172 107.9855 146.2606 125.0075 140.2695 2089.933 100
EDIT:
Incorporating the weighted.means function slows down the computation dramatically, and does not handle missing values specially according to the help file, so you will still need to write code to manage them.
> z <- matrix(rnorm(100),ncol=4)
> W <- function(row, weights){
+ weights <- weights * row[4]
+ row2 <- row[1:3] * weights
+ sum(row2) / sum(weights)
+
+ }
> W1 <- compiler::cmpfun(W)
> W2 <- function(row, weights){
+ weights <- weights * row[4]
+ weighted.mean(row[1:3],weights)
+ }
> W3 <- compiler::cmpfun(W2)
> w.means1 <- apply(z, 1, W, weights = c(0.1, 0.5, 0.3))
> w.means2 <- apply(z, 1, W2, weights = c(0.1, 0.5, 0.3))
> identical(w.means1,w.means2)
[1] TRUE
> microbenchmark(
+ w.means1 <- apply(z, 1, W, weights = c(0.1, 0.5, 0.3)),
+ w.means1 <- apply(z, 1, W1, weights = c(0.1, 0.5, 0.3)),
+ w.means2 < .... [TRUNCATED]
Unit: microseconds
expr min lq mean median uq max neval
w.means1 <- apply(z, 1, W, weights = c(0.1, 0.5, 0.3)) 145.315 167.4550 172.8163 172.9120 180.6920 194.673 100
w.means1 <- apply(z, 1, W1, weights = c(0.1, 0.5, 0.3)) 124.087 134.3365 143.6803 137.8925 148.7145 225.459 100
w.means2 <- apply(z, 1, W2, weights = c(0.1, 0.5, 0.3)) 307.311 346.6320 356.4845 354.7325 371.7620 412.110 100
w.means2 <- apply(z, 1, W3, weights = c(0.1, 0.5, 0.3)) 280.073 308.7110 323.0156 324.1230 333.7305 407.963 100

Here's a solution with zoo::rollapply. It produces the same answer as matrixStats::rowWeightedMeans for the simpler case.
if(! require(matrixStats)) {
install.packages('matrixStats')
library(matrixStats)
}
if(! require(zoo)) {
install.packages('zoo')
library(zoo)
}
z <- zoo (matrix (1:20, nrow = 5))
colnames (z) <- c ("a", "b", "c", "x")
z$x <- 0 # so we can see an effect below...
z
## a b c x
## 1 1 6 11 0
## 2 2 7 12 0
## 3 3 8 13 0
## 4 4 9 14 0
## 5 5 10 15 0
weights <- c(0.1,0.5,0.3)
W <- function (w) { return(w); }
z$wmean <- rowWeightedMeans(z[,1:3], w=W(weights))
## z[,new]<- doesn't work to create new columns in zoo
## objects
## use $
rowWeightMean_zoo <- function (r, W, weights) {
s <- sum(W(weights))
return(sum(r[1:3] * W(weights) / s))
}
z$wmean_zoo <- rollapply(z, width=1, by.column=FALSE,
function (r) rowWeightMean_zoo(r, W, weights))
z
For the requirement in the question, that the return value be dependent on some ancillary data in the row, rowWeightedMeans doesn't work. But, the function passed to rollapply can be modified to use other elements of the row.
W2 <- function (w, x) { return(w * x); }
# z$wmean2 <- rowWeightedMeans(z[,1:3], w=W2(c(0.1,0.5,0.3), z[,4]))
## doesn't work
## Error in rowWeightedMeans(z[, 1:3], w = W#(c(0.1, 0.5, 0.3), z[, 4])) :
## The length of argument 'w' is does not match the number of column in 'x': 5 != 3
## In addition: Warning message:
## In `*.default`(w, x) :
## longer object length is not a multiple of shorter object length
## Calls: rowWeightedMeans -> W -> Ops.zoo -> NextMethod
rowWeightMean_zoo_dependent <- function (r, W, weights) {
s <- sum(W(weights, r[4]))
return(sum(r[1:3] * W2(weights, r[4]) / s))
}
z$wmean2_zoo <- rollapply(z, width=1, by.column=FALSE,
function (r) rowWeightMean_zoo_dependent(r, W2, weights))
z
## a b c x wmean wmean_zoo wmean2_zoo
## 1 1 6 11 0 7.111111 7.111111 NaN
## 2 2 7 12 0 8.111111 8.111111 NaN
## 3 3 8 13 0 9.111111 9.111111 NaN
## 4 4 9 14 0 10.111111 10.111111 NaN
## 5 5 10 15 0 11.111111 11.111111 NaN

I think this can be solved by clever reshaping. I would use dplyr for that - but the workflow should work similar for plyr or data.table - all these packages are heavily optimized.
for this example I assume the weight function is w(x) = w0 ^ x
Here I create some sample data z, and generic weights w (note I add a row number r to z):
library(dplyr)
library(tidyr)
N <- 10
z <- data.frame(r=1:N, a=rnorm(N), b=rnorm(N), c=rnorm(N), x=rpois(N, 5))
w <- data.frame(key=c('a','b','c'), weight=c(0.1,0.5,0.3))
Now the calculation would be:
res <- z %>% gather(key,value,-r,-x) %>% # convert to long format, but keep row numbers and x
left_join(w, 'key') %>% # add generic weights
mutate(eff_weight = weight^x) %>% # calculate effective weights
group_by(r) %>% # group by the orignal lines for the weighted mean
summarise(ws = sum(value*eff_weight), ww=sum(eff_weight)) %>% # calculate to helper values
mutate(weighted_mean = ws/ww) %>% # effectively calculate the weighted mean
select(r, weighted_mean) # remove unneccesary output
left_join(z, res) # add to the original data
I added some notes - but if you have trouble understanding you could evaluate res stepwise (remove tail including %>%) and have a look at the results.
Update
took the challenge to find the way to do the same in base R:
N <- 10
z <- data.frame(a=rnorm(N), b=rnorm(N), c=rnorm(N), x=rpois(N, 5))
w <- data.frame(key=c('a','b','c'), weight=c(0.1,0.5,0.3))
long.z <- reshape(z, idvar = "row", times=c('a','b','c'),
timevar='key',
varying = list(c('a','b','c')), direction = "long")
compose.z <- merge(long.z,w, by='key')
compose.z2 <- within(compose.z, eff.weight <- weight^x)
sum.stat <- by(compose.z2, compose.z2$row, function(x) {sum(x$a * x$eff.weight )/sum(x$eff.weight)})
nice.data <- c(sum.stat)
It requires a bit more verbose function. But the same pattern can be applied.

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Applying a function to elements of each row and then summarising - r

Related

Thousand separator to numeric columns in R

Error in while (e_i$X1 < 12 | e_i$X2 < 12) { : argument is of length zero

How can I modify my code to include loop?

How to generate multiple matrix in R

How to efficiently do complex row operations with nested functions in R?

Categories

Resources