Aggregate a table by applying a function of multiple columns

Aggregate a table by applying a function of multiple columns - r

Considering the following table df, with categorical variables noted x1 and x2 and numerical measurements noted y1, y2 and y3:
df <- data.frame(x1=sample(letters[1:3], 20, replace=TRUE),
x2=sample(letters[4:6], 20, replace=TRUE),
y1=rnorm(20), y2=rnorm(20), y3=rnorm(20))
I'd like to apply on it a function of the 3 numerical measurements y with respect to the categorical variables x. For example the following function, where the input y is a table of 3 columns, which should output one new column:
f <- function(y){ sum((y[,1] - y[,2]) / y[,3]) }
I tried it with aggregate, dplyr, summarizeBy.. without success as it seems that for every method, mixing the inputs columns is not an option. Any idea on how to do that with such kind of functions (i.e. taking advantage of aggregation)?
aggregate(data = df, y1 + y2 + y3 ~ x1 + x2, FUN = f)
To clarify, the expected result can be obtained with something like:
groups <- unique(df[,c("x1", "x2")]) # coocurences of explanatory variables
res <- c()
for (i in 1:nrow(groups)){ # get the subtables
temp <- df[df$x1 == groups[i,1] & df$x2 == groups[i,2], c("y1", "y2", "y3")]
res <- c(res, f(temp)) # apply function on subtables
}
groups$res <- res # aggregate results
Which is not that fat for this simple toy example but very impractical with more complex data.

The problem is on th input side of your function. The way you specified it, it expects a dataframe.
A possible slution is to feed the function a list of columns. With a small change to your function:
f <- function(y) sum((y[[1]] - y[[2]]) / y[[3]])
You can now use it in a dplyr-chain:
df %>%
group_by(x1, x2) %>%
summarise(sum_y = f(list(y1, y2, y3)))
which gives:
# A tibble: 9 x 3
# Groups: x1 [?]
x1 x2 sum_y
<fct> <fct> <dbl>
1 a d 1.20
2 a e 0.457
3 a f -9.46
4 b d -1.11
5 b e -0.176
6 b f -1.34
7 c d -0.994
8 c e 3.38
9 c f -2.63

Related

In R, how to first use a loop to create matrices and then stack them?

Taking into account your past answers, I've changed for the following:
n <- 100
B <- 20
S <- 50
alpha <- 0.3
beta <- 1.2
theta <- alpha*beta
for (i in 1:S) {
###
sim_original_samples <- rgamma(n, alpha, beta) # for each S, we have a sample of 100 observations
sim_original_samples_X_bar <- mean(sim_original_samples) # for each dataset, compute the sample mean and input it
sim_bs_samples_X_bar <- matrix(0,B,1)
# in the same loop we are going to compute the sample mean per bootstrap per original sample i
####
####
for (j in 1:B) {
sim_bs_samples <- sample(sim_original_samples,n,replace=TRUE)
# for each original sample, we are going to draw B times a bootstrap sample
sim_bs_samples_X_bar[j] <- mean(sim_bs_samples)
# all the elements of this matrix should be the bootstrap sample mean
var_sim_bs_samples <- matrix(0,B,1)
var_sim_bs_samples[j] <- (sim_bs_samples_X_bar[j] - sim_original_samples_X_bar)^2
se_sim_bs_samples <- sqrt((1/B*sum(var_sim_bs_samples)))
}
####
####
# now we want to compute the asymptotic CI of i)
z <- 1.96
var_gamma <- alpha*beta^2/n
CI_sim_asy_norm <- matrix(ncol = 3, nrow = S) # create a vector for the CI
names <- c("Lower bound", "Upper bound", "teta covered")
colnames(CI_sim_asy_norm) <- names
#
CI_sim_asy_norm[i,1] <- theta - z*sqrt(var_gamma)
CI_sim_asy_norm[i,2] <- theta + z*sqrt(var_gamma)
CI_sim_asy_norm[i,3] <- theta >= CI_sim_asy_norm[i,1] & theta <= CI_sim_asy_norm[i,2]
# check whether the true parameter of interest is covered
####
####
# do the same for the asymptotic BS CI of ii)
CI_sim_asy_bs <- matrix(ncol = 3, nrow = S)
colnames(CI_sim_asy_bs) <- names
CI_sim_asy_bs[i,1] <- sim_original_samples_X_bar - z*se_sim_bs_samples
CI_sim_asy_bs[i,2] <- sim_original_samples_X_bar + z*se_sim_bs_samples
CI_sim_asy_bs[i,3] <- theta >= CI_sim_asy_bs[i,1] & theta <= CI_sim_asy_bs[i,2]
####
####
# do the same for the percentile BS CI of iii) assuming B = 1000 for simplicity
sim_bs_samples_X_bar_sorted <- sort(sim_bs_samples_X_bar, decreasing=FALSE)
CI_sim_percentile <- matrix(ncol = 3, nrow = S)
colnames(CI_sim_percentile) <- names
CI_sim_percentile[i,1] <- sim_bs_samples_X_bar_sorted[1000*(0.05/2)]
CI_sim_percentile[i,2] <- sim_bs_samples_X_bar_sorted[1000*((1-0.05)/2)]
CI_sim_percentile[i,3] <- theta >= CI_sim_percentile[i,1] & theta <= CI_sim_percentile[i,2]
####
}
The issue I have now, is that only the last row of the CI is filled (when filled) whereas it should be filled for all rows.
Where is the issue ? I cannot see it.
That is, for each original sample i, I draw B bootstrap samples.
For each, original sample i, I want to construct confidence intervals.
For each confidence intervals I want to know whether the true parameter (theta) has been contained in each of the CI.
Hence, I'd have 50 confidence intervals.
For the bootstrap one it is based on the estimates of the 20 simulations (per original sample).
Many thanks

The Question
Let's divide your question into a two parts:
How to create the data: samples, re-samples, means, etc.
How to create a multi-dimensional object
1. Creating the data
Configuration from your question
S <- 5
n <- 100
B <- 2
alpha <- 0.3
beta <- 1.2
Sample and re-sample
require( tidyverse )
U <- map_dfc( 1:S, ~rgamma( n, alpha, beta ))
Ubar <- map_dfc( 1:S, mean )
V <- map_dfc( 1:S, ~sample( U[[ . ]], n, replace = TRUE ))
Vbar <- map_dfc( 1:S, mean )
What shape are these?
dim( U )
# 100 5
dim( V )
# 100 5
dim( Ubar )
# 1 5
dim( Vbar )
# 1 5
Now, what did you want to stack? (and why?)
2. How to create a multi-dimensional object
Sometimes it can be helpful to pack data into a multi-dimensional object, in order to facilitate slicing along axes, or to select specific elements.
Define the object
multi_dimensional <- array(
data = 0:( S * B * n )
, dim = c( S, B, n )
, dimnames = list( # <---- names are optional
paste0( 'X', 1:S )
, paste0( 'Y', 1:B )
, paste0( 'Z', 1:n )
)
)
dim( multi_dimensional )
# [1] 5 2 100
Slice, dice, and chop
multi_dimensional[ 1, 1, 1 ]
# [1] 0
multi_dimensional[ S, B, n ]
# [1] 999
multi_dimensional[ 1, 2, 1:10]
# Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8 Z9 Z10
# 5 15 25 35 45 55 65 75 85 95
multi_dimensional[ , , 1:2 ]
# , , Z1
#
# Y1 Y2
# X1 0 5
# X2 1 6
# X3 2 7
# X4 3 8
# X5 4 9
#
# , , Z2
#
# Y1 Y2
# X1 10 15
# X2 11 16
# X3 12 17
# X4 13 18
# X5 14 19
multi_dimensional[ S, B, 99:100 ]
# Z99 Z100
# 989 999
Use the named dimensions, if you wish
multi_dimensional[ , , c( 'Z1', 'Z2' ) ]
# , , Z1
#
# Y1 Y2
# X1 0 5
# X2 1 6
# X3 2 7
# X4 3 8
# X5 4 9
#
# , , Z2
#
# Y1 Y2
# X1 10 15
# X2 11 16
# X3 12 17
# X4 13 18
# X5 14 19
multi_dimensional[ c( 'X1', 'X3', 'X5' ), 'Y2' , c( 'Z1', 'Z2' ) ]
# Z1 Z2
# X1 5 15
# X3 7 17
# X5 9 19
Assign new values to specific elements
multi_dimensional[ 5, 1, 29:30 ] <- c( 124.76, -5.0002 )
Now show the new values
multi_dimensional[ 5, 1, 29:30 ]
# Z29 Z30
# 124.8 -5.0
multi_dimensional[ 1:3, , 91:100 ] # slice off a particular 3 x 10 block
# (not shown, due to size)

Initial Answer
I think you misunderstoond the purpose of matrix datatype as in R matrices can't store complex objects such as other matrices, they are limited to: double/numeric, integer, logical, character, complex and raw.
Since you know the sizes of most of those data structures you should declare them beforehand AND outside the loops.
What you seem to want is to be able to store a list of B matrices of arbitrary size (1 by n) that are generated on the second loop. You can declare an empty list and start adding the matrices in the second loop to it with something like this:
#You should declare this outside the loops.
matrix_j <- vector(mode='list', length=B)
#Then on the inner loop you can use [[]] to add elements to a list
for (j in 1:B) {
matrix_j[[j]] <- sample(sim_original_samples[i],n,replace=TRUE)
or if you want an empty list of size 0, you can do matrix_j <- list() instead.
Next i didn't get if you want to compute the mean of each sample inside the list or if you want to compute the mean of the whole set of numbers, so:
First one would require you to you use the list apply function lapply,like this: lapply(matrix_j,mean), which would return a list in which each element is the mean of the the element in the same position of matrix_j.
For the second possibility, i think it would be more appropriate to combine the list elements into one simpler data structure and then compute the mean.
For your last problem, it seems to me that using lists (lists of lists) would solve your issue.
I'd create a big empty list and then add other lists as elements, as lists are allowed to contain other lists.
Answer to edits:
You are redeclaring lots of matrices inside loops. This is a bad practice as every time you do this, you're assigning the initial values to them, so don't do that if you want to keep data from previous iterations.
Consider this part of your code:
for (j in 1:B) {
sim_bs_samples <- sample(sim_original_samples,n,replace=TRUE)
# for each original sample, we are going to draw B times a bootstrap sample
sim_bs_samples_X_bar[j] <- mean(sim_bs_samples)
# all the elements of this matrix should be the bootstrap sample mean
var_sim_bs_samples <- matrix(0,B,1)
var_sim_bs_samples[j] <- (sim_bs_samples_X_bar[j] - sim_original_samples_X_bar)^2
se_sim_bs_samples <- sqrt((1/B*sum(var_sim_bs_samples)))
}
Every time the var_sim_bs_samples <- matrix(0,B,1) line runs inside the loop, it substitutes the current matrix for a new matrix full of zeros and then the following line assigns something to its i-th line.
This declaration shouldn't happen inside the loop to avoid this behavior, hence why I told you to create a list and store each new matrix inside it OR move the declaration outside the loop and keep adding things to each line. So, to fix that, you could move the declaration to outside the loop like i've done here:
var_sim_bs_samples <- matrix(0,B,1)
for (j in 1:B) {
sim_bs_samples <- sample(sim_original_samples,n,replace=TRUE)
# for each original sample, we are going to draw B times a bootstrap sample
sim_bs_samples_X_bar[j] <- mean(sim_bs_samples)
# all the elements of this matrix should be the bootstrap sample mean
var_sim_bs_samples[j] <- (sim_bs_samples_X_bar[j] - sim_original_samples_X_bar)^2
se_sim_bs_samples <- sqrt((1/B*sum(var_sim_bs_samples)))
}
The reason only the last line of multiple matrices is present is because you're redeclaring (i.e. erasing it) as an empty matrix everytime with <- matrix(ncol = 3, nrow = S), so you're kinda emptying the matrix and then adding/assigning something to the i-th position, the last one happens to do it to the last position since i goes from 1 to S.
Second Edit:
Just move the declarations outside of the loops where you're using them, like this:
n <- 100
B <- 20
S <- 50
alpha <- 0.3
beta <- 1.2
theta <- alpha*beta
CI_sim_asy_norm <- matrix(ncol = 3, nrow = S)
CI_sim_asy_bs <- matrix(ncol = 3, nrow = S)
CI_sim_percentile <- matrix(ncol = 3, nrow = S)
for (i in 1:S) {
sim_original_samples <- rgamma(n, alpha, beta)
sim_original_samples_X_bar <- mean(sim_original_samples)
sim_bs_samples_X_bar <- matrix(0,B,1)
var_sim_bs_samples <- matrix(0,B,1)
for (j in 1:B) {
sim_bs_samples <- sample(sim_original_samples,n,replace=TRUE)
Your CI_sim_percentile will still be full of NAs, because this command sim_bs_samples_X_bar_sorted[1000*(0.05/2)] is just trying to access an index of sim_bs_samples_X_bar_sorted where there is no data and R assumes NA for that.

How to run a function with multiple arguments of varying length in a loop in R

I need to run this function like 6000 times with all of its iterations. I have 6 arguments in total for the function. The first 3 of them go hand in hand and number 75. The next argument has 9 values. And the last 2 arguments have 3 values.
#require dplyr
#data is history as list
matchloop <- function(data, data2, x, a, b, c) {
#history as list
split <- data
#history for reference
fh <- FullHistory
#start counter
n<-1
#end counter
m<-a
tempdf0.3 <- fh
#set condition for loop
while(nrow(tempdf0.3) > 1 && m <= (nrow(data2))*b) {
#put history into a variable
tempdf0.0 <- split
#put fh into a variable
tempdf0.5 <- fh
#put test path into variable from row n to m
tempdf0.1 <- as.data.frame(data2[n:m,], stringsAsFactors = FALSE)
#change column name of test path
colnames(tempdf0.1) <- "directions"
#put row n to m of history into variable
tempdf0.2 <- lapply(tempdf0.0, function(df) df[n:m,])
#put output into output
tempdf0.3 <- orderedDistancespos(tempdf0.2, tempdf0.1,
"allPaths","directions")
#add to output routeID based on reference from fh-the test path ID
tempdf0.3 <- mutate(tempdf0.3, routeID = (subset(tempdf0.5, routeID
!= x)$routeID))
#reduce output based on the matched threshold
tempdf0.3 <- subset(tempdf0.3, dists >= a*c)
#create new history based on the IDs remaining in output
split <- split[as.character(tempdf0.3$routeID)]
#create new history for reference based on the IDs remaining in
output
fh <- subset(fh, routeID %in% tempdf0.3$routeID)
#increase loop counter
n <- n+a
#increase loop counter
m <- n+(a-1)
}
#show output
mylist <- list(tempdf0.3, nrow(tempdf0.3))
return(mylist)
}
I tried putting the 3 arguments with 75 elements in them to their own lists and use mapply. This works. But even at this level I still have to run the code 81 times to cover all the variables because as far as I understand mapply recycles based on the length of the longest argument.
mapply(matchloop, mylist2,mylist3,mylist4, MoreArgs = list(a=a, b=b, c=c))
data is a list of dataframes
data2 is a dataframe
x, a, b, c are all numerical.
Right now I'm trying to streamline my output so that its in just 1 line. So if possible I would like 1 single csv output with all 6000+ lines.

You can combine mapply and apply function to cycle through all possible combination of a, b and c variables. To create all possible combinations you can use expand.grid. Finally you can contatenate list of rows into a data.frame with the help of do.call and rbind functions as follows:
matchloop_stub <- matchloop <- function(data, data2, x, a, b, c) {
# stub
c(d = sum(data), d2 = sum(data2), x = sum(x), a = a, b = b, c = c, r = a + b + c)
}
set.seed(123)
mylist2 <- replicate(75, data.frame(rnorm(1)))
mylist3 <- replicate(75, data.frame(rnorm(2)))
mylist4 <- replicate(75, data.frame(rnorm(3)))
a <- 1:9
b <- 1:3
c <- 1:3
abc <- expand.grid(a, b, c)
names(abc) <- c("a", "b", "c")
xs <- apply(abc, 1, function(x) (mapply(matchloop_stub, mylist2, mylist3, mylist4, x[1], x[2], x[3], SIMPLIFY = FALSE)))
df <- do.call(rbind, do.call(rbind, xs))
write.csv(df, file = "temp.csv")
res <- read.csv("temp.csv")
nrow(res)
# [1] 6075
head(res)
# X d d2 x a b c r
# 1 1 -0.5604756 0.7407984 -1.362065 1 1 1 3
# 2 2 -0.5604756 0.7407984 -1.362065 2 1 1 4
# 3 3 -0.5604756 0.7407984 -1.362065 3 1 1 5
# 4 4 -0.5604756 0.7407984 -1.362065 4 1 1 6
# 5 5 -0.5604756 0.7407984 -1.362065 5 1 1 7
# 6 6 -0.5604756 0.7407984 -1.362065 6 1 1 8

Updating a vector within a dataframe using a random experiment

I have the following dataframes "df1" and "df2":
x1 <- c(1,1,1,2,2,3)
y1 <- c(0,0,1,1,2,2)
df1 <- data.frame(x1,y1)
y <- c(0,1,2)
p <- c(0.1,0.6,0.9)
df2 <- data.frame(y,p)
What I want to do is to update df1$x1 to a new vector df1$x2, based on a random experiment. This can be manually done using the following function and "lapply" on vector df1$x1:
example_function <- function(x,p){
if(runif(1) <= p) return(x + 1)
return(x)
}
set.seed(123)
df1$x2 <- unlist(lapply(df1$x1,example_function,0.5))
The function performs a random experiment and compares it with a given probability p. Depending on the result either x remains the same for df$x2 or increases by the value of 1.
In the procedure described above, "p" was selected manually within the function (here 0.5 for all x-values in df1). However, I want p to be chosen automatically depending on the combination of df1$x1 and df1$y1. Here comes df2 into play. df2 shows which p-values are related to which y-values. For example df1$x1[3] equals 1, the corresponding y value df1$y1[3] is also equal 1. df2 shows that the associated p-value has to be 0.6 (that is the p-value for y equal 1). In order to determine the corresponding value df1$x2, p = 0.6 should be used in "example_function". Depending on df1$y1, p should be 0.1 for df1$x1[1] and df1$x1[2], 0.6 for df1$x1[3] and df1$x1[4] and 0.9 for df1$x1[5] and df1$x1[6].
Following example is an approach, but only if vector df$x1 contains only different values:
x1 <- c(1,2,3,4,5,6)
y1 <- c(0,0,1,1,2,2)
df1 <- data.frame(x1,y1)
set.seed(123)
df1$x2 <- unlist(lapply(df1$x1,
function(z) {
example_function(z, df2$p[df2$y == df1$y1[df1$x1 == z]])
}))
df1
x1 y1 x2
#1 1 0 1
#2 2 0 2
#3 3 1 4
#4 4 1 4
#5 5 2 5
#6 6 2 7
Using x1 <- c(1,1,1,2,2,3), as mentioned above, leads to warnings and errors:
x1 <- c(1,1,1,2,2,3)
y1 <- c(0,0,1,1,2,2)
df1 <- data.frame(x1,y1)
set.seed(123)
df1$x2 <- unlist(lapply(df1$x1,
function(z) {
example_function(z, df2$p[df2$y == df1$y1[df1$x1 == z]])
}))
Error in if (runif(1) <= p) return(x + 1) : argument is of length zero
In addition: Warning message:
In df2$y == df1$y1[df1$x1 == z] :
Error in if (runif(1) <= p) return(x + 1) : argument is of length zero
Is there anyone who has an idea how to fix that problem? I am very grateful for any help.

Working with "merge" seems to be one solution:
df_new <- merge(df1, df2, by.x = 'y1', by.y = 'y')
set.seed(123)
df1$x2 <- mapply(example_function,df1$x1,df_new$p)
> df1
x1 y1 x2
1 1 0 1
2 1 0 1
3 1 1 2
4 2 1 2
5 2 2 2
6 3 2 4

Standard deviation for each row value and constant value

N <- c(1,3,4,6)
a <- c(3,4,5,6)
b <- c(4,5,6,7)
w <- c(5,6,7,6)
dat1 <- data.frame(N,May = a, April = b,June = w)
N May April June
1 1 3 4 5
2 3 4 5 6
3 4 5 6 7
4 6 6 7 6
I need a data frame, where each value is sd of N value and row value
sd(c(1,3) sd(c(1,4) sd(c(1,5) # for 1st row
sd(c(3,4) sd(c(3,5) sd(c(3,6) # for second and so on.

Try this:
The data:
Norm <- c(1,3,4,6)
a <- c(3,4,5,6)
b <- c(4,5,6,7)
w <- c(5,6,7,6)
mydata <- data.frame(Norm=Norm,May = a, April = b,June = w)
Solution:
finaldata <- do.call('cbind',lapply(names(mydata)[2:4], function(x) apply(mydata[c("Norm",x)],1,sd)))
I hope it helps.
Piece of advice:
Please refrain from using names like data and norm for your variable names. They can easily conflict with things that are native to R. For example norm is a function in R, and so is data.

I think I got it
x=matrix(data=NA, nrow=4, ncol=3)
for(j in 1:3){
for(i in 1:4){
x[i, j] <- sd(data[i, c(i,(j+1))])
x
}
}

Automate pairwise comparisons between subsets in a dataframe R

I have a data.frame with several variables X1,X2... and a grouping variable "site" I want to find the proportion of X1 with site==1 greater than X1 with site==2, I can do that with a fixed number of site levels and each variable at a time, but I would like to generalize for any number of levels and several variables, following is an example:
# Generate data
set.seed(20130226)
n <- 100
x1 <- matrix(c(rnorm(n, mean = 2),rnorm(n, mean = 5)),ncol=2)
x2 <- matrix(c(rnorm(n, mean = 1), rnorm(n, mean = 4)),ncol=2)
x3 <- matrix(c(rnorm(n, mean = 3), rnorm(n, mean = 3)),ncol=2)
xx <- data.frame(x1,site=1)
xx <- rbind(xx, data.frame(x2,site=2))
xx <- rbind(xx, data.frame(x3,site=3))
# comparisons
s <- unique(xx$site)
me1 <- with(xx,xx[site==s[1],])
me2<- with(xx,xx[site==s[2],])
me3<- with(xx,xx[site==s[3],])
Pg1.gt.g2 <- sum(me1[,c("X1")]>me2[,c("X1")])/nrow(me1)
Pg1.gt.g3 <- sum(me1[,c("X1")]>me3[,c("X1")])/nrow(me1)
Pg2.gt.g3 <- sum(me2[,c("X1")]>me3[,c("X1")])/nrow(me1)
# build table
comp1 <- data.frame(Group=c(paste(s[1],">",s[2]),paste(s[1],">",s[3]),paste(s[2],">",s[3])), P=c(Pg1.gt.g2, Pg1.gt.g3,Pg2.gt.g3))
print(comp1)
I don't figure out how to do this for a different numbers of groups and several variables, maybe using plyr
Thanks!

I would reshape the data into a matrix where each column represents a group:
# Unique sites
s <- unique(xx$site)
# Columns are each group, data are X1 values
mat <- do.call(cbind, lapply(split(xx, xx$site), function(x) x$X1))
# Compare all pairs of sites
do.call(rbind, apply(combn(seq_along(s), 2), 2,
function(x) data.frame(g1=s[x[1]], g2=s[x[2]],
prop=sum(mat[,x[1]] > mat[,x[2]])/nrow(mat))))
# g1 g2 prop
# 1 1 2 0.83
# 2 1 3 0.20
# 3 2 3 0.09

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Aggregate a table by applying a function of multiple columns - r

Related

In R, how to first use a loop to create matrices and then stack them?

How to run a function with multiple arguments of varying length in a loop in R

Updating a vector within a dataframe using a random experiment

Standard deviation for each row value and constant value

Automate pairwise comparisons between subsets in a dataframe R

Categories

Resources