I am trying to write a function using aggregate() that will allow me to easily specify one or more variables to list by and their names.
data:
FCST_VAR OBS_SID FCST_INIT_HOUR ME
WIND 00000 12 4.00000
WIND 11111 12 -0.74948
WIND 22222 12 -0.97792
WIND 00000 00 -2.15822
WIND 11111 00 0.94710
WIND 22222 00 -2.28489
I can do this for a single variable to group by fairly easily:
aggregate.CNT <- function(input.data, aggregate.by) {
# Calculate mean ME by aggregating specified variable
output.data <- aggregate(input.data$ME,
list(Station_ID = input.data[[OBS_SID]]),
mean, na.rm=T)
}
However, I'm stumped on two things:
Firstly, a way to be able to call the function specifying a name for the 'group by' column (instead of Group1), eg in the case of:
aggregate.CNT <- function(input.data, aggregate.by, group.name) {
# Calculate mean ME by aggregating specified variable
output.data <- aggregate(input.data$ME,
list(group.name = input.data[[OBS_SID]]),
mean, na.rm=T)
}
But this results in the column name in the output being group.name rather than the desired value of the argument.
Secondly, building on that - if I want to optionally specify more than one variable to sort by - with names. I tried using ... but that doesn't seem to possibly since the additional arguments obviously need to be in the form:
list(arg1 = input.data[[arg2]], arg3 = input.data[[arg4]])
And I don't think there's a way to place extra arguments into a arg3 = input.data[[arg4]] format.
So I was wondering if there is a way to use an argument to insert a whole string into the function, eg:
aggregate.CNT <- function(input.data, aggregate.by.list) {
# Calculate mean ME by aggregating specified variable
output.data <- aggregate(input.data$ME,
list(aggregate.by.list),
mean, na.rm=T)
aggregate.CNT(data, "Station_ID = data$OBS_SID, Init_Hour = data$FCST_INIT_HOUR")
If this isn't possible, suggestions for alternative methods are also greatly appreciated.
Thanks
Mal
Try this:
aggregate.CNT <- function(data, by) {
ag <- aggregate(ME ~., data[c("ME", by)], mean, na.rm = TRUE)
if (!is.null(names(by))) names(ag) <- c(names(by), "ME")
ag
}
Here is an example:
> DF <- data.frame(ME = 1:5, g = c(1, 1, 2, 2, 2), b = c(1, 1, 1, 2, 2))
> aggregate.CNT(DF, "g")
g ME
1 1 1.5
2 2 4.0
> aggregate.CNT(DF, c("g", "b"))
g b ME
1 1 1 1.5
2 2 1 3.0
3 2 2 4.5
> aggregate.CNT(DF, c(G = "g", B = "b"))
G B ME
1 1 1 1.5
2 2 1 3.0
3 2 2 4.5
ADDED: by vector may be named.
Related
I have several probability distribution functions defined using the pdqr package. Let say, they are A, B and C:
A <- as_d(function(x)dnorm(x, mean = 3, sd = 1))
B <- as_d(function(x)dnorm(x, mean = 6, sd = 1))
C <- as_d(function(x)dnorm(x, mean = 2, sd = 2))
I have a large data.frame with a vector which has a character describing the appropiate PDF per case in a vector distr, let say:
df <- data.frame(distr = c("A", "C", "A", "B", "B", "A", "C"))
I would like to generate the mean of each PDF per case. Individually this works like this for PDF A:
> pdqr::summ_mean(A)
[1] 3
Now I would like to generate the mean for each case based on the PDF set in distr. This means passing the PDF into pdqr::sum_mean(). I have tried the following with the resulting errors:
> df$distr_mean <- summ_mean(df$distr)
Error: `f` is not pdqr-function. It should be function.
>
> df$distr_mean <- summ_mean(invoke_map(df$distr))
Error in A() : argument "x" is missing, with no default
>
> df$distr_mean <- df %>%
+ pull(distr) %>%
+ summ_mean()
Error: `f` is not pdqr-function. It should be function.
So, either it doesn't understand that a pdqr-function is being passed, or it needs a x-value, which doesn't make sense, since I want the mean over the entire distribution, not a single x (passing a range like c(1:10) also doesn't work). Furthermore, I understand that any apply or do.call function only passes one single function, while I want to pass several different functions, given in a vector.
How to proceed?
One way to do this is to use the distr column as an argument to mget, which will return all the appropriate functions in a list. Just feed that list to summ_mean using sapply:
sapply(mget(df$distr), pdqr::summ_mean)
#> A C A B B A C
#> 3 2 3 6 6 3 2
Though inside mutate you'll need to tell mget which environment the functions will be found:
df %>%
mutate(distr_mean = sapply(mget(distr, envir = .GlobalEnv), pdqr::summ_mean))
#> distr distr_mean
#> 1 A 3
#> 2 C 2
#> 3 A 3
#> 4 B 6
#> 5 B 6
#> 6 A 3
#> 7 C 2
This may be easier to manage if you store your functions in a named list, rather than in the top level environment. From there, it's relatively easy to use sapply or lapply to calculate the mean for each function and then extract the results into df:
df <- data.frame(distr = c("A", "C", "A", "B", "B", "A", "C"))
pdfs <- list(
A = as_d(function(x)dnorm(x, mean = 3, sd = 1)),
B = as_d(function(x)dnorm(x, mean = 6, sd = 1)),
C = as_d(function(x)dnorm(x, mean = 2, sd = 2))
)
means <- sapply(pdfs, summ_mean)
df$distr_mean <- means[df$distr]
distr distr_mean
1 A 3
2 C 2
3 A 3
4 B 6
5 B 6
6 A 3
7 C 2
Or in one line:
df$distr_mean <- lapply(df$distr, \(x) pdqr::summ_mean(pdfs[[x]]))
This is related to R: use the newly generated data in the previous row
I realized the actual problem I was faced with is a bit more complicated than the example I gave in the thread above - it seems I have to pass 3 arguments to the recursive calculation to achieve what I want. Thus, accumulate2 or reduce may not work. So I open a new question here to avoid possible confusion.
I have the following dataset grouped by ID:
ID <- c(1, 2, 2, 3, 3, 3)
pw <- c(1:6)
add <- c(1, 2, 3, 5, 7, 8)
x <- c(1, 2, NA, 4, NA, NA)
df <- data.frame(ID, pw, add, x)
df
ID pw add x
1 1 1 1 1
2 2 2 2 2
3 2 3 3 NA
4 3 4 5 4
5 3 5 7 NA
6 3 6 8 NA
Within each group for column x, I want to keep the value of the first row as it is, while fill in the remaining rows with lagged values raised to the power stored in pw, and add to the exponent the value in add. I want to update the lagged values as I proceed. So I would like to have:
ID pw add x
1 1 1 1 1
2 2 2 2 2
3 2 3 3 2^3 + 3
4 3 4 5 4
5 3 5 7 4^5 + 7
6 3 6 8 (4^5 + 7)^6 + 8
I have to apply this calculation to a large dataset, so it would be perfect if there is a fast way to do this!
If we want to use accumulate2, then specify the arguments correctly i.e. it takes two input arguments as 'pw' and 'add' and an initialization argument which would be the first value of 'x'. As it is a grouped by 'ID', do the grouping before we do the accumulate2, extract the lambda default arguments ..1, ..2 and ..3 respectively in that order and create the recursive function based on this
library(dplyr)
library(purrr)
out <- df %>%
group_by(ID) %>%
mutate(x1 = accumulate2(pw[-1], add[-1], ~ ..1^..2 + ..3,
.init = first(x)) %>%
flatten_dbl ) %>%
ungroup
out$x1
#[1] 1 2 11
#[4] 4 1031 1201024845477409792
With more than 3 arguments, a for loop would be better
# // initialize an empty vector
out <- c()
# // loop over the `unique` ID
for(id in unique(df$ID)) {
# // create a temporary subset of data based on that id
tmp_df <- subset(df, ID == id)
# // initialize a temporary storage output
tmp_out <- numeric(nrow(tmp_df))
# // initialize first value with the first element of x
tmp_out[1] <- tmp_df$x[1]
# // if the number of rows is greater than 1
if(nrow(tmp_df) > 1) {
// loop over the rows
for(i in 2:nrow(tmp_df)) {
#// do the recursive calculation and update
tmp_out[i] <- tmp_out[i - 1]^ tmp_df$pw[i] + tmp_df$add[i]
}
}
out <- c(out, tmp_out)
}
out
#[1] 1 2 11
#[4] 4 1031 1201024845477409792
In base R we could use the following solution for more than two arguments.
In this solution I first subset the original data set on ID values
Then I chose row id values through seq_len(nrow(tmp))[-1] omitting the first row id since it was provided by init
In anonymous function I used in Reduce, b argument represents accumulated/ previous value starting from init and c represents new/current values of our vector which is row numbers
So in every iteration our previous value (starting from init) will be raised to the power of new value from pw and will be summed by new value from add
cbind(df[-length(df)], unlist(lapply(unique(df$ID), function(a) {
tmp <- subset(df, df$ID == a)
Reduce(function(b, c) {
b ^ tmp$pw[c] + tmp$add[c]
}, init = tmp$x[1],
seq_len(nrow(tmp))[-1], accumulate = TRUE)
}))) |> setNames(c(names(df)))
ID pw add x
1 1 1 1 1.000000e+00
2 2 2 2 2.000000e+00
3 2 3 3 1.100000e+01
4 3 4 5 4.000000e+00
5 3 5 7 1.031000e+03
6 3 6 8 1.201025e+18
Data
structure(list(ID = c(1, 2, 2, 3, 3, 3), pw = 1:6, add = c(1,
2, 3, 5, 7, 8), x = c(1, 2, NA, 4, NA, NA)), class = "data.frame", row.names = c(NA,
-6L))
Base R, not using Reduce() but rather a while() Loop:
# Split-apply-combine while loop: res => data.frame
res <- do.call(rbind, lapply(with(df, split(df, ID)), function(y){
# While there are any NAs in x:
while(any(is.na(y$x))){
# Store the index of the first NA value: idx => integer scalar
idx <- with(y, head(which(is.na(x)), 1))
# Calculate x at that index using the business rule provided:
# x => numeric vector
y$x[idx] <- with(y, x[(idx-1)] ** pw[idx] + add[idx])
}
# Explicitly define the return object: y => GlobalEnv
y
}
)
)
OR recursive function:
# Recursive function: estimation_func => function()
estimation_func <- function(value_vec, exponent_vec, add_vec){
# Specify the termination condition; when all elements
# of value_vec are no longer NA:
if(all(!(is.na(value_vec)))){
# Return value_vec: numeric vector => GlobalEnv
return(value_vec)
# Otherwise recursively apply the below:
}else{
# Store the index of the first na value: idx => integer vector
idx <- Position(is.na, value_vec)
# Calculate the value of the value_vec at that index;
# using the provided business logic: value_vec => numeric vector
value_vec[idx] <- (value_vec[(idx-1)] ** exponent_vec[idx]) + add_vec[idx]
# Recursively apply function: function => Local Env
return(estimation_func(value_vec, exponent_vec, add_vec))
}
}
# Split data.frame into a list on ID;
# Overwrite x values, applying recursive function;
# Combine list into a data.frame
# res => data.frame
res <- data.frame(
do.call(
rbind,
Map(function(y){y$x <- estimation_func(y$x, y$pw, y$add); y}, split(df, df$ID))
), row.names = NULL
)
I have an imported dataframe Measurements that contains many observations from an experiment.
Measurements <- data.frame(X = 1:4,
Data = c(90, 85, 100, 105))
X Data
1 90
2 85
3 100
4 105
I want to add another column Condition that specifies the treatment group for each datapoint. I know which obervation ranges are from which condition (e.g. observations 1:2 are from the control and observations 3:4 are from the experimental group).
I have devised two solutions already that give the desired output but neither are ideal. First:
Measurements["Condition"] <- c(rep("Cont", 2), rep("Exp", 2))
X Data Condition
1 90 Cont
2 85 Cont
3 100 Exp
4 105 Exp
The benefit of this is it is one line of code/one command. But this is not ideal since I need to do math outside separately (e.g. 3:4 = 2 obs, etc) which can be tricky/unclear/indirect with larger datasets and more conditions (e.g. 47:83 = ? obs, etc) and would be liable to perpetuating errors since a small error in length for an early assignment would also shift the assignment of later groups (e.g. if rep of Cont is mistakenly 1, then Exp gets mistakenly assigned to 2:3 too).
I also thought of assigning like this, which gives the desired output too:
Measurements[1:2, "Condition"] <- "Cont"
Measurements[3:4, "Condition"] <- "Exp"
X Data Condition
1 90 Cont
2 85 Cont
3 100 Exp
4 105 Exp
This makes it more clear/simple/direct which rows will receive which assignment, but this requires separate assignments and repetition. I feel like there should be a way to "vectorize" this assignment, which is the solution I'm looking for.
I'm having trouble finding complex indexing rules from online. Here is my first intuitive guess of how to achieve this:
Measurements[c(1:2, 3:4), "Condition"] <- list("Cont", "Exp")
X Data Condition
1 90 Cont
2 85 Cont
3 100 Cont
4 105 Cont
But this doesn't work. It seems to combine 1:2 and 3:4 into a single equivalent range (1:4) and assigns only the first condition to this range, which suggests I also need to specify the column again. When I try to specify the column again:
Measurements[c(1:2, 3:4), c("Condition", "Condition")] <- list("Cont", "Exp")
X Data Condition Condition.1
1 90 Cont Exp
2 85 Cont Exp
3 100 Cont Exp
4 105 Cont Exp
For some reason this creates a second new column (??), and it again seems to combine 1:2 and 3:4 into essentially 1:4. So I think I need to index the two row ranges in a way that keeps them separate and only specify the column once, but I'm stuck on how to do this. I assume the solution is simple but I can't seem to find an example of what I'm trying to do. Maybe to keep them separate I do have to assign them separately, but I'm hoping there is a way.
Can anyone help? Thank you a ton in advance from an R noobie!
If you already have a list of observations which belong to each condition you could use dplyr::case_when to do a conditional mutate. Depending on how you have this information stored you could use something like the following:
library(dplyr)
Measurements <- data.frame(X = 1:4,
Data = c(90, 85, 100, 105))
# set which observations belong to each condition
Cont <- 1:2
Exp <- 3:4
Measurements %>%
mutate(Condition = case_when(
X %in% Cont ~ "Cont",
X %in% Exp ~ "Exp"
))
# X Data Condition
# 1 90 Cont
# 2 85 Cont
# 3 100 Exp
# 4 105 Exp
Note that this does not require the observations to be in consecutive rows.
I normally see this done with a merge operation. The trick is getting your conditions data into a nice shape.
composeConditions <- function(...) {
conditions <- list(...)
data.frame(
X = unname(unlist(conditions)),
condition = unlist(unname(lapply(
names(conditions),
function(x) rep(x, times = length(conditions[x][[1]]))
)))
)
}
conditions <- composeConditions(Cont = 1:2, Exp = 3:4)
> conditions
X condition
1 1 Cont
2 2 Cont
3 3 Exp
4 4 Exp
merge(Measurements, conditions, by = "X")
X Data condition
1 1 90 Cont
2 2 85 Cont
3 3 100 Exp
4 4 105 Exp
Efficient for larger datasets is to know the data pattern and the data id.
Measurements <- data.frame(X = 1:4, Data = c(90, 85, 100, 105))
dat <- c("Cont","Exp")
pattern <- c(1,1,2,2)
Or draw pattern from data, e.g. conditional from Measurements$Data
pattern <- sapply( Measurements$Data >=100, function(x){ if(x){2}else{1} } )
# [1] 1 1 2 2
Then you can add the data simply by doing:
Measurements$Condition <- dat[pattern]
# X Data Condition
#1 1 90 Cont
#2 2 85 Cont
#3 3 100 Exp
#4 4 105 Exp
I would like to create a new column which equals to the mean of several variables (columns) in my data frame. However, I'm afraid I can't use 'rowMeans' because I don't want to average all variables. Moreover, I'm hesitate to manually type all the variable names (which are many). For example:
my_data <- data.frame(a = c(1,2,3), b = c(4,5,6), c = c(10,10,10), d = c(13,24,81),
e = c(10, 8, 6), hello = c(1,-1,1), bye = c(1,5,5))
I want to mutate a row called avg which is the average of variables a, b, c, d, and e only. Because in my dataset, the variables names are long (and complex), and there are more than 10 variables, I prefer not to type them out one by one. So I guess I might need to use dplyr package and the mutate function?? Could you please suggest a clever way for me to do that?
The below content is added after your kind comments and answers suggest. Thank you all again:
Actually, the column names that I needed are Mcheck5_1_1, Mcheck5_2_1, ..., Mcheck5_8_1 (so there are 8 in total).However, I tried
my_data$avg = rowMeans(select(my_data, Mcheck5_1_1:Mcheck5_8_1), na.rm = TRUE)
but an error was thrown to me:
Error in select(my_data, Mcheck5_1_1:Mcheck5_8_1) :
unused argument (Mcheck5_1_1:Mcheck5_8_1)
Right now I solved the problem by using the following code:
`idx = grep("Mcheck5_1_1", names(my_data))
my_data$avg = rowMeans(my_data[, idx:idx+7], na.rm = TRUE)`
But is there a more elegant way to do it? Or why couldn't I use select()? Thanks!
I would do something like this
my_data <- data.frame(a = c(1,2,3), b = c(4,5,6), c = c(10,10,10), d = c(13,24,81),
e = c(10, 8, 6), hello = c(1,-1,1), bye = c(1,5,5))
several_variables <- c('a', 'b', 'c', 'd', 'e') #3 or `letters[1:5]`
my_data$avg <- rowMeans(my_data[,several_variables])
my_data
#> a b c d e hello bye avg
#> 1 1 4 10 13 10 1 1 7.6
#> 2 2 5 10 24 8 -1 5 9.8
#> 3 3 6 10 81 6 1 5 21.2
Obviously, if the variables is at some fixed position, and you know they will stay there, you could use the numbered indexing as suggested by Jaap,
my_data$avg <- rowMeans(my_data[,1:5])
From ?dplyr::bind_cols:
This is an efficient implementation of the common pattern of do.call(rbind, dfs) or do.call(cbind, dfs) for binding many data frames into one
However, with example data:
tmp_df1 <- data.frame(a = 1)
tmp_df2 <- data.frame(b = c(-2, 2))
tmp_df3 <- data.frame(c = runif(10))
The command do.call(cbind, list(tmp_df1, tmp_df2, tmp_df3)) produces:
a b c
1 1 -2 0.8473307
2 1 2 0.8031552
3 1 -2 0.3057430
4 1 2 0.6344999
5 1 -2 0.7870753
6 1 2 0.9453199
7 1 -2 0.6642231
8 1 2 0.9708049
9 1 -2 0.7189576
10 1 2 0.9217087
That is, rows of tmp_df1 and tmp_df2 are recycled to match the number of rows in tmp_df3.
In dplyr:
> bind_cols(tmp_df1, tmp_df2, tmp_df3)
Error in eval(substitute(expr), envir, enclos) :
incompatible number of rows (2, expecting 1)
The reason why I want to do something like this is because I am in a situation similar to below:
df_normal_param <- df(mu = rnorm(10), sigma = runif(10))
df_normal_sample_list <- lapply(1:10, function(i)
with(df_normal_param,
data.frame(sam = rnorm(100, mu[i], sigma[i]))
and I wish to attach the arguments used to create each entry of df_normal_sample_list to the outputs, e.g.
df_normal_sample_list <- lapply(1:10, function(i)
cbind(df_normal_param[i,], df_normal_sample_list[[i]]))
You argue in a comment that this behavior is safe, I strongly disagree. It seems safe, for this very particular case, but it is likely to cause you problems somewhere down the road. Which is why I believe that the answer to your stated question ("Is there a way to get dplyr's bind_cols to expand number of rows like in cbind?") is a simple: no, and they probably built it that way intentionally.
Instead, I would suggest that you be more explicit in your approach, and just add the columns you want right as you build the data you are creating. For example, you could include that step right in your call (here using apply to clarify what is going where)
df <- data.frame(mu = rnorm(3), sigma = runif(3))
df_normal_sample_list <- apply(df, 1, function(x){
data.frame(
mu = x["mu"]
, sigma = x["sigma"]
, sam = rnorm(3, x["mu"], x["sigma"])
)
})
Returns
[[1]]
mu sigma sam
1 -0.6982395 0.1690402 -0.592286
2 -0.6982395 0.1690402 -0.516948
3 -0.6982395 0.1690402 -0.804366
[[2]]
mu sigma sam
1 -1.698747 0.2597186 -1.830950
2 -1.698747 0.2597186 -2.087393
3 -1.698747 0.2597186 -1.961376
[[3]]
mu sigma sam
1 0.9913492 0.3069877 0.9629801
2 0.9913492 0.3069877 1.2279697
3 0.9913492 0.3069877 1.1222780
Then, instead of binding the columns, then the rows, you can just bind the rows at the end (also from dplyr)
bind_rows(df_normal_sample_list)