I'm looking for a better option saving a group mean directly in the same data frame in a new column. I usually solve this problem in the following steps shown below. Is there a possibility of saving the means without merging them explicitly but doing i right away maybe with dplyr?
data <- data.frame(group = rep(c("low","high"),2),
values = runif(n = 4, min = 0, max = 2))
data_mean <- data %>% group_by(group) %>% summarise (mean(values))
merge(data_mean, data)
group mean(values) values
1 high 0.2889459 0.07079697
2 high 0.2889459 0.50709475
3 low 0.7767188 0.93176182
4 low 0.7767188 0.62167588
Just use mutate instead of summarise should do what you want:
data %>%
group_by(group) %>%
mutate(mean = mean(values))
#Source: local data frame [4 x 3]
#Groups: group
#
# group values mean
#1 low 1.4017168 0.7478336
#2 high 0.8074821 1.1018971
#3 low 0.0939505 0.7478336
#4 high 1.3963122 1.1018971
Note: my values are different from yours because you didn't use set.seed for reproducibility of random numbers.
You could use tapply in base R
within(data, means <- tapply(values, group, mean, na.rm=TRUE))
# group values means
# 1 low 1.1069518 1.515846
# 2 high 1.6729194 1.001568
# 3 low 0.8961838 1.515846
# 4 high 1.3587732 1.001568
Related
Starting data
I'm working in R and I have a set of data generated from groups (cohorts) of animals treated with different doses of different drugs. A simplified reproducible example of my dataset follows:
# set starting values for simulation of animal cohorts across doses of various drugs with a few numeric endpoints
cohort_size <- 3
animals <- letters[1:cohort_size]
drugs <- factor(c("A", "B", "C"))
doses <- factor(c(0, 10, 100))
total_size <- cohort_size * length(drugs) * length(doses)
# simulate data based on above parameters
df <- cbind(expand.grid(drug = drugs, dose = doses, animal = animals),
data.frame(
other_metadata = sample(LETTERS[24:26], size = total_size, replace = TRUE),
num1 = rnorm(total_size, mean = 10, sd = 3),
num2 = rnorm(total_size, mean = 60, sd = 9),
num3 = runif(total_size, min = 1, max = 5)))
This produces something like:
## drug dose animal other_metadata num1 num2 num3
## 1 A 0 a X 6.448411 54.49473 4.111368
## 2 B 0 a Y 9.439396 67.39118 4.917354
## 3 C 0 a Y 8.519773 67.11086 3.969524
## 4 A 10 a Z 6.286326 69.25982 2.194252
## 5 B 10 a Y 12.428265 70.32093 1.679301
## 6 C 10 a X 13.278707 68.37053 1.746217
My goal
For each drug treatment, I consider the dose == 0 animals as my control group for that drug (let's say each was run at a different time and has it's own control group). I wish to calculate the mean for each numeric endpoint (columns 5:7 in this example) of the control group. Next I want to normalize (divide) every numeric endpoint (columns 5:7) for every animal by the mean of it's respective control group.
In other words num1 for all animals where drug == "A" should be divided by the mean of num1 for all animals where drug == "A" AND dose == 0 and so on for each endpoint.
The final output should be the same size as the original data.frame with all of the non-numeric metadata columns remaining unchanged on the left side and all the numeric data columns now with the normalized values.
Naturally I'd like to find the simplest solution possible - minimizing creation of new variables and ideally in a single dplyr pipeline if possible.
What I've tried so far
I should say that I have technically solved this but the solution is super ugly with a ton of steps so I'm hoping to get help to find a more elegant solution.
I know I can easily get the averages for the control groups into a new data.frame using:
df %>%
filter(dose == 0) %>%
group_by(drug, dose) %>%
summarise_all(mean)
I've looked into several things but can't figure out how to implement them. In order of what seems most promising to me:
dplyr::group_modify()
dplyr::rowwise()
sweep() in some type of loop
Thanks in advance for any help you can offer!
If the intention is to divide the numeric columns by the mean of the control group values, grouped by 'drug', after grouping by 'drug', use mutate with across (from dplyr 1.0.0), divide the column values (. with mean of the values where the 'dose' is 0
library(dplyr) # 1.0.0
df %>%
group_by(drug) %>%
mutate(across(where(is.numeric), ~ ./mean(.[dose == 0])))
If we have a dplyr version is < 1.0.0, use mutate_if
df %>%
group_by(drug) %>%
mutate_if(is.numeric, ~ ./mean(.[dose == 0]))
In dplyr, group_by has a parameter add, and if it's true, it adds to the group_by. For example:
data <- data.frame(a=c('a','b','c'), b=c(1,2,3), c=c(4,5,6))
data <- data %>% group_by(a, add=TRUE)
data <- data %>% group_by(b, add=TRUE)
data %>% summarize(sum_c = sum(c))
Output:
a b sum_c
1 a 1 4
2 b 2 5
3 c 3 6
Is there an analogous way to add summary variables to a summarize statement? I have some complicated conditionals (with dbplyr) where if x=TRUE I want to add
variable x_v to the summary.
I see several related stackoverflow questions, but I didn't see this.
EDIT: Here is some precise example code, but simplified from the real code (which has more than two conditionals).
summarize_num <- TRUE
summarize_num_distinct <- FALSE
data <- data.frame(val=c(1,2,2))
if (summarize_num && summarize_num_distinct) {
summ <- data %>% summarize(n=n(), n_unique=n_distinct())
} else if (summarize_num) {
summ <- data %>% summarize(n=n())
} else if (summarize_num_distinct) {
summ <- data %>% summarize(n_unique=n_distinct())
}
Depending on conditions (summarize_num, and summarize_num_distinct here), the eventual summary (summ here) has different columns.
As the number of conditions goes up, the number of clauses goes up combinatorially. However, the conditions are independent, so I'd like to add the summary variables independently as well.
I'm using dbplyr, so I have to do it in a way that it can get translated into SQL.
Would this work for your situation? Here, we add a column for each requested summation using mutate. It's computationally wasteful since it does the same sum once for every row in each group, and then discards everything but the first row of each group. But that might be fine if your data's not too huge.
data <- data.frame(val=c(1,2,2), grp = c(1, 1, 2)) # To show it works within groups
summ <- data %>% group_by(grp)
if(summarize_num) {summ = mutate(summ, n = n())}
if(summarize_num_distinct) {summ = mutate(summ, n_unique=n_distinct(val))}
summ = slice(summ, 1) %>% ungroup() %>% select(-val)
## A tibble: 2 x 3
# grp n n_unique
# <dbl> <int> <int>
#1 1 2 2
#2 2 1 1
The summarise_at() function takes a list of functions as parameter. So, we can get
data <- data.frame(val=c(1,2,2))
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts)
n_unique n
1 2 3
All functions in the list must take one argument. Therefore, n() was replaced by length().
The list of functions can be modified dynamically as requested by the OP, e.g.,
summarize_num_distinct <- FALSE
summarize_num <- TRUE
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts[c(summarize_num_distinct, summarize_num)])
n
1 3
So, the idea is to define a list of possible aggregation functions and then to select dynamically the aggregation to compute. Even the order of columns in the aggregate can be determined:
fcts <- list(n_unique = n_distinct, n = length, sum = sum, avg = mean, min = min, max = max)
data %>%
summarise_at(.vars = "val", fcts[c(6, 2, 4, 3)])
max n avg sum
1 2 3 1.666667 5
In R, I have a large list of large dataframes consisting of two columns, value and count. The function which I am using in the previous step returns the value of the observation in value, the corresponding column count shows how many times this specific value has been observed. The following code produces one dataframe as an example - however all dataframes in the list do have different values resp. value ranges:
d <- as.data.frame(
cbind(
value = runif(n = 1856, min = 921, max = 4187),
count = runif(n = 1856, min = 0, max = 20000)
)
)
Now I would like to aggregate the data to be able to create viewable visualizations. This aggregation should be applied to all dataframes in a list, which do each have different value ranges. I am looking for a function, cutting the data into new values and counts, a little bit like a histogram function. So for example, for all data from a value of 0 to 100, the counts should be summated (and so on, in a defined interval, with a clean interval border starting point like 0).
My first try was to create a simple value vector, where each value is repeated in a number of times that is determined by the count field. Then, the next step would have been applying the hist() function without plotting to obtain the aggregated values and counts which can be defined in the hist()'s arguments. However, this produces too large vectors (some Gb for each) that R cannot handle anymore. I appreciate any solutions or hints!
I am not entirely sure I understand your question correctly, but this might solve your problem or at least point you in a direction. I make a list of data-frames and then generate a new column containing the result of applying the binfunction to each dataframe by using mapfrom the purrr package.
library(tidyverse)
d1 <- d2 <- tibble(
value = runif(n = 1856, min = 921, max = 4187),
count = runif(n = 1856, min = 0, max = 20000)
)
d <- tibble(name = c('d1', 'd2'), data = list(d1, d2))
binfunction <- function(data) {
data %>% mutate(bin = value - (value %% 100)) %>%
group_by(bin) %>%
mutate(sum = sum(count)) %>%
select(bin, sum)
}
d_binned <- d %>%
mutate(binned = map(data, binfunction)) %>%
select(-data) %>%
unnest() %>%
group_by(name, bin) %>%
slice(1L)
d_binned
#> Source: local data frame [66 x 3]
#> Groups: name, bin [66]
#>
#> # A tibble: 66 x 3
#> name bin sum
#> <chr> <dbl> <dbl>
#> 1 d1 900 495123.8
#> 2 d1 1000 683108.6
#> 3 d1 1100 546524.4
#> 4 d1 1200 447077.5
#> 5 d1 1300 604759.2
#> 6 d1 1400 506225.4
#> 7 d1 1500 499666.5
#> 8 d1 1600 541305.9
#> 9 d1 1700 514080.9
#> 10 d1 1800 586892.9
#> # ... with 56 more rows
d_binned %>%
ggplot(aes(x = bin, y = sum, fill = name)) +
geom_col() +
facet_wrap(~name)
See this comment for my inspiration for the binning. It bins the data in groups of 100, so e.g. bin 1100 represents 1100 to <1200 etc. I imagine you can adapt the binfunction to your needs.
Good afternoon,
I have to add dummy data to a dataframe whenever a specific variable is absent of several given intervals.
require(plyr)
df <- data.frame(length = c(1.5e+07, 2.5e+07), grade = c(1000, 1000), company = "TEST")
for(x in df$length){
if (x<=0|x>1e+07) {
df <- rbind.fill(df, data.frame(length = c(5000000), grade = c(1000)))
}
This works fine but I am having trouble to check if x is absent in each “length” interval from 0 to 1e+08, with a step of 1e+07, and add “1000“ in “grade” if that is the case. I tried all lot of things, and the end my data frame is only 1 row larger.
After that, I will create subgroups based on these intervals and I need a value for each subgroup.
df$length <- cut(df$length, breaks = seq(0, 1e+08, 1e+07))
In the end, the objective is to still get an empty space on a boxplot for each condition where there is no data, as the “1000“ I added is way above the limit threshold.
The next step will be to do the same but for each “company” variable.
I hope I am clear, sorry for my English.
Thanks
You can do it using dplyr and tidyr.
First, cut your df$length:
df <- data.frame(length = c(1.5e+07, 2.5e+07), grade = c(1000, 1000), company = "TEST")
df$length <- cut(df$length, breaks = seq(0, 1e+08, 1e+07))
Now we can use dplyr to left_join on all the levels of length, we then complete company:length, filter out any NA companies, and change the NA to 1000:
library(dplyr)
library(tidyr)
df %>% left_join(data.frame(length = levels(df$length)), .) %>%
complete(length, company) %>%
filter(!is.na(company)) %>%
mutate(grade = ifelse(is.na(grade), 1000, grade))
Source: local data frame [10 x 3]
length company grade
(fctr) (fctr) (dbl)
1 (0,1e+07] TEST 1000
2 (1e+07,2e+07] TEST 1000
3 (2e+07,3e+07] TEST 1000
4 (3e+07,4e+07] TEST 1000
5 (4e+07,5e+07] TEST 1000
6 (5e+07,6e+07] TEST 1000
7 (6e+07,7e+07] TEST 1000
8 (7e+07,8e+07] TEST 1000
9 (8e+07,9e+07] TEST 1000
10 (9e+07,1e+08] TEST 1000
I have three data frames, each having 1 column but having different number of rows 100,100,1000 for df1,df2,df3 respectively. I want to do an rbind iteratively and calculate measures like mean repeatedly for the small chunks of data by taking 10% of the data each time. Meaning in the first iteration I need to have 10 rows from df1, 10 from df2 and 100 from df3 and for this set i need to get a mean and the process should continue 10 times. And I need to plot the iterations chunks over time showing the mean in y-axis over iterations and get an overall mean with this procedure. Any suggestions?
df1<- data.frame(A=c(1:100))
df2<- data.frame(A=c(1:100))
df3<- data.frame(A=c(1:1000))
library(dplyr)
for i in (1:10)
{ df[i]<- rbind_list(df1,df2,df3)
mean=mean(df$A)}
You're making things complicated by trying to keep separate data frames. Add a "group" column---call it "iteration" if you prefer---and get your data in one data frame:
df1$group = rep(1:10, each = nrow(df1) / 10)
df2$group = rep(1:10, each = nrow(df2) / 10)
df3$group = rep(1:10, each = nrow(df3) / 10)
df = rbind(df1, df2, df3)
means = group_by(df, group) %>% summarize(means = mean(A))
means
# Source: local data frame [10 x 2]
#
# group means
# 1 1 43
# 2 2 128
# 3 3 213
# 4 4 298
# 5 5 383
# 6 6 468
# 7 7 553
# 8 8 638
# 9 9 723
# 10 10 808
Your overall mean is mean(df$A). You can plot with with(means, plot(group, means)).
Edits:
If the groups don't come out exactly, here's how I'd assign the group column. Make sure your dplyr is up-to-date, this uses the the .id argument of bind_rows() which was new this month in version 0.4.3.
library(dplyr)
# dplyr > 0.4.3
df = bind_rows(df1, df2, df3, .id = "id")
df = df %>% group_by(id) %>%
mutate(group = (0:(n() - 1)) %/% (n() / 10) + 1)
The id column tells you which data frame the row came from, and the group column splits it into 10 groups. The rest of the code from above should work just fine.