I have this code where i need to find the mean of the approval for each quarter.
approval <- read_csv('covid_approval_polls.csv')
quarters2 <- approval %>%
select(start_date, end_date, approve) %>%
filter(approval$party == 'all') %>%
mutate(Quarter = as.yearqtr(approval$start_date)) %>%
group_by(Quarter) %>%
summarise(AVERAGE = ceiling(mean(approval$approve, na.rm = TRUE)))
I am trying to use dplyr which I think is correct but my code gives me the mean of all the data.
I am trying to create new columns grouped by different columns but I am not sure if the way I am doing it is the best way to use group_by. I am wondering if there is a way I can group_by in line?
I know it can be done using data.table package where the syntax is of type
DT[i,j, by].
But since this is a small piece in a bigger code which uses tidyverse and works great as is, I just don't want to deviate from that.
## Creating Sample Data Frame
state <- rep(c("OH", "IL", "IN", "PA", "KY"),10)
county <- sample(LETTERS[1:5], 50, replace = T) %>% str_c(state,sep = "-")
customers <- sample.int(50:100,50)
sales <- sample.int(500:5000,50)
df <- bind_cols(data.frame(state, county,customers,sales))
## workflow
df2 <- df %>%
group_by(state) %>%
mutate(customerInState = sum(customers),
saleInState = sum(sales)) %>%
ungroup %>%
group_by(county) %>%
mutate(customerInCounty = sum(customers),
saleInCounty = sum(sales)) %>%
ungroup %>%
mutate(salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
group_by(state) %>%
mutate(minSale = min(salePerCountyPercent)) %>%
I want my code to look like
df3 <- df %>%
mutate(customerInState = sum(customers, by = state),
saleInState = sum(sales, by = state),
customerInCounty = sum(customers, by = county),
saleInCounty = sum(sales, by = county),
salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState,
minSale = min(salePerCountyPercent, by = state))
it runs without errors, but I know the output is not right
I understand that it may be possible to juggle around the mutates to get what I need with less amount of group_bys.
But the questions is, if there is away to do in line group by in dplyr
You could create wrapper to do what you want. This specific solution works if you have one grouping variable. Good luck!
mutate_by <- function(.data, group, ...) {
group_by(.data, !!enquo(group)) %>%
mutate(...) %>%
df1 <- df %>%
customerInState = sum(customers),
saleInState = sum(sales)) %>%
customerInCounty = sum(customers),
saleInCounty = sum(sales)) %>%
mutate(salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
minSale = min(salePerCountyPercent))
identical(df2, df1)
[1] TRUE
EDIT: or, more concicely / similar to your code:
df %>%
mutate_by(customerInState = sum(customers),
saleInState = sum(sales), group = state) %>%
mutate_by(customerInCounty = sum(customers),
saleInCounty = sum(sales), group = county) %>%
mutate(salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
mutate_by(minSale = min(salePerCountyPercent), group = state)
Ah, you mean the syntax style. No, this is not how tidyverse runs, I'm afraid. You want tidyverse, you better use pipes. However: (i) once you grouped something, it stays grouped until you group again with a different column. (ii) No need to ungroup if you group again. We can therefore shorten your code:
df3 <- df %>%
group_by(county) %>%
mutate(customerInCounty = sum(customers),
saleInCounty = sum(sales)) %>%
group_by(state) %>%
mutate(customerInState = sum(customers),
saleInState = sum(sales),
salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
mutate(minSale = min(salePerCountyPercent)) %>%
Two mutates and two group_by's.
Now: the order of columns is different, but we can easily test that the data is identical:
identical((df3 %>% select(colnames(df2))), (df2)) # TRUE
(iii) I have no idea about the administrative structure of the US, but I assume that counties are nested within states, correct? Then how about using summarize? Do you need to keep all the individual sales, or is it enough to generate per county and/or per state statistics?
You can do it in two steps, creating two data sets, then left_join them.
df2 <- df %>%
group_by(state) %>%
summarise(customerInState = sum(customers),
saleInState = sum(sales))
df3 <- df %>%
group_by(state, county) %>%
summarise(customerInCounty = sum(customers),
saleInCounty = sum(sales))
df2 <- left_join(df2, df3) %>%
mutate(salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
group_by(state) %>%
mutate(minSale = min(salePerCountyPercent))
Final clean up.
I am new to R. I have a data frame with firm level data such as revenue, profits and costs. I would need to loop through 3 variables - revenue, profit and costs over this code:
datagroup %>% group_by(treat) %>% summarise(n = n(), mean = mean(profit), std_error = sd(profit) / sqrt(n))
Basically, I would run the code for revenue and costs by replacing the variable profit. Could you assist? I tried for loops but to no avail.
We can do this in a loop with the column name as string, then convert it to symbol, evaluate (!!) and get the mean
c("revenue", "costs") %>%
map(~ datagroup %>%
group_by(treat) %>%
summarise(n = n(),
!! str_c("mean_", .x) := mean(!! rlang::sym(.x)), # convert to symbol
!! str_c("std_error_", .x) := sd(!! rlang::sym(.x)) / sqrt(n)))
We can also do this with summarise_at
c("revenue", "costs") %>%
map(~ datagroup %>%
group_by(treat) %>%
group_by(n = n(), add = TRUE) %>%
list(mean = ~ mean(.x),
std_error = ~ sd(.x)/sqrt(first(n)))))
The output will be a list of data.frames
Since you are new to R, consider base R for multiple aggregate functions on multiple numeric columns via a cbind + aggregate + do.call:
aggregate(cbind(revenue, cost, profit) ~ treat,
function(x) c(n = length(x),
mean = mean(x),
std_error = sd(x) / sqrt(length(x))
dplyr programming question here. Trying to write a dplyr function which takes column names as inputs and also filters on a component outlined in the function. What I am trying to recreate is as follow called test:
#test df
x<- sample(1:100, 10)
y<- sample(c(TRUE, FALSE), 10, replace = TRUE)
date<- seq(as.Date("2018-01-01"), as.Date("2018-01-10"), by =1)
my_df<- data.frame(x = x, y =y, date =date)
test<- my_df %>% group_by(date) %>%
summarise(total = n(), total_2 = sum(y ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter(date >= "2018-01-03")
The function I am testing is as follows:
cumsum_df<- function(data, date_field, cumulative_y, minimum_date = "2017-04-21") {
date_field <- enquo(date_field)
cumulative_y <- enquo(cumulative_y)
data %>% group_by(!!date_field) %>%
summarise(total = n(), total_2 = sum(!!cumulative_y ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter((!!date_field) >= minimum_date)
test2<- cumsum_df(data = my_df, date_field = date, cumulative_y = y, minimum_date = "2018-01-03")
I have looked looked at some examples of using enquo and this thread gets me half way there:
Use variable names in functions of dplyr
But the issue is I get two different data frame outputs for test 1 and test 2. The one from the function outputs does not have data from the logical y referenced column.
I also tried this instead
cumsum_df<- function(data, date_field, cumulative_y, minimum_date = "2017-04-21") {
date_field <- enquo(date_field)
cumulative_y <- deparse(substitute(cumulative_y))
data %>% group_by(!!date_field) %>%
summarise(total = n(), total_2 = sum(data[[cumulative_y]] ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter((!!date_field) >= minimum_date)
test2<- cumsum_df(data= my_df, date_field = date, cumulative_y = y, minimum_date = "2018-01-04")
Based on this thread: Pass a data.frame column name to a function
But the output from my test 2 column is also wildly different and it seems to do some kind or recursive accumulation. Which again is different to my test date frame.
If anyone can help that would be much appreciated.
Since I have to use a function in a loop, I have to use dplyr group_by_at() and summarise_at() function. Unfortunately, I am not able to use the complete function from plyr to prevent empty groups to be removed by using an Index. Or is there another option to prevent dplyr from dropping empty groups?
df1 <- mtcars %>%
group_by(gear) %>%
summarise(Mittelwert = mean(mpg, na.rm = TRUE)) %>%
complete(gear, fill = list(Gewicht = 1))
df2 <- mtcars %>%
group_by_at(10) %>%
summarise_at(1, mean, na.rm = TRUE) %>%
complete(gear, fill = list(Gewicht = 1))