Aggregate a data frame while keeping other variables, with dplyr - r

Suppose I have the following data frame (note the length of 'score'):
id = 1:10^8
school = LETTERS[1:10]
class = paste0(school, rep(1:10, each=10))
score = rnorm(10^8)
df = data.frame(id, school, class, score,
stringsAsFactors = FALSE)
I want to compute the mean of each of the 100 classes. Yet, I also want
to keep the school variable in the results. Using dplyr:
df %>% group_by(class) %>%
summarise(mean = mean(score),
school = unique(school))
This works, but is slow (8 seconds on my machine, and my data in fact is much bigger). I think one option could be not use unique() but a member of the join() family. But I need first to define another df as follow:
df_join = data.frame(class, school,
stringsAsFactors = FALSE)
and then:
df %>% group_by(class) %>%
summarise(mean = mean(score)) %>%
left_join(df_join)
This works and is less slow, as it takes now 6 seconds. Yet, creating the df_join here was easy because I invent the dataframe but in real life, obtaining the df_join can be much more challenging. So I would like to use only the original dataframe (df).
Any idea making this easier (and maybe faster) with dplyr? (I cheked there, but did not find a solution: Aggregate by factor levels, keeping other variables in the resulting data frame)

Since you only have one unique school per class, you can simply include the school variable in the grouping variables:
df %>% group_by(school, class) %>% summarize(mean_score = mean(score))
# # A tibble: 100 x 3
# # Groups: school [?]
# school class mean_score
# <chr> <chr> <dbl>
# 1 A A1 0.000506
# 2 A A10 -0.000275
# 3 A A2 0.00136
# 4 A A3 0.000405
# 5 A A4 -0.00156
# 6 A A5 -0.00214
# 7 A A6 -0.00108
# 8 A A7 -0.000534
# 9 A A8 0.000804
# 10 A A9 0.00106
# # ... with 90 more rows
Here's a data.table equivalent:
library(data.table)
setDT(df, key = c("school", "class"))
df[, .(mean_score = mean(score)), by=.(school, class)]

Related

Loop through specific columns of dataframe keeping some columns as fixed

I have a large dataset with the two first columns that serve as ID (one is an ID and the other one is a year variable). I would like to compute a count by group and to loop over each variable that is not an ID one. This code below shows what I want to achieve for one variable:
library(tidyverse)
df <- tibble(
ID1 = c(rep("a", 10), rep("b", 10)),
year = c(2001:2020),
var1 = rnorm(20),
var2 = rnorm(20))
df %>%
select(ID1, year, var1) %>%
filter(if_any(starts_with("var"), ~!is.na(.))) %>%
group_by(year) %>%
count() %>%
print(n = Inf)
I cannot use a loop that starts with for(i in names(df)) since I want to keep the variables "ID1" and "year". How can I run this piece of code for all the columns that start with "var"? I tried using quosures but it did not work as I receive the error select() doesn't handle lists. I also tried to work with select(starts_with("var") but with no success.
Many thanks!
Another possible solution:
library(tidyverse)
df %>%
group_by(ID1) %>%
summarise(across(starts_with("var"), ~ length(na.omit(.x))))
#> # A tibble: 2 × 3
#> ID1 var1 var2
#> <chr> <int> <int>
#> 1 a 10 10
#> 2 b 10 10
for(i in names(df)[grepl('var',names(df))])

Using group_by() function on multiple data frames?

I have data that were collected from a year but are broken up by months. For my code, I labeled them df1-df12 for each corresponding month. I am trying to group these data using the group_by function to group all the dataframes similarly. When I do the following code- it works fine alone:
df <- df %>%
group_by(date,id) %>%
slice(n()) %>%
ungroup()
However, I would like to streamline this code so that I can use this function for all 12 dataframes without having to copy/paste 12 times, since there is a lot of data to go through. Here is what I have tried to do to that end:
func1<-function(df)
{
df <- df %>%
group_by(date,id) %>%
slice(n()) %>%
ungroup()
}
yr19<-c(df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12)
map(yr19, func1)
However, i get the following error message: Error in UseMethod("group_by") :
no applicable method for 'group_by' applied to an object of class "character". As stated above- i don't get this error message if I go through and do it individually, but there are many months and many years to be analyzed and from a time perspective I don't think doing this code manually is feasible. Thanks for your help
Two ways you can approach this, first using the approach suggested by #ktiu:
## Create example data
library(dplyr) # for pipe and group_by()
set.seed(914)
df1 <- tibble(
date = sample(1:30, 50, replace = T),
id = sample(1:10, 50, replace = T),
var1 = rnorm(50, mean = 10, sd = 3)
)
df2 <- tibble(
date = sample(1:30, 50, replace = T),
id = sample(1:10, 50, replace = T),
var1 = rnorm(50, mean = 10, sd = 3)
)
Modifying your function to address error
func1<-function(df)
{
df <- df %>%
group_by(date,id) %>%
slice(n()) %>%
ungroup()
df
}
## And using list rather than c to combine data frames.
yr19 <- list(df1, df2)
yr19_data <- lapply(yr19, func1)
# This will return a list of data frames you can access with `yr19_data[[1]]`
Alternative approach is to add variable for your source data frames, then collapse it all into a single data frame and manipulate from there. Which approach makes more sense will depend on what else you want to do later.
func2 <- function(df.name){
mutate(get(df.name), source = df.name)
}
# This is set up to get objects given their names, so we'll use a character vector
# of names to iterate off of.
yr19 = c("df1", "df2")
df.list <- lapply(yr19, func2)
df.long <- do.call(bind_rows, df.list)
df.long
# # A tibble: 100 x 4
# date id var1 source
# <int> <int> <dbl> <chr>
# 1 27 9 9.31 df1
# 2 5 3 16.5 df1
# 3 28 3 2.67 df1
# 4 24 4 8.94 df1
# 5 13 3 1.68 df1
At this point you can manipulate one data frame in your original pipe:
df <- df.long %>%
group_by(source, date,id) %>%
slice(n()) %>%
ungroup()
df
# # A tibble: 93 x 4
# date id var1 source
# <int> <int> <dbl> <chr>
# 1 1 8 9.89 df1
# 2 2 4 10.9 df1
# 3 4 3 8.45 df1
# 4 5 3 16.5 df1
# 5 5 7 10.6 df1

How to group multiple rows based on some criteria and sum values in R?

Hi All,
Example :- The above is the data I have. I want to group age 1-2 and count the values. In this data value is 4 for age group 1-2. Similarly I want to group age 3-4 and count the values. Here the value for age group 3-4 is 6.
How can I group age and aggregate the values correspond to it?
I know this way: code-
data.frame(df %>% group_by(df$Age) %>% tally())
But the values are aggregating on individual Age.
I want the values aggregating on multiple age to be a group as mentioned above example.
Any help on this will be greatly helpful.
Thanks a lot to All.
Here are two solutions, with base R and with package dplyr.
I will use the data posted by Shree.
First, base R.
I create a grouping variable grp and then aggregate on it.
grp <- with(df, c((age %in% 1:2) + 2*(age %in% 3:4)))
aggregate(age ~ grp, df, length)
# grp age
#1 1 4
#2 2 6
Second a dplyr way.
Function case_when is used to create a grouping variable. This allows for meaningful names to be given to the groups in an easy way.
library(dplyr)
df %>%
mutate(grp = case_when(
age %in% 1:2 ~ "2:3",
age %in% 3:4 ~ "3:4",
TRUE ~ NA_character_
)) %>%
group_by(grp) %>%
tally()
## A tibble: 2 x 2
# grp n
# <chr> <int>
#1 1:2 4
#2 3:4 6
Here's one way using dplyr and ?cut from base R -
df <- data.frame(age = c(1,1,2,2,3,3,3,4,4,4),
Name = letters[1:10],
stringsAsFactors = F)
df %>%
count(grp = cut(age, breaks = c(0,2,4)))
# A tibble: 2 x 2
grp n
<fct> <int>
1 (0,2] 4
2 (2,4] 6

Summarise but keep length variable (dplyr)

Basic dplyr question... Respondents could select multiple companies that they use. For example:
library(dplyr)
test <- tibble(
CompanyA = rep(c(0:1),5),
CompanyB = rep(c(1),10),
CompanyC = c(1,1,1,1,0,0,1,1,1,1)
)
test
If it were a forced-choice question - i.e., respondents could make only one selection - I would do the following for a basic summary table:
test %>%
summarise_all(funs(sum), na.rm = TRUE) %>%
gather(Response, n) %>%
arrange(desc(n)) %>%
mutate("%" = round(100*n/sum(n)))
Note, however, that the "%" column is not what I want. I'm instead looking for the proportion of total respondents for each individual response option (since they could make multiple selections).
I've tried adding mutate(totalrows = nrow(.)) %>% prior to the summarise_all command. This would allow me to use that variable as the denominator in a later mutate command. However, summarise_all eliminates the "totalrows" var.
Also, if there's a better way to do this, I'm open to ideas.
To get the proportion of respondents who chose an option when that variable is binary, you can take the mean. To do this with your test data, you can use sapply:
sapply(test, mean)
CompanyA CompanyB CompanyC
0.5 1.0 0.8
If you wanted to do this in a more complicated fashion (say your data is not binary encoded, but is stored as 1 and 2 instead), you could do that with the following:
test %>%
gather(key='Company') %>%
group_by(Company) %>%
summarise(proportion = sum(value == 1) / n())
# A tibble: 3 x 2
Company proportion
<chr> <dbl>
1 CompanyA 0.5
2 CompanyB 1
3 CompanyC 0.8
If you put all functions in a list within summarise, then this will work. You'll need to do some quick tidying up after though.
test %>%
summarise_all(
list(
rows = length,
n = function(x){sum(x, na.rm = T)},
perc = function(x){sum(x,na.rm = T)/length(x)}
)) %>%
tidyr::gather(Response, n) %>%
tidyr::separate(Response, c("Company", "Metric"), '_') %>%
tidyr::spread(Metric, n)
And you'll get this
Company n perc rows
<chr> <dbl> <dbl> <dbl>
1 CompanyA 5 0.5 10
2 CompanyB 10 1 10
3 CompanyC 8 0.8 10
Here is a solution using tidyr::gather:
test %>%
gather(Company, response) %>%
group_by(Company) %>%
summarise(`%` = 100 * sum(response) / n())

How do you use spread() when your data has multiple "key" variables?

Edit: apologies for the more-than-minimal example. I redid this with a more parsimonious example, and it looks like aosmith's answer worked out!
This is the next step after this question, in the same process. It's been a doozy.
I have a dataset with a series of variables, each with low, medium, and high values. There are also multiple identification variables, which here I am calling "scenario" and "month" just for this example. I'm doing a calculation involving 3 different values, some of which have a low, medium, or high value that varies in each scenario, and each month.
# generating a practice dataset
library(dplyr)
library(tidyr)
set.seed(123)
pracdf <- bind_cols(expand.grid(ID = letters[1:2],
month = 1:2,
scenario = c("a", "b")),
data_frame(p.mid = runif(8, 100, 1000),
a = rep(runif(2), 4),
b = rep(runif(2), 4),
c = rep(runif(2), 4)))
pracdf <- pracdf %>% mutate(p.low = p.mid * 0.75,
p.high = p.mid * 1.25) %>%
gather(p.low, p.mid, p.high, key = "ptype", value = "p")
# all of that is just to generate the practice dataset.
# 2 IDs * 2 months * 2 scenarios * 3 different values of p = 24 total rows in this dataset
# Do the calculation
pracdf2 <- pracdf %>%
mutate(result = p * a * b * c)
This fully "gathered" dataset has the results that I want. Let's do a spread-type operation to get this in a way that's a bit more readable, with each month, scenario, and p-type combination having it's own column. An example column name would be 'month1_scenario.a_p.low'. The total with this dataset would be 2 months * 3 p types * 2 scenarios = 12 columns.
# this fully "gathered" dataset is exactly what I want.
# Let's put it in a format that the supervisor for this project will be happy with
# ID, month, scenario, and p.type are all "key" variables
# spread() only allows one key variable at a time, so...
pracdf2.spread1 <- pracdf2 %>% spread(ptype, result, sep = ".")
# Produces NA's. Looks like it's messing up with the different values of p
pracdf2.spread2 <- pracdf2 %>% select(-p) %>% spread(ptype, result, sep = ".")
# that's better, now let's spread across scenarios
pracdf2.spread2.spread2low <- pracdf2.spread2 %>% select(-ptype.p.high, -ptype.p.mid) %>% spread(scenario, ptype.p.low, sep = ".")
pracdf2.spread2.spread2mid <- pracdf2.spread2 %>% select(-ptype.p.low, -ptype.p.high) %>% spread(scenario, ptype.p.mid, sep = ".")
pracdf2.spread2.spread2high <- pracdf2.spread2 %>% select(-ptype.p.mid, -ptype.p.low) %>% spread(scenario, ptype.p.high, sep = ".")
pracdf2.spread2.spread2 <- pracdf2.spread2.spread2low %>% left_join(pracdf2.spread2.spread2mid)
# Ok, that was rough and will clearly spiral out of control quickly
# what am I still doing with my life?
I could do the spread() to spread each key column, then redo the spread for each consequent value column, but that will take ages and will likely be error-prone.
Is there a cleaner, tidier, and tidyr way to do this?
Thanks!
You can use unite from tidyr to combine the three columns into one prior to spreading.
Then you can spread, using the new column as the key and the "result" as value.
I also removed columns "a" through "p" prior to spreading, as it didn't seem like these were needed in the desired result.
pracdf2 %>%
unite("allgroups", month, scenario, ptype) %>%
select(-(a:p)) %>%
spread(allgroups, result)
# A tibble: 2 x 13
ID `1_a_p.high` `1_a_p.low` `1_a_p.mid` `1_b_p.high` `1_b_p.low` `1_b_p.mid` `2_a_p.high` `2_a_p.low`
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 160 96.2 128 423 254 338 209 126
2 b 120 72.0 96.0 20.9 12.5 16.7 133 79.5
# ... with 4 more variables: `2_a_p.mid` <dbl>, `2_b_p.high` <dbl>, `2_b_p.low` <dbl>, `2_b_p.mid` <dbl>

Resources