Summarize data based on unique ID column - r

I am trying to summarise multiple columns based on an ID column so I don't double count observations. I have managed to use tapply to get what I need for one variable at a time but can't do this for several variables at the same time.
In addition, the data frame I want to apply this to has +50,000 rows and I want to apply this to +10 different count variables. I was wondering if there is a better solution within dplyr as I ultimately want to create a Shiny Dashboard with this data.
I have replicated a small sample of the data and shown the existing cost.
#Creating data frame
df <- data.frame (ID = c(1, 1, 2, 3, 4, 4, 4),
Count = c(1, 1, 30, 15, 1, 1, 1),
Count2 = c(1, 1, 20, 10, 1, 1, 1),
Service = c("Service A", "Service B", "Service C", "Service D",
"Service E", "Service F", "Service G"))
#Create object of variables to count
myvars <- c("Count", "Count2")
#Count number of unique frequencies for two groups
df %>%
group_by(ID) %>%
summarise(value_sum = sum(tapply(myvars, ID, FUN = max))) %>%
summarise(value_sum = sum(value_sum))
#Count number of unique frequencies (code works for one variable at a time)
df %>%
group_by(ID) %>%
summarise(value_sum = sum(tapply(Count, ID, FUN = max))) %>%
summarise(value_sum = sum(value_sum))
df %>%
group_by(ID) %>%
summarise(value_sum = sum(tapply(Count2, ID, FUN = max))) %>%
summarise(value_sum = sum(value_sum))

You can use across() to work on multiple variables at the same time within summarise(). In your case:
df %>%
group_by(ID) %>%
summarise(across(myvars, max)) %>%
summarise(across(myvars, sum))

Related

Why does map_df produce many missing values? How can i concatenate across rows to removing NAs?

I'm trying to count how many students received 1s, 2s, 3s, 4s, and 5s across their subjects, and I want a column for each subject and the possible grade (math_1, science_2, etc.).
I originally wrote a for loop, but my actual dataset has so many cases that I need to use map. I can get it to work, but it produces many NAs and only one chunk per column has actual data. I'm curious to know either:
Why is map_df() doing this and how can I avoid it? OR
How can I tighten this up so I only have this information on one row per the original rows in the first dataset (18 rows)? In other words, I'd concatenate up and down the column, so all the NAs are filled in (unless there truly was missing data).
Here's my code
library(tidyverse)
#Set up - generate sample dataset and get all combinations of grades and subjects
student_grades <- tibble(student_id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5),
subject = c(rep(c("english", "biology", "math", "history"), 4), NA, "biology"),
grade = as.character(c(1, 2, 3, 4, 5, 4, 3, 2, 2, 4, 1, 1, 1, 1, 2, 3, 3, 4)))
all_subject_combos <- c("english", "history", "math", "biology")
all_grades <- c("1", "2", "3",
"4", "5")
subjects_and_letter_grades <- expand.grid(all_subject_combos, all_grades)
all_combos <- subjects_and_letter_grades %>%
unite("names", c(Var1, Var2)) %>%
mutate(names = str_replace_all(names, "\\|", "_")) %>%
pull(names)
# iterate over each combination using map_df()
student_map <- map_df(all_combos,
~student_grades %>%
mutate("{.x}" := paste(i)) %>%
group_by(student_id) %>%
mutate("{.x}" := sum(case_when(str_detect(.x, subject) &
str_detect(.x, grade) ~ 1,
TRUE ~ 0), na.rm = T)))
EDIT
For the record, my almost identical for loop does not pad in many missing values. I assume it must have something to do with how it is building the dataset, but I don't know how I can override what map_df is doing under the hood.
student_map <- student_grades
for(i in all_combos) {
student_map <- student_map %>%
mutate("{i}" := paste(i)) %>%
group_by(student_id) %>%
mutate("{i}" := sum(case_when(str_detect(i, subject) &
str_detect(i, grade) ~ 1,
TRUE ~ 0), na.rm = T))
}
There is no i in the map as the default lambda value looped is .x. Also, it is better to use transmute instead of mutate as we need to return only the columns added in each iteration and then we bind with the original data at the end
library(dplyr)
library(purrr)
library(stringr)
student_map2 <- map_dfc(all_combos,
~ student_grades %>%
transmute(subject, grade, student_id, "{.x}" := .x) %>%
group_by(student_id) %>%
transmute("{.x}" := sum(case_when(str_detect( .x, subject) &
str_detect(.x, grade)~ 1, TRUE ~ 0), na.rm = TRUE)) %>%
ungroup %>%
select(-student_id)) %>%
bind_cols(student_grades, .)
-checking with OP's for loop output
> all.equal(student_map, student_map2, check.attributes = FALSE)
[1] TRUE
Though I can't figure out why map_df() is performing in this undesirable way, I did find a solution, inspired heavily by the answer to this post.
solution <- student_map %>%
group_by(student_id, subject, grade) %>%
summarise_all(~ last(na.omit(.)))
solution
Basically, this code removes any NAs and only keeps missing values if there are only missing values. Because those columns in my dataset will never have missing values, this solution works in my case.

User defined function in R with dplyr

I have a dataframe and try to create a function that calculate number of records by TRT01AN and another variable chosen by the user (I just send a reduced DF with only one extra variable to make it simpler)
dataframe <- as.data.frame(cbind(ID,=c(1,2,3,4,5,6),TRT01AN = c(1, 1, 3, 2, 2, 2),
AGEGR1 =c("Adult","Child","Adolescent","Adolescent","Adolescent","Child")))
sub1 <- function(SUB1) {
# Calculate number of subjects in each treatment arm
bigN1 <- dataframe %>%
group_by_(SUB1,TRT01AN) %>%
summarise(N = n_distinct(ID))
return(bigN1)
}
bigN1<-sub1(SUB1="AGEGR1")
If I do that , with group_by_ I have an error that TRT01AN doesn't exist and if I use group_by, SUB1 can't be found... Any idea how I can have both variables, a "permanent" one and on defined as the argument of the function?
Thank you!
Try using curly braces (works with or without quotation marks in function call):
library(dplyr)
dataframe <-
as.data.frame(cbind(
ID = c(1, 2, 3, 4, 5, 6),
TRT01AN = c(1, 1, 3, 2, 2, 2),
AGEGR1 = c(
"Adult",
"Child",
"Adolescent",
"Adolescent",
"Adolescent",
"Child"
)
))
sub1 <- function(SUB1) {
# Calculate number of subjects in each treatment arm
bigN1 <- dataframe %>%
group_by({{SUB1}}, TRT01AN) %>%
summarise(N = n_distinct(ID))
return(bigN1)
}
bigN1 <- sub1(AGEGR1)

gather 3 different detections of three different variables

I have a dataframe of 96074 obs. of 31 variables.
the first two variables are id and the date, then I have 9 columns with measurement (three different KPIs with three different time properties), then various technical and geographical variables.
df <- data.frame(
id = rep(1:3, 3),
time = rep(as.Date('2009-01-01') + 0:2, each = 3),
sum_d_1day_old = rnorm(9, 2, 1),
sum_i_1day_old = rnorm(9, 2, 1),
per_i_d_1day_old = rnorm(9, 0, 1),
sum_d_5days_old = rnorm(9, 0, 1),
sum_i_5days_old = rnorm(9, 0, 1),
per_i_d_5days_old = rnorm(9, 0, 1),
sum_d_15days_old = rnorm(9, 0, 1),
sum_i_15days_old = rnorm(9, 0, 1),
per_i_d_15days_old = rnorm(9, 0, 1)
)
I want to transform from wide to long, in order to do graphs with ggplot using facets for example.
If I had a df with just one variable with its three-time scans I would have no problem in using gather:
plotdf <- df %>%
gather(sum_d, value,
c(sum_d_1day_old, sum_d_5days_old, sum_d_15days_old),
factor_key = TRUE)
But having three different variables trips me up.
I would like to have this output:
plotdf <- data.frame(
id = rep(1:3, 3),
time = rep(as.Date('2009-01-01') + 0:2, each = 3),
sum_d = rep(c("sum_d_1day_old", "sum_d_5days_old", "sum_d_15days_old"), 3),
values_sum_d = rnorm(9, 2, 1),
sum_i = rep(c("sum_i_1day_old", "sum_i_5days_old", "sum_i_15days_old"), 3),
values_sum_i = rnorm(9, 2, 1),
per_i_d = rep(c("per_i_d_1day_old", "per_i_d_5days_old", "per_i_d_15days_old"), 3),
values_per_i_d = rnorm(9, 2, 1)
)
with id, sum_d, sum_i and per_i_d of class factor time of class Date and the values of class numeric (I have to add that I don't have negative measures in these variables).
what I've tried to do:
plotdf <- gather(df, key, value, sum_d_1day_old:per_i_d_15days_old, factor_key = TRUE)
gathering all of the variables in a single column
plotdf$KPI <- paste(sapply(strsplit(as.character(plotdf$key), "_"), "[[", 1),
sapply(strsplit(as.character(plotdf$key), "_"), "[[", 2), sep = "_")
creating a new column with the name of the KPI, without the time specification
plotdf %>% unite(value2, key, value) %>%
#creating a new variable with the full name of the KPI attaching the value at the end
mutate(i = row_number()) %>% spread(KPI, value2) %>% select(-i)
#spreading
But spread creates rows with NAs.
To replace then at first I used
group_by(id, date) %>%
fill(c(sum_d, sum_i, per_i_d), .direction = "down") %>%
fill(c(sum_d, sum_i, per_i_d), .direction = "up") %>%
But the problem is that there are already some measurements with NAs in the original df in the variable per_i_d (44 in total), so I lose that information.
I thought that I could replace the NAs in the original df with a dummy value and then replace the NAs back, but then I thought that there could be a more efficient solution for all of my problem.
After I replaced the NAs, my idea was to use slice(1) to select only the first row of each couple id/date, then do some manipulation with separate/unite to have the output I desired.
I actually did that, but then I remembered I had those aforementioned NAs in the original df.
df %>%
gather(key,value,-id,-time) %>%
mutate(type = str_extract(key,'[a-z]+_[a-z]'),
age = str_extract(key, '[0-9]+[a-z]+_[a-z]+')) %>%
select(-key) %>%
spread(type,value)
gives
id time age per_i sum_d sum_i
1 1 2009-01-01 15days_old 0.8132301 0.8888928 0.077532040
2 1 2009-01-01 1day_old -2.0993199 2.8817133 3.047894196
3 1 2009-01-01 5days_old -0.4626151 -1.0002926 0.327102000
4 1 2009-01-02 15days_old 0.4089618 -1.6868523 0.866412133
5 1 2009-01-02 1day_old 0.8181313 3.7118065 3.701018419
...
EDIT:
adding non-value columns to the dataframe:
df %>%
gather(key,value,-id,-time) %>%
mutate(type = str_extract(key,'[a-z]+_[a-z]'),
age = str_extract(key, '[0-9]+[a-z]+_[a-z]+'),
info = paste(age,type,sep = "_")) %>%
select(-key) %>%
gather(key,value,-id,-time,-age,-type) %>%
unite(dummy,type,key) %>%
spread(dummy,value)

summarise mean of a specific column in dplyr

I would like to summarise a grouped data.frame without knowing the name of the column. But what I know is, that the feature is always at position 3 (column) in this data.frame, is that possible?
df <- data_frame(date = rep(c("2017-01-01", "2017-01-02", "2017-01-03"), 2),
group = rep(c("A", "B"), 3),
temperature = runif(6, -10, 30),
percipitation = runif(6, 0,5)
)
parameter <- "perc"
df1 <- df %>%
select(date, group, starts_with(parameter)) %>%
group_by(group) %>%
summarise(
avg = mean(percipitation)
)
In this example the code works, but of course only for the parameter 'perc' and not for 'temp' or so.
avg = mean(df[[3]])
or something like this doesn't work. Any suggestions?
You could keep just the grouping variable and the third column using select(group, 3). The function summarise_all() can then be used to calculate the mean.
df %>%
select(group, 3) %>%
group_by(group) %>%
summarise_all(
funs(mean)
)

In nested data frame, pass information from one list column to function applied in another

I am working on a report for which I have to export a large number of similar data frames into nice looking tables in Word. My goal is to achieve this in one go, using flextable to generate the tables and purrr / tidyverse to apply all the formatting procedures to all rows in a nested data frame. This is what my data frame looks like:
df <- data.frame(school = c("A", "B", "A", "B", "A", "B"),
students = c(round(runif(6, 1, 10), 0)),
grade = c(1, 1, 2, 2, 3, 3))
I want to generate separate tables for all groups in column 'school' and started by using the nest() function within tidyr.
list <- df %>%
group_by(school) %>%
nest()
This gives me a nested data frame to which I can apply the functions in flextable using purrr:
list <- list %>%
mutate(ftables = map(data, flextable)) %>%
mutate(ftables = purrr::map(ftables, ~ set_header_labels(.,
students = "No of students",
grade = "Grade")))
The first mutate generates a new column with flextable objects for each school, and the second mutate applies header labels to the table, based on the column names that are saved in the object.
My goal is now to add another header that is based on the name of the school. This value resides in the list column entitled school, which corresponds row-wise to the tables generated in the list column ftables. How can I pass the name of the school to the add_header function within ftables, using purrr or any other procedure?
Expected output
I have been able to achieve what I want for individual schools with this procedure (identical header cells will later be merged):
school.name <- "A"
ftable.a <- df %>%
filter(school == "A") %>%
select(-school) %>%
flextable() %>%
set_header_labels(students = "No of students",
grade = "Grade") %>%
add_header(students = school.name,
grade = school.name)
ftable.a
package purrr provides function map2 that you should use:
library(flextable)
library(magrittr)
library(dplyr)
library(tidyr)
library(purrr)
df <- data.frame(school = c("A", "B", "A", "B", "A", "B"),
students = c(round(runif(6, 1, 10), 0)),
grade = c(1, 1, 2, 2, 3, 3))
byschool <- df %>%
group_by(school) %>%
nest()
byschool <- byschool %>%
mutate(ftables = map(data, flextable)) %>%
mutate(ftables = purrr::map(
ftables, ~ set_header_labels(.,
students = "No of students",
grade = "Grade"))) %>%
mutate(ftables = purrr::map2(ftables, school, function(ft, h){
add_header(ft, students = h, grade = h)
} ))

Resources