Summarizing a collection of data frames - improving upon a clumsy solution - r

I have a collection of data frames, df_i, representing the ith visit of a set of patients to a hospital. I'd like to summarize each of the data frames to determine the number of men, women and total patients at the ith visit. While I can solve this, my solution is clumsy. Is there a simpler way to get the final dataframe that I want? Example follows:
df_1 <- data.frame(
ID = c(rep("A",4), rep("B",3), rep("C",2), "D"),
Dates = seq.Date(from = as.Date("2020-01-01"), to = as.Date("2020-01-10"), by = "day"),
Sex = c(rep("Male",4), rep("Male",3), rep("Female",2), "Female"),
Weight = seq(100, 190, 10),
Visit = rep(1, 10)
)
df_2 <- data.frame(
ID = c(rep("A",4), rep("B",3), rep("C",2)),
Dates = seq.Date(from = as.Date("2020-02-01"), to = as.Date("2020-02-9"), by = "day"),
Sex = c(rep("Male",4), rep("Male",3), rep("Female",2)),
Weight = seq(100, 180, 10),
Visit = rep(2, 5)
)
df_3 <- data.frame(
ID = c(rep("A",4), rep("B",3)),
Dates = seq.Date(from = as.Date("2020-03-01"), to = as.Date("2020-03-07"), by = "day"),
Sex = rep("Male",7),
Weight = seq(140, 200, 10),
Visit = rep(3, 7)
)
I'm looking to generate the following result:
> df_sum
Visit Patients Men Women
1 1 4 2 2
2 2 3 2 1
3 3 2 2 0
I can do this in a very clumsy way: First create a temporary data frame that summarizes the information in df_1
df_tmp <- df_1 %>%
group_by(ID) %>%
filter(Dates == min(Dates)) %>%
summarize(n = n(), Men = sum(Sex == "Male"), Women = sum(Sex == "Female"))
> df_tmp
# A tibble: 4 x 4
ID n Men Women
<chr> <int> <int> <int>
1 A 1 1 0
2 B 1 1 0
3 C 1 0 1
4 D 1 0 1
Next, sum each of the columns in df_tmp to create the first row for the summary column.
r1 <- c(sum(df_tmp$n), sum(df_tmp$Men), sum(df_tmp$Women))
Repeat for the second and third data frames. Finally rbind the rows together to create the summary data frame. While this works, it is extremely clumsy, and doesn't generalize to the case when I have a variable number of visits. Would someone kindly point me to a mmore elegant solution to my problem?
Many thanks in advance
Thomas Philips

Could also make into a tibble with bind_rows:
library(tidyverse)
bind_rows(df_1, df_2, df_3, .id = "day") %>%
group_by(day, ID) %>%
slice_min(Dates) %>%
group_by(day) %>%
summarize(n = n(), Men = sum(Sex == "Male"), Women = sum(Sex == "Female"))
Result
# A tibble: 3 x 4
day n Men Women
* <chr> <int> <int> <int>
1 1 4 2 2
2 2 3 2 1
3 3 2 2 0

Put the data in a list and iterate over them through map so that you don't have to repeat the code for each dataframe. Using janitor::adorn_totals you can add a new row in the output with the total and get the data in wide format.
library(tidyverse)
list_df <- list(df_1, df_2, df_3)
map_df(list_df, ~.x %>%
group_by(ID) %>%
filter(Dates == min(Dates)) %>%
ungroup %>%
count(Sex) %>%
janitor::adorn_totals(name = 'Patients'), .id = 'Visit') %>%
pivot_wider(names_from = Sex, values_from = n, values_fill = 0)
# Visit Female Male Patients
# <chr> <int> <int> <int>
#1 1 2 2 4
#2 2 1 2 3
#3 3 0 2 2

Related

Summarize one variable/column over all possible values of other variables/columns

I need to summarize one variable/column of a long table after aggregating (group_by()) by another variable/column, I need to have the summarized value by all values of other variables/columns.
Here is test data:
library(tidyverse)
set.seed(123)
Site <- str_c("S", 1:5)
Species <- str_c("Sps", 1:6)
print(Species_tbl <- bind_cols(Species = Species,
Exotic = rbinom(length(Species), 1, .3),
Migrant = rbinom(length(Species), 2, .3)))
Data_tbl <- expand.grid(Site = Site,
Species = Species) %>%
left_join(Species_tbl)
Data_tbl$Presence <- rbinom(nrow(Data_tbl), 1, .5)
And here is my best effort:
print(Data_tbl %>%
group_by(Site) %>%
summarise(N_sp = sum(Presence),
N_sp_Exo = sum(Presence[Exotic == 1]),
N_sp_Nat = sum(Presence[Exotic == 0]),
N_sp_M0 = sum(Presence[Migrant == 0]),
N_sp_M1 = sum(Presence[Migrant == 1]),
N_sp_M2 = sum(Presence[Migrant == 2])))
You can get the data in long format for your columns of interest c(Exotic, Migrant) and take sum of Presence columns for each unique column names and it's values. This can be merged with sum of each Site.
library(dplyr)
library(tidyr)
data1 <- Data_tbl %>%
group_by(Site) %>%
summarise(N_sp = sum(Presence))
data2 <- Data_tbl %>%
pivot_longer(cols = c(Exotic, Migrant)) %>%
group_by(Site, name, value) %>%
summarise(result = sum(Presence), .groups = "drop") %>%
pivot_wider(names_from = c(name, value), values_from = result)
inner_join(data1, data2, by = 'Site')
# Site N_sp Exotic_0 Exotic_1 Migrant_0 Migrant_1 Migrant_2
# <fct> <int> <int> <int> <int> <int> <int>
#1 S1 4 2 2 1 2 1
#2 S2 3 2 1 0 2 1
#3 S3 2 1 1 0 2 0
#4 S4 4 2 2 1 3 0
#5 S5 4 1 3 1 2 1
The answer has been divided in two steps for ease of readability. If you would like to do this in a single chain without creating temporary variables that can be done as well.

How to produce tidy summaries when comparing logical conditions

Consider the following data frame:
set.seed(123)
dat <- data.frame(Region = rep(c("a","b"), each=100),
State =rep(c("NY","MA","FL","GA"), each = 50),
Loc = rep(letters[1:20], each = 5),
ID = 1:200,
count1 = sample(4, 200, replace=T),
count2 = sample(4, 200, replace=T))
Individual observations are denoted with a unque ID. There are three grouping variables for the individual observations: Region, State, and Loc. Lets say that I know the following conditions to be true:
- When count1 equals 1 then count2 should equal 2
- When count1 equals 2 then count2 should equal 4
- When count1 equals 3 then count2 should equal 1
- When count1 equals 4 then count2 should equal 3
I want to answer the following types of questions:
1. How many observations that belong to each grouping variable (Region, State, Loc) are in each level of count1 and count2
Which IDs are in which level of count1 and count2(and what grouping variables do these IDs belong to)
How often do the conditions outlined above hold true, and how often do they not hold true
For which grouping variables and IDs do these conditions hold true, and for which do they not hold true
When the conditions do not hold true, what is actually observed (e.g., when count1 equals 1 then count2 should equal 2; so when count1 equals 1 but count2 does not equal 2, then what does count2 equal instead).
How can I specify these conditions and produce tidy summary-like tables to answer these questions?
You can think of the levels of count1 and count2 as being associated with certain characteristics, and I want understand the relationship between those levels with each other, and with the grouping variables. If anyone has any graphical visualization ideas for these types of questions, that would be very helpful as well!
Here's one way to go for questions 1 & 2 although this feels a little involved. I am using tidyr pivot_wider to create columns for each unique value of count1 and count2. The function length in values_fn counts the number of elements in the vectors created by pivot_wider for relevant combinations. As we need answers for count1 and count2 separately I run pivot_wider twice.
Results are then combined with bind_cols and superfluous columns are removed.
All of this can probably be improved on with a bit more thought.
library(dplyr)
library(tidyr)
library(tibble)
set.seed(123)
data <- data.frame(Region = rep(c("a","b"), each=100),
State =rep(c("NY","MA","FL","GA"), each = 50),
Loc = rep(letters[1:20], each = 5),
ID = 1:200,
count1 = sample(4, 200, replace=T),
count2 = sample(4, 200, replace=T))
# 1. How many observations that belong to each grouping variable (Region, State, Loc) are in each level of count1 and count2
level_count1 <- data %>%
select(-count2) %>%
pivot_wider(id_cols = c(Region, State, Loc),
values_from = count1,
values_fn = list(count1 = length),
names_from = count1,
names_prefix = "count1_")
level_count2 <- data %>%
select(-count1) %>%
pivot_wider(id_cols = c(Region, State, Loc),
values_from = count2,
values_fn = list(count2 = length),
names_from = count2,
names_prefix = "count2_")
level_count <- bind_cols(level_count1, level_count2) %>% select(-Region1, -State1, -Loc1)
# 2. Which IDs are in which level of count1 and count2(and what grouping variables do these IDs belong to)
ID_count1 <- data %>%
select(-count2) %>%
pivot_wider(id_cols = ID,
values_from = count1,
values_fn = list(count1 = length),
names_from = count1,
names_prefix = "count1_") %>%
left_join(data %>% select(Region, State, Loc, ID), by = "ID")
ID_count2 <- data %>%
select(-count1) %>%
pivot_wider(id_cols = ID,
values_from = count2,
values_fn = list(count2 = length),
names_from = count2,
names_prefix = "count2_") %>%
left_join(data %>% select(Region, State, Loc, ID), by = "ID")
ID_count <- bind_cols(ID_count1, ID_count2) %>% select(-Region1, -State1, -Loc1, -ID)
Results are like this:
> head(level_count)
# A tibble: 6 x 11
Region State Loc count1_3 count1_1 count1_2 count1_4 count2_4 count2_2 count2_1 count2_3
<fct> <fct> <fct> <int> <int> <int> <int> <int> <int> <int> <int>
1 a NY a 3 2 NA NA 2 1 1 1
2 a NY b 2 1 2 NA NA 2 1 2
3 a NY c NA NA 3 2 NA 2 2 1
4 a NY d 1 3 NA 1 1 3 NA 1
5 a NY e 2 3 NA NA 2 3 NA NA
6 a NY f 1 2 1 1 NA NA 2 3
The value of 3 in the first row of column count1_3 means that the combination of Region == "a", State == "NY" and Loc == "a" occurs 3 times for the value 3 for count1. Likewise, the value 2 in the second row indicates that the value 3 occurs twice in count1 for the combination Region == "a", State == "NY" and Loc == "b".
NA values indicate that 3 does not occur for the given combination of categorical columns. And so on. Is this useful for you?
The approach for ID is similar.

30 sec rows to 1 min rows

I have a dataset with time stamps every 30 seconds and binary along side each time stamp with either 0 meaning active and 1 meaning inactive. I want to combine two 30 second intervals into one interval marked either active or inactive based on if there is a 0 in either of the two, the new minute interval is marked 0 and if there is two 1s, the interval is marked inactive. I could probably do a macro in excel but I think it would be easier to just do it in R.
11:00.20 1
11:00.50 0
11:01.20 1
11:01.50 1
Here's a way using dplyr -
df %>%
mutate(group = rep(1:n(), each = 2, length.out = n())) %>%
group_by(group) %>%
summarise(
timestamp = first(timestamp),
value = case_when(
sum(value) == 1 ~ "0",
sum(value) == 2 ~ "inactive",
TRUE ~ "active"
)
)
# A tibble: 2 x 3
group timestamp value
<int> <chr> <chr>
1 1 11:00.20 0
2 2 11:01.20 inactive
Data -
df <- read.table(text = "11:00.20 1
11:00.50 0
11:01.20 1
11:01.50 1", header = F, stringsAsFactors = F, col.names = c("timestamp", "value"))
Your data:
df <- tibble(
time = c("11:00.20", "11:00.50",
"11:01.20", "11:01.50"),
active = c(1, 0, 1, 1))
I have tried lubridate:
library(lubridate)
library(tidyverse)
df %>%
mutate(
date = hms(time),
) %>%
group_by(hour= hour(date), minute=minute(date)) %>%
summarize(
active = min(active)
)
and got the following tibble:
# A tibble: 2 x 3
# Groups: hour [1]
hour minute active
<dbl> <dbl> <dbl>
1 11 0 0
2 11 1 1

Improving a dplyr solution -- Create a variable by conditional ordering (position) based on other information

I'm working on a dataset where every participant (ID) was evaluated 1, 2 or 3 times. It's a longitudinal study. Unfortunately, when the first analyst coded the dataset, she/he did not assign any information about that.
Because all participant have age information (in months), it's easy to identify when was the first evaluation, when was the second and so on. In the first evaluation, the participant was younger than the second and so on.
I used tidyverse tools to deal with that and everything is working. Howerver,I really know (imagine...) there is many other (much more) elegant solution, and I came to this forum to ask for that. Could someone give me thoughts about how to make this code shorter and clear?
This is a fake data to reproduce the code:
ds <- data.frame(id = seq(1:6),
months = round(rnorm(18, mean=12, sd=2),0),
x1 = sample(0:2),
x2 = sample(0:2),
x3 = sample(0:2),
x4 = sample(0:2))
#add how many times each child was acessed
ds <- ds %>% group_by(id) %>% mutate(how_many = n())
#Add position
ds %>% group_by(id) %>%
mutate(first = min(months),
max = max(months),
med = median(months)) -> ds
#add label to the third evaluation (the second will be missing)
ds %>%
mutate(group = case_when((how_many == 3) & (months %in% first) ~ "First evaluation",
(how_many == 3) & (months %in% max) ~ "Third evaluation",
TRUE ~ group)) -> ds
#add label to the second evaluation for all children evaluated two times
ds %>% mutate_at(vars(group), funs(if_else(is.na(.),"Second Evaluation",.))) -> ds
This is my original code:
temp <- dataset %>% select(idind, arm, infant_sex,infant_age_months)
#add how many times each child was acessed
temp <- temp %>% group_by(idind) %>% mutate(how_many = n())
#Add position
temp %>% group_by(idind) %>%
mutate(first = min(infant_age_months),
max = max(infant_age_months),
med = median(infant_age_months)) -> temp
#add label to the first evaluation
temp %>%
mutate(group = case_when(how_many == 1 ~ "First evaluation")) -> temp
#add label to the second evaluation (and keep all previous results)
temp %>%
mutate(group = case_when((how_many == 2) & (infant_age_months %in% first) ~ "First evaluation",
(how_many == 2) & (infant_age_months %in% max) ~ "Second evaluation",
TRUE ~ group)) -> temp
#add label to the third evaluation (the second will be missing)
temp %>%
mutate(group = case_when((how_many == 3) & (infant_age_months %in% first) ~ "First evaluation",
(how_many == 3) & (infant_age_months %in% max) ~ "Third evaluation",
TRUE ~ group)) -> temp
#add label to the second evaluation for all children evaluated two times
temp %>% mutate_at(vars(group), funs(if_else(is.na(.),"Second Evaluation",.))) -> temp
Please, keep in mind I used search box before asking that and I really imagine other people can figure the same question when programing.
Thanks much
There you go. I used rank() to give the order of the treatments.
ds <- data.frame(id = seq(1:6),
months = round(rnorm(18, mean=12, sd=2),0),
x1 = sample(0:2),
x2 = sample(0:2),
x3 = sample(0:2),
x4 = sample(0:2))
ds2 = ds %>% group_by(id) %>% mutate(rank = rank(months,ties.method="first"))
labels = c("First", "Second","Third")
ds2$labels = labels[ds2$rank]
Or just arrange by age and use 1:n() instead of n(), which creates a sequence:
ds <- ds %>% group_by(id) %>% arrange(months) %>% mutate(how_many = 1:n())
ds %>% arrange(id, months)
# A tibble: 18 x 7
# Groups: id [6]
id months x1 x2 x3 x4 how_many
<int> <dbl> <int> <int> <int> <int> <int>
1 1 10 1 2 0 1 1
2 1 11 1 2 0 1 2
3 1 12 1 2 0 1 3
4 2 11 0 1 2 2 1
5 2 14 0 1 2 2 2
6 2 14 0 1 2 2 3
You can then use factor to attach a label, if you wish.
ds$label <- factor(ds$how_many, level = 1:3, label = c("First", "Second","Third"))
head(ds)
# A tibble: 18 x 8
# Groups: id [6]
id months x1 x2 x3 x4 how_many label
<int> <dbl> <int> <int> <int> <int> <int> <fct>
1 1 10 1 2 0 1 1 First
2 1 11 1 2 0 1 2 Second
3 1 12 1 2 0 1 3 Third
4 2 11 0 1 2 2 1 First
5 2 14 0 1 2 2 2 Second
6 2 14 0 1 2 2 3 Third

How can I create a column that cumulatively adds the sum of two previous rows based on conditions?

I tried asking this question before but was it was poorly stated. This is a new attempt cause I haven't solved it yet.
I have a dataset with winners, losers, date, winner_points and loser_points.
For each row, I want two new columns, one for the winner and one for the loser that shows how many points they have scored so far (as both winners and losers).
Example data:
winner <- c(1,2,3,1,2,3,1,2,3)
loser <- c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)
I want the output to be:
winner_points_sum <- c(0, 0, 1, 3, 1, 3, 5, 3, 5)
loser_points_sum <- c(0, 2, 2, 1, 4, 5, 4, 7, 4)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points, winner_points_sum, loser_points_sum)
How I've solved it thus far is to do a for loop such as:
library(dplyr)
test_data$winner_points_sum_loop <- 0
test_data$loser_points_sum_loop <- 0
for(i in row.names(test_data)) {
test_data[i,]$winner_points_sum_loop <-
(
test_data %>%
dplyr::filter(winner == test_data[i,]$winner & date < test_data[i,]$date) %>%
dplyr::summarise(points = sum(winner_points, na.rm = TRUE))
+
test_data %>%
dplyr::filter(loser == test_data[i,]$winner & date < test_data[i,]$date) %>%
dplyr::summarise(points = sum(loser_points, na.rm = TRUE))
)
}
test_data$winner_points_sum_loop <- unlist(test_data$winner_points_sum_loop)
Any suggestions how to tackle this problem? The queries take quite some time when the row numbers add up. I've tried elaborating with the AVE function, I can do it for one column to sum a players point as winner but can't figure out how to add their points as loser.
winner <- c(1,2,3,1,2,3,1,2,3)
loser <- c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)
library(dplyr)
library(tidyr)
test_data %>%
unite(winner, winner, winner_points) %>% # unite winner columns
unite(loser, loser, loser_points) %>% # unite loser columns
gather(type, pl_pts, winner, loser, -date) %>% # reshape
separate(pl_pts, c("player","points"), convert = T) %>% # separate columns
arrange(date) %>% # order dates (in case it's not)
group_by(player) %>% # for each player
mutate(sum_points = cumsum(points) - points) %>% # get points up to that date
ungroup() %>% # forget the grouping
unite(pl_pts_sumpts, player, points, sum_points) %>% # unite columns
spread(type, pl_pts_sumpts) %>% # reshape
separate(loser, c("loser", "loser_points", "loser_points_sum"), convert = T) %>% # separate columns and give appropriate names
separate(winner, c("winner", "winner_points", "winner_points_sum"), convert = T) %>%
select(winner, loser, date, winner_points, loser_points, winner_points_sum, loser_points_sum) # select the order you prefer
# # A tibble: 9 x 7
# winner loser date winner_points loser_points winner_points_sum loser_points_sum
# * <int> <int> <date> <int> <int> <int> <int>
# 1 1 3 2017-10-01 2 1 0 0
# 2 2 1 2017-10-02 1 0 0 2
# 3 3 1 2017-10-03 2 1 1 2
# 4 1 2 2017-10-04 1 0 3 1
# 5 2 1 2017-10-05 2 1 1 4
# 6 3 1 2017-10-06 1 0 3 5
# 7 1 3 2017-10-07 2 1 5 4
# 8 2 1 2017-10-08 1 0 3 7
# 9 3 2 2017-10-09 2 1 5 4
I finally understood what you want. And I took an approach of getting cumulative points of each player at each point in time and then joining it to the original test_data data frame.
winner <- c(1,2,3,1,2,3,1,2,3)
loser <- c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)
library(dplyr)
library(tidyr)
cum_points <- test_data %>%
gather(end_game_status, player_id, winner, loser) %>%
gather(which_point, how_many_points, winner_points, loser_points) %>%
filter(
(end_game_status == "winner" & which_point == "winner_points") |
(end_game_status == "loser" & which_point == "loser_points")) %>%
arrange(date = as.Date(date)) %>%
group_by(player_id) %>%
mutate(cumulative_points = cumsum(how_many_points)) %>%
mutate(cumulative_points_sofar = lag(cumulative_points, default = 0))
select(player_id, date, cumulative_points)
output <- test_data %>%
left_join(cum_points, by = c('date', 'winner' = 'player_id')) %>%
rename(winner_points_sum = cumulative_points_sofar) %>%
left_join(cum_points, by = c('date', 'loser' = 'player_id')) %>%
rename(loser_points_sum = cumulative_points_sofar)
output
The difference to the previous question of the OP is that the OP is now asking for the cumulative sum of points each player has scored so far, i.e., before the actual date. Furthermore, the sample data set now contains a date column which uniquely identifies each row.
So, my previous approach can be used here as well, with some modifications. The solution below reshapes the data from wide to long format whereby two value variables are reshaped simultaneously, computes the cumulative sums for each player id , and finally reshapes from long back to wide format, again. In order to sum only points scored before the actual date, the rows are lagged by one.
It is important to note that the winner and loser columns contain the respective player ids.
library(data.table)
cols <- c("winner", "loser")
setDT(test_data)[
# reshape multiple value variables simultaneously from wide to long format
, melt(.SD, id.vars = "date",
measure.vars = list(cols, paste0(cols, "_points")),
value.name = c("id", "points"))][
# rename variable column
, variable := forcats::lvls_revalue(variable, cols)][
# order by date and cumulate the lagged points by id
order(date), points_sum := cumsum(shift(points, fill = 0)), by = id][
# reshape multiple value variables simultaneously from long to wide format
, dcast(.SD, date ~ variable, value.var = c("id", "points", "points_sum"))]
date id_winner id_loser points_winner points_loser points_sum_winner points_sum_loser
1: 2017-10-01 1 3 2 1 0 0
2: 2017-10-02 2 1 1 0 0 2
3: 2017-10-03 3 1 2 1 1 2
4: 2017-10-04 1 2 1 0 3 1
5: 2017-10-05 2 1 2 1 1 4
6: 2017-10-06 3 1 1 0 3 5
7: 2017-10-07 1 3 2 1 5 4
8: 2017-10-08 2 1 1 0 3 7
9: 2017-10-09 3 2 2 1 5 4

Resources