How to obtain maximum counts by group - r

Using tidyverse, I would like to obtain the maximum count of events (e.g., dates) by group. Here is a minimum reproducible example:
Data frame:
df <- data.frame(id = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5),
event = c(12, 6, 1, 7, 13, 9, 4, 8, 2, 5, 11, 3, 10, 14))
The following code produces the desired output, but seems overly complicated:
df %>%
group_by(id) %>%
mutate(count = n()) %>%
ungroup() %>%
select(count) %>%
slice_max(count, n = 1, with_ties = FALSE)
Is there a simpler/better way? The following works, but top_n has been superseded by slice_max and it is recommended that the latter be used instead.
df %>%
count(id) %>%
distinct(n) %>% # to remove tied values
top_n(1)
Any suggestions?

If you want something with fewer steps, you could try base R table() to get the counts in a vector and then take the max(). By default it returns the max value only once even if it appears a few times in the vector.
max(table(df$id))
[1] 4
Or if you want it in tidyverse style
df$id %>%
table() %>%
max()

If you want the maximum number of events by group (where id is the grouping variable), then:
df %>%
group_by(id) %>%
summarise(max_n_events = max(event))
If instead you basically do not consider the specific values in the event column and only look at the id column, the solution proposed by #Josh above can also be written as follows:
df %>% group_by(id) %>% count() %>% ungroup() %>% summarise(max(n))

Related

Why does map_df produce many missing values? How can i concatenate across rows to removing NAs?

I'm trying to count how many students received 1s, 2s, 3s, 4s, and 5s across their subjects, and I want a column for each subject and the possible grade (math_1, science_2, etc.).
I originally wrote a for loop, but my actual dataset has so many cases that I need to use map. I can get it to work, but it produces many NAs and only one chunk per column has actual data. I'm curious to know either:
Why is map_df() doing this and how can I avoid it? OR
How can I tighten this up so I only have this information on one row per the original rows in the first dataset (18 rows)? In other words, I'd concatenate up and down the column, so all the NAs are filled in (unless there truly was missing data).
Here's my code
library(tidyverse)
#Set up - generate sample dataset and get all combinations of grades and subjects
student_grades <- tibble(student_id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5),
subject = c(rep(c("english", "biology", "math", "history"), 4), NA, "biology"),
grade = as.character(c(1, 2, 3, 4, 5, 4, 3, 2, 2, 4, 1, 1, 1, 1, 2, 3, 3, 4)))
all_subject_combos <- c("english", "history", "math", "biology")
all_grades <- c("1", "2", "3",
"4", "5")
subjects_and_letter_grades <- expand.grid(all_subject_combos, all_grades)
all_combos <- subjects_and_letter_grades %>%
unite("names", c(Var1, Var2)) %>%
mutate(names = str_replace_all(names, "\\|", "_")) %>%
pull(names)
# iterate over each combination using map_df()
student_map <- map_df(all_combos,
~student_grades %>%
mutate("{.x}" := paste(i)) %>%
group_by(student_id) %>%
mutate("{.x}" := sum(case_when(str_detect(.x, subject) &
str_detect(.x, grade) ~ 1,
TRUE ~ 0), na.rm = T)))
EDIT
For the record, my almost identical for loop does not pad in many missing values. I assume it must have something to do with how it is building the dataset, but I don't know how I can override what map_df is doing under the hood.
student_map <- student_grades
for(i in all_combos) {
student_map <- student_map %>%
mutate("{i}" := paste(i)) %>%
group_by(student_id) %>%
mutate("{i}" := sum(case_when(str_detect(i, subject) &
str_detect(i, grade) ~ 1,
TRUE ~ 0), na.rm = T))
}
There is no i in the map as the default lambda value looped is .x. Also, it is better to use transmute instead of mutate as we need to return only the columns added in each iteration and then we bind with the original data at the end
library(dplyr)
library(purrr)
library(stringr)
student_map2 <- map_dfc(all_combos,
~ student_grades %>%
transmute(subject, grade, student_id, "{.x}" := .x) %>%
group_by(student_id) %>%
transmute("{.x}" := sum(case_when(str_detect( .x, subject) &
str_detect(.x, grade)~ 1, TRUE ~ 0), na.rm = TRUE)) %>%
ungroup %>%
select(-student_id)) %>%
bind_cols(student_grades, .)
-checking with OP's for loop output
> all.equal(student_map, student_map2, check.attributes = FALSE)
[1] TRUE
Though I can't figure out why map_df() is performing in this undesirable way, I did find a solution, inspired heavily by the answer to this post.
solution <- student_map %>%
group_by(student_id, subject, grade) %>%
summarise_all(~ last(na.omit(.)))
solution
Basically, this code removes any NAs and only keeps missing values if there are only missing values. Because those columns in my dataset will never have missing values, this solution works in my case.

Is there a way to count the number of null/missing values for each month in a dataframe?

I am currently using station data for my research in R, and I need to count the number of missing/null values for each month. The data is currently in daily measurements, and the monthly total of missing values would let me trim certain months out if they are not useful.
CUM00078310_df %>%
dplyr::mutate(
Month=month(Date),
Mis = rowSums(is.na(.[,grepl("C",colnames(CUM00078310_df))]))
) %>%
group_by(Month) %>%
summarize(Sum=sum(Mis), Percentage=mean(Mis))
Here is an example. Not sure if you want the data summarized or held within the dataframe. If not summarized, then omit final two lines of code. Add month grouping variable to group_by() with your data. Filter NA's only, if needed filter(is.na(x))
df<-data.frame(x = c(NA,2,5,10,15, NA, 3, 5, 10, 15, NA, 4, 10, NA, 6, 15))
df <- df %>%
group_by(x) %>%
mutate(valueCount = n()) %>%
arrange(desc(valueCount)) %>%
group_by(x, valueCount) %>%
summarise()
df<-data.frame(x = c(NA,2,5,10,15, NA, 3, 5, 10, 15, NA, 4, 10, NA, 6, 15))
Unsummarized example
df <- df %>%
group_by(x) %>%
mutate(valueCount = n()) %>%
arrange(desc(valueCount))

combine lists nested in a tibble

I want to count the number of unique values in a table that contains several from...to pairs like below:
tmp <- tribble(
~group, ~from, ~to,
1, 1, 10,
1, 5, 8,
1, 15, 20,
2, 1, 10,
2, 5, 10,
2, 15, 18
)
I tried to nest all values in a list for each row (works), but combining these nested lists into one vector and counting the uniques doesn't work as expected.
tmp %>%
group_by(group) %>%
rowwise() %>%
mutate(nrs = list(c(from:to))) %>%
summarise(n_uni = length(unique(unlist(list(nrs)))))
The desired output looks like this:
tibble(group = c(1, 2),
n_uni = c(length(unique(unlist(list(tmp$nrs[tmp$group == 1])))),
length(unique(unlist(list(tmp$nrs[tmp$group == 2]))))))
# # A tibble: 2 × 2
# group n_uni
# <dbl> <int>
#1 1 16
#2 2 14
Any help would be much appreciated!
tmp %>%
rowwise() %>%
mutate(nrs = list(from:to)) %>%
group_by(group) %>%
summarise(n_uni = n_distinct(unlist(nrs)))
The issue with OP's approach is that rowwise is the equivalent of a new grouping, that drops the initial group_by step. Thus we've got to group_by after creating nrs in the rowwise step.

Sum of a list in R data frame

I have a column of type "list" in my data frame, I want to create a column with the sum.
I guess there is no visual difference, but my column consists of list(1,2,3)s and not c(1,2,3)s :
tibble(
MY_DATA = list(
list(2, 7, 8),
list(3, 10, 11),
list(4, 2, 8)
),
NOT_MY_DATA = list(
c(2, 7, 8),
c(3, 10, 11),
c(4, 2, 8)
)
)
Unfortunately when I try
mutate(NEW_COL = MY_LIST_COL_D %>% unlist() %>% sum())
the result is that every cell in the new column contains the sum of the entire source column (so a value in the millions)
I tried reduce and it did work, but was slow, I was looking for a better solution.
You could use the purrr::map_dbl, which should return a vector of type double:
library(tibble)
library(dplyr)
library(purrr)
df = tibble(
MY_LIST_COL_D = list(
c(2, 7, 8),
c(3, 10, 11),
c(4, 2, 8)
)
)
df %>%
mutate(NEW_COL= map_dbl(MY_LIST_COL_D, sum), .keep = 'unused')
# NEW_COL
<dbl>
# 1 17
# 2 24
# 3 14
Is this what you were looking for? If you don't want to remove the list column just disregard the .keep argument.
Update
With the underlying structure being lists, you can still apply the same logic, but one way to solve the issue is to unlist:
df = tibble(
MY_LIST_COL_D = list(
list(2, 7, 8),
list(3, 10, 11),
list(4, 2, 8)
)
)
df %>%
mutate(NEW_COL = map_dbl(MY_LIST_COL_D, ~ sum(unlist(.x))), .keep = 'unused')
# NEW_COL
# <dbl>
# 1 17
# 2 24
# 3 14
You can use rowwise in dplyr
library(dplyr)
df %>% rowwise() %>% mutate(NEW_COL = sum(MY_LIST_COL_D))
rowwise will also make your attempt work :
df %>% rowwise() %>% mutate(NEW_COL = MY_LIST_COL_D %>% unlist() %>% sum())
Can also use sapply in base R :
df$NEW_COL <- sapply(df$MY_LIST_COL_D, sum)

How to use dplyr to make modifications in a dataframe equivalent to the use of a 'which'?

If I have a data frame, say
df <- data.frame(x = c(1, 2, 3), y = c(2, 4, 7), z = c(3, 6, 10))
then I can modify entries with the which function:
w <- which(df[,"y"] == 7)
df[w,c("y", "z")] <- data.frame(6, 9)
One way I see to do this with the package dplyr is the following:
df <- df %>%
mutate(W = (y==7),
y = ifelse(W, 6, y),
z = ifelse(W, 9, z)) %>%
select(-W)
But I find it a bit unelegant, and I am not so sure it would replace all kinds of which uses. Ideally I would imagine something like:
df <- df %>%
keep(y == 7) %>%
mutate(y = 6) %>%
unkeep()
where keep would provisionally select rows where modifications are to be made, and unkeep would unselect them to recover the full data frame.

Resources