tidyr::fill() with sequential integers rather than a repeated value - r

After grouping by id I wish to replace the NAs in dist_from_top with sequential values such that dist_from_top becomes c(5,4,3,2,1,5,4,3,2). I am using the one dist_from_top value within each id grouping as a seed of sorts to fill in the values of dist_from_top that are above and below.
tidyr::fill() can fill in the same value throughout the grouping, but I can't think of a way to make it increase and decrease by 1 as it fills. Any help is greatly appreciated.
library(dplyr)
library(tidyr)
df <-
tribble(
~id, ~mgr, ~dist_from_top,
"A", "B", NA,
"A", "C", NA,
"A", "D", 3,
"A", "E", NA,
"A", "F", NA,
"B", "C", NA,
"B", "D", 4,
"B", "E", NA,
"B", "F", NA
)
An "almost there" solution using fill()
df %>%
group_by(id) %>%
fill(dist_from_top, .direction = "up") %>%
fill(dist_from_top, .direction = "down")

Create a column that counts downwards in each group, from any starting point:
... %>% mutate(rn = -row_number())
Add the offset that is defined by the difference between dist_from_top and rn for the one row where dist_from_top is not NA:
... %>% mutate(dist_from_top = rn + max(dist_from_top - rn, na.rm = TRUE))
This uses max() merely to pick one value, assuming there is only one value that isn't NA.
Both mutate() operations operate on groups:
df %>%
group_by(id) %>%
mutate(rn = ...) %>%
mutate(dist_from_top = ...) %>%
ungroup() %>%
select(-rn)
If there is an all-NA group, you'll see a warning.

Related

How to duplicate rows with incontinuous dates in R

I need to duplicate rows with incontinuous dates to fill all the dates in a dataframe.
Suppose this df:
df <- data.frame(date = c("2022-07-05", "2022-07-07", "2022-07-11", "2022-07-15", "2022-07-18"), letter = c("a", "b", "a", "b", "c"))
The desired output is this df_new:
df_new <- data.frame(date = c("2022-07-05", "2022-07-06",
"2022-07-07", "2022-07-08", "2022-07-09", "2022-07-10",
"2022-07-11", "2022-07-12", "2022-07-13", "2022-07-14",
"2022-07-15"),
letter = c("a", "a",
"b", "b", "b", "b",
"a", "a", "a", "a",
"c"))
Could you please help ?
We could use complete from tidyr to expand the data based on the min/max date incremented by '1 day' and then fill the NA elements in 'letter' by the previous non-NA element
library(dplyr)
library(tidyr)
df %>%
mutate(date = as.Date(date)) %>%
complete(date = seq(min(date), max(date), by = '1 day')) %>%
fill(letter)

Getting counts of membership in combination of groups using sparklyr or dplyr

I have a spark dataframe I'm manipulating using sparklyr that looks like the following:
input_data <- data.frame(id = c(10,10,10,20,20,30,30,40,40,40,50,60,70, 80,80,80,100,100,110,110,120,120,120,130,140,150,160,170),
date = c("2021-01-01","2021-01-02","2021-01-03","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-02","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-05","2021-01-01","2021-01-02","2021-01-03","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-02","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-05","2021-01-01","2021-01-05"),
group = c("A", "B", "C", "B", "C", "A", "C", "A", "A", "A", "C", "A","B","A", "B", "C", "B", "C", "A", "C", "A", "A", "A", "C", "A", "A", "B","A"),
event = c(1,1,1,0,1,0,1,0,0,1,1,1,0,1,1,1,0,1,0,1,0,0,1,1,1,1,1,0))
I'd like to aggregate the data such that I have a count of the number of "events" (where event == 1 ) and "non_events" (where event == 0) for each combination such that the final output looks like the following:
data.frame(group_a = c(1,0,0,1,0,1),
group_b = c(0,1,0,1,1,0),
group_c = c(0,0,1,0,1,1),
event_occured = c(3,1,2,0,2,2),
event_not_occured = c(4,2,2,0,2,2))
So, for example, there were no combinations where A and B were groups for the same ID so that gets a 0 for event and non_event. There were 4 IDs in which group A was involved in, of which 3 resulted in an event and 1 resulted in a non_event, so on and so forth.
What approach using sparklyr (or dplyr or pyspark) would allow for aggregation as described above? I tried the following but I'm getting the exact same number of event as event_not_occurred so I must be doing something wrong but can't pinpoint it:
combo_path_sdf <- input_data %>%
group_by(id) %>%
arrange(date) %>%
mutate(order_seq = ifelse(event > 0, 1, NA)) %>%
mutate(order_seq = lag(cumsum(ifelse(is.na(order_seq), 0, order_seq)))) %>%
mutate(order_seq = ifelse((row_number() == 1) & (event > 0), -1, ifelse(row_number() == 1, 0, order_seq))) %>%
ungroup()
combo_path_sdf %>%
group_by(id, order_seq) %>%
summarize(group_a = max(ifelse(group_a == "A", 1, 0)),
group_b = max(ifelse(group_b == "B", 1, 0)),
group_c = max(ifelse(group_c == "C", 1, 0)),
events = sum(event)) %>%
group_by(order_seq, group_a, group_b, group_c) %>%
summarize(event = sum(events),
total_sequences = n()) %>%
mutate(event_not_occured = total_sequences - event)
Final output in the following format would be ok too:
data.frame(group_a = c("A", "B", "C", "A,B", "B,C", "A,C"),
event_occured = c(3,1,2,1,2,2),
event_not_occured = c(4,2,2,1,2,2))
(image below for A,B is incorrect, should be 1,1 not 0,0)
The following matches your requested output format, and process the data in the way I understand you want, but (as per the comment by #Martin Gal) does not match the example result you provided.
input_data %>%
group_by(id) %>%
summarise(group_a = max(ifelse(group == 'A', 1, 0)),
group_b = max(ifelse(group == 'B', 1, 0)),
group_c = max(ifelse(group == 'C', 1, 0)),
event_occured = sum(ifelse(event == 1, 1, 0)),
event_not_occured = sum(ifelse(event == 0, 1, 0)),
.groups = "drop") %>%
group_by(group_a, group_b, group_c) %>%
summarise(event_occured = sum(event_occured),
event_not_occured = sum(event_not_occured),
.groups = "drop")
This idea is a two step summary process. The first summarise creates an indicator for group from each event and counts the number of events/non-events. The second summarise, combines all similar groups.
Regarding the code you are using that produces the same number of events and non-events. Take a look at hts_combined. This is not defined in the code you have shared and hence your script might be reading a variable from elsewhere.

How do you use forcat's fct_lump_min() function on a factor while keeping another identifiying factor?

Lets consider this dummy dataset:
v1<- c("A","B", "C", "D", "E", "F")
v2<- c("Z","Y", "X", "X", "V", "U")
Count<- c(2, 5, 10, 5, 1)
df<- cbind.data.frame(v1, v2, Count)
I want to use fct_lump_min() to lump all v1 factors that have a count of 2 or less into another factor named "unique". If I were to completely disregard the V2 factor column, I have functional code like this:
df<-df %>%
mutate(CombinedDGSequence = fct_lump_min(v1, 2, Count, other_level = "Unique")) %>%
count(CombinedDGSequence, wt = Count, name = "Count")
However, doing so removes the corresponding v2 factor column completely. Is there any way I can maintain each v1 factor level's corresponding v2 value in the resulting dataframe after using fct_lump_min?
Thanks guys!
We may need add_count which creates a new column instead of summarizing
library(dplyr)
library(forcats)
df %>%
mutate(CombinedDGSequence = fct_lump_min(v1, 2, Count,
other_level = "Unique")) %>%
add_count(CombinedDGSequence, wt = Count, name = "Count")
You may try this to combine all the v2 values in one string.
library(dplyr)
library(forcats)
df %>%
mutate(v1 = fct_lump_min(v1, 2, Count, other_level = "Unique")) %>%
group_by(v1) %>%
summarise(v2 = toString(v2),
Count = sum(Count))

How to drop observations based on conditions

I have a subset data that has a total count for each observation from a bigger dataset. If I want to drop duplicates based on a higher count and drop codes that appear less if the name is the same, how would I go about that? So for instance:
name = c("a", "a", "b", "b", "b", "c", "d", "e", "e", "e")
code = c(1,1,2,3,4,1,1,2,2,3)
n = c(1,10,2,3,5,4,8,100,90,40)
data = data.frame(name,code,n)
The end product would be left with these:
name = c("a", "b", "c", "d", "e")
code = c(1,4,1,1,2)
n = c(10,5,4,8,100)
data2 = data.frame(name,code,n)
If you can use dplyr, this should do the trick:
library(dplyr)
data %>%
group_by(name) %>%
filter(n == max(n)) %>%
ungroup()

Count number of entries in another dataframe given a certain condition

I have the following problem: I have two dataframes. df1 contains among other variables (which are not shown in the code below) a date-variable. In df2 I have an id (refering to the id in table df1), a factor-variable (type) and another date.
df1 <- data.frame(id=1:5, referenceDate=c("2018-01-20","2018-02-03","2018-05-20", "2018-08-01", "2018-07-31"))
df2 <- data.frame(id=c(1,1,1,2,2,4,4,5,5), type=c("A", "A", "B", "A", "A", "B", "A", "B", "B"), dates=c("2018-01-10", "2018-01-23", "2018-01-24", "2018-05-21", "2018-05-18", "2018-06-01", "2018-09-01", "2018-07-10", "2018-07-20"))
My goal is to create a new column in df1 indicating the number of rows in df2 where (e.g.) df2$type=='A' and df2$dates occures before df1$referenceDate.
In R I have the following solution that gives me the number of rows where df2$type=='A'. But how can I additionally consider the date? I had the idea of first joining the two tables in order to get the referenceDate-Variable from df1 into df2 and then do the counting and join the two tables again in the other direction (in order to get the count variable back into df1). But this does not sound very elegant to me.
library(tidyverse)
reduced <- df2 %>% filter(type=='A') %>% group_by(id) %>% mutate(count=n()) %>% filter(!duplicated(id))
df1 %>% left_join(reduced[, c("id", "count")])
I think this might be what you want:
df1 <- tibble(id = 1:5,
referenceDate = as.Date(c("2018-01-20","2018-02-03","2018-05-20", "2018-08-01", "2018-07-31")))
df2 <- tibble(id = c(1,1,1,2,2,4,4,5,5),
type = c("A", "A", "B", "A", "A", "B", "A", "B", "B"),
dates = as.Date(c("2018-01-10", "2018-01-23", "2018-01-24", "2018-05-21", "2018-05-18", "2018-06-01", "2018-09-01", "2018-07-10", "2018-07-20")))
df1 %>%
left_join(
df2 %>%
left_join(df1, by = 'id') %>%
filter(dates < referenceDate) %>%
group_by(id) %>%
count(type) %>%
ungroup(),
by = 'id'
)
The key is to join df1 to df2 first and then filter based on reference date. That allows you to use filter to keep what you want. Then, use count. Then join back to df1

Resources