I need to duplicate rows with incontinuous dates to fill all the dates in a dataframe.
Suppose this df:
df <- data.frame(date = c("2022-07-05", "2022-07-07", "2022-07-11", "2022-07-15", "2022-07-18"), letter = c("a", "b", "a", "b", "c"))
The desired output is this df_new:
df_new <- data.frame(date = c("2022-07-05", "2022-07-06",
"2022-07-07", "2022-07-08", "2022-07-09", "2022-07-10",
"2022-07-11", "2022-07-12", "2022-07-13", "2022-07-14",
"2022-07-15"),
letter = c("a", "a",
"b", "b", "b", "b",
"a", "a", "a", "a",
"c"))
Could you please help ?
We could use complete from tidyr to expand the data based on the min/max date incremented by '1 day' and then fill the NA elements in 'letter' by the previous non-NA element
library(dplyr)
library(tidyr)
df %>%
mutate(date = as.Date(date)) %>%
complete(date = seq(min(date), max(date), by = '1 day')) %>%
fill(letter)
I have a spark dataframe I'm manipulating using sparklyr that looks like the following:
input_data <- data.frame(id = c(10,10,10,20,20,30,30,40,40,40,50,60,70, 80,80,80,100,100,110,110,120,120,120,130,140,150,160,170),
date = c("2021-01-01","2021-01-02","2021-01-03","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-02","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-05","2021-01-01","2021-01-02","2021-01-03","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-02","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-05","2021-01-01","2021-01-05"),
group = c("A", "B", "C", "B", "C", "A", "C", "A", "A", "A", "C", "A","B","A", "B", "C", "B", "C", "A", "C", "A", "A", "A", "C", "A", "A", "B","A"),
event = c(1,1,1,0,1,0,1,0,0,1,1,1,0,1,1,1,0,1,0,1,0,0,1,1,1,1,1,0))
I'd like to aggregate the data such that I have a count of the number of "events" (where event == 1 ) and "non_events" (where event == 0) for each combination such that the final output looks like the following:
data.frame(group_a = c(1,0,0,1,0,1),
group_b = c(0,1,0,1,1,0),
group_c = c(0,0,1,0,1,1),
event_occured = c(3,1,2,0,2,2),
event_not_occured = c(4,2,2,0,2,2))
So, for example, there were no combinations where A and B were groups for the same ID so that gets a 0 for event and non_event. There were 4 IDs in which group A was involved in, of which 3 resulted in an event and 1 resulted in a non_event, so on and so forth.
What approach using sparklyr (or dplyr or pyspark) would allow for aggregation as described above? I tried the following but I'm getting the exact same number of event as event_not_occurred so I must be doing something wrong but can't pinpoint it:
combo_path_sdf <- input_data %>%
group_by(id) %>%
arrange(date) %>%
mutate(order_seq = ifelse(event > 0, 1, NA)) %>%
mutate(order_seq = lag(cumsum(ifelse(is.na(order_seq), 0, order_seq)))) %>%
mutate(order_seq = ifelse((row_number() == 1) & (event > 0), -1, ifelse(row_number() == 1, 0, order_seq))) %>%
ungroup()
combo_path_sdf %>%
group_by(id, order_seq) %>%
summarize(group_a = max(ifelse(group_a == "A", 1, 0)),
group_b = max(ifelse(group_b == "B", 1, 0)),
group_c = max(ifelse(group_c == "C", 1, 0)),
events = sum(event)) %>%
group_by(order_seq, group_a, group_b, group_c) %>%
summarize(event = sum(events),
total_sequences = n()) %>%
mutate(event_not_occured = total_sequences - event)
Final output in the following format would be ok too:
data.frame(group_a = c("A", "B", "C", "A,B", "B,C", "A,C"),
event_occured = c(3,1,2,1,2,2),
event_not_occured = c(4,2,2,1,2,2))
(image below for A,B is incorrect, should be 1,1 not 0,0)
The following matches your requested output format, and process the data in the way I understand you want, but (as per the comment by #Martin Gal) does not match the example result you provided.
input_data %>%
group_by(id) %>%
summarise(group_a = max(ifelse(group == 'A', 1, 0)),
group_b = max(ifelse(group == 'B', 1, 0)),
group_c = max(ifelse(group == 'C', 1, 0)),
event_occured = sum(ifelse(event == 1, 1, 0)),
event_not_occured = sum(ifelse(event == 0, 1, 0)),
.groups = "drop") %>%
group_by(group_a, group_b, group_c) %>%
summarise(event_occured = sum(event_occured),
event_not_occured = sum(event_not_occured),
.groups = "drop")
This idea is a two step summary process. The first summarise creates an indicator for group from each event and counts the number of events/non-events. The second summarise, combines all similar groups.
Regarding the code you are using that produces the same number of events and non-events. Take a look at hts_combined. This is not defined in the code you have shared and hence your script might be reading a variable from elsewhere.
Lets consider this dummy dataset:
v1<- c("A","B", "C", "D", "E", "F")
v2<- c("Z","Y", "X", "X", "V", "U")
Count<- c(2, 5, 10, 5, 1)
df<- cbind.data.frame(v1, v2, Count)
I want to use fct_lump_min() to lump all v1 factors that have a count of 2 or less into another factor named "unique". If I were to completely disregard the V2 factor column, I have functional code like this:
df<-df %>%
mutate(CombinedDGSequence = fct_lump_min(v1, 2, Count, other_level = "Unique")) %>%
count(CombinedDGSequence, wt = Count, name = "Count")
However, doing so removes the corresponding v2 factor column completely. Is there any way I can maintain each v1 factor level's corresponding v2 value in the resulting dataframe after using fct_lump_min?
Thanks guys!
We may need add_count which creates a new column instead of summarizing
library(dplyr)
library(forcats)
df %>%
mutate(CombinedDGSequence = fct_lump_min(v1, 2, Count,
other_level = "Unique")) %>%
add_count(CombinedDGSequence, wt = Count, name = "Count")
You may try this to combine all the v2 values in one string.
library(dplyr)
library(forcats)
df %>%
mutate(v1 = fct_lump_min(v1, 2, Count, other_level = "Unique")) %>%
group_by(v1) %>%
summarise(v2 = toString(v2),
Count = sum(Count))
I have a subset data that has a total count for each observation from a bigger dataset. If I want to drop duplicates based on a higher count and drop codes that appear less if the name is the same, how would I go about that? So for instance:
name = c("a", "a", "b", "b", "b", "c", "d", "e", "e", "e")
code = c(1,1,2,3,4,1,1,2,2,3)
n = c(1,10,2,3,5,4,8,100,90,40)
data = data.frame(name,code,n)
The end product would be left with these:
name = c("a", "b", "c", "d", "e")
code = c(1,4,1,1,2)
n = c(10,5,4,8,100)
data2 = data.frame(name,code,n)
If you can use dplyr, this should do the trick:
library(dplyr)
data %>%
group_by(name) %>%
filter(n == max(n)) %>%
ungroup()
I have the following problem: I have two dataframes. df1 contains among other variables (which are not shown in the code below) a date-variable. In df2 I have an id (refering to the id in table df1), a factor-variable (type) and another date.
df1 <- data.frame(id=1:5, referenceDate=c("2018-01-20","2018-02-03","2018-05-20", "2018-08-01", "2018-07-31"))
df2 <- data.frame(id=c(1,1,1,2,2,4,4,5,5), type=c("A", "A", "B", "A", "A", "B", "A", "B", "B"), dates=c("2018-01-10", "2018-01-23", "2018-01-24", "2018-05-21", "2018-05-18", "2018-06-01", "2018-09-01", "2018-07-10", "2018-07-20"))
My goal is to create a new column in df1 indicating the number of rows in df2 where (e.g.) df2$type=='A' and df2$dates occures before df1$referenceDate.
In R I have the following solution that gives me the number of rows where df2$type=='A'. But how can I additionally consider the date? I had the idea of first joining the two tables in order to get the referenceDate-Variable from df1 into df2 and then do the counting and join the two tables again in the other direction (in order to get the count variable back into df1). But this does not sound very elegant to me.
library(tidyverse)
reduced <- df2 %>% filter(type=='A') %>% group_by(id) %>% mutate(count=n()) %>% filter(!duplicated(id))
df1 %>% left_join(reduced[, c("id", "count")])
I think this might be what you want:
df1 <- tibble(id = 1:5,
referenceDate = as.Date(c("2018-01-20","2018-02-03","2018-05-20", "2018-08-01", "2018-07-31")))
df2 <- tibble(id = c(1,1,1,2,2,4,4,5,5),
type = c("A", "A", "B", "A", "A", "B", "A", "B", "B"),
dates = as.Date(c("2018-01-10", "2018-01-23", "2018-01-24", "2018-05-21", "2018-05-18", "2018-06-01", "2018-09-01", "2018-07-10", "2018-07-20")))
df1 %>%
left_join(
df2 %>%
left_join(df1, by = 'id') %>%
filter(dates < referenceDate) %>%
group_by(id) %>%
count(type) %>%
ungroup(),
by = 'id'
)
The key is to join df1 to df2 first and then filter based on reference date. That allows you to use filter to keep what you want. Then, use count. Then join back to df1