I would ultimately like to have df2 with certain dates and the cumulative sum of values connected to those date ranges from df1.
df1 = data.frame("date"=c("10/01/2020","10/02/2020","10/03/2020","10/04/2020","10/05/2020",
"10/06/2020","10/07/2020","10/08/2020","10/09/2020","10/10/2020"),
"value"=c(1:10))
df1
> df1
date value
1 10/01/2020 1
2 10/02/2020 2
3 10/03/2020 3
4 10/04/2020 4
5 10/05/2020 5
6 10/06/2020 6
7 10/07/2020 7
8 10/08/2020 8
9 10/09/2020 9
10 10/10/2020 10
df2 = data.frame("date"=c("10/05/2020","10/10/2020"))
df2
> df2
date
1 10/05/2020
2 10/10/2020
I realize this is incorrect, but I am not sure how to define df2$value as the sums of certain df1$value rows:
df2$value = filter(df1, c(sum(1:5),sum(6:10)))
df2
I would like the output to look like this:
> df2
date value
1 10/05/2020 15
2 10/10/2020 40
Here is another approach using dplyr and lubridate:
library(lubridate)
library(dplyr)
df1 %>%
mutate(date = dmy(date)) %>%
mutate(date = if_else(date == "2020-05-10" |
date == "2020-10-10", date, NA_Date_)) %>%
fill(date, .direction = "up") %>%
group_by(date) %>%
summarise(value = sum(value))
date value
<date> <int>
1 2020-05-10 15
2 2020-10-10 40
We may use a non-equi join after converting the 'date' columns to Date class
library(lubridate)
library(data.table)
setDT(df1)[, date := mdy(date)]
setDT(df2)[, date := mdy(date)]
df2[, start_date := fcoalesce(shift(date) + days(1), floor_date(date, 'month'))]
df1[df2,.(value = sum(value)), on = .( date >= start_date,
date <= date), by = .EACHI][, -1, with = FALSE]
date value
<Date> <int>
1: 2020-10-05 15
2: 2020-10-10 40
Or another option is creating a group with findInterval and then do the group by sum
library(dplyr)
df1 %>%
group_by(grp = findInterval(date, df2$date, left.open = TRUE)) %>%
summarise(date = last(date), value = sum(value)) %>%
select(-grp)
-output
# A tibble: 2 × 2
date value
<date> <int>
1 2020-10-05 15
2 2020-10-10 40
Right now, my dataset is in wide format, meaning I have one row per person, but I want a long dataset, with multiple rows per person. I have two date variables, ADATE and DDATE, that I want to use as my start and end points, respectively. For example, if someone's ADATE is 02/04/10 and DDATE is 02/07/10, I want 4 rows:
Have:
ID ADATE DDATE
1 02/04/10 02/07/10
Want:
ID ADATE DDATE NEW_DATE
1 02/04/10 02/07/10 02/04/10
1 02/04/10 02/07/10 02/05/10
1 02/04/10 02/07/10 02/06/10
1 02/04/10 02/07/10 02/07/10
I have multiple datasets that I want to do this for, and I have written code that works for every single dataset except one...I'm not sure why. This is my attempt and the error I get:
jan15_long <- chf_jan15 %>%
mutate(NEW_DATE = as.Date(ADATE)) %>%
group_by(ID) %>%
complete(NEW_DATE = seq.Date(as.Date(ADATE), as.Date(DDATE), by = "day")) %>%
fill(vars) %>%
ungroup()
Error in seq.Date(as.Date(ADATE), as.Date(DDATE), by = "day") :
'from' must be of length 1
The above code gives me what I want and runs perfectly for every other dataset I have (10 out of 11).
Is there a better way to do this? dplyr makes the most sense to me, so hopefully there's a solution to this.
If there are more than one row, the seq needs to be looped. We can use map2. Also, based on the format of the 'DATE' columns, the as.Date needs a format argument i.e. as.Date(ADATE, "%m/%d/%y") (assuming it is month/day/year format)
library(dplyr)
library(purrr)
library(lubridate)
chf_jan15 %>%
mutate_at(vars(ends_with("DATE")), mdy) %>%
mutate(random_date = map2(ADATE, DDATE, seq, by = "day")) %>%
unnest(c(random_date))
# A tibble: 4 x 4
# ID ADATE DDATE random_date
# <int> <date> <date> <date>
#1 1 2010-02-04 2010-02-07 2010-02-04
#2 1 2010-02-04 2010-02-07 2010-02-05
#3 1 2010-02-04 2010-02-07 2010-02-06
#4 1 2010-02-04 2010-02-07 2010-02-07
If there is only a single row, after converting to Date class, the complete should work
library(tidyr)
chf_jan15 %>%
mutate_at(vars(ends_with("DATE")), as.Date, format = "%m/%d/%y") %>%
mutate(NEW_DATE = ADATE) %>%
complete(NEW_DATE = seq(ADATE, DDATE, by = 'day')) %>%
fill(c(ID, ADATE, DDATE))
# A tibble: 4 x 4
# NEW_DATE ID ADATE DDATE
# <date> <int> <date> <date>
#1 2010-02-04 1 2010-02-04 2010-02-07
#2 2010-02-05 1 2010-02-04 2010-02-07
#3 2010-02-06 1 2010-02-04 2010-02-07
#4 2010-02-07 1 2010-02-04 2010-02-07
If there is a single row for each each 'ID', then we can group_split and use complete
chf_jan15 %>%
mutate_at(vars(ends_with("DATE")), as.Date, format = "%m/%d/%y") %>%
mutate(NEW_DATE = ADATE) %>%
group_split(ID) %>%
map_dfr(~ .x %>%
complete(NEW_DATE = seq(ADATE, DDATE, by = 'day')) %>%
fill(c(ID, ADATE, DDATE)))
data
chf_jan15 <- structure(list(ID = 1L, ADATE = "02/04/10",
DDATE = "02/07/10"), class = "data.frame", row.names = c(NA,
-1L))
I have a dataset with ID, date, days of life, and medication variables. Each ID has multiple observations indicating different administrations of a certain drug. I want to find UNIQUE meds that were administered within 365 days of each other. A sample of the data frame is as follows:
ID date dayoflife meds
1 2003-11-24 16361 lasiks
1 2003-11-24 16361 vigab
1 2004-01-09 16407 lacos
1 2013-11-25 20015 pheno
1 2013-11-26 20016 vigab
1 2013-11-26 20016 lasiks
2 2008-06-05 24133 pheno
2 2008-04-07 24074 vigab
3 2014-11-25 8458 pheno
3 2014-12-22 8485 pheno
I expect the outcome to be:
ID N
1 3
2 2
3 1
indicating that individual 1 had a max of 3 different types of medications administered within 365 days of each other. I am not sure if it is best to use days of life or the date to get to this expected outcome.Any help is appreciated
An option would be to convert the 'date' to Date class, grouped by 'ID', get the absolute difference of 'date' and the lag of the column, check whether it is greater than 365, create a grouping index with cumsum, get the number of distinct elements of 'meds' in summarise
library(dplyr)
df1 %>%
mutate(date = as.Date(date)) %>%
group_by(ID) %>%
mutate(diffd = abs(as.numeric(difftime(date, lag(date, default = first(date)),
units = 'days')))) %>%
group_by(grp = cumsum(diffd > 365), add = TRUE) %>%
summarise(N = n_distinct(meds)) %>%
group_by(ID) %>%
summarise(N = max(N))
# A tibble: 3 x 2
# ID N
# <int> <int>
#1 1 2
#2 2 2
#3 3 1
You can try:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(date = as.Date(date),
lag_date = abs(date - lag(date)) <= 365,
lead_date = abs(date - lead(date)) <= 365) %>%
mutate_at(vars(lag_date, lead_date), ~ ifelse(., ., NA)) %>%
filter(coalesce(lag_date, lead_date)) %>%
summarise(N = n_distinct(meds))
Output:
# A tibble: 3 x 2
ID N
<int> <int>
1 1 2
2 2 2
3 3 1
Given a table
id start end
1 22/03/2016 05/06/2016
2 17/08/2016 29/08/2016
3 22/09/2017 25/12/2017
I'm trying to split by Calendar month as the following table
id start end
1 22/03/2016 31/03/2016
1 01/04/2016 30/04/2016
1 01/05/2016 05/06/2016
2 17/08/2016 29/08/2016
3 22/09/2017 30/09/2017
3 01/10/2017 31/10/2017
3 01/11/2017 30/11/2017
3 01/12/2017 25/12/2017
I'm trying to modify a code extract from how to split rows of a dataframe in multiple rows based on start date and end date? , but I am not being able to modify correctly the code. The problem is generally in the months with 30 days, and maybe is easy but I am not still familiarized with regular expressions.
#sample data
df <- data.frame("starting_date" = as.Date(c("2016-03-22", "2016-08-17", "2017-09-12")),
"end_date" = as.Date(c("2016-06-05", "2016-08-29", "2017-12-25")),
col3=c('1','2', '3'))
df1 <- df[,1:2] %>%
rowwise() %>%
do(rbind(data.frame(matrix(as.character(c(
.$starting_date,
seq(.$starting_date, .$end_date, by=1)[grep("\\d{4}-\\d{2}-31|\\d{4}-\\d{2}-01", seq(.$starting_date, .$end_date, by=1))],
.$end_date)), ncol=2, byrow=T))
)
) %>%
data.frame() %>%
`colnames<-`(c("starting_date", "end_date")) %>%
mutate(starting_date= as.Date(starting_date, format= "%Y-%m-%d"),
end_date= as.Date(end_date, format= "%Y-%m-%d"))
#add temporary columns to the original and expanded date column dataframes
df$row_idx <- seq(1:nrow(df))
df$temp_col <- (year(df$end_date) - year(df$starting_date)) +1
df1 <- cbind(df1,row_idx = rep(df$row_idx,df$temp_col))
#join both dataframes to get the final result
final_df <- left_join(df1,df[,3:(ncol(df)-1)],by="row_idx") %>%
select(-row_idx)
final_df
If anyone knows how to modify the code or a better way to do it I will be very grateful.
We assume there is an error in the sample output in the question since the third row spans parts of two months and so should be split into two rows.
Define Seq which given one start and end Date variables produces a data.frame of start and end columns and then run it on each id using group_by:
library(dplyr)
library(zoo)
Seq <- function(start, end) {
ym <- seq(as.yearmon(start), as.yearmon(end), 1/12)
starts <- pmax(start, as.Date(ym, frac = 0))
ends <- pmin(end, as.Date(ym, frac = 1))
unique(data.frame(start = starts, end = ends))
}
fmt <- "%d/%m/%Y"
DF %>%
mutate(start = as.Date(start, fmt), end = as.Date(end, fmt)) %>%
group_by(id) %>%
do(Seq(.$start, .$end)) %>%
ungroup
giving:
# A tibble: 9 x 3
id start end
<int> <date> <date>
1 1 2016-03-22 2016-03-31
2 1 2016-04-01 2016-04-30
3 1 2016-05-01 2016-05-31
4 1 2016-06-01 2016-06-05
5 2 2016-08-17 2016-08-29
6 3 2017-09-22 2017-09-30
7 3 2017-10-01 2017-10-31
8 3 2017-11-01 2017-11-30
9 3 2017-12-01 2017-12-25
Note
The input DF in reproducible form:
Lines <- "
id start end
1 22/03/2016 05/06/2016
2 17/08/2016 29/08/2016
3 22/09/2017 25/12/2017"
DF <- read.table(text = Lines, header = TRUE)
So there's a probably a more elegant way to accomplish this and I feel like I've seen similar questions, but could not find a duplicate quickly, so here goes...
SETUP
library(tidyverse)
library(lubridate)
df <- data.frame(
id = c('1', '2', '3'),
starting_date = as.Date(c("2016-03-22", "2016-08-17", "2017-09-12")),
end_date = as.Date(c("2016-06-05", "2016-08-29", "2017-12-25")),
stringsAsFactors = FALSE
)
df
#> id starting_date end_date
#> 1 1 2016-03-22 2016-06-05
#> 2 2 2016-08-17 2016-08-29
#> 3 3 2017-09-12 2017-12-25
SOLUTION
df %>%
group_by(id) %>%
mutate(
date_seq = list(seq.Date(starting_date, end_date, by = "month") %>% ceiling_date("month") - 1)
) %>%
unnest() %>%
mutate(row = row_number()) %>%
mutate(
new_end_date = if_else(row == max(row), end_date, date_seq),
new_start_date = if_else(row == min(row), starting_date, floor_date(new_end_date, "month"))
) %>%
select(
id, new_start_date, new_end_date
)
#> # A tibble: 8 x 3
#> # Groups: id [3]
#> id new_start_date new_end_date
#> <chr> <date> <date>
#> 1 1 2016-03-22 2016-03-31
#> 2 1 2016-04-01 2016-04-30
#> 3 1 2016-06-01 2016-06-05
#> 4 2 2016-08-17 2016-08-29
#> 5 3 2017-09-12 2017-09-30
#> 6 3 2017-10-01 2017-10-31
#> 7 3 2017-11-01 2017-11-30
#> 8 3 2017-12-01 2017-12-25
EXPLANATION
Much of what's going on here takes place in the first mutate call which creates date_seq. To understand it, consider the following:
seq.Date(ymd("2016-03-22"), ymd("2016-06-05"), by = "month")
# [1] "2016-03-22" "2016-04-22" "2016-05-22"
seq.Date(ymd("2016-03-22"), ymd("2016-06-05"), by = "month") %>%
ceiling_date("month")
# [1] "2016-04-01" "2016-05-01" "2016-06-01"
seq.Date(ymd("2016-03-22"), ymd("2016-06-05"), by = "month") %>%
ceiling_date("month") - 1
# [1] "2016-03-31" "2016-04-30" "2016-05-31"
So basically, create a sequence of "end-of-month" dates between the original start and end dates. Putting this in a list-column allows us to organize by the id so that we unnest appropriately. Checkout the output after the end of the unnest():
df %>%
group_by(id) %>%
mutate(
date_seq = list(seq.Date(starting_date, end_date, by = "month") %>% ceiling_date("month") - 1)
) %>%
unnest()
From there I hope things are relatively straightforward. The row_number probably could have been replaced with something fancier like a first/last, but I thought this might be easier to follow.
I tried asking this question before but was it was poorly stated. This is a new attempt cause I haven't solved it yet.
I have a dataset with winners, losers, date, winner_points and loser_points.
For each row, I want two new columns, one for the winner and one for the loser that shows how many points they have scored so far (as both winners and losers).
Example data:
winner <- c(1,2,3,1,2,3,1,2,3)
loser <- c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)
I want the output to be:
winner_points_sum <- c(0, 0, 1, 3, 1, 3, 5, 3, 5)
loser_points_sum <- c(0, 2, 2, 1, 4, 5, 4, 7, 4)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points, winner_points_sum, loser_points_sum)
How I've solved it thus far is to do a for loop such as:
library(dplyr)
test_data$winner_points_sum_loop <- 0
test_data$loser_points_sum_loop <- 0
for(i in row.names(test_data)) {
test_data[i,]$winner_points_sum_loop <-
(
test_data %>%
dplyr::filter(winner == test_data[i,]$winner & date < test_data[i,]$date) %>%
dplyr::summarise(points = sum(winner_points, na.rm = TRUE))
+
test_data %>%
dplyr::filter(loser == test_data[i,]$winner & date < test_data[i,]$date) %>%
dplyr::summarise(points = sum(loser_points, na.rm = TRUE))
)
}
test_data$winner_points_sum_loop <- unlist(test_data$winner_points_sum_loop)
Any suggestions how to tackle this problem? The queries take quite some time when the row numbers add up. I've tried elaborating with the AVE function, I can do it for one column to sum a players point as winner but can't figure out how to add their points as loser.
winner <- c(1,2,3,1,2,3,1,2,3)
loser <- c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)
library(dplyr)
library(tidyr)
test_data %>%
unite(winner, winner, winner_points) %>% # unite winner columns
unite(loser, loser, loser_points) %>% # unite loser columns
gather(type, pl_pts, winner, loser, -date) %>% # reshape
separate(pl_pts, c("player","points"), convert = T) %>% # separate columns
arrange(date) %>% # order dates (in case it's not)
group_by(player) %>% # for each player
mutate(sum_points = cumsum(points) - points) %>% # get points up to that date
ungroup() %>% # forget the grouping
unite(pl_pts_sumpts, player, points, sum_points) %>% # unite columns
spread(type, pl_pts_sumpts) %>% # reshape
separate(loser, c("loser", "loser_points", "loser_points_sum"), convert = T) %>% # separate columns and give appropriate names
separate(winner, c("winner", "winner_points", "winner_points_sum"), convert = T) %>%
select(winner, loser, date, winner_points, loser_points, winner_points_sum, loser_points_sum) # select the order you prefer
# # A tibble: 9 x 7
# winner loser date winner_points loser_points winner_points_sum loser_points_sum
# * <int> <int> <date> <int> <int> <int> <int>
# 1 1 3 2017-10-01 2 1 0 0
# 2 2 1 2017-10-02 1 0 0 2
# 3 3 1 2017-10-03 2 1 1 2
# 4 1 2 2017-10-04 1 0 3 1
# 5 2 1 2017-10-05 2 1 1 4
# 6 3 1 2017-10-06 1 0 3 5
# 7 1 3 2017-10-07 2 1 5 4
# 8 2 1 2017-10-08 1 0 3 7
# 9 3 2 2017-10-09 2 1 5 4
I finally understood what you want. And I took an approach of getting cumulative points of each player at each point in time and then joining it to the original test_data data frame.
winner <- c(1,2,3,1,2,3,1,2,3)
loser <- c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)
library(dplyr)
library(tidyr)
cum_points <- test_data %>%
gather(end_game_status, player_id, winner, loser) %>%
gather(which_point, how_many_points, winner_points, loser_points) %>%
filter(
(end_game_status == "winner" & which_point == "winner_points") |
(end_game_status == "loser" & which_point == "loser_points")) %>%
arrange(date = as.Date(date)) %>%
group_by(player_id) %>%
mutate(cumulative_points = cumsum(how_many_points)) %>%
mutate(cumulative_points_sofar = lag(cumulative_points, default = 0))
select(player_id, date, cumulative_points)
output <- test_data %>%
left_join(cum_points, by = c('date', 'winner' = 'player_id')) %>%
rename(winner_points_sum = cumulative_points_sofar) %>%
left_join(cum_points, by = c('date', 'loser' = 'player_id')) %>%
rename(loser_points_sum = cumulative_points_sofar)
output
The difference to the previous question of the OP is that the OP is now asking for the cumulative sum of points each player has scored so far, i.e., before the actual date. Furthermore, the sample data set now contains a date column which uniquely identifies each row.
So, my previous approach can be used here as well, with some modifications. The solution below reshapes the data from wide to long format whereby two value variables are reshaped simultaneously, computes the cumulative sums for each player id , and finally reshapes from long back to wide format, again. In order to sum only points scored before the actual date, the rows are lagged by one.
It is important to note that the winner and loser columns contain the respective player ids.
library(data.table)
cols <- c("winner", "loser")
setDT(test_data)[
# reshape multiple value variables simultaneously from wide to long format
, melt(.SD, id.vars = "date",
measure.vars = list(cols, paste0(cols, "_points")),
value.name = c("id", "points"))][
# rename variable column
, variable := forcats::lvls_revalue(variable, cols)][
# order by date and cumulate the lagged points by id
order(date), points_sum := cumsum(shift(points, fill = 0)), by = id][
# reshape multiple value variables simultaneously from long to wide format
, dcast(.SD, date ~ variable, value.var = c("id", "points", "points_sum"))]
date id_winner id_loser points_winner points_loser points_sum_winner points_sum_loser
1: 2017-10-01 1 3 2 1 0 0
2: 2017-10-02 2 1 1 0 0 2
3: 2017-10-03 3 1 2 1 1 2
4: 2017-10-04 1 2 1 0 3 1
5: 2017-10-05 2 1 2 1 1 4
6: 2017-10-06 3 1 1 0 3 5
7: 2017-10-07 1 3 2 1 5 4
8: 2017-10-08 2 1 1 0 3 7
9: 2017-10-09 3 2 2 1 5 4