Right now, my dataset is in wide format, meaning I have one row per person, but I want a long dataset, with multiple rows per person. I have two date variables, ADATE and DDATE, that I want to use as my start and end points, respectively. For example, if someone's ADATE is 02/04/10 and DDATE is 02/07/10, I want 4 rows:
Have:
ID ADATE DDATE
1 02/04/10 02/07/10
Want:
ID ADATE DDATE NEW_DATE
1 02/04/10 02/07/10 02/04/10
1 02/04/10 02/07/10 02/05/10
1 02/04/10 02/07/10 02/06/10
1 02/04/10 02/07/10 02/07/10
I have multiple datasets that I want to do this for, and I have written code that works for every single dataset except one...I'm not sure why. This is my attempt and the error I get:
jan15_long <- chf_jan15 %>%
mutate(NEW_DATE = as.Date(ADATE)) %>%
group_by(ID) %>%
complete(NEW_DATE = seq.Date(as.Date(ADATE), as.Date(DDATE), by = "day")) %>%
fill(vars) %>%
ungroup()
Error in seq.Date(as.Date(ADATE), as.Date(DDATE), by = "day") :
'from' must be of length 1
The above code gives me what I want and runs perfectly for every other dataset I have (10 out of 11).
Is there a better way to do this? dplyr makes the most sense to me, so hopefully there's a solution to this.
If there are more than one row, the seq needs to be looped. We can use map2. Also, based on the format of the 'DATE' columns, the as.Date needs a format argument i.e. as.Date(ADATE, "%m/%d/%y") (assuming it is month/day/year format)
library(dplyr)
library(purrr)
library(lubridate)
chf_jan15 %>%
mutate_at(vars(ends_with("DATE")), mdy) %>%
mutate(random_date = map2(ADATE, DDATE, seq, by = "day")) %>%
unnest(c(random_date))
# A tibble: 4 x 4
# ID ADATE DDATE random_date
# <int> <date> <date> <date>
#1 1 2010-02-04 2010-02-07 2010-02-04
#2 1 2010-02-04 2010-02-07 2010-02-05
#3 1 2010-02-04 2010-02-07 2010-02-06
#4 1 2010-02-04 2010-02-07 2010-02-07
If there is only a single row, after converting to Date class, the complete should work
library(tidyr)
chf_jan15 %>%
mutate_at(vars(ends_with("DATE")), as.Date, format = "%m/%d/%y") %>%
mutate(NEW_DATE = ADATE) %>%
complete(NEW_DATE = seq(ADATE, DDATE, by = 'day')) %>%
fill(c(ID, ADATE, DDATE))
# A tibble: 4 x 4
# NEW_DATE ID ADATE DDATE
# <date> <int> <date> <date>
#1 2010-02-04 1 2010-02-04 2010-02-07
#2 2010-02-05 1 2010-02-04 2010-02-07
#3 2010-02-06 1 2010-02-04 2010-02-07
#4 2010-02-07 1 2010-02-04 2010-02-07
If there is a single row for each each 'ID', then we can group_split and use complete
chf_jan15 %>%
mutate_at(vars(ends_with("DATE")), as.Date, format = "%m/%d/%y") %>%
mutate(NEW_DATE = ADATE) %>%
group_split(ID) %>%
map_dfr(~ .x %>%
complete(NEW_DATE = seq(ADATE, DDATE, by = 'day')) %>%
fill(c(ID, ADATE, DDATE)))
data
chf_jan15 <- structure(list(ID = 1L, ADATE = "02/04/10",
DDATE = "02/07/10"), class = "data.frame", row.names = c(NA,
-1L))
Related
I would ultimately like to have df2 with certain dates and the cumulative sum of values connected to those date ranges from df1.
df1 = data.frame("date"=c("10/01/2020","10/02/2020","10/03/2020","10/04/2020","10/05/2020",
"10/06/2020","10/07/2020","10/08/2020","10/09/2020","10/10/2020"),
"value"=c(1:10))
df1
> df1
date value
1 10/01/2020 1
2 10/02/2020 2
3 10/03/2020 3
4 10/04/2020 4
5 10/05/2020 5
6 10/06/2020 6
7 10/07/2020 7
8 10/08/2020 8
9 10/09/2020 9
10 10/10/2020 10
df2 = data.frame("date"=c("10/05/2020","10/10/2020"))
df2
> df2
date
1 10/05/2020
2 10/10/2020
I realize this is incorrect, but I am not sure how to define df2$value as the sums of certain df1$value rows:
df2$value = filter(df1, c(sum(1:5),sum(6:10)))
df2
I would like the output to look like this:
> df2
date value
1 10/05/2020 15
2 10/10/2020 40
Here is another approach using dplyr and lubridate:
library(lubridate)
library(dplyr)
df1 %>%
mutate(date = dmy(date)) %>%
mutate(date = if_else(date == "2020-05-10" |
date == "2020-10-10", date, NA_Date_)) %>%
fill(date, .direction = "up") %>%
group_by(date) %>%
summarise(value = sum(value))
date value
<date> <int>
1 2020-05-10 15
2 2020-10-10 40
We may use a non-equi join after converting the 'date' columns to Date class
library(lubridate)
library(data.table)
setDT(df1)[, date := mdy(date)]
setDT(df2)[, date := mdy(date)]
df2[, start_date := fcoalesce(shift(date) + days(1), floor_date(date, 'month'))]
df1[df2,.(value = sum(value)), on = .( date >= start_date,
date <= date), by = .EACHI][, -1, with = FALSE]
date value
<Date> <int>
1: 2020-10-05 15
2: 2020-10-10 40
Or another option is creating a group with findInterval and then do the group by sum
library(dplyr)
df1 %>%
group_by(grp = findInterval(date, df2$date, left.open = TRUE)) %>%
summarise(date = last(date), value = sum(value)) %>%
select(-grp)
-output
# A tibble: 2 × 2
date value
<date> <int>
1 2020-10-05 15
2 2020-10-10 40
I have a problem with a time series which I don´t know to solve.
I have a tibble with 4 different variables. In my real dataset there are over 10.000 Documents.
document date author label
1 2018-04-05 Mr.X 1
2 2018-02-05 Mr.Y 0
3 2018-04-17 Mr.Z 1
So now my problem is that in the first step I want to count my articles which are occur in a specific month and a specific year for every month in my time series.I know that I can filter for a specific month in a year like this:
tibble%>%
filter(date > "2018-02-01" && date < "2018-02-28")
Result out of this would be a tibble with 1 Observation, but my problem is that I have 360 different time periods in my data. Can I write a function for this to solve this problem or do I need to make 360 own calculations?
The best solution for me would be a table with 360 different columns where in every column the amount of articles which are counted in this month are represented. Is this possible?
Thank you so much in advance.
If you want each result into a separate list, you can do something like this
suppressMessages(library(dplyr))
df %>% mutate(date = as.Date(date)) %>%
group_split(substr(date, 1, 7), .keep = F)
<list_of<
tbl_df<
document: integer
date : date
author : character
label : integer
>
>[2]>
[[1]]
# A tibble: 1 x 4
document date author label
<int> <date> <chr> <int>
1 2 2018-02-05 Mr.Y 0
[[2]]
# A tibble: 2 x 4
document date author label
<int> <date> <chr> <int>
1 1 2018-04-05 Mr.X 1
2 3 2018-04-17 Mr.Z 1
you can further use list2env() to save each item of this list as a separate item.
To count the number of rows for each month-year combination, in tidyverse you can do :
library(dplyr)
library(tidyr)
df %>%
mutate(date = as.Date(date),
year_mon = format(date, '%Y-%m')) %>%
select(year_mon) %>%
pivot_wider(names_from = year_mon, values_from = year_mon,
values_fn = length, values_fill = 0)
# `2018-04` `2018-02`
# <int> <int>
#1 2 1
and in base R :
df$date <- as.Date(df$date)
table(format(df$date, '%Y-%m'))
I have a dataset with ID, date, days of life, and medication variables. Each ID has multiple observations indicating different administrations of a certain drug. I want to find UNIQUE meds that were administered within 365 days of each other. A sample of the data frame is as follows:
ID date dayoflife meds
1 2003-11-24 16361 lasiks
1 2003-11-24 16361 vigab
1 2004-01-09 16407 lacos
1 2013-11-25 20015 pheno
1 2013-11-26 20016 vigab
1 2013-11-26 20016 lasiks
2 2008-06-05 24133 pheno
2 2008-04-07 24074 vigab
3 2014-11-25 8458 pheno
3 2014-12-22 8485 pheno
I expect the outcome to be:
ID N
1 3
2 2
3 1
indicating that individual 1 had a max of 3 different types of medications administered within 365 days of each other. I am not sure if it is best to use days of life or the date to get to this expected outcome.Any help is appreciated
An option would be to convert the 'date' to Date class, grouped by 'ID', get the absolute difference of 'date' and the lag of the column, check whether it is greater than 365, create a grouping index with cumsum, get the number of distinct elements of 'meds' in summarise
library(dplyr)
df1 %>%
mutate(date = as.Date(date)) %>%
group_by(ID) %>%
mutate(diffd = abs(as.numeric(difftime(date, lag(date, default = first(date)),
units = 'days')))) %>%
group_by(grp = cumsum(diffd > 365), add = TRUE) %>%
summarise(N = n_distinct(meds)) %>%
group_by(ID) %>%
summarise(N = max(N))
# A tibble: 3 x 2
# ID N
# <int> <int>
#1 1 2
#2 2 2
#3 3 1
You can try:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(date = as.Date(date),
lag_date = abs(date - lag(date)) <= 365,
lead_date = abs(date - lead(date)) <= 365) %>%
mutate_at(vars(lag_date, lead_date), ~ ifelse(., ., NA)) %>%
filter(coalesce(lag_date, lead_date)) %>%
summarise(N = n_distinct(meds))
Output:
# A tibble: 3 x 2
ID N
<int> <int>
1 1 2
2 2 2
3 3 1
Want to calculate conditional sum based on specified dates in r. My sample df is
start_date = c("7/24/2017", "7/1/2017", "7/25/2017")
end_date = c("7/27/2017", "7/4/2017", "7/28/2017")
`7/23/2017` = c(1,5,1)
`7/24/2017` = c(2,0,2)
`7/25/2017` = c(0,0,10)
`7/26/2017` = c(2,2,2)
`7/27/2017` = c(0,0,0)
df = data.frame(start_date,end_date,`7/23/2017`,`7/24/2017`,`7/25/2017`,`7/26/2017`,`7/27/2017`)
In Excel it looks like:
I want to perform calculations as specified in Column H which is a conditional sum of columns C through G based on the dates specified in columns A and B.
Apparently, Excel allows columns to be dates but not R.
#wide to long format
dat <- reshape(df, direction="long", varying=list(names(df)[3:7]), v.names="Value",
idvar=c("start_date","end_date"), timevar="Date",
times=seq(as.Date("2017/07/23"),as.Date("2017/07/27"), "day"))
#convert from factor to date class
dat$end_date <- as.Date(dat$end_date, format = "%m/%d/%Y")
dat$start_date <- as.Date(dat$start_date, format = "%m/%d/%Y")
library(dplyr)
dat %>% group_by(start_date, end_date) %>%
mutate(mval = ifelse(between(Date, start_date, end_date), Value, 0)) %>%
summarise(conditional_sum=sum(mval))
# # A tibble: 3 x 3
# # Groups: start_date [?]
# start_date end_date conditional_sum
# <date> <date> <dbl>
# 1 2017-07-01 2017-07-04 0
# 2 2017-07-24 2017-07-27 4
# 3 2017-07-25 2017-07-28 12
You could achieve that as follows:
# number of trailing columns without numeric values
c = 2
# create a separate vector with the dates
dates = as.Date(gsub("X","",tail(colnames(df),-c)),format="%m.%d.%Y")
# convert date columns in dataframe
df$start_date = as.Date(df$start_date,format="%m/%d/%Y")
df$end_date = as.Date(df$end_date,format="%m/%d/%Y")
# calculate sum
sapply(1:nrow(df),function(x) {y = df[x,(c+1):ncol(df)][dates %in%
seq(df$start_date[x],df$end_date[x],by="day") ]; ifelse(length(y)>0,sum(y),0) })
returns:
[1] 4 0 12
Hope this helps!
Here's a solution all in one dplyr pipe:
library(dplyr)
library(lubridate)
library(tidyr)
df %>%
gather(date, value, -c(1, 2)) %>%
mutate(date = gsub('X', '', date)) %>%
mutate(date = gsub('\\.', '/', date)) %>%
mutate(date = mdy(date)) %>%
filter(date >= mdy(start_date) & date <=mdy(end_date)) %>%
group_by(start_date, end_date) %>%
summarize(Conditional_Sum = sum(value)) %>%
right_join(df) %>%
mutate(Conditional_Sum = ifelse(is.na(Conditional_Sum), 0, Conditional_Sum)) %>%
select(-one_of('Conditional_Sum'), one_of('Conditional_Sum'))
## start_date end_date X7.23.2017 X7.24.2017 X7.25.2017 X7.26.2017 X7.27.2017 Conditional_Sum
## <fctr> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7/24/2017 7/27/2017 1 2 0 2 0 4
## 2 7/1/2017 7/4/2017 5 0 0 2 0 0
## 3 7/25/2017 7/28/2017 1 2 10 2 0 12
Here is a my df (data.frame):
id group date
[1] 1 B 2000-01-01
[2] 1 B 2001-02-11
[3] 1 A 2001-04-06
[4] 2 C 2000-02-01
[5] 2 A 2001-01-01
[6] 2 B 2004-11-12
...
The data.frame has been arranged by id and date.
I would like to calculate difference in dates (in days) between group A and the row above it for each id. In my data, every group A has a row above it for the same id.
The results that I am interest in will look something like this
id days
[1] 1 54
[2] 2 335
...
Please advise
Thanks.
Since it's already sorted, you can just do:
dft %>%
group_by(id) %>%
mutate(diff_days = difftime(date, lag(date))) %>%
filter(group == "A") %>%
select(diff_days)
which gives:
id diff_days
<int> <time>
1 1 54 days
2 2 335 days
Here is an idea using dplyr
library(dplyr)
#make sure "date" has the appropriate class
df$date <- as.POSIXct(df$date, format = '%Y-%m-%d')
df %>%
group_by(id) %>%
mutate(diff1 = c(NA, round(diff.difftime(date, units = 'days')))) %>%
filter(group == 'A') %>%
select(id, diff1)
#Source: local data frame [2 x 2]
#Groups: id [2]
# id diff1
# <int> <dbl>
#1 1 54
#2 2 335
We can use data.table
library(data.table)
setDT(df)[, diff1 := c(NA, round(diff.difftime(date,
units = 'days'), 0)), id][group=="A"][, c("id", "diff1"), with = FALSE]
# id diff1
#1: 1 54
#2: 2 335