Find difference between dates in consecutive rows - r

I know this has been asked before but the answers I have found seem to rely on POSIXct whereas I don't see why I cant do this with date
I have data like
Person Event VisitDate
1 RFA 2004-06-04
1 EMR 2016-06-03
1 Nil 2016-06-05
I want to get the difference between the dates in a separate column (eventually to average the difference of the dates over all Person ids).
Expected output:
Person Event VisitDate Date Difference in days
1 RFA 2004-06-04
1 EMR 2016-06-03 4383
1 Nil 2016-06-05 2
So far I have used:
EndoSubsetOnSurveil %>%
arrange(Person, as.Date(EndoSubsetOnSurveil$VisitDate, '%d/%m/%y')) %>%
difftime(VisitDate[1:(length(VisitDate)-1)] , VisitDate[2:length(VisitDate)])
but I get the error
Error in as.POSIXct.default(time1, tz = tz) :
do not know how to convert 'time1' to class “POSIXct”

Explanation:
(i) The format provided in as.Date should be changed to %Y-%m-%d. (ii) Your variable should be changed to as.Date if you want it to be recognized as such. In your code, it is only used to arrange the database but is not recognized later. (iii) Using lag makes is more useful.
Code:
I think that the last chunk output is what you want in comparison with the second chunk.
# SAMPLE DATA -------------------------------------------------------------
EndoSubsetOnSurveil <-
data.frame(Person = c(1,1,2,2),
VisitDate = c("2004-06-04", "2016-06-03", "2016-07-01",
"2016-08-01"))
EndoSubsetOnSurveil$VisitDate <-
as.Date(EndoSubsetOnSurveil$VisitDate, '%Y-%m-%d')
# DIFFERENCE BETWEEN VISIT WITHOUT GROUPING -------------------------------
library(dplyr)
EndoSubsetOnSurveil %>% arrange(Person, VisitDate) %>%
mutate(diffDate = difftime(VisitDate, lag(VisitDate,1)))
# DIFFERENCE BETWEEN VISIT BY PATIENT -------------------------------------
EndoSubsetOnSurveil %>% arrange(Person, VisitDate) %>% group_by(Person) %>%
mutate(diffDate = difftime(VisitDate, lag(VisitDate,1))) %>% ungroup()

Related

How can I identify and extract duplicates from data frame?

My objective is to check if a patient is using two drugs at the same date.
In the example, patient 1 is using drug A and drug B at the same date, but I want to extract it with code.
df <- data.frame(id = c(1,1,1,2,2,2),
date = c("2020-02-01","2020-02-01","2020-03-02","2019-10-02","2019-10-18","2019-10-26"),
drug_type = c("A","B","A","A","A","B"))
df$date <- as.factor(df$date)
df$drug_type <- as.factor(df$drug_type)
In order to do this, I firstly made date and drug type factor variables.
Next I used following code:
df %>%
mutate(lev_actdate = as.factor(actdate))%>%
filter(nlevels(drug_type)>1 & nlevels(date) < nrow(date))
But I failed. I assumed that if a patient is using two drugs at the same date, the number of levels in the date column will be less than its row number. However, now I don't know how to make it with code.
Additionally, I feel weird about following:
if I use nlevels(df$date), right result will be returned, but when I use df %>% nlevels(date), the error will be return with showing
"Error in nlevels(., df$date) : unused argument (df$date)"
Could you please tell me why this occurred and how can I fix it?
Thank you for your time.
You could use something like
library(dplyr)
df %>%
group_by(id, date) %>%
filter(n_distinct(drug_type) >= 2)
df %>% nlevels(date) is the same as nlevels(df, date) which is not the same as nlevels(df$date). Instead of the latter youcould try df %>% nlevels(.$date) or perhaps df %>% {nlevels(.$date)}.
Do you need something like this?
library(dplyr)
df %>%
group_by(date) %>%
distinct() %>%
summarise(drug_type_sum = toString(drug_type))
date drug_type_sum
<fct> <chr>
1 2019-10-02 A
2 2019-10-18 A
3 2019-10-26 B
4 2020-02-01 A, B
5 2020-03-02 A

R -- Always grab the last day of the previous year in R

I am an aspiring data scientist, and this will be my first ever question on StackOF.
I have this line of code to help wrangle me data. My date filter is static. I would prefer not to have to go in an change this hardcoded value every year. What is the best alternative for my date filter to make it more dynamic? The date column is also difficult to work with because it is not a
"date", it is a "dbl"
library(dplyr)
library(lubridate)
# create a sample dataframe
df <- data.frame(
DATE = c(20191230, 20191231, 20200122)
)
Tried so far:
df %>%
filter(DATE >= 20191231)
# load packages (lubridate for dates)
library(dplyr)
library(lubridate)
# create a sample dataframe
df <- data.frame(
DATE = c(20191230, 20191231, 20200122)
)
This looks like this:
DATE
1 20191230
2 20191231
3 20200122
# and now...
df %>% # take the dataframe
mutate(DATE = ymd(DATE)) %>% # turn the DATE column actually into a date
filter(DATE >= floor_date(Sys.Date(), "year") - days(1))
...and filter rows where DATE is >= to one day before the first day of this year (floor_date(Sys.Date(), "year"))
DATE
1 2019-12-31
2 2020-01-22

Using mutate and summarize to find elements in a vector

I'm trying to replace vba code with R code. Currently in vba I use sumif in a range to find the total value of an ID depending on some dates. In R I'm using mutate an summarize but there's always an error. I don´t know how to fix it.
If i want to find the value for ID=1 that made some value withing 2 days:
#sys.Date() = 2016-01-06
df
DATES ID VALUE
2016/01/01 1 10
2016/01/02 2 15
2016/01/05 1 13
the result must be:
ID Value
1 13
Currently, the code is:
df%>%
group_by(ID) %>%
mutate(Total_op = if (Sys.Date()-as.Date(Dates,format="%YYYY-%mm-
%dd")>=1) Value else 0)))%>%
summarize(SumTotal = sum(Total_op))%>%
collect
But the error showed is:
Error: Column 'sumTotal' must be length X (the group size) or one, not Y
With lubridate we can convert the DATES string to a datetime object and filter accordingly:
library(lubridate)
library(tidyverse)
Dat <- ymd("2016-01-06") #Set a date. Can be done by Sys.Date()
df %>%
mutate_at("DATES",ymd) %>% #convert to datetime
filter(DATES %within% interval(Dat-2,Dat)) %>% #filter entries in the last 2 days
group_by(ID) %>% #group by ID
summarise(SumTotal = sum(VALUE)) #summarise value as Sum

dplyr: left_join where df A value lies between df B values

I'd like to know if it is possible to achieve the following using dplyr, or some tidyverse package...
Context: I am having trouble getting my data into a structure that will allow the use of geom_rect. See this SO question for the motivation.
library(tis)
# Prepare NBER recession start end dates.
recessions <- data.frame(start = as.Date(as.character(nberDates()[,"Start"]),"%Y%m%d"),
end= as.Date(as.character(nberDates()[,"End"]),"%Y%m%d"))
dt <- tibble(date=c(as.Date('1983-01-01'),as.Date('1990-10-15'), as.Date('1993-01-01')))
Desired output:
date start end
1983-01-01 NA NA
1990-10-15 1990-08-01 1991-03-31
1993-01-01 NA NA
Appreciate any suggestions.
Note: Previous questions indicate that sqldf is one approach to take. However, the data here involves dates and my understanding date is not a data type in SQLite.
In the spirit of 'write the code you wish you had':
df <- dt %>%
left_join(x=., y=recessions, date >= start & date <= end)
"Date" class objects in R are internally stored as the number of days since the Epoch (January 1, 1970) and that number is what is sent to SQLite so the order is still maintained even though the class is not; therefore, we can do this using the SQLite back end:
sqldf("select * from dt left join recessions on date between start and end")
giving:
date start end
1 1983-01-01 <NA> <NA>
2 1990-10-15 1990-08-01 1991-03-31
3 1993-01-01 <NA> <NA>
Also note that sqldf works with several other back ends that do fully support dates so you are not restricted to SQLite. Suggest you review the FAQ and Examples at https://github.com/ggrothendieck/sqldf .
The following uses only dplyr and produces the desired data frame result.
Note: On larger datasets you will likely run into memory issues and the sqldf proposed by G. Grothendieck will work.
Hat-tip:
#nick-criswell for directing me to #ian-gow for this partial solution
# Build data frame of dates within the interval [start, end]
df1 <- dt %>%
mutate(dummy=TRUE) %>%
left_join(recessions %>% mutate(dummy=TRUE)) %>%
filter(date >= start & date <= end) %>%
select(-dummy)
# Build data frame of all other dates with start=NA and end=NA
df2 <- dt %>%
mutate(dummy=TRUE) %>%
left_join(recessions %>% mutate(dummy=TRUE)) %>%
mutate(start=NA, end=NA) %>%
unique() %>%
select(-dummy)
# Now merge the two. Overwirte NA values with start and end dates
df <- df2 %>%
left_join(x=., y=df1, by="date") %>%
mutate(date, start = ifelse(is.na(start.y), as.character(start.x), as.character(start.y)),end = ifelse(is.na(end.y), as.character(end.x), as.character(end.y))) %>%
mutate(start=as.Date(start), end=as.Date(end) )
> df
# A tibble: 3 x 3
date start end
<date> <date> <date>
1 1983-01-01 NA NA
2 1990-10-15 1990-08-01 1991-03-31
3 1993-01-01 NA NA

Earliest Date for each id in R

I have a dataset where each individual (id) has an e_date, and since each individual could have more than one e_date, I'm trying to get the earliest date for each individual. So basically I would like to have a dataset with one row per each id showing his earliest e_date value.
I've use the aggregate function to find the minimum values, I've created a new variable combining the date and the id and last I've subset the original dataset based on the one containing the minimums using the new variable created. I've come to this:
new <- aggregate(e_date ~ id, data_full, min)
data_full["comb"] <- NULL
data_full$comb <- paste(data_full$id,data_full$e_date)
new["comb"] <- NULL
new$comb <- paste(new$lopnr,new$EDATUM)
data_fixed <- data_full[which(new$comb %in% data_full$comb),]
The first thing is that the aggregate function doesn't seems to work at all, it reduces the number of rows but viewing the data I can clearly see that some ids appear more than once with different e_date. Plus, the code gives me different results when I use the as.Date format instead of its original format for the date (integer). I think the answer is simple but I'm struck on this one.
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data_full)), grouped by 'id', we get the 1st row (head(.SD, 1L)).
library(data.table)
setDT(data_full)[order(e_date), head(.SD, 1L), by = id]
Or using dplyr, after grouping by 'id', arrange the 'e_date' (assuming it is of Date class) and get the first row with slice.
library(dplyr)
data_full %>%
group_by(id) %>%
arrange(e_date) %>%
slice(1L)
If we need a base R option, ave can be used
data_full[with(data_full, ave(e_date, id, FUN = function(x) rank(x)==1)),]
Another answer that uses dplyr's filter command:
dta %>%
group_by(id) %>%
filter(date == min(date))
You may use library(sqldf) to get the minimum date as follows:
data1<-data.frame(id=c("789","123","456","123","123","456","789"),
e_date=c("2016-05-01","2016-07-02","2016-08-25","2015-12-11","2014-03-01","2015-07-08","2015-12-11"))
library(sqldf)
data2 = sqldf("SELECT id,
min(e_date) as 'earliest_date'
FROM data1 GROUP BY 1", method = "name__class")
head(data2)
id earliest_date
123 2014-03-01
456 2015-07-08
789 2015-12-11
I made a reproducible example, supposing that you grouped some dates by which quarter they were in.
library(lubridate)
library(dplyr)
rand_weeks <- now() + weeks(sample(100))
which_quarter <- quarter(rand_weeks)
df <- data.frame(rand_weeks, which_quarter)
df %>%
group_by(which_quarter) %>% summarise(sort(rand_weeks)[1])
# A tibble: 4 x 2
which_quarter sort(rand_weeks)[1]
<dbl> <time>
1 1 2017-01-05 05:46:32
2 2 2017-04-06 05:46:32
3 3 2016-08-18 05:46:32
4 4 2016-10-06 05:46:32

Resources