Suppose I have a dataframe that looks like this:
id start_date death_date
1 2011-05-20 2014-12-11
2 2014-08-01 2016-01-05
3 2005-01-02 2015-10-20
4 2015-06-30 2016-02-14
5 2014-07-01 2014-09-03
I want to create a new column that contains the difference between death_date and start_date in months UNLESS start_date is before 2014-05-31. If start_date < 2014-05-31, then I want the new column to be the difference between death_date and 2014-05-31 in months.
Code to create sample dataframe:
id <- c(1:5)
start_date <- c(as.Date("2011-05-20"), as.Date("2014-08-01"),
as.Date("2005-01-02"), as.Date("2015-06-30"),
as.Date("2014-07-01"))
death_date <- c(as.Date("2014-12-11"), as.Date("2016-01-05"),
as.Date("2015-10-20"), as.Date("2016-02-14"),
as.Date("2014-09-03"))
example_dates <- data.frame(id, start_date, death_date)
Try this:
df$new_col <- round(ifelse(df$start_date<as.Date("2014-05-31"),
df$death_date-as.Date("2014-05-31"), df$death_date-df$start_date)/30, 2)
# id start_date death_date new_col
# 1 1 2011-05-20 2014-12-11 6.47
# 2 2 2014-08-01 2016-01-05 17.40
# 3 3 2005-01-02 2015-10-20 16.90
# 4 4 2015-06-30 2016-02-14 7.63
# 5 5 2014-07-01 2014-09-03 2.13
Related
The dataframe df1 summarizes detections of different individuals (ID) through time (Datetime). As a short example:
library(lubridate)
df1<- data.frame(ID= c(1,2,1,2,1,2,1,2,1,2),
Datetime= ymd_hms(c("2016-08-21 00:00:00","2016-08-24 08:00:00","2016-08-23 12:00:00","2016-08-29 03:00:00","2016-08-27 23:00:00","2016-09-02 02:00:00","2016-09-01 12:00:00","2016-09-09 04:00:00","2016-09-01 12:00:00","2016-09-10 12:00:00")))
> df1
ID Datetime
1 1 2016-08-21 00:00:00
2 2 2016-08-24 08:00:00
3 1 2016-08-23 12:00:00
4 2 2016-08-29 03:00:00
5 1 2016-08-27 23:00:00
6 2 2016-09-02 02:00:00
7 1 2016-09-01 12:00:00
8 2 2016-09-09 04:00:00
9 1 2016-09-01 12:00:00
10 2 2016-09-10 12:00:00
I want to calculate for each row, the number of hours (Hours_since_begining) since the first time that the individual was detected.
I would expect something like that (It can contain some mistakes since I did the calculations by hand):
> df1
ID Datetime Hours_since_begining
1 1 2016-08-21 00:00:00 0
2 2 2016-08-24 08:00:00 0
3 1 2016-08-23 12:00:00 60 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-23 12:00:00"
4 2 2016-08-29 03:00:00 115
5 1 2016-08-27 23:00:00 167 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-27 23:00:00"
6 2 2016-09-02 02:00:00 210
7 1 2016-09-01 12:00:00 276
8 2 2016-09-09 04:00:00 380
9 1 2016-09-01 12:00:00 276
10 2 2016-09-10 12:00:00 412
Does anyone know how to do it?
Thanks in advance!
You can do this :
library(tidyverse)
# first get min datetime by ID
min_datetime_id <- df1 %>% group_by(ID) %>% summarise(min_datetime=min(Datetime))
# join with df1 and compute time difference
df1 <- df1 %>% left_join(min_datetime_id) %>% mutate(Hours_since_beginning= as.numeric(difftime(Datetime, min_datetime,units="hours")))
How do I manage to use a date I have in a dataframe, let's say dataframe 1, as reference for selecting a value that is in other dataframe, dataframe2, when my date in dataframe 1 is between a start date variable and an end date variable in dataframe 2?
For example, I have two dataframes. The first one is a dataframe that only has dates, we will call it "dates".
library(lubridate)
date <- ymd(c("2017-06-01", "2013-01-01", "2014-03-01", "2008-01-01","2011-03-01","2009-03-01","2012-03-01","2015-08-01","2008-08-01"))
date <- as.data.frame(date)
> date
date
1 2017-06-01
2 2013-01-01
3 2014-03-01
4 2008-01-01
5 2011-03-01
6 2009-03-01
7 2012-03-01
8 2015-08-01
9 2008-08-01
My other dataframe,"df2" , contains the start and end dates and a value that is to be assigned to the dataframe"dates" in case a date$date falls between the start date and the end date of the dataframe "df2" .
start_date <- dmy(c("1/6/2001","1/6/2002","1/6/2003","1/10/2011","1/11/2015","1/1/2016","1/1/2017","1/1/2018"))
end_date <-dmy(c("1/5/2002","1/5/2003","1/9/2011","1/10/2015","1/12/2015","1/12/2016","1/12/2017","1/12/2018"))
value <- c(2400,3600,4800,7000,7350,7717.5,8103.38,8508.54)
df2 <- data.frame(start_date, end_date, value)
> df2
start_date end_date value
1 2001-06-01 2002-05-01 2400.00
2 2002-06-01 2003-05-01 3600.00
3 2003-06-01 2011-09-01 4800.00
4 2011-10-01 2015-10-01 7000.00
5 2015-11-01 2015-12-01 7350.00
6 2016-01-01 2016-12-01 7717.50
7 2017-01-01 2017-12-01 8103.38
8 2018-01-01 2018-12-01 8508.54
In the end i would have this result :
date value
1 2017-06-01 8103.38
2 2013-01-01 7000.00
3 2014-03-01 7000.00
4 2008-01-01 4800.00
5 2011-03-01 4800.00
6 2009-03-01 4800.00
7 2012-03-01 7000.00
8 2015-08-01 7000.00
9 2008-08-01 4800.00
Using data.table, you can specify the join condition of the fly:
library(data.table)
setDT(date1) # date data frame
setDT(df1)
date1[df2, on = .(date >= start_date, date <= end_date), value := i.value]
print(date1)
date value
1: 2008-01-01 4800.00
2: 2008-08-01 4800.00
3: 2009-03-01 4800.00
4: 2011-03-01 4800.00
5: 2012-03-01 7000.00
6: 2013-01-01 7000.00
7: 2014-03-01 7000.00
8: 2015-08-01 7000.00
9: 2017-06-01 8103.38
I am working with electronic health records data and would like to create an indicator variable called "episode" that joins antibiotic medications that occur within 7 days of each other. Below is a mock dataset and the output that I would like. I program in R.
df2=data.frame(
id = c(01,01,01,01,01,02,02,03,04),
date = c("2015-01-01 11:00",
"2015-01-06 13:29",
"2015-01-10 12:46",
"2015-01-25 14:45",
"2015-02-15 13:30",
"2015-01-01 10:00",
"2015-05-05 15:20",
"2015-01-01 15:19",
"2015-08-01 13:15"),
abx = c("AMPICILLIN",
"ERYTHROMYCIN",
"NEOMYCIN",
"AMPICILLIN",
"VANCOMYCIN",
"VANCOMYCIN",
"NEOMYCIN",
"PENICILLIN",
"ERYTHROMYCIN"));
df2
Output desired
id date abx episode
1 2015-01-01 11:00 AMPICILLIN 1
1 2015-01-06 13:29 ERYTHROMYCIN 1
1 2015-01-10 12:46 NEOMYCIN 1
1 2015-01-25 14:45 AMPICILLIN 2
1 2015-02-15 13:30 VANCOMYCIN 3
2 2015-01-01 10:00 VANCOMYCIN 1
2 2015-05-05 15:20 NEOMYCIN 1
3 2015-01-01 15:19 PENICILLIN 1
4 2015-08-01 13:15 ERYTHROMYCIN 1
Use ave like this:
grpno <- function(x) cumsum(c(TRUE, diff(x) >=7 ))
transform(df2, episode = ave(as.numeric(as.Date(date)), id, FUN = grpno))
giving:
id date abx episode
1 1 2015-01-01 11:00 AMPICILLIN 1
2 1 2015-01-06 13:29 ERYTHROMYCIN 1
3 1 2015-01-10 12:46 NEOMYCIN 1
4 1 2015-01-25 14:45 AMPICILLIN 2
5 1 2015-02-15 13:30 VANCOMYCIN 3
6 2 2015-01-01 10:00 VANCOMYCIN 1
7 2 2015-05-05 15:20 NEOMYCIN 2
8 3 2015-01-01 15:19 PENICILLIN 1
9 4 2015-08-01 13:15 ERYTHROMYCIN 1
or with dplyr and grpno from above:
df2 %>%
group_by(id) %>%
mutate(episode = date %>% as.Date %>% as.numeric %>% grpno) %>%
ungroup
I have a column with ID and for each ID several even dates. I want to create two columns with rows for each id one column with the first date and the other with the next consecutive date. The next row for the ID should have the entry in the previous row second column and the next consecutive date for this ID. An example:
This is the data I have
id date
1 1 2015-01-01
2 1 2015-01-18
3 1 2015-08-02
4 2 2015-01-01
5 2 2015-01-13
6 3 2015-01-01
This is data I want
id date1 date2
1 1 2015-01-01 2015-01-18
2 1 2015-01-18 2015-08-02
3 1 2015-08-02 NA
4 2 2015-01-01 2015-01-13
5 2 2015-01-13 NA
6 3 2015-01-01 NA
Using dplyr:
library(dplyr)
df %>% group_by(id) %>%
mutate(date2 = lead(date))
id date date2
(int) (fctr) (fctr)
1 1 2015-01-01 2015-01-18
2 1 2015-01-18 2015-08-02
3 1 2015-08-02 NA
4 2 2015-01-01 2015-01-13
5 2 2015-01-13 NA
6 3 2015-01-01 NA
Using data.table, you can do as follow:
require(data.table)
DT[, .(date1 = date, date2 = shift(date, type = "lead")), by = id]
Or simply (also mentioned by #docendodiscimus)
DT[, date2 := shift(date, type = "lead"), by = id]
Also, if you are interested on making a recursive n columns (edited, taking advantage of #docendodiscimus comment to simplify the code)
i = 1:5
DT[, paste0("date", i+1) := shift(date, i, type = "lead"), by = id]
Base R solution using transform() and ave():
transform(df,date1=date,date2=ave(date,id,FUN=function(x) c(x[-1L],NA)),date=NULL);
## id date1 date2
## 1 1 2015-01-01 2015-01-18
## 2 1 2015-01-18 2015-08-02
## 3 1 2015-08-02 <NA>
## 4 2 2015-01-01 2015-01-13
## 5 2 2015-01-13 <NA>
## 6 3 2015-01-01 <NA>
The above line of code produces a copy of the data.frame. The return value can be assigned over the original df, assigned to a new variable, or passed as an argument/operand to a function/operator. If you want to modify it in-place, which would be a more efficient way to overwrite df, you can do this:
df$date2 <- ave(df$date,df$id,FUN=function(x) c(x[-1L],NA));
colnames(df)[colnames(df)=='date'] <- 'date1';
df;
## id date1 date2
## 1 1 2015-01-01 2015-01-18
## 2 1 2015-01-18 2015-08-02
## 3 1 2015-08-02 <NA>
## 4 2 2015-01-01 2015-01-13
## 5 2 2015-01-13 <NA>
## 6 3 2015-01-01 <NA>
df$date2 = ifelse(df$id==c(df$id[-1],-1), c(df$date[-1],NA), NA)
I have a dataset in long form for start and end date. for each id you will see multiple start and end dates.
I need to find the difference between the first end date and second start date. I am not sure how to use two rows to calculate the difference. Any help is appreciated.
df=data.frame(c(1,2,2,2,3,4,4),
as.Date(c( "2010-10-01","2009-09-01","2014-01-01","2014-02-01","2009-01-01","2013-03-01","2014-03-01")),
as.Date(c("2016-04-30","2013-12-31","2014-01-31","2016-04-30","2014-02-28","2013-05-01","2014-08-31")));
names(df)=c('id','start','end')
my output would look like this:
df$diff=c(NA,1,1,NA,NA,304, NA)
Here's an attempt in base R that I think does what you want:
df$diff <- NA
split(df$diff, df$id) <- by(df, df$id, FUN=function(SD) c(SD$start[-1], NA) - SD$end)
df
# id start end diff
#1 1 2010-10-01 2016-04-30 NA
#2 2 2009-09-01 2013-12-31 1
#3 2 2014-01-01 2014-01-31 1
#4 2 2014-02-01 2016-04-30 NA
#5 3 2009-01-01 2014-02-28 NA
#6 4 2013-03-01 2013-05-01 304
#7 4 2014-03-01 2014-08-31 NA
Alternatively, in data.table it would be:
setDT(df)[, diff := shift(start,n=1,type="lead") - end, by=id]
Here's an alternative using the popular dplyr package:
library(dplyr)
df %>%
group_by(id) %>%
mutate(diff = difftime(lead(start), end, units = "days"))
# id start end diff
# (dbl) (date) (date) (dfft)
# 1 1 2010-10-01 2016-04-30 NA days
# 2 2 2009-09-01 2013-12-31 1 days
# 3 2 2014-01-01 2014-01-31 1 days
# 4 2 2014-02-01 2016-04-30 NA days
# 5 3 2009-01-01 2014-02-28 NA days
# 6 4 2013-03-01 2013-05-01 304 days
# 7 4 2014-03-01 2014-08-31 NA days
You can wrap diff in as.numeric if you want.
Again with base R, you can do the following:
df$noofdays <- as.numeric(as.difftime(df$end-df$start, units=c("days"), format="%Y-%m-%d"))