I have a dataset in long form for start and end date. for each id you will see multiple start and end dates.
I need to find the difference between the first end date and second start date. I am not sure how to use two rows to calculate the difference. Any help is appreciated.
df=data.frame(c(1,2,2,2,3,4,4),
as.Date(c( "2010-10-01","2009-09-01","2014-01-01","2014-02-01","2009-01-01","2013-03-01","2014-03-01")),
as.Date(c("2016-04-30","2013-12-31","2014-01-31","2016-04-30","2014-02-28","2013-05-01","2014-08-31")));
names(df)=c('id','start','end')
my output would look like this:
df$diff=c(NA,1,1,NA,NA,304, NA)
Here's an attempt in base R that I think does what you want:
df$diff <- NA
split(df$diff, df$id) <- by(df, df$id, FUN=function(SD) c(SD$start[-1], NA) - SD$end)
df
# id start end diff
#1 1 2010-10-01 2016-04-30 NA
#2 2 2009-09-01 2013-12-31 1
#3 2 2014-01-01 2014-01-31 1
#4 2 2014-02-01 2016-04-30 NA
#5 3 2009-01-01 2014-02-28 NA
#6 4 2013-03-01 2013-05-01 304
#7 4 2014-03-01 2014-08-31 NA
Alternatively, in data.table it would be:
setDT(df)[, diff := shift(start,n=1,type="lead") - end, by=id]
Here's an alternative using the popular dplyr package:
library(dplyr)
df %>%
group_by(id) %>%
mutate(diff = difftime(lead(start), end, units = "days"))
# id start end diff
# (dbl) (date) (date) (dfft)
# 1 1 2010-10-01 2016-04-30 NA days
# 2 2 2009-09-01 2013-12-31 1 days
# 3 2 2014-01-01 2014-01-31 1 days
# 4 2 2014-02-01 2016-04-30 NA days
# 5 3 2009-01-01 2014-02-28 NA days
# 6 4 2013-03-01 2013-05-01 304 days
# 7 4 2014-03-01 2014-08-31 NA days
You can wrap diff in as.numeric if you want.
Again with base R, you can do the following:
df$noofdays <- as.numeric(as.difftime(df$end-df$start, units=c("days"), format="%Y-%m-%d"))
Related
Here is my toy dataset:
df <- tibble::tribble(
~date, ~value,
"2007-01-31", 25,
"2007-05-31", 31,
"2007-12-31", 26
)
I am creating month-end date series using the following code.
df %>%
mutate(date = as.Date(date)) %>%
complete(date = seq(as.Date("2007-01-31"), as.Date("2019-12-31"), by="month"))
However, I am not getting the correct month-end dates.
date value
<date> <dbl>
1 2007-01-31 25
2 2007-03-03 NA
3 2007-03-31 NA
4 2007-05-01 NA
5 2007-05-31 31
6 2007-07-01 NA
7 2007-07-31 NA
8 2007-08-31 NA
9 2007-10-01 NA
10 2007-10-31 NA
11 2007-12-01 NA
12 2007-12-31 26
What am I missing here? I am okay using other functions from any other package.
No need of complete function, you can do this in base R.
Since last day of the month is different for different months, we can create a sequence of monthly start dates and subtract 1 day from it.
seq(as.Date("2007-02-01"), as.Date("2008-01-01"), by="month") - 1
#[1] "2007-01-31" "2007-02-28" "2007-03-31" "2007-04-30" "2007-05-31" "2007-06-30"
# "2007-07-31" "2007-08-31" "2007-09-30" "2007-10-31" "2007-11-30" "2007-12-31"
Using the same logic in updated dataframe, we can do :
library(dplyr)
df %>%
mutate(date = as.Date(date)) %>%
tidyr::complete(date = seq(min(date) + 1, max(date) + 1, by="month") - 1)
# date value
# <date> <dbl>
# 1 2007-01-31 25
# 2 2007-02-28 NA
# 3 2007-03-31 NA
# 4 2007-04-30 NA
# 5 2007-05-31 31
# 6 2007-06-30 NA
# 7 2007-07-31 NA
# 8 2007-08-31 NA
# 9 2007-09-30 NA
#10 2007-10-31 NA
#11 2007-11-30 NA
#12 2007-12-31 26
Consider the following data frame (df):
"id" "date_start" "date_end"
a 2012-03-11 2012-03-27
a 2012-05-17 2012-07-21
a 2012-06-09 2012-08-18
b 2015-06-21 2015-07-12
b 2015-06-27 2015-08-04
b 2015-07-02 2015-08-01
c 2017-10-11 2017-11-08
c 2017-11-27 2017-12-15
c 2017-01-02 2018-02-03
I am trying to create a new data frame with sequences of monthly dates, starting one month prior to the minimum value of "date_start" for each group in "id". The sequence also only includes dates from the first day of a month and ends at the maximum value of "date-end" for each group in "id".
This is a reproducible example for my data frame:
library(lubridate)
id <- c("a","a","a","b","b","b","c","c","c")
df <- data.frame(id)
df$date_start <- as.Date(c("2012-03-11", "2012-05-17","2012-06-09", "2015-06-21", "2015-06-27","2015-07-02", "2017-10-11", "2017-11-27","2018-01-02"))
df$date_end <- as.Date(c("2012-03-27", "2012-07-21","2012-08-18", "2015-07-12", "2015-08-04","2015-08-012", "2017-11-08", "2017-12-15","2018-02-03"))
What I have tried to do:
library(dplyr)
library(Desctools)
library(timeDate)
df2 <- df %>%
group_by(id) %>%
summarize(start= floor_date(AddMonths(min(date_start),-1), "month"),end=max(date_end)) %>%
do(data.frame(id=.$id, date=seq(.$start,.$end,by="1 month")))
The code works perfectly fine for an ungrouped data frame. Somehow, with the grouping by "id" it throws an error message:
Error in seq.default(.$date_start, .$date_end, by = "1 month") :
'from' must be of length 1
This is how the desired output looks like for the data frame given above:
"id" "date"
a 2012-02-01
a 2012-03-01
a 2012-04-01
a 2012-05-01
a 2012-06-01
a 2012-07-01
a 2012-08-01
b 2015-05-01
b 2015-06-01
b 2015-07-01
b 2015-08-01
c 2017-09-01
c 2017-10-01
c 2017-11-01
c 2017-12-01
c 2018-01-01
c 2018-02-01
Is there a way to alter the code to function with a grouped data frame? Is there an altogether different approach for this operation?
Another option using dplyr and lubridate is to first summarise a list of Date objects for each id and then unnest them to expand them into different rows.
library(dplyr)
library(lubridate)
df %>%
group_by(id) %>%
summarise(date = list(seq(floor_date(min(date_start),unit = "month") - months(1),
floor_date(max(date_end), unit = "month"), by = "month"))) %>%
tidyr::unnest()
# id date
# <fct> <date>
# 1 a 2012-02-01
# 2 a 2012-03-01
# 3 a 2012-04-01
# 4 a 2012-05-01
# 5 a 2012-06-01
# 6 a 2012-07-01
# 7 a 2012-08-01
# 8 b 2015-05-01
# 9 b 2015-06-01
#10 b 2015-07-01
#11 b 2015-08-01
#12 c 2017-09-01
#13 c 2017-10-01
#14 c 2017-11-01
#15 c 2017-12-01
#16 c 2018-01-01
#17 c 2018-02-01
In your code, since there are duplicates in id, you could group by row_number and achieve the same results as below:
df %>%
group_by(id) %>%
summarize(start= floor_date(AddMonths(min(date_start),-1), "month"),end=max(date_end)) %>%
group_by(rn=row_number()) %>%
do(data.frame(id=.$id, date=seq(.$start, .$end, by="1 month"))) %>%
ungroup() %>%
select(-rn)
# A tibble: 17 x 2
id date
<fct> <date>
1 a 2012-02-01
2 a 2012-03-01
3 a 2012-04-01
4 a 2012-05-01
5 a 2012-06-01
6 a 2012-07-01
7 a 2012-08-01
8 b 2015-05-01
9 b 2015-06-01
10 b 2015-07-01
11 b 2015-08-01
12 c 2017-09-01
13 c 2017-10-01
14 c 2017-11-01
15 c 2017-12-01
16 c 2018-01-01
17 c 2018-02-01
Use as.yearmon to convert to year/month. Note that yearmon objects are represented internally as year + fraction where fraction is 0 for January, 1/12 for February, 2/12 for March and so on. Then use as.Date to convert that to Date class. do allows the group to change size.
library(dplyr)
library(zoo)
df %>%
group_by(id) %>%
do( data.frame(month = as.Date(seq(as.yearmon(min(.$date_start)) - 1/12,
as.yearmon(max(.$date_end)),
1/12) ))) %>%
ungroup
giving:
# A tibble: 17 x 2
id month
<fct> <date>
1 a 2012-02-01
2 a 2012-03-01
3 a 2012-04-01
4 a 2012-05-01
5 a 2012-06-01
6 a 2012-07-01
7 a 2012-08-01
8 b 2015-05-01
9 b 2015-06-01
10 b 2015-07-01
11 b 2015-08-01
12 c 2017-09-01
13 c 2017-10-01
14 c 2017-11-01
15 c 2017-12-01
16 c 2018-01-01
17 c 2018-02-01
This could also be written like this using the same library statements as above:
Seq <- function(st, en) as.Date(seq(as.yearmon(st) - 1/12, as.yearmon(en), 1/12))
df %>%
group_by(id) %>%
do( data.frame(month = Seq(min(.$date_start), max(.$date_end))) ) %>%
ungroup
A couple basic data manipulations. I searched with different wordings and couldn't find much.
I have data structured as below. In reality the hourly data is continuous, but I just included 4 lines as an example.
start <- as.POSIXlt(c('2017-1-1 1:00','2017-1-1 2:00','2017-1-2 1:00','2017-1-2 2:00'))
values <- as.numeric(c(2,5,4,3))
df <- data.frame(start,values)
df
start values
1 2017-01-01 01:00:00 2
2 2017-01-01 02:00:00 5
3 2017-01-02 01:00:00 4
4 2017-01-02 02:00:00 3
I would like to add a couple columns that:
1) Show the max of the same day.
2) Show the max of the previous day.
3) Show the value of one previous hour.
The goal is to have an output like:
MaxValueDay <- as.numeric(c(5,5,4,4))
MaxValueYesterday <- as.numeric(c(NA,NA,5,5))
PreviousHourValue <- as.numeric(c(NA,2,NA,4))
df2 <- data.frame(start,values,MaxValueDay,MaxValueYesterday,PreviousHourValue)
df2
start values MaxValueDay MaxValueYesterday PreviousHourValue
1 2017-01-01 01:00:00 2 5 NA NA
2 2017-01-01 02:00:00 5 5 NA 2
3 2017-01-02 01:00:00 4 4 5 NA
4 2017-01-02 02:00:00 3 4 5 4
Any help would be greatly appreciated. Thanks
A solution using dplyr, magrittr, and lubridate packages:
library(dplyr)
library(magrittr)
library(lubridate)
df %>%
within(MaxValueDay <- sapply(as.Date(start), function (x) max(df$values[which(x==as.Date(start))]))) %>%
within(MaxValueYesterday <- MaxValueDay[sapply(as.Date(start)-1, match, as.Date(start))]) %>%
within(PreviousHourValue <- values[sapply(start-hours(1), match, start)])
# start values MaxValueDay MaxValueYesterday PreviousHourValue
# 1 2017-01-01 01:00:00 2 5 NA NA
# 2 2017-01-01 02:00:00 5 5 NA 2
# 3 2017-01-02 01:00:00 4 4 5 NA
# 4 2017-01-02 02:00:00 3 4 5 4
I have a column with ID and for each ID several even dates. I want to create two columns with rows for each id one column with the first date and the other with the next consecutive date. The next row for the ID should have the entry in the previous row second column and the next consecutive date for this ID. An example:
This is the data I have
id date
1 1 2015-01-01
2 1 2015-01-18
3 1 2015-08-02
4 2 2015-01-01
5 2 2015-01-13
6 3 2015-01-01
This is data I want
id date1 date2
1 1 2015-01-01 2015-01-18
2 1 2015-01-18 2015-08-02
3 1 2015-08-02 NA
4 2 2015-01-01 2015-01-13
5 2 2015-01-13 NA
6 3 2015-01-01 NA
Using dplyr:
library(dplyr)
df %>% group_by(id) %>%
mutate(date2 = lead(date))
id date date2
(int) (fctr) (fctr)
1 1 2015-01-01 2015-01-18
2 1 2015-01-18 2015-08-02
3 1 2015-08-02 NA
4 2 2015-01-01 2015-01-13
5 2 2015-01-13 NA
6 3 2015-01-01 NA
Using data.table, you can do as follow:
require(data.table)
DT[, .(date1 = date, date2 = shift(date, type = "lead")), by = id]
Or simply (also mentioned by #docendodiscimus)
DT[, date2 := shift(date, type = "lead"), by = id]
Also, if you are interested on making a recursive n columns (edited, taking advantage of #docendodiscimus comment to simplify the code)
i = 1:5
DT[, paste0("date", i+1) := shift(date, i, type = "lead"), by = id]
Base R solution using transform() and ave():
transform(df,date1=date,date2=ave(date,id,FUN=function(x) c(x[-1L],NA)),date=NULL);
## id date1 date2
## 1 1 2015-01-01 2015-01-18
## 2 1 2015-01-18 2015-08-02
## 3 1 2015-08-02 <NA>
## 4 2 2015-01-01 2015-01-13
## 5 2 2015-01-13 <NA>
## 6 3 2015-01-01 <NA>
The above line of code produces a copy of the data.frame. The return value can be assigned over the original df, assigned to a new variable, or passed as an argument/operand to a function/operator. If you want to modify it in-place, which would be a more efficient way to overwrite df, you can do this:
df$date2 <- ave(df$date,df$id,FUN=function(x) c(x[-1L],NA));
colnames(df)[colnames(df)=='date'] <- 'date1';
df;
## id date1 date2
## 1 1 2015-01-01 2015-01-18
## 2 1 2015-01-18 2015-08-02
## 3 1 2015-08-02 <NA>
## 4 2 2015-01-01 2015-01-13
## 5 2 2015-01-13 <NA>
## 6 3 2015-01-01 <NA>
df$date2 = ifelse(df$id==c(df$id[-1],-1), c(df$date[-1],NA), NA)
I have the following data.frame:
df <- data.frame(id=c(1,2,3),
first.date=as.Date(c("2014-01-01", "2014-03-01", "2014-06-01")),
second.date=as.Date(c("2015-01-01", "2015-03-01", "2015-06-1")),
third.date=as.Date(c("2016-01-01", "2017-03-01", "2018-06-1")),
fourth.date=as.Date(c("2017-01-01", "2018-03-01", "2019-06-1")))
> df
id first.date second.date third.date fourth.date
1 1 2014-01-01 2015-01-01 2016-01-01 2017-01-01
2 2 2014-03-01 2015-03-01 2017-03-01 2018-03-01
3 3 2014-06-01 2015-06-01 2018-06-01 2019-06-01
Each row represents three timespans; i.e. the time spans between first.date and second.date, second.date and third.date, and third.date and fourth.date respectively.
I would like to, in lack of a better word, unnest the dataframe to obtain this instead:
id StartDate EndDate
1 1 2014-01-01 2015-01-01
2 1 2015-01-01 2016-01-01
3 1 2016-01-01 2017-01-01
4 2 2014-03-01 2015-03-01
5 2 2015-03-01 2017-03-01
6 2 2017-03-01 2018-03-01
7 3 2014-06-01 2015-06-01
8 3 2015-06-01 2018-06-01
9 3 2018-06-01 2019-06-01
I have been playing around with the unnest function from the tidyr package, but I came to the conclusion that I don't think it's what I'm really looking for.
Any suggestions?
You can try tidyr/dplyr as follows:
library(tidyr)
library(dplyr)
df %>% gather(DateType, StartDate, -id) %>% select(-DateType) %>% arrange(id) %>% group_by(id) %>% mutate(EndDate = lead(StartDate))
You can eliminate the last row in each id group by adding:
%>% slice(-4)
To the above pipeline.
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), then melt the dataset to long format, use shift with type='lead' grouped by 'id' and then remove the NA elements.
library(data.table)
na.omit(melt(setDT(df), id.var='id')[, shift(value,0:1, type='lead') , id])
# id V1 V2
#1: 1 2014-01-01 2015-01-01
#2: 1 2015-01-01 2016-01-01
#3: 1 2016-01-01 2017-01-01
#4: 2 2014-03-01 2015-03-01
#5: 2 2015-03-01 2017-03-01
#6: 2 2017-03-01 2018-03-01
#7: 3 2014-06-01 2015-06-01
#8: 3 2015-06-01 2018-06-01
#9: 3 2018-06-01 2019-06-01
The column names can be changed by using either setnames or earlier in the shift step.