I would like to compare 2 time series by their time of day. These 2 series are from different dates (ie, 2018-08-10 for the first series and 2018-09-10 for the second series) but have the same time stamps. Is it possible to do a cbind/merge between the 2 series only taking in to acccount the timestamps but not the date of the time stamp?
Thanks
I have no idea of how your data looks like, so next time please create a reproducible exampel. I created 2 data.frames that can serve as an example. You should now that xts needs a valid timeseries object and an hms timeseries is not a valid timeseries for xts.
That being said, you can always transform an xts object into a data.frame with:
my_df <- data.frame(times = index(my_xts), coredata(my_xts))
Now for the example:
I'm showing it via dplyr, but merge will work as well if you create a hms object in each data.frame. I'm using as.hms from the hms package to create a hms object in the data.frames and join them together.
x <- Sys.time() + 1:10*60 # today
y <- x - 60*60*24 # same time yesterday
df1 <- data.frame(times = x, val1 = 1:10)
df2 <- data.frame(times = y, val2 = 10:1)
library(dplyr)
# create hms object in df1 and in df2 on the fly
df1 %>%
mutate(times2 = hms::as.hms(times)) %>%
inner_join(df2 %>% mutate(times2 = hms::as.hms(times)), by = "times2"
)
times.x val1 times2 times.y val2
1 2018-10-01 18:26:05 1 18:26:05.421764 2018-09-30 18:26:05 10
2 2018-10-01 18:27:05 2 18:27:05.421764 2018-09-30 18:27:05 9
3 2018-10-01 18:28:05 3 18:28:05.421764 2018-09-30 18:28:05 8
4 2018-10-01 18:29:05 4 18:29:05.421764 2018-09-30 18:29:05 7
5 2018-10-01 18:30:05 5 18:30:05.421764 2018-09-30 18:30:05 6
6 2018-10-01 18:31:05 6 18:31:05.421764 2018-09-30 18:31:05 5
7 2018-10-01 18:32:05 7 18:32:05.421764 2018-09-30 18:32:05 4
8 2018-10-01 18:33:05 8 18:33:05.421764 2018-09-30 18:33:05 3
9 2018-10-01 18:34:05 9 18:34:05.421764 2018-09-30 18:34:05 2
10 2018-10-01 18:35:05 10 18:35:05.421764 2018-09-30 18:35:05 1
Related
I have a dataset with dates in tibble format from tidyverse/dplyr.
library(tidyverse)
A = seq(from = as.Date("2019/1/1"),to=as.Date("2022/1/1"), length.out = 252*3)
length(A)
x = rnorm(252*3)
d = tibble(A,x);d
Resulting to :
# A tibble: 756 x 2
A x
<date> <dbl>
1 2019-01-01 1.43
2 2019-01-02 0.899
3 2019-01-03 0.658
4 2019-01-05 -0.0720
5 2019-01-06 -1.99
6 2019-01-08 -0.743
7 2019-01-09 0.426
8 2019-01-11 0.00675
9 2019-01-12 0.967
10 2019-01-14 -0.606
# ... with 746 more rows
i also have a date of interest, say:
start = as.Date("2021/12/15");start
I want to subset the dataset from this specific date (start) and one year back. But the year has 252 observations.
i tried :
d%>%
dplyr::filter(A<start)%>%
dplyr::slice_tail(n=252)
but i don't like it because my real dataset has more than one factor label and if i use this then i will have 252 observations.
i also tried :
LAST_YEAR = DATE-365
d%>%
dplyr::filter(Date <= DATE & Date >=LAST_YEAR)
which works but i want to use the 252.Imagine that i want to find 2 years (252*2) back how many observations i have on this specific time interval.
Any help how i can do that?
I am trying to figure out how to add a row when a date range spans a calendar year. Below is a minimal reprex:
I have a date frame like this:
have <- data.frame(
from = c(as.Date('2018-12-15'), as.Date('2019-12-20'), as.Date('2019-05-13')),
to = c(as.Date('2019-06-20'), as.Date('2020-01-25'), as.Date('2019-09-10'))
)
have
#> from to
#> 1 2018-12-15 2019-06-20
#> 2 2019-12-20 2020-01-25
#> 3 2019-05-13 2019-09-10
I want a data.frame that splits into two rows when to and from span a calendar year.
want <- data.frame(
from = c(as.Date('2018-12-15'), as.Date('2019-01-01'), as.Date('2019-12-20'), as.Date('2020-01-01'), as.Date('2019-05-13')),
to = c(as.Date('2018-12-31'), as.Date('2019-06-20'), as.Date('2019-12-31'), as.Date('2020-01-25'), as.Date('2019-09-10'))
)
want
#> from to
#> 1 2018-12-15 2018-12-31
#> 2 2019-01-01 2019-06-20
#> 3 2019-12-20 2019-12-31
#> 4 2020-01-01 2020-01-25
#> 5 2019-05-13 2019-09-10
I am wanting to do this because for a particular row, I want to know how many days are in each year.
want$time_diff_by_year <- difftime(want$to, want$from)
Created on 2020-05-15 by the reprex package (v0.3.0)
Any base R, tidyverse solutions would be much appreciated.
You can determine the additional years needed for your date intervals with map2, then unnest to create additional rows for each year.
Then, you can identify date intervals of intersections between partial years and a full calendar year. This will keep the partial years starting Jan 1 or ending Dec 31 for a given year.
library(tidyverse)
library(lubridate)
have %>%
mutate(date_int = interval(from, to),
year = map2(year(from), year(to), seq)) %>%
unnest(year) %>%
mutate(year_int = interval(as.Date(paste0(year, '-01-01')), as.Date(paste0(year, '-12-31'))),
year_sect = intersect(date_int, year_int),
from_new = as.Date(int_start(year_sect)),
to_new = as.Date(int_end(year_sect))) %>%
select(from_new, to_new)
Output
# A tibble: 5 x 2
from_new to_new
<date> <date>
1 2018-12-15 2018-12-31
2 2019-01-01 2019-06-20
3 2019-12-20 2019-12-31
4 2020-01-01 2020-01-25
5 2019-05-13 2019-09-10
I have another question in the same project scope pandas dataframe groupby datetime month however I fear the data structure might be to complicated so I am trying an alternative approach. I am hoping this achieves the same result.
I am ideally looking to build a matrix of phone numbers as rows and start and end dates as columns and identify the period in which a telephone call was made.
This will be achieved by transforming a dataset of dates and phone numbers to a complete list of dates, identifying an end day match, and then seeing if the date the telephone call was made falls within that period.
The original data looks like:
Date = as.Date(c("2019-03-01", "2019-03-15","2019-03-29", "2019-04-10","2019-03-05","2019-03-20"))
Phone = c("070000001","070000001","070000001","070000001","070000002","070000002")
df<-data.frame(Date,Phone)
df
## Date Phone
## 1 2019-03-01 070000001
## 2 2019-03-15 070000001
## 3 2019-03-29 070000001
## 4 2019-04-10 070000001
## 5 2019-03-05 070000002
## 6 2019-03-20 070000002
Ideally I would want it to look like this:
## Date Phone INT_1 INT_2 INT_3 INT_4 INT_5
## 1 2019-03-01 070000001 X X X X X
## 2 2019-03-15 070000002 X X X
Where INT is a series of dates + 30 and X indicates that the telephone number appeared in that rolling period.
To do this I assume you need two datasets. The one above, of telephone numbers by date called, and a second which is the complete list of days and their = 30 day counter parts.
dates<-as.data.frame(seq(as.Date("2016/7/1"), as.Date("2019/7/1"),"days"),
responseName = c('start'))
dates$end<-dates$start+30
## INT start end
## 1 2016-07-01 2016-07-31
## 2 2016-07-02 2016-08-01
## 3 2016-07-03 2016-08-02
## 4 2016-07-04 2016-08-03
But how do I get the two to evaluate together? I am assuming some kind of merge and expand of the telephone data into the date list then spread the dates by the row index/ INT?
I think that to match the two dataframes you could use a fuzzyjoin. For example, if I define a dataframe of phone numbers and usage dates as:
library(dplyr)
library(fuzzyjoin)
fake_phone_data <- tibble(
date = as.Date(c("2019-01-03", "2019-01-27", "2019-02-12", "2019-02-25", "2019-02-26")),
phone = c("1", "1", "2", "2", "2")
)
and a dataframe of starting/ending dates (plus an ID column) as:
id_dates <- tibble(
ID = c("1", "2", "3", "4"),
starting_date = as.Date(c("2019-01-01", "2019-01-16", "2019-02-01", "2019-02-16")),
ending_date = as.Date(c("2019-01-15", "2019-01-31", "2019-02-15", "2019-02-27"))
)
then I can join the two dataframes using a fuzzyjoin, i.e. two rows are matched if the date of the phone call happens between the starting date and the end date of the corresponding period:
fuzzy_left_join(
fake_phone_data,
id_dates,
by = c(
"date" = "starting_date",
"date" = "ending_date"
),
match_fun = list(`>=`, `<`)
)
#> # A tibble: 5 x 5
#> date phone ID starting_date ending_date
#> <date> <chr> <chr> <date> <date>
#> 1 2019-01-03 1 1 2019-01-01 2019-01-15
#> 2 2019-01-27 1 2 2019-01-16 2019-01-31
#> 3 2019-02-12 2 3 2019-02-01 2019-02-15
#> 4 2019-02-25 2 4 2019-02-16 2019-02-27
#> 5 2019-02-26 2 4 2019-02-16 2019-02-27
Created on 2019-07-19 by the reprex package (v0.3.0)
Does it solve your problem?
This approach is very similar to this question.
I am trying to get a count of active clients per month, using data that has a start and end date to each client's episode. The code I am using I can't work out how to count per month, rather than per every n days.
Here is some sample data:
Start.Date <- as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03"))
End.Date<- as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04"))
Make sure the dates are dates:
Start.Date <- as.Date(Start.Date, "%d/%m/%Y")
End.Date <- as.Date(End.Date, "%d/%m/%Y")
Here is the code I am using, which current counts the number per day:
library(plyr)
count(Reduce(c, Map(seq, start.month, end.month, by = 1)))
which returns:
x freq
1 2014-01-01 1
2 2014-01-02 2
3 2014-01-03 4
4 2014-01-04 2
The "by" argument can be changed to be however many days I want, but problems arise because months have different lengths.
Would anyone be able to suggest how I can count per month?
Thanks a lot.
note: I now realize that for my example data I have only used dates in the same month, but my real data has dates spanning 3 years.
Here's a solution that seems to work. First, I set the seed so that the example is reproducible.
# Set seed for reproducible example
set.seed(33550336)
Next, I create a dummy data frame.
# Test data
df <- data.frame(Start_date = as.Date(sample(seq(as.Date('2014/01/01'), as.Date('2015/01/01'), by="day"), 12))) %>%
mutate(End_date = as.Date(Start_date + sample(1:365, 12, replace = TRUE)))
which looks like,
# Start_date End_date
# 1 2014-11-13 2015-09-26
# 2 2014-05-09 2014-06-16
# 3 2014-07-11 2014-08-16
# 4 2014-01-25 2014-04-23
# 5 2014-05-16 2014-12-19
# 6 2014-11-29 2015-07-11
# 7 2014-09-21 2015-03-30
# 8 2014-09-15 2015-01-03
# 9 2014-09-17 2014-09-26
# 10 2014-12-03 2015-05-08
# 11 2014-08-03 2015-01-12
# 12 2014-01-16 2014-12-12
The function below takes a start date and end date and creates a sequence of months between these dates.
# Sequence of months
mon_seq <- function(start, end){
# Change each day to the first to aid month counting
day(start) <- 1
day(end) <- 1
# Create a sequence of months
seq(start, end, by = "month")
}
Right, this is the tricky bit. I apply my function mon_seq to all rows in the data frame using mapply. This gives the months between each start and end date. Then, I combine all these months together into a vector. I format this vector so that dates just contain months and years. Finally, I pipe (using dplyr's %>%) this into table which counts each occurrence of year-month and I cast as a data frame.
data.frame(format(do.call("c", mapply(mon_seq, df$Start_date, df$End_date)), "%Y-%m") %>% table)
This gives,
# . Freq
# 1 2014-01 2
# 2 2014-02 2
# 3 2014-03 2
# 4 2014-04 2
# 5 2014-05 3
# 6 2014-06 3
# 7 2014-07 3
# 8 2014-08 4
# 9 2014-09 6
# 10 2014-10 5
# 11 2014-11 7
# 12 2014-12 8
# 13 2015-01 6
# 14 2015-02 4
# 15 2015-03 4
# 16 2015-04 3
# 17 2015-05 3
# 18 2015-06 2
# 19 2015-07 2
# 20 2015-08 1
# 21 2015-09 1
I have a dataframe in R, which has two variables that are dates and I need to calculate the difference in days between them. However, they are formatted as YYYYMMDD. How do I change it to a date format readable in R?
This should work
lubridate::ymd(given_date_format)
I like anydate() from the anytime package. Quick demo, with actual data:
R> set.seed(123) # be reproducible
R> data <- data.frame(inp=Sys.Date() + cumsum(runif(10)*10))
R> data$ymd <- format(data$inp, "%Y%m%d") ## as yyyymmdd
R> data$int <- as.integer(data$ymd) ## same as integer
R> library(anytime)
R> data$diff1 <- c(NA, diff(anydate(data$ymd))) # reads YMD
R> data$diff2 <- c(NA, diff(anydate(data$int))) # also reads int
R> data
inp ymd int diff1 diff2
1 2017-06-23 20170623 20170623 NA NA
2 2017-07-01 20170701 20170701 8 8
3 2017-07-05 20170705 20170705 4 4
4 2017-07-14 20170714 20170714 9 9
5 2017-07-24 20170724 20170724 10 10
6 2017-07-24 20170724 20170724 0 0
7 2017-07-29 20170729 20170729 5 5
8 2017-08-07 20170807 20170807 9 9
9 2017-08-13 20170813 20170813 6 6
10 2017-08-17 20170817 20170817 4 4
R>
Here the first column is actual dates we work from. Columns two and three are then generates to match OP's requirement: YMD, either in character or integer.
We then compute differences on them, account for the first 'lost' data point differences when we have no predecessor and show that either date format works.