creation of a time series adds day and time to the date - r

My data frame looks like that (with more columns and rows):
DT
Date PERMNO lag.ME.Jun
<S3: yearmon> <fctr> <dbl>
1 Gen 2000 34936 21.860
2 Feb 2000 34936 21.860
3 Mar 2000 34936 21.860
Then I create different time series for each column (for the variable lag.ME.Jun, for example):
v6<-xts( newdata11$lag.ME.Jun, newdata11$Date)
However, it adds also the date and time inside the time series; which was not provided. So v6 looks like:
34936
2000-01-01 01:00:00 21.86
2000-02-01 01:00:00 21.86
2000-03-01 01:00:00 21.86
How can I avoid to have the day and time in the time to appear in the time series? only the month and year.

It seems newdata11$Date is either a POSIXct type, or being converted to one. If yearmon is what you want, be explicit about it:
v6<-xts( newdata11$lag.ME.Jun, as.yearmon(newdata11$Date))
E.g.
d = as.yearmon(c("mar07", "apr07", "may07"), "%b%y")
> d
[1] "Mar 2007" "Apr 2007" "May 2007"
> xts(1:3, d)
[,1]
Mar 2007 1
Apr 2007 2
May 2007 3
> d2= as.Date(c("2017-05-01", "2017-06-01", "2017-07-01"))
> xts(1:3, as.yearmon(d2))
[,1]
May 2017 1
Jun 2017 2
Jul 2017 3

Related

Specifying start date of timeseries data in R as Q2

I have time series data that is seasonal by the quarter. However, the data starts in the 2nd quarter of the first year but all other years have all four quarters.
> EquifaxData
DATE EQFXSUBPRIME013045
1 2014-04-01 42.58513
2 2014-07-01 43.15483
3 2014-10-01 43.55090
4 2015-01-01 42.59218
5 2015-04-01 41.47105
6 2015-07-01 41.53640
7 2015-10-01 41.82020
8 2016-01-01 40.98760
9 2016-04-01 40.51305
10 2016-07-01 39.91170
11 2016-10-01 40.15402
I then converted the Date column to a date as follows:
> EquifaxData$DATE <- as.Date(EquifaxData$DATE)
Now comes the issue. I want to convert this data to a time series. But I need to specify my start date as the beginning of Q2 in 2014. Not the beginning of 2014. As you can see below from what I have tried, the resulting time series shown by head has all the values shifted one quarter back because it is starting from the beginning of 2014.
> EquifaxTs <- ts(EquifaxData$EQFXSUBPRIME013045, start=2014, frequency = 4)
> head(EquifaxTs)
Qtr1 Qtr2 Qtr3 Qtr4
2014 42.58513 43.15483 43.55090 42.59218
2015 41.47105 41.53640
>
How can I define EquifaxTs to correctly start in Q2 2014 and still remain seasonal with a frequency of 4 per year?
I think that's it solves:
EquifaxTs <- ts(EquifaxData$EQFXSUBPRIME013045, start = c(2014, 2), frequency = 4)

Formatting date column with different formats (including missing day information) - lubridate

I'm relatively new to R. I downloaded a dataset about clinical trial data, but it occurred to me, that the format of the dates in the relative column are mixed up: most of them are like "September 1, 2012", but some are missing the day information (e.g. October 2015).
I want to express them all in the same way (eg. yyyy-mm-dd), to work with them. That went fine, the only problem that is missing is the name of the output column. In the last function (date_correction) I planned to include an argument "output_col" which I can pass the intended name for the created (formatted) column, but it only prints output_col all the time.
Do you know, how I could handle this? To pass the intended name of the output column right into the function?
Is there a better way to solve my problem?
-> I even tried to manage more complex orders-argument for lubricate::parse_date_time like
parse_date_time(input_col, orders="mdy", "my")
but this didn't work.
Here's the code:
library("tidyverse")
library("lubridate")
Observation <- c(seq(1:5))
Date_original <- c("October 2014","August 2014","June 2013",
"June 24, 2010","January 2005")
df_dates <- data.frame(Observation, Date_original)
# looking for a comma in the cell
comma_detect <- function(a_string){
str_detect(a_string, ",")
}
# if comma: assume "mdy", if not apply "my" -> return formatted value
date_correction_row <- function(input_col){
if_else(comma_detect(input_col),
parse_date_time(input_col, orders="mdy"),
parse_date_time(input_col, orders="my"))
}
# prepare function for dataframe:
date_correction <- function(df, input_col, output_col){
mutate(df, output_col = date_correction_row(input_col))
}
df_dates %>% date_correction(df_dates$Date_original, date_formatted) %>% view()
OUTPUT
Observation Date_original output_col
1 1 October 2014 2014-10-01
2 2 August 2014 2014-08-01
3 3 June 2013 2013-06-01
4 4 June 24, 2010 2010-06-24
5 5 January 2005 2005-01-01
In the code below we assume that output_col equals "Date". They all set the column name, give no warnings and use Date class.
1) Try each format and take the one that does not give NA. This uses only base R.
output_col <- "Date"
within(df_dates, assign(output_col, pmin(na.rm = TRUE,
as.Date(Date_original, "%B %d, %Y"),
as.Date(paste(Date_original, 1), "%B %Y %d"))))
## Observation Date_original Date
## 1 1 October 2014 2014-10-01
## 2 2 August 2014 2014-08-01
## 3 3 June 2013 2013-06-01
## 4 4 June 24, 2010 2010-06-24
## 5 5 January 2005 2005-01-01
2) This can also be done in lubridate. It is important that my is the first rather than second argument to coalesce since it outputs NA for those values that do not match the format whereas mdy gives a wrong date so if that were first coalesce would never get to my. This approach is shorter than (3) but you might prefer the robustness (3) since it does not depend on what is returned for non-matching dates.
library(dplyr)
library(lubridate)
output_col <- "Date"
df_dates %>%
mutate(!!output_col := coalesce(my(Date_original, quiet = TRUE),
mdy(Date_original)))
## Observation Date_original Date
## 1 1 October 2014 2014-10-01
## 2 2 August 2014 2014-08-01
## 3 3 June 2013 2013-06-01
## 4 4 June 24, 2010 2010-06-24
## 5 5 January 2005 2005-01-01
3) If you prefer your own method of first checking for comma here is a variation of that which is more compact. It uses my and mdy instead of parse_date_time since my and mdy give Date class results which are more appropriate here than the POSIXct of parse_date_time given that there are no times.
library(dplyr)
library(lubridate)
output_col <- "Date"
df_dates %>%
mutate(!!output_col := if_else(grepl(",", Date_original),
mdy(Date_original), my(Date_original, quiet = TRUE)))
## 1 1 October 2014 2014-10-01
## 2 2 August 2014 2014-08-01
## 3 3 June 2013 2013-06-01
## 4 4 June 24, 2010 2010-06-24
## 5 5 January 2005 2005-01-01
When the date structure is known, I like to explicitly correct the date structure first, then parse. Here I use regex to sub in 1 when the day is missing, then we just parse like normal.
library(tidyverse)
df_dates %>%
mutate(
output_col = gsub("(?<!,)\\s(?=\\d{4})", " 1, ", Date_original, perl = TRUE) %>%
as.Date(., format = '%B %d, %Y')
)
Observation Date_original output_col
1 1 October 2014 2014-10-01
2 2 August 2014 2014-08-01
3 3 June 2013 2013-06-01
4 4 June 24, 2010 2010-06-24
5 5 January 2005 2005-01-01

How to create new variable based on time and preexisting variables?

I have a dataset with repeated measurements on multiple individuals over time. It looks something like this:
ID Time Event
1 Jan 1 2012, 4pm Abx
1 Jan 2 2012, 2pm Test
1 Jan 26 2012 3 pm Test
1 Jan 29 2012 10 pm Abx
1 Jan 30 2012, 3 pm Test
1 Jan 5 2012 3 pm Test
2 Jan 1 2012, 4pm Abx
2 Jan 2 2012, 2pm Test
2 Jan 26 2012 3 pm Test
The dataset is currently based around events. It will later be filtered down to just tests. What I need to do is make a new variable that is 1 when certain events (Abx, in this case) occur within a certain time range of tests. So if the event 'Abx' occurs within, let's say, 48 hours of a Test event, the new variable should equal 1. Otherwise, it should equal zero.
I'm hoping to produce something like this:
ID Time Event New_variable
1 Jan 1 2012, 4pm Abx 1
1 Jan 2 2012, 2pm Test 1
1 Jan 26 2012 3 pm Test 0
1 Jan 29 2012 10 pm Abx 1
1 Jan 30 2012, 3 pm Test 1
1 Jan 5 2012 3 pm Test 0
2 Jan 1 2012, 4pm Abx 1
2 Jan 2 2012, 2pm Test 1
2 Jan 26 2012 3 pm Test 0
I know that I could probably solve this with a combination of Dplyr mutate functions combined with ifelse statements, and if I just wanted to make a variable that reads "1" when the antibiotic event occurs I could do that like this:
test %>%
mutate(New_variable = ifelse(Event == 'Abx', 1, 0)) -> test2
But I don't know how to factor in time so that Test events = 1 within 48 hours of an Abx event. I also am not sure how to make sure that the condition is applied only within the same ID. How can I do this?
Any help is appreciated!
Update: Thank you so much for the suggestions! I'm going to try these out on the data, but I think they'll work. If they don't, I'll be back soon. Success! I also modified the suggested helper function to include additional options (for more than one type of Abx):
abxRows <- type == "Abx" | type == "Abx2"
To the data provided, I added two "Abx" events which should not be one (i.e. one that was not within 48 hours and one that wasn't in the same group as the test that was within 48 hours).
library(dplyr)
library(lubridate)
library(purrr)
eventData <-
data.frame(stringsAsFactors = FALSE,
ID = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1),
Time = c("Jan 1 2012 4 pm", "Jan 2 2012, 2pm",
"Jan 26 2012 3 pm", "Jan 29 2012 10 pm",
"Jan 30 2012 3 pm", "Jan 5 2012 3 pm",
"Jan 1 2012 4 pm", "Jan 2 2012, 2pm",
"Jan 26 2012 3 pm", "Feb 12 2012 1pm",
"Jan 16 2012 3 pm", "Jan 16 2012 1 pm"),
Event = c("Abx", "Test", "Test", "Abx", "Test", "Test",
"Abx", "Test", "Test", "Abx", "Abx", "Test")
) %>%
mutate(Time = mdy_h(Time),
window = if_else(Event == "Test",
interval(Time - hours(48), Time + hours(48)),
interval(NA, NA))
)
First, you want to make sure the Time column is a time format. Then create a column of the lubridate Interval class that creates a 48 hr window around "Test" events.
Define the helper function that will check if the event occurred within the window.
chkFun <- function(eventTime, intervals, grp, type){
abxRows <- type == "Abx"
testRows <- !abxRows
hits <- map2_lgl(eventTime, grp,
~any(.x %within% intervals[grp %in% .y], na.rm = TRUE)) &
abxRows
testHits <- map_lgl(which(testRows),
~any(eventTime[abxRows & (grp[.x] == grp)] %within%
intervals[.x]))
hits[testRows] <- testHits
as.integer(hits)
}
This function first goes through and test if the "Abx" events occur within the intervals. It then determines which "Test" rows have an interval that contains a "Abx" event. The function returns the combination of these cast as integers.
Last, just use a mutate statement with the helper function, dropping the window column
eventData %>%
mutate(New_variable = chkFun(Time, window, ID, Event)) %>%
select(-window)
Alternatively, the helper function could just take the data.frame as an argument and assume the column names. In the form above, though, if you define it first in your script, it could also be used in the original definition of eventData
Results:
#> ID Time Event New_variable
#> 1 1 2012-01-01 16:00:00 Abx 1
#> 2 1 2012-01-02 14:00:00 Test 1
#> 3 1 2012-01-26 15:00:00 Test 0
#> 4 1 2012-01-29 22:00:00 Abx 1
#> 5 1 2012-01-30 15:00:00 Test 1
#> 6 1 2012-01-05 15:00:00 Test 0
#> 7 2 2012-01-01 16:00:00 Abx 1
#> 8 2 2012-01-02 14:00:00 Test 1
#> 9 2 2012-01-26 15:00:00 Test 0
#> 10 2 2012-02-12 13:00:00 Abx 0
#> 11 2 2012-01-16 15:00:00 Abx 0
#> 12 1 2012-01-16 13:00:00 Test 0
So I dont have a copy of your data, so Im not sure what for kmat your dates are in...
I would recommend converting the date to the right format using as.POSIXct(Time, format="%b %d %Y, %I%p") For more info on the format look up ?strptime, but I think that is right for your column.
If we assume your data frame is like this... I know I have changed parts of it but this is for simplicity
df <- data.frame(ID = c(rep(1,6),rep(2,3)),
Time=c(seq(from=start, by=interval*6840, to=end)[1:6],seq(from=start, by=interval*6840, to=end)[1:3]),
Event = rep(c("Abs","Test","Test"),3))
This would look like this
ID Time Event
1 1 2012-01-01 00:00:00 Abs
2 1 2012-01-05 18:00:00 Test
3 1 2012-01-10 12:00:00 Test
4 1 2012-01-15 06:00:00 Abs
5 1 2012-01-20 00:00:00 Test
6 1 2012-01-24 18:00:00 Test
7 2 2012-01-01 00:00:00 Abs
8 2 2012-01-05 18:00:00 Test
9 2 2012-01-10 12:00:00 Test
So you can use the following code to test whether a Test falls within 48 hours of an Abs
df[which(df$Event=="Test"),]$Time %in% unlist(Map(`:`, df[which(df$Event=="Abs"),]$Time-48*60*60, df[which(df$Event=="Abs"),]$Time+48*60*60))
So this will return FALSE for all, but that is because the synthetic data is at larger time steps.
To unpack this...
df[which(df$Event=="Test"),]$Time Gives the times of tests
%in% Says look for what precedes this, in a set of values that follows it.
So what follows it is: unlist(Map(`:`, df[which(df$Event=="Abs"),]$Time-48*60*60, df[which(df$Event=="Abs"),]$Time+48*60*60))
This creates a list of dates +/- 48 hours from each Abs. to add or subtract 48 hours, POSIXct objects like this done in seconds, hence 48*60*60

How to fill missing and adjust irregular time intervals in a data.frame in R

I have several datasets mostly with 15 min intervals of time. However, some datasets have missing readings (e.g., 3rd row in sample dataset was supposed to be "May 1 2015 00:40AM". In addition, there are some timesteps that are longer than 15 min (e.g., see 3rd and 6th rows)
How can add the missing time steps so that my Date will continue with 15 min intervals and at the same time adjust those time steps with more than 15 min intervals to 15 min?
s <- data.frame(Date = c(
"May 1 2015 00:10AM","May 1 2015 00:25AM",
"May 1 2015 00:56AM","May 1 2015 01:10AM",
"May 1 2015 01:25AM","May 1 2015 01:41AM",
"May 1 2015 01:55AM"),
val = c(1:7)
)
My desired output would be the following:
> s
Date val
1 May 1 2015 00:10AM 1
2 May 1 2015 00:25AM 2
3 May 1 2015 00:40AM NA
4 May 1 2015 00:55AM 3
5 May 1 2015 01:10AM 4
6 May 1 2015 01:25AM 5
7 May 1 2015 01:40AM 6
8 May 1 2015 01:55AM 7
You could try the following:
First, turn your s dataframe variable "Date" into POSIXct, so you can work with it:
s <- data.frame(Date = c(
"May 1 2015 00:10AM","May 1 2015 00:25AM",
"May 1 2015 00:56AM","May 1 2015 01:10AM",
"May 1 2015 01:25AM","May 1 2015 01:41AM",
"May 1 2015 01:55AM"),
val = c(1:7)
) %>% dplyr::mutate(Date = lubridate::parse_date_time(Date, "b d Y HM"))
Second, you can join this with another data frame that has all the time intervals you are expecting. First, we construct it, using a difference of time intervals (15 mins in this case):
one <- lubridate::parse_date_time("May 1 2015 00:10AM", orders = "b d Y HM")
two <- lubridate::parse_date_time("May 1 2015 00:25AM", orders = "b d Y HM")
dif <- two - one
Now the dataframe:
other_df <- data.frame(
Date = seq(from = lubridate::parse_date_time("May 1 2015 00:10AM",
orders = "b d Y HM"),
to = lubridate::parse_date_time("May 1 2015 01:55AM",
orders = "b d Y HM"),
by = dif))
Join the two:
result <- dplyr::full_join(other_df, s)
> result
Date val
1 2015-05-01 00:10:00 1
2 2015-05-01 00:25:00 2
3 2015-05-01 00:40:00 NA
4 2015-05-01 00:55:00 NA
5 2015-05-01 01:10:00 4
6 2015-05-01 01:25:00 5
7 2015-05-01 01:40:00 NA
8 2015-05-01 01:55:00 7
9 2015-05-01 00:56:00 3
10 2015-05-01 01:41:00 6

Long string date to short date R

I have a df with dates formatted in the following way.
Date Year
<chr> <dbl>
Sunday, Jul 27 2008
Tuesday, Jul 29 2008
Wednesday, July 31 (1) 2008
Wednesday, July 31 (2) 2008
Is there a simple way to achieve the following format of columns and values? I'd also like to remove the (1) and (2) notations on the two July 31 dates.
Date Year Month Day Day_of_Week
2008-07-27 2008 07 27 Sunday
With base R, you can do:
dat <- data.frame(
Date = c("Sunday, Jul 27" ,"Tuesday, Jul 29", "Wednesday, July 31", "Wednesday, July 31"),
Year = rep(2008, 4),
stringsAsFactors = FALSE
)
dts <- as.POSIXlt(paste(dat$Year, dat$Date), format = "%Y %A, %B %d")
POSIXlt provides a list-based reference for the date/time. To see them, try unclass(dts[1]).
From here it can be rather academic:
dat$Month = 1 + dts$mon # months are 0-based in POSIXlt
dat$Day = dts$mday
dat$Day_of_Week = weekdays(dts)
dat
# Date Year Month Day Day_of_Week
# 1 Sunday, Jul 27 2008 7 27 Sunday
# 2 Tuesday, Jul 29 2008 7 29 Tuesday
# 3 Wednesday, July 31 2008 7 31 Thursday
# 4 Wednesday, July 31 2008 7 31 Thursday
library(dplyr)
library(lubridate)
dat = data_frame(date = c('Sunday, Jul 27','Tuesday, Jul 29', 'Wednesday, July
31 (1)','Wednesday, July 31 (2)'), year=rep(2008,4))
dat %>%
mutate(date = gsub("\\s*\\([^\\)]+\\)","",as.character(date)),
date = parse_date_time(date,'A, b! d ')) -> dat1
year(dat1$date) <- dat1$year
# A tibble: 4 × 2
date year
<dttm> <dbl>
1 2008-07-27 2008
2 2008-07-29 2008
3 2008-07-31 2008
4 2008-07-31 2008

Resources