Dynamically Read all possible formats of dates in R - r

I am trying to dynamically read dates in R from csv or xlsx files. Challenge is the dates could be in all possible formats of dates like combination of %d for Day, %m or %b or %B for month and %y or %Y for year and in any sequence of day, month and Year.
Are there any ready made functions that I can use or is reading characters from a series of dates and then determining which format it could be, as a solution.
Any pointers, highly appreciated.

Function findAndTransformDates from dataPreparation will find auomaticaly the format in each columns and transform them.
NB: it only works if you have the same format in all rows of one columns.
For example:
require(dataPreparation)
data("messy_adult")
head(messy_adult[, .(date1, date2, date3, date4)])
date1 date2 date3 date4
1: 2017-10-07 NA 19-Jan-2017 21-January-2017
2: 2017-31-12 1513465200 06-Jun-2017 08-June-2017
3: 2017-12-10 1511305200 03-Jul-2017 05-July-2017
4: 2017-06-09 1485126000 19-Jul-2017 21-July-2017
5: 2017-02-03 1498345200 16-May-2017 18-May-2017
6: 2017-04-10 1503183600 02-Apr-2017 04-April-2017
messy_adult <- findAndTransformDates(messy_adult)
head(messy_adult[, .(date1, date2, date3, date4)])
date1 date2 date3 date4
1: 2017-07-10 <NA> 2017-01-19 2017-01-21
2: 2017-12-31 2017-12-17 00:00:00 2017-06-06 2017-06-08
3: 2017-10-12 2017-11-22 00:00:00 2017-07-03 2017-07-05
4: 2017-09-06 2017-01-23 00:00:00 2017-07-19 2017-07-21
5: 2017-03-02 2017-06-25 01:00:00 2017-05-16 2017-05-18
6: 2017-10-04 2017-08-20 01:00:00 2017-04-02 2017-04-04
Hope it helps
Disclamer: I'm the author of this package.

Related

R: time series monthly max adjusted by group

I have a df like that (head):
date Value
1: 2016-12-31 169361280
2: 2017-01-01 169383153
3: 2017-01-02 169494585
4: 2017-01-03 167106852
5: 2017-01-04 166750164
6: 2017-01-05 164086438
I would like to calculate a ratio, for that reason I need the max of every period. The max it´s normally the last day of the month but sometime It could be some days after and before (28,29,30,31,01,02).
In order to calculate it properly I would like to assign to my reference date (the last day of the month) the max value of this group of days to be sure that the ratio reflects what it supossed to.
This could be a reproducible example:
Start<-as.Date("2016-12-31")
End<-Sys.Date()
window<-data.table(seq(Start,End,by='1 day'))
dt<-cbind(window,rep(rnorm(nrow(window))))
colnames(dt)<-c("date","value")
# Create a Dateseq
DateSeq <- function(st, en, freq) {
st <- as.Date(as.yearmon(st))
en <- as.Date(as.yearmon(en))
as.Date(as.yearmon(seq(st, en, by = paste(as.character(12/freq),
"months"))), frac = 1)
}
# df to be fulfilled with the group max.
Value.Max.Month<-data.frame(DateSeq(Start,End,12))
colnames(Value.Max.Month)<-c("date")
date
1 2016-12-31
2 2017-01-31
3 2017-02-28
4 2017-03-31
5 2017-04-30
6 2017-05-31
7 2017-06-30
8 2017-07-31
9 2017-08-31
10 2017-09-30
11 2017-10-31
12 2017-11-30
13 2017-12-31
14 2018-01-31
15 2018-02-28
16 2018-03-31
You could use data.table:
library(lubridate)
library(zoo)
Start <- as.Date("2016-12-31")
End <- Sys.Date()
window <- data.table(seq(Start,End,by='1 day'))
dt <- cbind(window,rep(rnorm(nrow(window))))
colnames(dt) <- c("date","value")
dt <- data.table(dt)
dt[,period := as.Date(as.yearmon(date)) %m+% months(1) - 1,][, maximum:=max(value), by=period][, unique(maximum), by=period]
In the first expression we create a new column called period. Then we group by this new column and look for the maximum in value. In the last expression we just output these unique rows.
Notice that to get the last day of each period we add one month using lubridate and then substract 1 day.
The output is:
period V1
1: 2016-12-31 -0.7832116
2: 2017-01-31 2.1988660
3: 2017-02-28 1.6644812
4: 2017-03-31 1.2464980
5: 2017-04-30 2.8268820
6: 2017-05-31 1.7963104
7: 2017-06-30 1.3612476
8: 2017-07-31 1.7325457
9: 2017-08-31 2.7503439
10: 2017-09-30 2.4369036
11: 2017-10-31 2.4544802
12: 2017-11-30 3.1477730
13: 2017-12-31 2.8461506
14: 2018-01-31 1.8862944
15: 2018-02-28 1.8946470
16: 2018-03-31 0.7864341

How to insert missing dates/times using R based on criteria?

A data frame like below. 3 staffs have hourly readings in days, but incomplete (every staff shall have 24 readings a day).
Understand that staffs had different number of readings on the days. Now only interested in the staff with most readings in the day.
There are many days. It’s wanted to insert the missing (hourly) rows for the most ones on the days. That is, 2018-03-02 to insert only for Jack’s, 2018-03-03 only for David and 2018-03-04 only for Kate.
I tried these lines from this question (even though they fill all without differentiation) but not getting there.
How can it be done in R?
date_time <- c("2/3/2018 0:00","2/3/2018 1:00","2/3/2018 2:00","2/3/2018 3:00","2/3/2018 5:00","2/3/2018 6:00","2/3/2018 7:00","2/3/2018 8:00","2/3/2018 9:00","2/3/2018 10:00","2/3/2018 11:00","2/3/2018 12:00","2/3/2018 13:00","2/3/2018 14:00","2/3/2018 16:00","2/3/2018 17:00","2/3/2018 18:00","2/3/2018 19:00","2/3/2018 21:00","2/3/2018 22:00","2/3/2018 23:00","3/3/2018 0:00","3/3/2018 0:00","3/3/2018 1:00","3/3/2018 2:00","3/3/2018 4:00","3/3/2018 5:00","3/3/2018 7:00","3/3/2018 8:00","3/3/2018 9:00","3/3/2018 11:00","3/3/2018 12:00","3/3/2018 14:00","3/3/2018 15:00","3/3/2018 17:00","3/3/2018 18:00","3/3/2018 20:00","3/3/2018 22:00","3/3/2018 23:00","4/3/2018 0:00","4/3/2018 0:00","4/3/2018 1:00","4/3/2018 2:00","4/3/2018 3:00","4/3/2018 5:00","4/3/2018 6:00","4/3/2018 7:00","4/3/2018 8:00","4/3/2018 10:00","4/3/2018 11:00","4/3/2018 12:00","4/3/2018 14:00","4/3/2018 15:00","4/3/2018 16:00","4/3/2018 17:00","4/3/2018 19:00","4/3/2018 20:00","4/3/2018 22:00","4/3/2018 23:00")
staff <- c("Jack","Jack","Kate","Jack","Jack","Jack","Jack","Jack","Jack","Jack","Jack","Jack","Kate","Jack","Jack","Jack","David","David","Jack","Kate","David","David","David","David","David","David","David","David","David","David","David","David","David","David","David","David","David","Jack","Kate","David","David","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Jack")
reading <- c(7.5,8.3,7,6.9,7.1,8.1,8.4,8.8,6,7.1,8.9,7.3,7.4,6.9,11.3,18.8,4.6,6.7,7.7,7.8,7,7,6.6,6.8,6.7,6.1,7.1,6.3,7.2,6,5.8,6.6,6.5,6.4,7.2,8.4,6.5,6.5,5.5,6.7,7,7.5,6.5,7.5,7.2,6.3,7.3,8,7,8.2,6.5,6.8,7.5,7,6.1,5.7,6.7,4.3,6.3)
df <- data.frame(date_time, staff, reading)
The option would be to do this separately. Create a data.table of the dates of interest and the corresponding 'staff', and get the full sequence of date time, then we rbind this with the original dataset and using a condition, we summarise the data
library(data.table)
stf <- c("Jack", "David", "Kate")
date <- as.Date(c("2018-03-02", "2018-03-03", "2018-03-04"))
df1 <- data.table(date, staff= stf)[, .(date_time = seq(as.POSIXct(paste(date, "00:00:00"),
tz = "GMT"),
length.out = 24, by = "1 hour")), staff]
setDT(df)[, date_time := as.POSIXct(date_time, "%d/%m/%Y %H:%M", tz = "GMT")]
res <- rbindlist(list(df, df1), fill = TRUE)[,
.(reading = if(any(is.na(reading))) sum(reading, na.rm = TRUE) else reading),
.(staff, date_time)]
table(res$staff, as.Date(res$date_time))
# 2018-03-02 2018-03-03 2018-03-04
# David 3 24 2
# Jack 24 1 1
# Kate 3 1 24
head(res)
# staff date_time reading
#1: Jack 2018-03-02 00:00:00 7.5
#2: Jack 2018-03-02 01:00:00 8.3
#3: Kate 2018-03-02 02:00:00 7.0
#4: Jack 2018-03-02 03:00:00 6.9
#5: Jack 2018-03-02 05:00:00 7.1
#6: Jack 2018-03-02 06:00:00 8.1
tail(res)
# staff date_time reading
#1: Kate 2018-03-04 04:00:00 0
#2: Kate 2018-03-04 09:00:00 0
#3: Kate 2018-03-04 13:00:00 0
#4: Kate 2018-03-04 18:00:00 0
#5: Kate 2018-03-04 21:00:00 0
#6: Kate 2018-03-04 23:00:00 0
Try this code:
Identify each daily hour and all staff members
date_h<-seq(as.POSIXlt(min(date_time),format="%d/%m/%Y %H:%M"),as.POSIXlt(max(date_time),format="%d/%m/%Y %H:%M"),by=60*60)
staff_u<-unique(staff)
comb<-expand.grid(staff_u,date_h)
colnames(comb)<-c("staff","date_time")
Uniform date format in df
df$date_time<-as.POSIXlt(df$date_time,format="%d/%m/%Y %H:%M")
Merge information
out<-merge(comb,df,all.x=T)
Your output:
head(out)
staff date_time reading
1 Jack 2018-03-02 00:00:00 7.5
2 Jack 2018-03-02 01:00:00 8.3
3 Jack 2018-03-02 02:00:00 NA
4 Jack 2018-03-02 03:00:00 6.9
5 Jack 2018-03-02 04:00:00 NA
6 Jack 2018-03-02 05:00:00 7.1

Create a time interval of 15 minutes from minutely data in R?

I have some data which is formatted in the following way:
time count
00:00 17
00:01 62
00:02 41
So I have from 00:00 to 23:59hours and with a counter per minute. I'd like to group the data in intervals of 15 minutes such that:
time count
00:00-00:15 148
00:16-00:30 284
I have tried to do it manually but this is exhausting so I am sure there has to be a function or sth to do it easily but I haven't figured out yet how to do it.
I'd really appreciate some help!!
Thank you very much!
For data that's in POSIXct format, you can use the cut function to create 15-minute groupings, and then aggregate by those groups. The code below shows how to do this in base R and with the dplyr and data.table packages.
First, create some fake data:
set.seed(4984)
dat = data.frame(time=seq(as.POSIXct("2016-05-01"), as.POSIXct("2016-05-01") + 60*99, by=60),
count=sample(1:50, 100, replace=TRUE))
Base R
cut the data into 15 minute groups:
dat$by15 = cut(dat$time, breaks="15 min")
time count by15
1 2016-05-01 00:00:00 22 2016-05-01 00:00:00
2 2016-05-01 00:01:00 11 2016-05-01 00:00:00
3 2016-05-01 00:02:00 31 2016-05-01 00:00:00
...
98 2016-05-01 01:37:00 20 2016-05-01 01:30:00
99 2016-05-01 01:38:00 29 2016-05-01 01:30:00
100 2016-05-01 01:39:00 37 2016-05-01 01:30:00
Now aggregate by the new grouping column, using sum as the aggregation function:
dat.summary = aggregate(count ~ by15, FUN=sum, data=dat)
by15 count
1 2016-05-01 00:00:00 312
2 2016-05-01 00:15:00 395
3 2016-05-01 00:30:00 341
4 2016-05-01 00:45:00 318
5 2016-05-01 01:00:00 349
6 2016-05-01 01:15:00 397
7 2016-05-01 01:30:00 341
dplyr
library(dplyr)
dat.summary = dat %>% group_by(by15=cut(time, "15 min")) %>%
summarise(count=sum(count))
data.table
library(data.table)
dat.summary = setDT(dat)[ , list(count=sum(count)), by=cut(time, "15 min")]
UPDATE: To answer the comment, for this case the end point of each grouping interval is as.POSIXct(as.character(dat$by15)) + 60*15 - 1. In other words, the endpoint of the grouping interval is 15 minutes minus one second from the start of the interval. We add 60*15 - 1 because POSIXct is denominated in seconds. The as.POSIXct(as.character(...)) is because cut returns a factor and this just converts it back to date-time so that we can do math on it.
If you want the end point to the nearest minute before the next interval (instead of the nearest second), you could to as.POSIXct(as.character(dat$by15)) + 60*14.
If you don't know the break interval, for example, because you chose the number of breaks and let R pick the interval, you could find the number of seconds to add by doing max(unique(diff(as.POSIXct(as.character(dat$by15))))) - 1.
The cut approach is handy but slow with large data frames. The following approach is approximately 1,000x faster than the cut approach (tested with 400k records.)
# Function: Truncate (floor) POSIXct to time interval (specified in seconds)
# Author: Stephen McDaniel # PowerTrip Analytics
# Date : 2017MAY
# Copyright: (C) 2017 by Freakalytics, LLC
# License: MIT
floor_datetime <- function(date_var, floor_seconds = 60,
origin = "1970-01-01") { # defaults to minute rounding
if(!is(date_var, "POSIXct")) stop("Please pass in a POSIXct variable")
if(is.na(date_var)) return(as.POSIXct(NA)) else {
return(as.POSIXct(floor(as.numeric(date_var) /
(floor_seconds))*(floor_seconds), origin = origin))
}
}
Sample output:
test <- data.frame(good = as.POSIXct(Sys.time()),
bad1 = as.Date(Sys.time()),
bad2 = as.POSIXct(NA))
test$good_15 <- floor_datetime(test$good, 15 * 60)
test$bad1_15 <- floor_datetime(test$bad1, 15 * 60)
Error in floor_datetime(test$bad, 15 * 60) :
Please pass in a POSIXct variable
test$bad2_15 <- floor_datetime(test$bad2, 15 * 60)
test
good bad1 bad2 good_15 bad2_15
1 2017-05-06 13:55:34.48 2017-05-06 <NA> 2007-05-06 13:45:00 <NA>
You can do it in one line by using trs function from FQOAT, just like:
df_15mins=trs(df, "15 mins")
Below is a repeatable example:
library(foqat)
head(aqi[,c(1,2)])
# Time NO
#1 2017-05-01 01:00:00 0.0376578
#2 2017-05-01 01:01:00 0.0341483
#3 2017-05-01 01:02:00 0.0310285
#4 2017-05-01 01:03:00 0.0357016
#5 2017-05-01 01:04:00 0.0337507
#6 2017-05-01 01:05:00 0.0238120
#mean
aqi_15mins=trs(aqi[,c(1,2)], "15 mins")
head(aqi_15mins)
# Time NO
#1 2017-05-01 01:00:00 0.02736549
#2 2017-05-01 01:15:00 0.03244958
#3 2017-05-01 01:30:00 0.03743626
#4 2017-05-01 01:45:00 0.02769419
#5 2017-05-01 02:00:00 0.02901817
#6 2017-05-01 02:15:00 0.03439455

R time series missing values

I was working with a time series dataset having hourly data. The data contained a few missing values so I tried to create a dataframe (time_seq) with the correct time value and do a merge with the original data so the missing values become 'NA'.
> data
date value
7980 2015-03-30 20:00:00 78389
7981 2015-03-30 21:00:00 72622
7982 2015-03-30 22:00:00 65240
7983 2015-03-30 23:00:00 47795
7984 2015-03-31 08:00:00 37455
7985 2015-03-31 09:00:00 70695
7986 2015-03-31 10:00:00 68444
//converting the date in the data to POSIXct format.
> data$date <- format.POSIXct(data$date,'%Y-%m-%d %H:%M:%S')
// creating a dataframe with the correct sequence of dates.
> time_seq <- seq(from = as.POSIXct("2014-05-01 00:00:00"),
to = as.POSIXct("2015-04-30 23:00:00"), by = "hour")
> df <- data.frame(date=time_seq)
> df
date
8013 2015-03-30 20:00:00
8014 2015-03-30 21:00:00
8015 2015-03-30 22:00:00
8016 2015-03-30 23:00:00
8017 2015-03-31 00:00:00
8018 2015-03-31 01:00:00
8019 2015-03-31 02:00:00
8020 2015-03-31 03:00:00
8021 2015-03-31 04:00:00
8022 2015-03-31 05:00:00
8023 2015-03-31 06:00:00
8024 2015-03-31 07:00:00
// merging with the original data
> a <- merge(data,df, x.by = data$date, y.by = df$date ,all=TRUE)
> a
date value
4005 2014-07-23 07:00:00 37003
4006 2014-07-23 07:30:00 NA
4007 2014-07-23 08:00:00 37216
4008 2014-07-23 08:30:00 NA
The values I get after merging are incorrect and they contain half-hourly values. What would be the correct approach for solving this?
Why are is the merge result in 30 minute intervals when both my dataframes are hourly?
PS:I looked into this question : Fastest way for filling-in missing dates for data.table and followed the steps but it didn't help.
You can use the padr package to solve this problem.
library(padr)
library(dplyr) #for the pipe operator
data %>%
pad() %>%
fill_by_value()

Alter values in one data frame based on comparison values in another in R

I am trying to subtract one hour to date/times within a POSIXct column that are earlier than or equal to a time stated in a different comparison dataframe for that particular ID.
For example:
#create sample data
Time<-as.POSIXct(c("2015-10-02 08:00:00","2015-11-02 11:00:00","2015-10-11 10:00:00","2015-11-11 09:00:00","2015-10-24 08:00:00","2015-10-27 08:00:00"), format = "%Y-%m-%d %H:%M:%S")
ID<-c(01,01,02,02,03,03)
data<-data.frame(Time,ID)
Which produces this:
Time ID
1 2015-10-02 08:00:00 1
2 2015-11-02 11:00:00 1
3 2015-10-11 10:00:00 2
4 2015-11-11 09:00:00 2
5 2015-10-24 08:00:00 3
6 2015-10-27 08:00:00 3
I then have another dataframe with a key date and time for each ID to compare against. The Time in data should be compared against Comparison in ComparisonData for the particular ID it is associated with. If the Time value in data is earlier than or equal to the comparison value one hour should be subtracted from the value in data:
#create sample comparison data
Comparison<-as.POSIXct(c("2015-10-29 08:00:00","2015-11-02 08:00:00","2015-10-26 08:30:00"), format = "%Y-%m-%d %H:%M:%S")
ID<-c(01,02,03)
ComparisonData<-data.frame(Comparison,ID)
This should look like this:
Comparison ID
1 2015-10-29 08:00:00 1
2 2015-11-02 08:00:00 2
3 2015-10-26 08:30:00 3
In summary, the code should check all times of a certain ID to see if any are earlier than or equal to the value specified in ComparisonData and if they are, subtract one hour. This should give this data frame as an output:
Time ID
1 2015-10-02 07:00:00 1
2 2015-11-02 11:00:00 1
3 2015-10-11 09:00:00 2
4 2015-11-11 09:00:00 2
5 2015-10-24 07:00:00 3
6 2015-10-27 08:00:00 3
I have looked at similar solutions such as this but I cannot understand how to also check the times using the right timing with that particular ID.
I think ddply seems quite a promising option but I'm not sure how to use it for this particular problem.
Here's a quick and efficient solution using data.table. First we join the two data sets by ID and then just modify the Times which are lower or equal to Comparison
library(data.table) # v1.9.6+
setDT(data)[ComparisonData, end := i.Comparison, on = "ID"]
data[Time <= end, Time := Time - 3600L][, end := NULL]
data
# Time ID
# 1: 2015-10-02 07:00:00 1
# 2: 2015-11-02 11:00:00 1
# 3: 2015-10-11 09:00:00 2
# 4: 2015-11-11 09:00:00 2
# 5: 2015-10-24 07:00:00 3
# 6: 2015-10-27 08:00:00 3
Alternatively, we could do this in one step while joining using ifelse (not sure how efficient this though)
setDT(data)[ComparisonData,
Time := ifelse(Time <= i.Comparison,
Time - 3600L, Time),
on = "ID"]
data
# Time ID
# 1: 2015-10-02 07:00:00 1
# 2: 2015-11-02 11:00:00 1
# 3: 2015-10-11 09:00:00 2
# 4: 2015-11-11 09:00:00 2
# 5: 2015-10-24 07:00:00 3
# 6: 2015-10-27 08:00:00 3
I am sure there is going to be a better solution than this, however, I think this works.
for(i in 1:nrow(data)) {
if(data$Time[i] < ComparisonData[data$ID[i], 1]){
data$Time[i] <- data$Time[i] - 3600
}
}
# Time ID
#1 2015-10-02 07:00:00 1
#2 2015-11-02 11:00:00 1
#3 2015-10-11 09:00:00 2
#4 2015-11-11 09:00:00 2
#5 2015-10-24 07:00:00 3
#6 2015-10-27 08:00:00 3
This is going to iterate through every row in data.
ComparisonData[data$ID[i], 1] gets the time column in ComparisonData for the corresponding ID. If this is greater than the Time column in data then reduce the time by 1 hour.

Resources