dealing with the datetime value in R - r

First of all, I have a large data.table with the one parameter-Date, but the str(Date) is chr.
date
2015-07-01 0:15:00
2015-07-01 0:30:00
2015-07-01 0:45:00
2015-07-01 0:60:00
2015-07-01 1:15:00
2015-07-01 1:30:00
2015-07-01 1:45:00
2015-07-01 1:60:00
what i want to do is
make them in standard format like: 2015-07-01 00:15:00
correct the time, for example: 2015-07-01 1:60:00 -> 2015-07-01 02:00:00
for the first one, I tried to use the function as.POSIXct() to reset the format, it should be correct, but the problem is for the data like 2015-07-01 1:60:00, after transformatiion, it is just NA.
anybody has ideas?
Here is a code to generate test data:
dd <- data.table(date = c("2015-07-01 0:15:00", "2015-07-01 0:30:00",
"2015-07-01 0:45:00","2015-07-01 0:60:00", "2015-07-01 1:15:00",
"2015-07-01 1:30:00","2015-07-01 1:45:00","2015-07-01 1:60:00","2015-07-01 2:15:00"))
Note: this table is just for one day and the last value of the table is
2015-07-01 23:60:00
for any unclear points, feel free to let me know
thanks for that !

In base R you could try this:
df1$date <- gsub(":60:",":59:",df1$date, fixed = TRUE)
df1$date <- as.POSIXct(df1$date)
the59s <- grepl(":59:",df1$date)
df1$date[the59s] <- df1$date[the59s] + 60
#> df1
# date
#1 2015-07-01 00:15:00
#2 2015-07-01 00:30:00
#3 2015-07-01 00:45:00
#4 2015-07-01 01:00:00
#5 2015-07-01 01:15:00
#6 2015-07-01 01:30:00
#7 2015-07-01 01:45:00
#8 2015-07-01 02:00:00
#9 2015-07-01 02:15:00
The idea is to let POSIXct perform the conversion to the next hour / day / month / ... triggered by a "60 minutes" value. For this we first identify those entries containing :60: and replace that part with :59:. Then the column is converted into a POSIXct object. Afterwards we find all those entries containing a ":59:" and add 60 (seconds), thereby converting the time/date to the intended format.
In the case described by the OP the data contains only quarter hour values 0, 15, 30, 40, 60. A more general situation may include genuine 59 minutes values that should not be converted to the next hour. It would then be better to store the relevant row indices before performing the conversion:
the60s <- grepl(":60:", df1$date)
df1$date <- gsub(":60:",":59:",df1$date, fixed = TRUE)
df1$date <- as.POSIXct(df1$date)
df1$date[the60s] <- df1$date[the60s] + 60
data:
df1 <- structure(list(date = structure(1:9, .Label = c("2015-07-01 0:15:00",
"2015-07-01 0:30:00", "2015-07-01 0:45:00", "2015-07-01 0:60:00",
"2015-07-01 1:15:00", "2015-07-01 1:30:00", "2015-07-01 1:45:00",
"2015-07-01 1:60:00", "2015-07-01 2:15:00"), class = "factor")),
.Names = "date", row.names = c(NA, -9L), class = "data.frame")

Related

Extract maximum hourly value each day R

I have this data.frame:
Time a b c d
1 2015-01-01 00:00:00 863 1051 1899 25385
2 2015-01-01 01:00:00 920 1009 1658 24382
3 2015-01-01 02:00:00 1164 973 1371 22734
4 2015-01-01 03:00:00 1503 949 779 21286
5 2015-01-01 04:00:00 1826 953 720 20264
6 2015-01-01 05:00:00 2109 952 743 19905
...
Time a b c d
8756 2015-12-31 19:00:00 0 775 4957 28812
8757 2015-12-31 20:00:00 0 783 5615 29568
8758 2015-12-31 21:00:00 0 790 4838 28653
8759 2015-12-31 22:00:00 0 766 3841 27078
8760 2015-12-31 23:00:00 72 729 2179 24565
8761 2016-01-01 00:00:00 290 710 1612 23311
It represents every hour of every day for a year. I would like to extract one line per day, as a function of the maximum value of d. So at the end I want to obtain a data.frame of 365x5.
I have tried all the propositions from :Extract the maximum value within each group in a dataframe and also:Daily minimum values in R but it still doesn't work.
May be it could come from the way I proceed to generate my time serie?
library(lubridate)
start <- dmy_hms("1 Jan 2015 00:00:00")
end <- dmy_hms("01 Jan 2016 00:00:00")
time <- as.data.frame(seq(start, end, by="hours"))
Thanks for help!
If we are aggregating by the 'Day', convert the 'Time' column to Date class stripping off the Time attributes, grouped by those, get the max of 'd'. In the OP's post, the syntax for data.table involves mydf and df. Assuming these are the same, we need
library(data.table)
setDT(mydf)[, .(d = max(d)), by = .(Day = as.Date(Time))]
Or using aggregate from base R
aggregate(d ~ Day, transform(mydf, Day = as.Date(Time)), FUN = max)
Or with tidyverse
library(tidyverse)
mydf %>%
group_by(Day = as.Date(Time)) %>%
summarise(d = max(d))
NOTE: Based on the OP's comments, columns 'a' to 'd' are factor class. We need to convert it to numeric either at the beginning or convert it during the processing stage
mydf$d <- as.numeric(as.character(mydf$d)))
For multiple columns
mydf[c('a', 'b', 'c', 'd')] <- lapply(mydf[c('a', 'b', 'c', 'd'), function(x)
as.numeric(as.character(x)))
data
mydf <- structure(list(Time = c("2015-01-01 00:00:00", "2015-01-01 01:00:00",
"2015-01-01 02:00:00", "2015-01-01 03:00:00", "2015-01-01 04:00:00",
"2015-01-01 05:00:00"), a = c(863L, 920L, 1164L, 1503L, 1826L,
2109L), b = c(1051L, 1009L, 973L, 949L, 953L, 952L), c = c(1899L,
1658L, 1371L, 779L, 720L, 743L), d = c(25385L, 24382L, 22734L,
21286L, 20264L, 19905L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
'max' doesn't work with factors. Hence convert the column (in your case, its column d) for which you are finding the maximum into double using as.numeric
Assuming your data set is in a data frame
mydf$d = as.numeric(mydf$d)
Thanks for your help! Finally I choose
do.call(rbind, lapply(split(test,test$time), function(x) {return(x[which.max(x$d),])}))
which allows me to have a 365x5 data.frame. All your propositions were right. I just needed to change my time serie like
time <- as.data.frame(rep(c(1:365), each = 24))
test<- cbind.data.frame(time, df, timebis)
which allows me to have a 365x5 data.frame. All your propositions were right. I just needed to change my time serie.

R add specific (different) amounts of times to entire column

I have a table in R like:
start duration
02/01/2012 20:00:00 5
05/01/2012 07:00:00 6
etc... etc...
I got to this by importing a table from Microsoft Excel that looked like this:
date time duration
2012/02/01 20:00:00 5
etc...
I then merged the date and time columns by running the following code:
d.f <- within(d.f, { start=format(as.POSIXct(paste(date, time)), "%m/%d/%Y %H:%M:%S") })
I want to create a third column called 'end', which will be calculated as the number of hours after the start time. I am pretty sure that my time is a POSIXct vector. I have seen how to manipulate one datetime object, but how can I do that for the entire column?
The expected result should look like:
start duration end
02/01/2012 20:00:00 5 02/02/2012 01:00:00
05/01/2012 07:00:00 6 05/01/2012 13:00:00
etc... etc... etc...
Using lubridate
> library(lubridate)
> df$start <- mdy_hms(df$start)
> df$end <- df$start + hours(df$duration)
> df
# start duration end
#1 2012-02-01 20:00:00 5 2012-02-02 01:00:00
#2 2012-05-01 07:00:00 6 2012-05-01 13:00:00
data
df <- structure(list(start = c("02/01/2012 20:00:00", "05/01/2012 07:00:00"
), duration = 5:6), .Names = c("start", "duration"), class = "data.frame", row.names = c(NA,
-2L))
You can simply add dur*3600 to start column of the data frame. E.g. with one date:
start = as.POSIXct("02/01/2012 20:00:00",format="%m/%d/%Y %H:%M:%S")
start
[1] "2012-02-01 20:00:00 CST"
start + 5*3600
[1] "2012-02-02 01:00:00 CST"

subsetting a data frame according factor date

I have a data frame(df) where one of its column is a date column. However that column's type is factor:
> head(df$date)
[1] 2011-01-01 2011-01-01 2011-01-01 2011-01-01 2011-01-01 2011-01-01
1519 Levels: 2010-11-27 2010-11-28 2010-11-29 2010-11-30 2010-12-01 2010-12-02 2010-12-03 2010-12-04 ... 2015-02-07
I want to subset this data frame according to date. For example I want to create a second data frame(df2) where it is a subset of df where dates are smaller than 2014-03-30.
How can I do that using R? I will be very glad for any help. Thanks a lot.
You could begin exploring the lubridate library. It makes working with dates very simple.
df <- data.frame(date = c("2013-01-01", "2014-04-01", "2014-01-01",
"2011-06-01", "2012-03-01", "2014-08-01"))
df
date
1 2013-01-01
2 2014-04-01
3 2014-01-01
4 2011-06-01
5 2012-03-01
6 2014-08-01
library(lubridate)
# ymd - year-month-day
df$date <- ymd(df$date)
with(df, df[date < ymd("2014-03-30"),])
[1] "2013-01-01 UTC" "2014-01-01 UTC" "2011-06-01 UTC" "2012-03-01 UTC"

Combining time series data with different resolution in R

I have read in and formatted my data set like shown under.
library(xts)
#Read data from file
x <- read.csv("data.dat", header=F)
x[is.na(x)] <- c(0) #If empty fill in zero
#Construct data frames
rawdata.h <- data.frame(x[,2],x[,3],x[,4],x[,5],x[,6],x[,7],x[,8]) #Hourly data
rawdata.15min <- data.frame(x[,10]) #15 min data
#Convert time index to proper format
index.h <- as.POSIXct(strptime(x[,1], "%d.%m.%Y %H:%M"))
index.15min <- as.POSIXct(strptime(x[,9], "%d.%m.%Y %H:%M"))
#Set column names
names(rawdata.h) <- c("spot","RKup", "RKdown","RKcon","anm", "pp.stat","prod.h")
names(rawdata.15min) <- c("prod.15min")
#Convert data frames to time series objects
data.htemp <- xts(rawdata.h,order.by=index.h)
data.15mintemp <- xts(rawdata.15min,order.by=index.15min)
#Select desired subset period
data.h <- data.htemp["2013"]
data.15min <- data.15mintemp["2013"]
I want to be able to combine hourly data from data.h$prod.h with data, with 15 min resolution, from data.15min$prod.15min corresponding to the same hour.
An example would be to take the average of the hourly value at time 2013-12-01 00:00-01:00 with the last 15 minute value in that same hour, i.e. the 15 minute value from time 2013-12-01 00:45-01:00. I'm looking for a flexible way to do this with an arbitrary hour.
Any suggestions?
Edit: Just to clarify further: I want to do something like this:
N <- NROW(data.h$prod.h)
for (i in 1:N){
prod.average[i] <- mean(data.h$prod.h[i] + #INSERT CODE THAT FINDS LAST 15 MIN IN HOUR i )
}
I found a solution to my problem by converting the 15 minute data into hourly data using the very useful .index* function from the xts package like shown under.
prod.new <- data.15min$prod.15min[.indexmin(data.15min$prod.15min) %in% c(45:59)]
This creates a new time series with only the values occuring in the 45-59 minute interval each hour.
For those curious my data looked like this:
Original hourly series:
> data.h$prod.h[1:4]
2013-01-01 00:00:00 19.744
2013-01-01 01:00:00 27.866
2013-01-01 02:00:00 26.227
2013-01-01 03:00:00 16.013
Original 15 minute series:
> data.15min$prod.15min[1:4]
2013-09-30 00:00:00 16.4251
2013-09-30 00:15:00 18.4495
2013-09-30 00:30:00 7.2125
2013-09-30 00:45:00 12.1913
2013-09-30 01:00:00 12.4606
2013-09-30 01:15:00 12.7299
2013-09-30 01:30:00 12.9992
2013-09-30 01:45:00 26.7522
New series with only the last 15 minutes in each hour:
> prod.new[1:4]
2013-09-30 00:45:00 12.1913
2013-09-30 01:45:00 26.7522
2013-09-30 02:45:00 5.0332
2013-09-30 03:45:00 2.6974
Short answer
df %>%
group_by(t = cut(time, "30 min")) %>%
summarise(v = mean(value))
Long answer
Since, you want to compress the 15 minutes time series to a smaller resolution (30 minutes), you should use dplyr package or any other package that computes the "group by" concept.
For instance:
s = seq(as.POSIXct("2017-01-01"), as.POSIXct("2017-01-02"), "15 min")
df = data.frame(time = s, value=1:97)
df is a time series with 97 rows and two columns.
head(df)
time value
1 2017-01-01 00:00:00 1
2 2017-01-01 00:15:00 2
3 2017-01-01 00:30:00 3
4 2017-01-01 00:45:00 4
5 2017-01-01 01:00:00 5
6 2017-01-01 01:15:00 6
The cut.POSIXt, group_by and summarise functions do the work:
df %>%
group_by(t = cut(time, "30 min")) %>%
summarise(v = mean(value))
t v
1 2017-01-01 00:00:00 1.5
2 2017-01-01 00:30:00 3.5
3 2017-01-01 01:00:00 5.5
4 2017-01-01 01:30:00 7.5
5 2017-01-01 02:00:00 9.5
6 2017-01-01 02:30:00 11.5
A more robust way is to convert 15 minutes values into hourly values by taking average. Then do whatever operation you want to.
### 15 Minutes Data
min15 <- structure(list(V1 = structure(1:8, .Label = c("2013-01-01 00:00:00",
"2013-01-01 00:15:00", "2013-01-01 00:30:00", "2013-01-01 00:45:00",
"2013-01-01 01:00:00", "2013-01-01 01:15:00", "2013-01-01 01:30:00",
"2013-01-01 01:45:00"), class = "factor"), V2 = c(16.4251, 18.4495,
7.2125, 12.1913, 12.4606, 12.7299, 12.9992, 26.7522)), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -8L))
min15
### Hourly Data
hourly <- structure(list(V1 = structure(1:4, .Label = c("2013-01-01 00:00:00",
"2013-01-01 01:00:00", "2013-01-01 02:00:00", "2013-01-01 03:00:00"
), class = "factor"), V2 = c(19.744, 27.866, 26.227, 16.013)), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -4L))
hourly
### Convert 15min data into hourly data by taking average of 4 values
min15$V1 <- as.POSIXct(min15$V1,origin="1970-01-01 0:0:0")
min15 <- aggregate(. ~ cut(min15$V1,"60 min"),min15[setdiff(names(min15), "V1")],mean)
min15
names(min15) <- c("time","min15")
names(hourly) <- c("time","hourly")
### merge the corresponding values
combined <- merge(hourly,min15)
### average of hourly and 15min values
rowMeans(combined[,2:3])

obtain hour from DateTime vector

I have a DateTime vector within a data.frame where the data frame is made up of 8760 observations representing hourly intervals throughout the year e.g.
2010-01-01 00:00
2010-01-01 01:00
2010-01-01 02:00
2010-01-01 03:00
and so on.
I would like to create a data.frame which has the original DateTime vector as the first column and then the hourly values in the second column e.g.
2010-01-01 00:00 00:00
2010-01-01 01:00 01:00
How can this be achieved?
Use format or strptime to extract the time information.
Create a POSIXct vector:
x <- seq(as.POSIXct("2012-05-21"), by=("+1 hour"), length.out=5)
Extract the time:
data.frame(
date=x,
time=format(x, "%H:%M")
)
date time
1 2012-05-21 00:00:00 00:00
2 2012-05-21 01:00:00 01:00
3 2012-05-21 02:00:00 02:00
4 2012-05-21 03:00:00 03:00
5 2012-05-21 04:00:00 04:00
If the input vector is a character vector, then you have to convert to POSIXct first:
Create some data
dat <- data.frame(
DateTime=format(seq(as.POSIXct("2012-05-21"), by=("+1 hour"), length.out=5), format="%Y-%m-%d %H:%M")
)
dat
DateTime
1 2012-05-21 00:00
2 2012-05-21 01:00
3 2012-05-21 02:00
4 2012-05-21 03:00
5 2012-05-21 04:00
Split time out:
data.frame(
DateTime=dat$DateTime,
time=format(as.POSIXct(dat$DateTime, format="%Y-%m-%d %H:%M"), format="%H:%M")
)
DateTime time
1 2012-05-21 00:00 00:00
2 2012-05-21 01:00 01:00
3 2012-05-21 02:00 02:00
4 2012-05-21 03:00 03:00
5 2012-05-21 04:00 04:00
Or generically, not treating them as dates, you can use the following provided that the time and dates are padded correctly.
library(stringr)
df <- data.frame(DateTime = c("2010-01-01 00:00", "2010-01-01 01:00", "2010-01-01 02:00", "2010-01-01 03:00"))
df <- data.frame(df, Time = str_sub(df$DateTime, -5, -1))
It depends on your needs really.
Using lubridate
library(stringr)
library(lubridate)
library(plyr)
df <- data.frame(DateTime = c("2010-01-01 00:00", "2010-01-01 01:00", "2010-01-01 02:00", "2010-01-01 03:00"))
df <- mutate(df, DateTime = ymd_hm(DateTime),
time = str_c(hour(DateTime), str_pad(minute(DateTime), 2, side = 'right', pad = '0'), sep = ':'))
On a more general note, for anyone that comes here from google and maybe wants to group by hour:
The key here is: lubridate::hour(datetime)
p22 in the cran doc here: https://cran.r-project.org/web/packages/lubridate/lubridate.pdf

Resources