Recall in date POSIXct - r

I couldn't find a solution of my problem with POSIXct format - I have a monthly data. This is a scrap of my code:
Data <- as.POSIXct(as.character(czerwiec$Data), format = "%Y-%m-%d %H:%M:%S")
get.rows <- Data >= as.POSIXct(as.character("2013-06-03 00:00:01")) & Data <= as.POSIXct(as.character("2013-06-09 23:59:59"))
czerwiec <- czerwiec[get.rows,]
Data <- Data[get.rows]
I chose one hole week of June from 3 to 9 and wanted to estimate the sum of column X (czerwiec$X) by every hours. As you see I could reduce time, but it will be stupid to do it, like this
get.rows <- Data >= as.POSIXct(as.character("2013-06-03 00:00:01")) &
Data <= as.POSIXct(as.character("2013-06-03 00:59:59"))
then
get.rows <- Data >= as.POSIXct(as.character("2013-06-04 00:00:01")) &
Data <= as.POSIXct(as.character("2013-06-04 00:59:59"))
And in the end of this operations, I can estimate sum for this hour etc.
Do you have any idea, how I can recall to every rows, which have time like 2013-06-03 to 2013-06-09 and 00:00:01 to 00:59:59??
Something about data frame "czerwiec", so I have three columns, where first call "ID", second "Price" and third "Data" (means Date).
Thx for help :)

This might help. I've used the lubridate package, which doesn't really do anything you can't do in base R, but it makes handling dates much easier
# Set up Data as a string vector
Data <- c("2013-06-01 05:05:05", "2013-06-06 05:05:05", "2013-06-06 08:10:05", "2013-07-07 05:05:05")
require(lubridate)
# Set up the data frame with fake data. This makes a reproducible example
set.seed(4) #For reproducibility, always set the seed when using random numbers
# Create a data frame with Data and price
czerwiec <- data.frame(price=runif(4))
# Use lubridate to turn the Data string into a vector of POSIXctn objects
czerwiec$Data <- ymd_hms(Data)
# Determine the 'yearday' -i.e. yearday of Jan 1 is 1; yearday of Dec 31 is 365 (or 366 in a leap year)
czerwiec$yday <- yday(czerwiec$Data)
# in.range is true if the date is in the desired date range
czerwiec$in.range <- czerwiec$yday[czerwiec$yday >= yday(ymd("2013-06-03")) &
czerwiec$yday yday(ymd("2013-06-09")]
# Pick out the dates that have the range that you want
selected_dates <- subset(czerwiec, in.range==TRUE)

Related

Create efficient week over week calculation with subsetting

In my working dataset, I'm trying to calculate week-over-week values for the changes in wholesale and revenue. The code seems to work, but my estimates show it'll take about 75hrs to run what is a seemingly simple calculation. Below is the generic reproducible version which takes about 2m to run on this smaller dataset:
########################################################################################################################
# MAKE A GENERIC REPORDUCIBLE STACK OVERFLOW QUESTION
########################################################################################################################
# Create empty data frame of 26,000 observations similar to my data, but populated with noise
exampleData <- data.frame(product = rep(LETTERS,1000),
wholesale = rnorm(1000*26),
revenue = rnorm(1000*26))
# create a week_ending column which increases by one week with every set of 26 "products"
for(i in 1:nrow(exampleData)){
exampleData$week_ending[i] <- as.Date("2016-09-04")+7*floor((i-1)/26)
}
exampleData$week_ending <- as.Date(exampleData$week_ending, origin = "1970-01-01")
# create empty columns to fill
exampleData$wholesale_wow <- NA
exampleData$revenue_wow <- NA
# loop through the wholesale and revenue numbers and append the week-over-week changes
for(i in 1:nrow(exampleData)){
# set a condition where the loop only appends the week-over-week values if it's not the first week
if(exampleData$week_ending[i]!="2016-09-04"){
# set temporary values for the current and past week's wholesale value
currentWholesale <- exampleData$wholesale[i]
lastWeekWholesale <- exampleData$wholesale[which(exampleData$product==exampleData$product[i] &
exampleData$week_ending==exampleData$week_ending[i]-7)]
exampleData$wholesale_wow[i] <- currentWholesale/lastWeekWholesale -1
# set temporary values for the current and past week's revenue
currentRevenue <- exampleData$revenue[i]
lastWeekRevenue <- exampleData$revenue[which(exampleData$product==exampleData$product[i] &
exampleData$week_ending==exampleData$week_ending[i]-7)]
exampleData$revenue_wow[i] <- currentRevenue/lastWeekRevenue -1
}
}
Any help understanding why this takes so long or how to cut down the time would be much appreciated!
The first for loop can be simplified with the following for:
exampleData$week_ending2 <- as.Date("2016-09-04") + 7 * floor((seq_len(nrow(exampleData)) - 1) / 26)
setequal(exampleData$week_ending, exampleData$week_ending2)
[1] TRUE
Replacing second for loop
library(data.table)
dt1 <- as.data.table(exampleData)
dt1[, wholesale_wow := wholesale / shift(wholesale) - 1 , by = product]
dt1[, revenue_wow := revenue / shift(revenue) - 1 , by = product]
setequal(exampleData, dt1)
[1] TRUE
This takes about 4 milliseconds to run on my laptop
Here is a vectorized solution using the tidyr package.
set.seed(123)
# Create empty data frame of 26,000 observations similar to my data, but populated with noise
exampleData <- data.frame(product = rep(LETTERS,1000),
wholesale = rnorm(1000*26),
revenue = rnorm(1000*26))
# create a week_ending column which increases by one week with every set of 26 "products"
#vectorize the creating of the data
i<-1:nrow(exampleData)
exampleData$week_ending <- as.Date("2016-09-04")+7*floor((i-1)/26)
exampleData$week_ending <- as.Date(exampleData$week_ending, origin = "1970-01-01")
# create empty columns to fill
exampleData$wholesale_wow <- NA
exampleData$revenue_wow <- NA
#find the index of rows of interest (ie removing the first week)
i<-i[exampleData$week_ending!="2016-09-04"]
library(tidyr)
#create temp variables and convert into wide format
# the rows are product and the columns are the ending weeks
Wholesale<-exampleData[ ,c(1,2,4)]
Wholesale<-spread(Wholesale, week_ending, wholesale)
Revenue<-exampleData[ ,c(1,3,4)]
Revenue<-spread(Revenue, week_ending, revenue)
#number of columns
numCol<-ncol(Wholesale)
#remove the first two columns for current wholesale
#remove the first and last column for last week's wholesale
#perform calculation on ever element in dataframe (divide this week/lastweek)
Wholesale_wow<- Wholesale[ ,-c(1, 2)]/Wholesale[ ,-c(1, numCol)] - 1
#convert back to long format
Wholesale_wow<-gather(Wholesale_wow)
#repeat for revenue
Revenue_wow<- Revenue[ ,-c(1, 2)]/Revenue[ ,-c(1, numCol)] - 1
#convert back to long format
Revenue_wow<-gather(Revenue_wow)
#assemble calculated values back into the original dataframe
exampleData$wholesale_wow[i]<-Wholesale_wow$value
exampleData$revenue_wow[i]<-Revenue_wow$value
The strategy was to convert the original data into a wide format where the rows were the product id and the columns were the weeks. Then divide the data frames by each other. Convert back into a long format and add the newly calculated values to the exampleData data frame. This works, not very clean but very much faster than the loop. The dplyr package is another tool for this type of work.
To compare this results of this code with you test case use:
print(identical(goldendata, exampleData))
Where goldendata is your known good results, be sure to use the same random numbers with the set.seed() function.

How to calculate summary statistics within specified date/time range within time series, using an input of multiple start and end dates?

I have a (dummy) data frame with time series data:
datetime <- as.POSIXct(seq(ISOdate(2012,12,22), ISOdate(2012,12,23), by="hour"), tz='EST')
data <- rnorm(25, 10, 5)
df <- data.frame(datetime, data)
I also have a separate data frame with start and end times as the two columns:
start <- as.POSIXct(c('2012/12/22 19:53', '2012/12/22 23:05'), tz='gmt')
end <- as.POSIXct(c('2012/12/22 21:06', '2012/12/22 23:58'), tz='gmt')
index <- data.frame(start, end)
What I'd like to do is "feed" the main data frame 'df' the 'index' data frame, and, for each start and end date/time combination, find the average value of "data" within that date/time range. This would be equivalent to doing a subset of 'df' manually for each start/end time, but in a combined fashion. (My real data set has years of data, and a hundred date/time ranges I want to feed it FYI).
End goal is to have three columns, start time, end time, and the average numeric value of 'data' within those times.
In general you don't want to grow a data frame one row at a time by calling rbind because it is very inefficient (see the second circle of the R inferno for details). In your case, you can use sapply to replicate this logic:
index$mean <- sapply(1:nrow(index), function(i) mean(df[df$datetime >= index$start[i] &
df$datetime <= index$end[i],2]))
index
# start end mean
# 1 2012-12-22 19:53:00 2012-12-22 21:06:00 9.563336
# 2 2012-12-22 23:05:00 2012-12-22 23:58:00 NaN
I figured out how to do it with a for loop. If anyone has a more efficient solution, that would be great. The for loop solution:
d <- data.frame()
for i in (1:nrow(index)) {
d <- rbind(d, mean(subset(df, datetime >= index[i,1] &
datetime <= index[i,2])[,2]))}

lm and time series formats - is conversion necessary?

I want the slope from a couple of columns that looks like so:
date time
7/8/2014 23.4917166
7/9/2014 28.69671107
7/10/2014 27.3262166
7/11/2014 30.25426663
7/12/2014 29.8345944
7/13/2014 27.7473055
7/14/2014 29.8657722
7/15/2014 29.2622055
The problem is, lm() doesn't seem to play ball with date in a mm/dd/yyyy format. If I make the date data numeric like so:
date time
1 23.4917166
2 28.69671107
3 27.3262166
4 30.25426663
5 29.8345944
6 27.7473055
7 29.8657722
8 29.2622055
and run something like
timetest <- read.table("clipboard", sep="\t", header=T)
test <- lm(time ~ date, data=timetest)
coefficients(test)[2]
I get:
date
0.5605038
So how should I go about transforming the mm/dd/yyyy date format into something numeric? Is there a function to cast them as unix time?
If you convert first the date field to a Date specifying the format used (MM/DD/YYYY eq %m/%d/%Y), then lm does the conversion for you:
timetest$new_date <- as.Date(timetest$date, format = "%m/%d/%Y")
So, the regression looks like
test <- lm(time ~ new_date, data = timetest)
coefficients(test)[2]
and gives
as.numeric(new_date)
0.5605038
Note that as.numeric turns the date into the number of days since 1970-01-01
as.numeric(timetest$new_date[1])
[1] 16259
and
difftime(timetest$new_date[1], as.Date("1970-01-01"))
Time difference of 16259 days
You can also use predict to obtain new value for dates in the format of the original field
predict(test, data.frame(new_date =
seq.Date(as.Date("7/16/2014", format = "%m/%d/%Y"),
as.Date("7/20/2014", format = "%m/%d/%Y"), by = 1)))
that returns
1 2 3 4 5
30.83212 31.39262 31.95312 32.51363 33.07413
For some reason the as.POSIXct() wasn't working, so I went with:
timetest <- read.table("clipboard", sep="\t", header=T)
timetest$date <- as.numeric(as.Date(timetest$date, "%m/%d/%Y"))
test <- lm(time ~ date, data=timetest)
coefficients(test)[2]
Where the new 2nd line just overwrites the original m/d/y data with numeric values. Unix time not necessary for this process.
If you want to convert your dates into Unix time, you can use something like
unix_time<-as.numeric(as.POSIXct(date, format="%m/%d/%Y"))
but the values will end with a lot of zeros, so I'm not sure how useful they'll be in a regression.

Finding a more elegant was to aggregate hourly data to mean hourly data using zoo

I have a chunk of data logging temperatures from a few dozen devices every hour for over a year. The data are stored as a zoo object. I'd very much like to summarize those data by looking at the average values for every one of the 24 hours in a day (1am, 2am, 3am, etc.). So that for each device I can see what its average value is for all the 1am times, 2am times, and so on. I can do this with a loop but sense that there must be a way to do this in zoo with an artful use of aggregate.zoo. Any help?
require(zoo)
# random hourly data over 30 days for five series
x <- matrix(rnorm(24 * 30 * 5),ncol=5)
# Assign hourly data with a real time and date
x.DateTime <- as.POSIXct("2014-01-01 0100",format = "%Y-%m-%d %H") +
seq(0,24 * 30 * 60 * 60, by=3600)
# make a zoo object
x.zoo <- zoo(x, x.DateTime)
#plot(x.zoo)
# what I want:
# the average value for each series at 1am, 2am, 3am, etc. so that
# the dimensions of the output are 24 (hours) by 5 (series)
# If I were just working on x I might do something like:
res <- matrix(NA,ncol=5,nrow=24)
for(i in 1:nrow(res)){
res[i,] <- apply(x[seq(i,nrow(x),by=24),],2,mean)
}
res
# how can I avoid the loop and write an aggregate statement in zoo that
# will get me what I want?
Calculate the hour for each time point and then aggregate by that:
hr <- as.numeric(format(time(x.zoo), "%H"))
ag <- aggregate(x.zoo, hr, mean)
dim(ag)
## [1] 24 5
ADDED
Alternately use hours from chron or hour from data.table:
library(chron)
ag <- aggregate(x.zoo, hours, mean)
This is quite similar to the other answer but takes advantage of the fact the the by=... argument to aggregate.zoo(...) can be a function which will be applied to time(x.zoo):
as.hour <- function(t) as.numeric(format(t,"%H"))
result <- aggregate(x.zoo,as.hour,mean)
identical(result,ag) # ag from G. Grothendieck answer
# [1] TRUE
Note that this produces a result identical to the other answer, not not the same as yours. This is because your dataset starts at 1:00am, not midnight, so your loop produces a matrix wherein the 1st row corresponds to 1:00am and the last row corresponds to midnight. These solutions produce zoo objects wherein the first row corresponds to midnight.

Mean hour-of-day and imputation...would this be easier with time calculations?

I'm working with a data set and am imputing NAs for times. I have a simplified example below where I am creating a new column that includes the original data and imputed values for NAs (i.e., the mean of the time of day). The code works fine, but I am so weak with dates I was wondering if there was an easier way to calculate the mean time of day date/time values?
arrivals <- data.frame(
ships=c("Glory","Discover","Intrepid","Enchantment","Summit"),
times=c("8:00","10:00","11:42",NA,"9:20"), stringsAsFactors=FALSE)
sumtime <- sapply(strsplit(as.character(arrivals$times),":"),
function(x) as.numeric(x[1])*60 + as.numeric(x[2]))
avgtime <- paste(trunc((mean(sumtime, na.rm=TRUE)/60)),":",
trunc(mean(sumtime, na.rm=TRUE)%%60), sep="")
arrivals$times2 <- arrivals$times
arrivals$times2[is.na(arrivals$times)] <- avgtime
You can use the chron package to convert your times column to a numeric representation that you can take the average of:
library(chron)
Arrivals <- arrivals[,c("ships","times")]
# Will give some warnings due to the missing value
Arrivals$times <- chron(times.=paste(Arrivals$times, ":00", sep=""))
Arrivals$times[is.na(Arrivals$times)] <- mean(Arrivals$times,na.rm=TRUE)
ships times
1 Glory 08:00:00
2 Discover 10:00:00
3 Intrepid 11:42:00
4 Enchantment 09:45:30
5 Summit 09:20:00

Resources