Efficient and Succinct Vector Transformation of Weekly to Daily hourly Data in R - r

I've got a working function, but I'm hoping there is a more succinct way of going about this.
I have a dataset of events that are captured with the hour of the week they occurred in. For example, 4 AM on Sunday= 4, 4 AM on Monday = 28 etc. I want to analyze this data on a daily basis. For instance, all of the events that happen between 8 and 10 am on any day.
To do this I have built a function that returns a dichotomous value for the given range for an ordered list. Function two_break accepts an ordered list of integers between 0:168 representing the hours of a week and a range (b1 and b2) for the desired periods of a 24 hour day. b1 and b2 divide the range of the 24 hour day that are desired. i.e. if b1=8 and b2=10 two_break will return all all values of 9, (9+24)=33, (9+48)=57...etc. as 1 and all others 0.
two_break <- function(test_hr,b1,b2){
test_hr<-ifelse(test_hr==1,1.1,test_hr)
for(i in 0:6){
test_hr<-ifelse(test_hr> (b1+24*i) & test_hr< (b2+24*i), 1 ,test_hr)
}
test_hr<-ifelse(test_hr==1,1,0)
return(test_hr)
}
This function works fine, but I'm wondering if anybody out there could do it more efficiently/succinctly.
See full code and data set at my github: anthonyjp87 168 hr transformation file/data.
Cheers!

You can use integer division %/% to capture the day of the week, and modulus, %% to capture the hour in the day:
weekHours <- 1:168
# return the indices of all elements where the hour is between 8AM and 10AM, inclusive
test_hr <- weekHours[weekHours %% 24 %in% 8:10]
Note that midnight is represented by 0. If you want to wrap this into a function, you might use
getTest_hr <- function(weekHours, startTime, stopTime) {
weekHours[weekHours %% 24 %in% seq(startTime, stopTime)]
}
To get the day of the week, you can use integer division:
# get all indices for the third day of the week
dayOfWeek3 <- weekHours[(weekHours %/% 24 + 1) == 3]
To get a binary vector of the selected time periods, simply pull the logical out of the index:
allTimesBinary <- (weekHours %% 24) %in% 8:10

Related

Replacement of missing day and month in dates using R

This question is about how to replace missing days and months in a data frame using R. Considering the data frame below, 99 denotes missing day or month and NA represents dates that are completely unknown.
df<-data.frame("id"=c(1,2,3,4,5),
"date" = c("99/10/2014","99/99/2011","23/02/2016","NA",
"99/04/2009"))
I am trying to replace the missing days and months based on the following criteria:
For dates with missing day but known month and year, the replacement date would be a random selection from the middle of the interval (first day to the last day of that month). Example, for id 1, the replacement date would be sampled from the middle of 01/10/2014 to 31/10/2014. For id 5, this would be the middle of 01/04/2009 to 30/04/2009. Of note is the varying number of days for different months, e.g. 31 days for October and 30 days for April.
As in the case of id 2, where both day and month are missing, the replacement date is a random selection from the middle of the interval (first day to last day of the year), e.g 01/01/2011 to 31/12/2011.
Please note: complete dates (e.g. the case of id 3) and NAs are not to be replaced.
I have tried by making use of the seq function together with the as.POSIXct and as.Date functions to obtain the sequence of dates from which the replacement dates are to be sampled. The difficulty I am experiencing is how to automate the R code to obtain the date intervals (it varies across distinct id) and how to make a random draw from the middle of the intervals.
The expected output would have the date of id 1, 2 and 5 replaced but those of id 3 and 4 remain unchanged. Any help on this is greatly appreciated.
This isn't the prettiest, but it seems to work and adapts to differing month and year lengths:
set.seed(999)
df$dateorig <- df$date
seld <- grepl("^99/", df$date)
selm <- grepl("^../99", df$date)
md <- seld & (!selm)
mm <- seld & selm
df$date <- as.Date(gsub("99","01",as.character(df$date)), format="%d/%m/%Y")
monrng <- sapply(df$date[md], function(x) seq(x, length.out=2, by="month")[2]) - as.numeric(df$date[md])
df$date[md] <- df$date[md] + sapply(monrng, sample, 1)
yrrng <- sapply(df$date[mm], function(x) seq(x, length.out=2, by="12 months")[2]) - as.numeric(df$date[mm])
df$date[mm] <- df$date[mm] + sapply(yrrng, sample, 1)
#df
# id date dateorig
#1 1 2014-10-14 99/10/2014
#2 2 2011-02-05 99/99/2011
#3 3 2016-02-23 23/02/2016
#4 4 <NA> NA
#5 5 2009-04-19 99/04/2009

Creating time stamps and repeat in the same column

I want to create a vector of time stamps consisting of 60 monthly dates and repeat the process for n number of times. That means, if n = 2, the vector should contain 120 times stamps.
A single vector of time stamps I am creating in this way,
t <- seq(as.Date("2014-01-01"), as.Date("2018-12-31"), by = "month")
To repeat it n number of times I am doing the following,
n <- 2
X <- data.frame(replicate(n, seq(as.Date("2014-01-01"), as.Date("2018-12-31"), by = "month")))
Y <- stack(X)[,"values", drop=FALSE]
head(Y)
> head(Y)
values
1 16071
2 16102
3 16130
4 16161
5 16191
6 16222
As you see the values are not in time format anymore. My question is how to retain the time format in the vector Y? Is there any smarter way to do this problem?
Take a look at the 'zoo' package, there is an old thread here https://stat.ethz.ch/pipermail/r-help//2010-March/233159.html
where they talk about sort of the same problem.
Either way, after installing zoo you can do
as.Date(16071)
and it will return the date in date format. Hope this makes sense.

calculating ages in R by subtracting two dates columns

I have 2 columns with ~ 2000 rows of dates in them. One is a variable with a visit date (df$visitdate), and the other is a birth date of the individual (df$birthday).
Wondering if there is any simple way to subtract the visit date - birth date to create the variable "age at the time of the visit", accounting for leap years, etc.
I tried to use the following code (from an answer in a similar question) but it didn't work in my case.
find number of seconds in one year:
seconds_in_a_year <- as.integer((seconds(ymd("2010-01-01")) - seconds(ymd("2009-01-01"))))
now obtain number of seconds between the 2 dates you desire
seconds_between_dates <- as.integer(seconds(date1) - seconds(date2))
your final answer for number of years in floating points will be
years_between_dates <- seconds_between_dates / seconds_in_a_year
When I tried to apply this to my data frame (note: using variables rather than specific dates, so this may be the cause) I got the following:
seconds_in_a_year <- as.integer((seconds(ymd(df$visitdate)) - seconds(ymd(df$birthday))))
Warning message:
NAs introduced by coercion
Following the code along I got a final output of:
years_between_dates
[1] 1.157407e-05 [2] 1.157407e-05
Any help is greatly appreciated!
Subtracting from a Date object another Date object gives you the time difference in days, e.g.
> dates = as.Date(c("2007-03-01", "2004-05-23"))
>
> dates[1] - dates[2]
Time difference of 1012 days
So, assuming 365 days in a year
> age_time_visit = as.numeric(dates[1] - dates[2]) / 365
> age_time_visit
[1] 2.772603
There are various answers for this scattered around the internet.
I think the one I've typically used was inspired by Professor Ripley:
http://r.789695.n4.nabble.com/Calculate-difference-between-dates-in-years-td835196.html
age_years <- function(first, second)
{
lt <- data.frame(first, second)
age <- as.numeric(format(lt[,2],format="%Y")) - as.numeric(format(lt[,1],format="%Y"))
first <- as.Date(paste(format(lt[,2],format="%Y"),"-",format(lt[,1],format="%m-%d"),sep=""))
age[which(first > lt[,2])] <- age[which(first > lt[,2])] - 1
age
}
There's another approach at https://gist.github.com/mmparker/7254445
Or you you just want to raw, decimal value of years, you can get the number of days and divide by 365.2425
Here is an approach that accounts for leap years (don't know if this has been done before, but suspect it has...).
get.age <- function(from, to) {
require(lubridate) # for leap_year(...)
n <- as.integer(to-from)
n.l <- sum(leap_year(seq(from,to,by=1)))
n.l/366 + (n+1-n.l)/365
}
get.age(as.Date("2009-01-01"),as.Date("2012-12-31"))
# [1] 4
get.age(as.Date("2012-01-01"),as.Date("2012-01-31")) # 2012 was a leap year
# [1] 0.08469945
get.age(as.Date("2011-01-01"),as.Date("2011-01-31")) # 2011 was not
# [1] 0.08493151
So the basic idea is to create a vector with one element for every day between from and to (inclusive), then for each day account for whether that day is part of a leap year or not. The we add up the leap year days and the non-leap year days separately and calculate the number of years as:
leap-year-days/366 + non-leap-year-days/365
This works for single dates (vectors of length 1). To enable this for columns of dates, as you asked, we use Vectorize(...).
vget.age <- Vectorize(get.age) # vectorized version
And then a demo:
# example data set
set.seed(1) # for reproducible example
today <- as.Date("2015-09-09")
df <- data.frame(birth.date=today-sample(1000:10000,2000)) # 2000 birthdays
result <- vget.age(df$birth.date,today) # how old are they?
head(result)
# [1] 9.282192 11.909589 16.854795 25.115068 7.706849 24.865753

Finding a more elegant was to aggregate hourly data to mean hourly data using zoo

I have a chunk of data logging temperatures from a few dozen devices every hour for over a year. The data are stored as a zoo object. I'd very much like to summarize those data by looking at the average values for every one of the 24 hours in a day (1am, 2am, 3am, etc.). So that for each device I can see what its average value is for all the 1am times, 2am times, and so on. I can do this with a loop but sense that there must be a way to do this in zoo with an artful use of aggregate.zoo. Any help?
require(zoo)
# random hourly data over 30 days for five series
x <- matrix(rnorm(24 * 30 * 5),ncol=5)
# Assign hourly data with a real time and date
x.DateTime <- as.POSIXct("2014-01-01 0100",format = "%Y-%m-%d %H") +
seq(0,24 * 30 * 60 * 60, by=3600)
# make a zoo object
x.zoo <- zoo(x, x.DateTime)
#plot(x.zoo)
# what I want:
# the average value for each series at 1am, 2am, 3am, etc. so that
# the dimensions of the output are 24 (hours) by 5 (series)
# If I were just working on x I might do something like:
res <- matrix(NA,ncol=5,nrow=24)
for(i in 1:nrow(res)){
res[i,] <- apply(x[seq(i,nrow(x),by=24),],2,mean)
}
res
# how can I avoid the loop and write an aggregate statement in zoo that
# will get me what I want?
Calculate the hour for each time point and then aggregate by that:
hr <- as.numeric(format(time(x.zoo), "%H"))
ag <- aggregate(x.zoo, hr, mean)
dim(ag)
## [1] 24 5
ADDED
Alternately use hours from chron or hour from data.table:
library(chron)
ag <- aggregate(x.zoo, hours, mean)
This is quite similar to the other answer but takes advantage of the fact the the by=... argument to aggregate.zoo(...) can be a function which will be applied to time(x.zoo):
as.hour <- function(t) as.numeric(format(t,"%H"))
result <- aggregate(x.zoo,as.hour,mean)
identical(result,ag) # ag from G. Grothendieck answer
# [1] TRUE
Note that this produces a result identical to the other answer, not not the same as yours. This is because your dataset starts at 1:00am, not midnight, so your loop produces a matrix wherein the 1st row corresponds to 1:00am and the last row corresponds to midnight. These solutions produce zoo objects wherein the first row corresponds to midnight.

How do I subset every day except the last five days of zoo data?

I am trying to extract all dates except for the last five days from a zoo dataset into a single object.
This question is somewhat related to How do I subset the last week for every month of a zoo object in R?
You can reproduce the dataset with this code:
set.seed(123)
price <- rnorm(365)
data <- cbind(seq(as.Date("2013-01-01"), by = "day", length.out = 365), price)
zoodata <- zoo(data[,2], as.Date(data[,1]))
For my output, I'm hoping to get a combined dataset of everything except the last five days of each month. For example, if there are 20 days in the first month's data and 19 days in the second month's, I only want to subset the first 15 and 14 days of data respectively.
I tried using the head() function and the first() function to extract the first three weeks, but since each month will have a different amount of days according to month or leap year months, it's not ideal.
Thank you.
Here are a few approaches:
1) as.Date Let tt be the dates. Then we compute a Date vector the same length as tt which has the corresponding last date of the month. We then pick out those dates which are at least 5 days away from that:
tt <- time(zoodata)
last.date.of.month <- as.Date(as.yearmon(tt), frac = 1)
zoodata[ last.date.of.month - tt >= 5 ]
2) tapply/head For each month tapply head(x, -5) to the data and then concatenate the reduced months back together:
do.call("c", tapply(zoodata, as.yearmon(time(zoodata)), head, -5))
3) ave Define revseq which given a vector or zoo object returns sequence numbers in reverse order so that the last element corresponds to 1. Then use ave to create a vector ix the same length as zoodata which assigns such reverse sequence numbers to the days of each month. Thus the ix value for the last day of the month will be 1, for the second last day 2, etc. Finally subset zoodata to those elements corresponding to sequence numbers greater than 5:
revseq <- function(x) rev(seq_along(x))
ix <- ave(seq_along(zoodata), as.yearmon(time(zoodata)), FUN = revseq)
z <- zoodata[ ix > 5 ]
ADDED Solutions (1) and (2).
Exactly the same way as in the answer to your other question:
Split dataset by month, remove last 5 days, just add a "-":
library(xts)
xts.data <- as.xts(zoodata)
lapply(split(xts.data, "months"), last, "-5 days")
And the same way, if you want it on one single object:
do.call(rbind, lapply(split(xts.data, "months"), last, "-5 days"))

Resources