Take the start date of a data frame - r

With this we can create a data frame with days having the start day:
dataframe = seq(as.Date("2000-09-01"), by="days", length=11)
If we have already a dataframe which has values can we do the same?

That actually creates a Date vector, not a data frame.
x <- seq(as.Date("2000-09-01"), by="days", length=11)
class(x)
# [1] "Date"
If you define the starting date as the earliest date (i.e., the minimum date), then use min:
min(x)
# [1] "2000-09-01"
But if you're looking for the first element in the vector (which is also the earliest in your case, but of course it doesn't have to be), then grab it by its index:
x[1]
# [1] "2000-09-01"

Related

calculating ages in R by subtracting two dates columns

I have 2 columns with ~ 2000 rows of dates in them. One is a variable with a visit date (df$visitdate), and the other is a birth date of the individual (df$birthday).
Wondering if there is any simple way to subtract the visit date - birth date to create the variable "age at the time of the visit", accounting for leap years, etc.
I tried to use the following code (from an answer in a similar question) but it didn't work in my case.
find number of seconds in one year:
seconds_in_a_year <- as.integer((seconds(ymd("2010-01-01")) - seconds(ymd("2009-01-01"))))
now obtain number of seconds between the 2 dates you desire
seconds_between_dates <- as.integer(seconds(date1) - seconds(date2))
your final answer for number of years in floating points will be
years_between_dates <- seconds_between_dates / seconds_in_a_year
When I tried to apply this to my data frame (note: using variables rather than specific dates, so this may be the cause) I got the following:
seconds_in_a_year <- as.integer((seconds(ymd(df$visitdate)) - seconds(ymd(df$birthday))))
Warning message:
NAs introduced by coercion
Following the code along I got a final output of:
years_between_dates
[1] 1.157407e-05 [2] 1.157407e-05
Any help is greatly appreciated!
Subtracting from a Date object another Date object gives you the time difference in days, e.g.
> dates = as.Date(c("2007-03-01", "2004-05-23"))
>
> dates[1] - dates[2]
Time difference of 1012 days
So, assuming 365 days in a year
> age_time_visit = as.numeric(dates[1] - dates[2]) / 365
> age_time_visit
[1] 2.772603
There are various answers for this scattered around the internet.
I think the one I've typically used was inspired by Professor Ripley:
http://r.789695.n4.nabble.com/Calculate-difference-between-dates-in-years-td835196.html
age_years <- function(first, second)
{
lt <- data.frame(first, second)
age <- as.numeric(format(lt[,2],format="%Y")) - as.numeric(format(lt[,1],format="%Y"))
first <- as.Date(paste(format(lt[,2],format="%Y"),"-",format(lt[,1],format="%m-%d"),sep=""))
age[which(first > lt[,2])] <- age[which(first > lt[,2])] - 1
age
}
There's another approach at https://gist.github.com/mmparker/7254445
Or you you just want to raw, decimal value of years, you can get the number of days and divide by 365.2425
Here is an approach that accounts for leap years (don't know if this has been done before, but suspect it has...).
get.age <- function(from, to) {
require(lubridate) # for leap_year(...)
n <- as.integer(to-from)
n.l <- sum(leap_year(seq(from,to,by=1)))
n.l/366 + (n+1-n.l)/365
}
get.age(as.Date("2009-01-01"),as.Date("2012-12-31"))
# [1] 4
get.age(as.Date("2012-01-01"),as.Date("2012-01-31")) # 2012 was a leap year
# [1] 0.08469945
get.age(as.Date("2011-01-01"),as.Date("2011-01-31")) # 2011 was not
# [1] 0.08493151
So the basic idea is to create a vector with one element for every day between from and to (inclusive), then for each day account for whether that day is part of a leap year or not. The we add up the leap year days and the non-leap year days separately and calculate the number of years as:
leap-year-days/366 + non-leap-year-days/365
This works for single dates (vectors of length 1). To enable this for columns of dates, as you asked, we use Vectorize(...).
vget.age <- Vectorize(get.age) # vectorized version
And then a demo:
# example data set
set.seed(1) # for reproducible example
today <- as.Date("2015-09-09")
df <- data.frame(birth.date=today-sample(1000:10000,2000)) # 2000 birthdays
result <- vget.age(df$birth.date,today) # how old are they?
head(result)
# [1] 9.282192 11.909589 16.854795 25.115068 7.706849 24.865753

How to calculate summary statistics within specified date/time range within time series, using an input of multiple start and end dates?

I have a (dummy) data frame with time series data:
datetime <- as.POSIXct(seq(ISOdate(2012,12,22), ISOdate(2012,12,23), by="hour"), tz='EST')
data <- rnorm(25, 10, 5)
df <- data.frame(datetime, data)
I also have a separate data frame with start and end times as the two columns:
start <- as.POSIXct(c('2012/12/22 19:53', '2012/12/22 23:05'), tz='gmt')
end <- as.POSIXct(c('2012/12/22 21:06', '2012/12/22 23:58'), tz='gmt')
index <- data.frame(start, end)
What I'd like to do is "feed" the main data frame 'df' the 'index' data frame, and, for each start and end date/time combination, find the average value of "data" within that date/time range. This would be equivalent to doing a subset of 'df' manually for each start/end time, but in a combined fashion. (My real data set has years of data, and a hundred date/time ranges I want to feed it FYI).
End goal is to have three columns, start time, end time, and the average numeric value of 'data' within those times.
In general you don't want to grow a data frame one row at a time by calling rbind because it is very inefficient (see the second circle of the R inferno for details). In your case, you can use sapply to replicate this logic:
index$mean <- sapply(1:nrow(index), function(i) mean(df[df$datetime >= index$start[i] &
df$datetime <= index$end[i],2]))
index
# start end mean
# 1 2012-12-22 19:53:00 2012-12-22 21:06:00 9.563336
# 2 2012-12-22 23:05:00 2012-12-22 23:58:00 NaN
I figured out how to do it with a for loop. If anyone has a more efficient solution, that would be great. The for loop solution:
d <- data.frame()
for i in (1:nrow(index)) {
d <- rbind(d, mean(subset(df, datetime >= index[i,1] &
datetime <= index[i,2])[,2]))}

How do I subset every day except the last five days of zoo data?

I am trying to extract all dates except for the last five days from a zoo dataset into a single object.
This question is somewhat related to How do I subset the last week for every month of a zoo object in R?
You can reproduce the dataset with this code:
set.seed(123)
price <- rnorm(365)
data <- cbind(seq(as.Date("2013-01-01"), by = "day", length.out = 365), price)
zoodata <- zoo(data[,2], as.Date(data[,1]))
For my output, I'm hoping to get a combined dataset of everything except the last five days of each month. For example, if there are 20 days in the first month's data and 19 days in the second month's, I only want to subset the first 15 and 14 days of data respectively.
I tried using the head() function and the first() function to extract the first three weeks, but since each month will have a different amount of days according to month or leap year months, it's not ideal.
Thank you.
Here are a few approaches:
1) as.Date Let tt be the dates. Then we compute a Date vector the same length as tt which has the corresponding last date of the month. We then pick out those dates which are at least 5 days away from that:
tt <- time(zoodata)
last.date.of.month <- as.Date(as.yearmon(tt), frac = 1)
zoodata[ last.date.of.month - tt >= 5 ]
2) tapply/head For each month tapply head(x, -5) to the data and then concatenate the reduced months back together:
do.call("c", tapply(zoodata, as.yearmon(time(zoodata)), head, -5))
3) ave Define revseq which given a vector or zoo object returns sequence numbers in reverse order so that the last element corresponds to 1. Then use ave to create a vector ix the same length as zoodata which assigns such reverse sequence numbers to the days of each month. Thus the ix value for the last day of the month will be 1, for the second last day 2, etc. Finally subset zoodata to those elements corresponding to sequence numbers greater than 5:
revseq <- function(x) rev(seq_along(x))
ix <- ave(seq_along(zoodata), as.yearmon(time(zoodata)), FUN = revseq)
z <- zoodata[ ix > 5 ]
ADDED Solutions (1) and (2).
Exactly the same way as in the answer to your other question:
Split dataset by month, remove last 5 days, just add a "-":
library(xts)
xts.data <- as.xts(zoodata)
lapply(split(xts.data, "months"), last, "-5 days")
And the same way, if you want it on one single object:
do.call(rbind, lapply(split(xts.data, "months"), last, "-5 days"))

How can I subtract 2 dataframes of different length by searching for the closest timestamp in R?

I am using R to extract data from a process historian using SQL. I have two dataframes, one of net weights (NetWt) with timestamps (100 rows) and another of weight setpoints (SetPt) with timestamps (6 rows). The setpoint is changed infrequently but a new bag weight is recorded every 30 seconds. I need to subtract the two such that I get a resultant dataframe of NetWt - SetPt for each timestamp in NetWt. In my last dataset the most recent SetPt timestamp is earlier than the first NetWt timestamp. I need a function that will go through each row in NetWt, take the timestamp, search for the closest timestamp before that time in the SetPt dataframe, return the most recent SetPt and output the difference (NetWt-SetPt).
I have researched merge, rbind, cbind, and I can't find a function to search backwards for the most recent SetPt value and merge that with the NetWt so that I can subtract them to plot the difference with time. Can anyone please help?
Data:
SetPtLines <- "Value,DateTime
51.35,2014-02-10 08:10:49
53.30,2014-02-10 07:52:37
53.10,2014-02-10 07:52:19
51.70,2014-02-10 07:50:26
51.35,2014-02-09 19:25:21
51.40,2014-02-09 19:13:11
51.50,2014-02-09 18:24:53
51.45,2014-02-09 16:10:38
51.40,2014-02-09 15:54:42"
SetPt <- read.csv(text=SetPtLines, header=TRUE)
NetWtLines <- "DateTime,Value
2014-02-11 12:51:50,50.90735
2014-02-11 12:52:24,50.22308
2014-02-11 12:52:55,50.88604
2014-02-11 12:53:27,50.69514
2014-02-11 12:53:58,51.38968
2014-02-11 12:54:29,50.96672"
NetWt <- read.csv(text=NetWtLines, header=TRUE)
There are 100 rows in NetWt.
data.table has a roll argument which would probably be very helpful here
library(data.table)
NetWt <- as.data.table(NetWt)
SetPt <- as.data.table(SetPt)
## Only needed if dates are strings:
## Ensure that your DateTime columns are actually times and not strings
NetWt[, DateTime := as.POSIXct(DateTime)]
SetPt[, DateTime := as.POSIXct(DateTime)]
## Next, set keys to the dates
setkey(NetWt, DateTime)
setkey(SetPt, DateTime)
## Join the two, use roll
NetWt[SetPt, NewValue := Value - i.Value, roll="nearest"]
## It's not clear which you want to subtract from which
SetPt[NetWt, NewValue := Value - i.Value, roll="nearest"]
Here's a solution using xts. Note that your example would be more helpful if SetPt and NetWt included some overlapping observations.
library(xts)
# convert your data to xts
xSetPt <- xts(SetPt$Value, as.POSIXct(SetPt$DateTime))
xNetWt <- xts(NetWt$Value, as.POSIXct(NetWt$DateTime))
# merge them
xm <- merge(xNetWt, xSetPt)
# fill all missing values in the SetPt column with their prior value
xm$xSetPt <- na.locf(xm$xSetPt)
# plot the difference
plot(na.omit(xm$xNetWt - xm$xSetPt))

Recall in date POSIXct

I couldn't find a solution of my problem with POSIXct format - I have a monthly data. This is a scrap of my code:
Data <- as.POSIXct(as.character(czerwiec$Data), format = "%Y-%m-%d %H:%M:%S")
get.rows <- Data >= as.POSIXct(as.character("2013-06-03 00:00:01")) & Data <= as.POSIXct(as.character("2013-06-09 23:59:59"))
czerwiec <- czerwiec[get.rows,]
Data <- Data[get.rows]
I chose one hole week of June from 3 to 9 and wanted to estimate the sum of column X (czerwiec$X) by every hours. As you see I could reduce time, but it will be stupid to do it, like this
get.rows <- Data >= as.POSIXct(as.character("2013-06-03 00:00:01")) &
Data <= as.POSIXct(as.character("2013-06-03 00:59:59"))
then
get.rows <- Data >= as.POSIXct(as.character("2013-06-04 00:00:01")) &
Data <= as.POSIXct(as.character("2013-06-04 00:59:59"))
And in the end of this operations, I can estimate sum for this hour etc.
Do you have any idea, how I can recall to every rows, which have time like 2013-06-03 to 2013-06-09 and 00:00:01 to 00:59:59??
Something about data frame "czerwiec", so I have three columns, where first call "ID", second "Price" and third "Data" (means Date).
Thx for help :)
This might help. I've used the lubridate package, which doesn't really do anything you can't do in base R, but it makes handling dates much easier
# Set up Data as a string vector
Data <- c("2013-06-01 05:05:05", "2013-06-06 05:05:05", "2013-06-06 08:10:05", "2013-07-07 05:05:05")
require(lubridate)
# Set up the data frame with fake data. This makes a reproducible example
set.seed(4) #For reproducibility, always set the seed when using random numbers
# Create a data frame with Data and price
czerwiec <- data.frame(price=runif(4))
# Use lubridate to turn the Data string into a vector of POSIXctn objects
czerwiec$Data <- ymd_hms(Data)
# Determine the 'yearday' -i.e. yearday of Jan 1 is 1; yearday of Dec 31 is 365 (or 366 in a leap year)
czerwiec$yday <- yday(czerwiec$Data)
# in.range is true if the date is in the desired date range
czerwiec$in.range <- czerwiec$yday[czerwiec$yday >= yday(ymd("2013-06-03")) &
czerwiec$yday yday(ymd("2013-06-09")]
# Pick out the dates that have the range that you want
selected_dates <- subset(czerwiec, in.range==TRUE)

Resources