I have a data.frame that contains two date columns, one for date of birth (DOB) for an individual, and a reference point in time (Snapshot.Date), let's say it's the date we last saw that individual. There are other columns (omitted), so I'd ideally like the results to be added as a column to my existing data.frame.
I would like to calculate how many months (continuous), between the individuals last birthday (relative to the Snapshot.Date) and the Snapshot.Date.
I've tried a plyr solution and a base sapply solution, and they are both slower than I expected they would be -- (and I need to process one million rows in my 'real' data.frame)
First, here is a test dataset. 20 original records (with the 'special' case of Feb 29th, only existing in a leap year).
data.test = structure(list(Snapshot.Date = structure(c(1433030400, 1396224000,
1375228800, 1396224000, 1383177600, 1362009600, 1367280000, 1369958400,
1346371200, 1348963200, 1435622400, 1435622400, 1435622400, 1435622400,
1435622400, 1435622400, 1435622400, 1435622400, 1435622400, 1346371200
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), DOB = structure(c(-268790400,
-155692800, -955065600, -551232000, -149644800, -774230400, -485395200,
-17625600, -131932800, -387244800, 545961600, 18489600, -230515200,
441676800, -32745600, 775180800, 713491200, 483235200, 114307200,
-815443200), class = c("POSIXct", "POSIXt"), tzone = "UTC")), .Names = c("Snapshot.Date",
"DOB"), row.names = c(32806L, 21294L, 14880L, 21730L, 17525L,
8516L, 11068L, 11751L, 2564L, 3832L, 802276L, 1031697L, 129222L,
588224L, 1093247L, 878037L, 370736L, 709108L, 861908L, 2199L), class = "data.frame")
And the function for calculating months (I'm sure this can be improved too).
months_since_last_birthday = function(CurrentDate, DateOfBirth)
{
last_birthday = DateOfBirth
if(month(last_birthday) == 2 & day(last_birthday) == 29) # this birthday only occurs once every four years, let's reset them to be the 28th
{
day(last_birthday) = 28
}
year(last_birthday) = year(CurrentDate)
if(last_birthday > CurrentDate)
{
last_birthday = last_birthday - years(1) #last year's birthday is the most recent occurrence
}
return(as.period(new_interval(last_birthday, CurrentDate)) / months(1))
}
For the base 20 records, here is the desired output:
Snapshot.Date DOB Months.Since.Birthday
32806 2015-05-31 1961-06-26 11.1643836
21294 2014-03-31 1965-01-25 2.1972603
14880 2013-07-31 1939-09-27 10.1315068
21730 2014-03-31 1952-07-14 8.5589041
17525 2013-10-31 1965-04-05 6.8547945
8516 2013-02-28 1945-06-20 8.2630137
11068 2013-04-30 1954-08-15 8.4931507
11751 2013-05-31 1969-06-11 11.6575342
2564 2012-08-31 1965-10-27 10.1315068
3832 2012-09-30 1957-09-24 0.1972603
802276 2015-06-30 1987-04-21 2.2958904
1031697 2015-06-30 1970-08-03 10.8876712
129222 2015-06-30 1962-09-12 9.5917808
588224 2015-06-30 1983-12-31 5.9863014
1093247 2015-06-30 1968-12-18 6.3945205
878037 2015-06-30 1994-07-26 11.1315068
370736 2015-06-30 1992-08-11 10.6246575
709108 2015-06-30 1985-04-25 2.1643836
861908 2015-06-30 1973-08-16 10.4602740
2199 2012-08-31 1944-02-29 6.0986301
Scaling up the dataset for benchmarking:
# Make 5000 records total for benchmarking, didn't replicate Feb 29th
# since it is a very rare case in the data
set.seed(1)
data.test = rbind(data.test, data.test[sample(1:19, size = 4980, replace = TRUE),])
start.time = Sys.time()
res = suppressMessages(adply(data.test , 1, transform, Months.Since.Birthday = months_since_last_birthday(Snapshot.Date, DOB)))
end.time = Sys.time()
# end.time - start.time
# Time difference of 1.793945 mins
start.time = Sys.time()
data.test$Months.Since.Birthday = suppressMessages(sapply(1:5000, function(x){return(months_since_last_birthday(data.test$Snapshot.Date[x], data.test$DOB[x]))}))
end.time = Sys.time()
# end.time - start.time
# Time difference of 1.743053 mins
Am I doing something seriously wrong? Does this seem really slow to you?
Any feedback is welcome!
Unless I'm missing something obvious, there are a bunch of built in ways of working with time data in R, notably base::difftime which may have saved you some trouble.
Taking your above dataset data.test:
data.test$dif <- round(as.vector(as.double(difftime(strptime(data.test$Snapshot.Date, format = "%Y-%m-%d"), strptime(data.test$DOB, format = "%Y-%m-%d"), units = "days"))) %% 365, 1)
or to lay it out more logically (this wont work if you copy paste it).
data.test$dif <-
round(
as.vector(
as.double(
difftime(
strptime(data.test$Snapshot.Date, format = "%Y-%m-%d"),
strptime(data.test$DOB, format = "%Y-%m-%d"), units = "days")
)
)
%% 365,
1)
The above uses the difftime function to find the difference between the two dates with the given format (format = "%Y-%m-%d") in terms of days, then performs remainder division to get the number of days since the last birthday. I personally think this is a better measure than months because a difference of 2 months between July and August is a different number of days than a 2 month difference between January and February.
Note: The above solution does not incorporate leap years. You could easily look up a list of leap years and add 1 day to the checkup or subtract 1 day from the birthday of each individual who lived through that leap year to get an accurate number.
Related
I have tried to transform daily prices to weekly prices of more than one financial asset and then put them together, but I cannot.
When I join both weekly series, there are days that do not coincide, causing that the weekly frequency is not respected.
Example:
I download the data using quantmod
getSymbols('^FCHI', from = '2005-01-06', to= "2022-03-18")
Y<-Cl(to.weekly(FCHI))
getSymbols("^GDAXI", from = '2005-01-06', to= "2022-03-18")
O<-Cl(to.weekly(GDAXI))
i<-cbind(Y,O)
I get this data:
structure(c(3803.719971, 3794.439941, NA, 3912.72998, NA, 3936.330078,
4045.139893, 3954.379883, 3820.780029, 3739.459961, 5756.290039,
5831.209961, 5957.439941, NA, 5957.430176, NA, 6037.609863, 5875.970215,
5695.319824, 5608.790039), class = c("xts", "zoo"), src = "yahoo", updated = structure(1648424873.12071, class = c("POSIXct",
"POSIXt")), na.action = structure(c(528L, 1120L, 2567L), class = "omit", index = c(1325462400,
1398902400, 1577232000)), index = structure(c(1260489600, 1261094400,
1261526400, 1261612800, 1262131200, 1262217600, 1262908800, 1263513600,
1264118400, 1264723200), tzone = "UTC", tclass = "Date"), .Dim = c(10L,
2L), .Dimnames = list(NULL, c("FCHI.Close", "GDAXI.Close")))
FCHI.close GDAXI.close
2009-12-11 3803.72 5756.29
2009-12-18 3794.44 5831.21
2009-12-23 NA 5957.44
2009-12-24 3912.73 NA
2009-12-30 NA 5957.43
2009-12-31 3936.33 NA
Even if I replace these missings with the last value, I would have a problem, since the weekly frequency would be lost since two successive dates are generated
How can i fix this? Thanks in advance and sorry for my bad english
Stock market databases tend to have missing values and missing dates for a number of reasons. For France and Germany these tend to be days, like easter monday, second Christmas day (boxing day) or any other holiday where the stockmarket is closed locally.
Especially before all the European financial markets were synchronized for opening days. Second missing dates are dates that are not recorded in the database, like Christmas day. Now if this wasn't a Friday you wouldn't notice, but the to.period function takes the last day it finds in a week. If these are different between timeseries, you have different last days of the week for that week.
This shows itself in December 2009 where you have 23 (DAX) and 24 (CAC40) as the last day of the week. Merging this will give you a 2 days for that week as seen in your example.
So there are 2 things you need to do. First synchronize all the dates in the timeseries, a.k.a. insert all missing dates in all timeseries. Secondly, fill the NA's with the information from the previous day(s) and then you can use the to.period functions.
Below is some code to handle this:
#Create date sequence
dates <- seq.Date(from = as.Date('2005-01-06'), to= as.Date("2022-03-18"), by = 1)
# remove Saturday and Sunday
dates <- dates[lubridate::wday(dates, week_start = 1) %in% (1:5)]
# merge dates with timeseries
FCHI <- merge(FCHI, dates)
GDAXI <- merge(GDAXI, dates)
# fill in the NA's with the previous value
FCHI <- na.locf(FCHI)
GDAXI <- na.locf(GDAXI)
Y <- Cl(to.weekly(FCHI))
O <- Cl(to.weekly(GDAXI))
#merge CAC40 and DAX
i <- merge(Y, O)
# data for December 2009
i["2009-12"]
FCHI.Close GDAXI.Close
2009-12-04 3846.62 5817.65
2009-12-11 3803.72 5756.29
2009-12-18 3794.44 5831.21
2009-12-25 3912.73 5957.44
As you can see, for December 2009 the dates are now aligned. You might argue dat 2009-12-25 should be 2009-12-24, in that case adjust the dates sequence by removing the 25th of December(s) before doing the rest.
I want to generate the same period during serval days, e.g. from 09:30:00 to 16:00:00 every day, and I know that
dates<- seq(as.POSIXct("2000-01-01 9:00",tz='UTC'), as.POSIXct("2000-04-9 16:00",tz='UTC'), by=300)
can help me obtain the time series observed every 5 minutes during 24 hours in 100 days. But what I want is the 09:30:00 to 16:00:00 over 100 days.
Thanks in advance
Here is one way. We can create a date sequence for every day, and then create sub-list with each day for the five minute interval. Finally, we can combine this list. final_seq is the final output.
date_seq <- seq(as.Date("2000-01-01"), as.Date("2000-04-09"), by = 1)
hour_seq <- lapply(date_seq, function(x){
temp_date <- as.character(x)
temp_seq <- seq(as.POSIXct(paste(temp_date, "09:30"), tz = "UTC"),
as.POSIXct(paste(temp_date, "16:00"), tz = "UTC"),
by = 300)
})
final_seq <- do.call("c", hour_seq)
An option using tidyr::crossing() (which I love) and the lubridate package:
crossing(c1 = paste(dmy("01/01/2000") + seq(1:100), "09:30"),
c2 = seq(0, 390, 5)) %>%
mutate(time_series = ymd_hm(c1) + minutes(c2)) %>%
pull(time_series)
How to get date difference with R (in term of minutes) when day, month and year were not provided.
For instance minutes betweeen "23:14:01" and "00:02:01".
You can use difftime:
a <- strptime("23:14:01",format = "%H:%M:%S")
b <- strptime("00:02:01",format = "%H:%M:%S")
difftime(a,b, units = "mins")
# Time difference of 1392 mins
difftime_res_2 <- 1440 - difftime_res # In case the times are from following days
difftime_res_2
# Time difference of 48 mins
I'm having data as
customer_id Last_city First city recent_date
1020 Jaipur Gujarat 20130216
1021 Delhi Lucknow 20130129
1022 Mumbai Punjab 20130221
and I want to find the number of days from recent date and today (for every record).
difftime function calculates time difference in days, hours, minutes, etc.
First, need to parse the date string into a date representation (e.g. Date or POSIXct) then compare that to the current date/time.
# create dummy data.frame for testing
df <- data.frame("customer_id"=1020, "Last_city"="Jaipur",
"First_city"="Gujarat", "recent_date"="20130216",
stringsAsFactors = FALSE)
now <- Sys.Date()
# parse date into date type (Note: %Y=4-digit year, %y=2-digit year)
df$date = as.Date(df$recent_date, format = "%Y%m%d")
# next calculate the difference between recent date and current time
df$diff = as.double(difftime(now, df$date, units = c("days")))
> df
customer_id Last_city First_city recent_date date diff
1 1020 Jaipur Gujarat 20130216 2013-02-16 1604
If wanted the difference in weeks then
> as.double(difftime(now, df$date, units = c("weeks")))
[1] 229.1429
I'm new to R, so this may very well be a simple problem, but it's causing me a lot of difficulty.
I am trying to subset between two values found across data frames, and I am having difficulty when trying to subset between these two values. I will first describe what I've done, what is working, and then what is not working.
I have two data frames. One has a series of storm data, including dates of storm events, and the other has a series of data corresponding to discharge for many thousands of monitoring events. I am trying to see if any of the discharge data corresponds within the storm event start and end dates/times.
What I have done thus far is as follows:
Example discharge data:
X. DateTime Depth DateTime1 newcol
1 3 8/2/2013 13:15 0.038 2013-08-02 13:15:00 1375463700
2 4 8/2/2013 13:30 0.038 2013-08-02 13:30:00 1375464600
3 5 8/2/2013 13:45 0.039 2013-08-02 13:45:00 1375465500
4 6 8/2/2013 14:00 0.039 2013-08-02 14:00:00 1375466400
Example storm data:
Storm newStart newEnd
1 1 1382125500 1382130000
2 2 1385768100 1385794200
#Make a value to which the csv files are attached
CA_Storms <- read.csv(file = "CA_Storms.csv", header = TRUE, stringsAsFactors = FALSE)
CA_adj <- read.csv(file = "CA_Adj.csv", header = TRUE, stringsAsFactors = FALSE)
#strptime function (do this for all data sets)
CA_adj$DateTime1 <- strptime(CA_adj$DateTime, format = "%m/%d/%Y %H:%M")
CA_Storms$Start.time1 <- strptime(CA_Storms$Start.time, format = "%m/%d/%Y %H:%M")
CA_Storms$End.time1 <- strptime(CA_Storms$End.time, format = "%m/%d/%Y %H:%M")
#Make dates and times continuous
CA_adj$newcol <- as.numeric(CA_adj$DateTime1)
CA_Storms$newStart <- as.numeric(CA_Storms$Start.time1)
CA_Storms$newEnd <- as.numeric(CA_Storms$End.time1)
This allows me to do the following subsets successfully:
CA_adj[CA_adj$newcol == "1375463700", ]
Example output:
X. DateTime Depth DateTime1 newcol
1 3 8/2/2013 13:15 0.038 2013-08-02 13:15:00 1375463700
CA_adj[CA_adj$newcol == CA_Storms[1,19], ]
X. DateTime Depth DateTime1 newcol
7403 7408 10/18/2013 15:45 0.058 2013-10-18 15:45:00 1382125500
CA_adj[CA_adj$newcol <= CA_Storms[1,20], ]
However, whenever I try to have it move between two values, such as in:
CA_adj[CA_adj$newcol >= CA_Storms[1,19] & CA_adj$newol <= CA_Storms[1,20], ]
it responds with this:
[1] X. DateTime Depth DateTime1 newcol
<0 rows> (or 0-length row.names)
I know this output is incorrect, as, through a cursory look through my large data set, there is at least one value that falls within these criteria.
What gives?
discharge<-data.frame( x=c(3,4,5,6),
DateTime=c("8/2/2013 13:15","8/2/2013 13:30",
"8/2/2013 13:45","8/2/2013 14:00"),
Depth=c(0.038, 0.038, 0.039, 0.039)
)
discharge$DateTime1<- as.POSIXct(discharge$DateTime, format = "%m/%d/%Y %H:%M")
storm<-data.frame( storm=c(1,2),
start=c("8/2/2013 13:15","8/2/2013 16:30"),
end=c("8/2/2013 13:45","8/2/2013 16:45")
)
storm$start<- as.POSIXct(storm$start, format = "%m/%d/%Y %H:%M")
storm$end<- as.POSIXct(storm$end, format = "%m/%d/%Y %H:%M")
discharge[(discharge$DateTime1>=storm[1,2] & discharge$DateTime1<=storm[1,3]),]