Converting numeric dates into Hours since format - r

I'm having problems converting dates into an "hours since" format, I don't think I need to apply some of the more complex methods of date manipulation, but I might be wrong, I was hoping someone might know a good solution?
The data I have is in a table format, which I read in from a text file. A 3 line example of the 5,000+ rows of data is;
date1 <- matrix(c(2007,2007,2007, 12,12,12,1,2,3,0.365,0.096,-0.416),nrow=3)
Which prints out as:
date1
[,1] [,2] [,3] [,4]
[1,] 2007 12 1 0.365
[2,] 2007 12 2 0.096
[3,] 2007 12 3 -0.416
The first column is the year, the second the month, and third the day. The value in the 4th column is an index value relevant to my study.
The data I would like to match the index value is in a slightly odd format, of hours since "1800-01-01"
ftime <- c(1822548, 1822572, 1822596)
ftime can be printed as just a date, via the following function.
as.Date(ftime/24,"1800-01-01")
[1] "2007-12-01" "2007-12-02" "2007-12-03"
My code all uses the numeric values in ftime to match data, but I cannot seem to work out how to format the new data (data1) into the same.
I have the feeling it should be a simple solution, but cannot seem to get it to work.
Help is always greatly appreciated!

You can use the difftime function, if I got what you want:
#setting the origin
myorigin<-as.Date("1800-01-01")
#converting date1 to Date objects
myDates<-as.Date(do.call(function(...) paste(...,sep="-"),as.data.frame(date1[,1:3])))
#get the results
difftime(myDates,myorigin,units="hour")
#Time differences in hours
#[1] 1822536 1822560 1822584

Related

How to extract date from time series and convert it to date in R

I have a dataset consisting of 4 variables namely date,Gold price,crude price and dollar price. I converted the class of the data to time series using ts() function. After converting, the dates got changed into some value. I am able to retrieve the dates from converted values using as.Date() function.Now I want to replace the date values by date itself in the ts object's date variable.
#CONVERTING DATA FRAME TO TIME SERIES OBJECT.
Gold.ts <- ts(Gold,start=Gold$DATE[1])
head(Gold.ts)
#OUTPUT.
DATE GOLD.PRICE CRUDE DOLLAR.INR
[1,] 13152 533.9 63.42 44.705
[2,] 13153 526.3 62.79 44.600
[3,] 13154 539.7 64.21 44.320
head(as.Date(index(Gold.ts)))
[1] "2006-01-04" "2006-01-05" "2006-01-06" "2006-01-07" "2006-01-08" "2006- 01-09"
Gold.ts$DATE <- as.Date(index(Gold.ts)) # This won't work because $ is not acceptable to extract variables from a time series object.
index(Gold.ts) <- as.Date(index(Gold.ts)) #This should work but gives error. How to display date instead of values in time series object i.e Gold.ts?
What is the right way to do it?

R - converting dataframe to xts adds quotations (in xts)

I am using a dataframe to create an xts.
The xts gets created but all values (except the index in the xts) are within quotation. This leads that I cannot use the data, since many functions such as sum, does not work.
Any ideas how I can get an xts produced without the quotations?
Here is my code [updated due to comment of inconsisty names of dataframes/xts]:
# creates a dataframe with dates and currency
mydf3 <- data.frame(date = c("2013-12-10", "2015-04-01",
"2016-01-03"), plus = c(4, 3, 2), minus = c(-1, -2, -4))
# transforms the column date to date-format
mydf3 = transform(mydf3,date=as.Date(as.character(date),format='%Y-%m-%d'))
# creates the xts, based on the dataframe mydf3
myxts3 <- xts(mydf3, order.by = mydf3$date)
# removes the column date, since date is now stored as index in the xts
myxts3$date <- NULL
You need to realize that the underlying data structure that stores your data in the xts object is an R matrix object, which can only be of one R type (e.g. all numeric or all character). The timestamps are stored as a separate vector (your date column in this case) which is used for indexing/subsetting the data by time.
The cause of your problem is that your date column is forcing the matrix of data to convert to character type matrix (in the xts object) instead of numeric. It seems that the date class converts to character when it is included in the matrix:
> as.matrix(mydf3)
date plus minus
[1,] "2013-12-10" "4" "-1"
[2,] "2015-04-01" "3" "-2"
[3,] "2016-01-03" "2" "-4"
Any time you have non-numeric data in your data you're converting to xts (in the x argument of xts), you'll get this kind of problem.
Your issue can be resolved as follows (which wici has shown in the comments)
myxts3 <- xts(x= mydf3[, c("plus", "minus")], order.by = mydf3[, "date"])
> coredata(myxts3)
plus minus
[1,] 4 -1
[2,] 3 -2
[3,] 2 -4
> class(coredata(myxts3))
[1] "matrix"
The date part should be in the index, not part of the data. read.zoo will do that for you producing a zoo object. You can then convert that to xts if you need it in that form.
library(xts)
as.xts(read.zoo(mydf3))
## plus minus
## 2013-12-10 4 -1
## 2015-04-01 3 -2
## 2016-01-03 2 -4

Extract numbers from string as numeric or dates in R

I am working with some hdf5 data sets. However, the dates are stored in the file and no hint of these dates from the file name. The attribute file consists of day of the year, month of the year, day of the month and year columns.
I would like to pull out data to create time series identity for each of the files i.e.year month date format that can be used for time series.
A sample of the data can be downloaded here:
[ ftp://l5eil01.larc.nasa.gov/tesl1l2l3/TES/TL3COD.003/2007.08.31/TES-Aura_L3-CO_r0000006311_F01_09.he5 ]
There is an attribute group file and a data group file.
I use the R library "rhdf5" to explore the hdf5 files. E.g
CO1<-h5ls ("TES-Aura_L3-CO_r0000006311_F01_09.he5")
Attr<-h5read("TES-Aura_L3-CO_r0000006311_F01_09.he5","HDFEOS INFORMATION/coremetadata")
Data<-h5read("TES-Aura_L3-CO_r0000006311_F01_09.he5", "HDFEOS\SWATHS\ColumnAmountNO2\Data Fields\ColumnAmountNO2Trop")
The Attr when read consist of a long string with the only required information being "2007-08-31" which is the date of acquisition. I have been able to extract this using the Stringr library:
regexp <- "([[:digit:]]{4})([-])([[:digit:]]{2})([-])([[:digit:]]{2})"
Date<-str_extract(Attr,pattern=regexp)
which returns the Date as:
"2007-08-31"
The only problem left now is that the Date isnt recognised as numeric or date. How do I change this as I need to bind the Date with the data for all days to create a time series (more like an identifier as the data sets are irregular), please? a sample of how it looks after extracting the dates from string and binding with the CO values for each date is below
Dates CO3b
[1,] "2011-03-01" 1.625811e+18
[2,] "2011-03-04" 1.655504e+18
[3,] "2011-03-11" 1.690428e+18
[4,] "2011-03-15" 1.679871e+18
[5,] "2011-03-17" 1.705987e+18
[6,] "2011-03-17" 1.661198e+18
[7,] "2011-03-17" 1.662694e+18
[8,] "2011-03-20" 1.520328e+18
[9,] "2011-03-21" 1.510642e+18
[10,] "2011-03-21" 1.556637e+18
However, R recognises these dates as character and not as date. I need to convert them to a time series I can work with.
Seems like you've already done all the hard work! Based off your comment, here's how you could take it across the finish line.
From your comment, seems like you have the strings in a good format. Given that your variable is named date, simply go
dateObjects<-as.Date(Date) #where Date is your variable
and either the single value or vector of character strings (as the format you gave in the comment) will now be date objects, which you could use with a library like zoo to create time series.
If your strings are not necessarily in the format you've described, then refer to the following link to see how to format other string forms as dates.
http://www.statmethods.net/input/dates.html
Given your example data frame you can create a time series in the following way, using the package zoo.
library(zoo)
datavect<-as.zoo(df$CO3b)
index(datavect)<-as.Date(df$Date)
here we take your CO data, covert it to a zoo object, then assign the appropriate date to each entry, converting it from a character to a date object. Now if you print datavect, you'll see each data entry attached to a date. This allows you to take advantage of zoo methods, such as merge and window.
Here is one approach not using string extraction. If you know how long your time series should be, which you should based on the length of your dataset and knowledge of its periodicity, you could just create a regular date series and then add that into a data.frame with other variables of interest. Assuming you have daily data the below would work. Obviously your length.out would be different.
d1 <- ISOdate(year=2007,month=8,day=31)
d2 <- as.Date(format(seq(from=d1,by="day",length.out=10),"%Y-%m-%d"))
[1] "2007-08-31" "2007-09-01" "2007-09-02" "2007-09-03" "2007-09-04" "2007-09-05" "2007-09-06" "2007-09-07" "2007-09-08" "2007-09-09"
class(d2)
[1] "Date"
Edit of Original:
Oh I see. Well after reading in your new data example the below worked for me. It was a pretty straight forward transform. cheers
library(magrittr) # Needed for the pipe operator %>% it makes it really easy to string steps together.
dateData
Dates CO3b
1 2011-03-01 1.63e+18
2 2011-03-04 1.66e+18
3 2011-03-11 1.69e+18
4 2011-03-15 1.68e+18
5 2011-03-17 1.71e+18
6 2011-03-17 1.66e+18
7 2011-03-17 1.66e+18
8 2011-03-20 1.52e+18
9 2011-03-21 1.51e+18
10 2011-03-21 1.56e+18
dateData %>% sapply(class) # classes before transforming (character,numeric)
dateData[,1] <- as.Date(dateData[,1]) # Transform to date
dateData %>% sapply(class) # classes after transforming (Date,numeric)
str(dateData) # one more check
'data.frame': 10 obs. of 2 variables:
$ Dates: Date, format: "2011-03-01" "2011-03-04" "2011-03-11" "2011-03-15" ...
$ CO3b : num 1.63e+18 1.66e+18 1.69e+18 1.68e+18 1.71e+18 ...

Efficiently formatting date and time in large data sets in R - dplyr performance

This question is on code performance. I have a data frame with two columns:
DATE is represented as numeric in MMDDYYYY format
EPOCH is a representation of time in 5 minute increments from midnight. EPOCH count starts at 0 - so 00:00 to 00:05 would be 0, 00:05 to 00:10 would be 1 and so on.
I have about 15 million rows of data in my data frame. As a part of processing this data I am converting the two columns to R's Date and POSIXct format. I am using dplyr - however, the code I have is taking way too long (about 30 minutes). Below I am generating a toy data set and provided the code I am using:
library(dplyr)
DATA <- data.frame(DATE = rep(10082013,15000000), EPOCH = rep(6,15000000))
Here is a sample view of the data
DATA %>%
head()
DATE EPOCH
1 10082013 6
2 10082013 6
3 10082013 6
4 10082013 6
5 10082013 6
6 10082013 6
This is the part where I transform the data into the format I want it in:
DATA %>%
mutate(DATE_FORMATTED = as.Date(as.character(DATE), "%m%d%Y")) %>%
mutate(DOW = weekdays(DATE_FORMATTED)) %>%
mutate(TIME_FORMATTED = strftime(as.POSIXct(((EPOCH+1)*5*60), origin=as.character(DATE_FORMATTED), tz="UTC"), format="%R", tz="UTC")) %>%
head()
I feel the overhead is due to all the coercions in the TIME_FORMATTED formula. Is there a way to achieve the end result faster? Maybe a different function that is dplyr optimized?
As suggested in "Why is as.Date slow on a character vector?", the bottleneck is probably strptime. In particular, the answer by user daniel.s suggests using lubridate::fast_strptime.
And there's no need to convert DATE_FORMATTED to character.
Mind you, I haven't done any testing myself so maybe a better answer will come along.

Data aggregation loop in R

I am facing a problem concerning aggregating my data to daily data.
I have a data frame where NAs have been removed (Link of picture of data is given below). Data has been collected 3 times a day, but sometimes due to NAs, there is just 1 or 2 entries per day; some days data is missing completely.
I am now interested in calculating the daily mean of "dist": this means summing up the data of "dist" of one day and dividing it by number of entries per day (so 3 if there is no data missing that day). I would like to do this via a loop.
How can I do this with a loop? The problem is that sometimes I have 3 entries per day and sometimes just 2 or even 1. I would like to tell R that for every day, it should sum up "dist" and divide it by the number of entries that are available for every day.
I just have no idea how to formulate a for loop for this purpose. I would really appreciate if you could give me any advice on that problem. Thanks for your efforts and kind regards,
Jan
Data frame: http://www.pic-upload.de/view-11435581/Data_loop.jpg.html
Edit: I used aggregate and tapply as suggested, however, the mean value of the data was not really calculated:
Group.1 x
1 2006-10-06 12:00:00 636.5395
2 2006-10-06 20:00:00 859.0109
3 2006-10-07 04:00:00 301.8548
4 2006-10-07 12:00:00 649.3357
5 2006-10-07 20:00:00 944.8272
6 2006-10-08 04:00:00 136.7393
7 2006-10-08 12:00:00 360.9560
8 2006-10-08 20:00:00 NaN
The code used was:
dates<-Dis_sub$date
distance<-Dis_sub$dist
aggregate(distance,list(dates),mean,na.rm=TRUE)
tapply(distance,dates,mean,na.rm=TRUE)
Don't use a loop. Use R. Some example data :
dates <- rep(seq(as.Date("2001-01-05"),
as.Date("2001-01-20"),
by="day"),
each=3)
values <- rep(1:16,each=3)
values[c(4,5,6,10,14,15,30)] <- NA
and any of :
aggregate(values,list(dates),mean,na.rm=TRUE)
tapply(values,dates,mean,na.rm=TRUE)
gives you what you want. See also ?aggregate and ?tapply.
If you want a dataframe back, you can look at the package plyr :
Data <- as.data.frame(dates,values)
require(plyr)
ddply(data,"dates",mean,na.rm=TRUE)
Keep in mind that ddply is not fully supporting the date format (yet).
Look at the data.table package especially if your data is huge. Here is some code that calculates the mean of dist by day.
library(data.table)
dt = data.table(Data)
Data[,list(avg_dist = mean(dist, na.rm = T)),'date']
It looks like your main problem is that your date field has times attached. The first thing you need to do is create a column that has just the date using something like
Dis_sub$date_only <- as.Date(Dis_sub$date)
Then using Joris Meys' solution (which is the right way to do it) should work.
However if for some reason you really want to use a loop you could try something like
newFrame <- data.frame()
for d in unique(Dis_sub$date){
meanDist <- mean(Dis_sub$dist[Dis_sub$date==d],na.rm=TRUE)
newFrame <- rbind(newFrame,c(d,meanDist))
}
But keep in mind that this will be slow and memory-inefficient.

Resources