R: creating xts changes dataset, losing data - r

when creating an xtsobject from a data.frame I seem to lose some data (approximately 3000 data lost over 33 000).
My dataset is as follow: (with the time being day-month-year, EU format)
> head(mesdonnees)
time value
1 05-03-2006 04:07 NA
2 05-03-2006 04:17 NA
3 05-03-2006 04:27 NA
4 05-03-2006 04:37 NA
5 05-03-2006 04:47 NA
6 05-03-2006 04:57 NA
Due to the format I had to extract the different parts of the date (at least I couldn't get as.POSIXct to work with this format).
Here is how I did it:
# Extract characters and define as S....
Syear <- substr(mesdonnees$time, 7,10)
Smonth <- substr(mesdonnees$time, 4,5)
Sday <- substr(mesdonnees$time, 1, 2)
#Gather all parts and use "-" as sep
datetext <- paste(Syear, Smonth, Sday, sep="-")
#define format of each part of the string
formatdate<-as.POSIXct(datetext, format="%Y-%m-%d", tz = "GMT")
I then try to create my xtswith...
xtsdata <- xts(mesdonnees$value, order.by = formatdate, tz = "GMT")
... but when doing this I get some quite weird results: the first value is in 1900
> head(xtsdata)
[,1]
1900-01-04 NA
2006-03-05 NA
2006-03-05 NA
2006-03-05 NA
2006-03-05 NA
2006-03-05 NA
and many (3000) dates are not kept:
> xtsdata[30225:30233,]
[,1]
2006-12-31 0
2006-12-31 0
2006-12-31 0
2006-12-31 0
<NA> NA
<NA> NA
<NA> NA
<NA> NA
<NA> NA
When looking at what should be the same line in both my data.frameand my xtsI can see that the lines are offset (I had the date format changed in the xts object creation):
> mesdonnees[25617,]
time value
25617 08-11-2006 23:51 0
> xtsdata[25617,]
[,1]
2006-11-25 0.27
How is it that my data are offset? I tried changing the tz but it doesn't affect it. I removed all duplicates using the dyplr package, it doesn't affect the xts results either. Thank you for your help !
After changing my xts code to the one suggested by Joshua:
xtsdata <- xts(mesdonnees$value, order.by = as.POSIXct(mesdonnees$time, tz = "GMT", format = "%d-%m-%Y %H:%M"))
... my data show properly for the "last" part, but I now have a different problem. The first 2300 data show the following results when doing (using xtsdata[1500,] (or any row < 2300) displays the same results)
> view(xtsdata):
0206-06-30 23:08:00 NA
0206-06-30 23:18:00 NA
0206-06-30 23:28:00 NA
1900-01-04 12:00:00 NA
2006-03-05 04:07:00 NA
2006-03-05 04:17:00 NA
I noticed this error before and thought it was due to the date format; maybe it is not? Also, when looking at the xtsdata I do not get the same results for the same row (the last rows are correct thought):
> mesdonnees[2360,]
time value
2360 23-03-2006 03:09 NA
> xtsdata[2360,]
[,1]
2006-03-05 09:07:00 NA
As requested:
> str(mesdonnees)
'data.frame': 32556 obs. of 2 variables:
$ time : chr "05-03-2006 04:07" "05-03-2006 04:17" "05-03-2006 04:27" "05-03-2006 04:37" ...
$ value: num NA NA NA NA NA NA NA NA NA NA ...
And if needed:
An ‘xts’ object on 0206-06-01 00:09:00/2006-12-31 23:29:00 containing:
Data: num [1:32556, 1] NA NA NA NA NA NA NA NA NA NA ...
Indexed by objects of class: [POSIXct,POSIXt] TZ: GMT
xts Attributes:
NULL

The problem is that you only include the date portion of the timestamp in datetext and formatdate, but your data have dates and times.
You also do not need to do all the string subsetting. You can achive the same result by specifying the format argument in your as.POSIXct call.
xtsdata <- xts(mesdonnees$value,
as.POSIXct(mesdonnees$times, "GMT", format = "%d-%m-%Y %H:%M")

Related

format input to ts object in R

I am having some data with a time column expressed in week.year and a corresponding unit that was measured in that week.
Week-Year Units
01.2020 39.12727273
02.2020 33.34545455
03.2020 118.7181818
04.2020 83.71818182
05.2020 58.56985
. .
52.2020 89.54651534
I have to create a ts object which takes these Week-Year values as input.
The reason for requiring this step is- there are sometimes values missing for certain weeks so using an auto generated time scale (start=, end=, frequency=) will mess up the readings.
Is there any way of achieving it? or is there any way to accommodate such a situation?
R novice here, would really appreciate some guidance. :)
Assuming the input is the data frame DF shown reproducibly in the Note at the end, convert it to a zoo object and then use as.ts to create a ts series with frequency 52.
library(zoo)
week <- as.integer(DF[[1]])
year <- as.numeric(sub("...", "", DF[[1]]))
z <- zoo(DF[[2]], year + (week - 1) / 52)
tt <- as.ts(z)
tt
## Time Series:
## Start = c(2020, 1)
## End = c(2020, 52)
## Frequency = 52
## [1] 39.12727 33.34545 118.71818 83.71818 58.56985 NA NA
## [8] NA NA NA NA NA NA NA
## [15] NA NA NA NA NA NA NA
## [22] NA NA NA NA NA NA NA
## [29] NA NA NA NA NA NA NA
## [36] NA NA NA NA NA NA NA
## [43] NA NA NA NA NA NA NA
## [50] NA NA 89.54652
frequency(tt)
## [1] 52
class(tt)
## [1] "ts"
Note
Lines <- " Week-Year Units
01.2020 39.12727273
02.2020 33.34545455
03.2020 118.7181818
04.2020 83.71818182
05.2020 58.56985
52.2020 89.54651534"
DF <- read.table(text = Lines, header = TRUE, colClasses = c("character", NA))

Replace values in dataframe by matching dates of different lengths

I have 52 time series files with differing lengths for date. All have the same end date - 31-01-2017, but all 52 dataframes have different start dates.
'data': nRows
Date FLOW Modelled
01-01-1992 1.856 NA
02-01-1992 1.523 NA
03-01-1992 2.623 NA
04-01-1992 3.679 NA
...
31-12-2017
I also have a file with simulated FLOW values for each of the datasets in columns.
'Simulated': 20819 rows, 53 columns (including Date).
Date 1 2 3 ..52
01-01-1961 1.856 2.889 2.365
02-01-1961 1.523 3.536 4.624
03-01-1961 2.536 2.452 6.352
04-01-1961 3.486 4.267 3.685
...
31-12-2017
My question is I want to select each column from Simulated data (e.g column 1 corresponds to 'data' file 1) and fill the Modelled column of 'data' with the simulated values. Ideally this would loop through the 52 files based on a list of their names
The problem I am facing is when using left_join the error I get is
e.g. replacement has 20819 rows, data has 9657
when 'data' is a shorter than 'Simulated', and
e.g. replacement has 20819 rows, data has 22821
when 'data' is longer than 'Simulated'.
I have tried to use left_join of the dplyr package with no luck as dates are not matching up across 'data' and 'Simulated' dataframes.
library(dplyr)
df <-left_join(data, Simulated, by = c("Date"),all.x=TRUE)
I have formatted both 'data' and 'Simulated' dates using similar to Simulated$Date <- as.Date(with(Simulated, paste(Year, Month, Day, sep="-")), "%Y-%m-%d"). But I still get the error below when using left_join:
cannot join a Date object with an object that is not a Date object
A solution can be achieved using tidyverse and read.table. First read all data frames from all files in a list and then use dplyr::bind_rows to merge them in one dataframe.
#Get the file list
filelist = list.files(path = ".", pattern = ".*.txt", full.names = TRUE)
# Read all files in a list
ll <- lapply(filelist, FUN=read.table, header=TRUE, stringsAsFactors = FALSE)
# Read data from file containing simulate data
simulated <- read.table(file = "simulated.txt", header=TRUE, stringsAsFactors = FALSE)
library(tidyverse)
#Convert simulated data to long format and then join with other dataframes
simulated %>% mutate(Date = as.Date(Date, format = "%d-%m-%Y")) %>%
gather(df_num, SIM_FLOW, -Date) %>%
mutate(df_num = gsub("X(\\d+)", "\\1", df_num)) %>%
right_join(bind_rows(ll, .id="df_num") %>% mutate(Date = as.Date(Date, format = "%d-%m-%Y")),
by=c("df_num", "Date"))
# Date df_num SIM_FLOW FLOW Modelled
# 1 1992-01-01 1 1.86 1.86 NA
# 2 1992-01-02 1 NA 1.52 NA
# 3 1992-01-03 1 NA 2.62 NA
# 4 1992-01-04 1 NA 3.68 NA
# 5 1993-01-01 2 NA 11.86 NA
# 6 1993-01-02 2 3.54 11.52 NA
# 7 1993-01-03 2 NA 12.62 NA
# 8 1993-01-04 2 NA 13.68 NA
# 9 1994-01-01 3 NA 111.86 NA
# 10 1994-01-02 3 NA 111.52 NA
# 11 1994-01-03 3 6.35 112.62 NA
# 12 1994-01-04 3 NA 113.68 NA
Data:
simulated.txt
Date 1 2 3
01-01-1992 1.856 2.889 2.365
02-01-1993 1.523 3.536 4.624
03-01-1994 2.536 2.452 6.352
04-01-1902 3.486 4.267 3.685
File1.txt
Date FLOW Modelled
01-01-1992 1.856 NA
02-01-1992 1.523 NA
03-01-1992 2.623 NA
04-01-1992 3.679 NA
File2.txt
Date FLOW Modelled
01-01-1993 11.856 NA
02-01-1993 11.523 NA
03-01-1993 12.623 NA
04-01-1993 13.679 NA
File3.txt
Date FLOW Modelled
01-01-1994 111.856 NA
02-01-1994 111.523 NA
03-01-1994 112.623 NA
04-01-1994 113.679 NA

Converting Date formats in R [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 5 years ago.
Improve this question
dates <- as.Date(dli$Dates)
class(dates)
[1] "Date"
dates
[1] "2016-01-01" "2016-01-02" "2016-01-03" "2016-01-04" "2016-01-05" "2016-01-06"
[7] "2016-01-07" "2016-01-08" "2016-01-09" "2016-01-10" "2016-01-11" "2016-01-12"
[13] "2016-01-13" "2016-01-14" "2016-01-15" "2016-01-16" "2016-01-17" "2016-01-18"
[19] "2016-01-19" "2016-01-20" "2016-01-21" "2016-01-22" "2016-01-23" "2016-01-24"
[25] "2016-01-25" "2016-01-26" "2016-01-27" "2016-01-28" "2016-01-29" "2016-01-30"
[31] "2016-01-31" "2016-02-01" "2016-02-02" "2016-02-03" "2016-02-04" "2016-02-05"
[37] "2016-02-06" "2016-02-07" "2016-02-08" "2016-02-09" "2016-02-10" "2016-02-11"
This is my date format , so i need to convert it into "2016-month-day"
I am getting NA values
dates <- as.Date(dli$Dates,"%d/%b/%Y")
class(dates)
[1] "Date"
dates
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[31] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[61] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[91] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[121] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
can you give any suggestions
Thanks in Advance
Good practice is to store date in R as YYYY-MM-DD, and your strings already seem to be at the good format, but :
The format you're passing to as.Date must describe what the strings contain, not what you're expecting as an output.
"%d/%b/%Y" stands for "day as a number (0-31) slash abbreviated month slash 4-digit year", and your strings format are "4-digit year - month as a number - day as a number".
If you want to format the date, you need to call format :
> date <- "2016-01-01"
> date <- as.Date(date, format = "%Y-%m-%d")
> date
[1] "2016-01-01"
> format(date, "%d/%b/%Y")
[1] "01/jan/2016"
To obtain your required format i.e., 2016-month-day , you can use format function once you have converted vector of strings to Date type.
I hope below code snippet clears your doubt.
> d = c("2016-02-08","2016-02-18","2015-02-08","2016-02-02")
> class(d)
[1] "character"
> d = as.Date(d)
> class(d)
[1] "Date"
> d = format(d,"%Y-%b-%d")
> d
[1] "2016-Feb-08" "2016-Feb-18" "2015-Feb-08" "2016-Feb-02"
Format function converts the date type objects into the required format. Refer to this link for more information on date type formatting.
If you just want to render your dates in this format, then use format:
x <- as.Date("2016-01-01")
format(x, "%Y %b %a %d")
[1] "2016 Jan Fri 01"
There is a separation of concerns here. If you already have your date information stored in R as date types, then you need not change anything internally to extract further information from those dates.
Demo
You would use as.Date() to convert between dates saved as character and Date objects.
If you want to change the format of a Date object, you can use format().
You have specified "2016-month-day" as the desired format of the dates in the question, but in the code you provide you are using "%d/%b/%Y". The way this works is: the % indicates that the next character will be a conversion specification, everything else (e.g. (- or /) will be used for finding / adding delimiter to the date representation. (see ?strptime for details).
So in your case, just use
dates <- format(dli$Dates, format = "%Y-%b-%d")
to get the result specified in the text of the question:
[1] "2016-Jan-01" "2016-Jan-02" "2016-Jan-03" "2016-Jan-04" "2016-Jan-05"
or this:
dates <- format(dli$Dates, format = "%Y/%b/%d")
to get what you have used in the code snipped:
[1] "2016/Jan/01" "2016/Jan/02" "2016/Jan/03" "2016/Jan/04" "2016/Jan/05"
You can use the package lubridate to convert to a date with ymd then format it for the way you want it displayed
Dates_df <- mutate(Dates, dli = format(ymd(dli), "%Y/%b/%d")
(I use dplyr here, I assume you have other variables in Dates)
without dplyr if you just want to keep the dates in a vector:
Dates_vec <- format(ymd(Dates$dli), "%Y/%b/%d")

xts assignment changes column class

I have a data.frame earlyCloses defined as follows:
earlyCloses <- read.table('EarlyCloses.txt', header=T, colClasses= c(rep("character", 3)))
earlyCloses
StartDate EndDate EarlyClose
1 2012-12-24 2012-12-24 13:00
I define a xts object pricesXts as follows:
prices <- read.table('sample.txt', header=T, colClasses=c("character", "numeric"))
pricesXts = xts(prices$Close, as.POSIXct(prices$Date, tz='America/New_York'))
colnames(pricesXts) = c("Close")
pricesXts$CloseTime = NA
pricesXts
Close CloseTime
2012-12-21 13190.84 NA
2012-12-24 13139.08 NA
2012-12-26 13114.59 NA
2012-12-27 13096.31 NA
2012-12-28 12938.11 NA
Now I execute a for loop over the rows of earlyCloses and set the CloseTime of pricesXts.
for (i in 1:nrow(earlyCloses)) {
pricesXts[paste(earlyCloses[i,"StartDate"], earlyCloses[i,"EndDate"], sep='/'), 2] = earlyCloses[i,"EarlyClose"]
}
pricesXts
Close CloseTime
2012-12-21 "13190.84" NA
2012-12-24 "13139.08" "13:00"
2012-12-26 "13114.59" NA
2012-12-27 "13096.31" NA
2012-12-28 "12938.11" NA
Why has the class of the Close column in the xts object changed from numeric to character? Is this because an xts object is represented internally as a matrix? Is there a way to avoid this conversion?
xts is encoded internally as a matrix ( better performances). Since you want just to store the Early Close, you can convert it to a numeric , for example:
strptime(earlyCloses$EarlyClose,'%H:%M')$hour
Then
for (i in 1:nrow(earlyCloses))
pricesXts[paste(earlyCloses[i,"StartDate"],
earlyCloses[i,"EndDate"],
sep='/'), 2] <- strptime(earlyCloses$EarlyClose,'%H:%M')$hour
Close CloseTime
2012-12-21 13191 NA
2012-12-24 13139 13
2012-12-26 13115 NA
2012-12-27 13096 NA
2012-12-28 12938 NA

Using a date field in a ts?

I wonder how I can make use of an already existing date field when creating a ts in R.
Sometimes you simply have a date before you have a ts object, e.g.
x <- as.Date("2008-01-01") + c(30,60,90,120,150)
# add some data to it
df = data.frame(datefield=x,test=1:length(x))
Now, is there a way to use the datefield of the df to as an index when creating a ts object? Because:
ts(df$test,start=c(2008,1,2),frequency=12)
(obviuously) completely ignores the date information I already have. Making use of ts methods like acf is the reason why I´d like to make it a ts object. I typcically use monthly an quarterly time series...
You don't necessarily need to create new types of objects from scratch; you can always coerce to other classes, including ts as you need to. zoo or xts are arguably to most useful and intuitive but there are others. Here is your example, cast as a zoo object, which we then coerce to class ts for use in acf().
## create the data
x <- as.Date("2008-01-01") + c(30,60,90,120,150)
df = data.frame(datefield=x,test=1:length(x))
## load zoo
require(zoo)
## convert to a zoo object, with order given by the `datefield`
df.zoo <- with(df, zoo(test, order.by = x))
## or to a regular zoo object
df.zoo2 <- with(df, zooreg(test, order.by = x))
Now we can easily go to a ts object using the as.ts() method:
> as.ts(df.zoo)
Time Series:
Start = 13920
End = 14040
Frequency = 0.0333333333333333
[1] 1 2 3 4 5
> ## zooreg object:
> as.ts(df.zoo2)
Time Series:
Start = 13909
End = 14029
Frequency = 1
[1] 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[21] NA NA NA NA NA NA NA NA NA NA 2 NA NA NA NA NA NA NA NA NA
[41] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[61] 3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[81] NA NA NA NA NA NA NA NA NA NA 4 NA NA NA NA NA NA NA NA NA
[101] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[121] 5
Notice the two ways in which the objects are represented (although we could have made the zooreg version the same as the standard zoo object by setting the frequency argument to 0.03333333):
> as.ts(with(df, zooreg(test, order.by = datefield,
+ frequency = 0.033333333333333)))
Time Series:
Start = 13920.0000000001
End = 14040.0000000001
Frequency = 0.033333333333333
[1] 1 2 3 4 5
We can use the zoo/zooreg object in acf() and it will get the correct lags (daily observations but every 30 days):
acf(df.zoo)
Whether this is intuitive to you or not depends on how you view the time series. We can do the same thing in terms of a 30-day interval via:
acf(coredata(df.zoo))
where we use coredata() to extract the time series itself, ignoring the date information.
I don't know exactly what you're trying to do, but acf also works on simple vectors, given off course it represents a regular time series (i.e. even spaced). Otherwise the result is just bollocks.
>acf(df$test)
Regarding the ts object :
The "dates" you see are just from the print.ts function, so they're not inherent to the ts object. The ts object has no date information in it. You can set the option calender=FALSE to get the standard print out of the ts object.
> ts(df$test,start=2008,frequency=12)
Jan Feb Mar Apr May
2008 1 2 3 4 5
> print(ts(df$test,start=2008,frequency=12),calendar=F)
Time Series:
Start = c(2008, 1)
End = c(2008, 5)
Frequency = 12
[1] 1 2 3 4 5
Now, the vector you construct looks like :
> x
[1] "2008-01-31" "2008-03-01" "2008-03-31" "2008-04-30" "2008-05-30"
which is or isn't regular, depending on how you see it. If you extract the months, then you have 1 observation for january, 2 for march, 1 for april...: not regular. You have an observation every 30 days : regular. If you have an observation every 30 days, you shouldn't bother about the dates as 365 is not dividable through 30. Hence, one year you'll have 12 observations, another one you'll have 13 observations. So you can't set the frequency in ts in a consequent correct way.
So I'd refrain from using a ts all together, as James already indicated in the comments.
If you want:
Use the date information you already have
Easily set the frequency to a desired value
End up with a ts object
You can start with an xts object, add a frequency attribute, and then convert to ts:
library("xts")
my_xts <- xts(df$test, df$datefield)
attr(my_xts, 'frequency') <- 12 # Set the frequency
my_ts <- as.ts(my_xts)
The resulting ts object will have the specified period and will have the correct dates assigned to each data point.

Resources