Trouble finding non-unique index entries in zooreg time series - r

I have several years of data that I'm trying to work into a zoo object (.csv at Dropbox). I'm given an error once the data is coerced into a zoo object. I cannot find any duplicated in the index.
df <- read.csv(choose.files(default = "", caption = "Select data source", multi = FALSE), na.strings="*")
df <- read.zoo(df, format = "%Y/%m/%d %H:%M", regular = TRUE, row.names = FALSE, col.names = TRUE, index.column = 1)
Warning message:
In zoo(rval3, ix) :
some methods for “zoo” objects do not work if the index entries in ‘order.by’ are not unique
I've tried:
sum(duplicated(df$NST_DATI))
But the result is 0.
Thanks for your help!

You are using read.zoo(...) incorrectly. According to the documentation:
To process the index, read.zoo calls FUN with the index as the first
argument. If FUN is not specified then if there are multiple index
columns they are pasted together with a space between each. Using the
index column or pasted index column: 1. If tz is specified then the
index column is converted to POSIXct. 2. If format is specified then
the index column is converted to Date. 3. Otherwise, a heuristic
attempts to decide among "numeric", "Date" and "POSIXct". If format
and/or tz is specified then they are passed to the conversion function
as well.
You are specifying format=... so read.zoo(...) converts everything to Date, not POSIXct. Obviously, there are many, many duplicated dates.
Simplistically, the correct solution is to use:
df <- read.zoo(df, FUN=as.POSIXct, format = "%Y/%m/%d %H:%M")
# Error in read.zoo(df, FUN = as.POSIXct, format = "%Y/%m/%d %H:%M") :
# index has bad entries at data rows: 507 9243 18147 26883 35619 44355
but as you can see this does not work either. Here the problem is much more subtle. The index is converted using POSIXct, but in the system time zone (which on my system is US Eastern). The referenced rows have timestamps that coincide with the changeover from Standard to DST, so these times do not exist in the US Eastern timezone. If you use:
df <- read.zoo(df, FUN=as.POSIXct, format = "%Y/%m/%d %H:%M", tz="UTC")
the data imports correctly.
EDIT:
As #G.Grothendieck points out, this would also work, and is simpler:
df <- read.zoo(df, tz="UTC")
You should set tz to whatever timezome is appropriate for the dataset.

Related

Convert a character column to dates in R

I am trying to convert a data column (x_date) that has this form "31.03.2013" (the class is "character") into Dates in the form of "2013-01-31"
I tried with the following codes:
as.Date(x_date, format = "%d-%m-%Y")
as.Date(x_date, format="%Y-%m-%d")
as.Date(x_date,format= "%Y-%m-%d", tryFormats = c("%Y-%m-%d", "%Y/%m/%d", "%d.%m.%Y"), optional=FALSE )
in all of the three cases the complete data column turns into "NA".
Then I tried this code:
format.Date(x_date, format="%Y-%m-%d")
and I get an error warning.
Can anybody help me to convert my column into the respective Dates?
Specify the format of the data instead of in tryFormats
as.Date(x_date, format = '%d.%m.%Y')

Trouble using foverlaps with POSIXct objects - maybe due to fractional seconds?

I have 2 data tables and I would like to find the rows that overlap using foverlaps. I think I am getting tripped up because some of the dates have fractional seconds.
library(data.table)
First create a data table of shift times
On <- as.POSIXct(c("2017-08-01 00:05:54", "2017-08-01 00:07:20", "2017-08-01 00:21:53"), format = "%Y-%m-%d %H:%M:%S", tz = "UTC")
Off <- as.POSIXct(c("2017-08-01 00:05:54", "2017-08-01 00:07:20", "2017-08-01 00:21:53"), format = "%Y-%m-%d %H:%M:%S", tz = "UTC")
shifts <- data.table(On, Off)
Now create a data table of observations times
The first bunch of observation times are from Matlab, so need to be converted to POSIXct first. These end up giving me fractional seconds
timestamp <- c(736908.0041, 736908.0051, 736908.009, 736908.012, 736908.0152)
Obs = data.table(SightingTime = as.POSIXct((timestamp-719529)*86400, origin = "1970-01-01", tz = "UTC"))
#add a variable for the "date type"
Obs$DateType = "Long"
Add a row to the data table that does not have fractional seconds (for the purpose of this example)
Obs <- rbind(Obs, data.table(SightingTime=as.POSIXct("2017-08-01 00:05:54", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), DateType = "Short"))
create point intervals so can use foverlaps
Obs[, SightingTime2 := SightingTime]
get ready for foverlaps
setkey(Obs, SightingTime, SightingTime2)
setkey(shifts, On, Off)
do the overlap join
Obs.ov <- foverlaps(shifts, Obs ,type="any",nomatch=0L)
This results in Obs.ov having a single row - the overlaps with the "Short" date format. Rows with the "Long" date format don't get included in the overlap. I would have expected that three rows would overlap (assuming that the fractional seconds would be rounded off, I would expect overlaps with the 00:05:54 and 00:21:53 "Long" timestamps as well.
I think this might be due to the fractional seconds in the dates I converted from Matlab, but I don't know how to get rid of the fractional bit. I did try using
attributes(Obs$SightingTime)$format <- "%Y-%m-%d %H:%M:%OS"
as well as including the "format" argument when the SightingTime variable was created from the "timestamp" variable early on. But have had no luck with either.
I did look here How to format fractional seconds in POSIXct in r, but can't quite figure out what change I need to make based on this.
I found what I needed here Remove seconds from time in R
I just needed to round off the seconds after creating the SightingTime variable, but before creating the "SightingTime2" variable.
Obs$SightingTime <- as.POSIXct(round(Obs$SightingTime, units="secs"))
Now when I do the overlaps, I get the 3 overlapping rows as expected.

Custom function to find date column in df and standardize name to "date" in R

A lot of my work involves unioning new datasets to old, but often the standardized "date" name I have in the master dataset won't match up to the date name in the new raw data (which may be "Date", "Day", "Time.Period", etc...). To make life easier, I'd like to create a custom function that will:
Detect the date columns in the new and old datasets
Standardize the column name to "date" (oftentimes the raw new data will come in with the date column named "Date" or "Day" or "Time Period", etc..)
Here are a couple datasets to play with:
Dates_A <- seq(from = as.Date("2017-01-01"), to = as.Date("2017-12-31"), by = "day")
Dates_B <- seq(from = as.Date("2017-01-01"), to = as.Date("2017-12-31"), by = "day")
Numbers <- rnorm(365)
df_a <- data.frame(Dates_A, Numbers)
df_b <- data.frame(Dates_B, Numbers)
My first inclination is to try a for-loop that searches for the class of the columns by index and automatically renames any with Class = Date to "date", but ideally I'd also like the function to solve for the examples below, where the class of the date column may be character or factor.
Dates_C <- as.character(Dates_B)
df_c <- data.frame(Dates_C, Numbers)
df_d <- data.frame(Dates_C, Numbers, stringsAsFactors = FALSE)
If you have any ideas or can point me in the right direction, I'd really appreciate it!
Based on the description, we could check whether a particular column is Date class, get a logical index and assign the name of that column to 'date'
is.date <- function(x) inherits(x, 'Date')
names(df_a)[sapply(df_a, is.date)] <- 'date'
Assuming that there is only a single 'date' column in the dataset. If there are multiple 'date' columns, inorder to avoid duplicate column names, use make.unique
names(df_a) <- make.unique(names(df_a))
akrun's solution works for columns of class Date but not for columns of classes factor or character like you ask at the end of the question, so maybe the following can be of use to you.
library(lubridate)
checkDates <- function(x) {
op <- options(warn = -1) # needed to keep stderr clean
on.exit(options(op)) # reset to original value
!all(is.na(ymd(x)))
}
names(df_c)[sapply(df_c, checkDates)] <- 'date'
names(df_d)[sapply(df_d, checkDates)] <- 'date'
Note that maybe you can get some inspiration on both solutions and combine them into one function. If inherits returns TRUE all done else try ymd.

Use dplyr::mutate and lubridate::force_tz based on arguments from data frame columns

I am trying to use lubridate::force_tz to add timezone information to timestamps (date+time) formatted as strings (as.character()). Both are stored as two columns in a data frame:
require(lubridate)
require(dplyr)
row1<-c(as.character(now()),"Etc/UTC")
row2<-c(as.character(now()+5),"America/Chicago")
df<-as.data.frame(rbind(row1,row2))
names(df)<-c("dt","tz")
x<-force_tz(as.POSIXct(as.character(now())),"Etc/UTC") #works
df<-df%>%mutate(newDT=force_tz(as.POSIXct(dt),tz)) #fails
I get: Error in UseMethod("mutate_") :
no applicable method for 'mutate_' applied to an object of class "c('matrix', 'character')"
Following Stibu's comments, I tried (an un-R like) approach with an iteration:
for (i in seq(from=1,to=length(df$dt))){
timestamp<-as.character(df[i,1])
tz<-as.character(df[i,2])
print(tz)
newdt<-force_tz(as.POSIXct(timestamp),tz)
df[i,3]<-newdt
print(attr(df[i,3],"tzone"))
df$timezone<-attr(df[i,3],"tzone")
}
This extracts the values correctly, but seems to get stuck with setting the value of the tz to the first value encountered - weirdly:
[1] "Etc/UTC"
[1] "Etc/UTC"
[1] "America/Chicago"
[1] "Etc/UTC"
I would have expected the last printout to result in "America/Chicago"
The df then looks like:
> df
dt tz newDT timezone
1 2016-04-13 23:07:45 Etc/UTC 2016-04-13 23:07:45 Etc/UTC
2 2016-04-13 23:07:50 America/Chicago 2016-04-14 04:07:50 Etc/UTC
You have actually two issues in your code that I will discuss separately below.
dplyr works with data frames
Your df is a matrix, not a data frame. But mutate() (and functions from dplyr in general) works with data frames. The error message simply tells you that mutate() does not know what to do with a matrix.
You can solve this by converting df to a data frame:
df <- as.data.frame(df)
names(df)<-c("dt","tz")
A remark regarding names(): This function can be used to get/set the column names of a data frame. For matrices, the corresponding function is colnames(). You used names() on a matrix, which did not set the column names of the matrix. Therefore, the names of the data frame are also not set after conversion.
You could also create a data frame from the start as follows:
df <- data.frame(dt = as.character(c(now(), now() + 5)),
tz = c("Etc/UTC", "America/Chicago"),
stringsAsFactors = FALSE)
Note that you need to define the contents column-wise, not row-wise as you did.
If you use the data frame df, there will be no error from mutate().
One time zone per vector
Unfortunately, there is a second issue. What you want to do simply cannot be done. The reason is the following.
Let's convert the first column of df to POSIXct with time zone CET:
ts <- as.POSIXct(df$dt, tz = "CET")
ts
## [1] "2016-04-13 14:42:26 CEST" "2016-04-13 14:42:31 CEST"
Let's try to do the same with two time zones:
ts <- as.POSIXct(df$dt, tz = c("CET", "UTC"))
## Error in strptime(xx, f <- "%Y-%m-%d %H:%M:%OS", tz = tz) :
## invalid 'tz' value
This does not work. The reason is that there is a single time zone per vector and not a time zone per element in the vector. Look at the attributes of ts:
attributes(ts)
## $class
## [1] "POSIXct" "POSIXt"
##
## $tzone
## [1] "CET"
The time zone is set as an attribute of the entire vector and it is not a property of each element.

Convert CSV with dates using lubridate

I got a dataset in CSV format that has two columns: Date and Value. There are hundreds of rows in the file. Date format in the file is given as YYYY-MM-DD. When I imported this dataset, the Date column got imported as a factor, so I cannot run a regression between those two variables.
I am very new to R, but I understand that lubridate can help me convert the data in the Date column. Could someone provide some suggestions on what command should I use to do so? The file name is: Test.csv.
Next time please provide some test data and show what you did. For variations see ?as.Date and ?read.csv . The following does not use any packages:
# test data
Lines <- "Date,Value
2000-01-01,12
2001-01-01,13"
# DF <- read.csv("myfile.csv")
DF <- read.csv(text = Lines)
DF$Date <- as.Date(DF$Date)
plot(Value ~ Date, DF, type = "o")
giving:
> DF
Date Value
1 2000-01-01 12
2 2001-01-01 13
Note: Since your data is a time series you might want to use a time series representation. In this case read.zoo automatically converts the first column to "Date" class:
library(zoo)
# z <- read.zoo("myfile.csv", header = TRUE, sep = ",")
z <- read.zoo(text = Lines, header = TRUE, sep = ",")
plot(z)

Resources