R time formatting with dirty data - r

I'm using R to generate a CZML file from a database.
The database has dirty data.
I need a way to make sure times are in the format "%H:%M:%S".
The data can be in the correct %H:%M:%S already or missing zeros in front of the hour, e.g 8:30:00, which is an invalid ISO 8601 and throws the CZML parsing off entirely.
It needs to always be like so 08:30:00 or 07:09:00 in the 24h format.
I have errors because it is like so 8:30:00 or 7:09:00 still in the 24h format though, I haven't checked if the minutes or seconds are incorrect too but for the moment I assume they are correct and the only problem is the hours.
For example, I have a csv file like this:
"Date","Time","TZ","Jul.Time","BirdID","Species","Sex","Age","SiteID","Latitude","Longitude"
"4-Mar-13","08:30:00","America/Costa_Rica",2456356.187500,"test2","GREH","M","AHY","56scr25",8.71191178,-82.96866316
"4-Mar-13","8:30:00","America/Costa_Rica",2456356.187500,"test2","GREH","M","AHY","56scr25",8.71191178,-82.96866316
I need to generate a CZML like so:
"point": {
"color": {
"rgba": [
"2013-03-04T08:30:00Z",225,50,50,196,"2013-03-04T08:30:01Z",50,50,225,196,"2013-03-04T13:30:00Z",225,50,50,196,"2013-03-04T13:30:01Z",50,50,225,196,"2013-03-04T16:00:00Z",225,50,50,196,"2013-03-04T16:00:01Z",50,50,225,196
]
},
"pixelSize": { "number": 10 }
}
My code is like so:
j=1
numVisits=nrow(visitedTimes)
while(j<=numVisits){
date=as.Date(visitedTimes$Date[j], format="%d-%b-%y")
time=format(visitedTimes$Time[j], format="%H:%M:%S")
timeOfPassage=paste0(date,"T",time,"Z")
timeAfter=as.POSIXlt(timeOfPassage, format="%Y-%m-%dT%H:%M:%SZ")
timeAfter$sec=timeAfter$sec+1
timeAfter=format(timeAfter, format="%Y-%m-%dT%H:%M:%SZ")
cat(paste0("\"",timeOfPassage,"\","))
cat("225,50,50,196,")
cat(paste0("\"",timeAfter,"\","))
cat("50,50,225,196")
if(j<numVisits){
cat(",")
}
j=j+1
}
But it doesn't produce the desired output because of the dirty data..
Any ideas?

We can use times from chron
library(chron)
times(v1)
#[1] 08:30:00 08:30:00 07:09:00 07:09:00
Or using base R
format(strptime(v2, '%H:%M:%S'), '%H:%M:%S')
#[1] "08:30:00" "08:30:00" "07:09:00" "07:09:00" "07:09:05" "11:10:00"
Using the OP's updated dataset
df1$Time <- times(df1$Time)
df1$Time
#[1] 08:30:00 08:30:00
Or using regex
sub('^(.:)', '0\\1', df1$Time)
gsub('[^:]{2}(*SKIP)(*F)|(\\d)', '0\\1', v2, perl=TRUE)
#[1] "08:30:00" "08:30:00" "07:09:00" "07:09:00" "07:09:05" "11:10:00"
data
v1 <- c('8:30:00', '08:30:00', '7:09:00', '7:9:00')
v2 <- c(v1, '7:9:5', '11:10:0')
df1 <- structure(list(Date = c("4-Mar-13", "4-Mar-13"), Time = c("08:30:00",
"8:30:00"), TZ = c("America/Costa_Rica", "America/Costa_Rica"
), Jul.Time = c(2456356.1875, 2456356.1875), BirdID = c("test2",
"test2"), Species = c("GREH", "GREH"), Sex = c("M", "M"), Age = c("AHY",
"AHY"), SiteID = c("56scr25", "56scr25"), Latitude = c(8.71191178,
8.71191178), Longitude = c(-82.96866316, -82.96866316)), .Names = c("Date",
"Time", "TZ", "Jul.Time", "BirdID", "Species", "Sex", "Age",
"SiteID", "Latitude", "Longitude"), class = "data.frame", row.names = c(NA,
-2L))

Related

Creating a function to change a variable type to time

I'm playing around with functions in R and want to create a function that takes a character variable and converts it to a POSIXct.
The time variable currently looks like this:
"2020-01-01T05:00:00.283236Z"
I've successfully converted the time variable in my janviews dataset with the following code:
janviews$time <- gsub('T',' ',janviews$time)
janviews$time <- as.POSIXct(janviews$time, format = "%Y-%m-%d %H:%M:%S", tz = Sys.timezone())
Since I have to perform this on multiple datasets, I want to create a function that will perform this. I created the following function but it doesn't seem to be working and I'm not sure why:
set.time <- function(dat, variable.name){
dat$variable.name <- gsub('T', ' ', dat$variable.name)
dat$variable.name <- as.POSIXct(dat$variable.name, format = "%Y-%m-%d %H:%M:%S", tz = Sys.timezone())
}
Here's the first four rows of the janviews dataset:
structure(list(customer_id = c("S4PpjV8AgTBx", "p5bpA9itlILN",
"nujcp24ULuxD", "cFV46KwexXoE"), product_id = c("kq4dNGB9NzwbwmiE",
"FQjLaJ4B76h0l1dM", "pCl1B4XF0iRBUuGt", "e5DN2VOdpiH1Cqg3"),
time = c("2020-01-01T05:00:00.283236Z", "2020-01-01T05:00:00.895876Z",
"2020-01-01T05:00:01.362329Z", "2020-01-01T05:00:01.873054Z"
)), row.names = c(NA, -4L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x1488180e0>)
Also, if there is a better way to convert my time variable, I am open to changing my method!
I would use the lubridate package and the as_datetime() function.
lubridate::as_datetime("2020-01-01T05:00:00.283236Z")
Returns
"2020-01-01 05:00:00 UTC"
Lubridate Info

What is the origin of numeric dates starting with 6?

I have dates as numeric, all starting with digit 6, I know that x date is between startDate and endDate.
Example data:
#dput(df1)
df1 <- structure(list(
startDate = structure(c(9748, 11474, 12204, 12204), class = "Date"),
endDate = structure(c(16645, 16535, 13376, 15863), class = "Date"),
x = c(63719L, 63622L, 60448L, 62940L)),
row.names = c(NA, -4L), class = "data.frame")
?as.Date suggests many origins, none works:
as.Date(63719, origin = "1900-01-01")
# [1] "2074-06-16"
as.Date(63719, origin = "1899-12-30")
# [1] "2074-06-14"
as.Date(63719, origin = "1904-01-01")
# [1] "2078-06-15"
as.Date(63719, origin = "1970-01-01")
# [1] "2144-06-16"
Any ideas?
The origin could be MUMPS origin date "1840-12-31", the reason for this date is explained in MUMPS Language faq:
27. "What happened in 1841?"
When I decided on specifications for the date routine, I remembered reading
of the oldest (one of the oldest?) U.S. citizen, a Civil War veteran, who
was 121 years old at the time. Since I wanted to be able to represent dates
in a Julian-type form so that age could be easily calculated and to be able
to represent any birth date in the numeric range selected, I decided that a
starting date in the early 1840s would be 'safe.' Since my algorithm worked
most logically when every fourth year was a leap year, the first year was
taken as 1841. The zero point was then December 30, 1840...
That's the origin of December 31, 1840 or January 1, 1841. I wasn't party
to the MDC negotiations, but I did explain the logic of my choice to members
of the Committee.
Wikipedia System time:
Language/Application Function or variable Resolution Epoch or range
MUMPS $H (short for $HOROLOG) 1 s 31 December 1840
Let's test:
df1$xClean <- as.Date(df1$x, origin = "1840-12-31")
df1$xClean > df1$startDate & df1$xClean < df1$endDate
# [1] TRUE TRUE TRUE TRUE
Note: Thanks to #Frank for pointing me to this blogpost which led me to original MUMPS faq. I posted self-answer Q&A for reference, as searching SO and Google didn't yield much.
You found already an answer, but for the fun of it, here a code snippet to automatize this:
library(rvest)
library(tidyverse)
library(magrittr)
df1 <- structure(list(
startDate = structure(c(9748, 11474, 12204, 12204), class = "Date"),
endDate = structure(c(16645, 16535, 13376, 15863), class = "Date"),
x = c(63719L, 63622L, 60448L, 62940L)),
row.names = c(NA, -4L), class = "data.frame")
epochs <- read_html("https://en.wikipedia.org/wiki/Epoch_(computing)") %>%
html_nodes(xpath = '//*[#id="mw-content-text"]/div/table') %>%
html_table() %>%
extract2(1) %>%
set_names(c("epoch", "users", "rationale")) %>%
mutate(epoch_date = parse_date(epoch, "%B %d, %Y", locale = locale("en"))) %>%
filter(!is.na(epoch_date))
potential_origins <- map_lgl(epochs$epoch_date,
function(origin) {
d <- as.Date(df1$x, origin = origin)
all(d >= df1$startDate & d <= df1$endDate)
})
epochs$users[potential_origins]
# [1] "MUMPS programming language"

R Error: index is not in increasing order

NOTE: PROBLEM RESOLVED IN THE COMMENTS BELOW
I'm getting the following error when trying to turn a data.frame into xts following the answer in found here.
Error in .xts(DA[, 3:6], index = as.POSIXct(DAINDEX, format = "%m/%d/%Y %H:%M:%S", :
index is not in increasing order
I've not been able to find much on this error or how to resolve it, so any help towards that would be greatly appreciated.
The data is daily S&P 500 in a comma delimited format with the following columns: "Date" "Time" "Open" "High" "Low" "Close".
Below is the code:
DA <- read.csv("SNP.csv", header = TRUE, stringsAsFactors = FALSE)
DAINDEX <- paste(DA$Date, DA$Time, sep = " ")
Data.hist <- .xts(DA[,3:6], index = as.POSIXct(DAINDEX, format = "%m/%d/%Y %H:%M:%S", tzone = "GMT"))
As requested, some lines of the data
structure(list(Date = c("5/20/2016", "5/19/2016", "5/18/2016",
"5/17/2016", "5/16/2016", "5/13/2016"), Time = c("0:00:00", "0:00:00",
"0:00:00", "0:00:00", "0:00:00", "0:00:00"), Open = c(2041.880005,
2044.209961, 2044.380005, 2065.040039, 2046.530029, 2062.5),
High = c(2058.350098, 2044.209961, 2060.610107, 2065.689941,
2071.879883, 2066.790039), Low = c(2041.880005, 2025.910034,
2034.48999, 2040.819946, 2046.530029, 2043.130005), Close = c(2052.320068,
2040.040039, 2047.630005, 2047.209961, 2066.659912, 2046.609985
)), .Names = c("Date", "Time", "Open", "High", "Low", "Close"
), row.names = c(NA, 6L), class = "data.frame")
The above is the output of dput(head(DA))
The easiest thing to do is use the regular xts constructor instead of .xts. It will check if the index is sorted correctly, and sort the index and data, if necessary.
Data.hist <- xts(DA[,3:6], as.POSIXct(DAINDEX, "%m/%d/%Y %H:%M:%S", "GMT"))

R convert YYMMDD to date

I have data in YYMMDDHH format but am trying to get the weekday so I need to go to a date format but can't figure it out.
Here's a dput of the relevant data:
structure(list(id = c(7927751403363142656, 18236986451472797696,
5654946373641778176, 14195690822403907584, 1693303484298446848,
1.1362181921561e+19, 11694645532962195456, 1221431312630614784,
1987127670789791488, 379819848497418688), hour = c(14102118L,
14102217L, 14102812L, 14102912L, 14102820L, 14102401L, 14102117L,
14102312L, 14102301L, 14102414L)), .Names = c("id", "hour"), row.names = c(3620479L,
8510796L, 29632625L, 34450879L, 31874113L, 13420799L, 3332671L,
11543560L, 9602012L, 15574701L), class = "data.frame")
When I use:
dat2$dow <- as.Date(substr(as.character(dat2$hour), 1,6), format = '%Y%m%d')
I just get NA's. Any suggestions?
"%Y" is for 4-digit years; "%y" is for 2-digit years. And you don't need to use substr. as.Date will ignore anything after the end of the specified format.
dat2$dow <- as.Date(as.character(dat2$hour), format='%y%m%d')

R read.zoo error for incorrect date format

I have a data that has one date column and 10 other columns.
The date column has the format of 199010.
so it's yyyymm.
It seems like that zoo/xts requires that the date has days info in it.
Is there any way to address this issue?
hier ist my data
structure(list(Date = 198901:198905, NoDur = c(5.66, -1.44, 5.51,
5.68, 5.32)), .Names = c("Date", "NoDur"), class = "data.frame", row.names = c(NA,
5L))
data<-read.zoo("C:/***/data_port.csv",sep=",",format="%Y%m",header=TRUE,index.column=1,colClasses=c("character",rep("numeric",1)))
The code has these problems:
the data is space separated but the code specifies that it is comma separated
the data does not describe dates since there is no day but the code is using the default of dates
the data is not provided in reproducible form. Note how one can simply copy the data and code below and paste it into R without any additional work.
Try this:
Lines <- "Date NoDur
198901 5.66
198902 -1.44
198903 5.51
198904 5.68
198905 5.32
"
library(zoo)
read.zoo(text = Lines, format = "%Y%m", FUN = as.yearmon, header = TRUE,
colClasses = c("character", NA))
The above converts the index to "yearmon" class which probably makes most sense here but it would alternately be possible to convert it to "Date" class by using FUN = function(x, format) as.Date(as.yearmon(x, format)) in place of the
FUN argument above.

Resources