What is the origin of numeric dates starting with 6? - r

I have dates as numeric, all starting with digit 6, I know that x date is between startDate and endDate.
Example data:
#dput(df1)
df1 <- structure(list(
startDate = structure(c(9748, 11474, 12204, 12204), class = "Date"),
endDate = structure(c(16645, 16535, 13376, 15863), class = "Date"),
x = c(63719L, 63622L, 60448L, 62940L)),
row.names = c(NA, -4L), class = "data.frame")
?as.Date suggests many origins, none works:
as.Date(63719, origin = "1900-01-01")
# [1] "2074-06-16"
as.Date(63719, origin = "1899-12-30")
# [1] "2074-06-14"
as.Date(63719, origin = "1904-01-01")
# [1] "2078-06-15"
as.Date(63719, origin = "1970-01-01")
# [1] "2144-06-16"
Any ideas?

The origin could be MUMPS origin date "1840-12-31", the reason for this date is explained in MUMPS Language faq:
27. "What happened in 1841?"
When I decided on specifications for the date routine, I remembered reading
of the oldest (one of the oldest?) U.S. citizen, a Civil War veteran, who
was 121 years old at the time. Since I wanted to be able to represent dates
in a Julian-type form so that age could be easily calculated and to be able
to represent any birth date in the numeric range selected, I decided that a
starting date in the early 1840s would be 'safe.' Since my algorithm worked
most logically when every fourth year was a leap year, the first year was
taken as 1841. The zero point was then December 30, 1840...
That's the origin of December 31, 1840 or January 1, 1841. I wasn't party
to the MDC negotiations, but I did explain the logic of my choice to members
of the Committee.
Wikipedia System time:
Language/Application Function or variable Resolution Epoch or range
MUMPS $H (short for $HOROLOG) 1 s 31 December 1840
Let's test:
df1$xClean <- as.Date(df1$x, origin = "1840-12-31")
df1$xClean > df1$startDate & df1$xClean < df1$endDate
# [1] TRUE TRUE TRUE TRUE
Note: Thanks to #Frank for pointing me to this blogpost which led me to original MUMPS faq. I posted self-answer Q&A for reference, as searching SO and Google didn't yield much.

You found already an answer, but for the fun of it, here a code snippet to automatize this:
library(rvest)
library(tidyverse)
library(magrittr)
df1 <- structure(list(
startDate = structure(c(9748, 11474, 12204, 12204), class = "Date"),
endDate = structure(c(16645, 16535, 13376, 15863), class = "Date"),
x = c(63719L, 63622L, 60448L, 62940L)),
row.names = c(NA, -4L), class = "data.frame")
epochs <- read_html("https://en.wikipedia.org/wiki/Epoch_(computing)") %>%
html_nodes(xpath = '//*[#id="mw-content-text"]/div/table') %>%
html_table() %>%
extract2(1) %>%
set_names(c("epoch", "users", "rationale")) %>%
mutate(epoch_date = parse_date(epoch, "%B %d, %Y", locale = locale("en"))) %>%
filter(!is.na(epoch_date))
potential_origins <- map_lgl(epochs$epoch_date,
function(origin) {
d <- as.Date(df1$x, origin = origin)
all(d >= df1$startDate & d <= df1$endDate)
})
epochs$users[potential_origins]
# [1] "MUMPS programming language"

Related

Can't remove empty `character(0)` or `list()` values from R data frame

I have an R data frame that has character(0) and list() values inside the cells. I want to replace these with NA values.
In the following example, the field "teaser" has this issue, but it can be anywhere in the data frame.
df <- structure(list(body = "BAKER TO VEGAS 2022The Office fielded two squads this year in the 36th Annual Baker to Vegas (“B2V”) Challenge Cup Relay on April 9-10. Members of our 2022 B2V Team include many staff and AUSAs who were joined by office alums and a cadre of friends and family who helped out during some rather brutal conditions this year with temperatures around 100 degrees for much of the days. Most importantly, everyone had fun… and nobody got hurt! It was a great opportunity to meet (and run past) various members of our law enforcement community and to see the amazing logistics of the yearly event. Congratulations to all the participants.",
changed = structure(19156, class = "Date"), created = structure(19156, class = "Date"),
date = structure(19090, class = "Date"), teaser = "character(0)",
title = "Baker to Vegas 2022", url = "https://www.justice.gov/usao-cdca/blog/baker-vegas-2022",
uuid = "cd7e1023-c3ed-4234-b8af-56d342493810", vuuid = "8971702d-6f96-4bbd-ba8c-418f9d32a486",
name = "USAO - California, Central,"), row.names = 33L, class = "data.frame")
I've tried numerous things that don't work, including the following:
df <- na_if(df, "character(0)")
Error in charToDate(x) :
character string is not in a standard unambiguous format
Thanks for your help.
We could use
library(dplyr)
df %>%
mutate(across(where(is.character), ~ na_if(.x, "character(0)")))
Here is a base R way.
create a logical index taking the value TRUE when the columns are of class "character";
create an index list on those columns with lapply;
with mapply change the bad values to NA.
i_chr <- sapply(df, is.character)
inx_list <- lapply(df[i_chr], \(x) x == "character(0)")
df[i_chr] <- Map(\(x, i) {is.na(x) <- i; x}, df[i_chr], inx_list)

Rstudio Time sum calculation

How can I sum the total time per driver in R? Can someone help me?
Total time
Prefered end result
One recommendation to make: please do not use images to share data. Instead, use dput() of your data frame. See this post on making a reproducible example on SO.
One approach to this involves the tidyverse and lubridate packages (I am sure there are other solutions).
First, would put your data into long form instead of wide. The times are then converted from %H:%M:%OS (with milliseconds) to durations since midnight.
Then, for each driver, these times are summed up, and results are provided in different formats:
total_time1 - total number of seconds (with decimal places)
total_time2 - number minutes (M) and the number decimal seconds (S)
total_time3 - total time in %M:%OS format (minutes and decimal seconds)
Edit: In addition, I have added two columns based on OP request:
total_time_minutes - total number of minutes (with decimal places)
avg_speed - average speed in km/hr, assuming 27.004,65 meters
I hope this is helpful. Please let me know.
library(tidyverse)
library(lubridate)
df %>%
pivot_longer(cols = -lap) %>%
mutate(lap_time = as.numeric(as.POSIXct(value, format = "%H:%M:%OS", tz = "UTC")) -
as.numeric(as.POSIXct(Sys.Date(), tz = "UTC"))) %>%
group_by(name) %>%
summarise(total_time1 = sum(lap_time)) %>%
mutate(total_time2 = seconds_to_period(total_time1),
total_time3 = sprintf("%d:%.4f", minute(total_time2), second(total_time2)),
total_time_minutes = total_time1/60,
avg_speed = 3.6 * 27004.65/total_time1) %>%
as.data.frame()
Output
name total_time1 total_time2 total_time3 total_time_minutes avg_speed
1 Bottas 319.782 5M 19.7815999984741S 5:19.7816 5.32969 304.010
2 Hamilton 320.320 5M 20.3204002380371S 5:20.3204 5.33867 303.498
3 Leclerc 319.981 5M 19.98140001297S 5:19.9814 5.33302 303.820
4 Verstappen 318.220 5M 18.219899892807S 5:18.2199 5.30366 305.502
5 Vettel 318.625 5M 18.6247997283936S 5:18.6248 5.31041 305.114
Data
df <- structure(list(lap = 1:5, Bottas = c("00:01:04.9388", "00:01:03.7164",
"00:01:04.0028", "00:01:03.3424", "00:01:03.7812"), Hamilton = c("00:01:04.5280",
"00:01:03.7524", "00:01:03.9632", "00:01:04.3712", "00:01:03.7056"
), Leclerc = c("00:01:04.9812", "00:01:03.7740", "00:01:04.6026",
"00:01:03.3920", "00:01:03.2316"), Verstappen = c("00:01:04.1704",
"00:01:03.7383", "00:01:03.7128", "00:01:02.8460", "00:01:03.7524"
), Vettel = c("00:01:04.3632", "00:01:02.8244", "00:01:03.7164",
"00:01:03.8532", "00:01:03.8676")), class = "data.frame", row.names = c(NA,
-5L))

Convert date to month/year format for time series

I have some have some water quality sample data.
> dput(GrowingArealog90s[1:10,])
structure(list(SampleDate = structure(c(6948, 6949, 6950, 7516,
7517, 7782, 7783, 7784, 8092, 8106), class = "Date"), Flog90 = c(1.51851393987789,
1.48970743802793, 1.81243963000062, 0.273575501327576, 0.874218895695207,
1.89762709129044, 1.44012088794774, 0.301029995663981, 1.23603370361931,
0.301029995663981)), .Names = c("SampleDate", "Flog90"), class = c("tbl_df",
"data.frame"), row.names = c(NA, -10L))
This data is collected monthly, although some months are missed over the 25 year period.
I know there is so much help out there for converting dates to different formats but I have not been able to figure this out. I want to create a time series with just a month/year format, so that I can do things like decompose the data by month and run seasonal kendalls and such. I have tried so many different ways of converting my date to the desired format that I have completely confused myself. I don't care about the exact format as long as it is recognized month/year.
I also need to fill in the missing months with NAs.
I tried uploading the "SampleDate" column in a numeric format, "yyyymm". I could then merge that data frame with another that contained all the dates I need.
GA90 <- merge(Dates, GrowingArealog90s, by.x = "Date", by.y = "Date", all.x = TRUE)
However, when I converted the resulting data frame to a time series it would not recognize the 12 month frequency.
GA90ts <- as.ts(GA90, frequency(12))
> GA90ts
Time Series:
Start = 1
End = 324
Frequency = 1
Any help with this is appreciated.
Here's how to do it with zoo. You'll get a warning, but it's OK for now. You'll get a series with mon/yy.
series <-structure(list(SampleDate = structure(c(6948, 6949, 6950, 7516,
7517, 7782, 7783, 7784, 8092, 8106), class = "Date"), Flog90 = c(1.51851393987789,
1.48970743802793, 1.81243963000062, 0.273575501327576, 0.874218895695207,
1.89762709129044, 1.44012088794774, 0.301029995663981, 1.23603370361931,
0.301029995663981)), .Names = c("SampleDate", "Flog90"), class = c("tbl_df",
"data.frame"), row.names = c(NA, -10L))
library(zoo)
series <-as.data.frame(series) #to drop dplyr class
series.zoo <-zoo(series[,-1,drop=FALSE],as.yearmon(series[,1]))
Best practice would be to keep your series with actual date and use as.yearmon or as.yearmon only when you actually need to make calculations or aggregate.zoo by month and year.
The following is a matter of taste, but I've dealt with a lot of time series and I think zoo is superior to ts and xts. Much more flexible.
Now, to fill in missing values, you have to create a vector of dates. Here, I'm using a zoo object with actual dates. I then use na.locf, which is "last observation carry forward". You could also look at na.approx.
series.zoo <-zoo(series[,-1,drop=FALSE],(series[,1]))
my.seq <-seq.Date(first(series[,1,drop=FALSE]), last(series[,1,drop=FALSE]),by="month")
merged <-merge.zoo(series.zoo,zoo(,my.seq))
na.locf(merged)
UPDATE
With aggregate.
GrowingArealog90s <-structure(list(SampleDate = structure(c(6948, 6949, 6950, 7516,
7517, 7782, 7783, 7784, 8092, 8106), class = "Date"), Flog90 = c(1.51851393987789,
1.48970743802793, 1.81243963000062, 0.273575501327576, 0.874218895695207,
1.89762709129044, 1.44012088794774, 0.301029995663981, 1.23603370361931,
0.301029995663981)), .Names = c("SampleDate", "Flog90"), class = c("tbl_df",
"data.frame"), row.names = c(NA, -10L))
library(zoo);library(xts)
GrowingArealog90s <-as.data.frame(GrowingArealog90s) #to remove dplyr format
GrowingArealog90s.zoo <-zoo(GrowingArealog90s[,-1,drop=FALSE],as.Date(GrowingArealog90s[,1]))
#First aggregate by month. I chose to get the mean per month
GrowingArealog90s.agg <-aggregate(GrowingArealog90s.zoo, as.yearmon, mean) #replace mean with last to get last reading of the month
#Then create a sequence of months and merge it
my.seq <-seq.Date(first(GrowingArealog90s[,1]), last(GrowingArealog90s[,1]),by="month")
merged <-merge.zoo(GrowingArealog90s.agg ,zoo(,as.yearmon(my.seq)))
na.locf(merged)

R time formatting with dirty data

I'm using R to generate a CZML file from a database.
The database has dirty data.
I need a way to make sure times are in the format "%H:%M:%S".
The data can be in the correct %H:%M:%S already or missing zeros in front of the hour, e.g 8:30:00, which is an invalid ISO 8601 and throws the CZML parsing off entirely.
It needs to always be like so 08:30:00 or 07:09:00 in the 24h format.
I have errors because it is like so 8:30:00 or 7:09:00 still in the 24h format though, I haven't checked if the minutes or seconds are incorrect too but for the moment I assume they are correct and the only problem is the hours.
For example, I have a csv file like this:
"Date","Time","TZ","Jul.Time","BirdID","Species","Sex","Age","SiteID","Latitude","Longitude"
"4-Mar-13","08:30:00","America/Costa_Rica",2456356.187500,"test2","GREH","M","AHY","56scr25",8.71191178,-82.96866316
"4-Mar-13","8:30:00","America/Costa_Rica",2456356.187500,"test2","GREH","M","AHY","56scr25",8.71191178,-82.96866316
I need to generate a CZML like so:
"point": {
"color": {
"rgba": [
"2013-03-04T08:30:00Z",225,50,50,196,"2013-03-04T08:30:01Z",50,50,225,196,"2013-03-04T13:30:00Z",225,50,50,196,"2013-03-04T13:30:01Z",50,50,225,196,"2013-03-04T16:00:00Z",225,50,50,196,"2013-03-04T16:00:01Z",50,50,225,196
]
},
"pixelSize": { "number": 10 }
}
My code is like so:
j=1
numVisits=nrow(visitedTimes)
while(j<=numVisits){
date=as.Date(visitedTimes$Date[j], format="%d-%b-%y")
time=format(visitedTimes$Time[j], format="%H:%M:%S")
timeOfPassage=paste0(date,"T",time,"Z")
timeAfter=as.POSIXlt(timeOfPassage, format="%Y-%m-%dT%H:%M:%SZ")
timeAfter$sec=timeAfter$sec+1
timeAfter=format(timeAfter, format="%Y-%m-%dT%H:%M:%SZ")
cat(paste0("\"",timeOfPassage,"\","))
cat("225,50,50,196,")
cat(paste0("\"",timeAfter,"\","))
cat("50,50,225,196")
if(j<numVisits){
cat(",")
}
j=j+1
}
But it doesn't produce the desired output because of the dirty data..
Any ideas?
We can use times from chron
library(chron)
times(v1)
#[1] 08:30:00 08:30:00 07:09:00 07:09:00
Or using base R
format(strptime(v2, '%H:%M:%S'), '%H:%M:%S')
#[1] "08:30:00" "08:30:00" "07:09:00" "07:09:00" "07:09:05" "11:10:00"
Using the OP's updated dataset
df1$Time <- times(df1$Time)
df1$Time
#[1] 08:30:00 08:30:00
Or using regex
sub('^(.:)', '0\\1', df1$Time)
gsub('[^:]{2}(*SKIP)(*F)|(\\d)', '0\\1', v2, perl=TRUE)
#[1] "08:30:00" "08:30:00" "07:09:00" "07:09:00" "07:09:05" "11:10:00"
data
v1 <- c('8:30:00', '08:30:00', '7:09:00', '7:9:00')
v2 <- c(v1, '7:9:5', '11:10:0')
df1 <- structure(list(Date = c("4-Mar-13", "4-Mar-13"), Time = c("08:30:00",
"8:30:00"), TZ = c("America/Costa_Rica", "America/Costa_Rica"
), Jul.Time = c(2456356.1875, 2456356.1875), BirdID = c("test2",
"test2"), Species = c("GREH", "GREH"), Sex = c("M", "M"), Age = c("AHY",
"AHY"), SiteID = c("56scr25", "56scr25"), Latitude = c(8.71191178,
8.71191178), Longitude = c(-82.96866316, -82.96866316)), .Names = c("Date",
"Time", "TZ", "Jul.Time", "BirdID", "Species", "Sex", "Age",
"SiteID", "Latitude", "Longitude"), class = "data.frame", row.names = c(NA,
-2L))

R convert YYMMDD to date

I have data in YYMMDDHH format but am trying to get the weekday so I need to go to a date format but can't figure it out.
Here's a dput of the relevant data:
structure(list(id = c(7927751403363142656, 18236986451472797696,
5654946373641778176, 14195690822403907584, 1693303484298446848,
1.1362181921561e+19, 11694645532962195456, 1221431312630614784,
1987127670789791488, 379819848497418688), hour = c(14102118L,
14102217L, 14102812L, 14102912L, 14102820L, 14102401L, 14102117L,
14102312L, 14102301L, 14102414L)), .Names = c("id", "hour"), row.names = c(3620479L,
8510796L, 29632625L, 34450879L, 31874113L, 13420799L, 3332671L,
11543560L, 9602012L, 15574701L), class = "data.frame")
When I use:
dat2$dow <- as.Date(substr(as.character(dat2$hour), 1,6), format = '%Y%m%d')
I just get NA's. Any suggestions?
"%Y" is for 4-digit years; "%y" is for 2-digit years. And you don't need to use substr. as.Date will ignore anything after the end of the specified format.
dat2$dow <- as.Date(as.character(dat2$hour), format='%y%m%d')

Resources