Retrieve time zone based on locale country information (OS-independent) - r

Yet another date/time related question ;-)
Before you aim and shoot
Things are kind of messed up with a Germany + MS Windows + R combination as the following yields an invalid time zone:
> Sys.timezone()
[1] "MST"
Warning message:
In as.POSIXlt.POSIXct(Sys.time()) : unknown timezone 'MET-1MST'
That's definitely not R's fault, it's Windows. Hence the question in the first place ;-)
Question
Is there an easy/alternative and OS-independent way to query your current country via locale info and then look up the corresponding time zone (format "<country>/<city>", e.g. "Europe/Berlin" for Germany)?
I should also add that I'd like the solution to be independent from internet resources such as stated in this post/answer.
The problem context
Suppose you don't know how to specify your time zone yet. You might have heard something about CET/CEST etc, but AFAIK that doesn't really get you anywhere when using base R functionality (at least being located in Germany ;-)).
You can get a list of available "<country>/<city>" pairs from the /share/zoneinfo/zone.tab file in your RHOME directory. Yet, in order to find the time zone corresponding to the current country you're in you need to know the ISO country code.
Of course we usually do for our native country, but let's suppose we don't (I'd like to end up with a generic approach). What do you do next?
Below is my "four-step" solution, but I'm not really happy with it because
it relies on yet another contrib package (ISOcodes)
I can't test if it works for other locales as I don't know what the info actually would look like if you're in India, Russia, Australia etc.
Anyone got a better idea? Also, It'd be great if some of you in countries other than Germany could run this through and post their locale info Sys.getlocale().
Step 1: get locale info
loc <- strsplit(unlist(strsplit(Sys.getlocale(), split=";")), split="=")
foo <- function(x) {
out <- list(x[2])
names(out) <- x[1]
out
}
loc <- sapply(loc, foo)
> loc
$LC_COLLATE
[1] "German_Germany.1252"
$LC_CTYPE
[1] "German_Germany.1252"
$LC_MONETARY
[1] "German_Germany.1252"
$LC_NUMERIC
[1] "C"
$LC_TIME
[1] "German_Germany.1252"
Step 2: get country name from locale info
country.this <- unlist(strsplit(loc$LC_TIME, split="_|\\."))[2]
> country.this
[1] "Germany"
Step 3: get ISO country code
Use country.this to look up the associated country code in data set ISO_3166_1 of package ISOcodes
require("ISOcodes")
data("ISO_3166_1")
iso <- ISO_3166_1
idx <- which(iso$Name %in% country.this)
code <- iso[idx, "Alpha_2"]
> code
[1] "DE"
Step 4: get time zone
Use code to look up the time zone in the data frame that can be derived from file RHOME/share/zoneinfo/zone.tab
path <- file.path(Sys.getenv("R_HOME"), "share/zoneinfo/zone.tab")
tzones <- read.delim(
path,
row.names=NULL,
header=FALSE,
col.names=c("country", "coords", "name", "comments"),
as.is=TRUE,
fill=TRUE,
comment.char = "#"
)
> tzones[which(tzones$country == code), "name"]
[4] "Europe/Berlin"

Specifically regarding your question:
Is there an easy/alternative and OS-independent way to query your current country via locale info and then look up the corresponding time zone?
No - there is not. This is because there are several countries that have multiple time zones. One cannot know the time zone from just the country alone.
This is why TZDB identifiers are in the form of Area/Location, rather than just a list of country codes.

Some simplifications to your workflow.
You can retrieve just the time part of the locale using
Sys.getlocale("LC_TIME")
which avoids the neeed to split strings.
The lubridate package contains a function to retrieve Olson-style time zone names, so you don't have to worry about reading and parsing zone.tab.
library(lubridate)
olson_time_zones()

Related

As.POSIXct() is giving difference of one second for dates before 1970-01-01

In R studio when I am getting following
> as.POSIXct("1970-01-01 18:30:00.001", origin = "1900-01-01")
[1] "1970-01-01 18:30:00.000 IST"
> as.POSIXct("1969-01-01 18:30:00.001", origin = "1900-01-01")
[1] "1969-01-01 18:30:01.000 IST" ## extra second got added here
> as.POSIXct("1969-01-01 18:30:01.001", origin = "1900-01-01")
[1] "1969-01-01 18:30:02.000 IST" ## extra second got added here
I am doing same thing in Rcpp and I am getting same result
// [[Rcpp::export]]
Rcpp::Datetime rcppdatetime() {
Rcpp::Datetime dt("1969-01-01 18:30:00.001");
return(dt);
}
/*** R
rcppdatetime()
*/
"1969-01-01 18:30:01.000 IST"
This is expected as Rcpp::Datetime object is POSIXct type.
I need help in following regard
) How to correct this second value for dates before year 1970 ?
) Similar error I am facing for microseconds representation. I went through this thread
https://github.com/RcppCore/Rcpp/issues/899
Can someone point me to documentation in R/Rcpp where its mentioned as R constraint.
I am using mingw 8.1.0 to compile my application, so I am not sure how C++ 11 specific code will help here.
) I checked this thread
How R formats POSIXct with fractional seconds
But it provides output in character format. I need output in posixct form so that end user can format as he wants or do further processing.
) I want to do this in Rcpp. As most of my application code in in .cpp (we have DLLwhich gets loaded). If I decide to do it in R, since I am putting all date time objects in DatetimeVector, if I decide to do it in R, I may need to go through entire vector once again. Are there any links which can help ?
) Any other package/interface available which I can use in my .cpp files as achieve it there itself?

Search for timestamp anomalies in a dataframe R

im working with some GTFS data from Berlin and I am hitting a wall here right now.
There is a stop_times.txt file for all Busstops in Berlin with 5 million rows.
Two Columns (Arrival_time and Departure_time) contain anomalies, such as
Arrival_time : 112:30:0 instead of the regular format 11:20:30.
I dont really know how to extract those specific lines and erase them from the dataset. I cant come up with an algorithm which is able to detect it. I tried to go with the length of strings (should be 8 00:00:00 = 8 characters), but the errored ones are also 8 long.
Do you know a simple way to make sure that the format is always xx:xx:xx and delete all others?
Thanks...
Edit :
So, now after trying the below suggested solution, it didnt work because it would just tell me how many rows were malicious and not where and not how i could delete those.
My idea is basically now :
Find every timestamp which does not correspond to this exact format :
'00:00:00', where it has to be length '8' and 2 digits seperated by ':'. Is there a way to detect anomalies within this pattern and then delete them? I really dont know how to fix this issue anymore.
Thanks
lubridate is such a useful package I can't remember how I was doing without it.
require(lubridate)
times <- c("112:30:0", "11:20:30")
datetimes <- paste("01.01.2018", times)
parsed.datetimes <- lubridate::dmy_hms(datetimes)
#[1] NA "2018-01-01 11:20:30 UTC"
#Warning message:
# 1 failed to parse.
This function will automatically tell you when format parsing failed, only thing is, it is taking datetime format as input instead of just times, but you can easily get around that like shown.
In order to know exactly which ones have failed to parse, you can then apply:
failed.list <- which(is.na(parsed.datetimes))

SPARQL: Date conversion

In the R package SPARQL, xsd:date datatypes are by default converted into Unix time. This is a problem because this involves two date transformations - the first taking place within the function SPARQL() - which are determined by the local system time zone. This is a problem if you say, let's say in Sydney, Australia (Sys.timezone() == "Australia/Sydney") because the following query, requesting the date of the 2016 US presidential election
query <- "SELECT ?date WHERE {wd:Q699872 wdt:P585 ?date}"
res <- SPARQL('https://query.wikidata.org/sparql', query)
as.POSIXct(res$results$date, origin = '1970-01-01')
will return "2016-11-07" instead of "2016-11-08" (the correct date), which is instead returned if
Sys.setenv(TZ='GMT')
res <- SPARQL('https://query.wikidata.org/sparql', query)
as.Date(as.POSIXct(res$results$date, origin = '1970-01-01'))
Is there any way to ask SPARQL to return date datatypes as characters?
I'm not sure how the R SPARQL package determines it's a date, but assuming it looks at the assigned datatype, you can coerce to string by retrieving only the lexical value:
SELECT (STR(?date) as ?dateString) ....
Of course this only works if the Unix time conversion happens on the result processing side, not during query evalation. If the latter is the case: get a better SPARQL engine.

Grabbing part of a link from a URL in R

I have parts of links pertaining to baseball players in my character vector:
teamplayerlinks <- c(
"/players/i/iannech01.shtml",
"/players/l/lindad01.shtml",
"/players/c/canoro01.shtml"
)
I would like to isolate the letters/numbers after the 3rd / sign, and before the .sthml portion. I want my resulting string to read:
desiredlinks
# [1] "iannech01" "lindad01" "canoro01"
I assume this may be a job for sub, but I after many trials and error, I'm having a very tough time learning the escape and character sequences. I know it can be done with two sub calls to remove the front and back portion, but I'd rather complete this to dynamically handle other links.
Thank you in advance to anyone who replies - I'm still learning R and trying to get better everyday.
You could try
gsub(".*/|\\..*$", "", teamplayerlinks)
# [1] "iannech01" "lindad01" "canoro01"
Here we have
.*/ remove everything up to and including the last /
| or
\\..*$ remove everything after the ., starting from the end of the string
By the way, these look a bit like player IDs given in the Lahman baseball data sets. If so, you can use the Lahman package in R and not have to scrape the web. It has numerous baseball data sets. It can be installed with install.packages("Lahman"). I also wrote a package retrosheet for downloading data sets from retrosheet.com. It's also on CRAN. Check it out!
The basename function is useful here.
gsub("\\.shtml", "", basename(teamplayerlinks))
# [1] "iannech01" "lindad01" "canoro01"
This can be also done without regex
tools::file_path_sans_ext(basename(teamplayerlinks))
#[1] "iannech01" "lindad01" "canoro01"

Importing option chain data from Bloomberg

I would like to import from Bloomberg into R for a specified day the entire option chain for a particular stock, i.e. all expiries and strikes for the exchange traded options. I am able to import the option chain for a non-specified day (today):
bbgData <- bds(connection,sec,"OPT_CHAIN")
Where connection is a valid Bloomberg connection and sec is a Bloomberg security ticker such as "TLS AU Equity"
However, if I add extra fields it doesn't work, i.e.
bbgData <- bds(connection, sec,"OPT_CHAIN", testDate, "OPT_STRIKE_PX", "MATURITY", "PX_BID", "PX_ASK")
bbgData <- bds(connection, sec,"OPT_CHAIN", "OPT_STRIKE_PX", "MATURITY", "PX_BID", "PX_ASK")
Similarly, if I switch to using the historical data function it doesn't work
bbgData <- dateDataHist <- bdh(connection,sec,"OPT_CHAIN","20160201")
I just need the data for one day, but for a specified day, and including the additional fields
Hint: I think the issue is that every field following "OPT_CHAIN" is dependent on the result of "OPT_CHAIN", so for example it is the strike price given the code in "OPT_CHAIN", but I am unsure how to introduce this conditionality into the R Bloomberg query.
It's better to use the field CHAIN_TICKERS and related overrides when retrieving option data for a given underlying from Bloomberg. You can, for example, request points for a given moneyness by getting CHAIN_TICKERS with an override of CHAIN_STRIKE_PX_OVRD equal to 90%-110%.
In either case you need to use the tickers that are the result of your first request in a second request if you want to retrieve additional data. So:
option_tickers <- bds("TLS AU Equity","CHAIN_TICKERS",
overrides=c(CHAIN_STRIKE_PX_OVRD="90%-110%"))
option_prices <- bdp(sapply(option_tickers, paste, "equity"), c("PX_BID","PX_ASK"))

Resources