Creating a unified time-series, with dates coming from different (natural) languages

Creating a unified time-series, with dates coming from different (natural) languages - r

I am using the as.Date function as follows:
x$time_date <- as.Date(x$time_date, format = "%H:%M - %d %b %Y")
This worked fine until I saw a lot of NA values in the output, which I traced back to some of the dates stemming from a different language: German.
My English dates look like this: 18:00 - 10 Dec 2014
Where the German equivalent is: 18:00 - 10 Dez 2014
The month December is abbreviated the German way. This is not recognised by the as.Date function. I have the same problem for five other months:
Mar - März
May - Mai
Jun - Juni
Jul - Juli
Oct - Okt
This looks like it would be of use, but I am unsure of how to implement it for 'unrecognised' formats:
How to change multiple Date formats in same column
I attempted to just go through and use gsub to replace all the occurences of German months, but without luck. x below is the data.table and I work on just the time_date column:
x$time_date <- gsub("(März)?", "Mar", x$time_date) %>%
gsub("(Mai)?", "May", .) %>%
gsub("(Juni)?", "Jun", .) %>%
gsub("(Juli)?", "Jul", .) %>%
gsub("(Okt)?", "Oct", .) %>%
gsub("(Dez)?", "Dec", .)
Not only did this not work, but it is also a very slow process and I have nearly 20 GB of pure .csv files to work through.
In the as.Date documentation there is mention of different locales / languages, but not how to work with several simultaneously. I also found instructions on how to use different languages, however my data is all mixed, so I can only thing of a conditional loop using the correct language for each file, however that would also be slow.
Is there a known workaround for this, which I can't find?

Create a table tab that contains all the translations and then use subscripting to actually do the translation. The code below seems to work for me on Windows provided your input abbreviations are the same as the standard ones generated but the precise language names ("German", etc.) may vary depending on your system. See ?Sys.setlocale for more information. Also if the abbreviations in your input are different than the ones generated here you will have to add those to tab yourself, e.g. tab <- c(tab, Juli = "Jul")
langs <- c("French", "German", "English")
tab <- unlist(lapply(langs, function(lang) {
Sys.setlocale("LC_TIME", lang)
nms <- format(ISOdate(2000, 1:12, 1), "%b")
setNames(month.abb, nms)
}))
x <- c("18:00 - 10 Juli 2014", "18:00 - 10 Mai 2014") # test input
source_month <- gsub("[^[:alpha:]]", "", x)
mapply(sub, source_month, tab[source_month], x, USE.NAMES = FALSE)
giving:
[1] "18:00 - 10 Jul 2014" "18:00 - 10 May 2014"

Related

Writing a function to clean string data and rename columns

I am writing a function to be applied to many individual matrices. Each matrix has 5 columns of string text. I want to remove a piece of one string which matches the string inside another element exactly, then apply a couple more stringr functions, transform it into a data frame, then rename the columns and in the last step I want to add a number to the end of each column name, since I will apply this to many matrices and need to identify the columns later.
This is very similar to another function I wrote so I can't figure out why it won't work. I tried running each line individually by filling in the inputs like this and it works perfectly:
Review1[,4] <- str_remove(Review1[,4], Review1[,3])
Review1[,4] <- str_sub(Review1[,4], 4, -4)
Review1[,4] <- str_trim(Review1[,4], "both")
Review1 <- as.data.frame(Review1)
colnames(Review1) <- c("Title", "Rating", "Date", "User", "Text")
Review1 <- Review1 %>% rename_all(paste0, 1)
But when I run the function nothing seems to happen at all.
Transform_Reviews <- function(x, y, z, a) {
x[,y] <- str_remove(x[,y], x[,z])
x[,y] <- str_sub(x[,y], 4, -4)
x[,y] <- str_trim(x[,y], "both")
x <- as.data.frame(x)
colnames(x) <- c("Title", "Rating", "Date", "User", "Text")
x <- x %>% rename_all(paste0, a)
}
Transform_Reviews(Review1, 4, 3, 1)
This is the only warning message I get. I also receive this when I run the str_remove function individually, but it still changes the elements. But it changes nothing when I run the UDF.
Warning messages:
1: In stri_replace_first_regex(string, pattern, fix_replacement(replacement), ... :
empty search patterns are not supported
This is an example of the part of Review1 that I'm working with.
[,3] [,4]
[1,] "6 April 2014" "By Copnovelist on 6 April 2014"
[2,] "18 Dec. 2015" "By kenneth bell on 18 Dec. 2015"
[3,] "26 May 2015" "By Simon.B :-) on 26 May 2015"
[4,] "22 July 2013" "By Lilla Lukacs on 22 July 2013"
This is what I want the output to look like:
Date1 User1
1 6 April 2014 Copnovelist
2 18 Dec. 2015 kenneth bell
3 26 May 2015 Simon.B :-)
4 22 July 2013 Lilla Lukacs

I realized I just needed to use an assignment operator to see my function work.
Review1 <- Transform_Reviews(Review1, 4, 3, 1)

How to extract date from the text

I tried to extract a date from the following text. Unfortunately, it keeps giving me warning and the result is NA
I have a following text:
"IRA-401K Investment Assets Under Management (AUM) As of July 31, 2018 BMG Funds
$217,743,573 BMG BullionBars $45,176,561 TOTAL $262,920,134 Physical Holdings Download
Scotiabank BMG BullionBars List Download Brinks BMG BullionBars List Holdings by Ounces As
of July 31, 2018 Gold Bars 21,132.496 Silver Bars 453,531.574 Silver Coins
80,500 Platinum Bars"
The text contains following date: July 31, 2018. These dates appear twice in the text.
I used following code to extract the dates out of the text.
test_take <- lapply(cleanurl_text, parse_date_time, orders = "mdy",
locale = Sys.setlocale('LC_TIME', locale = "English_Canada.1252"))
I get the following error message:
Warning message:
All formats failed to parse. No formats found.
When I include exact = TRUE
test_take <- lapply(as.character(cleanurl_text), parse_date_time, orders = "mdy",
locale = Sys.setlocale('LC_TIME', locale = "English_Canada.1252"), exact = TRUE)
I get the following warning:
Warning message:
1 failed to parse.
The resulting object still contains NA.

The following regex can extract the date in the posted format.
pattern <- paste(month.name, collapse = "|")
pattern <- paste0("(", pattern, ")\\s\\d{1,2}.{1,2}\\d{4}")
m <- gregexpr(pattern, cleanurl_text)
regmatches(cleanurl_text, m)
#[[1]]
#[1] "July 31, 2018" "July 31, 2018"
Note that this can be done in just one code line, regmatches(gregexpr(.)), but I have opted for two lines in order to make it more readable.

Add a variable including the day of the week

This could seem a repetition but I haven't found this exact question's answer yet.
I have this dataframe:
Day.of.the.month Month Year Items Amount.in.euros
1 1 January 2005 Nothing 0.00
2 2 February 2008 Food 7.78
3 3 April 2009 Nothing 0.00
4 4 January 2016 Bus 2.00
I want to create a column named "day.of.the.week" including, of course, "saturday", "sunday" and so on. If the date was formatted as '2012/02/02' I would not have probs, but this way I don't know whether there is a way nor a workaround to solve the issue.
Any hint?

Do you want this?
options(stringsAsFactors = F)
df <- data.frame( x = c(1, 2, 3, 4) ,y = c("January", "February","April", "January"), z = c(2005, 2008, 2009, 2016))
weekdays(as.Date(paste0(df$x, df$y, df$z),"%d%B%Y")) # %d for date, %B for month in complete and %Y for year complete
This is just a side note
Note: Since someone commented that this solution is being locale dependent. So if that is the case you can always do "Sys.setlocale("LC_TIME", "C")" to change your locale settings also, use Sys.getlocale() to get your locale settings.
If someone interested in making this permanent while starting the R session everytime:
You can also write below script on your .RProfile file (which is usually located at your home directory , in windows it is mostly found at Documents folder):
.First <- function() {
Sys.setlocale("LC_TIME", "C")
}

Day.of.the.month<-as.numeric(c(1,2,3,4))
Month<-as.character(c("January","February","April","January"))
Year<-as.numeric(c(2005,2008,2009,2016))
Items<-as.character(c("Nothing","Food","Nothing","Bus"))
Amount.in.euros<-as.numeric(c(0.00,7.78,0.0,2.0))
complete.date<-paste(Year,Month,Day.of.the.month)
strptime(complete.date, "%Y %B %d")
example1.data <-
data.frame(Day.of.the.month,Month,Year,Items,Amount.in.euros,complete.date)
example1.data$weekday <- weekdays(as.Date(example1.data$complete.date))

Date sequence in R spanning B.C.E. to A.D

I would like to generate a sequence of dates from 10,000 B.C.E. to the present. This is easy for 0 C.E. (or A.D.):
ADtoNow <- seq.Date(from = as.Date("0/1/1"), to = Sys.Date(), by = "day")
But I am stumped as to how to generate dates before 0 AD. Obviously, I could do years before present but it would be nice to be able to graph something as BCE and AD.

To expand on Ricardo's suggestion, here is some testing of how things work. Or don't work for that matter.
I will repeat Joshua's warning taken from ?as.Date for future searchers in big bold letters:
"Note: Years before 1CE (aka 1AD) will probably not be handled correctly."
as.integer(as.Date("0/1/1"))
[1] -719528
as.integer(seq(as.Date("0/1/1"),length=2,by="-10000 years"))
[1] -719528 -4371953
seq(as.Date(-4371953,origin="1970-01-01"),Sys.Date(),by="1000 years")
# nonsense
[1] "0000-01-01" "'000-01-01" "(000-01-01" ")000-01-01" "*000-01-01"
[6] "+000-01-01" ",000-01-01" "-000-01-01" ".000-01-01" "/000-01-01"
[11] "0000-01-01" "1000-01-01" "2000-01-01"
> as.integer(seq(as.Date(-4371953,origin="1970-01-01"),Sys.Date(),by="1000 years"))
# also possibly nonsense
[1] -4371953 -4006710 -3641468 -3276225 -2910983 -2545740 -2180498 -1815255
[9] -1450013 -1084770 -719528 -354285 10957
Though this does seem to work for graphing somewhat:
yrs1000 <- seq(as.Date(-4371953,origin="1970-01-01"),Sys.Date(),by="1000 years")
plot(yrs1000,rep(1,length(yrs1000)),axes=FALSE,ann=FALSE)
box()
axis(2)
axis(1,at=yrs1000,labels=c(paste(seq(10000,1000,by=-1000),"BC",sep=""),"0AD","1000AD","2000AD"))
title(xlab="Year",ylab="Value")

Quite some time has gone by since this question was asked. With that time came a new R package, gregorian which can handle BCE time values in the as_gregorian method.
Here's an example of piecewise constructing a list of dates that range from -10000 BCE to the current year.
library(lubridate)
library(gregorian)
# Container for the dates
dates <- c()
starting_year <- year(now())
# Add the CE dates to the list
for (year in starting_year:0){
date <- sprintf("%s-%s-%s", year, "1", "1")
dates <- c(dates, gregorian::as_gregorian(date))
}
starting_year <- "-10000"
# Add the BCE dates to the list
for (year in starting_year:0){
start_date <- gregorian::as_gregorian("-10000-1-1")
date <- sprintf("%s-%s-%s", year, "1", "1")
dates <- c(dates, gregorian::as_gregorian(date))
}
How you use the list is up to you, just know that the relevant properties of the date objects are year and bce. For example, you can loop over list of dates, parse the year, and determine if it's BCE or not.
> gregorian_date <- gregorian::as_gregorian("-10000-1-1")
> gregorian_date$bce
[1] TRUE
> gregorian_date$year
[1] 10001
Notes on 0AD
The gregorian package assumes that when you mean Year 0, you're really talking about year 1 (shown below). I personally think an exception should be thrown, but that's the mapping users needs to keep in mind.
> gregorian::as_gregorian("0-1-1")
[1] "Monday January 1, 1 CE"
This is also the case with BCE
> gregorian::as_gregorian("-0-1-1")
[1] "Saturday January 1, 1 BCE"

As #JoshuaUlrich commented, the short answer is no.
However, you can splice out the year into a separate column and then convert to integer. Would this work for you?

The package lubridate seems to handle "negative" years ok, although it does create a year 0, which from the above comments seems to be inaccurate. Try:
library(lubridate)
start <- -10000
stop <- 2013
myrange <- NULL
for (x in start:stop) {
myrange <- c(myrange,ymd(paste0(x,'-01-01')))
}

R time series data, daily only working days

I am using the following code:
dates<-seq(as.Date("1991/1/4"),as.Date("2010/3/1"),"days")
However, I would like to only have working days, how can it be done?
(Assuming that 1991/1/4 is a Monday, I would like to exclude: 1991/6/4 and 1991/7/4.
And that for each week.)
Thank you for your help.

Would this work for you? (note, it requires the timeDate package to be installed)
# install.packages('timeDate')
require(timeDate)
# A ’timeDate’ Sequence
tS <- timeSequence(as.Date("1991/1/4"), as.Date("2010/3/1"))
tS
# Subset weekdays
tW <- tS[isWeekday(tS)]; tW
dayOfWeek(tW)

You are entering your dates incorrectly. In order to use the YYYY/DD/MM input mode which is implied by 1991/1/4 being Monday, you need to have a format string in as.Date.
So the full solution assuming you want to exclude weekends is:
X <- seq( as.Date("1991/1/4", format="%Y/%m/%d"), as.Date("2010/3/1", format="%Y/%m/%d"),"days")
weekdays.X <- X[ ! weekdays(X) %in% c("Saturday", "Sunday") ]
# negation easier since only two cases in exclusion
# probably do not want to print that vector to screen.
str(weekdays.X)
Regarding your comment I am unable to reproduce. I get:
> table(weekdays(weekdays.X) )
Friday Monday Thursday Tuesday Wednesday
1000 1000 999 999 999

I came to this question while looking up business day functions, and since the OP requested "business days" instead of "weekdays", and timeDate also has the isBizday function, this answer uses that.
# A timeDate Sequence
date.sequence <- timeSequence(as.Date("1991-12-15"), as.Date("1992-01-15")); # a short example period with three London holidays
date.sequence;
# holidays in the period
years.included <- unique( as.integer( format( x=date.sequence, format="%Y" ) ) );
holidays <- holidayLONDON(years.included) # (locale was not specified by OP in question nor in profile, so this assumes for example: holidayLONDON; also supported by timeDate are: holidayNERC, holidayNYSE, holidayTSX & holidayZURICH)
# Subset business days
business.days <- date.sequence[isBizday(date.sequence, holidays)];
business.days

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Creating a unified time-series, with dates coming from different (natural) languages - r

Related

Writing a function to clean string data and rename columns

How to extract date from the text

Add a variable including the day of the week

Date sequence in R spanning B.C.E. to A.D

R time series data, daily only working days

Categories

Resources