Inconsistency in abbreviated month names in non-english (danish & norwegian) - r

I wish convert some dates in Norwegian to actual dates in R. I'm using readr, and it kind of works - but I stumbled upon an issue which really annoys me, and I don't really know how to get around it.
Here is an illustration of my problem:
> parse_date(c("29. mai 2017", "29. sep 2017"), format = "%d. %b %Y", locale = locale("nn"))
Warning: 1 parsing failure.
row # A tibble: 1 x 4 col row col expected actual expected <int> <int> <chr> <chr> actual 1 2 NA date like %d. %b %Y 29. sep 2017
[1] "2017-05-29" NA
So it catches the date in May but not the one in September. It turns out that this is because the abbreviation for September in Norwegian needs a "." (sep. instead of sep), whereas the May abbreviations does not (probably because it's actually not an abbreviation ;-)):
locale("nb")
<locale>
Numbers: 123,456.78
Formats: %AD / %AT
Timezone: UTC
Encoding: UTF-8
<date_names>
Days: søndag (søn.), mandag (man.), tirsdag (tir.), onsdag (ons.), torsdag(tor.), fredag (fre.), lørdag (lør.)
Months: januar (jan.), februar (feb.), mars (mar.), april (apr.), mai (mai), juni (jun.), juli (jul.), august (aug.), september (sep.), oktober (okt.), november (nov.), desember (des.)
AM/PM: a.m./p.m.
However it seems inconsistent that it will not require the same number of charterers for all months. I also noticed that these annoying "." are not a part of the abbreviations in English:
> locale("en")
<locale>
Numbers: 123,456.78
Formats: %AD / %AT
Timezone: UTC
Encoding: UTF-8
<date_names>
Days: Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed), Thursday (Thu), Friday (Fri), Saturday (Sat)
Months: January (Jan), February (Feb), March (Mar), April (Apr), May (May), June (Jun), July (Jul), August (Aug),
September (Sep), October (Oct), November (Nov), December (Dec)
AM/PM: AM/PM
It really is terrible inconvenient also because I believe it is somewhat rare to actually include the "." at all when registration dates with abbreviations (but that is really just based on personal preferences and experience). Any input is much appreciated.

You can edit the locale manually like this...
loc <- locale("nb")
loc$date_names$mon_ab <- substr(loc$date_names$mon_ab, 1, 3) #just take first 3 characters
parse_date(c("29. mai 2017", "29. sep 2017"), format = "%d. %b %Y", locale = loc)
[1] "2017-05-29" "2017-09-29"

A solution similar to and inspired #Andrew Gustar is creating your own date_names object:
loc <- locale("nb")
myNo <- date_names(mon = loc$date_names$mon,
mon_ab = substr(loc$date_names$mon_ab, 1, 3),
day = loc$date_names$day,
day_ab = substr(loc$date_names$day, 1, 3))
parse_date(c("29. mai 2017", "29. sep 2017"), format = "%d. %b %Y", locale = locale(date_names = myNo))
[1] "2017-05-29" "2017-09-29"

Related

How to locate and convert the all date format for a file.txt?

Suppose I have a diary.txt file. I import it as a string. All the dates in this string appear to be YYYY.MM.DD, and I want to locate and convert it into DDMMYYYY. What should I do?
For example, here is a diary.txt,
2018.01.01
It's a nice day.
2018.01.02
Today is a rainy day.
It should be converted into
Jan 01 2018
It's a nice day.
Jan 02 2018
Today is a rainy day.
You first need to coerce your dates to proper date objects (as.Date) and then replace them with a newly formatted date. See ?strptime for the syntax on how to specify the new format.
# import data
diary <- tempfile(fileext = ".txt")
cat("2018.01.01
It's a nice day.
2018.01.02
Today is a rainy day.", file = diary)
xy <- readLines(con = diary)
# coerce to proper date format
dates <- as.Date(xy, format = "%Y.%m.%d")
# replace valid dates with new dates formatted using format()
# date should be all non-NAs
xy[!is.na(dates)] <- format(dates[!is.na(dates)], format = "%b %d %Y") # %b will depend on your locale, see ?strptime
# write to file
writeLines(xy, con = "result.txt")
# contents of result.txt
jan. 01 2018
It's a nice day.
jan. 02 2018
Today is a rainy day.
Notice that it doesn't say Jan, but jan. This is due to my local which doesn't match to what you may be used to.
> Sys.getlocale()
[1] "LC_COLLATE=Slovenian_Slovenia.1250;LC_CTYPE=Slovenian_Slovenia.1250;LC_MONETARY=Slovenian_Slovenia.1250;LC_NUMERIC=C;LC_TIME=Slovenian_Slovenia.1250"
If I set time locale to something else (may only work on windows)
> Sys.setlocale(category = "LC_TIME", locale = "English_United States.1252")
the result is
> xy
[1] "Jan 01 2018" "It's a nice day." "" "Jan 02 2018"
[5] "Today is a rainy day."
Try this out:
# Loading data
data <- readLines("diary.txt")
# Identifying lines with dates
date_lines <- grep("^[[:digit:]]", data)
# Creating dates
data[date_lines] <- format(as.POSIXct(data[date_lines], format = "%Y.%m.%d"), "%b %d %Y")
# Writing to new file
fileConn<-file("diary_fixed.txt")
writeLines(data, fileConn)
close(fileConn)

Extract time stamps from string and convert to R POSIXct object

Currently, my dataset has a time variable (factor) in the following format:
weekday month day hour min seconds +0000 year
I don't know what the "+0000" field is but all observations have this. For example:
"Tues Feb 02 11:05:21 +0000 2018"
"Mon Jun 12 06:21:50 +0000 2017"
"Wed Aug 01 11:24:08 +0000 2018"
I want to convert these values to POSIXlt or POSIXct objects(year-month-day hour:min:sec) and make them numeric. Currently, using as.numeric(as.character(time-variable)) outputs incorrect values.
Thank you for the great responses! I really appreciate a lot.
Not sure how to reproduce the transition from factor to char, but starting from that this code should work:
t <- unlist(strsplit(as.character("Tues Feb 02 11:05:21 +0000 2018")," "))
strptime(paste(t[6],t[2],t[3], t[4]),format='%Y %b %d %H:%M:%S')
PS: More on date formats and conversion: https://www.stat.berkeley.edu/~s133/dates.html
For this problem you can get by without using lubridate. First, to extract individual dates we can use regmatches and gregexpr:
date_char <- 'Tue Feb 02 11:05:21 +0000 2018 Mon Jun 12 06:21:50 +0000 2017'
ptrn <- '([[:alpha:]]{3} [[:alpha:]]{3} [[:digit:]]{2} [[:digit:]]{2}\\:[[:digit:]]{2}\\:[[:digit:]]{2} \\+[[:digit:]]{4} [[:digit:]]{4})'
date_vec <- unlist( regmatches(date_char, gregexpr(ptrn, date_char)))
> date_vec
[1] "Tue Feb 02 11:05:21 +0000 2018" "Mon Jun 12 06:21:50 +0000 2017"
You can learn more about regular expressions here.
In the above example +0000 field is the UTC offset in hours e.g. it would be -0500 for EST timezone. To convert to R date-time object:
> as.POSIXct(date_vec, format = '%a %b %d %H:%M:%S %z %Y', tz = 'UTC')
[1] "2018-02-02 11:05:21 UTC" "2017-06-12 06:21:50 UTC"
which is the desired output. The formats can be found here or you can use lubridate::guess_formats(). If you don't specify the tz, you'll get the output in your system's time zone (e.g. for me that would be EST). Since the offset is specified in the format, R correctly carries out the conversion.
To get numeric values, the following works:
> as.numeric(as.POSIXct(date_vec, format = '%a %b %d %H:%M:%S %z %Y', tz = 'UTC'))
[1] 1517569521 1497248510
Note: this is based on uniform string structure. In the OP there was Tues instead of Tue which wouldn't work. The above example is based on the three-letter abbreviation which is the standard reporting format.
If however, your data is a mix of different formats, you'd have to extract individual time strings (customized regexes, of course), then use lubridate::guess_formats() to get the formats and then use those to carry out the conversion.
Hope this is helpful!!

How can I convert to short form of months instead of full names?

I have this time dataframe
3/31/2001 8:15
4/31/2001 8:25
2/31/2001 8:45
4/31/2001 8:55
Which I am converting into months in a different column using this line of code
all$month<-strftime(as.Date(all$time, format="%m/%d/%Y %H:%M"), "%B")
The result I get is of the form:
March
April
February
April
But I would like to get only the short form of the month names, i.e.
Mar
Apr
Feb
Apr
How can I implement it?
Use a lowercase "b":
> strftime(as.Date("3/31/2001 8:15", format="%m/%d/%Y %H:%M"), "%b")
[1] "Mar"

Converting char to date time

In a data.frame, I have a date time stamp in the form:
head(x$time)
[1] "Thu Oct 11 22:18:02 2012" "Thu Oct 11 22:50:15 2012" "Thu Oct 11 22:54:17 2012"
[4] "Thu Oct 11 22:43:13 2012" "Thu Oct 11 22:41:18 2012" "Thu Oct 11 22:15:19 2012"
Everytime I try to convert it with as.Date, lubridate, or zoo I get NAs or Errors.
What is the way to convert this time to a readable form?
I've tried:
Time<-strptime(x$time,format="&m/%d/%Y %H:$M")
x$minute<-parse_date_time(x$time)
x$minute<-mdy(x$time)
x$minute<-as.Date(x$time,"%m/%d/%Y %H:%M:%S")
x$minute<-as.time(x$time)
x$minute<-as.POSIXct(x$time,format="%H:%M")
x$minute<-minute(x$time)
What you really want is strptime(). Try something like:
strptime(x$time, "%a %b %d %H:%M:%S %Y")
As an example of the interesting things you can do with strptime(), consider the following:
thedate <- "I came to your house at 11:45 on January 21, 2012."
strptime(thedate, "I came to your house at %H:%M on %B %d, %Y.")
# [1] "2012-01-21 11:45:00"
Another option is to use lubridate::parse_date_time():
library(lubridate)
parse_date_time(x$time, "%a %b %d %H:%M:%S %Y")
Or more simply:
parse_date_time(x$time, "abdHMSY")
From the docs:
It differs from base::strptime() in two respects. First, it allows specification of the order in which the formats occur without the need to include separators and % prefix. Such a formating argument is refered to as "order". Second, it allows the user to specify several format-orders to handle heterogeneous date-time character representations.
The docs contain all the formats (the "abdHMSY" etc.) recognized by lubridate.

date objects in month day format

I was wondering if there is a way for R to turn this format into any date object. The format is 'month [space] day'. For example: Jan 1 or Jul 29 or Jul 30. I just want those examples to be read as a date object so I can manipulate them.
Yes, use as.Date, but you also have to specify a year:
x <- c("Jan 1", "Jul 29", "Jul 30")
as.Date(paste("2012", x), format="%Y %b %d")
[1] "2012-01-01" "2012-07-29" "2012-07-30"
See ?as.Date for more help on Date objects, and ?strptime for help on the formatting codes.

Resources