I want to parse some Apache-Logfiles in R, but date conversion does not work in some months. Example:
> as.Date(c("11/Jan/2012", "11/Mar/2012", "11/Apr/2012", "11/May/2012"), "%d/%b/%Y")
[1] "2012-01-11" NA
[3] "2012-04-11" NA
Why does Jan and Apr work and Mar and May not?
Kind regards,
Arne
Related
I'm relatively new to R. I downloaded a dataset about clinical trial data, but it occurred to me, that the format of the dates in the relative column are mixed up: most of them are like "September 1, 2012", but some are missing the day information (e.g. October 2015).
I want to express them all in the same way (eg. yyyy-mm-dd), to work with them. That went fine, the only problem that is missing is the name of the output column. In the last function (date_correction) I planned to include an argument "output_col" which I can pass the intended name for the created (formatted) column, but it only prints output_col all the time.
Do you know, how I could handle this? To pass the intended name of the output column right into the function?
Is there a better way to solve my problem?
-> I even tried to manage more complex orders-argument for lubricate::parse_date_time like
parse_date_time(input_col, orders="mdy", "my")
but this didn't work.
Here's the code:
library("tidyverse")
library("lubridate")
Observation <- c(seq(1:5))
Date_original <- c("October 2014","August 2014","June 2013",
"June 24, 2010","January 2005")
df_dates <- data.frame(Observation, Date_original)
# looking for a comma in the cell
comma_detect <- function(a_string){
str_detect(a_string, ",")
}
# if comma: assume "mdy", if not apply "my" -> return formatted value
date_correction_row <- function(input_col){
if_else(comma_detect(input_col),
parse_date_time(input_col, orders="mdy"),
parse_date_time(input_col, orders="my"))
}
# prepare function for dataframe:
date_correction <- function(df, input_col, output_col){
mutate(df, output_col = date_correction_row(input_col))
}
df_dates %>% date_correction(df_dates$Date_original, date_formatted) %>% view()
OUTPUT
Observation Date_original output_col
1 1 October 2014 2014-10-01
2 2 August 2014 2014-08-01
3 3 June 2013 2013-06-01
4 4 June 24, 2010 2010-06-24
5 5 January 2005 2005-01-01
In the code below we assume that output_col equals "Date". They all set the column name, give no warnings and use Date class.
1) Try each format and take the one that does not give NA. This uses only base R.
output_col <- "Date"
within(df_dates, assign(output_col, pmin(na.rm = TRUE,
as.Date(Date_original, "%B %d, %Y"),
as.Date(paste(Date_original, 1), "%B %Y %d"))))
## Observation Date_original Date
## 1 1 October 2014 2014-10-01
## 2 2 August 2014 2014-08-01
## 3 3 June 2013 2013-06-01
## 4 4 June 24, 2010 2010-06-24
## 5 5 January 2005 2005-01-01
2) This can also be done in lubridate. It is important that my is the first rather than second argument to coalesce since it outputs NA for those values that do not match the format whereas mdy gives a wrong date so if that were first coalesce would never get to my. This approach is shorter than (3) but you might prefer the robustness (3) since it does not depend on what is returned for non-matching dates.
library(dplyr)
library(lubridate)
output_col <- "Date"
df_dates %>%
mutate(!!output_col := coalesce(my(Date_original, quiet = TRUE),
mdy(Date_original)))
## Observation Date_original Date
## 1 1 October 2014 2014-10-01
## 2 2 August 2014 2014-08-01
## 3 3 June 2013 2013-06-01
## 4 4 June 24, 2010 2010-06-24
## 5 5 January 2005 2005-01-01
3) If you prefer your own method of first checking for comma here is a variation of that which is more compact. It uses my and mdy instead of parse_date_time since my and mdy give Date class results which are more appropriate here than the POSIXct of parse_date_time given that there are no times.
library(dplyr)
library(lubridate)
output_col <- "Date"
df_dates %>%
mutate(!!output_col := if_else(grepl(",", Date_original),
mdy(Date_original), my(Date_original, quiet = TRUE)))
## 1 1 October 2014 2014-10-01
## 2 2 August 2014 2014-08-01
## 3 3 June 2013 2013-06-01
## 4 4 June 24, 2010 2010-06-24
## 5 5 January 2005 2005-01-01
When the date structure is known, I like to explicitly correct the date structure first, then parse. Here I use regex to sub in 1 when the day is missing, then we just parse like normal.
library(tidyverse)
df_dates %>%
mutate(
output_col = gsub("(?<!,)\\s(?=\\d{4})", " 1, ", Date_original, perl = TRUE) %>%
as.Date(., format = '%B %d, %Y')
)
Observation Date_original output_col
1 1 October 2014 2014-10-01
2 2 August 2014 2014-08-01
3 3 June 2013 2013-06-01
4 4 June 24, 2010 2010-06-24
5 5 January 2005 2005-01-01
This question already has answers here:
strptime, as.POSIXct and as.Date return unexpected NA
(2 answers)
Closed 3 years ago.
I have a dataset with 228 observations and, one of the three columns specifying "Date" is in the following form: 2 December 1999 4 November 1999 7 October 1999 .....
My aim is to convert it into this format: 1999-12-02 (yyyy-mm-dd). To do so, I use the "as.POSIXct" function but I get "NA" for all the 228 observations.
I tried this code and any possible variance of that (including some hints got from previous questions similar to mine) such as "as.Date", "strptime", etc.:
new_date <- as.POSIXct(ecb_result1$Date, format = "%Y-%m-%d")
As I said, I expected to see the conversion from "2 December 1999" to "1999-12-02". Instead, I got:
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Is there anyone who can help me understand what's wrong and how to fix it?
The format for as.Date or as.POSIXct in this case is "%e %B %Y". But this is locale dependent. In my case October is not transformed as my locale is dutch. And the format is expecting oktober not October. This might be happening in your case as well. I would suggest trying lubridate's dmy function. See examples below.
dates <- c("2 December 1999", "4 November 1999", "7 October 1999")
# goes wrong for my locale
as.Date(dates, "%e %B %Y") # as.Date
[1] "1999-12-02" "1999-11-04" NA
as.POSIXct(dates, format = "%e %B %Y") # as.POSIXct
[1] "1999-12-02 CET" "1999-11-04 CET" NA
But lubridate's dmy function does the trick for me.
lubridate::dmy(dates)
[1] "1999-12-02" "1999-11-04" "1999-10-07"
Or messing around with your Sys.setlocale will also work:
Sys.setlocale("LC_TIME", "English_United Kingdom")
as.POSIXct(dates, format = "%e %B %Y")
[1] "1999-12-02 CET" "1999-11-04 CET" "1999-10-07 CEST"
I want to know how to find out which part of string is month and which part of string is day while parsing dates.
The problem is 01-06-2017 can be 1 June or it can be 6 January. How to parse it correctly. In India we write dates as Day Month Year mostly, in west it is Month Day Year mostly, when I have mixed data how do I impute which is the month and which is the day
because the data is not clean enough, it sometimes have dates in mdy and sometimes in dmy format and if the number is less than 12, it is difficult to know if it is a day or a month
11/1/11 can be 11 Jan 2011 or 1 November 2011
Example
I am using lubridate package and I have dates in this format
library(lubridate)
fundates2=c("1Apr2017","12-30-2017","1/6/17")
fun3=dmy(fundates2)
## Warning: 1 failed to parse.
fun3
## [1] "2017-04-01" NA "2017-06-01"
fun4=mdy(fundates2)
## Warning: 1 failed to parse.
fun4
## [1] NA "2017-12-30" "2017-01-06"
Well, you have yo know from your context which one is the correct.
To check which one your date is you can simply add 1 day to it:
In fun3:
fun3 + 1
[1] "2017-04-02" NA "2017-06-02"
You can see that the month is the 06.
In fun4:
fun4 + 1
[1] NA "2017-12-31" "2017-01-07"
You can see the month is 01
I have data with dates in MM/DD/YY HH:MM format and others in plain old MM/DD/YY format. I want to parse all of them into the same format as "2010-12-01 12:12 EST." How should I go about doing that? I tried the following ifelse statement and it gave me a bunch of long integers and told me a large number of my data points failed to parse:
df_prime$date <- ifelse(!is.na(mdy_hm(df$date)), mdy_hm(df$date), mdy(df$date))
df_prime is a duplicate of the data frame df that I initially loaded in
IEN date admission_number KEY_PTF_45 admission_from discharge_to
1 12 3/3/07 18:05 1 252186 OTHER DIRECT
2 12 3/9/07 12:10 1 252186 RETURN TO COMMUNITY- INDEPENDENT
3 12 3/10/07 15:08 2 252382 OUTPATIENT TREATMENT
4 12 3/14/07 10:26 2 252382 RETURN TO COMMUNITY-INDEPENDENT
5 12 4/24/07 19:45 3 254343 OTHER DIRECT
6 12 4/28/07 11:45 3 254343 RETURN TO COMMUNITY-INDEPENDENT
...
1046334 23613488506 2/25/14 NA NA
1046335 23613488506 2/25/14 11:27 NA NA
1046336 23613488506 2/28/14 NA NA
1046337 23613488506 3/4/14 NA NA
1046338 23613488506 3/10/14 11:30 NA NA
1046339 23613488506 3/10/14 12:32 NA NA
Sorry if some of the formatting isn't right, but the date column is the most important one.
EDIT: Below is some code for a portion of my data frame via a dput command:
structure(list(IEN = c(23613488506, 23613488506, 23613488506, 23613488506, 23613488506, 23613488506), date = c("2/25/14", "2/25/14 11:27", "2/28/14", "3/4/14", "3/10/14 11:30", "3/10/14 12:32")), .Names = c("IEN", "date"), row.names = 1046334:1046339, class = "data.frame")
Have you tried the function guess_formats() in the lubridate package?
A reproducible example to build a dataframe like yours could be helpful!
The lubridate package's mdy_hm has a truncated parameter that lets you supply dates that might not have all the bits. For your example:
> mdy_hm(d$date,truncated=2)
[1] "2014-02-25 00:00:00 UTC" "2014-02-25 11:27:00 UTC"
[3] "2014-02-28 00:00:00 UTC" "2014-03-04 00:00:00 UTC"
[5] "2014-03-10 11:30:00 UTC" "2014-03-10 12:32:00 UTC"
Is there a good way to get a year + week number converted a date in R? I have tried the following:
> as.POSIXct("2008 41", format="%Y %U")
[1] "2008-02-21 EST"
> as.POSIXct("2008 42", format="%Y %U")
[1] "2008-02-21 EST"
According to ?strftime:
%Y Year with century. Note that whereas there was no zero in the
original Gregorian calendar, ISO 8601:2004 defines it to be valid
(interpreted as 1BC): see http://en.wikipedia.org/wiki/0_(year). Note
that the standard also says that years before 1582 in its calendar
should only be used with agreement of the parties involved.
%U Week of the year as decimal number (00–53) using Sunday as the
first day 1 of the week (and typically with the first Sunday of the
year as day 1 of week 1). The US convention.
This is kinda like another question you may have seen before. :)
The key issue is: what day should a week number specify? Is it the first day of the week? The last? That's ambiguous. I don't know if week one is the first day of the year or the 7th day of the year, or possibly the first Sunday or Monday of the year (which is a frequent interpretation). (And it's worse than that: these generally appear to be 0-indexed, rather than 1-indexed.) So, an enumerated day of the week needs to be specified.
For instance, try this:
as.POSIXlt("2008 42 1", format = "%Y %U %u")
The %u indicator specifies the day of the week.
Additional note: See ?strptime for the various options for format conversion. It's important to be careful about the enumeration of weeks, as these can be split across the end of the year, and day 1 is ambiguous: is it specified based on a Sunday or Monday, or from the first day of the year? This should all be specified and tested on the different systems where the R code will run. I'm not certain that Windows and POSIX systems sing the same tune on some of these conversions, hence I'd test and test again.
Day-of-week == zero in the POSIXlt DateTimesClasses system is Sunday. Not exactly Biblical and not in agreement with the R indexing that starts at "1" convention either, but that's what it is. Week zero is the first (partial) week in the year. Week one (but day of week zero) starts with the first Sunday. And all the other sequence types in POSIXlt have 0 as their starting point. It kind of interesting to see what coercing the list elements of POSIXlt objects do. The only way you can actually change a POSIXlt date is to alter the $year, the $mon or the $mday elements. The others seem to be epiphenomena.
today <- as.POSIXlt(Sys.Date())
today # Tuesday
#[1] "2012-02-21 UTC"
today$wday <- 0 # attempt to make it Sunday
today
# [1] "2012-02-21 UTC" The attempt fails
today$mday <- 19
today
#[1] "2012-02-19 UTC" Success
I did not come up with this myself (it's taken from a blog post by Forester), but nevertheless I thought I'd add this to the answer list because it's the first implementation of the ISO 8601 week number convention that I've seen in R.
No doubt, week numbers are a very ambiguous topic, but I prefer an ISO standard over the current implementation of week numbers via format(..., "%U") because it seems that this is what most people agreed on, at least in Germany (calendars etc.).
I've put the actual function def at the bottom to facilitate focusing on the output first. Also, I just stumbled across package ISOweek, maybe worth a try.
Approach Comparison
x.days <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
x.names <- sapply(1:length(posix), function(x) {
x.day <- as.POSIXlt(posix[x], tz="Europe/Berlin")$wday
if (x.day == 0) {
x.day <- 7
}
out <- x.days[x.day]
})
data.frame(
posix,
name=x.names,
week.r=weeknum,
week.iso=ISOweek(as.character(posix), tzone="Europe/Berlin")$weeknum
)
# Result
posix name week.r week.iso
1 2012-01-01 Sun 1 4480458
2 2012-01-02 Mon 1 1
3 2012-01-03 Tue 1 1
4 2012-01-04 Wed 1 1
5 2012-01-05 Thu 1 1
6 2012-01-06 Fri 1 1
7 2012-01-07 Sat 1 1
8 2012-01-08 Sun 2 1
9 2012-01-09 Mon 2 2
10 2012-01-10 Tue 2 2
11 2012-01-11 Wed 2 2
12 2012-01-12 Thu 2 2
13 2012-01-13 Fri 2 2
14 2012-01-14 Sat 2 2
15 2012-01-15 Sun 3 2
16 2012-01-16 Mon 3 3
17 2012-01-17 Tue 3 3
18 2012-01-18 Wed 3 3
19 2012-01-19 Thu 3 3
20 2012-01-20 Fri 3 3
21 2012-01-21 Sat 3 3
22 2012-01-22 Sun 4 3
23 2012-01-23 Mon 4 4
24 2012-01-24 Tue 4 4
25 2012-01-25 Wed 4 4
26 2012-01-26 Thu 4 4
27 2012-01-27 Fri 4 4
28 2012-01-28 Sat 4 4
29 2012-01-29 Sun 5 4
30 2012-01-30 Mon 5 5
31 2012-01-31 Tue 5 5
Function Def
It's taken directly from the blog post, I've just changed a couple of minor things. The function is still kind of sketchy (e.g. the week number of the first date is far off), but I find it to be a nice start!
ISOweek <- function(
date,
format="%Y-%m-%d",
tzone="UTC",
return.val="weekofyear"
){
##converts dates into "dayofyear" or "weekofyear", the latter providing the ISO-8601 week
##date should be a vector of class Date or a vector of formatted character strings
##format refers to the date form used if a vector of
## character strings is supplied
##convert date to POSIXt format
if(class(date)[1]%in%c("Date","character")){
date=as.POSIXlt(date,format=format, tz=tzone)
}
# if(class(date)[1]!="POSIXt"){
if (!inherits(date, "POSIXt")) {
print("Date is of wrong format.")
break
}else if(class(date)[2]=="POSIXct"){
date=as.POSIXlt(date, tz=tzone)
}
print(date)
if(return.val=="dayofyear"){
##add 1 because POSIXt is base zero
return(date$yday+1)
}else if(return.val=="weekofyear"){
##Based on the ISO8601 weekdate system,
## Monday is the first day of the week
## W01 is the week with 4 Jan in it.
year=1900+date$year
jan4=strptime(paste(year,1,4,sep="-"),format="%Y-%m-%d")
wday=jan4$wday
wday[wday==0]=7 ##convert to base 1, where Monday == 1, Sunday==7
##calculate the date of the first week of the year
weekstart=jan4-(wday-1)*86400
weeknum=ceiling(as.numeric((difftime(date,weekstart,units="days")+0.1)/7))
#########################################################################
##calculate week for days of the year occuring in the next year's week 1.
#########################################################################
mday=date$mday
wday=date$wday
wday[wday==0]=7
year=ifelse(weeknum==53 & mday-wday>=28,year+1,year)
weeknum=ifelse(weeknum==53 & mday-wday>=28,1,weeknum)
################################################################
##calculate week for days of the year occuring prior to week 1.
################################################################
##first calculate the numbe of weeks in the previous year
year.shift=year-1
jan4.shift=strptime(paste(year.shift,1,4,sep="-"),format="%Y-%m-%d")
wday=jan4.shift$wday
wday[wday==0]=7 ##convert to base 1, where Monday == 1, Sunday==7
weekstart=jan4.shift-(wday-1)*86400
weeknum.shift=ceiling(as.numeric((difftime(date,weekstart)+0.1)/7))
##update year and week
year=ifelse(weeknum==0,year.shift,year)
weeknum=ifelse(weeknum==0,weeknum.shift,weeknum)
return(list("year"=year,"weeknum"=weeknum))
}else{
print("Unknown return.val")
break
}
}