Format two kinds of factor dates in R - r

I have two sets of date looking strings; either 31.3.14 or 31/3/14
I would like to format them to 31-3-2014
Now I know how to format each of them to desired format, but I don't know how to distinguish them and apply the approach bellow.
For this format 31.3.14 :
format(as.Date(as.character("31.3.14"), "%d.%m.%y"), "%d-%m-%Y")
For this format 31/3/14 :
format(as.Date(as.character("31/3/14"), "%d/%m/%Y"), "%d-%m-%Y"))
I have this sorts of dates in a dataframe column randomly so I would need to apply given method for the right set of format.
EDIT: sorry I have also different kinds of dates, also: "2013-04-01" here the solution provided with dmy function fails.

Could also do it with base R by removing punctuations first
Dates <- c("31.3.14", "31/3/14")
format(as.Date(gsub("[[:punct:]]", "-", Dates), format = "%d-%m-%y"), "%d-%m-%Y")
## [1] "31-03-2014" "31-03-2014"

Hadley Wickham's Lubridate package makes this easy.
> require(lubridate)
> test <- data.frame(raw = c("31.3.14", "31/3/14"))
> test$formatted <- dmy(test$raw)
> test
raw formatted
1 31.3.14 2014-03-31
2 31/3/14 2014-03-31
EDIT:
Based on the edit to the question, one can use ifelse() within a function to detect a four-digit sequence at the start of the date string.
require(stringr)
myDateFun <- function(x){
z <- ifelse(str_detect(x, "^\\d{4}") == TRUE,
ymd(x), dmy(x) )
z <- as.POSIXlt(z, origin = "1970-01-01")
z <- format(z, "%Y-%m-%d")
return(z)
}
test <- data.frame(raw = c("31.3.14", "31/3/14", "2014-3-31"))
test$formatted.2 <- myDateFun(test$raw)
test
raw formatted formatted.2
1 31.3.14 2014-03-31 2014-03-31
2 31/3/14 2014-03-31 2014-03-31
3 2014-3-31 <NA> 2014-03-31

Related

How to convert different date formats to single format in multiple columns of dataframe

I have a dataframe with dates in different formats scattered across the columns and I would like to standardize them to a single format. I can do the standardization for a single vector of heterogeneous dates, as in d, by defining the possible date formats in a vector such as formats and passing it to as.Date:
d <- c("01-02-2009","01/04/2009","15-Jan-2019", "12-12-2020")
formats <- c("%d-%m-%Y", "%d/%m/%Y", "%d-%b-%Y")
format(as.Date(d, format = formats), "%d-%b-%Y")
[1] "01-Feb-2009" "01-Apr-2009" "15-Jan-2019" "12-Dez-2020"
But this doesn't work for the dataframe:
df <- data.frame(Transaction = c("01-Mar-2015", "31-01-2012", "15/01/1999"),
Delivery = c("01-02-2018", "01/08/2016", "17-09-2007"),
Return = c("27/11/2009", "22-Jan-2013", "20-Nov-1987"))
Here, the standardization works only partly:
df[,1:3] <- lapply(df[,1:3], function(x) format(as.Date(x, format = formats), "%d-%b-%Y"))
df
Transaction Delivery Return
1 <NA> 01-Feb-2018 <NA>
2 <NA> 01-Aug-2016 <NA>
3 <NA> <NA> 20-Nov-1987
How can the dates be standardized to the %d-%b-%Y format in the whole dataframe?
With mutate_all you can convert all character columns of your dataframe into a single date format using parse_date_time function from lubridate and passing your list of formats in orders argument.
Then, you can format these dates into the desired output by using format:
library(lubridate)
library(dplyr)
formats <- c("%d-%m-%Y", "%d/%m/%Y", "%d-%b-%Y")
df %>% mutate_all( ~parse_date_time(., orders = formats)) %>%
mutate_all(~format(., "%d-%b-%Y"))
Transaction Delivery Return
1 01-Mar-2015 01-Feb-2018 27-Nov-2009
2 31-Jan-2012 01-Aug-2016 22-Jan-2013
3 15-Jan-1999 17-Sep-2007 20-Nov-1987
Using apply you can do:
library(lubridate)
apply(df, 2, function(x) format(parse_date_time(x, orders = formats), "%d-%b-%Y"))
Transaction Delivery Return
[1,] "01-Mar-2015" "01-Feb-2018" "27-Nov-2009"
[2,] "31-Jan-2012" "01-Aug-2016" "22-Jan-2013"
[3,] "15-Jan-1999" "17-Sep-2007" "20-Nov-1987"
Does it answer your question ?
NB: parse_date_time is working for lubridate version 1.7.8. For lubridate version 1.7.4, you can use parse_date and replace orders by format
The issue is that the formats in the columns are different than the one already created. So, we need something like
as.Date(df$Transaction, format = c("%d-%b-%Y", "%d-%m-%Y", "%d/%m/%Y"))
#[1] "2015-03-01" "2012-01-31" "1999-01-15"
i.e. the formats specified by the OP is
formats
#[1] "%d-%m-%Y" "%d/%m/%Y" "%d-%b-%Y"
if we check the 'Transaction' column
df$Transaction
#[1] 01-Mar-2015 31-01-2012 15/01/1999
It include %d-%m-%Y and %d/%m/%Y which is not found in the existing formats
Also, just to make it more clear, the vector format passed is doing an elementwise comparison of the format
as.Date(df$Transaction, format = c("%d-%b-%Y", "%d/%m/%Y"))
#[1] "2015-03-01" NA NA
i.e. by passing "%d/%m/%Y", it should have matched the third entry, but because it is an elementwise comparison, it does the check with the second element, then do a recycling of the vector format as it is of length less than the length of 'Transaction' column
This implies, that if our dataset is 1e6 rows, it expects 1e6 formats that should be matching each element.
Or using anydate from anytime
library(anytime)
addFormats(c('%d-%m-%Y', '%d/%m/%Y'))
df[] <- lapply(df, function(x) format(anydate(x), "%d-%b-%Y"))
df
# Transaction Delivery Return
#1 01-Mar-2015 01-Feb-2018 27-Nov-2009
#2 31-Jan-2012 01-Aug-2016 22-Jan-2013
#3 15-Jan-1999 17-Sep-2007 20-Nov-1987

Change column with different formats into dates

Would like to change a column of my data.frame into the date format in R.
The problem is that the format of the column is not consistent.
Most rows are in the format "%Y-%m-%d" and I can change them easily with the as.Date() function.
Few rows are in the format "%Y/%d/%m" and can't change them with the as.Date() function but instead I get NA's.
input <- c("2019-01-22", "2019-04-17", "2019/27/05", "2019/13/05", "2019/15/06", "2019-07-30")
Input: Output:
Dates Dates
2019-01-22 2019-01-22
2019-04-17 2019-04-17
2019/27/05 2019-27-05
2019/13/05 2019-13-05
2019/15/06 2019-15-06
2019-07-30 2019-07-30
In your case, in which you have "%Y-%m-%d" and "%Y/%d/%m", you might to use as.Date including the format it has. So, for example:
input <- c("2019-10-11", "2019/27/10", "2014-12-10")
If you use:
input2 <- ifelse(grepl("/",input), format(as.Date(input,"%Y/%d/%m"),"%Y-%m-%d"), input)
then:
> input2
[1] "2019-10-11" "2019-10-27" "2014-12-10"
If you have only these two formats you can substitute all the / for -:
Example:
input <- c("2019-10-11", "2019/10/12", "2014-10-13")
as.Date(gsub("/", "-", input), format = "%Y-%m-%d")
# [1] "2019-10-11" "2019-10-12" "2014-10-13"
If you just want to replace / by - for just a few rows, I think the following code might be an efficiency way to do the replacement
output <- as.Date(input)
output[is.na(output)]<-as.Date(input[is.na(output)],format = "%Y/%d/%m")
such that
> output
[1] "2019-01-22" "2019-04-17" "2019-05-27" "2019-05-13" "2019-06-15" "2019-07-30"
We can use anydate from anytime
library(anytime)
anydate(input)
#[1] "2019-10-11" "2019-10-12" "2014-10-13"
Or using lubridate
library(lubridate)
ymd(input)
data
input <- c("2019-10-11", "2019/10/12", "2014-10-13")

Convert string to date with different types of time format

I have a dataframe as bellow
library(lubridate)
Date <- c("18.11.2016 21:03:41", "19.11.2016", "20.11.2016","21.11.2016")
df = data.frame(Date)
df
I to get
df$Date
[1] "2016-11-18" "2016-11-19" "2016-11-20" "2016-11-21"
& try to convert it to date like this
df$Date = dmy(df$Date)
and I get
Warning message:
1 failed to parse.
How to fix it?
Try this:
s <- c("2004-03-21 12:45:33.123456", # ISO
"2004/03/21 12:45:33.123456", # variant
"20040321", # just dates work fine as well
"Mar/21/2004", # US format, also support month abbreviation or full
"rapunzel") # will produce a NA
p <- toPOSIXct(s)
options("digits.secs"=6) # make sure we see microseconds in output
print(format(p, tz="UTC")
Read about it more here.

How to convert a date to YYYYDDD?

I can't figure out how to turn Sys.Date() into a number in the format YYYYDDD. Where DDD is the day of the year, i.e. Jan 1 would be 2016001 Dec 31 would be 2016365
Date <- Sys.Date() ## The Variable Date is created as 2016-01-01
SomeFunction(Date) ## Returns 2016001
You can just use the format function as follows:
format(Date, '%Y%j')
which gives:
[1] "2016161" "2016162" "2016163"
If you want to format it in other ways, see ?strptime for all the possible options.
Alternatively, you could use the year and yday functions from the data.table or lubridate packages and paste them together with paste0:
library(data.table) # or: library(lubridate)
paste0(year(Date), yday(Date))
which will give you the same result.
The values that are returned by both options are of class character. Wrap the above solutions in as.numeric() to get real numbers.
Used data:
> Date <- Sys.Date() + 1:3
> Date
[1] "2016-06-09" "2016-06-10" "2016-06-11"
> class(Date)
[1] "Date"
Here's one option with lubridate:
library(lubridate)
x <- Sys.Date()
#[1] "2016-06-08"
paste0(year(x),yday(x))
#[1] "2016160"
This should work for creating a new column with the specified date format:
Date <- Sys.Date
df$Month_Yr <- format(as.Date(df$Date), "%Y%d")
But, especially when working with larger data sets, it is easier to do the following:
library(data.table)
setDT(df)[,NewDate := format(as.Date(Date), "%Y%d"
Hope this helps. May have to tinker if you only want one value and are not working with a data set.

Converting dates from excel to R

I have difficulty converting dates from excel (reading from csv) to R. Help is much appreciated.
Here is what I'm doing:
df$date = as.Date(df$excel.date, format = "%d/%m/%Y")
However, some dates get converted but some not. Here is the output of:
head(df$date)
[1] NA NA NA "0006-01-05" NA NA
the first 5 entries imported from csv file are as follows:
7/28/05
7/28/05
12/16/05
5/1/06
4/21/05
and here is the output of:
head(df$excel.date)
[1] 7/28/05 7/28/05 12/16/05 5/1/06 4/21/05 1/25/07
1079 Levels: 1/1/00 1/1/02 1/1/97 1/10/96 1/10/99 1/11/04 1/11/94 1/11/96 1/11/97 1/11/98 ... 9/9/99
str(df)
.
.
$ excel.date : Factor w/ 1079 levels "1/1/00","1/1/02",..: 869 869 288 618 561 48 710 1022 172 241 ...
First of all, make sure you have the dates in your file in an unambiguous format, using full years (not just 2 last numbers). %Y is for "year with century" (see ?strptime) but you don't seem to have century. So you can use %y (at your own risk, see ?strptime again) or reformat the dates in Excel.
It is also a good idea to use as.is=TRUE with read.csv when reading in these data -- otherwise character vectors are converted to factors which can lead to unexpected results.
And on Wndows it may be easier to use RODBC to read in dates directly from xls or xlsx file.
(edit)
The following may give a hint:
> as.Date("13/04/2014", format= "%d/%m/%Y")
[1] "2014-04-13"
> as.Date(factor("13/04/2014"), format= "%d/%m/%Y")
[1] "2014-04-13"
> as.Date(factor("13/04/14"), format= "%d/%m/%Y")
[1] "14-04-13"
> as.Date(factor("13/04/14"), format= "%d/%m/%y")
[1] "2014-04-13"
(So as.Date can actually take care of factors - the magick happens in as.Date.factor method defined as:
function (x, ...) as.Date(as.character(x), ...)
It is not a good idea to represent dates as factors but in this case it is not a problem either. I think the problem is excel which saves your years as 2-digit numbers in a CSV file, without asking you.)
-
The ?strptime help file says that using %y is platform specific - you can have different results on different machines. So if there's no way of going back to the source and save the csv in a better way you might use something like the following:
x <- c("7/28/05", "7/28/05", "12/16/05", "5/1/06", "4/21/05", "1/25/07")
repairExcelDates <- function(x, yearcol=3, fmt="%m/%d/%Y") {
x <- do.call(rbind, lapply(strsplit(x, "/"), as.numeric))
year <- x[,yearcol]
if(any(year>99)) stop("dont'know what to do")
x[,yearcol] <- ifelse(year <= as.numeric(format(Sys.Date(), "%Y")), year+2000, year + 1900)
# if year <= current year then add 2000, otherwise add 1900
x <- apply(x, 1, paste, collapse="/")
as.Date(x, format=fmt)
}
repairExcelDates(x)
# [1] "2005-07-28" "2005-07-28" "2005-12-16" "2006-05-01" "2005-04-21"
# [6] "2007-01-25"
Your data is formatted as Month/Day/Year so
df$date = as.Date(df$excel.date, format = "%d/%m/%Y")
should be
df$date = as.Date(df$excel.date, format = "%m/%d/%Y")

Resources