Parsing dates in R from strings with multiple formats - r

I have a tibble in R with about 2,000 rows. It was imported from Excel using read_excel. One of the fields is a date field: dob. It imported as a string, and has dates in three formats:
"YYYY-MM-DD"
"DD-MM-YYYY"
"XXXXX" (ie, a five-digit Excel-style date)
Let's say I treat the column as a vector.
dob <- c("1969-02-02", "1986-05-02", "34486", "1995-09-05", "1983-06-05",
"1981-02-01", "30621", "01-05-1986")
I can see that I probably need a solution that uses both parse_date_time and as.Date.
If I use parse_date_time:
dob_fixed <- parse_date_time(dob, c("ymd", "dmy"))
This fixes them all, except the five-digit one, which returns NA.
I can fix the five-digit one, by using as.integer and as.Date:
dob_fixed2 <- as.Date(as.integer(dob), origin = "1899-12-30")
Ideally I would run one and then the other, but because each returns NA on the strings that don't work I can't do that.
Any suggestions for doing all? I could simply change them in Excel and re-import, but I feel like that's cheating!

We create a logical index after the first run based on the NA values and use that to index for the second run
i1 <- is.na(dob_fixed)
dob_fixed[i1] <- as.Date(as.integer(dob[i1]), origin = "1899-12-30")

Related

How do I import an excel spreadsheet where there is a date column that has text and numerical date formats? eg, ("3-Dec-19", "2019-05-04", "43787"

If I have multiple spreadsheets with date column that have different formats, is it possible to convert the dates if some of the rows have the numeric format?
If I import the columns as character, I would still need to convert to date. I believe parse_date_time will fix anything accept the number format.
The following will convert the first two but not the numeric version. I don't think this function has a numeric function.
Is there a function that can process both Text and Numeric dates?
x<- c("2019-12-05","8-Dec-19","43787")
lubridate::parse_date_time(x, c("ymd", "d-b-y"))
It's a bit clunky, but you can take a second pass through the data with janitor::excel_numeric_to_date() for any values that are numeric and failed to parse via parse_date_time ... (if you're going to use this often, you can write a wrapper function - you might also want to suppress the warning messages from parse_date_time and as.numeric ...)
x <- c("2019-12-05","8-Dec-19","43787")
y <- lubridate::parse_date_time(x, c("ymd", "d-b-y"))
exceld <- is.na(y) & !is.na(as.numeric(x))
y[exceld] <- janitor::excel_numeric_to_date(as.numeric(x[exceld]))
Perhaps there is a better approach, but I was able to create a function that handles both the numeric and text versions of the date format with the tryFormats.
dfix(c("2019-12-05","8-Dec-19","43787"))
[1] "2019-12-05" "2019-12-08" "2019-11-18"
dfix<-function(x1){
dout<-c()
for (x in x1){
if(grepl('\\d{5}',x)){ #Check for numeric date (5 digits)
n<-as.numeric(x)
d<-as.Date(n, origin = "1899-12-30")
}
else if (grepl('-',x)){ # TryFormats for dates with "-"
d<-as.Date(x,tryFormats = c("%Y-%m-%d","%d-%b-%y"))
}
dout<-c(dout,d)
}
return (as.Date(dout,origin = '1970-01-01'))
}

regex single digit

I have a question which I think is solved by regex use in R.
I have a set of dates (as chr) which I would like in a different format (as chr).
I have tried to fool around with the below examples where the first (new_dates) gives the right format for months 1-9 and wrong for 10-12 and (new_dates2) gives the right format for 10-12 but nothing for 1-9.
I see that the code in the first case matches a single digit twice for 10-12, but don't really know how to tell it to match only single digit.
The final vector of correct dates shows the result I would like.
dates <- c("1/2016", "2/2016", "3/2016", "4/2016", "5/2016", "6/2016", "7/2016", "8/2016", "9/2016", "10/2016", "11/2016", "12/2016", "1/2017")
new_dates <- sub("(\\d)[:/:](\\d{4})","\\2M0\\1", dates)
new_dates2 <- sub("(\\d{2})[:/:](\\d{4})","\\2M\\1", dates)
correctdates <- c("2016M01", "2016M02", "2016M03", "2016M04", "2016M05", "2016M06", "2016M07", "2016M08", "2016M09", "2016M10", "2016M11", "2016M12", "2017M1")
Here's a base R method that will return the desired format:
format(as.Date(paste0("1/",dates), "%d/%m/%Y"), "%YM%m")
[1] "2016M01" "2016M02" "2016M03" "2016M04" "2016M05" "2016M06" "2016M07" "2016M08" "2016M09"
[10] "2016M10" "2016M11" "2016M12" "2017M01"
The idea is to first convert to a Date object and then use the format function to create the desired character representation. I pasted on 1/ so that a day is present in each element.
As #a p o m said it might be better to look for another solution if you are manipulating dates but if you want to stick with regular expressions you can try this one.
([02-9]|1[0-2]?)[:\/](\d{4}) example
new_dates <- sub("(\\d{1,2})\\/(\\d{4})","\\2M0\\1", dates)
It's fine.

Reading Time column in R

I am reading a Excel file with time as a column.
This column has values like
23:29:04
23:04:31
21:55:37
21:52:27
21:49:53
When I read this column using R , read column comes as a numeric value :
0.961469907
0.913622685
0.911423611
0.907094907
0.906250000
0.899490741
There is no correspondence between above mentioned Excel and R column values. These are just samples.
I tried using
strptime(TimeStamp,format="%H:%M:%S)
It gives all values as NA.
Please suggest how to read time correctly in R.
These numbers are fractions of a day corresponding to times. Time objects are, e.g., implemented in package chron:
library(chron)
x <- c(0.961469907, 0.913622685, 0.911423611, 0.907094907, 0.906250000, 0.899490741)
x <- times(x)
print(x)
#[1] 23:04:31 21:55:37 21:52:27 21:46:13 21:45:00 21:35:16
Read the columns as string and wrap your strptime command with as.POSIXct:
as.POSIXct(strptime(TimeStamp,format="%H:%M:%S"))

How to lag dates in form of strings in R

The following vector of Dates is given in form of a string sequence:
d <- c("01/09/1991","01/10/1991","01/11/1991","01/12/1991")
I would like to exemplary lag this vector by 1 month, that means to produce the following structure:
d <- c("01/08/1991","01/09/1991","01/10/1991","01/11/1991")
My data is much larger and I must impose higher lags as well, but this seems to be the basis I need to know.
By doing this, I would like to have the same format in the end again:("%d/%m/%Y). How can this be done in R? I found a couple of packages (e.g. lubridate), but I always have to convert between formats (strings, dates and more) so it's a bit messy and seems prone to mistake.
edit: some more info on why I want to do this: I am using this vector as rownames of a matrix, so I would prefer a solution where the final outcome is a string vector again.
This does not use any packages. We convert to "POSIXlt" class, subtract one from the month component and convert back:
fmt <- "%d/%m/%Y"
lt <- as.POSIXlt(d, format = fmt)
lt$mon <- lt$mon - 1
format(lt, format = fmt)
## [1] "01/08/1991" "01/09/1991" "01/10/1991" "01/11/1991"
My solution uses lubridatebut it does return what you want in the specified format:
require(lubridate)
d <- c("01/09/1991","01/10/1991","01/11/1991","01/12/1991")
format(as.Date(d,format="%d/%m/%Y")-months(1),'%d/%m/%Y')
[1] "01/08/1991" "01/09/1991" "01/10/1991" "01/11/1991"
You can then change the lag and (if you want) the output (which is this part : '%d/%m/%Y') by specifying what you want.

Sprintf Function and Character Dates

I have a data set in which I want to pad zeroes in front of a set of dates that don't have six characters. For example, I have a date that reads 91003 (October 3rd, 2009) and I want it to read 091003, as well as any other date that is missing a zero in front. When I use the sprintf function, the code is:
Data1$entrydate <- sprintf("%06d", data1$entrydate)
But what it spits out is something like 000127, or some other other random number for all the other dates in the problem. I don't understand what's going on, and I would appreciate some help on the issue. Thanks.
PS. I am sometimes also getting a error message that sprintf is only for character values, I don't know if there is any code for numerical values.
I guess you got different results than expected because the column class was factor. You can convert the column to numeric either by as.numeric(as.character(datacolumn)) or as.numeric(levels(datacolumn)). According to ?factor
To transform a factor ‘f’ to approximately its
original numeric values, ‘as.numeric(levels(f))[f]’ is recommended
and slightly more efficient than ‘as.numeric(as.character(f))’.
So, you can use
levels(data1$entrydate) <- sprintf('%06d', as.numeric(levels(data1$entrydate)))
Example
Here is an example that shows the problem
v1 <- factor(c(91003, 91104,90103))
sprintf('%06d', v1)
#[1] "000002" "000003" "000001"
Or, it is equivalent to
sprintf('%06d', as.numeric(v1)) #the formatted numbers are
# the numeric index of factor levels.
#[1] "000002" "000003" "000001"
When you convert it back to numeric, works as expected
sprintf('%06d', as.numeric(levels(v1)))
#[1] "090103" "091003" "091104"

Resources