How can I fix corrupted dates in R? - r

I have a dataset as follows:
19/9/1997
22/9/1997
23/9/1997
24/9/1997
25/9/1997
26/9/1997
29/9/1997
30/9/1997
35440
35471
35499
35591
35621
35652
35683
35713
13/10/1997
14/10/1997
15/10/1997
16/10/1997
17/10/1997
20/10/1997
21/10/1997
22/10/1997
23/10/1997
24/10/1997
27/10/1997
28/10/1997
29/10/1997
30/10/1997
31/10/1997
35500
35531
35561
35592
35622
35714
35745
35775
13/11/1997
14/11/1997
17/11/1997
18/11/1997
19/11/1997
20/11/1997
21/11/1997
24/11/1997 ...
The Data that should be here are (for reproduction as requested)
19/9/1997
22/9/1997
23/9/1997
24/9/1997
25/9/1997
26/9/1997
29/9/1997
30/9/1997
10/01/1997
10/02/1997
10/03/1997
10/06/1997
10/07/1997
10/08/1997
10/09/1997
10/10/1997
13/10/1997
14/10/1997
15/10/1997
16/10/1997
17/10/1997
20/10/1997
21/10/1997
22/10/1997
23/10/1997
24/10/1997
27/10/1997
28/10/1997
29/10/1997
30/10/1997
31/10/1997
11/03/1997
11/04/1997
11/05/1997
11/06/1997
11/07/1997
11/10/1997
11/11/1997
11/12/1997
13/11/1997
14/11/1997
17/11/1997
18/11/1997
19/11/1997
20/11/1997
21/11/1997
24/11/1997
I have 5,149 rows of dates where there are numbers in places of dates. I tried fixing the missing dates with this:
ATTEMPT 1 BEFORE REVISION:
rm (list = ls(all=TRUE))
graphics.off()
library(readxl)
Dates <- read_excel("F:/OneDrive - University of Tasmania/Mardi Meetings/Dataset/Dates.xlsx")
x<-Dates[,1]
library(date)
library(datetime)
ans <- Reduce(function(prev, curr) {
f1 <- as.Date(curr, "%d/%m/%Y")
f2 <- as.Date(curr, "%m/%d/%Y")
if (is.na(f1)) return(f2)
if (is.na(f2)) return(f1)
if (prev < f1 && prev < f2) return(min(f1, f2))
if (prev < f1) return(f1)
if (prev < f2) return(f2)
}, x[-1], init=as.Date(x[1], "%d/%m/%Y"), accumulate=TRUE)
as.Date(ans, origin="1970-01-01")
But I am getting the following error:
+ }, x[-1], init=as.Date(x[1], "%d/%m/%Y"), accumulate=TRUE)
Error in Reduce(function(prev, curr) { : object 'x' not found
>
> as.Date(ans, origin="1970-01-01")
Error in as.Date(ans, origin = "1970-01-01") : object 'ans' not found
Any suggestions will be highly appreciated.
OK AS PER ADVICE I REVISED THE CODE ATTEMPT 2 AFTER REVISION
> rm (list = ls(all=TRUE))
> graphics.off()
> library(readxl)
> Dates <- read_excel("F:/OneDrive - University of Tasmania/Mardi Meetings/Dataset/Dates.xlsx")
> dput(head(Dates))
structure(list(Date = c("33274", "33302", "33394", "33424", "33455",
"33486")), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
> x<-Dates[[1]]
> library(date)
> library(datetime)
Attaching package: ‘datetime’
The following object is masked from ‘package:date’:
as.date
> dates <- as.Date(x, format="%d/%m/%Y")
> dput(head(dates))
structure(c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), class = "Date")
> head(dates,10)
[1] NA NA NA NA NA NA NA
[8] "1991-05-13" "1991-05-14" "1991-05-15"
As you can see I have lost the corrupted dates completely
Today on 28th I tried again
> rm (list = ls(all=TRUE))
> graphics.off()
> library(readxl)
> Dates <- read_excel("F:/OneDrive - University of Tasmania/Mardi Meetings/Dataset/Dates.xlsx")
> x<-Dates[[1]]
>
> library(date)
> library(datetime)
Attaching package: ‘datetime’
The following object is masked from ‘package:date’:
as.date
> formats <- c("%m/%d/%Y", "%d/%m/%Y", "%Y/%m/%d")
> dates <- as.Date(rep(NA, length(x)))
> for (fmt in formats) {
+ nas <- is.na(dates)
+ dates[nas] <- as.Date(as.integer(x[nas], format=fmt))
+ }
Error in as.Date.numeric(as.integer(x[nas], format = fmt)) :
'origin' must be supplied
In addition: Warning message:
In as.Date(as.integer(x[nas], format = fmt)) : NAs introduced by coercion
> dates <- as.Date(x, format="%d/%m/%Y")
> head(dates)
[1] NA NA NA NA NA NA
> head(dates, 10)
[1] NA NA NA NA NA NA NA
[8] "1991-05-13" "1991-05-14" "1991-05-15"

You need none of the packages you've loaded, nor do you need to use Reduce, as functions we're using here are naturally "vectorized".
Here's a sample of your data. (A good question includes data in an easily copied format such as this.)
x <- c("19/9/1997", "22/9/1997", "23/9/1997", "24/9/1997", "25/9/1997",
"26/9/1997", "29/9/1997", "30/9/1997",
"35440", "35471", "35499", "35591", "35621",
"35652", "35683", "35713")
dates <- as.Date(x, format="%d/%m/%Y")
dates
# [1] "1997-09-19" "1997-09-22" "1997-09-23" "1997-09-24" "1997-09-25"
# [6] "1997-09-26" "1997-09-29" "1997-09-30" NA NA
# [11] NA NA NA NA NA
# [16] NA
Not surprisingly, the second-half of the dates are not recognized given format="%d/%m/%Y". You mentioned the use of "%m/%d/%Y" in your question, so we can (1) do a literal second-pass for this format (un-utilized with this example, but still relevant for your work?):
dates[is.na(dates)] <- as.Date(x[is.na(dates)], format="%m/%d/%Y")
where [is.na(dates)] only works on the un-converted elements.
(2) If we have more than one other format, you can always use a vector of them and loop over them. (For this, I'll start over, since this loop would replace/augment the first steps above.)
formats <- c("%m/%d/%Y", "%d/%m/%Y", "%Y/%m/%d")
dates <- as.Date(rep(NA, length(x)))
for (fmt in formats) {
nas <- is.na(dates)
dates[nas] <- as.Date(x[nas], format=fmt)
}
dates
# [1] "1997-09-19" "1997-09-22" "1997-09-23" "1997-09-24" "1997-09-25"
# [6] "1997-09-26" "1997-09-29" "1997-09-30" NA NA
# [11] NA NA NA NA NA
# [16] NA
This still leaves us with NAs for the integer-looking ones. For these you need to specify the origin= to be able to figure it out (as well as converting to an integer). R typically works with an origin of "1970-01-01", which you can confirm with
as.integer(Sys.Date())
# [1] 17787
Sys.Date() - 17787
# [1] "1970-01-01"
but it appears that your dates have an origin of "1900-01-01", I think that's Excel's default storage of dates (but it doesn't matter here):
x[9] # the first integer-looking element
# [1] "35440"
dates[1] - as.integer(x[9])
# [1] "1900-09-08"
(I'm assuming that your dates are from the same relative period of time.)
From here:
nas <- is.na(dates)
dates[nas] <- as.Date(as.integer(x[nas]), origin="1900-01-01")
dates
# [1] "1997-09-19" "1997-09-22" "1997-09-23" "1997-09-24" "1997-09-25"
# [6] "1997-09-26" "1997-09-29" "1997-09-30" "1997-01-12" "1997-02-12"
# [11] "1997-03-12" "1997-06-12" "1997-07-12" "1997-08-12" "1997-09-12"
# [16] "1997-10-12"
(Working on the indices of only NA elements is relatively efficient in that it only works on and replaces the not-yet-matched entries. If there is nothing left when it gets to another call to as.Date, it does still call it but with an argument of length 0, with which the function works rather efficiently. I don't think adding a conditional of if (any(nas)) ... would help, but if there are further methods you need that might be more "expensive", you can consider it.)

Related

Problem with dates imported from Excel into R [duplicate]

I have a dataset as follows:
19/9/1997
22/9/1997
23/9/1997
24/9/1997
25/9/1997
26/9/1997
29/9/1997
30/9/1997
35440
35471
35499
35591
35621
35652
35683
35713
13/10/1997
14/10/1997
15/10/1997
16/10/1997
17/10/1997
20/10/1997
21/10/1997
22/10/1997
23/10/1997
24/10/1997
27/10/1997
28/10/1997
29/10/1997
30/10/1997
31/10/1997
35500
35531
35561
35592
35622
35714
35745
35775
13/11/1997
14/11/1997
17/11/1997
18/11/1997
19/11/1997
20/11/1997
21/11/1997
24/11/1997 ...
The Data that should be here are (for reproduction as requested)
19/9/1997
22/9/1997
23/9/1997
24/9/1997
25/9/1997
26/9/1997
29/9/1997
30/9/1997
10/01/1997
10/02/1997
10/03/1997
10/06/1997
10/07/1997
10/08/1997
10/09/1997
10/10/1997
13/10/1997
14/10/1997
15/10/1997
16/10/1997
17/10/1997
20/10/1997
21/10/1997
22/10/1997
23/10/1997
24/10/1997
27/10/1997
28/10/1997
29/10/1997
30/10/1997
31/10/1997
11/03/1997
11/04/1997
11/05/1997
11/06/1997
11/07/1997
11/10/1997
11/11/1997
11/12/1997
13/11/1997
14/11/1997
17/11/1997
18/11/1997
19/11/1997
20/11/1997
21/11/1997
24/11/1997
I have 5,149 rows of dates where there are numbers in places of dates. I tried fixing the missing dates with this:
ATTEMPT 1 BEFORE REVISION:
rm (list = ls(all=TRUE))
graphics.off()
library(readxl)
Dates <- read_excel("F:/OneDrive - University of Tasmania/Mardi Meetings/Dataset/Dates.xlsx")
x<-Dates[,1]
library(date)
library(datetime)
ans <- Reduce(function(prev, curr) {
f1 <- as.Date(curr, "%d/%m/%Y")
f2 <- as.Date(curr, "%m/%d/%Y")
if (is.na(f1)) return(f2)
if (is.na(f2)) return(f1)
if (prev < f1 && prev < f2) return(min(f1, f2))
if (prev < f1) return(f1)
if (prev < f2) return(f2)
}, x[-1], init=as.Date(x[1], "%d/%m/%Y"), accumulate=TRUE)
as.Date(ans, origin="1970-01-01")
But I am getting the following error:
+ }, x[-1], init=as.Date(x[1], "%d/%m/%Y"), accumulate=TRUE)
Error in Reduce(function(prev, curr) { : object 'x' not found
>
> as.Date(ans, origin="1970-01-01")
Error in as.Date(ans, origin = "1970-01-01") : object 'ans' not found
Any suggestions will be highly appreciated.
OK AS PER ADVICE I REVISED THE CODE ATTEMPT 2 AFTER REVISION
> rm (list = ls(all=TRUE))
> graphics.off()
> library(readxl)
> Dates <- read_excel("F:/OneDrive - University of Tasmania/Mardi Meetings/Dataset/Dates.xlsx")
> dput(head(Dates))
structure(list(Date = c("33274", "33302", "33394", "33424", "33455",
"33486")), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
> x<-Dates[[1]]
> library(date)
> library(datetime)
Attaching package: ‘datetime’
The following object is masked from ‘package:date’:
as.date
> dates <- as.Date(x, format="%d/%m/%Y")
> dput(head(dates))
structure(c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), class = "Date")
> head(dates,10)
[1] NA NA NA NA NA NA NA
[8] "1991-05-13" "1991-05-14" "1991-05-15"
As you can see I have lost the corrupted dates completely
Today on 28th I tried again
> rm (list = ls(all=TRUE))
> graphics.off()
> library(readxl)
> Dates <- read_excel("F:/OneDrive - University of Tasmania/Mardi Meetings/Dataset/Dates.xlsx")
> x<-Dates[[1]]
>
> library(date)
> library(datetime)
Attaching package: ‘datetime’
The following object is masked from ‘package:date’:
as.date
> formats <- c("%m/%d/%Y", "%d/%m/%Y", "%Y/%m/%d")
> dates <- as.Date(rep(NA, length(x)))
> for (fmt in formats) {
+ nas <- is.na(dates)
+ dates[nas] <- as.Date(as.integer(x[nas], format=fmt))
+ }
Error in as.Date.numeric(as.integer(x[nas], format = fmt)) :
'origin' must be supplied
In addition: Warning message:
In as.Date(as.integer(x[nas], format = fmt)) : NAs introduced by coercion
> dates <- as.Date(x, format="%d/%m/%Y")
> head(dates)
[1] NA NA NA NA NA NA
> head(dates, 10)
[1] NA NA NA NA NA NA NA
[8] "1991-05-13" "1991-05-14" "1991-05-15"
You need none of the packages you've loaded, nor do you need to use Reduce, as functions we're using here are naturally "vectorized".
Here's a sample of your data. (A good question includes data in an easily copied format such as this.)
x <- c("19/9/1997", "22/9/1997", "23/9/1997", "24/9/1997", "25/9/1997",
"26/9/1997", "29/9/1997", "30/9/1997",
"35440", "35471", "35499", "35591", "35621",
"35652", "35683", "35713")
dates <- as.Date(x, format="%d/%m/%Y")
dates
# [1] "1997-09-19" "1997-09-22" "1997-09-23" "1997-09-24" "1997-09-25"
# [6] "1997-09-26" "1997-09-29" "1997-09-30" NA NA
# [11] NA NA NA NA NA
# [16] NA
Not surprisingly, the second-half of the dates are not recognized given format="%d/%m/%Y". You mentioned the use of "%m/%d/%Y" in your question, so we can (1) do a literal second-pass for this format (un-utilized with this example, but still relevant for your work?):
dates[is.na(dates)] <- as.Date(x[is.na(dates)], format="%m/%d/%Y")
where [is.na(dates)] only works on the un-converted elements.
(2) If we have more than one other format, you can always use a vector of them and loop over them. (For this, I'll start over, since this loop would replace/augment the first steps above.)
formats <- c("%m/%d/%Y", "%d/%m/%Y", "%Y/%m/%d")
dates <- as.Date(rep(NA, length(x)))
for (fmt in formats) {
nas <- is.na(dates)
dates[nas] <- as.Date(x[nas], format=fmt)
}
dates
# [1] "1997-09-19" "1997-09-22" "1997-09-23" "1997-09-24" "1997-09-25"
# [6] "1997-09-26" "1997-09-29" "1997-09-30" NA NA
# [11] NA NA NA NA NA
# [16] NA
This still leaves us with NAs for the integer-looking ones. For these you need to specify the origin= to be able to figure it out (as well as converting to an integer). R typically works with an origin of "1970-01-01", which you can confirm with
as.integer(Sys.Date())
# [1] 17787
Sys.Date() - 17787
# [1] "1970-01-01"
but it appears that your dates have an origin of "1900-01-01", I think that's Excel's default storage of dates (but it doesn't matter here):
x[9] # the first integer-looking element
# [1] "35440"
dates[1] - as.integer(x[9])
# [1] "1900-09-08"
(I'm assuming that your dates are from the same relative period of time.)
From here:
nas <- is.na(dates)
dates[nas] <- as.Date(as.integer(x[nas]), origin="1900-01-01")
dates
# [1] "1997-09-19" "1997-09-22" "1997-09-23" "1997-09-24" "1997-09-25"
# [6] "1997-09-26" "1997-09-29" "1997-09-30" "1997-01-12" "1997-02-12"
# [11] "1997-03-12" "1997-06-12" "1997-07-12" "1997-08-12" "1997-09-12"
# [16] "1997-10-12"
(Working on the indices of only NA elements is relatively efficient in that it only works on and replaces the not-yet-matched entries. If there is nothing left when it gets to another call to as.Date, it does still call it but with an argument of length 0, with which the function works rather efficiently. I don't think adding a conditional of if (any(nas)) ... would help, but if there are further methods you need that might be more "expensive", you can consider it.)

function to automatically recognize the word "date" contained in variable column and changing the entries into date format (despite the heterogeneity) [duplicate]

I have a dataset as follows:
19/9/1997
22/9/1997
23/9/1997
24/9/1997
25/9/1997
26/9/1997
29/9/1997
30/9/1997
35440
35471
35499
35591
35621
35652
35683
35713
13/10/1997
14/10/1997
15/10/1997
16/10/1997
17/10/1997
20/10/1997
21/10/1997
22/10/1997
23/10/1997
24/10/1997
27/10/1997
28/10/1997
29/10/1997
30/10/1997
31/10/1997
35500
35531
35561
35592
35622
35714
35745
35775
13/11/1997
14/11/1997
17/11/1997
18/11/1997
19/11/1997
20/11/1997
21/11/1997
24/11/1997 ...
The Data that should be here are (for reproduction as requested)
19/9/1997
22/9/1997
23/9/1997
24/9/1997
25/9/1997
26/9/1997
29/9/1997
30/9/1997
10/01/1997
10/02/1997
10/03/1997
10/06/1997
10/07/1997
10/08/1997
10/09/1997
10/10/1997
13/10/1997
14/10/1997
15/10/1997
16/10/1997
17/10/1997
20/10/1997
21/10/1997
22/10/1997
23/10/1997
24/10/1997
27/10/1997
28/10/1997
29/10/1997
30/10/1997
31/10/1997
11/03/1997
11/04/1997
11/05/1997
11/06/1997
11/07/1997
11/10/1997
11/11/1997
11/12/1997
13/11/1997
14/11/1997
17/11/1997
18/11/1997
19/11/1997
20/11/1997
21/11/1997
24/11/1997
I have 5,149 rows of dates where there are numbers in places of dates. I tried fixing the missing dates with this:
ATTEMPT 1 BEFORE REVISION:
rm (list = ls(all=TRUE))
graphics.off()
library(readxl)
Dates <- read_excel("F:/OneDrive - University of Tasmania/Mardi Meetings/Dataset/Dates.xlsx")
x<-Dates[,1]
library(date)
library(datetime)
ans <- Reduce(function(prev, curr) {
f1 <- as.Date(curr, "%d/%m/%Y")
f2 <- as.Date(curr, "%m/%d/%Y")
if (is.na(f1)) return(f2)
if (is.na(f2)) return(f1)
if (prev < f1 && prev < f2) return(min(f1, f2))
if (prev < f1) return(f1)
if (prev < f2) return(f2)
}, x[-1], init=as.Date(x[1], "%d/%m/%Y"), accumulate=TRUE)
as.Date(ans, origin="1970-01-01")
But I am getting the following error:
+ }, x[-1], init=as.Date(x[1], "%d/%m/%Y"), accumulate=TRUE)
Error in Reduce(function(prev, curr) { : object 'x' not found
>
> as.Date(ans, origin="1970-01-01")
Error in as.Date(ans, origin = "1970-01-01") : object 'ans' not found
Any suggestions will be highly appreciated.
OK AS PER ADVICE I REVISED THE CODE ATTEMPT 2 AFTER REVISION
> rm (list = ls(all=TRUE))
> graphics.off()
> library(readxl)
> Dates <- read_excel("F:/OneDrive - University of Tasmania/Mardi Meetings/Dataset/Dates.xlsx")
> dput(head(Dates))
structure(list(Date = c("33274", "33302", "33394", "33424", "33455",
"33486")), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
> x<-Dates[[1]]
> library(date)
> library(datetime)
Attaching package: ‘datetime’
The following object is masked from ‘package:date’:
as.date
> dates <- as.Date(x, format="%d/%m/%Y")
> dput(head(dates))
structure(c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), class = "Date")
> head(dates,10)
[1] NA NA NA NA NA NA NA
[8] "1991-05-13" "1991-05-14" "1991-05-15"
As you can see I have lost the corrupted dates completely
Today on 28th I tried again
> rm (list = ls(all=TRUE))
> graphics.off()
> library(readxl)
> Dates <- read_excel("F:/OneDrive - University of Tasmania/Mardi Meetings/Dataset/Dates.xlsx")
> x<-Dates[[1]]
>
> library(date)
> library(datetime)
Attaching package: ‘datetime’
The following object is masked from ‘package:date’:
as.date
> formats <- c("%m/%d/%Y", "%d/%m/%Y", "%Y/%m/%d")
> dates <- as.Date(rep(NA, length(x)))
> for (fmt in formats) {
+ nas <- is.na(dates)
+ dates[nas] <- as.Date(as.integer(x[nas], format=fmt))
+ }
Error in as.Date.numeric(as.integer(x[nas], format = fmt)) :
'origin' must be supplied
In addition: Warning message:
In as.Date(as.integer(x[nas], format = fmt)) : NAs introduced by coercion
> dates <- as.Date(x, format="%d/%m/%Y")
> head(dates)
[1] NA NA NA NA NA NA
> head(dates, 10)
[1] NA NA NA NA NA NA NA
[8] "1991-05-13" "1991-05-14" "1991-05-15"
You need none of the packages you've loaded, nor do you need to use Reduce, as functions we're using here are naturally "vectorized".
Here's a sample of your data. (A good question includes data in an easily copied format such as this.)
x <- c("19/9/1997", "22/9/1997", "23/9/1997", "24/9/1997", "25/9/1997",
"26/9/1997", "29/9/1997", "30/9/1997",
"35440", "35471", "35499", "35591", "35621",
"35652", "35683", "35713")
dates <- as.Date(x, format="%d/%m/%Y")
dates
# [1] "1997-09-19" "1997-09-22" "1997-09-23" "1997-09-24" "1997-09-25"
# [6] "1997-09-26" "1997-09-29" "1997-09-30" NA NA
# [11] NA NA NA NA NA
# [16] NA
Not surprisingly, the second-half of the dates are not recognized given format="%d/%m/%Y". You mentioned the use of "%m/%d/%Y" in your question, so we can (1) do a literal second-pass for this format (un-utilized with this example, but still relevant for your work?):
dates[is.na(dates)] <- as.Date(x[is.na(dates)], format="%m/%d/%Y")
where [is.na(dates)] only works on the un-converted elements.
(2) If we have more than one other format, you can always use a vector of them and loop over them. (For this, I'll start over, since this loop would replace/augment the first steps above.)
formats <- c("%m/%d/%Y", "%d/%m/%Y", "%Y/%m/%d")
dates <- as.Date(rep(NA, length(x)))
for (fmt in formats) {
nas <- is.na(dates)
dates[nas] <- as.Date(x[nas], format=fmt)
}
dates
# [1] "1997-09-19" "1997-09-22" "1997-09-23" "1997-09-24" "1997-09-25"
# [6] "1997-09-26" "1997-09-29" "1997-09-30" NA NA
# [11] NA NA NA NA NA
# [16] NA
This still leaves us with NAs for the integer-looking ones. For these you need to specify the origin= to be able to figure it out (as well as converting to an integer). R typically works with an origin of "1970-01-01", which you can confirm with
as.integer(Sys.Date())
# [1] 17787
Sys.Date() - 17787
# [1] "1970-01-01"
but it appears that your dates have an origin of "1900-01-01", I think that's Excel's default storage of dates (but it doesn't matter here):
x[9] # the first integer-looking element
# [1] "35440"
dates[1] - as.integer(x[9])
# [1] "1900-09-08"
(I'm assuming that your dates are from the same relative period of time.)
From here:
nas <- is.na(dates)
dates[nas] <- as.Date(as.integer(x[nas]), origin="1900-01-01")
dates
# [1] "1997-09-19" "1997-09-22" "1997-09-23" "1997-09-24" "1997-09-25"
# [6] "1997-09-26" "1997-09-29" "1997-09-30" "1997-01-12" "1997-02-12"
# [11] "1997-03-12" "1997-06-12" "1997-07-12" "1997-08-12" "1997-09-12"
# [16] "1997-10-12"
(Working on the indices of only NA elements is relatively efficient in that it only works on and replaces the not-yet-matched entries. If there is nothing left when it gets to another call to as.Date, it does still call it but with an argument of length 0, with which the function works rather efficiently. I don't think adding a conditional of if (any(nas)) ... would help, but if there are further methods you need that might be more "expensive", you can consider it.)

Use as.Date with tryFormats to parse dates with different formats

I have a variable with dates in a two different formats ("%Y-%m-%d" and "%m/%d/%Y"):
dput(df)
structure(1:8, .Label = c("2019-04-07", "2019-04-08", "2019-04-09",
"2019-04-10", "7/29/2019", "7/30/2019", "7/31/2019", "8/1/2019"
), class = "factor")
# [1] 2019-04-07 2019-04-08 2019-04-09 2019-04-10 7/29/2019 7/30/2019 7/31/2019 8/1/2019
# 8 Levels: 2019-04-07 2019-04-08 2019-04-09 2019-04-10 7/29/2019 7/30/2019 ... 8/1/2019
I try to parse the dates using as.Date with tryFormats
df <- as.character(df)
d <- as.Date(df, tryFormats = c("%Y-%m-%d", "%m/%d/%Y"))
which converts the first format structure, but then returns NA for the second format structure. If I run the two formats separately, they look good though:
t1 <- as.Date(df, format = "%Y-%m-%d")
t2 <- as.Date(df, format = "%m/%d/%Y")
t1
# [1] "2019-04-07" "2019-04-08" "2019-04-09" "2019-04-10" NA
# [6] NA NA NA
t2
# [1] NA NA NA NA "2019-07-29"
# [6] "2019-07-30" "2019-07-31" "2019-08-01"
Any suggestions? I've looked through other responses, but haven't found any good tryFormats examples/questions that seem to address this.
tryFormats will only select one of the given formats. In your case you can convert them individually, as you have already done.
d <- as.Date(df,format="%Y-%m-%d")
d[is.na(d)] <- as.Date(df[is.na(d)],format="%m/%d/%Y")
d
#[1] "2019-04-07" "2019-04-08" "2019-04-09" "2019-04-10" "2019-07-29"
#[6] "2019-07-30" "2019-07-31" "2019-08-01"
We can use anydate from anytime
library(anytime)
anydate(df)
If any of the formats are not present, use addFormats() to add that format and then apply the function
Or with lubridate
library(lubridate)
as.Date(parse_date_time(df, c("ymd", "mdy")))
For base solution, you may try the following as explained in this answer:
> df
#[1] "2019-04-07" "2019-04-08" "2019-04-09" "2019-04-10" "7/29/2019" "7/30/2019"
#"7/31/2019" "8/1/2019"
fmts <- c("%Y-%m-%d","%m/%d/%Y")
as.Date(apply(outer(df, fmts, as.Date),1,na.omit),'1970-01-01')
#[1] "2019-04-07" "2019-04-08" "2019-04-09" "2019-04-10" "2019-07-29" "2019-07-30" "2019-07-31" "2019-08-01"

i want convert factor to date, the factor is like that :2011-05-05:16:30:04.466

I tried with both the lubricate package and as.Date(), but both show error:
# the factor
> x
[1] '2011-05-05:16:30:04.466 '
873 Levels: '2011-05-05:16:30:04.466 ' ... '2017-08-10:20:05:51.406967'
# try 1
> as.Date(x, format = "%m/%d/%Y")
[1] NA
# try 2
> xx <- mdy(x)
Warning message:
All formats failed to parse. No formats found.
> xx
[1] NA
> xx <- mdy_hms(x)
Warning message:
All formats failed to parse. No formats found.
someone can help me?
To add to the other answer by Jason Clark, there is also as.POSIXct, if you want to keep the times.
getOption("digits.secs")
#NULL
options(digits.secs = 6)
x <- factor('2011-05-05:16:30:04.466')
y <- as.POSIXct(x, format = "%Y-%m-%d:%H:%M:%OS")
y
#[1] "2011-05-05 16:30:04.466 BST"
class(y)
#[1] "POSIXct" "POSIXt"
It looks like your problem might be the format. The default format in as.Date is "%Y-%m-%d" which seems to take care of the example.
> as.Date(as.factor('2011-05-05:16:30:04.466 '), format = "%m/%d/%Y")
[1] NA
> as.Date(as.factor('2011-05-05:16:30:04.466 '))
[1] "2011-05-05"
> as.Date(as.factor('2011-05-05:16:30:04.466 '), format = "%Y-%m-%d")
[1] "2011-05-05"

Create vector of character strings in R using for loop

I'm trying to create a vector of dates (formatted as character strings not as dates) using a for loop. I've reviewed a few other SO questions such as (How to create a vector of character strings using a loop?), but they weren't helpful. I've created the following for loop:
start_dates <- c("1993-12-01")
j <- 1
start_dates <- for(i in 1994:as.numeric(format(Sys.Date(), "%Y"))){
date <- sprintf("%s-01-01", i)
j <- j + 1
start_dates[j] <- date
}
However, it returns a NULL (empty) vector start_dates. When I increment the i index manually it works. For example:
> years <- 1994:as.numeric(format(Sys.Date(), "%Y"))
> start_dates <- c("1993-12-01")
> j <- 1
> i <- years[1]
> date <- sprintf("%s-01-01", i)
> j <- j + 1
> start_dates[j] <- date
> start_dates
[1] "1993-12-01" "1994-01-01"
> i <- years[2]
> date <- sprintf("%s-01-01", i)
> j <- j + 1
> start_dates[j] <- date
> start_dates
[1] "1993-12-01" "1994-01-01" "1995-01-01"
It must have something to do with the construction of my for() statement, but I can't figure it out. I'm sure it's super simple. Thanks in advance.
What is wrong with:
sprintf("%s-01-01", 1994:2015)
> sprintf("%s-01-01", 1994:2015)
[1] "1994-01-01" "1995-01-01" "1996-01-01" "1997-01-01" "1998-01-01"
[6] "1999-01-01" "2000-01-01" "2001-01-01" "2002-01-01" "2003-01-01"
[11] "2004-01-01" "2005-01-01" "2006-01-01" "2007-01-01" "2008-01-01"
[16] "2009-01-01" "2010-01-01" "2011-01-01" "2012-01-01" "2013-01-01"
[21] "2014-01-01" "2015-01-01"
sprintf() is fully vectorised, take advantage of this.
Problems with your loop
The main problem is that you are assigning the value of the for() function to start_dates when the for() finished, hence overwriting all the hard work your loop did. This is effectively what is happening:
j <- 1
foo <- for (i in 1:10) {
j <- j + 1
}
foo
> foo
NULL
And reading ?'for' we see that this behaviour is by design:
Value:
....
‘for’, ‘while’ and ‘repeat’ return ‘NULL’ invisibly.
Solution: Don't assign the returned value of for(). Hence the template might be:
for(i in foo) {
# ... do stuff
start_dates[j] <- bar
}
Fix that and you still have a problem; j will be 2 by the time you assign the first date to the output as you start with j <- 1 and increment it before assigning in the loop.
This would be easier if you made i take values from a sequence 1, 2, ..., n rather than the actual years you want. You can use i to index the years vector and as an index for the elements of start_dates too.
Not that you should do the loop this way, but, if you wanted too...
years <- seq.int(1994, 2015)
start_dates <- numeric(length = length(years))
for (i in seq_along(years)) {
start_dates[i] <- sprintf("%s-01-01", years[i])
}
which would give:
> start_dates
[1] "1994-01-01" "1995-01-01" "1996-01-01" "1997-01-01" "1998-01-01"
[6] "1999-01-01" "2000-01-01" "2001-01-01" "2002-01-01" "2003-01-01"
[11] "2004-01-01" "2005-01-01" "2006-01-01" "2007-01-01" "2008-01-01"
[16] "2009-01-01" "2010-01-01" "2011-01-01" "2012-01-01" "2013-01-01"
[21] "2014-01-01" "2015-01-01"
Sometimes it is helpful to loop over the actual values in a vector (as you did) rather than it's indices (as I just did), but only in specific cases. For general operations like you have here, it is just an additional complication you need to work around. That said, think about doing vectorised operations in R before resorting to a loop.
You shouldn't assign the loop to a variable. Do:
start_dates <- c("1993-12-01")
j <- 1
for(i in 1994:as.numeric(format(Sys.Date(), "%Y"))){ #use the for-loop on its own. Don't assign it to a variable
date <- sprintf("%s-01-01", i )
j <- j + 1
start_dates[j] <- date
}
and you are fine:
> start_dates
[1] "1993-12-01" "1994-01-01" "1995-01-01" "1996-01-01" "1997-01-01" "1998-01-01" "1999-01-01" "2000-01-01" "2001-01-01"
[10] "2002-01-01" "2003-01-01" "2004-01-01" "2005-01-01" "2006-01-01" "2007-01-01" "2008-01-01" "2009-01-01" "2010-01-01"
[19] "2011-01-01" "2012-01-01" "2013-01-01" "2014-01-01" "2015-01-01"

Resources