I am using "R" to format a character variable that has two different kinds of date formats (MM-DD-YYYY & YYYY-MM-DD). The second is an excel origin date.
DateVar <- c("12-07-2017", "43229", "43137", "03-27-2018")
I created vector using grepl to identify both types and then a for loop to apply the as.date function to only the "excel origin dates".
indicator <- !grepl("-", DateVar)
for(i in indicator == TRUE){
as.date(DateVar, origin = "1899-12-30")
It is not working for me however, so I am looking if someone can point me in the right direction.
Thanks.
Couple of things: The for loop is unnecessary - just subset DateVar with [indicator]. Second, it's as.Date, not as.date (note the "D"). Third, since it's a character vector, you need to pass the origin numbers through as.integer for as.Date to be able to work with them:
as.Date(as.integer(DateVar[indicator]), origin = "1899-12-30")
(or, without the intervening indicator assignment:
as.Date(as.integer(DateVar[!grepl("-",DateVar)]), origin = "1899-12-30")
[1] "2018-05-09" "2018-02-06"
If you wish to input these dates back into DateVar, you again use the subset function:
DateVar[indicator]<-format(as.Date(as.integer(DateVar[indicator]), origin = "1899-12-30"), "%m-%d-%Y")
Related
Well, first things first, I'm still a noob and am learning R. I've a a dataset with 0.9 million rows and 36 columns. Of these columns, a column, let's say DATE has dates in string format and an other column, let's say TZ has timezones as strings too.
What I'm trying to do is contract these two columns into one with type POSIXlt date, which has date, time, timezone. Here's my code for trying to get a vector of all the converted dates:
# Let's suppose my data exist in a variable "data" with dates in "DATE" column and timezones in "TZ"
indices <- NULL
dates <- NULL
zones <- unique (data$TZ)
for(i in seq_along(zones)){
indices <<- which(data$TZ==zones[i])
dates <<- c(dates, as.POSIXlt(data$DATE[indices], format = "%m/%d/%Y %H:%M:%S", tz = zones[i]))
}
Now, although there are ~1 million observations, it seems to do the job in 3-4 seconds. Only, that it "seems" to. The result I get is a list with NAs.
It does work when I try to convert a group individually, i.e., store result for every iteration in a different variable, or not run a for loop and do each iteration manually, storing each result in a different variable and, in the end, concatenate it all using c() function.
What am I doing wrong?
For anyone who might stumble here, I figured it.
You can't use c() on a POSIXlt object as it'll convert it into local timezone. (Not the reason for NAs but it's helpful.)
POSIXlt is stored as a list of different variables like mday, zone etc, due to which it's value cannot be used in a data frame element. Instead of POSIXlt, we can use POSIXct as that's internally represented as seconds from 1970-01-01.
Since we'll be replacing a data frame column with dates it's easier to do so with converting it into a tibble using dplyr::as_tibble() and then use dplyr::rbind() to combine the different results.
The reason of NAs being introduced is the lexical scoping in R. I used dates <<- c(dates, as.POSIXlt(data$DATE[indices], format = "%m/%d/%Y %H:%M:%S", tz = zones[i])) due to which, the value of i in zones[i] was NA or unknown.
So, the correct working code is -
dates <- NULL
for (i in seq_along(zones)) {
indices <- which(data$TZ==zones[i])
dts <- as.POSIXct(data$BGN_DATE[indices], format = "%m/%d/%Y %H%M", tz = zones[i])
dates <<- rbind(dates,as_tibble(dts))
}
#Further, to combine the dates into data frame
data <- arrange(data, TZ) %>% mutate(DATEandTime = dates$value) %>% select(-c("DATE","TZ"))
I am trying to convert integer data from my data frame in R, to date format.
The data is under column named svcg_cycle within orig_svcg_filtered data frame.
The original data looking something like 200502, 200503, and so forth, and I expect to turn it into yyyy-mm-dd format.
I am trying to use this code:
as.Date(orig_svcg_filtered$svcg_cycle, origin = "2000-01-01")
but the output is not something that I expected:
[1] "2548-12-15" "2548-12-15" "2548-12-15" "2548-12-15" "2548-12-15"
while it is supposed to be 2005-02-01, 2005-03-01, and so forth.
How to solve this?
If you have
x <- c(200502, 200503)
Then
as.Date(x, origin = "2000-01-01")
tells R you want the days 200,502 and 200,503 days after 2000-01-01. From help("as.Date"):
as.Date will accept numeric data (the number of days since an epoch),
but only if origin is supplied.
So, integer data gives days after the origin supplied, not some sort of numeric code for the dates like 200502 for "2005-02-01".
What you want is
as.Date(paste(substr(x, 1, 4), substr(x, 5, 6), "01", sep = "-"))
# [1] "2005-02-01" "2005-03-01"
The
paste(substr(x, 1, 4), substr(x, 5, 6), "01", sep = "-")
part takes your integers and creates strings like
# [1] "2005-02-01" "2005-03-01"
Then as.Date() knows how to deal with them.
You could alternatively do something like
as.Date(paste0(x, "01"), format = "%Y%m%d")
# [1] "2005-02-01" "2005-03-01"
This just pastes on an "01" to each element (for the day), converts to character, and tells as.Date() what format to read the date into. (See help("as.Date") and help("strptime")).
I like to use Regex to fix these kinds of string formatting issues. as.Date by default only checks for several standard date formats like YYYY-MM-DD. origin is used when you have an integer date (i.e. seconds from some reference point), but in this case your date is actually not an integer date, rather it's just a date formatted as a string of integers.
We simply split the month and day with a dash, and add a day, in this case the first of the month, to make it a valid date (you must have a day to store it as a date object in R). The Regex bit captures the first 4 digits in group one and final two digits in group two. We then combine the two groups, separated by dashes, along with the day.
as.Date(gsub("^(\\d{4})(\\d{2})", "\\1-\\2-01", x))
[1] "2005-02-01" "2005-03-01"
You don't need to specify format in this case, because YYYY-MM-DD is one of the standard formats as.Date checks, however, the format argument is format = "%Y-%m-%d"
I have the following DF:
date <- c("2017-10-11","2018-04-02","2017-05-03")
df <- data.frame(date)
The following code fails to format it as a date, instead returning NAs:
df$date <- as.Date(df$date,format='%Y/%m/%d')
The following code successfully formats it as a date:
df$date <- as.Date(df$date)
I've specified the as.Date format (as in the first as.Date code) in other projects and it has worked for me before. There are numerous responses to similar questions about as.Date returning NAs, but I can't find any that explain why my first as.Date code doesn't work in this situation, but my second one does. I don't need to specify the format for my purposes, but I would like to understand why the first line of code does not work.
The default value for format argument of as.Date function is "%Y-%m-%d" which is matching with the format of date column of df.
Hence, df$date <- as.Date(df$date) works perfectly.
The R-Studio help describes format argument as:
format
A character string. If not specified, it will try "%Y-%m-%d"
then "%Y/%m/%d" on the first non-NA element, and give an error if
neither works. Otherwise, the processing is via strptime
But, df$date <- as.Date(df$date,format='%Y/%m/%d') will not work. The reason is that separator / mentioned as part of format is not present in the data column.
I have read a csv file in as mydata, an existing column called inbound_date, contain the data like
NULL
2017-06-24 16:47:35
2017-06-24 16:47:35
I want to create a new column to extract the day for this column. i have tried below code, but failed,
mydata$inbound_day<-ifelse(is.null(mydata$inbound_date),"null",as.Date(mydata$inbound_date,format = "%Y-%m-%d"))
The new column inbound_day has been added, but it shows as NA in the column for all the rows.
Can help to see the code, which part is wrong? Thanks!
There are two things at play here.
The behaviour of ifelse. It will return as many values as the
length of the condition. If the condition returns only one value, ifelse
too will return a single value.
The behaviour of is.null is not the same as that of is.na. Unlike is.na, is.null(mydata$inbound_date) is checking the whole
of mydata$inbound_date1 as a single object and you are getting
just one value in return, which is False
.
The combined effect of these two things is that you are only getting the as.Date value for the first item as result, and it is a single NA. What's more, this `NA is then being recycled to fill the whole column with NAs.
Solution -- Use is.na where you are using is.null. It will return multiple values and the thing will work as expected.
You have to specify the time as well.
x <- as.POSIXlt("2017-06-24 16:47:35", format = "%Y-%m-%d %H:%M:%S")
format(x, "%Y-%m-%d")
[1] "2017-06-24"
Using lubridate to format instead of as.date then extracting the day
library(lubridate)
x <- ymd_hms("2017-06-24 16:47:35")
format(x, "%d")
I have dates in a character vector. I cannot easily convert to a date vector using as.Date, because not all of the strings have the form mm/dd/yyyy, thus giving me the ambiguous date error. Some strings have the form m/dd/yyyy (months 1:9).
Here's part of the vector:
data$Date <- c("8/26/2014","3/10/2014","9/25/2014","11/12/2014","8/4/2015")
Indicator for date to let me know which strings I need to add a zero to
data$date <- grepl("[0-9]{2}/[0-9]{2}/[0-9]{4}", data$Date)
Attempt to add zeros through a conditional:
data$Date<-ifelse(data$date == "FALSE", paste0("0", data$Date), data$Date)
Doesn't work (I'm not familiar with paste). Any concise solutions to add a leading zero to single digit months (m/dd/yyy)? I'm guessing gsub or sub? I need all the strings to be in form mm/dd/yyy so I can convert to a date vector.
data <- data.frame(Date=c("8/26/2014","3/10/2014","9/25/2014","11/12/2014","8/4/2015"))
as.Date(data$Date,format="%m/%d/%Y")
works fine for me with your data. Output is
"2014-08-26" "2014-03-10" "2014-09-25" "2014-11-12" "2015-08-04"