R: readxl and date format - r

I read in an excel file, where 1 column contains dates in different format: excel format (e.g. 43596) and text (e.g. "01.01.2020").
To convert excel format one can use as.Date(as.numeric(df$date), origin = "1899-12-30")
to convert text one can use as.Date(df$date, format = "%d.%m.%Y")
These work for individual values, but when I try ifelse as:
df$date <- ifelse(length(df$date)==5,
as.Date(as.numeric(df$date), origin = "1899-12-30"),
as.Date(df$date, format = "%d.%m.%Y"))
or a for loop:
for (i in length(x)) {
if(nchar(x[i])==5) {
y[i] <- as.Date(as.numeric(x[i]), origin = "1899-12-30")
} else {x[i] <- as.Date(x[i], , format = "%d.%m.%Y"))}
} print(x)
It does not work because of:
"character string is not in a standard unambiguous format"
Maybe you could advice a better solution to convert/ replace different date formats in the appropriate one?

I have 2 solutions for it.
Changing the code, which I don't like because you are depending on xlsx date formats:
> df <- tibble(date = c("01.01.2020","43596"))
>
> df$date <- as.Date(ifelse(nchar(df$date)==5,
+ as.Date(as.numeric(df$date), origin = "1899-12-30"),
+ as.Date(df$date, format = "%d.%m.%Y")), origin = "1970-01-01")
Warning message:
In as.Date(as.numeric(df$date), origin = "1899-12-30") :
NAs introducidos por coerción
>
> df$date
[1] "2020-01-01" "2019-05-11"
>
Save the document as CSV and use read_csv() function from readr package. That solves everything !!!!

You could use sapply to apply ifelse to each value:
df$date <- as.Date(sapply(df$date,function(date) ifelse(nchar(date)==5,
as.Date(as.numeric(date), origin = "1899-12-30"),
as.Date(date, format = "%d.%m.%Y"))),
origin="1970-01-01")
df
# A tibble: 6 x 2
contract date
<dbl> <date>
1 231429 2019-05-11
2 231437 2020-01-07
3 231449 2021-01-01
4 231459 2020-03-03
5 231463 2020-10-27
6 231466 2011-03-17

A tidyverse solution using rowwise
library(dplyr)
library(lubridate)
df %>%
rowwise() %>%
mutate(date_new=as.Date(ifelse(grepl("\\.",date),
as.character(dmy(date)),
as.character(as.Date(as.numeric(date), origin="1899-12-30"))))) %>%
ungroup()
# A tibble: 6 × 3
contract date date_new
<dbl> <chr> <date>
1 231429 43596 2019-05-11
2 231437 07.01.2020 2020-01-07
3 231449 01.01.2021 2021-01-01
4 231459 03.03.2020 2020-03-03
5 231463 44131 2020-10-27
6 231466 40619 2011-03-17

Related

Transform number chain into date

I am trying to transform a list of numbers (e.g. 20200119) into a valid date (here: 2020-01-19)
This is my trial data:
df <- data.frame(c(20200119, 20180718, 20180729, 20150502, 20010301))
colnames(df)[1] = "Dates"
And this is what I tried so far:
df <- as_date(df)
df <- as.Date.numeric(df)
df <- as.Date.factor(df)
Neither of them works unfortunately.
I also tried to seperate the numbers, but I couldn't achieve either.
Can somebody help me?
Convert it to a character and convert it then to a Date with given format %Y%m%d:
as.Date(as.character(df$Dates), "%Y%m%d")
#[1] "2020-01-19" "2018-07-18" "2018-07-29" "2015-05-02" "2001-03-01"
Another option using strptime with the right format like this:
df <- data.frame(c(20200119, 20180718, 20180729, 20150502, 20010301))
colnames(df)[1] = "Dates"
df$Dates2 <- strptime(df$Dates, format = "%Y%m%d")
df
#> Dates Dates2
#> 1 20200119 2020-01-19
#> 2 20180718 2018-07-18
#> 3 20180729 2018-07-29
#> 4 20150502 2015-05-02
#> 5 20010301 2001-03-01
Created on 2023-01-12 with reprex v2.0.2

How to convert a "char" column to datetime column in large datasets

I am working with large datasets and in which one column is represented as char data type instead of a DateTime datatype. I trying it convert but I am unable to convert it.
Could you please suggest any suggestions for this problem? it would be very helpful for me
Thanks in advance
code which i am using right now
c_data$dt_1 <- lubridate::parse_date_time(c_data$started_at,"ymd HMS")
getting output:
2027- 05- 20 20:10:03
but desired output is
2020-05-20 10:03
Here is another way using lubridate:
library(lubridate)
df <- tibble(start_at = c("27/05/2020 10:03", "25/05/2020 10:47"))
df %>%
mutate(start_at = dmy_hms(start_at))
# A tibble: 2 x 1
start_at
<dttm>
1 2020-05-27 20:10:03
2 2020-05-25 20:10:47
In R, dates and times have a single format. You can change it's format to your required format but then it would be of type character.
If you want to keep data in the format year-month-day min-sec you can use format as -
format(Sys.time(), '%Y-%m-%d %M:%S')
#[1] "2021-08-27 17:54"
For the entire column you can apply this as -
c_data$dt_2 <- format(c_data$dt_1, '%Y-%m-%d %M:%S')
Read ?strptime for different formatting options.
Using anytime
library(dplyr)
library(anytime)
addFormats("%d/%m/%Y %H:%M")
df %>%
mutate(start_at = anytime(start_at))
-output
# A tibble: 2 x 1
start_at
<dttm>
1 2020-05-27 10:03:00
2 2020-05-25 10:47:00

How to change the date format & remove rows from dataframe before certain date R Studio

I have a dataframe with over 8.8 million observations and I need to remove rows from the dataframe before a certain date. Currently the date format is in MM/DD/YYYY but I would like to convert it to R date format (I believe YYYY-MM-DD).
When I run the code that I have below, it puts them in the correct R format, but it does not keep the correct date. For some reason, it makes the dates 2020. None of the dates in my data frame have the year 2020
> dates <- nyc_call_data_sample$INCIDENT_DATETIME
> date <- as.Date(dates,
+ format = "%m/%d/%y")
> head(nyc_call_data_sample$INCIDENT_DATETIME)
[1] "07/01/2015" "04/24/2016" "04/01/2013" "02/07/2015" "06/27/2016" "05/04/2017"
> head(date)
[1] "2020-07-01" "2020-04-24" "2020-04-01" "2020-02-07" "2020-06-27" "2020-05-04"
> nyc_call_data_sample$INCIDENT_DATETIME <- strptime(as.character(nzd$date), "%d/%m/%y")
Also, I have data that goes back as far as 2013. How would I go about removing all rows from the dataframe that are before 01/01/2017
Thanks!
as.Date and basic ?Extraction are your friend here.
dat <- data.frame(
unformatted = c("07/01/2015", "04/24/2016", "04/01/2013", "02/07/2015", "06/27/2016", "05/04/2017")
)
dat$date <- as.Date(dat$unformatted, format = "%m/%d/%Y")
dat
# unformatted date
# 1 07/01/2015 2015-07-01
# 2 04/24/2016 2016-04-24
# 3 04/01/2013 2013-04-01
# 4 02/07/2015 2015-02-07
# 5 06/27/2016 2016-06-27
# 6 05/04/2017 2017-05-04
dat[ dat$date > as.Date("2017-01-01"), ]
# unformatted date
# 6 05/04/2017 2017-05-04
(Feel free to remove the unformatted column with dat$unformatted <- NULL.)
With tidyverse:
library(dplyr)
dat %>%
mutate(date = as.Date(unformatted, format = "%m/%d/%Y")) %>%
select(-unformatted) %>%
filter(date > as.Date("2017-01-01"))
# date
# 1 2017-05-04

How to clean a time column in r

I have a time column in R as:
22:34:47
06:23:15
7:35:15
5:45
How to make all the time values in a column into hh:mm:ss format. I have used
as_date(a$time, tz=NULL) but I am not able to get the format which I wanted.
Here is an option with parse_date_time which can take multiple formats
library(lubridate)
format(parse_date_time(time, c("HMS", "HM"), tz = "GMT"), "%H:%M:%S")
#[1] "22:34:47" "06:23:15" "07:35:15" "05:45:00"
data
time <- c("22:34:47", "06:23:15", "7:35:15", "5:45")
Nothing a bit of formatting can't take care of:
x <- c("22:34:47","06:23:15","7:35:15","5:45")
format(
pmax(
as.POSIXct(x, format="%T", tz="UTC"),
as.POSIXct(x, format="%R", tz="UTC"), na.rm=TRUE
),
"%T"
)
#[1] "22:34:47" "06:23:15" "07:35:15" "05:45:00"
The pmax means any additional seconds will be taken in preference to just hh:mm.
You could get functional if you wanted to get a similar result with less typing, and more opportunity for turning it into a repeatable function.
do.call(pmax, c(lapply(c("%T","%R"), as.POSIXct, x=x, tz="UTC"), na.rm=TRUE))
Using a tidyverse approach with dplyr and hms verbs.
library(dplyr)
library(hms)
a <- tibble(time = c("22:34:47", "06:23:15", "7:35:15", "5:45"))
a %>%
mutate(
time = case_when(
is.na(parse_hms(time)) ~ parse_hm(time),
TRUE ~ parse_hms(time)
)
)
# # A tibble: 4 x 1
# time
# <time>
# 1 22:34
# 2 06:23
# 3 07:35
# 4 05:45
Note that the use of case_when could be replaced with an ifelse. The reason for this conditional is that parse_hms will return NA for values without seconds.
You may also want the output to be a POSIX compliant value, you may adapt the previous solution to do so.
a %>%
mutate(
time = case_when(
is.na(parse_hms(time)) ~ as.POSIXct(parse_hm(time)),
TRUE ~ as.POSIXct(parse_hms(time))
)
)
# # A tibble: 4 x 1
# time
# <dttm>
# 1 1970-01-01 22:34:47
# 2 1970-01-01 06:23:15
# 3 1970-01-01 07:35:15
# 4 1970-01-01 05:45:00
Note this will set the date to origin, which is 1970-01-01 by default.

Vectorised time zone conversion with lubridate

I have a data frame with a column of date-time strings:
library(tidyverse)
library(lubridate)
testdf = data_frame(
mytz = c('Australia/Sydney', 'Australia/Adelaide', 'Australia/Perth'),
mydt = c('2018-01-17T09:15:00', '2018-01-17T09:16:00', '2018-01-17T09:18:00'))
testdf
# A tibble: 3 x 2
# mytz mydt
# <chr> <chr>
# 1 Australia/Sydney 2018-01-17T09:15:00
# 2 Australia/Adelaide 2018-01-17T09:16:00
# 3 Australia/Perth 2018-01-17T09:18:00
I want to convert these date-time strings to POSIX date-time objects with their respective timezones:
testdf %>% mutate(mydt_new = ymd_hms(mydt, tz = mytz))
Error in mutate_impl(.data, dots) :
Evaluation error: tz argument must be a single character string.
In addition: Warning message:
In if (tz != "UTC") { :
the condition has length > 1 and only the first element will be used
I get the same result if I use ymd_hms without a timezone and pipe it into force_tz. Is it fair to conclude that lubridate doesn't support any sort of vectorisation when it comes to timezone operations?
Another option is map2. It may be better to store different tz output in a list as this may get coerced to a single tz
library(tidyverse)
out <- testdf %>%
mutate(mydt_new = map2(mydt, mytz, ~ymd_hms(.x, tz = .y)))
If required, it can be unnested
out %>%
unnest
The values in the list are
out %>%
pull(mydt_new)
#[[1]]
#[1] "2018-01-17 09:15:00 AEDT"
#[[2]]
#[1] "2018-01-17 09:16:00 ACDT"
#[[3]]
#[1] "2018-01-17 09:18:00 AWST"
tz argument must be a single character string. indicates that there are more than one time zones thrown into ymd_hms(). In order to make sure that there is only one time zone being thrown into the function, I used rowwise(). Note that I am not in Australian time zone. So I am not sure if the outcome I have is identical to yours.
testdf <- data_frame(mytz = c('Australia/Sydney', 'Australia/Adelaide', 'Australia/Perth'),
mydt = c('2018-01-17 09:15:00', '2018-01-17 09:16:00', '2018-01-17 09:18:00'))
testdf %>%
rowwise %>%
mutate(mydt_new = ymd_hms(mydt, tz = mytz))
mytz mydt mydt_new
<chr> <chr> <dttm>
1 Australia/Sydney 2018-01-17 09:15:00 2018-01-17 06:15:00
2 Australia/Adelaide 2018-01-17 09:16:00 2018-01-17 06:46:00
3 Australia/Perth 2018-01-17 09:18:00 2018-01-17 09:18:00

Resources