parse string like "now-1h" in R - r

what about experience by parsing/converting strings like "now-1h", "today", "now-3d", "today+30m" in R?
how to recognize and convert string (for example as function's argument) to date_time?

I do not recommend using this code, which is brittle. This is just to demonstrate this is possible:
library(lubridate)
library(stringr)
library(dplyr)
data <- c("now-1h", "today", "now-3d", "today+30m", "-3d")
We're going to use lubridate functions now(), today(), days(), hours(), ... which resemble your input data:
str(now())
# POSIXct[1:1], format: "2017-03-24 12:26:18"
str(today())
# Date[1:1], format: "2017-03-24"
str(now() - days(3))
# POSIXct[1:1], format: "2017-03-21 12:27:11"
str(- days(3))
# Formal class 'Period' [package "lubridate"] with 6 slots
# ..# .Data : num 0
# ..# year : num 0
# ..# month : num 0
# ..# day : num -3
# ..# hour : num 0
# ..# minute: num 0
We're going to have to parse() them as strings, to be able to actually use them, like that:
eval(parse(text = "now() + days(3)"))
# [1] "2017-03-27 12:41:34 CEST"
Now let's parse the input strings with a regex, manipulate them a bit to match lubridate syntax, then eval()uate them:
res <-
str_match(data, "(today|now)?([+-])?(\\d+)?([dhms])?")[, - 1] %>%
apply(1, function(x) {
time_ <- if (is.na(x[1])) NULL else paste0(x[1], "()")
offset_ <- if (any(is.na(x[2:4]))) NULL else paste(x[2],
recode(x[4], "d" = "days(", "h" = "hours(", "m" = "minutes(", "s" = "seconds("),
x[3],
")")
parse(text = paste(time_, offset_))
}) %>%
lapply(eval)
Notice that you get a variety of classes as output (either POSIXct or Date or lubridate::Period):
invisible(lapply(res, function(x) { print(x) ; str(x) }))
# [1] "2017-03-24 11:57:52 CET"
# POSIXct[1:1], format: "2017-03-24 11:57:52"
# [1] "2017-03-24"
# Date[1:1], format: "2017-03-24"
# [1] "2017-03-21 12:57:52 CET"
# POSIXct[1:1], format: "2017-03-21 12:57:52"
# [1] "2017-03-24 00:30:00 UTC"
# POSIXlt[1:1], format: "2017-03-24 00:30:00"
# [1] "-3d 0H 0M 0S"
# Formal class 'Period' [package "lubridate"] with 6 slots
# ..# .Data : num 0
# ..# year : num 0
# ..# month : num 0
# ..# day : num -3
# ..# hour : num 0
# ..# minute: num 0
(What I recommend instead is to pre-process the data with the language that produced it and possesses the right tools for the job, which appears to be Perl).

Related

Saving and readings lubridate intervals to/from disk

I am having problems recovering lubridate::intervals when reading back from csv, and fst formats.
Does anyone have a suggestion for how to do this?
library(tidyverse)
library(fst)
library(lubridate)
test <- tibble(
start = ymd_hms("2020-01-01 12:13:14", tz="UTC"),
end = ymd_hms("2021-01-01 12:13:14", tz="UTC"),
interval = lubridate::interval(start, end)
) %>%
write_csv("test1.csv")
test %>% fst::write_fst("test1.fst")
str(test)
test_read_back_csv <- read_csv("test1.csv")
str(test_read_back_csv)
test_read_back_fst <- read_fst("test1.fst")
str(test_read_back_fst)
You will see that the structure of the returned object test_read_back_csv$interval or test_read_back_fst$interval is not a lubridate interval, and I both need to save this file, and read it back properly.
Save to a binary format such as .RDS:
str(test)
#> tibble [1 x 3] (S3: tbl_df/tbl/data.frame)
#> $ start : POSIXct[1:1], format: "2020-01-01 12:13:14"
#> $ end : POSIXct[1:1], format: "2021-01-01 12:13:14"
#> $ interval:Formal class 'Interval' [package "lubridate"] with 3 slots
#> .. ..# .Data: num 31622400
#> .. ..# start: POSIXct[1:1], format: "2020-01-01 12:13:14"
#> .. ..# tzone: chr "UTC"
write_rds(test, "test1.rds")
test_read_back_rds <- read_rds("test1.rds")
str(test_read_back_rds)
#> tibble [1 x 3] (S3: tbl_df/tbl/data.frame)
#> $ start : POSIXct[1:1], format: "2020-01-01 12:13:14"
#> $ end : POSIXct[1:1], format: "2021-01-01 12:13:14"
#> $ interval:Formal class 'Interval' [package "lubridate"] with 3 slots
#> .. ..# .Data: num 31622400
#> .. ..# start: POSIXct[1:1], format: "2020-01-01 12:13:14"
#> .. ..# tzone: chr "UTC"
Created on 2022-03-14 by the reprex package (v2.0.1)

R: How to change column type of a df in a list and combine list to df with bind_rows()

I´m having multiple dataframes/tibbles inside a list, like this: (in the real dataset it´s more like 500 df with 180 columns and 30 rows each)
df1 <- data.frame(Col_1 = c(0,0,0,0,0),
Col_2 = c(1,1,1,1,1),
Col_3 = c("text", "text", "text", "text", "text"))
df2 <- data.frame(Col_1 = c(0,0,0,0,0),
Col_2 = c(1,1,1,1,1),
Col_3 = c(2,2,2,2,2))
l <- list(df1, df2)
The reason for this is, because I´m using readxl over multiple excel files. In principle those excel files/columns are the same, but some columns are imported as character or double. This is caused by user input.
In the end I want a big dataframe with all the df binded by bind_rows() or another function.
By simply using dplyr::bind_rows(l) there will be an error (Error: Can't combine `..1$Col_3` <character> and `..2$Col_3` <double>.), because I´m having different class types. To solve this problem, I´m using this approach:
l <- lapply(l, function(df) dplyr::mutate_at(df, vars(matches("Col_3")), as.character))
and afterwards this:
df <- dplyr::bind_rows(l)
which results in my desired df using this simple example.
BUT if I want to use the lapply function in my "real" dataset, this error always occurs:
Error: Can't transform a data frame with duplicate names.
How could I enclose the problem? (I can´t share the dataset because of confidentiality reasons, but I couldn´t reproduce this error in the above mentioned example)
Is there maybe a better way/function to convert this list into one df? Maybe convert automatically all column types to character (for now this would work, but this is of course not a good choice in the long run)
Change everything columns to character, and then you can combine the data frames.
library(dplyr)
library(purrr)
df_all <- map_dfr(l, ~.x %>% mutate(across(everything(), as.character)))
A better way could be that when you read the data into R, make sure all columns are character.
Up front, the best fix for this is to specify the column class when importing the data. By far, the function I propose below should never be used when you have any semblance of control over the import process. The utility of this function is in the extreme times you need some normalization of classes.
normalize_attributes <- function(L) {
all_nms <- unique(unlist(sapply(L, names)))
L <- lapply(L, function(dat) {
missing_nms <- setdiff(all_nms, names(dat))
if (length(missing_nms)) dat[missing_nms] <- NA
dat
})
first_nms <- names(L[[1]])
reattrib_funcs <- setNames(vector("list", length(all_nms)), all_nms)
LL_attr <- lapply(setNames(nm = all_nms),
function(nm) lapply(L, function(dat) attributes(dat[[nm]])))
LL_first <- lapply(setNames(nm = all_nms),
function(nm) lapply(L, function(dat) dat[[nm]][1]))
for (nm in all_nms) {
haschr <- sapply(LL_first[[nm]], inherits, "character")
hasnum <- sapply(LL_first[[nm]], inherits, "numeric")
haspsx <- sapply(LL_first[[nm]], inherits, "POSIXt")
hasdate <- sapply(LL_first[[nm]], inherits, "Date")
# use psx if any present otherwise date
hastime <- if (any(haspsx)) haspsx else hasdate
if (any(haschr)) {
# character wins all
reattrib_funcs[[nm]] <- as.character
} else if (any(hastime)) {
# need to use attributes here; this allows up-conversion without
# needing to specify the 'origin' (this might be a bug!)
att <- LL_attr[[nm]][[ which.max(hastime) ]]
reattrib_funcs[[nm]] <- substitute(function(vec) `attributes<-`(vec, att))
} else if (any(hasnum)) {
reattrib_funcs[[nm]] <- as.numeric
} else {
cls <- class(unlist(LL_first[[nm]]))
reattrib_funcs[[nm]] <- substitute(function(vec) `class<-`(vec, cls))
}
}
L <- lapply(L, function(dat) {
dat[all_nms] <- Map(function(func, vec) eval(func)(vec),
reattrib_funcs[all_nms], dat[all_nms])
dat
})
}
Some cruel data, showing various combinations of data classes/types:
df1 <- data.frame(int_int = c(0L,0L,0L,0L,0L),
num_num = c(0,0,0,0,0),
num_int = c(0,0,0,0,0),
num_psx = c(0,0,0,0,0),
num_dat = c(0,0,0,0,0),
chr_num = c("text", "text", "text", "text", "text"),
psx_dat = rep(Sys.time(),5),
chr_mis = c(0,0,0,0,0))
df2 <- data.frame(int_int = c(1L,1L,1L,1L,1L),
num_num = c(1,1,1,1,1),
num_int = c(1L,1L,1L,1L,1L),
num_psx = rep(Sys.time(),5),
num_dat = rep(Sys.Date(),5),
psx_dat = rep(Sys.Date(),5),
chr_num = c(1,1,1,1,1))
l <- list(df1, df2)
str(l)
# List of 2
# $ :'data.frame': 5 obs. of 8 variables:
# ..$ int_int: int [1:5] 0 0 0 0 0
# ..$ num_num: num [1:5] 0 0 0 0 0
# ..$ num_int: num [1:5] 0 0 0 0 0
# ..$ num_psx: num [1:5] 0 0 0 0 0
# ..$ num_dat: num [1:5] 0 0 0 0 0
# ..$ chr_num: chr [1:5] "text" "text" "text" "text" ...
# ..$ psx_dat: POSIXct[1:5], format: "2021-02-11 10:51:52" "2021-02-11 10:51:52" "2021-02-11 10:51:52" "2021-02-11 10:51:52" ...
# ..$ chr_mis: num [1:5] 0 0 0 0 0
# $ :'data.frame': 5 obs. of 7 variables:
# ..$ int_int: int [1:5] 1 1 1 1 1
# ..$ num_num: num [1:5] 1 1 1 1 1
# ..$ num_int: int [1:5] 1 1 1 1 1
# ..$ num_psx: POSIXct[1:5], format: "2021-02-11 10:51:52" "2021-02-11 10:51:52" "2021-02-11 10:51:52" "2021-02-11 10:51:52" ...
# ..$ num_dat: Date[1:5], format: "2021-02-11" "2021-02-11" "2021-02-11" "2021-02-11" ...
# ..$ psx_dat: Date[1:5], format: "2021-02-11" "2021-02-11" "2021-02-11" "2021-02-11" ...
# ..$ chr_num: num [1:5] 1 1 1 1 1
(The column names indicate the different types/classes.) Note the problems with those frames:
"chr_mis" is missing in one;
columns are in a different order, "chr_num" notably; and obviously
the classes are rarely the same :-)
I expect from this that:
if any column is a string, all frames have that column as a string;
if a column is POSIXt or Date (special-case numeric due to attributes), all should be;
num and int go to num, normal R behavior;
missing columns are added, using the appropriate R NA class (there are at least six types of NA)
And the fixed data:
str(l2 <- normalize_attributes(l))
# List of 2
# $ :'data.frame': 5 obs. of 8 variables:
# ..$ int_int: int [1:5] 0 0 0 0 0
# ..$ num_num: num [1:5] 0 0 0 0 0
# ..$ num_int: num [1:5] 0 0 0 0 0
# ..$ num_psx: POSIXct[1:5], format: "1969-12-31 19:00:00" "1969-12-31 19:00:00" "1969-12-31 19:00:00" "1969-12-31 19:00:00" ...
# ..$ num_dat: Date[1:5], format: "1970-01-01" "1970-01-01" "1970-01-01" "1970-01-01" ...
# ..$ chr_num: chr [1:5] "text" "text" "text" "text" ...
# ..$ psx_dat: POSIXct[1:5], format: "2021-02-11 11:43:31" "2021-02-11 11:43:31" "2021-02-11 11:43:31" "2021-02-11 11:43:31" ...
# ..$ chr_mis: num [1:5] 0 0 0 0 0
# $ :'data.frame': 5 obs. of 8 variables:
# ..$ int_int: int [1:5] 1 1 1 1 1
# ..$ num_num: num [1:5] 1 1 1 1 1
# ..$ num_int: num [1:5] 1 1 1 1 1
# ..$ num_psx: POSIXct[1:5], format: "2021-02-11 11:43:31" "2021-02-11 11:43:31" "2021-02-11 11:43:31" "2021-02-11 11:43:31" ...
# ..$ num_dat: Date[1:5], format: "2021-02-11" "2021-02-11" "2021-02-11" "2021-02-11" ...
# ..$ psx_dat: POSIXct[1:5], format: "1970-01-01 00:11:09" "1970-01-01 00:11:09" "1970-01-01 00:11:09" "1970-01-01 00:11:09" ...
# ..$ chr_num: chr [1:5] "1" "1" "1" "1" ...
# ..$ chr_mis: num [1:5] NA NA NA NA NA
which can now be rbinded more safely:
do.call(rbind, l2)
# int_int num_num num_int num_psx num_dat chr_num psx_dat chr_mis
# 1 0 0 0 1969-12-31 19:00:00 1970-01-01 text 2021-02-11 11:43:31 0
# 2 0 0 0 1969-12-31 19:00:00 1970-01-01 text 2021-02-11 11:43:31 0
# 3 0 0 0 1969-12-31 19:00:00 1970-01-01 text 2021-02-11 11:43:31 0
# 4 0 0 0 1969-12-31 19:00:00 1970-01-01 text 2021-02-11 11:43:31 0
# 5 0 0 0 1969-12-31 19:00:00 1970-01-01 text 2021-02-11 11:43:31 0
# 6 1 1 1 2021-02-11 11:43:31 2021-02-11 1 1970-01-01 00:11:09 NA
# 7 1 1 1 2021-02-11 11:43:31 2021-02-11 1 1970-01-01 00:11:09 NA
# 8 1 1 1 2021-02-11 11:43:31 2021-02-11 1 1970-01-01 00:11:09 NA
# 9 1 1 1 2021-02-11 11:43:31 2021-02-11 1 1970-01-01 00:11:09 NA
# 10 1 1 1 2021-02-11 11:43:31 2021-02-11 1 1970-01-01 00:11:09 NA
library dplyr
library purr
I would first create a list of all the files to be combined
csv_files = list.files(path = (paste0(data_path,"folder_with_csvs/")), pattern = "csv$", full.names = TRUE)
Then choose from the following options (second is probably preferred).
First option is to have every column as a character.
df_all <- map_dfr(.x = set_names(csv_files),
.f = ~ read_csv(.x, col_types = cols(.default = "c")))
Second option is have the default parsing/guessing for all columns except your exception(s).
df_all <- map_dfr(.x = set_names(csv_files_new),
.f = ~ read_csv(.x, col_types = cols(.default = "?",Col_3 = "c")))
Third option is it classify each column.
df_all <- map_dfr(.x = set_names(csv_files_new),
.f = ~ read_csv(.x, col_types = cols(Col_1 = "c", #char
Col_2 = "d", #double
Col_3 = "c"))) #char
see p. 34 of http://www.hiercourse.com/docs/Working_in_the_Tidyverse.pdf for more details

Date to day and time

I have read a lot of blogs, but I cannot find the answer to my question:
I have a date 2020-25-02 17:45:03 and I would like to convert it to two columns day and time.
hello <- strptime(as.character("2020-25-02 17:42:03"),"%Y-%m-%d %H:%M:%S")
df$day <- as.Date(hello, format = "%Y-%d-%m")
But I also would like df$time. Is it possible ?
dtimes = c("2002-06-09 12:45:40","2003-01-29 09:30:40",
+ "2002-09-04 16:45:40","2002-11-13 20:00:40",
+ "2002-07-07 17:30:40")
> dtparts = t(as.data.frame(strsplit(dtimes,' ')))
> row.names(dtparts) = NULL
> thetimes = chron(dates=dtparts[,1],times=dtparts[,2],
+ format=c('y-m-d','h:m:s'))
> thetimes
[1] (02-06-09 12:45:40) (03-01-29 09:30:40) (02-09-04 16:45:40)
[4] (02-11-13 20:00:40) (02-07-07 17:30:40)
Please see this link
Use function hms in package lubridate.
df <- data.frame(day = as.Date(hello, format = "%Y-%d-%m"))
df$time <- lubridate::hms(sub("^[^ ]*\\b(.*)$", "\\1", hello))
df
# day time
#1 2020-02-25 17H 42M 3S
str(df)
#'data.frame': 1 obs. of 2 variables:
# $ day : Date, format: "2020-02-25"
# $ time:Formal class 'Period' [package "lubridate"] with 6 slots
# .. ..# .Data : num 3
# .. ..# year : num 0
# .. ..# month : num 0
# .. ..# day : num 0
# .. ..# hour : num 17
# .. ..# minute: num 42

R - How to convert a column times in ####M #S format to just number of minutes

I have used lubridate to create a column of time format ####M ##S
df$delay <- minutes(df$finish-Delays$start)
How can I covert this column to just give the number #### in front of minutes?
Thanks for any help.
Take a look at the S4 slots...
library(lubridate)
# create data
df <- data.frame(finish = 20)
Delays <- data.frame(start = 10)
(df$delay <- minutes(df$finish-Delays$start))
[1] "10M 0S"
# take a look at the 'delay' object
str(df$delay)
Formal class 'Period' [package "lubridate"] with 6 slots
..# .Data : num 0
..# year : num 0
..# month : num 0
..# day : num 0
..# hour : num 0
..# minute: num 10
# access the 'minute' slot
df$delay#minute
[1] 10

How to combine a list of timeDate into a single timeDate?

I can generate a list of timeDate objects for New York Exchange. However, most of the analytical functions expect a single timeDate object. The underlying data representation is POSIXct, so I can't just append them like a vector or a list.
How to do it?
library(timeDate)
x <- lapply(c(1885: 1886), holidayNYSE)
x
[[1]]
NewYork
[1] [1885-01-01] [1885-02-23] [1885-04-03] [1885-11-03] [1885-11-26] [1885-12-25]
[[2]]
NewYork
[1] [1886-01-01] [1886-02-22] [1886-04-23] [1886-05-31] [1886-07-05] [1886-11-02] [1886-11-25]
class(x[[1]])
[1] "timeDate"
attr(,"package")
[1] "timeDate"
class(x[[1]]#Data)
[1] "POSIXct" "POSIXt"
# ??? How to my two datetime objects ???
We can use do.call with c
x1 <- do.call(c, x)
x1
#NewYork
#[1] [1885-01-01] [1885-02-23] [1885-04-03] [1885-11-03] [1885-11-26] [1885-12-25] [1886-01-01] [1886-02-22] [1886-04-23] [1886-05-31] [1886-07-05] [1886-11-02]
#[13] [1886-11-25]
str(x1)
#Formal class 'timeDate' [package "timeDate"] with 3 slots
# ..# Data : POSIXct[1:13], format: "1885-01-01 05:00:00" "1885-02-23 05:00:00" "1885-04-03 05:00:00" "1885-11-03 05:00:00" ...
# ..# format : chr "%Y-%m-%d"
# ..# FinCenter: chr "NewYork"
and the structure of OP's list is
str(x)
#List of 2
#$ :Formal class 'timeDate' [package "timeDate"] with 3 slots
# .. ..# Data : POSIXct[1:6], format: "1885-01-01 05:00:00" "1885-02-23 05:00:00" "1885-04-03 05:00:00" "1885-11-03 05:00:00" ...
# .. ..# format : chr "%Y-%m-%d"
# .. ..# FinCenter: chr "NewYork"
# $ :Formal class 'timeDate' [package "timeDate"] with 3 slots
# .. ..# Data : POSIXct[1:7], format: "1886-01-01 05:00:00" "1886-02-22 05:00:00" "1886-04-23 05:00:00" "1886-05-31 05:00:00" ...
# .. ..# format : chr "%Y-%m-%d"
# .. ..# FinCenter: chr "NewYork"

Resources