I am having problems recovering lubridate::intervals when reading back from csv, and fst formats.
Does anyone have a suggestion for how to do this?
library(tidyverse)
library(fst)
library(lubridate)
test <- tibble(
start = ymd_hms("2020-01-01 12:13:14", tz="UTC"),
end = ymd_hms("2021-01-01 12:13:14", tz="UTC"),
interval = lubridate::interval(start, end)
) %>%
write_csv("test1.csv")
test %>% fst::write_fst("test1.fst")
str(test)
test_read_back_csv <- read_csv("test1.csv")
str(test_read_back_csv)
test_read_back_fst <- read_fst("test1.fst")
str(test_read_back_fst)
You will see that the structure of the returned object test_read_back_csv$interval or test_read_back_fst$interval is not a lubridate interval, and I both need to save this file, and read it back properly.
Save to a binary format such as .RDS:
str(test)
#> tibble [1 x 3] (S3: tbl_df/tbl/data.frame)
#> $ start : POSIXct[1:1], format: "2020-01-01 12:13:14"
#> $ end : POSIXct[1:1], format: "2021-01-01 12:13:14"
#> $ interval:Formal class 'Interval' [package "lubridate"] with 3 slots
#> .. ..# .Data: num 31622400
#> .. ..# start: POSIXct[1:1], format: "2020-01-01 12:13:14"
#> .. ..# tzone: chr "UTC"
write_rds(test, "test1.rds")
test_read_back_rds <- read_rds("test1.rds")
str(test_read_back_rds)
#> tibble [1 x 3] (S3: tbl_df/tbl/data.frame)
#> $ start : POSIXct[1:1], format: "2020-01-01 12:13:14"
#> $ end : POSIXct[1:1], format: "2021-01-01 12:13:14"
#> $ interval:Formal class 'Interval' [package "lubridate"] with 3 slots
#> .. ..# .Data: num 31622400
#> .. ..# start: POSIXct[1:1], format: "2020-01-01 12:13:14"
#> .. ..# tzone: chr "UTC"
Created on 2022-03-14 by the reprex package (v2.0.1)
I´m having multiple dataframes/tibbles inside a list, like this: (in the real dataset it´s more like 500 df with 180 columns and 30 rows each)
df1 <- data.frame(Col_1 = c(0,0,0,0,0),
Col_2 = c(1,1,1,1,1),
Col_3 = c("text", "text", "text", "text", "text"))
df2 <- data.frame(Col_1 = c(0,0,0,0,0),
Col_2 = c(1,1,1,1,1),
Col_3 = c(2,2,2,2,2))
l <- list(df1, df2)
The reason for this is, because I´m using readxl over multiple excel files. In principle those excel files/columns are the same, but some columns are imported as character or double. This is caused by user input.
In the end I want a big dataframe with all the df binded by bind_rows() or another function.
By simply using dplyr::bind_rows(l) there will be an error (Error: Can't combine `..1$Col_3` <character> and `..2$Col_3` <double>.), because I´m having different class types. To solve this problem, I´m using this approach:
l <- lapply(l, function(df) dplyr::mutate_at(df, vars(matches("Col_3")), as.character))
and afterwards this:
df <- dplyr::bind_rows(l)
which results in my desired df using this simple example.
BUT if I want to use the lapply function in my "real" dataset, this error always occurs:
Error: Can't transform a data frame with duplicate names.
How could I enclose the problem? (I can´t share the dataset because of confidentiality reasons, but I couldn´t reproduce this error in the above mentioned example)
Is there maybe a better way/function to convert this list into one df? Maybe convert automatically all column types to character (for now this would work, but this is of course not a good choice in the long run)
Change everything columns to character, and then you can combine the data frames.
library(dplyr)
library(purrr)
df_all <- map_dfr(l, ~.x %>% mutate(across(everything(), as.character)))
A better way could be that when you read the data into R, make sure all columns are character.
Up front, the best fix for this is to specify the column class when importing the data. By far, the function I propose below should never be used when you have any semblance of control over the import process. The utility of this function is in the extreme times you need some normalization of classes.
normalize_attributes <- function(L) {
all_nms <- unique(unlist(sapply(L, names)))
L <- lapply(L, function(dat) {
missing_nms <- setdiff(all_nms, names(dat))
if (length(missing_nms)) dat[missing_nms] <- NA
dat
})
first_nms <- names(L[[1]])
reattrib_funcs <- setNames(vector("list", length(all_nms)), all_nms)
LL_attr <- lapply(setNames(nm = all_nms),
function(nm) lapply(L, function(dat) attributes(dat[[nm]])))
LL_first <- lapply(setNames(nm = all_nms),
function(nm) lapply(L, function(dat) dat[[nm]][1]))
for (nm in all_nms) {
haschr <- sapply(LL_first[[nm]], inherits, "character")
hasnum <- sapply(LL_first[[nm]], inherits, "numeric")
haspsx <- sapply(LL_first[[nm]], inherits, "POSIXt")
hasdate <- sapply(LL_first[[nm]], inherits, "Date")
# use psx if any present otherwise date
hastime <- if (any(haspsx)) haspsx else hasdate
if (any(haschr)) {
# character wins all
reattrib_funcs[[nm]] <- as.character
} else if (any(hastime)) {
# need to use attributes here; this allows up-conversion without
# needing to specify the 'origin' (this might be a bug!)
att <- LL_attr[[nm]][[ which.max(hastime) ]]
reattrib_funcs[[nm]] <- substitute(function(vec) `attributes<-`(vec, att))
} else if (any(hasnum)) {
reattrib_funcs[[nm]] <- as.numeric
} else {
cls <- class(unlist(LL_first[[nm]]))
reattrib_funcs[[nm]] <- substitute(function(vec) `class<-`(vec, cls))
}
}
L <- lapply(L, function(dat) {
dat[all_nms] <- Map(function(func, vec) eval(func)(vec),
reattrib_funcs[all_nms], dat[all_nms])
dat
})
}
Some cruel data, showing various combinations of data classes/types:
df1 <- data.frame(int_int = c(0L,0L,0L,0L,0L),
num_num = c(0,0,0,0,0),
num_int = c(0,0,0,0,0),
num_psx = c(0,0,0,0,0),
num_dat = c(0,0,0,0,0),
chr_num = c("text", "text", "text", "text", "text"),
psx_dat = rep(Sys.time(),5),
chr_mis = c(0,0,0,0,0))
df2 <- data.frame(int_int = c(1L,1L,1L,1L,1L),
num_num = c(1,1,1,1,1),
num_int = c(1L,1L,1L,1L,1L),
num_psx = rep(Sys.time(),5),
num_dat = rep(Sys.Date(),5),
psx_dat = rep(Sys.Date(),5),
chr_num = c(1,1,1,1,1))
l <- list(df1, df2)
str(l)
# List of 2
# $ :'data.frame': 5 obs. of 8 variables:
# ..$ int_int: int [1:5] 0 0 0 0 0
# ..$ num_num: num [1:5] 0 0 0 0 0
# ..$ num_int: num [1:5] 0 0 0 0 0
# ..$ num_psx: num [1:5] 0 0 0 0 0
# ..$ num_dat: num [1:5] 0 0 0 0 0
# ..$ chr_num: chr [1:5] "text" "text" "text" "text" ...
# ..$ psx_dat: POSIXct[1:5], format: "2021-02-11 10:51:52" "2021-02-11 10:51:52" "2021-02-11 10:51:52" "2021-02-11 10:51:52" ...
# ..$ chr_mis: num [1:5] 0 0 0 0 0
# $ :'data.frame': 5 obs. of 7 variables:
# ..$ int_int: int [1:5] 1 1 1 1 1
# ..$ num_num: num [1:5] 1 1 1 1 1
# ..$ num_int: int [1:5] 1 1 1 1 1
# ..$ num_psx: POSIXct[1:5], format: "2021-02-11 10:51:52" "2021-02-11 10:51:52" "2021-02-11 10:51:52" "2021-02-11 10:51:52" ...
# ..$ num_dat: Date[1:5], format: "2021-02-11" "2021-02-11" "2021-02-11" "2021-02-11" ...
# ..$ psx_dat: Date[1:5], format: "2021-02-11" "2021-02-11" "2021-02-11" "2021-02-11" ...
# ..$ chr_num: num [1:5] 1 1 1 1 1
(The column names indicate the different types/classes.) Note the problems with those frames:
"chr_mis" is missing in one;
columns are in a different order, "chr_num" notably; and obviously
the classes are rarely the same :-)
I expect from this that:
if any column is a string, all frames have that column as a string;
if a column is POSIXt or Date (special-case numeric due to attributes), all should be;
num and int go to num, normal R behavior;
missing columns are added, using the appropriate R NA class (there are at least six types of NA)
And the fixed data:
str(l2 <- normalize_attributes(l))
# List of 2
# $ :'data.frame': 5 obs. of 8 variables:
# ..$ int_int: int [1:5] 0 0 0 0 0
# ..$ num_num: num [1:5] 0 0 0 0 0
# ..$ num_int: num [1:5] 0 0 0 0 0
# ..$ num_psx: POSIXct[1:5], format: "1969-12-31 19:00:00" "1969-12-31 19:00:00" "1969-12-31 19:00:00" "1969-12-31 19:00:00" ...
# ..$ num_dat: Date[1:5], format: "1970-01-01" "1970-01-01" "1970-01-01" "1970-01-01" ...
# ..$ chr_num: chr [1:5] "text" "text" "text" "text" ...
# ..$ psx_dat: POSIXct[1:5], format: "2021-02-11 11:43:31" "2021-02-11 11:43:31" "2021-02-11 11:43:31" "2021-02-11 11:43:31" ...
# ..$ chr_mis: num [1:5] 0 0 0 0 0
# $ :'data.frame': 5 obs. of 8 variables:
# ..$ int_int: int [1:5] 1 1 1 1 1
# ..$ num_num: num [1:5] 1 1 1 1 1
# ..$ num_int: num [1:5] 1 1 1 1 1
# ..$ num_psx: POSIXct[1:5], format: "2021-02-11 11:43:31" "2021-02-11 11:43:31" "2021-02-11 11:43:31" "2021-02-11 11:43:31" ...
# ..$ num_dat: Date[1:5], format: "2021-02-11" "2021-02-11" "2021-02-11" "2021-02-11" ...
# ..$ psx_dat: POSIXct[1:5], format: "1970-01-01 00:11:09" "1970-01-01 00:11:09" "1970-01-01 00:11:09" "1970-01-01 00:11:09" ...
# ..$ chr_num: chr [1:5] "1" "1" "1" "1" ...
# ..$ chr_mis: num [1:5] NA NA NA NA NA
which can now be rbinded more safely:
do.call(rbind, l2)
# int_int num_num num_int num_psx num_dat chr_num psx_dat chr_mis
# 1 0 0 0 1969-12-31 19:00:00 1970-01-01 text 2021-02-11 11:43:31 0
# 2 0 0 0 1969-12-31 19:00:00 1970-01-01 text 2021-02-11 11:43:31 0
# 3 0 0 0 1969-12-31 19:00:00 1970-01-01 text 2021-02-11 11:43:31 0
# 4 0 0 0 1969-12-31 19:00:00 1970-01-01 text 2021-02-11 11:43:31 0
# 5 0 0 0 1969-12-31 19:00:00 1970-01-01 text 2021-02-11 11:43:31 0
# 6 1 1 1 2021-02-11 11:43:31 2021-02-11 1 1970-01-01 00:11:09 NA
# 7 1 1 1 2021-02-11 11:43:31 2021-02-11 1 1970-01-01 00:11:09 NA
# 8 1 1 1 2021-02-11 11:43:31 2021-02-11 1 1970-01-01 00:11:09 NA
# 9 1 1 1 2021-02-11 11:43:31 2021-02-11 1 1970-01-01 00:11:09 NA
# 10 1 1 1 2021-02-11 11:43:31 2021-02-11 1 1970-01-01 00:11:09 NA
library dplyr
library purr
I would first create a list of all the files to be combined
csv_files = list.files(path = (paste0(data_path,"folder_with_csvs/")), pattern = "csv$", full.names = TRUE)
Then choose from the following options (second is probably preferred).
First option is to have every column as a character.
df_all <- map_dfr(.x = set_names(csv_files),
.f = ~ read_csv(.x, col_types = cols(.default = "c")))
Second option is have the default parsing/guessing for all columns except your exception(s).
df_all <- map_dfr(.x = set_names(csv_files_new),
.f = ~ read_csv(.x, col_types = cols(.default = "?",Col_3 = "c")))
Third option is it classify each column.
df_all <- map_dfr(.x = set_names(csv_files_new),
.f = ~ read_csv(.x, col_types = cols(Col_1 = "c", #char
Col_2 = "d", #double
Col_3 = "c"))) #char
see p. 34 of http://www.hiercourse.com/docs/Working_in_the_Tidyverse.pdf for more details
I have read a lot of blogs, but I cannot find the answer to my question:
I have a date 2020-25-02 17:45:03 and I would like to convert it to two columns day and time.
hello <- strptime(as.character("2020-25-02 17:42:03"),"%Y-%m-%d %H:%M:%S")
df$day <- as.Date(hello, format = "%Y-%d-%m")
But I also would like df$time. Is it possible ?
dtimes = c("2002-06-09 12:45:40","2003-01-29 09:30:40",
+ "2002-09-04 16:45:40","2002-11-13 20:00:40",
+ "2002-07-07 17:30:40")
> dtparts = t(as.data.frame(strsplit(dtimes,' ')))
> row.names(dtparts) = NULL
> thetimes = chron(dates=dtparts[,1],times=dtparts[,2],
+ format=c('y-m-d','h:m:s'))
> thetimes
[1] (02-06-09 12:45:40) (03-01-29 09:30:40) (02-09-04 16:45:40)
[4] (02-11-13 20:00:40) (02-07-07 17:30:40)
Please see this link
Use function hms in package lubridate.
df <- data.frame(day = as.Date(hello, format = "%Y-%d-%m"))
df$time <- lubridate::hms(sub("^[^ ]*\\b(.*)$", "\\1", hello))
df
# day time
#1 2020-02-25 17H 42M 3S
str(df)
#'data.frame': 1 obs. of 2 variables:
# $ day : Date, format: "2020-02-25"
# $ time:Formal class 'Period' [package "lubridate"] with 6 slots
# .. ..# .Data : num 3
# .. ..# year : num 0
# .. ..# month : num 0
# .. ..# day : num 0
# .. ..# hour : num 17
# .. ..# minute: num 42
I can generate a list of timeDate objects for New York Exchange. However, most of the analytical functions expect a single timeDate object. The underlying data representation is POSIXct, so I can't just append them like a vector or a list.
How to do it?
library(timeDate)
x <- lapply(c(1885: 1886), holidayNYSE)
x
[[1]]
NewYork
[1] [1885-01-01] [1885-02-23] [1885-04-03] [1885-11-03] [1885-11-26] [1885-12-25]
[[2]]
NewYork
[1] [1886-01-01] [1886-02-22] [1886-04-23] [1886-05-31] [1886-07-05] [1886-11-02] [1886-11-25]
class(x[[1]])
[1] "timeDate"
attr(,"package")
[1] "timeDate"
class(x[[1]]#Data)
[1] "POSIXct" "POSIXt"
# ??? How to my two datetime objects ???
We can use do.call with c
x1 <- do.call(c, x)
x1
#NewYork
#[1] [1885-01-01] [1885-02-23] [1885-04-03] [1885-11-03] [1885-11-26] [1885-12-25] [1886-01-01] [1886-02-22] [1886-04-23] [1886-05-31] [1886-07-05] [1886-11-02]
#[13] [1886-11-25]
str(x1)
#Formal class 'timeDate' [package "timeDate"] with 3 slots
# ..# Data : POSIXct[1:13], format: "1885-01-01 05:00:00" "1885-02-23 05:00:00" "1885-04-03 05:00:00" "1885-11-03 05:00:00" ...
# ..# format : chr "%Y-%m-%d"
# ..# FinCenter: chr "NewYork"
and the structure of OP's list is
str(x)
#List of 2
#$ :Formal class 'timeDate' [package "timeDate"] with 3 slots
# .. ..# Data : POSIXct[1:6], format: "1885-01-01 05:00:00" "1885-02-23 05:00:00" "1885-04-03 05:00:00" "1885-11-03 05:00:00" ...
# .. ..# format : chr "%Y-%m-%d"
# .. ..# FinCenter: chr "NewYork"
# $ :Formal class 'timeDate' [package "timeDate"] with 3 slots
# .. ..# Data : POSIXct[1:7], format: "1886-01-01 05:00:00" "1886-02-22 05:00:00" "1886-04-23 05:00:00" "1886-05-31 05:00:00" ...
# .. ..# format : chr "%Y-%m-%d"
# .. ..# FinCenter: chr "NewYork"