I have split my data frame using group_split and further want to save outputs as a separate csv. I am using the following code. I am unable to get those csv. Please help.
Thank you.
library(dplyr)
y<-year_x %>% group_split(Year)
for(i in 1:length(y)) {
write.csv2(get(y[i]),
paste0("D:/newfolder",
y[i],
".csv"),
row.names = FALSE)
group_split will not return you name of the year as list name. Use base::split instead.
y <- split(year_x, year_x$Year)
for(i in seq_along(y)) {
write.csv2(y[[i]], paste0("D:/newfolder/", names(y)[i], ".csv"),row.names = FALSE)
}
You could also do this with purrr::imap
purrr::imap(y, ~write.csv2(.x, paste0("D:/newfolder/", .y, ".csv"),row.names = FALSE))
We can use group_map
library(dplyr)
library(purrr)
library(stringr)
year_x %>%
group_by(year_x) %>%
group_map(~ write.csv2(.x, file.path("D:/newfolder/", str_c(cur_group(), ".csv")), row.names = FALSE))
Related
How does one create a named list of all dataframes/tibbles in the global environment in R? Is there a way to do this without manually hardcoding all dataframes/tibbles?
I.e. if the global environment contains the dataframes/tibbles df_1, my_data_1, science_1, all_data, how does one create an output that looks like:
files_list <- list(
df_1 = df_1,
my_data_1 = my_data_1,
science_1 = science_1,
all_data = all_data
)
We may Filter the elements that are data.frame or tibble in the environment that we are working on - e.g. in the global env, it can be
Filter(length, eapply(.GlobalEnv,
function(x) if(is.data.frame(x)||is_tibble(x)) x))
We can get all objects first, then keep only the data.frames
library(purrr)
mget(ls()) %>% keep(is.data.frame)
A base way, combining methods of #GuedesBF and #akrun could be using ls, mget and Filter.
Filter(is.data.frame, mget(ls()))
#Filter(is.data.frame, mget(ls(.GlobalEnv))) #More explicit using globEnv
Please try the below code which will generate a df
naml <- list()
for (i in seq_along(ls(envir =.GlobalEnv))) {
j <- ls(envir =.GlobalEnv)[i]
if (any(class(get(j))=='data.frame')) name <- {j} else name <- NA
if (any(class(get(j))=='data.frame')) class <- class(get(j))[3] else class <- NA
if (!is.na(name) & !is.na(class)) {
df <- data.frame(namex=name,classx=class)
naml[[j]] <- df
}
}
df2 <- do.call(rbind, naml) %>% rownames_to_column('name') %>%
pivot_wider(names_from = name, values_from = namex)
Is there a quick and easy way using dplyr to add a column called 'site_id' which populates rows from the number given to the filename when using map_df from purrr package to bring the data in to one dataframe?
For example my.files will read in two csv files:
"H:/Documents/2015.csv" and "H:/Documents/2021.csv"
my.files <- list.files(my.path, pattern = "*.csv", full.names = TRUE)
I then use map_df to bring all the data in to one data frame, but would like to create an additional column called 'site_id' that will populate each row from that file with its original file title e.g. 2015 or 2021
I currently merge the .csv files together with this code:
temp.df <- my.files %>% map_df(~read.csv(., skip = 15))
But I envisage using mutate to help but am unsure how it would work...
temp.df <- my.files %>% map_df(~read.csv(., skip = 15) %>%
mutate(site_id = ????))
Any help is much appreciated.
We may use imap if we want to use mutate
library(dplyr)
library(purrr)
setNames(my.files, my.files) %>%
imap_df(~ read.csv(.x, skip = 15) %>%
mutate(site_id = .y))
Or specify the .id in map
setNames(my.files, my.files) %>%
map_dfr(read.csv, skip = 15, .id = "site_id")
Using purrr & dplyr:
temp.df <- my.files %>%
purrr::set_names() %>%
purrr::map(., ~read.csv(., skip = 15)) %>%
dplyr::bind_rows(.id = "site_id")
I have a folder with different files, each with a different format, so I created different functions able to read each of the files. Is it possible to use map to apply the corresponding function to the corresponding file?
I have found this post to apply several functions to the object, but I don't think is applicable in this case since here all functions are applied always.
all_files <- list.dirs(file.path(path))
fun_A <- function(x) {read.csv(x)}
fun_B <- function(x) {read.table(x)}
fun_C <- function(x) {read.delim(x)}
funs <- c(fun_A , fun_B , fun_C)
So, if I do it manually it works:
(all_files %>%
purrr::map(., ~list.files(., full.names = T)))[[1]][1] %>% fun_A() %>%
dplyr::bind_rows((all_files %>%
purrr::map(., ~list.files(., full.names = T)))[[1]][2] %>% fun_B ()) %>%
dplyr::bind_rows((all_files %>%
purrr::map(., ~list.files(., full.names = T)))[[1]][3] %>% fun_C())
But I tried several times with purrr and I am not able to make it work. This is my final attempt:
all_files %>% purrr::map(.x = ., ~{
df = (.x)
funs %>% purrr::map(., ~ df %>% (.))
})
Any suggestions?
You can use Map or map2 as suggested by #akrun
do.call(rbind, Map(function(x, y) y(x), all_files, funs))
Using map2_df :
purrr::map2_df(all_files, funs, ~.y(.x))
For this to work it is expected that length(all_files) and length(funs) are equal.
Consider I have two Excel files in my subdirectionary:
.../Myfolder/File1.xlsx
.../Myfolder/File2.xlsx
I know that I can read them into R as a list using the following formular:
data <- list.files(path = "./Myfolder/", pattern="*.xlsx", full.names = T)
data.list <- lapply(data, read_excel)
However, I want to name my objects in the list according to the file name. That is, the first objects name shall be "File1" and the second one should be "File2". I can use:
names(data.list) <- data
But then I get the full name (because I use full.names = T).
You can do :
names(data.list) <- sub('\\.xlsx', '', basename(data))
Or without any regex :
names(data.list) <- tools::file_path_sans_ext(basename(data))
This is what you're asking.
library(tidyverse)
library(stringr)
library(readxl)
(list.files('folder_with_sheets') %>%
keep(~ str_detect(.x, '.xlsx')) %>%
set_names(.) %>%
map(read_excel) ->
data)
But supposing they all have the same columns in each:
library(tidyverse)
library(stringr)
library(readxl)
(list.files('folder_with_sheets') %>%
keep(~ str_detect(.x, '.xlsx')) %>%
map_dfr(~ read_excel(.x) %>% mutate(sheet = .x)) ->
data)
Supposing they all share an identification column and represent different data about the same individuals:
library(tidyverse)
library(stringr)
library(readxl)
(list.files('folder_with_sheets') %>%
keep(~ str_detect(.x, '.xlsx')) %>%
map(read_excel) %>%
reduce(left_join) -> # or reduce(~ left_join(.x, .y, by = 'key_variable_name')
data)
Either way, with set_names you can pipe in name assignment, which is preferable to having two expressions, one to create data, other to label it.
P.S:
This is how I'd do it nowadays:
library(tidyverse)
library(readxl)
library(fs)
fs::dir_ls(
path = "folder/",
glob = "*.xlsx") %>%
purrr::set_names(
x = purrr::map(., readxl::read_excel),
nm = .)
# or maybe within a tibble?
tibble::tibble(
path = fs::dir_ls(
path = "folder/",
glob = "*.xlsx"),
data = purrr::map(path, readxl::read_excel))
I had to modify with. However, it does keep the final path name extension in the list names, which I don't like.
(list.files(path = 'filepath ', pattern = "\\.xlsx$", full.names = TRUE) %>%
keep(~ str_detect(.x, '\\.xlsx$')) %>%
set_names(.) %>%
map(read_excel) ->
data)
allcsvs = list.files(pattern = "*.csv$", recursive = TRUE)
library(tidyverse)
##LOOP to redact the snow data csvs##
for(x in 1:length(allcsvs)) {
df = read.csv(allcsvs[x], check.names = FALSE)
newdf = df %>%
gather(COL_DATE, SNOW_DEPTH, -PT_ID, -DATE) %>%
mutate(
DATE = as.Date(DATE,format = "%m/%d/%Y"),
COL_DATE = as.Date(COL_DATE, format = "%Y.%m.%d")
) %>%
filter(DATE == COL_DATE) %>%
select(-COL_DATE)
####TURN DATES UNAMBIGUOUS HERE####
df$DATE = lubridate::mdy(df$DATE)
finaldf = merge(newdf, df, all.y = TRUE)
write.csv(finaldf, allcsvs[x])
df = read.csv(allcsvs[x])
newdf = df[, -grep("X20", colnames(df))]
write.csv(newdf, allcsvs[x])
}
I am using the code above to populate a new column row-by-row using values from different existing columns, using date as selection criteria. If I manually open each .csv in excel and delete the first column, this code works great. However, if I run it on the .csvs "as is"
I get the following message:
Error: Column 1 must be named
So far I've tried putting -rownames within the parenthesis of gather, I've tried putting remove_rownames %>% below newdf = df %>%, but nothing seems to work. I tried reading the csv without the first column [,-1] or deleting the first column in R df[,1]<-NULL but for some reason when I do that my code returns an empty table instead of what I want it to. In other words, I can delete the rownames in Excel and it works great, if I delete them in R something funky happens.
Here is some sample data: https://drive.google.com/file/d/1RiMrx4wOpUdJkN4il6IopciSF6pKeNLr/view?usp=sharing
You can consider to import them with readr::read_csv.
An easy solution with tidyverse:
allcsvs %>%
map(read_csv) %>%
reduce(bind_rows) %>%
gather(COL_DATE, SNOW_DEPTH, -PT_ID, -DATE) %>%
mutate(
DATE = as.Date(DATE,format = "%m/%d/%Y"),
COL_DATE = as.Date(COL_DATE, format = "%Y.%m.%d")
) %>%
filter(DATE == COL_DATE) %>%
select(-COL_DATE)
With utils::read.csv, you are importing strings are factors. as.Date(DATE,format = "%m/%d/%Y") evaluates NA.
Update
Above solution returns one single dataframe. To write the each data file separately with the for loop:
for(x in 1:length(allcsvs)) {
read_csv(allcsvs[x]) %>%
gather(COL_DATE, SNOW_DEPTH, -PT_ID, -DATE) %>%
mutate(
COL_DATE = as.Date(COL_DATE, format = "%Y.%m.%d")
) %>%
filter(DATE == COL_DATE) %>%
select(-COL_DATE) %>%
write_csv(paste('tidy', allcsvs[x], sep = '_'))
}
Comparison
purrr:map and purrr:reduce can be used instead of for loop in some cases. Those functions take another functions as arguments.
readr::read_csv is typically 10x faster than base R equivalents. (More info: http://r4ds.had.co.nz/data-import.html). Also it can handle CSV files better.