Add filename to a new column when using map_df in r - r

Is there a quick and easy way using dplyr to add a column called 'site_id' which populates rows from the number given to the filename when using map_df from purrr package to bring the data in to one dataframe?
For example my.files will read in two csv files:
"H:/Documents/2015.csv" and "H:/Documents/2021.csv"
my.files <- list.files(my.path, pattern = "*.csv", full.names = TRUE)
I then use map_df to bring all the data in to one data frame, but would like to create an additional column called 'site_id' that will populate each row from that file with its original file title e.g. 2015 or 2021
I currently merge the .csv files together with this code:
temp.df <- my.files %>% map_df(~read.csv(., skip = 15))
But I envisage using mutate to help but am unsure how it would work...
temp.df <- my.files %>% map_df(~read.csv(., skip = 15) %>%
mutate(site_id = ????))
Any help is much appreciated.

We may use imap if we want to use mutate
library(dplyr)
library(purrr)
setNames(my.files, my.files) %>%
imap_df(~ read.csv(.x, skip = 15) %>%
mutate(site_id = .y))
Or specify the .id in map
setNames(my.files, my.files) %>%
map_dfr(read.csv, skip = 15, .id = "site_id")

Using purrr & dplyr:
temp.df <- my.files %>%
purrr::set_names() %>%
purrr::map(., ~read.csv(., skip = 15)) %>%
dplyr::bind_rows(.id = "site_id")

Related

Using purr::map to rename columns based on another list in R

I have multiple files and I want to rename the second column of each file with a name coming from the
samples=c("sample1","sample2") dataset. As I am learning purr::map functions, I am struggling to do the renaming with the inside map.
Here is an example:
Any help is extremely appreciated
library(purrr)
library(data.table)
library(dplyr)
files <- paste0("file", 1:3, ".txt")
## Create example files in a temp dir
temp <- tempdir()
walk(files, ~ write.csv(iris[1:2], file.path(temp, .x), row.names = FALSE))
files |>
map(~ fread(file.path(temp, .x)) %>% rename(test = 1, samples=2))
Of course, this does not work, but this is here I am so far.
This is one way to do it:
We use map2() and loop over both files and samples and for each file we first read in the data fread(file,path(temp, .x)) and then pipe that into rename(., test = 1, !! sym(.y) := 2)).
samples contains strings. We need to make the strings into object names with sym (or alternatively as.name()) and evaluate them with !!. If we use this kind of syntax on the lefthand side we also need the walrus operator := instead of =.
samples=c("sample1","sample2", "sample3")
files |>
map2(samples, ~ fread(file.path(temp, .x)) %>% rename(., test = 1, !! sym(.y) := 2))
If you want to rename a different column in every data.frame its better to construct a list of lists as below and splice each sublist into rename() with !!!. (The example below just uses the second column but we could change that to any column number we want).
samples = list(
list("sample1" = 2),
list("sample2" = 2),
list("sample3" = 2)
)
files |>
map2(samples, ~ fread(file.path(temp, .x)) %>% rename(., test = 1, !!! .y))
Since you are using data.table to read-in the data we don't need dyplr::rename() to rename the colums. Especially the case where you want to rename each second column is easier with data.table::setnames():
samples = c("sample1", "sample2","sample3")
files |>
map2(samples, ~ fread(file.path(temp, .x)) %>% setnames(., 1, .y))

Add new column name to a list of data frames from a part of the file name using lapply

I have 10 excel files in a folder that all have the same columns. The end result is bind_rows and combine them all. Each file represents a week (in this case weeks 1-10). I am looking to see how I can add a new column called "Week" to the final product. Each file name looks like "...wk1.xlsx", "...wk2.xlsx", etc. I am trying to figure out how I can detect "wk1", etc and format that into a new column in a list of data frames.
Here's what I have...
files <- list.files(path ="Users/Desktop/week", pattern = "*.xlsx", full.names= T) %>%
lapply(read_excel, sheet =4, skip =39) %>%
bind_rows()
Name the list of filenames using setNames(), then use the .id argument in bind_rows(), which adds a column containing list names.
library(tidyverse)
library(readxl)
files <- list.files(path ="Users/Desktop/week", pattern = "*.xlsx", full.names= T) %>%
setNames(nm = .) %>%
lapply(read_excel, sheet =4, skip =39) %>%
bind_rows(.id = "Week") %>%
mutate(Week = str_extract(Week, "wk\\d+"))
You could also combine the iteration and row-binding steps using purrr::map_dfr():
files <- list.files(path ="Users/Desktop/week", pattern = "*.xlsx", full.names= T) %>%
setNames(nm = .) %>%
map_dfr(read_excel, sheet = 4, skip = 39, .id = "Week") %>%
mutate(Week = str_extract(Week, "wk\\d+"))

output of group_split needs to be saved as separate dataframe

I have split my data frame using group_split and further want to save outputs as a separate csv. I am using the following code. I am unable to get those csv. Please help.
Thank you.
library(dplyr)
y<-year_x %>% group_split(Year)
for(i in 1:length(y)) {
write.csv2(get(y[i]),
paste0("D:/newfolder",
y[i],
".csv"),
row.names = FALSE)
group_split will not return you name of the year as list name. Use base::split instead.
y <- split(year_x, year_x$Year)
for(i in seq_along(y)) {
write.csv2(y[[i]], paste0("D:/newfolder/", names(y)[i], ".csv"),row.names = FALSE)
}
You could also do this with purrr::imap
purrr::imap(y, ~write.csv2(.x, paste0("D:/newfolder/", .y, ".csv"),row.names = FALSE))
We can use group_map
library(dplyr)
library(purrr)
library(stringr)
year_x %>%
group_by(year_x) %>%
group_map(~ write.csv2(.x, file.path("D:/newfolder/", str_c(cur_group(), ".csv")), row.names = FALSE))

Import multiple Excel files with names in R as a list

Consider I have two Excel files in my subdirectionary:
.../Myfolder/File1.xlsx
.../Myfolder/File2.xlsx
I know that I can read them into R as a list using the following formular:
data <- list.files(path = "./Myfolder/", pattern="*.xlsx", full.names = T)
data.list <- lapply(data, read_excel)
However, I want to name my objects in the list according to the file name. That is, the first objects name shall be "File1" and the second one should be "File2". I can use:
names(data.list) <- data
But then I get the full name (because I use full.names = T).
You can do :
names(data.list) <- sub('\\.xlsx', '', basename(data))
Or without any regex :
names(data.list) <- tools::file_path_sans_ext(basename(data))
This is what you're asking.
library(tidyverse)
library(stringr)
library(readxl)
(list.files('folder_with_sheets') %>%
keep(~ str_detect(.x, '.xlsx')) %>%
set_names(.) %>%
map(read_excel) ->
data)
But supposing they all have the same columns in each:
library(tidyverse)
library(stringr)
library(readxl)
(list.files('folder_with_sheets') %>%
keep(~ str_detect(.x, '.xlsx')) %>%
map_dfr(~ read_excel(.x) %>% mutate(sheet = .x)) ->
data)
Supposing they all share an identification column and represent different data about the same individuals:
library(tidyverse)
library(stringr)
library(readxl)
(list.files('folder_with_sheets') %>%
keep(~ str_detect(.x, '.xlsx')) %>%
map(read_excel) %>%
reduce(left_join) -> # or reduce(~ left_join(.x, .y, by = 'key_variable_name')
data)
Either way, with set_names you can pipe in name assignment, which is preferable to having two expressions, one to create data, other to label it.
P.S:
This is how I'd do it nowadays:
library(tidyverse)
library(readxl)
library(fs)
fs::dir_ls(
path = "folder/",
glob = "*.xlsx") %>%
purrr::set_names(
x = purrr::map(., readxl::read_excel),
nm = .)
# or maybe within a tibble?
tibble::tibble(
path = fs::dir_ls(
path = "folder/",
glob = "*.xlsx"),
data = purrr::map(path, readxl::read_excel))
I had to modify with. However, it does keep the final path name extension in the list names, which I don't like.
(list.files(path = 'filepath ', pattern = "\\.xlsx$", full.names = TRUE) %>%
keep(~ str_detect(.x, '\\.xlsx$')) %>%
set_names(.) %>%
map(read_excel) ->
data)

`gather` can't handle rownames

allcsvs = list.files(pattern = "*.csv$", recursive = TRUE)
library(tidyverse)
##LOOP to redact the snow data csvs##
for(x in 1:length(allcsvs)) {
df = read.csv(allcsvs[x], check.names = FALSE)
newdf = df %>%
gather(COL_DATE, SNOW_DEPTH, -PT_ID, -DATE) %>%
mutate(
DATE = as.Date(DATE,format = "%m/%d/%Y"),
COL_DATE = as.Date(COL_DATE, format = "%Y.%m.%d")
) %>%
filter(DATE == COL_DATE) %>%
select(-COL_DATE)
####TURN DATES UNAMBIGUOUS HERE####
df$DATE = lubridate::mdy(df$DATE)
finaldf = merge(newdf, df, all.y = TRUE)
write.csv(finaldf, allcsvs[x])
df = read.csv(allcsvs[x])
newdf = df[, -grep("X20", colnames(df))]
write.csv(newdf, allcsvs[x])
}
I am using the code above to populate a new column row-by-row using values from different existing columns, using date as selection criteria. If I manually open each .csv in excel and delete the first column, this code works great. However, if I run it on the .csvs "as is"
I get the following message:
Error: Column 1 must be named
So far I've tried putting -rownames within the parenthesis of gather, I've tried putting remove_rownames %>% below newdf = df %>%, but nothing seems to work. I tried reading the csv without the first column [,-1] or deleting the first column in R df[,1]<-NULL but for some reason when I do that my code returns an empty table instead of what I want it to. In other words, I can delete the rownames in Excel and it works great, if I delete them in R something funky happens.
Here is some sample data: https://drive.google.com/file/d/1RiMrx4wOpUdJkN4il6IopciSF6pKeNLr/view?usp=sharing
You can consider to import them with readr::read_csv.
An easy solution with tidyverse:
allcsvs %>%
map(read_csv) %>%
reduce(bind_rows) %>%
gather(COL_DATE, SNOW_DEPTH, -PT_ID, -DATE) %>%
mutate(
DATE = as.Date(DATE,format = "%m/%d/%Y"),
COL_DATE = as.Date(COL_DATE, format = "%Y.%m.%d")
) %>%
filter(DATE == COL_DATE) %>%
select(-COL_DATE)
With utils::read.csv, you are importing strings are factors. as.Date(DATE,format = "%m/%d/%Y") evaluates NA.
Update
Above solution returns one single dataframe. To write the each data file separately with the for loop:
for(x in 1:length(allcsvs)) {
read_csv(allcsvs[x]) %>%
gather(COL_DATE, SNOW_DEPTH, -PT_ID, -DATE) %>%
mutate(
COL_DATE = as.Date(COL_DATE, format = "%Y.%m.%d")
) %>%
filter(DATE == COL_DATE) %>%
select(-COL_DATE) %>%
write_csv(paste('tidy', allcsvs[x], sep = '_'))
}
Comparison
purrr:map and purrr:reduce can be used instead of for loop in some cases. Those functions take another functions as arguments.
readr::read_csv is typically 10x faster than base R equivalents. (More info: http://r4ds.had.co.nz/data-import.html). Also it can handle CSV files better.

Resources