iterate reading/mutating csv files in R purr

iterate reading/mutating csv files in R purr - r

I have a folder of csv files in R that will need to loop through, clean, and create in columns based on information in the file name. I am trying to use purr and this is what I have done so far.
# get file names
files_names <- list.files("data/", recursive = TRUE, full.names = TRUE)
# inspect
files_names
[1] "data/BOC_All_ATMImage_(Aug 2020).txt" "data/BOC_All_ATMImage_(Aug 2021).txt" "data/BOC_All_ATMImage_(Feb 2021).txt"
[4] "data/BOC_All_ATMImage_(May 2021).txt" "data/BOC_All_ATMImage_(Nov 2020).txt" "data/BOC_All_ATMImage_(Nov 2021).txt"
# extract month/year inside brackets and convert to snakecase
# this will be used later to create column names
names_data <- files_names %>%
str_extract(., "(?<=\\().*?(?=\\))") %>%
str_to_lower() %>%
str_replace(., " ", "_")
column_names
[1] "aug_2020" "aug_2021" "feb_2021" "may_2021" "nov_2020" "nov_2021"
now loop through the csvs, read each csv, do some data cleaning and create columns
mc_data <-
map(files_names,
~ read_csv(.x, guess_max = 50000) %>%
janitor::clean_names() %>%
mutate(month_year = str_extract(.x, "(?<=\\().*?(?=\\))"),
date_dmy = paste0(day, "-", month_year),
date = dmy(date_dmy),
fsa = str_sub(postal_code, start = 1, end=3),
?? = 1) %>%
select(-date_dmy),
.id = "group"
)
I need to mutate one more column and that column has to named based on this names_data extracted. I currently have this as ?? in the fake code above. names_data follows the same order as the file path so the idea is to do it in one loop and save each data after it has been cleaned.

We can use glue syntax and map2. Perhaps:
mc_data <-
map2(files_names, column_names,
~ read_csv(.x, guess_max = 50000) %>%
janitor::clean_names() %>%
mutate(month_year = str_extract(.x, "(?<=\\().*?(?=\\))"),
date_dmy = paste0(day, "-", month_year),
date = dmy(date_dmy),
fsa = str_sub(postal_code, start = 1, end=3),
'{.y}' := 1) %>%
select(-date_dmy),
.id = "group"
)

Related

Add filename to a new column when using map_df in r

Is there a quick and easy way using dplyr to add a column called 'site_id' which populates rows from the number given to the filename when using map_df from purrr package to bring the data in to one dataframe?
For example my.files will read in two csv files:
"H:/Documents/2015.csv" and "H:/Documents/2021.csv"
my.files <- list.files(my.path, pattern = "*.csv", full.names = TRUE)
I then use map_df to bring all the data in to one data frame, but would like to create an additional column called 'site_id' that will populate each row from that file with its original file title e.g. 2015 or 2021
I currently merge the .csv files together with this code:
temp.df <- my.files %>% map_df(~read.csv(., skip = 15))
But I envisage using mutate to help but am unsure how it would work...
temp.df <- my.files %>% map_df(~read.csv(., skip = 15) %>%
mutate(site_id = ????))
Any help is much appreciated.

We may use imap if we want to use mutate
library(dplyr)
library(purrr)
setNames(my.files, my.files) %>%
imap_df(~ read.csv(.x, skip = 15) %>%
mutate(site_id = .y))
Or specify the .id in map
setNames(my.files, my.files) %>%
map_dfr(read.csv, skip = 15, .id = "site_id")

Using purrr & dplyr:
temp.df <- my.files %>%
purrr::set_names() %>%
purrr::map(., ~read.csv(., skip = 15)) %>%
dplyr::bind_rows(.id = "site_id")

Add new column name to a list of data frames from a part of the file name using lapply

I have 10 excel files in a folder that all have the same columns. The end result is bind_rows and combine them all. Each file represents a week (in this case weeks 1-10). I am looking to see how I can add a new column called "Week" to the final product. Each file name looks like "...wk1.xlsx", "...wk2.xlsx", etc. I am trying to figure out how I can detect "wk1", etc and format that into a new column in a list of data frames.
Here's what I have...
files <- list.files(path ="Users/Desktop/week", pattern = "*.xlsx", full.names= T) %>%
lapply(read_excel, sheet =4, skip =39) %>%
bind_rows()

Name the list of filenames using setNames(), then use the .id argument in bind_rows(), which adds a column containing list names.
library(tidyverse)
library(readxl)
files <- list.files(path ="Users/Desktop/week", pattern = "*.xlsx", full.names= T) %>%
setNames(nm = .) %>%
lapply(read_excel, sheet =4, skip =39) %>%
bind_rows(.id = "Week") %>%
mutate(Week = str_extract(Week, "wk\\d+"))
You could also combine the iteration and row-binding steps using purrr::map_dfr():
files <- list.files(path ="Users/Desktop/week", pattern = "*.xlsx", full.names= T) %>%
setNames(nm = .) %>%
map_dfr(read_excel, sheet = 4, skip = 39, .id = "Week") %>%
mutate(Week = str_extract(Week, "wk\\d+"))

How to change the value of first row to its row number in R?

I have multiple excel sheets to be imported into R. I used the following code
files <- list.files(path = "D:/xxx/Daily Report/", pattern = "*.xlsx", full.names = T)
tbl <- sapply(files, read_xlsx, simplify=FALSE) %>% bind_rows(.id = "S No")
In the S No column, the values that are filled is the file path. I want to convert them into the row number. Getting the following output when I try to change the value of S No
tbl <- tbl %>% mutate(.$`S No` = row_number())
Error: unexpected '=' in "tbl <- tbl %>% mutate(.$`S No` ="

This should work:
tbl <- sapply(files, read_xlsx, simplify=FALSE) %>% bind_rows %>% mutate(.id = 1:n() )
You could also change .id to what ever other column name you could want for it.
And using the intended function for this, thanks to OP and commentor below, #Gregor Thomas:
tbl <- sapply(files, read_xlsx, simplify=FALSE) %>% bind_rows %>% mutate(.id = row_number() )

Import multiple Excel files with names in R as a list

Consider I have two Excel files in my subdirectionary:
.../Myfolder/File1.xlsx
.../Myfolder/File2.xlsx
I know that I can read them into R as a list using the following formular:
data <- list.files(path = "./Myfolder/", pattern="*.xlsx", full.names = T)
data.list <- lapply(data, read_excel)
However, I want to name my objects in the list according to the file name. That is, the first objects name shall be "File1" and the second one should be "File2". I can use:
names(data.list) <- data
But then I get the full name (because I use full.names = T).

You can do :
names(data.list) <- sub('\\.xlsx', '', basename(data))
Or without any regex :
names(data.list) <- tools::file_path_sans_ext(basename(data))

This is what you're asking.
library(tidyverse)
library(stringr)
library(readxl)
(list.files('folder_with_sheets') %>%
keep(~ str_detect(.x, '.xlsx')) %>%
set_names(.) %>%
map(read_excel) ->
data)
But supposing they all have the same columns in each:
library(tidyverse)
library(stringr)
library(readxl)
(list.files('folder_with_sheets') %>%
keep(~ str_detect(.x, '.xlsx')) %>%
map_dfr(~ read_excel(.x) %>% mutate(sheet = .x)) ->
data)
Supposing they all share an identification column and represent different data about the same individuals:
library(tidyverse)
library(stringr)
library(readxl)
(list.files('folder_with_sheets') %>%
keep(~ str_detect(.x, '.xlsx')) %>%
map(read_excel) %>%
reduce(left_join) -> # or reduce(~ left_join(.x, .y, by = 'key_variable_name')
data)
Either way, with set_names you can pipe in name assignment, which is preferable to having two expressions, one to create data, other to label it.
P.S:
This is how I'd do it nowadays:
library(tidyverse)
library(readxl)
library(fs)
fs::dir_ls(
path = "folder/",
glob = "*.xlsx") %>%
purrr::set_names(
x = purrr::map(., readxl::read_excel),
nm = .)
# or maybe within a tibble?
tibble::tibble(
path = fs::dir_ls(
path = "folder/",
glob = "*.xlsx"),
data = purrr::map(path, readxl::read_excel))

I had to modify with. However, it does keep the final path name extension in the list names, which I don't like.
(list.files(path = 'filepath ', pattern = "\\.xlsx$", full.names = TRUE) %>%
keep(~ str_detect(.x, '\\.xlsx$')) %>%
set_names(.) %>%
map(read_excel) ->
data)

`gather` can't handle rownames

allcsvs = list.files(pattern = "*.csv$", recursive = TRUE)
library(tidyverse)
##LOOP to redact the snow data csvs##
for(x in 1:length(allcsvs)) {
df = read.csv(allcsvs[x], check.names = FALSE)
newdf = df %>%
gather(COL_DATE, SNOW_DEPTH, -PT_ID, -DATE) %>%
mutate(
DATE = as.Date(DATE,format = "%m/%d/%Y"),
COL_DATE = as.Date(COL_DATE, format = "%Y.%m.%d")
) %>%
filter(DATE == COL_DATE) %>%
select(-COL_DATE)
####TURN DATES UNAMBIGUOUS HERE####
df$DATE = lubridate::mdy(df$DATE)
finaldf = merge(newdf, df, all.y = TRUE)
write.csv(finaldf, allcsvs[x])
df = read.csv(allcsvs[x])
newdf = df[, -grep("X20", colnames(df))]
write.csv(newdf, allcsvs[x])
}
I am using the code above to populate a new column row-by-row using values from different existing columns, using date as selection criteria. If I manually open each .csv in excel and delete the first column, this code works great. However, if I run it on the .csvs "as is"
I get the following message:
Error: Column 1 must be named
So far I've tried putting -rownames within the parenthesis of gather, I've tried putting remove_rownames %>% below newdf = df %>%, but nothing seems to work. I tried reading the csv without the first column [,-1] or deleting the first column in R df[,1]<-NULL but for some reason when I do that my code returns an empty table instead of what I want it to. In other words, I can delete the rownames in Excel and it works great, if I delete them in R something funky happens.
Here is some sample data: https://drive.google.com/file/d/1RiMrx4wOpUdJkN4il6IopciSF6pKeNLr/view?usp=sharing

You can consider to import them with readr::read_csv.
An easy solution with tidyverse:
allcsvs %>%
map(read_csv) %>%
reduce(bind_rows) %>%
gather(COL_DATE, SNOW_DEPTH, -PT_ID, -DATE) %>%
mutate(
DATE = as.Date(DATE,format = "%m/%d/%Y"),
COL_DATE = as.Date(COL_DATE, format = "%Y.%m.%d")
) %>%
filter(DATE == COL_DATE) %>%
select(-COL_DATE)
With utils::read.csv, you are importing strings are factors. as.Date(DATE,format = "%m/%d/%Y") evaluates NA.
Update
Above solution returns one single dataframe. To write the each data file separately with the for loop:
for(x in 1:length(allcsvs)) {
read_csv(allcsvs[x]) %>%
gather(COL_DATE, SNOW_DEPTH, -PT_ID, -DATE) %>%
mutate(
COL_DATE = as.Date(COL_DATE, format = "%Y.%m.%d")
) %>%
filter(DATE == COL_DATE) %>%
select(-COL_DATE) %>%
write_csv(paste('tidy', allcsvs[x], sep = '_'))
}
Comparison
purrr:map and purrr:reduce can be used instead of for loop in some cases. Those functions take another functions as arguments.
readr::read_csv is typically 10x faster than base R equivalents. (More info: http://r4ds.had.co.nz/data-import.html). Also it can handle CSV files better.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

iterate reading/mutating csv files in R purr - r

Related

Add filename to a new column when using map_df in r

Add new column name to a list of data frames from a part of the file name using lapply

How to change the value of first row to its row number in R?

Import multiple Excel files with names in R as a list

`gather` can't handle rownames

Categories

Resources