Loop code through multiple .csv files in R - r

I'm trying to run a chunk of code through 50+ csv files, but I can't figure out how to do it. All the csv files have the same columns, but different number of rows.
So far I have the following:
filelist<- list.files(pattern = ".csv") #made a file list with all the .csv files in the directory
samples <- lapply(filelist, function(x) read.table(x, header=T)[,c(1,2,3,5)]) #read all the csv files (only the columns I'm interested in)
output <- samples %>%
rename (orf = protein) %>%
filter (!grepl("sp", orf)) %>%
write.csv (paste0("new_", filename))
#I want to rename a column and remove all rows containing "sp" in that column, then export the dataframe as new_originalfilename.csv
Any help would be greatly appreciated!

You may do this in the same lapply loop.
library(dplyr)
lapply(filelist, function(x) {
read.table(x, header=T) %>%
select(1, 2, 3, 5) %>%
rename(orf = protein) %>%
filter(!grepl("sp", orf)) %>%
write.csv(paste0("new_", x), row.names = FALSE)
})

Related

How to combine multiple .txt files with different # of rows in R and keep file names?

The goal is to combine multiple .txt files with single column from different subfolders then cbind to one dataframe (each file will be one column), and keep file names as columne value, an example of the .txt file:
0.348107
0.413864
0.285974
0.130399
...
My code:
#list all the files in the folder
listfile<- list.files(path="",
pattern= "txt",full.names = T, recursive = TRUE) #To include sub directories, change the recursive = TRUE, else FALSE.
#extract the files with folder name aINS
listfile_aINS <- listfile[grep("aINS",listfile)]
#inspect file names
head(listfile_aINS)
#combined all the text files in listfile_aINS and store in dataframe 'Data'
for (i in 1:length(listfile_aINS)){
if(i==1){
assign(paste0("Data"), read.table(listfile[i],header = FALSE, sep = ","))
}
if(!i==1){
assign(paste0("Test",i), read.table(listfile[i],header = FALSE, sep = ","))
Data <- cbind(Data,get(paste0("Test",i))) #choose one: cbind, combine by column; rbind, combine by row
rm(list = ls(pattern = "Test"))
}
}
rm(list = ls(pattern = "list.+?"))
I ran into two problems:
R returns this error because the .txt files have different # of rows.
"Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 37, 36"
I have too many files so I hope to work around the error without having to fix the files into the same length.
my code won't keep file name as the column name
It will be easier to write a function and then rbind() the data from each file. The resulting data frame will have a file column with the filename from the listfile_aINS vector.
read_file <- function(filename) {
dat <- read.table(filename,header = FALSE, sep = ",")
dat$file <- filename
return(dat)
}
all_dat <- do.call(rbind, lapply(listfile_aINS, read_file))
If they don't all have the same number of rows it might not make sense to have each column be a file, but if you really want that you could make it into a wide dataset with NA filling out the empty rows:
library(dplyr)
library(tidyr)
all_dat %>%
group_by(file) %>%
mutate(n = 1:n()) %>%
pivot_wider(names_from = file, values_from = V1)

loop through batch read folders in setwd(), format dfs & write.csv() to different folder R

I have folders with 4 .csv files in each folder. Currently I am batch reading the .csv files in the folder:
setwd("/Users/Drive/MS/Ma/Ec/Effort_variation_Ec/MES1/")
ecosmpr <-
list.files(pattern = "*.csv") %>%
map_df(~read_csv(.))
ecosmpr=data.frame(ecosmpr)
After batch reading in the 4 csvs in the folder as one data.frame, I need to do some formatting:
ecosmpr1=ecosmpr[,-c(2:13)]
dim(ecosmpr1)
ecosmpr1=ecosmpr1 %>%
row_to_names(row_number = 1)
names(ecosmpr1)=rev(c("detritus","phyto","peri","zoops","amphipods","inverts","leucisids","lns","yct5plus","yct4","yct3","yct2","yctyoy","lkt5plus","lkt34","lkt2","lkt7mo1yo",
"lktyoy","years"))
Then I want to export the formatted data.frame to a csv, but in a different location:
write.csv(ecosmpr1,"/Users/Drive/MS/Ma/Ec/Effort_variation_Ec/ecosmpr1_partialformat.csv",row.names = FALSE)
My issue is that I need to loop through the first setwd(MESXX) rename each "ecosmprXX" and export each "ecosmprXX_partialformat.csv" I am having issues with even starting this loop. My naming convention for the folder is MESXX (where XX is the number, 1:30),data frame is ecosmprXX (where XX is the number, 1:30), and exported .csv is ecosmprXX_partialformat.csv (where XX is the number, 1:30). I have 30 different folders so doing this without a loop is inefficient.
This should do the trick:
library(tidyverse)
library(janitor)
new_col_names <- rev(c("detritus","phyto","peri","zoops","amphipods","inverts","leucisids","lns",
"yct5plus","yct4","yct3","yct2","yctyoy","lkt5plus","lkt34","lkt2",
"lkt7mo1yo", "lktyoy","years"))
for (i in 1:30) {
setwd(paste0("/Users/Drive/MS/Ma/Ec/Effort_variation_Ec/MES", i, "/"))
ecosmpr <- list.files(pattern = "*.csv") %>%
map_df(~read_csv(.x))
ecosmpr <- ecosmpr %>%
select(-c(2:13)) %>%
row_to_names(row_number = 1)
names(ecosmpr) <- new_col_names
output_file <-
paste0("/Users/Drive/MS/Ma/Ec/Effort_variation_Ec/ecosmpr", i, "_partialformat.csv")
write.csv(ecosmpr, output_file, row.names = FALSE)
}

import multible CSV files and use file name as column [duplicate]

This question already has answers here:
Importing multiple .csv files into R and adding a new column with file name
(2 answers)
Closed 14 days ago.
I have numerous csv files in multiple directories that I want to read into a R tribble or data.table. I use "list.files()" with the recursive argument set to TRUE to create a list of file names and paths, then use "lapply()" to read in multiple csv files, and then "bind_rows()" stick them all together:
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
tbl <- lapply(filenames, read_csv) %>%
bind_rows()
This approach works fine. However, I need to extract a substring from the each file name and add it as a column to the final table. I can get the substring I need with "str_extract()" like this:
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")
I am stuck however on how to add the extracted substring as a column as lapply() runs through read_csv() for each file.
I generally use the following approach, based on dplyr/tidyr:
data = tibble(File = files) %>%
extract(File, "Site", "([A-Z]{2}-[A-Za-z0-9]{3})", remove = FALSE) %>%
mutate(Data = lapply(File, read_csv)) %>%
unnest(Data) %>%
select(-File)
tidyverse approach:
Update:
readr 2.0 (and beyond) now has built-in support for reading a list of files with the same columns into one output table in a single command. Just pass the filenames to be read in the same vector to the reading function. For example reading in csv files:
(files <- fs::dir_ls("D:/data", glob="*.csv"))
dat <- read_csv(files, id="path")
Alternatively using map_dfr with purrr:
Add the filename using the .id = "source" argument in purrr::map_dfr()
An example loading .csv files:
# specify the directory, then read a list of files
data_dir <- here("file/path")
data_list <- fs::dir_ls(data_dir, regexp = ".csv$")
# return a single data frame w/ purrr:map_dfr
my_data = data_list %>%
purrr::map_dfr(read_csv, .id = "source")
# Alternatively, rename source from the file path to the file name
my_data = data_list %>%
purrr::map_dfr(read_csv, .id = "source") %>%
dplyr::mutate(source = stringr::str_replace(source, "file/path", ""))
You could use purrr::map2 here, which works similarly to mapply
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}") # same length as filenames
library(purrr)
library(dplyr)
library(readr)
stopifnot(length(filenames)==length(sites)) # returns error if not the same length
ans <- map2(filenames, sites, ~read_csv(.x) %>% mutate(id = .y)) # .x is element in filenames, and .y is element in sites
The output of map2 is a list, similar to lapply
If you have a development version of purrr, you can use imap, which is a wrapper for map2 with an index
data.table approach:
If you name the list, then you can use this name to add to the data.table when binding the list together.
workflow
files <- list.files( whatever... )
#read the files from the list
l <- lapply( files, fread )
#names the list using the basename from `l`
# this also is the step to manipuly the filesnamaes to whatever you like
names(l) <- basename( l )
#bind the rows from the list togetgher, putting the filenames into the colum "id"
dt <- rbindlist( dt.list, idcol = "id" )
You just need to write your own function that reads the csv and adds the column you want, before combining them.
my_read_csv <- function(x) {
out <- read_csv(x)
site <- str_extract(x, "[A-Z]{2}-[A-Za-z0-9]{3}")
cbind(Site=site, out)
}
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
tbl <- lapply(filenames, my_read_csv) %>% bind_rows()
You can build a filenames vector based on "sites" with the exact same length as tbl and then combine the two using cbind
### Get file names
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")
### Get length of each csv
file_lengths <- unlist(lapply(lapply(filenames, read_csv), nrow))
### Repeat sites using lengths
file_names <- rep(sites,file_lengths))
### Create table
tbl <- lapply(filenames, read_csv) %>%
bind_rows()
### Combine file_names and tbl
tbl <- cbind(tbl, filename = file_names)

Read in multiple files and append file name to data frame [duplicate]

This question already has answers here:
Importing multiple .csv files into R and adding a new column with file name
(2 answers)
Closed 15 days ago.
I have numerous csv files in multiple directories that I want to read into a R tribble or data.table. I use "list.files()" with the recursive argument set to TRUE to create a list of file names and paths, then use "lapply()" to read in multiple csv files, and then "bind_rows()" stick them all together:
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
tbl <- lapply(filenames, read_csv) %>%
bind_rows()
This approach works fine. However, I need to extract a substring from the each file name and add it as a column to the final table. I can get the substring I need with "str_extract()" like this:
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")
I am stuck however on how to add the extracted substring as a column as lapply() runs through read_csv() for each file.
I generally use the following approach, based on dplyr/tidyr:
data = tibble(File = files) %>%
extract(File, "Site", "([A-Z]{2}-[A-Za-z0-9]{3})", remove = FALSE) %>%
mutate(Data = lapply(File, read_csv)) %>%
unnest(Data) %>%
select(-File)
tidyverse approach:
Update:
readr 2.0 (and beyond) now has built-in support for reading a list of files with the same columns into one output table in a single command. Just pass the filenames to be read in the same vector to the reading function. For example reading in csv files:
(files <- fs::dir_ls("D:/data", glob="*.csv"))
dat <- read_csv(files, id="path")
Alternatively using map_dfr with purrr:
Add the filename using the .id = "source" argument in purrr::map_dfr()
An example loading .csv files:
# specify the directory, then read a list of files
data_dir <- here("file/path")
data_list <- fs::dir_ls(data_dir, regexp = ".csv$")
# return a single data frame w/ purrr:map_dfr
my_data = data_list %>%
purrr::map_dfr(read_csv, .id = "source")
# Alternatively, rename source from the file path to the file name
my_data = data_list %>%
purrr::map_dfr(read_csv, .id = "source") %>%
dplyr::mutate(source = stringr::str_replace(source, "file/path", ""))
You could use purrr::map2 here, which works similarly to mapply
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}") # same length as filenames
library(purrr)
library(dplyr)
library(readr)
stopifnot(length(filenames)==length(sites)) # returns error if not the same length
ans <- map2(filenames, sites, ~read_csv(.x) %>% mutate(id = .y)) # .x is element in filenames, and .y is element in sites
The output of map2 is a list, similar to lapply
If you have a development version of purrr, you can use imap, which is a wrapper for map2 with an index
data.table approach:
If you name the list, then you can use this name to add to the data.table when binding the list together.
workflow
files <- list.files( whatever... )
#read the files from the list
l <- lapply( files, fread )
#names the list using the basename from `l`
# this also is the step to manipuly the filesnamaes to whatever you like
names(l) <- basename( l )
#bind the rows from the list togetgher, putting the filenames into the colum "id"
dt <- rbindlist( dt.list, idcol = "id" )
You just need to write your own function that reads the csv and adds the column you want, before combining them.
my_read_csv <- function(x) {
out <- read_csv(x)
site <- str_extract(x, "[A-Z]{2}-[A-Za-z0-9]{3}")
cbind(Site=site, out)
}
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
tbl <- lapply(filenames, my_read_csv) %>% bind_rows()
You can build a filenames vector based on "sites" with the exact same length as tbl and then combine the two using cbind
### Get file names
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")
### Get length of each csv
file_lengths <- unlist(lapply(lapply(filenames, read_csv), nrow))
### Repeat sites using lengths
file_names <- rep(sites,file_lengths))
### Create table
tbl <- lapply(filenames, read_csv) %>%
bind_rows()
### Combine file_names and tbl
tbl <- cbind(tbl, filename = file_names)

Saving filenames of a list of imported files into a column of a data frame

I'm reading and combining a large group of csv tables into R but before merging them all I'll like to create a column with the name of the file where those specific set of rows belong.
Here is an example of the code I wrote to read the list of files:
archivos <- list.files("proyecciones", full.names = T)
#proyecciones is the folder where all the csv files are located.
tbl <- lapply(archivos, read.table, sep="", head = T) %>% bind_rows()
As you can see I already have the names of the files in "archivos" but still haven't been able to figure it out how to put it into the lapply command.
Thanks!
We need to use the .id in bind_rows
lapply(archivos, read.table, sep="", header = TRUE) %>%
set_names(archivos) %>%
bind_rows(.id = 'grp')
A more tidyverse syntax would be
library(tidyverse)
map(archivos, read.table, sep='', header = TRUE) %>%
setnames(archivos) %>%
bind_rows(.id = 'grp')

Resources