Remove column using list.files in R - r

So, I've got the following script
output <- list.files(pattern = "some_files.csv", recursive = TRUE) %>%
lapply(read_csv) %>%
bind_rows
and it's works perfect find all posibble files csvsand making one big file, but i encountered the following problem, one csv file generate error: Error: Column `some_column` can't be converted from numeric to character. And I decided to remove this column from the dataset
output <- list.files(pattern = "some_files.csv", recursive = TRUE) %>%
lapply(read_csv) %>% subset(read_csv, select = -c(some_column)) %>%
bind_rows
that genearte another error
Error in subset.default(., read_csv, select = -c(some_column)) :
'subset' must be logical
Any idea?

Try this
list.files(pattern = "some_files.csv", recursive = TRUE) %>%
purrr::map_df(~{
x <- readr::read_csv(.)
x[setdiff(names(x), "some_column")]
})
Instead of lapply we use map, to avoid bind_rows at the end we can use map_df or map_dfr. Instead of subset we use setdiff to remove columns since subset would fail in case if the column is not present.
Or keeping everything in base R
file_names <- list.files(pattern = "some_files.csv", recursive = TRUE)
do.call(rbind, lapply(file_names, function(x) {
df <- read.csv(x)
df[setdiff(names(df), "some_column")]
))

Related

How to combine multiple .txt files with different # of rows in R and keep file names?

The goal is to combine multiple .txt files with single column from different subfolders then cbind to one dataframe (each file will be one column), and keep file names as columne value, an example of the .txt file:
0.348107
0.413864
0.285974
0.130399
...
My code:
#list all the files in the folder
listfile<- list.files(path="",
pattern= "txt",full.names = T, recursive = TRUE) #To include sub directories, change the recursive = TRUE, else FALSE.
#extract the files with folder name aINS
listfile_aINS <- listfile[grep("aINS",listfile)]
#inspect file names
head(listfile_aINS)
#combined all the text files in listfile_aINS and store in dataframe 'Data'
for (i in 1:length(listfile_aINS)){
if(i==1){
assign(paste0("Data"), read.table(listfile[i],header = FALSE, sep = ","))
}
if(!i==1){
assign(paste0("Test",i), read.table(listfile[i],header = FALSE, sep = ","))
Data <- cbind(Data,get(paste0("Test",i))) #choose one: cbind, combine by column; rbind, combine by row
rm(list = ls(pattern = "Test"))
}
}
rm(list = ls(pattern = "list.+?"))
I ran into two problems:
R returns this error because the .txt files have different # of rows.
"Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 37, 36"
I have too many files so I hope to work around the error without having to fix the files into the same length.
my code won't keep file name as the column name
It will be easier to write a function and then rbind() the data from each file. The resulting data frame will have a file column with the filename from the listfile_aINS vector.
read_file <- function(filename) {
dat <- read.table(filename,header = FALSE, sep = ",")
dat$file <- filename
return(dat)
}
all_dat <- do.call(rbind, lapply(listfile_aINS, read_file))
If they don't all have the same number of rows it might not make sense to have each column be a file, but if you really want that you could make it into a wide dataset with NA filling out the empty rows:
library(dplyr)
library(tidyr)
all_dat %>%
group_by(file) %>%
mutate(n = 1:n()) %>%
pivot_wider(names_from = file, values_from = V1)

Can't Combine X <character> and X <double>

I am trying to merge various .csv files into one dataframe using the following:
df<- list.files(path = "C:/Users...", pattern = "*.csv", full.names = TRUE) %>% lapply(read_csv) %>% bind_rows clean
However, I get an error saying that I can't combine X character variable and X double variable.
Is there a way I can transform one of them to a character or double variable?
Since each csv file is slightly different, from my beginners standpoint, believe that lapply would be the best in this case, unless there is an easier way to get around this.
Thank you everyone for your time and attention!
You can change the X variable to character in all the files. You can also use map_df to combine all the files in one dataframe.
library(tidyverse)
result <- list.files(path = "C:/Users...", pattern = "*.csv", full.names = TRUE) %>%
map_df(~read_csv(.x) %>% mutate(X = as.character(X)))
If there are more columns with type mismatch issue you can change all the columns to character, combine the data and use type_convert to change their class.
result <- list.files(path = "C:/Users...", pattern = "*.csv", full.names = TRUE) %>%
map_df(~read_csv(.x) %>% mutate(across(.fns = as.character))) %>%
type_convert()
if all file has same number of columns, then try plyrr::rbind.fill instead of dplyr::bind_rows.
list.files(path = "C:/Users...", pattern = "*.csv", full.names = TRUE) %>% lapply(read_csv) %>% plyrr::rbind.fill

import multible CSV files and use file name as column [duplicate]

This question already has answers here:
Importing multiple .csv files into R and adding a new column with file name
(2 answers)
Closed 14 days ago.
I have numerous csv files in multiple directories that I want to read into a R tribble or data.table. I use "list.files()" with the recursive argument set to TRUE to create a list of file names and paths, then use "lapply()" to read in multiple csv files, and then "bind_rows()" stick them all together:
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
tbl <- lapply(filenames, read_csv) %>%
bind_rows()
This approach works fine. However, I need to extract a substring from the each file name and add it as a column to the final table. I can get the substring I need with "str_extract()" like this:
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")
I am stuck however on how to add the extracted substring as a column as lapply() runs through read_csv() for each file.
I generally use the following approach, based on dplyr/tidyr:
data = tibble(File = files) %>%
extract(File, "Site", "([A-Z]{2}-[A-Za-z0-9]{3})", remove = FALSE) %>%
mutate(Data = lapply(File, read_csv)) %>%
unnest(Data) %>%
select(-File)
tidyverse approach:
Update:
readr 2.0 (and beyond) now has built-in support for reading a list of files with the same columns into one output table in a single command. Just pass the filenames to be read in the same vector to the reading function. For example reading in csv files:
(files <- fs::dir_ls("D:/data", glob="*.csv"))
dat <- read_csv(files, id="path")
Alternatively using map_dfr with purrr:
Add the filename using the .id = "source" argument in purrr::map_dfr()
An example loading .csv files:
# specify the directory, then read a list of files
data_dir <- here("file/path")
data_list <- fs::dir_ls(data_dir, regexp = ".csv$")
# return a single data frame w/ purrr:map_dfr
my_data = data_list %>%
purrr::map_dfr(read_csv, .id = "source")
# Alternatively, rename source from the file path to the file name
my_data = data_list %>%
purrr::map_dfr(read_csv, .id = "source") %>%
dplyr::mutate(source = stringr::str_replace(source, "file/path", ""))
You could use purrr::map2 here, which works similarly to mapply
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}") # same length as filenames
library(purrr)
library(dplyr)
library(readr)
stopifnot(length(filenames)==length(sites)) # returns error if not the same length
ans <- map2(filenames, sites, ~read_csv(.x) %>% mutate(id = .y)) # .x is element in filenames, and .y is element in sites
The output of map2 is a list, similar to lapply
If you have a development version of purrr, you can use imap, which is a wrapper for map2 with an index
data.table approach:
If you name the list, then you can use this name to add to the data.table when binding the list together.
workflow
files <- list.files( whatever... )
#read the files from the list
l <- lapply( files, fread )
#names the list using the basename from `l`
# this also is the step to manipuly the filesnamaes to whatever you like
names(l) <- basename( l )
#bind the rows from the list togetgher, putting the filenames into the colum "id"
dt <- rbindlist( dt.list, idcol = "id" )
You just need to write your own function that reads the csv and adds the column you want, before combining them.
my_read_csv <- function(x) {
out <- read_csv(x)
site <- str_extract(x, "[A-Z]{2}-[A-Za-z0-9]{3}")
cbind(Site=site, out)
}
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
tbl <- lapply(filenames, my_read_csv) %>% bind_rows()
You can build a filenames vector based on "sites" with the exact same length as tbl and then combine the two using cbind
### Get file names
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")
### Get length of each csv
file_lengths <- unlist(lapply(lapply(filenames, read_csv), nrow))
### Repeat sites using lengths
file_names <- rep(sites,file_lengths))
### Create table
tbl <- lapply(filenames, read_csv) %>%
bind_rows()
### Combine file_names and tbl
tbl <- cbind(tbl, filename = file_names)

Rbind multiple Data Frames in a loop

I have a bunch of data frames that are named in the same pattern "dfX.csv" where X represents a number from 1 to 67. I loaded them into seperate dataframes using following piece of code:
folder <- mypath
file_list <- list.files(path=folder, pattern="*.csv")
for (i in 1:length(file_list)){
assign(file_list[i],
read.csv(paste(folder, file_list[i], sep=',', header=TRUE))
)}
What I'm trying to do is merge/rbind them into a single huge dataframe.
for (i in 1:length(file_list)){
df_main <- rbind(df_main, df[[i]].csv)
}
However using that I'm getting an error:
Error: unexpected symbol in:
"for (i in 1:length(file_list)){
df_main <- rbind(df_main, df[[i]].csv"
Any idea what might be causing an issue & whether there's a simpler way of doing things.
If file_list is a character vector of filenames that have since been loaded into variables in the local environment, then perhaps one of
do.call(rbind.data.frame, mget(ls(pattern = "^df\\s+\\.csv")))
do.call(rbind.data.frame, mget(paste0("df", seq_along(file_list), ".csv")))
The first assumes anything found (as df*.csv) in R's environment is appropriate to grab. It might not grab then in the correct order, so consider using sort or somehow ordering them yourself.
mget takes a string vector and retrieves the value of the object with each name from the given environment (current, by default), returning a list of values.
do.call(rbind.data.frame, ...) does one call to rbind, which is much much faster than iteratively rbinding.
Here I use map() to iterate over your files reading each one into a list of dataframes and bind_rows is used to bind all df together
library(tidyverse)
df <- map(list.files(), read_csv) %>%
bind_rows()
If you have a lot of data (lot of rows), here's a data.table approach that works great:
library(data.table)
basedir <- choose.dir() # directory with all the csv files
file_names <- list.files(path = basedir, pattern= '*.csv', full.names = F, recursive = F)
big_list <- lapply(file_names, function(file_name){
dat <- fread(file = file.path(basedir, file_name), header = T)
# Add a 'filename' column to each data.table to back-track where it was read from
# this is why we set full.names = F in the list.files line above
dat$filename <- gsub('.csv', '', file_name)
return(dat)
})
big_data <- rbindlist(l = big_list, use.names = T, fill = T)
If you want to read only some columns and not all, you can use the select argument in fread - helps improve speed since empty columns are not read in, similarly skip lets you skip reading in a bunch of rows.

Read in multiple files and append file name to data frame [duplicate]

This question already has answers here:
Importing multiple .csv files into R and adding a new column with file name
(2 answers)
Closed 15 days ago.
I have numerous csv files in multiple directories that I want to read into a R tribble or data.table. I use "list.files()" with the recursive argument set to TRUE to create a list of file names and paths, then use "lapply()" to read in multiple csv files, and then "bind_rows()" stick them all together:
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
tbl <- lapply(filenames, read_csv) %>%
bind_rows()
This approach works fine. However, I need to extract a substring from the each file name and add it as a column to the final table. I can get the substring I need with "str_extract()" like this:
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")
I am stuck however on how to add the extracted substring as a column as lapply() runs through read_csv() for each file.
I generally use the following approach, based on dplyr/tidyr:
data = tibble(File = files) %>%
extract(File, "Site", "([A-Z]{2}-[A-Za-z0-9]{3})", remove = FALSE) %>%
mutate(Data = lapply(File, read_csv)) %>%
unnest(Data) %>%
select(-File)
tidyverse approach:
Update:
readr 2.0 (and beyond) now has built-in support for reading a list of files with the same columns into one output table in a single command. Just pass the filenames to be read in the same vector to the reading function. For example reading in csv files:
(files <- fs::dir_ls("D:/data", glob="*.csv"))
dat <- read_csv(files, id="path")
Alternatively using map_dfr with purrr:
Add the filename using the .id = "source" argument in purrr::map_dfr()
An example loading .csv files:
# specify the directory, then read a list of files
data_dir <- here("file/path")
data_list <- fs::dir_ls(data_dir, regexp = ".csv$")
# return a single data frame w/ purrr:map_dfr
my_data = data_list %>%
purrr::map_dfr(read_csv, .id = "source")
# Alternatively, rename source from the file path to the file name
my_data = data_list %>%
purrr::map_dfr(read_csv, .id = "source") %>%
dplyr::mutate(source = stringr::str_replace(source, "file/path", ""))
You could use purrr::map2 here, which works similarly to mapply
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}") # same length as filenames
library(purrr)
library(dplyr)
library(readr)
stopifnot(length(filenames)==length(sites)) # returns error if not the same length
ans <- map2(filenames, sites, ~read_csv(.x) %>% mutate(id = .y)) # .x is element in filenames, and .y is element in sites
The output of map2 is a list, similar to lapply
If you have a development version of purrr, you can use imap, which is a wrapper for map2 with an index
data.table approach:
If you name the list, then you can use this name to add to the data.table when binding the list together.
workflow
files <- list.files( whatever... )
#read the files from the list
l <- lapply( files, fread )
#names the list using the basename from `l`
# this also is the step to manipuly the filesnamaes to whatever you like
names(l) <- basename( l )
#bind the rows from the list togetgher, putting the filenames into the colum "id"
dt <- rbindlist( dt.list, idcol = "id" )
You just need to write your own function that reads the csv and adds the column you want, before combining them.
my_read_csv <- function(x) {
out <- read_csv(x)
site <- str_extract(x, "[A-Z]{2}-[A-Za-z0-9]{3}")
cbind(Site=site, out)
}
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
tbl <- lapply(filenames, my_read_csv) %>% bind_rows()
You can build a filenames vector based on "sites" with the exact same length as tbl and then combine the two using cbind
### Get file names
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")
### Get length of each csv
file_lengths <- unlist(lapply(lapply(filenames, read_csv), nrow))
### Repeat sites using lengths
file_names <- rep(sites,file_lengths))
### Create table
tbl <- lapply(filenames, read_csv) %>%
bind_rows()
### Combine file_names and tbl
tbl <- cbind(tbl, filename = file_names)

Resources