R - avoid new column when merging csv files - r

I have a script that merge all csv files in a folder.
My problem is that a new column named "...20" is created with empty data. How can I avoid that ?
Thanks for helping
My script :
folderfiles <- list.files(path = "//myserver/Depots/",
pattern = "\\.csv$",
full.names = TRUE)
data_csv <- folderfiles %>%
set_names() %>%
map_dfr(.f = read_delim,
delim = ";",
)
and the message :

It's difficult to debug this without access to specific files. However, you can attempt to specify the columns you want to read using the cols_only function. For example, let's assume that you only want to read the mpg column. You can do that in the following manner:
library("fs")
library("readr")
library("tidyverse")
# Generating some sample files
temp_dir_files <- path_temp("cars")
dir_create(temp_dir_files)
for (i in 1:10) {
write_csv(mtcars, file = path(temp_dir_files, paste0("cars", i, ".csv")))
}
# Selected column import
# read_* can handle a vector of paths
read_csv(
file = dir_ls(temp_dir_files, glob = "*.csv"),
col_types = cols_only(
mpg = col_double()
),
id = "input_file"
)
The cols_only specification passed to read_csv will force the read_csv to skip the remaining columns and only import the column with the matching name.

Related

How to combine multiple .txt files with different # of rows in R and keep file names?

The goal is to combine multiple .txt files with single column from different subfolders then cbind to one dataframe (each file will be one column), and keep file names as columne value, an example of the .txt file:
0.348107
0.413864
0.285974
0.130399
...
My code:
#list all the files in the folder
listfile<- list.files(path="",
pattern= "txt",full.names = T, recursive = TRUE) #To include sub directories, change the recursive = TRUE, else FALSE.
#extract the files with folder name aINS
listfile_aINS <- listfile[grep("aINS",listfile)]
#inspect file names
head(listfile_aINS)
#combined all the text files in listfile_aINS and store in dataframe 'Data'
for (i in 1:length(listfile_aINS)){
if(i==1){
assign(paste0("Data"), read.table(listfile[i],header = FALSE, sep = ","))
}
if(!i==1){
assign(paste0("Test",i), read.table(listfile[i],header = FALSE, sep = ","))
Data <- cbind(Data,get(paste0("Test",i))) #choose one: cbind, combine by column; rbind, combine by row
rm(list = ls(pattern = "Test"))
}
}
rm(list = ls(pattern = "list.+?"))
I ran into two problems:
R returns this error because the .txt files have different # of rows.
"Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 37, 36"
I have too many files so I hope to work around the error without having to fix the files into the same length.
my code won't keep file name as the column name
It will be easier to write a function and then rbind() the data from each file. The resulting data frame will have a file column with the filename from the listfile_aINS vector.
read_file <- function(filename) {
dat <- read.table(filename,header = FALSE, sep = ",")
dat$file <- filename
return(dat)
}
all_dat <- do.call(rbind, lapply(listfile_aINS, read_file))
If they don't all have the same number of rows it might not make sense to have each column be a file, but if you really want that you could make it into a wide dataset with NA filling out the empty rows:
library(dplyr)
library(tidyr)
all_dat %>%
group_by(file) %>%
mutate(n = 1:n()) %>%
pivot_wider(names_from = file, values_from = V1)

Modifying multiple CSV files and saving them all as TXT in R

I have a folder containing several .csv files. I need to delete the first three rows and the very last row of all those .csv files and then save them all as .txt. All the files have the same format so it's always the same rows that I would need to delete.
I know how to modify a single dataframe but I do not know how to load, modify and save as txt several dataframes.
I am a beginner using R so I do not have examples of things I have tried yet.
Any help will be really appreciated!
It's hard to start with stack overflow but the other comments about reproducible examples are worth thinking about for the future. My suggestion would be to write a function that reads, modifies, and writes and then loop it across all the files.
I can't tell exactly how to do this as I can't see your data but something like this should work:
library('tidyverse')
old_paths = list.files(
path = your_folder,
pattern = '\\.csv$',
full.names = TRUE
)
read_write = function(path){
new_filename = str_replace(
string = path,
pattern = '\\.csv$',
replacement = '.txt'
)
read_csv(path) %>%
slice(-(1:3)) %>%
slice(-n()) %>%
write_tsv(new_filename) %>%
invisible()
}
lapply(old_paths, read_write)
Let's do this for one data frame, only referencing its file name
input_file = "my_data_1.csv"
data = read.csv(input_file)
# modify
data = data[-(1:3), ] # delete first 3 rows
data = data[-nrow(data), ] # delete last row
# save as .txt
output_file = sub("csv$", "txt", input_file)
write.table(x = data, file = output_file, sep = "\t", row.names = FALSE)
Now we can turn it into a function taking the file name as an argument:
my_txt_convert = function(input_file) {
data = read.csv(input_file)
# modify
data = data[-(1:3), ] # delete first 3 rows
data = data[-nrow(data), ] # delete last row
# save as .txt
output_file = sub("csv$", "txt", input_file)
write.table(x = data, file = output_file, sep = "\t", row.names = FALSE)
}
Then we call the function on all your files:
to_convert = list.files(pattern='.*.csv')
for (file in to_convert) {
my_txt_convert(file)
}
# or
lapply(to_convert, my_txt_convert)

R read_excel or readxl Multiple Files with Multiple Sheets - Bind

I have a directory full of .xlsx files. They all have multiple sheets. I want to extract the same sheet from all of the files and append them into a tibble.
I have found numerous solutions for extracting multiple sheets from a single Excel file; however, not a single sheet from multiple files.
I have tried:
paths = as.tibble(list.files("data/BAH", pattern = ".xlsx", full.names = TRUE, all.files = FALSE))
test <- paths %>% read_xlsx(sheet = "Portal", col_names = TRUE)
I know the "paths" variable contains all of my file names with path. However, I am not sure how to iterate through each file name appending just the specific sheet = "Portal" to a csv file.
The error is:
Error: path must be a string
I have tried to pass in paths as a vector, as a tibble, and tried sub-scripting it as well. All fails.
So, in summary. I have a directory of xlsx files and I need to extract a single sheet from each one and append it to a csv file. I have tried using purrr with some map functions but also could not get it to work.
My goal was to use the Tidy way.
Thanks for any hints.
You have to use lapply() or map(). Try
test <- lapply(paths, read_xlsx, sheet = "Portal", col_names = TRUE)
or
library(purrr)
test <- map_dfr(paths, read_xlsx, sheet = "Portal", col_names = TRUE)
You can then bind the dataframes with
library(dplyr)
test %>% bind_rows()
library(tidyverse)
library(readxl)
library(fs)
# Get all files
xlsx_files <- fs::dir_ls("data/BAH", regexp = "\\.xlsx$")
paths = as_tibble(list.files("data/BAH", pattern = ".xlsx", full.names = TRUE, all.files = FALSE))
#portal_tabs <- map_dfr(paths, read_xlsx, sheet = "Portal", col_names = TRUE)
portal_tabs <- map_dfr(xlsx_files, read_xlsx, sheet = "Portal", col_names = TRUE, .id = 'source')

How to iterate over excel-sheets with readxl

I am supposed to load the data for my master-thesis in an R-dataframe, which is stored in 74 excel workbooks. Every workbook has 4 worksheets each, called: animals, features, r_words, verbs. All of the worksheets have the same 12 variables(starttime, word, endtime, ID,... etc.). I want to concatenate every worksheet under the one before, so the resulting dataframe should have 12 columns and the number of rows is depending on how many answers the 74 subjects produced.
I want to use the readxl-package of the tidyverse and followed this article: https://readxl.tidyverse.org/articles/articles/readxl-workflows.html#csv-caching-and-iterating-over-sheets.
The first problem I face is how to read all 4 worksheets with read_excel(path, sheet = "animals", "features", "r_words", "verbs"). This only works with the first worksheet, so I tried to make a list with all the sheet-names (object sheet). This is also not working. And when I try to use the following code with just one worksheet, the next line throws an error:
Error in basename(.) : a character vector argument expected
So, here is a part of my code, hopefully fulfilling the requirements:
filenames <- list.files("data", pattern = '\\.xlsm',full.names = TRUE)
# indices
subfile_nos <- 1:length(filenames)
# function to read all the sheets in at once and cache to csv
read_then_csv <- function(sheet, path) {
for (i in 1:length(filenames)){
sheet <- excel_sheets(filenames[i])
len.sheet <- 1:length(sheet)
path <- read_excel(filenames[i], sheet = sheet[i]) #only reading in the first sheet
pathbase <- path %>%
basename() %>% #Error in basename(.) : a character vector argument expected
tools::file_path_sans_ext()
path %>%
read_excel(sheet = sheet) %>%
write_csv(paste0(pathbase, "-", sheet, ".csv"))
}
}
You should do a double loop or a nested map, like so:
library(dplyr)
library(purrr)
library(readxl)
# I suggest looking at
?purrr::map_df
# Function to read all the sheets in at once and save as csv
read_then_csv <- function(input_filenames, output_file) {
# Iterate over files and concatenate results
map_df(input_filenames, function(f){
# Iterate over sheets and concatenate results
excel_sheets(f) %>%
map_df(function(sh){
read_excel(f, sh)
})
}) %>%
# Write csv
write_csv(output_file)
}
# Test function
filenames <- list.files("data", pattern = '\\.xlsm',full.names = TRUE)
read_then_csv(filenames, 'my_output.csv')
You say...'I want to concatenate every worksheet under the one before'... The script below will combine all sheets from all files. Test it on a COPY of your data, in case it doesn't do what you want/need it to do.
# load names of excel files
files = list.files(path = "C:\\your_path_here\\", full.names = TRUE, pattern = ".xlsx")
# create function to read multiple sheets per excel file
read_excel_allsheets <- function(filename, tibble = FALSE) {
sheets <- readxl::excel_sheets(filename)
sapply(sheets, function(f) as.data.frame(readxl::read_excel(filename, sheet = f)),
simplify = FALSE)
}
# execute function for all excel files in "files"
all_data <- lapply(files, read_excel_allsheets)

Include .csv filename when reading data into r using list.files [duplicate]

This question already has answers here:
Add "filename" column to table as multiple files are read and bound [duplicate]
(6 answers)
When importing CSV into R how to generate column with name of the CSV?
(7 answers)
Closed last year.
I'm aggregating a bunch of CSV files in R, which I have done successfully using the following code (found here):
Tbl <- list.files(path = "./Data/CSVs/",
pattern="*.csv",
full.names = T) %>%
map_df(~read_csv(., col_types = cols(.default = "c")))
I want to include the .csv filename (ideally without the file extension) as a column in Tbl. I found a solution using plyr, but I want to stick with dplyr as plyr causes glitches further down my code.
Is there any way I can add something to the above code that will tell R to include the file name in Tbl$filename?
Many thanks!
Here's my solution. Let me know if this helps.
Tbl <- list.files(path = "./Data/CSVs/",
pattern="*.csv",
full.names = T) %>%
map_df(function(x) read_csv(x, col_types = cols(.default = "c")) %>% mutate(filename=gsub(".csv","",basename(x))))
It's difficult to know exactly what you want since the format of your data in .csv is unclear. But try gsub. Assuming you have list of your files in Tbl.list:
library(dplyr)
Tbl.list <- list.files(path = "./Data/CSVs/",
pattern="*.csv",
full.names = T)
Convert to data.frame and then mutate filename subbing out ".csv" with "":
Tbl.df <- data.frame( X1 = Tbl.list ) %>%
mutate( filename_wo_ext = gsub( ".csv", "", X1 ) )
You could also try the following, but I'm not sure it'll work. (Let's assume you have Tbl.list still). Start by changing your map_df statement to add an index column:
map_df(~ read_csv(., col_types = cols(.default = "c")),
.id="index") %>%
mutate( gsub( ".csv", "", Tbl.list[as.numeric(index)] )
The column index should contain a character vector [1...n]. The mutate statement will look in Tbl.list, grab the filename at index, and sub out ".csv" with "" .

Resources