How to merge files in a directory with r? - r

Good afternoon,
have a folder with 231 .csv files and I would like to merge them in R. Each file is one spectrum with 2 columns (Wavenumber and Reflectance), but as they come from the spectrometer they don't have colnames. So they look like this when I import them:
C_Sycamore = read.csv("#C_SC_1_10 average.CSV", header = FALSE)
head(C_Sycamore)
V1 V2
1 399.1989 7.750676e+001
2 401.1274 7.779499e+001
3 403.0559 7.813432e+001
4 404.9844 7.837078e+001
5 406.9129 7.837600e+001
6 408.8414 7.822227e+001
The first column (Wavenumber) is identical in all 231 files and all spectra contain exactly 1869 rows. Therefore, it should be possible to merge the whole folder in one big dataframe, right? At least this would very practical for me.
So what I tried is this. I set the working directory to the according folder. Define an empty variable d. Store all the file names in file.list. And the loop through the names in the file.list. First, I want to change the colnames of every file to "Wavenumber" and "the according file name itself", so I use deparse(substitute(i)). Then, I want to read in the file and merge it with the others. And then I could probably do merge(d, read.csv(i, header = FALSE, by = "Wavenumber"), but I don't even get this far.
d = NULL
file.list = list.files()
for(i in file.list){
colnames(i) = c("Wavenumber", deparse(substitute(i)))
d = merge(d, read.csv(i, header = FALSE))
}
When I run this I get the error code
"Error in colnames<-(*tmp*, value = c("Wavenumber", deparse(substitute(i)))) :
So I tried running it without the "colnames()" line, which does not produce an error code, but doesn't work either. Instead of my desired dataframe I get am empty dataframe with only two columns and the message:
"reread"#S_BE_1_10 average.CSV" "#S_P_1_10 average.CSV""
This kind of programming is new to me. So I am thankful for all useful suggestions. Also I am happy to share more data if it helps.
Thanks in advance.

Solution
library(tidyr)
library(purrr)
path <- "your/path/to/folder"
# in one pipeline:
C_Sycamore <- path %>%
# get csvs full paths. (?i) is for case insentitive
list.files(pattern = "(?i)\\.csv$", full.names = TRUE) %>%
# create a named vector: you need it to assign ids in the next step.
# and remove file extection to get clean colnames
set_names(tools::file_path_sans_ext(basename(.))) %>%
# read file one by one, bind them in one df and create id column
map_dfr(read.csv, col.names = c("Wavenumber", "V2"), .id = "colname") %>%
# pivot to create one column for each .id
pivot_wider(names_from = colname, values_from = V2)
Explanation
I would suggest not to change the working directory.
I think it's better if you read from that folder instead.
You can read each CSV file in a loop and bind them together by row. You can use map_dfr to loop over each item and then bind every dataframe by row (that's what the _dfr stands for).
Note that I've used .id = to create a new column called colname. It gets populated out of the names of the vector you're looping over. (That's why we added the names with set_names)
Then, to have one row for each Wavenumber, you need to reshape your data. You can use pivot_wider.
You will have at the end a dataframe with as many rows as Wavenumber and as many columns as the number of CSV plus 1 (the wavenumber column).
Reproducible example
To double check my results, you can use this reproducible example:
path <- tempdir()
csv <- "399.1989,7.750676e+001
401.1274,7.779499e+001
403.0559,7.813432e+001
404.9844,7.837078e+001
406.9129,7.837600e+001
408.8414,7.822227e+001"
write(csv, file.path(path, "file1.csv"))
write(csv, file.path(path, "file2.csv"))
You should expect this output:
C_Sycamore
#> # A tibble: 5 x 3
#> Wavenumber file1 file2
#> <dbl> <dbl> <dbl>
#> 1 401. 77.8 77.8
#> 2 403. 78.1 78.1
#> 3 405. 78.4 78.4
#> 4 407. 78.4 78.4
#> 5 409. 78.2 78.2
Thanks a lot to #Konrad Rudolph for the suggestions!!

no need for a loop here simply use lapply.
first set your working directory to file location###
library(dplyr)
files_to_upload<-list.files(, pattern = "*.csv")
theData_list<-lapply(files_to_upload, read.csv)
C_Sycamore <-bind_rows(theData_list)

Related

Read all csv files in a directory and add the name of each file in a new column [duplicate]

This question already has answers here:
Importing multiple .csv files into R and adding a new column with file name
(2 answers)
How can I read multiple csv files into R at once and know which file the data is from? [duplicate]
(3 answers)
Closed 13 days ago.
I have this code that reads all CSV files in a directory.
nm <- list.files()
df <- do.call(rbind, lapply(nm, function(x) read_delim(x,';',col_names = T)))
I want to modify it in a way that appends the filename to the data. The result would be a single data frame that has all the CSV files, and inside the data frame, there is a column that specifies from which file the data came. How to do it?
Instead of do.call(rbind, lapply(...)), you can use purrr::map_dfr() with the .id argument:
library(readr)
library(purrr)
df <- list.files() |>
set_names() |>
map_dfr(read_delim, .id = "file")
df
# A tibble: 9 × 3
file col1 col2
<chr> <dbl> <dbl>
1 f1.csv 1 4
2 f1.csv 2 5
3 f1.csv 3 6
4 f2.csv 1 4
5 f2.csv 2 5
6 f2.csv 3 6
7 f3.csv 1 4
8 f3.csv 2 5
9 f3.csv 3 6
Example data:
for (f in c("f1.csv", "f2.csv", "f3.csv")) {
readr::write_delim(data.frame(col1 = 1:3, col2 = 4:6), f, ";")
}
readr::read_csv() can accept a vector of file names. The id parameter is "the name of a column in which to store the file path. This is useful when reading multiple input files and there is data in the file paths, such as the data collection date."
nm |>
readr::read_csv(
id = "file_path"
)
I see other answers use file name without the directory. If that's desired, consider using functions built for file manipulation, instead of regexes, unless you're sure the file names & paths are always well-behaved.
nm |>
readr::read_csv(
id = "file_path"
) |>
dplyr::mutate(
file_name_1 = basename(file_path), # If you want the extension
file_name_2 = tools::file_path_sans_ext(file_name_1), # If you don't
)
Here is another solution using purrr, which removes the file extention from the value in the column filename.
library(tidyverse)
nm <- list.files(pattern = "\\.csv$")
df <- map_dfr(
.x = nm,
~ read.csv(.x) %>%
mutate(
filename = stringr::str_replace(
.x,
"\\.csv$",
""
)
)
)
View(df)
EDIT
Actually you can still removes the file extention from the column for the file names when you apply #zephryl's approach by adding a mutate() process as follows:
df <- nm %>%
set_names() %>%
map_dfr(read_delim, .id = "file") %>%
mutate(
file = stringr::str_replace(
file,
"\\.csv$",
""
)
)
You can use bind_rows() from dplyr and supply the argument .id that creates a new column of identifiers to link each row to its original data frame.
df <- dplyr::bind_rows(
lapply(setNames(nm, basename(nm)), read_csv2),
.id = 'src'
)
The use of basename() removes the directory paths prepended to the file names.
For conventional scenarios, I prefer for readr to loop through the csvs by itself. But there some scenarios where it helps to process files individually before stacking them together.
A few weeks ago, purrr 1.0's map_dfr() function was "superseded in favour of using the
appropriate map function along with list_rbind()".
#zephryl's snippet is slightly modified to become
list.files() |>
rlang::set_names() |>
purrr::map(readr::read_delim) |>
# { possibly process files here before stacking/binding } |>
purrr::list_rbind(names_to = "file")
The functions were superseded in purrr 1.0.0 because their names suggest they work like _lgl(), _int(), etc which require length 1 outputs, but actually they return results of any size because the results are combined without any size checks. Additionally, they use dplyr::bind_rows() and dplyr::bind_cols() which require dplyr to be installed and have confusing semantics with edge cases. Superseded functions will not go away, but will only receive critical bug fixes.
Instead, we recommend using map(), map2(), etc with list_rbind() and list_cbind(). These use vctrs::vec_rbind() and vctrs::vec_cbind() under the hood, and have names that more clearly reflect their semantics.
Source: https://purrr.tidyverse.org/reference/map_dfr.html

Iteratively skip the last rows in CSV files when using read_csv

I have a number of CSV files exported from our database, say site1_observations.csv, site2_observations.csv, site3_observations.csv etc. Each CSV looks like below (site1 for example):
Column A
Column B
Column C
# Team: all teams
# Observation type: xyz
Site ID
Reason
Quantity
a
xyz
1
b
abc
2
Total quantity
3
We need to skip the top 2 rows and the last 1 row from each CSV before combining them as a whole master dataset for further analysis. I know I can use the skip = argument to skip the first few lines of CSV, but read_csv() doesn't seem to have simple argument to skip the last lines and I have been using n_max = as a workaround. The data import has been done in manual way. I want to shift the manual process to programmatic manner using purrr::map(), but just couldn't work out how to efficiently skip the last few lines here.
library(tidyverse)
observations_skip_head <- 2
# Approach 1: manual ----
site1_rawdata <- read_csv("/data/site1_observations.csv",
skip = observations_skip_head,
n_max = nrow(read_csv("/data/site1_observations.csv",
skip = observations_skip_head))-1)
# site2_rawdata
# site3_rawdata
# [etc]
# all_sites_rawdata <- bind_rows(site1_rawdata, site2_rawdata, site3_rawdata, [etc])
I have tried to use purrr::map() and I believe I am almost there, except the n_max = part which I am not sure how/what to do this in map() (or any other effective way to get rid of the last line in each CSV). How to do this with purrr?
observations_csv_paths_chr <- paste0("data/site", 1:3,"_observations.csv")
# Approach 2: programmatically import csv files with purrr ----
all_sites_rawdata <- observations_csv_paths_chr %>%
map(~ read_csv(., skip = observations_skip_head,
n_max = nrow(read_csv("/data/site1_observations.csv",
skip = observations_skip_head))-1)) %>%
set_names(observations_csv_paths_chr)
I know this post uses a custom function and fread. But for my education I want to understand how to achieve this goal using the purrr approach (if it's doable).
You could try something like this?
library(tidyverse)
csv_files <- paste0("data/site", 1:3, "_observations.csv")
csv_files |>
map(
~ .x |>
read_lines() |>
tail(-3) |> # skip first 3
head(-2) |> # ..and last 2
paste(collapse = '\n') |>
read_csv()
)
manual_csv<-function(x) {
txt<-readLines(x)
txt<-txt[-c(2,3,length(txt))] # insert the row you want to delete
result<-read.csv(text=paste0(txt, collapse="\n"))
}
test<-manual_csv('D:/jaechang/pool/final.csv')

Create for each match an own csv

I've alreaedy looked for a similar question to mine but I couldn't found it.
Whenever you find one, let me know.
I have a df that looks like (in reality this df has three columns and more than 1000 rows):
Name, Value
Markus, 2
Markus, 4
Markus, 1
Caesar, 77
Caesar, 70
Brutus, 3
Nero, 4
Nero, 9
Nero, 10
Nero, 19
How can I create for each match (depending on Name) an own csv file?
I don't know how to approach this.
In this case the end result should be four csv files with the name form the Name column:
Markus.csv
Caesar.csv
Brutus.csv
Nero.csv
I'm thankful for any advice.
We can split your df by Name to create a list, iterate over it and write a csv for each group. File names are created using paste0
lst <- split(df, df$Name)
lapply(1:length(lst), function(i){
write.csv(lst[[i]], paste0(names(lst)[i], ".csv"), row.names = FALSE)
})
Similar idea to u/Jilber Urbina, you can use dplyr and tidyr to filter the dataset. The basic principle for both answers is to iterate the write.csv() function over the list/vector of names via lapply() function.
library(dplyr)
library(tidyr)
lapply(unique(df$Name), function(i) {
write.csv(
x = df %>% dplyr::filter(Name==i) %>% # filter dataset by Name
select(-Name), # if you want to remove the Name column from output
file = paste0(i,'.csv'),
row.names = FALSE
)
})

How do I download and combine multiple files in r?

I am endeavoring to combine files but find myself writing very redundant code, which is cumbersome. I have looked at the documentation but for some reason cannot find anything about how to do this.
Basically, I download the code from my native machine, and then want to combine the exact same columns for each file (the only difference is year).
Can you help?
I download the code from my machine ("C:/SAM/CODE1_2005.csv" then "C:/SAM/CODE1_2006.csv" then "C:/SAM/CODE1_2007.csv", until 2016.
I then define the columns, all the same for each year I have downloaded, such as COLLEGESCORECARD05_A<-subset(COLLEGESCORECARD05, select=c(ï..UNITID,OPEID,OPEID6,INSTNM)) and so forth...
and then combine the files into one database.
The issue is that this seems inefficient. Is there a more efficient way?
You can make a list of the .csv files in the folder and then read them all together into a single df with purrr::map_df. You can add a column to differentiate between files then
library(tidyverse)
df <- list.files(path="C://SAM",
pattern="*.csv") %>%
purrr::map_df(function(x) readr::read_csv(x) %>%
mutate(filename=gsub(" .csv", "", basename(x)))
At the risk of seeming icky for self-promotion, I wrote a function that does exactly this (desiderata::apply_to_files()):
# Apply a function to every file in a folder that matches a regex pattern
rain <- apply_to_files(path = "Raw data/Rainfall", pattern = "csv",
func = readr::read_csv, col_types = "Tiic",
recursive = FALSE, ignorecase = TRUE,
method = "row_bind")
dplyr::sample_n(rain, 5)
#> # A tibble: 5 x 5
#>
#> orig_source_file Time Tips mV Event
#> <chr> <dttm> <int> <int> <chr>
#> 1 BOW-BM-2016-01-15.csv 2015-12-17 03:58:00 0 4047 Normal
#> 2 BOW-BM-2016-01-15.csv 2016-01-03 00:27:00 2 3962 Normal
#> 3 BOW-BM-2016-01-15.csv 2015-11-27 12:06:00 0 4262 Normal
#> 4 BIL-BPA-2018-01-24.csv 2015-11-15 10:00:00 0 4378 Normal
#> 5 BOW-BM-2016-08-05.csv 2016-04-13 19:00:00 0 4447 Normal
In this case, all of the files have identical columns and order (Time, Tips, mV, Event), so I can just method = "row_bind" and the function will automatically add the filename as an extra column. There are other methods available:
"full_join" (the default) returns all columns and rows. "left_join" returns all rows from the first file, and all columns from subsequent files. "inner_join" returns rows from the first file that have matches in subsequent files.
Internally, the function builds a list of files in the path (recursive or not), runs an lapply() on the list, and then handles merging the new list of dataframes into a single dataframe:
apply_to_files <- function(path, pattern, func, ..., recursive = FALSE, ignorecase = TRUE,
method = "full_join") {
file_list <- list.files(path = path,
pattern = pattern,
full.names = TRUE, # Return full relative path.
recursive = recursive, # Search into subfolders.
ignore.case = ignorecase)
df_list <- lapply(file_list, func, ...)
# The .id arg of bind_rows() uses the names to create the ID column.
names(df_list) <- basename(file_list)
out <- switch(method,
"full_join" = plyr::join_all(df_list, type = "full"),
"left_join" = plyr::join_all(df_list, type = "left"),
"inner_join" = plyr::join_all(df_list, type = "inner"),
# The fancy joins don't have orig_source_file because the values were
# getting all mixed together.
"row_bind" = dplyr::bind_rows(df_list, .id = "orig_source_file"))
return(invisible(out))
}

How to merge .txt files in subfolders and name them after in the same way as the main folder using R?

I have performed an experiment under different conditions. Each of those condition has its own Folder. In each of those folders, there is a subfolder for each replicate that containts a text file called DistList.txt. This then looks like this, where the folders "C1.1", "C1.2" and so on contain the mentioned .txt files:
Those .txt files then look like this, but their length may vary from only one or two to several hundreds:
Now, I would like to merge those .txt files and create a .csv file out of it in a way that it looks like this:
C1.1 C1.2 C1.3 ...
155 223 996
169 559 999
259 623 1033
2003 2220
4421
Until now, I was able to write a script that takes together all the files and plots the single data in different columns, just as I want it. However, I would like the title of each column to be the name of the main folder I extracted the .txt file of (e.g. C1.1, C1.2, C1.3, C2.1, ...).
So far, I have this script:
fileList <- list.files(path = ".", recursive = TRUE, pattern = "DistList.txt", full.names = TRUE)
listData <- lapply(fileList, read.table)
names(listData) <- gsub("DistList.txt","",basename(fileList))
library(tidyverse)
library(reshape2)
bind_rows(listData, .id = "FileName") %>%
group_by(FileName) %>%
mutate(rowNum = row_number()) %>%
dcast(rowNum~FileName, value.var = "V1") %>%
select(-rowNum) %>%
write.csv(file="Result.csv")
This then yields a .csv file like this, where there are just numbers as column headers and not the name I would like to have. This is an extract of the file created, where I have marked the row that should contain the titles as mentioned above (C1.1, C1.2, C1.2, ...):
Is there any possibility to name the columns as I have mentioned above?
Perhaps I misunderstood, but why not do this
# Generate some sample data consisting of a list
# of single-column data.frame's
set.seed(2017);
listData <- list(
C1 = data.frame(V1 = runif(10)),
C2 = data.frame(V1 = runif(10)),
C3 = data.frame(V1 = runif(10)))
setNames(bind_cols(listData), names(listData))
# C1 C2 C3
#1 0.92424261 0.674331481 0.63411352
#2 0.53717641 0.002020766 0.37986744
#3 0.46919565 0.025093514 0.94207403
#4 0.28862618 0.432077786 0.75499369
#5 0.77008816 0.499391912 0.22761184
#6 0.77276871 0.388681932 0.91466603
#7 0.03932234 0.395375316 0.62044504
#8 0.43490560 0.715707325 0.31910458
#9 0.47216639 0.940999879 0.07628881
#10 0.27383312 0.827229161 0.26083932
Explanation: We simply bind the single column of all data.frames in listData and set column names with setNames.
I don't understand why you're doing a bind_rows first, and then later recast from long to wide. The same can be achieved with a single bind_cols.
In this case, the line :
names(listData) <- gsub("DistList.txt","",basename(fileList))
has to be replaced by
names(listData) <- basename(dirname(fileList))
so that the the names of the subfolders are used as the headers of the single columns.

Resources