Create for each match an own csv - r

I've alreaedy looked for a similar question to mine but I couldn't found it.
Whenever you find one, let me know.
I have a df that looks like (in reality this df has three columns and more than 1000 rows):
Name, Value
Markus, 2
Markus, 4
Markus, 1
Caesar, 77
Caesar, 70
Brutus, 3
Nero, 4
Nero, 9
Nero, 10
Nero, 19
How can I create for each match (depending on Name) an own csv file?
I don't know how to approach this.
In this case the end result should be four csv files with the name form the Name column:
Markus.csv
Caesar.csv
Brutus.csv
Nero.csv
I'm thankful for any advice.

We can split your df by Name to create a list, iterate over it and write a csv for each group. File names are created using paste0
lst <- split(df, df$Name)
lapply(1:length(lst), function(i){
write.csv(lst[[i]], paste0(names(lst)[i], ".csv"), row.names = FALSE)
})

Similar idea to u/Jilber Urbina, you can use dplyr and tidyr to filter the dataset. The basic principle for both answers is to iterate the write.csv() function over the list/vector of names via lapply() function.
library(dplyr)
library(tidyr)
lapply(unique(df$Name), function(i) {
write.csv(
x = df %>% dplyr::filter(Name==i) %>% # filter dataset by Name
select(-Name), # if you want to remove the Name column from output
file = paste0(i,'.csv'),
row.names = FALSE
)
})

Related

Read all csv files in a directory and add the name of each file in a new column [duplicate]

This question already has answers here:
Importing multiple .csv files into R and adding a new column with file name
(2 answers)
How can I read multiple csv files into R at once and know which file the data is from? [duplicate]
(3 answers)
Closed 13 days ago.
I have this code that reads all CSV files in a directory.
nm <- list.files()
df <- do.call(rbind, lapply(nm, function(x) read_delim(x,';',col_names = T)))
I want to modify it in a way that appends the filename to the data. The result would be a single data frame that has all the CSV files, and inside the data frame, there is a column that specifies from which file the data came. How to do it?
Instead of do.call(rbind, lapply(...)), you can use purrr::map_dfr() with the .id argument:
library(readr)
library(purrr)
df <- list.files() |>
set_names() |>
map_dfr(read_delim, .id = "file")
df
# A tibble: 9 × 3
file col1 col2
<chr> <dbl> <dbl>
1 f1.csv 1 4
2 f1.csv 2 5
3 f1.csv 3 6
4 f2.csv 1 4
5 f2.csv 2 5
6 f2.csv 3 6
7 f3.csv 1 4
8 f3.csv 2 5
9 f3.csv 3 6
Example data:
for (f in c("f1.csv", "f2.csv", "f3.csv")) {
readr::write_delim(data.frame(col1 = 1:3, col2 = 4:6), f, ";")
}
readr::read_csv() can accept a vector of file names. The id parameter is "the name of a column in which to store the file path. This is useful when reading multiple input files and there is data in the file paths, such as the data collection date."
nm |>
readr::read_csv(
id = "file_path"
)
I see other answers use file name without the directory. If that's desired, consider using functions built for file manipulation, instead of regexes, unless you're sure the file names & paths are always well-behaved.
nm |>
readr::read_csv(
id = "file_path"
) |>
dplyr::mutate(
file_name_1 = basename(file_path), # If you want the extension
file_name_2 = tools::file_path_sans_ext(file_name_1), # If you don't
)
Here is another solution using purrr, which removes the file extention from the value in the column filename.
library(tidyverse)
nm <- list.files(pattern = "\\.csv$")
df <- map_dfr(
.x = nm,
~ read.csv(.x) %>%
mutate(
filename = stringr::str_replace(
.x,
"\\.csv$",
""
)
)
)
View(df)
EDIT
Actually you can still removes the file extention from the column for the file names when you apply #zephryl's approach by adding a mutate() process as follows:
df <- nm %>%
set_names() %>%
map_dfr(read_delim, .id = "file") %>%
mutate(
file = stringr::str_replace(
file,
"\\.csv$",
""
)
)
You can use bind_rows() from dplyr and supply the argument .id that creates a new column of identifiers to link each row to its original data frame.
df <- dplyr::bind_rows(
lapply(setNames(nm, basename(nm)), read_csv2),
.id = 'src'
)
The use of basename() removes the directory paths prepended to the file names.
For conventional scenarios, I prefer for readr to loop through the csvs by itself. But there some scenarios where it helps to process files individually before stacking them together.
A few weeks ago, purrr 1.0's map_dfr() function was "superseded in favour of using the
appropriate map function along with list_rbind()".
#zephryl's snippet is slightly modified to become
list.files() |>
rlang::set_names() |>
purrr::map(readr::read_delim) |>
# { possibly process files here before stacking/binding } |>
purrr::list_rbind(names_to = "file")
The functions were superseded in purrr 1.0.0 because their names suggest they work like _lgl(), _int(), etc which require length 1 outputs, but actually they return results of any size because the results are combined without any size checks. Additionally, they use dplyr::bind_rows() and dplyr::bind_cols() which require dplyr to be installed and have confusing semantics with edge cases. Superseded functions will not go away, but will only receive critical bug fixes.
Instead, we recommend using map(), map2(), etc with list_rbind() and list_cbind(). These use vctrs::vec_rbind() and vctrs::vec_cbind() under the hood, and have names that more clearly reflect their semantics.
Source: https://purrr.tidyverse.org/reference/map_dfr.html

Iteratively skip the last rows in CSV files when using read_csv

I have a number of CSV files exported from our database, say site1_observations.csv, site2_observations.csv, site3_observations.csv etc. Each CSV looks like below (site1 for example):
Column A
Column B
Column C
# Team: all teams
# Observation type: xyz
Site ID
Reason
Quantity
a
xyz
1
b
abc
2
Total quantity
3
We need to skip the top 2 rows and the last 1 row from each CSV before combining them as a whole master dataset for further analysis. I know I can use the skip = argument to skip the first few lines of CSV, but read_csv() doesn't seem to have simple argument to skip the last lines and I have been using n_max = as a workaround. The data import has been done in manual way. I want to shift the manual process to programmatic manner using purrr::map(), but just couldn't work out how to efficiently skip the last few lines here.
library(tidyverse)
observations_skip_head <- 2
# Approach 1: manual ----
site1_rawdata <- read_csv("/data/site1_observations.csv",
skip = observations_skip_head,
n_max = nrow(read_csv("/data/site1_observations.csv",
skip = observations_skip_head))-1)
# site2_rawdata
# site3_rawdata
# [etc]
# all_sites_rawdata <- bind_rows(site1_rawdata, site2_rawdata, site3_rawdata, [etc])
I have tried to use purrr::map() and I believe I am almost there, except the n_max = part which I am not sure how/what to do this in map() (or any other effective way to get rid of the last line in each CSV). How to do this with purrr?
observations_csv_paths_chr <- paste0("data/site", 1:3,"_observations.csv")
# Approach 2: programmatically import csv files with purrr ----
all_sites_rawdata <- observations_csv_paths_chr %>%
map(~ read_csv(., skip = observations_skip_head,
n_max = nrow(read_csv("/data/site1_observations.csv",
skip = observations_skip_head))-1)) %>%
set_names(observations_csv_paths_chr)
I know this post uses a custom function and fread. But for my education I want to understand how to achieve this goal using the purrr approach (if it's doable).
You could try something like this?
library(tidyverse)
csv_files <- paste0("data/site", 1:3, "_observations.csv")
csv_files |>
map(
~ .x |>
read_lines() |>
tail(-3) |> # skip first 3
head(-2) |> # ..and last 2
paste(collapse = '\n') |>
read_csv()
)
manual_csv<-function(x) {
txt<-readLines(x)
txt<-txt[-c(2,3,length(txt))] # insert the row you want to delete
result<-read.csv(text=paste0(txt, collapse="\n"))
}
test<-manual_csv('D:/jaechang/pool/final.csv')

How to merge files in a directory with r?

Good afternoon,
have a folder with 231 .csv files and I would like to merge them in R. Each file is one spectrum with 2 columns (Wavenumber and Reflectance), but as they come from the spectrometer they don't have colnames. So they look like this when I import them:
C_Sycamore = read.csv("#C_SC_1_10 average.CSV", header = FALSE)
head(C_Sycamore)
V1 V2
1 399.1989 7.750676e+001
2 401.1274 7.779499e+001
3 403.0559 7.813432e+001
4 404.9844 7.837078e+001
5 406.9129 7.837600e+001
6 408.8414 7.822227e+001
The first column (Wavenumber) is identical in all 231 files and all spectra contain exactly 1869 rows. Therefore, it should be possible to merge the whole folder in one big dataframe, right? At least this would very practical for me.
So what I tried is this. I set the working directory to the according folder. Define an empty variable d. Store all the file names in file.list. And the loop through the names in the file.list. First, I want to change the colnames of every file to "Wavenumber" and "the according file name itself", so I use deparse(substitute(i)). Then, I want to read in the file and merge it with the others. And then I could probably do merge(d, read.csv(i, header = FALSE, by = "Wavenumber"), but I don't even get this far.
d = NULL
file.list = list.files()
for(i in file.list){
colnames(i) = c("Wavenumber", deparse(substitute(i)))
d = merge(d, read.csv(i, header = FALSE))
}
When I run this I get the error code
"Error in colnames<-(*tmp*, value = c("Wavenumber", deparse(substitute(i)))) :
So I tried running it without the "colnames()" line, which does not produce an error code, but doesn't work either. Instead of my desired dataframe I get am empty dataframe with only two columns and the message:
"reread"#S_BE_1_10 average.CSV" "#S_P_1_10 average.CSV""
This kind of programming is new to me. So I am thankful for all useful suggestions. Also I am happy to share more data if it helps.
Thanks in advance.
Solution
library(tidyr)
library(purrr)
path <- "your/path/to/folder"
# in one pipeline:
C_Sycamore <- path %>%
# get csvs full paths. (?i) is for case insentitive
list.files(pattern = "(?i)\\.csv$", full.names = TRUE) %>%
# create a named vector: you need it to assign ids in the next step.
# and remove file extection to get clean colnames
set_names(tools::file_path_sans_ext(basename(.))) %>%
# read file one by one, bind them in one df and create id column
map_dfr(read.csv, col.names = c("Wavenumber", "V2"), .id = "colname") %>%
# pivot to create one column for each .id
pivot_wider(names_from = colname, values_from = V2)
Explanation
I would suggest not to change the working directory.
I think it's better if you read from that folder instead.
You can read each CSV file in a loop and bind them together by row. You can use map_dfr to loop over each item and then bind every dataframe by row (that's what the _dfr stands for).
Note that I've used .id = to create a new column called colname. It gets populated out of the names of the vector you're looping over. (That's why we added the names with set_names)
Then, to have one row for each Wavenumber, you need to reshape your data. You can use pivot_wider.
You will have at the end a dataframe with as many rows as Wavenumber and as many columns as the number of CSV plus 1 (the wavenumber column).
Reproducible example
To double check my results, you can use this reproducible example:
path <- tempdir()
csv <- "399.1989,7.750676e+001
401.1274,7.779499e+001
403.0559,7.813432e+001
404.9844,7.837078e+001
406.9129,7.837600e+001
408.8414,7.822227e+001"
write(csv, file.path(path, "file1.csv"))
write(csv, file.path(path, "file2.csv"))
You should expect this output:
C_Sycamore
#> # A tibble: 5 x 3
#> Wavenumber file1 file2
#> <dbl> <dbl> <dbl>
#> 1 401. 77.8 77.8
#> 2 403. 78.1 78.1
#> 3 405. 78.4 78.4
#> 4 407. 78.4 78.4
#> 5 409. 78.2 78.2
Thanks a lot to #Konrad Rudolph for the suggestions!!
no need for a loop here simply use lapply.
first set your working directory to file location###
library(dplyr)
files_to_upload<-list.files(, pattern = "*.csv")
theData_list<-lapply(files_to_upload, read.csv)
C_Sycamore <-bind_rows(theData_list)

How do I make the name of a list the name of its column?

I have a set of 270 RNA-seq samples, and I have already subsetted out their expected counts using the following code:
for (i in 1:length(sample_ID_list)) {
assign(sample_ID_list[i], subset(get(sample_file_list[i]), select = expected_count))
}
Where sample_ID_list is a character list of each sample ID (e.g., 313312664) and sample_file_list is a character list of the file names for each sample already in my environment to be subsetted (e.g., s313312664).
Now, the head of one of those subsetted samples looks like this:
> head(`308087571`)
# A tibble: 6 x 1
expected_count
<dbl>
1 129
2 8
3 137
4 6230.
5 1165.
6 0
The problem is I want to paste all of these lists together to make a counts dataframe, but I will not be able to differentiate between columns without their sample ID as the column name instead of expected_count.
Does anyone know of a good way to go about this? Please let me know if you need any more details!
You can use:
dplyr::bind_rows(mget(sample_ID_list), .id = name)
If we want to name the list, loop over the list, extract the first element of 'expected_count' ('nm1') and use that to assign the names of the list
nm1 <- sapply(sample_file_list, function(x) x$expected_count[1])
names(sample_file_list) <- nm1
Or from sample_ID_list
do.call(rbind, Map(cbind, mget(sample_ID_list), name = sample_ID_list))
Update
Based on the comments, we can loop over the 'sample_file_list and 'sample_ID_list' with Map and rename the 'expected_count' column with the corresponding value from 'sample_ID_list'
sample_file_list2 <- Map(function(dat, nm) {
names(dat)[match('expected_count', names(dat))] <- nm
dat
}, sample_file_list, sample_ID_list)
Or if we need a package solution,
library(data.table)
rbindlist(mget(sample_ID_list), idcol = name)
Update:
Thank you all so much for your help. I had to update my for loop as follows:
for (i in 1:length(sample_ID_list)) {
assign(sample_ID_list[i], subset(get(sample_file_list[i]), select = expected_count))
data<- get(sample_ID_list[i])
colnames(data)<- sample_ID_list[i]
assign(sample_ID_list[i],data)
}
and was able to successfully reassign the names!

r - read in multiple files and select max value from column by group

I am looking for the most elegant way loop through and read in multiple files organized by date and select the most recent value if anything changed based on multiple keys.
Sadly, the reason I need to read in all the files and not just the last files is because there could be an instance in the file that disappears that I would like to capture.
Here is an example of what the files looks like (I'm posting comma separated even though it's fixed width)
file_20200101.txt
key_1,key_2,value,date_as_numb
123,abc,100,20200101
456,def,200,20200101
789,xyz,100,20200101
100,foo,15,20200101
file_20200102.txt
key_1,key_2,value,date_as_numb
123,abc,50,20200102
456,def,500,20200102
789,xyz,300,20200102
and an example of the desired output:
desired_df
key_1,key_2,value,date_as_numb
123,abc,50,20200102
456,def,500,20200102
789,xyz,300,20200102
100,foo,15,20200101
In addition, here is some code I know that works to read in multiple files and then get my ideal output, but I need it to be inside of the loop. The dataframe would be way too big to import and bind all the files:
files <- list.files(path, pattern = ".txt")
df <- files %>%
map(function(f) {
print(f)
df <- fread(f)
df <- df %>% mutate(date_as_numb = f)
return(df)
}) %>% bind_rows()
df <- df %>%
mutate(file_date = as.numeric(str_remove_all(date_as_numb, ".*_"))) %>%
group_by(key_1, key_2) %>%
filter(date_as_numb == max(date_as_numb))
Thanks in advance!
Don't know what is "way too big" for you. Data table can (allegedly) handle really big data. So if bind_rows of the list is not ok, maybe use data table.
(in my own experience, dplyr::group_by can be really slow with many groups (say 10^5 groups) in large-ish data (say around 10^6 rows). I don't have much experience with data.table, but all the threads mention its superiority for large data).
I've used this answer for merging a list of data tables
library(data.table)
dt1 <- fread(text = "key_1,key_2,value,date_as_numb
123,abc,100,20200101
456,def,200,20200101
789,xyz,100,20200101
100,foo,15,20200101")
dt2 <- fread(text = "key_1,key_2,value,date_as_numb
123,abc,50,20200102
456,def,500,20200102
789,xyz,300,20200102")
ls_files <- list(dt1, dt2)
# you would have created this list by calling fread with lapply, like
# ls_files <- lapply(files, fread)
# Now here a two-liner with data table.
alldata <- Reduce(function(...) merge(..., all = TRUE), ls_files)
alldata[alldata[, .I[which.max(date_as_numb)], by = .(key_1, key_2)]$V1]
#> key_1 key_2 value date_as_numb
#> 1: 100 foo 15 20200101
#> 2: 123 abc 50 20200102
#> 3: 456 def 500 20200102
#> 4: 789 xyz 300 20200102
Created on 2021-01-28 by the reprex package (v0.3.0)

Resources