save files into a specific subfolder in a loop in R - r

I feel I am very close to the solution but at the moment i cant figure out how to get there.
I´ve got the following problem.
In my folder "Test" I´ve got stacked datafiles with the names M1_1; M1_2, M1_3 and so on: /Test/M1_1.dat for example.
No I want to seperate the files, so that I get: M1_1[1].dat, M1_1[2].dat, M1_1[3].dat and so on. These files I´d like to save in specific subfolders: Test/M1/M1_1[1]; Test/M1/M1_1[2] and so on, and Test/M2/M1_2[1], Test/M2/M1_2[2] and so on.
Now I already created the subfolders. And I got the following command to split up the files so that i get M1_1.dat[1] and so on:
for (e in dir(path = "Test/", pattern = ".dat", full.names=TRUE, recursive=TRUE)){
data <- read.table(e, header=TRUE)
df <- data[ -c(2) ]
out <- split(df , f = df$.imp)
lapply(names(out),function(z){
write.table(out[[z]], paste0(e, "[",z,"].dat"),
sep="\t", row.names=FALSE, col.names = FALSE)})
}
Now the paste0 command gets me my desired split up data (although its M1_1.dat[1] instead of M1_1[1].dat), but i cant figure out how to get this data into my subfolders.
Maybe you´ve got an idea?
Thanks in advance.

I don't have any idea what your data looks like so I am going to attempt to recreate the scenario with the gender datasets available at baby names
Assuming all the files from the zip folder are stored to "inst/data"
store all file paths to all_fi variable
all_fi <- list.files("inst/data",
full.names = TRUE,
recursive = TRUE,
pattern = "\\.txt$")
> head(all_fi, 3)
[1] "inst/data/yob1880.txt" "inst/data/yob1881.txt"
Preset function that will apply to each file in the directory
f.it <- function(f_in = NULL){
# Create the new folder based on the existing basename of the input file
new_folder <- file_path_sans_ext(f_in)
dir.create(new_folder)
data.table::fread(f_in) %>%
select(name = 1, gender = 2, freq = 3) %>%
mutate(
gender = ifelse(grepl("F", gender), "female","male")
) %>% (function(x){
# Dataset contains names for males and females
# so that's what I'm using to mimic your split
out <- split(x, x$gender)
o <- rbind.pages(
lapply(names(out), function(i){
# New filename for each iteration of the split dataframes
###### THIS IS WHERE YOU NEED TO TWEAK FOR YOUR NEEDS
new_dest_file <- sprintf("%s/%s.txt", new_folder, i)
# Write the sub-data-frame to the new file
data.table::fwrite(out[[i]], new_dest_file)
# For our purposes return a dataframe with file info on the new
# files...
data.frame(
file_name = new_dest_file,
file_size = file.size(new_dest_file),
stringsAsFactors = FALSE)
})
)
o
})
}
Now we can just loop through:
NOTE: for my purposes I'm not going to spend time looping through each file, for your purposes this would apply to each of your initial files, or in my case all_fi rather than all_fi[2:5].
> rbind.pages(lapply(all_fi[2:5], f.it))
============================ =========
file_name file_size
============================ =========
inst/data/yob1881/female.txt 16476
inst/data/yob1881/male.txt 15306
inst/data/yob1882/female.txt 18109
inst/data/yob1882/male.txt 16923
inst/data/yob1883/female.txt 18537
inst/data/yob1883/male.txt 15861
inst/data/yob1884/female.txt 20641
inst/data/yob1884/male.txt 17300
============================ =========

Related

How to loop over on different files and save the output with filename in R?

I have several files with the names RTDFE, TRYFG, FTYGS, WERTS...like 100 files in txt format. For each file, I'm using the following code and writing the output in a file.
name = c("RTDFE")
file1 <- paste0(name, "_filter",".txt")
file2 <- paste0(name, "_data",".txt")
### One
A <- read.delim(file1, sep = "\t", header = FALSE)
#### two
B <- read.delim(file2, sep = "\t", header = FALSE)
C <- merge(A, B, by="XYZ")
nrow(C)
145
Output:
Samples Common
RTDFE 145
Every time I'm assigning the file to variable name running my code and writing the output in the file. Instead, I want the code to be run on all the files in one go and want the following output. Common is the row of merged data frame C
The output I need:
Samples Common
RTDFE 145
TRYFG ...
FTYGS ...
WERTS ...
How to do this? Any help.
How about putting all your names in a single vector, called names, like this:
names<-c("TRYFG","RTDFE",...)
and then feeding each one to a function that reads the files, merges them, and returns the rows
f<-function(n) {
fs = paste0(n,c("_filter", "_data"),".txt")
C = merge(
read.delim(fs[1],sep="\t", header=F),
read.delim(fs[2],sep="\t", header=F), by="XYZ")
data.frame(Samples=n,Common=nrow(C))
}
Then just call call this function f on each of the values in names, row binding the result together
do.call(rbind, lapply(names, f))
An easy way to create the vector names is like this:
p = "_(filter|data).txt"
names = unique(gsub(p,"",list.files(pattern = p)))
I am making some assumptions here.
The first assumption is that you have all these files in a folder with no other text files (.txt) in this folder.
If so you can get the list of files with the command list.files.
But when doing so you will get the "_data.txt" and the "filter.txt".
We need a way to extract the basic part of the name.
I use "str_replace" to remove the "_data.txt" and the "_filter.txt" from the list.
But when doing so you will get a list with two entries. Therefore I use the "unique" command.
I store this in "lfiles" that will now contain "RTDFE, TRYFG, FTYGS, WERTS..." and any other file that satisfy the conditions.
After this I run a for loop on this list.
I reopen the files similarly as you do.
I merge by XYZ and I immediately put the results in a data frame.
By using rbind I keep adding results to the data frame "res".
library(stringr)
lfiles=list.files(path = ".", pattern = ".txt")
## we strip, from the files, the "_filter and the data
lfiles=unique( sapply(lfiles, function(x){
x=str_replace(x, "_data.txt", "")
x=str_replace(x, "_filter.txt", "")
return(x)
} ))
res=NULL
for(i in lfiles){
file1 <- paste0(i, "_filter.txt")
file2 <- paste0(i, "_data.txt")
### One
A <- read.delim(file1, sep = "\t", header = FALSE)
#### two
B <- read.delim(file2, sep = "\t", header = FALSE)
res=rbind(data.frame(Samples=i, Common=nrow(merge(A, B, by="XYZ"))))
}
Ok, I will assume you have a folder called "data" with files named "RTDFE_filter.txt, RTDFE_data, TRYFG_filter.txt, TRYFG_data.txt, etc. (only and exacly this files).
This code should give a possible way
# save the file names
files = list.files("data")
# get indexes for "data" (for "filter" indexes, add 1)
files_data_index = seq(1, length(f), 2) # 1, 3, 5, ...
# loop on indexes
results = lapply(files_data_index, function(i) {
A <- read.delim(files[i+1], sep = "\t", header = FALSE)
B <- read.delim(files[i], sep = "\t", header = FALSE)
C <- merge(A, B, by="XYZ")
samp = strsplit(files[i], "_")[[1]][1]
com = nrow(C)
return(c(Samples = samp, Comon = com))
})
# combine results
do.call(rbind, results)

Can I automate an increasing value in a file name in R?

So I have .csv's of nesting data that I need to trim. I wrote a series of functions in R and then spit out the new pretty .csv. The issue is that I need to do this with 59 .csv's and I would like to automate the name.
data1 <- read.csv("Nest001.csv", skip = 3, header=F)
functions functions functions
write.csv("Nest001_NEW.csv, file.path(out.path, edit), row.names=F)
So...is there any way for me to loop the name Nest001 to Nest0059 so that I don't have to delete and retype the name for every .csv?
EDIT to incorporate Gregor's suggestion:
One option:
filenames_in <- sprintf("Nest%03d.csv", 1:59)
filenames_out <- sub(pattern = "(\\d{3})(\\.)", replacement = "\\1_NEW\\2", filenames_in)
all_files <- matrix(c(filenames_in, filenames_out), ncol = 2)
And then loop through them:
for (i in 1:nrow(all_files)) {
temp <- read.csv(all_files[[i, 1]], skip = 3, header=F)
do stuff
write.csv(temp, all_files[[i, 2]], row.names = f)
)
To do this purrr-style, you would create two lists similar to the above, and then write a custom function to read in the file, perform all the functions, and then output it.
e.g.
purrr::walk2(
.x = list(filenames_in),
.y = list(filenames_out),
.f = ~my_function()
)
Consider .x and .y as the i in the for loop; it goes through both lists simultaneously, and performs the function on each item.
More info is available here.
Your best bet is to put all of these CSVs into one folder, without any other CSVs in that folder. Then, you can write a loop to go over every file in that folder, and read them in.
library(dplyr)
setwd("path to the folder with CSV's goes here")
combinedData = data.frame()
files = list.files()
for (file in files)
{
read.csv(file)
combinedData = bind_rows(combinedData, file)
}
EDIT: if there are other files in the folder that you don't want to read, you can add this line of code to only read in files that contain the word "Nest" in the title:
files= files[grepl("Nest",filesToRead)]
I don't remember off the top of my head if that is case sensitive or not

Function to extract date from jpg files in a directory

I have a large volume (aprx 10 000) jpg files with dates written on each one. I wish to extract the date from each jpg and add this to a dataframe with a corresponding filename.
I have read this forum and beyond and I have tried to patch together a function in R which will perform the task but I cannot get it to work. I have used a loop to:
1) generate a list of image files in the chosen directory
2) create a dataframe for the results with a column for file path and a column
for date (extracted from the jpg)
3) loop through files in directory:
Resize,
Crop to portion of image showing date,
OCR the image,
Write date to dataframe - created in step 2
This seems to crash when I run the function and I am not really sure why. I am an R user but I have not written functions before (you can probably tell)
I am using R 3.6.0 and RStudio
library(tesseract)
library(magick)
library(tidyverse)
library(gsubfn)
get_jpeg_date <- function(folder) {
file_list <- list.files(path=folder, pattern="*.jpg", recursive = T)
image_dates <- as.data.frame(file_list)
image_dates $ ImageDate <- rep_len(x = NA, length.out = length(file_list))
eng <- tesseract("eng")
for (i in length(file_list) ) {
ImageDate <- image_read(paste(folder,"\\",file_list, sep = ""))%>%
image_resize("2000") %>%
image_crop("300x100+1800") %>%
tesseract::ocr(engine = eng) %>%
strapplyc("\\d+/\\d+/\\d+", simplify = TRUE)%>%
image_dates[,i]
}
}
x <- get_jpeg_date(folder = folder)
folder <- "C:/file_path"
x <- get_jpeg_date(folder = folder)
The code in the loop works on single files but there is no output when I run the function on a small test sample of 3 jpg images.
Consider re-factoring your function to run on a single jpg file, then assign column to it with sapply or map. In R, the last line of a function is the return object. Since for loops are not the last process, function will return the OCR'ed and regex-ed string vector.
get_jpeg_date <- function(pic) {
eng <- tesseract("eng")
image_read(pic) %>%
image_resize("2000") %>%
image_crop("300x100+1800") %>%
tesseract::ocr(engine = eng) %>%
strapplyc("\\d+/\\d+/\\d+", simplify = TRUE)
}
file_list <- list.files(path=folder, pattern="*.jpg", full.names = TRUE, recursive = TRUE)
# DATA FRAME BUILD
image_dates_df <- data.frame(img_path = file_list)
# COLUMN ASSIGNMENT
image_dates_df$img_date <- sapply(image_dates_df$img_path, get_jpeg_date)
# ALTERNATIVELY WITH dplyr::mutate() and purrr:map()
image_dates_df <- data.frame(img_path = file_list) %>%
mutate(img_date = map(img_path, get_jpeg_date))

Merge multiple Excel files starting at row in R

I have multiple Excel files that I need to merge into one, but only certain rows. The Excel files look like this...
The column headers are identical for all files. I also need to add a new column A to the newly generated file, so I created a separate Excel file with just the headers and the new column A. My script first reads in this file (below) and writes it to the workbook...
Next, I need to read each file, starting at row 9 and merge all the data, one after another. So the final result should look like this (minus the Member site column, I haven't attempted the logic for that yet, but thinking it will be a substring of the Specimen ID value)...
However, my current result is...
I am currently only using 3 files, each with a few dozen rows, to start, but the end goal is to merge about 15-30 files, each with 25 to 200 rows, give or take. So...
1) I know my code is incorrect, but not sure how to get the intended results. For one, my loop is overwriting data because it's constantly starting at row/column 2 when it writes. However, I can't think how to rewrite this.
2) The dates are returning in General format ("43008" instead of "9/30/2017")
3) Certain columns data is being placed under different columns (like Nucleic Acid Concentration has the values from the Date of Tissue Content).
Any advice or help would be greatly appreciated!
My code...
library(openxlsx) # Excel and csv files
library(svDialogs) # Dialog boxes
setwd("C:/Users/Work/Combined Manifest")
# Create and load Excel file
wb <- createWorkbook()
# Add worksheet
addWorksheet(wb, "Template")
# Read in & write header file
df.headers <- read.xlsx("headers.xlsx", sheet = "Template")
writeData(wb, "Template", df.headers, colNames = TRUE)
# Function to get user path
getPath <- function() {
# Ask for path
path <- dlgInput("Enter path to files: ", Sys.info()["user"])$res
if (dir.exists(path)) {
# If path exists, set the path as the working directory
return(path)
} else {
# If not, issue an error and recall the getPath function
dlg_message("Error: The path you entered is not a valid directory. Please try again.")$res
getPath()
}
}
# Call getPath function
folder <- getPath()
setwd(folder)
# Get list of files in directory
pattern.ext <- "\\.xlsx$"
files <- dir(folder, full=TRUE, pattern=pattern.ext)
# Get basenames and remove extension
files.nms <- basename(files)
files.nms <- gsub(pattern.ext, "", files.nms)
# Set the names
names(files) <- files.nms
# Iterate to read in files and write to new file
for (nm in files.nms) {
# Read in files
df <- read.xlsx((files[nm]), sheet = "Template", startRow = 9, colNames = FALSE)
# Write data to sheet
writeData(wb, "Template", df, startCol = 2, startRow = 2, colNames = FALSE)
}
saveWorkbook(wb, "Combined.xlsx", overwrite = TRUE)
EDIT:
So with the loop below, I am successfully reading in the files and merging them. Thanks for all the help!
for (nm in files.nms) {
# Read in files
df <- read.xlsx(files[nm], sheet = "Template", startRow = 8, colNames = TRUE, detectDates = TRUE, skipEmptyRows = FALSE,
skipEmptyCols = FALSE)
# Append the data
allData <- rbind(allData, df)
}
EDIT: FINAL SOLUTION
Thanks to everyone for the help!!
library(openxlsx) # Excel and csv files
library(svDialogs) # Dialog boxes
# Create and load Excel file
wb <- createWorkbook()
# Add worksheet
addWorksheet(wb, "Template")
# Function to get user path
getPath <- function() {
# Ask for path
path <- dlgInput("Enter path to files: ", Sys.info()["user"])$res
if (dir.exists(path)) {
# If path exists, set the path as the working directory
return(path)
} else {
# If not, issue an error and recall the getPath function
dlg_message("Error: The path you entered is not a valid directory. Please try again.")$res
getPath()
}
}
# Call getPath function
folder <- getPath()
# Set working directory
setwd(folder)
# Get list of files in directory
pattern.ext <- "\\.xlsx$"
files <- dir(folder, full=TRUE, pattern=pattern.ext)
# Get basenames and remove extension
files.nms <- basename(files)
# Set the names
names(files) <- files.nms
# Create empty dataframe
allData <- data.frame()
# Create list (reserve memory)
f.List <- vector("list",length(files.nms))
# Look and load files
for (nm in 1:length(files.nms)) {
# Read in files
f.List[[nm]] <- read.xlsx(files[nm], sheet = "Template", startRow = 8, colNames = TRUE, detectDates = TRUE, skipEmptyRows = FALSE,
skipEmptyCols = FALSE)
}
# Append the data
allData <- do.call("rbind", f.List)
# Add a new column as 'Member Site'
allData <- data.frame('Member Site' = "", allData)
# Take the substring of the Specimen.ID column for Memeber Site
allData$Member.Site <- sapply(strsplit(allData$Specimen.ID, "-"), "[", 2)
# Write data to sheet
writeData(wb, "Template", startCol = 1, allData)
# Save workbook
saveWorkbook(wb, "Combined.xlsx", overwrite = TRUE)
First of all, you are providing a lot of information in your question, which is generally a good thing, but I’m wondering if you could make your problems easier to solve by recreating the problem using fewer and smaller files. Could you figure out how to merge two files, each containing a small amount of data first?
With regards to the first challenge you raise:
1) Yes you are overwriting the workbook in each loop. I would suggest you load the data and append it to a data.frame and then store the end result after loading all the files. Have a look at the example below. Please note that this example uses rbind, which is inefficient if you are combining a large number of files. So if you have many files you may need to use a different structure.
# Create and empty data frame
allData <- data.frame()
# Loop and load files
for(nm in files.nms) {
# Read in files
df <- read.xlsx((files[nm]), sheet = "Template", startRow = 9, colNames = FALSE)
# Append the data
allData <- rbind(allData, df)
}
# Write data to sheet
writeData(wb, "Template", df, startCol = 2, startRow = 2, colNames = FALSE)
Hopefully this gets you closer to what you need!
Edit: Updating the answer to address the comments made
If you have more then a few files, rbind will get slow like #Parfait mentioned due to multiple copies of the data being made. The way to avoid this, is by first reserving space in memory by creating an empty list with enough space to hold your data, then fill in the list, and only at the end merge all the data together using do.call("rbind", ...) . I've compiled some sample code below that's in line with what you provided in your question.
# Create list (reserve memory)
f.List <- vector("list",length(files.nms))
# Loop and load files
for(eNr in 1:length(files.nms)) {
# Read in files
f.List[[eNr]] <- read.xlsx((files.nms[eNr]), sheet = "Template", startRow = 9)
}
# Append the data
allData <- do.call("rbind", f.List)
Below to illustrate this further, a small reproducible example. It uses just a couple of data frames, but it illustrates the process of creating a list, populating that list, and merging the data as the last step.
# Sample data
df1 <- data.frame(x=1:3, y=3:1)
df2 <- data.frame(y=4:6, x=3:1)
df.List <- list(df1,df2)
# Create list
d.List <- vector("list",length(df.List))
# Loop and add data
for(eNr in 1:length(df.List)) {
d.List[[eNr]] <- df.List[[eNr]]
}
# Bind all at once
dfAll <- do.call("rbind", d.List)
print(dfAll)
Hope this help! Thanks!

R batch txt file processing

I am new to R and I want to batch process all files in a working directory.
I have lots of .txt files and want to read them in, calculate a frequency of one Column, calculate percentage and a so called "H-Score", calculate the sum of the H-Score and store it in a vector. Then the next .txt file should be processed and so on.
After all files are processed, I want to write the vector in another .txt file as a result. The final .txt file should also contain the name of the input file and the calculated sum of H-Score. This is what I have so far, but as you can see, I am a absolute Newbie to programming and R...
setwd("~/Desktop/Automated Analysis/TXT/") # Set working directory
# List all txt files including sub-folders
list_of_files <- list.files(path = ".", recursive = TRUE,
pattern = "\\.txt$", full.names = TRUE)
library(data.table)
# Read all the files and create a FileName column to store filenames
DT <- rbindlist( sapply(list_of_files, fread, simplify = FALSE),
use.names = TRUE, idcol = "FileName" )
br = c(0,1,3,9,15,500) # Set breaks
bins = c(0,1,2,3,4) # Set bins
for (k in 1:length(list_of_files)) { # process all the files in the working directory
HScore_list = c() # create a vector for storing the results
for(i in 1:5) { my_vector = c(HScore_list,i) }
freq = hist(Count, breaks=br, plot=FALSE)
df = data.frame(bins, frequency=freq$counts,
df$percent=df$frequency / sum(df$frequency) * 100,
df$HScore=df$percent * df$bins)
HScore = sum(df$HScore)
}
write(HScore_list, "HScore_list.txt", sep="\n")
Do you know what I want and can help me?
EDIT: My Problem is, that the Code is producing no output.

Resources