R batch txt file processing - r

I am new to R and I want to batch process all files in a working directory.
I have lots of .txt files and want to read them in, calculate a frequency of one Column, calculate percentage and a so called "H-Score", calculate the sum of the H-Score and store it in a vector. Then the next .txt file should be processed and so on.
After all files are processed, I want to write the vector in another .txt file as a result. The final .txt file should also contain the name of the input file and the calculated sum of H-Score. This is what I have so far, but as you can see, I am a absolute Newbie to programming and R...
setwd("~/Desktop/Automated Analysis/TXT/") # Set working directory
# List all txt files including sub-folders
list_of_files <- list.files(path = ".", recursive = TRUE,
pattern = "\\.txt$", full.names = TRUE)
library(data.table)
# Read all the files and create a FileName column to store filenames
DT <- rbindlist( sapply(list_of_files, fread, simplify = FALSE),
use.names = TRUE, idcol = "FileName" )
br = c(0,1,3,9,15,500) # Set breaks
bins = c(0,1,2,3,4) # Set bins
for (k in 1:length(list_of_files)) { # process all the files in the working directory
HScore_list = c() # create a vector for storing the results
for(i in 1:5) { my_vector = c(HScore_list,i) }
freq = hist(Count, breaks=br, plot=FALSE)
df = data.frame(bins, frequency=freq$counts,
df$percent=df$frequency / sum(df$frequency) * 100,
df$HScore=df$percent * df$bins)
HScore = sum(df$HScore)
}
write(HScore_list, "HScore_list.txt", sep="\n")
Do you know what I want and can help me?
EDIT: My Problem is, that the Code is producing no output.

Related

Why does this loop only reads in the first txt file correctly?

ep_dir <- "C:/Users/J/Desktop/e_prot_unicode"
reading and merging data
# reading the data. empty list that gets filled up
ep_ldf<-list()
# creates a list of all the files in the directory with ending .txt
listtxt_ep<-list.files(path = ep_dir, pattern="*.txt", full.names = T)
# loop for reading all the files in the list
for(m in 1:length(listtxt_ep)){
ep_ldf[[m]]<-read.table(listtxt_ep[m],fill=T,header=F,sep = "\t",stringsAsFactors=FALSE)
}
f_ep = "C:/Users/J/Desktop/e_prot_unicode//05AP.U1"
#reading and merging the files, data.table is then called d_ep
d_ep = data.frame()
for(f_ep in listtxt_ep){
tmp_ep <- read.delim(f_ep,row.names = NULL,sep = "\t",fileEncoding="UTF-16LE",fill = T) %>% as.data.frame(stringsAsFactors = F)
d_ep <- rbind.fill(d_ep, tmp_ep)
}
I want to read in a bunch of txt files. The above code reads in the files incorrectly. Only the first one (05AP.U1) contains all values properly. All the others are missing the values in the first column (here I do not mean the numbering row), that are the names. Why does this code only reads in the first file correctly?

Passing a list of one function to another, across different directories

I am trying to assign random numbers to a copy of a file in various directories (tricky to replicate). The directory structure is as follows:
1100
1100/Images
I first create the new directories and copy the images across. For this I have the following working
NewImageGen <- function(singledir){
#
Directoryforrandoms <- "RandomNumAsignment"
Directoryforrandoms <- paste(singledir, "/", Directoryforrandoms, sep = "")
dir.create(Directoryforrandoms,
showWarnings = F)
#
Imagedir <- paste(singledir, '/Images', sep = '')
filestocopy <- list.files(Imagedir,
recursive = F,
full.names = T)
file.copy(from = filestocopy,
to = Directoryforrandoms,
overwrite = F)
#
newfiles <- list.files(path = Directoryforrandoms,
pattern = ".tif", # they are all tiff files
full.names = T)
#
return(newfiles)
}
NewImages <- pblapply(alldirs, FUN = NewImageGen)
This gives me a large list, which is divided into four (due to there being four directories in this instance). I want to pass the newfiles to another function which generates a random number prefix and sticks it on the front. I can do this on a regular list of files using:
RandomNumGen <- function(singleimg){
randomnumber <- as.character(sample(100000:999999, 1, replace=F))
singlerename <- sub('^', randomnumber, singleimg)
file.rename(from = singleimg,
to = singlerename)
}
It runs through all elements of the list but returns a frustrating false.
Any help would be top notch!
pblapply() returns a list while list.files() returns a character vector.
So if your function works with the return value of list.files() try changing your call from pblapply() to pbsapply().
Edit:
newfiles, i.e. the return value of the NewImageGen() function, contains full paths (as you set full.names = T in the last list.files() call).
sub('^', randomnumber, singleimg) will add the random number before the first part of the path like 677570/home/Jim/1100/Images/RandomNumAsignment/original_image_name.tif.
Instead you want to do
file.path(dirname(singleimg), sub('^', randomnumber, basename(singleimg))

Using lapply to apply a function over read-in list of files and saving output as new list of files

I'm quite new at R and a bit stuck on what I feel is likely a common operation to do. I have a number of files (57 with ~1.5 billion rows cumulatively by 6 columns) that I need to perform basic functions on. I'm able to read these files in and perform the calculations I need no problem but I'm tripping up in the final output. I envision the function working on 1 file at a time, outputting the worked file and moving onto the next.
After calculations I would like to output 57 new .txt files named after the file the input data first came from. So far I'm able to perform the calculations on smaller test datasets and spit out 1 appended .txt file but this isn't what I want as a final output.
#list filenames
files <- list.files(path=, pattern="*.txt", full.names=TRUE, recursive=FALSE)
#begin looping process
loop_output = lapply(files,
function(x) {
#Load 'x' file in
DF<- read.table(x, header = FALSE, sep= "\t")
#Call calculated height average a name
R_ref= 1647.038203
#Add column names to .las data
colnames(DF) <- c("X","Y","Z","I","A","FC")
#Calculate return
DF$R_calc <- (R_ref - DF$Z)/cos(DF$A*pi/180)
#Calculate intensity
DF$Ir_calc <- DF$I * (DF$R_calc^2/R_ref^2)
#Output new .txt with calcuated columns
write.table(DF, file=, row.names = FALSE, col.names = FALSE, append = TRUE,fileEncoding = "UTF-8")
})
My latest code endeavors have been to mess around with the intial lapply/sapply function as so:
#begin looping process
loop_output = sapply(names(files),
function(x) {
As well as the output line:
#Output new .csv with calcuated columns
write.table(DF, file=paste0(names(DF), "txt", sep="."),
row.names = FALSE, col.names = FALSE, append = TRUE,fileEncoding = "UTF-8")
From what I've been reading the file naming function during write.table output may be one of the keys I don't have fully aligned yet with the rest of the script. I've been viewing a lot of other asked questions that I felt were applicable:
Using lapply to apply a function over list of data frames and saving output to files with different names
Write list of data.frames to separate CSV files with lapply
to no luck. I deeply appreciate any insights or paths towards the right direction on inputting x number of files, performing the same function on each, then outputting the same x number of files. Thank you.
The reason the output is directed to the same file is probably that file = paste0(names(DF), "txt", sep=".") returns the same value for every iteration. That is, DF must have the same column names in every iteration, therefore names(DF) will be the same, and paste0(names(DF), "txt", sep=".") will be the same. Along with the append = TRUE option the result is that all output is written to the same file.
Inside the anonymous function, x is the name of the input file. Instead of using names(DF) as a basis for the output file name you could do some transformation of this character string.
example.
Given
x <- "/foo/raw_data.csv"
Inside the function you could do something like this
infile <- x
outfile <- file.path(dirname(infile), gsub('raw', 'clean', basename(infile)))
outfile
[1] "/foo/clean_data.csv"
Then use the new name for output, with append = FALSE (unless you need it to be true)
write.table(DF, file = outfile, row.names = FALSE, col.names = FALSE, append = FALSE, fileEncoding = "UTF-8")
Using your code, this is the general idea:
require(purrr)
#list filenames
files <- list.files(path=, pattern="*.txt", full.names=TRUE, recursive=FALSE)
#Call calculated height average a name
R_ref= 1647.038203
dfTransform <- function(file){
colnames(file) <- c("X","Y","Z","I","A","FC")
#Calculate return
file$R_calc <- (R_ref - file$Z)/cos(file$A*pi/180)
#Calculate intensity
file$Ir_calc <- file$I * (file$R_calc^2/R_ref^2)
return(file)
}
output <- files %>% map(read.table,header = FALSE, sep= "\t") %>%
map(dfTransform) %>%
map(write.table, file=paste0(names(DF), "txt", sep="."),
row.names = FALSE, col.names = FALSE, append = TRUE,fileEncoding = "UTF-8")

Combine csv files with common file identifier

I have a list of approximately 500 csv files each with a filename that consists of a six-digit number followed by a year (ex. 123456_2015.csv). I would like to append all files together that have the same six-digit number. I tried to implement the code suggested in this question:
Import and rbind multiple csv files with common name in R but I want the appended data to be saved as new csv files in the same directory as the original files are currently saved. I have also tried to implement the below code however the csv files produced from this contain no data.
rm(list=ls())
filenames <- list.files(path = "C:/Users/smithma/Desktop/PM25_test")
NAPS_ID <- gsub('.+?\\([0-9]{5,6}?)\\_.+?$', '\\1', filenames)
Unique_NAPS_ID <- unique(NAPS_ID)
n <- length(Unique_NAPS_ID)
for(j in 1:n){
curr_NAPS_ID <- as.character(Unique_NAPS_ID[j])
NAPS_ID_pattern <- paste(".+?\\_(", curr_NAPS_ID,"+?)\\_.+?$", sep = "" )
NAPS_filenames <- list.files(path = "C:/Users/smithma/Desktop/PM25_test", pattern = NAPS_ID_pattern)
write.csv(do.call("rbind", lapply(NAPS_filenames, read.csv, header = TRUE)),file = paste("C:/Users/smithma/Desktop/PM25_test/MERGED", "MERGED_", Unique_NAPS_ID[j], ".csv", sep = ""), row.names=FALSE)
}
Any help would be greatly appreciated.
Because you're not doing any data manipulation, you don't need to treat the files like tabular data. You only need to copy the file contents.
filenames <- list.files("C:/Users/smithma/Desktop/PM25_test", full.names = TRUE)
NAPS_ID <- substr(basename(filenames), 1, 6)
Unique_NAPS_ID <- unique(NAPS_ID)
for(curr_NAPS_ID in Unique_NAPS_ID){
NAPS_filenames <- filenames[startsWith(basename(filenames), curr_NAPS_ID)]
output_file <- paste0(
"C:/Users/nwerth/Desktop/PM25_test/MERGED_", curr_NAPS_ID, ".csv"
)
for (fname in NAPS_filenames) {
line_text <- readLines(fname)
# Write the header from the first file
if (fname == NAPS_filenames[1]) {
cat(line_text[1], '\n', sep = '', file = output_file)
}
# Append every line in the file except the header
line_text <- line_text[-1]
cat(line_text, file = output_file, sep = '\n', append = TRUE)
}
}
My changes:
list.files(..., full.names = TRUE) is usually the best way to go.
Because the digits appear at the start of the filenames, I suggest substr. It's easier to get an idea of what's going on when skimming the code.
Instead of looping over the indices of a vector, loop over the values. It's more succinct and less likely to cause problems if the vector's empty.
startsWith and endsWith are relatively new functions, and they're great.
You only care about copying lines, so just use readLines to get them in and cat to get them out.
You might consider something like this:
##will take the first 6 characters of each file name
six.digit.filenames <- substr(filenames, 1,6)
path <- "C:/Users/smithma/Desktop/PM25_test/"
unique.numbers <- unique(six.digit.filenames)
for(j in unique.numbers){
sub <- filenames[which(substr(filenames,1,6) == j)]
data.for.output <- c()
for(file in sub){
##now do your stuff with these files including read them in
data <- read.csv(paste0(path,file))
data.for.output <- rbind(data.for.output,data)
}
write.csv(data.for.output,paste0(path,j, '.csv'), row.names = F)
}

save files into a specific subfolder in a loop in R

I feel I am very close to the solution but at the moment i cant figure out how to get there.
I´ve got the following problem.
In my folder "Test" I´ve got stacked datafiles with the names M1_1; M1_2, M1_3 and so on: /Test/M1_1.dat for example.
No I want to seperate the files, so that I get: M1_1[1].dat, M1_1[2].dat, M1_1[3].dat and so on. These files I´d like to save in specific subfolders: Test/M1/M1_1[1]; Test/M1/M1_1[2] and so on, and Test/M2/M1_2[1], Test/M2/M1_2[2] and so on.
Now I already created the subfolders. And I got the following command to split up the files so that i get M1_1.dat[1] and so on:
for (e in dir(path = "Test/", pattern = ".dat", full.names=TRUE, recursive=TRUE)){
data <- read.table(e, header=TRUE)
df <- data[ -c(2) ]
out <- split(df , f = df$.imp)
lapply(names(out),function(z){
write.table(out[[z]], paste0(e, "[",z,"].dat"),
sep="\t", row.names=FALSE, col.names = FALSE)})
}
Now the paste0 command gets me my desired split up data (although its M1_1.dat[1] instead of M1_1[1].dat), but i cant figure out how to get this data into my subfolders.
Maybe you´ve got an idea?
Thanks in advance.
I don't have any idea what your data looks like so I am going to attempt to recreate the scenario with the gender datasets available at baby names
Assuming all the files from the zip folder are stored to "inst/data"
store all file paths to all_fi variable
all_fi <- list.files("inst/data",
full.names = TRUE,
recursive = TRUE,
pattern = "\\.txt$")
> head(all_fi, 3)
[1] "inst/data/yob1880.txt" "inst/data/yob1881.txt"
Preset function that will apply to each file in the directory
f.it <- function(f_in = NULL){
# Create the new folder based on the existing basename of the input file
new_folder <- file_path_sans_ext(f_in)
dir.create(new_folder)
data.table::fread(f_in) %>%
select(name = 1, gender = 2, freq = 3) %>%
mutate(
gender = ifelse(grepl("F", gender), "female","male")
) %>% (function(x){
# Dataset contains names for males and females
# so that's what I'm using to mimic your split
out <- split(x, x$gender)
o <- rbind.pages(
lapply(names(out), function(i){
# New filename for each iteration of the split dataframes
###### THIS IS WHERE YOU NEED TO TWEAK FOR YOUR NEEDS
new_dest_file <- sprintf("%s/%s.txt", new_folder, i)
# Write the sub-data-frame to the new file
data.table::fwrite(out[[i]], new_dest_file)
# For our purposes return a dataframe with file info on the new
# files...
data.frame(
file_name = new_dest_file,
file_size = file.size(new_dest_file),
stringsAsFactors = FALSE)
})
)
o
})
}
Now we can just loop through:
NOTE: for my purposes I'm not going to spend time looping through each file, for your purposes this would apply to each of your initial files, or in my case all_fi rather than all_fi[2:5].
> rbind.pages(lapply(all_fi[2:5], f.it))
============================ =========
file_name file_size
============================ =========
inst/data/yob1881/female.txt 16476
inst/data/yob1881/male.txt 15306
inst/data/yob1882/female.txt 18109
inst/data/yob1882/male.txt 16923
inst/data/yob1883/female.txt 18537
inst/data/yob1883/male.txt 15861
inst/data/yob1884/female.txt 20641
inst/data/yob1884/male.txt 17300
============================ =========

Resources