Processing multiple files through application in r - r

I'm processing files through an application using R. The application requires a simple inputfile, outputfilename specification as parameters. Using the below code, this works fine.
input <- "\"7374.txt\""
output <- "\"7374_cleaned.txt\""
system2("DataCleaner", args = c(input, output))
However I wish to process a folder of .txt files, rather then have to do each one individually. If i had access to the source code i would simply alter the application to accept a folder rather then an individual file, but unfortunately i don't. Is it possible to somehow do this in R? I had tried starting to create a loop,
input <- dir(pattern=".txt")
but i don't know how i could insert a vector in as an argument without the regex included as part of that? Also i would then need to be able to paste '_cleaned' on to the end of the outputfile names? Many thanks in advance.

Obviously, I can't test it because I don't have your DataCleaner program but how about this...
# make some files
dir.create('folder')
x = sapply(seq_along(1:5), function(f) {t = tempfile(tmpdir = 'folder', fileext = '.txt'); file.create(t); t})
# find the files
inputfiles = list.files(path = 'folder', pattern = 'txt', full.names = T)
# remove the extension
base = tools::file_path_sans_ext(inputfiles)
# make the output file names
outputfiles = paste0(base, '_cleaned.txt')
mysystem <- function(input, output) {
system2('DataCleaner', args = c(input, output))
}
lapply(seq_along(1:length(inputfiles)), function(f) mysystem(inputfiles[f], outputfiles[f]))
It uses lapply to iterate over all the members of the input and output files and calls the system2 function.

Related

Can I automate an increasing value in a file name in R?

So I have .csv's of nesting data that I need to trim. I wrote a series of functions in R and then spit out the new pretty .csv. The issue is that I need to do this with 59 .csv's and I would like to automate the name.
data1 <- read.csv("Nest001.csv", skip = 3, header=F)
functions functions functions
write.csv("Nest001_NEW.csv, file.path(out.path, edit), row.names=F)
So...is there any way for me to loop the name Nest001 to Nest0059 so that I don't have to delete and retype the name for every .csv?
EDIT to incorporate Gregor's suggestion:
One option:
filenames_in <- sprintf("Nest%03d.csv", 1:59)
filenames_out <- sub(pattern = "(\\d{3})(\\.)", replacement = "\\1_NEW\\2", filenames_in)
all_files <- matrix(c(filenames_in, filenames_out), ncol = 2)
And then loop through them:
for (i in 1:nrow(all_files)) {
temp <- read.csv(all_files[[i, 1]], skip = 3, header=F)
do stuff
write.csv(temp, all_files[[i, 2]], row.names = f)
)
To do this purrr-style, you would create two lists similar to the above, and then write a custom function to read in the file, perform all the functions, and then output it.
e.g.
purrr::walk2(
.x = list(filenames_in),
.y = list(filenames_out),
.f = ~my_function()
)
Consider .x and .y as the i in the for loop; it goes through both lists simultaneously, and performs the function on each item.
More info is available here.
Your best bet is to put all of these CSVs into one folder, without any other CSVs in that folder. Then, you can write a loop to go over every file in that folder, and read them in.
library(dplyr)
setwd("path to the folder with CSV's goes here")
combinedData = data.frame()
files = list.files()
for (file in files)
{
read.csv(file)
combinedData = bind_rows(combinedData, file)
}
EDIT: if there are other files in the folder that you don't want to read, you can add this line of code to only read in files that contain the word "Nest" in the title:
files= files[grepl("Nest",filesToRead)]
I don't remember off the top of my head if that is case sensitive or not

R rename files keeping part of original name

I'm trying to rename all files in a folder (about 7,000 files) files with just a portion of their original name.
The initial fip code is a 4 or 5 digit code that identifies counties, and is different for every file in the folder. The rest of the name in the original files is the state_county_lat_lon of every file.
For example:
Original name:
"5081_Illinois_Jefferson_-88.9255_38.3024_-88.75_38.25.wth"
"7083_Illinois_Jersey_-90.3424_39.0953_-90.25_39.25.wth"
"11085_Illinois_Jo_Daviess_-90.196_42.3686_-90.25_42.25.wth"
"13087_Illinois_Johnson_-88.8788_37.4559_-88.75_37.25.wth"
"17089_Illinois_Kane_-88.4342_41.9418_-88.25_41.75.wth"
And I need it to rename with just the initial code (fips):
"5081.wth"
"7083.wth"
"11085.wth"
"13087.wth"
"17089.wth"
I've tried by using the list.files and file.rename functions, but I do not know how to identify the code name out of he full name. Some kind of a "wildcard" could work, but don't know how to apply those properly because they all have the same pattern but differ in content.
This is what I've tried this far:
setwd("C:/Users/xxx")
Files <- list.files(path = "C:/Users/xxx", pattern = "fips_*.wth" all.files = TRUE)
newName <- paste("fips",".wth", sep = "")
for (x in length(Files)) {
file.rename(nFiles,newName)}
I've also tried with the "sub" function as follows:
setwd("C:/Users/xxxx")
Files <- list.files(path = "C:/Users/xxxx", all.files = TRUE)
for (x in length(Files)) {
sub("_*", ".wth", Files)}
but get Error in as.character(x) :
cannot coerce type 'closure' to vector of type 'character'
OR
setwd("C:/Users/xxxx")
Files <- list.files(path = "C:/Users/xxxx", all.files = TRUE)
for (x in length(Files)) {
sub("^(\\d+)_.*", "\\1.wth", file)}
Which runs without errors but does nothing to the names in the file.
I could use any help.
Thanks
Here is my example.
Preparation for data to use;
dir.create("test_dir")
data_sets <- c("5081_Illinois_Jefferson_-88.9255_38.3024_-88.75_38.25.wth",
"7083_Illinois_Jersey_-90.3424_39.0953_-90.25_39.25.wth",
"11085_Illinois_Jo_Daviess_-90.196_42.3686_-90.25_42.25.wth",
"13087_Illinois_Johnson_-88.8788_37.4559_-88.75_37.25.wth",
"17089_Illinois_Kane_-88.4342_41.9418_-88.25_41.75.wth")
setwd("test_dir")
file.create(data_sets)
Rename the files;
Files <- list.files(all.files = TRUE, pattern = ".wth")
newName <- sub("^(\\d+)_.*", "\\1.wth", Files)
file.rename(Files, newName)

looping over all files in the same directory in R

the following code in R for all the files. actually I made a for loop for that but when I run it it will be applied only on one file not all of them. BTW, my files do not have header.
You use [[ to subset something from peaks. However, after reading it using the file name, it is a data frame with then no more reference to the file name. Thus, you just have to get rid of the [[i]].
for (i in filelist.coverages) {
peaks <- read.delim(i, sep='', header=F)
PeakSizes <- c(PeakSizes, peaks$V3 - peaks$V2)
}
By using the iterator i within read.delim() which holds a new file name each time, every time R goes through the loop, peaks will have the content of a new file.
In your code, i is referencing to a name file. Use indices instead.
And, by the way, don't use setwd, use full.names = TRUE option in list.files. And preallocate PeakSizes like this: PeakSizes <- numeric(length(filelist.coverages)).
So do:
filelist.coverages <- list.files('K:/prostate_cancer_porto/H3K27me3_ChIPseq/',
pattern = 'island.bed', full.names = TRUE)
##all 97 bed files
PeakSizes <- numeric(length(filelist.coverages))
for (i in seq_along(filelist.coverages)) {
peaks <- read.delim(filelist.coverages[i], sep = '', header = FALSE)
PeakSizes[i] <- peaks$V3 - peaks$V2
}
Or you could simply use sapply or purrr::map_dbl:
sapply(filelist.coverages, function(file) {
peaks <- read.delim(file, sep = '', header = FALSE)
peaks$V3 - peaks$V2
})

How to find all files sourcing a particular file?

Suppose I have file_a.R. It is sourced via R's base source function by some other files file_b.R, file_c.R, which are located in the same folder or sub folder. Is there an easy way to get the paths of file_b.R and file_c.R given the path of file_a.R?
EDIT:
If you want to get all links between R files, and some files that are sourced in those files, you can use something like that:
library(stringr)
#Get all R files paths in working directory and subdirectories
filelist <- lapply(list.files(
pattern = "[.]R$", recursive = TRUE
), print)
#Extract one file's sources
getSources <- function(file, pattern) {
#Store all file lines in a character vector
lines <- readLines(file, warn = FALSE)
#Extract R-filenames starting with "pattern" in all lines containing "source"
sources <- lapply(lines, function(x) {
if (length(grep("source", x) > 0)) {
str_extract(x, paste0(pattern, ".*[.]R"))
}
else{
NA
}
})
#Remove NA (lines without source)
sources <- sources[!is.na(sources)]
#Return a list
list(path = file,
pattern = pattern,
sources = unlist(sources))
}
#Example
corresp <- lapply(X = filelist, FUN = getSources, pattern = "file")
It will return a list of:
$path: R file path
$pattern: pattern used to match sources
$sources: the name of the sourced file
And you'll be able to see if anything is sourced anywhere, including file_A.

loop to change string (for file import function)

I'm working on some R code to automatize file import. I´ve been using sub to change the path string, more specifically I want to go through Trial1 to Trial10 for participant 1 and so forth and than save it as data[i]. Instead of putting this manually for all trials could this be done more efficiently with a loop? The function itself adds the filepath to the imported data so I can use this information later
path <- "C:/Users/Thomas/Desktop/tapping backup/Pilot141116/pilot_151116_pat1_250/realisations/participant_8/Trial1"
setwd( path )
files <- list.files(path = path, pattern = "midi.*\\.csv", full.names = T )
# set up a function to read a file and add a column for filename
import <- function( file ) {
df <- read_csv( file, col_names = T )
df$file <- file
return( df )
}
# run that function across all files.
data1 <- ldply( .data = files, .fun = import )
I would build the file list from pilot_151116_pat1_250/realisations/ with recursive set to TRUE and full.names set to TRUE. Then you run the ldply loop with the import function. Later you can deduce from the file column which participant and trial you data was part of. This can be done by using strsplit with sep equal to /, or by using separate from the dplyr package.

Resources