I am writing a script in R and i am trying to retrieve files from different folder in for loop.
i am doing something like this but didn't work. How can i read files in for loop from different folders?
index2<-list.files(path="/path/to/folder2",pattern = "*.entire.txt",)
index1<-list.files(path="/path/to/folder1",pattern = "*.file.txt",)
for (y in 1:length(index2) )
{
for (x in 1:length(index1) )
{
ac2<-read.table(paste0(file = "/path/to/folder2",index2[y],header = FALSE))
ac1<-read.table(paste0(file = "/path/to/folder1",index1[x],header = FALSE))
}
P.S = I know about lapply function but i am here reading files performing calculations and saving in new files so i want to use with for loop. So if anyone can recommend how can i read files from different folders in for loop?
Related
I'm trying to load multiple .rds files that are save in the same directory. I have made a function for that and I iterate on a list of the files dir to load it but it doesn't work, see below that I write:
markerDir="..."
markerFilesList <- list.files(markerDir,pattern = ".rds", recursive = TRUE, include.dirs = TRUE)
readRDSfct <- function(markerFilesList,markerDir,i){
print(paste0("Reading the marker file called :",basename(markerFilesList[[i]])))
nameVariableTmp=basename(markerFilesList[[i]])
nameVariable=gsub(pattern = "\\.rds", '',nameVariableTmp)
print(paste0("file saved in varibale called:", nameVariable))
currentRDSfile = readRDS(paste0(markerDir,markerFilesList[[i]])) #nameVariable
return(currentRDSfile)
}
for (i in 1:length(markerFilesList)){
readRDSfct(markerFilesList, markerDir, i)
}
Does anyone has a suggestion for me to do it ?
thanks in advance!
As I understand it correctly, you want to just load all the RDS, which are saved in the same directory in the R environment?
To load and bind all .RDS in one directory i am using something like this:
List_RDS = list.files(pattern="*.RDS")
List_using = lapply(List_RDS, readRDS)
Data_bind <-do.call("rbind", List_using)
I am trying to import multiple excel files, based on their titles, in RStudio. However, I have to create the directory to these files and these files are of two different extensions, .xls and .xlsx. I am confused on how can I successfully import these files in the fastest way possible.
I have tired to create some 'for' and 'if' loops and failed miserably. I have given an example below. I just don't know how to go about this. Even providing some sort of error message, using 'try/trycatch/stop' would be helpful.
a = paste0("subtitle1","subtitle2", ".xls")
for (i in 1:length(var)){
b = try(read_excel(paste0(getwd(),"/",a[i])))
stop('error message')
}
OR
if (i = 1:length(var)){
a = paste0("subtitle1","subtitle2", ".xls")
b[i] = read_excel(paste0(getwd(),"/",a[i]))
} else {
c = paste0("subtitle1","subtitle2", ".xlsx")
d[i] = read_excel(paste0(getwd(),"/",c[i]))
}
I want to, programmatically, source all .R files contained within a given array retrieved with the Sys.glob() function.
This is the code I wrote:
# fetch the different ETL parts
parts <- Sys.glob("scratch/*.R")
if (length(parts) > 0) {
for (part in parts) {
# source the ETL part
source(part)
# rest of code goes here
# ...
}
} else {
stop("no ETL parts found (no data to process)")
}
The problem I have is I cannot do this or, at least, I get the following error:
simpleError in source(part): scratch/foo.bar.com-https.R:4:151: unexpected string constant
I've tried different combinations for the source() function like the following:
source(sprintf("./%s", part))
source(toString(part))
source(file = part)
source(file = sprintf("./%s", part))
source(file = toString(part))
No luck. As I'm globbing the contents of a directory I need to tell R to source those files. As it's a custom-tailored ETL (extract, transform and load) script, I can manually write:
source("scratch/foo.bar.com-https.R")
source("scratch/bar.bar.com-https.R")
source("scratch/baz.bar.com-https.R")
But that's dirty and right now there are 3 different extraction patterns. They could be 8, 80 or even 2000 different patterns so writing it by hand is not an option.
How can I do this?
Try getting the list of files with dir and then using lapply:
For example, if your files are of the form t1.R, t2.R, etc., and are inside the path "StackOverflow" do:
d = dir(pattern = "^t\\d.R$", path = "StackOverflow/", recursive = T, full.names = T)
m = lapply(d, source)
The option recursive = T will search all subdirectories, and full.names = T will add the path to the filenames.
If you still want to use Sys.glob(), this works too:
d = Sys.glob(paths = "StackOverflow/t*.R")
m = lapply(d, source)
I need to create a function in R that reads all the files in a folder (let's assume that all files are tables in tab delimited format) and create objects with same names in global environment. I did something similar to this (see code below); I was able to write a function that reads all the files in the folder, makes some changes in the first column of each file and writes it back in to the folder. But the I couldn't find how to assign the read files in to an object that will stay in the global environment.
changeCol1 <- function () {
filesInfolder <- list.files()
for (i in 1:length(filesInfolder)){
wrkngFile <- read.table(filesInfolder[i])
wrkngFile[,1] <- gsub(0,1,wrkngFile[,1])
write.table(wrkngFile, file = filesInfolder[i], quote = F, sep = "\t")
}
}
You are much better off assigning them all to elements of a named list (and it's pretty easy to do, too):
changeCol1 <- function () {
filesInfolder <- list.files()
lapply(filesInfolder, function(fname) {
wrkngFile <- read.table(fname)
wrkngFile[,1] <- gsub(0, 1, wrkngFile[,1])
write.table(wrkngFile, file=fname, quote=FALSE, sep="\t")
wrkngFile
}) -> data
names(data) <- filesInfolder
data
}
a_list_full_of_data <- changeCol1()
Also, F will come back to haunt you some day (it's not protected where FALSE and TRUE are).
add this to your loop after making the changes:
assign(filesInfolder[i], wrkngFile, envir=globalenv())
If you want to put them into a list, one way would be, outside your loop, declare a list:
mylist = list()
Then, within your loop, do like so:
mylist[[filesInfolder[i] = wrkngFile]]
And then you can access each object by looking at:
mylist[[filename]]
from the global env.
The easy answer to this is "buy more RAM" but I am hoping to get a more constructive answer and learn something in the process.
I am running Windows 7 64-bit with 8GB of RAM.
I have several very large .csv.gz files (~450MB uncompressed) with the same exact header information that I read into R and perform some processing on. Then, I need to combine the processed R objects into a single master object and write back out to .csv on disk.
I do this same operation on multiple sets of files. As an example, I have 5 folders each with 6 csv.gz files in them. I need to end up with 5 master files, one for each folder.
My code looks something like the following:
for( loop through folders ){
master.file = data.table()
for ( loop through files ) {
filename = list.files( ... )
file = as.data.table ( read.csv( gzfile( filename ), stringsAsFactors = F ))
gc()
...do some processing to file...
# append file to the running master.file
if ( nrow(master.file) == 0 ) {
master.file = file
} else {
master.file = rbindlist( list( master.file, file) )
}
rm( file, filename )
gc()
}
write.csv( master.file, unique master filename, row.names = FALSE )
rm( master.file )
gc()
}
This code does not work. I get the cannot allocate memory error before it writes out the final csv. I was watching resource monitor while running this code and don't understand why it would be using 8GB of RAM to do this processing. The total of all the file sizes is roughly 2.7GB, so I was expecting that the maximum memory R would use is 2.7GB. But the write.csv operation seems to use the same amount of memory as the data object you are writing, so if you have a 2.7GB object in memory and try to write it out, you would be using 5.6 GB of memory.
This apparent reality, combined with using a for loop in which memory doesn't seem to be getting adequately freed up seems to be the problem.
I suspect that I could use the sqldf package as mentioned here and here but when I set the sqldf statement equal to an R variable I ended up with the same out of memory errors.
Update 12/23/2013 - The following solution works all in R without running out of memory
(Thanks #AnandaMahto).
The major caveat with this method is that you must be absolutely sure that the files you reading in and writing out each time have exactly the same header columns, in exactly the same order, or your R processing code must ensure this since write.table does not check this for you.
for( loop through folders ){
for ( loop through files ) {
filename = list.files( ... )
file = as.data.table ( read.csv( gzfile( filename ), stringsAsFactors = F ))
gc()
...do some processing to file...
# append file to the running master.file
if ( first time through inner loop) {
write.table(file,
"masterfile.csv",
sep = ",",
dec = ".",
qmethod = "double",
row.names = "FALSE")
} else {
write.table(file,
"masterfile.csv",
sep = ",",
dec = ".",
qmethod = "double",
row.names = "FALSE",
append = "TRUE",
col.names = "FALSE")
}
rm( file, filename )
gc()
}
gc()
}
My Initial Solution:
for( loop through folders ){
for ( loop through files ) {
filename = list.files( ... )
file = as.data.table ( read.csv( gzfile( filename ), stringsAsFactors = F ))
gc()
...do some processing to file...
#write out the file
write.csv( file, ... )
rm( file, filename )
gc()
}
gc()
}
I then downloaded and installed GnuWin32's sed package and used Windows command line tools to append the files as follows:
copy /b *common_pattern*.csv master_file.csv
This appends together all of the individual .csv files whose names have the text pattern "common_pattern" in them, headers and all.
Then I use sed.exe to remove all but the first header line as follows:
"c:\Program Files (x86)\GnuWin32\bin\sed.exe" -i 2,${/header_pattern/d;} master_file.csv
-i tells sed to just overwrite the specified file (in-place).
2,$ tells sed to look at range from the 2nd row to the last row ($)
{/header_pattern/d;} tells sed to find all lines in the range with the text "header_pattern" in them and d delete these lines
In order to make sure this was doing what I wanted it to do, I first printed the lines I was planning to delete.
"c:\Program Files (x86)\GnuWin32\bin\sed.exe" -n 2,${/header_pattern/p;} master_file.csv
Works like a charm, I just wish I could do it all in R.