R loop opening files - r

I am trying to run a simple 5 lines command but over 9000 different files. I wrote the following for loop
setwd("/Users/morandin/Desktop/Test")
output_file<- ("output_file.txt")
files <- list.files("/Users/morandin/Desktop/Test")
for(i in files) {
chem.w.in <- scan(i, sep=",")
pruned.tree<-drop.tip(mytree,which(chem.w.in %in% NA))
plot(pruned.tree)
pruned.tree.ja.chem.w.in <- phylo4d(pruned.tree, c(na.omit(chem.w.in)))
plot(pruned.tree.ja.chem.w.in)
out <- abouheif.moran(pruned.tree.ja.chem.w.in)
print(out)
}
Hey I am editing my question: the above code does the for loop perfectly now (thanks for all your help). I am still having an issue with the output.
I can redirect the entire output using R through bash commands but I would need the name of the processed file. My output looks like this:
class: krandtest
Monte-Carlo tests
Call: as.krandtest(sim = matrix(res$result, ncol = nvar, byrow = TRUE),
obs = res$obs, alter = alter, names = test.names)
Number of tests: 1
Adjustment method for multiple comparisons: none
Permutation number: 999
Test Obs Std.Obs Alter Pvalue
1 dt 0.1458514 0.7976225 greater 0.2
other elements: adj.method call
Is there a way to print Pvalue results and name of the file (element i)??
Thanks

Since Paul Hiemstra's answer answered #1, here's an answer to #2, assuming that by "answers" you mean "the printed output of abouheif.moran(pruned.tree.ja.chem.w.in)".
Use cat() with the argument append = true. For example:
output_file = "my_output_file.txt"
for(i in files) {
# do stuff
# make plots
out <- abouheif.moran(pruned.tree.ja.chem.w.in)
out <- sprintf("-------\n %s:\n-------\n%s\n\n", i, out)
cat(out, file = output_file, append = TRUE)
}
This will produce a file called my_output_file.txt that looks like:
-------
file_1:
-------
output_goes_here
-------
file_2:
-------
output_goes_here
Obviously the formatting is entirely up to you; I just wanted to demonstrate what could be done here.
An alternative solution would be to sink() the entire script, but I'd rather be explicit about it. A middle road might be to sink() just a small piece of the code, but except in extreme cases it's a matter of preference or style.

I suspect what is going wrong here is that list.files() by default returns a list of only the names of the files, not the entire path to the file. Setting full.names to TRUE will fix this issue. Note that you will not have to add the txt add the filename as list.files() already returns the full path to an existing file.

Related

How to write to csv each split chunk?

I have list data for which I used split:
x <- split(A, f = A$Col_1)
It works beautifully. But now I need to write each chunk of the split to an individual .csv. There are 2100 chunks of 140 rows each. Let's call them "1:2100". I would like to create something that wrote "1" to "~/full_path_name/A1.csv" then go to "2" and write to "~/full_path_name/A2.csv", then "3" to "~/full_path_name/A3.csv", etc.
I included "~/full_path_name/" because down the road this path name will change for other data using the same code, and for my own understanding I need to see it in the code. I don't know how to write a small sample of what I am asking for for someone to correct because I don't know how to write it at all.
Can someone make a suggestion on how to do this? Thank you.
I have only been coding for month and am entirely self-taught. I do not have a background in other coding programs. I have no one to ask for help but here. I struggle with the terminology, so please understand if I am not asking in the proper way and I will try to correct it if need be.
EDIT, AFTER DOING SOME FURTHER RESEARCH --
This is what I have found elsewhere on SO from #RichPaloo, and my adaptations below that:
#example data.frame
df <- data.frame(x = 1:4, y = c("a", "a", "b", "b"))
#split into a list by the y column
l <- split(df, df$y)
#the names of the list are the unique values of the y column
nam <- names(l)
#iterate over the list and the vector of list names and write csvs
for(i in 1:length(l)) {
write_csv(l[[i]], paste0(nam[i], ".csv"))
}
This is my version:
bcc4.5_WINTER <- split(bcc4.5_FinalWinterRO, f = bcc4.5_FinalWinterRO$HUC8)
nam <- names(bcc4.5_WINTER)
for(i in 1:length(bcc4.5_WINTER)) {
write_csv(bcc4.5_WINTER[[i]], paste0(“~/Rprojects/BCC_CSM1_1_RCP_45/Winter/”, nam[i], “.csv”))
}
I appear to have a problem with the folder within my home folder "/BCC_CSM1_1_RCP_45/Winter/” It says "unexpected token" at both ends, but not at the "~Rprojects". Can I not send something to a folder within my home folder?
It also shows redlines under the quotes around ".csv" near the end. I don't know what to make of this because it's exactly what the person used successfully, apparently, in another post. Thank you.
So, the code example above (#Paul) worked except the df[l] was not being iterated, so I removed the _i from each l instance. The final problem I had (in comments above) was because the path name was not complete.
I used fwrite() rather than write.csv because it gave me better feedback as I struggled with mistakes. This gave me what I needed:
#split file into chunks by names within a row, in this case row "BBB"
df <- split(old_df, f = old_df$BBB)
#write those chunks to individual .csv files with the name being the name of each chunk
save_fun <- function(df, name_i) {
fwrite(df, file = paste0("~/Desktop/projects_folder/", name_i, ".csv"))
}
#save the file on your computer
mapply(FUN = save_fun, df, name_i = names(df), SIMPLIFY = FALSE)
Much thanks to Paul.
Investigating the potential typo problem
Please see the two lines below:
write.csv(l[[1]], file = paste0("./a_folder/", names(l)[1], ".csv"))
write.csv(l[[1]], file = paste0(“./a_folder/”, names(l)[1], “csv”))
Line 1 will save the file. Note that "./a_folder/" and ".csv" are seen as text.
Line 2 “./a_folder/” and “.csv” are not recognized as text. Line 2 produces an error: unexpected input in " write.csv(l[[1]], file = paste0(“"
RStudio colors your code to help you with this problem.
Thoughts about not using a for loop.
I think one better way to go (especialy when you have large dataset) is by using lapply or mapply. What these functions do is take each "chunk" of a list and apply a function to it.
As lapply loses the name of each chunk while processing it. It can be annoying when you want to use the name of the chunk to name the file on your computer. mapply() comes handy to deal with this situation.
Here is an example using the provided example.
# example data.frame
df <- data.frame(x = 1:4, y = c("a", "a", "b", "b"))
# split df
l <- split(df, df$y)
# save each "chunk" of l as a .csv file on a hard drive
# 1st, create a function that takes a "chunk" of your list and its name as inputs
save_fun <- function(l_i, name_i) {
print(l_i) # print the output in console
write.csv(l_i, file = paste0("./a_folder/", name_i, ".csv")) # save the file on your computer
}
# 2nd, use mapply (and not a list) to use the previous function on each pair chunk/name
mapply(FUN = save_fun, l_i = l, name_i = names(l), SIMPLIFY = FALSE) # see ?mapply for how to use mapply()

Saving plots with different filenames using R

While fine-tuning parameters for plots I want to save all the test runs in different files so that they will not be lost. So far, I managed to do it using the code below:
# Save the plot as WMF file - using random numbers to avoid overwriting
number <- sample(1:20,1)
filename <- paste("dummy", number, sep="-")
fullname <- paste(filename, ".wmf", sep="")
# Next line actually creates the file
dev.copy(win.metafile, fullname)
dev.off() # Turn off the device
This code works, generating files with name "dummy-XX.wmf", where XX is a random number between 1 and 20, but it looks cumbersome and not elegant at all.
Is there any more elegant method to accomplish the same? Or even, to keep a count of how many times the code has been run and generate nice progressive numbers for the files?
If you really want to increment (to avoid overwriting what files already exist) you can create a small function like this one:
createNewFileName = function(path = getwd(), pattern = "plot_of_something", extension=".png") {
myExistingFiles = list.files(path = path, pattern = pattern)
print(myExistingFiles)
completePattern = paste0("^(",pattern,")([0-9]*)(",extension,")$")
existingNumbers = gsub(pattern = completePattern, replacement = "\\2", x = myExistingFiles)
if (identical(existingNumbers, character(0)))
existingNumbers = 0
return(paste0(pattern,max(as.numeric(existingNumbers))+1,extension))
}
# will create the file myplot1.png
png(filename = createNewFileName(pattern="myplot"))
hist(rnorm(100))
dev.off()
# will create the file myplot2.png
png(filename = createNewFileName(pattern="myplot"))
hist(rnorm(100))
dev.off()
If you are printing many plots, you can do something like
png("plot-%02d.png")
plot(1)
plot(1)
plot(1)
dev.off()
This will create three files "plot-01.png", "plot-02.png", "plot-03.png"
The filename you specify can take an sprintf-like format where the index of the plot in passed in. Note that counting is reset when you open a new graphics device so all calls to plot() will need to be done before calling dev.off().
Note however with this method, it will not look to see which files already exist. It will always reset the counting at 1. Also, there is no way to change the first index to anything other than 1.

Applying an R script to multiple files

I have an R script that reads a certain type of file (nexus files of phylogenetic trees), whose name ends in *.trees.txt. It then applies a number of functions from an R package called bGMYC, available here and creates 3 pdf files. I would like to know what I should do to make the script loop through the files for each of 14 species.
The input files are in a separate folder for each species, but I can put them all in one folder if that facilitates the task. Ideally, I would like to output the pdf files to a folder for each species, different from the one containing the input file.
Here's the script
# Call Tree file
trees <- read.nexus("L_boscai_1411_test2.trees.txt")
# To use with different species, substitute "L_boscai_1411_test2.trees.txt" by the path to each species tree
#Store the number of tips of the tree
ntips <- length(trees$tip.label[[1]])
#Apply bgmyc.single
results.single <- bgmyc.singlephy(trees[[1]], mcmc=150000, burnin=40000, thinning=100, t1=2, t2=ntips, start=c(1,1,ntips/2))
#Create the 1st pdf
pdf('results_single_boscai.pdf')
plot(results.single)
dev.off()
#Sample 50 trees
n <- sample(1:length(trees), 50)
trees.sample <- trees[n]
#Apply bgmyc.multiphylo
results.multi <- bgmyc.multiphylo(trees.sample, mcmc=150000, burnin=40000, thinning=100, t1=2, t2=ntips, start=c(1,1,ntips/2))
#Create 2nd pdf
pdf('results_boscai.pdf') # Substitute 'results_boscai.pdf' by "*speciesname.pdf"
plot(results.multi)
dev.off()
#Apply bgmyc.spec and spec.probmat
results.spec <- bgmyc.spec(results.multi)
results.probmat <- spec.probmat(results.multi)
#Create 3rd pdf
pdf('trees_boscai.pdf') # Substitute 'trees_boscai.pdf' by "trees_speciesname.pdf"
for (i in 1:50) plot(results.probmat, trees.sample[[i]])
dev.off()
I've read several posts with a similar question, but they almost always involve .csv files, refer to multiple files in a single folder, have a simpler script or do not need to output files to separate folders, so I couldn't find a solution to my specific problem.
Shsould I use a for loop or could I create a function out of this script and use lapply or another sort of apply? Could you provide me with sample code for your proposed solution or point me to a tutorial or another reference?
Thanks for your help.
It really depends on the way you want to run it.
If you are using linux / command line job submission, it might be best to look at
How can I read command line parameters from an R script?
If you are using GUI (Rstudio...) you might not be familiar with this, so I would solve the problem
as a function or a loop.
First, get all your file names.
files = list.files(path = "your/folder")
# Now you have list of your file name as files. Just call each name one at a time
# and use for loop or apply (anything of your choice)
And since you would need to name pdf files, you can use your file name or index (e.g loop counter) and append to the desired file name. (e.g. paste("single_boscai", "i"))
In your case,
files = list.files(path = "your/folder")
# Use pattern = "" if you want to do string matching, and extract
# only matching files from the source folder.
genPDF = function(input) {
# Read the file
trees <- read.nexus(input)
# Store the index (numeric)
index = which(files == input)
#Store the number of tips of the tree
ntips <- length(trees$tip.label[[1]])
#Apply bgmyc.single
results.single <- bgmyc.singlephy(trees[[1]], mcmc=150000, burnin=40000, thinning=100, t1=2, t2=ntips, start=c(1,1,ntips/2))
#Create the 1st pdf
outname = paste('results_single_boscai', index, '.pdf', sep = "")
pdf(outnam)
plot(results.single)
dev.off()
#Sample 50 trees
n <- sample(1:length(trees), 50)
trees.sample <- trees[n]
#Apply bgmyc.multiphylo
results.multi <- bgmyc.multiphylo(trees.sample, mcmc=150000, burnin=40000, thinning=100, t1=2, t2=ntips, start=c(1,1,ntips/2))
#Create 2nd pdf
outname = paste('results_boscai', index, '.pdf', sep = "")
pdf(outname) # Substitute 'results_boscai.pdf' by "*speciesname.pdf"
plot(results.multi)
dev.off()
#Apply bgmyc.spec and spec.probmat
results.spec <- bgmyc.spec(results.multi)
results.probmat <- spec.probmat(results.multi)
#Create 3rd pdf
outname = paste('trees_boscai', index, '.pdf', sep = "")
pdf(outname) # Substitute 'trees_boscai.pdf' by "trees_speciesname.pdf"
for (i in 1:50) plot(results.probmat, trees.sample[[i]])
dev.off()
}
for (i in 1:length(files)) {
genPDF(files[i])
}

R loop for anova on multiple files

I would like to execute anova on multiple datasets stored in my working directory. I have come up so far with:
files <- list.files(pattern = ".csv")
for (i in seq_along(files)) {
mydataset.i <- files[i]
AnovaModel.1 <- aov(DES ~ DOSE, data=mydataset.i)
summary(AnovaModel.1)
}
As you can see I am very new to loops and cannot make this work. I also understand that I need to add a code to append all summary outputs in one file. I would appreciate any help you can provide to guide to the working loop that can execute anovas on multiple .csv files in the directory (same headers) and produce outputs for the record.
you might want to use list.files with full.names = TRUE in case you are not on the same path.
files <- list.files("path_to_my_dir", pattern="*.csv", full.names = T)
# use lapply to loop over all files
out <- lapply(1:length(files), function(idx) {
# read the file
this.data <- read.csv(files[idx], header = TRUE) # choose TRUE/FALSE accordingly
aov.mod <- aov(DES ~ DOSE, data = this.data)
# if you want just the summary as object of summary.aov class
summary(aov.mod)
# if you require it as a matrix, comment the previous line and uncomment the one below
# as.matrix(summary(aov.mod)[[1]])
})
head(out)
This should give you a list with each entry of the list having a summary matrix in the same order as the input file list.
Your error is that your loop is not loading your data. Your list of file names is in "files" then you start moving through that list and set mydataset.i equal to the name of the file that matches your itterator i... but then you try to run aov on the file name that is stored in mydataset.i!
The command you are looking for to redirect your output to a file is sink. Consider the following:
sink("FileOfResults.txt") #starting the redirect to the file
files <- list.files("path_to_my_dir", pattern="*.csv", full.names = T) #using the fuller code from Arun
for (i in seq_along(files)){
mydataset.i <- files[i]
mydataset.d <- read.csv(mydataset.i) #this line is new
AnovaModel.1 <- aov(DES ~ DOSE, data=mydataset.d) #this line is modified
print(summary(AnovaModel.1))
}
sink() #ending the redirect to the file
I prefer this approach to Arun's because the results are stored directly to the file without jumping through a list and then having to figure out how to store the list to a file in a readable fashion.

For loop question in R

Hope I can explain my question well enough to obtain an answer - any help will be appreciated.
I have a number if data files which I need to merge into one. I use a for loop to do this and add a column which indicates which file it is.
In this case there are 6 files with up to 100 data entries in each.
When there are 6 files I have no problem in getting this to run.
But when there are less I have a problem.
What I would like to do is use the for loop to test for the files and use the for loop variable to assemble a vector which references the files that exist.
I can't seem to get the new variable to combine the new value of the for loop variable as it goes through the loop.
Here is the sample code I have written so far.
for ( rloop1 in 1 : 6) {
ReadFile=paste(rloop1,SampleName,"_",FileName,"_Stats.csv", sep="")
if (file.exists(ReadFile))
**files_found <- c(rloop1)**
}
What I am looking for is that files_found will contain those files where 1...6 are valid for the files found.
Regards
Steve
It would probably be better to list the files you want to load, and then loop over that list to load them. list.files is your friend here. We can use a regular expression to list only those files that end in "_Stats.csv". For example, in my current working directory I have the following files:
$ ls | grep Stats
bar_Stats.csv
foobar_Stats.csv
foobar_Stats.csv.txt
foo_Stats.csv
Only three of them are csv files I want to load (the .txt file doesn't match the pattern you showed). We can get these file names using list.files():
> list.files(pattern = "_Stats.csv$")
[1] "bar_Stats.csv" "foo_Stats.csv" "foobar_Stats.csv"
You can then loop over that and read the files in. Something like:
fnames <- list.files(pattern = "_Stats.csv$")
for(i in seq_along(fnames)) {
assign(paste("file_", i, sep = ""), read.csv(fnames[i]))
}
That will create a series of objects file_1, file_2, file_3 etc in the global workspace. If you want the files in a list, you could instead lapply over the fnames:
lapply(fnames, read.csv)
and if suitable, do.call might help combine the files from the list:
do.call(rbind, lapply(fnames, read.csv))
There's a much shorter way to do this using list.files() as Henrik showed. In case you're not familiar with regular expressions (see ?regex), you could do.
n <- 6
Fnames <- paste(1:n,SampleName,"_",FileName,"Stats.csv",sep="")
Filelist <- Fnames[file.exists(Fnames)]
which is perfectly equivalent. Both paste and file.exists are vectorized functions, so you better make use of that. There's no need for a for-loop whatsoever.
To get the number of the filenames (assuming that's the only digits), you can do:
gsub("^[:digit:]","", Filelist)
See also ?regex
I think there are better solutions (e.g., you could use list.files() to scan the folder and then loop over the length of the returned object), but this should (I didn't try it) do the trick (using your sample code):
files.found <- ""
for (rloop1 in 1 : 6) {
ReadFile=paste(rloop1,SampleName,"_",FileName,"_Stats.csv", sep="")
if (file.exists(ReadFile)) files_found <- c(files.found, rloop1)
}
Alternatively, you could get the fileNames (other than their index) via:
files.found <- ""
for (rloop1 in 1 : 6) {
ReadFile=paste(rloop1,SampleName,"_",FileName,"_Stats.csv", sep="")
if (file.exists(ReadFile)) files_found <- c(files.found, ReadFile)
}
Finally, in your case list.files could look something like this:
files.found <- list.files(pattern = "[[:digit:]]_SampleName_FileName_Stats.csv")

Resources