Using a R Script on multiple folders - r

I would like to use a R Script that i wrote on Multiple Folders that include a csv file and a text file.
The function i wrote takes the csv and the text file and calculates a vector.
So basically the code i need would open every folder, take the csv file and the text file and would calculate me the fitting vectors.
I thought about using list.files to get a list with the names of all folders and then use lapply to apply the function on every folder, but i dont know how to define the read.csv and read.table.
setwd("C:\\WD")
ptf = "C:\\PathtoFiles"
temp = list.files(path = ptf)
lapply(temp, exfunction)
exfunction = function() {
csvfile = read.csv("nameofile.csv")
textfile = read.table("nameoffile.txt", header=TRUE)
calcvec = vector(mode = "numeric", length = length(textfile))
#Code that calculates the vector
return(calcvec)
}

Related

Loop through files and use functions, then use that result to form a dataframe in r

I have a directory of sorted bam files that I want to use pileup function to. The output of pileup function is a dataframe. Then I would like to use the result of each file and form a dataframe.
For each file, I use the follow codes:
r16<-pileup(filename, index=filename, scanBamParam = ScanBamParam(), pileupParam = PileupParam())
r16$sample_id <- "sample id"
For sample_id column, I would like it to be the name of the file, for example:
the file name is file1.sorted.bam, I would like sample_id to be file1
And after all files are processed, I would use rbind to get a big dataframe and save it to a RData file.
So far, I have tried to use the loops on them, but it is not giving me any outputs.
library(pasillaBamSubset)
library(Rsamtools)
filenames<-Sys.glob("*.sorted.bam")
for (file in filenames) {
output <- pileup(pileup(filenames, index=filenames, scanBamParam = ScanBamParam(), pileupParam = PileupParam()))
save(output, file = "res.RData")
}
I am assuming that you want to stack all the data.frames on top of each other (row bind). map (from purrr) or lapply can apply a function to each item in
a given list/vector (each filename in this case). map_dfr does the same and row binds all the outputs.
filenames <- list.files(pattern = "*.sorted.bam")
library(purrr)
purrr::map_dfr(filenames, ~pileup(.x,
index = .x,
scanBamParam = ScanBamParam(),
pileupParam = PileupParam()))

Load multiple .csv files in R, edit them and save as new .csv files named by a list of chracter strings

I am pretty new to R and programming so I do apologies if this question has been asked elsewhere.
I'm trying to load multiple .csv files, edit them and save again. But cannot find out how to manage more than one .csv file and also name new files based on a list of character strings.
So I have .csv file and can do:
species_name<-'ace_neg'
{species<-read.csv('species_data/ace_neg.csv')
species_1_2<-species[,1:2]
species_1_2$species<-species_name
species_3_2_1<-species_1_2[,c(3,1,2)]
write.csv(species_3_2_1, file='ace_neg.csv',row.names=FALSE)}
But I would like to run this code for all .csv files in the folder and add text to a new column based on .csv file name.
So I can load all .csv files and make a list of character strings for use as a new column text and as new file names.
NDOP_files <- list.files(path="species_data", pattern="*.csv$", full.names=TRUE, recursive=FALSE)
short_names<- substr(NDOP_files, 14,20)
Then I tried:
lapply(NDOP_files, function(x){
species<-read.csv(x)
species_1_2<-species[,1:2]
species_1_2$species<-'name' #don't know how to insert first character string of short_names instead of 'name', than second character string from short_names for second csv. file etc.
Then continue in the code to change an order of columns
species_3_2_1<-species_1_2[,c(3,1,2)]
And then write all new modified csv. files and name them again by the list of short_names.
I'm sorry if the text is somewhat confusing.
Any help or suggestions would be great.
You are actually quite close and using lapply() is really good idea.
As you state, the issue is, it only takes one list as an argument,
but you want to work with two. mapply() is a function in base R that you can feed multiple lists into and cycle through synchronically. lapply() and mapply()are both designed to create/ manipulate objects inRbut you want to write the files and are not interested in the out withinR. Thepurrrpackage has thewalk*()\ functions which are useful,
when you want to cycle through lists and are only interested in creating
side effects (in your case saving files).
purrr::walk2() takes two lists, so you can provide the data and the
file names at the same time.
library(purrr)
First I create some example data (I’m basically already using the same concept here as I will below):
test_data <- map(1:5, ~ data.frame(
a = sample(1:5, 3),
b = sample(1:5, 3),
c = sample(1:5, 3)
))
walk2(test_data,
paste0("species_data/", 1:5, "test.csv"),
~ write.csv(.x, .y))
Instead of getting the file paths and then stripping away the path
to get the file names, I just call list.files(), once with full.names = TRUE and once with full.names = FALSE.
NDOP_filepaths <-
list.files(
path = "species_data",
pattern = "*.csv$",
full.names = TRUE,
recursive = FALSE
)
NDOP_filenames <-
list.files(
path = "species_data",
pattern = "*.csv$",
full.names = FALSE,
recursive = FALSE
)
Now I feed the two lists into purrr::walk2(). Using the ~ before
the curly brackets I can define the anonymous function a bit more elegant
and then use .x, and .y to refer to the entries of the first and the
second list.
walk2(NDOP_filepaths,
NDOP_filenames,
~ {
species <- read.csv(.x)
species <- species[, 1:2]
species$species <- gsub(".csv", "", .y)
write.csv(species, .x)
})
Learn more about purrr at purrr.tidyverse.org.
Alternatively, you could just extract the file name in the loop and stick to lapply() or use purrr::map()/purrr::walk(), like this:
lapply(NDOP_filepaths,
function(x) {
species <- read.csv(x)
species <- species[, 1:2]
species$species <- gsub("species///|.csv", "", x)
write.csv(species, gsub("species///", "", x))
})
NDOP_files <- list.files(path="species_data", pattern="*.csv$",
full.names=TRUE, recursive=FALSE)
# Get name of each file (without the extension)
# basename() removes all of the path up to and including the last path seperator
# file_path_sands_ext() removes the .csv extension
csvFileNames <- tools::file_path_sans_ext(basename(NDOP_files))
Then, I would write a function that takes in 1 csv file and does some manipulation to the file and outputs out a data frame. Since you have a list of csv files from using list.files, you can use the map function in the purrr package to apply your function to each csv file.
doSomething <- function(NDOP_file){
# your code here to manipulate NDOP_file to your liking
return(NDOP_file)
NDOP_files <- map(NDOP_files, ~doSomething(.x))
Lastly, you can manipulate the file names when you write the new csv files using csvFileNames and a custom function you write to change the file name to your liking. Essentially, use the same architecture of defining your custom function and using map to apply to each of your files.

R: Exporting potentially infinite function outputs to a csv file

I have a script that takes raw csv files in a folder, transforms the data in a method described in a function(filename) called "analyze", and spits out values into the console. When I attempt to write.csv these values, it only gives the last value of the function. IF there was a set amount of files per folder I would just do each specific csv file through the program, say [1:5], and lapply/set a matrix into write.csv. However, there is a potential for an infinite amount of files drawn from the directory, so this will not work (I think?). How would I export potentially infinite function outputs to a csv file? I have listed below my final steps after the function definition. It lists all the files in the folder and applys the function "anaylze" to all the files in the folder.
filename <- list.files(path = "VCDATA", pattern = ".csv", full.names = TRUE)
for (f in filename) {
print(f)
analyze(f)
}
Best,
Evan
It's hard to tell without a reproducible example, but I think you have assign the output of analyze to a vector or a dataframe (instead of spitting it out to the console).
Something along these lines:
filename <- list.files(path = "VCDATA", pattern = ".csv", full.names = TRUE)
results <- vector() #empty vector
for (f in filename) {
print(f)
results[which(filename==f)] <- analyze(f) #assign output vector
}
write.csv(results, file=xxx) #write csv file when loop is finished
I hope this answers your question, but it really depends on the format of the output of the analyze function.

using loop variable in text string for loops in R

If i have multiple csv files stored as:
m1.csv, m2.csv,.....,m50.csv and what I would like to do is load each csv into R, run the data in the i-th file and store it as a variable: m'i'. I am trying to use a for loop but i'm not sure if i can quite use them in such a way. For example:
for (i in 1:100){
A<-as.matrix(read.csv("c:/Users/Desktop/m"i".csv))
...
#some analysis on A
...
m"i"<- #result of analysis on A
}
V<-cbind(m1,m2, .... ,m100)
Try this
filenames = list.files(getwd())
filenames = filenames[grepl(".csv",file_names)]
files = lapply(filenames, read.csv)
files = do.call(rbind,files)

Assigning Directory as a Variable in R

I need to create a function called PollutantMean with the following arguments: directory, pollutant, and id=1:332)
I have most of the code written but I can't figure out how to assign my directory as a variable. My current working directory is C:/Users/User/Documents. I tried writing the variable as:
directory <- "C:/Users/User/specdata" and that didn't work.
Next I tried the following:
directory <- list.files("specdata", full.names=TRUE) and that didn't work either.
Any ideas on how to change this?
If you are trying to assign the values in your current working directory to the variable "directory" Why not take the simple method and add:
directory <- getwd()
This should take the contents of the working directory and assign the values to the variable "directory".
I've already worker with directory as variables, I usually declare them like that
directory<-"C://Users//User//specdata//"
To take back your example.
Then, if I want to read a specific file in this directory, I will just go like :
read.table(paste(directory,"myfile.txt",sep=""),...)
It's the same process to write in a file
write.table(res,file=paste(directory,"myfile.txt",sep=""),...)
Is this helping ?
EDIT : you can then use read.csv and it will work fine
I think you are confused by the assignment operation in R. The following line
directory <- "C:/Users/User/specdata"
assigns a string to a new object that just happened to be called directory. It has the same effect on your working environment as
elephant <- "C:/Users/User/specdata"
To change where R reads its files, use the function setwd (short for set working directory):
setwd("C:/Users/User/specdata")
You can also specify full path names to functions that read in data (like read.table). For your specific problem,
# creates a list of all files ending with `csv` (i.e. all csv files)
all.specdata.files <- list.files(path = "C:/Users/User/specdata", pattern = "csv$")
# creates a list resulting from the application of `read.csv` to
# each of these files (which may be slow!!)
all.specdata.list <- lapply(all.specdata.files, read.csv)
Then we use dplyr::rbind_all to row-bind them into one file.
library(dplyr)
all.specdata <- rbind_all(all.specdata.list)
Then use colMeans to determine the grand means. Not sure how to do this without seeing the data.
Assuming that the columns in each of the 300+ csv files are the same, that is have column j contains the same type of data in all files, then the following example should be of use:
# let's use a temp directory for storing the files
tmpdr <- tempdir()
# Let's creat a large matrix of values and then split it into many different
# files
original_data <- data.frame(matrix(rnorm(10000L), nrow = 1000L))
# write each row to a file
for(i in seq(1, nrow(original_data), by = 1)) {
write.csv(original_data[i, ],
file = paste0(tmpdr, "/", formatC(i, format = "d", width = 4, flag = 0), ".csv"),
row.names = FALSE)
}
# get a character vector with the full path of each of the files
files <- list.files(path = tmpdr, pattern = "\\.csv$", full.names = TRUE)
# read each file into a list
read_data <- lapply(files, read.csv)
# bind the read_data into one data.frame,
read_data <- do.call(rbind, read_data)
# check that our two data.frames are the same.
all.equal(read_data, original_data)
# [1] TRUE

Resources