Looping through files in R

Looping through files in R - r

I am using R to calculate the mean values of a column in a file like so:
R
file1 = read.table("x01")
mean(file1$V4)
However I have no experience building loops involving R, only with bash.
How would I convert this into a loop that did this for every file in a folder and saved the output into one file with the file name and mean value as the 2 columns for each row?
eg:
x01(or file1 if that is simpler) 23.4
x02 25.4
x03 10.4
etc
(Don't mind if the solution is bash and R or exclusively R)
Many thanks for your help!
Current error from one of the solutions using bash and R:
Error in `[.data.frame`(read.table("PercentWindowConservedRanked_Lowest_cleanfor1000genomes_1000regions_x013", :
undefined columns selected
Calls: mean -> [ -> [.data.frame
Execution halted

This is similar to what #jmsigner has done, but with minor changes. For instance, writing to a file is done at the end. The code has not been tested.
out <- lapply(list.files(), FUN = function(x) {
m <- mean(read.table(x, header = TRUE)$V4)
return(m)
})
result <- do.call("cbind", out) #merge a list column-wise
# before writing, you can make column names pretty with colnames()
# e.g. colnames(result) <- c("x01", "x02")
write.table(result, file = "means.txt")

Assuming the columns are always named the same, you could do the following in R:
out.file <- 'means.txt'
for (i in list.files()) {
tmp.file <- read.table(i, header=TRUE) # Not sure if you have headers or not
tmp.mean <- mean(tmp.file1$V4)
write(paste0(i, "," tmp.mean), out.file, append=TRUE)
}
Or the same thing with more bash:
for i in $(ls *)
do
mean=$(Rscript -e "mean(read.table('$i', header=T)[, 'V4'])")
echo $i,$mean >> means.txt
done

My solution is also similar to #jmsinger but you can specify the path to your files in the code itself and then calculate the mean like this :
filename <- system("ls /dir/",intern=TRUE)
for(i in 1:length(filename)){
file <- read.table(filename[i],header=TRUE) ## if you have headers in your files ##
mean <- mean(file$V4)
write.table(mean,file=paste("/dir",paste("mean",filename[i],sep="."),sep="/"))
##if you wish to write the means of all the files in seperate files rather than one.
}
hope this helps

Related

How to work with nested for loops in R with same list?

I amtrying to do some R coding for my project. Where I have to read some .csv files from one directory in R and I have to assign data frame as df_subject1_activity1, i have tried nested loops but it is not working.
ex:
my dir name is "Test" and i have six .csv files
subject1activity1.csv,
subject1activity2.csv,
subject1activity3.csv,
subject2activity1.csv,
subject2activity2.csv,
subject2activity3.csv
now i want to write code to load this .csv file in R and assign dataframe name as
ex:
subject1activity1 = df_subject1_activity1
subject1activity2 = df_subject1_activity2
.... so on using for loop.
my expected output is:
df_subject1_activity1
df_subject1_activity2
df_subject1_activity3
df_subject2_activity1
df_subject2_activity2
df_subject2_activity3
I have trie dfollowing code:
setwd(dirname(getActiveDocumentContext()$path))
new_path <- getwd()
new_path
data_files <- list.files(pattern=".csv") # Identify file names
data_files
for(i in 1:length(data_files)) {
for(j in 1:4){
assign(paste0("df_subj",i,"_activity",j)
read.csv2(paste0(new_path,"/",data_files[i]),sep=",",header=FALSE))
}
}
I am not getting desire output.
new to R can anyone please help.
Thanks

One solution is to use the vroom package (https://www.tidyverse.org/blog/2019/05/vroom-1-0-0/), e.g.
library(tidyverse)
library(vroom)
library(fs)
files <- fs::dir_ls(glob = "subject_*.csv")
data <- purrr::map(files, ~vroom::vroom(.x))
list2env(data, envir = .GlobalEnv)
# You can also combine all the dataframes if they have the same columns, e.g.
library(data.table)
concat <- data.table::rbindlist(data, fill = TRUE)

You are almost there. As always, if you are unsure, is never a bad idea to code clearly using more lines.
data_files <- list.files(pattern=".csv", full.names=TRUE) # Identify file names data_files
for( data_file in data_files) {
## check that the data file matches our expected pattern:
if(!grepl( "subject[0-9]activity[0-9]", basename(data_file) )) {
warning( "skiping file ", basename(data_file) )
next
}
## start creating the variable name from the filename
## remove the .csv extension
var.name <- sub( "\\.csv", "", basename(data_file), ignore.case=TRUE )
## prepend 'df' and introduce underscores:
var.name <- paste0(
"df",
gsub( "(subject|activity)", "_\\1", var.name ) ## this looks for literal 'subject' and 'acitivity' and if found, adds an underscore in front of it
)
## now read the file
data.from.file <- read.csv2( data_file )
## and assign it to our variable name
assign( var.name, data.from.file )
}
I don't have your files to test with, but should the above fail, you should be able to run the code line by line and easily see where it starts to go wrong.

Config file in a csv (or txt) format

I want to create a config file. In an R file it would look like the following:
#file:config.R
min_birthday_year <- 1920
max_birthday <- Sys.Date() %m+% months(9)
min_startdate_year <- 2010
max_startdate_year <- 2022
And in the main script I would do: source("config.R") .
However, now I want to source the config data from a .csv file. Does anyone have any idea how to? The file could also be in a .txt format

First thing I would suggest is looking into the config package.
It allows you to specify variables in a yaml text file. I haven't used it but it seems pretty neat and looks like it may be a good solution.
If you don't want to use that, then if your csv is something like this, with var names in one column and values in the next:
min_birthday_year,1920
max_birthday,Sys.Date() %m+% months(9)
min_startdate_year,2010
max_startdate_year,2022
then you could do something like this:
# Read in the file
# assuming that names are in one column and values in another
# will create vars using vals from second col with names from first
config <- read.table("config.csv", sep = ",")
# mapply with assign, with var names in one vector and values in the other
# eval(parse()) call to evaluate value as an expression - needed to evaluate the Sys.Date() thing.
# tryCatch in case you add a string value to the csv at some point, which will throw an error in the `eval` call
mapply(
function(x, y) {
z <- tryCatch(
eval(parse(text = y)),
error = function(e) y
)
assign(x, z, inherits = TRUE)
},
config[[1]],
config[[2]]
)

R - cut a specific column from multiple files and bind them altogether

I have multiple files (30, tab delimited) that look like the one below:
|target_id | length| eff_length| est_counts| tpm|
|:------------|------:|----------:|----------:|--------:|
|LmjF.27.1250 | 966| 823.427| 2932| 94.7314|
|LmjF.09.0430 | 1410| 1267.430| 3603| 75.6304|
|LmjF.13.0210 | 2001| 1858.430| 4435| 63.4897|
|LmjF.28.0530 | 4083| 3940.430| 7032| 47.4778|
|LmjF.16.1400 | 591| 448.577| 1163| 68.9761|
|LmjF.29.2570 | 1506| 1363.430| 11135| 217.2770|
I am trying to cut the fifth column from all of these files 30 files with a command such as:
fifth_colum_file1 = file1.csv[ , 5]
But I want to make the process more automatised.
The files that I want to work with have all the pattern "bs_abundance", therefore I think a good starting point would be to either load all the files I want to work with with such a command:
temp = list.files(pattern="*bs_abundance")
Or perhaps I can also load all the tables I want to work with directly into the working space already:
for(i in temp) {
x <- read.table(i, header=TRUE, comment.char = "A", sep="\t")
assign(i,x)
}
Then, as explained, I want to cut the fifth column of each of the files to later bind them all to another table of same number of rows.

Put the files into a folder. For this example let's call it temp. Set your working directory appropriately or specify the full path for the example below.
cols <- as.character()
files <- dir("temp")
for(i in files){
# You didn't mention a file type, but let's say it's csv
tmp <- read.csv(files[i], header = T)
tmp <- tmp[, 5]
cols <- cbind(cols, tmp)
}
Then you can just cbind the columns in cols with your final data object.

Here is a method using lapply that assumes each file in the folder has the same number of rows.
# get file names
files <- dir("temp")
# remove one file
files <- files[-which(files == "removeFileName")]
# get list of vectors from 29 files
myList <- lapply(files, function(i) {temp <- read.csv(i); temp[, 5]})
# get new data.frame
dfDone <- do.call(data.frame, myList)

R: selectively importing data from several csv files into single data frame while also changing data from rows to individual columns

I’m looking to do the following in R.
I have 250+ csv files of chromatographic data structured similarly to the example below, but with 21 rows instead of three:
1 4.708252 BB 9.946890 7.830349 0.01982016 4.684836 4.742056
2 4.970352 BB 1.792341 1.497008 0.01896829 4.945352 5.005390
3 6.393414 BB 6.599891 5.309925 0.01950091 6.368413 6.428723
What I want to do is read a subset of the data in all 250 files into a single data frame, which is easy enough — but I also need to restructure it a fair bit.
Every row in the table above is a peak. I only want the data from the first and fourth columns (which are ‘peak number’ and ‘area under the peak’, respectively), and in the output I need to make each peak an individual column, rather than a row as above, with the peak number as the header. Finally, I want to create a new column where each row (that is, the data from each individual csv file) is given the same name as the csv file name.
So, imagine I have 3 files: ABC1.csv, ABC2.csv, and ABC3.csv. Each file looks like my example above. I want to automatically take all those files and merge them into a single data frame such as the one below.
ID 1 2 3
ABC1 9.94689 1.792341 6.599891
ABC2 9.76651 1.932332 6.600022
ABC3 8.99193 2.556471 6.718934
I hope I’ve made this clear enough. I’ve been able to manage most of the steps but haven’t been successful writing them into a single script. And I have no idea how, if there is any way, to make the file name into a variable.
Cheers

I am assuming the working directory is set to where the files are. Then you can get the list of files below.
filenames <- list.files()
Have a helper function to read a file and keep just columns 1 and 4.
readdata <- function(filename) {
df <- read.csv(filename)
vec <- df[, 4]
names(vec) <- df[, 1]
return(vec)
}
Loop over all of the files and rbind them
result <- do.call(rbind, lapply(filenames, readdata))
Name them as you like
row.names(result) <- filenames

this following code can probably be of some help, though the file name is still not working properly -
path <- "C:\\Users\\Vidyut\\"
filenames <- list.files(path = path,pattern = ".csv")
l <- data.frame(ID=character(),col1=numeric(),col2=numeric(),col3=numeric(),stringsAsFactors=FALSE)
for (i in filenames) {
#i = filenames[1]
full = paste(path,i,sep="")
m <- read.csv(full, header=F)
# extract the subset of rows required from each file
# m <- m[c(),]
n<- m[,c(1,4)]
y <- gsub('.csv','',i)
print("y=")
print(y)
d <- list(ID=as.character(y),col1=n[1,2],col2=n[2,2],col3=n[3,2])
print("d=")
print(d)
l <- rbind.data.frame(l,d)
print("l=")
print(l)
}
Mind you, this is not very pretty code - just something hacked together to get the job done (visible from the multiple print lines scattered across).

Here's a solution for you. This only works if we can assume that there are exactly 21 peaks in each file and they are in order 1:21. If that's not the case a few changes to the code should remedy this.
folder = "c:/temp/"
files <- dir(folder)
first_loop <- TRUE
for (file in files) {
# Read one file, only the first and fourth columns
temp <- read.csv(file=paste0(folder,file),
header = FALSE,
colClasses = c("integer", "NULL", "NULL", "numeric", "NULL", "NULL", "NULL", "NULL"))
# Transpose the data
temp <- data.frame(t(temp))
# Remove the peak number
temp <- temp[2,]
# Concatenate the dataframes together
temp$file <- file
if (first_loop) {
data <- temp
first_loop <- FALSE
} else {
data <- rbind(data, temp)
}
}
data