loop loading pairs of files - r

I am writing a loop that takes two files per run e.g.a0.txt and b0.txt. I am running this over 100 files that run from a0.txt and b0.txt to a999.txt and b999.txt. The pattern function i use works perfect if i do the run for files a0 and b0 to a9 and b9 with only file pairs 0-9 in the directory. but when i put more files in the directory and do the run from '0:10, the loop fails and confuses vectors in files. I think this is becuase of thepattern` i use i.e.
list.files(pattern=paste('.', x, '\\.txt', sep=''))
This only looks for files that have '.',x,//txt.
So if '.'=a and x=1 it finds file a1. But i think it gets confused between a0 and a10 when I do the run over more files. But i cannot seem to find the appropriate loop that will serach for files that also look for files up to a999 and b999, as well.
Can anyone help with a better way to do this? code below.
dostuff <- function(x)
{
files <- list.files(pattern=paste('.', x, '\\.txt', sep=''))
a <- read.table(files[1],header=FALSE) #file a0.txt
G <- a$V1-a$V2
b <- read.table(files[2],header=FALSE) #file b0.txt
as.factor(b$V2)
q <- tapply(b$V3,b$V2,Fun=length)
H <- b$V1-b$V2
model <- lm(G~H)
return(model$coefficients[2],q)
}
results <- sapply(0:10,dostuff)
Error in tapply(b$V3, b$V2, FUN = length) : arguments must have same length

How about getting the files directly, without searching. i.e.
dostuff <- function(x)
{
a.filename <- paste('a', x, '.txt', sep='') # a<x>.txt
b.filename <- paste('b', x, '.txt', sep='') # b<x>.txt
a <- read.table(a.filename, header=FALSE)
# [...]
b <- read.table(b.filename, header=FALSE)
# [...]
}
But the error message says the problem is caused by the call to tapply rather than anything about incorrect file names, and I have literally no idea how that could happen, since I thought a data frame (which read.table creates) always has the same number of rows for each column. Did you copy-paste that error message out of R? (I have a feeling there might be a typo, and so it was, for example, q <- tapply(a$V3,b$V2,Fun=length). But I could easily be wrong)
Also, as.factor(b$V2) doesn't modify b$V2, it just returns a factor representing b$V2: after you call as.factor b$V2 is still a vector. You need to assign it to something, e.g.:
V2.factor <- as.factor(b$V2)

If the beginning of the two files is always the same (a,b in your example); you could use this information in the pattern:
x <- 1
list.files(pattern=paste('[a,b]', x, '\\.txt', sep=''))
# [1] "a1.txt" "b1.txt"
x <- 11
list.files(pattern=paste('[a,b]', x, '\\.txt', sep=''))
# [1] "a11.txt" "b11.txt"
Edit: and you should include the ^ as well, as Wojciech proposed. ^ matches the beginning of a line or in your case the beginning of the filename.

Related

function with FOR and IF loops

I am writing a function that will go through a list of files in a directory, count number of complete cases, and if the sum of complete cases is above a given threshhold, a correlation should be calculated. The output must be a numeric vector of correlations for all files that meet the threshhold requirement. This is what I have so far (and it gives me an Error: unexpected '}' in "}" Full disclosure - I am a complete newbie, as in wrote my first code 2 weeks ago. What am I doing wrong?
correlation <- function (directory, threshhold = 0) {
all_files <- list.files(path = getwd())
correlations_list <- numeric()
for (i in seq_along(all_files)) {
dataFR2 <- read.csv(all_files[i])
c <- c(sum(complete.cases(dataFR2)))
if c >= threshhold {
d <- cor(dataFR2$sulfate, dataFR2$nitrate, use = "complete.obs", method = c("pearson"))
correlations_list <- c(correlations_list, d)
}
}
correlations_list
}
"Unexpected *" errors are a syntax error. Often a missing parenthesis, comma, or curly bracket. In this case, you need to change if c >= threshhold { to if (c >= threshhold) {. if() is a function and it requires parentheses.
I'd also strongly recommend that you not use c as a variable name. c() is the most commonly used R function, and giving an object the same name will make your code look very strange to anyone else reading it.
Lastly, I'd recommend that you make your output the same length as the the number of files. As you have it, there won't be any way to know which files met the threshold to have their correlations calculated. I'd make correlations_list have the same length as the number of files, and add names to it so you know which correlation belongs to which file. This has the side benefit of not "growing an object in a loop", which is an anti-pattern known for its inefficiency. A rewritten function would look something like this:
correlation <- function (directory, threshhold = 0) {
all_files <- list.files(path = getwd())
correlations_list <- numeric(length(all_files)) ## initialize to full length
for (i in seq_along(all_files)) {
dataFR2 <- read.csv(all_files[i])
n_complete <- sum(complete.cases(dataFR2))
if(n_complete >= threshhold) {
d <- cor(dataFR2$sulfate, dataFR2$nitrate, use = "complete.obs", method = c("pearson"))
} else {
d <- NA
}
correlations_list[i] <- d
}
names(correlations_list) <- all_files
correlations_list
}

Substitute/Placeholder Variable in R Syntax

I have the following problem:
I need the same syntax over and over again for different variable-sets.
They all have the same "core" name (for example: variable_1) and different suffixes like:
variable_1_a, variable_1_b, variable_1_c, variable_1_d, variable_1_e, ...
since the syntax is large and I need to run it for (example) variables _2, _3, _4, _5, ... and so on, I was wondering whether there is some form of placeholder-expression I could define with the "core" name with which I want to run it each time, instead of copy pasting the whole syntax and substituting every "variable_1" with the next core-name.
For example saving the core name in term !XY! (the "!" is just to represent that it is something atypical) and having that term in the whole syntax with "_a", "_b", "_c" attached
!XY!_a, !XY!_b, !XY!_c, !XY!_d, !XY!_e, ...
I played around with saving the core-name in an element called XY and pasting it with the endings:
XY <- "variable_1"
paste0(as.character(XY),"_a")
"variable_1_a"
OR
as.symbol(paste0(as.character(XY),"_a"))
variable_1_a
of course that looks horribly long but that I would accept if it worked to also use it as an expression which I could address to use like a variable: for example to read or write it which results in an error.
as.symbol(paste0(as.character(XY),"_a")) <- "test"
Error in as.symbol(paste0(as.character(XY),"_a")) <- "test" :
could not find function "as.symbol<-"
It would be a huge time-saver if there is a chance to write one syntax to fit all procedures!
Thx a lot for your ideas!
Let's assume you have 5 variables ("variable_1", "variable_2" etc) and 4 letters ("_a", "_b" etc).
We can use outer like :
n <- 1:5
l <- letters[1:4]
c(outer(n, l, function(x, y) paste("variable", x, y, sep = "_")))
#Or a bit shorter :
#paste0("variable_", c(outer(n, l, paste, sep = "_")))
#[1] "variable_1_a" "variable_2_a" "variable_3_a" "variable_4_a"
#[5] "variable_5_a" "variable_1_b" "variable_2_b" "variable_3_b"
#[9] "variable_4_b" "variable_5_b" "variable_1_c" "variable_2_c"
#[13] "variable_3_c" "variable_4_c" "variable_5_c" "variable_1_d"
#[17] "variable_2_d" "variable_3_d" "variable_4_d" "variable_5_d"

test if compressed archives contain same data

Is it possible to test if the contents of compressed archives are the same without needing to decompress them? What is the standard way of doing this in R? I was thinking of hashing them, like and md5 or something, but this all takes more time and is it necessary? I don't care about times the archives were created or anything like that, only if the contents of the files are the same.
Example (creates some test files on your comp.)
## Create some test files
dir.create("test1")
dir.create('test2')
writeLines(text="hi", con="test1/test1.txt")
writeLines(text="hi*2", con="test2/test2.txt")
## Make some compressed archives
tar("test.tar.gzip2", files="test1", compression="bzip2") # should be same as test1.tar.gzip2
tar("test1.tar.gzip2", files="test1", compression="bzip2")
tar("test2.tar.gzip2", files="test2", compression="bzip2")
I want to be able to test that "test.tar.gzip2" and "test1.tar.gzip2" are the same, but "test2.tar.gzip2" is different. How?
the following function extract the bytes from a file which you can then compare:
binRead <- function(fName){
f_s <- file.info(fName)$size
f <- file(fName,"rb")
res <- readBin(f,"raw",f_s)
close(f)
return(res)
}
t0 <- binRead("test.tar.gzip2")
t1 <- binRead("test1.tar.gzip2")
t2 <- binRead("test2.tar.gzip2")
all(t0 == t1) #true
all(t0 == t2) #false

Assign, Save and Load a Sequence of Objects to a Sequence of Variables in R

I would like to assign a sequence of results to a sequence of variables and then save these variables in a sequence of .Rdata files. I want to do this in order to use sapply to load all the .Rdata files once I'll need them.
To illustrate my problem, I wrote a simple function save.try1, and ran it 5 times.
save.try1 = function(x){
M = matrix(c(x,x+1,x+2,x+3), ncol=2, nrow = 2)
filename = paste('/Volumes/GIORDANO 3/MTS/MTS - essai/matrice', x, '.Rdata', sep='')
save(M, file =filename)
}
sapply(1:5, save.try1)
I now have 5 .Rdata files named matrice1.Rdata, ... , matrice5.Rdata in my folder MTS - essai.
Now, when I load the files in R, I obtain:
> matrices.chargees = sapply(1:5, function(x) load(file = paste('/Volumes/GIORDANO 3/MTS/MTS - essai/matrice', x, '.Rdata', sep='')))
> matrices.chargees
[1] "M" "M" "M" "M" "M"
And I can't open any of the M matrices:
> matrices.chargees[[2]]
[1] "M"
> matrices.chargees[2]
[1] "M"
> matrices.chargees$M
Error in matrices.chargees$M : $ operator is invalid for atomic vectors
First, I want to have different names for each matrix. I read the following post:
how to assign a sequence of computations to a sequence of variables in r
I tried to change my code to:
save.try1 = function(x){
M = matrix(c(x,x+1,x+2,x+3), ncol=2, nrow = 2)
assign(paste('M', x, sep =''), M)
filename = paste('/Volumes/GIORDANO 3/MTS/MTS - essai/matrice', x, '.Rdata', sep='')
save(paste('M', x, sep =''), file =filename)
}
But I obtain the following error:
> sapply(1:5, save.try1)
Error in save(paste("M", x, sep = ""), file = filename) :
object ‘paste("M", x, sep = "")’ not found
My questions are therefore the following:
1- how do I modify my code to have a different name for each matrix?
2- Will this allow me to access the content of each matrix? If not, how do I access it?
Thank you!
Please do not use = in your assignments unless you know exactly what you are doing.
Here is a working version of your function. You need to tell save that your argument is a variable name as a string not a symbol:
save.try1 <- function(x){
M <- matrix(c(x,x+1,x+2,x+3), ncol=2, nrow = 2)
foo <- paste('M', x, sep ='')
assign(foo, M)
filename <- paste('ENTER_YOUR_PATH_HERE', x, '.Rdata', sep='')
save(list=foo, file=filename)
}
Having said that, I really doubt using a functional instead of a for loop here is good style. Functionals are supposed to return a structure of values not a list of NULLs as this sapply call will. Love your for loops even in R, they exist for a reason.
You also have a tiny error in your load script. Load returns A character vector of the names of objects created, invisibly. (see ?load).
So with small fixes it will load the matrices as expected:
matrices.chargees = sapply(1:5, function(x) {
name <- load(file = paste('matrice', x, '.Rdata', sep=''))
out <- list(get(text=name))
names(out) <- name
out })

Repeated vector names in loop error

I am running the following code over pairs of files like a1.txt and b1.txt, a2.txt and b2.txt, .... a999.txt and b999.txt:
dostuff <- function(x)
{
files <- list.files(pattern=paste('.', x, '\\.txt', sep=''))
a <- read.table(files[1],header=FALSE) #file a1.txt
G <- a$V1-a$V2
b <- read.table(files[2],header=FALSE) #file b1.txt
as.factor(b$V2)
q <- tapply(b$V3,b$V2,Fun=length)
H <- b$V1-b$V2
model <- lm(G~H)
return(model$coefficients[2],q)
}
results <- sapply(0:999,dostuff)
Error in tapply(b$V3, b$V2, FUN = length) : arguments must have same length
This I assume is because both files in a pair have no headers so a has V1 V2 and b has V1 V2 V3.
However this error does not arise when i run this over small batches of files like 0:3 and the results for these come out the same as if i did every analysis separately, clearing the environment between runs of the whole code manually.
I believe the problem arises once files run from a1 b1 to a10 b10 and above. Because the loop i thinks get confused over which files to select. This problem disappears as long as I run with a0 b0 to a9 b9.
Best solutions?
I've encountered similar issues when there was some stray file in a batch I was trying to process that didn't have the expected number of rows. I'd recommend checking the dimensions of all your files first to see if they all agree. My guess is that files 0-3 are all properly formatted, but maybe some file in [0, 999] isn't.
Try checking to see if the dimensions are the same for all files:
dims2 <- list()
dims3 <- list()
doDims <- function(x){
files <- list.files(pattern=paste('.', x, '\\.txt', sep=''))
a <- read.table(files[1],header=FALSE) #file a1.txt
G <- a$V1-a$V2
b <- read.table(files[2],header=FALSE) #file b1.txt
as.factor(b$V2)
dims2[x] <- dim(b$V2)
dims3[x] <- dim(b$V3)
}
Then you can see if all the dimensions are, indeed, what you expect.

Resources