Repeated vector names in loop error - r

I am running the following code over pairs of files like a1.txt and b1.txt, a2.txt and b2.txt, .... a999.txt and b999.txt:
dostuff <- function(x)
{
files <- list.files(pattern=paste('.', x, '\\.txt', sep=''))
a <- read.table(files[1],header=FALSE) #file a1.txt
G <- a$V1-a$V2
b <- read.table(files[2],header=FALSE) #file b1.txt
as.factor(b$V2)
q <- tapply(b$V3,b$V2,Fun=length)
H <- b$V1-b$V2
model <- lm(G~H)
return(model$coefficients[2],q)
}
results <- sapply(0:999,dostuff)
Error in tapply(b$V3, b$V2, FUN = length) : arguments must have same length
This I assume is because both files in a pair have no headers so a has V1 V2 and b has V1 V2 V3.
However this error does not arise when i run this over small batches of files like 0:3 and the results for these come out the same as if i did every analysis separately, clearing the environment between runs of the whole code manually.
I believe the problem arises once files run from a1 b1 to a10 b10 and above. Because the loop i thinks get confused over which files to select. This problem disappears as long as I run with a0 b0 to a9 b9.
Best solutions?

I've encountered similar issues when there was some stray file in a batch I was trying to process that didn't have the expected number of rows. I'd recommend checking the dimensions of all your files first to see if they all agree. My guess is that files 0-3 are all properly formatted, but maybe some file in [0, 999] isn't.
Try checking to see if the dimensions are the same for all files:
dims2 <- list()
dims3 <- list()
doDims <- function(x){
files <- list.files(pattern=paste('.', x, '\\.txt', sep=''))
a <- read.table(files[1],header=FALSE) #file a1.txt
G <- a$V1-a$V2
b <- read.table(files[2],header=FALSE) #file b1.txt
as.factor(b$V2)
dims2[x] <- dim(b$V2)
dims3[x] <- dim(b$V3)
}
Then you can see if all the dimensions are, indeed, what you expect.

Related

Where in this loop re-writes b to 29?

I have the following loop:
n_range <- 4:29
bResultsStra <- vector("double", 27)
for (b in n_range) {
bRowsStra<-Stratified[sample(nrow(Stratified), b), ]
base <- read.table("base.csv", header=T, sep="," )
base$area<-5036821
base$quadrea <- base$area * 16
bootStra <- boot(data=bRowsStra$count, average, R=1000)
base$data<- bootStra$t
base$popsize<-(base$data*base$quadrea)
basesortStra <- base[order(base$popsize),]
bValue <- basesortStra$popsize[975] - basesortStra$popsize[25]
bResultsStra[[b - 1]] <- bValue
}
which I changed from
n_range <- 2:29
bResultsStra <- vector("double", 29)
Basically, I no longer want to take values from n=2,3.
However, when I run and attempt to put it in a data frame it returns:
Error in data.frame(n = n_range, bResultsStra) :
arguments imply differing number of rows: 26, 28
For some reason I do not know, b gets written to
29L
rather than 27 which it should from the n_range. Why is it doing this?
Thank You
The offending line is
bResultsStra[[b - 1]] <- bValue
That’s the problem with magic constants: once you change one of them, the relationship with the other constants in the code gets lost.
To fix this and make your code more robust at the same time, change your code as follows:
from <- 4L
to <- 29L
n_range <- seq(from, to)
bResultsStra <- vector("double", length(n_range))
for (b in n_range) {
# […]
bResultsStra[[b - from + 1L]] <- bValue
}
That said, this code could be improved further by choosing better variable names (even just consistent naming conventions go a long way!) and by replacing the entire loop with a call to lapply.

Loop Changing to Matrix then Running tests

I have a dataframe with ~9000 rows of human coded data in it, two coders per item so about 4500 unique pairs. I want to break the dataset into each of these pairs, so ~4500 dataframes, run a kripp.alpha on the scores that were assigned, and then save those into a coder sheet I have made. I cannot get the loop to work to do this.
I can get it to work individually, using this:
example.m <- as.matrix(example.m)
s <- kripp.alpha(example.m)
example$alpha <- s$value
However, when trying a loop I am getting either "Error in get(v) : object 'NA' not found" when running this:
for (i in items) {
v <- i
v <- v[c("V1","V2")]
v <- assign(v, as.matrix(get(v)))
s <- kripp.alpha(v)
i$alpha <- s$value
}
Or am getting "In i$alpha <- s$value : Coercing LHS to a list" when running:
for (i in items) {
i.m <- i[c("V1","V2")]
i.m <- as.matrix(i.m)
s <- kripp.alpha(i.m)
i$alpha <- s$value
}
Here is an example set of data. Items is a list of individual dataframes.
l <- as.data.frame(matrix(c(4,3,3,3,1,1,3,3,3,3,1,1),nrow=2))
t <- as.data.frame(matrix(c(4,3,4,3,1,1,3,3,1,3,1,1),nrow=2))
items <- c("l","t")
I am sure this is a basic question, but what I want is for each file, i, to add a column with the alpha score at the end. Thanks!
Your problem is with scoping and extracting names from objects when referenced through strings. You'd need to eval() some of your object to make your current approach work.
Here's another solution
library("irr") # For kripp.alpha
# Produce the data
l <- as.data.frame(matrix(c(4,3,3,3,1,1,3,3,3,3,1,1),nrow=2))
t <- as.data.frame(matrix(c(4,3,4,3,1,1,3,3,1,3,1,1),nrow=2))
# Collect the data as a list right away
items <- list(l, t)
Now you can sapply() directly over the elements in the list.
sapply(items, function(v) {
kripp.alpha(as.matrix(v[c("V1","V2")]))$value
})
which produces
[1] 0.0 -0.5

test if compressed archives contain same data

Is it possible to test if the contents of compressed archives are the same without needing to decompress them? What is the standard way of doing this in R? I was thinking of hashing them, like and md5 or something, but this all takes more time and is it necessary? I don't care about times the archives were created or anything like that, only if the contents of the files are the same.
Example (creates some test files on your comp.)
## Create some test files
dir.create("test1")
dir.create('test2')
writeLines(text="hi", con="test1/test1.txt")
writeLines(text="hi*2", con="test2/test2.txt")
## Make some compressed archives
tar("test.tar.gzip2", files="test1", compression="bzip2") # should be same as test1.tar.gzip2
tar("test1.tar.gzip2", files="test1", compression="bzip2")
tar("test2.tar.gzip2", files="test2", compression="bzip2")
I want to be able to test that "test.tar.gzip2" and "test1.tar.gzip2" are the same, but "test2.tar.gzip2" is different. How?
the following function extract the bytes from a file which you can then compare:
binRead <- function(fName){
f_s <- file.info(fName)$size
f <- file(fName,"rb")
res <- readBin(f,"raw",f_s)
close(f)
return(res)
}
t0 <- binRead("test.tar.gzip2")
t1 <- binRead("test1.tar.gzip2")
t2 <- binRead("test2.tar.gzip2")
all(t0 == t1) #true
all(t0 == t2) #false

Parallelize and speed up R code to read in many files

I've a code that works perfectly for my purpose (it reads some files with a specific pattern, read the matrix within each file and compute something using each filepair...the final output is a matrix that has the same size of the file number) and looks like this:
m<- 100
output<- matrix(0, m, m)
lista<- list.files(pattern = "q")
listan<- as.matrix(lista)
n <- nrow(listan)
for (i in 1:n) {
AA <- read.table((listan[i,]), header = FALSE)
A<- as.matrix(AA)
dVarX <- sqrt(mean(A * A))
for (j in i:n) {
BB <- read.table ((listan[j,]), header = FALSE)
B<- as.matrix(BB)
V <- sqrt (dVarX * (sqrt(mean(B * B))))
output[i,j] <- (sqrt(mean(A * B))) / V
}
}
My problem is that it takes a lot of time (I have about 5000 matrixes, that means 5000x5000 loops).
I would like to parallelize, but I need some help!
Waiting for your kind suggestions!
Thank you in advance!
Gab
The bottleneck is likely reading from disk. Running code in parallel isn't guaranteed to make things faster. In this case, multiple processes attempting to read from the same disk at the same time is likely to be even slower than a single process.
Since your matrices are being written by another R process, you really should save them in R's binary format. You're reading every matrix once and only once, so the only way to make your program faster is to make reading from disk faster.
Here's an example that shows you how much faster it could be:
# make some random data and write it to disk
set.seed(21)
for(i in 0:9) {
m <- matrix(runif(700*700), 700, 700)
f <- paste0("f",i)
write(m, f, 700) # text format
saveRDS(m, paste0(f,".rds")) # binary format
}
# initialize two output objects
m <- 10
o1 <- o2 <- matrix(NA, m, m)
# get list of file names
files <- list.files(pattern="^f[[:digit:]]+$")
n <- length(files)
First, let's run your your code using scan, which is already a lot faster than your current solution with read.table.
system.time({
for (i in 1:n) {
A <- scan(files[i],quiet=TRUE)
for (j in i:n) {
B <- scan(files[j],quiet=TRUE)
o1[i,j] <- sqrt(mean(A*B)) / sqrt(sqrt(mean(A*A)) * sqrt(mean(B*B)))
}
}
})
# user system elapsed
# 31.37 0.78 32.58
Now, let's re-run that code using the files saved in R's binary format:
system.time({
for (i in 1:n) {
fA <- paste0(files[i],".rds")
A <- readRDS(fA)
for (j in i:n) {
fB <- paste0(files[j],".rds")
B <- readRDS(fB)
o2[i,j] <- sqrt(mean(A*B)) / sqrt(sqrt(mean(A*A)) * sqrt(mean(B*B)))
}
}
})
# user system elapsed
# 2.42 0.39 2.92
So the binary format is ~10x faster! And the output is the same:
all.equal(o1,o2)
# [1] TRUE

loop loading pairs of files

I am writing a loop that takes two files per run e.g.a0.txt and b0.txt. I am running this over 100 files that run from a0.txt and b0.txt to a999.txt and b999.txt. The pattern function i use works perfect if i do the run for files a0 and b0 to a9 and b9 with only file pairs 0-9 in the directory. but when i put more files in the directory and do the run from '0:10, the loop fails and confuses vectors in files. I think this is becuase of thepattern` i use i.e.
list.files(pattern=paste('.', x, '\\.txt', sep=''))
This only looks for files that have '.',x,//txt.
So if '.'=a and x=1 it finds file a1. But i think it gets confused between a0 and a10 when I do the run over more files. But i cannot seem to find the appropriate loop that will serach for files that also look for files up to a999 and b999, as well.
Can anyone help with a better way to do this? code below.
dostuff <- function(x)
{
files <- list.files(pattern=paste('.', x, '\\.txt', sep=''))
a <- read.table(files[1],header=FALSE) #file a0.txt
G <- a$V1-a$V2
b <- read.table(files[2],header=FALSE) #file b0.txt
as.factor(b$V2)
q <- tapply(b$V3,b$V2,Fun=length)
H <- b$V1-b$V2
model <- lm(G~H)
return(model$coefficients[2],q)
}
results <- sapply(0:10,dostuff)
Error in tapply(b$V3, b$V2, FUN = length) : arguments must have same length
How about getting the files directly, without searching. i.e.
dostuff <- function(x)
{
a.filename <- paste('a', x, '.txt', sep='') # a<x>.txt
b.filename <- paste('b', x, '.txt', sep='') # b<x>.txt
a <- read.table(a.filename, header=FALSE)
# [...]
b <- read.table(b.filename, header=FALSE)
# [...]
}
But the error message says the problem is caused by the call to tapply rather than anything about incorrect file names, and I have literally no idea how that could happen, since I thought a data frame (which read.table creates) always has the same number of rows for each column. Did you copy-paste that error message out of R? (I have a feeling there might be a typo, and so it was, for example, q <- tapply(a$V3,b$V2,Fun=length). But I could easily be wrong)
Also, as.factor(b$V2) doesn't modify b$V2, it just returns a factor representing b$V2: after you call as.factor b$V2 is still a vector. You need to assign it to something, e.g.:
V2.factor <- as.factor(b$V2)
If the beginning of the two files is always the same (a,b in your example); you could use this information in the pattern:
x <- 1
list.files(pattern=paste('[a,b]', x, '\\.txt', sep=''))
# [1] "a1.txt" "b1.txt"
x <- 11
list.files(pattern=paste('[a,b]', x, '\\.txt', sep=''))
# [1] "a11.txt" "b11.txt"
Edit: and you should include the ^ as well, as Wojciech proposed. ^ matches the beginning of a line or in your case the beginning of the filename.

Resources