Seemingly simple question, but I don't know how the loop syntax and variable assignments work in R very well. I have a 6900 line table that I want parsed into 10 equal sized text files. My code is below, but how would I design a loop around it and iterate through the filenames?
write.table(clipboard[1:619,1],
"mydata1.txt",sep="\t")
write.table(clipboard[619:1238,1],
"mydata2.txt",sep="\t")
write.table(clipboard[1238:1857,1],
"mydata3.txt",sep="\t")
write.table(clipboard[1857:2476,1],
"mydata4.txt",sep="\t")
write.table(clipboard[2476:3095,1],
"mydata5.txt",sep="\t")
write.table(clipboard[3095:3714,1],
"mydata6.txt",sep="\t")
write.table(clipboard[3714:4333,1],
"mydata7.txt",sep="\t")
write.table(clipboard[4333:4952,1],
"mydata8.txt",sep="\t")
write.table(clipboard[4952:5571,1],
"mydata9.txt",sep="\t")
write.table(clipboard[5571:6190,1],
"mydata10.txt",sep="\t")
The manual way
I guess not such an issue to use a loop for IO:
for (i in 1:10) {
start <- 1 + (i-1) * nrow(clipboard) / 10
end <- i * nrow(clipboard) / 10
fname <- paste("mydata", i ,".txt", sep="")
write.table(x=clipboard[start:end, 1], file=fname, sep="\t")
}
Note that this assumes that it can actually be separated into 10 equally sized files!
Done properly, write.split:
This method will actually (when not perfectly divisable) create an extra file for the remainder.
I used this splitter to create a list of data that will then be used in parallel for some statistical computations in my package correlate. Here, it actually means we would be able to write the files in parallel. Note that this is pointless for small files; maybe even slower.
# Helper to split the data in chunks
splitter <- function(x, splitsize) {
nr <- nrow(x)
if (splitsize > nr) {
splitsize <- nr
}
splits <- floor(nr / splitsize)
splitted.list <- lapply(split(x[seq_len(splits*splitsize), ],
seq_len(splits)), function(x) matrix(x, splitsize))
if (nr %% splitsize != 0) {
splitted.list$last <- x[(splits * splitsize + 1):nr, ]
}
return(splitted.list)
}
write.split <- function(x, chunks, file.prefix, file.extension, cores = 1, ...) {
splitsize <- nrow(x) / chunks
splitted.list <- splitter(x, splitsize)
if (cores == 1) {
sapply(names(splitted.list), function(z)
write.table(splitted.list[z],
file = paste(file.prefix, z, file.extension, sep=""),
...))
} else {
# currently just the simple linux version; this won't work on Windows.
# Upon request I'll add it
stopifnot(require(parallel))
mclapply(names(splitted.list), function(z)
write.table(splitted.list[z],
file = paste(file.prefix, z, file.extension, sep=""),
...))
}
}
Usage:
write.split(z, chunks = 10,
file.prefix = "mydata", file.extension = ".txt", sep="\t")
You can also give it the row.names and col.names arguments, basically anything that can be passed to write.table.
Benchmark:
Using `matrix(1:1000000, 1000)` as data.
Unit: seconds
expr min lq median uq max neval
1-core 1.780022 1.990751 2.079907 2.166891 2.744904 100
4-cores 1.305048 1.438777 1.492114 1.559110 2.070911 100
Extensibility:
It could also be easily extended by allowing to give a number of lines to write rather than the amount of chunks.
Related
I am running the following for loop for the gwr.basic function in the GWmodel package in R. What I need to do is to collect the mean of estimate parameter for any given bandwidth.
the code looks like:
library(GWmodel)
data("DubVoter")
#Dub.voter
LARentMean = list()
for (i in 20:21)
{
gwr.res <- gwr.basic(GenEl2004 ~ DiffAdd + LARent + SC1 + Unempl + LowEduc + Age18_24 + Age25_44 + Age45_64, data = Dub.voter, bw = i, kernel = "bisquare", adaptive = TRUE, F123.test = TRUE)
a <- mean(gwr.res$SDF$LARent)
LARentMean[i] <- a
}
outcome = unlist(LARentMean)
> outcome
[1] -0.1117668 -0.1099969
However it is terribly slow at returning the result. I need a much wider range such as 20:200. Is there a way to speed the process up? If not, how to have a stepped range let's say 20 to 200 with steps of 5 to reduce the number of operations?
I am a python user new to R. I read on SO that R is well known for being slow at for loops and that there are more efficient alternatives. More clarity on this point would be welcomed.
I got the same impression like #musically_ut. The for loop and the traditional for-vs.apply debate is unlikely to help you here. Try to go for parallelization if you got more than one core. There are several packages like parallel or snowfall. Which package is ultimately the best and fastest depends on your machine and operating system.
Best does not always equal fastest here. A code that works cross-platform and can be worth more than a bit of extra performance. Also transparency and ease of use can outweigh maximum speed. That being said I like the standard solution a lot and would recommend to use parallel which ships with R and works on Windows, OSX and Linux.
EDIT: here's the fully reproducible example using the OP's example.
library(GWmodel)
data("DubVoter")
library(parallel)
bwlist <- list(bw1 = 20, bw2 = 21)
cl <- makeCluster(detectCores())
# load 'GWmodel' for each node
clusterEvalQ(cl, library(GWmodel))
# export data to each node
clusterExport(cl, varlist = c("bwlist","Dub.voter"))
out <- parLapply(cl, bwlist, function(e){
try(gwr.basic(GenEl2004 ~ DiffAdd + LARent + SC1 +
Unempl + LowEduc + Age18_24 + Age25_44 +
Age45_64, data = Dub.voter,
bw = e, kernel = "bisquare",
adaptive = TRUE, F123.test = TRUE ))
} )
LArent_l <- lapply(lapply(out,"[[","SDF"),"[[","LARent")
unlist(lapply(LArent_l,"mean"))
# finally, stop the cluster
stopCluster(cl)
Besides using parallelization as Matt Bannert suggests, you should preallocate the vector LARentMean. Often, it's not the for loop itself that is slow but the fact that the for seduces you to do slow things like creating growing vectors.
Consider the following example to see the impact of a growing vector as compared to preallocating the memory:
library(microbenchmark)
growing <- function(x) {
mylist <- list()
for (i in 1:x) {
mylist[[i]] <- i
}
}
allocate <- function(x) {
mylist <- vector(mode = "list", length = x)
for (i in 1:x) {
mylist[[i]] <- i
}
}
microbenchmark(growing(1000), allocate(1000), times = 1000)
# Unit: microseconds
# expr min lq mean median uq max neval
# growing(1000) 3055.134 4284.202 4743.4874 4433.024 4655.616 47977.236 1000
# allocate(1000) 867.703 917.738 998.0719 956.441 995.143 2564.192 1000
The growing list is about 5 times slower than the version that preallocates the memory.
I have a large data.frame (15 columns and 100,000 rows) in an existing R session that I want to send to a Q/KDB instance. From KDB's cookbook, the possible solutions are:
RServer for Q: use KDB to create new R instance which shares memory space. This doesn't work because my data is in an existing instance of R.
RServe: run an R server and use TCP/IP to communicate with Q/KDB client. This does not work, because as per RServe's documentation, "every connection has a separate workspace and working directory" and so i presume does not see my existing data.
R Math Library: access R's functionality via a math library without needing an instance of R. This does not work because my data is already in an instance of R.
So any other ideas on how to send data from R to Q/KDB?
open a port in Q. I start Q with a batch file:
#echo off
c:\q\w32\q -p 5001
load qserver.dll
tryCatch({
dyn.load("c:/q/qserver.dll")}
,error = function(f){
print("can't load qserver.dll")
})
Then use these
open_connection <- function(host="localhost", port=5001, user=NULL) {
parameters <- list(host, as.integer(port), user)
h <- .Call("kx_r_open_connection", parameters)
assign(".k.h", h, envir = .GlobalEnv)
return(h)
}
close_connection <- function(connection) {
.Call("kx_r_close_connection", as.integer(connection))
}
execute <- function(connection, query) {
.Call("kx_r_execute", as.integer(connection), query)
}
d<<-open_connection(host="localhost",port=thePort)
ex2 <- function(...)
{
query <- list(...)
theResult <- NULL
for(i in query) theResult <- paste0(theResult,i)
return(execute(d,paste0(theResult)))
}
then ex2 can take multiple arguments so you can build queries with R variables and strings
Edit: thats for R from Q, heres R to Q
2nd Edit: improved algo:
library(stringr)
RToQTable <- function(Rtable,Qname,withColNames=TRUE,withRowNames=TRUE,colSuffix = NULL)
{
theColnames <- if(!withColNames || length(colnames(Rtable))==0) paste0("col",as.character(1:length(Rtable[1,])),colSuffix) else colnames(Rtable)
if(!withRowNames || length(rownames(Rtable))==0) withRowNames <- FALSE
Rtable <- rbind(Rtable,"linesep")
charnum <- as.integer(nchar(thestr <- paste(paste0(theColnames,':("',str_split(paste(Rtable,collapse='";"'),';\"linesep\";\"')[[1]],');'),collapse="")) - 11)
if(withRowNames)
ex2(Qname,":([]",Qname,str_replace_all(paste0("`",paste(rownames(Rtable),collapse="`"))," ","_"),";",.Internal(substr(thestr,1L,charnum)),"))") else
ex2(Qname,":([]",.Internal(substr(thestr,1L,charnum)),"))")
}
> bigMat <- matrix(runif(1500000),nrow=100000,ncol=15)
> microbenchmark(RToQTable(bigMat,"Qmat"),times=3)
Unit: seconds
expr min lq mean median uq max neval
RToQTable(bigMat, "Qmat") 10.29171 10.315 10.32766 10.33829 10.34563 10.35298 3
This will work for a matrix, so for a data frame just save a vector containing the types of each column, then convert the dataframe to a matrix, import the matrix to Q, and cast the types
Note that this algo is approx O(rows * cols^1.1) so you'll need to chop the columns up into multiple matricies if you have any more than 20 to get O(rows * cols)
but for your example 150,000 rows and 15 columns takes 10 seconds so further optimization may not be necessary.
I have following for loop in R:
v = c(1,2,3,4)
s = create.some.complex.object()
for (i in v){
print(i)
s = some.complex.function.that.updates.s(s)
}
# s here has the right content.
Needless to say, this loop is horribly slow in R.
I tried to write it in functional style:
lapply(v, function(i){
print(i)
s = some.complex.function.that.updates.s(s)
})
# s wasn't updated.
But this doesn't work, because s is passed by value and not by reference.
I only need the result of the last iteration, not all of the intermediate steps.
How do I formulate the first loop in R-style?
Mulone
lapply(v, function(i){
print(i)
s = some.complex.function.that.updates.s(s)
return(s)
})
the result will be a list of object s created for each value of v. Even if it should have passed the value of v anyway cause it was the last operation performed by the function.
If you can't afford to create it many times then there are not a lot of options. It is hard to say as well without seeing the object that you are operating on. If the object is growing/appending you could collect the intermediate results and do the appending at the end. If it is actually mutating you should try to get away from the pass value and use reference classes (http://www.inside-r.org/r-doc/methods/ReferenceClasses). Then the function that modifies it will actually be a method you just call n times.
Is the loop itself really the problem? Or is it rather the time the execution of some.complex.function.that.updates.s needs?
Some R programers will jump through hoops to avoid loops but have a look at this example:
f <- function(a) a/1.001
loop <- function(n) { s = (1/f(1)^n); for (i in 1:n) s <- f(s); s}
system.time(loop(1E7))
user system elapsed
7.011 0.030 7.008
This is 0.7 micro seconds (on a MacBook Pro) per call of a very trivial function in a loop.
v = c(1,2,3,4)
s = create.some.complex.object()
lapply(v, function(i){
print(i)
s <<- some.complex.function.that.updates.s(s)
}) |> invisible()
Use of the <<- operator can sometimes get you into trouble and is (somewhat) discouraged, but when I want to mimic a for loop with side-effects this is a pattern I have found useful.
v = c(1,2,3,4)
s = create.some.complex.object()
lapply(v, function(i){
print(i)
assign('s', some.complex.function.that.updates.s(s), envir = .GlobalEnv)
}) |> invisible()
Using assign allows you to avoid the use of <<- operator. Using <<- is significantly faster than invoking the assign function. For performance reasons in more intensive applications it is very much worth it to replace sequential for loops with vectorized operations as the median execution time of lapply can be several orders of magnitude faster! Here are some toy benchmarks to support this assertion:
v <- c(1, 2, 3, 4)
microbenchmark::microbenchmark({
s <- 1
lapply(v, function(i) {
s <<- s + i
})
}, times = 1e4, unit = 'microseconds')
Median: ~ 4 microseconds
v <- c(1, 2, 3, 4)
microbenchmark::microbenchmark({
s <- 1
for(i in v) {
s <- s + i
}
}, times = 1e4, unit = 'microseconds')
Median: ~ 1488 microseconds
I've a code that works perfectly for my purpose (it reads some files with a specific pattern, read the matrix within each file and compute something using each filepair...the final output is a matrix that has the same size of the file number) and looks like this:
m<- 100
output<- matrix(0, m, m)
lista<- list.files(pattern = "q")
listan<- as.matrix(lista)
n <- nrow(listan)
for (i in 1:n) {
AA <- read.table((listan[i,]), header = FALSE)
A<- as.matrix(AA)
dVarX <- sqrt(mean(A * A))
for (j in i:n) {
BB <- read.table ((listan[j,]), header = FALSE)
B<- as.matrix(BB)
V <- sqrt (dVarX * (sqrt(mean(B * B))))
output[i,j] <- (sqrt(mean(A * B))) / V
}
}
My problem is that it takes a lot of time (I have about 5000 matrixes, that means 5000x5000 loops).
I would like to parallelize, but I need some help!
Waiting for your kind suggestions!
Thank you in advance!
Gab
The bottleneck is likely reading from disk. Running code in parallel isn't guaranteed to make things faster. In this case, multiple processes attempting to read from the same disk at the same time is likely to be even slower than a single process.
Since your matrices are being written by another R process, you really should save them in R's binary format. You're reading every matrix once and only once, so the only way to make your program faster is to make reading from disk faster.
Here's an example that shows you how much faster it could be:
# make some random data and write it to disk
set.seed(21)
for(i in 0:9) {
m <- matrix(runif(700*700), 700, 700)
f <- paste0("f",i)
write(m, f, 700) # text format
saveRDS(m, paste0(f,".rds")) # binary format
}
# initialize two output objects
m <- 10
o1 <- o2 <- matrix(NA, m, m)
# get list of file names
files <- list.files(pattern="^f[[:digit:]]+$")
n <- length(files)
First, let's run your your code using scan, which is already a lot faster than your current solution with read.table.
system.time({
for (i in 1:n) {
A <- scan(files[i],quiet=TRUE)
for (j in i:n) {
B <- scan(files[j],quiet=TRUE)
o1[i,j] <- sqrt(mean(A*B)) / sqrt(sqrt(mean(A*A)) * sqrt(mean(B*B)))
}
}
})
# user system elapsed
# 31.37 0.78 32.58
Now, let's re-run that code using the files saved in R's binary format:
system.time({
for (i in 1:n) {
fA <- paste0(files[i],".rds")
A <- readRDS(fA)
for (j in i:n) {
fB <- paste0(files[j],".rds")
B <- readRDS(fB)
o2[i,j] <- sqrt(mean(A*B)) / sqrt(sqrt(mean(A*A)) * sqrt(mean(B*B)))
}
}
})
# user system elapsed
# 2.42 0.39 2.92
So the binary format is ~10x faster! And the output is the same:
all.equal(o1,o2)
# [1] TRUE
I have a large list of objects (say 100k elements). Each element will have to be processed by a function "process" BUT I would like to do the processing in chunks... say 20 passes for example as I want to save processing results into a hard drive file and keep operating memory free.
I'm new to R and I know that it should involve some apply magic but I don't know how to do it (yet).
Any guidance would be much appreciated.
A small example:
objects <- list();
for (i in 1:100){
objects <- append(objects, 500);
}
objects;
processOneElement <- function(x){
x/20 + 23;
}
I would like to process first 20 elements in one go and save results then process second 20 elements in second go and save results... and so on
objects <- list();
for (i in 1:100){
objects <- append(objects, 500);
}
objects;
process <- function(x){
x/20 + 23;
}
results <- lapply(objects, FUN=process)
index <- seq(1, length(objects), by=20);
lapply(index, function(idx1) {
idx2 <- min(idx1+20-1, length(objects));
batch <- lapply(idx:idx2, function(x) {
process(objects[[x]]);
})
write.table(batch, paste("batch", idx1, sep=""));
})
With what you have given, this is the answer I could suggest. Assuming your list is stored in list.object,
lapply(seq(1, length(list.object), by=20), function(idx) {
# here idx will be 1, 21, 41 etc...
idx2 <- min(idx+20-1, length(list.object))
# do what you want here..
batch.20.processed <- lapply(idx:idx2, function(x) {
process(list.object[[x]]) # passes idx:idx2 indices one at a time
})
# here you have processed list with 20 elements
# finally write to file
lapply(1:20, function(x) {
write.table(batch.20.processed[[x]], ...)
# where "..." is all other allowed arguments to write.table
# such as row.names, col.names, quote etc.
# don't literally pass "..." to write.table
})
}