write.table(...,append=T) : Cannot open the connection - r

I'm wondering if anyone else has ever encountered this problem. I'm writing a fairly small amount of data to a csv file. It's about 30 lines, 50 times.
I'm using a for loop to write data to the file.
It seems "finicky" sometimes the operation completes successfully, and other times it stops after the first ten times (300 lines) other times 3, or 5... by telling me
"cannot open connection".
I imagine it is some type of timeout. Is there a way to tell R to "slow down" when writing tables?
Before you ask: there's just too much code to provide an example here.

Code would help, despite your objections. R has a fixed-size connection pool and I suspect you are running out of connection.
So make sure you follow the three-step of
open connection (and check for error as a bonus)
write using the connection
close the connection

I can't reproduce it on a R 2.11.1 32bit on a Windows 7 64bit. For these kind of things, please provide more info on your system (see e.g. ?R.version, ?Sys.info )
Memory is a lot faster than disk access. 1500 lines are pretty much manageable in the memory and can be written to file in one time. If it's different sets of data, add an extra factor variable indicating the set (set1 to set50). All your data is easily manageable in one dataframe and you avoid having to access the disk many times.
In case it really is for 50 files, this code illustrates the valuable advice of Dirk :
for(i in 1:50){
...
ff <- file("C:/Mydir/Myfile.txt",open="at")
write.table(myData,file=ff)
close(ff)
}
See also the help: ?file
EDIT : you should use open="at" instead of open="wt". "at" is appending mode. "wt" is writing mode. append=T is the same as open="at".

Related

Issue with applying str_length to a dataframe

I created a simple R Script that is run on a monthly basis by colleagues.
This script brings in a fairly chunky RDS file that has around 2.6M observations and 521 variables.
Against this file the following two commands are run:
Latest$MFU <- substr(Latest$SUB_BUSINESS_UNIT_CODE, 1, 2)
Latest$LENGTH <- str_length(Latest$POLICYHOLDER_COMPANY_NAME_LAST_NAME)
This script has run perfectly for the last three years, but today, for some reason, it is now failing for all three people tasked to run it and has indeed fallen over for myself too.
The error message received is
Error: cannot allocate vector of size 10.0 Mb
At first I assumed that their computers were running out of memory, or they were not using 64Bit R, or some other reason such as not restarting their computers, etc.
It turns out though that they have plenty of memory available, have restarted their computers, are using 64 Bit R in R Studio and all are using different versions of R Studio/R.
I tried running the process myself, my computer has 32GB of Ram and 768GB of Hard Drive space free. I am getting the same error message.......
So, must be a corrupt source file I figure. Try last months file which all ran just fine last month for everyone and same error.
Maybe just try stringr package instead then, move around the problem that way. Nope, no dice, exact same error message.
I have to admit I'm stumped. I have tried gc(), tried previous versions of the file, tried cutting the file in half and running it that way, it just flat out refuses to run.
Anyone know of an alternative to stringr/base R commands to get the length of a character string as a new variable and to get a substring as a new variable?
What about rm(list=ls()) before running, and memory.limit(size = 16265*4) (or another big number) ?

ROracle Connection on Worker Nodes // Automated reporting with R Markdown

I am running into several distinct problems while trying to speed up some automated report generation for a large dataset. I'm utilizing R + markdown -> HTML to generate a report, and loop over ~10K distinct groupings for the report accessing the data from Oracle.
The system is comprised mainly of two parts
a main script
a markdown template file
The main script sets up the computing environment and parallel processing backends:
library(ROracle)
library(doParallel) ..etc
....
cl <- makeCluster(4)
clusterEvalQ(cl, con<-dbConnect(db,un,pw)) ##pseudocode...
Here the first issue appears to arise. R throws an exception stating the connections on the workers are invalid BUT when I monitor live sessions on Oracle they appear to be fine...
Next, the main calls the loop for report generation.
foreach(i=1:nrow(reportgroups), .packages=c('ROracle', 'ggplot2', 'knitr') %dopar% ##...etc
{
rmarkdown::render(inputfile.Rmd, outputfile.html, params=list(groupParam1[i], groupParam2[i], etc)
}
If I run the foreach loop sequentially i.e., %do% instead of %dopar%, everything seems to work fine. No errors, then entire set runs correctly (I have only tested to ~400 groups, will do a full run of all 10k overnight).
However, if I attempt to run the loop in parallel, invariably 'pandoc' throws an error #1 in converting the file. If I run the broken loop multiple times, the 'task' in the loop (or cluster, not sure which task refers to in this context) which causes the error changes.
The template file is pretty basic, it takes in parameters for groups, runs an SQL query on the connection defined for the cluster worker, and utilizes ggplot2 + dplyr to generate results. Since the template seem to operate when not through a cluster, I believe the problem must be something to do with the connection objects in the cluster nodes from ROracle, although I don't know enough about the subject to really pinpoint the problem.
If anyone has had a similar experience, or has a hunch about what is going on, any advice would be appreciated!
Let me know if I can clarify anything...
Thanks
Problem has been solved using a variety of hacks. First, main loop was re-written to pull data into local R session instead of running SQL within the markdown report. This seemed to help cure some of the collisions in connections to the database, but did not fix it entirely. So, I added some tryCatch() and repeat() functionality to the functions which queried data or attempted to connect to the dB. Highly recommend anyone having issues with ROracle on a cluster implements something similar, and spends the time to review the error messages to see what exactly is going on.
Next, in the issues with pandoc conversion. Most of the problems were solved by resolving a join to a depreciated table in the SQL (old table lacked data for some groups, and thus was pulling in no rows and the report would not generate). However, some issues still remained where rmarkdown::render() would run successfully, but the actual output would be empty or broken. No idea what caused this issue, but the way I solved it was to compare filesizes of the reports generated (empty ones were ~ 300kb, completed ones 400 + ) and re-run the report generation after the cluster was shut down on a single machine. This seemed to clean up any of the last issues.
In summary:
Previously: reports generated, but with significant issues and incomplete data within.
Fixes:
For the cluster, ensure multiple attempts to connect to dB in case of an issue during the run. Don't forget to close connection after grabbing data. In ROracle, creating the driver, i.e,
driver <- dbDriver("Oracle")
is the most time consuming part of connecting to a database, but the driver can be reused with multiple connections in a loop e.g.,
##Create driver outside loop, and reuse inside
for(i in 1:n){
con <- dbConnect(driver, 'username', 'password')
data <- dbGetQuery(con, 'Select * from mydata')
dbDisconnect(con)
...##do something to data
}
is much faster than calling
dbConnect(dbDriver("Oracle"), 'username', 'password')
inside the loop
Wrap connection
/ sql attempts inside a function which implements tryCatch and repeat functionality in case of errors
Wrap calls to rmarkdown::render() with tryCatch and repeat functionality and log status, filesize, filename, filelocation, etc.
--Load in logfile created as above from rmarkdown::render calls, find outlier file sizes (in my case, simply Z-scores of filesize, filter those with < -3) to identify reports where issues still exist. Do something to fix them rerun on a single worker
Overall, ~4500 reports generated from a subset of my data, time on 4 core i5 # 2.8 Ghz is around 1 hour. Total number of 'bad' reports generated is around 1% of the total.

R console unexpectedly slow, long behind job (PDF output) is finished

When I run a large R scripts (works nicely as expected, basically produces a correct PDF at the end of the script (base plotting plus beeswarm, last line of script is dev.off()), I notice that the PDF is finished after ~3 seconds and can even be opened in other applications, long before the console output (merely few integer values and echo of code ~400 lines) is finished (~20 seconds). There are no errors reported. In between, the echo stops and does nothing for seconds.
I work with R Studio V0.97.551, R version 3.0.1, on Win-7.
gc() or close and restart R did not help, and the data structures used are not big anyway (5 dataframes with up to 60 obs and 64 numeric or short character variables). The available memory should be sufficient (according to task manager, around 4 GB throughout), but CPU is busy during that time.
I agree this is not reproducible for other people w/o the script, which is however too large to post, but maybe someone has experienced the same problem or even an explanation or suggestion what to check? Thanks in advance!
EDIT:
I run exactly the same code directly in R 3.0.1 (w/o RStudio), and the problem was gone, suggesting the problem is related to RStudio. I added the tag RStudio, but I am not sure if I am now supposed to move this question somewhere else?
Recently I came across similar problem--running from RStudio becomes very slow, even when it is executing something as simple as example('plot'). After searching around, this post pointed me to the right place that eventually led to a workaround: resetting RStudio by renaming the "RStudio-Desktop Directory". The exact way to do so depends upon the OS you are using, and you could find the detail instruction here. I just tried it, and it works.

Why does loading saved R file increase CPU usage?

I have an R script that I want to run frequently. Few months ago when I wrote it and initiated, there was no problem.
Now, my script is consuming almost all (99%) of the CPU and its slower than it used to be. I am running the script in a server and other users experience slow response from the server when the script is running.
I tried to find out the piece of code where its slow. The following loop is taking almost all the time and CPU that is used by the script.
for (i in 1:100){
load (paste (saved_file, i, ".RData", sep=""))
Do something (which is fast)
assign (paste ("var", i, sep=""), vector)
}
The loaded data is about 11 MB in each iteration. When I run above script for an arbitrary "i", the loading of file step takes longer time than other commands.
I spent few hours reading forum posts but could not get any hint about my problem. It would be great if you could point out if there's something I am missing or suggest more effective way to load a file in R.
EDIT: Added space in the codes to make it easier to read.
paste(saved_file, i, ".RData", sep = "")
Loads a object at each iteration, with name xxx1, xxx2, and so on.
Did you tried to rm the object at the end of loop? I guess the object stays in memory, regardless of your variable being reused.
Just a tip: add spaces in your code (like i did), it's much more easier to read/debug.

Speed up RData load

I've checked several related questions such is this
How to load data quickly into R?
I'm quoting specific part of the most rated answer
It depends on what you want to do and how you process the data further. In any case, loading from a binary R object is always going to be faster, provided you always need the same dataset. The limiting speed here is the speed of your harddrive, not R. The binary form is the internal representation of the dataframe in the workspace, so there is no transformation needed anymore
I really thought that. However, life is about experimenting. I have a 1.22 GB file containing an igraph object. That's said, i don't think what I found here is related to the object class, mainly because you can load('file.RData') even before you call "library".
Disks in this server are pretty cool. As you can check in the reading time to memory
user#machine data$ pv mygraph.RData > /dev/null
1.22GB 0:00:03 [ 384MB/s] [==================================>] 100% `
However when I load this data from R
>system.time(load('mygraph.RData'))
user system elapsed
178.533 16.490 202.662
So it seems loading *.RData files is 60 times slower than disk limits, which should mean R actually does something while "load".
I've got the same feeling using differentes R versions with different hardware, it's just this time I got patience to make benchmarking (mainly because with such a cool disk storage, it was terrible how long the load actually takes)
Any ideas on how to overcome this?
After ideas in answers
save(g,file="test.RData",compress=F)
Now the file is 3.1GB against 1.22GB before. In my case, loading uncompress is a bit faster (disk is not my bottleneck by far)
> system.time(load('test.RData'))
user system elapsed
126.254 2.701 128.974
Reading the uncompressed file to memory takes like 12 seconds, so I confirm most the time is spent in setting the enviroment
I'll be back with RDS results, sounds like interesting
Here we are, as prommised
system.time(saveRDS(g,file="test2.RData",compress=F))
user system elapsed
7.714 2.820 18.112
And I get a 3.1GB just like "save" uncompressed, although md5sum is different, probably because save also stores the object name
Now reading...
> system.time(a<-readRDS('test2.RData'))
user system elapsed
41.902 2.166 44.077
So combining both ideas (uncompress and RDS) runs 5 times faster. Thanks for your contributions!
save compresses by default, so it takes extra time to uncompress the file. Then it takes a bit longer to load the larger file into memory. Your pv example is just copying the compressed data to memory, which isn't very useful to you. ;-)
UPDATE:
I tested my theory and it was incorrect (at least on my Windows XP machine with 3.3Ghz CPU and 7200RPM HDD). Loading compressed files is faster (probably because it reduces disk I/O).
The extra time is spent in RestoreToEnv (in saveload.c) and/or R_Unserialize (in serialize.c). So you could make loading faster by changing those files, or maybe by using saveRDS to individually save the objects in myGraph.RData then somehow using loadRDS across multiple R processes to load the data into shared memory...
For variables that big, I suspect that most of the time is taken up inside the internal C code (http://svn.r-project.org/R/trunk/src/main/saveload.c). You can run some profiling to see if I'm right. (All the R code in the load function does is check that your file is non-empty and hasn't been corrupted.
As well as reading the variables into memory, they (amongst other things) need to be stored inside an R environment.
The only obvious way of getting a big speedup in loading variables would be to rewrite the code in a parallel way to allow simultaneous loading of variables. This presumably requires a substantial rewrite of R's internals, so don't hold your breath for such a feature.
The main reason why RData files take a while to load is that the de-compression step is single-threaded.
The fastSave R package allows using parallel tools for saving and restoring R sessions:
https://github.com/barkasn/fastSave
But it only works on UNIX (You should still be able to open the files on other platforms though).

Resources