tm crashing R when converting VCorpus to corpus - r

I am using windows 7 with a 32-bit operating system with 4Gb RAM of which only 3Gb is accessible due to 32-bit limitations. I shut everything else down and can see that I have about 1Gb as cached and 1Gb available before starting. The "free" memory varies but is sometimes 0.
Using tm - I am successfully creating a 517Mb VCorpus three .txt documents from a Swiftkey dataset. When I attempt to take the next step to convert it to a "corpus" using the tm::Corpus() command, I get an error. Code and output follows:
cname <- file.path("./final/en_US/")
docs <- Corpus(DirSource(cname))
myCorpus <- tm::Corpus(docs)
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
....and R terminates........any ideas how to prevent this?

Related

Freeing up resources while running loops in h2o

I am running a loop to upload a csv file from my local machine, convert it to a h2o data frame, then run a h2o model. I then remove the h2o data frame from my r environment and the loop continues. These data frames are massive so I can only have one data frame loaded at a time (hence the reason for me removing the data frame from my environment).
My problem is that h2o creates temporary files which quickly max out my memory. I know I can restart my r session, but is there another way to flush this out in code so my loop can run happily? When I look at my task manager the all my memory is sucked up in Java(TM) Platform SE Binary.
Removing the object from the R session using rm(h2o_df) will eventually trigger garbage collection in R and the delete will be propagated to H2O. I don't think this is ideal, however.
The recommended way is to use h2o.rm or for your particular use case, it seems like h2o.removeAll would be the best (takes care of everything, models, data..).

RStudio Connect - Intermittent C stack usage errors when downloading from Snowflake database

Having an intermittent issue when downloading a fairly large dataset from a Snowflake database view. The dataset is about 60 million rows. Here is the code used to connect to Snowflake and download the data:
sf_db <- dbConnect(odbc::odbc(),
Driver="SnowflakeDSIIDriver",
AUTHENTICATOR = "SNOWFLAKE_JWT",
Server = db_param_snowflake$server,
Database = db_param_snowflake$db,
PORT=db_param_snowflake$port,
Trusted_Connection = "True",
uid = db_param_snowflake$uid,
db = db_param_snowflake$db,
warehouse = db_param_snowflake$warehouse,
PRIV_KEY_FILE = db_param_snowflake$priv_key_file,
PRIV_KEY_FILE_PWD = db_param_snowflake$priv_key_file_pwd,
timeout = 20)
snowflake_query <- 'SELECT "address" FROM ABC_DB.DEV.VW_ADDR_SUBSET'
my_table <- tbl(sf_db, sql(snowflake_query)) %>%
collect() %>%
data.table()
About 50% of the time, this part of the script runs fine. When it fails, the RStudio Connect logs contain messages like this:
2021/05/05 18:17:51.522937848 Error: C stack usage 940309959492 is too close to the limit
2021/05/05 18:17:51.522975132 In addition: Warning messages:
2021/05/05 18:17:51.523077000 Lost warning messages
2021/05/05 18:17:51.523100338
2021/05/05 18:17:51.523227401 *** caught segfault ***
2021/05/05 18:17:51.523230793 address (nil), cause 'memory not mapped'
2021/05/05 18:17:51.523251671 Warning: stack imbalance in 'lazyLoadDBfetch', 113 then 114
To try to get this working consistently, I have tried using a process that downloads rows in batches, and that also intermittently fails, usually after downloading many millions of records. I have also tried connecting with Pool and downloading, that also only works sometimes. Also tried dbGetQuery, same inconsistent results.
I have Googled this extensively, and found threads related to C stack errors and recursion, but those problems seemed to be consistent (unlike this one that works sometimes) and I'm not sure what I can do if there is some recursive process running as part of this download.
We are running this on a Connect server with 125GB of memory, and at the time this script runs there are no other scripts running, and (at least according to the Admin screen that shows CPU and memory usage) this script doesn't use any more than 8-10GB before it (sometimes) fails. As far as when it succeeds and when it fails, I haven't noticed any pattern. I could run it now and it fail, then immediately run it again and it works. When it succeeds, it takes about 7-8 minutes. When it fails, it generally fails after anywhere from 3-8 minutes. All packages are the newest versions, and this has always worked inconsistently, so cannot think of anything to roll back.
Any ideas for troubleshooting, or alternate approach ideas, are welcome. Thank you.

R session terminated data lost

I collected a 2.5GB matrix with R, and upon its completion, I accidentally passed a view command to the RStudio
View(Matrix)
So the RStudio stuck and force quit. I lost all the data. Is there any possibility that R could have stored some of the data somewhere? If yes, where could I find them? I am using a Mac.

Non-blocking download using curl in R

I am writing some code where I download many pages of from a web API, do some processing, and combine them into a data frame. The API takes ~30 seconds to respond to each request, so it would be convenient to send the request for the next page while doing the processing for the current page. I can do this using, e.g., mcparallel, but that seems like overkill. The curl package claims that it can make non-blocking connections, but this does not seem to work for me.
From vignette("intro", "curl"):
As of version 2.3 it is also possible to open connetions in
non-blocking mode. In this case readBin and readLines will return
immediately with data that is available without waiting. For
non-blocking connections we use isIncomplete to check if the download
has completed yet.
con <- curl("https://httpbin.org/drip?duration=1&numbytes=50")
open(con, "rb", blocking = FALSE)
while(isIncomplete(con)){
buf <- readBin(con, raw(), 1024)
if(length(buf))
cat("received: ", rawToChar(buf), "\n")
}
close(con)
The expected result is that the open should return immediately, and then 50 asterisks should be progressively printed over 1 second as the results come in. For me, the open blocks for about a second, and then the asterisks are printed all at once.
Is there something else I need to do? Does this work for anyone else?
I am using R version 3.3.2, curl package version 3.1, and libcurl3 version 7.47.0 on Ubuntu 16.04 LTS. I have tried in RStudio and the command line R console, with the same results.

Every time I source my R script it leaks a db connection

I cannot paste the entire script here, but I am explaining the situation. If you have ever got leaked DB connections then you would be knowing what I am talking about.
I have an R script file that has many functions (around 50) that use db connections using the DBI & RMySQL R packages. I have consolidated all DB access through 4 or 5 functions. I use on.exit(dbDisconnect(db)) in every single function where a dbConnect is used.
I discovered that just on loading this script using source("dbscripts.R") causes one DB connection to leak. I see this when I run the command
dbListConnections(MySQL())
[[1]]
MySQLConnection:0,607>
[[2]]
MySQLConnection:0,608>
[[3]]
MySQLConnection:0,609>
[[4]]
MySQLConnection:0,610>
I see one more DB connection added to the list everytime. This quickly reaches to 16 and my script stops working.
The problem is, I am unable to find out which line of code is causing the leak.
I have checked each dbConnect line in the code. All of them are within functions and no dbConnect happens outside in the main code.
So, why is the connection leak occurring?

Resources