I frequently use the packages future.apply and future to parallelize tasks in R. This works perfectly well in my local machines. However, if I try to use them in a computer cluster, managed by PBS/TORQUE, the job gets killed for violating the resources policy. After reviewing the processes, I noticed that the resources_used.mem and resources_used.vmem as reported by qstat are ridiculously high. Is there any way to fix this?
Note: I already know and use the package batchtools and future.batchtools, but they produce jobs to launch to the queues, so this requires me to organize the scripts in a particular way, so I would like to avoid this for this specific example.
I have prepared the following MVE. As you can see, the code simply allocates a vector with 10^9 elements, and then performs, in parallel using future_lapply, some operations (here just a trivial check).
library(future.apply)
plan(multicore, workers = 12)
sample <- rnorm(n = 10^9, mean = 10, sd = 10)
print(object.size(sample)/(1024*1024)) # fills ~ 8 gb of RAM
options(future.globals.maxSize=+Inf)
options(future.gc = TRUE)
future_lapply(future.seed = TRUE,
X = 1:12, function(idx){
# just do some stuff
for(i in sample){
if (i > 0) dummy <- 1
}
return(dummy)
})
If run on my local computer (no PBS-TORQUE involved), this works well (meaning no problem with the RAM) assuming 32Gb of RAM are available. However, if run through TORQUE/PBS on a machine that has enough resources, like this:
qsub -I -l mem=60Gb -l nodes=1:ppn=12 -l walltime=72:00:00
the job gets automatically killed due to violating the resources policy. I am pretty sure that this has to do with PBS/TORQUE not measuring correctly the resources used since, since if I check
qstat -f JOBNAME | grep used
I get:
resources_used.cput = 00:05:29
resources_used.mem = 102597484kb
resources_used.vmem = 213467760kb
resources_used.walltime = 00:02:06
Telling me that the process is using ~102Gb of mem and ~213Gb of vmem. It does not, you can actually monitor the node with e.g. htop and it is using the correct amount of RAM, but TORQUE/PBS is measuring much more.
I have several large R data.frames that I would like to put into a local duckdb database. The problem I am having is duckdb seems to load everything into memory even though I am specifying a file as the location.
Also, it isn't clear to me the correct way to establish a connection (so I'm not sure if this has something to do with it). I have tried:
duckdrv <- duckdb(dbdir="dt.db", read_only=FALSE)
dkCon <- dbConnect(drv=duckdrv)
and also:
duckdrv <- duckdb()
dkCon <- dbConnect(drv=duckdrv, dbdir="dt.db", read_only=FALSE)
Both work fine, meaning I can create tables, use dbWriteTable, run queries, etc. However, the memory usage is very high (about the same size as the data.frames). I think I read somewhere that duckdb defaults to using a certain % of the available memory which won't work for me because the system that I am using is a shared resource. I also want to run some queries in parallel which will drive memory usage even higher.
I have tried this:
dbExecute(dkCon, "PRAGMA memory_limit='1GB';")
but that doesn't seem to make a difference, even if I close the connection, shutdown the instance and reconnect.
Does anyone know how I can fix this problem? RSQLite, also has high memory usage temporarily when I am writing data to a table but then it goes back to normal and if I open a read only connection it isn't an issue at all. I would like to get duckdb working because I think the queries are supposed to be much faster. Any help would be appreciated!
Memory limit can be set using PRAGMA or SET statement in DuckDB. By default, 75% of the RAM is the limit.
con.execute("PRAGMA memory_limit='200MB'")
OR
con.execute("SET memory_limit='200MB'")
I can confirm that this limit works. However this is not a hard limit and might get exceeded sometimes based on the volume of data, format of data your are querying(eg: parquet from s3), type of query - certain limitations or certain constraints around it at the moment.
Below is one of the examples where the volume of data in plain text(csv) was around 4.23 GB. This data was first loaded into DuckDB and then some SQL queries were run by setting the memory_limit='200MB'. Below screenshot shows max recorded memory used by the py script.
Your approach is correct - using memory_limit pragma, but you used an outdated version.
For example, using DuckDb version 0.5.1:
library("DBI")
con = dbConnect(duckdb::duckdb(), dbdir="my-db.duckdb")
dbExecute(conn = con, paste0("PRAGMA memory_limit='500MB'"))
dbGetQuery(conn = con, "PRAGMA version")
dbExecute(con, "CREATE TABLE gen AS SELECT * FROM 'gen1GB.csv'")
dbGetQuery(conn = con, "select count(*) from gen")
This outputs for me:
library_version source_id
1 0.5.1 7c111322d
count_star()
1 1e+08
Memory usage is less than 500MB. On MacOs can be checked using:
ps axu | grep 'lib\/R' | awk '{print $6 " " $11}'
464768 /usr/local/Cellar/r/4.2.1_4/lib/R/bin/exec/R
You can generate a test csv-file using:
import numpy as np
import pandas as pd
rng = np.random.default_rng()
df = pd.DataFrame(rng.integers(0, 100, size=(100000000, 4)), columns=list('ABCD'))
df.to_csv('gen1GB.csv', index=False)
I have 200,000 links that I am trying to download, I have tried downloading it all in one go but I ran into memory issues.
I am trying to create a function which will download 1000 links at a time and save them in a folder.
Packages:
library(dplyr)
library(purrr)
library(edgarWebR)
A small sample of the data is as follows:
Data 1:
urls_to_parse <- c("https://www.sec.gov/Archives/edgar/data/1750/000104746918004978/a2236183z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746917004528/a2232622z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746916014299/a2228768z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746915006136/a2225345z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746914006243/a2220733z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746913007797/a2216052z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746911006302/a2204709z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746910006500/a2199382z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746909006783/a2193700z10-k.htm"
)
I then apply the following function to download these 10 links
parsed_files <- map(urls_to_parse, possibly(parse_filing, otherwise = NA))
Which stores it as a nice list, I can then apply names(parsed_files) <- urls_to_parse to name the lists as the links from where they were downloading them from. I can also use output <- plyr::ldply(parsed_files, data.frame) to store everything in a nice data frame.
Using the below data, how could I create batches to download the data in say batches of 10?
What I have currently:
start = 1
end = 100
output <- NULL
output_fin <- NULL
for(i in start:end){
output[[i]] <- map(urls_to_parse[[i]], possibly(parse_filing, otherwise = NA))
names(output) <- urls_to_parse[start:end]
save(output_fin, file = paste0("C:/Users/Downloads/data/",i, "output.RData"))
}
I am sure there is a better way using a function, since this code breaks for some of the results.
More data: - 100 links
urls_to_parse <- c("https://www.sec.gov/Archives/edgar/data/1750/000104746918004978/a2236183z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746917004528/a2232622z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746916014299/a2228768z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746915006136/a2225345z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746914006243/a2220733z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746913007797/a2216052z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746911006302/a2204709z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746910006500/a2199382z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746909006783/a2193700z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746908008126/a2186742z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000110465907055173/a07-18543_110k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000110465906047248/a06-15961_110k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000110465905033688/a05-12324_110k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746904023905/a2140220z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746903028005/a2116671z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000091205702033450/a2087919z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095012310108231/c61492e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095015208010514/n48172e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095013707018659/c22309e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095013707000193/c11187e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095013406000594/c01109e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000120677405000032/d16006.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000120677404000013/d13773.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000104746903001075/a2097401z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000091205702001614/a2067550z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/319126/000115752308008030/a5800571.htm",
"https://www.sec.gov/Archives/edgar/data/319126/000115752307009801/a5515869.htm",
"https://www.sec.gov/Archives/edgar/data/319126/000115752306009238/a5227919.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046908000102/alpharmainc_10k.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046907000017/alo10k2006.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046906000027/alo10k2005.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046905000021/alo10k2004final.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046904000058/alo10k2003master.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046903000001/alo10k.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046902000004/alo10k2001.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046901500003/alo.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000000620118000009/a10k123117.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000119312517051216/d286458d10k.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000119312516474605/d78287d10k.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000119312515061145/d829913d10k.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000000620114000004/aagaa10k-20131231.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000620113000023/amr-10kx20121231.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000119312512063516/d259681d10k.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095012311014726/d78201e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000620110000006/ar123109.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000620109000009/ar120810k.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000451508000014/ar022010k.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013407003888/d43815e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013406003715/d33303e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013405003726/d22731e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013404002668/d12953e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000104746903013301/a2108197z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/65695/000095013407003823/h42902e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/65695/000095012906002343/h31028e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/65695/000095012905002955/h22337e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000156459018005085/cece-10k_20171231.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000156459017004264/cece-10k_20161231.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000156459016015157/cece-10k_20151231.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312515095828/d864880d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312514098407/d661608d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312513109153/d444138d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312512119293/d293768d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312511067373/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312510069639/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312509055504/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312508058939/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312507071909/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312506068031/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312505077739/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312504052176/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000110465910047121/a10-16705_110k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000114420409046933/v159572_10k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000110465906060737/a06-19311_110k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000104746905022854/a2162888z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000104746904028585/a2143353z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000104746903031974/a2119476z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000143774918010388/avx20180331_10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916317000028/avx-20170331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916316000079/avx-20160331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916315000024/avx-20150331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916314000035/avx-20140331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916313000022/avx-20130331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916312000024/avxform10kfy12.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916311000013/avxform10kfy11.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916310000020/avxform10kfy10.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916309000117/form10kfy09.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916308000192/form10qq1fy09.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916308000101/form10kfy08.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916307000122/form10kfy07.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916306000102/avxfy06form10-k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916305000094/fy0510k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916304000091/fy0410k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916303000020/fy0310k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916302000007/r10k-0302.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462218000018/pnw2017123110-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462217000010/pnw2016123110-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462216000087/pnw2015123110-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462215000013/pnw12311410-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000110465914012068/a13-25897_110k.htm"
)
Looping over to do batch job as you showed is a bad idea. If you have a 1000s of files to be downloaded, how do you recover from errors?
The performance is not solely depend on your computer's configuration, but the network performance is crucial.
Here are couple of suggestions.
Option 1
partition all URLs in to batches to be able to download them parallelly. The number of files to be downloaded could be equal to number of cores in your computer. Look at this question; reading multiple files quickly in R
store these batches in a queue objects - For ex: using a package like https://cran.r-project.org/web/packages/dequer/dequer.pdf
pop the queue and use the batch of URLs in your parallel file download function.
Use a retryable file download function like in -- HTTP error 400 in R, error handling, How to retry instead of forcing to stop?
Once the queue is completed, move to the next partition.
wrap the whole operation in a retryable loop. For example; How to retry a statement on error?
Why do I use a queue? Because you could retry on error easily.
A pseudo code
file_url_partitions <- partion_as_batches(all_urls, batch_size)
attempts = 3
while( file_url_partitions is not empty && attempt <= 3 ) {
batch = file_url_partitions.pop()
tryCatch({
download_parallel(batch)
}, some_exception = function(se) {
file_url_partitions.push(batch)
attemp = attempt+1
})
}
Note: I don't have access to R studio/environment now hence no way to try.
Option 2
Download files separately using a download manager/similar and use downloaded files.
Some useful resources:
https://www.r-bloggers.com/r-with-parallel-computing-from-user-perspectives/
http://adv-r.had.co.nz/beyond-exception-handling.html
Due to I constantly reach memory size limit in my R Session (8GB Windows PC) I start to remove big objects loaded in. However once I reach this limit, removing objects seems not to work.
So, I was wondering if there's a way to get the R Session size. I know that it's possible to retrieve objects' size (saw in this thread).I want to know if there's a way to count the complete R Session size though (loaded packages, objects, etc).
Thank you!
I personally use this function to get the available memory:
getAvailMem <- function(format = TRUE) {
gc()
if (Sys.info()[["sysname"]] == "Windows") {
memfree <- 1024^2 * (utils::memory.limit() - utils::memory.size())
} else {
# http://stackoverflow.com/a/6457769/6103040
memfree <- 1024 * as.numeric(
system("awk '/MemFree/ {print $2}' /proc/meminfo", intern = TRUE))
}
`if`(format, format(structure(memfree, class = "object_size"),
units = "auto"), memfree)
}
To get the total memory used by R, you may try mem_used() from pryr package. Unlike memory.size, this one is not OS dependent, because it uses the R function gc() underneath it. Try to look in the function body and also this pryr:::node_size and pryr:::show_bytes
pryr::mem_used()
The help file ?pryr::mem_used describes
R breaks down memory usage into Vcells (memory used by vectors) and
Ncells (memory used by everything else). However, neither this
distinction nor the "gc trigger" and "max used" columns are typically
important. What we're usually most interested in is the the first
column: the total memory used. This function wraps around gc() to
return the total amount of memory (in megabytes) currently used by R.
You can also use pryr::mem_change to track the size of the memory used by the R code. Try the example in its documentation page.
The numbers such as 28L and 56L used to refer node size with pryr:::node_size comes from the help file of ?gc, which describes
gc returns a matrix with rows "Ncells" (cons cells), usually 28 bytes
each on 32-bit systems and 56 bytes on 64-bit systems, and "Vcells"
(vector cells, 8 bytes each),
After removing a large object run gc() to free memory
I'm trying to query a large database and then save the returned result as an ff object using the read.odbc.ffdf function. The following code should allow me to pull in a thousand rows at a time, save the data as an ff object, and then move onto the next thousand rows, appending these to the previous file, so that I can preserve memory:
library(ff);
library(ffbase);
library(ETLUtils);
library(RODBC);
sqlcode <- "SELECT f.*
FROM table1 AS f;";
data <-read.odbc.ffdf(query = sqlcode,
odbcConnect.arg = list('data1; db=research'),
nrows = 1000,
next.rows = 1000,
BATCHBYTES = 100000;
dim(data);
However, when I do this, R eventually eats up all my RAM, and eventually the process terminates before the object "data" is completely filled. Inspecting "data" reveals the following error message:
read.odbc.ffdf 1.. () odbc-read=822.06secError in if (nrow(dat) == 0) { : argument is of length zero
Any ideas what this error message means? How can I query this database without exhausting my memory (4 GB RAM)? I expect the option "BATCHBYTES" in combination with "first.rows" and "next.rows" to keep my memory usage low (within 100,000 bytes, which should be enough for my system).
Am I just not understanding how function read.odbc.ffdf works with the first.rows, next.rows, and BATCHBYTES options?