How do I limit the memory usage of duckdb in R?

How do I limit the memory usage of duckdb in R? - r

I have several large R data.frames that I would like to put into a local duckdb database. The problem I am having is duckdb seems to load everything into memory even though I am specifying a file as the location.
Also, it isn't clear to me the correct way to establish a connection (so I'm not sure if this has something to do with it). I have tried:
duckdrv <- duckdb(dbdir="dt.db", read_only=FALSE)
dkCon <- dbConnect(drv=duckdrv)
and also:
duckdrv <- duckdb()
dkCon <- dbConnect(drv=duckdrv, dbdir="dt.db", read_only=FALSE)
Both work fine, meaning I can create tables, use dbWriteTable, run queries, etc. However, the memory usage is very high (about the same size as the data.frames). I think I read somewhere that duckdb defaults to using a certain % of the available memory which won't work for me because the system that I am using is a shared resource. I also want to run some queries in parallel which will drive memory usage even higher.
I have tried this:
dbExecute(dkCon, "PRAGMA memory_limit='1GB';")
but that doesn't seem to make a difference, even if I close the connection, shutdown the instance and reconnect.
Does anyone know how I can fix this problem? RSQLite, also has high memory usage temporarily when I am writing data to a table but then it goes back to normal and if I open a read only connection it isn't an issue at all. I would like to get duckdb working because I think the queries are supposed to be much faster. Any help would be appreciated!

Memory limit can be set using PRAGMA or SET statement in DuckDB. By default, 75% of the RAM is the limit.
con.execute("PRAGMA memory_limit='200MB'")
OR
con.execute("SET memory_limit='200MB'")
I can confirm that this limit works. However this is not a hard limit and might get exceeded sometimes based on the volume of data, format of data your are querying(eg: parquet from s3), type of query - certain limitations or certain constraints around it at the moment.
Below is one of the examples where the volume of data in plain text(csv) was around 4.23 GB. This data was first loaded into DuckDB and then some SQL queries were run by setting the memory_limit='200MB'. Below screenshot shows max recorded memory used by the py script.

Your approach is correct - using memory_limit pragma, but you used an outdated version.
For example, using DuckDb version 0.5.1:
library("DBI")
con = dbConnect(duckdb::duckdb(), dbdir="my-db.duckdb")
dbExecute(conn = con, paste0("PRAGMA memory_limit='500MB'"))
dbGetQuery(conn = con, "PRAGMA version")
dbExecute(con, "CREATE TABLE gen AS SELECT * FROM 'gen1GB.csv'")
dbGetQuery(conn = con, "select count(*) from gen")
This outputs for me:
library_version source_id
1 0.5.1 7c111322d
count_star()
1 1e+08
Memory usage is less than 500MB. On MacOs can be checked using:
ps axu | grep 'lib\/R' | awk '{print $6 " " $11}'
464768 /usr/local/Cellar/r/4.2.1_4/lib/R/bin/exec/R
You can generate a test csv-file using:
import numpy as np
import pandas as pd
rng = np.random.default_rng()
df = pd.DataFrame(rng.integers(0, 100, size=(100000000, 4)), columns=list('ABCD'))
df.to_csv('gen1GB.csv', index=False)

Related

How to analyze a MariaDB crash log?

When I created a partition for an existing table, I got an exception stack below:
Thread pointer: 0x1ce1ecc6ab8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
mysqld.exe!ha_partition::read_par_file()[ha_partition.cc:3021]
mysqld.exe!ha_partition::get_from_handler_file()[ha_partition.cc:3161]
mysqld.exe!ha_partition::initialize_partition()[ha_partition.cc:512]
mysqld.exe!partition_create_handler()[ha_partition.cc:185]
mysqld.exe!get_new_handler()[handler.cc:302]
mysqld.exe!TABLE_SHARE::init_from_binary_frm_image()[table.cc:2025]
mysqld.exe!open_table_def()[table.cc:698]
mysqld.exe!tdc_acquire_share()[table_cache.cc:842]
mysqld.exe!open_table()[sql_base.cc:1905]
mysqld.exe!open_and_process_table()[sql_base.cc:3802]
mysqld.exe!open_tables()[sql_base.cc:4300]
mysqld.exe!mysql_alter_table()[sql_table.cc:9353]
mysqld.exe!Sql_cmd_alter_table::execute()[sql_alter.cc:510]
mysqld.exe!mysql_execute_command()[sql_parse.cc:6087]
mysqld.exe!Prepared_statement::execute()[sql_prepare.cc:4760]
mysqld.exe!Prepared_statement::execute_loop()[sql_prepare.cc:4246]
mysqld.exe!mysql_sql_stmt_execute()[sql_prepare.cc:3364]
mysqld.exe!mysql_execute_command()[sql_parse.cc:3901]
mysqld.exe!sp_instr_stmt::exec_core()[sp_head.cc:3652]
mysqld.exe!sp_lex_keeper::reset_lex_and_exec_core()[sp_head.cc:3335]
mysqld.exe!sp_instr_stmt::execute()[sp_head.cc:3513]
mysqld.exe!sp_head::execute()[sp_head.cc:1346]
mysqld.exe!sp_head::execute_procedure()[sp_head.cc:2288]
mysqld.exe!do_execute_sp()[sql_parse.cc:3005]
mysqld.exe!Sql_cmd_call::execute()[sql_parse.cc:3247]
mysqld.exe!mysql_execute_command()[sql_parse.cc:6087]
mysqld.exe!sp_instr_stmt::exec_core()[sp_head.cc:3652]
mysqld.exe!sp_lex_keeper::reset_lex_and_exec_core()[sp_head.cc:3335]
mysqld.exe!sp_instr_stmt::execute()[sp_head.cc:3513]
mysqld.exe!sp_head::execute()[sp_head.cc:1346]
mysqld.exe!sp_head::execute_procedure()[sp_head.cc:2288]
mysqld.exe!Event_job_data::execute()[event_data_objects.cc:1459]
mysqld.exe!Event_worker_thread::run()[event_scheduler.cc:312]
mysqld.exe!event_worker_thread()[event_scheduler.cc:268]
mysqld.exe!pthread_start()[my_winthread.c:62]
ucrtbase.dll!_configthreadlocale()
KERNEL32.DLL!BaseThreadInitThunk()
ntdll.dll!RtlUserThreadStart()
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x1ce21cf66b8): alter table mytable add PARTITION (PARTITION t20221103 VALUES LESS THAN (TO_DAYS("2022-11-03")+1))
Connection ID (thread ID): 1649
Status: NOT_KILLED
The running platform is windows so I cannot debug the core file.
Therefore, I read the source code but I cannot understand the reason still deeply.
The database version is 10.4.6 so here is the source code link:
https://github.com/MariaDB/server/blob/mariadb-10.4.6/sql/ha_partition.cc#L3021
chksum= 0;
for (i= 0; i < len_words; i++)
chksum ^= uint4korr((file_buffer) + PAR_WORD_SIZE * i);
if (chksum)
goto err2;
m_tot_parts= uint4korr((file_buffer) + PAR_NUM_PARTS_OFFSET);
DBUG_PRINT("info", ("No of parts: %u", m_tot_parts));
tot_partition_words= (m_tot_parts + PAR_WORD_SIZE - 1) / PAR_WORD_SIZE;
tot_name_len_offset= file_buffer + PAR_ENGINES_OFFSET +
PAR_WORD_SIZE * tot_partition_words;
tot_name_words= (uint4korr(tot_name_len_offset) + PAR_WORD_SIZE - 1) /
PAR_WORD_SIZE; // <--- crashed here
There are some macros, so I don't know whether the line number is exact.
It seems a bug for loading the par file. I found that the above code has checked the validation on the par file, but subsequently, MariaDB still raises the exception.
So, I wonder if my analysis is right and how to bypass the exception, thanks a lot.

First is to search existing bug reports for 'read_par_file', and there's no obvious match.
Second is, looking at the line it crashed, we can assume its reading from memory that isn't allocated. The file_buffer was allocated earlier in the file based of a word length at the beginning.
Third, if you look at the git blame of the function, you see nothing has changed in quite a while.
So its likely a new bug. Please report it using the show create table and par file.
To confirm, that the show create table mytable, paste that into a running 10.4 latest version (containers are good for this). Then issue your alter table statement again. Its likely to crash, but its good to confirm.
There does seem to be a lack of checking around some of these offsets with regard to the allocated space.
Given it looks like you are doing daily partitions, have you exceeded the maximum partitions of 8k per table? Either way, an error message should occur rather than a crash.

AzureML Dataset.File.from_files creation extremely slow even with 4 files

I have a few thousand of video files in my BlobStorage, which I set it as a datastore.
This blob storage receives new files every night and I need to split the data and register each split as a new version of AzureML Dataset.
This is how I do the data split, simply getting the blob paths and splitting them.
container_client = ContainerClient.from_connection_string(AZ_CONN_STR,'keymoments-clips')
blobs = container_client.list_blobs('soccer')
blobs = map(lambda x: Path(x['name']), blobs)
train_set, test_set = get_train_test(blobs, 0.75, 3, class_subset={'goal', 'hitWoodwork', 'penalty', 'redCard', 'contentiousRefereeDecision'})
valid_set, test_set = split_data(test_set, 0.5, 3)
train_set, test_set, valid_set are just nx2 numpy arrays containing blob storage path and class.
Here is when I try to create a new version of my Dataset:
datastore = Datastore.get(workspace, 'clips_datastore')
dataset_train = Dataset.File.from_files([(datastore, b) for b, _ in train_set[:4]], validate=True, partition_format='**/{class_label}/*.mp4')
dataset_train.register(workspace, 'train_video_clips', create_new_version=True)
How is it possible that the Dataset creation seems to hang for an indefinite time even with only 4 paths?
I saw in the doc that providing a list of Tuple[datastore, path] is perfectly fine. Do you know why?
Thanks

Do you have your Azure Machine Learning Workspace and your Azure Storage Account in different Azure Regions? If that's true, latency may be a contributing factor with validate=True.

Another possibility may be slowness in the way datastore paths are resolved. This is an area where improvements are being worked on.
As an experiment, could you try creating the dataset using a url instead of datastore? Let us know if that makes a difference to performance, and whether it can unblock your current issue in the short term.
Something like this:
dataset_train = Dataset.File.from_files(path="https://bloburl/**/*.mp4?accesstoken", validate=True, partition_format='**/{class_label}/*.mp4')
dataset_train.register(workspace, 'train_video_clips', create_new_version=True)

I'd be interested to see what happens if you run the dataset creation code twice in the same notebook/script. Is it faster the second time? I ask because it might be an issue with the .NET core runtime startup (which would only happen on the first time you run the code)
EDIT 9/16/20
While it doesn't seem to make sense that .NET core invoked when not data is moving, is suspect it is the validate=True part of the param that requires that all the data be inspected (which can computationally expensive). I'd be interested to see what happens if that param is False

RODBC connection- limited rows

I set up an ODBC connect to a Netezza (SQL database). The connection is fine. However, R only pulls out 256 rows by default and restricts the number of rows it can pull out.
If I ran the query in Netezza, it would return a total number of rows (300k). I am expecting the same number of rows in R. However, it only returned 256 rows quite a bit short from 300k.
The driver I am using NetezzaSQL version 7.00.02 NSQLODBC.DLL
I tried to change the pre-fetch count to zero in the "Drivers Option' from
Control Panel > Administrative Tools > Data Sources(OBBC) > System DNS
It didn't work. Any ideas?

I think RODBC acts poorly with Netezza. A solution http://datamining.togaware.com/survivor/Database_Connection.html
just add believeNRows=FALSE to either your sqlQuery or odbcConnect call (use the later if you also use sqlFetch.

You can also try using JDBC driver:
library(RJDBC)
drv <- JDBC("org.netezza.Driver", "nzjdbc.jar", "'")
conn <- dbConnect(drv, "jdbc:netezza://host:5480/database", "user", "password")
res <- dbSendQuery(conn, "select * from mytable")
That way you don't have to deal with DSNs, etc.

I know this is kind of out-dated but the problem is not with the RODBC package. The problem lies in how you set up the ODBC connection if you configure the connection in windows you'll see a last tab in the settings where you can specify the amount of rows it'll fetch. And the default is on 256.

Logfile analysis in R?

I know there are other tools around like awstats or splunk, but I wonder whether there is some serious (web)server logfile analysis going on in R. I might not be the first thought to do it in R, but still R has nice visualization capabilities and also nice spatial packages. Do you know of any? Or is there a R package / code that handles the most common log file formats that one could build on? Or is it simply a very bad idea?

In connection with a project to build an analytics toolbox for our Network Ops guys,
i built one of these about two months ago. My employer has no problem if i open source it, so if anyone is interested i can put it up on my github repo. I assume it's most useful to this group if i build an R Package. I won't be able to do that straight away though
because i need to research the docs on package building with non-R code (it might be as simple as tossing the python bytecode files in /exec along with a suitable python runtime, but i have no idea).
I was actually suprised that i needed to undertake a project of this sort. There are at least several excellent open source and free log file parsers/viewers (including the excellent Webalyzer and AWStats) but neither parse server error logs (parsing server access logs is the primary use case for both).
If you are not familiar with error logs or with the difference between them and access
logs, in sum, Apache servers (likewsie, nginx and IIS) record two distinct logs and store them to disk by default next to each other in the same directory. On Mac OS X,
that directory in /var, just below root:
$> pwd
/var/log/apache2
$> ls
access_log error_log
For network diagnostics, error logs are often far more useful than the access logs.
They also happen to be significantly more difficult to process because of the unstructured nature of the data in many of the fields and more significantly, because the data file
you are left with after parsing is an irregular time series--you might have multiple entries keyed to a single timestamp, then the next entry is three seconds later, and so forth.
i wanted an app that i could toss in raw error logs (of any size, but usually several hundred MB at a time) have something useful come out the other end--which in this case, had to be some pre-packaged analytics and also a data cube available inside R for command-line analytics. Given this, i coded the raw-log parser in python, while the processor (e.g., gridding the parser output to create a regular time series) and all analytics and data visualization, i coded in R.
I have been building analytics tools for a long time, but only in the past
four years have i been using R. So my first impression--immediately upon parsing a raw log file and loading the data frame in R is what a pleasure R is to work with and how it is so well suited for tasks of this sort. A few welcome suprises:
Serialization. To persist working data in R is a single command
(save). I knew this, but i didn't know how efficient is this binary
format. Thee actual data: for every 50 MB of raw logfiles parsed, the
.RData representation was about 500 KB--100 : 1 compression. (Note: i
pushed this down further to about 300 : 1 by using the data.table
library and manually setting compression level argument to the save
function);
IO. My Data Warehouse relies heavily on a lightweight datastructure
server that resides entirely in RAM and writes to disk
asynchronously, called redis. The proect itself is only about two
years old, yet there's already a redis client for R in CRAN (by B.W.
Lewis, version 1.6.1 as of this post);
Primary Data Analysis. The purpose of this Project was to build a
Library for our Network Ops guys to use. My goal was a "one command =
one data view" type interface. So for instance, i used the excellent
googleVis Package to create a professional-looking
scrollable/paginated HTML tables with sortable columns, in which i
loaded a data frame of aggregated data (>5,000 lines). Just those few
interactive elments--e.g., sorting a column--delivered useful
descriptive analytics. Another example, i wrote a lot of thin
wrappers over some basic data juggling and table-like functions; each
of these functions i would for instance, bind to a clickable button
on a tabbed web page. Again, this was a pleasure to do in R, in part
becasue quite often the function required no wrapper, the single
command with the arguments supplied was enough to generate a useful
view of the data.
A couple of examples of the last bullet:
# what are the most common issues that cause an error to be logged?
err_order = function(df){
t0 = xtabs(~Issue_Descr, df)
m = cbind( names(t0), t0)
rownames(m) = NULL
colnames(m) = c("Cause", "Count")
x = m[,2]
x = as.numeric(x)
ndx = order(x, decreasing=T)
m = m[ndx,]
m1 = data.frame(Cause=m[,1], Count=as.numeric(m[,2]),
CountAsProp=100*as.numeric(m[,2])/dim(df)[1])
subset(m1, CountAsProp >= 1.)
}
# calling this function, passing in a data frame, returns something like:
Cause Count CountAsProp
1 'connect to unix://var/ failed' 200 40.0
2 'object buffered to temp file' 185 37.0
3 'connection refused' 94 18.8
The Primary Data Cube Displayed for Interactive Analysis Using googleVis:
A contingency table (from an xtab function call) displayed using googleVis)

It is in fact an excellent idea. R also has very good date/time capabilities, can do cluster analysis or use any variety of machine learning alogorithms, has three different regexp engines to parse etc pp.
And it may not be a novel idea. A few years ago I was in brief email contact with someone using R for proactive (rather than reactive) logfile analysis: Read the logs, (in their case) build time-series models, predict hot spots. That is so obviously a good idea. It was one of the Department of Energy labs but I no longer have a URL. Even outside of temporal patterns there is a lot one could do here.

I have used R to load and parse IIS Log files with some success here is my code.
Load IIS Log files
require(data.table)
setwd("Log File Directory")
# get a list of all the log files
log_files <- Sys.glob("*.log")
# This line
# 1) reads each log file
# 2) concatenates them
IIS <- do.call( "rbind", lapply( log_files, read.csv, sep = " ", header = FALSE, comment.char = "#", na.strings = "-" ) )
# Add field names - Copy the "Fields" line from one of the log files :header line
colnames(IIS) <- c("date", "time", "s_ip", "cs_method", "cs_uri_stem", "cs_uri_query", "s_port", "cs_username", "c_ip", "cs_User_Agent", "sc_status", "sc_substatus", "sc_win32_status", "sc_bytes", "cs_bytes", "time-taken")
#Change it to a data.table
IIS <- data.table( IIS )
#Query at will
IIS[, .N, by = list(sc_status,cs_username, cs_uri_stem,sc_win32_status) ]

I did a logfile-analysis recently using R. It was no real komplex thing, mostly descriptive tables. R's build-in functions were sufficient for this job.
The problem was the data storage as my logfiles were about 10 GB. Revolutions R does offer new methods to handle such big data, but I at last decided to use a MySQL-database as a backend (which in fact reduced the size to 2 GB though normalization).
That could also solve your problem in reading logfiles in R.

#!python
import argparse
import csv
import cStringIO as StringIO
class OurDialect:
escapechar = ','
delimiter = ' '
quoting = csv.QUOTE_NONE
parser = argparse.ArgumentParser()
parser.add_argument('-f', '--source', type=str, dest='line', default=[['''54.67.81.141 - - [01/Apr/2015:13:39:22 +0000] "GET / HTTP/1.1" 502 173 "-" "curl/7.41.0" "-"'''], ['''54.67.81.141 - - [01/Apr/2015:13:39:22 +0000] "GET / HTTP/1.1" 502 173 "-" "curl/7.41.0" "-"''']])
arguments = parser.parse_args()
try:
with open(arguments.line, 'wb') as fin:
line = fin.readlines()
except:
pass
finally:
line = arguments.line
header = ['IP', 'Ident', 'User', 'Timestamp', 'Offset', 'HTTP Verb', 'HTTP Endpoint', 'HTTP Version', 'HTTP Return code', 'Size in bytes', 'User-Agent']
lines = [[l[:-1].replace('[', '"').replace(']', '"').replace('"', '') for l in l1] for l1 in line]
out = StringIO.StringIO()
writer = csv.writer(out)
writer.writerow(header)
writer = csv.writer(out,dialect=OurDialect)
writer.writerows([[l1 for l1 in l] for l in lines])
print(out.getvalue())
Demo output:
IP,Ident,User,Timestamp,Offset,HTTP Verb,HTTP Endpoint,HTTP Version,HTTP Return code,Size in bytes,User-Agent
54.67.81.141, -, -, 01/Apr/2015:13:39:22, +0000, GET, /, HTTP/1.1, 502, 173, -, curl/7.41.0, -
54.67.81.141, -, -, 01/Apr/2015:13:39:22, +0000, GET, /, HTTP/1.1, 502, 173, -, curl/7.41.0, -
This format can easily be read into R using read.csv. And, it doesn't require any 3rd party libraries.

QSqlTableModel.insertRecord() is very slow

I am using PyQt to insert records into a MySQL database. the code basically looks like
self.table = QSqlTableModel()
self.table.setTable('mytable')
while True:
rec = self.table.record()
values = getValueDictionary()
for k,v in values.items():
rec.setValue(k,QVariant(v))
self.table.insertRecord(-1,rec)
The table currently has ~ 50,000 rows in it.
I have timed each line and found that the insertRecord function is taking ~5 seconds to execute, which is unacceptably slow. Everything else is fast.
For comparison, I also made a version of the code that uses
QSqlQuery.prepare("INSERT INTO mytable (f1,f2,...) VALUES (:f1, :f2,...)")
query.bindValue(":f1",blah)
query.exec_()
In this case, the whole thing takes only ~ 20 milliseconds, so the delay is not in the database connection as far as I can tell.
I'd really prefer to use the QtSql stuff instead of the awkward MySQL commands. Any ideas on how to add a bunch of rows to a MySQL database with QtSql instead of raw comands and with reasonable speed?
Thanks,
G

Things to try:
set your EditStrategy to QSqlTableModel.OnManualSubmit
mass insert rows with insertRows
and see if it helps...

you should use begin before and commit after the loop, or turn off the autocommit feature from MySQL ..
this will give you usually a performance increase of 50% or more ..

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How do I limit the memory usage of duckdb in R? - r

Related

How to analyze a MariaDB crash log?

AzureML Dataset.File.from_files creation extremely slow even with 4 files

RODBC connection- limited rows

Logfile analysis in R?

QSqlTableModel.insertRecord() is very slow

Categories

Resources