QSqlTableModel.insertRecord() is very slow - qt

I am using PyQt to insert records into a MySQL database. the code basically looks like
self.table = QSqlTableModel()
self.table.setTable('mytable')
while True:
rec = self.table.record()
values = getValueDictionary()
for k,v in values.items():
rec.setValue(k,QVariant(v))
self.table.insertRecord(-1,rec)
The table currently has ~ 50,000 rows in it.
I have timed each line and found that the insertRecord function is taking ~5 seconds to execute, which is unacceptably slow. Everything else is fast.
For comparison, I also made a version of the code that uses
QSqlQuery.prepare("INSERT INTO mytable (f1,f2,...) VALUES (:f1, :f2,...)")
query.bindValue(":f1",blah)
query.exec_()
In this case, the whole thing takes only ~ 20 milliseconds, so the delay is not in the database connection as far as I can tell.
I'd really prefer to use the QtSql stuff instead of the awkward MySQL commands. Any ideas on how to add a bunch of rows to a MySQL database with QtSql instead of raw comands and with reasonable speed?
Thanks,
G

Things to try:
set your EditStrategy to QSqlTableModel.OnManualSubmit
mass insert rows with insertRows
and see if it helps...

you should use begin before and commit after the loop, or turn off the autocommit feature from MySQL ..
this will give you usually a performance increase of 50% or more ..

Related

How to analyze a MariaDB crash log?

When I created a partition for an existing table, I got an exception stack below:
Thread pointer: 0x1ce1ecc6ab8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
mysqld.exe!ha_partition::read_par_file()[ha_partition.cc:3021]
mysqld.exe!ha_partition::get_from_handler_file()[ha_partition.cc:3161]
mysqld.exe!ha_partition::initialize_partition()[ha_partition.cc:512]
mysqld.exe!partition_create_handler()[ha_partition.cc:185]
mysqld.exe!get_new_handler()[handler.cc:302]
mysqld.exe!TABLE_SHARE::init_from_binary_frm_image()[table.cc:2025]
mysqld.exe!open_table_def()[table.cc:698]
mysqld.exe!tdc_acquire_share()[table_cache.cc:842]
mysqld.exe!open_table()[sql_base.cc:1905]
mysqld.exe!open_and_process_table()[sql_base.cc:3802]
mysqld.exe!open_tables()[sql_base.cc:4300]
mysqld.exe!mysql_alter_table()[sql_table.cc:9353]
mysqld.exe!Sql_cmd_alter_table::execute()[sql_alter.cc:510]
mysqld.exe!mysql_execute_command()[sql_parse.cc:6087]
mysqld.exe!Prepared_statement::execute()[sql_prepare.cc:4760]
mysqld.exe!Prepared_statement::execute_loop()[sql_prepare.cc:4246]
mysqld.exe!mysql_sql_stmt_execute()[sql_prepare.cc:3364]
mysqld.exe!mysql_execute_command()[sql_parse.cc:3901]
mysqld.exe!sp_instr_stmt::exec_core()[sp_head.cc:3652]
mysqld.exe!sp_lex_keeper::reset_lex_and_exec_core()[sp_head.cc:3335]
mysqld.exe!sp_instr_stmt::execute()[sp_head.cc:3513]
mysqld.exe!sp_head::execute()[sp_head.cc:1346]
mysqld.exe!sp_head::execute_procedure()[sp_head.cc:2288]
mysqld.exe!do_execute_sp()[sql_parse.cc:3005]
mysqld.exe!Sql_cmd_call::execute()[sql_parse.cc:3247]
mysqld.exe!mysql_execute_command()[sql_parse.cc:6087]
mysqld.exe!sp_instr_stmt::exec_core()[sp_head.cc:3652]
mysqld.exe!sp_lex_keeper::reset_lex_and_exec_core()[sp_head.cc:3335]
mysqld.exe!sp_instr_stmt::execute()[sp_head.cc:3513]
mysqld.exe!sp_head::execute()[sp_head.cc:1346]
mysqld.exe!sp_head::execute_procedure()[sp_head.cc:2288]
mysqld.exe!Event_job_data::execute()[event_data_objects.cc:1459]
mysqld.exe!Event_worker_thread::run()[event_scheduler.cc:312]
mysqld.exe!event_worker_thread()[event_scheduler.cc:268]
mysqld.exe!pthread_start()[my_winthread.c:62]
ucrtbase.dll!_configthreadlocale()
KERNEL32.DLL!BaseThreadInitThunk()
ntdll.dll!RtlUserThreadStart()
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x1ce21cf66b8): alter table mytable add PARTITION (PARTITION t20221103 VALUES LESS THAN (TO_DAYS("2022-11-03")+1))
Connection ID (thread ID): 1649
Status: NOT_KILLED
The running platform is windows so I cannot debug the core file.
Therefore, I read the source code but I cannot understand the reason still deeply.
The database version is 10.4.6 so here is the source code link:
https://github.com/MariaDB/server/blob/mariadb-10.4.6/sql/ha_partition.cc#L3021
chksum= 0;
for (i= 0; i < len_words; i++)
chksum ^= uint4korr((file_buffer) + PAR_WORD_SIZE * i);
if (chksum)
goto err2;
m_tot_parts= uint4korr((file_buffer) + PAR_NUM_PARTS_OFFSET);
DBUG_PRINT("info", ("No of parts: %u", m_tot_parts));
tot_partition_words= (m_tot_parts + PAR_WORD_SIZE - 1) / PAR_WORD_SIZE;
tot_name_len_offset= file_buffer + PAR_ENGINES_OFFSET +
PAR_WORD_SIZE * tot_partition_words;
tot_name_words= (uint4korr(tot_name_len_offset) + PAR_WORD_SIZE - 1) /
PAR_WORD_SIZE; // <--- crashed here
There are some macros, so I don't know whether the line number is exact.
It seems a bug for loading the par file. I found that the above code has checked the validation on the par file, but subsequently, MariaDB still raises the exception.
So, I wonder if my analysis is right and how to bypass the exception, thanks a lot.
First is to search existing bug reports for 'read_par_file', and there's no obvious match.
Second is, looking at the line it crashed, we can assume its reading from memory that isn't allocated. The file_buffer was allocated earlier in the file based of a word length at the beginning.
Third, if you look at the git blame of the function, you see nothing has changed in quite a while.
So its likely a new bug. Please report it using the show create table and par file.
To confirm, that the show create table mytable, paste that into a running 10.4 latest version (containers are good for this). Then issue your alter table statement again. Its likely to crash, but its good to confirm.
There does seem to be a lack of checking around some of these offsets with regard to the allocated space.
Given it looks like you are doing daily partitions, have you exceeded the maximum partitions of 8k per table? Either way, an error message should occur rather than a crash.

How do I limit the memory usage of duckdb in R?

I have several large R data.frames that I would like to put into a local duckdb database. The problem I am having is duckdb seems to load everything into memory even though I am specifying a file as the location.
Also, it isn't clear to me the correct way to establish a connection (so I'm not sure if this has something to do with it). I have tried:
duckdrv <- duckdb(dbdir="dt.db", read_only=FALSE)
dkCon <- dbConnect(drv=duckdrv)
and also:
duckdrv <- duckdb()
dkCon <- dbConnect(drv=duckdrv, dbdir="dt.db", read_only=FALSE)
Both work fine, meaning I can create tables, use dbWriteTable, run queries, etc. However, the memory usage is very high (about the same size as the data.frames). I think I read somewhere that duckdb defaults to using a certain % of the available memory which won't work for me because the system that I am using is a shared resource. I also want to run some queries in parallel which will drive memory usage even higher.
I have tried this:
dbExecute(dkCon, "PRAGMA memory_limit='1GB';")
but that doesn't seem to make a difference, even if I close the connection, shutdown the instance and reconnect.
Does anyone know how I can fix this problem? RSQLite, also has high memory usage temporarily when I am writing data to a table but then it goes back to normal and if I open a read only connection it isn't an issue at all. I would like to get duckdb working because I think the queries are supposed to be much faster. Any help would be appreciated!
Memory limit can be set using PRAGMA or SET statement in DuckDB. By default, 75% of the RAM is the limit.
con.execute("PRAGMA memory_limit='200MB'")
OR
con.execute("SET memory_limit='200MB'")
I can confirm that this limit works. However this is not a hard limit and might get exceeded sometimes based on the volume of data, format of data your are querying(eg: parquet from s3), type of query - certain limitations or certain constraints around it at the moment.
Below is one of the examples where the volume of data in plain text(csv) was around 4.23 GB. This data was first loaded into DuckDB and then some SQL queries were run by setting the memory_limit='200MB'. Below screenshot shows max recorded memory used by the py script.
Your approach is correct - using memory_limit pragma, but you used an outdated version.
For example, using DuckDb version 0.5.1:
library("DBI")
con = dbConnect(duckdb::duckdb(), dbdir="my-db.duckdb")
dbExecute(conn = con, paste0("PRAGMA memory_limit='500MB'"))
dbGetQuery(conn = con, "PRAGMA version")
dbExecute(con, "CREATE TABLE gen AS SELECT * FROM 'gen1GB.csv'")
dbGetQuery(conn = con, "select count(*) from gen")
This outputs for me:
library_version source_id
1 0.5.1 7c111322d
count_star()
1 1e+08
Memory usage is less than 500MB. On MacOs can be checked using:
ps axu | grep 'lib\/R' | awk '{print $6 " " $11}'
464768 /usr/local/Cellar/r/4.2.1_4/lib/R/bin/exec/R
You can generate a test csv-file using:
import numpy as np
import pandas as pd
rng = np.random.default_rng()
df = pd.DataFrame(rng.integers(0, 100, size=(100000000, 4)), columns=list('ABCD'))
df.to_csv('gen1GB.csv', index=False)

BigQuery Timeout Errors in R Loop Using bigrquery

I am running a query in a loop for each store in a dataframe. Typically there are 70 or so stores so the loop repeats that many times for each complete loop.
Maybe 75% of the time this loop works all the way through with no errors.
About 25% of the time I get the following error during any one of the loop iterations:
Error in curl::curl_fetch_memory(url, handle = handle) :
Timeout was reached
Then I have to figure out which iteration bombed, and repeat the loop excluding iterations that completed successfully.
I can't find anything on the web to help me understand what is causing this seemingly random error. Perhaps it is a BQ technical issue? There does not seem to be any relation to the size of the result set it crashes on.
Here is the part of my code that does the loop...again it works all the way through most of the time. The cartesian product across IDs is intentional, as I want every combination of each Test ID with all possible Control IDs within store.
sql<-"SELECT pstore as store, max(pretrips) as pretrips FROM analytics.campaign_ids
group by 1 order by 1"
store_maxtrips<-query_exec(sql,project=project, max_pages = 1)
store_maxtrips
for (i in 1:length(store_maxtrips$store)) {
#pull back all ids shopping in same primary store as each test ID with their pre metrics
sql<-paste("SELECT a.pstore as pstore, a.id as test_id,
b.id as ctl_id,
(abs(a.zpbsales-b.zpbsales)*",wt_pb_sales,")+(abs(a.zcatsales-b.zcatsales)*",wt_cat_sales,")+
(abs(a.zsales-b.zsales)*",wt_retail_sales,")+(abs(a.ztrips-b.ztrips)*",wt_retail_trips,") as zscore
FROM analytics.campaign_ids a inner join analytics.pre_zscores b
on a.pstore=b.pstore
where a.id<>b.id and a.pstore=",store_maxtrips$store[i]," order by a.pstore, a.id, zscore")
print(paste("processing store",store_maxtrips$store[i]))
query_exec(sql,project=project,destination_table = "analytics.campaign_matches",
write_disposition = "WRITE_APPEND", max_pages = 1)
}
Solved!
It turns out I was using query_exec, but I should have been using insert_query_job since I do not want to retrieve any results. The errors were all happening in the course of R trying to retrieve results from BigQuery which I didn't want anyhow.
By using insert_query_job + wait_for(job) in my loop instead of the query_exec command, it eliminated all issues with the loop finishing.
I did also need to add a try() function to help circumvent some rare errors that still popped up with this approach. Thanks to MarkeD for this tip. So my final solution looked like this:
try(job<-insert_query_job(sql,project=project,destination_table = "analytics.campaign_matches",
write_disposition = "WRITE_APPEND"))
wait_for(job)
Thanks to everyone who commented and helped me research the issue.

Titan Graph Queries taking too long to execute

I have a problem with the executing speed of Titan queries.
To be more specific:
I created a property file for my graph using BerkeleyJe which is looking like this:
storage.backend=berkeleyje
storage.directory=/finalGraph_script/graph
Afterwards, i opened the Gremlin.bat to open my Graph.
I set up all the neccessary Index Keys for my nodes:
m = g.getManagementSystem();
username = m.makePropertyKey('username').dataType(String.class).make()
m.buildIndex('byUsername',Vertex.class).addKey(username).unique().buildCompositeIndex()
m.commit()
g.commit()
(all other keys are created the same way...)
I imported a csv file containing about 100 000 lines, each line is producing at least 2 nodes and some edges. All this is done via Batchloading.
That works without a Problem.
Then i execute a groupBy query which is looking like that:
m = g.V.has("imageLink").groupBy{it.imageLink}{it.in("is_on_image").out("is_species")}{it._().species.groupCount().cap.next()}.cap.next()
With this query i want for every node with the property key "imageLink" the number of the different "species". "Species" are also nodes, and can be called by going back the edge "is_on_image" and following the edge "is_species".
Well this is also working like a charm, for my recent nodes. This query is taking about 2 minutes on my local PC.
But now to the problem.
My whole dataset is a csv with 10 million entries. The structure is the same as above, and each line is also creating at least 2 nodes and some edges.
With my local PC i cant even import this set, causing an Memory Exception after 3 days of loading.
So I tried the same on a server with much more RAM and memory. There the Import works, and takes about 1 day. But the groupBy failes after about 3 days.
I actually dont know if the groupBy itself fails, or just the Connection to the Server after that long time.
So my first Question:
In my opinion about 15 million nodes shouldn't be that big deal for a graph database, should it?
Second Question:
Is it normal that it takes so long? Or is there anyway to speed it up using indices? I configured the indices as listet above :(
I don't know which exact information you need for helping me, but please just tell me what you need in addition to that.
Thanks a lot!
Best regards,
Ricardo
EDIT 1: The way im loading the CSV in the Graph:
I'm using this code, i deleted some unneccassry properties, which are also set an property for some nodes, loaded the same way.
bg = new BatchGraph(g, VertexIDType.STRING, 10000)
new File("annotation_nodes_wNothing.csv").eachLine({ final String line ->def (annotationId,species,username,imageLink) = line.split('\t')*.trim();def userVertex = bg.getVertex(username) ?: bg.addVertex(username);def imageVertex = bg.getVertex(imageLink) ?: bg.addVertex(imageLink);def speciesVertex = bg.getVertex(species) ?: bg.addVertex(species);def annotationVertex = bg.getVertex(annotationId) ?: bg.addVertex(annotationId);userVertex.setProperty("username",username);imageVertex.setProperty("imageLink", imageLink);speciesVertex.setProperty("species",species);annotationVertex.setProperty("annotationId", annotationId);def classifies = bg.addEdge(null, userVertex, annotationVertex, "classifies");def is_on_image = bg.addEdge(null, annotationVertex, imageVertex, "is_on_image");def is_species = bg.addEdge(null, annotationVertex, speciesVertex, "is_species");})
bg.commit()
g.commit()

Batch loading in DynamoDB

Hi Currently I am loading each row row in Dynamodb which is veryslow.
I have a huge data which i want to load to DynamoDb by JAVA API.
But this takes huge time .For example to load 1 million data it took me 2 days to load to Dynamo.
Is Batch load possible in DynamoDb.I am not finding any information about bulkload or batch load.
Appreciate any help here.
i know it's an old post, but we came up short exploring how to optimize this so embarked on a scientific discovery :)
http://tech.equinox.com/driving-miss-dynamodb/
long and short of it it Hive on EMR is an excellent option (i know, it's "old skool")..
using and tuning these parameters do the trick (see blog for details):
SET dynamodb.throughput.write.percent = x;
SET mapred.reduce.tasks = x;
SET hive.exec.reducers.bytes.per.reducer = x;
SET tez.grouping.split-count = x;

Resources