Very slow insert speed using sqlite's .import command line - sqlite

I'm trying to import a database with around 4.5 million entries, 350 fields, and 2 indexes (I didn't design this) from a .csv into a sqlite database. Most of the performance issues I read about people having involve not using batch transactions and various things like this but I imagined using the sqlite command line import would be as fast as possible. Yet I'm only getting around ~150 inserts per second. Is there a way to speed this up somehow?
As far as what I've tried I've recreated the table schema without the two indices and I've tried setting PRAGMA Synchronous to off based on recommendations I read from googling but neither of these helped, I still get the same inserts/minute.
For whatever reason the first 5,000 inserts seem to happen nearly instantly but after that it slows down to around 150/second.

There's a few things you can try doing to speed up the import.
Turn off journaling - PRAGMA journal_mode = OFF
If you're importing to a VM up the resources till your done with the import
Maybe convert your csv to a dump file as that's native to SQLite and should import quicker
It none of that helps you can try using the sqlite-utils Python library. That will allow you to use the import_csv function which might result in a faster import. At the end of the day your working with a huge dataset.

Related

Knime too slow - performance

I just started to use KNIME and it suppose managed a huge mount of data, but isn't, it's slow and often not response. I'll manage more data than that I'm using now, What am I doing wrong?.
I set in my configuration file "knime.ini":
-XX:MaxPermSize=1024m
-Xmx2048m
I also read data from a database node (millions of rows) but I can't limit it by SQL (I don't really mind, I need this data).
SELECT * FROM foo LIMIT 1000
error:
WARN Database Reader com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'LIMIT 0' at line 1
I had the same issue... and was able to solve it really simply, KNIME has a KNIME.ini file, this one is like the paramethers KNIME uses to execute...
The real issue is that JBDC driver is set for 10 Fetch Size. By default, when Oracle JDBC runs a query, it retrieves a result set of 10 rows  at a time from the database cursor. This is the default Oracle row fetch size value... so whenever you are reading database you will have a big pain waiting to retrieve all the lines.
The fix is simply, go to the folder where KNIME is installed, look for the file KNIME.ini, open it and then add the following sentences to the bottom, it will override the defauld JBDC fetching, and then you will get the data in literally seconds.
-Dknime.database.fetchsize=50000
-Dknime.url.timeout=9000
Hope this helps :slight_smile:
see http://tech.knime.org/forum/knime-users/knime-performance-reading-from-a-database for the rest of this discussion and solutions...
I'm not sure if your question is about the performance problem or the SQL problem.
For the former, I had the same issue and only found a solution when I started searching for Eclipse performance fixes rather than KNIME performance fixes. It's true that increasing the Java heap size is a good thing to do, but my performance problem (and perhaps yours) was caused by something bad going on in the saved workspace metadata. Solution: Delete contents of the knime/workspace/.metadata directory.
As for the latter, not sure why you're getting that error; maybe try adding a semicolon at the end of the SQL statement.

Download large data to Excel

I have about 3 million rows in SqlDataReader (~500Mb).
How can I deliver this data to client Excel? (excel has limitation 1048K rows per sheet)
Tried to use ClosedXml. But more than 100K per sheet I didn't manage to create. Always get OutOfMemoryException when do InsertData.
Os - Win 32x
Don't suggest - txt download. Already done. Now need Excel with either dividing data by sheets or files.
Any ideas?
If you can use a commercial product. I suggest SoftArtisans Officewriter. You can download an eval here
With OfficeWriter you can use ExcelTemplate's BindData in conjunction with the continue modifier
What this boils down to is < 10 lines of code an a couple of placeholders in a template file.It is very easy to implement and will allow you to seamlessly have your data go on to the next worksheet.
Disclaimer: I work for SoftArtisans the makers of OfficeWriter
How this data will be used?
If they'll end in some sort of pivot table you might consider Powerpivot as well.
It's able to manage way more than 3 million rows and it's free unless you need to publish it into a Sharepoint server.
Friend tried to import the data and then open with EXCEL? I've seen someone do it and it worked most do not know if it will work now with updates and integrity constraints that programmers put more tries, it does not work another dieia would do the following, you already have the commands in sql table and goes to print is easier for you to find everything there you will gradually upgrading it in execl
3 million rows, 1 million rows per sheet? loop the data and load it into each sheet, or write some queries to split the first table into 3 smaller ones and load each new table in to a sheet. First will be dog slow but easy, second will be harder by faster.

How can i improve the performance of the SQLite database?

Background: I am using SQLite database in my flex application. Size of the database is 4 MB and have 5 tables which are
table 1 have 2500 records
table 2 have 8700 records
table 3 have 3000 records
table 4 have 5000 records
table 5 have 2000 records.
Problem: Whenever I run a select query on any table, it takes around (approx 50 seconds) to fetch data from database tables. This has made the application quite slow and unresponsive while it fetches the data from the table.
How can i improve the performance of the SQLite database so that the time taken to fetch the data from the tables is reduced?
Thanks
As I tell you in a comment, without knowing what structures your database consists of, and what queries you run against the data, there is nothing we can infer suggesting why your queries take much time.
However here is an interesting reading about indexes : Use the index, Luke!. It tells you what an index is, how you should design your indexes and what benefits you can harvest.
Also, if you can post the queries and the table schemas and cardinalities (not the contents) maybe it could help.
Are you using asynchronous or synchronous execution modes? The difference between them is that asynchronous execution runs in the background while your application continues to run. Your application will then have to listen for a dispatched event and then carry out any subsequent operations. In synchronous mode, however, the user will not be able to interact with the application until the database operation is complete since those operations run in the same execution sequence as the application. Synchronous mode is conceptually simpler to implement, but asynchronous mode will yield better usability.
The first time SQLStatement.execute() on a SQLStatement instance, the statement is prepared automatically before executing. Subsequent calls will execute faster as long as the SQLStatement.text property has not changed. Using the same SQLStatement instances is better than creating new instances again and again. If you need to change your queries, then consider using parameterized statements.
You can also use techniques such as deferring what data you need at runtime. If you only need a subset of data, pull that back first and then retrieve other data as necessary. This may depend on your application scope and what needs you have to fulfill though.
Specifying the database with the table names will prevent the runtime from checking each database to find a matching table if you have multiple databases. It also helps prevent the runtime will choose the wrong database if this isn't specified. Do SELECT email FROM main.users; instead of SELECT email FROM users; even if you only have one single database. (main is automatically assigned as the database name when you call SQLConnection.open.)
If you happen to be writing lots of changes to the database (multiple INSERT or UPDATE statements), then consider wrapping it in a transaction. Changes will made in memory by the runtime and then written to disk. If you don't use a transaction, each statement will result in multiple disk writes to the database file which can be slow and consume lots of time.
Try to avoid any schema changes. The table definition data is kept at the start of the database file. The runtime loads these definitions when the database connection is opened. Data added to tables is kept after the table definition data in the database file. If changes such as adding columns or tables, the new table definitions will be mixed in with table data in the database file. The effect of this is that the runtime will have to read the table definition data from different parts of the file rather than at the beginning. The SQLConnection.compact() method restructures the table definition data so it is at the the beginning of the file, but its downside is that this method can also consume much time and more so if the database file is large.
Lastly, as Benoit pointed out in his comment, consider improving your own SQL queries and table structure that you're using. It would be helpful to know your database structure and queries are the actual cause of the slow performance or not. My guess is that you're using synchronous execution. If you switch to asynchronous mode, you'll see better performance but that doesn't mean it has to stop there.
The Adobe Flex documentation online has more information on improving database performance and best practices working with local SQL databases.
You could try indexing some of the columns used in the WHERE clause of your SELECT statements. You might also try minimizing usage of the LIKE keyword.
If you are joining your tables together, you might try simplifying the table relationships.
Like others have said, it's hard to get specific without knowing more about your schema and the SQL you are using.

Importing 60,000 nodes

I'm using Table Wizard + Migrate module to import nodes into my Drupal installation.
I need to import around 60,000 questions / answers (they are both nodes) and I thought it would have been an easy task.
However, the migrate process imports 4 nodes per minute, and it would take approximately 11 days to finish the importing.
I was wondering if I can make it faster by importing directly in mysql. But I actually need to create 60,000 nodes. I guess Drupal is going to store additional information in other tables... and it is not that safe.
what do you suggest me to do ? Wait 10 days ?
Thanks
Table migrate should be orders of magnitude faster than that.
Are you using pathauto?
If yes, try disabling the pathauto module, often causes big performance problems on import.
Second, if disabling pathauto doesn't work, turn off all non-essential modules you may have running - some modules do crazy stuff. Eliminate other modules as the sources of the problem.
Third, is MySQL db log turned on? That can have a big performance impact - not the level you are talking about, but its something to consider.
Third, install xdebug, and tail your mysql log to see exactly whats happening.
What is your PHP memory limit?
Do you have plenty of disk space left?
If you're not doing it, you should use drush to migrate the nodes in batches. You could even write a shell script for it, if you want it automated. Using the command line should lower the time it takes to import the nodes a lot. With a script, you can make it an automated task that you don't have to worry about.
One thing I want to note though, 4 nodes per minute is very low. I once needed to import some nodes from a CSV file, using migrate etc. I needed to import 300 nodes, with location, 4-5 CCK fields and I did it in a matter of seconds. So if you only import 4 nodes per minute, you either have extremely complex nodes, or something fishy is going on.
What are the specs of the computer you are using for this? Where's the import source located?
This is a tough topic, but within Drupal actually very well covered. I don't know the ins- and outs. But do know where to look.
Data Mining Drupalgroup has some pointers, knowledge and information on processing large amounts of data in PHP/Drupal.
Drupal core has batch-functionality built in and called BatchAPI At your service when writing modules! For a working example, see this tutorial on CSV import.
4 node per minute is incredibly slow. Migrate shouldn't normally take that long. You could speed things up a bit by using Drush, but probably not enough to get a reasonable import time (hours, not days). That wouldn't really address your core problem: your import itself is taking too long. The overhead of the Migrate GUI isn't that big.
Importing directly into MySQL would certainly be faster, but there's a reason Migrate exists. Node database storage in Drupal is complicated, so it's generally best to let Drupal work it out rather than trying to figure out what goes where.
Are you using Migrate's hooks to do additional processing on each node? I'd suggest adding some logging to see what exactly is taking so long. Test it on 10 nodes at a time until you figure out the lag before doing the whole 60k.
We had a similar problem on a Drupal 7 install. Left it run all week-end on an import, and it only imported 1,000 lines of a file.
The funny thing is that exactly the same import on a pre-production machine was taking 90 minutes.
We ended up comparing the source code (making sure we are at the same commit in git), the database schema (identical), the quantity of node on each machine (not identical but similar)...
Long story made short, the only significant difference between the two machines was the max_execution_time option in the php.ini settings file.
The production machine had max_execution_time = 30, while the pre-production machine had max_execution_time = 3000. It looks like the migrate module has a kind of system to handle "short" max_execution_time that is less than optimal.
Conclusion : set max_execution_time = 3000 or more in your php.ini, that helps a lot the migrate module.
I just wanted to add a note saying the pathauto disable really does help. I had an import of over 22k rows and before disabling it took over 12 hours and would crash multiple times during the import. After disabling pathauto and then running the import, it took only 62 minutes and didn't crash once.
Just a heads up, I created a module that before the import starts, disables the pathauto module, and then upon the feed finishing, reenables the pathauto module. Here's the code from the module in case anyone needs to have this ability:
function YOURMODULENAME_feeds_before_import(FeedsSource $source) {
$modules = array('pathauto');
drupal_set_message(t('The ').$modules[0].t(' has been deployed and should begin to disable'), 'warning');
module_disable($modules);
drupal_set_message(t('The ').$modules[0].t(' module should have been disabled'), 'warning');
}
function YOURMODULENAME_feeds_after_import(FeedsSource $source) {
$modules = array('pathauto');
module_enable($modules);
drupal_set_message($modules[0].t(' should be reenabled now'), 'warning');
}

sqlite3 bulk insert from C?

I came across the .import command to do this (bulk insert), but is there a query version of this which I can execute using sqlite3_exec().
I would just like to copy a small text file contents into a table.
A query version of this one below,
".import demotab.txt mytable"
Sqlite's performance doesn't benefit from bulk insert. Simply performing the inserts separately (but within a single transaction!) provides very good performance.
You might benefit from increasing sqlite's page cache size; that depends on the number of indexes and/or the order in which the data is inserted. If you don't have any indexes, for a pure insert, the cache size is likely not to matter much.
Be sure to use a prepared query, as opposed to regenerating a query plan in the innermost loop. It's extremely important to wrap the statements in a transaction since this avoids the need for the filesystem to sync the database to disk - afterall, partially a written transaction is atomically aborted anyhow, meaning that all fsync()'s are delayed until the transaction completes.
Finally, indexes will limit your insert performance since their creation is somewhat expensive. If you're really dealing with a lot of data and start off with an empty table, it may be beneficial to add the indexes after the data - though this isn't a huge factor.
Oh, and you might want to get one of those intel X25-E SSD's and ensure you have an AHCI controller ;-).
I'm maintaining an app with sqlite db's with about 500000000 rows (spread over several tables) - much of which was bulk inserted using plain old begin-insert-commit: it works fine.

Resources