Importing 60,000 nodes - drupal

I'm using Table Wizard + Migrate module to import nodes into my Drupal installation.
I need to import around 60,000 questions / answers (they are both nodes) and I thought it would have been an easy task.
However, the migrate process imports 4 nodes per minute, and it would take approximately 11 days to finish the importing.
I was wondering if I can make it faster by importing directly in mysql. But I actually need to create 60,000 nodes. I guess Drupal is going to store additional information in other tables... and it is not that safe.
what do you suggest me to do ? Wait 10 days ?
Thanks

Table migrate should be orders of magnitude faster than that.
Are you using pathauto?
If yes, try disabling the pathauto module, often causes big performance problems on import.
Second, if disabling pathauto doesn't work, turn off all non-essential modules you may have running - some modules do crazy stuff. Eliminate other modules as the sources of the problem.
Third, is MySQL db log turned on? That can have a big performance impact - not the level you are talking about, but its something to consider.
Third, install xdebug, and tail your mysql log to see exactly whats happening.
What is your PHP memory limit?
Do you have plenty of disk space left?

If you're not doing it, you should use drush to migrate the nodes in batches. You could even write a shell script for it, if you want it automated. Using the command line should lower the time it takes to import the nodes a lot. With a script, you can make it an automated task that you don't have to worry about.
One thing I want to note though, 4 nodes per minute is very low. I once needed to import some nodes from a CSV file, using migrate etc. I needed to import 300 nodes, with location, 4-5 CCK fields and I did it in a matter of seconds. So if you only import 4 nodes per minute, you either have extremely complex nodes, or something fishy is going on.
What are the specs of the computer you are using for this? Where's the import source located?

This is a tough topic, but within Drupal actually very well covered. I don't know the ins- and outs. But do know where to look.
Data Mining Drupalgroup has some pointers, knowledge and information on processing large amounts of data in PHP/Drupal.
Drupal core has batch-functionality built in and called BatchAPI At your service when writing modules! For a working example, see this tutorial on CSV import.

4 node per minute is incredibly slow. Migrate shouldn't normally take that long. You could speed things up a bit by using Drush, but probably not enough to get a reasonable import time (hours, not days). That wouldn't really address your core problem: your import itself is taking too long. The overhead of the Migrate GUI isn't that big.
Importing directly into MySQL would certainly be faster, but there's a reason Migrate exists. Node database storage in Drupal is complicated, so it's generally best to let Drupal work it out rather than trying to figure out what goes where.
Are you using Migrate's hooks to do additional processing on each node? I'd suggest adding some logging to see what exactly is taking so long. Test it on 10 nodes at a time until you figure out the lag before doing the whole 60k.

We had a similar problem on a Drupal 7 install. Left it run all week-end on an import, and it only imported 1,000 lines of a file.
The funny thing is that exactly the same import on a pre-production machine was taking 90 minutes.
We ended up comparing the source code (making sure we are at the same commit in git), the database schema (identical), the quantity of node on each machine (not identical but similar)...
Long story made short, the only significant difference between the two machines was the max_execution_time option in the php.ini settings file.
The production machine had max_execution_time = 30, while the pre-production machine had max_execution_time = 3000. It looks like the migrate module has a kind of system to handle "short" max_execution_time that is less than optimal.
Conclusion : set max_execution_time = 3000 or more in your php.ini, that helps a lot the migrate module.

I just wanted to add a note saying the pathauto disable really does help. I had an import of over 22k rows and before disabling it took over 12 hours and would crash multiple times during the import. After disabling pathauto and then running the import, it took only 62 minutes and didn't crash once.
Just a heads up, I created a module that before the import starts, disables the pathauto module, and then upon the feed finishing, reenables the pathauto module. Here's the code from the module in case anyone needs to have this ability:
function YOURMODULENAME_feeds_before_import(FeedsSource $source) {
$modules = array('pathauto');
drupal_set_message(t('The ').$modules[0].t(' has been deployed and should begin to disable'), 'warning');
module_disable($modules);
drupal_set_message(t('The ').$modules[0].t(' module should have been disabled'), 'warning');
}
function YOURMODULENAME_feeds_after_import(FeedsSource $source) {
$modules = array('pathauto');
module_enable($modules);
drupal_set_message($modules[0].t(' should be reenabled now'), 'warning');
}

Related

Very slow insert speed using sqlite's .import command line

I'm trying to import a database with around 4.5 million entries, 350 fields, and 2 indexes (I didn't design this) from a .csv into a sqlite database. Most of the performance issues I read about people having involve not using batch transactions and various things like this but I imagined using the sqlite command line import would be as fast as possible. Yet I'm only getting around ~150 inserts per second. Is there a way to speed this up somehow?
As far as what I've tried I've recreated the table schema without the two indices and I've tried setting PRAGMA Synchronous to off based on recommendations I read from googling but neither of these helped, I still get the same inserts/minute.
For whatever reason the first 5,000 inserts seem to happen nearly instantly but after that it slows down to around 150/second.
There's a few things you can try doing to speed up the import.
Turn off journaling - PRAGMA journal_mode = OFF
If you're importing to a VM up the resources till your done with the import
Maybe convert your csv to a dump file as that's native to SQLite and should import quicker
It none of that helps you can try using the sqlite-utils Python library. That will allow you to use the import_csv function which might result in a faster import. At the end of the day your working with a huge dataset.

Knime too slow - performance

I just started to use KNIME and it suppose managed a huge mount of data, but isn't, it's slow and often not response. I'll manage more data than that I'm using now, What am I doing wrong?.
I set in my configuration file "knime.ini":
-XX:MaxPermSize=1024m
-Xmx2048m
I also read data from a database node (millions of rows) but I can't limit it by SQL (I don't really mind, I need this data).
SELECT * FROM foo LIMIT 1000
error:
WARN Database Reader com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'LIMIT 0' at line 1
I had the same issue... and was able to solve it really simply, KNIME has a KNIME.ini file, this one is like the paramethers KNIME uses to execute...
The real issue is that JBDC driver is set for 10 Fetch Size. By default, when Oracle JDBC runs a query, it retrieves a result set of 10 rows  at a time from the database cursor. This is the default Oracle row fetch size value... so whenever you are reading database you will have a big pain waiting to retrieve all the lines.
The fix is simply, go to the folder where KNIME is installed, look for the file KNIME.ini, open it and then add the following sentences to the bottom, it will override the defauld JBDC fetching, and then you will get the data in literally seconds.
-Dknime.database.fetchsize=50000
-Dknime.url.timeout=9000
Hope this helps :slight_smile:
see http://tech.knime.org/forum/knime-users/knime-performance-reading-from-a-database for the rest of this discussion and solutions...
I'm not sure if your question is about the performance problem or the SQL problem.
For the former, I had the same issue and only found a solution when I started searching for Eclipse performance fixes rather than KNIME performance fixes. It's true that increasing the Java heap size is a good thing to do, but my performance problem (and perhaps yours) was caused by something bad going on in the saved workspace metadata. Solution: Delete contents of the knime/workspace/.metadata directory.
As for the latter, not sure why you're getting that error; maybe try adding a semicolon at the end of the SQL statement.

ASP Pre-Caching with Application_Start Considerations

I am designing a web API which requires fast read only access to a large dataset which will be hopefully be constantly stored and ready for access. Access will be from a static class which will just do some super fast lookups on the data.
So, I want to pre-cache a Dictionary<string,Dictionary<string,Dictionary<string,myclass>>>, with the total number of elements at the third level dictionary being around 1 Million, which will increase eventually, but lets say not more than 2 million ever. 'myclass' is a small class with a (small) list of strings, an int, an enum and a couple of bools, so nothing major. It should be a bit over 100mb in memory.
From what I can tell, the way to do this is simply call my StaticClass.Load() method to read all this data in from a file with the Application_Start event in Global.asax.
I am wondering what the things I need to consider/worry about with this. I am guessing it is not just as simple as calling Load() and then assuming everything will be OK for future access. Will the GC know to leave the data there even if the API is not hit for a couple of hours?
To complicate things, I want to reload this data every day as well. I think I'll just be able to throw out the old dataset and load the new one in from another file, but I'll get to that later.
Cheers
Please see my similar question IIS6 ASP.NET 2.0 Application Cache - data storage options and performance for large amounts of data BUT in particular the answer from Marc and his last paragraph about options for large cache's which I think would apply to your case.
The Standard ASP.net application Cache could work for you here. Check out this article. With this you get built in management of dependance (the file changes) or time based expiry. The linked article shows an On_start application
My concern is the size of what you want to cache.
Cheers guys
I also found this article which addresses the options: http://www.asp.net/data-access/tutorials/caching-data-at-application-startup-cs
Nothing really gives recommendations for large amounts of data - or what is even defined as 'large amounts'. I'll keep doing my research, but Redis looks pretty good

ASP.NET/SQL 2008 Performance issue

We've developed a system with a search screen that looks a little something like this:
(source: nsourceservices.com)
As you can see, there is some fairly serious search functionality. You can use any combination of statuses, channels, languages, campaign types, and then narrow it down by name and so on as well.
Then, once you've searched and the leads pop up at the bottom, you can sort the headers.
The query uses ROWNUM to do a paging scheme, so we only return something like 70 rows at a time.
The Problem
Even though we're only returning 70 rows, an awful lot of IO and sorting is going on. This makes sense of course.
This has always caused some minor spikes to the Disk Queue. It started slowing down more when we hit 3 million leads, and now that we're getting closer to 5, the Disk Queue pegs for up to a second or two straight sometimes.
That would actually still be workable, but this system has another area with a time-sensitive process, lets say for simplicity that it's a web service, that needs to serve up responses very quickly or it will cause a timeout on the other end. The Disk Queue spikes are causing that part to bog down, which is causing timeouts downstream. The end result is actually dropped phone calls in our automated VoiceXML-based IVR, and that's very bad for us.
What We've Tried
We've tried:
Maintenance tasks that reduce the number of leads in the system to the bare minimum.
Added the obvious indexes to help.
Ran the index tuning wizard in profiler and applied most of its suggestions. One of them was going to more or less reproduce the entire table inside an index so I tweaked it by hand to do a bit less than that.
Added more RAM to the server. It was a little low but now it always has something like 8 gigs idle, and the SQL server is configured to use no more than 8 gigs, however it never uses more than 2 or 3. I found that odd. Why isn't it just putting the whole table in RAM? It's only 5 million leads and there's plenty of room.
Poured over query execution plans. I can see that at this point the indexes seem to be mostly doing their job -- about 90% of the work is happening during the sorting stage.
Considered partitioning the Leads table out to a different physical drive, but we don't have the resources for that, and it seems like it shouldn't be necessary.
In Closing...
Part of me feels like the server should be able to handle this. Five million records is not so many given the power of that server, which is a decent quad core with 16 gigs of ram. However, I can see how the sorting part is causing millions of rows to be touched just to return a handful.
So what have you done in situations like this? My instinct is that we should maybe slash some functionality, but if there's a way to keep this intact that will save me a war with the business unit.
Thanks in advance!
Database bottlenecks can frequently be improved by improving your SQL queries. Without knowing what those look like, consider creating an operational data store or a data warehouse that you populate on a scheduled basis.
Sometimes flattening out your complex relational databases is the way to go. It can make queries run significantly faster, and make it a lot easier to optimize your queries, since the model is very flat. That may also make it easier to determine if you need to scale your database server up or out. A capacity and growth analysis may help to make that call.
Transactional/highly normalized databases are not usually as scalable as an ODS or data warehouse.
Edit: Your ORM may have optimizations as well that it may support, that may be worth looking into, rather than just looking into how to optimize the queries that it's sending to your database. Perhaps bypassing your ORM altogether for the reports could be one way to have full control over your queries in order to gain better performance.
Consider how your ORM is creating the queries.
If you're having poor search performance perhaps you could try using stored procedures to return your results and, if necessary, multiple stored procedures specifically tailored to which search criteria are in use.
determine which ad-hoc queries will most likely be run or limit the search criteria with stored procedures.. can you summarize data?.. treat this
app like a data warehouse.
create indexes on each column involved in the search to avoid table scans.
create fragments on expressions.
periodically reorg the data and update statistics as more leads are loaded.
put the temporary files created by queries (result sets) in ramdisk.
consider migrating to a high-performance RDBMS engine like Informix OnLine.
Initiate another thread to start displaying N rows from the result set while the query
continues to execute.

How to build large/busy RSS feed

I've been playing with RSS feeds this week, and for my next trick I want to build one for our internal application log. We have a centralized database table that our myriad batch and intranet apps use for posting log messages. I want to create an RSS feed off of this table, but I'm not sure how to handle the volume- there could be hundreds of entries per day even on a normal day. An exceptional make-you-want-to-quit kind of day might see a few thousand. Any thoughts?
I would make the feed a static file (you can easily serve thousands of these), regenerated periodically. Then you have a much broader choice, because it doesn't have to run below second, it can run even minutes. And users still get perfect download speed and reasonable update speed.
If you are building a system with notifications that must not be missed, then a pub-sub mechanism (using XMPP, one of the other protocols supported by ApacheMQ, or something similar) will be more suitable that a syndication mechanism. You need some measure of coupling between the system that is generating the notifications and ones that are consuming them, to ensure that consumers don't miss notifications.
(You can do this using RSS or Atom as a transport format, but it's probably not a common use case; you'd need to vary the notifications shown based on the consumer and which notifications it has previously seen.)
I'd split up the feeds as much as possible and let users recombine them as desired. If I were doing it I'd probably think about using Django and the syndication framework.
Django's models could probably handle representing the data structure of the tables you care about.
You could have a URL that catches everything, like: r'/rss/(?(\w*?)/)+' (I think that might work, but I can't test it now so it might not be perfect).
That way you could use URLs like (edited to cancel the auto-linking of example URLs):
http:// feedserver/rss/batch-file-output/
http:// feedserver/rss/support-tickets/
http:// feedserver/rss/batch-file-output/support-tickets/ (both of the first two combined into one)
Then in the view:
def get_batch_file_messages():
# Grab all the recent batch files messages here.
# Maybe cache the result and only regenerate every so often.
# Other feed functions here.
feed_mapping = { 'batch-file-output': get_batch_file_messages, }
def rss(request, *args):
items_to_display = []
for feed in args:
items_to_display += feed_mapping[feed]()
# Processing/returning the feed.
Having individual, chainable feeds means that users can subscribe to one feed at a time, or merge the ones they care about into one larger feed. Whatever's easier for them to read, they can do.
Without knowing your application, I can't offer specific advice.
That said, it's common in these sorts of systems to have a level of severity. You could have a query string parameter that you tack on to the end of the URL that specifies the severity. If set to "DEBUG" you would see every event, no matter how trivial. If you set it to "FATAL" you'd only see the events that that were "System Failure" in magnitude.
If there are still too many events, you may want to sub-divide your events in to some sort of category system. Again, I would have this as a query string parameter.
You can then have multiple RSS feeds for the various categories and severities. This should allow you to tune the level of alerts you get an acceptable level.
In this case, it's more of a manager's dashboard: how much work was put into support today, is there anything pressing in the log right now, and for when we first arrive in the morning as a measure of what went wrong with batch jobs overnight.
Okay, I decided how I'm gonna handle this. I'm using the timestamp field for each column and grouping by day. It takes a little bit of SQL-fu to make it happen since of course there's a full timestamp there and I need to be semi-intelligent about how I pick the log message to show from within the group, but it's not too bad. Further, I'm building it to let you select which application to monitor, and then showing every message (max 50) from a specific day.
That gets me down to something reasonable.
I'm still hoping for a good answer to the more generic question: "How do you syndicate many important messages, where missing a message could be a problem?"

Resources