Faster persist and flush for doctrine mongodb odm import - symfony

I am using symfony2 and doctrine mongodb odm to import product data from CSV files. I created a console command to create the Product objects and then persist them and flush the DocumentManager. The flush is taking upwards of 30 seconds and I only have a couple thousand products. There will potentially be many more in the future.
I am wondering if there are any optimizations/best practices to make flushing a large quantity of new objects faster in doctrine. It seems like there wouldn't need to be that much processing on the objects since they are all new and just need to be added to the collection.

I experienced a similar problem (loading thousands of products from a csv as it would be). My problem revolved more around running out of memory, but the solution showed a significant increase in speed as well.
Essentially, I put a counter inside the loop and flushed the manager then cleared it every so often. I found that 150 batch size yielded the best results. I am sure it depended largely on how you are processing it as I had LOTS of number crunching going on to clean the data before inserting it.
For reference, it loads about 5,500 products with 100+ fields and does processing on them in about 20 seconds. It was taking 3+ minutes before the modification (if it even finished at all due to running out of memory.)
//LOOP {
if ($count % $batchSize == 0) {
$manager->flush();
$manager->clear();
gc_collect_cycles();
if ($count % $batchSize == 0)
echo $count . ' | ' . number_format((memory_get_usage() / 1024), 4) . " KBs\n";
}
$count++;
}
Don't forget to run the $manager->flush() at least one more time once the loop is complete to catch those 1-149 records that wouldn't trigger it in the loop.

I have a very large database. I find it much more efficient to do a flush every time you insert the code better manages access to the database.
$dm->persist($object);
$dm->flush();
$dm->clear();

Related

Optimizing a set 20 webrequests with threads

This is for ASP.NET. I want to improve the time it takes run my function, today it takes around 20-30 seconds, more towards 30secs than 20secs though. That's running on one thread making 20 webrequests.
I'm thinking threads that do all the 20 webreqeusts, in order to quickly find the result or just go through the data (IE do all the 20 requests not finding anything).
Here's how it works.
1. I'm using html agility pack to fetch htmldocuments. 2. Then I parse them for information 3. Lastly I add that information to a dictionary OR I move on to the next webrequest until I reach 20 requests made.
I make at most 20 webRequests, at minimum 1. I have set the function to end when the info I'm searching for is found. Sometimes the info isn't there hence the 20 webrequests(it goes through all the data).
Every webrequest adds between 5-20 entries to the dictionary. This is then compared with the information I sent to it, if it's in the list I get the Key back, otherwise it returns 201. If found it gets added to the database.
QUESTIONS
*A:*If I want to do this with threads, how many should I create? 20 One for each request and let them all loose to do the job? Or should i create like 4 of them making at most 5 requests each?B: What if two threads are finished at the same time and wants to add info to the directory, can it lock the whole site(I'm using ASP.NET), or will it try to add one from thread A and then one result from Thread B? I have a check already today that checks if the key exists before adding it.
C:What would be the fastest way to this?
This is my code, depicting the loop which just shows that 20 requests are being made?
public void FetchAndParseAllPages()
{
int _maxSearchDepth = 200;
int _searchIncrement = 10;
PageFetcher fetcher = new PageFetcher();
for (int i = 0; i < _maxSearchDepth; i += _searchIncrement)
{
string keywordNsearch = _keyword + i;
ParseHtmldocuments(fetcher.GetWebpage(keywordNsearch));
if (GetPostion() != 201)
{ //ADD DATA TO DATABASE
InsertRankingData(DocParser.GetSearchResults(), _theSearchedKeyword);
return;
}
}
}
.NET allows only 2 requests open at the same time. If you want more than that, you need to configure it in web.config. Look here: http://msdn.microsoft.com/en-us/library/aa480507.aspx
You can the Parallel.For method which is very straightforward and handles the "how much threads" for you. Of course you can tweak it to set how much threads (or tasks) you want with ParallelOptions. Look here: http://msdn.microsoft.com/en-us/library/dd781401.aspx
For making a thread-safe dictionary you can use the ConcurrentDictionary. Look here: http://msdn.microsoft.com/en-us/library/dd287191.aspx

Please suggest a way to store a temp file in Windows Azure

Here I have a simple feature on ASP.NET MVC3 which host on Azure.
1st step: user upload a picture
2nd step: user crop the uploaded picture
3rd: system save the cropped picture, delete the temp file which is the uploaded original picture
Here is the problem I am facing now: where to store the temp file?
I tried on windows system somewhere, or on LocalResources: the problem is these resources are per Instance, so here is no guarantee the code on an instance shows the picture to crop will be the same code on the same instance that saved the temp file.
Do you have any idea on this temp file issue?
normally the file exist just for a while before delete it
the temp file needs to be Instance independent
Better the file can have some expire setting (for example, 1H) to delete itself, in case code crashed somewhere.
OK. So what you're after is basically somthing that is shared storage but expires. Amazon have just announced a rather nice setting called object expiration (https://forums.aws.amazon.com/ann.jspa?annID=1303). Nothing like this for Windows Azure storage yet unfortunately, but, doesnt mean we can't come up with some other approach; indeed even come up with a better (more cost effective) approach.
You say that it needs to be instance independant which means using a local temp drive is out of the picture. As others have said my initial leaning would be towards Blob storage but you will have cleanup effort there. If you are working with large images (>1MB) or low throughput (<100rps) then I think Blob storage is the only option. If you are working with smaller images AND high throughput then the transaction costs for blob storage will start to really add up (I have a white paper coming out soon which shows some modelling of this but some quick thoughts are below).
For a scenario with small images and high throughput a better option might be to use the Windows Azure Cache as your temporary storaage area. At first glance it will be eye wateringly expensive; on a per GB basis (110GB/month for Cache, 12c/GB for Storage). But, with storage your transactions are paid for whereas with Cache they are 'free'. (Quotas are here: http://msdn.microsoft.com/en-us/library/hh697522.aspx#C_BKMK_FAQ8) This can really add up; e.g. using 100kb temp files held for 20 minutes with a system throughput of 1500rps using Cache is about $1000 per month vs $15000 per month for storage transactions.
The Azure Cache approach is well worth considering, but, to be sure it is the 'best' approach I'd really want to know;
Size of images
Throughput per hour
A bit more detail on the actual client interaction with the server during the crop process? Is it an interactive process where the user will pull the iamge into their browser and crop visually? Or is it just a simple crop?
Here is what I see as a possible approach:
user upload the picture
your code saves it to a blob and have some data backend to know the relation between user session and uploaded image (mark it as temp image)
display the image in the cropping user interface interface
when user is done cropping on the client:
4.1. retrieve the original from the blob
4.2. crop it according the data sent from the user
4.3. delete the original from the blob and the record in the data backend used in step 2
4.4. save the final to another blob (final blob).
And have one background process checking for "expired" temp images in the data backend (used in step 2) to delete the images and the records in the data backend.
Please note that even in WebRole, you still have the RoleEntryPoint descendant, and you still can override the Run method. Impleneting the infinite loop in the Run() (that method shall never exit!) method, you can check if there is anything for deleting every N seconds (depending on your Thread.Sleep() in the Run().
You can use the Azure blob storage. Have look at this tutorial.
Under sample will be help you.
https://code.msdn.microsoft.com/How-to-store-temp-files-in-d33bbb10
you have two way of temp file in Azure.
1, you can use Path.GetTempPath and Path.GetTempFilename() functions for the temp file name
2, you can use Azure blob to simulate it.
private long TotalLimitSizeOfTempFiles = 100 * 1024 * 1024;
private async Task SaveTempFile(string fileName, long contentLenght, Stream inputStream)
{
try
{
//firstly, we need check the container if exists or not. And if not, we need to create one.
await container.CreateIfNotExistsAsync();
//init a blobReference
CloudBlockBlob tempFileBlob = container.GetBlockBlobReference(fileName);
//if the blobReference is exists, delete the old blob
tempFileBlob.DeleteIfExists();
//check the count of blob if over limit or not, if yes, clear them.
await CleanStorageIfReachLimit(contentLenght);
//and upload the new file in this
tempFileBlob.UploadFromStream(inputStream);
}
catch (Exception ex)
{
if (ex.InnerException != null)
{
throw ex.InnerException;
}
else
{
throw ex;
}
}
}
//check the count of blob if over limit or not, if yes, clear them.
private async Task CleanStorageIfReachLimit(long newFileLength)
{
List<CloudBlob> blobs = container.ListBlobs()
.OfType<CloudBlob>()
.OrderBy(m => m.Properties.LastModified)
.ToList();
//get total size of all blobs.
long totalSize = blobs.Sum(m => m.Properties.Length);
//calculate out the real limit size of before upload
long realLimetSize = TotalLimitSizeOfTempFiles - newFileLength;
//delete all,when the free size is enough, break this loop,and stop delete blob anymore
foreach (CloudBlob item in blobs)
{
if (totalSize <= realLimetSize)
{
break;
}
await item.DeleteIfExistsAsync();
totalSize -= item.Properties.Length;
}
}

Create a timed cache in Drupal

I am looking for more detailed information on how I can get the following caching behavior in Drupal 7.
I want a block that renders information I'm retrieving from an external service. As the block is rendered for many users I do not want to continually request data from that service, but instead cache the result. However, this data is relatively frequent to change, so I'd like to retrieve the latest data every 5 or 10 minutes, then cache it again.
Does anyone know how to achieve such caching behavior without writing too much of the code oneself? I also haven't found much in terms of good documentation on how to use caching in Drupal (7), so any pointers on that are appreciated as well.
Keep in mind that cache_get() does not actually check if an item is expired or not. So you need to use:
if (($cache = cache_get('your_cache_key')) && $cache->expire >= REQUEST_TIME) {
return $cache->data;
}
Also make sure to use the REQUEST_TIME constant rather than time() in D7.
The functions cache_set() and cache_get() are what you are looking for. cache_set() has an expire argument.
You can use them basically like this:
<?php
if ($cached_data = cache_get('your_cache_key')) {
// Return from cache.
return $cached_data->data;
}
// No or outdated cache entry, refresh data.
$data = _your_module_get_data_from_external_service();
// Save data in cache with 5min expiration time.
cache_set('your_cache_key', $data, 'cache', time() + 60 * 5);
return $data;
?>
Note: You can also use a different cache bin (see documentation links) but you need to create a corresponding cache table yourself as part of your schema.
I think this should be $cache->expire, not expires. I didn't have luck with this example if I'm setting REQUEST_TIME + 300 in cache_set() since $cache->expires will always be less than REQUEST_TIME. This works for me:
if (($cache = cache_get('your_cache_key', 'cache')) && (REQUEST_TIME < $cache->expire)) {
return $cache->data;
}

Drupal module to control user post frequency?

We've been having a new type of spam-bot this week at PortableApps.com which posts at a rate of about 10 comments a minute and doesn't seem to stop - at least the first hour or so (we've always stopped it within that time so far). We've had them about a dozen times in the last week - sometimes stopping it at 50 or 60, sometimes up to 250 or 300. We're working to stop it and other spam bots as much as possible, but at the moment it's still a real pest.
I was wondering whether in the mean time whether there's any sort of module to control the frequency a user can post at to e.g. 50 an hour or something like 10 in an hour for new users. That at least would mean that instead of having to clear up 300 comments 50 at a time in admin/content/comment we'd have a smaller number to clear. (A module to add a page to delete all content by a user and block them would also be helpful!)
I believe that there's a plugin to do this available for WordPress, but can't find any such thing for Drupal.
For your second question, i would have a look at the code of the User Delete module (click).
The module also disables the user account and unpublished all nodes/comments from a certain user. By extending the code, you could easily create another possibility to unpublish + delete all nodes/comments from a certain user and blocking the account.
After the unpublish code in the module, you should just put delete code (in sql if the module is selecting by a sql-query or by using the drupal delete functions).
Another option would be so make a view (using the view module) only to be viewed by administrators, where you choose a certain user using the filters and then lists his/her posts. Then in the node-contenttype.tpl.php you place a button that calls a function which deletes all nodes/comments and the user.
First problem (post frequency)
I've been thinking about the comment post limit. If I remember correctly Drupal stores comments in a seperate table and has comment specific functions.
I'd create a new module and using the comment_nodeapi function i would check in the operation 'insert' how much comments the current user has already made within a certain timeframe.
To check this I would write a custom sql query on the database which takes the count of alle comments made by uid where the post_date is larger then NOW-1hour. If that count is larger then 10 or 15 or whatever post frequency you want then you give a message back to the user. You can retrieve the user id and name by using the global $user variable.
(example: print $user->name;)
You have to check on your own for the sql query but here's some code when you have the amount:
<?php
function comment_nodeapi(&$node, $op, $arg = 0) {
switch ($op) {
case 'insert':
//PLACE HERE THE SQL TO GET THE COUNT
if($count > 15){
$repeat = FALSE;
$type = 'status'
drupal_set_message("You have reached the comment limit for this time.", $type, $repeat);
break;
}else{
db_query('INSERT INTO {node_comment_statistics} (nid, last_comment_timestamp, last_comment_name, last_comment_uid, comment_count) VALUES (%d, %d, NULL, %d, 0)', $node->nid, $node->changed, $node->uid);
break;
}
}
}
?>
(this code has not been tested so no guarantees, but this should put you on the right track)
I would suggest something like Mollom (from the creator of Drupal). It scans the message for known spam pattern/keywords/... and if this scan fails, it displays a CAPTCHA to the user to make sure that it's a real human that wants to enter content that has the same properties like spam.
They offer a free service and some paid solutions. We are using it for some customers and it's worth the money. It also integrates very well in Drupal.
Comment Limit is probably what you need.
http://drupal.org/project/spam
http://drupal.org/project/antispam - with akismet support

WordPress Write Cache Issue with Multiple Sessions

I'm working on a content dripper custom plugin in WordPress that my client asked me to build. He says he wants it to catch a page view event, and if it's the right time of day (24 hours since last post), to pull from a resource file and output another post. He needed it to also raise a flag and prevent other sessions from firing that same snippet of code. So, raise some kind of flag saying, "I'm posting that post, go away other process," and then it makes that post and releases the flag again.
However, the strangest thing is occurring when placed under load with multiple sessions hitting the site with page views. It's firing instead of one post -- it's randomly doing like 1, 2, or 3 extra posts, with each one thinking that it was the right time to post because it was 24 hours past the time of the last post. Because it's somewhat random, I'm guessing that the problem is some kind of write caching where the other sessions don't see the raised flag just yet until a couple microseconds pass.
The plugin was raising the "flag" by simply writing to the wp_options table with the update_option() API in WordPress. The other user sessions were supposed to read that value with get_option() and see the flag, and then not run that piece of code that creates the post because a given session was already doing it. Then, when done, I lower the flag and the other sessions continue as normal.
But what it's doing is letting those other sessions in.
To make this work, I was using add_action('loop_start','checkToAddContent'). The odd thing about that function though is that it's called more than once on a page, and in fact some plugins may call it. I don't know if there's a better event to hook. Even still, even if I find an event to hook that only runs once on a page view, I still have multiple sessions to contend with (different users who may view the page at the same time) and I want only one given session to trigger the content post when the post is due on the schedule.
I'm wondering if there are any WordPress plugin devs out there who could suggest another event hook to latch on to, and to figure out another way to raise a flag that all sessions would see. I mean, I could use the shared memory API in PHP, but many hosting plans have that disabled. Can't use a cookie or session var because that's only one single session. About the only thing that might work across hosting plans would be to drop a file as a flag, instead. If the file is present, then one session has the flag. If the file is not present, then other sessions can attempt to get the flag. Sure, I could use the file route, but it's kind of immature in my opinion and I was wondering if there's something in WordPress I could do.
The key may be to create a semaphore record in the database for the "drip" event.
Warning - consider the following pseudocode - I'm not looking up the functions.
When the post is queried, use a SQL statement like
$ts = get_time_now(); // or whatever the function is
$sid = session_id();
INSERT INTO table (postcategory, timestamp, sessionid)
VALUES ("$category", $ts, "$sid")
WHERE NOT EXISTS (SELECT 1 FROM table WHERE postcategory = "$category"
AND timestamp < $ts - 24 hours)
Database integrity will make this atomic so only one record can be inserted.
and the insertion will only take place if the timespan has been exceeded.
Then immediately check to see if the current session_id() and timestamp are yours. If they are, drip.
SELECT sessionid FROM table
WHERE postcategory = "$postcategory"
AND timestamp = $ts
AND sessionid = "$sid"
The problem goes like this with page requests even from the same session (same visitor), but also can occur with page requests from separate visitors. It works like this:
If you are doing content dripping, then a page request is probably what you intercept with add_action('wp','myPageRequest'). From there, if a scheduled post is due, then you create the new post.
The post takes a little bit of time to write to the database. In that time, a query on get_posts() may not see that new record yet. It may actually trigger your piece of code to create a new post when one has already been placed.
The fix is to force WordPress to flush the write cache appears to be this:
try {
$asPosts = array();
$asPosts = # wp_get_recent_posts(1);
foreach($asPosts as $asPost) {break;}
# delete_post_meta($asPost['ID'], '_thwart');
# add_post_meta($asPost['ID'], '_thwart', '' . date('Y-m-d H:i:s'));
} catch (Exception $e) {}
$asPosts = array();
$asPosts = # wp_get_recent_posts(1);
foreach($asPosts as $asPost) {break;}
$sLastPostDate = '';
# $sLastPostDate = $asPost['post_date'];
$sLastPostDate = substr($sLastPostDate, 0, strpos($sLastPostDate, ' '));
$sNow = date('Y-m-d H:i:s');
$sNow = substr($sNow, 0, strpos($sNow, ' '));
if ($sLastPostDate != $sNow) {
// No post today, so go ahead and post your new blog post.
// Place that code here.
}
The first thing we do is get the most recent post. But we don't really care if it's not the most recent post or not. All we're getting it for is to get a single Post ID, and then we add a hidden custom field (thus the underscore it begins with) called
_thwart
...as in, thwart the write cache by posting some data to the database that's not too CPU heavy.
Once that is in place, we then also use wp_get_recent_posts(1) yet again so that we can see if the most recent post is not today's date. If not, then we are clear to drip some content in. (Or, if you want to only drip in like every 72 hours, etc., you can change this a little here.)

Resources