how to force scrapy exit when there is an exception - web-scraping

I wrote a crawler with Scrapy.
There is a function in the pipeline where I write my data to a database. I use the logging module to log runtime logs.
I found that when my string have Chinese logging.error() will throw an exception. But the crawler keeps running!
I know this is a minor error but if there is a critical exception I will miss it if crawler keeps running.
My question is: Is there a setting that I can force Scrapy stop when there is an exception?

You can use CLOSESPIDER_ERRORCOUNT
An integer which specifies the maximum number of errors to receive
before closing the spider. If the spider generates more than that
number of errors, it will be closed with the reason
closespider_errorcount. If zero (or non set), spiders won’t be closed
by number of errors.
By default it is set to 0
CLOSESPIDER_ERRORCOUNT = 0
you can change it to 1 if you want to exit when you have the first error.
UPDATE
Read the answers of this question, you can also use:
crawler.engine.close_spider(self, 'log message')
for more information read :
Close spider extension

In the process_item function of your spider you have an instance of spider.
To solve your problem you could catch the exceptions when you insert your data, then neatly stop you spider if you catch a certain exeption like this:
def process_item(self, item, spider):
try:
#Insert your item here
except YourExceptionName:
spider.crawler.engine.close_spider(self, reason='finished')

I don't know of a setting that would close the crawler on any exception, but you have at least a couple of options:
you can raise CloseSpider exception in a spider callback, maybe when you catch that exception you mention
you can call crawler.engine.close_spider(spider, 'some reason') if you have a reference to the crawler and spider object, for example in an extension. See how the CloseSpider extension is implemented (it's not the same as the CloseSpider exception).
You could hook this with the spider_error signal for example.

Related

How can my Flask app check whether a SQLite3 transaction is in progress?

I am trying to build some smart error messages using the #app.errorhandler (500) feature. For example, my route includes an INSERT command to the database:
if request.method == "POST":
userID = int(request.form.get("userID"))
topicID = int(request.form.get("topicID"))
db.execute("BEGIN TRANSACTION")
db.execute("INSERT INTO UserToTopic (userID,topicID) VALUES (?,?)", userID, topicID)
db.execute("COMMIT")
If that transaction violates a constraint, such as UNIQUE or FOREIGN_KEY, I want to catch the error and display a user-friendly message. To do this, I'm using the Flask #app.errorhandler as follows:
#app.errorhandler(500)
def internal_error(error):
db.execute("ROLLBACK")
return render_template('500.html'), 500
The "ROLLBACK" command works fine if I'm in the middle of a database transaction. But sometimes the 500 error is not related to the db, and in those cases the ROLLBACK statement itself causes an error, because you can't rollback a transaction that never started. So I'm looking for a method that returns a Boolean value that would be true if a db transaction is under way, and false if not, so I can use it to make the ROLLBACK conditional. The only one I can find in the SQLite3 documentation is for a C interface, and I can't get it to work with my Python code. Any suggestions?
I know that if I'm careful enough with my forms and routes, I can prevent 99% of potential violations of db rules. But I would still like a smart error catcher to protect me for the other 1%.
I don't know how transaction works in sqlite but what you are trying to do, you can achieve it by try/except statements
use try/except within the function
try:
db.execute("ROLLBACK")
except:
pass
return render_template('500.html'), 500
Use try/except when inserting data.
from flask import abort
try:
userID = int(request.form.get("userID"))
[...]
except:
db.rollback()
abort(500)
I am not familiar with sqlite errors, if you know what specific error occurs except for that specific error.

Strange ZQuery behavior

I'm using Zeos and SQLite3 DB in Delphi
ZQuery2.Close;
ZQuery2.SQL.Clear;
ZQuery2.SQL.Add('SELECT * FROM users WHERE un = ' + QuotedStr( UserName ) );
ZQuery2.Open;
OutputDebugString(PWideChar( ZQuery2.FieldDefList.CommaText )); // log : id,un,pw
OutputDebugString(PWideChar(ZQuery2.FieldByName('pw').AsString)); //causes error sometimes
the code is working but sometimes I get the following error message
Exception class EDatabaseError with message 'ZQuery2:Field'pw' not found'.
This is odd because a field of a dataset shouldn't just disappear while the app is in the middle of running, especially if other fields are still operating normally. So, I would suspect something like a memory overwrite being the cause.
Memory overwrites usually happen when something is written to the wrong place in memory, overwriting what is there, usually because of an incorrect pointer value or a so-called "buffer overrun" where the writing operation carries on beyond where is should stop. Usually, the pointer value is so wildly wrong that the OS can detect it and raise an AV, but sometimes it is less obvious.
Delphi's memory manager has a 'full debug mode' which adds special checks for this condition, see here.
I suggest you enable full debug mode as per the linked document and wait for the exception to occur.

What does "rerun in upstream in pipeline"?

I am defining a pipeline in data factory, I had some errors that I correct.
The first activity is calling an usql script to do some aggregation, I changed the script plenty of time but the error is still:
[{"errorId":"E_CSC_USER_SYNTAXERROR","severity":"Error","component":"CSC","source":"USER","message":"syntax
error. Final statement did not end with a semicolon","details":"at
token 'usql', line 4\r\nnear the ###:\r\n**************\r\nCLARE
#lineitemsfile string =
\"/datalakerepo/input/2016/01/01lineitems.txt\";\nDECLARE #ordersfile
string = \"/datalakerepo/input/2016/01/01orders.txt\";\nsales.usql ###
\n","description":"Invalid syntax found in the
script.","resolution":"Correct the script syntax, using expected
token(s) as a
guide.","helpLink":"","filePath":"","lineNumber":4,"startOffset":228,"endOffset":232}].
seem like not all usql script is read from the data factory, so I though that may be the "rerun in upstream in pipeline" have something to do with this, like clear cache from previous script.
Anyone knows what "rerun in upstream in pipeline" does?
Many thanks!
"Rerun with upstream in pipeline" basically means "recalculate with all dependencies". For example, if one has pipeline1 -> dataset1 -> pipeline2 and tries to rerun pipeline2 with dependecies, then pipeline1 and pipeline2 will be both executed. I believe it works same with several chained activities within single pipeline.

ASP.NET Unexpected and Different Behavior in Different Environments

I have an ASP.NET site (VB.NET) that I'm trying to clean up. When it was originally created it was written with no error handling, and I'm trying to add it in to improve the User Experience.
Try
If Not String.IsNullOrEmpty(strMfgName) And Not String.IsNullOrEmpty(strSortType) Then
If Integer.TryParse(Request.QueryString("CategoryID"), i) And String.IsNullOrEmpty(Request.QueryString("CategoryID"))
MyDataGrid.DataSource = ProductCategoryDB.GetMfgItems(strMfgName, strSortType, i)
Else
MyDataGrid.DataSource = ProductCategoryDB.GetMfgItems(strMfgName, strSortType)
End If
MyDataGrid.DataBind()
If CType(MyDataGrid.DataSource, DataSet).Tables("Data").Rows.Count > 0 Then
lblCatName.Text = CType(MyDataGrid.DataSource, DataSet).Tables("Data").Rows(0).Item("mfgName")
End If
If MyDataGrid.Items.Count < 2 Then
cboSortTypes.Visible = False
table_search.Visible = False
End If
If MyDataGrid.PageCount < 2 Then
MyDataGrid.PagerStyle.Visible = False
End If
Else
lblCatName.Text &= "<br /><span style=""fontf-size: 12px;"">There are no items for this manufacturer</span>"
MyDataGrid.Visible = False
table_search.Visible = False
End If
Catch
lblCatName.Text &= "<br /><span style=""font-size: 12px;"">There are no items for this manufacturer</span>"
MyDataGrid.Visible = False
table_search.Visible = False
End Try
Now, this is trying to avoid generating a 500 error by catching exceptions. There can be three items on the query string, but only two matter here. In my test environment and in Visual Studio when I run this site, it doesn't matter if that item is on the query string. In production, it does matter. If that third item isn't present (SubCategoryID) on the query string, then the "There are no items for this manufacturer" displays instead of the data from the database.
In the two different environments I am seeing two different code execution paths, despite the same URLs and the same code base.
The site is running on Server 2003 with IIS 6.
Thoughts?
EDIT:
In response to the answer below, I doubt it's a connection error (though I see what you're getting to), as when I add the SubCategoryID to the query string, the site works correctly (displaying data from the database).
Also, if please let me know if you have any suggestions for how to test this scenario, without deploying the code back to production (it's been rolled back).
I think you should try to print out the exception details in your catch block to see what the problem is. It could anything for example a connection error to your database.
The error could be anything, and you should definitely consider printing this out or logging it somewhere, rather than making the assumption that there's no data. You're also outputting the same error message to the UI for two different code paths, which makes things harder to debug, especially without knowing if an exception occurred, and if so, what it was.
Generally, it's also better not to have a catch for all exceptions in cases like this, especially without logging the error. Instead, you should catch specific exceptions and handle these appropriately, and any real exceptions can get passed up the stack, ideally to a global error handler which can log it and/or send out some kind of error notification.
I discovered the reason yesterday. In short it was because when I copied my files from my computer into my dev-test environment, I missed a file, which ironically caused it to work, rather than not. So in the end it would have functioned the same in both environments.

URLLoader fails randomly without throwing an error or dispatching any events

In Adobe AIR 1.5, I'm using URLLoader to upload a video in 1 MB chunks. It uploads 1 MB, waits for the Event.COMPLETE event, and then uploads the next chunk. The server-side code knows how to construct the video from these chunks.
Usually, it works fine. However, sometimes it just stops without throwing any errors or dispatching any events. This is an example of what is shown in a log that I create:
Uploading chunk of size: 1000000
HTTP_RESPONSE_STATUS dispatched: 200
HTTP_STATUS dispatched: 200
Completed chunk 1 of 108
Uploading chunk of size: 1000000
HTTP_RESPONSE_STATUS ...
etc...
Most of the time, it completes all of the chunks fine. However, sometimes, it just fails in the middle:
Completed chunk 2 of 108
Uploading chunk of size: 1000000
... and nothing else, and no network activity.
Through debugging, I can tell that it does successfully call urlLoader.load(). When it fails, it just seems to stall, calling load(), and then calling the UIComponent's callLaterDispatcher() and then nothing.
Does anyone have any idea why this could be happening? I'm setting up my URLLoader like this:
urlLoader.dataFormat = URLLoaderDataFormat.BINARY;
urlLoader.addEventListener(Event.COMPLETE, chunkComplete);
urlLoader.addEventListener(IOErrorEvent.IO_ERROR, ioErrorHandler);
urlLoader.addEventListener(SecurityErrorEvent.SECURITY_ERROR, securityErrorHandler);
urlLoader.addEventListener(HTTPStatusEvent.HTTP_RESPONSE_STATUS, responseStatusHandler);
urlLoader.addEventListener(HTTPStatusEvent.HTTP_STATUS, statusHandler);
urlLoader.addEventListener(ProgressEvent.PROGRESS, progressHandler);
And I'm re-using it for each chunk. No events get called when it doesn't succeed, and urlLoader.load() doesn't throw any exceptions. When it succeeds, HTTP_RESPONSE_STATUS, HTTP_STATUS, and PROGRESS events are dispatched.
Thanks!
Edit: One thing that might be helpful is that, we have the same upload functionality implemented in .NET. In .NET, the request.GetResponse() method sometimes throws an exception, complaining that the connection was closed unexpectedly. We catch the exception if this happens, and try that chunk again, until it succeeds. I'm looking to implement something similar here, but there are no exceptions being thrown or error events being dispatched.
More detailed code example below. The URLLoader is setup as described above. The readAgain variable just makes it skip reading a new set of bytes in the file stream (ie: it tries to send the old one again) ... however, it never catches any exceptions, because none are ever thrown.
private function uploadSegment():void
{
.... prepare byte array, setup url ...
// Create a URL request
var urlRequest:URLRequest = new URLRequest();
urlRequest.url = _url + "?" + paramStr;
urlRequest.method = URLRequestMethod.POST;
urlRequest.data = byteArray;
urlRequest.useCache = false;
urlRequest.requestHeaders.push(new URLRequestHeader('Cache-Control', 'no-cache'));
try
{
urlLoader.load(urlRequest);
}
catch (e:Error)
{
Logger.error("Failed to upload chunk. Caught exception. Trying again.");
readAgain = true;
uploadSegment();
return;
}
readAgain = false;
}
Have you tried signing up for 'Event.OPEN' to see if the connection is opening correctly? If you're doing this per chunk - perhaps that event or lack thereof would help?
[Edit]
Can you also try setting useCache to false on your URLRequest?
[Edit]
I assume you're urlLoader is globally referenced... If not, while you're waiting for async behavior, something evil like GC might hurt you ... But - skipping that, if you call 'bytesTotal' while you're waiting for something to happen - does it always return zero?
[More]
Also - check the URL in the cases where NOTHING happens - because online I've found some mention that if the server is unreachable there are no events fired (though there is some argument around that)...
I encountered a similar problem in Flex, only with Safari.
The URLloader sometimes returned nothing, not even the OPEN event.
I made sure that this wasn't a cache problem.
After lots of trial
and error, the only remedy I found was to use https protocol in the url. I am not sure what this does to
Safari, but now the problem is gone.

Resources