skip feedexport when exception occured - web-scraping

Is it possible to only export the extracted items if no exception was thrown during the crawl?
I run into errors from time to time but I need to make sure the crawl was successfully completed before processing the exported items.
I am using scrapy-feedexporter-sftp
and made the configration in settings as described in the README.
The upload to sFTP works.
To produce an error I am using a wrong XPATH
File "/usr/local/lib/python3.5/dist-packages/parsel/selector.py", line 256, in xpath
**kwargs)
File "src/lxml/etree.pyx", line 1582, in lxml.etree._Element.xpath
File "src/lxml/xpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.__call__
File "src/lxml/xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result
ValueError: XPath error: Invalid expression in td[1]//span//text()TEST_TEXT_TO_THROW_ERROR
The crawl failed but scrapy will push the file anyways:
[scrapy.extensions.feedexport] INFO: Stored json feed (59 items) in: sftp://user:pass#host/my/path/to/file/foo_2020-07-23T09-03-50.json
2020-07-23 11:03:54 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 581,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 24199,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 2.831581,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 7, 23, 9, 3, 53, 763269),
'item_dropped_count': 1,
'item_dropped_reasons_count/DropItem': 1,
'item_scraped_count': 59,
'log_count/DEBUG': 86,
'log_count/ERROR': 1, <------
'log_count/INFO': 15,
'log_count/WARNING': 2,
'memusage/max': 60858368,
'memusage/startup': 60858368,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/ValueError': 1,
'start_time': datetime.datetime(2020, 7, 23, 9, 3, 50, 931688)}
2020-07-23 11:03:54 [scrapy.core.engine] INFO: Spider closed (finished)
Regards ;)

Related

Using gsub with a data frame getting an odd result, how you use gsub with a dataframe

I am trying to use a gsub function on my data frame. In my data frame, I have phrases like "text.Democrat17_P" and many others with different numbers. My goal is to replace phrases like this with just, "DEM".
I first wanted to test the gsub function with only one row before I replace every value in the data frame. However, when I ran my script, gsub seemed to disassemble my data frame and list numbers out instead.
My result looked like this:
[1] "c(14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 16)"
[2] "c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4)"
[3] "c(0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1)"
[4] "c(2322, 2490, 2912, 3181, 3245, 2640, 3215, 4506, 3256, 2705, 2662, 5676, 7344, 2888, 2891, 4387, 9494, 2525, 3649, 1654, 2178, 2913, 2922, 3320, 7243, 5836, 6054, 6283, 4499, 5291, 4747, 2538, 5433, 5354, 5272, 4166, 3427, 5432, 4566, 5371, 5503, 4550, 1639, 2603, 3937, 2359, 1516, 1204, 826, 916, 1039, 1738, 2077, 874, 2495, 628, 582, 872, 2179, 682, 578, 476, 2207, 1178, 859, 1345, 1223, 2014, 438, 448, 1020, 879, 1117, 271, 210, 295, 233, 172, 77, 205, 3958)"
[5] "c(\"DEM\", \"text.Demminus14_P\", \"text.Demplus9_P\", \"text.Repminus11_O\", \"text.Repminus18_O\", \"text.Demminus12_P\", \"text.Repminus4_O\", \"text.Demminus12_P\", \"text.Repplus8_O\", \"text.Demminus4_P\", \"text.Demplus9_P\", \"text.Demminus20_P\", \"text.Repplus16_O\", \"text.Repminus10_O\", \"text.Repminus13_O\", \"text.Demplus18_P\", \"text.Repplus18_O\", \"text.Demplus1_P\", \"text.Repminus15_O\", \"text.Demminus11_P\", \"text.Repplus14_O\", \"text.Demminus8_P\", \"text.Repminus18_O\", \"text.Repminus13_O\", \"text.Demminus9_P\", \n\"text.Repminus13_O\", \"text.Repminus16_O\", \"text.Demminus9_P\", \"text.Repminus1_O\", \"text.Demplus15_P\", \"DEM\", \"text.Demminus1_P\", \"text.Repplus2_O\", \"text.Demminus18_P\", \"text.Repplus14_O\", \"text.Repminus20_O\", \"text.Repplus16_O\", \"text.Demplus2_P\", \"text.Repplus10_O\", \"text.Demminus18_P\", \"text.Repplus2_O\", \"text.Demminus15_P\", \"text.Repminus6_O\", \"text.Demminus19_P\", \"text.Repminus9_O\", \"text.Repplus15_O\", \"text.Repminus15_O\", \"text.Repminus8_O\", \"text.Repplus12_O\", \"text.Demminus19_P\", \n\"text.Repplus6_O\", \"text.Demplus13_P\", \"text.Demminus14_P\", \"text.Demminus5_P\", \"text.Demminus2_P\", \"text.Repplus1_O\", \"text.Repminus18_O\", \"text.Repplus14_O\", \"text.Demplus20_P\", \"text.Repplus6_O\", \"text.Repminus16_O\", \"text.Demminus19_P\", \"text.Demplus12_P\", \"text.Demminus12_P\", \"text.Demminus10_P\", \"text.Repplus5_O\", \"text.Demplus5_P\", \"text.Repplus17_O\", \"text.Repminus13_O\", \"text.Demplus3_P\", \"text.Demminus5_P\", \"text.Repminus10_O\", \"text.Repplus6_O\", \"text.Repplus16_O\", \"text.Repminus10_O\", \n\"text.Repplus1_O\", \"text.Demminus6_P\", \"text.Repplus5_O\", \"text.Demplus3_P\", \"text.Demplus3_P\", \"text.Repminus19_P\")"
Does anyone know why this happend and how I can get my result to look like a original data frame, with rows and columns.
This is the code that I am using:
DDMB <- DDMBehavfin[1,]
DDMB
gsub (pattern= "text.Demminus17_P", replacement = "DEM", x= DDMB)
Could this have to do something with the datatype of the columns? What can I do to my gsub function so that I can gain a regular looking data frame instead of a messy result like this?
I first want to tackle why my result looks odd, before using gsub to replace all of the values.
Thank you for any help.

Scrapy Depth limit changing Itself

I am crawling a website using Scrapy. Lets say there are 150 pages to crawl , the site has pagination where one page give url of the next page to crawl.
Now, my spider stop by itself, with the following logs:
{'downloader/request_bytes': 38096,
'downloader/request_count': 55,
'downloader/request_method_count/GET': 55,
'downloader/response_bytes': 5014634,
'downloader/response_count': 55,
'downloader/response_status_count/200': 55,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 8, 17, 19, 12, 11, 607000),
'item_scraped_count': 2,
'log_count/DEBUG': 58,
'log_count/INFO': 9,
'log_count/WARNING': 1,
'request_depth_max': 36,
'response_received_count': 55,
'scheduler/dequeued': 55,
'scheduler/dequeued/memory': 55,
'scheduler/enqueued': 55,
'scheduler/enqueued/memory': 55,
'start_time': datetime.datetime(2016, 8, 17, 19, 9, 13, 893000)}
the request_depth_max sometimes become 51, and now it 36. But in my settings I have it as DEPTH_LIMIT = 1000000000
I have also tried setting DEPTH_LIMIT to 0, but still the spider stops by itself, is there any setting that I am missing.
The stat request_depth_max is not a setting, it just means the highest depth your spider reached in this run.
Also DEPTH_LIMIT defaults to 0 which equates to infinity.

scrapy isn't making all request

I'm using scrapy to download a list of pages, I'm not extracting any data at this moment so I only saving the response.body in a csv file.
I'm not crawling either, so the start urls are the only urls I need to get, I've a list of 400 urls
start_urls =['url_1','url_2,'url_3',---,'url_400']
but I'm only getting the source for about 170, no clue of what;s happening with the rest.
this is the log I got at the end
2016-05-16 04:30:25 [scrapy] INFO: Closing spider (finished)
2016-05-16 04:30:25 [scrapy] INFO: Stored csv feed (166 items) in: pages.csv
2016-05-16 04:30:25 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 11,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 6,
'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 5,
'downloader/request_bytes': 95268,
'downloader/request_count': 180,
'downloader/request_method_count/GET': 180,
'downloader/response_bytes': 3931169,
'downloader/response_count': 169,
'downloader/response_status_count/200': 166,
'downloader/response_status_count/404': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 5, 16, 9, 0, 25, 461208),
'item_scraped_count': 166,
'log_count/DEBUG': 350,
'log_count/INFO': 17,
'response_received_count': 169,
'scheduler/dequeued': 180,
'scheduler/dequeued/memory': 180,
'scheduler/enqueued': 180,
'scheduler/enqueued/memory': 180,
'start_time': datetime.datetime(2016, 5, 16, 8, 50, 34, 443699)}
2016-05-16 04:30:25 [scrapy] INFO: Spider closed (finished)

Auto.Arima transform timeseries and xreg correlation with lagged forecast timeseries

I'm trying to forecast an auto.arima() model like the one below.
I was wondering in general if it was necessary to transform a timeseries so that it resembled a normal distribution before passing it to auto.arima()?
Also does it matter if your xreg=... predictor is correlated with a lag of the timeseries you're trying to predict, or vice versa?
Code:
tsTrain <-tsTiTo[1:60]
tsTest <- tsTiTo[61:100]
Xreg<-CustCount[1:100]
##Predictor
xregTrain2 <- Xreg[1:60]
xregTest2 <- Xreg[61:100]
Arima.fit2 <- auto.arima(tsTrain, xreg = xregTrain2)
Acast2<-forecast(Arima.fit2, h=40, xreg = xregTest2)
Data:
#dput(ds$CustCount[1:100])
CustCount = c(3, 3, 1, 4, 1, 3, 2, 3, 2, 4, 1, 1, 5, 6, 8, 5, 2, 7, 7, 3, 2, 2, 2, 1, 3, 2, 3, 1, 1, 2, 1, 1, 3, 2, 2, 2, 3, 7, 5, 6, 8, 7, 3, 5, 6, 6, 8, 4, 2, 1, 2, 1, NA, NA, 4, 2, 2, 4, 11, 2, 8, 1, 4, 7, 11, 5, 3, 10, 7, 1, 1, NA, 2, NA, NA, 2, NA, NA, 1, 2, 3, 5, 9, 5, 9, 6, 6, 1, 5, 3, 7, 5, 8, 3, 2, 6, 3, 2, 3, 1 )
# dput(tsTiTo[1:100])
tsTiTo = c(45, 34, 11, 79, 102, 45, 21, 45, 104, 20, 2, 207, 45, 2, 3, 153, 8, 2, 173, 11, 207, 79, 45, 153, 192, 173, 130, 4, 173, 174, 173, 130, 79, 154, 4, 104, 192, 153, 192, 104, 28, 173, 52, 45, 11, 29, 22, 81, 7, 79, 193, 104, 1, 1, 46, 130, 45, 154, 153, 7, 174, 21, 193, 45, 79, 173, 45, 153, 45, 173, 2, 1, 2, 1, 1, 8, 1, 1, 79, 45, 79, 173, 45, 2, 173, 130, 104, 19, 4, 34, 2, 192, 42, 41, 31, 39, 11, 79, 4, 79)
Short answer is no and no to both questions. See the long answer below.
I was wondering in general if it was necessary to transform a
timeseries so that it resembled a normal distribution before passing
it to auto.arima()?
No. In the case of time series data, it is the innovation errors that you want to be normally distributed. Not the time series you are modelling.
This is similar to in the case of liner regression model, you don't expect the predictors to be normally distributed. It is the errors that you'd expect to be normally distributed.
Also does it matter if your xreg=... predictor is correlated with a
lag of the timeseries you're trying to predict, or vice versa?
You'd hope xreg are correlated this way. We are typing to use that information when looking for an appropriate model to forecast.

Difference Predictors in Auto.Arima Forecast

I'm trying to build an auto.arima forecast with predictors like the example below. I've noticed that my predictor is non-stationary. So I was wondering if I should difference the predictor before inputting it in the xreg parameter, like I've shown below. The real data set is much larger, this just an example. Any advice is greatly appreciated.
Code:
tsTrain <-tsTiTo[1:60]
tsTest <- tsTiTo[61:100]
ndiffs(ds$CustCount)
##returns 1
diffedCustCount<-diff(ds$CustCount,differences=1)
Xreg<-diffedCustCount[1:100]
##Predictor
xregTrain2 <- Xreg[1:60]
xregTest2 <- Xreg[61:100]
Arima.fit2 <- auto.arima(tsTrain, xreg = xregTrain2)
Acast2<-forecast(Arima.fit2, h=40, xreg = xregTest2)
Data:
dput(ds$CustCount[1:100])
c(3, 3, 1, 4, 1, 3, 2, 3, 2, 4, 1, 1, 5, 6, 8, 5, 2, 7, 7, 3, 2, 2, 2, 1, 3, 2, 3, 1, 1, 2, 1, 1, 3, 2, 2, 2, 3, 7, 5, 6, 8, 7, 3, 5, 6, 6, 8, 4, 2, 1, 2, 1, NA, NA, 4, 2, 2, 4, 11, 2, 8, 1, 4, 7, 11, 5, 3, 10, 7, 1, 1, NA, 2, NA, NA, 2, NA, NA, 1, 2, 3, 5, 9, 5, 9, 6, 6, 1, 5, 3, 7, 5, 8, 3, 2, 6, 3, 2, 3, 1 )
dput(tsTiTo[1:100])
c(45, 34, 11, 79, 102, 45, 21, 45, 104, 20, 2, 207, 45, 2, 3, 153, 8, 2, 173, 11, 207, 79, 45, 153, 192, 173, 130, 4, 173, 174, 173, 130, 79, 154, 4, 104, 192, 153, 192, 104, 28, 173, 52, 45, 11, 29, 22, 81, 7, 79, 193, 104, 1, 1, 46, 130, 45, 154, 153, 7, 174, 21, 193, 45, 79, 173, 45, 153, 45, 173, 2, 1, 2, 1, 1, 8, 1, 1, 79, 45, 79, 173, 45, 2, 173, 130, 104, 19, 4, 34, 2, 192, 42, 41, 31, 39, 11, 79, 4, 79)
The xreg argument in auto.arima performs a dynamic regression which is to say that you are performing a linear regression and fitting the errors with an arma process.
While auto.arima() used to require manual differencing for non-stationary data when external regressors are included, this is no longer the case. auto.arima() will take non-stationary data as an input and determine the order of differencing using a unit-root test.
See this Post from Rob Hyndman for further detail.

Resources