scrapy isn't making all request

scrapy isn't making all request - web-scraping

I'm using scrapy to download a list of pages, I'm not extracting any data at this moment so I only saving the response.body in a csv file.
I'm not crawling either, so the start urls are the only urls I need to get, I've a list of 400 urls
start_urls =['url_1','url_2,'url_3',---,'url_400']
but I'm only getting the source for about 170, no clue of what;s happening with the rest.
this is the log I got at the end
2016-05-16 04:30:25 [scrapy] INFO: Closing spider (finished)
2016-05-16 04:30:25 [scrapy] INFO: Stored csv feed (166 items) in: pages.csv
2016-05-16 04:30:25 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 11,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 6,
'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 5,
'downloader/request_bytes': 95268,
'downloader/request_count': 180,
'downloader/request_method_count/GET': 180,
'downloader/response_bytes': 3931169,
'downloader/response_count': 169,
'downloader/response_status_count/200': 166,
'downloader/response_status_count/404': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 5, 16, 9, 0, 25, 461208),
'item_scraped_count': 166,
'log_count/DEBUG': 350,
'log_count/INFO': 17,
'response_received_count': 169,
'scheduler/dequeued': 180,
'scheduler/dequeued/memory': 180,
'scheduler/enqueued': 180,
'scheduler/enqueued/memory': 180,
'start_time': datetime.datetime(2016, 5, 16, 8, 50, 34, 443699)}
2016-05-16 04:30:25 [scrapy] INFO: Spider closed (finished)

Related

How to make spiked rather than rectangular columns with ggplot

I'm using ggplot to display some data. The machine that generates it produces a plot that looks like this. You'll notice each individual blue peak is a spike, with a sharp top.
I've recreated this with ggplot, but is there a way to get geom_col, or another function, to produce a spike for each peak?
data <- data.frame(rpt = c(100.803333333333, 101.783333333333, 102.733333333333, 103.653333333333,
104.64, 105.566666666667, 106.526666666667, 107.46, 108.36, 109.33,
110.303333333333, 111.276666666667, 112.223333333333, 113.136666666667,
114.186666666667, 115.21, 116.2, 117.153333333333, 118.18, 119.17,
120.16, 121.153333333333, 122.163333333333, 123.086666666667,
124.046666666667, 125.003333333333, 125.956666666667, 126.946666666667,
127.866666666667, 128.82, 129.736666666667, 130.7, 131.633333333333,
132.6, 133.573333333333, 134.51, 135.453333333333, 136.43, 137.376666666667,
138.356666666667, 139.306666666667, 140.256666666667, 141.213333333333,
142.166666666667, 143.16, 144.086666666667, 145.046666666667,
146.01, 147.013333333333, 147.946666666667, 148.913333333333,
149.853333333333, 150.86, 151.803333333333, 152.746666666667,
153.693333333333, 154.673333333333, 155.66, 156.576666666667,
157.53, 158.486666666667, 159.443333333333, 160.403333333333,
161.403333333333, 162.33, 164.223333333333, 165.226666666667,
166.123333333333, 167.023333333333, 168.103333333333, 168.936666666667,
170.026666666667, 170.9, 171.85, 172.62, 173.61, 174.716666666667,
175.713333333333, 176.64, 177.57, 178.466666666667, 180.913333333333,
181.256666666667, 182.286666666667, 183.32, 184.283333333333,
185.246666666667, 186.136666666667, 187.146666666667, 188.16,
188.823333333333, 190.036666666667, 190.983333333333, 191.613333333333,
191.93, 192.323333333333, 192.84, 193.91, 194.866666666667, 195.706666666667,
199.523333333333, 200.563333333333, 201.326666666667, 202.293333333333,
203.346666666667, 204.893333333333, 206.943333333333, 208.223333333333,
208.846666666667, 209.68, 210.723333333333, 211.476666666667,
212.023333333333, 212.363333333333, 214.71, 216.576666666667,
216.97, 219.883333333333, 221.806666666667, 222.62, 223.526666666667,
224.436666666667, 225.35, 226.273333333333, 227.056666666667,
228.22, 229.013333333333, 232.076666666667, 238.52, 240.806666666667,
245.48, 248.49, 251.136666666667, 256.523333333333, 258.646666666667,
260.856666666667, 265.93, 268.88, 270.963333333333, 283.38, 285.2,
286.223333333333, 288.49, 294.926666666667),
height = c(119, 127, 132, 139, 136, 136, 140, 161, 162, 194, 239, 278,
370, 288, 434, 361, 286, 232, 213, 221, 238, 244, 266, 295, 306,
325, 358, 420, 497, 670, 838, 1104, 1451, 1743, 2018, 2170, 2226,
2058, 1777, 1464, 1158, 916, 702, 604, 540, 535, 554, 543, 517,
490, 434, 365, 322, 315, 312, 293, 272, 281, 279, 293, 297, 286,
253, 222, 170, 111, 69, 49, 29, 39, 33, 25, 24, 23, 16, 19, 18,
17, 24, 16, 19, 12, 20, 16, 19, 17, 19, 13, 16, 16, 17, 17, 16,
11, 19, 11, 18, 19, 17, 16, 14, 18, 14, 14, 11, 15, 10, 10, 11,
15, 11, 10, 10, 10, 11, 10, 10, 12, 10, 15, 14, 24, 27, 30, 22,
11, 12, 10, 10, 10, 10, 11, 10, 11, 10, 10, 12, 12, 15, 11, 10,
11, 10, 12))
ggplot(data, aes(x=rpt, y=height)) +
geom_col(position="identity")

Here's an approach where I add a zero value halfway in between each original value.
library(tidyverse)
data %>%
uncount(2, .id = "id") %>% # repeat each row
mutate(rpt = if_else(id == 2, (rpt + lead(rpt))/2, rpt), # space in between
height = if_else(id == 2, 0, height)) %>% # zero out every other row
ggplot(aes(x=rpt, y=height)) +
geom_line()

skip feedexport when exception occured

Is it possible to only export the extracted items if no exception was thrown during the crawl?
I run into errors from time to time but I need to make sure the crawl was successfully completed before processing the exported items.
I am using scrapy-feedexporter-sftp
and made the configration in settings as described in the README.
The upload to sFTP works.
To produce an error I am using a wrong XPATH
File "/usr/local/lib/python3.5/dist-packages/parsel/selector.py", line 256, in xpath
**kwargs)
File "src/lxml/etree.pyx", line 1582, in lxml.etree._Element.xpath
File "src/lxml/xpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.__call__
File "src/lxml/xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result
ValueError: XPath error: Invalid expression in td[1]//span//text()TEST_TEXT_TO_THROW_ERROR
The crawl failed but scrapy will push the file anyways:
[scrapy.extensions.feedexport] INFO: Stored json feed (59 items) in: sftp://user:pass#host/my/path/to/file/foo_2020-07-23T09-03-50.json
2020-07-23 11:03:54 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 581,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 24199,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 2.831581,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 7, 23, 9, 3, 53, 763269),
'item_dropped_count': 1,
'item_dropped_reasons_count/DropItem': 1,
'item_scraped_count': 59,
'log_count/DEBUG': 86,
'log_count/ERROR': 1, <------
'log_count/INFO': 15,
'log_count/WARNING': 2,
'memusage/max': 60858368,
'memusage/startup': 60858368,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/ValueError': 1,
'start_time': datetime.datetime(2020, 7, 23, 9, 3, 50, 931688)}
2020-07-23 11:03:54 [scrapy.core.engine] INFO: Spider closed (finished)
Regards ;)

Scrapy Depth limit changing Itself

I am crawling a website using Scrapy. Lets say there are 150 pages to crawl , the site has pagination where one page give url of the next page to crawl.
Now, my spider stop by itself, with the following logs:
{'downloader/request_bytes': 38096,
'downloader/request_count': 55,
'downloader/request_method_count/GET': 55,
'downloader/response_bytes': 5014634,
'downloader/response_count': 55,
'downloader/response_status_count/200': 55,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 8, 17, 19, 12, 11, 607000),
'item_scraped_count': 2,
'log_count/DEBUG': 58,
'log_count/INFO': 9,
'log_count/WARNING': 1,
'request_depth_max': 36,
'response_received_count': 55,
'scheduler/dequeued': 55,
'scheduler/dequeued/memory': 55,
'scheduler/enqueued': 55,
'scheduler/enqueued/memory': 55,
'start_time': datetime.datetime(2016, 8, 17, 19, 9, 13, 893000)}
the request_depth_max sometimes become 51, and now it 36. But in my settings I have it as DEPTH_LIMIT = 1000000000
I have also tried setting DEPTH_LIMIT to 0, but still the spider stops by itself, is there any setting that I am missing.

The stat request_depth_max is not a setting, it just means the highest depth your spider reached in this run.
Also DEPTH_LIMIT defaults to 0 which equates to infinity.

Show a specific value of x-axis on ggplot

Im creating a ggplot with geom_vline at a specific location on the x axis. i would like the x axis to show that specific value
Following is my data + code:
dput(agg_data)
structure(list(latency = structure(c(0, 1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 24, 26, 28,
29, 32, 36, 37, 40, 43, 46, 47, 48, 49, 54, 64, 71, 72, 75, 87,
88, 89, 93, 134, 151), class = "difftime", units = "days"), count = c(362,
11, 8, 5, 4, 2, 8, 6, 4, 2, 2, 1, 5, 1, 2, 2, 2, 1, 1, 1, 2,
1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1,
1, 1, 1), cum_sum = c(362, 373, 381, 386, 390, 392, 400, 406,
410, 412, 414, 415, 420, 421, 423, 425, 427, 428, 429, 430, 432,
433, 435, 436, 437, 438, 439, 440, 441, 442, 444, 446, 447, 448,
449, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460), cum_dist = c(0.78695652173913,
0.810869565217391, 0.828260869565217, 0.839130434782609, 0.847826086956522,
0.852173913043478, 0.869565217391304, 0.882608695652174, 0.891304347826087,
0.895652173913044, 0.9, 0.902173913043478, 0.91304347826087,
0.915217391304348, 0.919565217391304, 0.923913043478261, 0.928260869565217,
0.930434782608696, 0.932608695652174, 0.934782608695652, 0.939130434782609,
0.941304347826087, 0.945652173913043, 0.947826086956522, 0.95,
0.952173913043478, 0.954347826086957, 0.956521739130435, 0.958695652173913,
0.960869565217391, 0.965217391304348, 0.969565217391304, 0.971739130434783,
0.973913043478261, 0.976086956521739, 0.980434782608696, 0.982608695652174,
0.984782608695652, 0.98695652173913, 0.989130434782609, 0.991304347826087,
0.993478260869565, 0.995652173913044, 0.997826086956522, 1)), .Names = c("latency",
"count", "cum_sum", "cum_dist"), row.names = c(NA, -45L), class = "data.frame")
and code:
library(ggplot2)
library(ggthemes)
russ<-ggplot(data=agg_data,aes(x=as.numeric(latency),y=cum_dist))+geom_line(size=2)
russ<-russ+ggtitle("Latency from first click to Qualified Demo:") + xlab("in Days")+ ylab("Proportion of maturity")+theme_economist()
russ<-russ+geom_vline(aes(xintercept=10), color="black", linetype="dashed")
russ
which creates the following plot:
I want the plot show the value '10' (same location as the vline) on the x-axis
I looked for some other similar answers, like in Customize axis labels
but this one re creates the x axis labels (with scale_x_discrete), and does not add a new number to the current scale, which is more of what im looking for.
thanks in advance!

In your case x scale is continuous, so you can use function scale_x_continuous() and provide breaks at positions you need.
russ + scale_x_continuous(breaks=c(0,10,50,100,150))

removing columns with similar variance

I have a dataframe of 3500 X 4000. I am trying to write a professional command in R to remove any columns in a matrix that show the same variance. I am able to do this a with a long, complicated command such as
datavar <- apply(data, 2, var)
datavar <- datavar[!duplicated(datavar)]
then assemble my data by matching the remaining column names, but this is SAD! I was hoping to do this in a single go. I was thinking of something like
data <- data[, which(apply(data, 2, function(col) !any(var(data) = any(var(data)) )))]
I know the last part of the above command is nonsense, but I also know there is someway it can be done in some... smart command!
Here's some data that applies to the question
data <- structure(list(V1 = c(3, 213, 1, 135, 5, 2323, 1231, 351, 1,
33, 2, 213, 153, 132, 1321, 53, 1, 231, 351, 3135, 13), V2 = c(1,
1, 1, 2, 3, 5, 13, 33, 53, 132, 135, 153, 213, 213, 231, 351,
351, 1231, 1321, 2323, 3135), V3 = c(65, 41, 1, 53132, 1, 6451,
3241, 561, 321, 534, 31, 135, 1, 1351, 31, 351, 31, 31, 3212,
3132, 1), V4 = c(2, 2, 5, 4654, 5641, 21, 21, 1, 1, 465, 31,
4, 651, 35153, 13, 132, 123, 1231, 321, 321, 5), V5 = c(23, 13,
213, 135, 15341, 564, 564, 8, 464, 8, 484, 6546, 132, 165, 123,
135, 132, 132, 123, 123, 2), V6 = c(2, 1, 84, 86468, 464, 18,
45, 55, 2, 5, 12, 4512, 5, 123, 132465, 12, 456, 15, 45, 123213,
12), V7 = c(1, 2, 2, 5, 5, 12, 12, 12, 15, 18, 45, 45, 55, 84,
123, 456, 464, 4512, 86468, 123213, 132465)), .Names = c("V1",
"V2", "V3", "V4", "V5", "V6", "V7"), row.names = c(NA, 21L), class = "data.frame")
Would I be able to keep one of the "similar variance" columns too?
Thanks,

I might go a more cautious route, like
data[, !duplicated(round(sapply(data,var),your_precision_here))]

This is pretty similar to what you've come up with:
vars <- lapply(data,var)
data[,which(sapply(1:length(vars), function(x) !vars[x] %in% vars[-x]))]
One thing to think about though is whether you want to match variances exactly (as in this example) or just variances that are close. The latter would be a significantly more challenging problem.

... or as alternative:
data[ , !c(duplicated(apply(data, 2, var)) | duplicated(apply(data, 2, var), fromLast=TRUE))]
...but also not shorter :)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

scrapy isn't making all request - web-scraping

Related

How to make spiked rather than rectangular columns with ggplot

skip feedexport when exception occured

Scrapy Depth limit changing Itself

Show a specific value of x-axis on ggplot

removing columns with similar variance

Categories

Resources