Scrapy Depth limit changing Itself - web-scraping

I am crawling a website using Scrapy. Lets say there are 150 pages to crawl , the site has pagination where one page give url of the next page to crawl.
Now, my spider stop by itself, with the following logs:
{'downloader/request_bytes': 38096,
'downloader/request_count': 55,
'downloader/request_method_count/GET': 55,
'downloader/response_bytes': 5014634,
'downloader/response_count': 55,
'downloader/response_status_count/200': 55,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 8, 17, 19, 12, 11, 607000),
'item_scraped_count': 2,
'log_count/DEBUG': 58,
'log_count/INFO': 9,
'log_count/WARNING': 1,
'request_depth_max': 36,
'response_received_count': 55,
'scheduler/dequeued': 55,
'scheduler/dequeued/memory': 55,
'scheduler/enqueued': 55,
'scheduler/enqueued/memory': 55,
'start_time': datetime.datetime(2016, 8, 17, 19, 9, 13, 893000)}
the request_depth_max sometimes become 51, and now it 36. But in my settings I have it as DEPTH_LIMIT = 1000000000
I have also tried setting DEPTH_LIMIT to 0, but still the spider stops by itself, is there any setting that I am missing.

The stat request_depth_max is not a setting, it just means the highest depth your spider reached in this run.
Also DEPTH_LIMIT defaults to 0 which equates to infinity.

Related

missing nodes in node view of graph after randomly removed some nodes

I created a graph G and I have a node view as following < 0, 1,2,... 100>
I randomly removed 20 nodes and the node view of this new graph misses the nodes I removed randomly. to be precise for example , in the new graph there are some nodes missing(since they are removed
node view <0,1,3,5,6,7,9 ...100>
however, I want this graph to be a new graph having node view such as the following:
<0,1,2....80>
is there any solution? I tried relabeling, coping the same graph, they didn't work
PS. my nodes have attribute label equal to either 0,1
and i want to preserve them
Here is one approach you can take. After removing your nodes from the graph you can relabel the remaining nodes using nx.relabel_nodes to get the node view you want. See example below:
import networkx as nx
import numpy as np
#Creating random graph
N_nodes=50
G=nx.erdos_renyi_graph(N_nodes,p=0.25)
#Removing random nodes
N_del_nodes=10
del_node_list=np.random.choice(N_nodes,size=N_del_nodes,replace=False)
G.remove_nodes_from(del_node_list)
print('Node view without relabelling:' +str(G.nodes))
#Relabelling graph
label_mapping={list(G.nodes)[j]:j for j in range(N_nodes-N_del_nodes)}
G_rel=nx.relabel_nodes(G, label_mapping)
print('Node view with relabelling:' +str(G_rel.nodes))
And the output gives:
Node view without relabelling:[0, 1, 2, 5, 6, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 27, 28, 30, 31, 32, 33, 34, 36, 37, 38, 40, 41, 44, 45, 46, 47, 48, 49]
Node view with relabelling:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]

skip feedexport when exception occured

Is it possible to only export the extracted items if no exception was thrown during the crawl?
I run into errors from time to time but I need to make sure the crawl was successfully completed before processing the exported items.
I am using scrapy-feedexporter-sftp
and made the configration in settings as described in the README.
The upload to sFTP works.
To produce an error I am using a wrong XPATH
File "/usr/local/lib/python3.5/dist-packages/parsel/selector.py", line 256, in xpath
**kwargs)
File "src/lxml/etree.pyx", line 1582, in lxml.etree._Element.xpath
File "src/lxml/xpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.__call__
File "src/lxml/xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result
ValueError: XPath error: Invalid expression in td[1]//span//text()TEST_TEXT_TO_THROW_ERROR
The crawl failed but scrapy will push the file anyways:
[scrapy.extensions.feedexport] INFO: Stored json feed (59 items) in: sftp://user:pass#host/my/path/to/file/foo_2020-07-23T09-03-50.json
2020-07-23 11:03:54 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 581,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 24199,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 2.831581,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 7, 23, 9, 3, 53, 763269),
'item_dropped_count': 1,
'item_dropped_reasons_count/DropItem': 1,
'item_scraped_count': 59,
'log_count/DEBUG': 86,
'log_count/ERROR': 1, <------
'log_count/INFO': 15,
'log_count/WARNING': 2,
'memusage/max': 60858368,
'memusage/startup': 60858368,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/ValueError': 1,
'start_time': datetime.datetime(2020, 7, 23, 9, 3, 50, 931688)}
2020-07-23 11:03:54 [scrapy.core.engine] INFO: Spider closed (finished)
Regards ;)

Shuffle deck of cards without built-in random functon

My friend suggested me to try to solve this problem before interview, but I have no idea on how to approach it.
I need to write a code to shuffle a deck of 52 cards without using a built-in standard random function.
Update
Thanks to Yifei Wu, his answer was very helpful.
Here is a link for my github project where I executed the given algorithm
https://github.com/Dantsj16/Shuffle-Without-Random.git
Your question does not say it must be a random shuffle of 52 cards. There is such a thing as a perfect shuffle, where a riffle shuffle is done with the top card remaining on the top and every other card comes from the other half of the deck. Many magicians and card sharks can do this shuffle as desired. It is well known that eight perfect shuffles in a row of a standard 52-card deck returns the cards to their original order, if the top card remains on top for each shuffle.
Here are 8 perfect shuffles in python Note that this shuffle is done differently than an actual manual shuffle would be done, to simplify the code.
In [1]: d0=[x for x in range(1,53)] # the card deck
In [2]: print(d0)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52]
In [3]: d1=d0[::2]+d0[1::2] # a perfect shuffle
In [4]: print(d1)
[1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52]
In [5]: d2=d1[::2]+d1[1::2]
In [6]: d3=d2[::2]+d2[1::2]
In [7]: d4=d3[::2]+d3[1::2]
In [8]: d5=d4[::2]+d4[1::2]
In [9]: d6=d5[::2]+d5[1::2]
In [10]: d7=d6[::2]+d6[1::2]
In [11]: d8=d7[::2]+d7[1::2]
In [12]: print(d8)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52]
In [13]: print(d0 == d8)
True
If you want the perfect shuffle as done by hand, use
d1=[None]*52
d1[::2]=d0[:26]
d1[1::2]=d0[26:]
This gives, for d1,
[1, 27, 2, 28, 3, 29, 4, 30, 5, 31, 6, 32, 7, 33, 8, 34, 9, 35, 10, 36, 11, 37, 12, 38, 13, 39, 14, 40, 15, 41, 16, 42, 17, 43, 18, 44, 19, 45, 20, 46, 21, 47, 22, 48, 23, 49, 24, 50, 25, 51, 26, 52]
Let me know if you really need a random shuffle. I can adapt my Borland Delphi code into python if you need it.

scrapy isn't making all request

I'm using scrapy to download a list of pages, I'm not extracting any data at this moment so I only saving the response.body in a csv file.
I'm not crawling either, so the start urls are the only urls I need to get, I've a list of 400 urls
start_urls =['url_1','url_2,'url_3',---,'url_400']
but I'm only getting the source for about 170, no clue of what;s happening with the rest.
this is the log I got at the end
2016-05-16 04:30:25 [scrapy] INFO: Closing spider (finished)
2016-05-16 04:30:25 [scrapy] INFO: Stored csv feed (166 items) in: pages.csv
2016-05-16 04:30:25 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 11,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 6,
'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 5,
'downloader/request_bytes': 95268,
'downloader/request_count': 180,
'downloader/request_method_count/GET': 180,
'downloader/response_bytes': 3931169,
'downloader/response_count': 169,
'downloader/response_status_count/200': 166,
'downloader/response_status_count/404': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 5, 16, 9, 0, 25, 461208),
'item_scraped_count': 166,
'log_count/DEBUG': 350,
'log_count/INFO': 17,
'response_received_count': 169,
'scheduler/dequeued': 180,
'scheduler/dequeued/memory': 180,
'scheduler/enqueued': 180,
'scheduler/enqueued/memory': 180,
'start_time': datetime.datetime(2016, 5, 16, 8, 50, 34, 443699)}
2016-05-16 04:30:25 [scrapy] INFO: Spider closed (finished)

changing the spacing between vertices in iGraph in R

Suppose I want to make a plot with the following data:
pairs <- c(1, 2, 2, 3, 2, 4, 2, 5, 2, 6, 2, 7, 2, 8, 2, 9, 2, 10, 2, 11, 4,
14, 4, 15, 6, 13, 6, 19, 6, 28, 6, 36, 7, 16, 7, 23, 7, 26, 7, 33,
7, 39, 7, 43, 8, 35, 8, 40, 9, 21, 9, 22, 9, 25, 9, 27, 9, 33, 9,
38, 10, 12, 10, 18, 10, 20, 10, 32, 10, 34, 10, 37, 10, 44, 10, 45,
10, 46, 11, 17, 11, 24, 11, 29, 11, 30, 11, 31, 11, 33, 11, 41, 11,
42, 11, 47, 14, 50, 14, 52, 14, 54, 14, 55, 14, 56, 14, 57, 14, 58,
14, 59, 14, 60, 14, 61, 15, 48, 15, 49, 15, 51, 15, 53, 15, 62, 15,
63)
g <- graph( pairs )
plot( g,layout = layout.reingold.tilford )
I get a plot like the one below:
As you can see the spaces between some of the vertices are so small that these vertices overlap.
1. I wonder if there is a way to change the spacing between vertices.
2. In addition, is the spacing between vertices arbitrary? For example, Vertices 3, 4, and 5 are very close to each other, but 5 and 6 are far apart.
EDIT:
For my 2nd question, I guess the spacing is dependent on the number of nodes below. E.g., 10 and 11 are farther from each other than 8 and 9 are because there are more children below 10 and 11 than there are below 8 and 9.
I bet there is a better solution but I cannot find it. Here my approach. Since seems that a general parameter for width is missing you have to adjust manually parameters in order to obtain the desired output.
My approach is primarily to resize some elements of the plot in order to make them of the right size, adjust margins in order to optimize the space as much as possible. The most important parameter here is the asp parameter that controls the aspect ratio of the plot (since in this case the plot I guess is better long than tall an aspect ratio of even less than 0.5 is right). Other tricks are to diminish the size of vertex and fonts. Here is the code:
plot( g, layout = layout.reingold.tilford,
edge.width = 1,
edge.arrow.width = 0.3,
vertex.size = 5,
edge.arrow.size = 0.5,
vertex.size2 = 3,
vertex.label.cex = 1,
asp = 0.35,
margin = -0.1)
That produces this plot:
another approach would be to set graphical devices to PDF (or JPEG etc.) and then set the rescale to FALSE. With Rstudio viewer this cut off a huge piece of the data but with other graphic devices it might (not guarantee) work well.
Anyway for every doubt about how to use these parameters (that are very tricky sometimes) type help(igraph.plotting)
For the second part of the question I am not sure but looking inside the function I cannot figure out a precise answer but I guess that the space between elements on the same level is calculated on the child elements they have, say 3,4,5 have to be closer because they have child and sub-child and then they require more space.

Resources