How to get depth of each crawler with Scrapy

How to get depth of each crawler with Scrapy - web-scraping

Is there a way to keep track of each crawler's depth?
I am recursively crawling some websites.
My setup is similar to the below code.
import scrapy
class Crawl(scrapy.Spider):
name = "Crawl"
def start_requests(self):
if(condition is satisfied):
yield scrapy.Request(url=url,
callback=self.parse,
meta={'depth':1})
def parse(self, response):
next_crawl_depth = response.meta['depth'] + 1
if(condition is satisfied):
with open(filename, "a") as file:
file.write(record depth and url)
yield scrapy.Request(url=url,
callback=self.parse,
meta={'depth': next_crawl_depth})
This approach doesn't work.
For example, I would want to record each crawler's activity as such
crawler depth1 URL1
crawler depth2 URL2
...
Thank you in advance.

I think you are almost there. Please try this code.
import scrapy
class Crawl(scrapy.Spider):
name = "Crawl"
def start_requests(self):
if(condition is satisfied):
yield scrapy.Request(url=url,
callback=self.parse,
meta={'depth':1})
def parse(self, response):
cur_crawl_depth = response.meta['depth']
next_crawl_depth = cur_crawl_depth + 1
if(condition is satisfied):
with open(filename, "w+") as f:
f.write(url + str(cur_crawl_depth) + "\n")
yield scrapy.Request(url=url,
callback=self.parse,
meta={'depth': next_crawl_depth})

Related

Beautifulsoup ':-soup-contains' pseudo-class is not implemented at this time

Whenever I run this code, I get the pseudo-class is not implemented error. I found this code online and I am trying to scrape the relevant information about the cities from Wikipedia.
I have updated python and beautiful soup to their most recent versions. Any help is greatly appreciated.
import requests
import bs4
from bs4 import BeautifulSoup as bs
import pandas as pd
import unicodedata
import re
# cities = ['Berlin', 'Hamburg', 'Frankfurt','Munich','Stuttgart','Leipzig','Cologne','Dresden','Hannover','Paris', 'Barcelona','Lisbon','Madrid']
cities = ['Berlin','Paris','Amsterdam','Barcelona','Rome','Lisbon','Prague','Vienna','Madrid']
def City_info(soup):
ret_dict = {}
ret_dict['city'] = soup.h1.get_text()
if soup.select_one('.mergedrow:-soup-contains("Mayor")>.infobox-label') != None:
i = soup.select_one('.mergedrow:-soup-contains("Mayor")>.infobox-label')
mayor_name_html = i.find_next_sibling()
mayor_name = unicodedata.normalize('NFKD',mayor_name_html.get_text())
ret_dict['mayor'] = mayor_name
if soup.select_one('.mergedrow:-soup-contains("City")>.infobox-label') != None:
j = soup.select_one('.mergedrow:-soup-contains("City")>.infobox-label')
area = j.find_next_sibling('td').get_text()
ret_dict['city_size'] = unicodedata.normalize('NFKD',area)
if soup.select_one('.mergedtoprow:-soup-contains("Elevation")>.infobox-data') != None:
k = soup.select_one('.mergedtoprow:-soup-contains("Elevation")>.infobox-data')
elevation_html = k.get_text()
ret_dict['elevation'] = unicodedata.normalize('NFKD',elevation_html)
if soup.select_one('.mergedtoprow:-soup-contains("Population")') != None:
l = soup.select_one('.mergedtoprow:-soup-contains("Population")')
c_pop = l.findNext('td').get_text()
ret_dict['city_population'] = c_pop
if soup.select_one('.infobox-label>[title^=Urban]') != None:
m = soup.select_one('.infobox-label>[title^=Urban]')
u_pop = m.findNext('td')
ret_dict['urban_population'] = u_pop.get_text()
if soup.select_one('.infobox-label>[title^=Metro]') != None:
n = soup.select_one('.infobox-label>[title^=Metro]')
m_pop = n.findNext('td')
ret_dict['metro_population'] = m_pop.get_text()
if soup.select_one('.latitude') != None:
o = soup.select_one('.latitude')
ret_dict['lat'] = o.get_text()
if soup.select_one('.longitude') != None:
p = soup.select_one('.longitude')
ret_dict['long'] = p.get_text()
return ret_dict
list_of_city_info = []
for city in cities:
url = 'https://en.wikipedia.org/wiki/{}'.format(city)
web = requests.get(url,'html.parser')
soup = bs(web.content)
list_of_city_info.append(City_info(soup))
df_cities = pd.DataFrame(list_of_city_info)
df_cities = df_cities.set_index('city')
df_cities
I have not found any solutions for this unfortunately.

:-soup-contains is a css pseudo class selector to target a node's text.
It comes with Soup Sieve that is the official CSS select implementation of Beautiful Soup 4.7.0+, so for most people, using Beautiful Soup 4.7.0+ your script should work fine.
So first check if your version is up to date in older version deprecated form of :contains() is used.

Bs4: Trying to loop in diferent arrays with diferent lenghts. Get IndexError: list index out of range

With Beautifulsoup4 and python3.7 I'm trying to loop some arrays with links. After, want to get some text from tags. But I'm encountering and error passing the code on the terminal.
Here the code:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import re
import csv
my_url = "http://www.example.com"
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
links = page_soup.select('dt > a[href]')
link = [tag.get('href') for tag in links]
i = 0
for i in range(0, 5000):
url = link[i]
Client = uReq(url)
pageHtml = Client.read()
Client.close()
pSoup = soup(pageHtml, "html.parser")
linkeas = pSoup.findAll(href=re.compile(my_url))
def linkas(href):
return href and re.compile("html").search(href) and re.compile(my_url).search(href)
linka = pSoup.findAll(href=linkas)
if linka != []:
linkia = [tag.get('href') for tag in linka]
linko = len(linkia)
j = 0
for j in range (0, linko):
curl = linkia[j]
cClient = uReq(curl)
pageHtml = cClient.read()
cClient.close()
Soup = soup(page_html, "html.parser")
country = Soup.select('.class > a:nth-of-type(3)')
countri = country[0].text.strip()
print(countri)
I've tried for days several ways but got so far as this with no results:
Traceback (most recent call last):
File "<stdin>", line 22, in <module>
IndexError: list index out of range
Could someone give some tip?
NOTE:
Arrays show like this:
print(linkia)
['http://www.example/example/1.html']
['http://www.example/example/2.html']
['http://www.example/example/3.html', 'http://www.example/example/4.html',
'http://www.example/example/5.html', 'http://www.example/example/6.html',
'http://www.example/example/7.html', 'http://www.example/example/8.html',
'http://www.example/example/9.html', 'http://www.example/example/10.html',
'http://www.example/example/11.html', 'http://www.example/example/12.html',
'http://www.example/example/13.html', 'http://www.example/example/14.html',
'http://www.example/example/15.html', 'http://www.example/example/16.html',
'http://www.example/example/17.html', 'http://www.example/example/18.html',
'http://www.example/example/19.html']
['http://www.example/example/20.html', 'http://www.example/example/example/21.html',
'http://www.example/example/example/22.html']
['http://www.example/example/23.html']
Thanks a lot for your time. Really appreciate. Will be connected all time with fast response.

change:
i = 0
for i in range(0, 5000):
url = link[i]
to just:
for url in link:
And then can get rid of the url = link[i]
You're essentially telling it to loop through 5000 items in your list, when you don't have 5000 items, hence the list index out of range. You really just want it to loop through each element until it runs out of items. And you can do that by simply saying for url in link:
Then the same for your other nested for loop.
change:
j = 0
for j in range (0, linko):
curl = linkia[j]
to:
for curl in linkia:
I will also note that if you were to set it up the way you have it, you wouldn't need to set the initial i or j to be = 0. Since you set the range/list to go from 0, 5000...the for loop would automatically start at that first element of 0. But again, that point is irrelevant, as I would not recommend iterating through your list like that. It a) isn't robust (you would need exactly 5000 items in your list every time it gets to that loop), and b) while it would work ok for your second loop because you set the range from 0, to the length of the list, it really is unnecessary since you can condense that into 1 line.
Try:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import re
import csv
my_url = "http://www.example.com"
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
links = page_soup.select('dt > a[href]')
link = [tag.get('href') for tag in links]
for url in link:
Client = uReq(url)
pageHtml = Client.read()
Client.close()
pSoup = soup(pageHtml, "html.parser")
linkeas = pSoup.findAll(href=re.compile(my_url))
def linkas(href):
return href and re.compile("html").search(href) and re.compile(my_url).search(href)
linka = pSoup.findAll(href=linkas)
if linka != []:
linkia = [tag.get('href') for tag in linka]
for curl in linkia:
cClient = uReq(curl)
pageHtml = cClient.read()
cClient.close()
Soup = soup(page_html, "html.parser")
country = Soup.select('.class > a:nth-of-type(3)')
countri = country[0].text.strip()
print(countri)

AtrributeError MomentSGD optimizer has no attribute prepare

Recently, I run the code released by other authors. They utilized chainer v1.3, but I installed v4. When I run the code, it errors that Attribute Errors: MomentSGD optimizer has no attribute prepare. Here I post the codes of this part:
class BaseModel(chainer.Chain):
loss = None
accuracy = None
gpu_mode = False
_train = False
def __call__(self, *arg_list, **arg_dict):
raise NotImplementedError()
def clear(self):
self.loss = None
self.accuracy = None
def train(self, data, optimizer):
self._train = True
optimizer.update(self, data)
if self.accuracy is None:
return float(self.loss.data)
else:
return float(self.loss.data), float(self.accuracy.data)
def validate(self, data):
self._train = False
self(data)
if self.accuracy is None:
return float(self.loss.data)
else:
return float(self.loss.data), float(self.accuracy.data)
def test(self, data):
self._train = False
raise NotImplementedError()
def save(self, fname):
serializers.save_hdf5(fname, self)
def load(self, fname):
serializers.load_hdf5(fname, self)
def cache(self):
self.to_cpu()
cached_model = self.copy()
self.to_gpu()
return cached_model
# this part is the error part
def setup(self, optimizer):
self.to_gpu()
optimizer.target = self
optimizer.prepare()
def to_cpu(self):
if not self.gpu_mode:
return
super(BaseModel, self).to_cpu()
self.gpu_mode = False
def to_gpu(self):
if self.gpu_mode:
return
super(BaseModel, self).to_gpu()
self.gpu_mode = True

Newer version of chainer uses setup method to initialize optimizer.
Can you try modifing your code as follows?
def setup(self, optimizer):
self.to_gpu()
optimizer.setup(self)

Passing gradients between components; pass_by_obj output

I have a situation where the gradient of one component is by necessity calculated in another component. What I have attempted to do is just have the gradient be an output from the first component and an input to the second component. I have set it to be pass_by_obj so that it doesn't affect other calculations. Any recommendations on whether or not this would be the best way to do it would be appreciated. Nevertheless, I am getting an error when using check_partial_derivatives(). It seems to be an error for any output that is specified as pass_by_obj. Here is a simple case:
import numpy as np
from openmdao.api import Group, Problem, Component, ScipyGMRES, ExecComp, IndepVarComp
class Comp1(Component):
def __init__(self):
super(Comp1, self).__init__()
self.add_param('x', shape=1)
self.add_output('y', shape=1)
self.add_output('dz_dy', shape=1, pass_by_obj=True)
def solve_nonlinear(self, params, unknowns, resids):
x = params['x']
unknowns['y'] = 4.0*x + 1.0
unknowns['dz_dy'] = 2.0*x
def linearize(self, params, unknowns, resids):
J = {}
J['y', 'x'] = 4.0
return J
class Comp2(Component):
def __init__(self):
super(Comp2, self).__init__()
self.add_param('y', shape=1)
self.add_param('dz_dy', shape=1, pass_by_obj=True)
self.add_output('z', shape=1)
def solve_nonlinear(self, params, unknowns, resids):
y = params['y']
unknowns['z'] = y*2.0
def linearize(self, params, unknowns, resids):
J = {}
J['z', 'y'] = params['dz_dy']
return J
class TestGroup(Group):
def __init__(self):
super(TestGroup, self).__init__()
self.add('x', IndepVarComp('x', 0.0), promotes=['*'])
self.add('c1', Comp1(), promotes=['*'])
self.add('c2', Comp2(), promotes=['*'])
p = Problem()
p.root = TestGroup()
p.setup(check=False)
p['x'] = 2.0
p.run()
print p['z']
print 'gradients'
test_grad = open('partial_gradients_test.txt', 'w')
partial = p.check_partial_derivatives(out_stream=test_grad)
I get the following error message:
partial = p.check_partial_derivatives(out_stream=test_grad)
File "/usr/local/lib/python2.7/site-packages/openmdao/core/problem.py", line 1699, in check_partial_derivatives
dresids._dat[u_name].val[idx] = 1.0
TypeError: '_ByObjWrapper' object does not support item assignment
I asked before about the params being checked for pass_by_obj in check_partial_derivatives() and it might be simply a matter of checking the unknowns for pass_by_obj as well.

the error you're getting is another bug related to check_partial_derivatives function. It should be easy enough to fix, but in the meantime you can just remove the pass_by_obj setting. Since you're computing a value in one component and passing it to another, there isn't a need to do pass_by_obj at all (and it will be more efficient if you don't).
You said that you did it so that it "doesn't affect other calculations", but I don't quite know what you mean by that. It won't affect anything unless you use it in the solve_nonlinear method.

Check Partial Derivatives with pass_by_obj

I have a component that has an input that is an int so I am setting pass_by_obj = True. However, when I check derivatives with check_partial_derivatives(), it throws this error:
data = prob.check_partial_derivatives(out_stream=sys.stdout)
File "/usr/local/lib/python2.7/site-packages/openmdao/core/problem.py", line 1711, in check_partial_derivatives
jac_rev[(u_name, p_name)][idx, :] = dinputs._dat[p_name].val
TypeError: float() argument must be a string or a number
It appears to be trying to take the derivative even though it cannot. Here is a simple example:
import sys
from openmdao.api import IndepVarComp, Problem, Group, Component
class Comp(Component):
def __init__(self):
super(Comp, self).__init__()
self.add_param('x', val=0.0)
self.add_param('y', val=3, pass_by_obj=True)
self.add_output('z', val=0.0)
def solve_nonlinear(self, params, unknowns, resids):
unknowns['z'] = params['y']*params['x']
def linearize(self, params, unknowns, resids):
J = {}
J['z', 'x'] = params['y']
return J
prob = Problem()
prob.root = Group()
prob.root.add('comp', Comp(), promotes=['*'])
prob.root.add('p1', IndepVarComp('x', 0.0), promotes=['x'])
prob.root.add('p2', IndepVarComp('y', 3, pass_by_obj=True), promotes=['y'])
prob.setup(check=False)
prob['x'] = 2.0
prob['y'] = 3
prob.run()
print prob['z']
data = prob.check_partial_derivatives(out_stream=sys.stdout)
It is possible to use the check_partial_derivatives() method with components that have inputs that are specified as pass_by_obj? I don't care about the derivatives for the inputs that are specified as pass_by_obj, but I care about the other inputs.

Thanks for the report and test. This was a bug where we weren't excluding the design variables that were declared pass_by_obj. I've got a pull request up on the OpenMDAO repo with a fix. It'll probably be merged to master within a day.
EDIT -- The fix is merged. https://github.com/OpenMDAO/OpenMDAO/commit/b123b284e46aac7e15fa9bce3751f9ad9bb63b95

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to get depth of each crawler with Scrapy - web-scraping

Related

Beautifulsoup ':-soup-contains' pseudo-class is not implemented at this time

Bs4: Trying to loop in diferent arrays with diferent lenghts. Get IndexError: list index out of range

AtrributeError MomentSGD optimizer has no attribute prepare

Passing gradients between components; pass_by_obj output

Check Partial Derivatives with pass_by_obj

Categories

Resources