Scrapy Splash recursive crawl not working - web-scraping

I tried to use tips from similar questions but did not come to success.
In the end, I returned to the starting point and I want to ask your help.
I cant execute a recursive crawl process with scrapy splash, but do it without problems on a single page. I see issue in bad urls to scrape:
2019-04-16 16:17:11 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to '192.168.0.104': <GET http://192.168.0.104:8050/************>
But link must be https://www.someurl.com/***************
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse, meta={'splash': {'endpoint': 'render.html', 'args': {'wait': 0.5}}})
def parse(self, response):
***********
items_urls = ***********
for url in items_urls.extract():
yield Request(urlparse.urljoin(response.url, url), callback=self.parse_items, meta={'item': item})
def parse_items(self, response):
***********
yield item

I have found a solution:
Just remove a urlparse.urljoin(response.url, url) module and change it to simple string like a "someurl.com" + url
Now all links are correct and a crawl process works fine.
But now I have a some troubles with crawl loops, but its another question :)

Related

Cannot Authenticate to website, Scrapy Spider, Bad Request

I'm trying to write a webscraper to study different social media platforms and now I'm working on one for Gab. When I try to log in I get what I believe is a 400 HTTP code, bad request and I'm not sure why.
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
class FeedparserSpider(scrapy.Spider):
name = 'feedparser'
allowed_domains = ['gab.com']
start_urls = ['https://gab.com/auth/sign_in/']
def parse(self, response):
# Everything we need to sign in
authenticity_token = response.xpath('//form[#class="simple_form new_user"]/input[#name="authenticity_token"]/#value').get()
user_email = "my#eamiladdress.com"
user_password = "MyPassword"
open_in_browser(response)
return FormRequest.from_response("https://gab.com/auth/sign_in", formdata={
'authenticity_token': authenticity_token,
'user[email]' : user_email,
'user[password]': user_password,
}, callback=self.parsefeed)
def parsefeed(self, response):
home_url = 'https://gab.com/home'
yield scrapy.Request(url=home_url, callback=self.parse_feed)
def parse_feed(self, response):
open_in_browser(response)
Current Predicament
I suspect I'll need to change my formdata to include a user object with an email and a password property but I'm not sure.
I am VERY new to web scraping so dont have many troubleshooting strategies or insight yet. Any advice on what and how I should proceed would be very helpful and if this post could use any additional details please let me know and will add it as quickly as possible.
You need to pass 'FormRequest.from_response' your response:
return FormRequest.from_response(response, formdata={
'authenticity_token': authenticity_token,
'user[email]' : user_email,
'user[password]': user_password},
callback=self.parsefeed)
Then you'll get to a page with the following text:
To use the Gab Social web application, please enable JavaScript. Alternatively, try one of the native apps for Gab Social for your platform.
You need to check how to website checks for javascript and see if you can bypass it.

Scrape dynamic info from same URL using python or any other tool

I am trying to scrape the URL of every company who has posted a job offer on this website:
https://jobs.workable.com/
I want to pull the info to generate some stats re this website.
The problem is that when I click on an add and navigate through the job post, the url is always the same. I know a bit of python so any solution using it would be useful. I am open to any other approach though.
Thank you in advance.
This is just a pseudo code to give you the idea of what you are looking for.
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
first_url = 'https://job-board-v3.workable.com/api/v1/jobs?query=&orderBy=postingUpdateTime+desc'
base_url= 'https://job-board-v3.workable.com/api/v1/jobs?query=&orderBy=postingUpdateTime+desc&offset='
page_ids = ['0','10','20','30','40','50'] ## can also be created dynamically this is just raw
for pep_id in page_ids:
# for initial page
if(pep_id == '0'):
page = requests.get(first_url, headers=headers)
print('You still need to parse the first page')
##Enter some parsing logic
else:
final_url = base_url + str(pep_id)
page = requests.get(final_url, headers=headers)
print('You still need to parse the other pages')
##Enter some parsing logic

Find the final redirected url using Python

I am trying to use python to find the final redirected URL for a url. I tried various solutions from stackoverflow answers but nothing worked for me. I am only getting the original url.
To be specific, I tried requests, urllib2 and urlparse libraries and none of them worked as they should. Here are some of the codes I tried:
Solution 1:
s = requests.session()
r = s.post('https://www.boots.com/search/10055096', allow_redirects=True)
print(r.history)
print(r.history[1].url)
Result:
[<Response [301]>, <Response [302]>]
https://www.boots.com/search/10055096
Solution 2:
import urlparse
url = 'https://www.boots.com/search/10055096'
try:
out = urlparse.parse_qs(urlparse.urlparse(url).query)['out'][0]
print(out)
except Exception as e:
print('not found')
Result:
not found
Solution 3:
import urllib2
def get_redirected_url(url):
opener = urllib2.build_opener(urllib2.HTTPRedirectHandler)
request = opener.open(url)
return request.url
print(get_redirected_url('https://www.boots.com/search/10055096'))
Result:
HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found
Expected URL below is the final redirected page and that is what I want to return.
Original URL: https://www.boots.com/search/10055096
Expected URL: https://www.boots.com/gillette-fusion5-razor-blades-4pk-10055096
Solution #1 was the closest one. At least it returned 2 responses but second respond wasn't the final page, it seems like it was the loading page looking at the content of it.
The first request returns with a html file which contains a JS to update the site and Java scripts are not processed by requests . You can find the updated link by using
import requests
from bs4 import BeautifulSoup
import re
r = requests.get('https://www.boots.com/search/10055096')
soup = BeautifulSoup(r.content,'html.parser')
reg = soup.find('input',id='searchBoxText').findNext('script').contents[0]
print(re.search(r'ht[\w\://\.-]+', reg).group())

Python-requests: Can't scrape all the html code from a page

I am trying to scrape the content of the
Financial Times Search page.
Using Requests, I can easily scrape the articles' titles and hyperlinks.
I would like to get the next page's hyperlink, but I can not find it in the Requests response, unlike the articles' titles or hyperlinks.
from bs4 import BeautifulSoup
import requests
url = 'http://search.ft.com/search?q=SABMiller+PLC&t=all&rpp=100&fa=people%2Corganisations%2Cregions%2Csections%2Ctopics%2Ccategory%2Cbrand&s=-lastPublishDateTime&f=lastPublishDateTime[2000-01-01T00%3A00%3A00%2C2016-01-01T23%3A59%3A59]&curations=ARTICLES%2CBLOGS%2CVIDEOS%2CPODCASTS&highlight=true&p=1et'
response = requests.get(url, auth=(my login informations))
soup = BeautifulSoup(response.text, "lxml")
def get_titles_and_links():
titles = soup.find_all('a')
for ref in titles:
if ref.get('title') and ref.get('onclick'):
print ref.get('href')
print ref.get('title')
The get_titles_and_links() function gives me the titles and links of all the articles.
However, with a similar function for the next page, I have no results:
def get_next_page():
next_page = soup.find_all("li", class_="page next")
return next_page
Or:
def get_next_page():
next_page = soup.find_all('li')
for ref in next_page:
if ref.get('page next'):
print ref.get('page next')
If you can see the required links in the page source, but are not able to get them via requests or urllib. It can mean two things.
There is something wrong with your logic. Let's assume it's not that.
Then the thing remains is: Ajax, those parts of the page you are looking for are loaded by javascript after the document.onload method fired. So you cannot get something that's not there in the first place.
My solutions(more like suggestions) are
Reverse engineer the network requests. Difficult, but universally applicable. I personally do that. You might want to use re module.
Find something that renders javascript. That's just to say that, simulate web browsing. You might wanna check out the webdriver component of selenium, Qt etc. This is easier, but kinda memory hungry and consumes a lot more network resource compared to 1.

crawl dynamic data using scrapy

I try to get the product rating information from target.com. The URL for the product is
http://www.target.com/p/bounty-select-a-size-paper-towels-white-8-huge-rolls/-/A-15258543#prodSlot=medium_1_4&term=bounty
After looking through response.body, I find out that the rating information is not statically loaded. So I need to get using other ways. I find some similar questions saying in order to get dynamic data, I need to
find out the correct XHR and where to send request
use FormRequest to get the right json
parse json
(if I am wrong about the steps please tell me)
I am stuck at step 2 right now, i find out that one XHR named 15258543 contained rating distribution, but I don't know how can I sent a request to get the json. Like to where and use what parameter.
Can someone can walk me through this?
Thank you!
The trickiest thing is to get that 15258543 product ID dynamically and then use it inside the URL to get the reviews. This product ID can be found in multiple places on the product page, for instance, there is a meta element that we can use:
<meta itemprop="productID" content="15258543">
Here is a working spider that makes a separate GET request to get the reviews, loads the JSON response via json.loads() and prints the overall product rating:
import json
import scrapy
class TargetSpider(scrapy.Spider):
name = "target"
allowed_domains = ["target.com"]
start_urls = ["http://www.target.com/p/bounty-select-a-size-paper-towels-white-8-huge-rolls/-/A-15258543#prodSlot=medium_1_4&term=bounty"]
def parse(self, response):
product_id = response.xpath("//meta[#itemprop='productID']/#content").extract_first()
return scrapy.Request("http://tws.target.com/productservice/services/reviews/v1/reviewstats/" + product_id,
callback=self.parse_ratings,
meta={"product_id": product_id})
def parse_ratings(self, response):
data = json.loads(response.body)
print(data["result"][response.meta["product_id"]]["coreStats"]["AverageOverallRating"])
Prints 4.5585.

Resources