Scrapy rotating user agents in spider

Scrapy rotating user agents in spider - web-scraping

I put a function in my spider which will generate a random user agent from a txt file. Now, I called this function from the start_requests function:
def start_requests(self):
url = 'someurl'
head = self.loadUserAgents()
headers = {
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.5',
'User-Agent': head
}
yield scrapy.http.Request(url,headers=headers)
I use a parse function that is able to follow the next page. I think that in this way the spider will only generate the random user agent once. How can I force the spider to generate a new user agent on each following page?
Thanks.

Related

Beautiful soup not identifying children of an element

I am trying to scrape this webpage. I am interested in scraping the text under DIV CLASS="example".
This is the the snippet of the script I am interested in (Stackoverflow automatically banned my post when I tried to post the code, lol):
snapshot of the sourcecode
I tried using the find function from beautifulsoup. The code I used was:
testurl = "https://www.snopes.com/fact-check/dark-profits/"
user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0'
HEADERS = {'User-agent':user_agent}
req = urllib.request.Request(testurl, headers = HEADERS) # visit disguised as browser
pagehtml = urllib.request.urlopen(req).read() # read the website
pagesoup = soup(pagehtml,'html.parser')
potentials = pagesoup.findAll("div", { "class" : "example" })
potentials[0]
potentials[0].find_children
potentials[0].find_children was not able to find anything. I have also tried potentials[0].findChildren() and it was notable to find anything either. Why is find_children not picking up the children of the div tag?

Try to change the parser from html.parser to html5lib:
import requests
from bs4 import BeautifulSoup
url = "https://www.snopes.com/fact-check/dark-profits/"
soup = BeautifulSoup(requests.get(url).content, "html5lib")
print(soup.select_one(".example").get_text(strip=True, separator="\n"))
Prints:
Welcome to the site www.darkprofits.com, it's us again, now we extended our offerings, here is a list:
...and so on.

Trying to request information from "IDEALISTA webpage" in PYTHON with "requests" and I get <Response [403]> [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 months ago.
Improve this question
Its a easy starting project that I'm doing, but the main body would be accesing to the information on the Webpage , so , I don't know if i'm doing something wrong
the starting code is: (to see if it works on Fotocasa Webpage)
import requests
from bs4 import BeautifulSoup
url = 'https://www.fotocasa.es/es/comprar/vivienda/valencia-capital/aire-acondicionado-trastero-ascensor-no-amueblado/161485852/d'
# url = 'https://www.idealista.com/inmueble/97795476/'
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'es,es-ES;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
'cache-control': 'max-age=0',
'dnt': '1',
'sec-ch-ua': '"Chromium";v="106", "Microsoft Edge";v="106", "Not;A=Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.47'
}
r = requests.get(url, headers=headers)
print(r)
req = requests.get(url, headers=headers).text
# Now that the content is ready, iterate
# through the content using BeautifulSoup:
soup = BeautifulSoup(req, "html.parser")
# get the information of a given tag
inm = soup.find(class_="re-DetailHeader-propertyTitle").text
print(inm)
You can try and see that with the URL of Fotocasa , works perfectly (gets <Response [200]>) , but with the one from Idealista, doesn't work, (gets <Response [403]>)
the code is:
import requests
from bs4 import BeautifulSoup
# url = 'https://www.fotocasa.es/es/comprar/vivienda/valencia-capital/aire-acondicionado-trastero-ascensor-no-amueblado/161485852/d'
url = 'https://www.idealista.com/inmueble/97795476/'
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'es,es-ES;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
'cache-control': 'max-age=0',
'dnt': '1',
'sec-ch-ua': '"Chromium";v="106", "Microsoft Edge";v="106", "Not;A=Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.47'
}
r = requests.get(url, headers=headers)
print(r)
req = requests.get(url, headers=headers).text
# Now that the content is ready, iterate
# through the content using BeautifulSoup:
soup = BeautifulSoup(req, "html.parser")
# get the information of a given tag
inm = soup.find(class_="main-info__title-main").text
print(inm)

Your headers are probably fine, but I think you need to include cookies as well.
One way to replicate your browser's request almost exactly is to go to the network tab on the browser's dev tools, and then copy the request for the page
(you might need to refresh for it to show up - it's the one with the same Request URL as whatever you entered in the address bar)
Then, you can paste the copied cURL somewhere like curlconverter to convert to python code, and then copy paste that into to your code so that you can continue with
soup = BeautifulSoup(response.content, "html.parser")
However, you will have to update the cookies frequently, so it migth be less hassle to use a library or api that can bypass these blocks. For example if you sign up for ScrapingAnt, and then paste your API token to the code below:
scrapingant_api = 'https://api.scrapingant.com/v2/general'
scrapingant_key = 'YOUR_API_TOKEN' # paste here
url_to_Scrape = 'https://www.idealista.com/inmueble/97795476/'
url = f'{scrapingant_api}?url={url_to_Scrape}&x-api-key={scrapingant_key}'
r = requests.get(url)
soup = BeautifulSoup(r.text)
inm = soup.find(class_="main-info__title-main").text
print(inm) # prints "Ático en venta en plaza del Ayuntamiento, 6"
There is a limit to how many requests you can run on the free tier of ScrapingAnt, though; so I suggest also considering selenium if you'll be needing to scrape an unlimited number of times. If you copy the function from this gist, you can simply call it like:
# def linkToSoup_selenium .... # paste function into your code
soup = linkToSoup_selenium('https://www.idealista.com/inmueble/97795476/')
inm = soup.find(class_="main-info__title-main").text
print(inm) # prints "Ático en venta en plaza del Ayuntamiento, 6"

You should ALWAYS check the robots.txt file if you want to scrape a page. Read this: ( https://dan-suciu.medium.com/the-complete-manual-to-legal-ethical-web-scraping-in-2021-3eeae278b334 )
In the case of your 2nd url it seems like the scraping is not allowed - it is blocked. Try the url https://www.idealista.com/robots.txt and see the text, google translates it as:
Misuse has been detected Access has been blocked
Having trouble accessing the site? Contact support
ID: fc1d890d-6ed6-8959-cd68-de965251f89b
IP: xx.xx.xx.xx
All the best,
The idealist team
Regards...

Data Scraping for Pagination of the Products to get all products details

I want to scrape all the product data for the 'Cushion cover' category having URL = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/'
I analysed the data is in the script tag ,but how to get the data from all the pages. I required the URL's of all the Products from all the pages and the data is also in API for different pages API= 'https://www.noon.com/_next/data/B60DhzfamQWEpEl9Q8ajE/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover.json?limit=50&page=2&sort%5Bby%5D=popularity&sort%5Bdir%5D=desc&catalog=home-and-kitchen&catalog=home-decor&catalog=slipcovers&catalog=cushion-cover'
if we goes on changing the page num in the above link we have the data for the respective pages but how to get that data from different pages
Please suggest for this.
import requests
import pandas as pd
import json
import csv
from lxml import html
headers ={'authority': 'www.noon.com',
'accept' :
'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
}
produrl = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/'
prodresp = requests.get(produrl, headers = headers, timeout =30)
prodResphtml = html.fromstring(prodresp.text)
print(prodresp)
partjson = prodResphtml.xpath('//script[#id= "__NEXT_DATA__"]/text()')
partjson = partjson[0]

You are about to reach your goal. You can make the next pages meaning pagination using for loop and range function to pull all the pages as we know that total page numbers are 192 that's why I've made the pagination this robust way. So to get all the products url (or any data item) from all of the pages, you can follow the next example.
Script:
import requests
import pandas as pd
import json
from lxml import html
headers ={
'authority': 'www.noon.com',
'accept' :'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
}
produrl = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/?limit=50&page={page}&sort[by]=popularity&sort[dir]=desc'
data=[]
for page in range(0,192):
prodresp = requests.get(produrl.format(page=page), headers = headers, timeout =30)
prodResphtml = html.fromstring(prodresp.text)
#print(prodresp)
partjson = prodResphtml.xpath('//script[#id= "__NEXT_DATA__"]/text()')
#partjson = partjson[0]
partjson = json.loads(partjson[0])
#print(partjson)
# with open('data.json','w',encoding='utf-8') as f:
# f.write(partjson)
for item in partjson['props']['pageProps']['props']['catalog']['hits']:
link='https://www.noon.com/'+item['url']
data.append(link)
df = pd.DataFrame(data,columns=['URL'])
#df.to_csv('product.csv',index=False)#to save data into your system
print(df)
Output:
URL
0 https://www.noon.com/graphic-geometric-pattern...
1 https://www.noon.com/classic-nordic-decorative...
2 https://www.noon.com/embroidered-iconic-medusa...
3 https://www.noon.com/geometric-marble-texture-...
4 https://www.noon.com/traditional-damask-motif-...
... ...
9594 https://www.noon.com/geometric-printed-cushion...
9595 https://www.noon.com/chinese-style-art-printed...
9596 https://www.noon.com/chinese-style-art-printed...
9597 https://www.noon.com/chinese-style-art-printed...
9598 https://www.noon.com/chinese-style-art-printed...
[9599 rows x 1 columns]

I used re lib. In other word, I used regex it is much better to scrape any page use JavaScript
import requests
import pandas as pd
import json
import csv
from lxml import html
import re
headers ={
'authority': 'www.noon.com',
'accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
}
url = "https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/"
prodresp = requests.get(url, headers = headers, timeout =30)
jsonpage = re.findall(r'type="application/json">(.*?)</script>', prodresp.text)
jsonpage = json.loads(jsonpage[0])

Python - Change requests params so it doesn't start by "?=" and "&"

So I have been trying to figure out how to work out things with requests.
So right now I have done something like:
url = 'www.helloworld.com'
params = {
"": page_num,
"orderBy": 'Published'
}
headers = {
'User-Agent': ('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
' (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36')
}
resp = requests.get(url, headers=headers, params=params, timeout=12)
resp.raise_for_status()
print(resp.url)
and basically how it prints out now is:
www.helloworld.com/?=2&orderBy=Published
and what I wish to have is:
www.helloworld.com/2?orderBy=Published
How would I be able to change the params requests so it will end up like above?

Your issue is that you are trying to modify the target URL path, not the parameters. So you can't use the params parameters from requests to do that.
I suggest 2 options to do what you want:
construct the url by hand. You can do it with string concatenations for simple cases, but there are modules to do it properly: https://pypi.org/project/furl/ , https://hyperlink.readthedocs.io/en/latest/ , that are easier to use and more powerful than urllib.parse.urljoin
use apirequests which is a simple wrapper around requests: https://pypi.org/project/apirequests
Sample using apirequests:
import apirequests
client = apirequests.Client('www.helloworld.com')
resp = client.get('/2', headers=headers, params=params, timeout=12)
# note that apirequests calls "resp.raise_for_status() automatically

How to find header data and name? (Python-requests)

I want to use requests to web scrape on a login site. I already done the code using selenium but it is very inconvenient and slower to do it that way as I want to make it public(every user has to download chrome driver).
The problem is, there are multiple requests from the site and I don't have any experience processing that data and extracting the header data and name. Any help is great, thanks.

[Premise]
Using requests module you can send requests in these way:
import requests
url = "http://www.example.com" # request url
headers = { # headers dict to send in request
"header_name": "headers_value",
}
params = { # params to be encoded in the url
"param_name": "param_value",
}
data = { # data to send in the request body
"data_name": "data_value",
}
# Send GET request.
requests.get(url, params=params, headers=headers)
# Send POST request.
requests.post(url, params=params, headers=headers, data=data)
Once you perform a request, you can get much information from the response object:
>>> import requests
# We perform a request and get the response object.
>>> response = requests.get(url, params=params, headers=headers)
>>> response = requests.post(url, params=params, headers=headers, data=data)
>>> response.status_code # server response status code
>>> 200 # eg.
>>> response.request.method
>>> 'GET' # or eventually 'POST'
>>> response.request.headers # headers you sent with the request
>>> {'Accept-Encoding': 'gzip, deflate, br'} # eg.
>>> response.request.url # sent request url
>>> 'http://www.example.com'
>>> response.response.body
>>> 'name=value&name2=value2' # eg.
In conclusion, you can retrieve all the information that you can find in Dev Tools in the browser, from the response object. You need nothing else.
Dev Tools view
Dev Tool view 2
Once you send a GET or POST requests you can retrieve information from Dev Tools:
In General:
Request URL: the url you sent the request to. Corresponds to response.request.url
Request Method: corresponds to response.request.method
Status Code: corresponds to response.status_code
In Response Headers:
You find response headers which correspond to response.headers
eg. Connection: Keep-Alive,
Content-Length: 0,
Content-Type: text/html; charset=UTF-8...
In Requests Headers:
You find request headers which correspond to response.request.headers
In Form Data:
You can find the data you passed with data keyword in requests.post.
Corresponds to response.request.body

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scrapy rotating user agents in spider - web-scraping

Related

Beautiful soup not identifying children of an element

Trying to request information from "IDEALISTA webpage" in PYTHON with "requests" and I get <Response [403]> [closed]

Data Scraping for Pagination of the Products to get all products details

Python - Change requests params so it doesn't start by "?=" and "&"

How to find header data and name? (Python-requests)

Categories

Resources