Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 months ago.
Improve this question
Its a easy starting project that I'm doing, but the main body would be accesing to the information on the Webpage , so , I don't know if i'm doing something wrong
the starting code is: (to see if it works on Fotocasa Webpage)
import requests
from bs4 import BeautifulSoup
url = 'https://www.fotocasa.es/es/comprar/vivienda/valencia-capital/aire-acondicionado-trastero-ascensor-no-amueblado/161485852/d'
# url = 'https://www.idealista.com/inmueble/97795476/'
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'es,es-ES;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
'cache-control': 'max-age=0',
'dnt': '1',
'sec-ch-ua': '"Chromium";v="106", "Microsoft Edge";v="106", "Not;A=Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.47'
}
r = requests.get(url, headers=headers)
print(r)
req = requests.get(url, headers=headers).text
# Now that the content is ready, iterate
# through the content using BeautifulSoup:
soup = BeautifulSoup(req, "html.parser")
# get the information of a given tag
inm = soup.find(class_="re-DetailHeader-propertyTitle").text
print(inm)
You can try and see that with the URL of Fotocasa , works perfectly (gets <Response [200]>) , but with the one from Idealista, doesn't work, (gets <Response [403]>)
the code is:
import requests
from bs4 import BeautifulSoup
# url = 'https://www.fotocasa.es/es/comprar/vivienda/valencia-capital/aire-acondicionado-trastero-ascensor-no-amueblado/161485852/d'
url = 'https://www.idealista.com/inmueble/97795476/'
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'es,es-ES;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
'cache-control': 'max-age=0',
'dnt': '1',
'sec-ch-ua': '"Chromium";v="106", "Microsoft Edge";v="106", "Not;A=Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.47'
}
r = requests.get(url, headers=headers)
print(r)
req = requests.get(url, headers=headers).text
# Now that the content is ready, iterate
# through the content using BeautifulSoup:
soup = BeautifulSoup(req, "html.parser")
# get the information of a given tag
inm = soup.find(class_="main-info__title-main").text
print(inm)
Your headers are probably fine, but I think you need to include cookies as well.
One way to replicate your browser's request almost exactly is to go to the network tab on the browser's dev tools, and then copy the request for the page
(you might need to refresh for it to show up - it's the one with the same Request URL as whatever you entered in the address bar)
Then, you can paste the copied cURL somewhere like curlconverter to convert to python code, and then copy paste that into to your code so that you can continue with
soup = BeautifulSoup(response.content, "html.parser")
However, you will have to update the cookies frequently, so it migth be less hassle to use a library or api that can bypass these blocks. For example if you sign up for ScrapingAnt, and then paste your API token to the code below:
scrapingant_api = 'https://api.scrapingant.com/v2/general'
scrapingant_key = 'YOUR_API_TOKEN' # paste here
url_to_Scrape = 'https://www.idealista.com/inmueble/97795476/'
url = f'{scrapingant_api}?url={url_to_Scrape}&x-api-key={scrapingant_key}'
r = requests.get(url)
soup = BeautifulSoup(r.text)
inm = soup.find(class_="main-info__title-main").text
print(inm) # prints "Ático en venta en plaza del Ayuntamiento, 6"
There is a limit to how many requests you can run on the free tier of ScrapingAnt, though; so I suggest also considering selenium if you'll be needing to scrape an unlimited number of times. If you copy the function from this gist, you can simply call it like:
# def linkToSoup_selenium .... # paste function into your code
soup = linkToSoup_selenium('https://www.idealista.com/inmueble/97795476/')
inm = soup.find(class_="main-info__title-main").text
print(inm) # prints "Ático en venta en plaza del Ayuntamiento, 6"
You should ALWAYS check the robots.txt file if you want to scrape a page. Read this: ( https://dan-suciu.medium.com/the-complete-manual-to-legal-ethical-web-scraping-in-2021-3eeae278b334 )
In the case of your 2nd url it seems like the scraping is not allowed - it is blocked. Try the url https://www.idealista.com/robots.txt and see the text, google translates it as:
Misuse has been detected Access has been blocked
Having trouble accessing the site? Contact support
ID: fc1d890d-6ed6-8959-cd68-de965251f89b
IP: xx.xx.xx.xx
All the best,
The idealist team
Regards...
I try to access the meta data of a solana token via the Solscan API.
The following code works in principle but the API doesn't provide the expected data.
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
}
params = {
'token': '24jvtWN7qCf5GQ5MaE7V2R4SUgtRxND1w7hyvYa2PXG6',
}
response = requests.get('https://api.solscan.io/token/meta', headers=headers, params=params)
print(response.content.decode())
It returns:
{"succcess":true,"data":{"holder":1}}
However, I expected the following according to the docs https://public-api.solscan.io/docs/#/Token/get_token_meta:
Any help? Thx!
Tried this with another token and got the full response. It seems like the example SPL is lacking metadata to display.
import requests
from requests.structures import CaseInsensitiveDict
url = "https://public-api.solscan.io/token/meta?tokenAddress=4k3Dyjzvzp8eMZWUXbBCjEvwSkkk59S5iCNLY3QrkX6R"
headers = CaseInsensitiveDict()
headers["accept"] = "application/json"
resp = requests.get(url, headers=headers)
print(resp.status_code)
get 403 error when trying to grab data and seems like cloudscraper is dead. any ideas?
trying this code -
import requests
headers = {
'accept': 'application/json',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
}
s = requests.Session()
url = 'https://www.brownsfashion.com/api/listing'
r = s.get(url, headers=headers)
print(r.text)
or cloudscraper version
import cloudscraper
s = cfscrape.create_scraper()
url = 'https://www.brownsfashion.com/api/listing'
r = s.get(url)
print(r.text)
error message -
This website is using a security service to protect itself from online attacks.
So I have been trying to figure out how to work out things with requests.
So right now I have done something like:
url = 'www.helloworld.com'
params = {
"": page_num,
"orderBy": 'Published'
}
headers = {
'User-Agent': ('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
' (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36')
}
resp = requests.get(url, headers=headers, params=params, timeout=12)
resp.raise_for_status()
print(resp.url)
and basically how it prints out now is:
www.helloworld.com/?=2&orderBy=Published
and what I wish to have is:
www.helloworld.com/2?orderBy=Published
How would I be able to change the params requests so it will end up like above?
Your issue is that you are trying to modify the target URL path, not the parameters. So you can't use the params parameters from requests to do that.
I suggest 2 options to do what you want:
construct the url by hand. You can do it with string concatenations for simple cases, but there are modules to do it properly: https://pypi.org/project/furl/ , https://hyperlink.readthedocs.io/en/latest/ , that are easier to use and more powerful than urllib.parse.urljoin
use apirequests which is a simple wrapper around requests: https://pypi.org/project/apirequests
Sample using apirequests:
import apirequests
client = apirequests.Client('www.helloworld.com')
resp = client.get('/2', headers=headers, params=params, timeout=12)
# note that apirequests calls "resp.raise_for_status() automatically
I'm trying to scrap data from aspx page using request with POST data.
On parsed html I'm getting an error "An application error occurred on the server. The current custom error settings for this application prevent the details of the application error from being viewed remotely (for security reasons). It could, however, be viewed by browsers running on the local server machine."
I was searching for solutions a while but frankly I'm new in Python and can't really figure out what's wrong.
The ASPX has javaonclick function which opens a new window with data in html.
The code I've created is below.
Any help or suggestions would be greatly welcomed. Thank you!
import requests
from bs4 import BeautifulSoup
session = requests.Session()
url = 'http://ws1.osfi-bsif.gc.ca/WebApps/FINDAT/Insurance.aspx?T=0&LANG=E'
r=session.get(url)
soup = BeautifulSoup(r.content,'lxml')
viewstate = soup.select("#__VIEWSTATE")[0]['value']
eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']
payload = {
r'__EVENTTARGET': r'',
r'__EVENTARGUMENT': r'',
r'__LASTFOCUS': r'',
r'__VIEWSTATE': viewstate,
r'__VIEWSTATEGENERATOR': r'B2E4460D',
r'__EVENTVALIDATION': eventvalidation,
r'InsuranceWebPartManager$gwpinsuranceControl$insuranceControl$institutionType': r'radioButton1',
r'InsuranceWebPartManager$gwpinsuranceControl$insuranceControl$institutionDropDownList': r'F018',
r'InsuranceWebPartManager$gwpinsuranceControl$insuranceControl$reportTemplateDropDownList': r'C_LIFE-1',
r'InsuranceWebPartManager$gwpinsuranceControl$insuranceControl$reportDateDropDownList': r'3+-+2015',
r'InsuranceWebPartManager$gwpinsuranceControl$insuranceControl$submitButton': r'Submit'
}
HEADER = {
"Content-Type":"application/x-www-form-urlencoded",
"Content-Length":"11759",
"Host":"ws1.osfi-bsif.gc.ca",
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language":"en-US,en;q=0.5",
"Cache-Control": "max-age=0",
"Accept-Encoding":"gzip, deflate",
"Connection":"keep-alive",
}
df = session.post(url, data=payload, headers=HEADER)
print df.text