Error while scraping website

Error while scraping website - web-scraping

i am trying to scrape a website using beautilfulsoup. But when i scrape it, i receive certain error
error i received is :
505|error|500|Invalid postback or callback argument. Event validation is enabled using <pages enableeventvalidation="true"></pages> in configuration or <%# Page EnableEventValidation="true" %> in a page. For security purposes, this feature verifies that arguments to postback or callback events originate from the server control that originally rendered them. If the data is valid and expected, use the ClientScriptManager.RegisterForEventValidation method in order to register the postback or callback data for validation.|
and my codes are :
from bs4 import BeautifulSoup
import requests
import csv
final_data = []
url = "https://rera.cgstate.gov.in/Default.aspx"
def writefiles(alldata, filename):
with open ("./"+ filename, "w") as csvfile:
csvfile = csv.writer(csvfile, delimiter=",")
csvfile.writerow("")
for i in range(0, len(alldata)):
csvfile.writerow(alldata[i])
def getbyGet(url, values):
res = requests.get(url, data=values)
text = res.text
return text
def readHeaders():
global url
html = getbyGet(url, {})
soup = BeautifulSoup(html, "html.parser")
EVENTVALIDATION = soup.select("#__EVENTVALIDATION")[0]['value']
VIEWSTATE = soup.select("#__VIEWSTATE")[0]['value']
#VIEWSTATEGENERATOR = soup.select("#__VIEWSTATEGENERATOR")[0]["value"]
headers= {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Content-Type':'application/x-www-form-urlencoded',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0'}
formfields = {"__ASYNCPOST":"true",
"__EVENTARGUMENT":"",
"__EVENTTARGET":"" ,
"__EVENTVALIDATION":EVENTVALIDATION,
"__LASTFOCUS":"",
"__VIEWSTATE":VIEWSTATE,
"ApplicantType":"",
"Button1":"Search",
"District_Name":"0",
"DropDownList1":"0",
"DropDownList2":"0",
"DropDownList4":"0",
"DropDownList5":"0",
"group1":"on",
"hdnSelectedOption":"0",
"hdnSelectedOptionForContractor":"0",
"Mobile":"",
"Tehsil_Name":"0",
"TextBox1":"",
"TextBox2":"",
"TextBox3":"",
"TextBox4":"",
"TextBox5":"",
"TextBox6":"",
"ToolkitScriptManager1":"appr1|Button1",
"txt_otp":"",
"txt_proj_name":"",
"txtRefNo":"",
"txtRefNoForContractor":""}
s = requests.session()
res = s.post(url, data=formfields, headers=headers).text
soup = BeautifulSoup(res, "html.parser")
print(soup)
readHeaders()
what am i doing wrong? can someone guide? i read another post with same error someone received but his post also didnt have any soultion. This is the post link EVENTVALIDATION error while scraping asp.net page

Related

Beautiful soup not identifying children of an element

I am trying to scrape this webpage. I am interested in scraping the text under DIV CLASS="example".
This is the the snippet of the script I am interested in (Stackoverflow automatically banned my post when I tried to post the code, lol):
snapshot of the sourcecode
I tried using the find function from beautifulsoup. The code I used was:
testurl = "https://www.snopes.com/fact-check/dark-profits/"
user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0'
HEADERS = {'User-agent':user_agent}
req = urllib.request.Request(testurl, headers = HEADERS) # visit disguised as browser
pagehtml = urllib.request.urlopen(req).read() # read the website
pagesoup = soup(pagehtml,'html.parser')
potentials = pagesoup.findAll("div", { "class" : "example" })
potentials[0]
potentials[0].find_children
potentials[0].find_children was not able to find anything. I have also tried potentials[0].findChildren() and it was notable to find anything either. Why is find_children not picking up the children of the div tag?

Try to change the parser from html.parser to html5lib:
import requests
from bs4 import BeautifulSoup
url = "https://www.snopes.com/fact-check/dark-profits/"
soup = BeautifulSoup(requests.get(url).content, "html5lib")
print(soup.select_one(".example").get_text(strip=True, separator="\n"))
Prints:
Welcome to the site www.darkprofits.com, it's us again, now we extended our offerings, here is a list:
...and so on.

Javascript Rendering Issue in Scrapy-Splash

I was exploring Scrapy+Splash and ran into issue that SplashRequest is not rendering the javascript and is giving exact same response scrapy.Request.
The webpage I want to scrape is this. I want some fields from the webpage for my course project.
I am unable to get the final HTML after js is rendered even after waiting for 'wait':'30'. In fact, the result is the same as scrapy.Request. The same code works perfectly for another website that I have tried ie. this. So I believe the settings are fine.
This is spider definition
import scrapy
from .. import IndeedItem
import scrapy
from scrapy_splash import SplashRequest
from bs4 import BeautifulSoup
class IndeedSpider(scrapy.Spider):
name = "indeed"
def __init__(self):
self.headers = {"Host": "www.naukri.com",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:70.0) Gecko/20100101 Firefox/70.0"}
def start_requests(self):
yield SplashRequest(
url = "https://www.naukri.com/job-listings-Sr-Python-Developer-Rackspace-Gurgaon-4-to-9-years-270819005015",
endpoint='render.html', headers = self.headers,
args={
'wait': 3,
}
)
def parse(self, response):
soup = BeautifulSoup(response.body)
it = IndeedItem()
it['job_title'] = soup
yield it
The settings.py (only relevant part) file is
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810
}
SPLASH_URL = 'http://localhost:8050/'
And the output file is here
I do not know what to make of the output, it has embedded JavaScript in it. Opening it in a browser tells that very little has been rendered (title only). How would I get rendered HTML for the website? Any help is much appreciated.

Python Instagram login using requests

I am trying to login to Instagram with python. I am able to get the csrf Token but the requests.Session().post() doesn't seem to post the login data to the website correctly. I always get the class="no-js not-logged-in client-root". I've been searching for a while and also tried to login into some random sites which seemed to work. In the login method I just start a requests.Session() and make a post request to the https://www.instagram.com/accounts/login/ with the login name and password as the data parameter.
def login(self):
with requests.Session() as s:
p = s.post(self.loginUrl, data=self.loginData, allow_redirects=True)
Also please don't tell me to use Selenium I strictly want to do it with requests.

Currently (January 2021) a working solution to log into Instagram using Python is the following:
def login(username, password):
"""Login to Instagram"""
time = str(int(datetime.datetime.now().timestamp()))
enc_password = f"#PWD_INSTAGRAM_BROWSER:0:{time}:{password}"
session = requests.Session()
# set a cookie that signals Instagram the "Accept cookie" banner was closed
session.cookies.set("ig_cb", "2")
session.headers.update({'user-agent': self.user_agent})
session.headers.update({'Referer': 'https://www.instagram.com'})
res = session.get('https://www.instagram.com')
csrftoken = None
for key in res.cookies.keys():
if key == 'csrftoken':
csrftoken = session.cookies['csrftoken']
session.headers.update({'X-CSRFToken': csrftoken})
login_data = {'username': username, 'enc_password': enc_password}
login = session.post('https://www.instagram.com/accounts/login/ajax/', data=login_data, allow_redirects=True)
session.headers.update({'X-CSRFToken': login.cookies['csrftoken']})
cookies = login.cookies
print(login.text)
session.close()

Try using this code:
import requests
#Creating URL, usr/pass and user agent variables
BASE_URL = 'https://www.instagram.com/'
LOGIN_URL = BASE_URL + 'accounts/login/ajax/'
USERNAME = '****'
PASSWD = '*******'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)\
Chrome/59.0.3071.115 Safari/537.36'
#Setting some headers and refers
session = requests.Session()
session.headers = {'user-agent': USER_AGENT}
session.headers.update({'Referer': BASE_URL})
try:
#Requesting the base url. Grabbing and inserting the csrftoken
req = session.get(BASE_URL)
session.headers.update({'X-CSRFToken': req.cookies['csrftoken']})
login_data = {'username': USERNAME, 'password': PASSWD}
#Finally login in
login = session.post(LOGIN_URL, data=login_data, allow_redirects=True)
session.headers.update({'X-CSRFToken': login.cookies['csrftoken']})
cookies = login.cookies
#Print the html results after I've logged in
print(login.text)
#In case of refused connection
except requests.exceptions.ConnectionError:
print("Connection refused")
I found it in this Youtube video. It worked for me, I hope it can work for you too.

web scraping gearbest with python

EDITIED:
i've been trying to pull some data from Gearbest.com about several products and I have some real trouble with pulling the shipping price.
i'm working with requests and beautifulsoup and so far i managed to get the name + link + price.
how can I get it's shipping price?
the urls are:
https://www.gearbest.com/gaming-laptops/pp_009863068586.html?wid=1433363
https://www.gearbest.com/smart-watch-phone/pp_009309925869.html?wid=1433363i've tried:
shipping = soup.find("span", class_="goodsIntro_attrText").get("strong)
shipping = soup.find("span", class_="goodsIntro_attrText").get("strong).text
shipping = soup.find("strong", class_="goodsIntro_shippingCost")
shipping = soup.find("strong", class_="goodsIntro_shippingCost").text
soup is the return value from here(the url is each product link):
def get_page(url):
client = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"})
try:
client.raise_for_status()
except requests.exceptions.HTTPError as e:
print("Error in gearbest with the url:", url)
exit(0)
soup = BeautifulSoup(client.content, 'lxml')
return soup
any ideas what can I do?

You want to use soup not souo. Also, there seems to be a difference between what is returned from request versus what is on the page for me.
from bs4 import BeautifulSoup as bs
import requests
urls = ['https://www.gearbest.com/gaming-laptops/pp_009863068586.html?wid=1433363','https://www.gearbest.com/smart-watch-phone/pp_009309925869.html?wid=1433363i']
with requests.Session() as s:
for url in urls:
r = s.get(url, headers = {'User-Agent':'Mozilla\5.0'})
soup = bs(r.content, 'lxml')
print(soup.select_one('.goodsIntro_price').text)
print(soup.select_one('.goodsIntro_shippingCost').text) # soup.find("strong", class_="goodsIntro_shippingCost").text
For the actual price it seems there are dynamic feeds for price in the network tab though it is stored under actual fee. So, perhaps there is dynamic location based updating of shipping prices.
from bs4 import BeautifulSoup as bs
import requests
urls = ['https://www.gearbest.com/goods/goods-shipping?goodSn=455718101&countryCode=GB&realWhCodeList=1433363&realWhCode=1433363&priceMd5=540DCE6E4F455639641E0BB2B6356F15&goodPrice=1729.99&num=1&categoryId=13300&saleSizeLong=50&saleSizeWide=40&saleSizeHigh=10&saleWeight=4.5&volumeWeight=4.5&properties=8&shipTemplateId=&isPlatform=0&virWhCode=1433363&deliveryType=0&platformCategoryId=&recommendedLevel=2&backRuleId=',
'https://www.gearbest.com/goods/goods-shipping?goodSn=459768501&countryCode=GB&realWhCodeList=1433363&realWhCode=1433363&priceMd5=91D909FDFFE8F8F1F9D1EC1D5D1B7C2C&goodPrice=159.99&num=1&categoryId=12004&saleSizeLong=12&saleSizeWide=10.5&saleSizeHigh=6.5&saleWeight=0.266&volumeWeight=0.266&properties=8&shipTemplateId=&isPlatform=0&virWhCode=1433363&deliveryType=0&platformCategoryId=&recommendedLevel=1&backRuleId=']
with requests.Session() as s:
for url in urls:
r = s.get(url, headers = {'User-Agent':'Mozilla\5.0'}).json()
print(r['data']['shippingMethodList'][0]['actualFee'])

POST data into ASPX with requests. How to fix "remote mashine error" message

I'm trying to scrap data from aspx page using request with POST data.
On parsed html I'm getting an error "An application error occurred on the server. The current custom error settings for this application prevent the details of the application error from being viewed remotely (for security reasons). It could, however, be viewed by browsers running on the local server machine."
I was searching for solutions a while but frankly I'm new in Python and can't really figure out what's wrong.
The ASPX has javaonclick function which opens a new window with data in html.
The code I've created is below.
Any help or suggestions would be greatly welcomed. Thank you!
import requests
from bs4 import BeautifulSoup
session = requests.Session()
url = 'http://ws1.osfi-bsif.gc.ca/WebApps/FINDAT/Insurance.aspx?T=0&LANG=E'
r=session.get(url)
soup = BeautifulSoup(r.content,'lxml')
viewstate = soup.select("#__VIEWSTATE")[0]['value']
eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']
payload = {
r'__EVENTTARGET': r'',
r'__EVENTARGUMENT': r'',
r'__LASTFOCUS': r'',
r'__VIEWSTATE': viewstate,
r'__VIEWSTATEGENERATOR': r'B2E4460D',
r'__EVENTVALIDATION': eventvalidation,
r'InsuranceWebPartManager$gwpinsuranceControl$insuranceControl$institutionType': r'radioButton1',
r'InsuranceWebPartManager$gwpinsuranceControl$insuranceControl$institutionDropDownList': r'F018',
r'InsuranceWebPartManager$gwpinsuranceControl$insuranceControl$reportTemplateDropDownList': r'C_LIFE-1',
r'InsuranceWebPartManager$gwpinsuranceControl$insuranceControl$reportDateDropDownList': r'3+-+2015',
r'InsuranceWebPartManager$gwpinsuranceControl$insuranceControl$submitButton': r'Submit'
}
HEADER = {
"Content-Type":"application/x-www-form-urlencoded",
"Content-Length":"11759",
"Host":"ws1.osfi-bsif.gc.ca",
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language":"en-US,en;q=0.5",
"Cache-Control": "max-age=0",
"Accept-Encoding":"gzip, deflate",
"Connection":"keep-alive",
}
df = session.post(url, data=payload, headers=HEADER)
print df.text

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Error while scraping website - web-scraping

Related

Beautiful soup not identifying children of an element

Javascript Rendering Issue in Scrapy-Splash

Python Instagram login using requests

web scraping gearbest with python

POST data into ASPX with requests. How to fix "remote mashine error" message

Categories

Resources