Relatively new to Splash. I'm trying to scrape a website which needs a login. I started off with the Splash API for which I was able to login perfectly. However, when I put my code in a scrapy spider script, using SplashRequest, it's not able to login.
import scrapy
from scrapy_splash import SplashRequest
class Payer1Spider(scrapy.Spider):
name = "payer1"
start_url = "https://provider.wellcare.com/provider/claims/search"
lua_script = """
function main(splash,args)
assert(splash:go(args.url))
splash:wait(0.5)
local search_input = splash:select('#Username')
search_input:send_text('')
local search_input = splash:select('#Password')
search_input:send_text('')
assert(splash:wait(0.5))
local login_button = splash:select('#btnSubmit')
login_button:mouse_click()
assert(splash:wait(7))
return{splash:html()}
end
"""
def start_requests(self):
yield SplashRequest(self.start_url, self.parse_result,args={'lua_source': self.lua_script},)
def parse_result(self, response):
yield {'doc_title' : response.text}
The output HTML is the login page and not the one after logging in.
You have to add endpoint='execute' to your SplashRequest to execute the lua-script:
yield SplashRequest(self.start_url, self.parse_result, args={'lua_source': self.lua_script}, endpoint='execute')
I believe you don't need splash to login to the site indeed. You can try next:
Get https://provider.wellcare.com and then..
# Get request verification token..
token = response.css('input[name=__RequestVerificationToken]::attr(value)').get()
# Forge post request payload...
data = [
('__RequestVerificationToken', token),
('Username', 'user'),
('Password', 'pass'),
('ReturnUrl', '/provider/claims/search'),
]
#Make dict from list of tuples
formdata=dict(data)
# And then execute request
scrapy.FormRequest(
url='https://provider.wellcare.com/api/sitecore/Login',
formdata=formdata
)
Not completely sure if all of this will work. But you can try.
Related
We are trying to execute a Get Request from the Azure Web App using our python code. The request will be made to our DevOps Repo following the available API. This is just a part (example) of the entire code:
from fastapi import FastAPI
import requests
import base64
import uvicorn
app = FastAPI()
#app.get("/")
async def root():
return {"Hello": "Pycharm"}
#app.get('/file')
async def sqlcode(p: str = '/ppppppppp',
v: str = 'vvvv',
r: str = 'rrrrrr',
pn: str = 'nnnnnnn'):
organizational_url = f'https://xxxxxxxx/{pn}/_apis/git/repositories/{r}/items?path={p}&versionDescriptor.version={v}&api-version=6.1-preview.1'
username = 'username'
password = 'password'
basic_authentication = base64.b64encode((f'{username}:{password}').encode('utf-8')).decode('utf-8')
headers = {
'Authorization': f'Basic {basic_authentication}',
}
response = requests.get(url=organizational_url, headers=headers)
return {"sqlcode": response.text}
# Code ends for Fastapi with swagger
def print_hi(name):
print(f'Hi, {name}') # Press Ctrl+F8 to toggle the breakpoint.
# Press the green button in the gutter to run the script.
if __name__ == '__main__':
print_hi('PyCharm')
And this is the requirements.txt
Flask==2.0.1
asyncpg==0.21.0
databases==0.4.1
fastapi==0.63.0
gunicorn==20.0.4
pydantic==1.7.3
SQLAlchemy==1.3.22
uvicorn==0.11.5
config==0.3.9
pandas==1.5.3
dask==2023.1.0
azure.storage.blob==12.14.1
pysftp==0.2.9
azure-identity==1.12.0
azure-keyvault-secrets==4.6.0
When we execute the code in Pycharm this works without any issue :
But when we execute the same code from the Web App we've got the Internal Server Error Message
I'm just trying to help the team to understand this error. I'm not a cloud engineer. What could be missing, is it something in the code, a particular port to use, is a permission needed in Azure Web Service?
I am running a scrapy spider that starts by getting an authorization token from the website I am scraping from, using basic requests library. The function for this is called get_security_token(). This token is passed as a header to the scrapy request. The issue is that the token expires after 300 seconds, and then I get a 401 error. Is there anyway for a spider to see the 401 error, run the get_security_token() function again, and then pass the new token on to all future request headers?
import scrapy
class PlayerSpider(scrapy.Spider):
name = 'player'
def start_requests(self):
urls = ['URL GOES HERE']
header_data = {'Authorization':'Bearer 72bb65d7-2ff1-3686-837c-61613454928d'}
for url in urls:
yield scrapy.Request(url = url, callback = self.parse,headers = header_data)
def parse(self, response):
yield response.json()
if it's pure scrapy you can add handle_httpstatus_list = [501] after start_urls
and then in you parse method you need to do something like this:
if response.status == 501:
get_security_token()
I have a Flask app served by Gunicorn using NGINX as a webserver. I only have two endpoints that have POST requests, a login and a submit post. The login works flawlessly however whenever I attempt to POST using the submit post endpoint, I receive a 404. My code works running on localhost without NGINX or Gunicorn and in order to detect that the request is POST vs GET I am using WTForms validate_on_submit method. The NGINX error and access logs don't show anything out of the ordinary.
Edit: Here is the code for the endpoints with post requests
New Post:
#app.route('/newpost', methods=['GET', 'POST'])
#login_required
def new_post():
form = BlogForm()
if form.validate_on_submit():
#Omitted code for getting form data and creating form object
try:
#Omitted code for database actions
return redirect(url_for('index', _external=True))
except IntegrityError:
flash("Error inserting into database")
return('', 203)
else:
#Omitted some code for template rendering
return render_template('new_post.html')
Login:
#app.route('/login', methods=['GET', 'POST'])
def login():
form = LoginForm()
userAuth = current_user.is_authenticated
if form.validate_on_submit():
flash("Login attempt logged")
user = User.query.filter_by(username=form.userID.data).first()
if user:
checkPass = form.password.data.encode('utf-8')
if bcrypt.checkpw(checkPass, user.password):
user.authenticated = True
login_user(user, remember=form.remember_me.data)
return redirect(url_for('index', _external=True))
else:
form.errors['password'] = 'false'
else:
form.errors['username'] = 'false'
else:
return render_template('login.html',
title='Log In',
loggedIn=userAuth,
form=form)
The problem ended up being with NGINX after all, I fixed it by changing the ownership of /var/lib/nginx from the nginx user to the user running the Flask application. Prior to doing this AWS wasn't letting me upload files from my form in new_post.
I am trying to get some technical informations about automobiles from this page
Here is my current code:
import scrapy
import re
from arabamcom.items import ArabamcomItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class BasicSpider(CrawlSpider):
name="arabamcom"
allowed_domains=["arabam.com"]
start_urls=['https://www.arabam.com/ikinci-el/otomobil']
rules=(Rule(LinkExtractor(allow=(r'/ilan')),callback="parse_item",follow=True),)
def parse_item(self,response):
item=ArabamcomItem()
item['fiyat']=response.css('span.color-red.font-huge.bold::text').extract()
item['marka']=response.css('p.color-black.bold.word-break.mb4::text').extract()
item['yil']=response.xpath('//*[#id="js-hook-appendable-technicalPropertiesWrapper"]/div[2]/dl[1]/dd/span/text()').extract()
And this is my items.py file
import scrapy
class ArabamcomItem(scrapy.Item):
fiyat=scrapy.Field()
marka=scrapy.Field()
yil=scrapy.Field()
When i run the code i can get data from 'marka' and 'fiyat' item but spider does not get anything for 'yil' attribute. Also other parts like 'Yakit Tipi','Vites Tipi' etc. How can i solve this problem ?
What's wrong:
//*[#id="js-hook-appendable-technicalPropertiesWrapper"]/......
This id start with js and may be dynamic element appeded by javascript
Scrapy do not have the ability to render javascript by default.
There are 2 solutions you can try
Scrapy-Splash
This is a javascript rendering engine for scrapy
Install Splash as a Docker container
Modify your settings.py file to integrate splash (append following middlewares to your project)
SPLASH_URL = 'http://127.0.0.1:8050'
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware':100,
}
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware':723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
Replace your Request Function with SplashRequest
from scrapy_splash import SplashRequest as SP
SP(url=url, callback=parse, endpoint='render.html', args={'wait': 5})
Selenium WebDriver
This is a browser automation-testing framework
Install Selenium from PyPi and install there corresponding driver(e.g. Firefox -> Geckodriver) to PATH folder
Append following middleware class to your project's middleware.py file:
class SeleniumMiddleware(object):
#classmethod
def from_crawler(cls, crawler):
middleware = cls()
crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
return middleware
def process_request(self, request, spider):
request.meta['driver'] = self.driver
self.driver.get(request.url)
self.driver.implicitly_wait(2)
body = to_bytes(self.driver.page_source)
return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)
def spider_opened(self, spider):
"""Change your browser mode here"""
self.driver = webdriver.Firefox()
def spider_closed(self, spider):
self.driver.close()
Modify your settings.py file to integrate the Selenium middleware (append following middleware to your project and replace yourproject with your project name)
DOWNLOADER_MIDDLEWARES = {
'yourproject.middlewares.SeleniumMiddleware': 200
}
Comparison
Scrapy-Splash
An official module by Scrapy Company
You can deploy splash instance to cloud, so that you will be able to browse the url in cloud then transfer the render.html back to your spider
It's slow
Splash container will stop if there is a memory leak. (Be sure to deploy splash instance on a high memory cloud instance)
Selenium web driver
You have to have Firefox or Chrome with their corresponding automated-test-driver on your machine, unless you use PhantomJS.
You can't modify request headers directly with Selenium web driver
You could render the webpage using a headless browser but this data can be easily extracted without it, try this:
import re
import ast
...
def parse_item(self,response):
regex = re.compile('dataLayer.push\((\{.*\})\);', re.DOTALL)
html_info = response.xpath('//script[contains(., "dataLayer.push")]').re_first(regex)
data = ast.literal_eval(html_info)
yield {'fiyat': data['CD_Fiyat'],
'marka': data['CD_marka'],
'yil': data['CD_yil']}
# output an item with {'fiyat': '103500', 'marka': 'Renault', 'yil': '2017'}
I've been trying to scrape some lists from this website http://www.golf.org.au its an ASP.NET based I did some research and it appears that I must pass some values in a POST request to make the website fetch the data into the tables I did that but still I'm failing any Idea what I'm missing?
Here is my code:
# -*- coding: utf-8 -*-
import scrapy
class GolfscraperSpider(scrapy.Spider):
name = "golfscraper"
allowed_domains = ["golf.org.au","www.golf.org.au"]
ids = ['3012801330', '3012801331', '3012801332', '3012801333']
start_urls = []
for id in ids:
start_urls.append('http://www.golf.org.au/handicap/%s' %id)
def parse(self, response):
scrapy.FormRequest('http://www.golf.org.au/default.aspx?
s=handicap',
formdata={
'__VIEWSTATE':
response.css('input#__VIEWSTATE::attr(value)').extract_first(),
'ctl11$ddlHistoryInMonths':'48',
'__EVENTTARGET':
'ctl11$ddlHistoryInMonths',
'__EVENTVALIDATION' :
response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
'gaHandicap' : '6.5',
'golflink_No' : '2012003003',
'__VIEWSTATEGENERATOR' : 'CA0B0334',
},
callback=self.parse_details)
def parse_details(self,response):
for name in response.css('div.rnd-course::text').extract():
yield {'name' : name}
Yes, ASP pages are tricky to scrape. Most probably some little parameter is missing.
Solution for this:
instead of creating the request through scrapy.FormRequest(...) use the scrapy.FormRequest.from_response() method (see code example below). This will capture most or even all of the hidden form data and use it to prepopulate the FormRequest's data.
it seems you forgot to return the request, maybe that's another potential problem too ...
as far as I recall the __VIEWSTATEGENERATOR also will change each time and has to be extracted from the page
If this doesn't work, fire up your Firefox browser with Firebug plugin or Chrome's developer tools, do the request in the browser and then check the full request header and body data against the same data in your request. There will be some difference.
Example code with all my suggestions:
def parse(self, response):
req = scrapy.FormRequest.from_response(response,
formdata={
'__VIEWSTATE': response.css('input#__VIEWSTATE::attr(value)').extract_first(),
'ctl11$ddlHistoryInMonths':'48',
'__EVENTTARGET': 'ctl11$ddlHistoryInMonths',
'__EVENTVALIDATION' : response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
'gaHandicap' : '6.5',
'golflink_No' : '2012003003',
'__VIEWSTATEGENERATOR' : 'CA0B0334',
},
callback=self.parse_details)
log.info(req.headers)
log.info(req.body)
return req