Scrapy does not extract data - web-scraping

I am trying to get some technical informations about automobiles from this page
Here is my current code:
import scrapy
import re
from arabamcom.items import ArabamcomItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class BasicSpider(CrawlSpider):
name="arabamcom"
allowed_domains=["arabam.com"]
start_urls=['https://www.arabam.com/ikinci-el/otomobil']
rules=(Rule(LinkExtractor(allow=(r'/ilan')),callback="parse_item",follow=True),)
def parse_item(self,response):
item=ArabamcomItem()
item['fiyat']=response.css('span.color-red.font-huge.bold::text').extract()
item['marka']=response.css('p.color-black.bold.word-break.mb4::text').extract()
item['yil']=response.xpath('//*[#id="js-hook-appendable-technicalPropertiesWrapper"]/div[2]/dl[1]/dd/span/text()').extract()
And this is my items.py file
import scrapy
class ArabamcomItem(scrapy.Item):
fiyat=scrapy.Field()
marka=scrapy.Field()
yil=scrapy.Field()
When i run the code i can get data from 'marka' and 'fiyat' item but spider does not get anything for 'yil' attribute. Also other parts like 'Yakit Tipi','Vites Tipi' etc. How can i solve this problem ?

What's wrong:
//*[#id="js-hook-appendable-technicalPropertiesWrapper"]/......
This id start with js and may be dynamic element appeded by javascript
Scrapy do not have the ability to render javascript by default.
There are 2 solutions you can try
Scrapy-Splash
This is a javascript rendering engine for scrapy
Install Splash as a Docker container
Modify your settings.py file to integrate splash (append following middlewares to your project)
SPLASH_URL = 'http://127.0.0.1:8050'
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware':100,
}
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware':723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
Replace your Request Function with SplashRequest
from scrapy_splash import SplashRequest as SP
SP(url=url, callback=parse, endpoint='render.html', args={'wait': 5})
Selenium WebDriver
This is a browser automation-testing framework
Install Selenium from PyPi and install there corresponding driver(e.g. Firefox -> Geckodriver) to PATH folder
Append following middleware class to your project's middleware.py file:
class SeleniumMiddleware(object):
#classmethod
def from_crawler(cls, crawler):
middleware = cls()
crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
return middleware
def process_request(self, request, spider):
request.meta['driver'] = self.driver
self.driver.get(request.url)
self.driver.implicitly_wait(2)
body = to_bytes(self.driver.page_source)
return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)
def spider_opened(self, spider):
"""Change your browser mode here"""
self.driver = webdriver.Firefox()
def spider_closed(self, spider):
self.driver.close()
Modify your settings.py file to integrate the Selenium middleware (append following middleware to your project and replace yourproject with your project name)
DOWNLOADER_MIDDLEWARES = {
'yourproject.middlewares.SeleniumMiddleware': 200
}
Comparison
Scrapy-Splash
An official module by Scrapy Company
You can deploy splash instance to cloud, so that you will be able to browse the url in cloud then transfer the render.html back to your spider
It's slow
Splash container will stop if there is a memory leak. (Be sure to deploy splash instance on a high memory cloud instance)
Selenium web driver
You have to have Firefox or Chrome with their corresponding automated-test-driver on your machine, unless you use PhantomJS.
You can't modify request headers directly with Selenium web driver

You could render the webpage using a headless browser but this data can be easily extracted without it, try this:
import re
import ast
...
def parse_item(self,response):
regex = re.compile('dataLayer.push\((\{.*\})\);', re.DOTALL)
html_info = response.xpath('//script[contains(., "dataLayer.push")]').re_first(regex)
data = ast.literal_eval(html_info)
yield {'fiyat': data['CD_Fiyat'],
'marka': data['CD_marka'],
'yil': data['CD_yil']}
# output an item with {'fiyat': '103500', 'marka': 'Renault', 'yil': '2017'}

Related

In gRPC python, the Service class and the Serve method are always in the same file, why?

In gRPC python, the Service-class and the Serve method are always in the same file, why?
for example -
the service class - link and the serve() - link
Similarly - service-class and serve()
I am new to python and grpc. In my project I wrote the serve method in different file importing the service-class, the server seems started but when I invoke it from client code (postman)
it doesn't work
Here is my code - ems_validator_service.py contains the service class
and main.py has the serve() method
File: ems_validator_service.py -
from validator.src.grpc import ems_validator_service_pb2
from validator.src.grpc.ems_validator_service_pb2_grpc import EmsValidatorServiceServicer
class EmsValidatorServiceServicer(EmsValidatorServiceServicer):
def Validate(self, request, context):
# TODO: logic to validate
return ems_validator_service_pb2.GetStatusResponse(
validation_status=ems_validator_service_pb2.VALIDATION_STATUS_IN_PROGRESS)
def GetStatus(self, request, context):
# TODO: logic to get actual status
return ems_validator_service_pb2.GetStatusResponse(
validation_status=ems_validator_service_pb2.VALIDATION_STATUS_IN_PROGRESS)
File: main.py -
from validator.src.grpc.ems_validator_service_pb2_grpc import (
EmsValidatorServiceServicer,
add_EmsValidatorServiceServicer_to_server
)
from concurrent import futures
import grpc
def serve():
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
add_EmsValidatorServiceServicer_to_server(EmsValidatorServiceServicer(), server)
server.add_insecure_port('localhost:50051') # todo change it
server.start()
server.wait_for_termination()
if __name__ == "__main__":
serve()
For the above code I can't invoke the rpc.
but If I move the serve method to the ems_validator_service.py file and call that method from main.py then it works fine. Not sure if it is a python thing or gRPC thing?
The error I get from client.py -
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNIMPLEMENTED
details = "Method not implemented!"
debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:50051 {created_time:"2022-10-19T22:10:18.898439-07:00", grpc_status:12, grpc_message:"Method not implemented!"}"
>
As already mentioned, the same client works fine if I move the above serve() method to the service class
These are just simple examples. It's totally fine to call serve() from a different file.

how to set fast API version to allow HTTP can specify version in accept header?

I am working on a project that requires me to version fast API endpoints. We want to version the endpoint through HTTP accept header, e.g. headers={'Accept': 'application/json;version=1.0.1'}, headers={'Accept': 'application/json;version=1.0.2'}. Only set up the api version like this seem not work:
app = FastAPI(
version=version,
title="A title",
description="Some description.",
)
Does anyone know what else I need to do with this ?
Well maybe the version in path url could be better
sub apps docs
from fastapi import FastAPI
app = FastAPI()
v1 = FastAPI()
#v1.get("/app/")
def read_main():
return {"message": "Hello World from api v1"}
v2 = FastAPI()
#v2.get("/app/")
def read_sub():
return {"message": "Hello World from api v2"}
app.mount("/api/v1", v1)
app.mount("/api/v2", v2)
You will see the auto docs for each app
localhost:8000/api/v1/docs
localhost:8000/api/v2/docs
But you always get the headers in request
from starlette.requests import Request
from fastapi import FastAPI
app = FastAPI()
#app.post("/hyper_mega_fast_service")
def fast_service(request: Request, ):
aceept = request.headers.get('Accept')
value = great_fuction_to_get_version_from_header(aceept)
if value == '1.0.1':
"Do something"
if value == '1.0.2':
"Do something"
Try api versioning for fastapi web applications
Installation
pip install fastapi-versioning
Examples
from fastapi import FastAPI
from fastapi_versioning import VersionedFastAPI, version
app = FastAPI(title="My App")
#app.get("/greet")
#version(1, 0)
def greet_with_hello():
return "Hello"
#app.get("/greet")
#version(1, 1)
def greet_with_hi():
return "Hi"
app = VersionedFastAPI(app)
this will generate two endpoints:
/v1_0/greet
/v1_1/greet
as well as:
/docs
/v1_0/docs
/v1_1/docs
/v1_0/openapi.json
/v1_1/openapi.json
There's also the possibility of adding a set of additional endpoints that
redirect the most recent API version. To do that make the argument
enable_latest true:
app = VersionedFastAPI(app, enable_latest=True)
this will generate the following additional endpoints:
/latest/greet
/latest/docs
/latest/openapi.json
In this example, /latest endpoints will reflect the same data as /v1.1.
Try it out:
pip install pipenv
pipenv install --dev
pipenv run uvicorn example.annotation.app:app
# pipenv run uvicorn example.folder_name.app:app
Usage without minor version
from fastapi import FastAPI
from fastapi_versioning import VersionedFastAPI, version
app = FastAPI(title='My App')
#app.get('/greet')
#version(1)
def greet():
return 'Hello'
#app.get('/greet')
#version(2)
def greet():
return 'Hi'
app = VersionedFastAPI(app,
version_format='{major}',
prefix_format='/v{major}')
this will generate two endpoints:
/v1/greet
/v2/greet
as well as:
/docs
/v1/docs
/v2/docs
/v1/openapi.json
/v2/openapi.json
Extra FastAPI constructor arguments
It's important to note that only the title from the original FastAPI will be
provided to the VersionedAPI app. If you have any middleware, event handlers
etc these arguments will also need to be provided to the VersionedAPI function
call, as in the example below
from fastapi import FastAPI, Request
from fastapi_versioning import VersionedFastAPI, version
from starlette.middleware import Middleware
from starlette.middleware.sessions import SessionMiddleware
app = FastAPI(
title='My App',
description='Greet uses with a nice message',
middleware=[
Middleware(SessionMiddleware, secret_key='mysecretkey')
]
)
#app.get('/greet')
#version(1)
def greet(request: Request):
request.session['last_version_used'] = 1
return 'Hello'
#app.get('/greet')
#version(2)
def greet(request: Request):
request.session['last_version_used'] = 2
return 'Hi'
#app.get('/version')
def last_version(request: Request):
return f'Your last greeting was sent from version {request.session["last_version_used"]}'
app = VersionedFastAPI(app,
version_format='{major}',
prefix_format='/v{major}',
description='Greet users with a nice message',
middleware=[
Middleware(SessionMiddleware, secret_key='mysecretkey')
]
)

How to set an OpenTelemetry span attribute for a FastAPI route?

Adding a tag to a trace's span can be very useful to later analyse the tracing data and slice & dice it by the wanted tag.
After reading the OpenTelemetry docs, I couldn't figure out a way to add a custom tag to a span.
Here is my sample FastAPI application, already instrumented with OpenTelemetry:
"""main.py"""
from typing import Dict
import fastapi
from opentelemetry import trace
from opentelemetry.sdk.trace.export import (
ConsoleSpanExporter,
SimpleExportSpanProcessor,
)
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.sdk.trace import TracerProvider
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(SimpleExportSpanProcessor(ConsoleSpanExporter()))
app = fastapi.FastAPI()
#app.get("/user/{id}")
async def get_user(id: int) -> Dict[str, str]:
"""Test endpoint."""
return {"message": "hello user!"}
FastAPIInstrumentor.instrument_app(app)
You can run it with uvicorn main:app --reload
How can I add the user id to the span?
After reading the source code of the ASGI instrumentation's OpenTelemetryMiddleware (here), I realised you can simply get the current span, set the tag (or attribute), and the current span will be returned with all its attributes.
#app.get("/user/{id}")
async def get_user(id: int) -> Dict[str, str]:
"""Test endpoint."""
# Instrument the user id as a span tag
current_span = trace.get_current_span()
if current_span:
current_span.set_attribute("user_id", id)
return {"message": "hello user!"}

Login working in Splash API but not when SplashRequest is used

Relatively new to Splash. I'm trying to scrape a website which needs a login. I started off with the Splash API for which I was able to login perfectly. However, when I put my code in a scrapy spider script, using SplashRequest, it's not able to login.
import scrapy
from scrapy_splash import SplashRequest
class Payer1Spider(scrapy.Spider):
name = "payer1"
start_url = "https://provider.wellcare.com/provider/claims/search"
lua_script = """
function main(splash,args)
assert(splash:go(args.url))
splash:wait(0.5)
local search_input = splash:select('#Username')
search_input:send_text('')
local search_input = splash:select('#Password')
search_input:send_text('')
assert(splash:wait(0.5))
local login_button = splash:select('#btnSubmit')
login_button:mouse_click()
assert(splash:wait(7))
return{splash:html()}
end
"""
def start_requests(self):
yield SplashRequest(self.start_url, self.parse_result,args={'lua_source': self.lua_script},)
def parse_result(self, response):
yield {'doc_title' : response.text}
The output HTML is the login page and not the one after logging in.
You have to add endpoint='execute' to your SplashRequest to execute the lua-script:
yield SplashRequest(self.start_url, self.parse_result, args={'lua_source': self.lua_script}, endpoint='execute')
I believe you don't need splash to login to the site indeed. You can try next:
Get https://provider.wellcare.com and then..
# Get request verification token..
token = response.css('input[name=__RequestVerificationToken]::attr(value)').get()
# Forge post request payload...
data = [
('__RequestVerificationToken', token),
('Username', 'user'),
('Password', 'pass'),
('ReturnUrl', '/provider/claims/search'),
]
#Make dict from list of tuples
formdata=dict(data)
# And then execute request
scrapy.FormRequest(
url='https://provider.wellcare.com/api/sitecore/Login',
formdata=formdata
)
Not completely sure if all of this will work. But you can try.

Download entire web pages and save them as html file with urllib.request

I can save multiple web pages with using these codes; however, I cant see a proper website view after saving them as html. For example, the texts in table are slipped and images can't be seen.
I need to download entire pages just as we do save as in any web browser so that I can see a proper view.
import urllib.request
url= 'https://asd.com/asdID='
for i in range(1, 5):
print(' --> ID:', i)
newurl = url + str(i)
f = open(str(i)+'.html', 'w')
page = urllib.request.urlopen(newurl)
pagetext = str(page.read())
f.write(pagetext)
f.close()
You can use selenium instead to download full website nicely
Just run the following code
from selenium import webdriver
#Download the chrome driver from the link below and specify the path of chromedriver
#https://chromedriver.storage.googleapis.com/index.html?path=2.40/
chromedriver = 'C:/python36/chromedriver.exe'
url= 'https://asd.com/asdID='
for i in range(1, 5):
browser = webdriver.Chrome(chromedriver)
browser.get(url + str(i))
data = browser.page_source
with open("webpage%s.html" %(str(i)), "w+") as f:
f.write(data)
UPDATE
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import ahk
firefox = FirefoxBinary("C:\\Program Files (x86)\\Mozilla Firefox\\firefox.exe")
from selenium import webdriver
driver = web.Firefox(firefox_binary=firefox)
driver.get("http://www.yahoo.com")
ahk.start()
ahk.ready()
ahk.execute("Send,^s")
ahk.execute("WinWaitActive, Save As,,2")
ahk.execute("WinActivate, Save As")
ahk.execute("Send, C:\\path\\to\\file.htm")
ahk.execute("Send, {Enter}")
You will now get everything

Resources