Scrapy Middleware Selenium with meta - web-scraping

Basically, I have a working version of middleware to pass all requests through selenium and return HtmlResponse, the problem is I also want to have some meta data to be attached to the request which I can access in parse method of spider. For some reason I can't access it in parse method of spider, could you help me please?
middleware.py
def process_request(self, request, spider):
request = request.replace(meta={'test': 'test'})
self.driver.get(request.url)
body = self.driver.page_source
return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)
spider.py
def parse(self, response):
yield {'meta': response.meta}

In the yield of the first function where we Request URL, there is another argument called meta by which we can pass details from the first function to the second one in a dictionary.
For example, in the first function, we well have:
yield Request(url, callback=self.parse_function, meta={'Date Added':date, 'Category':category}) 
In the second function, what we yield is rather this meta  dictionary. This will be like this:
yield response.meta 
What about the details we get from the second function? We should first add them to the dictionary before yield, just as adding to any dictionary, like this:
response.meta['Brand'] = brand
response.meta['Model'] = model
response.meta['Price'] = price

Related

Fast API - post any key value data

I'm using Fast.API
I need an API to allow users to post any data, key/value - I use this to allow users to add custom profile key/value fields to profile, where key is of type string and value is string, number, boolean.
how do I add such route?
I'm using this route, but is not working:
#route.post('/update_profile')
def update_profile(acsess_token, **kargs):
# I need here to get a dictionary like this: { "name": "John", "nick_name": "Juju", "birth_year": 1999, "allow_newsletter": False, }
# and so on.... any key/value pair
pass
I want to be able to post to this route any pair(s) key/value. Is any way to do it with FastAPI?
Thank you.
you can use the request object directly
from fastapi import FastAPI
from fastapi import Request
#app.post("/something")
async def get_body(request: Request):
return await request.json()
After searching for a way to do it, I found this solution:
#route.post('/update_profile')
def update_profile(acsess_token, custom_fields: Optional[dict[str, Any]]):
pass
And this is the best solution so far (for me).

Access fastapi.Request object without router method

I have a FastAPI app and on app-startup, I create a State for it like this:
app.state.player["id"] = Player()
And now I want to access this player state whenever I need it. And the only way of accessing this state is through fastapi.Request object like this:
request.app.state.player["id"]
I tried this:
def _return_player(request: Request):
return request.app.state.player["id"]
async def process_new_player(player: Player = Depends(_return_player)):
await player.play()
but apparently fastapi.Depends can be used only inside router methods.
And then I wanted to do something like this:
async def process_new_player():
player = request.app.state.player["id"]
await player.play()
but again I was not sure how to get the value of fastapi.Request. And no, the process_new_player method is not initiated from any router at the very beginning (if that was the case I could pass the fastapi.Request object recursively to here).
How can I access the value of request.app.state.player["id"] inside process_new_player method?
I already know about the starlette-context library, but I want a solution without a 3rd party and a simpler one.
Thanks!

FastAPI: Return response with an array of objects

I'm currently working on a API that returns scraped data. The data is stored as an array but when I go return that array as a response on an endpoint, I get the following error:
pydantic.error_wrappers.ValidationError: 1 validation error for InnerObject
response
value is not a valid dict (type=type_error.dict)
Here's a simplified version of what I'm trying to achieve:
class InnerObject(BaseModel):
foo: str
class OuterObject(BaseModel):
bar: List[InnerObject]
#app.get("/test_single", response_model=InnerObject)
def test():
return InnerObject(foo="Hello Mars")
#app.get("/test_multiple", response_model=OuterObject)
def test():
objects = [InnerObject]
objects.append(InnerObject(foo="Hello Earth"))
objects.append(InnerObject(foo="Hello Mars"))
return objects
I have an array of objects that I want to return as a response. It's also possible that I don't need outer/inner models but I have also attempted this and set the response_model to be response_model=List[InnerObject]. Returning a single InnerObject as seen in the "/test_single" endpoint works fine, so I assume it's to do with trying to return a [InnerObject]
Thanks for any responses in advance
Solution
Thanks kosciej16, the problem was that I was adding the object name when declaring the list. So I was going objects = [InnerObject] when I should have been going objects = []
Generally fastapi tries to create OuterObject from thing you return.
In such case you have few options.
Create object explicitly
#app.get("/test_multiple", response_model=OuterObject)
def test():
objects = []
objects.append(InnerObject(foo="Hello Earth"))
objects.append(InnerObject(foo="Hello Mars"))
return OuterObject(bar=objects)
Change response_model
#app.get("/test_multiple", response_model=List[InnerObject])
def test():
objects = []
objects.append(InnerObject(foo="Hello Earth"))
objects.append(InnerObject(foo="Hello Mars"))
return objects
Change definition of OuterObject
class OuterObject(List[InnerObject]):
pass

Web2py: Sending JSON Data via a Rest API post call in Web2py scheduler

I have a form whose one field is type IS_JSON
db.define_table('vmPowerOpsTable',
Field('launchId',label=T('Launch ID'),default =datetime.datetime.now().strftime("%d%m%y%H%M%S")),
Field('launchDate',label=T('Launched On'),default=datetime.datetime.now()),
Field('launchBy',label=T('Launched By'),default = auth.user.email if auth.user else "Anonymous"),
Field('inputJson','text',label=T('Input JSON*'),
requires = [IS_NOT_EMPTY(error_message='Input JSON is required'),IS_JSON(error_message='Invalid JSON')]),
migrate=True)
When the user submits this Form, this data is also simultaneously inserted to another table.
db.opStatus.insert(launchId=vmops_launchid,launchDate=vmops_launchdate
,launchBy=vmops_launchBy,opsType=operation_type,
opsData=vmops_inputJson,
statusDetail="Pending")
db.commit()
Now from the scheduler, I am trying to retrieve this data and make a POST request.
vm_power_opStatus_row_data = vm_power_opStatus_row.opsData
Note in the above step I am able to retrieve the data. (I inserted it in a DB and saw the field exactly matches what the user has entered.
Then from the scheduler, I do a POST.
power_response = requests.post(vm_power_op_url, json=vm_power_opStatus_row_data)
The POST request is handled by a function in my controller.
Controller Function:
#request.restful()
def vmPowerOperation():
response.view = 'generic.json'
si = None
def POST(*args, **vars):
jsonBody = request.vars
print "Debug 1"+ str(jsonBody) ##-> Here it returns blank in jsonBody.
But if I do the same request from Outside(POSTMAN client or even python request ) I get the desired result.
Is anything going wrong with the data type when I am trying to fetch it from the table.
power_response = requests.post(vm_power_op_url,
json=vm_power_opStatus_row_data)
It appears that vm_power_opStatus_row_data is already a JSON-encoded string. However, the json argument to requests.post() should be a Python object, not a string (requests will automatically encode the Python object to JSON and set the content type appropriately). So, the above should be:
power_response = requests.post(vm_power_op_url,
json=json.loads(vm_power_opStatus_row_data))
Alternatively, you can use the data argument and set the content type to JSON:
power_response = requests.post(vm_power_op_url,
data=vm_power_opStatus_row_data,
headers={'Content-Type': 'application/json')
Also, note that in your REST POST function, request.vars is already passed to the function as **vars, so within the function, you can simply refer to vars rather than request.vars.

Change website deliver country with Scrapy

I need to scrape the website http://www.yellowkorner.com/
By choosing a different country, all the prices will change. There are 40+ countries listed, and each of those must be scrapped.
My current spider is pretty simple
# coding=utf-8
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://www.yellowkorner.com/photos/index.aspx']
def parse(self, response):
for url in response.css('a::attr("href")').re(r'/photos/\d\d\d\d/.*$'):
yield scrapy.Request(response.urljoin(url), self.parse_prices)
def parse_prices(self, response):
yield None
How can I scrape price information for all countries?
Open the page with firebug and refresh. Inspecting the web page at the panel Network / Sub Panel Cookies you will see that the page saves de country information with cookies (see image below).
So you have to force the cookie "YellowKornerCulture" attribute values LANGUAGE and COUNTRY at the request. I made an example based on your code to get the available countries on the site and a loop to get all the prices. See the code below:
# coding=utf-8
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://www.yellowkorner.com/photos/index.aspx']
def parse(self, response):
countries = self.get_countries(response)
#countries = ['BR', 'US'] try this if you only have some countries
for country in countries:
#With the expression re(r'/photos/\d\d\d\d/.*$') you only get photos with 4-digit ids. I think this is not your goal.
for url in response.css('a::attr("href")').re(r'/photos/\d\d\d\d/.*$'):
yield scrapy.Request(response.urljoin(url), cookies={'YellowKornerCulture' : 'Language=US&Country='+str(country), 'YellowKornerHistory' : '', 'ASP.NET_SessionId' : ''}, callback=self.parse_prices, dont_filter=True, meta={'country':country})
def parse_prices(self, response):
yield {
'name': response.xpath('//h1[#itemprop="name"]/text()').extract()[0],
'price': response.xpath('//span[#itemprop="price"]/text()').extract()[0],
'country': response.meta['country']
}
#function that gets the countries avaliables on the site
def get_countries(self, response):
return response.xpath('//select[#id="ctl00_languageSelection_ddlCountry"]/option/attribute::value').extract()
Took a certain time to figure this out but you have to erase another cookies that the site is using to choose the language page. Also I fixed the language value to English(US). The parameter dont_filter=True was used because you are requesting an already requested url each loop iteration and the default behavior of scrapy is don't repeat a request to the same url due performance reasons.
PS: The xpath expressions provided can be improved.
Hope this helps.

Resources