Extract text from a site using Scrapy spider - web-scraping

I am trying to extract the description of a book from amazon site. Note: I am using Scrapy spider:
this is the link of the amazon book: https://www.amazon.com/Local-Woman-Missing-Mary-Kubica/dp/1665068671
this the div that contains the text of the decription inside:
<div aria-expanded="true" class="a-expander-content a-expander-partial-collapse-content
a-expander-content-expanded" style="padding-bottom: 20px;"> <p><span class="a-text-
bold">MP3 CD Format</span></p><p><span class="a-text-bold">People don’t just disappear
without a trace…</span></p><p class="a-text-bold"><span class="a-text-bold">Shelby Tebow
is the first to go missing. Not long after, Meredith Dickey and her six-year-old
daughter, Delilah, vanish just blocks away from where Shelby was last seen, striking
fear into their once-peaceful community. Are these incidents connected? After an elusive
search that yields more questions than answers, the case eventually goes cold.</span>
</p><p class="a-text-bold"><span class="a-text-bold">Now, eleven years later, Delilah
shockingly returns. Everyone wants to know what happened to her, but no one is prepared
for what they’ll find…</span></p><p class="a-text-bold"><span class="a-text-bold">In
this smart and chilling thriller, master of suspense and New York Times bestselling
author Mary Kubica takes domestic secrets to a whole new level, showing that some people
will stop at nothing to keep the truth buried.</span></p><p></p> </div>
actually I tried this line:
div = response.css(".a-expander-content.a-expander-partial-collapse-content.a-expander-content-expanded")
description = " ".join([re.sub('<.*?>', '', span) for span in response.css('.a-expander-content span').extract()])
it's not working as expected. Please if you have any idea share it here. Thanks in advance

Here is the scrapy code:
import scrapy
from scrapy.spiders import Request
class AmazonSpider(scrapy.Spider):
name = 'amazon'
start_urls = ['https://www.amazon.com/dp/1665068671']
def start_requests(self):
yield Request(self.start_urls[0], callback=self.parse_book)
def parse_book(self, response):
description = "".join(response.css('[data-a-expander-name="book_description_expander"] .a-expander-content ::text').getall())
yield {"description": description}
Output:
{'description': ' MP3 CD FormatPeople don’t just disappear without a trace…Shelby Tebow is the first to go missing. Not long after, Meredith Dickey and her six-year-old daughter, Delilah, vanish just blocks away from where Shelby was last seen, striking fear into their once-peaceful community. Are these incidents connected? After an elusive search that yields more questions than answers, the case eventually goes cold.Now, eleven years later, Delilah shockingly returns. Everyone wants to know what happened to her, but no one is prepared for what they’ll find…In this smart and chilling thriller, master of suspense and New York Times bestselling author Mary Kubica takes domestic secrets to a whole new level, showing that some people will stop at nothing to keep the truth buried. '}

Related

How can I get tweet having only tweet ID without using twitter API?

I have a large number of Tweet IDs that have been collected by other people (https://github.com/echen102/us-pres-elections-2020), and I now want to get these tweets from those IDs. What should I do without the Twitter API?
Do you want the url ? It is : https://twitter.com/user/status/<tweet_id>
If you want the text of the tweet withou using the api , you have to render the page, and then scrape it.
You can do it with one module, requests-html:
from requests_html import HTMLSession
session = HTMLSession()
url = "https://twitter.com/user/status/1414963866304458758"
r = session.get(url)
r.html.render(sleep=2)
tweet_text = r.html.find('.css-1dbjc4n.r-1s2bzr4', first=True)
print(tweet_text.text)
Output:
Here’s a serious national security question: Why does the Biden administration want to protect COMMUNISM from blame for the Cuban Uprising? They attribute it to vaccines. Even if the Big Guy can’t comprehend it, Hunter could draw a picture for him.

scrapy giving a different output than on website, problem with geo location?

I'm really a newbie in all of this and am just trying to learn a bit more about this.
So I had a lot of help to get this going but now I'm stuck on a very weird problem.
I am scraping info from a grocery store in Australia. As I'm located in the state of Victoria when I go on a website the price of a Redbull is 10.5$ but as soon as I run my script I get 11.25$.
I am guessing it might have to do with a geolocation...but not sure.
I basically need some help as to where to look to find how to get the right price I get when I go to the website.
Also, I noticed that when I do go to the same website from my phone it gives me the price of 11.25$, but if I go to the app of the store I get the accurate price of 10.5$.
import json
import scrapy
class SpidervenderSpider(scrapy.Spider):
name = 'spidervender'
allowed_domains = ['woolworths.com.au']
start_urls = ['https://www.woolworths.com.au/shop/productdetails/306165/red-bull-energy-drink']
def parse(self, response):
product_schema = json.loads(response.css('script[type="application/ld+json"]::text').get())
yield {
'title': product_schema['name'],
'price': product_schema['offers']['price']
}
So the code works perfectly but the price is (I presume) for a different part of Australia.

USA 'aka Title' returned as default?

Trying to get the Title for "Blood Oath" from 1990 https://www.imdb.com/title/tt0100414/ .In this example am using Jupyter, but it works the same in my .py program:
movie = ia.get_movie('0100414')
movie
<Movie id:0100414[http] title:_Prisoners of the Sun (1990)_>
Am I doing something wrong? This seems to be the 'USA aka' title. I do know how to get the AKA titles back via the API, but just puzzled as to why it's returning this one. On the IMDB web page "Blood Oath" is listed - under the AKA section - as the "(original title)". Thank you.
What you do is correct.
IMDbPY takes the movie title from the value of a meta tag with property set to "og:title". So, what's considered the title of a movie depends on the decisions made by IMDb.
You can also use "original title" key, that is taken from what it's actually shown to the reader of the web page. This, however, is even more subject to change since it's usually shown in the language guessed by the IMDb web servers using the language set by a registered user, the settings of your browser or by geolocation of the IP.
So, for example, for that title I get "Blood Oath" via browser since my browser is set to English and "Giuramento di sangue (1990)" if I access movie['original title'] (geolocation of my IP, I guess)
To conclude, if you really need another title, you may get the whole list this way:
ia.update(movie, 'release info')
print(movie.get('akas from release info'))
You will get a list that you can parse looking for a string ending in '(original title)'
(disclaimer: I'm one of the main authors of IMDbPY)

Can't scrape site of Central Bank BR (in R)

I already check the copyrights of Brazilian Central Bank, from now on: "BR Central Bank", (link here) and:
The total or partial reproduction of the content of this site is allowed, preserving the integrity of the information and citing the source. It is also authorized to insert links on other websites to the Central Bank of Brazil (BCB) website. However, the BCB reserves the right to change the provision of information on the site as necessary without notice.
Thus, I'm trying to scrape this website: https://www.bcb.gov.br/estabilidadefinanceira/leiautedoc2061e2071/atuais , but I can't understand why I'm not able to do it. Below you'll find what I'm doing. The html when is saved is empty. What am I doing wrong? Can anybody help me please? After this step I'll read the html code and look for new additions from last database.
url_bacen <- "https://www.bcb.gov.br/estabilidadefinanceira/leiautedoc2061e2071/atuais"
file_bacen_2061 <- paste("Y:/Dir_Path/" , "BACEN_2061.html", sep="" )
download.file(url_bacen,file_bacen_2061, method="auto",quiet= FALSE, mode="wb")
Tks for any help,
Felipe
Data is dynamically pulled from API call you can find it network tab when pressing F5 to refresh page i.e. the landing page makes an additional xhr request for info that you are not capturing. If you mimic this request it returns json you can parse for whatever info you want
library(jsonlite)
data <- jsonlite::read_json('https://www.bcb.gov.br/api/servico/sitebcb/leiautes2061')
print(data$conteudo)

Google Maps API: Bring up address selections as you type

I'm looking to create a web application that starts to suggest home addresses as you type. For instance, imagine a pizza delivery company, where you start typing in your address, "1279", and beneath the box it brings up 1279's in the US for people to choose from, like:
1279 Main Street, St. Louis, MO
1279 Tree Street, Baltimore, MD
In this way, it would really mirror maps.google.com in bringing up suggestions as you type.
I've looked through the Google Places and Maps APIs without much success. The GeoCoding one works OK by passing an address parameter through, but often returns no results or really bad ones... nothing like maps.google.com. Plus they're difficult to parse. (The address parts parameters aren't always consistent, meaning that I have to send the formatted address through another parser... not a deal-breaker though.)
Anyone else have any suggestions out there? Thanks! Jeremy
You can improve the Places autocomplete results by passing bounds option when creating it. The example binds it to the map viewport:
autocomplete.bindTo('bounds', map);
In this demo I hardcoded the continental US bounds (plus some of Mexico and Canada)
var input = document.getElementById('searchTextField');
var autocomplete = new google.maps.places.Autocomplete(input,
{bounds: new google.maps.LatLngBounds(
new google.maps.LatLng(23.730197707069532, -126.14240169525146),
new google.maps.LatLng(50.1805258484942, -65.32208919525146)) }
);

Resources