Can't scrape site of Central Bank BR (in R) - r

I already check the copyrights of Brazilian Central Bank, from now on: "BR Central Bank", (link here) and:
The total or partial reproduction of the content of this site is allowed, preserving the integrity of the information and citing the source. It is also authorized to insert links on other websites to the Central Bank of Brazil (BCB) website. However, the BCB reserves the right to change the provision of information on the site as necessary without notice.
Thus, I'm trying to scrape this website: https://www.bcb.gov.br/estabilidadefinanceira/leiautedoc2061e2071/atuais , but I can't understand why I'm not able to do it. Below you'll find what I'm doing. The html when is saved is empty. What am I doing wrong? Can anybody help me please? After this step I'll read the html code and look for new additions from last database.
url_bacen <- "https://www.bcb.gov.br/estabilidadefinanceira/leiautedoc2061e2071/atuais"
file_bacen_2061 <- paste("Y:/Dir_Path/" , "BACEN_2061.html", sep="" )
download.file(url_bacen,file_bacen_2061, method="auto",quiet= FALSE, mode="wb")
Tks for any help,
Felipe

Data is dynamically pulled from API call you can find it network tab when pressing F5 to refresh page i.e. the landing page makes an additional xhr request for info that you are not capturing. If you mimic this request it returns json you can parse for whatever info you want
library(jsonlite)
data <- jsonlite::read_json('https://www.bcb.gov.br/api/servico/sitebcb/leiautes2061')
print(data$conteudo)

Related

How can I get tweet having only tweet ID without using twitter API?

I have a large number of Tweet IDs that have been collected by other people (https://github.com/echen102/us-pres-elections-2020), and I now want to get these tweets from those IDs. What should I do without the Twitter API?
Do you want the url ? It is : https://twitter.com/user/status/<tweet_id>
If you want the text of the tweet withou using the api , you have to render the page, and then scrape it.
You can do it with one module, requests-html:
from requests_html import HTMLSession
session = HTMLSession()
url = "https://twitter.com/user/status/1414963866304458758"
r = session.get(url)
r.html.render(sleep=2)
tweet_text = r.html.find('.css-1dbjc4n.r-1s2bzr4', first=True)
print(tweet_text.text)
Output:
Here’s a serious national security question: Why does the Biden administration want to protect COMMUNISM from blame for the Cuban Uprising? They attribute it to vaccines. Even if the Big Guy can’t comprehend it, Hunter could draw a picture for him.

Skyscanner API does not return data for flight search

I am playing around with the Skyscanner API from their webpage in Postman (opens in a new tab, or in the Postman desktop app) and testing the endpoint for browsing flights. This is what the API says in their page:
And this is what I am trying - browsing for flights from Stockholm Arlanda Airport (ARN-sky) to Heathrow (LHR-sky), on the date 22nd July (around 4 days from now) for first leg, and 25th for return, but as you can see, I am not getting any result. The URL I am trying is this.
Any idea what am I doing wrong, and how to fix it?
Please, mind that you present an image regarding an endpoint to Browse Quotes, but you are trying to consume an endpoint to Browse Routes.
Assuming that you actually want to browse routes, I believe the problem may be this:
The endpoint is of the form:
GET /browseroutes/v1.0/{country}/{currency}/{locale}/{originPlace}/{destinationPlace}/{outboundPartialDate}/{inboundPartialDate}
You are writing a URL like:
.../browseroutes/v1.0/FR/eur/en-US/us/ARN-sky/LHR-sky/2021-07-22/2021-07-25?apikey=<api-key>
So it seems that you are actually specifying:
originPlace = us
destinationPlace = ARN-sky
But I think you wanted to define:
originPlace = ARN-sky
destinationPlace = LHR-sky
To solve this, you may remove the /us member, thus writing: http://partners.api.skyscanner.net/apiservices/browseroutes/v1.0/FR/eur/en-US/ARN-sky/LHR-sky/2021-07-22/2021-07-25?apikey=api-key
(Please, replace the api-key value by an actual API key)
This URL already returns a valid 200 OK result :)

How to figure out where is the raw data in a table?

https://www.nyse.com/quote/XNYS:A
After I access the above URL, I open Developer Tools in Firefox. Then change the date in HISTORIC PRICES, then click 'GO'. The table is updated. But I don't see relevant HTTP requests sent in devtools.
So this means that the data has already been downloaded in the first request. But I can not figure out how to extract the raw data of the table. Could anybody take a look at how to extract the raw data from the table? (Note that I don't want to use methods like selenium, I want to stay with raw HTTP requests to get the raw data.)
EDIT: websocket is mentioned in the comment. But I can't see it in Developer Tools. I add websocket tag anyway in case somebody knows more about websocket can chime in.
I am afraid you cannot extract javascript rendered content without selenium. You can always make use of a headless browser(you don't see any instance on your screen, the only pitfall is that you have to wait until the page fully loads) and it won't bother you anymore.
In other words, all the other scraping libs are based on urls and forms. Scrapy can post forms but not run javascripts.
Selenium will save the day, all you lose is a couple of seconds for each attempt(will be milliseconds if it is run in frontend). You can share page source with driver.page_source and it can be directly used for parsing(as a html text) with BeautifulSoup or whatever.
You can do it with requests-html, for example let's grab the first row of the table:
from requests_html import HTMLSession
session = HTMLSession()
url = 'https://www.nyse.com/quote/XNYS:A'
r = session.get(url)
r.html.render(sleep=7)
first_row = r.html.find('.flex_tr', first=True)
print(first_row.text)
Output:
06/18/2021
146.31
146.83
144.94
145.01
3,220,680
As #Nikita said you will have to wait the page loading (here 7sec but maybe less), but if you want to do multiple requests you can do it asynchronously !

How to create a stock availability checker with python requests if JavaScript is used?

I wrote some code which should check whether a product is back in stock and when it is, send me an email to notify me. This works when the things I'm looking for are in the html.
However, sometimes certain objects are loaded through JavaScript. How could I edit my code so that the web scraping also works with JavaScript?
This is my code thus far:
import time
import requests
while True:
# Get the url of the IKEA page
url = 'https://www.ikea.com/nl/nl/p/flintan-bureaustoel-vissle-zwart-20336841/'
# Get the text from that page and put everything in lower cases
productpage = requests.get(url).text.lower()
# Set the strings that should be on the page if the product is not available
outofstockstrings = ['niet beschikbaar voor levering', 'alleen beschikbaar in de winkel']
# Check whether the strings are in the text of the webpage
if any(x in productpage for x in outofstockstrings):
time.sleep(1800)
continue
else:
# send me an email and break the loop
Instead of scraping and analyzing the HTML you could use the inofficial stock API that the IKEA website is using too. That API return JSON data which is way easier to analyze and you’ll also get estimates when the product gets back to stock.
There even is a project written in javascript / node which provides you this kind of information straight from the command line: https://github.com/Ephigenia/ikea-availability-checker
You can easily check the stock amount of the chair in all stores in the Netherlands:
npx ikea-availability-checker stock --country nl 20336841

USA 'aka Title' returned as default?

Trying to get the Title for "Blood Oath" from 1990 https://www.imdb.com/title/tt0100414/ .In this example am using Jupyter, but it works the same in my .py program:
movie = ia.get_movie('0100414')
movie
<Movie id:0100414[http] title:_Prisoners of the Sun (1990)_>
Am I doing something wrong? This seems to be the 'USA aka' title. I do know how to get the AKA titles back via the API, but just puzzled as to why it's returning this one. On the IMDB web page "Blood Oath" is listed - under the AKA section - as the "(original title)". Thank you.
What you do is correct.
IMDbPY takes the movie title from the value of a meta tag with property set to "og:title". So, what's considered the title of a movie depends on the decisions made by IMDb.
You can also use "original title" key, that is taken from what it's actually shown to the reader of the web page. This, however, is even more subject to change since it's usually shown in the language guessed by the IMDb web servers using the language set by a registered user, the settings of your browser or by geolocation of the IP.
So, for example, for that title I get "Blood Oath" via browser since my browser is set to English and "Giuramento di sangue (1990)" if I access movie['original title'] (geolocation of my IP, I guess)
To conclude, if you really need another title, you may get the whole list this way:
ia.update(movie, 'release info')
print(movie.get('akas from release info'))
You will get a list that you can parse looking for a string ending in '(original title)'
(disclaimer: I'm one of the main authors of IMDbPY)

Resources