Scrape relevant information from soup file

Scrape relevant information from soup file - web-scraping

I am trying to scrape url to obtain the address and branch_name for all the branches.
URL="https://www.uob.co.id/personal/branch-and-atm-locator.page"
From the network option, I found the requested url path as:
URL="https://www.uob.co.id/wsm/stayinformed.do?path=lokasicabangatm"
but the format in which data is present here isn't clear.
import requests
from bs4 import BeautifulSoup
r = requests.get(URL)
soup = BeautifulSoup(r.content)
print(soup)
How can I extract the relevant information?

Just dump the response to a .csv file and there you have all the data.
import requests
headers = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36",
}
response = requests.get("https://www.uob.co.id/wsm/stayinformed.do?path=lokasicabangatm", headers=headers).text
with open("data.csv", "w") as f:
f.write(response)
Output:

Related

How to get around 403 Forbidden error with R download.file

I am trying to download multiple files from a website. Im scraping the website to come up with the individual URLs, but when I put the URL's into download.file, I'm getting a 403 forbidden error.
It looks like theres a user validation step on the website, but adding a header doesnt help.
Any help getting around this is appreciated. here's what im trying with one sample URL and file :
headers = c(
`user-agent` = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.61 Safari/537.36'
)
download.file("https://gibsons.civicweb.net/filepro/document/125764/Regular%20Council%20-%2006%20Dec%202022%20-%20Minutes%20-%20Pdf.pdf",
"file",
mode="wb",
headers=headers)```

R programming download.file() returning 403 Forbidden error

Been scraping a webpage previously and it is now returning a 403 Forbidden error. When I visit the site manually through a browser I have no problems, however when I scrape the page now I get the error.
Code is:
url <- 'https://www.punters.com.au/form-guide/'
download.file(url, destfile = "webpage.html", quiet=TRUE)
html <- read_html("webpage.html")
Error is:
Error in download.file(url, destfile = "webpage.html", quiet = TRUE) :
cannot open URL 'https://www.punters.com.au/form-guide/'
In addition: Warning message:
In download.file(url, destfile = "webpage.html", quiet = TRUE) :
cannot open URL 'https://www.punters.com.au/form-guide/': HTTP status was '403 Forbidden'
I've looked at the documentation and tried finding an answer online but had no luck so far. Any suggestions how I can circumvent this?

Looks like they added user-agent validation. You need to add user-agent and it works.
If you do not put user-agent of some browser, the site thinks that you are bot and block you. Here you have some python code.
from bs4 import BeautifulSoup
import requests
baseurl = "https://www.punters.com.au/form-guide/"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"}
page = requests.get(baseurl, headers=headers).content
soup = BeautifulSoup(page, 'html.parser')
title = soup.find("div", class_="short_title")
print("Title: " +title.text)
Request in R with user-agent:
require(httr)
headers = c(
`user-agent` = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.61 Safari/537.36'
)
res <- httr::GET(url = 'https://www.punters.com.au/form-guide/', httr::add_headers(.headers=headers))

Different API response for axios and curl requests

Instagram has an endpoint that gives you a JSON in the response.
When I try to call it using curl I would get a JSON.
curl -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36" https://www.instagram.com/microsoft/\?__a\=1
However, if I use other HTTP clients such as Axios for node.js I would get an html page instead. Also for some other HTTP clients I would get 302 redirect occasionally.
const axios = require('axios')
axios.request({
url: 'https://www.instagram.com/microsoft/\?__a',
headers: {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'}
})
Is there a way to get around this with axios or other HTTP clients so that they follow the curl behaviour for HTTP requests?

What I usually do in these cases, is use Postman in order to convert any type of requests, like Curl based request, to any other type, for example axios, fetch etc...

Instagram blocks me for the requests with 429

I have used a lot of requests to https://www.instagram.com/{username}/?__a=1 to check if a pseudo was existing and now I am getting 429.
Before, I had just to wait few minutes to make the 429 disapear. Now it is persistent ! :( I'm trying once a day, it doesnt work anymore.
Do you know anything about instagram requests limitation ?
Do you have any workaround please ? Thanks
Code ...
import requests
r = requests.get('https://www.instagram.com/test123/?__a=1')
res = str(r.status_code)

Try adding the user-agent header, otherwise, the website thinks that your a bot, and will block you.
import requests
URL = "https://www.instagram.com/bla/?__a=1"
HEADERS = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"}
response = requests.get(URL, headers=HEADERS)
print(response.status_code) # <- Output: 200

why is nothing getting parsed in my web scraping program?

I made this code to search all the top links in google search. But its returning none.
import webbrowser, requests
from bs4 import BeautifulSoup
string = 'selena+gomez'
website = f'http://google.com/search?q={string}'
req_web = requests.get(website).text
parser = BeautifulSoup(req_web, 'html.parser')
gotolink = parser.find('div', class_='r').a["href"]
print(gotolink)

Google needs that you specify User-Agent http header to return correct page. Without the correct User-Agent specified, Google returns page that doesn't contain <div> tags with r class. You can see it when you do print(soup) with and without User-Agent.
For example:
import requests
from bs4 import BeautifulSoup
string = 'selena+gomez'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
website = f'http://google.com/search?hl=en&q={string}'
req_web = requests.get(website, headers=headers).text
parser = BeautifulSoup(req_web, 'html.parser')
gotolink = parser.find('div', class_='r').a["href"]
print(gotolink)
Prints:
https://www.instagram.com/selenagomez/?hl=en

Answer from Andrej Kesely will throw an error since this css class no longer exists:
gotolink = parser.find('div', class_='r').a["href"]
AttributeError: 'NoneType' object has no attribute 'a'
Learn more about user-agent and request headers.
Basically user-agent let identifies the browser, its version number, and its host operating system that representing a person (browser) in a Web context that lets servers and network peers identify if it's a bot or not.
In this case, you need to send a fake user-agent so Google would treat your request as a "real" user visit, also known as user-agent spoofing.
Pass user-agent in request headers:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get(YOUR_URL, headers=headers)
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "selena gomez"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
link = result.select_one('.yuRUbf a')['href']
print(link)
# https://www.instagram.com/selenagomez/
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
Essentially, the main difference in your case is that you don't need to think about how to bypass Google blocks if they appear or figure out how to scrape elements that are a bit harder to scrape since it's already done for the end-user. The only thing that needs to be done is just get the data you want from the JSON string.
Example code:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "selena gomez",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
# [0] means index of the first organic result
link = results['organic_results'][0]['link']
print(link)
# https://www.instagram.com/selenagomez/
Disclaimer, I work for SerpApi.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scrape relevant information from soup file - web-scraping

Related

How to get around 403 Forbidden error with R download.file

R programming download.file() returning 403 Forbidden error

Different API response for axios and curl requests

Instagram blocks me for the requests with 429

why is nothing getting parsed in my web scraping program?

Categories

Resources