Newbie, Scraping Issue , FUTBIN web scraping issue - web-scraping

I'm new to web scraping and i was trying to scrape through FUTBIN (FUT 22) player database
"https://www.futbin.com/players" . My code is below and I don't know why if can't get any sort of results from the FUTBIN page but was successful in other webpages like IMDB.
CODE :`
import requests
from bs4 import BeautifulSoup
request = requests.get("https://www.futbin.com/players")
src = request.content
soup = BeautifulSoup(src, features="html.parser")
results = soup.find("a", class_="player_name_players_table get-tp`enter code here`")
print(results)

Related

Unable to scrape Craigslist with Beautifulsoup

I am just learning and new to scraping
Yesterday I was able to scrape craigslist with a beautiful soup. Today I am unable to.
Here is my code to scrape the first page of rental housing search result on CL.
from requests import get
from bs4 import BeautifulSoup
#get the first page of the san diego housing prices
url = 'https://sandiego.craigslist.org/search/apa?hasPic=1&availabilityMode=0&sale_date=all+dates'
response = get(url) # link exlcudes posts with no picures
html_soup = BeautifulSoup(response.text, 'html.parser')
#get the macro-container for the housing posts
posts = html_soup.find_all('li', class_="result-row")
print(type(posts)) #to double check that I got a ResultSet
print(len(posts)) #to double check I got 120 (elements/page)
The html_soup is not the same as it is in the actual url. It actually has the following in there:
<script>
window.cl.specialCurtainMessages = {
unsupportedBrowser: [
"We've detected you are using a browser that is missing critical features.",
"Please visit craigslist from a modern browser."
],
unrecoverableError: [
"There was an error loading the page."
]
};
</script>
Any help would be much appreciated.
I am not sure if I've potentially been 'blocked' somehow from scraping. I read this article about proxies and rotating IP addresses, but I do not want to break rules if I've been blocked, and also do not want to spend money on this. Is it not allowed to scrape craigslist? I have seen so many educational tutorials on it so thought it was okay.
import requests
from pprint import pp
def main(url):
with requests.Session() as req:
params = {
"availabilityMode": "0",
"batch": "8-0-360-0-0",
"cc": "US",
"hasPic": "1",
"lang": "en",
"sale_date": "all dates",
"searchPath": "apa"
}
r = req.get(url, params=params)
for i in r.json()['data']['items']:
pp(i)
break
main('https://sapi.craigslist.org/web/v7/postings/search/full')

Unable to scrape a table

I'm attempting to scrape the data from a table on the following website: https://droughtmonitor.unl.edu/DmData/DataTables.aspx
import requests
from bs4 import BeautifulSoup
url = 'https://droughtmonitor.unl.edu/DmData/DataTables.aspx'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
drought_table = soup.find('table', {'id':'datatabl'}).find('tbody').find_all('tr')
for some reason I am getting no outputs. I've tried to use pandas for the same job
import pandas as pd
url = 'https://droughtmonitor.unl.edu/DmData/DataTables.aspx'
table = pd.read_html(url)
df = table[0]
But also ended up getting an empty dataframe.
What could be causing this?
By checking network tool of browser it's obvious site uses Fetch/XHR to load table in another request.
Image: network monitor
You can use this code to get table data:
import requests
import json
headers = {
'Content-Type': 'application/json; charset=utf-8',
}
params = (
('area', '\'conus\''),
('statstype', '\'1\''),
)
response = requests.get(
'https://droughtmonitor.unl.edu/DmData/DataTables.aspx/ReturnTabularDMAreaPercent_national',
headers=headers, params=params
)
table = json.loads(response.content)
# Code generated by https://curlconverter.com/

Web scraping returns empty list

I am trying to scrape below info from https://www.dsmart.com.tr/yayin-akisi. However the below code returns empty list. Any idea?
<div class="col"><div class="title fS24 paBo30">NELER OLUYOR HAYATTA</div><div class="channel orangeText paBo30 fS14"><b>24 | 34. KANAL | 16 Nisan Perşembe | 6:0 - 7:0</b></div><div class="content paBo30 fS14">Billur Aktürk’ün sunduğu, yaşam değerlerini sorgulayan program Neler Oluyor Hayatta, toplumsal gerçekliğin bilgisine ulaşma noktasında sınırları zorluyor. </div><div class="subTitle paBo30 fS12">Billur Aktürk’ün sunduğu, yaşam değerlerini sorgulayan program Neler Oluyor Hayatta, toplumsal gerçekliğin bilgisine ulaşma noktasında sınırları zorluyor. </div></div>
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url="https://www.dsmart.com.tr/yayin-akisi"
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "lxml")
for link in page_soup.find_all("div", {"class":"col"}):
print(link)
This page is rendered in browser. HTML you're downloading has only links to js files, which later render content of page.
You can use real browser to render page (selenium, splash or similar technologies) or understand how this page receives data you needed.
Long story short, data rendered on this page requested from this link https://www.dsmart.com.tr/api/v1/public/epg/schedules?page=1&limit=10&day=2020-04-16
It is well formatted JSON, so it's very easy to parse it. My recommendation to download page with requests module - it can return json response as dict.
This website is populated by get calls to their API. You can see the get calls on your Browser (Chrome/Firefox) devtools network. If you check, you will see that they are calling API.
import requests
URL = 'https://www.dsmart.com.tr/api/v1/public/epg/schedules'
# parameters that you can tweak or add in a loop
# e.g for page in range(1,10): to get multiple pages
params = dict(page=1, limit=10, day='2020-04-16')
r = requests.get(URL,params=params)
assert r.ok, 'issues getting data'
data = r.json()
# data is dictonary that you can grab data out using keys
print(data)
In cases like this, using BeautifulSoup is unwarranted.

Web Scraping Yahoo Finance Recommendation Rating

I am trying to web scrape Yahoo's Finance Recommendation Rating using BeautifulSoup but it keeps returning 'None'.
E.g. Recommendation Rating for AAPL is '2'
https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL
Please advise. Thank you!
Below is the code:
from requests import get
from bs4 import BeautifulSoup
tickers = ['AAPL']
url = 'https://sg.finance.yahoo.com/quote/%s/profile?p=%s'%(ticker, ticker)
print(url)
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
#yf_rec refers to yahoo finance recommendation
try:
yf_rec = html_soup.find('div', attrs={'class':'B(8px) Pos(a) C(white) Py(2px) Px(0) Ta(c) Bdrs(3px) Trstf(eio) Trsde(0.5) Arrow South Bdtc(i)::a Fw(b) Bgc($buy) Bdtc($buy)'}).text.strip()
except:
pass
print(yf_rec)

Download entire web pages and save them as html file with urllib.request

I can save multiple web pages with using these codes; however, I cant see a proper website view after saving them as html. For example, the texts in table are slipped and images can't be seen.
I need to download entire pages just as we do save as in any web browser so that I can see a proper view.
import urllib.request
url= 'https://asd.com/asdID='
for i in range(1, 5):
print(' --> ID:', i)
newurl = url + str(i)
f = open(str(i)+'.html', 'w')
page = urllib.request.urlopen(newurl)
pagetext = str(page.read())
f.write(pagetext)
f.close()
You can use selenium instead to download full website nicely
Just run the following code
from selenium import webdriver
#Download the chrome driver from the link below and specify the path of chromedriver
#https://chromedriver.storage.googleapis.com/index.html?path=2.40/
chromedriver = 'C:/python36/chromedriver.exe'
url= 'https://asd.com/asdID='
for i in range(1, 5):
browser = webdriver.Chrome(chromedriver)
browser.get(url + str(i))
data = browser.page_source
with open("webpage%s.html" %(str(i)), "w+") as f:
f.write(data)
UPDATE
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import ahk
firefox = FirefoxBinary("C:\\Program Files (x86)\\Mozilla Firefox\\firefox.exe")
from selenium import webdriver
driver = web.Firefox(firefox_binary=firefox)
driver.get("http://www.yahoo.com")
ahk.start()
ahk.ready()
ahk.execute("Send,^s")
ahk.execute("WinWaitActive, Save As,,2")
ahk.execute("WinActivate, Save As")
ahk.execute("Send, C:\\path\\to\\file.htm")
ahk.execute("Send, {Enter}")
You will now get everything

Resources