I am trying to scrape below info from https://www.dsmart.com.tr/yayin-akisi. However the below code returns empty list. Any idea?
<div class="col"><div class="title fS24 paBo30">NELER OLUYOR HAYATTA</div><div class="channel orangeText paBo30 fS14"><b>24 | 34. KANAL | 16 Nisan Perşembe | 6:0 - 7:0</b></div><div class="content paBo30 fS14">Billur Aktürk’ün sunduğu, yaşam değerlerini sorgulayan program Neler Oluyor Hayatta, toplumsal gerçekliğin bilgisine ulaşma noktasında sınırları zorluyor. </div><div class="subTitle paBo30 fS12">Billur Aktürk’ün sunduğu, yaşam değerlerini sorgulayan program Neler Oluyor Hayatta, toplumsal gerçekliğin bilgisine ulaşma noktasında sınırları zorluyor. </div></div>
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url="https://www.dsmart.com.tr/yayin-akisi"
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "lxml")
for link in page_soup.find_all("div", {"class":"col"}):
print(link)
This page is rendered in browser. HTML you're downloading has only links to js files, which later render content of page.
You can use real browser to render page (selenium, splash or similar technologies) or understand how this page receives data you needed.
Long story short, data rendered on this page requested from this link https://www.dsmart.com.tr/api/v1/public/epg/schedules?page=1&limit=10&day=2020-04-16
It is well formatted JSON, so it's very easy to parse it. My recommendation to download page with requests module - it can return json response as dict.
This website is populated by get calls to their API. You can see the get calls on your Browser (Chrome/Firefox) devtools network. If you check, you will see that they are calling API.
import requests
URL = 'https://www.dsmart.com.tr/api/v1/public/epg/schedules'
# parameters that you can tweak or add in a loop
# e.g for page in range(1,10): to get multiple pages
params = dict(page=1, limit=10, day='2020-04-16')
r = requests.get(URL,params=params)
assert r.ok, 'issues getting data'
data = r.json()
# data is dictonary that you can grab data out using keys
print(data)
In cases like this, using BeautifulSoup is unwarranted.
Related
I'm new to web scraping and i was trying to scrape through FUTBIN (FUT 22) player database
"https://www.futbin.com/players" . My code is below and I don't know why if can't get any sort of results from the FUTBIN page but was successful in other webpages like IMDB.
CODE :`
import requests
from bs4 import BeautifulSoup
request = requests.get("https://www.futbin.com/players")
src = request.content
soup = BeautifulSoup(src, features="html.parser")
results = soup.find("a", class_="player_name_players_table get-tp`enter code here`")
print(results)
I am following along with a webscraping example in Automate-the-boring-stuff-with-python but my CSS selector is returning no results
import bs4
import requests
import sys
import webbrowser
print("Googling ...")
res = requests.get('https://www.google.com/search?q=' + ' '.join(sys.argv[1:]))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
linkelems = soup.find_all(".r a")
numopen = min(5, len(linkelems))
for i in range(numopen):
webbrowser.open('https://google.com' + linkelems[i].get('href'))
Has google since modified how they store search links ?
From inspecting the search page elements I see no reason this selector would not work.
There are two problems:
1.) Instead of soup.find_all(".r a") use soup.select(".r a") Only .select() method accepts CSS selectors
2.) Google page needs that you specify User-Agent header to return correct page.
import bs4
import sys
import requests
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
print("Googling ...")
res = requests.get('https://www.google.com/search?q=' + ' '.join(sys.argv[1:]), headers=headers)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
linkelems = soup.select(".r a")
for a in linkelems:
print(a.text)
Prints (for example):
Googling ...
Tree - Wikipediaen.wikipedia.org › wiki › Tree
... and so on.
A complimentary answer to Andrej Kesely's answer.
If you don't want to deal with figuring out what selectors to use or how to bypass blocks from Google, then you can try to use Google Search Engine Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that bypass blocks, data extraction, and more is already done for the end-user. All that needs to be done is just to iterate over structured JSON and get the data you want.
Example code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google", # search engine
"q": "fus ro dah", # query
"api_key": os.getenv("API_KEY"), # environment variable with your API-KEY
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
link = result['link']
print(link)
------------
'''
https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
https://knowyourmeme.com/memes/fus-ro-dah
https://en.uesp.net/wiki/Skyrim:Unrelenting_Force
https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
https://www.nexusmods.com/skyrimspecialedition/mods/4889/
https://www.nexusmods.com/skyrimspecialedition/mods/14094/
https://tenor.com/search/fus-ro-dah-gifs
'''
Disclaimer, I work for SerpApi.
I can save multiple web pages with using these codes; however, I cant see a proper website view after saving them as html. For example, the texts in table are slipped and images can't be seen.
I need to download entire pages just as we do save as in any web browser so that I can see a proper view.
import urllib.request
url= 'https://asd.com/asdID='
for i in range(1, 5):
print(' --> ID:', i)
newurl = url + str(i)
f = open(str(i)+'.html', 'w')
page = urllib.request.urlopen(newurl)
pagetext = str(page.read())
f.write(pagetext)
f.close()
You can use selenium instead to download full website nicely
Just run the following code
from selenium import webdriver
#Download the chrome driver from the link below and specify the path of chromedriver
#https://chromedriver.storage.googleapis.com/index.html?path=2.40/
chromedriver = 'C:/python36/chromedriver.exe'
url= 'https://asd.com/asdID='
for i in range(1, 5):
browser = webdriver.Chrome(chromedriver)
browser.get(url + str(i))
data = browser.page_source
with open("webpage%s.html" %(str(i)), "w+") as f:
f.write(data)
UPDATE
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import ahk
firefox = FirefoxBinary("C:\\Program Files (x86)\\Mozilla Firefox\\firefox.exe")
from selenium import webdriver
driver = web.Firefox(firefox_binary=firefox)
driver.get("http://www.yahoo.com")
ahk.start()
ahk.ready()
ahk.execute("Send,^s")
ahk.execute("WinWaitActive, Save As,,2")
ahk.execute("WinActivate, Save As")
ahk.execute("Send, C:\\path\\to\\file.htm")
ahk.execute("Send, {Enter}")
You will now get everything
I've been trying to scrape some lists from this website http://www.golf.org.au its an ASP.NET based I did some research and it appears that I must pass some values in a POST request to make the website fetch the data into the tables I did that but still I'm failing any Idea what I'm missing?
Here is my code:
# -*- coding: utf-8 -*-
import scrapy
class GolfscraperSpider(scrapy.Spider):
name = "golfscraper"
allowed_domains = ["golf.org.au","www.golf.org.au"]
ids = ['3012801330', '3012801331', '3012801332', '3012801333']
start_urls = []
for id in ids:
start_urls.append('http://www.golf.org.au/handicap/%s' %id)
def parse(self, response):
scrapy.FormRequest('http://www.golf.org.au/default.aspx?
s=handicap',
formdata={
'__VIEWSTATE':
response.css('input#__VIEWSTATE::attr(value)').extract_first(),
'ctl11$ddlHistoryInMonths':'48',
'__EVENTTARGET':
'ctl11$ddlHistoryInMonths',
'__EVENTVALIDATION' :
response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
'gaHandicap' : '6.5',
'golflink_No' : '2012003003',
'__VIEWSTATEGENERATOR' : 'CA0B0334',
},
callback=self.parse_details)
def parse_details(self,response):
for name in response.css('div.rnd-course::text').extract():
yield {'name' : name}
Yes, ASP pages are tricky to scrape. Most probably some little parameter is missing.
Solution for this:
instead of creating the request through scrapy.FormRequest(...) use the scrapy.FormRequest.from_response() method (see code example below). This will capture most or even all of the hidden form data and use it to prepopulate the FormRequest's data.
it seems you forgot to return the request, maybe that's another potential problem too ...
as far as I recall the __VIEWSTATEGENERATOR also will change each time and has to be extracted from the page
If this doesn't work, fire up your Firefox browser with Firebug plugin or Chrome's developer tools, do the request in the browser and then check the full request header and body data against the same data in your request. There will be some difference.
Example code with all my suggestions:
def parse(self, response):
req = scrapy.FormRequest.from_response(response,
formdata={
'__VIEWSTATE': response.css('input#__VIEWSTATE::attr(value)').extract_first(),
'ctl11$ddlHistoryInMonths':'48',
'__EVENTTARGET': 'ctl11$ddlHistoryInMonths',
'__EVENTVALIDATION' : response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
'gaHandicap' : '6.5',
'golflink_No' : '2012003003',
'__VIEWSTATEGENERATOR' : 'CA0B0334',
},
callback=self.parse_details)
log.info(req.headers)
log.info(req.body)
return req
I am using Scrapy to retrieve information about projects on https://www.indiegogo.com. I want to scrape all pages with the url format www.indiegogo.com/projects/[NameOfProject]. However, I am not sure how to reach all of those pages during a crawl. I can't find a master page that hardcodes links to all of the /projects/ pages. All projects seem to be accessible from https://www.indiegogo.com/explore (through visible links and the search function), but I cannot determine the set of links/search queries that would return all pages. My spider code is given below. These start_urls and rules scrape about 6000 pages, but I hear that there should be closer to 10x that many.
About the urls with parameters: The filter_quick parameter values used come from the "Trending", "Final Countdown", "New This Week", and "Most Funded" links on the Explore page and obviously miss unpopular and poorly funded projects. There is no max value on the per_page url parameter.
Any suggestions? Thanks!
class IndiegogoSpider(CrawlSpider):
name = "indiegogo"
allowed_domains = ["indiegogo.com"]
start_urls = [
"https://www.indiegogo.com/sitemap",
"https://www.indiegogo.com/explore",
"http://go.indiegogo.com/blog/category/campaigns-2",
"https://www.indiegogo.com/explore?filter_browse_balance=true&filter_quick=countdown&per_page=50000",
"https://www.indiegogo.com/explore?filter_browse_balance=true&filter_quick=new&per_page=50000",
"https://www.indiegogo.com/explore?filter_browse_balance=true&filter_quick=most_funded&per_page=50000",
"https://www.indiegogo.com/explore?filter_browse_balance=true&filter_quick=popular_all&per_page=50000"
]
rules = (
Rule(LinkExtractor(allow=('/explore?'))),
Rule(LinkExtractor(allow=('/campaigns-2/'))),
Rule(LinkExtractor(allow=('/projects/')), callback='parse_item'),
)
def parse_item(self, response):
[...]
Sidenote: there are other URL formats www.indiegogo.com/projects/[NameOfProject]/[OtherStuff] that either redirect to the desired URL format or give 404 errors when I try to load them in the browser. I am assuming that Scrapy is handling the redirects and blank pages correctly, but would be open to hearing ways to verify this.
Well if you have the link to sitemap than it will be faster to let Scrapy fetch pages from there and process them.
This will work something like below.
from scrapy.contrib.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/robots.txt']
//**you can set rules for extracting URLs under sitemap_rules.
sitemap_rules = [
('/shop/', 'parse_shop'),
]
sitemap_follow = ['/sitemap_shops']
def parse_shop(self, response):
pass # ... scrape shop here ...
Try the below code this will crawl the site and only crawl the "indiegogo.com/projects/"
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from sitemap.items import myitem
class DmozSpider(CrawlSpider):
name = 'indiego'
allowed_domains = ['indiegogo.com']
start_urls = [
'http://indiegogo.com'
]
rules = (Rule(LinkExtractor(allow_domains=['indiegogo.com/projects/']), callback='parse_items', follow= True),)
def parse_items(self, response):
item = myitem()
item['link'] = response.request.url
item['title'] = response.xpath('//title').extract()
yield item