Python-requests: Can't scrape all the html code from a page - web-scraping

I am trying to scrape the content of the
Financial Times Search page.
Using Requests, I can easily scrape the articles' titles and hyperlinks.
I would like to get the next page's hyperlink, but I can not find it in the Requests response, unlike the articles' titles or hyperlinks.
from bs4 import BeautifulSoup
import requests
url = 'http://search.ft.com/search?q=SABMiller+PLC&t=all&rpp=100&fa=people%2Corganisations%2Cregions%2Csections%2Ctopics%2Ccategory%2Cbrand&s=-lastPublishDateTime&f=lastPublishDateTime[2000-01-01T00%3A00%3A00%2C2016-01-01T23%3A59%3A59]&curations=ARTICLES%2CBLOGS%2CVIDEOS%2CPODCASTS&highlight=true&p=1et'
response = requests.get(url, auth=(my login informations))
soup = BeautifulSoup(response.text, "lxml")
def get_titles_and_links():
titles = soup.find_all('a')
for ref in titles:
if ref.get('title') and ref.get('onclick'):
print ref.get('href')
print ref.get('title')
The get_titles_and_links() function gives me the titles and links of all the articles.
However, with a similar function for the next page, I have no results:
def get_next_page():
next_page = soup.find_all("li", class_="page next")
return next_page
Or:
def get_next_page():
next_page = soup.find_all('li')
for ref in next_page:
if ref.get('page next'):
print ref.get('page next')

If you can see the required links in the page source, but are not able to get them via requests or urllib. It can mean two things.
There is something wrong with your logic. Let's assume it's not that.
Then the thing remains is: Ajax, those parts of the page you are looking for are loaded by javascript after the document.onload method fired. So you cannot get something that's not there in the first place.
My solutions(more like suggestions) are
Reverse engineer the network requests. Difficult, but universally applicable. I personally do that. You might want to use re module.
Find something that renders javascript. That's just to say that, simulate web browsing. You might wanna check out the webdriver component of selenium, Qt etc. This is easier, but kinda memory hungry and consumes a lot more network resource compared to 1.

Related

How to figure out where is the raw data in a table?

https://www.nyse.com/quote/XNYS:A
After I access the above URL, I open Developer Tools in Firefox. Then change the date in HISTORIC PRICES, then click 'GO'. The table is updated. But I don't see relevant HTTP requests sent in devtools.
So this means that the data has already been downloaded in the first request. But I can not figure out how to extract the raw data of the table. Could anybody take a look at how to extract the raw data from the table? (Note that I don't want to use methods like selenium, I want to stay with raw HTTP requests to get the raw data.)
EDIT: websocket is mentioned in the comment. But I can't see it in Developer Tools. I add websocket tag anyway in case somebody knows more about websocket can chime in.
I am afraid you cannot extract javascript rendered content without selenium. You can always make use of a headless browser(you don't see any instance on your screen, the only pitfall is that you have to wait until the page fully loads) and it won't bother you anymore.
In other words, all the other scraping libs are based on urls and forms. Scrapy can post forms but not run javascripts.
Selenium will save the day, all you lose is a couple of seconds for each attempt(will be milliseconds if it is run in frontend). You can share page source with driver.page_source and it can be directly used for parsing(as a html text) with BeautifulSoup or whatever.
You can do it with requests-html, for example let's grab the first row of the table:
from requests_html import HTMLSession
session = HTMLSession()
url = 'https://www.nyse.com/quote/XNYS:A'
r = session.get(url)
r.html.render(sleep=7)
first_row = r.html.find('.flex_tr', first=True)
print(first_row.text)
Output:
06/18/2021
146.31
146.83
144.94
145.01
3,220,680
As #Nikita said you will have to wait the page loading (here 7sec but maybe less), but if you want to do multiple requests you can do it asynchronously !

How to scrape/download all tumblr images with a particular tag

I am trying to download many (1000's) of images from tumblr with a particular tag (.e.g #art). I am trying to figure out the fastest and easiest way to do this. I have considered both scrapy and puppeteer as options, and I read a little bit about the tumblr API, but I'm not sure how to use the API to locally download the images I want.
Currently, puppeteer seems like the best way, but I'm not sure how to deal with the fact that tumblr uses lazy loading (e.g. what is the code for getting all the images, scrolling down, waiting for for images to load, and getting these)
Would appreciate any tips!
I recommend you use the Tumblr API, so here's some instructions on how to go about that.
Read up on the What You Need section of the documentation
Read up on the Get Posts With Tag section
Consider using a library like PyTumblr
import pytumblr
list_of_all_posts = []
# Authenticate via OAuth
client = pytumblr.TumblrRestClient(
'YOUR KEY HERE'
)
def get_art_posts():
posts = client.tagged('art', **params) # returns HTML of 20 most recent posts in the tag
# use params (shown in tumblr documentation) to change the timestamp of limit of the posts
# i.e. to only posts before a certain time
return posts
list_of_all_posts.append(get_art_posts())
I'm pretty rusty with the Tumblr API, not gonna lie. But the documentation is kept well up to date. Once you have the HTML of the post, the link to the images will be in there. There's plenty of libraries out there like Beautiful Soup that can extract the images from the HTML by their CSS selectors. Hope this helped!
My solution is below. Since I couldn't use offset, I used the timestamps of each post as an offset instead. Since I was trying to specifically get the links of images in the posts, I did a little processing of the output as well. I then used a simple python script to download every image from my list of links. I have included a website and an additional stack overflow post which I found helpful.
import pytumblr
def get_all_posts(client, blog):
offset = None
for i in range(48):
#response = client.posts(blog, limit=20, offset=offset, reblog_info=True, notes_info=True)
response = client.tagged('YOUR TAG HERE', limit=20, before=offset)
for post in response:
# for post in response:
if('photos' not in post):
#print(post)
if('body' in post):
body = post['body']
body = body.split('<')
body = [b for b in body if 'img src=' in b]
if(body):
body = body[0].split('"')
print(body[1])
yield body[1]
else:
yield
else:
print(post['photos'][0]['original_size']['url'])
yield post['photos'][0]['original_size']['url']
# move to the next offset
offset = response[-1]['timestamp']
print(offset)
client = pytumblr.TumblrRestClient('USE YOUR API KEY HERE')
blog = 'staff'
# use our function
with open('{}-posts.txt'.format(blog), 'w') as out_file:
for post in get_all_posts(client, blog):
print(post, file=out_file)
Links:
https://64.media.tumblr.com/9f6b4d8d15caffe88c5877cd2fb31726/8882b6bec4975045-23/s540x810/49586f5b05e8661d77e370845d01b34f0f5f2ca6.png
Print more than 20 posts from Tumblr API
Also thank you very much to Harada, whose advice helped a lot!

Scrape dynamic info from same URL using python or any other tool

I am trying to scrape the URL of every company who has posted a job offer on this website:
https://jobs.workable.com/
I want to pull the info to generate some stats re this website.
The problem is that when I click on an add and navigate through the job post, the url is always the same. I know a bit of python so any solution using it would be useful. I am open to any other approach though.
Thank you in advance.
This is just a pseudo code to give you the idea of what you are looking for.
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
first_url = 'https://job-board-v3.workable.com/api/v1/jobs?query=&orderBy=postingUpdateTime+desc'
base_url= 'https://job-board-v3.workable.com/api/v1/jobs?query=&orderBy=postingUpdateTime+desc&offset='
page_ids = ['0','10','20','30','40','50'] ## can also be created dynamically this is just raw
for pep_id in page_ids:
# for initial page
if(pep_id == '0'):
page = requests.get(first_url, headers=headers)
print('You still need to parse the first page')
##Enter some parsing logic
else:
final_url = base_url + str(pep_id)
page = requests.get(final_url, headers=headers)
print('You still need to parse the other pages')
##Enter some parsing logic

Detecting valid search parameters for a site? (Web scraping)

I'm trying to scrape a bunch of search results from the site:
http://www.wileyopenaccess.com/view/journals.html
Currently the results show up on 4 pages. The 4th page could be accessed with http://www.wileyopenaccess.com/view/journals.html?page=4
I'd like some way to get all of the results on one page for easier scraping, but I have no idea how to determine which request parameters are valid. I tried a couple of things like:
http://www.wileyopenaccess.com/view/journals.html?per_page=100
http://www.wileyopenaccess.com/view/journals.html?setlimit=100
to no avail. Is there a way to detect the valid parameters of this search?
I'm using BeautifulSoup; is there some obvious way to do this that I've overlooked?
Thanks
You cannot pass any magic params to get all the links but you can use the Next button to get all the pages which will work regardless of how many pages there may be:
from bs4 import BeautifulSoup
def get_all_pages():
response = requests.get('http://www.wileyopenaccess.com/view/journals.html')
soup = BeautifulSoup(response.text)
yield soup.select("div.journalRow")
nxt = soup.select_one("div.journalPagination.borderBox a[title^=Next]")
while nxt:
response = requests.get(nxt["href"])
soup = BeautifulSoup(response.text)
yield soup.select("div.journalRow")
nxt = soup.select_one("div.journalPagination.borderBox a[title^=Next]")
for page in get_all_pages():
print(page)

Search a url for unique phrase using Google API

Does Google have an API with a function which will verify if a specific phrase can be found at a given url?
Say I have a webpage url: www.mysite/2011/01/check-if-phrase-exists
I want to know if the phrase foobar exists somewhere on that document (it can be anywhere on the html document - not just "readable text").
The function/api would return True or False.
Question Update The "method" should avoid me from having to retrieve the entire page to my server and search myself. It is the fetching of the webpage to my server that I am trying to avoid (to cut down on bandwidth).
I don't think they do, but you could do this yourself without much code (this is adapted from the App Engine docs):
import urllib2
url = "http://www.google.com/"
try:
result = urllib2.urlopen(url)
my_search_function(result)
# or perhaps my_search_function(result.content)
except urllib2.URLError, e:
handleError(e)
Then you can just define my_search_function(text) to do what you need

Resources