Scrape an image using Soup - web-scraping

I am trying to scrape an image from this website: https://www.remax.ca/on/richmond-hill-real-estate/-2407--9201-yonge-st-wp_id268950754-lst. The current code is:
url = 'https://www.remax.ca/on/richmond-hill-real-estate/-2407--9201-yonge-st-wp_id268950754-lst'
soup = BeautifulSoup(urlopen(url), 'html.parser')
imgs = soup.findAll('div', attrs = {'class': 'images is-flex flex-one has-flex-align-center has-flex-content-center'})
When I look inside of imgs, I cannot find the image active ng-star-inserted ng-lazyloaded and srcset. As the result, I cannot download the image.
Can someone suggest on how to approach this problem?

The images are lazy loaded and I think the problem is that. So I scraped the script that loads and manages these pictures.
script = soup.find('script', {'type': 'application/ld+json'})
script_json = json.loads(script.contents[0])
imgs = script_json['#graph'][1]['photo']['url']
Now imgs contains the list of all 11 images from the link you provided for that residence.

You can use xpath to find the image and use requests to obtain the image then write it to a file as follows
import requests
from lxml import html
# send request to website
r = requests.get("thewebsite")
# convert to html object
tree = html.fromstring(r.content)
# find images urls from xpath
image_urls = tree.xpath("xpaths/#href")
# write each image to your computer
for i in image_urls:
with open("filename","wb") as f:
f.write(i)

Related

Web scrape excel file in different date

I'm a newbie to beautiful soup. Can anyone suggest how to scrape the excel file for the past 14 days? My understanding is to loop over the date and save the file. Thanks
https://www.hkexnews.hk/reports/sharerepur/sbn.asp
import requests
from bs4 import BeautifulSoup
res=requests.get("https://www.hkexnews.hk/reports/sharerepur/sbn.asp")
soup=BeautifulSoup(res.text,"lxml")
Now we will find data inside table using find method and use find_all to get all td tags and append data to list lst.
main_data=soup.find("table").find_all("td")
lst=[]
for data in main_data:
try:
url=data.find("a").get('href')[1:]
main_url="https://www.hkexnews.hk/reports/sharerepur"+url
lst.append(main_url)
except AttributeError:
pass
Now iterate through lst and call individual URL to download data to excel file.
for url in range(len(lst)):
resp=requests.get(lst[url])
output = open(f'test_{url}.xls', 'wb')
output.write(resp.content)
output.close()
print(url)
Image: (File being created in Local)

How to deal with this website in a webscraping format?

I am trying to webscrape this website.
I am applying the same code that I always use to webscrape pages:
url_dv1 <- "https://ec.europa.eu/commission/presscorner/detail/en/qanda_20_171?fbclid=IwAR2GqXLmkKRkWPoy3-QDwH9DzJiexFJ4Sp2ZoWGbfmOR1Yv8POdlLukLRaU"
url_dv1 <- paste(html_text(html_nodes(read_html(url_dv1), "#inline-nav-1 .ecl-paragraph")), collapse = "")
For this website, thought, the code doesn't seem to be working. In fact, I get Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "c('xml_document', 'xml_node')".
Why is it so? How can I fix it?
Thanks a lot!
The problem is that the web page is dynamically rendered. You can overcome this using phantomjs (can be downloaded here https://phantomjs.org/download.html). You will also need a custom javascript script (see below). The below R code works for me.
library(tidyverse)
library(rvest)
dir_js <- "path/to/a/directory" # JS code needs to be inserted here, the name of the file needs to be javascript.js
url <- "https://ec.europa.eu/commission/presscorner/detail/en/qanda_20_171?fbclid=IwAR2GqXLmkKRkWPoy3-QDwH9DzJiexFJ4Sp2ZoWGbfmOR1Yv8POdlLukLRaU"
system2("path/to/where/you/have/phantomjs.exe", # directory to phantomJS
args = c(file.path(dir_js, "javascript.js"), url))
read_html("myhtml.html") %>%
html_nodes("#inline-nav-1 .ecl-paragraph") %>%
html_text()
# this is the javascript code to be saved in javascript directory as javascript.js
// create a webpage object
var page = require('webpage').create(),
system = require('system')
// the url for each country provided as an argument
country= system.args[1];
// include the File System module for writing to files
var fs = require('fs');
// specify source and path to output file
// we'll just overwirte iteratively to a page in the same directory
var path = 'myhtml.html'

How to get missing HTML data when web scraping with python-requests

I am working on building a job board which involves scraping job data from company sites. I am currently trying to scrape Twilio at https://www.twilio.com/company/jobs. However, I am not getting the job data its self -- that seems to be being missed by the scraper. Based on other questions this could be because the data is in JavaScript, but that is not obvious.
Here is the code I am using:
# Set the URL you want to webscrape from
url = 'https://www.twilio.com/company/jobs'
# Connect to the URL
response = requests.get(url)
if "_job-title" in response.text:
print "Found the jobs!" # FAILS
# Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(response.text, "html.parser")
# To download the whole data set, let's do a for loop through all a tags
for i in range(0,len(soup.findAll('a', class_='_job'))): # href=True))): #'a' tags are for links
one_a_tag = soup.findAll('a', class_='_job')[i]
link = one_a_tag['href']
print link # FAILS
Nothing displays when this code is run. I have tried using urllib2 as well and that has the same problem. Selenium works but it is too slow for the job. Scrapy looks like it could be promising but I am having install issues with it.
Here is a screenshot of the data I am trying to access:
Basic info for all the jobs at different offices comes back dynamically from an API call you can find in network tab. If you extract the ids from that you can then make separate requests for the detailed job info using those ids. Example as shown:
import requests
from bs4 import BeautifulSoup as bs
listings = {}
with requests.Session() as s:
r = s.get('https://api.greenhouse.io/v1/boards/twilio/offices').json()
for office in r['offices']:
for dept in office['departments']: #you could perform some filtering here or later on
if 'jobs' in dept:
for job in dept['jobs']:
listings[job['id']] = job #store basic job info in dict
for key in listings.keys():
r = s.get(f'https://boards.greenhouse.io/twilio/jobs/{key}')
soup = bs(r.content, 'lxml')
job['soup'] = soup #store soup from detail page
print(soup.select_one('.app-title').text) #print example something from page

Rcurl parsing HTML webpage using class tag

I am trying to parse the following webpage to return the links of each result sub-page. However, the 'result' dimension just returns an empty list. What do i need to put into the span clause in order for it to correctly return the header and underlying URL of each result page?
Many thanks.
# load packages
library(RCulr)
library(XML)
# download html
url = "http://www.sportinglife.com/racing/results"
http = htmlParse(url)
result = lapply(http['//span[#class="hdr t2"]'],xmlValue)
Easy. When you look at "hdr t2" in the source code of the url you'll notice that the tag containing this as a class name is a h3 tag while you are querying for a span tag. Replace "span" with "h3" and it'll work. This works for me
# load packages
library(RCulr)
library(XML)
# download html
url = "http://www.sportinglife.com/racing/results"
http = htmlParse(url)
result = lapply(http['//h3[#class="hdr t2"]'],xmlValue)
I say it's easy, but it's easy to oversee as well :)

Web Scraping Problems

I am having a problem with my Web Scraping Application. I am wanting to return a list of the counties in a state, but I am having a problem only printing the text out. Here it prints all of the elements (being counties) in the selection, but I only want the list of counties (No html stuff, just the contents).
import urllib.request
from bs4 import BeautifulSoup
url = 'http://www.stats.indiana.edu/dms4/propertytaxes.asp'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read(), "html.parser")
counties = soup.find_all(id='Select1')#Works
print(counties)
This returns the text of everything on the web page without the html stuff, which is what I want, but it prints everything on the page:
import urllib.request
from bs4 import BeautifulSoup
url = 'http://www.stats.indiana.edu/dms4/propertytaxes.asp'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read(), "html.parser")
counties = soup.get_text()#works
print(counties)
I was wondering if there was a way to combine the two, but every time I do I am getting error messages. I thought this might work:
counties = soup.find_all(id=’Select1’).get_text()
I keep getting a “has no attribute ‘get_text’”
So what you actually want to do here is find the children (the options) in the select field.
select = soup.find_all(id='Select1')
options = select.findChildren()
for option in options :
print(option.get_text())
BeautifulSoup reference is pretty good. You can look around to find other methods you can use on the tag objects, as well as find options to pass to findChildren.

Resources