Web Scraping Problems - web-scraping

I am having a problem with my Web Scraping Application. I am wanting to return a list of the counties in a state, but I am having a problem only printing the text out. Here it prints all of the elements (being counties) in the selection, but I only want the list of counties (No html stuff, just the contents).
import urllib.request
from bs4 import BeautifulSoup
url = 'http://www.stats.indiana.edu/dms4/propertytaxes.asp'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read(), "html.parser")
counties = soup.find_all(id='Select1')#Works
print(counties)
This returns the text of everything on the web page without the html stuff, which is what I want, but it prints everything on the page:
import urllib.request
from bs4 import BeautifulSoup
url = 'http://www.stats.indiana.edu/dms4/propertytaxes.asp'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read(), "html.parser")
counties = soup.get_text()#works
print(counties)
I was wondering if there was a way to combine the two, but every time I do I am getting error messages. I thought this might work:
counties = soup.find_all(id=’Select1’).get_text()
I keep getting a “has no attribute ‘get_text’”

So what you actually want to do here is find the children (the options) in the select field.
select = soup.find_all(id='Select1')
options = select.findChildren()
for option in options :
print(option.get_text())
BeautifulSoup reference is pretty good. You can look around to find other methods you can use on the tag objects, as well as find options to pass to findChildren.

Related

Pagination stuck on page. Python web scraping

Im trying to scrape a list of Japanese companies. It works fine until it gets to a certain page and then starts repeating over it again and again. I'm a web scraping noob so apologies in advance. I have attached a png of the error.
error
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
l = []
for x in range(1,999):
url= 'https://jpn.bizdirlib.com/company?page='
r=requests.get(url+str(x))
soup = BeautifulSoup(r.content,"html.parser")
all= soup.find_all('span', class_='field-content')
for item in all:
name = item.find('a').text
Company_info = {'name': name}
l.append(Company_info)
print('companies found:', len(l))
time.sleep(0.5)
df = pd.DataFrame(l)
print(df.head())
df.to_csv('companies.csv')

Scrape an image using Soup

I am trying to scrape an image from this website: https://www.remax.ca/on/richmond-hill-real-estate/-2407--9201-yonge-st-wp_id268950754-lst. The current code is:
url = 'https://www.remax.ca/on/richmond-hill-real-estate/-2407--9201-yonge-st-wp_id268950754-lst'
soup = BeautifulSoup(urlopen(url), 'html.parser')
imgs = soup.findAll('div', attrs = {'class': 'images is-flex flex-one has-flex-align-center has-flex-content-center'})
When I look inside of imgs, I cannot find the image active ng-star-inserted ng-lazyloaded and srcset. As the result, I cannot download the image.
Can someone suggest on how to approach this problem?
The images are lazy loaded and I think the problem is that. So I scraped the script that loads and manages these pictures.
script = soup.find('script', {'type': 'application/ld+json'})
script_json = json.loads(script.contents[0])
imgs = script_json['#graph'][1]['photo']['url']
Now imgs contains the list of all 11 images from the link you provided for that residence.
You can use xpath to find the image and use requests to obtain the image then write it to a file as follows
import requests
from lxml import html
# send request to website
r = requests.get("thewebsite")
# convert to html object
tree = html.fromstring(r.content)
# find images urls from xpath
image_urls = tree.xpath("xpaths/#href")
# write each image to your computer
for i in image_urls:
with open("filename","wb") as f:
f.write(i)

How to get missing HTML data when web scraping with python-requests

I am working on building a job board which involves scraping job data from company sites. I am currently trying to scrape Twilio at https://www.twilio.com/company/jobs. However, I am not getting the job data its self -- that seems to be being missed by the scraper. Based on other questions this could be because the data is in JavaScript, but that is not obvious.
Here is the code I am using:
# Set the URL you want to webscrape from
url = 'https://www.twilio.com/company/jobs'
# Connect to the URL
response = requests.get(url)
if "_job-title" in response.text:
print "Found the jobs!" # FAILS
# Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(response.text, "html.parser")
# To download the whole data set, let's do a for loop through all a tags
for i in range(0,len(soup.findAll('a', class_='_job'))): # href=True))): #'a' tags are for links
one_a_tag = soup.findAll('a', class_='_job')[i]
link = one_a_tag['href']
print link # FAILS
Nothing displays when this code is run. I have tried using urllib2 as well and that has the same problem. Selenium works but it is too slow for the job. Scrapy looks like it could be promising but I am having install issues with it.
Here is a screenshot of the data I am trying to access:
Basic info for all the jobs at different offices comes back dynamically from an API call you can find in network tab. If you extract the ids from that you can then make separate requests for the detailed job info using those ids. Example as shown:
import requests
from bs4 import BeautifulSoup as bs
listings = {}
with requests.Session() as s:
r = s.get('https://api.greenhouse.io/v1/boards/twilio/offices').json()
for office in r['offices']:
for dept in office['departments']: #you could perform some filtering here or later on
if 'jobs' in dept:
for job in dept['jobs']:
listings[job['id']] = job #store basic job info in dict
for key in listings.keys():
r = s.get(f'https://boards.greenhouse.io/twilio/jobs/{key}')
soup = bs(r.content, 'lxml')
job['soup'] = soup #store soup from detail page
print(soup.select_one('.app-title').text) #print example something from page

web scraping : how to scrape one particular table body out of many table bodies?

I am trying to scrape a particular table in the site - http://stats.espncricinfo.com/ci/engine/player/35320.html?class=2;template=results;type=batting
Now, there are multiple tables that are indistinguishable from each other. And I want to scrape only one particular table from there. How do I do that?
I have tried using the find_all() function. But that only lists ALL the <tbody> tags.
I want to scrape only the highlighted table body.
It is tbody tagged and you can use the following css selector with bs4. Then wrap with table tags and pass to pandas to print nicely. I'm using bs4 4.7.1
You could also use table = soup.select('tbody:contains(year)').
Python:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
r = requests.get('http://stats.espncricinfo.com/ci/engine/player/35320.html?class=2;template=results;type=batting')
soup = bs(r.content, 'lxml')
table = soup.select('tbody:nth-child(7)')
headers = [item.text for item in soup.select('.headlinks th')]
df = pd.read_html('<table>' + str(table) + '</table>')[0]
df.columns = headers
df = df.dropna(how = 'all', axis=0).drop(['Span',''], axis=1)
print(df)
df.head()

How to get data from the live table using web scraping?

I am trying to set up a live table by downloading the data directly from a website through Python. I guess I am following all the steps to the dot but I still am not able to get the data from the said table.
I have referred to many web pages and blogs to try to correct the issue here but was unsuccessful. I would like the stack overflow community's help here.
The following is the table website and there is only one table on the page from which I am trying to get the data:
https://etfdb.com/themes/smart-beta-etfs/#complete-list__esg&sort_name=assets_under_management&sort_order=desc&page=1
The data on the table is partially available for free and the rest is paid. So I guess that is the problem here but I would assume I should be able to download the free data. But since this is my first time trying this and considering I am a beginner at Python, I can be wrong. So please all the help is appreciated.
The code is as follows:
import pandas as pd
import html5lib
import lxml
from bs4 import BeautifulSoup
import requests
site = 'https://etfdb.com/themes/smart-beta-etfs/#complete-list&sort_name=assets_under_management&sort_order=desc&page=1'
page1 = requests.get(site, proxies = proxy_support)
page1
page1.status_code
page1.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(page1.text, 'html.parser')
print(soup)
print(soup.prettify())
table = soup.find_all("div", class_ = "fixed-table-body")
table
When I run the table command, it gives me no data and the field is completely empty even though there is a table available on the website. All the help will be really appreciated.
The page does an another request for this info which returns json you can parse
import requests
r = requests.get('https://etfdb.com/data_set/?tm=77630&cond=&no_null_sort=&count_by_id=&sort=assets_under_management&order=desc&limit=25&offset=0').json()
Some of the keys (those for output columns Symbol and ETF Name - keys symbol and name) are associated with html so you can use bs4 to handle those values and extract the final desired result; the other keys value pairs are straightforward.
For example, if you loop each row in the json
for row in r['rows']:
print(row)
break
You get rows for parsing, of which two items need bs4 like this.
Python:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://etfdb.com/data_set/?tm=77630&cond=&no_null_sort=&count_by_id=&sort=assets_under_management&order=desc&limit=25&offset=0').json()
results = []
for row in r['rows']:
soup = bs(row['symbol'], 'lxml')
symbol = soup.select_one('.caps').text
soup = bs(row['name'], 'lxml')
etf_name = soup.select_one('a').text
esg_score = row['esg_quality_score']
esg_quality_score_pctl_peer = row['esg_quality_score_pctl_peer']
esg_quality_score_pctl_global = row['esg_quality_score_pctl_global']
esg_weighted_avg_carbon_inten = row['esg_weighted_avg_carbon_inten']
esg_sustainable_impact_pct = row['esg_sustainable_impact_pct']
row = [symbol, etf_name, esg_score, esg_quality_score_pctl_peer , esg_quality_score_pctl_global, esg_weighted_avg_carbon_inten, esg_sustainable_impact_pct ]
results.append(row)
headers = ['Symbol', 'ETF Name', 'ESG Score', 'ESG Score Peer Percentile (%)', 'ESG Score Global Percentile (%)',
'Carbon Intensity (Tons of CO2e / $M Sales)', 'Sustainable Impact Solutions (%)']
df = pd.DataFrame(results, columns = headers)
print(df)
I would like to use pandas data frame to fetch the table and can export the into csv.
import pandas as pd
tables=pd.read_html("https://etfdb.com/themes/smart-beta-etfs/#complete-list&sort_name=assets_under_management&sort_order=desc&page=1")
table=tables[0][:-1]
print(table)
table.to_csv('table.csv') #You can find the csv file in project folder after run

Resources