Beautiful soup, only part of the data is scraped

Beautiful soup, only part of the data is scraped - web-scraping

I am trying to scrape data from the shareholding disclosures page from the Hong Kong Exchange yet when I find all ('TR'), my code only grabs the previous balance no of shares rather than the whole tr line including the name and ticker.
url = "https://di.hkex.com.hk/di/summary/DSM20220218C1.htm"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
result =soup.find('table', id="Table3")
for stock in result:
rows = stock.find('tr')
print(rows)

Related

Why am i getting an empty list instead of a list with elements in it while web scraping

So I am trying to scrape the countries name from the table in a website https://www.theguardian.com/world/2020/oct/25/covid-world-map-countries-most-coronavirus-cases-deaths as a list. But when i am printing it out, it's just giving me empty list, instead of a list containing countries name. could anybody explain why i am getting this? the code is below,
import requests
from bs4 import BeautifulSoup
webpage = requests.get("https://www.theguardian.com/world/2020/oct/25/covid-world-map-countries-most-coronavirus-cases-deaths")
soup = BeautifulSoup(webpage.content, "html.parser")
countries = soup.find_all("div", attrs={"class": 'gv-cell gv-country-name'})
print(countries)
list_of_countries = []
for country in countries:
list_of_countries.append(country.get_text())
print(list_of_countries)
This is the output i am getting
[]
[]
Also, not only here, i was getting the same result (empty list) when i was trying to scrape a product's information from the amazon's website.

The list is dynamically retrieved from another endpoint you can find in the network tab which returns json. Something like as follows should work:
import requests
r = requests.get('https://interactive.guim.co.uk/2020/coronavirus-central-data/latest.json').json() #may need to add headers
countries = [i['attributes']['Country_Region'] for i in r['features']]

xpath returning empty text when web-scraping in r

I'm trying to scrape information from https://www.kff.org/interactive/subsidy-calculator. For instance, put state=California, zip=90001, income=20000, no coverage, 1 people, 1 adult, no children, age=21, no tobacco.
We get the following:
https://www.kff.org/interactive/subsidy-calculator/#state=ca&zip=94704&income-type=dollars&income=20000&employer-coverage=0&people=1&alternate-plan-family=individual&adult-count=1&adults%5B0%5D%5Bage%5D=21&adults%5B0%5D%5Btobacco%5D=0&child-count=0
I would like to get the numbers for "estimated financial help" and "your cost for a silver plan" (they are bolded-blue in the "Results" grey box, for some reason I can't upload the screenshot). When I use the xpath for the numbers, I get back empty string. This is not the case if I were to retrieve some other text (not in the grey box). I wonder what could be wrong with this. I have attached code below. Please forgive me if this is a stupid question since I'm very new to web-scraping. Thank you!
state = tolower('CA')
zip = 94704
income = 20000
people = 1
adult = 1
children = 0
url = paste0("https://www.kff.org/interactive/subsidy-calculator/#state=", state, "&zip=", zip, "&income-type=dollars&income=", income, "&employer-coverage=0&people=", people, "&alternate-plan-family=individual&adult-count=", adult, "&adults%5B0%5D%5Bage%5D=21&adults%5B0%5D%5Btobacco%5D=0&child-count=", children)
# This returns empty string
r = read_html(url) %>%
html_nodes(xpath ='//*[#id="subsidy-calculator-new"]/div[5]/div/div/dl/dd[1]/span') %>% html_text()
# This returns "Number of children (20 and younger) enrolling in Marketplace coverage", a line that's not in the grey box.
r = read_html(url) %>%
html_nodes(xpath = '//*[#id="subsidy-form"]/div[2]/div[3]/div[3]/p') %>%
html_text()

The values are generated through scripts that run on the page. Your current method won't allow for this hence your result. You are likely better off using a method which allows scripts to run such as RSelenium.
The form you complete #subsidy-form feeds values into a template in a script tag #results-template. The associated calculations are covered in this script https://www.kff.org/wp-content/themes/kaiser-foundation-2016/interactives/subsidy-calculator/2019/calculator.js?ver=1.7.7 where you will find the logic and the pre-set values such as poverty lines per year.
The simplest quick view is probably to inspect the javascript variables when the new SubsidyCalculator object is created to process the form i.e. js starting with var sc = new SubsidyCalculator. You could 'reverse engineer' those variables with your values plus the values returned from the json below which I think, but haven't confirmed, feed the 6 variables that begin with kff_sc, according to zipcode, into the calculator e.g. silver: kff_sc.silver . You get an idea of the ballpark figures given there are default values given at top of script.
Figures in relation to zipcode are retrieved from this: https://www.kff.org/wp-content/themes/kaiser-foundation-2016/interactives/subsidy-calculator/2019/json/zips/94.json where the last two numbers before .json are the first two numbers of zipcode. You can determine this from the input validation script: https://www.kff.org/wp-content/themes/kaiser-foundation-2016/interactives/subsidy-calculator/2019/shared.js?ver=1.7.7
var bucket = $( this ).val().substring( 0, 2 );
if ( kff_sc.buckets[bucket] ) return;
$.ajax( '/wp-content/themes/vip/kaiser-foundation-2016/interactives/subsidy-calculator/2019/json/zips/' + bucket + '.json',
The first two digits determine the bucket.
All in all you could likely implement your own calculator but you would be re-inventing the wheel. Seems easier to just automate the browser and then extract the resultant values.

Why does BeautifulSoup skip over some tables when using findAll

I am trying to extract the "Four Factors" table from the following URL, https://www.basketball-reference.com/boxscores/201810160GSW.html, when I use the findAll method in the BeautifulSoup library when searching for tables I do not see that table, nor do I see the "Line Score" table. I am only concerned with the "Four Factors" table, but I figured the note about the "Line Score" table could be useful information.
URL2 = 'https://www.basketball-reference.com/boxscores/201810160GSW.html'
page2 = requests.get(URL2)
page2 = page2.text
soup2 = bs4.BeautifulSoup(page2, 'html.parser')
content = soup2.findAll('table')
If you look at content, you can find the other 4 tables on the page, but the "Four Factors" and "Line Score" do not show up there. In addition to helping me extract the "Four Factors" table, can you explain why it doesn't show up in content?

It comes out in one of the comments which is why you weren't finding it I think.
import requests
from bs4 import BeautifulSoup , Comment
import pandas as pd
r =requests.get('https://www.basketball-reference.com/boxscores/201810160GSW.html')
soup = BeautifulSoup(r.text,'lxml')
comments= soup.find_all(string=lambda text:isinstance(text,Comment))
for comment in comments:
if 'id="four_factors"' in comment:
soup = BeautifulSoup(comment, 'lxml')
break
table = soup.select_one('#four_factors')
df = pd.read_html(str(table))[0].fillna('')
print(df)

Beautiful soup replaces certain symbols in a URL with other symbols

I am parsing a certain webpage with Beautiful soup, trying to retrieve all links that are inside h3 tag:
page = = requests.get(https://www....)
soup = BeautifulSoup(page.text, "html.parser")
links = []
for item in soup.find_all('h3'):
links.append(item.a['href']
However, the links found are different than the links present in the page. For example, when the link http://www.estense.com/?p=116872 is present in the page, Beautiful soup returns http://www.estense.com/%3Fp%3D116872, replacing '?' with '%3F' and '=' with %3D. Why is that?
Thanks.

You can unquote the URL using urllib.parse
from urllib import parse
parse.unquote(item.a['href'])

Scraping text from html table in r

i'm pretty new to R and i'm trying to solve some real world challenges while taking datacamp.com R course. The thing is: i'm trying to scrape address, name, phone, email and site from a webpage. The information is on a table. I have tried this code:
library(rvest)
# Store web url
apel_url <- read_html("http://www.apel.pt/pageview.aspx?pageid=944&langid=1")
txt <- html_text(apel_url)
txt
associados <- apel_url %>%
html_nodes(css = "p.MsoNormal") %>%
html_text()
print(associados)
As result i have a chr [1:1481] string but some of the lines were scraped joined with each other, although in the site they are separate lines. For instance:
associados[969]
results in:
[1] "PENUMBRA EDITORA, LDA.Rua da Marinha, 50 - Madalena4405-761 VILA NOVA DE GAIA Tel.: 22 375 04 52"
I wonder what i'm missing and would like to know which is the best way to transform this string in a dataframe separating each field in a column (phone, address, email, URL, etc). Some of the entrances have 1 or more phone numbers, other don't have URL, etc, so it has to be blank when there is no information.
Thanks for helping.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Beautiful soup, only part of the data is scraped - web-scraping

Related

Why am i getting an empty list instead of a list with elements in it while web scraping

xpath returning empty text when web-scraping in r

Why does BeautifulSoup skip over some tables when using findAll

Beautiful soup replaces certain symbols in a URL with other symbols

Scraping text from html table in r

Categories

Resources