How can I extract text from Multiple URLs using beautifulsoup?

How can I extract text from Multiple URLs using beautifulsoup? - web-scraping

I am doing lead generation and want to extract text for a handful of URLs. Here is my code to extract for one URL. What should i do if i want to extract for more than one URL and save it into a dataframe?
import urllib
from urllib.request import urlopen as urlopen
from bs4 import BeautifulSoup
url = 'https://www.wdtl.com/'
html = urlopen(url).read()
soup = BeautifulSoup(html)
for script in soup(["script", "style"]):
script.extract() # rip it out
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)

If I understand you correctly, you can get there using this simplified method. Let's see if it works for you:
import pandas as pd
from bs4 import BeautifulSoup as bs
import requests
headers={'User-Agent':'Mozilla/5.0'}
url = 'https://www.wdtl.com/'
resp = requests.get(url,headers = headers)
soup = bs(resp.content, "lxml")
#first, find the links
links = soup.find_all('link',href=True)
#create a list to house the links
all_links= []
#find each link and add it to the list
for link in links:
if 'http' in link['href']: #the soup contains many non-http links; this will remove them
all_links.append(link['href'])
#finally, load the list into a dataframe
df = pd.DataFrame(all_links)

Related

Web scraping for hidden content

I am trying to scrape the price data from this website: https://fuelkaki.sg/home
However, the data does not appear to be in the HTML code of the page. Upon inspecting, the data seems to be nested in the tag, for instance under Caltex for the retailer name, and similarly under multiple nested tags for the price data, which I am unable to scrape with the following code (there are no results to be found).
Any help would be much appreciated.
import requests
from bs4 import BeautifulSoup
URL = 'https://fuelkaki.sg/home'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('div', class_='fuel-name')

The table is behind JS (JavaScript) so BeautifulSoup won't see it.
Here's how I'd do it:
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = False
driver = webdriver.Chrome(options=options)
url = "https://fuelkaki.sg/home"
driver.get(url)
time.sleep(3)
element = driver.find_element_by_xpath('//*[#class="table"]')
print(element.text)
driver.close()
Output:
Diesel
92
95
98
Others
(V-Power, etc)
Caltex
3‬‬‬‬‬‬0‬‬‬‬‬‬ ‬‬‬‬‬‬S‬‬‬‬‬‬e‬‬‬‬‬‬p‬‬‬‬‬‬t‬‬‬‬‬‬e‬‬‬‬‬‬m‬‬‬‬‬‬b‬‬‬‬‬‬e‬‬‬‬‬‬r‬‬‬‬‬‬ ‬‬‬‬‬‬2‬‬‬‬‬‬0‬‬‬‬‬‬2‬‬‬‬‬‬0‬‬‬‬‬‬,‬‬‬‬‬‬ ‬‬‬‬‬‬0‬‬‬‬‬‬2‬‬‬‬‬‬:‬‬‬‬‬‬0‬‬‬‬‬‬5‬‬‬‬‬‬p‬‬‬‬‬‬m
S$ 1‬‬‬‬.‬‬‬‬7‬‬‬‬3
S$ 2‬‬‬.‬‬‬0‬‬‬2
S$ 2‬‬‬.‬‬‬0‬‬‬6
N.A.
S$ 2‬‬‬‬‬‬.‬‬‬‬‬‬6‬‬‬‬‬‬1
and so on...
EDIT:
If you want the table in a Dataframe try this:
import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = False
driver = webdriver.Chrome(options=options)
url = "https://fuelkaki.sg/home"
driver.get(url)
time.sleep(3)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser").select_one(".table")
df = pd.read_html(str(soup))
df = pd.concat(df).rename(columns={"Unnamed: 0": ""})
df.to_csv("fuel_data.csv", index=False)
driver.close()
Outputs a .csv file with the table's data:

For Loop Not Repeating

I'm trying to find the hrefs for all the states where this company has stores, however it only finds the href for the first state.
Can anyone figure out why the for loop doesn't repeat for the rest of the states? Thank you very much for your help!
import requests
from bs4 import BeautifulSoup
import csv
# website
sitemap = 'website_url'
# content of website
sitemap_content = requests.get(sitemap).content
# parsing website
soup = BeautifulSoup(sitemap_content, 'html.parser')
#print(soup)
list_of_divs = soup.findAll('div', attrs={'class':'listings-inner'})
#print(list_of_divs)
header = ['Links']
with open ('/Users/ABC/Desktop/v1.csv','wt') as csvfile:
writer = csv.writer(csvfile, delimiter ="\t" )
writer.writerow(header)
for state in list_of_divs:
# get the url's by state
print(state.find('div', attrs={'class':'itemlist'}).a.get('href'))
rows = [state.find('div', attrs={'class':'itemlist'}).a.get('href')]
writer.writerow(rows)

list_of_divs actually only contains one element, which is the only div on the page with class listings-inner. So when you iterate through all of it's elements and use the find method, it'll only return the first result.
You want to use the find_all method on that div:
import requests
from bs4 import BeautifulSoup
sitemap = 'https://stores.dollargeneral.com/'
sitemap_content = requests.get(sitemap).content
soup = BeautifulSoup(sitemap_content, 'html.parser')
listings_div = soup.find('div', attrs={'class':'listings-inner'})
for state in listings_div.find_all('div', attrs={'class':'itemlist'}):
print(state.a.get('href'))

extracting key-value data from javascript json type data with bs4

I am trying to extract some information from HTML of a web page.
But neither regex method nor list comprehension method works.
At http://bitly.kr/RWz5x, there is some key called encparam enclosed in getjason from a javascript tag which is 49th from all script elements of the page.
Thank you for your help in advance.
sam = requests.get('http://bitly.kr/RWz5x')
#html = sam.text
html=sam.content
soup = BeautifulSoup(html, 'html.parser')
scripts = soup.find_all('script')
#your_script = [script for script in scripts if 'encparam' in str(script)][0]
#print(your_script)
#print(scripts)
pattern = re.compile("(\w+): '(.*?)'")
fields = dict(re.findall(pattern, scripts.text))

Send your request to the following url which you can find in the sources tab:
import requests
from bs4 import BeautifulSoup as bs
import re
res = requests.get("https://navercomp.wisereport.co.kr/v2/company/c1010001.aspx?cmp_cd=005930")
soup = bs(res.content, 'lxml')
r = re.compile(r"encparam: '(.*)'")
data = soup.find('script', text=r).text
encparam = r.findall(data)[0]
print(encparam)
It is likely you can avoid bs4 altogether:
import requests
import re
r = requests.get("https://navercomp.wisereport.co.kr/v2/company/c1010001.aspx?cmp_cd=005930")
p = re.compile(r"encparam: '(.*)'")
encparam = p.findall(r.text)[0]
print(encparam)
If you actually want the encparam part in the string:
import requests
import re
r = requests.get("https://navercomp.wisereport.co.kr/v2/company/c1010001.aspx?cmp_cd=005930")
p = re.compile(r"(encparam: '\w+')")
encparam = p.findall(r.text)[0]
print(encparam)

I want links and all the content from each link

I searched for a keyword (cybersecurity) on a newspaper website and the results show around 10 articles. I want my code to grab the link and go to that link and get the whole article and repeat this to all the 10 articles in the page. (I don't want the summary, I want the whole article)
import urllib.request
import ssl
import time
from bs4 import BeautifulSoup
ssl._create_default_https_context = ssl._create_unverified_context
pages = [1]
for page in pages:
data = urllib.request.urlopen("https://www.japantimes.co.jp/tag/cybersecurity/page/{}".format(page))
soup = BeautifulSoup(data, 'html.parser')
for article in soup.find_all('div', class_="content_col"):
link = article.p.find('a')
print(link.attrs['href'])
for link in links:
headline = link.h1.find('div', class_= "padding_block")
headline = headline.text
print(headline)
content = link.p.find_all('div', class_= "entry")
content = content.text
print(content)
print()
time.sleep(3)
This is not working.
date = link.li.find('time', class_= "post_time")
Showing error :
AttributeError: 'NoneType' object has no attribute 'find'
This code is working and grabbing all the articles links. I want to include code that will add headline and content from every article link.
import urllib.request
import ssl
import time
from bs4 import BeautifulSoup
ssl._create_default_https_context = ssl._create_unverified_context
pages = [1]
for page in pages:
data = urllib.request.urlopen("https://www.japantimes.co.jp/tag/cybersecurity/page/{}".format(page))
soup = BeautifulSoup(data, 'html.parser')
for article in soup.find_all('div', class_="content_col"):
link = article.p.find('a')
print(link.attrs['href'])
print()
time.sleep(3)

Try the following script. It will fetch you all the titles along with their content. Put the highest number of pages you wanna go across.
import requests
from bs4 import BeautifulSoup
url = 'https://www.japantimes.co.jp/tag/cybersecurity/page/{}'
pages = 4
for page in range(1,pages+1):
res = requests.get(url.format(page))
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".content_col header p > a"):
resp = requests.get(item.get("href"))
sauce = BeautifulSoup(resp.text,"lxml")
title = sauce.select_one("header h1").text
content = [elem.text for elem in sauce.select("#jtarticle p")]
print(f'{title}\n{content}\n')

'None' returned when web scraping using beautiful soup using find()

I am trying to pick the FTSE price from the bbc website using BeautifulSoup & Requests but I get the output 'None' when I run it.
import sys
import requests
from bs4 import BeautifulSoup
URL = 'https://www.bbc.co.uk/news/topics./c9qdqqkgz27t/ftse-100'
page = requests.get(URL,timeout=5)
#fetch content from URL
soup = BeautifulSoup(page.content,'html.parser')
#parse html content
price = soup.find(class_='gel-paragon nw-c-md-market-summary_value')
#price = soup.find("div", class_="gel-paragon nw-c-md-market-summary_value")
#find class with name 'gel...'
print(price)
I've tried using different types of the find function but both return the same. I plan to use this logic to gather data from multiple pages ultimately but want to get it right before I try to iterate.

Your url was wrong, i did few edits and it works!
import requests
from bs4 import BeautifulSoup
URL = 'https://www.bbc.co.uk/news/topics/c9qdqqkgz27t/ftse-100'
page = requests.get(URL)
soup = BeautifulSoup(page.content,'html.parser')
price = soup.find('div', attrs={
'class':'gel-paragon nw-c-md-market-summary__value'})
print(price.text)
Output:
7442.28

This works perfectly:
import requests
from bs4 import BeautifulSoup
url = 'https://www.bbc.com/news/topics/c9qdqqkgz27t/ftse-100'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
price = soup.select_one('div.gel-paragon')
print(price.text)
Output:
7418.34
Note: Try 'html.parser' if you don't have 'lxml'

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How can I extract text from Multiple URLs using beautifulsoup? - web-scraping

Related

Web scraping for hidden content

For Loop Not Repeating

extracting key-value data from javascript json type data with bs4

I want links and all the content from each link

'None' returned when web scraping using beautiful soup using find()

Categories

Resources