BeautifulSoup does not scrape all data [closed] - web-scraping

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I am trying to scrape a website but when I run this code it prints only half the data (including critics data). Here is my script:
from bs4 import BeautifulSoup
from urllib.request import urlopen
inputfile = "Chicago.csv"
f = open(inputfile, "w")
Headers = "Name, Link\n"
f.write(Headers)
url = "https://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228"
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")
page_details = soup.find("dl", {"class":"boccat"})
Readers = page_details.find_all("a")
for i in Readers:
poll = i.contents[0]
link = i['href']
print(poll)
print(link)
f.write("{}".format(poll) + ",https://www.chicagoreader.com{}".format(link)+ "\n")
f.close()
Is my scripting style wrong?
How to make codes short?
When to use find_all and find to not get attribute error. I read documentation but didn't understand.

To make your code shorter, you can switch to Requests library. It's easy to use and precise. If you want to make it even shorter, you can use cssselect.
find selects the container and find_all selects individual item of that container in a for loop. Here is the completed code:
from bs4 import BeautifulSoup
import csv ; import requests
outfile = open("chicagoreader.csv","w",newline='')
writer = csv.writer(outfile)
writer.writerow(["Name","Link"])
base = "https://www.chicagoreader.com"
response = requests.get("https://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228")
soup = BeautifulSoup(response.text, "lxml")
for item in soup.select(".boccat dd a"):
writer.writerow([item.text,base + item.get('href')])
print(item.text,base + item.get('href'))
Or with find and find_all:
from bs4 import BeautifulSoup
import requests
base = "https://www.chicagoreader.com"
response = requests.get("https://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228")
soup = BeautifulSoup(response.text, "lxml")
for items in soup.find("dl",{"class":"boccat"}).find_all("dd"):
item = items.find_all("a")[0]
print(item.text, base + item.get("href"))

Related

For Loop Not Repeating

I'm trying to find the hrefs for all the states where this company has stores, however it only finds the href for the first state.
Can anyone figure out why the for loop doesn't repeat for the rest of the states? Thank you very much for your help!
import requests
from bs4 import BeautifulSoup
import csv
# website
sitemap = 'website_url'
# content of website
sitemap_content = requests.get(sitemap).content
# parsing website
soup = BeautifulSoup(sitemap_content, 'html.parser')
#print(soup)
list_of_divs = soup.findAll('div', attrs={'class':'listings-inner'})
#print(list_of_divs)
header = ['Links']
with open ('/Users/ABC/Desktop/v1.csv','wt') as csvfile:
writer = csv.writer(csvfile, delimiter ="\t" )
writer.writerow(header)
for state in list_of_divs:
# get the url's by state
print(state.find('div', attrs={'class':'itemlist'}).a.get('href'))
rows = [state.find('div', attrs={'class':'itemlist'}).a.get('href')]
writer.writerow(rows)
list_of_divs actually only contains one element, which is the only div on the page with class listings-inner. So when you iterate through all of it's elements and use the find method, it'll only return the first result.
You want to use the find_all method on that div:
import requests
from bs4 import BeautifulSoup
sitemap = 'https://stores.dollargeneral.com/'
sitemap_content = requests.get(sitemap).content
soup = BeautifulSoup(sitemap_content, 'html.parser')
listings_div = soup.find('div', attrs={'class':'listings-inner'})
for state in listings_div.find_all('div', attrs={'class':'itemlist'}):
print(state.a.get('href'))

How can I extract text from Multiple URLs using beautifulsoup?

I am doing lead generation and want to extract text for a handful of URLs. Here is my code to extract for one URL. What should i do if i want to extract for more than one URL and save it into a dataframe?
import urllib
from urllib.request import urlopen as urlopen
from bs4 import BeautifulSoup
url = 'https://www.wdtl.com/'
html = urlopen(url).read()
soup = BeautifulSoup(html)
for script in soup(["script", "style"]):
script.extract() # rip it out
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
If I understand you correctly, you can get there using this simplified method. Let's see if it works for you:
import pandas as pd
from bs4 import BeautifulSoup as bs
import requests
headers={'User-Agent':'Mozilla/5.0'}
url = 'https://www.wdtl.com/'
resp = requests.get(url,headers = headers)
soup = bs(resp.content, "lxml")
#first, find the links
links = soup.find_all('link',href=True)
#create a list to house the links
all_links= []
#find each link and add it to the list
for link in links:
if 'http' in link['href']: #the soup contains many non-http links; this will remove them
all_links.append(link['href'])
#finally, load the list into a dataframe
df = pd.DataFrame(all_links)

extracting key-value data from javascript json type data with bs4

I am trying to extract some information from HTML of a web page.
But neither regex method nor list comprehension method works.
At http://bitly.kr/RWz5x, there is some key called encparam enclosed in getjason from a javascript tag which is 49th from all script elements of the page.
Thank you for your help in advance.
sam = requests.get('http://bitly.kr/RWz5x')
#html = sam.text
html=sam.content
soup = BeautifulSoup(html, 'html.parser')
scripts = soup.find_all('script')
#your_script = [script for script in scripts if 'encparam' in str(script)][0]
#print(your_script)
#print(scripts)
pattern = re.compile("(\w+): '(.*?)'")
fields = dict(re.findall(pattern, scripts.text))
Send your request to the following url which you can find in the sources tab:
import requests
from bs4 import BeautifulSoup as bs
import re
res = requests.get("https://navercomp.wisereport.co.kr/v2/company/c1010001.aspx?cmp_cd=005930")
soup = bs(res.content, 'lxml')
r = re.compile(r"encparam: '(.*)'")
data = soup.find('script', text=r).text
encparam = r.findall(data)[0]
print(encparam)
It is likely you can avoid bs4 altogether:
import requests
import re
r = requests.get("https://navercomp.wisereport.co.kr/v2/company/c1010001.aspx?cmp_cd=005930")
p = re.compile(r"encparam: '(.*)'")
encparam = p.findall(r.text)[0]
print(encparam)
If you actually want the encparam part in the string:
import requests
import re
r = requests.get("https://navercomp.wisereport.co.kr/v2/company/c1010001.aspx?cmp_cd=005930")
p = re.compile(r"(encparam: '\w+')")
encparam = p.findall(r.text)[0]
print(encparam)

I want links and all the content from each link

I searched for a keyword (cybersecurity) on a newspaper website and the results show around 10 articles. I want my code to grab the link and go to that link and get the whole article and repeat this to all the 10 articles in the page. (I don't want the summary, I want the whole article)
import urllib.request
import ssl
import time
from bs4 import BeautifulSoup
ssl._create_default_https_context = ssl._create_unverified_context
pages = [1]
for page in pages:
data = urllib.request.urlopen("https://www.japantimes.co.jp/tag/cybersecurity/page/{}".format(page))
soup = BeautifulSoup(data, 'html.parser')
for article in soup.find_all('div', class_="content_col"):
link = article.p.find('a')
print(link.attrs['href'])
for link in links:
headline = link.h1.find('div', class_= "padding_block")
headline = headline.text
print(headline)
content = link.p.find_all('div', class_= "entry")
content = content.text
print(content)
print()
time.sleep(3)
This is not working.
date = link.li.find('time', class_= "post_time")
Showing error :
AttributeError: 'NoneType' object has no attribute 'find'
This code is working and grabbing all the articles links. I want to include code that will add headline and content from every article link.
import urllib.request
import ssl
import time
from bs4 import BeautifulSoup
ssl._create_default_https_context = ssl._create_unverified_context
pages = [1]
for page in pages:
data = urllib.request.urlopen("https://www.japantimes.co.jp/tag/cybersecurity/page/{}".format(page))
soup = BeautifulSoup(data, 'html.parser')
for article in soup.find_all('div', class_="content_col"):
link = article.p.find('a')
print(link.attrs['href'])
print()
time.sleep(3)
Try the following script. It will fetch you all the titles along with their content. Put the highest number of pages you wanna go across.
import requests
from bs4 import BeautifulSoup
url = 'https://www.japantimes.co.jp/tag/cybersecurity/page/{}'
pages = 4
for page in range(1,pages+1):
res = requests.get(url.format(page))
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".content_col header p > a"):
resp = requests.get(item.get("href"))
sauce = BeautifulSoup(resp.text,"lxml")
title = sauce.select_one("header h1").text
content = [elem.text for elem in sauce.select("#jtarticle p")]
print(f'{title}\n{content}\n')

Get the last page Number of a wabpage - Beautiful Soup

I'm trying to get the page number of the last page of this website
http://digitalmoneytimes.com/category/crypto-news/
This links shows that the last page number is 335 but i can't extract the page number.
soup = BeautifulSoup(page.content, 'html.parser')
soup_output= soup.find_all("li",{"class":"active"})
soup_output=soup.select(tag)
print(soup_output)
I get an empty list as the output
In order to get the last page of the given website, I would strongly recommend you to use the following code:
import requests
from bs4 import BeautifulSoup
page = requests.get("http://digitalmoneytimes.com/category/crypto-news/")
soup = BeautifulSoup(page.content, 'html.parser')
soup = soup.find_all("a", href = True)
pages = []
for x in soup:
if "http://digitalmoneytimes.com/category/crypto-news/page/" in str(x):
pages.append(x)
last_page = pages[2].getText()
where last_page is equal to the last page. Due to the fact that I don't have access to your tag and page variables I can't really tell you where is the problem in your code.
Really hope this solves your problem.
If it is about getting the last page number, there is something you might try out as well:
import requests
from bs4 import BeautifulSoup
link = 'http://digitalmoneytimes.com/category/crypto-news/'
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
last_page_num = soup.find(class_="pagination-next").find_previous_sibling().text
print(last_page_num)
Output:
336

Resources