Scrape wood industry database with BeautifulSoup

Scrape wood industry database with BeautifulSoup - web-scraping

I would like to scrape the sawmill owner (after "Owned by:") from https://www.sawmilldatabase.com/sawmill.php?id=1282 with BeautifulSoup.
I've tried to adapt this very similar answer, but it doesn't work for a reason I don't understand.
<td>
AKD Softwoods
</td>
Python:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.sawmilldatabase.com/sawmill.php?id=1282')
soup = BeautifulSoup(page.text, 'html.parser')
lst = soup.find_all('TD')
for td in lst:
if td.text == "Owned by":
print("yes")
print(lst[lst.index(td)+1].text)

To address the code you submitted, the reason you are not successful is that you use if td.text == "Owned by" as your conditional. While this seems like it may work, it will never return what you want because the website you are scraping places the sawmill owner after "Owned by: ". (If you inspect the webpage, you will see the <td> tag is <td>Owned by: </d>).
While the difference between "Owned by" and "Owned by: " seems negligible, it makes all the difference to your program. Simply by changing your code to if td.text == "Owned by: ":, you will get the correct response:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.sawmilldatabase.com/sawmill.php?id=1282')
soup = BeautifulSoup(page.text, 'html.parser')
lst = soup.find_all('td')
for td in lst:
if td.text == "Owned by: ":
print("yes")
print(lst[lst.index(td)+1].text)
Alternatively, you could also use if "Owned by" in td.text: as your conditional, but this is not exactly ideal, in the event there is another <td> tag with that information in it.
Hope it helps!
EDIT
Oh and also don't capitalize TD in lst = soup.find_all('TD').

How about the below approach!! If you comply with this usage if sth.text=="sth else: ", the main problem is that the text within inverted comma has to be identical to the one stored in the webpage. If you happen to use if sth.text=="sth else:" this, it will no longer work because the extra space from it's last portion has been taken out. Try this instead:
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("https://www.sawmilldatabase.com/sawmill.php?id=1282").text,"lxml")
for items in soup.select("table td"):
if "Owned by:" in items.text:
name = items.find_next_sibling().text
print(name)
Output:
AKD Softwoods

I've used regex to help me reach to the element you're looking for.
Code:
import requests, re
from bs4 import BeautifulSoup
page = requests.get('https://www.sawmilldatabase.com/sawmill.php?id=1282')
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.find('a', href=re.compile('company.php')).text)
Output:
AKD Softwoods

Related

For Loop Not Repeating

I'm trying to find the hrefs for all the states where this company has stores, however it only finds the href for the first state.
Can anyone figure out why the for loop doesn't repeat for the rest of the states? Thank you very much for your help!
import requests
from bs4 import BeautifulSoup
import csv
# website
sitemap = 'website_url'
# content of website
sitemap_content = requests.get(sitemap).content
# parsing website
soup = BeautifulSoup(sitemap_content, 'html.parser')
#print(soup)
list_of_divs = soup.findAll('div', attrs={'class':'listings-inner'})
#print(list_of_divs)
header = ['Links']
with open ('/Users/ABC/Desktop/v1.csv','wt') as csvfile:
writer = csv.writer(csvfile, delimiter ="\t" )
writer.writerow(header)
for state in list_of_divs:
# get the url's by state
print(state.find('div', attrs={'class':'itemlist'}).a.get('href'))
rows = [state.find('div', attrs={'class':'itemlist'}).a.get('href')]
writer.writerow(rows)

list_of_divs actually only contains one element, which is the only div on the page with class listings-inner. So when you iterate through all of it's elements and use the find method, it'll only return the first result.
You want to use the find_all method on that div:
import requests
from bs4 import BeautifulSoup
sitemap = 'https://stores.dollargeneral.com/'
sitemap_content = requests.get(sitemap).content
soup = BeautifulSoup(sitemap_content, 'html.parser')
listings_div = soup.find('div', attrs={'class':'listings-inner'})
for state in listings_div.find_all('div', attrs={'class':'itemlist'}):
print(state.a.get('href'))

Beautiful Soup Pagination, find_all not finding text within next_page class. Need also to extract data from URLS

I've been working on this for a week and am determined to get this working!
My ultimate goal is to write a webscraper where you can insert the county name and the scraper will produce a csv file of information from mugshots - Name, Location, Eye Color, Weight, Hair Color and Height (it's a genetics project I am working on).
The site organization is primary site page --> state page --> county page -- 120 mugshots with name and url --> url with data I am ultimately after and next links to another set of 120.
I thought the best way to do this would be to write a scraper that will grab the URLs and Names from the table of 120 mugshots and then use pagination to grab all the URLs and names from the rest of the county (in some cases there are 10's of thousands). I can get the first 120, but my pagination doesn't work.. so Im ending up with a csv of 120 names and urls.
I closely followed this article which was very helpful
from bs4 import BeautifulSoup
import requests
import lxml
import pandas as pd
county_name = input('Please, enter a county name: /Arizona/Maricopa-County-AZ \n')
print(f'Searching {county_name}. Wait, please...')
base_url = 'https://www.mugshots.com'
search_url = f'https://mugshots.com/US-Counties/{county_name}/'
data = {'Name': [],'URL': []}
def export_table_and_print(data):
table = pd.DataFrame(data, columns=['Name', 'URL'])
table.index = table.index + 1
table.to_csv('mugshots.csv', index=False)
print('Scraping done. Here are the results:')
print(table)
def get_mugshot_attributes(mugshot):
name = mugshot.find('div', attrs={'class', 'label'})
url = mugshot.find('a', attrs={'class', 'image-preview'})
name=name.text
url=mugshot.get('href')
url = base_url + url
data['Name'].append(name)
data['URL'].append(url)
def parse_page(next_url):
page = requests.get(next_url)
if page.status_code == requests.codes.ok:
bs = BeautifulSoup(page.text, 'lxml')
list_all_mugshot = bs.find_all('a', attrs={'class', 'image-preview'})
for mugshot in list_all_mugshot:
get_mugshot_attributes(mugshot)
next_page_text = mugshot.find('a class' , attrs={'next page'})
if next_page_text == 'Next':
next_page_text=mugshot.get_text()
next_page_url=mugshot.get('href')
next_page_url=base_url+next_page_url
print(next_page_url)
parse_page(next_page_url)
else:
export_table_and_print(data)
parse_page(search_url)
Any ideas on how to get the pagination to work and also how to eventually get the data from the list of URLs I scrape?
I appreciate your help! I've been working in python for a few months now, but the BS4 and Scrapy stuff is so confusing for some reason.
Thank you so much community!
Anna

It seems you want to know the logic as to how you can get the content using populated urls derived from each of the page traversing next pages. This is how you can parse all the links from each page including next page and then use those links to get the content from their inner pages.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "https://mugshots.com/"
base = "https://mugshots.com"
def get_next_pages(link):
print("**"*20,"current page:",link)
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("[itemprop='name'] > a[href^='/Current-Events/']"):
yield from get_main_content(urljoin(base,item.get("href")))
next_page = soup.select_one(".pagination > a:contains('Next')")
if next_page:
next_page = urljoin(url,next_page.get("href"))
yield from get_next_pages(next_page)
def get_main_content(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
item = soup.select_one("h1#item-title > span[itemprop='name']").text
yield item
if __name__ == '__main__':
for elem in get_next_pages(url):
print(elem)

How scrape element that doesn't belong to any attribute in a class

This issue happens to the same page that I asked yesterday. The url is:
https://www.fourfourtwo.com/statszone/22-2016/matches/861695/team-stats/6339/0_SHOT_01#tabs-wrapper-anchor
I am trying to scrape the date of the match:
I want to get:
Waldstadion Frankfurt, Saturday, May 20, 2017 - 14:30
Then, extract:
May 20, 2017
And this happens to be in side here from the inspect element view:
I try to access to this div tag and teams class in the code below:
import requests
from bs4 import BeautifulSoup
import csv
import re
url = "https://www.fourfourtwo.com/statszone/22-2016/matches/861695/team-stats/6339/0_SHOT_01#tabs-wrapper-anchor"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
# Try find date
date = soup.select('div.teams')
date_raw = date[0].text
date_strip = date_raw.strip()
y = re.findall('(^[A-Z].+)\n', date_strip)
y1 = str(y).strip()
print(y1)
But this is not quite successful...the result is still somehow in a list and with lots of space to be trimmed. The problem is that there are lots of children of this class, and I just wanted to access the class='teams' text element and extract the date.
['Waldstadion Frankfurt, Saturday, May 20, 2017 - 14:30 ']
Is there any better way to extract this element? Thank you very much for your help and time.

As you can see, the desired text is first content after <div class="teams">. You can access it in BeautifulSoup with .contents property, which can be indexed (0 in case of first content):
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.fourfourtwo.com/statszone/22-2016/matches/861695/team-stats/6339/0_SHOT_01#tabs-wrapper-anchor')
soup = BeautifulSoup(r.text, 'lxml')
print(soup.select_one('div.teams').contents[0].strip())
Prints:
Waldstadion Frankfurt, Saturday, May 20, 2017 - 14:30
EDIT:
To parse the string for place, date, time you can use regular expression:
from bs4 import BeautifulSoup
import requests
import re
r = requests.get('https://www.fourfourtwo.com/statszone/22-2016/matches/861695/team-stats/6339/0_SHOT_01#tabs-wrapper-anchor')
soup = BeautifulSoup(r.text, 'lxml')
data = soup.select_one('div.teams').contents[0].strip()
place, date, time = re.search(r'(.*?)(?:,.*?)((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Dec)\s+\d+,\s+\d+).*?(\d+:\d+)', data).groups()
print(place)
print(date)
print(time)
This will print:
Waldstadion Frankfurt
May 20, 2017
14:30
Explanation of this regular expression is here.

You can do it with plain JS before parsing it.
document.getElementById("match-head").
children[0].
innerText.
split(/[,-]/).
splice(1,2).
join("")
// produces " Saturday May 20"
The first three statements are just W3C DOM; the last 3 are array manipulation to extract the second and third items separated by "-" or "," characters and join them back together.

My first choice, dateutil.parser wasn't able to find the date, so I used a simple regex to extract it. The only caveat is that the date must begin with the full month name and end with a dash or newline.
import re
import requests
from bs4 import BeautifulSoup
url = "https://www.fourfourtwo.com/statszone/22-2016/matches/861695/team-stats/6339/0_SHOT_01#tabs-wrapper-anchor"
soup = BeautifulSoup(requests.get(url).text, "lxml")
pattern = "(?:January|February|March|April|May|June|July|August|September|October|November|December)[^-\n.]+"
print(re.search(pattern, soup.select("div.teams")[0].text).group().strip())
Output:
May 20, 2017
Personally, I trust that the site will be more consistent about date format than, say, commas or whitespace, but here's a version like that:
import re
import requests
from bs4 import BeautifulSoup
url = "https://www.fourfourtwo.com/statszone/22-2016/matches/861695/team-stats/6339/0_SHOT_01#tabs-wrapper-anchor"
soup = BeautifulSoup(requests.get(url).text, "lxml")
print(" ".join(re.split("\s+", soup.select("div.teams")[0].text)[4:7]))

Get the last page Number of a wabpage - Beautiful Soup

I'm trying to get the page number of the last page of this website
http://digitalmoneytimes.com/category/crypto-news/
This links shows that the last page number is 335 but i can't extract the page number.
soup = BeautifulSoup(page.content, 'html.parser')
soup_output= soup.find_all("li",{"class":"active"})
soup_output=soup.select(tag)
print(soup_output)
I get an empty list as the output

In order to get the last page of the given website, I would strongly recommend you to use the following code:
import requests
from bs4 import BeautifulSoup
page = requests.get("http://digitalmoneytimes.com/category/crypto-news/")
soup = BeautifulSoup(page.content, 'html.parser')
soup = soup.find_all("a", href = True)
pages = []
for x in soup:
if "http://digitalmoneytimes.com/category/crypto-news/page/" in str(x):
pages.append(x)
last_page = pages[2].getText()
where last_page is equal to the last page. Due to the fact that I don't have access to your tag and page variables I can't really tell you where is the problem in your code.
Really hope this solves your problem.

If it is about getting the last page number, there is something you might try out as well:
import requests
from bs4 import BeautifulSoup
link = 'http://digitalmoneytimes.com/category/crypto-news/'
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
last_page_num = soup.find(class_="pagination-next").find_previous_sibling().text
print(last_page_num)
Output:
336

scrape the next pages in python using Beautifulsoup

I want to scrape the links from each page and move on to the next pages and do the same. here is my code to scrape links from the first page:
import requests
from bs4 import BeautifulSoup
page='https://www.booli.se/slutpriser/goteborg/22/?objectType=L%C3%A4genhet'
request = requests.get(page)
soup = BeautifulSoup(request.text,'lxml')
links= soup.findAll('a',class_='search-list__item')
url=[]
prefix = "https://www.booli.se"
for link in links:
url.append(prefix+link["href"])
I tried the following for the first three pages, but it didn't work.
import re
import requests
from bs4 import BeautifulSoup
url=[]
prefix = "https://www.booli.se"
with requests.Session() as session:
for page in range(4):
response = session.get("https://www.booli.se/slutpriser/goteborg/22/?
objectType=L%C3%A4genhet&page=%f" % page)
soup = BeautifulSoup(response.content, "html.parser")
links= soup.findAll('a',class_='search-list__item')
for link in links:
url.append(prefix+link["href"])

First you have to create code that is working fine with one page.
Then you have to put your scraping code in loop
url = "https://www.booli.se/slutpriser/goteborg/22/?objectType=L%C3%A4genhet&page=1"
while True:
code goes here
You will notice there is a page=number at the end of the link.
You have to figure to run loop on these url with changing the page=number
i=1
url = "https://www.booli.se/slutpriser/goteborg/22/?objectType=L%C3%A4genhet&page=" + str(i)
while True:
i = i+1
page = requests.get(url)
if page.status_code != 200:
break
url = "https://www.booli.se/slutpriser/goteborg/22/?objectType=L%C3%A4genhet&page=" + str(i)
#Your scraping code goes here
#
#
I have used if statement so that the loop does not goes forever. It will go upto the last page.

Yes, I did it. Thank you. Here is the code for the first two pages:
urls=[]
for page in range(3):
urls.append("https://www.booli.se/slutpriser/goteborg/22/?
objectType=L%C3%A4genhet&page={}".format(page))
page=urls[1:]
#page
import requests
from bs4 import BeautifulSoup
inturl=[]
for page in page:
request = requests.get(page)
soup = BeautifulSoup(request.text,'lxml')
links= soup.findAll('a',class_='search-list__item')
prefix = "https://www.booli.se"
for link in links:
inturl.append(prefix+link["href"])

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scrape wood industry database with BeautifulSoup - web-scraping

Related

For Loop Not Repeating

Beautiful Soup Pagination, find_all not finding text within next_page class. Need also to extract data from URLS

How scrape element that doesn't belong to any attribute in a class

Get the last page Number of a wabpage - Beautiful Soup

scrape the next pages in python using Beautifulsoup

Categories

Resources