Scrape America's Career InfoNet - web-scraping

I've got employer IDs, which can be utilized get the business area:
https://www.careerinfonet.org/employ4.asp?emp_id=558742391
The HTML contains the data in tr/td tables:
Business Description:
Exporters (Whls) Primary Industry:Other Miscellaneous Durable Goods Merchant Wholesalers
Related Industry:Sporting and Athletic Goods Manufacturing
So I would like to get
Exporters (Whls)
Other Miscellaneous Durable Goods Merchant Wholesalers
Sporting and Athletic Goods Manufacturing
My example code looks like this:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.careerinfonet.org/employ4.asp?emp_id=558742391")
soup = BeautifulSoup(page.text, 'html.parser')
div = soup.find('td', class_='content')
for td in div.find_all('td'):
print(td.text)

I would like to preface this by saying that this technique is fairly sloppy, but it gets the job done assuming each page you scrape has a similar set up.
Your code is excellent for accessing the page itself, I simply add a check for every element to determine if it is the "Business Description", or the "Primary" or "Related Industry". Then you can access the appropriate element and use that.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.careerinfonet.org/employ4.asp?emp_id=558742391")
soup = BeautifulSoup(page.text, 'html.parser')
div = soup.find('td', class_='content')
lst = div.find_all('td')
for td in lst:
if td.text == "Business Description:":
print(lst[lst.index(td)+1].text)
if td.text == "Primary Industry:":
print(lst[lst.index(td)+1].text)
if td.text == "Related Industry:":
print(lst[lst.index(td)+1].text)
The other small modification I made is putting div.find_all('td') in a list that can then be indexed, to access the element you want.
Hope it helps!

Related

Extract text from a site using Scrapy spider

I am trying to extract the description of a book from amazon site. Note: I am using Scrapy spider:
this is the link of the amazon book: https://www.amazon.com/Local-Woman-Missing-Mary-Kubica/dp/1665068671
this the div that contains the text of the decription inside:
<div aria-expanded="true" class="a-expander-content a-expander-partial-collapse-content
a-expander-content-expanded" style="padding-bottom: 20px;"> <p><span class="a-text-
bold">MP3 CD Format</span></p><p><span class="a-text-bold">People don’t just disappear
without a trace…</span></p><p class="a-text-bold"><span class="a-text-bold">Shelby Tebow
is the first to go missing. Not long after, Meredith Dickey and her six-year-old
daughter, Delilah, vanish just blocks away from where Shelby was last seen, striking
fear into their once-peaceful community. Are these incidents connected? After an elusive
search that yields more questions than answers, the case eventually goes cold.</span>
</p><p class="a-text-bold"><span class="a-text-bold">Now, eleven years later, Delilah
shockingly returns. Everyone wants to know what happened to her, but no one is prepared
for what they’ll find…</span></p><p class="a-text-bold"><span class="a-text-bold">In
this smart and chilling thriller, master of suspense and New York Times bestselling
author Mary Kubica takes domestic secrets to a whole new level, showing that some people
will stop at nothing to keep the truth buried.</span></p><p></p> </div>
actually I tried this line:
div = response.css(".a-expander-content.a-expander-partial-collapse-content.a-expander-content-expanded")
description = " ".join([re.sub('<.*?>', '', span) for span in response.css('.a-expander-content span').extract()])
it's not working as expected. Please if you have any idea share it here. Thanks in advance
Here is the scrapy code:
import scrapy
from scrapy.spiders import Request
class AmazonSpider(scrapy.Spider):
name = 'amazon'
start_urls = ['https://www.amazon.com/dp/1665068671']
def start_requests(self):
yield Request(self.start_urls[0], callback=self.parse_book)
def parse_book(self, response):
description = "".join(response.css('[data-a-expander-name="book_description_expander"] .a-expander-content ::text').getall())
yield {"description": description}
Output:
{'description': ' MP3 CD FormatPeople don’t just disappear without a trace…Shelby Tebow is the first to go missing. Not long after, Meredith Dickey and her six-year-old daughter, Delilah, vanish just blocks away from where Shelby was last seen, striking fear into their once-peaceful community. Are these incidents connected? After an elusive search that yields more questions than answers, the case eventually goes cold.Now, eleven years later, Delilah shockingly returns. Everyone wants to know what happened to her, but no one is prepared for what they’ll find…In this smart and chilling thriller, master of suspense and New York Times bestselling author Mary Kubica takes domestic secrets to a whole new level, showing that some people will stop at nothing to keep the truth buried. '}

How can I get tweet having only tweet ID without using twitter API?

I have a large number of Tweet IDs that have been collected by other people (https://github.com/echen102/us-pres-elections-2020), and I now want to get these tweets from those IDs. What should I do without the Twitter API?
Do you want the url ? It is : https://twitter.com/user/status/<tweet_id>
If you want the text of the tweet withou using the api , you have to render the page, and then scrape it.
You can do it with one module, requests-html:
from requests_html import HTMLSession
session = HTMLSession()
url = "https://twitter.com/user/status/1414963866304458758"
r = session.get(url)
r.html.render(sleep=2)
tweet_text = r.html.find('.css-1dbjc4n.r-1s2bzr4', first=True)
print(tweet_text.text)
Output:
Here’s a serious national security question: Why does the Biden administration want to protect COMMUNISM from blame for the Cuban Uprising? They attribute it to vaccines. Even if the Big Guy can’t comprehend it, Hunter could draw a picture for him.

scrapy giving a different output than on website, problem with geo location?

I'm really a newbie in all of this and am just trying to learn a bit more about this.
So I had a lot of help to get this going but now I'm stuck on a very weird problem.
I am scraping info from a grocery store in Australia. As I'm located in the state of Victoria when I go on a website the price of a Redbull is 10.5$ but as soon as I run my script I get 11.25$.
I am guessing it might have to do with a geolocation...but not sure.
I basically need some help as to where to look to find how to get the right price I get when I go to the website.
Also, I noticed that when I do go to the same website from my phone it gives me the price of 11.25$, but if I go to the app of the store I get the accurate price of 10.5$.
import json
import scrapy
class SpidervenderSpider(scrapy.Spider):
name = 'spidervender'
allowed_domains = ['woolworths.com.au']
start_urls = ['https://www.woolworths.com.au/shop/productdetails/306165/red-bull-energy-drink']
def parse(self, response):
product_schema = json.loads(response.css('script[type="application/ld+json"]::text').get())
yield {
'title': product_schema['name'],
'price': product_schema['offers']['price']
}
So the code works perfectly but the price is (I presume) for a different part of Australia.

Scraping "older" pages with scrapy, rules and link extractors

I have been working on a project with scrapy. With help, from this lovely community I have managed to be able to scrape the first page of this website: http://www.rotoworld.com/playernews/nfl/football-player-news?ls=roto%3anfl%3agnav. I am trying to scrape information from the "older" pages as well. I have researched "crawlspider", rules and link extractors, and believed I had the proper code. I want the spider to perform the same loop on subsequent pages. Unfortunately at the moment when I run it, it just spits out the 1st page, and doesn't continue to the "older" pages.
I am not exactly sure what I need to change and would really appreciate some help. There are posts going all the way back to February of 2004... I am new to data mining, and not sure if it is actually a realistic goal to be able to scrape every post. If it is I would like to though. Please any help is appreciated. Thanks!
import scrapy
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor
class Roto_News_Spider2(crawlspider):
name = "RotoPlayerNews"
start_urls = [
'http://www.rotoworld.com/playernews/nfl/football/',
]
Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//input[#id="cp1_ctl00_btnNavigate1"]',)), callback="parse_page", follow= True),)
def parse(self, response):
for item in response.xpath("//div[#class='pb']"):
player = item.xpath(".//div[#class='player']/a/text()").extract_first()
position= item.xpath(".//div[#class='player']/text()").extract()[0].replace("-","").strip()
team = item.xpath(".//div[#class='player']/a/text()").extract()[1].strip()
report = item.xpath(".//div[#class='report']/p/text()").extract_first()
date = item.xpath(".//div[#class='date']/text()").extract_first() + " 2018"
impact = item.xpath(".//div[#class='impact']/text()").extract_first().strip()
source = item.xpath(".//div[#class='source']/a/text()").extract_first()
yield {"Player": player,"Position": position, "Team": team,"Report":report,"Impact":impact,"Date":date,"Source":source}
If your intention is to fetch the data traversing multiple pages, you don't need to go for scrapy. If you still want to have any solution related to scrapy then I suggest you opt for splash to handle the pagination.
I would do something like below to get the items (assuming you have already installed selenium in your machine):
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("http://www.rotoworld.com/playernews/nfl/football/")
wait = WebDriverWait(driver, 10)
while True:
for item in wait.until(EC.presence_of_all_elements_located((By.XPATH,"//div[#class='pb']"))):
player = item.find_element_by_xpath(".//div[#class='player']/a").text
player = player.encode() #it should handle the encoding issue; I'm not totally sure, though
print(player)
try:
idate = wait.until(EC.presence_of_element_located((By.XPATH, "//div[#class='date']"))).text
if "Jun 9" in idate: #put here any date you wanna go back to (last limit: where the scraper will stop)
break
wait.until(EC.presence_of_element_located((By.XPATH, "//input[#id='cp1_ctl00_btnNavigate1']"))).click()
wait.until(EC.staleness_of(item))
except:break
driver.quit()
My suggestion: Selenium
If you want to change of page automatically, you can use Selenium WebDriver.
Selenium makes you to be able to interact with the page click on buttons, write on inputs, etc. You'll need to change your code to scrap the data an then, click on the older button. Then, it'll change the page and keep scraping.
Selenium is a very useful tool. I'm using it right now, on a personal project. You can take a look at my repo on GitHub to see how it works. In the case of the page that you're trying to scrap, you cannot go to older just changing the link to be scraped, so, you need to use Selenium to do change between pages.
Hope it helps.
No need to use Selenium in current case. Before scraping you need to open url in browser and press F12 to inspect code and to see packets in Network Tab. When you press next or "OLDER" in your case you can see new set of TCP packets in Network tab. It provide to you all you need. When you understand how it work you can write working spider.
import scrapy
from scrapy import FormRequest
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor
class Roto_News_Spider2(CrawlSpider):
name = "RotoPlayerNews"
start_urls = [
'http://www.<DOMAIN>/playernews/nfl/football/',
]
Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//input[#id="cp1_ctl00_btnNavigate1"]',)), callback="parse", follow= True),)
def parse(self, response):
for item in response.xpath("//div[#class='pb']"):
player = item.xpath(".//div[#class='player']/a/text()").extract_first()
position= item.xpath(".//div[#class='player']/text()").extract()[0].replace("-","").strip()
team = item.xpath(".//div[#class='player']/a/text()").extract()[1].strip()
report = item.xpath(".//div[#class='report']/p/text()").extract_first()
date = item.xpath(".//div[#class='date']/text()").extract_first() + " 2018"
impact = item.xpath(".//div[#class='impact']/text()").extract_first().strip()
source = item.xpath(".//div[#class='source']/a/text()").extract_first()
yield {"Player": player,"Position": position, "Team": team,"Report":report,"Impact":impact,"Date":date,"Source":source}
older = response.css('input#cp1_ctl00_btnNavigate1')
if not older:
return
inputs = response.css('div.aspNetHidden input')
inputs.extend(response.css('div.RW_pn input'))
formdata = {}
for input in inputs:
name = input.css('::attr(name)').extract_first()
value = input.css('::attr(value)').extract_first()
formdata[name] = value or ''
formdata['ctl00$cp1$ctl00$btnNavigate1.x'] = '42'
formdata['ctl00$cp1$ctl00$btnNavigate1.y'] = '17'
del formdata['ctl00$cp1$ctl00$btnFilterResults']
del formdata['ctl00$cp1$ctl00$btnNavigate1']
action_url = 'http://www.<DOMAIN>/playernews/nfl/football-player-news?ls=roto%3anfl%3agnav&rw=1'
yield FormRequest(
action_url,
formdata=formdata,
callback=self.parse
)
Be carefull you need to replace all to corrent one in my code.

Python-requests: Can't scrape all the html code from a page

I am trying to scrape the content of the
Financial Times Search page.
Using Requests, I can easily scrape the articles' titles and hyperlinks.
I would like to get the next page's hyperlink, but I can not find it in the Requests response, unlike the articles' titles or hyperlinks.
from bs4 import BeautifulSoup
import requests
url = 'http://search.ft.com/search?q=SABMiller+PLC&t=all&rpp=100&fa=people%2Corganisations%2Cregions%2Csections%2Ctopics%2Ccategory%2Cbrand&s=-lastPublishDateTime&f=lastPublishDateTime[2000-01-01T00%3A00%3A00%2C2016-01-01T23%3A59%3A59]&curations=ARTICLES%2CBLOGS%2CVIDEOS%2CPODCASTS&highlight=true&p=1et'
response = requests.get(url, auth=(my login informations))
soup = BeautifulSoup(response.text, "lxml")
def get_titles_and_links():
titles = soup.find_all('a')
for ref in titles:
if ref.get('title') and ref.get('onclick'):
print ref.get('href')
print ref.get('title')
The get_titles_and_links() function gives me the titles and links of all the articles.
However, with a similar function for the next page, I have no results:
def get_next_page():
next_page = soup.find_all("li", class_="page next")
return next_page
Or:
def get_next_page():
next_page = soup.find_all('li')
for ref in next_page:
if ref.get('page next'):
print ref.get('page next')
If you can see the required links in the page source, but are not able to get them via requests or urllib. It can mean two things.
There is something wrong with your logic. Let's assume it's not that.
Then the thing remains is: Ajax, those parts of the page you are looking for are loaded by javascript after the document.onload method fired. So you cannot get something that's not there in the first place.
My solutions(more like suggestions) are
Reverse engineer the network requests. Difficult, but universally applicable. I personally do that. You might want to use re module.
Find something that renders javascript. That's just to say that, simulate web browsing. You might wanna check out the webdriver component of selenium, Qt etc. This is easier, but kinda memory hungry and consumes a lot more network resource compared to 1.

Resources