Web crawler in R with heading and summary - r

I'm trying to extract links from here with the article heading and a brief summary of each link.
The output should have the article heading and the brief summary of each article which is on the same page.
I'm able to get the links. Can you please suggest how can i get heading and summary for each link. Please see my code below.
install.packages('rvest')
#Loading the rvest package
library('rvest')
library(xml2)
#Specifying the url for desired website to be scrapped
url <- 'http://money.howstuffworks.com/business-profiles.htm'
webpage <- read_html(url)
pg <- read_html(url)
head(html_attr(html_nodes(pg, "a"), "href"))

We can use purrr to inspect each node and extract the relevant information:
library(rvest)
library(purrr)
url <- 'http://money.howstuffworks.com/business-profiles.htm'
articles <- read_html(url) %>%
html_nodes('.infinite-item > .media') %>%
map_df(~{
title <- .x %>%
html_node('.media-heading > h3') %>%
html_text()
head <- .x %>%
html_node('p') %>%
html_text()
link <- .x %>%
html_node('p > a') %>%
html_attr('href')
data.frame(title, head, link, stringsAsFactors = F)
})
head(articles)
#> title
#> 1 How Amazon Same-day Delivery Works
#> 2 10 Companies That Completely Reinvented Themselves
#> 3 10 Trade Secrets We Wish We Knew
#> 4 How Kickstarter Works
#> 5 Can you get rich selling stuff online?
#> 6 Are the Golden Arches really supposed to be giant french fries?
#> head
#> 1 The Amazon same-day delivery service aims to get your package to you in no time at all. Learn how Amazon same-day delivery works. See more »
#> 2 You might be surprised at what some of today's biggest companies used to do. Here are 10 companies that reinvented themselves from HowStuffWorks. See more »
#> 3 Trade secrets are often locked away in corporate vaults, making their owners a fortune. Which trade secrets are the stuff of legend? See more »
#> 4 Kickstarter is a service that utilizes crowdsourcing to raise funds for your projects. Learn about how Kickstarter works at HowStuffWorks. See more »
#> 5 Can you get rich selling your stuff online? Find out more in this article by HowStuffWorks.com. See more »
#> 6 Are McDonald's golden arches really suppose to be giant french fries? Check out this article for a brief history of McDonald's golden arches. See more »
#> link
#> 1 http://money.howstuffworks.com/amazon-same-day-delivery.htm
#> 2 http://money.howstuffworks.com/10-companies-reinvented-themselves.htm
#> 3 http://money.howstuffworks.com/10-trade-secrets.htm
#> 4 http://money.howstuffworks.com/kickstarter.htm
#> 5 http://money.howstuffworks.com/can-you-get-rich-selling-online.htm
#> 6 http://money.howstuffworks.com/mcdonalds-arches.htm
Obligatory comment: In this case I saw no disclaimer against harvesting on their Terms and conditions, but always be sure to check the terms of a site before scraping it.

Related

Is there a way to put a wildcard character in a web address when using rvest?

I am new to web scrapping and using R and rvest to try and pull some info for a friend. This project might be a bit much for my first, but hopefully someone can help or tell me if it is possible.
I am trying to pull info from https://www.veteranownedbusiness.com/mo like business name, address, phone number, and description. I started by pulling all the names of the business' and was going to loop through each page to pull the information by business. The problem I ran into is that the business url's have numbers assigned to them :
www.veteranownedbusiness.com/business/**32216**/accel-polymers-llc
Is there a way to tell R to ignore this number or accept any entry in its spot so that I could loop through the business names?
Here is the code I have so far to get and clean the business titles if it helps:
library(rvest)
library(tibble)
library(stringr)
vet_name_list <- "https://www.veteranownedbusiness.com/mo"
HTML <- read_html(vet_name_list)
biz_names_html <- html_nodes(HTML, '.top_level_listing a')
biz_names <- html_text(biz_names_html)
biz_names <- biz_names[biz_names != ""]
biz_names_lower <- tolower(biz_names)
biz_names_sym <- gsub("[][!#$&%()*,.:;<=>#^_`|~.{}]", "", biz_names_lower)
biz_names_dub <- str_squish(biz_names_sym)
biz_name_clean <- chartr(" ", "-", biz_names_dub)
No, I'm afraid you can't use wildcards to get a valid url. What you can do is to scrape all the correct urls from the page, number and all.
To do this, we find all the correct nodes (I'm using xpath here rather than css selectors since it gives a bit more flexibility). You then get the href attribute from each node.
This can produce a data frame of business names and url. Here's a fully reproducible example:
library(rvest)
library(tibble)
vet_name_list <- "https://www.veteranownedbusiness.com/mo"
biz <- read_html(vet_name_list) %>%
html_nodes(xpath = "//tr[#class='top_level_listing']/td/a[#href]")
tibble(business = html_text(biz),
url = paste0(biz %>% html_attr("href")))
#> # A tibble: 550 x 2
#> business url
#> <chr> <chr>
#> 1 Accel Polymers, LLC /business/32216/ac~
#> 2 Beacon Car & Pet Wash /business/35987/be~
#> 3 Compass Quest Joplin Regional Veteran Services /business/21943/co~
#> 4 Financial Assistance for Military Experiencing Divorce /business/20797/fi~
#> 5 Focus Marines Foundation /business/29376/fo~
#> 6 Make It Virtual Assistant /business/32204/ma~
#> 7 Malachi Coaching & Training Ministries Int'l /business/24060/ma~
#> 8 Mike Jackson - Author /business/29536/mi~
#> 9 The Mission Continues /business/14492/th~
#> 10 U.S. Small Business Conference & EXPO /business/8266/us-~
#> # ... with 540 more rows
Created on 2022-08-05 by the reprex package (v2.0.1)

Why can't I read clickable links for webscraping with rvest?

I am trying to webscrape this website.
The content I need is available after clicking on each title. I can get the content I want if I do this for example (I am using SelectorGadget):
library("rvest")
url_boe ="https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-reuters-events-global-responsible-business-2021"
sample_text = html_text(html_nodes(read_html(url_boe), "#output .page-section"))
However, I would need to get each text for each clickable link in the website. So I usually do:
url_boe = "https://www.bankofengland.co.uk/news/speeches"
html_attr(html_nodes(read_html(url_boe), "#SearchResults .exclude-navigation"), name = "href")
I get an empty object though. I tried different variants of the code but with the same result.
How can I read those links and then apply the code in the first part to all the links?
Can anyone help me?
Thanks!
As #KonradRudolph has noted before, the links are inserted dynamically into the webpage. Therefore, I have produced a code using RSelenium and rvest to tackle this issue:
library(rvest)
library(RSelenium)
# URL
url = "https://www.bankofengland.co.uk/news/speeches"
# Base URL
base_url = "https://www.bankofengland.co.uk"
# Instantiate a Selenium server
rD <- rsDriver(browser=c("chrome"), chromever="91.0.4472.19")
# Assign the client to an object
rem_dr <- rD[["client"]]
# Navigate to the URL
rem_dr$navigate(url)
# Get page HTML
page <- read_html(rem_dr$getPageSource()[[1]])
# Extract links and concatenate them with the base_url
links <- page %>%
html_nodes(".release-speech") %>%
html_attr('href') %>%
paste0(base_url, .)
# Get links names
links_names <- page %>%
html_nodes('#SearchResults .exclude-navigation') %>%
html_text()
# Keep only even results to deduplicate
links_names <- links_names[c(FALSE, TRUE)]
# Create a data.frame with the results
df <- data.frame(links_names, links)
# Close the client and the server
rem_dr$close()
rD$server$stop()
The resulting data.frame looks like this:
> head(df)
links_names
1 Stablecoins: What’s old is new again - speech by Christina Segal-Knowles
2 Tackling climate for real: progress and next steps - speech by Andrew Bailey
3 Tackling climate for real: the role of central banks - speech by Andrew Bailey
4 What are government bond yields telling us about the economic outlook? - speech by Gertjan Vlieghe
5 Responsible openness in the Insurance Sector - speech by Anna Sweeney
6 Cyber Risk: 2015 to 2027 and the Penrose steps - speech by Lyndon Nelson
links
1 https://www.bankofengland.co.uk/speech/2021/june/christina-segal-knowles-speech-at-the-westminster-eforum-poicy-conference
2 https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-bis-bank-of-france-imf-ngfs-green-swan-conference
3 https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-reuters-events-global-responsible-business-2021
4 https://www.bankofengland.co.uk/speech/2021/may/gertjan-vlieghe-speech-hosted-by-the-department-of-economics-and-the-ipr
5 https://www.bankofengland.co.uk/speech/2021/may/anna-sweeney-association-of-british-insurers-prudential-regulation
6 https://www.bankofengland.co.uk/speech/2021/may/lyndon-nelson-the-8th-operational-resilience-and-cyber-security-summit

Could you please help me with web scraping using Rvest?

I am currently trying to webscrape the following website: https://chicago.suntimes.com/crime/archives
I have been relying on the CSS Selector Gadget to find the x-path and to do web scraping. However, I am unable to use the gadget in this website and I would have to use the Inspect Source to find what I need. I have been trying to find the relevant css and xpath by scrolling down each source, but I was not able to do it due to my limited capabilities.
Could you please help me find the xpath or css for
Title
Author
Date
I am so sorry if this is a dry laundry list of everything... but I am really stuck. I will really appreciate if you could give me some help!
Thank you very much.
For each element that you want to extract if you find the relevant tag with it's respective class using selector gadget you'll be able to get what you want.
library(rvest)
url <- 'https://chicago.suntimes.com/crime/archives'
webpage <- url %>% read_html()
title <- webpage %>% html_nodes('h2.c-entry-box--compact__title') %>% html_text()
author <- webpage %>% html_nodes('span.c-byline__author-name') %>% html_text()
date <- webpage %>% html_nodes('time.c-byline__item')%>% html_text() %>% trimws()
result <- data.frame(title, author, date)
result
result
# title author date
#1 Belmont Cragin man charged with carjacking in Little Village: police Sun-Times Wire February 17
#2 Gas station robbed, man carjacked in Horner Park Jermaine Nolen February 17
#3 8 shot, 2 fatally, Tuesday in Chicago Sun-Times Wire February 17
#4 Businesses robbed at gunpoint on the Northwest Side: police Sun-Times Wire February 17
#5 Man charged with carjacking in Aurora Sun-Times Wire February 16
#6 Woman fatally stabbed in Park Manor apartment Sun-Times Wire February 16
#7 Woman critically hurt by gunfire in Woodlawn David Struett February 16
#8 Teen boy, 17, charged with attempted carjacking in Back of the Yards Sun-Times Wire February 16
#...
#...

Using rvest to creating a database from multiple XML files

Using R to extract relevant data from multiple online XML files to create a database
I just started to learn R to do text analysis. Here is what I am trying to do: I'm trying to use rvest in r to create a CSV database of bill summaries from the 116th Congress from online XML files. The database should have two columns:
The title of the bill.
The summary text of the bill.
The website source is https://www.govinfo.gov/bulkdata/BILLSUM/116/hr
The issue I am having is
I would like to collect all the speeches that are returned from the search. So I need to web scrape multiple links. But I don't know how to ensure that r runs function with a series of different links and then extract the expected data.
I have tried the following code but I am not sure how exactly to apply them to my specific problem. Also, I got an error report of my code. Please see my code below. Thanks for any help in advance!
library(rvest)
library(tidyverse)
library(purrr)
html_source <- "https://www.govinfo.gov/bulkdata/BILLSUM/116/hr?page="
map_df(1:997, function(i) {
cat(".")
pg <- read_html(sprintf(html_source, i))
data.frame(title = html_text(html_nodes(pg, "title")),
bill_text %>% html_node("summary-text") %>% html_text(),
stringsAsFactors = FALSE)
}) -> Bills
Error in open.connection(x, "rb") : HTTP error 406.
At the bottom of that page is a link to a zipfile with all of the XML files, so instead of scraping each one individually (which will get onerous with a suggested crawl-delay of 10s) you can just download the zipfile and parse the XML files with xml2 (rvest is for HTML):
library(xml2)
library(purrr)
local_dir <- "~/Downloads/BILLSUM-116-hr"
local_zip <- paste0(local_dir, '.zip')
download.file("https://www.govinfo.gov/bulkdata/BILLSUM/116/hr/BILLSUM-116-hr.zip", local_zip)
# returns vector of paths to unzipped files
xml_files <- unzip(local_zip, exdir = local_dir)
bills <- xml_files %>%
map(read_xml) %>%
map_dfr(~list(
# note xml2 functions only take XPath selectors, not CSS ones
title = xml_find_first(.x, '//title') %>% xml_text(),
summary = xml_find_first(.x, '//summary-text') %>% xml_text()
))
bills
#> # A tibble: 1,367 x 2
#> title summary
#> <chr> <chr>
#> 1 For the relief of certain aliens w… Provides for the relief of certain …
#> 2 "To designate the facility of the … "Designates the facility of the Uni…
#> 3 Consolidated Appropriations Act, 2… <p><b>Consolidated Appropriations A…
#> 4 Financial Institution Customer Pro… <p><strong>Financial Institution Cu…
#> 5 Zero-Baseline Budget Act of 2019 <p><b>Zero-Baseline Budget Act of 2…
#> 6 Agriculture, Rural Development, Fo… "<p><b>Highlights: </b></p> <p>This…
#> 7 SAFETI Act <p><strong>Security for the Adminis…
#> 8 Buy a Brick, Build the Wall Act of… <p><b>Buy a Brick, Build the Wall A…
#> 9 Inspector General Access Act of 20… <p><strong>Inspector General Access…
#> 10 Federal CIO Authorization Act of 2… <p><b>Federal CIO Authorization Act…
#> # … with 1,357 more rows
The summary column is HTML-formatted, but by and large this is pretty clean already.

Using Rvest to webscrape rankings

I am wanting to scrape all rankings from the following website:
https://www.glassdoor.com/ratingsDetails/full.htm?employerId=432&employerName=McDonalds#trends-overallRating
I have tried using CSS selectors, which tell me to use ".ratingNum", but it leaves me with blank data. I have also tried using the GET function, which results in a similar problem.
# Attempt 1
url <- 'https://www.glassdoor.com/ratingsDetails/full.htm?employerId=432&employerName=McDonalds#trends-overallRating'
webpage <- read_html(url)
rank_data_html <- html_nodes(webpage,'.rankingNum')
rank_data <- html_table(rank_data_html)
head(rank_data)
# Attempt 2
res <- GET("https://www.glassdoor.com/ratingsDetails/full.htm",
query=list(employerId="432",
employerName="McDonalds"))
doc <- read_html(content(res, as="text"))
html_nodes(doc, ".ratingNum")
rank_data <- html_table(rank_data_html)
head(rank_data)
I expect the result to give me a list of all of the rankings, but instead it is giving me an empty list, or a list that doesn't include the rankings.
Your list is empty because you're GETing an unpopulated HTML document. Frequently when this happens you have to resort to RSelenium and co., but Glassdoor's public-facing API actually has everything you need – if you know where to look.
(Note: I'm not sure if this is officially part of Glassdoor's public API, but I think it's fair game if they haven't made more of an effort to conceal it. I tried to find some information, but their documentation is pretty meager. Usually companies will look the other way if you're just doing a smallish analysis and not slamming their servers or trying to profit from their data, but it's still a good idea to heed their ToS. You might want to shoot them an email describing what you're doing, or even ask about becoming an API partner. Make sure you adhere to their attribution rules. Continue at your own peril.)
Take a look at the network analysis tab in you browser's developer tools. You will see some GET requests that return JSONs, and one of those has the address you need. Send a GET and parse the JSON:
library(httr)
library(purrr)
library(dplyr)
ratings <- paste0("https://www.glassdoor.com/api/employer/432-rating.htm?",
"locationStr=&jobTitleStr=&filterCurrentEmployee=false")
req_obj <- GET(ratings)
cont <- content(req_obj)
ratings_df <- map(cont$ratings, bind_cols) %>% bind_rows()
ratings_df
You should end up with a dataframe containing ratings data. Just don't forget that the "ceoRating", "bizOutlook", and "recommend" are are proportions from 0-1 (or percentages if *100), while the rest reflect average user ratings on a 5-point scale:
# A tibble: 9 x 3
hasRating type value
<lgl> <chr> <dbl>
1 TRUE overallRating 3.3
2 TRUE ceoRating 0.72
3 TRUE bizOutlook 0.42
4 TRUE recommend 0.570
5 TRUE compAndBenefits 2.8
6 TRUE cultureAndValues 3.1
7 TRUE careerOpportunities 3.2
8 TRUE workLife 3.1
9 TRUE seniorManagement 2.9

Resources