Could you please help me with web scraping using Rvest? - r

I am currently trying to webscrape the following website: https://chicago.suntimes.com/crime/archives
I have been relying on the CSS Selector Gadget to find the x-path and to do web scraping. However, I am unable to use the gadget in this website and I would have to use the Inspect Source to find what I need. I have been trying to find the relevant css and xpath by scrolling down each source, but I was not able to do it due to my limited capabilities.
Could you please help me find the xpath or css for
Title
Author
Date
I am so sorry if this is a dry laundry list of everything... but I am really stuck. I will really appreciate if you could give me some help!
Thank you very much.

For each element that you want to extract if you find the relevant tag with it's respective class using selector gadget you'll be able to get what you want.
library(rvest)
url <- 'https://chicago.suntimes.com/crime/archives'
webpage <- url %>% read_html()
title <- webpage %>% html_nodes('h2.c-entry-box--compact__title') %>% html_text()
author <- webpage %>% html_nodes('span.c-byline__author-name') %>% html_text()
date <- webpage %>% html_nodes('time.c-byline__item')%>% html_text() %>% trimws()
result <- data.frame(title, author, date)
result
result
# title author date
#1 Belmont Cragin man charged with carjacking in Little Village: police Sun-Times Wire February 17
#2 Gas station robbed, man carjacked in Horner Park Jermaine Nolen February 17
#3 8 shot, 2 fatally, Tuesday in Chicago Sun-Times Wire February 17
#4 Businesses robbed at gunpoint on the Northwest Side: police Sun-Times Wire February 17
#5 Man charged with carjacking in Aurora Sun-Times Wire February 16
#6 Woman fatally stabbed in Park Manor apartment Sun-Times Wire February 16
#7 Woman critically hurt by gunfire in Woodlawn David Struett February 16
#8 Teen boy, 17, charged with attempted carjacking in Back of the Yards Sun-Times Wire February 16
#...
#...

Related

Is there a way to put a wildcard character in a web address when using rvest?

I am new to web scrapping and using R and rvest to try and pull some info for a friend. This project might be a bit much for my first, but hopefully someone can help or tell me if it is possible.
I am trying to pull info from https://www.veteranownedbusiness.com/mo like business name, address, phone number, and description. I started by pulling all the names of the business' and was going to loop through each page to pull the information by business. The problem I ran into is that the business url's have numbers assigned to them :
www.veteranownedbusiness.com/business/**32216**/accel-polymers-llc
Is there a way to tell R to ignore this number or accept any entry in its spot so that I could loop through the business names?
Here is the code I have so far to get and clean the business titles if it helps:
library(rvest)
library(tibble)
library(stringr)
vet_name_list <- "https://www.veteranownedbusiness.com/mo"
HTML <- read_html(vet_name_list)
biz_names_html <- html_nodes(HTML, '.top_level_listing a')
biz_names <- html_text(biz_names_html)
biz_names <- biz_names[biz_names != ""]
biz_names_lower <- tolower(biz_names)
biz_names_sym <- gsub("[][!#$&%()*,.:;<=>#^_`|~.{}]", "", biz_names_lower)
biz_names_dub <- str_squish(biz_names_sym)
biz_name_clean <- chartr(" ", "-", biz_names_dub)
No, I'm afraid you can't use wildcards to get a valid url. What you can do is to scrape all the correct urls from the page, number and all.
To do this, we find all the correct nodes (I'm using xpath here rather than css selectors since it gives a bit more flexibility). You then get the href attribute from each node.
This can produce a data frame of business names and url. Here's a fully reproducible example:
library(rvest)
library(tibble)
vet_name_list <- "https://www.veteranownedbusiness.com/mo"
biz <- read_html(vet_name_list) %>%
html_nodes(xpath = "//tr[#class='top_level_listing']/td/a[#href]")
tibble(business = html_text(biz),
url = paste0(biz %>% html_attr("href")))
#> # A tibble: 550 x 2
#> business url
#> <chr> <chr>
#> 1 Accel Polymers, LLC /business/32216/ac~
#> 2 Beacon Car & Pet Wash /business/35987/be~
#> 3 Compass Quest Joplin Regional Veteran Services /business/21943/co~
#> 4 Financial Assistance for Military Experiencing Divorce /business/20797/fi~
#> 5 Focus Marines Foundation /business/29376/fo~
#> 6 Make It Virtual Assistant /business/32204/ma~
#> 7 Malachi Coaching & Training Ministries Int'l /business/24060/ma~
#> 8 Mike Jackson - Author /business/29536/mi~
#> 9 The Mission Continues /business/14492/th~
#> 10 U.S. Small Business Conference & EXPO /business/8266/us-~
#> # ... with 540 more rows
Created on 2022-08-05 by the reprex package (v2.0.1)

webscraping a table with R rvest

As an example to teach myself rvest, I attempted to scrape a website to grab data that's already written in a table format. The only problem is that I can't get an output of the underlying table data.
The only thing I really need is the player column.
library(tidyverse)
library(rvest)
base <- "https://www.milb.com/stats/"
base2 <- "?page="
base3 <- "&playerPool=ALL"
html <- read_html(paste0(base,"pacific-coast/","2017",base2,"2",base3))
html2 <- html %>% html_element("#stats-app-root")
html3 <- html2 %>% html_text("#stats-body-table player")
https://www.milb.com/stats/pacific-coast/2017?page=2&playerPool=ALL (easy way to see actual example url)
"HTML 2" appears to work, but I'm a little stuck about what to do from there. A couple of different attempts just hit a wall.
once this works, I'll replace text with numbers and do a few for loops (which seems pretty simple).
If you "inspect" the page in chrome, you see it's making a call to download a json file. Just do that yourself...
library(jsonlite)
data <- fromJSON("https://bdfed.stitch.mlbinfra.com/bdfed/stats/player?stitch_env=prod&season=2017&sportId=11&stats=season&group=hitting&gameType=R&offset=25&sortStat=onBasePlusSlugging&order=desc&playerPool=ALL&leagueIds=112")
df <- data$stats
head(df)
year playerId playerName type rank playerFullName
1 2017 643256 Adam Cimber player 26 Adam Cimber
2 2017 458547 Vladimir Frias player 27 Vladimir Frias
3 2017 643265 Garrett Cooper player 28 Garrett Cooper
4 2017 542979 Keon Broxton player 29 Keon Broxton
5 2017 600301 Taylor Motter player 30 Taylor Motter
6 2017 624414 Christian Arroyo player 31 Christian Arroyo
...

unnest_tokens fails to handle vectors in R with tidytext package

I want to use the tidytext package to create a column with 'ngrams'. with the following code:
library(tidytext)
unnest_tokens(tbl = president_tweets,
output = bigrams,
input = text,
token = "ngrams",
n = 2)
But when I run this I get the following error message:
error: unnest_tokens expects all columns of input to be atomic vectors (not lists)
My text column consists of a lot of tweets with rows that look like the following and is of class character.
president_tweets$text <– c("The United States Senate just passed the biggest in history Tax Cut and Reform Bill. Terrible Individual Mandate (ObamaCare)Repealed. Goes to the House tomorrow morning for final vote. If approved, there will be a News Conference at The White House at approximately 1:00 P.M.",
"Congratulations to Paul Ryan, Kevin McCarthy, Kevin Brady, Steve Scalise, Cathy McMorris Rodgers and all great House Republicans who voted in favor of cutting your taxes!",
"A story in the #washingtonpost that I was close to rescinding the nomination of Justice Gorsuch prior to confirmation is FAKE NEWS. I never even wavered and am very proud of him and the job he is doing as a Justice of the U.S. Supreme Court. The unnamed sources dont exist!",
"Stocks and the economy have a long way to go after the Tax Cut Bill is totally understood and appreciated in scope and size. Immediate expensing will have a big impact. Biggest Tax Cuts and Reform EVER passed. Enjoy, and create many beautiful JOBS!",
"DOW RISES 5000 POINTS ON THE YEAR FOR THE FIRST TIME EVER - MAKE AMERICA GREAT AGAIN!",
"70 Record Closes for the Dow so far this year! We have NEVER had 70 Dow Records in a one year period. Wow!"
)
---------Update:----------
It looks like the sentimetr or exploratory package caused the conflict. I reloaded my packages without these and now it works again!
Hmmmmm, I am not able to reproduce your problem.
library(tidytext)
library(dplyr)
president_tweets <- data_frame(text = c("The United States Senate just passed the biggest in history Tax Cut and Reform Bill. Terrible Individual Mandate (ObamaCare)Repealed. Goes to the House tomorrow morning for final vote. If approved, there will be a News Conference at The White House at approximately 1:00 P.M.",
"Congratulations to Paul Ryan, Kevin McCarthy, Kevin Brady, Steve Scalise, Cathy McMorris Rodgers and all great House Republicans who voted in favor of cutting your taxes!",
"A story in the #washingtonpost that I was close to rescinding the nomination of Justice Gorsuch prior to confirmation is FAKE NEWS. I never even wavered and am very proud of him and the job he is doing as a Justice of the U.S. Supreme Court. The unnamed sources dont exist!",
"Stocks and the economy have a long way to go after the Tax Cut Bill is totally understood and appreciated in scope and size. Immediate expensing will have a big impact. Biggest Tax Cuts and Reform EVER passed. Enjoy, and create many beautiful JOBS!",
"DOW RISES 5000 POINTS ON THE YEAR FOR THE FIRST TIME EVER - MAKE AMERICA GREAT AGAIN!",
"70 Record Closes for the Dow so far this year! We have NEVER had 70 Dow Records in a one year period. Wow!"))
unnest_tokens(tbl = president_tweets,
output = bigrams,
input = text,
token = "ngrams",
n = 2)
#> # A tibble: 205 x 1
#> bigrams
#> <chr>
#> 1 the united
#> 2 united states
#> 3 states senate
#> 4 senate just
#> 5 just passed
#> 6 passed the
#> 7 the biggest
#> 8 biggest in
#> 9 in history
#> 10 history tax
#> # ... with 195 more rows
The current CRAN version of tidytext does in fact not allow list-columns but we have changed the column handling so that the development version on GitHub now supports list-columns. Are you sure you don't have any of these in your data frame/tibble? What are the data types of all of your columns? Are any of them of type list?

Web crawler in R with heading and summary

I'm trying to extract links from here with the article heading and a brief summary of each link.
The output should have the article heading and the brief summary of each article which is on the same page.
I'm able to get the links. Can you please suggest how can i get heading and summary for each link. Please see my code below.
install.packages('rvest')
#Loading the rvest package
library('rvest')
library(xml2)
#Specifying the url for desired website to be scrapped
url <- 'http://money.howstuffworks.com/business-profiles.htm'
webpage <- read_html(url)
pg <- read_html(url)
head(html_attr(html_nodes(pg, "a"), "href"))
We can use purrr to inspect each node and extract the relevant information:
library(rvest)
library(purrr)
url <- 'http://money.howstuffworks.com/business-profiles.htm'
articles <- read_html(url) %>%
html_nodes('.infinite-item > .media') %>%
map_df(~{
title <- .x %>%
html_node('.media-heading > h3') %>%
html_text()
head <- .x %>%
html_node('p') %>%
html_text()
link <- .x %>%
html_node('p > a') %>%
html_attr('href')
data.frame(title, head, link, stringsAsFactors = F)
})
head(articles)
#> title
#> 1 How Amazon Same-day Delivery Works
#> 2 10 Companies That Completely Reinvented Themselves
#> 3 10 Trade Secrets We Wish We Knew
#> 4 How Kickstarter Works
#> 5 Can you get rich selling stuff online?
#> 6 Are the Golden Arches really supposed to be giant french fries?
#> head
#> 1 The Amazon same-day delivery service aims to get your package to you in no time at all. Learn how Amazon same-day delivery works. See more »
#> 2 You might be surprised at what some of today's biggest companies used to do. Here are 10 companies that reinvented themselves from HowStuffWorks. See more »
#> 3 Trade secrets are often locked away in corporate vaults, making their owners a fortune. Which trade secrets are the stuff of legend? See more »
#> 4 Kickstarter is a service that utilizes crowdsourcing to raise funds for your projects. Learn about how Kickstarter works at HowStuffWorks. See more »
#> 5 Can you get rich selling your stuff online? Find out more in this article by HowStuffWorks.com. See more »
#> 6 Are McDonald's golden arches really suppose to be giant french fries? Check out this article for a brief history of McDonald's golden arches. See more »
#> link
#> 1 http://money.howstuffworks.com/amazon-same-day-delivery.htm
#> 2 http://money.howstuffworks.com/10-companies-reinvented-themselves.htm
#> 3 http://money.howstuffworks.com/10-trade-secrets.htm
#> 4 http://money.howstuffworks.com/kickstarter.htm
#> 5 http://money.howstuffworks.com/can-you-get-rich-selling-online.htm
#> 6 http://money.howstuffworks.com/mcdonalds-arches.htm
Obligatory comment: In this case I saw no disclaimer against harvesting on their Terms and conditions, but always be sure to check the terms of a site before scraping it.

Can't import this excel file into R

I'm having trouble importing a file into R. The file was obtained from this website: https://report.nih.gov/award/index.cfm, where I clicked "Import Table" and downloaded a .xls file for the year 1992.
This image might help describe how I retrieved the data
Here's what I've tried typing into the console, along with the results:
Input:
> library('readxl')
> data1992 <- read_excel("1992.xls")
Output:
Not an excel file
Error in eval(substitute(expr), envir, enclos) :
Failed to open /home/chrx/Documents/NIH Funding Awards, 1992 - 2016/1992.xls
Input:
> data1992 <- read.csv ("1992.xls", sep ="\t")
Output:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
I'm not sure whether or not this is relevant, but I'm using GalliumOS (linux). Because I'm using Linux, Excel isn't installed on my computer. LibreOffice is.
Why bother with getting the data in and out of a .csv if it's right there on the web page for you to scrape?
# note the query parameters in the url when you apply a filter, e.g. fy=
url <- 'http://report.nih.gov/award/index.cfm?fy=1992'
library('rvest')
library('magrittr')
library('dplyr')
df <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="orgtable"]') %>%
html_table()%>%
extract2(1) %>%
mutate(Funding = as.numeric(gsub('[^0-9.]','',Funding)))
head(df)
returns
Organization City State Country Awards Funding
1 A.T. STILL UNIVERSITY OF HEALTH SCIENCES KIRKSVILLE MO UNITED STATES 3 356221
2 AAC ASSOCIATES, INC. VIENNA VA UNITED STATES 10 1097158
3 AARON DIAMOND AIDS RESEARCH CENTER NEW YORK NY UNITED STATES 3 629946
4 ABBOTT LABORATORIES NORTH CHICAGO IL UNITED STATES 4 1757241
5 ABIOMED, INC. DANVERS MA UNITED STATES 6 2161146
6 ABRATECH CORPORATION SAUSALITO CA UNITED STATES 1 450411
If you need to loop through years 1992 to present, or something similar, this programmatic approach will save you a lot of time versus handling a bunch of flat files.
This works for me
library(gdata)
dat1 <- read.xls("1992.xls")
If you're on 32-bit Windows this will also work:
require(RODBC)
dat1 <- odbcConnectExcel("1992.xls")
For several more options that rely on rJava-based packages like xlsx you can check out this link.
As someone mentioned in the comments it's also easy to save the file as a .csv and read it in that way. This will save you the trouble of dealing with the effects of strange formatting or metadata on your imported file:
dat1 <- read.csv("1992.csv")
head(dat1)
ORGANIZATION CITY STATE COUNTRY AWARDS FUNDING
1 A.T. STILL UNIVERSITY OF HEALTH SCIENCES KIRKSVILLE MO UNITED STATES 3 $356,221
2 AAC ASSOCIATES, INC. VIENNA VA UNITED STATES 10 $1,097,158
3 AARON DIAMOND AIDS RESEARCH CENTER NEW YORK NY UNITED STATES 3 $629,946
4 ABBOTT LABORATORIES NORTH CHICAGO IL UNITED STATES 4 $1,757,241
5 ABIOMED, INC. DANVERS MA UNITED STATES 6 $2,161,146
6 ABRATECH CORPORATION SAUSALITO CA UNITED STATES 1 $450,411
Converting to .csv is also usually the fastest way in my opinion (though this is only an issue with Big Data).

Resources