R: readLines on a URL leads to missing lines - r

When I readLines() on an URL, I get missing lines or values. This might be due to spacing that the computer can't read.
When you use the URL above, CTR + F finds 38 instances of text that matches "TV-". On the other hand, when I run readLines() and grep("TV-", HTML) I only find 12.
So, how can I avoid encoding/ spacing errors so that I can get complete lines of the HTML?

You can use rvest to scrape the data. For example, to get all the titles you can do :
library(rvest)
url <- 'https://www.imdb.com/search/title/?locations=Vancouver,%20British%20Columbia,%20Canada&start=1.json'
url %>%
read_html() %>%
html_nodes('div.lister-item-content h3 a') %>%
html_text() -> all_titles
all_titles
# [1] "The Haunting of Bly Manor" "The Haunting of Hill House"
# [3] "Supernatural" "Helstrom"
# [5] "The 100" "Lucifer"
# [7] "Criminal Minds" "Fear the Walking Dead"
# [9] "A Babysitter's Guide to Monster Hunting" "The Stand"
#...
#...

Related

Objects scraped from web are character "empty" in R

I'm sorry to ask this question once again: I know a lot of people have asked this before, but even looking at the answers they received I still can't solve my problem.
The code I'm using was actually inspired on some of the answers I was able to find:
link <- "https://letterboxd.com/alexissrey/activity/"
page <- link %>% GET(config = httr::config(ssl_verifypeer = FALSE))%>% read_html
Until this point everything seems to be working ok, but then I try to run the following line...
names <- link %>% html_nodes(".prettify > a") %>% html_text()
... to download all the movie names in that page, but the objet I get is empty.
It is worth mentioning that I've tried the same code for other pages (specially the ones mentioned by other users in their questions) and it worked perfectly.
So, can anyone see what I'm missing?
Thanks!
We can get the film link and name by using RSelenium
Start the browser
url = 'https://letterboxd.com/alexissrey/activity/'
library(RSelenium)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)
Get links to film by
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes(xpath = '//*[#id="SentimentContainer"]/div[1]/div[1]') %>%
html_text()
[1] "/film/the-power-of-the-dog/" "/nachotorresok/film/dune-2021/" "/furquerita/film/the-princess-switch/"
[4] "/film/fosse-verdon/" "/film/the-greatest-showman/" "/film/misery/"
[7] "/film/when-harry-met-sally/" "/film/stand-by-me/" "/film/things-to-come-2016/"
[10] "/film/bergman-island-2021/" "/film/king-lear-2018/" "/film/21-grams/"
[13] "/film/the-house-that-jack-built-2018/" "/film/dogville/" "/film/all-that-jazz/"
[16] "/alexissrey/list/peliculas-para-ver-en-omnibus/" "/film/in-the-mouth-of-madness/"
Get movie names by,
remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.target') %>%
html_text()
[1] "The Power of the Dog" " ★★★½ review of Dune" " ★★★½ review of The Princess Switch"
[4] "Fosse/Verdon" "The Greatest Showman" "Misery"
[7] "When Harry Met Sally..." "Stand by Me" "Things to Come"
[10] "Bergman Island" "King Lear" "21 Grams"
[13] "The House That Jack Built" "Dogville" "All That Jazz"
[16] "Películas para ver en ómnibus" "In the Mouth of Madness"

Is there a way to scrape through multiple pages on a website in R

I am new to R and webscraping. For practice I am trying to scrape book titles from a fake website that has multiple pages ('http://books.toscrape.com/catalogue/page-1.html'), and then calculate certain metrics based on the book titles. There are 20 books on each page and 50 pages, I have managed to scrape and calculate metrics for the first 20 books, however I want to calculate the metrics for the full 1000 books on the website.
The current output looks like this:
[1] "A Light in the Attic"
[2] "Tipping the Velvet"
[3] "Soumission"
[4] "Sharp Objects"
[5] "Sapiens: A Brief History of Humankind"
[6] "The Requiem Red"
[7] "The Dirty Little Secrets of Getting Your Dream Job"
[8] "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull"
[9] "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics"
[10] "The Black Maria"
[11] "Starving Hearts (Triangular Trade Trilogy, #1)"
[12] "Shakespeare's Sonnets"
[13] "Set Me Free"
[14] "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)"
[15] "Rip it Up and Start Again"
[16] "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991"
[17] "Olio"
[18] "Mesaerion: The Best Science Fiction Stories 1800-1849"
[19] "Libertarianism for Beginners"
[20] "It's Only the Himalayas"
I want this to be 1000 books long instead of 20, this will allow me to use the same code to calculate the metrics but for 1000 books instead of 20.
Code:
url<-'http://books.toscrape.com/catalogue/page-1.html'
url %>%
read_html() %>%
html_nodes('h3 a') %>%
html_attr('title')->titles
titles
What would be the best way to scrape every book from the website and make the list 1000 book titles long instead of 20? Thanks in advance.
Generate the 50 URLs, then iterate on them, e.g. with purrr::map
library(rvest)
urls <- paste0('http://books.toscrape.com/catalogue/page-', 1:50, '.html')
titles <- purrr::map(
urls,
. %>%
read_html() %>%
html_nodes('h3 a') %>%
html_attr('title')
)
something like this perhaps?
library(tidyverse)
library(rvest)
library(data.table)
# Vector with URL's to scrape
url <- paste0("http://books.toscrape.com/catalogue/page-", 1:20, ".html")
# Scrape to list
L <- lapply( url, function(x) {
print( paste0( "scraping: ", x, " ... " ) )
data.table(titles = read_html(x) %>%
html_nodes('h3 a') %>%
html_attr('title') )
})
# Bind list to single data.table
data.table::rbindlist(L, use.names = TRUE, fill = TRUE)

How to find the frequency of words in book titles that I have scraped from a website in R

I am very new to R and webscraping. For practice I am scraping book titles from a website and working out some basic stats using the titles. So far I have managed to scrape the book titles, add them to a table, and find the mean length of the books.
I now want to find the most commonly used word in the book titles, it is probably 'the', but I want to prove this using R. At the moment my program is only looking at the full book title, I need to split the words into their own individual identities so I can count the quantity of different words. However, I am not sure how to do this.
Code:
url <- 'http://books.toscrape.com/index.html'
bookNames <- read_html(allUrls) %>%
html_nodes(xpath='//*[contains(concat( " ", #class, " "), concat( " ", "product_pod", " " ))]//a') %>%
html_text
view(bookNames)
values<-lapply(bookNames,nchar)
mean(unlist(values))
bookNames<-tolower(bookNames)
sort(table(bookNames), decreasing=T)[1:2]
I think splitting every word into a new list would solve my problem, yet I am not sure how to do this.
Thanks in advance.
Above is the table of books I have been able to produce.
You can get all the book titles with :
library(rvest)
url <- 'http://books.toscrape.com/index.html'
url %>%
read_html() %>%
html_nodes('h3 a') %>%
html_attr('title') -> titles
titles
# [1] "A Light in the Attic"
# [2] "Tipping the Velvet"
# [3] "Soumission"
# [4] "Sharp Objects"
# [5] "Sapiens: A Brief History of Humankind"
# [6] "The Requiem Red"
# [7] "The Dirty Little Secrets of Getting Your Dream Job"
#....
To get the most common words in the title then you can split the string on whitespace and use table to count the frequency.
head(sort(table(tolower(unlist(strsplit(titles, '\\s+')))), decreasing = TRUE))
# the a of #1) and for
# 14 3 3 2 2 2

Rvest: Headlines returning empty list

I'm trying to replicate this tutorial on rvest here. However, at the start I'm already having issues. This is the code I'm using
library(rvest)
#Specifying the url for desired website to be scrapped
url <- 'https://www.nytimes.com/section/politics'
#Reading the HTML code from the website - headlines
webpage <- read_html(url)
headline_data <- html_nodes(webpage,'.story-link a, .story-body a')
My results when I look at headline_data return
{xml_nodeset (0)}
But in the tutorial it returns a list of length 48
{xml_nodeset (48)}
Any reason for the discrepancy?
As mentioned in the comments, there are no elements with the specified class you are searching for.
To begin, based on current tags you can get headlines with
library(rvest)
library(dplyr)
url <- 'https://www.nytimes.com/section/politics'
url %>%
read_html() %>%
html_nodes("h2.css-l2vidh a") %>%
html_text()
#[1] "Trump’s Secrecy Fight Escalates as Judge Rules for Congress in Early Test"
#[2] "A Would-Be Trump Aide’s Demands: A Jet on Call, a Future Cabinet Post and More"
#[3] "He’s One of the Biggest Backers of Trump’s Push to Protect American Steel. And He’s Canadian."
#[4] "Accountants Must Turn Over Trump’s Financial Records, Lower-Court Judge Rules"
and to get individual URL's of those headlines you could do
url %>%
read_html() %>%
html_nodes("h2.css-l2vidh a") %>%
html_attr("href") %>%
paste0("https://www.nytimes.com", .)
#[1] "https://www.nytimes.com/2019/05/20/us/politics/mcgahn-trump-congress.html"
#[2] "https://www.nytimes.com/2019/05/20/us/politics/kris-kobach-trump.html"
#[3] "https://www.nytimes.com/2019/05/20/us/politics/hes-one-of-the-biggest-backers-of-trumps-push-to-protect-american-steel-and-hes-canadian.html"
#[4] "https://www.nytimes.com/2019/05/20/us/politics/trump-financial-records.html"

html_nodes returning two results for a link

I'm trying to use R to fetch all the links to data files on the Eurostat website. While my code currently "works", I seem to get a duplicate result for every link.
Note, the use of download.file is to get around my company's firewall, per this answer
library(dplyr)
library(rvest)
myurl <- "http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?dir=data&sort=1&sort=2&start=all"
download.file(myurl, destfile = "eurofull.html")
content <- read_html("eurofull.html")
links <- content %>%
html_nodes("a") %>% #Note that I dont know the significance of "a", this was trial and error
html_attr("href") %>%
data.frame()
# filter to only get the ".tsv.gz" links
files <- filter(links, grepl("tsv.gz", .))
Looking at the top of the dataframe
files$.[1:6]
[1] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&file=data%2Faact_ali01.tsv.gz
[2] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&downfile=data%2Faact_ali01.tsv.gz
[3] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&file=data%2Faact_ali02.tsv.gz
[4] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&downfile=data%2Faact_ali02.tsv.gz
[5] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&file=data%2Faact_eaa01.tsv.gz
[6] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&downfile=data%2Faact_eaa01.tsv.gz
The only difference between 1 and 2 is that 1 says "...file=data..." while 2 says "...downfile=data...". This pattern continues for all pairs down the dataframe.
If I download 1 and 2 and read the files into R, an identical check confirms they are the same.
Why are two links to the same data being returned? Is there a way (other than filtering for "downfile") to only return one of the links?
As noted, you can just do some better node targeting. This uses XPath vs CSS selectors and picks the links with downfile in the href:
html_nodes(content, xpath = ".//a[contains(#href, 'downfile')]") %>%
html_attr("href") %>%
sprintf("http://ec.europa.eu/%s", .) %>%
head()
## [1] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_ali01.tsv.gz"
## [2] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_ali02.tsv.gz"
## [3] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa01.tsv.gz"
## [4] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa02.tsv.gz"
## [5] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa03.tsv.gz"
## [6] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa04.tsv.gz"

Resources