I want to get all names of all meal names from Wikipedia:
https://en.wikipedia.org/wiki/Lists_of_prepared_foods
How can I query it in R?
There is a query function but without good example how to do this.
I know there is a package called wikipedir that helps, but also rvest could be helpful:
library(rvest)
URL <- "https://en.wikipedia.org/wiki/Lists_of_prepared_foods"
temp <- URL %>%
read_html %>%
html_nodes("#mw-content-text h3+ ul a , .column-width a") %>% html_text()
[1] "List of almond dishes" "List of ancient dishes" "List of avocado dishes"
[4] "List of bacon substitutes" "List of baked goods" "List of breakfast beverages"
[7] "List of breakfast cereals" "List of breakfast foods" "List of cabbage dishes"
[10] "List of cakes" "List of candies" "List of carrot dishes" ... (trunc. output)
EDIT
To scrape the names in each page, I advice you to make a loop to solve the problem, using the vector temp created above but scraping the links:
temp <- URL %>%
read_html %>%
html_nodes("#mw-content-text h3+ ul a , .column-width a") %>% html_attr('href')
temp
[1] "/wiki/List_of_almond_dishes" "/wiki/List_of_ancient_dishes"
[3] "/wiki/List_of_avocado_dishes" "/wiki/List_of_bacon_substitutes" ... trunc. output)
Now you create an empty list to populate with the foods for each link:
# an empty list
listed <- list()
for (i in temp) {
# here you create the url made by https... + the scraped urls above
url <- paste0("https://en.wikipedia.org/",i)
# for each url, you'll have a component of the list with the extracted names
listed[[i]] <- url %>%
read_html %>%
# be sure to get the correct nodes, they seems these
html_nodes("h2~ ul li > a:nth-child(1) , a a") %>% html_text()
Sys.sleep(15) # very important: you'll add a 15 sec after each link scraped
# to not overload of requests the site in a small range of time
}
As result:
$`/wiki/List_of_almond_dishes`
[1] "Ajoblanco" "Almond butter" "Alpen (food)" "Amandine (culinary term)" "Amlu"
[6] "Bakewell tart" "Bear claw (pastry)" "Bethmännchen" "Biscuit Tortoni" "Blancmange"
[11] "Christmas cake" "Churchkhela" "Ciarduna" "Colomba di Pasqua" "Comfit"
[16] "Coucougnette" "Crème de Noyaux" "Cruncheroos" "Dacquoise" "Daim bar"
[21] "Dariole" "Esterházy torte" ... (trunc. output)
$`/wiki/List_of_ancient_dishes`
[1] "Anfu ham" "Babaofan" "Bread" "Flatbread" "Focaccia" "Mantou"
[7] "Chili pepper" "Chutney" "Congee" "Curry" "Doubanjiang" "Fish sauce"
[13] "Forcemeat" "Garum" "Ham" "Harissa" "Jeok" "Jusselle"
[19] "Liquamen" "Maccu" "Misu karu" "Moretum" "Nian gao" "Noodle" ... (trunc. output)
Related
I am using rvest to scrape an IMDB list and want to access the list of full cast and crew. Unfortunately, IMDB has created a summary page when you click on the title and it takes me to the wrong page.
This is the webpage I get: https://www.imdb.com/title/tt1375666/?ref_=ttls_li_tt
This is the webpage I need: https://www.imdb.com/title/tt1375666/fullcredits/?ref_=tt_ql_cl
Notice the addition of the /fullcredits in the URL.
How can I insert /fullcredits into the middle of a URL I have built?
#install.packages("rvest")
#install.packages("dplyr")
library(rvest) #webscraping package
library(dplyr) #piping
link = "https://www.imdb.com/list/ls006266261/?st_dt=&mode=detail&page=1&sort=list_order,asc"
credits = "fullcredits/"
page = read_html(link)
name <- page %>% rvest::html_nodes(".lister-item-header a") %>% rvest::html_text()
movie_link = page %>% rvest::html_nodes(".lister-item-header a") %>% html_attr("href") %>% paste("https://www.imdb.com", ., sep="")
Here is an option - get the dirname and basename from the link, replace the substring of the basename with new substring ("tt_ql_cl") and join them again with file.path after inserting the "fullcredits" in between
library(stringr)
movie_link2 <- file.path(dirname(movie_link), "fullcredits",
str_replace(basename(movie_link), "ttls_li_tt", "tt_ql_cl"))
-output
> head(movie_link2)
[1] "https://www.imdb.com/title/tt0068646/fullcredits/?ref_=tt_ql_cl"
[2] "https://www.imdb.com/title/tt0099685/fullcredits/?ref_=tt_ql_cl"
[3] "https://www.imdb.com/title/tt0110912/fullcredits/?ref_=tt_ql_cl"
[4] "https://www.imdb.com/title/tt0114814/fullcredits/?ref_=tt_ql_cl"
[5] "https://www.imdb.com/title/tt0078788/fullcredits/?ref_=tt_ql_cl"
[6] "https://www.imdb.com/title/tt0117951/fullcredits/?ref_=tt_ql_cl"
> tail(movie_link2)
[1] "https://www.imdb.com/title/tt0144084/fullcredits/?ref_=tt_ql_cl"
[2] "https://www.imdb.com/title/tt0119654/fullcredits/?ref_=tt_ql_cl"
[3] "https://www.imdb.com/title/tt0477348/fullcredits/?ref_=tt_ql_cl"
[4] "https://www.imdb.com/title/tt0080339/fullcredits/?ref_=tt_ql_cl"
[5] "https://www.imdb.com/title/tt0469494/fullcredits/?ref_=tt_ql_cl"
[6] "https://www.imdb.com/title/tt1375666/fullcredits/?ref_=tt_ql_cl"
Another way,
df1 = gsub("\\?.*", "", movie_link)
df = paste0(df1, 'fullcredits/?ref_=tt_ql_cl')
df
[1] "https://www.imdb.com/title/tt0068646/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0099685/fullcredits/?ref_=tt_ql_cl"
[3] "https://www.imdb.com/title/tt0110912/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0114814/fullcredits/?ref_=tt_ql_cl"
[5] "https://www.imdb.com/title/tt0078788/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0117951/fullcredits/?ref_=tt_ql_cl"
[7] "https://www.imdb.com/title/tt0137523/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0108052/fullcredits/?ref_=tt_ql_cl"
[9] "https://www.imdb.com/title/tt0118749/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0105236/fullcredits/?ref_=tt_ql_cl"
[11] "https://www.imdb.com/title/tt0111161/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0073195/fullcredits/?ref_=tt_ql_cl"
[13] "https://www.imdb.com/title/tt0075314/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0119488/fullcredits/?ref_=tt_ql_cl"
I'm sorry to ask this question once again: I know a lot of people have asked this before, but even looking at the answers they received I still can't solve my problem.
The code I'm using was actually inspired on some of the answers I was able to find:
link <- "https://letterboxd.com/alexissrey/activity/"
page <- link %>% GET(config = httr::config(ssl_verifypeer = FALSE))%>% read_html
Until this point everything seems to be working ok, but then I try to run the following line...
names <- link %>% html_nodes(".prettify > a") %>% html_text()
... to download all the movie names in that page, but the objet I get is empty.
It is worth mentioning that I've tried the same code for other pages (specially the ones mentioned by other users in their questions) and it worked perfectly.
So, can anyone see what I'm missing?
Thanks!
We can get the film link and name by using RSelenium
Start the browser
url = 'https://letterboxd.com/alexissrey/activity/'
library(RSelenium)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)
Get links to film by
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes(xpath = '//*[#id="SentimentContainer"]/div[1]/div[1]') %>%
html_text()
[1] "/film/the-power-of-the-dog/" "/nachotorresok/film/dune-2021/" "/furquerita/film/the-princess-switch/"
[4] "/film/fosse-verdon/" "/film/the-greatest-showman/" "/film/misery/"
[7] "/film/when-harry-met-sally/" "/film/stand-by-me/" "/film/things-to-come-2016/"
[10] "/film/bergman-island-2021/" "/film/king-lear-2018/" "/film/21-grams/"
[13] "/film/the-house-that-jack-built-2018/" "/film/dogville/" "/film/all-that-jazz/"
[16] "/alexissrey/list/peliculas-para-ver-en-omnibus/" "/film/in-the-mouth-of-madness/"
Get movie names by,
remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.target') %>%
html_text()
[1] "The Power of the Dog" " ★★★½ review of Dune" " ★★★½ review of The Princess Switch"
[4] "Fosse/Verdon" "The Greatest Showman" "Misery"
[7] "When Harry Met Sally..." "Stand by Me" "Things to Come"
[10] "Bergman Island" "King Lear" "21 Grams"
[13] "The House That Jack Built" "Dogville" "All That Jazz"
[16] "Películas para ver en ómnibus" "In the Mouth of Madness"
I am new to R and webscraping. For practice I am trying to scrape book titles from a fake website that has multiple pages ('http://books.toscrape.com/catalogue/page-1.html'), and then calculate certain metrics based on the book titles. There are 20 books on each page and 50 pages, I have managed to scrape and calculate metrics for the first 20 books, however I want to calculate the metrics for the full 1000 books on the website.
The current output looks like this:
[1] "A Light in the Attic"
[2] "Tipping the Velvet"
[3] "Soumission"
[4] "Sharp Objects"
[5] "Sapiens: A Brief History of Humankind"
[6] "The Requiem Red"
[7] "The Dirty Little Secrets of Getting Your Dream Job"
[8] "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull"
[9] "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics"
[10] "The Black Maria"
[11] "Starving Hearts (Triangular Trade Trilogy, #1)"
[12] "Shakespeare's Sonnets"
[13] "Set Me Free"
[14] "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)"
[15] "Rip it Up and Start Again"
[16] "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991"
[17] "Olio"
[18] "Mesaerion: The Best Science Fiction Stories 1800-1849"
[19] "Libertarianism for Beginners"
[20] "It's Only the Himalayas"
I want this to be 1000 books long instead of 20, this will allow me to use the same code to calculate the metrics but for 1000 books instead of 20.
Code:
url<-'http://books.toscrape.com/catalogue/page-1.html'
url %>%
read_html() %>%
html_nodes('h3 a') %>%
html_attr('title')->titles
titles
What would be the best way to scrape every book from the website and make the list 1000 book titles long instead of 20? Thanks in advance.
Generate the 50 URLs, then iterate on them, e.g. with purrr::map
library(rvest)
urls <- paste0('http://books.toscrape.com/catalogue/page-', 1:50, '.html')
titles <- purrr::map(
urls,
. %>%
read_html() %>%
html_nodes('h3 a') %>%
html_attr('title')
)
something like this perhaps?
library(tidyverse)
library(rvest)
library(data.table)
# Vector with URL's to scrape
url <- paste0("http://books.toscrape.com/catalogue/page-", 1:20, ".html")
# Scrape to list
L <- lapply( url, function(x) {
print( paste0( "scraping: ", x, " ... " ) )
data.table(titles = read_html(x) %>%
html_nodes('h3 a') %>%
html_attr('title') )
})
# Bind list to single data.table
data.table::rbindlist(L, use.names = TRUE, fill = TRUE)
When I readLines() on an URL, I get missing lines or values. This might be due to spacing that the computer can't read.
When you use the URL above, CTR + F finds 38 instances of text that matches "TV-". On the other hand, when I run readLines() and grep("TV-", HTML) I only find 12.
So, how can I avoid encoding/ spacing errors so that I can get complete lines of the HTML?
You can use rvest to scrape the data. For example, to get all the titles you can do :
library(rvest)
url <- 'https://www.imdb.com/search/title/?locations=Vancouver,%20British%20Columbia,%20Canada&start=1.json'
url %>%
read_html() %>%
html_nodes('div.lister-item-content h3 a') %>%
html_text() -> all_titles
all_titles
# [1] "The Haunting of Bly Manor" "The Haunting of Hill House"
# [3] "Supernatural" "Helstrom"
# [5] "The 100" "Lucifer"
# [7] "Criminal Minds" "Fear the Walking Dead"
# [9] "A Babysitter's Guide to Monster Hunting" "The Stand"
#...
#...
I'm trying to use R to fetch all the links to data files on the Eurostat website. While my code currently "works", I seem to get a duplicate result for every link.
Note, the use of download.file is to get around my company's firewall, per this answer
library(dplyr)
library(rvest)
myurl <- "http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?dir=data&sort=1&sort=2&start=all"
download.file(myurl, destfile = "eurofull.html")
content <- read_html("eurofull.html")
links <- content %>%
html_nodes("a") %>% #Note that I dont know the significance of "a", this was trial and error
html_attr("href") %>%
data.frame()
# filter to only get the ".tsv.gz" links
files <- filter(links, grepl("tsv.gz", .))
Looking at the top of the dataframe
files$.[1:6]
[1] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&file=data%2Faact_ali01.tsv.gz
[2] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&downfile=data%2Faact_ali01.tsv.gz
[3] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&file=data%2Faact_ali02.tsv.gz
[4] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&downfile=data%2Faact_ali02.tsv.gz
[5] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&file=data%2Faact_eaa01.tsv.gz
[6] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&downfile=data%2Faact_eaa01.tsv.gz
The only difference between 1 and 2 is that 1 says "...file=data..." while 2 says "...downfile=data...". This pattern continues for all pairs down the dataframe.
If I download 1 and 2 and read the files into R, an identical check confirms they are the same.
Why are two links to the same data being returned? Is there a way (other than filtering for "downfile") to only return one of the links?
As noted, you can just do some better node targeting. This uses XPath vs CSS selectors and picks the links with downfile in the href:
html_nodes(content, xpath = ".//a[contains(#href, 'downfile')]") %>%
html_attr("href") %>%
sprintf("http://ec.europa.eu/%s", .) %>%
head()
## [1] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_ali01.tsv.gz"
## [2] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_ali02.tsv.gz"
## [3] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa01.tsv.gz"
## [4] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa02.tsv.gz"
## [5] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa03.tsv.gz"
## [6] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa04.tsv.gz"