Web-scraping in R - r

I am practicing my web scraping coding in R and I cannot pass one phase no matter what website I try.
For example,
https://www.thecompleteuniversityguide.co.uk/league-tables/rankings?s=Music
My goal is to extract all 77 schools' name (Oxford to London Metropolitan)
So I tried...
library(rvest)
url_college <- "https://www.thecompleteuniversityguide.co.uk/league-tables/rankings?s=Music"
college <- read_html(url_college)
info <- html_nodes(college, css = '.league-table-institution-name')
info %>% html_nodes('.league-table-institution-name') %>% html_text()
From F12, I could find out that all schools' name is under class '.league-table-institution-name'... and that's why I wrote that in html_nodes...
What have I done wrong?

You appear to be running html_nodes() twice: first on college, an xml_document (which is correct) and then on info, a character vector, which is not correct.
Try this instead:
url_college %>%
read_html() %>%
html_nodes('.league-table-institution-name') %>%
html_text()
and then you'll need an additional step to clean up the school names; this one was suggested:
%>%
str_replace_all("(^[^a-zA-Z]+)|([^a-zA-Z]+$)", "")

Related

html_nodes return an empty list for complex table

I would like to webscraping the table in the following website: https://www.timeshighereducation.com/world-university-rankings/2021/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats
I am using the following code but it is not working, thank you in advance.
library(rvest)
library(xml2)
library(dplyr)
link <- "https://www.timeshighereducation.com/world-university-rankings/2021/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats"
page<- read_html(link)
rank<- page %>% html_nodes(".sorting_2") %>% html_text()
university<-page %>% html_nodes(".ranking-institution-title ") %>% html_text()
statistics<-page %>% html_nodes(".stats") %>% html_text()
The Terms and Services of this site state: "Use data mining, robot, spider, scraping or similar automated data gathering, extraction or publication tools for any purpose."
That being said, you can read the json file that #QHarr found:
library(jsonlite)
url <- "https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2021_0__fa224219a267a5b9c4287386a97c70ea.json"
x <- read_json(url, simplifyVector = TRUE)
head(x$data) # give you the data frame with universities
Now you have a well structured R list. The $data element contains a data frame with the stats of each university in rows. The other 3 list elements only provide supplementary information.

How to scrape data from multiple wikipedia pages with R?

I am new to data scraping in R, but I would like to do the following. I have a list of celebrities, celebs, and I would like to grab their date of birth from Wikipedia. I know how to do it for each individual celebrity, but I am trying to animate this process.
celebs <- c("Tom Hanks", "Tim Cook", "Michael Bloomberg")
I do the following to get the information I need for the first celebrity, Tom Hanks.
library(rvest)
wiki <- read_html("https://en.wikipedia.org/wiki/Tom_Hanks")
birth_date <- wiki %>%
html_nodes(xpath = '//*[#id="mw-content-text"]/div/table/tbody/tr[3]/td/text()') %>%
html_text()
Is there a way to get the information I need for Tim Cook and Michael Bloomberg without manually editing the above code?
welcome to SO.
To do any task repeatedly with code, you should always look to build a loop. Before you can build a loop, you should try to build a single iteration of the loop. You almost have that ready here, but there are a few missing steps.
First of all, we should try to generalize the code so that it could work by simply switch the value of one variable from your vector of iterators (celebs).
person <- "Tom Hanks"
Now, using that, we need to create the wikipedia link through code. There are two things to consider here:
We need to add the link before the name of the person;
We should replace the space in "Tom Hanks" for an underline
We can do that with this code:
link <- paste0("https://en.wikipedia.org/wiki/",
str_replace_all(person, " ", "_"))
This creates the correct link, which we can use for the subsequent steps. Now, it is just a question of iterating through the celebs vector. There are many ways to do it, but in R, the most appropriate would be with an sapply. For that, we will create an anonymous function that will take a person's name as input, query wikipedia and extract their birthday, using the code that you have already written:
function(person) {
link <- paste0("https://en.wikipedia.org/wiki/",
str_replace_all(person, " ", "_"))
wiki <- read_html(link)
birth_date <- wiki %>%
html_nodes(xpath = '//*[#id="mw-content-text"]/div/table/tbody/tr[3]/td/text()') %>%
html_text()
return(birth_date)
}
You can now wrap an sapply structure around that:
birthdates <- sapply(celebs, function(person) {
link <- paste0("https://en.wikipedia.org/wiki/",
str_replace_all(person, " ", "_"))
wiki <- read_html(link)
birth_date <- wiki %>%
html_nodes(xpath = '//*[#id="mw-content-text"]/div/table/tbody/tr[3]/td/text()') %>%
html_text()
return(birth_date)
})

Webscraping with R: How can I tidy my data?

I'm building a web scraper for some News websites in Switzerland. After some trial & error and a lot of help from StackOverflow (thx everyone!), I've gotten to a point where I can get text data from all articles.
#packages instalieren
install.packages("rvest")
install.packages("tidyverse")
install.packages("dplyr")
library(rvest)
library(stringr)
#seite einlesen
apisrf<- read_xml('https://www.srf.ch/news/bnf/rss/1646')
urls_srf <- apisrf %>% html_nodes('link') %>% html_text()
zeit_srf <- apisrf %>% html_nodes('pubDate') %>% html_text()
#data.frame basteln
dfsrf_titel_text <- data.frame(Text = character())
#scrape
for(i in 1:length(urls_srf)) {
link <- urls_srf[i]
artikel <- read_html(link)
#Informationen entnehmen
textsrf<- artikel %>% html_nodes('p') %>% html_text()
#In Dataframe strukturieren
dfsrf_text <- data.frame(Text = textsrf)
dfsrf_titel_text <- rbind(dfsrf_titel_text, cbind(dfsrf_text))
}
running this gives me dfsrf_titel_text. (I'm going to combine it with the titles of the articles at some point but let that be my problem.)
however, now my data is pretty untidy and I can't really figure out how to clean it in a way so it works for me. Especially annoying is that the texts from the different articles are not really structured in that way but get a new line whenever there is a paragraph in the texts. I can't combine the paragraphs because all the texts have different lengths. (The first article, starting at point 3, is super long because it's a live ticker covering the corona crisis so don't get confused if you run my code.)
how can I get R to create a new row in my dataframe only if the text is from a new article (meaning from a new URL?
thx for your help!
can you provide a sample of your data? you can use the strsplit(string, pattern) function where the pattern you specify is something that only happens between articles. Perhaps the URL?
strsplit(dfsrf_text,"www.\\w+.ch")
That will split your text anytime a URL in the .ch domain is found. you can use this regular expression cheat sheet to help you identify the pattern that seperates your articles.
You should correct this while creating dataframe itself. Here I am binding this all the data for each article together using paste0 adding new line character between them (\n\n).
library(rvest)
for(i in 1:length(urls_srf)) {
link <- urls_srf[i]
artikel <- read_html(link)
#Informationen entnehmen
textsrf<- paste0(artikel %>% html_nodes('p') %>% html_text(), collapse = "\n\n")
#In Dataframe strukturieren
dfsrf_text <- data.frame(Text = textsrf)
dfsrf_titel_text <- rbind(dfsrf_titel_text, cbind(dfsrf_text))
}
However, growing data in a loop is highly inefficient and can slow the process terribly especially when you have large data to scrape like this. Try using sapply.
dfsrf_titel_text <- data.frame(text = sapply(urls_srf, function(x) {
paste0(read_html(x) %>% html_nodes('p') %>% html_text(), collapse = "\n\n")
}))
So this will give you number of rows same as length of urls_srf .

Webscraping in R From Dataframe

From the following data frame
I am trying to use the package rvest to scrape each words Part of speech and synonyms from the website: https://www.thesaurus.com/browse/research?s=t into a csv.
I am not sure how to have R search each word of the data frame and pull its Part of Speech and Synonym.
install.packages("rvest")
install.packages("xml2")
library(xml2)
library(rvest)
library(dplyr)
words<data.frame("keywords"=c("research","survey","staff","outpatient","consent"))
html<- read_html("https://www.merriam-webster.com/thesaurus/research")
html %>% html_nodes(".mw-list") %>% html_text () %>%
head(n=1) # take the first 1st records
If you search [your term] on thesaurus, you will end up on the following HTML page: "https://www.thesaurus.com/browse/[your term]". If you know this, you can get the HTMLs of all the pages of terms you're interested in. After that you should be able to iterate with the map() function from the purrr pacakage to get the information you want:
# It makes more sense to just keep "words" as a vector for now
words <- c("research","survey","staff","outpatient","consent")
htmls <- paste0("https://www.thesaurus.com/browse/", words)
info_list <- map(htmls, .x %>%
read_html() %>%
html_node(.mw-list) %>%
html_text())

Using str_extract_all to extract publications from specific portal out of hundreds of portals in a string.

Using a simple code to extract the links to my articles (one by one)
library(rvest)
url = ("http://www.time.mk/week/2016/22")
frontpage = read_html(url) %>%
html_nodes(".other_articles") %>%
html_attr("href") %>%
paste0()
print(frontpage)
mark = "http://www dot time dot mk/"
frontpagelinks = paste0(mark, frontpage)
final = list()
final = read_html(frontpagelinks[1]) %>%
html_nodes("h1 a") %>%
html_attr("href")%>%
paste0()
I used
a1onJune = str_extract_all(frontpage, ".*a1on.*") to extract articles from the website a1on dot mk, which worked like a charm finding only the articles I needed.
After getting some help here as to how to make my code more efficient, i.e. extract numerous links at once, via:
linksList <- lapply(frontpagelinks, function(i) {
read_html(frontapagelinks[i]) %>%
html_nodes("h1 a") %>%
html_attr("href")%>%
paste0()
which extracts all of the links I need, the same stringr code returns oddly enough something like this
"\"standard dot mk/germancite-ermenskiot-genocid/\", \"//plusinfo dot mk/vest/72702/turcija-ne-go-prifakja-zborot-genocid\", \"/a1on dot mk/wordpress/archives/618719\", \"sitel dot mk/na-povidok-nov-sudir-megju-turcija-i-germanija\",
Where as shown in bold I also extract the links to the website I need, but also a bunch of other noise that I definitely don't want there. I tried a variety of regex expressions, however I've not managed to define only those lines of code that contain a1on posts.
Given that the list which I am attempting to clear out outputs separated links I am a bit baffled by the fact that when I use stringr it (as far as im concerned) randomly divides them into strings of multiple links:
[93] "http://telegraf dot mk /aktuelno/svet/ns-newsarticle-vo-znak-na-protest-turcija-go-povlece-svojot-ambasador-od-germanija.nspx"
[94] "http://tocka dot mk /1/197933/odnosite-pomegju-berlin-i-ankara-pred-totalen-kolaps-germanija-go-prizna-turskiot-genocid-nad-ermencite"
[95] "lokalno dot mk /merkel-vladata-na-germanija-e-podgotvena-da-pomogne-vo-dijalogot-megju-turcija-i-ermenija/"
Any thoughts as to how I can go about this? Perhaps something that is more general, given that I need to do the same type of cleaning for five different portals.
Thank you.
Using a simple code to extract the links to my articles (one by one)
library(rvest)
url = ("http://www.time.mk/week/2016/22")
frontpage = read_html(url) %>%
html_nodes(".other_articles") %>%
html_attr("href") %>%
paste0()
print(frontpage)
mark = "http://www.time.mk/"
frontpagelinks = paste0(mark, frontpage)
# lappy returns a list of lists, so use unlist to flatten
linksList <- unlist( lapply(frontpagelinks, function(i) {
read_html(i) %>%
html_nodes("h1 a") %>%
html_attr("href") %>%
paste0()}))
# grab the lists of interest
a1onLinks <- linksList[grepl(".*a1on.*", linksList)]
# [1] "http://a1on.mk/wordpress/archives/621196" "http://a1on.mk/wordpress/archives/621038"
# [3] "http://a1on.mk/wordpress/archives/620576" "http://a1on.mk/wordpress/archives/620686"
# [5] "http://a1on.mk/wordpress/archives/620364" "http://a1on.mk/wordpress/archives/620399"

Resources