Scraping Movie reviews from IMDB using rvest - r

I have extracted the reviews of a movie on IMDB but the separate reviews have a lot of blank lines between them. It is unstructured and very difficult to view.
I have to apply certain functions on each of them separately and then store them together as 1 for some text mining for some other functions.
How can I structure (clean) them and access them one at a time and also how to combine them and store it together?
Here is my code for scraping the reviews
ID <- 1490017
URL <- paste0("http://www.imdb.com/title/", ID, "/reviews?filter=prolific")
MOVIE_URL <- read_html(URL)
ex_review <- MOVIE_URL %>%
html_nodes("p") %>%
html_text()

I would suggest that you are more specific when you navigate the DOM. For instance, this code will only deliver reviews and none of the other information that you are presumably not looking to scrape:
ID <- 1490017
URL <- paste0("http://www.imdb.com/title/tt", ID, "/reviews?filter=prolific")
MOVIE_URL <- read_html(URL)
ex_review <- MOVIE_URL %>% html_nodes("#pagecontent") %>%
html_nodes("div+ p") %>%
html_text()
And here is a way to remove line breaks, applying a function to each review, and merging all reviews into one paragraph (also see this post on concatenating vector elements and this post on replacing line breaks):
ex_review <- gsub("[\r\n]", " ", ex_review) # replace line breaks
sapply(ex_review, function(x){}) # apply function to each review
ex_review <- paste(ex_review, collapse = "") # concatenate reviews into one paragraph
write(ex_review, "test.txt")
I think you were also missing a "tt" in the URL.

Related

How to use purrr and rvest in R to scrape transcripts from a webpage?

I am trying to extract all transcripts available on this webpage. I have been able to successfully extract the dates and titles of the speeches using the following code in R:
library(purr)
library(rvest)
url_kremlin <- "http://kremlin.ru/events/president/transcripts/page/"
map(1:10, safely(function(i) {
pg <- read_html(paste0(url_kremlin, i))
data.frame(date = html_text(html_nodes(pg, ".dt-published")),
title = html_text(html_nodes(pg, ".p-name")),
link = html_nodes(pg, ".p-name") %>%
html_node("p") %>% html_attr("href"))
})) -> kremlin_df
I am unable to extract the text of the transcripts, though. Does anyone know what I am doing wrong? What could should I use to successfully extract the transcripts?
Edit: When I run the code above, this is what I get: . The link should contain the text of the speeches (or at least that's what I want it to contain).

Scraping blog articles in R using rvest package

For a university project I want to scrape blog articles of the Instagram blog (https://about.instagram.com/blog/announcements/break-down-how-instagram-search-works). Getting the title, date and author of the articles is no problem but when I try to get the actual article text, it returns nothing. Does anybody have an idea what might be the problem?
This is my code:
require ("rvest")
require ("stringr")
require ("tidyverse")
library (tidyverse)
library (rvest)
library (stringr)
### set variable to save url ###
url <- 'https://about.instagram.com/blog/announcements/break-down-how-instagram-search-works'
### scrape title of blog entry ###
titles <- read_html(url) %>%
html_nodes('h1') %>%
html_text()
### scrape author and date into a vector ###
author_date <- read_html(url) %>%
html_nodes ('._8hlt') %>%
html_nodes ('._8hlu') #%>%
html_text()
### separate author and date from vector into single character variables ###
author <- author_date [1]
date <- author_date [2]
### scrape article text. does not work unfortunately. any idea why? ###
text <- read_html(url) %>%
html_nodes ("._8ig0 _8g86") %>%
html_nodes ("._8g86 _9g5w _8iq8 _8ipi") %>%
html_text()
Use this code to get all text in paragraphs.
text2 <- read_html(url) %>%
html_nodes (xpath="//p[.//text()]") %>%
html_text()

html_nodes return an empty list for complex table

I would like to webscraping the table in the following website: https://www.timeshighereducation.com/world-university-rankings/2021/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats
I am using the following code but it is not working, thank you in advance.
library(rvest)
library(xml2)
library(dplyr)
link <- "https://www.timeshighereducation.com/world-university-rankings/2021/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats"
page<- read_html(link)
rank<- page %>% html_nodes(".sorting_2") %>% html_text()
university<-page %>% html_nodes(".ranking-institution-title ") %>% html_text()
statistics<-page %>% html_nodes(".stats") %>% html_text()
The Terms and Services of this site state: "Use data mining, robot, spider, scraping or similar automated data gathering, extraction or publication tools for any purpose."
That being said, you can read the json file that #QHarr found:
library(jsonlite)
url <- "https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2021_0__fa224219a267a5b9c4287386a97c70ea.json"
x <- read_json(url, simplifyVector = TRUE)
head(x$data) # give you the data frame with universities
Now you have a well structured R list. The $data element contains a data frame with the stats of each university in rows. The other 3 list elements only provide supplementary information.

Webscraping with R: How can I tidy my data?

I'm building a web scraper for some News websites in Switzerland. After some trial & error and a lot of help from StackOverflow (thx everyone!), I've gotten to a point where I can get text data from all articles.
#packages instalieren
install.packages("rvest")
install.packages("tidyverse")
install.packages("dplyr")
library(rvest)
library(stringr)
#seite einlesen
apisrf<- read_xml('https://www.srf.ch/news/bnf/rss/1646')
urls_srf <- apisrf %>% html_nodes('link') %>% html_text()
zeit_srf <- apisrf %>% html_nodes('pubDate') %>% html_text()
#data.frame basteln
dfsrf_titel_text <- data.frame(Text = character())
#scrape
for(i in 1:length(urls_srf)) {
link <- urls_srf[i]
artikel <- read_html(link)
#Informationen entnehmen
textsrf<- artikel %>% html_nodes('p') %>% html_text()
#In Dataframe strukturieren
dfsrf_text <- data.frame(Text = textsrf)
dfsrf_titel_text <- rbind(dfsrf_titel_text, cbind(dfsrf_text))
}
running this gives me dfsrf_titel_text. (I'm going to combine it with the titles of the articles at some point but let that be my problem.)
however, now my data is pretty untidy and I can't really figure out how to clean it in a way so it works for me. Especially annoying is that the texts from the different articles are not really structured in that way but get a new line whenever there is a paragraph in the texts. I can't combine the paragraphs because all the texts have different lengths. (The first article, starting at point 3, is super long because it's a live ticker covering the corona crisis so don't get confused if you run my code.)
how can I get R to create a new row in my dataframe only if the text is from a new article (meaning from a new URL?
thx for your help!
can you provide a sample of your data? you can use the strsplit(string, pattern) function where the pattern you specify is something that only happens between articles. Perhaps the URL?
strsplit(dfsrf_text,"www.\\w+.ch")
That will split your text anytime a URL in the .ch domain is found. you can use this regular expression cheat sheet to help you identify the pattern that seperates your articles.
You should correct this while creating dataframe itself. Here I am binding this all the data for each article together using paste0 adding new line character between them (\n\n).
library(rvest)
for(i in 1:length(urls_srf)) {
link <- urls_srf[i]
artikel <- read_html(link)
#Informationen entnehmen
textsrf<- paste0(artikel %>% html_nodes('p') %>% html_text(), collapse = "\n\n")
#In Dataframe strukturieren
dfsrf_text <- data.frame(Text = textsrf)
dfsrf_titel_text <- rbind(dfsrf_titel_text, cbind(dfsrf_text))
}
However, growing data in a loop is highly inefficient and can slow the process terribly especially when you have large data to scrape like this. Try using sapply.
dfsrf_titel_text <- data.frame(text = sapply(urls_srf, function(x) {
paste0(read_html(x) %>% html_nodes('p') %>% html_text(), collapse = "\n\n")
}))
So this will give you number of rows same as length of urls_srf .

R Web scraping from different URLs

I am web scraping a page at
http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=
From this url, I have built up a dataframe through the following code:
dflist <- map(.x = 1:417, .f = function(x) {
Sys.sleep(5)
url <- ("http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=")
read_html(url) %>%
html_nodes(".title a") %>%
html_text() %>%
as.data.frame()
}) %>% do.call(rbind, .)
I have repeated the same code in order to get all the data I was interested in and it seems to work perfectly, although is of course a little slow due to the Sys.sleep() thing.
My issue has raised once I have tried to scrape the single projects descriptions that should be included in the dataframe.
For instance, the first project description is at
http://catalog.ihsn.org/index.php/catalog/7118/study-description
the second project description is at
http://catalog.ihsn.org/index.php/catalog/6606/study-description
and so forth.
My problem is that I can't find a dynamic way to scrape all the projects' pages and insert them in the data frame, being the number in the URLs not progressive nor at the end of the link.
To make things clearer, this is the structure of the website I am scraping:
1.http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=
1.1. http://catalog.ihsn.org/index.php/catalog/7118
1.1.a http://catalog.ihsn.org/index.php/catalog/7118/related_materials
1.1.b http://catalog.ihsn.org/index.php/catalog/7118/study-description
1.1.c. http://catalog.ihsn.org/index.php/catalog/7118/data_dictionary
I have scraped successfully level 1. but cannot level 1.1.b. (study-description) , the one I am interested in, since the dynamic element of the URL (in this case: 7118) is not consistent in the website's above 6000 pages of that level.
You have to extract the deeper urls from the .title a and then scrape those as well. Here's a small example on how to do that using rvest and the tidyverse
library(tidyverse)
library(rvest)
scraper <- function(x) {
Sys.sleep(5)
url <- sprintf("http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=%s&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=", x)
html <- read_html(url)
tibble(title = html_nodes(html, ".title a") %>% html_text(trim = TRUE),
project_url = html_nodes(html, ".title a") %>% html_attr("href"))
}
result <- map_df(1:2, scraper) %>%
mutate(study_description = map(project_url, ~read_html(sprintf("%s/study-description", .x)) %>% html_node(".xsl-block") %>% html_text()))
This isn't complete as to all the things you want to do, but should show you an approach.

Resources