Webscraping in R From Dataframe - r

From the following data frame
I am trying to use the package rvest to scrape each words Part of speech and synonyms from the website: https://www.thesaurus.com/browse/research?s=t into a csv.
I am not sure how to have R search each word of the data frame and pull its Part of Speech and Synonym.
install.packages("rvest")
install.packages("xml2")
library(xml2)
library(rvest)
library(dplyr)
words<data.frame("keywords"=c("research","survey","staff","outpatient","consent"))
html<- read_html("https://www.merriam-webster.com/thesaurus/research")
html %>% html_nodes(".mw-list") %>% html_text () %>%
head(n=1) # take the first 1st records

If you search [your term] on thesaurus, you will end up on the following HTML page: "https://www.thesaurus.com/browse/[your term]". If you know this, you can get the HTMLs of all the pages of terms you're interested in. After that you should be able to iterate with the map() function from the purrr pacakage to get the information you want:
# It makes more sense to just keep "words" as a vector for now
words <- c("research","survey","staff","outpatient","consent")
htmls <- paste0("https://www.thesaurus.com/browse/", words)
info_list <- map(htmls, .x %>%
read_html() %>%
html_node(.mw-list) %>%
html_text())

Related

How to use purrr and rvest in R to scrape transcripts from a webpage?

I am trying to extract all transcripts available on this webpage. I have been able to successfully extract the dates and titles of the speeches using the following code in R:
library(purr)
library(rvest)
url_kremlin <- "http://kremlin.ru/events/president/transcripts/page/"
map(1:10, safely(function(i) {
pg <- read_html(paste0(url_kremlin, i))
data.frame(date = html_text(html_nodes(pg, ".dt-published")),
title = html_text(html_nodes(pg, ".p-name")),
link = html_nodes(pg, ".p-name") %>%
html_node("p") %>% html_attr("href"))
})) -> kremlin_df
I am unable to extract the text of the transcripts, though. Does anyone know what I am doing wrong? What could should I use to successfully extract the transcripts?
Edit: When I run the code above, this is what I get: . The link should contain the text of the speeches (or at least that's what I want it to contain).

html_nodes return an empty list for complex table

I would like to webscraping the table in the following website: https://www.timeshighereducation.com/world-university-rankings/2021/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats
I am using the following code but it is not working, thank you in advance.
library(rvest)
library(xml2)
library(dplyr)
link <- "https://www.timeshighereducation.com/world-university-rankings/2021/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats"
page<- read_html(link)
rank<- page %>% html_nodes(".sorting_2") %>% html_text()
university<-page %>% html_nodes(".ranking-institution-title ") %>% html_text()
statistics<-page %>% html_nodes(".stats") %>% html_text()
The Terms and Services of this site state: "Use data mining, robot, spider, scraping or similar automated data gathering, extraction or publication tools for any purpose."
That being said, you can read the json file that #QHarr found:
library(jsonlite)
url <- "https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2021_0__fa224219a267a5b9c4287386a97c70ea.json"
x <- read_json(url, simplifyVector = TRUE)
head(x$data) # give you the data frame with universities
Now you have a well structured R list. The $data element contains a data frame with the stats of each university in rows. The other 3 list elements only provide supplementary information.

Scraping table of NBA stats with rvest

I'd like to scrape a table of NBA team stats with rvest, I've tried using:
the table element
library(rvest)
url_nba <- "http://stats.nba.com/teams/advanced/#!?sort=TEAM_NAME&dir=-1"
team_stats <- url_nba %>% read_html %>% html_nodes('table') %>% html_table
the xpath (via google chrome inspect)
team_stats <- url_nba %>%
read_html %>%
html_nodes(xpath="/html/body/main/div[2]/div/div[2]/div/div/nba-stat-table/div[1]/div[1]/table") %>%
html_table
the css selector (via mozilla inspect):
team_stats <- url_nba %>%
read_html %>%
html_nodes(".nba-stat-table__overflow > table:nth-child(1)") %>%
html_table
but with no luck. Any help would be greatly appreciated.
This question is very similar to this one: How to select a particular section of JSON Data in R?
The data you are requesting is not stored in the html code, thus the failures using rvest. The requested data is stored as a XHR file which and can be accessed directly:
library(httr)
library(jsonlite)
nba<-GET('http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Advanced&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=' )
Once the data is loaded into a the nba variable, using httr and jsonlite to clean-up the data:
#access the data
out<- content(nba, as="text") %>% fromJSON(flatten=FALSE)
#convert into dataframe.
# str(out) to determine the structure
df<-data.frame(out$resultSets$rowSet)
names(df)<-out$resultSets$headers[[1]]
I highly recommend reading the answer to the question which I linked above.

Scraping Movie reviews from IMDB using rvest

I have extracted the reviews of a movie on IMDB but the separate reviews have a lot of blank lines between them. It is unstructured and very difficult to view.
I have to apply certain functions on each of them separately and then store them together as 1 for some text mining for some other functions.
How can I structure (clean) them and access them one at a time and also how to combine them and store it together?
Here is my code for scraping the reviews
ID <- 1490017
URL <- paste0("http://www.imdb.com/title/", ID, "/reviews?filter=prolific")
MOVIE_URL <- read_html(URL)
ex_review <- MOVIE_URL %>%
html_nodes("p") %>%
html_text()
I would suggest that you are more specific when you navigate the DOM. For instance, this code will only deliver reviews and none of the other information that you are presumably not looking to scrape:
ID <- 1490017
URL <- paste0("http://www.imdb.com/title/tt", ID, "/reviews?filter=prolific")
MOVIE_URL <- read_html(URL)
ex_review <- MOVIE_URL %>% html_nodes("#pagecontent") %>%
html_nodes("div+ p") %>%
html_text()
And here is a way to remove line breaks, applying a function to each review, and merging all reviews into one paragraph (also see this post on concatenating vector elements and this post on replacing line breaks):
ex_review <- gsub("[\r\n]", " ", ex_review) # replace line breaks
sapply(ex_review, function(x){}) # apply function to each review
ex_review <- paste(ex_review, collapse = "") # concatenate reviews into one paragraph
write(ex_review, "test.txt")
I think you were also missing a "tt" in the URL.

Scrape website data using rvest

I am trying to scrape the data corresponding to Table 5 from the following link: https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/5tabledatadecpdf/table_5_crime_in_the_united_states_by_state_2013.xls
As suggested, I used SelectorGadget to find the relevant CSS match, and the one I found that contained all the data (as well as some extraneous information) was "#page_content"
I've tried the following code, which yield errors:
fbi <- read_html("https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/5tabledatadecpdf/table_5_crime_in_the_united_states_by_state_2013.xls")
fbi %>%
html_node("#page_content") %>%
html_table()
Error: html_name(x) == "table" is not TRUE
#Try extracting only the first column:
fbi %>%
html_nodes(".group0") %>%
html_table()
Error: html_name(x) == "table" is not TRUE
#Directly feed fbi into html_table
data = fbi %>% html_table(fill = T)
#This output creates a list of 3 elements, where within list 1 and 3, there are many missing values.
Any help would be greatly appreciated!
You can download the excel file directly. After that you should look into the excel file and take data that you want into a csv file. After that you can work on the data. Below is the code for doing the same.
library(rvest)
library(stringr)
page <- read_html("https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/5tabledatadecpdf/table_5_crime_in_the_united_states_by_state_2013.xls")
pageAdd <- page %>%
html_nodes("a") %>% # find all links
html_attr("href") %>% # get the url
str_subset("\\.xls") %>% # find those that end in xls
.[[1]]
mydestfile <- "D:/Kumar/table5.xls" # change the path and file name as per your system
download.file(pageAdd, mydestfile, mode="wb")
The data is not in a very formatted way. Hence downloading it in R, will be more confusing. To me this appears to be the best way to solve your problem.

Resources