Scraping IMDb user reviews using R, only got the first review back - r

I'm new to web scraping and hoping to use it for sentimental analysis. Here's the code I used, and it'd only returned with the first review...thanks in advance!
library(rvest)
library(XML)
library(plyr)
HouseofCards_IMDb <- read_html("http://www.imdb.com/title/tt1856010/reviews?ref_=tt_urv")
#Used SelectorGadget as the CSS Selector
reviews <- HouseofCards_IMDb %>% html_nodes("#pagecontent") %>%
html_nodes("div+p") %>%
html_text()
#perfrom data cleaning on user reviews
reviews <- gsub("\r?\n|\r", " ", reviews)
reviews <- tolower(gsub("[^[:alnum:] ]", " ", reviews))
reviews <- paste(reviews, collapse = "")
print(reviews)
write(reviews, "IMDb.CSV")

By F12 of Chromium, the XPath of the second review is:
//*[#id="tn15content"]/p[2]/text()
And the third review is:
//*[#id="tn15content"]/p[5]/text()[1]
You can use XML::htmlParse function to parse the page and XML::xpathSApply function to extract to correct nodes of the DOM (apparently, for review texts this is
//*[#id="tn15content"]/p/text()

Related

How to scrape NBA data?

I want to compare rookies across leagues with stats like Points per game (PPG) and such. ESPN and NBA have great tables to scrape from (as does Basketball-reference), but I just found out that they're not stored in html, so I can't use rvest. For context, I'm trying to scrape tables like this one (from NBA):
https://i.stack.imgur.com/SdKjE.png
I'm trying to learn how to use HTTR and JSON for this, but I'm running into some issues. I followed the answer in this post, but it's not working out for me.
This is what I've tried:
library(httr)
library(jsonlite)
coby.white <- GET('https://www.nba.com/players/coby/white/1629632')
out <- content(coby.white, as = "text") %>%
fromJSON(flatten = FALSE)
However, I get an error:
Error: lexical error: invalid char in json text.
<!DOCTYPE html><html class="" l
(right here) ------^
Is there an easier way to scrape a table from ESPN or NBA, or is there a solution to this issue?
ppg and others stats come from]
https://data.nba.net/prod/v1/2019/players/1629632_profile.json
and player info e.g. weight, height
https://www.nba.com/players/active_players.json
So, you could use jsonlite to parse e.g.
library(jsonlite)
data <- jsonlite::read_json('https://data.nba.net/prod/v1/2019/players/1629632_profile.json')
You can find these in the network tab when refreshing the page. Looks like you can use the player id in the url to get different players info for the season.
You actually can web scrape with rvest, here's an example of scraping White's totals table from Basketball Reference. Anything on Sports Reference's sites that is not the first table of the page is listed as a comment, meaning we must extract the comment nodes first then extract the desired data table.
library(rvest)
library(dplyr)
cobywhite = 'https://www.basketball-reference.com/players/w/whiteco01.html'
totalsdf = cobywhite %>%
read_html %>%
html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse='') %>%
read_html() %>%
html_node("#totals") %>%
html_table()

R - Using rvest to scrape Google + reviews

As part of a project, I am trying to scrape the complete reviews from Google + (in previous attempts on other websites, my reviews were truncated by a More which hides the full review unless you click on it).
I have chosen the package rvest for this. However, I do not seem to be getting the results I want.
Here are my steps
library(rvest)
library(xml2)
library(RSelenium)
queens <- read_html("https://www.google.co.uk/search?q=queen%27s+hospital+romford&oq=queen%27s+hospitql+&aqs=chrome.1.69i57j0l5.5843j0j4&sourceid=chrome&ie=UTF-8#lrd=0x47d8a4ce4aaaba81:0xf1185c71ae14d00,1,,,")
#Here I use the selectorgadget tool to identify the user review part that I wish to scrape
reviews=queens %>%
html_nodes(".review-snippet") %>%
html_text()
However this doesn't seem to be working. I do not get any output here.
I am quite new to this package and web scraping, so any inputs on this would be greatly appreciated.
Here is the workflow with RSelenium and rvest:
1. Scroll down any times to get as many contents as you want, remember to pause once a while to let the contents load.
2. Click on all "click on more" buttons and get full reviews.
3. Get pagesource and use rvest to get all reveiws in a list
What you want to scrape is not static, so you need the help of RSelenium. This should work:
library(rvest)
library(xml2)
library(RSelenium)
rmDr=rsDriver(browser=c("chrome"), chromever="73.0.3683.68")
myclient= rmDr$client
myclient$navigate("https://www.google.co.uk/search?q=queen%27s+hospital+romford&oq=queen%27s+hospitql+&aqs=chrome.1.69i57j0l5.5843j0j4&sourceid=chrome&ie=UTF-8#lrd=0x47d8a4ce4aaaba81:0xf1185c71ae14d00,1,,,")
#click on the snippet to switch focus----------
webEle <- myclient$findElement(using = "css",value = ".review-snippet")
webEle$clickElement()
#simulate scroll down for several times-------------
scroll_down_times=20
for(i in 1 :scroll_down_times){
webEle$sendKeysToActiveElement(sendKeys = list(key="page_down"))
#the content needs time to load,wait 1 second every 5 scroll downs
if(i%%5==0){
Sys.sleep(1)
}
}
#loop and simulate clicking on all "click on more" elements-------------
webEles <- myclient$findElements(using = "css",value = ".review-more-link")
for(webEle in webEles){
tryCatch(webEle$clickElement(),error=function(e){print(e)}) # trycatch to prevent any error from stopping the loop
}
pagesource= myclient$getPageSource()[[1]]
#this should get you the full review, including translation and original text-------------
reviews=read_html(pagesource) %>%
html_nodes(".review-full-text") %>%
html_text()
#number of stars
stars <- read_html(pagesource) %>%
html_node(".review-dialog-list") %>%
html_nodes("g-review-stars > span") %>%
html_attr("aria-label")
#time posted
post_time <- read_html(pagesource) %>%
html_node(".review-dialog-list") %>%
html_nodes(".dehysf") %>%
html_text()

R Web scrape - Error

Okay, So I am stuck on what seems would be a simple web scrape. My goal is to scrape Morningstar.com to retrieve a fund name based on the entered url. Here is the example of my code:
library(rvest)
url <- html("http://www.morningstar.com/funds/xnas/fbalx/quote.html")
url %>%
read_html() %>%
html_node('r_title')
I would expect it to return the name Fidelity Balanced Fund, but instead I get the following error: {xml_missing}
Suggestions?
Aaron
edit:
I also tried scraping via XHR request, but I think my issue is not knowing what css selector or xpath to select to find the appropriate data.
XHR code:
get.morningstar.Table1 <- function(Symbol.i,htmlnode){
try(res <- GET(url = "http://quotes.morningstar.com/fundq/c-header",
query = list(
t=Symbol.i,
region="usa",
culture="en-US",
version="RET",
test="QuoteiFrame"
)
))
tryCatch(x <- content(res) %>%
html_nodes(htmlnode) %>%
html_text() %>%
trimws()
, error = function(e) x <-NA)
return(x)
} #HTML Node in this case is a vkey
still the same question is, am I using the correct css/xpath to look up? The XHR code works great for requests that have a clear css selector.
OK, so it looks like the page dynamically loads the section you are targeting, so it doesn't actually get pulled in by read_html(). Interestingly, this part of the page also doesn't load using an RSelenium headless browser.
I was able to get this to work by scraping the page title (which is actually hidden on the page) and doing some regex to get rid of the junk:
library(rvest)
url <- 'http://www.morningstar.com/funds/xnas/fbalx/quote.html'
page <- read_html(url)
title <- page %>%
html_node('title') %>%
html_text()
symbol <- 'FBALX'
regex <- paste0(symbol, " (.*) ", symbol, ".*")
cleanTitle <- gsub(regex, '\\1', title)
As a side note, and for your future use, your first call to html_node() should include a "." before the class name you are targeting:
mypage %>%
html_node('.myClass')
Again, this doesn't help in this specific case, since the page is failing to load the section we are trying to scrape.
A final note: other sites contain the same info and are easier to scrape (like yahoo finance).

Data scraping text without css path R

Hello I am writting to you because I am breaking my head to find a way and scrap data out of a webpage("https://nabtu.org/about-nabtu/official-directory/building-trades-local-councils-overview/").
I am doing this for practice and just to learn how to scrap data.I am trying to scrap out the contact data of the above mentioned webpage(Office,Fax,email) but I am unable to do it since there is no certain css path I can get with Selectorgadget.I am using R and the scrip I am using is kind of like this.
library(rvest)
page_name <-read_html("page html")
page_name %>%
html_node("selector gadget node") %>%
html_text()
I scraped all the other data I just cant scrap this contact information.
Any help will be appreciated because my head is gonna blow.Thanks in advance.
I don't see where the problem is. Each contact block has a .council-list list class. Using that, you can extract the contact information seperately. Afterwards, use some string/regex operations to extract the exact fields.
library(rvest)
page_name <- read_html('https://nabtu.org/about-nabtu/official-directory/building-trades-local-councils-overview/')
contact_strings = page_name %>%
html_nodes('.council-list') %>%
html_text()
# Filter out strings that don't contain contact information
contact_strings = grep(x = contact_strings, 'Email|Fax|office', ignore.case = T, value = T)
# Extract infomration
library(stringr)
library(magrittr)
office = str_extract(contact_strings, 'Office:[^[:alpha:]]*')
fax = str_extract(contact_strings, 'Fax:[^[:alpha:]]*')
email = str_extract(contact_strings, 'Email: [^ ]*')

Clean Data Scraped from teambhp website using rvest in R

I am doing scraping in R using rvest package.
i want to scrape user comments and review from teambhp.com car's pages.
Doing this for below link.
Team BHP REVIEW
i am writing following code in r
library(rvest)
library(httr)
library(httpuv)
team_bhp <- read_html(httr::GET("http://www.team-bhp.com/forum/official-new-car-reviews/172150-tata-zica-official-review.html"))
all_tables <- team_bhp %>%
html_nodes(".tcat:nth-child(1) , #posts strong , hr+ div") %>%
html_text()
but i am getting all the text in on list. and that contains spaces and "\t \n" even if i am applying html_text() function to it. how to clean it and convert to data frame. ?
also , i want to do it for all cars reviews available on website. how can i recursively traverse all the car's reviews. ?

Resources