Clean Data Scraped from teambhp website using rvest in R - r

I am doing scraping in R using rvest package.
i want to scrape user comments and review from teambhp.com car's pages.
Doing this for below link.
Team BHP REVIEW
i am writing following code in r
library(rvest)
library(httr)
library(httpuv)
team_bhp <- read_html(httr::GET("http://www.team-bhp.com/forum/official-new-car-reviews/172150-tata-zica-official-review.html"))
all_tables <- team_bhp %>%
html_nodes(".tcat:nth-child(1) , #posts strong , hr+ div") %>%
html_text()
but i am getting all the text in on list. and that contains spaces and "\t \n" even if i am applying html_text() function to it. how to clean it and convert to data frame. ?
also , i want to do it for all cars reviews available on website. how can i recursively traverse all the car's reviews. ?

Related

Rvest and xpath returns misleading information

I am struggling with some scraping issues, using rvest and xpath.
The objective is to scrape the following page
https://www.barchart.com/futures/quotes/BT*0/futures-prices
and to extract the names of the futures
BTF21
BTG21
BTH21
etc for the full list of names.
The xpath for those variables seem to be xpath='//a'.
The following code provides no information of relevance, thus my query
library(rvest)
url <- 'https://www.barchart.com/futures/quotes/BT*0'
valuation_col <- url %>%
read_html() %>%
html_nodes(xpath='//a')
value <- valuation_col %>% html_text()
Any hint to proceed further to get the information would be much needed. Thanks in advance!

How to scrape NBA data?

I want to compare rookies across leagues with stats like Points per game (PPG) and such. ESPN and NBA have great tables to scrape from (as does Basketball-reference), but I just found out that they're not stored in html, so I can't use rvest. For context, I'm trying to scrape tables like this one (from NBA):
https://i.stack.imgur.com/SdKjE.png
I'm trying to learn how to use HTTR and JSON for this, but I'm running into some issues. I followed the answer in this post, but it's not working out for me.
This is what I've tried:
library(httr)
library(jsonlite)
coby.white <- GET('https://www.nba.com/players/coby/white/1629632')
out <- content(coby.white, as = "text") %>%
fromJSON(flatten = FALSE)
However, I get an error:
Error: lexical error: invalid char in json text.
<!DOCTYPE html><html class="" l
(right here) ------^
Is there an easier way to scrape a table from ESPN or NBA, or is there a solution to this issue?
ppg and others stats come from]
https://data.nba.net/prod/v1/2019/players/1629632_profile.json
and player info e.g. weight, height
https://www.nba.com/players/active_players.json
So, you could use jsonlite to parse e.g.
library(jsonlite)
data <- jsonlite::read_json('https://data.nba.net/prod/v1/2019/players/1629632_profile.json')
You can find these in the network tab when refreshing the page. Looks like you can use the player id in the url to get different players info for the season.
You actually can web scrape with rvest, here's an example of scraping White's totals table from Basketball Reference. Anything on Sports Reference's sites that is not the first table of the page is listed as a comment, meaning we must extract the comment nodes first then extract the desired data table.
library(rvest)
library(dplyr)
cobywhite = 'https://www.basketball-reference.com/players/w/whiteco01.html'
totalsdf = cobywhite %>%
read_html %>%
html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse='') %>%
read_html() %>%
html_node("#totals") %>%
html_table()

How to scrape data from a website with similar “#” urls in menu tabs using R?

I want to scrape stock data from other tabs of the following website: http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=35178706978554988 But all of them have same urls. when I try to use "rvest" library function such as read_html() , html_nodes() and html_text() , I can just scrape data from the main tab. Switching between tabs get same results. I tried to use following code, but still couldn't get appropriate results.
Previously I could extract some info such as "InsCode" and "ZTitad" stored in the "" section using "rvest". But because all other tabs' data are not written in the "html-source" section, I didn't have any idea what to do.
#Scraping Libraries
library(rvest)
library(jsonlite)
#Target website
my_url<-"http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=35178706978554988"
pagesource <- read_html(my_url)
content<- pagesource %>% html_node("script") %>% html_text()
data <- fromJSON(content)
Ultimately I want to export "حقیقی-حقوقی" tab data into a data-frame to continue my other analysis.

Web scraping with R using rvest for financial website

I am trying to scrape data table from following website using R, but it is not returning any value. I am using SelectorGadget to get the nodes detail.
library(rvest)
url = "http://www.bursamalaysia.com/market/derivatives/prices/"
text <- read_html(url) %>%
html_nodes("td") %>%
html_text()
output:
text
character(0)
I would appreciate any kind of help. Thank you!

Scraping IMDb user reviews using R, only got the first review back

I'm new to web scraping and hoping to use it for sentimental analysis. Here's the code I used, and it'd only returned with the first review...thanks in advance!
library(rvest)
library(XML)
library(plyr)
HouseofCards_IMDb <- read_html("http://www.imdb.com/title/tt1856010/reviews?ref_=tt_urv")
#Used SelectorGadget as the CSS Selector
reviews <- HouseofCards_IMDb %>% html_nodes("#pagecontent") %>%
html_nodes("div+p") %>%
html_text()
#perfrom data cleaning on user reviews
reviews <- gsub("\r?\n|\r", " ", reviews)
reviews <- tolower(gsub("[^[:alnum:] ]", " ", reviews))
reviews <- paste(reviews, collapse = "")
print(reviews)
write(reviews, "IMDb.CSV")
By F12 of Chromium, the XPath of the second review is:
//*[#id="tn15content"]/p[2]/text()
And the third review is:
//*[#id="tn15content"]/p[5]/text()[1]
You can use XML::htmlParse function to parse the page and XML::xpathSApply function to extract to correct nodes of the DOM (apparently, for review texts this is
//*[#id="tn15content"]/p/text()

Resources