I am trying to scrape a website but it does not give me any data.
#Get the Data
require(tidyverse)
require(rvest)
#specify the url
url <- 'https://www.travsport.se/sresultat?kommando=tevlingsdagVisa&tevdagId=570243&loppId=0&valdManad&valdLoppnr&source=S'
#get data
url %>%
read_html() %>%
html_nodes(".green div:nth-child(1)") %>%
html_text()
character(0)
I have also tried to use the xpath = '//*[contains(concat( " ", #class, " " ), concat( " ", "green", " " ))]//div[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]//a' but this gives me the same result with 0 data.
I am expecting Horse names. Shouldnt I at least get some javascript code even if data on page is rendered by javascript?
I cant see what else CSS selector I should use here.
You can simply use RSelenium package to scrape dynamycal pages :
library(RSelenium)
#specify the url
url <- 'https://www.travsport.se/sresultat?kommando=tevlingsdagVisa&tevdagId=570243&loppId=0&valdManad&valdLoppnr&source=S'
#Create the remote driver / navigator
rsd <- rsDriver(browser = "chrome")
remDr <- rsd$client
#Go to your url
remDr$navigate(url)
page <- read_html(remDr$getPageSource()[[1]])
#get your horses data by parsing Selenium page with Rvest as you know to do
page %>% html_nodes(".green div:nth-child(1)") %>% html_text()
Hope that will helps
Gottavianoni
Related
I am attempting to scrape the this URL to get the names of the top 50 soundcloud artists in Canada.
Using SelectorGadget, I selected the artists names and it told me the path is '.sc-link-light'.
My first attempt was as follows:
library(rvest)
library(stringr)
library(reshape2)
soundcloud <- read_html("https://soundcloud.com/charts/top?genre=all-music&country=CA")
artist_name <- soundcloud %>% html_nodes('.sc-link-light') %>% html_text()
which yielded artist_name as a list of 0.
My second attempt I changed the last line to:
artist_name <- soundcloud %>% html_node(xpath='//*[contains(concat( " ", #class, " " ), concat( " ", ".sc-link-light", " " ))]') %>% html_text()
which again yielded the same result.
What exactly am I doing wrong? I believe this should give me the artists names in a list.
Any help is appreciated, thank you.
The webpage you are attempting to scrape is dynamic. As a result you will need to use a library such as RSelenium. A sample script is below:
library(tidyverse)
library(RSelenium)
library(rvest)
library(stringr)
url <- "https://soundcloud.com/charts/top?genre=all-music&country=CA"
rD <- rsDriver(browser = "chrome")
remDr <- rD[["client"]]
remDr$navigate(url)
pg <- read_html(remDr$getPageSource()[[1]])
artist_name <- pg %>% html_nodes('.sc-link-light') %>% html_text()
####clean up####
remDr$close()
rD$server$stop()
rm(rD, remDr)
gc()
system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)
I pretend to be able to get all the reviews that users leave on Google Play about the apps. I have this code that they indicated there Web scraping in R through Google playstore . But the problem is that you only get the first 40 reviews. Is there a possibility to get all the comments of the app?
`` `
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
#Specifying the url for desired website to be scraped
url <- 'https://play.google.com/store/apps/details?
id=com.phonegap.rxpal&hl=en_IN&showAllReviews=true'
# starting local RSelenium (this is the only way to start RSelenium that
is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-
Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "firefox")
remDr$open()
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# 1) name field (assuming that with 'name' you refer to the name of the
reviewer)
names <- html_obj %>% html_nodes(".kx8XBd .X43Kjb") %>% html_text()
# 2) How much star they got
stars <- html_obj %>% html_nodes(".kx8XBd .nt2C1d [role='img']") %>%
html_attr("aria-label")
# 3) review they wrote
reviews <- html_obj %>% html_nodes(".UD7Dzf") %>% html_text()
# create the df with all the info
review_data <- data.frame(names = names, stars = stars, reviews = reviews,
stringsAsFactors = F)
`` `
You can get all the reviews from the web store of GooglePlay.
If you scroll through the reviews, you can see the XHR request is sent to:
https://play.google.com/_/PlayStoreUi/data/batchexecute
With form-data:
f.req: [[["rYsCDe","[[\"com.playrix.homescapes\",7]]",null,"55"]]]
at: AK6RGVZ3iNlrXreguWd7VvQCzkyn:1572317616250
And params of:
rpcids=rYsCDe
f.sid=-3951426241423402754
bl=boq_playuiserver_20191023.08_p0
hl=en
authuser=0
soc-app=121
soc-platform=1
soc-device=1
_reqid=839222
rt=c
After playing around with different parameters, I find out many are optional, and the request can be simplified as:
form-data:
f.req: [[["UsvDTd","[null,null,[2, $sort,[$review_size,null,$page_token]],[$package_name,7]]",null,"generic"]]]
params:
hl=$review_language
The response is cryptic, but it's essentially JSON data with keys stripped, similar to protobuf, I wrote a parser for the response that translate it to regular dict object.
https://gist.github.com/xlrtx/af655f05700eb76bb29aec876493ed90
I don't see the data/text I am looking for when scraping a web page
I tried googling the issue without having any luck. I also tried using the xpath but i get {xml_nodeset (0)}
require(rvest)
url <- "https://www.nasdaq.com/market-activity/ipos"
IPOS <- read_html(url)
IPOS %>% xml_nodes("tbody") %>% xml_text()
Output:
[1] "\n \n \n \n \n \n "
I do not see any of the IPO data. Expected output should contain the table for the "Priced" IPOs: Symbol, Company Name, etc...
No need for the expensive RSelenium. There is an API call you can find in the network tab returning everything as json.
For example,
library(jsonlite)
data <- jsonlite::read_json('https://api.nasdaq.com/api/ipo/calendar?date=2019-09')
View(data$data$priced$rows)
It seems that the table data are loaded by scripts. You can use RSelenium package to get them.
library(rvest)
library(RSelenium)
rD <- rsDriver(port = 1210L, browser = "firefox", check = FALSE)
remDr <- rD$client
url <- "https://www.nasdaq.com/market-activity/ipos"
remDr$navigate(url)
IPOS <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_table(fill = TRUE)
str(IPOS)
PRICED <- IPOS[[3]]
I want to scrape data from google play store of several app's review in which i want.
name field
How much star they got
review they wrote
This is the snap of the senerio
#Loading the rvest package
library('rvest')
#Specifying the url for desired website to be scrapped
url <- 'https://play.google.com/store/apps/details?id=com.phonegap.rxpal&hl=en_IN'
#Reading the HTML code from the website
webpage <- read_html(url)
#Using CSS gradient_Selector to scrap the name section
Name_data_html <- html_nodes(webpage,'.kx8XBd .X43Kjb')
#Converting the Name data to text
Name_data <- html_text(Name_data_html)
#Look at the Name
head(Name_data)
but it result to
> head(Name_data)
character(0)
later I try to discover more i found Name_data_html has
> Name_data_html
{xml_nodeset (0)}
I am new to web scraping can any help me out with this!
You should use Xpaths to select the object on the web page :
#Loading the rvest package
library('rvest')
#Specifying the url for desired website to be scrapped
url <- 'https://play.google.com/store/apps/details?id=com.phonegap.rxpal&hl=en_IN'
#Reading the HTML code from the website
webpage <- read_html(url)
# Using Xpath
Name_data_html <- webpage %>% html_nodes(xpath='/html/body/div[1]/div[4]/c-wiz/div/div[2]/div/div[1]/div/c-wiz[1]/c-wiz[1]/div/div[2]/div/div[1]/c-wiz[1]/h1/span')
#Converting the Name data to text
Name_data <- html_text(Name_data_html)
#Look at the Name
head(Name_data)
See how to get the path in this picture :
After analyzing your code and the source page of the URL you posted, I think that the reason you are unable to scrap anything is because the content is being generated dynamically so rvest cannot get it right.
Here is my solution:
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
#Specifying the url for desired website to be scrapped
url <- 'https://play.google.com/store/apps/details?id=com.phonegap.rxpal&hl=en_IN'
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# 1) name field (assuming that with 'name' you refer to the name of the reviewer)
names <- html_obj %>% html_nodes(".kx8XBd .X43Kjb") %>% html_text()
# 2) How much star they got
stars <- html_obj %>% html_nodes(".kx8XBd .nt2C1d [role='img']") %>% html_attr("aria-label")
# 3) review they wrote
reviews <- html_obj %>% html_nodes(".UD7Dzf") %>% html_text()
# create the df with all the info
review_data <- data.frame(names = names, stars = stars, reviews = reviews, stringsAsFactors = F)
In my solution, I'm using RSelenium, which is able to load the webpage as if you were navigating to it (instead of just downloading it like rvest). This way, all the dynamically-generated content is loaded and when is loaded, you can now retrieve it with rvest and scrap it.
If you have any doubts about my solution, just tell me!
Hope it helped!
I have a list of hospital names for which I need to extract the first google search URL. Here is the code I'm using
library(rvest)
library(urltools)
library(RCurl)
library(httr)
getWebsite <- function(name)
{
url = URLencode(paste0("https://www.google.com/search?q=",name))
page <- read_html(url)
results <- page %>%
html_nodes("cite") %>%
html_text()
result <- results[1]
return(as.character(result))}
websites <- data.frame(Website = sapply(c,getWebsite))
View(websites)
For short URLs this code works fine but when the link is long and appears in R with "..." (ex. www.medicine.northwestern.edu/divisions/allergy-immunology/.../fellowship.html) it appears in the dataframe the same way with "...". How can I extract the actual URLs without "..."? Appreciate your help!
This is a working example, tested on my computer:
library("rvest")
# Load the page
main.page <- read_html(x = "https://www.google.com/search?q=software%20programming")
links <- main.page %>%
html_nodes(".r a") %>% # get the a nodes with an r class
html_attr("href") # get the href attributes
#clean the text
links = gsub('/url\\?q=','',sapply(strsplit(links[as.vector(grep('url',links))],split='&'),'[',1))
# as a dataframe
websites <- data.frame(links = links, stringsAsFactors = FALSE)
View(websites)