I'm new to R and web scraping. I'm trying to read a table from the World Bank website into R.
Here is the url link for one of the projects as an example (my goal is to read the left table under "Basic Information"): http://projects.worldbank.org/P156880/?lang=en&tab=details
I'm using Chrome's Dedvtools to identify the selector nodes that i need for that particular table.
Here is my code:
library(rvest)
url <- "http://projects.worldbank.org/P156880/?lang=en&tab=details"
details <- url %>%
read_html() %>%
html_nodes(css = '#projectDetails > div:nth-child(2) > div.column-left > table') %>%
html_table()
Unfortunately, I get an empty list:
> details
list()
Any help on how to resolve this would be greatly appreciated.
This site uses XML http requests which you can get using httr. Open Chrome developer tools and go to the Network tab and then load your url above. You will notice four other urls are requested when loading the page, so click on projectdetails? and you should see the html table in the Preview tab. Next, right click on projectdetails? and Copy as cURL to a text editor and paste the URL, Referer, and X-Requested-With into the httr GET function below.
library(httr)
library(rvest)
res <- GET(
url = "http://projects.worldbank.org/p2e/projectdetails?projId=P156880&lang=en",
add_headers(Referer = "http://projects.worldbank.org/P156880/?lang=en&tab=details",
`X-Requested-With` = "XMLHttpRequest")
)
content(res) %>% html_node("table") %>% html_table( header=TRUE)
Project ID P156880
1 Status Active
2 Approval Date December 14, 2017
3 Closing Date December 15, 2023
4 Country Colombia
5 Region Latin America and Caribbean
6 Environmental Category B
Or write a function to get any project ID
get_project <-function(id){
res <- GET(
url = "http://projects.worldbank.org",
path = paste0("p2e/projectdetails?projId=", id, "&lang=en"),
add_headers(
Referer = paste0("http://projects.worldbank.org/", id, "/?lang=en&tab=details"),
`X-Requested-With` = "XMLHttpRequest")
)
content(res) %>% html_node("table") %>% html_table(header=TRUE)
}
get_project("P156880")
Related
I am using rvest to scrape a website to download all the data in tables. Step 1 is working. I am not getting the Step 2 correctly:
Step 1:
library(rvest)
library(httr)
url<-'http://www.ahw.gov.ab.ca/IHDA_Retrieval/ihdaData.do'
sess<-html_session(url)
sess %>% follow_link(css='#content > div > p:nth-child(8) > a') -> sess
sess %>% follow_link(css='#content > div > table:nth-child(3) > tbody > tr:nth-child(10) > td > a') -> sess
Step 2:
pg_form<-html_form(sess)[[2]]
filled_form <-set_values(pg_form, `displayObject.id` = "1006")
d<-submit_form(session=sess, form=filled_form)
I am not sure how to submit the selected form. Do I need to use Selenium instead of rvest?
You don't need to use RSelenium. You can scrape this particular site using rvest and httr, but it's a little tricky. You need to learn how to send forms in http requests. This requires a bit of exploration of the underlying html and the http requests sent by your web browser.
In your case, the form is actually pretty simple. It only has two fields: a command field, which is always "doSelect" and a displayObject.id, which is a unique number for each selection item, obtained from the "value" attributes of the "option" tags in the html.
Here's how we can look at the drop-downs and their associated ids:
library(tidyverse)
library(rvest)
library(httr)
url <- "http://www.ahw.gov.ab.ca/IHDA_Retrieval/"
paste0(url, "ihdaData.do") %>%
GET() %>%
read_html() %>%
html_node('#content > div > p:nth-child(8) > a') %>%
html_attr("href") %>%
{paste0(url, .)} %>%
GET() %>%
read_html() %>%
html_node('#content > div > table:nth-child(3) > tbody > tr:nth-child(10) > td > a') %>%
html_attr("href") %>%
{paste0(url, .)} %>%
GET() %>%
read_html() -> page
pages <- tibble(id = page %>% html_nodes("option") %>% html_attr("value"),
item = page %>% html_nodes("option") %>% html_text())
pages <- pages[which(pages$item != ""), ]
This gives us a listing of the available items on the page:
pages
#> # A tibble: 8 x 2
#> id item
#> <chr> <chr>
#> 1 724 Human Immunodeficiency Virus (HIV) Incidence Rate (Age Specific)
#> 2 723 Human Immunodeficiency Virus (HIV) Incidence Rate (by Geography)
#> 3 886 Human Immunodeficiency Virus (HIV) Proportion (Ethnicity)
#> 4 887 Human Immunodeficiency Virus (HIV) Proportion (Exposure Cateogory)
#> 5 719 Notifiable Diseases - Age-Sex Specific Incidence Rate
#> 6 1006 Sexually Transmitted Infections (STI) - Age-Sex Specific Case Counts (P~
#> 7 466 Sexually Transmitted Infections (STI) - Age-Sex Specific Rates of Repor~
#> 8 1110 Sexually Transmitted Infections (STI) - Quarterly Congenital Syphilis C~
Now, if we want to select the first one, we just post a list with the required parameters to the correct url, which you can find by checking the developer console in your browser (F12 in Chrome, Firefox or IE). In this case, it is the relative url "selectSubCategory.do"
params <- list(command = "doSelect", displayObject.id = pages$id[1])
next_page <- POST(paste0(url, "selectSubCategory.do"), body = params)
So now next_page contains the html of the page you were looking for. Unfortunately, in this case it is another drop-down selection page.
Hopefully by following the methods above, you will be able to navigate the pages well enough to get the data you need.
I'm doing web scraping in R of the reviews of a Google Play app, but I can't get the number of votes. I indicate the code: likes <- html_obj %>% html_nodes(".xjKiLb") %>% html_attr("aria-label") and I get no value. How can it be done?
Get scrape votes
FULL CODE
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
url <- 'https://play.google.com/store/apps/details?id=com.gospace.parenteral&showAllReviews=true'
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "firefox")
remDr$open()
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
likes <- html_obj %>% html_nodes(".xjKiLb") %>% html_attr("aria-label")
What returns me
NA NA NA
What I want to be returned
3 3 2
Maybe you are using the selector gadget to get the css selector. As you, I tried to do that, but the css the selector gadget return is not the correct one.
Inspecting the html code of the page, I realized that the correct element is contain in the tag with class = "jUL89d y92BAb" as you can see in this image.
This way, the code you should use is this one.
html_obj %>% html_nodes('.jUL89d') %>% html_text()
My personal recommendation for you is to always check the source code to confirm the output of the selector gadget.
I pretend to be able to get all the reviews that users leave on Google Play about the apps. I have this code that they indicated there Web scraping in R through Google playstore . But the problem is that you only get the first 40 reviews. Is there a possibility to get all the comments of the app?
`` `
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
#Specifying the url for desired website to be scraped
url <- 'https://play.google.com/store/apps/details?
id=com.phonegap.rxpal&hl=en_IN&showAllReviews=true'
# starting local RSelenium (this is the only way to start RSelenium that
is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-
Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "firefox")
remDr$open()
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# 1) name field (assuming that with 'name' you refer to the name of the
reviewer)
names <- html_obj %>% html_nodes(".kx8XBd .X43Kjb") %>% html_text()
# 2) How much star they got
stars <- html_obj %>% html_nodes(".kx8XBd .nt2C1d [role='img']") %>%
html_attr("aria-label")
# 3) review they wrote
reviews <- html_obj %>% html_nodes(".UD7Dzf") %>% html_text()
# create the df with all the info
review_data <- data.frame(names = names, stars = stars, reviews = reviews,
stringsAsFactors = F)
`` `
You can get all the reviews from the web store of GooglePlay.
If you scroll through the reviews, you can see the XHR request is sent to:
https://play.google.com/_/PlayStoreUi/data/batchexecute
With form-data:
f.req: [[["rYsCDe","[[\"com.playrix.homescapes\",7]]",null,"55"]]]
at: AK6RGVZ3iNlrXreguWd7VvQCzkyn:1572317616250
And params of:
rpcids=rYsCDe
f.sid=-3951426241423402754
bl=boq_playuiserver_20191023.08_p0
hl=en
authuser=0
soc-app=121
soc-platform=1
soc-device=1
_reqid=839222
rt=c
After playing around with different parameters, I find out many are optional, and the request can be simplified as:
form-data:
f.req: [[["UsvDTd","[null,null,[2, $sort,[$review_size,null,$page_token]],[$package_name,7]]",null,"generic"]]]
params:
hl=$review_language
The response is cryptic, but it's essentially JSON data with keys stripped, similar to protobuf, I wrote a parser for the response that translate it to regular dict object.
https://gist.github.com/xlrtx/af655f05700eb76bb29aec876493ed90
I'm trying to scrape & download csv files from a webpage with tons of csv's.
Code:
# Libraries
library(rvest)
library(httr)
# URL
url <- "http://data.gdeltproject.org/events/index.html"
# The csv's I want are from 14 through 378 (2018 year)
selector_nodes <- seq(from = 14, to = 378, by = 1)
# HTML read / rvest action
link <- url %>%
read_html() %>%
html_nodes(paste0("body > ul > li:nth-child(", (gdelt_nodes), ")> a")) %>%
html_attr("href")
I get this error:
Error in xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns, num_results = Inf) :
Expecting a single string value: [type=character; extent=365].
How do I tell it I want the nodes 14 to 378 correctly?
After I can get that assigned, I'm going to run a quick for loop and download all of the 2018 csv's.
See the comments in the code for the step-by-step solution.
library(rvest)
# URL
url <- "http://data.gdeltproject.org/events/index.html"
# Read the page in once then attempt to process it.
page <- url %>% read_html()
#extract the file list
filelist<-page %>% html_nodes("ul li a") %>% html_attr("href")
#filter for files from 2018
filelist<-filelist[grep("2018", filelist)]
#Loop would go here to download all of the pages
#pause between file downloads and then download a file
Sys.sleep(1)
download.file(paste0("http://data.gdeltproject.org/events/", filelist[1]), filelist[1])
I want to scrape data from google play store of several app's review in which i want.
name field
How much star they got
review they wrote
This is the snap of the senerio
#Loading the rvest package
library('rvest')
#Specifying the url for desired website to be scrapped
url <- 'https://play.google.com/store/apps/details?id=com.phonegap.rxpal&hl=en_IN'
#Reading the HTML code from the website
webpage <- read_html(url)
#Using CSS gradient_Selector to scrap the name section
Name_data_html <- html_nodes(webpage,'.kx8XBd .X43Kjb')
#Converting the Name data to text
Name_data <- html_text(Name_data_html)
#Look at the Name
head(Name_data)
but it result to
> head(Name_data)
character(0)
later I try to discover more i found Name_data_html has
> Name_data_html
{xml_nodeset (0)}
I am new to web scraping can any help me out with this!
You should use Xpaths to select the object on the web page :
#Loading the rvest package
library('rvest')
#Specifying the url for desired website to be scrapped
url <- 'https://play.google.com/store/apps/details?id=com.phonegap.rxpal&hl=en_IN'
#Reading the HTML code from the website
webpage <- read_html(url)
# Using Xpath
Name_data_html <- webpage %>% html_nodes(xpath='/html/body/div[1]/div[4]/c-wiz/div/div[2]/div/div[1]/div/c-wiz[1]/c-wiz[1]/div/div[2]/div/div[1]/c-wiz[1]/h1/span')
#Converting the Name data to text
Name_data <- html_text(Name_data_html)
#Look at the Name
head(Name_data)
See how to get the path in this picture :
After analyzing your code and the source page of the URL you posted, I think that the reason you are unable to scrap anything is because the content is being generated dynamically so rvest cannot get it right.
Here is my solution:
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
#Specifying the url for desired website to be scrapped
url <- 'https://play.google.com/store/apps/details?id=com.phonegap.rxpal&hl=en_IN'
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# 1) name field (assuming that with 'name' you refer to the name of the reviewer)
names <- html_obj %>% html_nodes(".kx8XBd .X43Kjb") %>% html_text()
# 2) How much star they got
stars <- html_obj %>% html_nodes(".kx8XBd .nt2C1d [role='img']") %>% html_attr("aria-label")
# 3) review they wrote
reviews <- html_obj %>% html_nodes(".UD7Dzf") %>% html_text()
# create the df with all the info
review_data <- data.frame(names = names, stars = stars, reviews = reviews, stringsAsFactors = F)
In my solution, I'm using RSelenium, which is able to load the webpage as if you were navigating to it (instead of just downloading it like rvest). This way, all the dynamically-generated content is loaded and when is loaded, you can now retrieve it with rvest and scrap it.
If you have any doubts about my solution, just tell me!
Hope it helped!