R: Scrape many zipped CSVs and download local machine - r

I'm trying to scrape & download csv files from a webpage with tons of csv's.
Code:
# Libraries
library(rvest)
library(httr)
# URL
url <- "http://data.gdeltproject.org/events/index.html"
# The csv's I want are from 14 through 378 (2018 year)
selector_nodes <- seq(from = 14, to = 378, by = 1)
# HTML read / rvest action
link <- url %>%
read_html() %>%
html_nodes(paste0("body > ul > li:nth-child(", (gdelt_nodes), ")> a")) %>%
html_attr("href")
I get this error:
Error in xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns, num_results = Inf) :
Expecting a single string value: [type=character; extent=365].
How do I tell it I want the nodes 14 to 378 correctly?
After I can get that assigned, I'm going to run a quick for loop and download all of the 2018 csv's.

See the comments in the code for the step-by-step solution.
library(rvest)
# URL
url <- "http://data.gdeltproject.org/events/index.html"
# Read the page in once then attempt to process it.
page <- url %>% read_html()
#extract the file list
filelist<-page %>% html_nodes("ul li a") %>% html_attr("href")
#filter for files from 2018
filelist<-filelist[grep("2018", filelist)]
#Loop would go here to download all of the pages
#pause between file downloads and then download a file
Sys.sleep(1)
download.file(paste0("http://data.gdeltproject.org/events/", filelist[1]), filelist[1])

Related

How to skip an error and in a for loop in R

I want to web scrap the URLs of pictures in a list of web pages. I tried the following code.
library(rvest)
pic_flat = data.frame()
for (i in 7:60){
# creating a loop for page urls
link <- paste0("https://www.immobilienscout24.at/regional/wien/wien/wohnung-kaufen/seite-", i)
page <- read_html(link)
# scraping href and creating a url
href <- page %>% html_elements("a.YXjuW") %>% html_attr('href')
apt_link <- paste0("https://www.immobilienscout24.at",href)
pic_flat = rbind(pic_flat, data.frame(apt_link))
}
#get the link to the apartment picture
apt_pic <- data.frame()
apt <- pic_flat$apt_link
for(x in apt){
picture <- read_html(x) %>% html_element(".CmhTt") %>% html_attr("src")
apt_pic <- rbind(apt_pic,data.frame(picture))
}
df_pic <- cbind(pic_flat,data.frame(apt_pic))
But some web pages crash in the middle of the iteration. For example:
Error in open.connection(x, "rb") : HTTP error 502.
So I want to skip those web pages and continue with the next web page and scrap available picture URLs to my data frame. How to use tryCatch function or any other method, to accomplish this task?
We can create a function and then use tryCatch or possibly to skip the errors.
First create function f1 to get links to pictures,
#function f1
f1 = function(x){
picture <- x %>% read_html() %>% html_element(".CmhTt") %>% html_attr("src")
}
apt <- pic_flat$apt_link
#now loop by skipping errors
apt_pic = lapply(apt, possibly(f1, NA))

How to deal with HTTP error 504 when scraping data from hundreds of webpages?

I am trying to scrape voting data from the website of the Russian parliament. I am working with nearly 600 webpages, and I am trying to scrape data from within those pages as well. Here is the code I have written thus far:
# load packages
library(rvest)
library(purrr)
library(writexl)
# base url
base_url <- sprintf("http://vote.duma.gov.ru/?convocation=AAAAAAA6&sort=date_asc&page=%d", 1:789)
# loop over pages
map_df(base_url, function(i) {
pg <- read_html(i)
tibble(
title = html_nodes(pg, ".item-left a") %>% html_text() %>% str_trim(),
link = html_elements(pg, '.item-left a') %>%
html_attr('href') %>%
paste0('http://vote.duma.gov.ru', .),
)
}) -> duma_votes_data
The above code executed successfully. This results in a df containing the titles and links. I am now trying to extract the date information. Here is the code I have written for that:
# extract date of vote
duma_votes_data$date <- map(duma_votes_data$link, ~ {
.x %>%
read_html() %>%
html_nodes(".date-p span") %>%
html_text() %>%
paste(collapse = " ")
})
After running this code, I receive the following error:
Error in open.connection(x, "rb") : HTTP error 504.
What is the best way to get around this issue? I have read about the possibility of incorporating Sys.sleep() to my code, but I am not sure where it should go. Note that this code is for all 789 pages, as indicated in base_url. The code does work with around 40 pages, so I guess worst case scenario I could do everything in small chunks and save the resulting dfs as a single df.

How to perform web scraping to get all the reviews of the an app in Google Play?

I pretend to be able to get all the reviews that users leave on Google Play about the apps. I have this code that they indicated there Web scraping in R through Google playstore . But the problem is that you only get the first 40 reviews. Is there a possibility to get all the comments of the app?
`` `
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
#Specifying the url for desired website to be scraped
url <- 'https://play.google.com/store/apps/details?
id=com.phonegap.rxpal&hl=en_IN&showAllReviews=true'
# starting local RSelenium (this is the only way to start RSelenium that
is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-
Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "firefox")
remDr$open()
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# 1) name field (assuming that with 'name' you refer to the name of the
reviewer)
names <- html_obj %>% html_nodes(".kx8XBd .X43Kjb") %>% html_text()
# 2) How much star they got
stars <- html_obj %>% html_nodes(".kx8XBd .nt2C1d [role='img']") %>%
html_attr("aria-label")
# 3) review they wrote
reviews <- html_obj %>% html_nodes(".UD7Dzf") %>% html_text()
# create the df with all the info
review_data <- data.frame(names = names, stars = stars, reviews = reviews,
stringsAsFactors = F)
`` `
You can get all the reviews from the web store of GooglePlay.
If you scroll through the reviews, you can see the XHR request is sent to:
https://play.google.com/_/PlayStoreUi/data/batchexecute
With form-data:
f.req: [[["rYsCDe","[[\"com.playrix.homescapes\",7]]",null,"55"]]]
at: AK6RGVZ3iNlrXreguWd7VvQCzkyn:1572317616250
And params of:
rpcids=rYsCDe
f.sid=-3951426241423402754
bl=boq_playuiserver_20191023.08_p0
hl=en
authuser=0
soc-app=121
soc-platform=1
soc-device=1
_reqid=839222
rt=c
After playing around with different parameters, I find out many are optional, and the request can be simplified as:
form-data:
f.req: [[["UsvDTd","[null,null,[2, $sort,[$review_size,null,$page_token]],[$package_name,7]]",null,"generic"]]]
params:
hl=$review_language
The response is cryptic, but it's essentially JSON data with keys stripped, similar to protobuf, I wrote a parser for the response that translate it to regular dict object.
https://gist.github.com/xlrtx/af655f05700eb76bb29aec876493ed90

Web scraping in R through Google playstore

I want to scrape data from google play store of several app's review in which i want.
name field
How much star they got
review they wrote
This is the snap of the senerio
#Loading the rvest package
library('rvest')
#Specifying the url for desired website to be scrapped
url <- 'https://play.google.com/store/apps/details?id=com.phonegap.rxpal&hl=en_IN'
#Reading the HTML code from the website
webpage <- read_html(url)
#Using CSS gradient_Selector to scrap the name section
Name_data_html <- html_nodes(webpage,'.kx8XBd .X43Kjb')
#Converting the Name data to text
Name_data <- html_text(Name_data_html)
#Look at the Name
head(Name_data)
but it result to
> head(Name_data)
character(0)
later I try to discover more i found Name_data_html has
> Name_data_html
{xml_nodeset (0)}
I am new to web scraping can any help me out with this!
You should use Xpaths to select the object on the web page :
#Loading the rvest package
library('rvest')
#Specifying the url for desired website to be scrapped
url <- 'https://play.google.com/store/apps/details?id=com.phonegap.rxpal&hl=en_IN'
#Reading the HTML code from the website
webpage <- read_html(url)
# Using Xpath
Name_data_html <- webpage %>% html_nodes(xpath='/html/body/div[1]/div[4]/c-wiz/div/div[2]/div/div[1]/div/c-wiz[1]/c-wiz[1]/div/div[2]/div/div[1]/c-wiz[1]/h1/span')
#Converting the Name data to text
Name_data <- html_text(Name_data_html)
#Look at the Name
head(Name_data)
See how to get the path in this picture :
After analyzing your code and the source page of the URL you posted, I think that the reason you are unable to scrap anything is because the content is being generated dynamically so rvest cannot get it right.
Here is my solution:
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
#Specifying the url for desired website to be scrapped
url <- 'https://play.google.com/store/apps/details?id=com.phonegap.rxpal&hl=en_IN'
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# 1) name field (assuming that with 'name' you refer to the name of the reviewer)
names <- html_obj %>% html_nodes(".kx8XBd .X43Kjb") %>% html_text()
# 2) How much star they got
stars <- html_obj %>% html_nodes(".kx8XBd .nt2C1d [role='img']") %>% html_attr("aria-label")
# 3) review they wrote
reviews <- html_obj %>% html_nodes(".UD7Dzf") %>% html_text()
# create the df with all the info
review_data <- data.frame(names = names, stars = stars, reviews = reviews, stringsAsFactors = F)
In my solution, I'm using RSelenium, which is able to load the webpage as if you were navigating to it (instead of just downloading it like rvest). This way, all the dynamically-generated content is loaded and when is loaded, you can now retrieve it with rvest and scrap it.
If you have any doubts about my solution, just tell me!
Hope it helped!

need help in extracting the first google search result using html_node in R

I have a list of hospital names for which I need to extract the first google search URL. Here is the code I'm using
library(rvest)
library(urltools)
library(RCurl)
library(httr)
getWebsite <- function(name)
{
url = URLencode(paste0("https://www.google.com/search?q=",name))
page <- read_html(url)
results <- page %>%
html_nodes("cite") %>%
html_text()
result <- results[1]
return(as.character(result))}
websites <- data.frame(Website = sapply(c,getWebsite))
View(websites)
For short URLs this code works fine but when the link is long and appears in R with "..." (ex. www.medicine.northwestern.edu/divisions/allergy-immunology/.../fellowship.html) it appears in the dataframe the same way with "...". How can I extract the actual URLs without "..."? Appreciate your help!
This is a working example, tested on my computer:
library("rvest")
# Load the page
main.page <- read_html(x = "https://www.google.com/search?q=software%20programming")
links <- main.page %>%
html_nodes(".r a") %>% # get the a nodes with an r class
html_attr("href") # get the href attributes
#clean the text
links = gsub('/url\\?q=','',sapply(strsplit(links[as.vector(grep('url',links))],split='&'),'[',1))
# as a dataframe
websites <- data.frame(links = links, stringsAsFactors = FALSE)
View(websites)

Resources