htmlParse errors on accessing google search. Is their an alternative approach - r

I am trying to obtain the number of results obtained from specific google searches.
For example for stackoverflow there are "About 28,200,000 results (0.12 seconds)".
Normally I would use the xpathSApply function from the XML R package but I am having errors and am not sure how to solve them or know if there is an alternative approach
library(XML)
googleURL <- "https://www.google.ca/search?q=stackoverflow"
googleInfo <- htmlParse(googleURL, isURL = TRUE)
Error: failed to load external entity "https://www.google.ca/search?q=stackoverflow"
#use of RCurl which I am not that familiar with
library(RCurl)
getURL(googleURL)
#Error in function (type, msg, asError = TRUE) :
#SSL certificate problem, verify that the CA cert is OK. Details:
#error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
# final effort
library(httr)
x <- GET(googleURL)
# no error but am not sure how to proceed
# the relevant HTML code to parse is
# <div id=resultStats>About 28,200,000 results<nobr> (0.12 seconds) </nobr></div>
Ay help in solving errors or parsing the httr object would be much appreciated

You are asking for a secure http connection
https://www.google.ca/search?q=stackoverflow
XML is complaining about this as is RCurl. httr will download the page.
XML ask for an unsecured connection
library(XML)
googleURL <- "http://www.google.ca/search?q=stackoverflow"
googleInfo <- htmlParse(googleURL, isURL = TRUE)
xpathSApply(googleInfo,'//*/div[#id="resultStats"]')
#[[1]]
#<div id="resultStats">About 28,200,000 results</div>
RCurl use ssl.verifypeer = FALSE thou it worked without for me
library(RCurl)
googleURL <- "https://www.google.ca/search?q=stackoverflow"
googleInfo <- getURL(googleURL,ssl.verifypeer = FALSE)
googleInfo <- htmlParse(googleInfo)
# or if you want to use a cert
# system.file("CurlSSL/cacert.pem", package = "RCurl")
# googleInfo <- getURL(googleURL, cainfo = cert)
# googleInfo <- htmlParse(googleInfo)
xpathSApply(googleInfo,'//*/div[#id="resultStats"]')
#[[1]]
#<div id="resultStats">About 28,200,000 results</div>
httr use content
library(httr)
x <- GET(googleURL)
googleInfo <- htmlParse(content(x, as = 'text'))
xpathSApply(googleInfo,'//*/div[#id="resultStats"]')
#[[1]]
#<div id="resultStats">About 28,200,000 results</div>

Related

Adding user agent scraping API using jsonlite / fromJSON

I've started receiving 429 errors for the below script. The API I'm scraping requires a user-agent to be specified.
I'm at a loss for how do to specify a user agent header with the package I am using. The attempts I made using RCurl::getUrl produced errors as well.
Using options(HTTPUserAgent = "what google returns when I search my user agent") did not fix the 429 problem.
API documentation linked below.
https://docs.helium.com/api/blockchain/introduction/#specify-a-user-agent
library(jsonlite)
blocks_api <- 'https://api.helium.io/v1/blocks'
blocks <- fromJSON(blocks_api)
endTime <- Sys.Date()
blockMax_api <- paste0(blocks_api,"/height","/?max_time=",endTime)
blockMax_ep <- fromJSON(blockMax_api)
blockMax <- max(blockMax_ep$data$height)
startTime <- Sys.Date() - 1
blockMin_api <- paste0(blocks_api,"/height","/?max_time=",startTime)
blockMin_ep <- fromJSON(blockMin_api)
blockMin <- blockMin_ep$data$height
period_blocks <- blockMax - blockMin
blockTimes <- data.frame()
oraclePrice <- 'https://api.helium.io/v1/oracle/prices'
for(i in blockMin:blockMax){
block_n <- fromJSON(paste0(blocks_api,"/",i))
block_n <- as.data.frame(block_n)
block_n$data.time <- anytime(block_n$data.time)
block_n <- block_n[,c(2,5,6)]
oracleBlockPrice <- fromJSON(paste0(oraclePrice,"/",i))
block_n$HNTprice <- oracleBlockPrice$data$price / 100000000
blockTimes <- rbind(blockTimes,block_n)
Sys.sleep(1)
}
This is how the author of the jsonlite changes the user-agent in the fromJSON function. Change the useragent variable to the text that you want:
h <- curl::new_handle(useragent = paste("jsonlite /", R.version.string))
curl::handle_setheaders(h, Accept = "application/json, text/*, */*")
txt <- curl::curl(url, handle = h)
And then call fromJSON
fromJSON(txt)

Failed to load HTTP resource xml parse in R

I am trying to use COVID-19 api from URL below,
and at the last code, the error follows as:
error 1: failed to load HTTP resource
Is this a problem with my code, or the website's server problem?
apiURL <- "http://openapi.data.go.kr/openapi/service/rest/Covid19/getCovid19InfStateJson"
operation <- "Covid19InfStateJson"
api_key <- "apikey"
numOfRows <- 4
pageNo <- 1
startCreateDt <- 30
endCreateDt <- 30
url <- paste0(apiURL,
operation,
paste0("?api_key=",api_key),
paste0("&numOfRows=", numOfRows),
paste0("&pageNo=", pageNo),
paste0("&startCreateDt=", startCreateDt),
paste0("&endCreateDt=", endCreateDt))
library(XML)
xmlFile <- xmlParse(url)

jsonlite suddenly retunring error: "Failure when receiving data from the peer"

Suddenly, over the weekend, my code is no longer working.
when I run it, I receive the following message:
Error in parse_con(txt, bigint_as_char) :
Failure when receiving data from the peer
the code is the following:
raiz <- "https://olinda.bcb.gov.br/olinda/servico/Expectativas/versao/v1/odata/"
tipo <- "ExpectativaMercadoMensais?%24format=json&%24select="
indicador <- "Indicador,Data,DataReferencia,Mediana,numeroRespondentes"
restricao <- "&%24orderby=Data%20desc&%24filter=Indicador%20eq%20'IPCA'&%24top=10"
library("jsonlite")
jsonlite::fromJSON(paste0(raiz,tipo,indicador,restricao), simplifyVector = FALSE)
There is a problem with the GET function that jsonlite uses to read the website. Use readLines instead.
raiz <- "https://olinda.bcb.gov.br/olinda/servico/Expectativas/versao/v1/odata/"
tipo <- "ExpectativaMercadoMensais?%24format=json&%24select="
indicador <- "Indicador,Data,DataReferencia,Mediana,numeroRespondentes"
restricao <- "&%24orderby=Data%20desc&%24filter=Indicador%20eq%20'IPCA'&%24top=10"
library("jsonlite")
web <- readLines(paste0(raiz,tipo,indicador,restricao), warn = FALSE)
df <- jsonlite::fromJSON(web, simplifyVector = FALSE)
I didn't understand your query, but here we have one that works:
web <- readLines("https://olinda.bcb.gov.br/olinda/servico/Expectativas/versao/v1/odata/ExpectativasMercadoInflacao12Meses?$format=json", warn = FALSE)
df <- fromJSON(web)
df$value

Not Found (HTTP 404) twitteR loop error

I am aiming to build a twitter network by constructing an igraph from twitter relationships.
After 10 iterations R gives me this error
Error in twInterfaceObj$doAPICall(paste("users", "show", sep = "/"), params = params, :
Not Found (HTTP 404).
# Grab latest tweets
tweets_galway <- searchTwitter('#galway', n=100)
# make into df
df <- do.call("rbind", lapply(tweets_galway, as.data.frame))
# extract users from galway hashtag tweets
galway_users = df$screenName
connectiondf=data.frame()
for( i in 1:100)
{
name<-getUser(galway_users[i])
following<-name$getFriends()
following <- twListToDF(following)
connect1=cbind(follower=galway_users[i],following=following$screenName)
follower<-name$getFollowers()
follower <- twListToDF(follower)
connect2=cbind(follower=follower$screenName,following=galway_users[i])
connection<-rbind(connect1,connect2)
connectiondf=rbind(connectiondf,connection)
print(i)
}
Why am I getting this error? Apologies if this is a silly query but I am new to R

R - Parallel Processing and ldply error

I am trying to use the below code to make API calls in a parallel process to speed up the API calls. (I know this isn't the best way to speed up API calls but it works)
It only fails when I try to use parallel, otherwise it works. In the ldply function I am getting the below error:
Error in do.ply(i) :
task 1 failed - "object of type 'closure' is not subsettable"
In addition:
Warning messages:
1: : ... may be used in an incorrect context: ‘.fun(piece, ...)’
2: : ... may be used in an incorrect context: ‘.fun(piece, ...)’
any help would be appreciated!
One <- 26
cl<-makeCluster(4)
registerDoSNOW(cl)
func.time <- Sys.time()
## API CALL ONE FOR "kline"
url <- "https://api.binance.com"
path <- paste("/api/v1/klines?symbol=",pairs[1],"&interval=1m&limit=1", sep = "")
raw.results <- GET(url = url, path = path)
text_content <- content(raw.results, as = "text", encoding = "UTF-8")
kline <- data.frame(text_content %>% fromJSON())
kline$symbol <- pairs[1]
## API FUNCTION TO BE APPLIED FOR REST
loopfunction <- function(i){
url <- "https://api.binance.com"
path <- paste("/api/v1/klines?symbol=",pairs[i],"&interval=1m&limit=1", sep = "")
raw.results <- GET(url = url, path = path)
text_content <- content(raw.results, as = "text", encoding = "UTF-8")
kline_temp <- data.frame(text_content %>% fromJSON())
kline_temp$symbol <- pairs[i]
kline <- rbind(kline,kline_temp)
return(kline)
}
## DPLY PARALLEL FUNCTION
kline2 <- data.frame(ldply(2:(One - 1), .fun = loopfunction, .parallel = T, .paropts = c("httr", "jsonlite", "dplyr"))) ##"ONE" is a list varriable created earlier
stopCluster(cl)
func.end.time <- Sys.time()
func.tot.time <- func.end.time - func.time
Your question isn't fully reproducible, so the following is an educated guess.
Your loopfunction() references an object called pairs. It seems from your script that a variable called pairs is defined somewhere in your local environment. However, when loopfunction() is passed to ldply(), it no longer has access to that variable (ordinarily, it would, but parallelization requires fresh R environments to be created). Having failed to find an object called pairs in the environment, R continues searching, and finds a match in stats::pairs(). This is a plotting function, not a subsettable object like a vector or data frame. Hence the error message, "object of type 'closure' is not subsettable".
I'm not especially familiar with how ldply implements parallel processing, but you could probably modify your function definition like this:
loopfunction <- function(i, pairs) {
...[body of function]...
}
And pass pairs as an extra parameter in your ldply call:
kline2 <- data.frame(ldply(2:(One - 1), .fun = loopfunction, pairs = pairs, .parallel = T, .paropts = list(.packages = c("httr", "jsonlite", "dplyr"))))

Resources