Failed to load HTTP resource xml parse in R - r

I am trying to use COVID-19 api from URL below,
and at the last code, the error follows as:
error 1: failed to load HTTP resource
Is this a problem with my code, or the website's server problem?
apiURL <- "http://openapi.data.go.kr/openapi/service/rest/Covid19/getCovid19InfStateJson"
operation <- "Covid19InfStateJson"
api_key <- "apikey"
numOfRows <- 4
pageNo <- 1
startCreateDt <- 30
endCreateDt <- 30
url <- paste0(apiURL,
operation,
paste0("?api_key=",api_key),
paste0("&numOfRows=", numOfRows),
paste0("&pageNo=", pageNo),
paste0("&startCreateDt=", startCreateDt),
paste0("&endCreateDt=", endCreateDt))
library(XML)
xmlFile <- xmlParse(url)

Related

getting an error using tuber to extract all comments from a vector of youtube videos id map_df()

I'm trying to extract all comments from a list of youtube videos using the tuber package
I can extract comments from a single video ID using the following code:
library(tuber)
client_id <- [my_client_id]
client_secret <- [my_client_secret]
yt_oauth(app_id = client_id, app_secret = client_secret, token = "")
video_id <- "arjHXHHQkQs"
test_comments <- get_all_comments(video_id = video_id)
I have a vector of 29 youtube IDs and I'm trying to use the map_df() function to iterate over every ID in the vector and apply the function get_all_comments() but I keep getting an error:
id <- c("C5OLDKq_CfI", "Y26MWDh8u3Y", "0HQyjY8I830", "AGBX-AHKDfk" ,"YuA59DKabVs")
comments_getter <- function(id) {
tuber::get_all_comments(video_id = id)}
comments_raw <- purrr::map(.x = id, .f = comments_getter)
Error: HTTP failure: 401
Called from: tuber_check(req)
And from the debug viewer I get this
function (req)
{
if (req$status_code < 400)
return(invisible())
stop("HTTP failure: ", req$status_code, "\n", call. = FALSE)
}
Is something related to a limit for the API or is there an error in my code?

Adding user agent scraping API using jsonlite / fromJSON

I've started receiving 429 errors for the below script. The API I'm scraping requires a user-agent to be specified.
I'm at a loss for how do to specify a user agent header with the package I am using. The attempts I made using RCurl::getUrl produced errors as well.
Using options(HTTPUserAgent = "what google returns when I search my user agent") did not fix the 429 problem.
API documentation linked below.
https://docs.helium.com/api/blockchain/introduction/#specify-a-user-agent
library(jsonlite)
blocks_api <- 'https://api.helium.io/v1/blocks'
blocks <- fromJSON(blocks_api)
endTime <- Sys.Date()
blockMax_api <- paste0(blocks_api,"/height","/?max_time=",endTime)
blockMax_ep <- fromJSON(blockMax_api)
blockMax <- max(blockMax_ep$data$height)
startTime <- Sys.Date() - 1
blockMin_api <- paste0(blocks_api,"/height","/?max_time=",startTime)
blockMin_ep <- fromJSON(blockMin_api)
blockMin <- blockMin_ep$data$height
period_blocks <- blockMax - blockMin
blockTimes <- data.frame()
oraclePrice <- 'https://api.helium.io/v1/oracle/prices'
for(i in blockMin:blockMax){
block_n <- fromJSON(paste0(blocks_api,"/",i))
block_n <- as.data.frame(block_n)
block_n$data.time <- anytime(block_n$data.time)
block_n <- block_n[,c(2,5,6)]
oracleBlockPrice <- fromJSON(paste0(oraclePrice,"/",i))
block_n$HNTprice <- oracleBlockPrice$data$price / 100000000
blockTimes <- rbind(blockTimes,block_n)
Sys.sleep(1)
}
This is how the author of the jsonlite changes the user-agent in the fromJSON function. Change the useragent variable to the text that you want:
h <- curl::new_handle(useragent = paste("jsonlite /", R.version.string))
curl::handle_setheaders(h, Accept = "application/json, text/*, */*")
txt <- curl::curl(url, handle = h)
And then call fromJSON
fromJSON(txt)

Page limit using rvest

I'm having an issue when using rvest to scrape 466 pages from a wiki. Each page represents a metric that I need further information about. I have the following code which loops through each link (loaded from a csv file) and extracts the information I need from a html table on each page.
Metrics <- read.csv("C:\\Users\\me\\Documents\\WebScraping\\LONMetrics.csv")
Metrics$Theme <- as.character(paste0(Metrics$Theme))
Metrics$Metric <- as.character(paste0(Metrics$Metric))
Metrics$URL <- as.character(paste0(Metrics$URL))
n = nrow(Metrics)
i = 1
while (i <= n) {
webPage <- read_html(Metrics$URL[i])
pageTable <- html_table(webpage)
Metrics$Definition[i] <- pageTable[[1]]$X2[1]
Metrics$Category[i] <- pageTable[[1]]$X2[2]
Metrics$Calculation[i] <- pageTable[[1]]$X2[3]
Metrics$UOM[i] <- pageTable[[1]]$X2[4]
Metrics$ExpectedTrend[i] <- pageTable[[1]]$X2[6]
Metrics$MinTech[i] <- pageTable[[1]]$X2[7]
i = i+1
}
The problem I'm having is that it stops returning data after 32 pages giving an error as:
Error in read_connection_(x, n) :
Evaluation error: Failure when receiving data from the peer
I'm wondering what the cause may be and how to get around this seeming limitation?
Thanks.
Rob

using GET in a loop

I am using the following code. I create a list of first names and then generate links to an API for each name and then try to capture the data from each link.
mydata$NameGenderURL2 <- paste ("https://gender-api.com/get?name=",mydata$firstname, "&key=suZrzhrNJRvrkWFXAG", sep="")
mynamegenderfunction <- function(x){
GET(url= mydata$NameGenderURL2[x])
this.raw.content <- genderdata$content
this.raw.content <- rawToChar(genderdata$content)
this.content <- fromJSON(this.raw.content)
name1[x] <- this.content$name
gender1[x] <- this.content$gender}
namelist <- mydata$firstname[1:100]
genderdata <- lapply(namelist, mynamegenderfunction)
Oddly enough I receive the following message:
>Error in curl::curl_fetch_memory(url, handle = handle) :
>Could not resolve host: NA`
I tried another API and got the same issue. Any suggestions?
Here is a data sample:
namesurl
https://api.genderize.io/?name=kaan
https://api.genderize.io/?name=Joan
https://api.genderize.io/?name=homeblitz
https://api.genderize.io/?name=Flatmax
https://api.genderize.io/?name=BRYAN
https://api.genderize.io/?name=James
https://api.genderize.io/?name=Dion
https://api.genderize.io/?name=Flintu
https://api.genderize.io/?name=Adriana
The output that I need is the gender for each link, which would be :Male/Female, Null

htmlParse errors on accessing google search. Is their an alternative approach

I am trying to obtain the number of results obtained from specific google searches.
For example for stackoverflow there are "About 28,200,000 results (0.12 seconds)".
Normally I would use the xpathSApply function from the XML R package but I am having errors and am not sure how to solve them or know if there is an alternative approach
library(XML)
googleURL <- "https://www.google.ca/search?q=stackoverflow"
googleInfo <- htmlParse(googleURL, isURL = TRUE)
Error: failed to load external entity "https://www.google.ca/search?q=stackoverflow"
#use of RCurl which I am not that familiar with
library(RCurl)
getURL(googleURL)
#Error in function (type, msg, asError = TRUE) :
#SSL certificate problem, verify that the CA cert is OK. Details:
#error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
# final effort
library(httr)
x <- GET(googleURL)
# no error but am not sure how to proceed
# the relevant HTML code to parse is
# <div id=resultStats>About 28,200,000 results<nobr> (0.12 seconds) </nobr></div>
Ay help in solving errors or parsing the httr object would be much appreciated
You are asking for a secure http connection
https://www.google.ca/search?q=stackoverflow
XML is complaining about this as is RCurl. httr will download the page.
XML ask for an unsecured connection
library(XML)
googleURL <- "http://www.google.ca/search?q=stackoverflow"
googleInfo <- htmlParse(googleURL, isURL = TRUE)
xpathSApply(googleInfo,'//*/div[#id="resultStats"]')
#[[1]]
#<div id="resultStats">About 28,200,000 results</div>
RCurl use ssl.verifypeer = FALSE thou it worked without for me
library(RCurl)
googleURL <- "https://www.google.ca/search?q=stackoverflow"
googleInfo <- getURL(googleURL,ssl.verifypeer = FALSE)
googleInfo <- htmlParse(googleInfo)
# or if you want to use a cert
# system.file("CurlSSL/cacert.pem", package = "RCurl")
# googleInfo <- getURL(googleURL, cainfo = cert)
# googleInfo <- htmlParse(googleInfo)
xpathSApply(googleInfo,'//*/div[#id="resultStats"]')
#[[1]]
#<div id="resultStats">About 28,200,000 results</div>
httr use content
library(httr)
x <- GET(googleURL)
googleInfo <- htmlParse(content(x, as = 'text'))
xpathSApply(googleInfo,'//*/div[#id="resultStats"]')
#[[1]]
#<div id="resultStats">About 28,200,000 results</div>

Resources