Can the httr package make HTTPS calls? - r

Code that previously worked now throws an error due to the server requesting HTTPS calls and no longer accepts HTTP.
Can this code me modified to work? Or is a new process required for encrypted HTTPS?
Thank you!
# Connecting to EIA API
# install.packages(c("httr", "jsonlite"))
library(httr)
library(jsonlite)
key <- "e77e9bd3c8bc84927fad13088f4bff28"
padd_key <- list('PET.MCRRIP12.M','PET.MCRRIP22.M',
'PET.MCRRIP32.M','PET.MCRRIP42.M',
'PET.MCRRIP52.M')
startdate <- "2010-01-01" #YYYY-MM-DD
enddate <- "2022-02-13" #YYYY-MM-DD
j = 0
for (i in padd_key) {
url <- paste('http://api.eia.gov/series/?api_key=',key,'&series_id=',i,sep="")
res <- GET(url)
json_data <- fromJSON(rawToChar(res$content))
data <- data.frame(json_data$series$data)
data$Year <- substr(data$X1,1,4)
data$Month <- substr(data$X1,5,6)
data$Day <- 1
data$Date <- as.Date(paste(data$Year, data$Month, data$Day, sep='-'))
colnames(data)[2] <- json_data$series$name
data <- data[-c(1,3,4,5)]
if (j == 0){
data_final <- data
}
else{
data_final <- merge(data_final,data,by="Date")
}
j = j + 1
}
data_final <- subset(data_final, Date >= startdate & Date <= enddate)

Yes, it can - you just use https in the URL.
The shortest demonstration I could think of is:
httr::GET("https://httpbin.org/get")
In your code you just need to change the line where you define the url variable:
url <- paste('https://api.eia.gov/series/?api_key=',key,'&series_id=',i,sep="")

Related

Polite webscraping with a jsonlite function

I have been trying to scrap a page, but after a few scraps, the page block my access for an hour.
I read about the R package called {polite} and it maybe can solve my problem.
But I'm failing to implement the function`s package in my code.
I did:
#conection
url_ini <- paste0("https://www.instagram.com/instagram/?__a=1&__d=11")
document_ini <- jsonlite::fromJSON(txt = url_ini)
#extracting information
id <- document_ini$graphql$user$id
end_cursor <- document_ini$graphql$user$edge_owner_to_timeline_media$page_info$end_cursor
n1 <- 'https://www.instagram.com/graphql/query/?query_hash=e769aa130647d2354c40ea6a439bfc08&variables={%22id%22:%22'
n2 <- '%22,%22first%22:12,%22after%22:%22'
n3 <- "%22}"
url <- noquote(paste0(n1, id, n2, end_cursor, n3))
document <- jsonlite::fromJSON(txt = url)
There are more code, but I think if I can do this part, I will be able to do the rest.
I tried without success things like this:
url_ini <- paste0("https://www.instagram.com/instagram/?__a=1&__d=11")
session <- polite::bow(url_ini)
document_ini <- jsonlite::fromJSON(txt = session) # doesn't work
responses <- map(session, ~polite::scrape(session,jsonlite::fromJSON)) # doesn't work

web-scraping from a website that does not change URL

I am very new to web-scraping, and I am having some difficulty scraping this website's content. I basic would like to collect the pesticide name and active ingredient, but the URL does not change, and I could not find a way to click the grids. Any help?
library(RSelenium)
library(rvest)
library(tidyverse)
rD <- rsDriver(browser="firefox", port=4547L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate("http://www.cdms.net/Label-Database")
This site calls an API to get the list of manufacturers: http://www.cdms.net/labelssds/Home/ManList?Keys=
On the products page, it also uses another API with the manufacturer ID, for example: http://www.cdms.net/labelssds/Home/ProductList?manId=537
You just need to loop through the Lst array and append the result to a dataframe.
For instance, the following code get all the products for the first 5 manufacturers :
library(httr)
manufacturers <- content(GET("http://www.cdms.net/labelssds/Home/ManList?Keys="), as = "parsed", type = "application/json")
maxManufacturer <- 5
index <- 1
manufacturerCount <- 0
data = list()
for(m in manufacturers$Lst){
print(m$label)
productUrl <- modify_url("http://www.cdms.net/labelssds/Home/ProductList",
query = list(
"manId" = m$value
)
)
products <- content(GET(productUrl), as = "parsed", type = "application/json")
for(p in products$Lst){
data[[index]] = p
index <- index + 1
}
manufacturerCount <- manufacturerCount + 1
if (manufacturerCount == maxManufacturer){
break
}
Sys.sleep(0.500) #add delay for scraping
}
df <- do.call(rbind, data)
options(width = 1200)
print(df)

How to authenticate myself in Postman using API Key with Rscript

I'm trying to create an API connection with Postman, but i need to authenticate myself using Type = API Key. these are my credentials
My problem is because i don't know hot to add that cretentials to my Rscript to be able to access, this is my current code
library(httr)
library(tidyverse)
library(plyr)
# Settings
.proxy <- list(url = "gbiss-l-ss31.int.dir.witowa.com",
user = "svc-g-gad",
pwd = "5vcGBgaGaSrf",
port = 8090,
Header = list(Key <- 'X-EDS-USER',
Value <- 'B6E6685F-DB0C-438A-983F')
)
format_url_data <- function(x){
raw <- httr::GET(url = x,
httr::use_proxy(
url = .proxy$url,
port = .proxy$port,
username = .proxy$user,
password = .proxy$pwd
)
)
raw <- intToUtf8(raw$content)
jsonlite::fromJSON(raw)
}
############################ FOR FF (Don't change anything) ##################################################################
#Define Url
# Define Url
basehttr <- 'https://iat.eds.gateway-api.willistowerswatson.com/Clients/Search?query='
endhttr <- 'APPLE'
endhttr_backup <- endhttr
endhttr <- URLencode(endhttr)
url <- glue::glue('{basehttr}{endhttr}')
# Get the information and convert to df
dt <- tryCatch( {
dt <- purrr::map(url, ~format_url_data(.))},
error = function( error_condition ) {
basehttr <- "https://qa.eds.gateway-api.willistowerswatson.com/gateway-api/Clients/Search?query="
endhttr <- 'APPLE'
endhttr_backup <- endhttr
endhttr <- URLencode(endhttr)
url <- glue::glue('{basehttr}{endhttr}')
dt <- purrr::map(url, ~format_url_data(.))
}
)
A <- dt %>% as.list.data.frame()
B <- ldply (A, data.frame)
Data <- B %>%
drop_na(name) %>% as.data.frame()
Data$Name <- gsub("[^[:alnum:][:blank:]?&/\\-]", "", Data$Name)
When i run my code get an Authentication error
Could you help me to connect with Postman with that credentials?
Thanks

Trying to webscrape an unchanging URL with data spread over pages

I am new to Webscraping. The url I am working with is this (https://tsmc.tripura.gov.in/doc_list). At present, I am able to extract data from the first page. Since, the url is unchanging, I don't have an identifier for the other pages to create a loop for data table extraction.
Here is my code:
install.packages("XML")
install.packages("RCurl")
install.packages("rlist")
install.packages("bitops")
library(bitops)
library(XML)
library(RCurl)
url1<- getURL("https://tsmc.tripura.gov.in/doc_list",.opts =
list(ssl.verifypeer = FALSE))
table1<- readHTMLTable(url1)
table1<- list.clean(table1, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(table1, function(t) dim(t)[1]))
table1[[which.max(n.rows)]]
View(table1)
table11= table1[["NULL"]]
Please help. Thanks!
Perhaps try this solution:
url <- "https://tsmc.tripura.gov.in/doc_list?page="
sq <- seq(1, 30) # There appears to be 30 pages so we create a sequence of 1:30 results
links <- paste0(url, sq) #Paste the sequence after the url "page="
store <- NULL
tbl <- NULL
library(rvest) #extract the tables
for(i in links){
store[[i]] = read_html(i)
tbl[[i]] = html_table(store[[i]])
}
library(plyr)
df <- ldply(tbl, data.frame) #combine the list of data frames into one large data frame
df$`.id` <- gsub("https://tsmc.tripura.gov.in/doc_list?page=", " ", df$`.id`, fixed = TRUE)
Which gives 846 observations across 8 variables.
EDIT: I found that the first url does not have a sequence. In order to add the first page and rbind it with the rest of the data use the following:
firsturl <- "https://tsmc.tripura.gov.in/doc_list"
first_store = read_html(firsturl)
first_tbl = html_table(first_store)
first_df <- as.data.frame(first_tbl)
first_df$`.id` <- 0
df2 <- rbind(first_df, df)

NA's for blanks in web scraping

I want to scrape the below mentioned page, but there are some blanks in ".trans-section" node. '.trans-section' node is capturing 'title' as well as 'description'. In some table title will be there, but description is missing. I want the data to be filled with NA's when the description is blank. Since the node is same for both I am not getting any blank lines. Please help on this.
Weblink: https://patentscope.wipo.int/search/en/result.jsf?currentNavigationRow=5&prevCurrentNavigationRow=2&query=FP:(Gaming)&office=&sortOption=Pub%20Date%20Desc&prevFilter=&maxRec=39316&viewOption=All&listLengthOption=200
library(rvest)
library(httr)
library(XML)
FinalD <- data.frame()
for (i in 1:10) {
rm(Data)
## Creating web page
Webpage <- paste0('https://patentscope.wipo.int/search/en/result.jsf?currentNavigationRow=',i,'&prevCurrentNavigationRow=1&query=&office=&sortOption=Pub%20Date%20Desc&prevFilter=&maxRec=64653917&viewOption=All&listLengthOption=100')
Webpage <- read_html(Webpage)
#Getting Nodes
Node_Intclass <- html_nodes(Webpage,'.trans-section')
Intclass <- data.frame(html_text(Node_Intclass))
Intclass$sequence <- seq(1:2)
Node_Others <- html_nodes(Webpage,'.notranslate')
Others <- data.frame(html_text(Node_Others))
Others$sequence <- seq(1:9)
####Others
Data <- data.frame(subset(Others$html_text.Node_Others.,Others$sequence == 1))
Data$ID <- subset(Others$html_text.Node_Others.,Others$sequence == 2)
Data$Country <- subset(Others$html_text.Node_Others.,Others$sequence == 3)
Data$PubDate <- subset(Others$html_text.Node_Others.,Others$sequence == 4)
Data$IntClass <- subset(Others$html_text.Node_Others.,Others$sequence == 5)
Data$ApplINo <- subset(Others$html_text.Node_Others.,Others$sequence == 7)
Data$Applicant <- subset(Others$html_text.Node_Others.,Others$sequence == 8)
Data$Inventor <- subset(Others$html_text.Node_Others.,Others$sequence == 9)
###Content
ifelse ((nrow(Intclass) == 200),
((Data$Title <- subset(Intclass$html_text.Node_Intclass.,Intclass$sequence == 1))&
(Data$Content <- subset(Intclass$html_text.Node_Intclass.,Intclass$sequence == 2))),
((Data$Title <- 0 ) & (Data$Content = 0)))
#Final Data
FinalD <- rbind(FinalD,Data)
}
write.csv(FinalD,'FinalD.csv')
Well, I am not an expert of Web Scraping ( I have just tried a few times) but I have realized that it is a tiresome procedure with a lot of trials and errors.
Maybe you can use the RSelenium package as the page is dynamically generated.For me it works but it creates a kind of messy output ,maybe it is better though.
library(RSelenium)
library(rvest)
library(dplyr)
library(data.table)
library(stringr)
tables1 <- list()
for (i in 1:10) { # i <- 1; i
## Creating web page
url <- paste0('https://patentscope.wipo.int/search/en/result.jsf?currentNavigationRow=',i,'&prevCurrentNavigationRow=1&query=&office=&sortOption=Pub%20Date%20Desc&prevFilter=&maxRec=64653917&viewOption=All&listLengthOption=100')
rD <- rsDriver( browser="chrome")
remDr <- rD$client
remDr$navigate(url)
page <- remDr$getPageSource()
remDr$close()
table <- page[[1]] %>%
read_html() %>%
html_nodes(xpath='//table[#id="resultTable"]') %>% # specify table as there is a div with same id
html_table(fill = T)
table <- table[[1]]
tables1[[url]] <- table %>% as.data.table()
rm(rD)
gc()}
I would also suggest you to create the list of pages that you want to read, outside the loop, and create an index so as if the connection fails you continue from the page you were left.
In addition, if the connection fails, run the
rm(rD)
gc()
lines to avoid an error that says that the port is already in use.
I hope it helped.
(Not tested)
Can you try to add the option:
read_html(Webpage, options = c("NOBLANKS"))

Resources