Rvest Could not find possible submission target when submitting form - r

I'm trying to scrape results from a site that requires a form submission, for this I'm using the rvest package.
The code fails after running the following commands:
require("rvest")
require(dplyr)
require(XML)
BasicURL <- "http://www.blm.mx/es/tiendas.php"
QForm <- html_form(read_html(BasicURL))[[1]]
Values <- set_values(QForm, txt_5 = 11850, drp_1="-1")
Session <- html_session(BasicURL)
submit_form(session = Session,form = Values)
Error: Could not find possible submission target.
I think it might be because rvest doesn't find the standard button targets for submitting. Is there away to specify to rvest which tags or buttons to look for?
Any help greatly appreciated

You can POST to the form directly with httr:
library(httr)
library(rvest)
library(purrr)
library(dplyr)
res <- POST("http://www.blm.mx/es/tiendas.php",
body = list(txt_5 = "11850",
drp_1 = "-1"),
encode = "form")
pg <- read_html(content(res, as="text", encoding="UTF-8"))
map(html_nodes(pg, xpath=".//div[#class='tiendas_resultado_right']"), ~html_nodes(., xpath=".//text()")) %>%
map_df(function(x) {
map(seq(1, length(x), 2), ~paste0(x[.:(.+1)], collapse="")) %>%
trimws() %>%
as.list() %>%
setNames(sprintf("x%d", 1:length(.)))
}) -> right
left <- html_nodes(pg, "div.tiendas_resultado_left") %>% html_text()
df <- bind_cols(data_frame(x0=left), right)
glimpse(df)
## Observations: 7
## Variables: 6
## $ x0 <chr> "ABARROTES LA GUADALUPANA", "CASA ARIES", "COMERCIO RED QIUBO", "FERROCARRIL 4", "LA FLOR DE JALISCO", "LA MIGAJA", "VIA LACTEA"
## $ x1 <chr> "Calle IGNACIO ESTEVA", "Calle PARQUE LIRA", "Calle GENERAL JOSE MORAN No 74 LOCAL B", "Calle MELCHOR MUZQUIZ", "Calle MELCHOR M...
## $ x2 <chr> "Col. San Miguel Chapultepec I Sección", "Col. San Miguel Chapultepec I Sección", "Col. San Miguel Chapultepec I Sección", "Col....
## $ x3 <chr> "Municipio/Ciudad Miguel Hidalgo", "Municipio/Ciudad Miguel Hidalgo", "Municipio/Ciudad Miguel Hidalgo", "Municipio/Ciudad Migue...
## $ x4 <chr> "CP 11850", "CP 11850", "CP 11850", "CP 11850", "CP 11850", "CP 11850", "CP 11850"
## $ x5 <chr> "Estado Distrito Federal", "Estado Distrito Federal", "Estado Distrito Federal", "Estado Distrito Federal", "Estado Distrito Fed...

Related

How to retrieve text below titles from google search using rvest

This is a follow up question for this one:
How to retrieve titles from google search using rvest
In this time I am trying to get the text behind titles in google search (circled in red):
Due to my lack of knowledge in web design I do not know how to formulate the xpath to extract the text below titles.
The answer by #AllanCameron is very useful but I do not know how to modify it:
library(rvest)
library(tidyverse)
#Code
#url
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
#Get data
first_page <- read_html(url)
titles <- html_nodes(first_page, xpath = "//div/div/div/a/h3") %>%
html_text()
Many thanks for your help!
This can all be done without Selenium, using rvest. Unfortunately, Google works differently in different locales, so for example in my locale there is a consent page that has to be navigated before I can even send a request to Google.
It seems this is not required in the OPs locale, but for those if you in the UK, you might need to run the following code first for the rest to work:
library(rvest)
library(tidyverse)
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
google_handle <- httr::handle('https://www.google.com')
httr::GET('https://www.google.com', handle = google_handle)
httr::POST(paste0('https://consent.google.com/save?continue=',
'https://www.google.com/',
'&gl=GB&m=0&pc=shp&x=5&src=2',
'&hl=en&bl=gws_20220801-0_RC1&uxe=eomtse&',
'set_eom=false&set_aps=true&set_sc=true'),
handle = google_handle)
url <- httr::GET(url, handle = google_handle)
For the OP and those without a Google consent page, the set up is simply:
library(rvest)
library(tidyverse)
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
Next we define the xpaths we are going to use to extract the title (as in the previous Q&A), and the text below the title (pertinent to this question)
title <- "//div/div/div/a/h3"
text <- paste0(title, "/parent::a/parent::div/following-sibling::div")
Now we can just apply these xpaths to get the correct nodes and extract the text from them:
first_page <- read_html(url)
tibble(title = first_page %>% html_nodes(xpath = title) %>% html_text(),
text = first_page %>% html_nodes(xpath = text) %>% html_text())
#> # A tibble: 9 x 2
#> title text
#> <chr> <chr>
#> 1 "Mario García Torres - Wikipedia" "Mario García Torres (born 1975 in Monc~
#> 2 "Mario Torres (#mario_torres25) • I~ "Mario Torres. Oaxaca, México. Luz y co~
#> 3 "Mario Lopez Torres - A Furniture A~ "The Mario Lopez Torres boutiques are a~
#> 4 "Mario Torres - Player profile | Tr~ "Mario Torres. Unknown since: -. Mario ~
#> 5 "Mario García Torres | The Guggenhe~ "Mario García Torres was born in 1975 i~
#> 6 "Mario Torres - Founder - InfOhana ~ "Ve el perfil de Mario Torres en Linked~
#> 7 "3500+ \"Mario Torres\" profiles - ~ "View the profiles of professionals nam~
#> 8 "Mario Torres Lopez - 33 For Sale o~ "H 69 in. Dm 20.5 in. 1970s Tropical Vi~
#> 9 "Mario Lopez Torres's Woven River G~ "28 Jun 2022 · From grass harvesting to~
The subtext you refer to appears to be rendered in JavaScript, which makes it difficult to access with conventional read_html() methods.
Here is a script using RSelenium which gets the results necessary. You can also click the next page element if you want to get more results etc.
library(wdman)
library(RSelenium)
library(tidyverse)
selServ <- selenium(
port = 4446L,
version = 'latest',
chromever = '103.0.5060.134', # set to available
)
remDr <- remoteDriver(
remoteServerAddr = 'localhost',
port = 4446L,
browserName = 'chrome'
)
remDr$open()
remDr$navigate("insert URL here")
text_elements <- remDr$findElements("xpath", "//div/div/div/div[2]/div/span")
sapply(text_elements, function(x) x$getElementText()) %>%
unlist() %>%
as_tibble_col("results") %>%
filter(str_length(results) > 15)
# # A tibble: 6 × 1
# results
# <chr>
# 1 "The meaning of HI is —used especially as a greeting. How to use hi in a sentence."
# 2 "Hi definition, (used as an exclamation of greeting); hello! See more."
# 3 "A friendly, informal, casual greeting said upon someone's arrival. quotations ▽synonyms △. Synonyms: hello, greetings, howdy.…
# 4 "Hi is defined as a standard greeting and is short for \"hello.\" An example of \"hi\" is what you say when you see someone. i…
# 5 "hi synonyms include: hola, hello, howdy, greetings, cheerio, whats crack-a-lackin, yo, how do you do, good morrow, guten tag,…
# 6 "Meaning of hi in English ... used as an informal greeting, usually to people who you know: Hi, there! Hi, how are you doing? …

Unable to encode csv to UTF-8

I am importing a csv and want to encode it to UTF-8 as some columns appear like this:
Comentario Fecha Mes `Estaci\xf3n` Hotel Marca
<chr> <date> <chr> <chr> <chr> <chr>
1 "No todas las instalaciones del hotel se pudieron usar, estaban cerradas sin ~ 2020-02-01 feb. "Invierno" Sol Arona T~ Sol
2 "Me he ido con un buen sabor de boca, y ganas de volver. Me ha sorprendido to~ 2019-11-01 nov. "Oto\xf1o" Sol Arona T~ Sol
3 "Hotel normalito. Est\xe1 un poco viejo. Las habitaciones no tienen aire acon~ 2019-09-01 sep. "Verano" Sol Arona T~ Sol
I have tried the following:
df<- read.csv("SolArona.csv", sep=",", encoding = "UTF-8")
But this returns Error in read_csv("SolArona.csv", sep = ",", encoding = "UTF-8") :
unused arguments (sep = ",", encoding = "UTF-8"). So, I have also tried doing the Encoding of each column:
df <-read_csv("SolArona.csv")
Encoding(df$Comentario)<-"UTF-8"
Encoding (df$Estaci\xf3n)<-"UTF-8"
Thanks very much in advance!
Try this, insert another encoding if nescessary
library(readr)
df<- read_csv("SolArona.csv",",",locale = locale(encoding = "UTF-8"))

Download files when exact URL is not known

From this link, I´m trying to download multiple pdf files, but I can´t get the exact URL for each file.
To access one of the pdf files, you could click on "Región de Arica y Parinacota" and then click on "Arica". Then, you can check that the url is http://cdn.servel.cl/padronesauditados/padron/A1501001.pdf, if you click on the next link "Camarones" you now noticed that the URL is http://cdn.servel.cl/padronesauditados/padron/A1501002.pdf
I checked more URLs, and they all have a similar pattern:
"A" + "two digit number from 1 to 15" + "two digit number of unknown range" + "three digit number of unknown range"
Even though the URL examples I showed seem to suggest that the file names are named sequentally, this is not always the case.
What I did to be able to download all the files despite not knowing the exact URLs I did the following:
1) I made a for loop in order to write all possible file names based on the pattern I describe above, i.e, A0101001.pdf, A0101002.pdf....A1599999.pdf
library(downloader)
library(stringr)
reg.ind <- 1:15
pro.ind <- 1:99
com.ind <- 1:999
reg <- str_pad(reg.ind, width=2, side="left", pad="0")
prov <- str_pad(pro.ind, width=2, side="left", pad="0")
com <- str_pad(com.ind, width=3, side="left", pad="0")
file <- c()
for(i in 1:length(reg)){
reg.i <- reg[i]
for(j in 1:length(prov)){
prov.j <- prov[j]
for(k in 1:length(com)){
com.k <- com[k]
file <- c(file, (paste0("A", reg.i, prov.j, com.k)))
}
}
}
2) then I used another for loop to download a file everytime I hit a correct URL. I use tryCatchto ignore the cases when the URL was incorrect (most of the time)
for(i in 1:length(file)){
tryCatch({
url <- paste0("http://cdn.servel.cl/padronesauditados/padron/", file[i],
".pdf")
# change destfile accordingly if you decide to run the code
download.file(url, destfile = paste0("./datos/comunas/", file[i], ".pdf"),
mode = "wb")
}, error = function(e){})
}
PROBLEM: In total I know there are not more than 400 pdf files, as each one of them correspond to a commune in Chile, but I wrote a vector with 1483515 possible file names, and therefore my code, even though it works, takes a much longer time than if I could manage to obtain the file names before hand.
Does anyone know how to workaround this problem?
You can re-create the "browser developer tools" experience in R with splashr:
library(splashr) # devtools::install_github("hrbrmstr/splashr")
library(tidyverse)
sp <- start_splash()
Sys.sleep(3) # give the docker container time to work
res <- render_har(url = "http://cdn.servel.cl/padronesauditados/padron.html",
response_body=TRUE)
map_chr(har_entries(res), c("request", "url"))
## [1] "http://cdn.servel.cl/padronesauditados/padron.html"
## [2] "http://cdn.servel.cl/padronesauditados/stylesheets/navbar-cleaned.min.css"
## [3] "http://cdn.servel.cl/padronesauditados/stylesheets/virtue.min.css"
## [4] "http://cdn.servel.cl/padronesauditados/stylesheets/virtue2.min.css"
## [5] "http://cdn.servel.cl/padronesauditados/stylesheets/custom.min.css"
## [6] "https://fonts.googleapis.com/css?family=Lato%3A400%2C700%7CRoboto%3A100%2C300%2C400%2C500%2C700%2C900%2C100italic%2C300italic%2C400italic%2C500italic%2C700italic%2C900italic&ver=1458748651"
## [7] "http://cdn.servel.cl/padronesauditados/jquery-ui-1.12.1.custom/jquery-ui.css"
## [8] "http://cdn.servel.cl/padronesauditados/jquery-ui-1.12.1.custom/external/jquery/jquery.js"
## [9] "http://cdn.servel.cl/padronesauditados/jquery-ui-1.12.1.custom/jquery-ui.js"
## [10] "http://cdn.servel.cl/padronesauditados/images/logo-txt-retina.png"
## [11] "http://cdn.servel.cl/assets/img/nav_arrows.png"
## [12] "http://cdn.servel.cl/padronesauditados/images/loader.gif"
## [13] "http://cdn.servel.cl/padronesauditados/archivos.xml"
## [14] "http://cdn.servel.cl/padronesauditados/jquery-ui-1.12.1.custom/images/ui-icons_444444_256x240.png"
## [15] "https://fonts.gstatic.com/s/roboto/v16/zN7GBFwfMP4uA6AR0HCoLQ.ttf"
## [16] "https://fonts.gstatic.com/s/roboto/v16/RxZJdnzeo3R5zSexge8UUaCWcynf_cDxXwCLxiixG1c.ttf"
## [17] "https://fonts.gstatic.com/s/roboto/v16/Hgo13k-tfSpn0qi1SFdUfaCWcynf_cDxXwCLxiixG1c.ttf"
## [18] "https://fonts.gstatic.com/s/roboto/v16/Jzo62I39jc0gQRrbndN6nfesZW2xOQ-xsNqO47m55DA.ttf"
## [19] "https://fonts.gstatic.com/s/roboto/v16/d-6IYplOFocCacKzxwXSOKCWcynf_cDxXwCLxiixG1c.ttf"
## [20] "https://fonts.gstatic.com/s/roboto/v16/mnpfi9pxYH-Go5UiibESIqCWcynf_cDxXwCLxiixG1c.ttf"
## [21] "http://cdn.servel.cl/padronesauditados/stylesheets/fonts/virtue_icons.woff"
## [22] "https://fonts.gstatic.com/s/lato/v13/v0SdcGFAl2aezM9Vq_aFTQ.ttf"
## [23] "https://fonts.gstatic.com/s/lato/v13/DvlFBScY1r-FMtZSYIYoYw.ttf"
Spotting the XML entry is easy in ^^, so we can focus on it:
har_entries(res)[[13]]$response$content$text %>%
openssl::base64_decode() %>%
xml2::read_xml() %>%
xml2::xml_find_all(".//Region") %>%
map_df(~{
data_frame(
id = xml2::xml_find_all(.x, ".//id") %>% xml2::xml_text(),
nombre = xml2::xml_find_all(.x, ".//nombre") %>% xml2::xml_text(),
nomcomuna = xml2::xml_find_all(.x, ".//comunas/comuna/nomcomuna") %>% xml2::xml_text(),
id_archivo = xml2::xml_find_all(.x, ".//comunas/comuna/idArchivo") %>% xml2::xml_text(),
archcomuna = xml2::xml_find_all(.x, ".//comunas/comuna/archcomuna") %>% xml2::xml_text()
)
})
## # A tibble: 346 x 5
## id nombre nomcomuna id_archivo archcomuna
## <chr> <chr> <chr> <chr> <chr>
## 1 1 Región de Arica y Parinacota Arica 1 A1501001.pdf
## 2 1 Región de Arica y Parinacota Camarones 2 A1501002.pdf
## 3 1 Región de Arica y Parinacota General Lagos 3 A1502002.pdf
## 4 1 Región de Arica y Parinacota Putre 4 A1502001.pdf
## 5 2 Región de Tarapacá Alto Hospicio 5 A0103002.pdf
## 6 2 Región de Tarapacá Camiña 6 A0152002.pdf
## 7 2 Región de Tarapacá Colchane 7 A0152003.pdf
## 8 2 Región de Tarapacá Huara 8 A0152001.pdf
## 9 2 Región de Tarapacá Iquique 9 A0103001.pdf
## 10 2 Región de Tarapacá Pica 10 A0152004.pdf
## # ... with 336 more rows
stop_splash(sp) # don't forget to clean up!
You can then either programmatically download all the PDFs by using the URL prefix: http://cdn.servel.cl/padronesauditados/padron/

Web Scraping using rvest in R

I have been trying to scrap information from a url in R using the rvest package:
url <-'https://eprocure.gov.in/cppp/tendersfullview/id%3DNDE4MTY4MA%3D%3D/ZmVhYzk5NWViMWM1NTdmZGMxYWYzN2JkYTU1YmQ5NzU%3D/MTUwMjk3MTg4NQ%3D%3D'
but am not able to correctly identity the xpath even after using selector plugin.
The code i am using for fetching the first table is as follows:
detail_data <- read_html(url)
detail_data_raw <- html_nodes(detail_data, xpath='//*[#id="edit-t-
fullview"]/table[2]/tbody/tr[2]/td/table')
detail_data_fine <- html_table(detail_data_raw)
When i try the above code, the detail_data_raw results in {xml_nodeset (0)} and consequently detail_data_fine is an empty list()
The information i am interested in scrapping is under the headers:
Organisation Details
Tender Details
Critical Dates
Work Details
Tender Inviting Authority Details
Any help or ideas in what is going wrong and how to rectify it is welcome.
Your example URL isn't working for anyone, but if you're looking to get the data for a particular tender, then:
library(rvest)
library(stringi)
library(tidyverse)
pg <- read_html("https://eprocure.gov.in/mmp/tendersfullview/id%3D2262207")
html_nodes(pg, xpath=".//table[#class='viewtablebg']/tr/td[1]") %>%
html_text(trim=TRUE) %>%
stri_replace_last_regex("\ +:$", "") %>%
stri_replace_all_fixed(" ", "_") %>%
stri_trans_tolower() -> tenders_cols
html_nodes(pg, xpath=".//table[#class='viewtablebg']/tr/td[2]") %>%
html_text(trim=TRUE) %>%
as.list() %>%
set_names(tenders_cols) %>%
flatten_df() %>%
glimpse()
## Observations: 1
## Variables: 15
## $ organisation_name <chr> "Delhi Jal Board"
## $ organisation_type <chr> "State Govt. and UT"
## $ tender_reference_number <chr> "Short NIT. No.20 (Item no.1) EE ...
## $ tender_title <chr> "Short NIT. No.20 (Item no.1)"
## $ product_category <chr> "Civil Works"
## $ tender_fee <chr> "Rs.500"
## $ tender_type <chr> "Open/Advertised"
## $ epublished_date <chr> "18-Aug-2017 05:15 PM"
## $ document_download_start_date <chr> "18-Aug-2017 05:15 PM"
## $ bid_submission_start_date <chr> "18-Aug-2017 05:15 PM"
## $ work_description <chr> "Replacement of settled deep sewe...
## $ pre_qualification <chr> "Please refer Tender documents."
## $ tender_document <chr> "https://govtprocurement.delhi.go...
## $ name <chr> "EXECUTIVE ENGINEER (NORTH)-II"
## $ address <chr> "EXECUTIVE ENGINEER (NORTH)-II\r\...
seems to work just fine w/o installing Python and using Selenium.
Have a look at 'dynamic webscraping'. Typically what happens is when you enter the url in your browser, it sends a get request to the host server. The host server builds an HTML page with all data in it and posts it back to you. In dynamic pages, the server just sends you a HTML template, which once you open, runs javascript in your browser, which then retrieves the data that populates the template.
I would recommend scraping this page using python and the Selenium library. Selenium library gives your program the ability to wait until the javascript has run in your browser and retrieved the data. See below a query I had on the same concept, and a very helpful reply
BeautifulSoup parser can't access html elements

How do I finish this R function for movie scraping?

I'm writing an R function. I would like it to be able to take a list of movies, download info about them, and then throw it into a data frame.
So far,
rottenrate <- function(movie){
link <- paste("http://www.omdbapi.com/?t=", movie, "&y=&plot=short&r=json&tomatoes=true", sep = "")
jsonData <- fromJSON(link)
return(jsonData)
}
This will return info for one movie and won't convert to a data.frame.
Thanks for any help.
You could do it like this:
# First, vectorize function
rottenrate <- function(movie){
require(RJSONIO)
link <- paste("http://www.omdbapi.com/?t=", movie, "&y=&plot=short&r=json&tomatoes=true", sep = "")
jsonData <- fromJSON(link)
return(jsonData)
}
vrottenrate <- Vectorize(rottenrate, "movie", SIMPLIFY = FALSE)
# Now, query and combine
movies <- c("inception", "toy story")
df <- do.call(rbind, lapply(vrottenrate(movies), function(x) as.data.frame(t(x), stringsAsFactors = FALSE)))
dplyr::glimpse(df)
# Observations: 2
# Variables:
# $ Title (chr) "Inception", "Toy Story"
# $ Year (chr) "2010", "1995"
# $ Rated (chr) "PG-13", "G"
# $ Released (chr) "16 Jul 2010", "22 Nov 1995"
# $ Runtime (chr) "148 min", "81 min"
# $ Genre (chr) "Action, Mystery, Sci-Fi", "Animation,
# ...
Interesting database btw... :-)

Resources