Download files when exact URL is not known

Download files when exact URL is not known - r

From this link, I´m trying to download multiple pdf files, but I can´t get the exact URL for each file.
To access one of the pdf files, you could click on "Región de Arica y Parinacota" and then click on "Arica". Then, you can check that the url is http://cdn.servel.cl/padronesauditados/padron/A1501001.pdf, if you click on the next link "Camarones" you now noticed that the URL is http://cdn.servel.cl/padronesauditados/padron/A1501002.pdf
I checked more URLs, and they all have a similar pattern:
"A" + "two digit number from 1 to 15" + "two digit number of unknown range" + "three digit number of unknown range"
Even though the URL examples I showed seem to suggest that the file names are named sequentally, this is not always the case.
What I did to be able to download all the files despite not knowing the exact URLs I did the following:
1) I made a for loop in order to write all possible file names based on the pattern I describe above, i.e, A0101001.pdf, A0101002.pdf....A1599999.pdf
library(downloader)
library(stringr)
reg.ind <- 1:15
pro.ind <- 1:99
com.ind <- 1:999
reg <- str_pad(reg.ind, width=2, side="left", pad="0")
prov <- str_pad(pro.ind, width=2, side="left", pad="0")
com <- str_pad(com.ind, width=3, side="left", pad="0")
file <- c()
for(i in 1:length(reg)){
reg.i <- reg[i]
for(j in 1:length(prov)){
prov.j <- prov[j]
for(k in 1:length(com)){
com.k <- com[k]
file <- c(file, (paste0("A", reg.i, prov.j, com.k)))
}
}
}
2) then I used another for loop to download a file everytime I hit a correct URL. I use tryCatchto ignore the cases when the URL was incorrect (most of the time)
for(i in 1:length(file)){
tryCatch({
url <- paste0("http://cdn.servel.cl/padronesauditados/padron/", file[i],
".pdf")
# change destfile accordingly if you decide to run the code
download.file(url, destfile = paste0("./datos/comunas/", file[i], ".pdf"),
mode = "wb")
}, error = function(e){})
}
PROBLEM: In total I know there are not more than 400 pdf files, as each one of them correspond to a commune in Chile, but I wrote a vector with 1483515 possible file names, and therefore my code, even though it works, takes a much longer time than if I could manage to obtain the file names before hand.
Does anyone know how to workaround this problem?

You can re-create the "browser developer tools" experience in R with splashr:
library(splashr) # devtools::install_github("hrbrmstr/splashr")
library(tidyverse)
sp <- start_splash()
Sys.sleep(3) # give the docker container time to work
res <- render_har(url = "http://cdn.servel.cl/padronesauditados/padron.html",
response_body=TRUE)
map_chr(har_entries(res), c("request", "url"))
## [1] "http://cdn.servel.cl/padronesauditados/padron.html"
## [2] "http://cdn.servel.cl/padronesauditados/stylesheets/navbar-cleaned.min.css"
## [3] "http://cdn.servel.cl/padronesauditados/stylesheets/virtue.min.css"
## [4] "http://cdn.servel.cl/padronesauditados/stylesheets/virtue2.min.css"
## [5] "http://cdn.servel.cl/padronesauditados/stylesheets/custom.min.css"
## [6] "https://fonts.googleapis.com/css?family=Lato%3A400%2C700%7CRoboto%3A100%2C300%2C400%2C500%2C700%2C900%2C100italic%2C300italic%2C400italic%2C500italic%2C700italic%2C900italic&ver=1458748651"
## [7] "http://cdn.servel.cl/padronesauditados/jquery-ui-1.12.1.custom/jquery-ui.css"
## [8] "http://cdn.servel.cl/padronesauditados/jquery-ui-1.12.1.custom/external/jquery/jquery.js"
## [9] "http://cdn.servel.cl/padronesauditados/jquery-ui-1.12.1.custom/jquery-ui.js"
## [10] "http://cdn.servel.cl/padronesauditados/images/logo-txt-retina.png"
## [11] "http://cdn.servel.cl/assets/img/nav_arrows.png"
## [12] "http://cdn.servel.cl/padronesauditados/images/loader.gif"
## [13] "http://cdn.servel.cl/padronesauditados/archivos.xml"
## [14] "http://cdn.servel.cl/padronesauditados/jquery-ui-1.12.1.custom/images/ui-icons_444444_256x240.png"
## [15] "https://fonts.gstatic.com/s/roboto/v16/zN7GBFwfMP4uA6AR0HCoLQ.ttf"
## [16] "https://fonts.gstatic.com/s/roboto/v16/RxZJdnzeo3R5zSexge8UUaCWcynf_cDxXwCLxiixG1c.ttf"
## [17] "https://fonts.gstatic.com/s/roboto/v16/Hgo13k-tfSpn0qi1SFdUfaCWcynf_cDxXwCLxiixG1c.ttf"
## [18] "https://fonts.gstatic.com/s/roboto/v16/Jzo62I39jc0gQRrbndN6nfesZW2xOQ-xsNqO47m55DA.ttf"
## [19] "https://fonts.gstatic.com/s/roboto/v16/d-6IYplOFocCacKzxwXSOKCWcynf_cDxXwCLxiixG1c.ttf"
## [20] "https://fonts.gstatic.com/s/roboto/v16/mnpfi9pxYH-Go5UiibESIqCWcynf_cDxXwCLxiixG1c.ttf"
## [21] "http://cdn.servel.cl/padronesauditados/stylesheets/fonts/virtue_icons.woff"
## [22] "https://fonts.gstatic.com/s/lato/v13/v0SdcGFAl2aezM9Vq_aFTQ.ttf"
## [23] "https://fonts.gstatic.com/s/lato/v13/DvlFBScY1r-FMtZSYIYoYw.ttf"
Spotting the XML entry is easy in ^^, so we can focus on it:
har_entries(res)[[13]]$response$content$text %>%
openssl::base64_decode() %>%
xml2::read_xml() %>%
xml2::xml_find_all(".//Region") %>%
map_df(~{
data_frame(
id = xml2::xml_find_all(.x, ".//id") %>% xml2::xml_text(),
nombre = xml2::xml_find_all(.x, ".//nombre") %>% xml2::xml_text(),
nomcomuna = xml2::xml_find_all(.x, ".//comunas/comuna/nomcomuna") %>% xml2::xml_text(),
id_archivo = xml2::xml_find_all(.x, ".//comunas/comuna/idArchivo") %>% xml2::xml_text(),
archcomuna = xml2::xml_find_all(.x, ".//comunas/comuna/archcomuna") %>% xml2::xml_text()
)
})
## # A tibble: 346 x 5
## id nombre nomcomuna id_archivo archcomuna
## <chr> <chr> <chr> <chr> <chr>
## 1 1 Región de Arica y Parinacota Arica 1 A1501001.pdf
## 2 1 Región de Arica y Parinacota Camarones 2 A1501002.pdf
## 3 1 Región de Arica y Parinacota General Lagos 3 A1502002.pdf
## 4 1 Región de Arica y Parinacota Putre 4 A1502001.pdf
## 5 2 Región de Tarapacá Alto Hospicio 5 A0103002.pdf
## 6 2 Región de Tarapacá Camiña 6 A0152002.pdf
## 7 2 Región de Tarapacá Colchane 7 A0152003.pdf
## 8 2 Región de Tarapacá Huara 8 A0152001.pdf
## 9 2 Región de Tarapacá Iquique 9 A0103001.pdf
## 10 2 Región de Tarapacá Pica 10 A0152004.pdf
## # ... with 336 more rows
stop_splash(sp) # don't forget to clean up!
You can then either programmatically download all the PDFs by using the URL prefix: http://cdn.servel.cl/padronesauditados/padron/

Related

How to fix data reading and table formatting problem with web scrapin in R

The first is how to programmatically access each of the 11 pages in this online table.
Since this is a simple html table, using the "Next" (next) button will take us to a new page. If we look at the URL on the Next page, we can see the page number in the query parameters.
"
... tienda-virtual-del-estado-colombiano / ordenes-compra? page = 1 & number_order = & estado ..."
We know that the pages are numbered starting with 0 (because "next" takes us to page 1), and using the navigation bar we can see that there are 11 pages. The httr package offers many very useful tools for handling html requests. Among those tools is the httr :: parse_url function that returns a list with the components of a URL.
url_params <- httr::parse_url("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=1&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to_=%20&date_from_=")
# Not all results returned were included below for the sake of keeping the example concise
$scheme
[1] "https"
$hostname
[1] "colombiacompra.gov.co"
$path
[1] "tienda-virtual-del-estado-colombiano/ordenes-compra"
$query
$query$page
[1] "1"
$query$number_order
[1] ""
Use the query parameters to construct a series of rvest :: read_html () calls corresponding to the page number by simply using lapply and paste0 to replace the page =. We can also save some time by coercing data.frame into the app.
pages <-
lapply(0:11, function(x) {
read_html(x = paste0("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=",
x,
"&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to_=%20&date_from_=")) |>
html_table() |>
data.frame()
})
do.call(rbind, pages)
But I have the following error
> url_params <- httr::parse_url("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=1&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to_=%20&date_from_=")
>
> pages <- lapply(0:11, function(x) {
+ rvest::read_html(x = paste0("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=",
+ x,
+ "&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to_=%20&date_from_=")) |>
Error: inesperado '>' in:
" x,
"&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to_=%20&date_from_=")) |>"
> html_table() |>
Error: inesperado '>' in " html_table() |>"
> data.frame()
data frame with 0 columns and 0 rows
> })
Error: inesperado '}' in " }"
>
> do.call(rbind, pages)
Error in do.call(rbind, pages) : objeto 'pages' no encontrado
>
> do.ca
Error: objeto 'do.ca' no encontrado
how can I solve that?

You can use map_df to read the html table from each url and combine it into one table.
urls <- paste0("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=",
0:11,
"&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to_=%20&date_from_=")
library(rvest)
res <- purrr::map_df(urls, ~.x %>% read_html() %>% html_table)
res
# A tibble: 578 x 6
# `Orden de Compra` `Entidad Estatal` `Fecha de la or… Estado Instrumento Total
# <int> <chr> <chr> <chr> <chr> <chr>
# 1 72683 UNIDAD ADMINISTRA… 2021-07-16 20:0… Emiti… IAD Softwa… $453…
# 2 72670 SUPERINTENDENCIA … 2021-07-16 16:5… Emiti… IAD Softwa… $252…
# 3 72648 BOGOTA D.C - UNID… 2021-07-16 14:4… Emiti… IAD Softwa… $179…
# 4 72638 ESTABLECIMIENTO P… 2021-07-16 14:1… Emiti… IAD Softwa… $1,7…
# 5 72631 INPEC - ESTABLECI… 2021-07-16 12:5… Emiti… IAD Softwa… $1,6…
# 6 72605 POLICIA NACIONAL … 2021-07-15 20:0… Emiti… IAD Softwa… $2,9…
# 7 72524 FOGAFIN 2021-07-15 08:4… Emiti… IAD Softwa… $5,9…
# 8 72502 INSTITUTO DE FOME… 2021-07-14 22:2… Emiti… IAD Softwa… $1,9…
# 9 72471 ANTIOQUA - MUNICI… 2021-07-14 09:5… Emiti… IAD Softwa… $28,…
#10 72433 AGENCIA DE RENOVA… 2021-07-13 16:2… Emiti… IAD Softwa… $282…
# … with 568 more rows

removing special apostrophes from French article contractions when tokenizing

I am currently running an stm (structural topic model) of a series of articles from the french newspaper Le Monde. The model is working just great, but I have a problem with the pre-processing of the text.
I'm currently using the quanteda package and the tm package for doing things like removing words, removing numbers...etc...
There's only one thing, though, that doesn't seem to work.
As some of you might know, in French, the masculine determinative article -le- contracts in -l'- before vowels. I've tried to remove -l'- (and similar things like -d'-) as words with removeWords
lmt67 <- removeWords(lmt67, c( "l'","d'","qu'il", "n'", "a", "dans"))
but it only works with words that are separate from the rest of text, not with the articles that are attached to a word, such as in -l'arbre- (the tree).
Frustrated, I've tried to give it a simple gsub
lmt67 <- gsub("l'","",lmt67)
but that doesn't seem to be working either.
Now, what's a better way to do this, and possibly through a c(...) vector so that I can give it a series of expressions all together?
Just as context, lmt67 is a "large character" with 30,000 elements/articles, obtained by using the "texts" functions on data imported from txt files.
Thanks to anyone that will want to help me.

I'll outline two ways to do this using quanteda and quanteda-related tools. First, let's define a slightly longer text, with more prefix cases for French. Notice the inclusion of the ’ apostrophe as well as the ASCII 39 simple apostrophe.
txt <- c(doc1 = "M. Trump, lors d’une réunion convoquée d’urgence à la Maison Blanche,
n’en a pas dit mot devant la presse. En réalité, il s’agit d’une
mesure essentiellement commerciale de ce pays qui l'importe.",
doc2 = "Réfugié à Bruxelles, l’indépendantiste catalan a désigné comme
successeur Jordi Sanchez, partisan de l’indépendance catalane,
actuellement en prison pour sédition.")
The first method will use pattern matches for the simple ASCII 39 (apostrophe) plus a bunch of
Unicode variants, matched through the category "Pf" for "Punctuation: Final Quote" category.
However, quanteda does its best to normalize the quotes at the tokenization stage - see the
"l'indépendance" in the second document for instance.
The second way below uses a French part-of-speech tagger integrated with quanteda that allows similar
selection after recognizing and separating the prefixes, and then removing determinants (among other POS).
1. quanteda tokens
toks <- tokens(txt, remove_punct = TRUE)
# remove stopwords
toks <- tokens_remove(toks, stopwords("french"))
toks
# tokens from 2 documents.
# doc1 :
# [1] "M" "Trump" "lors" "d'une" "réunion"
# [6] "convoquée" "d'urgence" "à" "la" "Maison"
# [11] "Blanche" "n'en" "a" "pas" "dit"
# [16] "mot" "devant" "la" "presse" "En"
# [21] "réalité" "il" "s'agit" "d'une" "mesure"
# [26] "essentiellement" "commerciale" "de" "ce" "pays"
# [31] "qui" "l'importe"
#
# doc2 :
# [1] "Réfugié" "à" "Bruxelles" "l'indépendantiste"
# [5] "catalan" "a" "désigné" "comme"
# [9] "successeur" "Jordi" "Sanchez" "partisan"
# [13] "de" "l'indépendance" "catalane" "actuellement"
# [17] "en" "prison" "pour" "sédition"
Then, we apply the pattern to match l', d', or l', using a regular expression replacement on the types (the unique tokens):
toks <- tokens_replace(
toks,
types(toks),
stringi::stri_replace_all_regex(types(toks), "[lsd]['\\p{Pf}]", "")
)
# tokens from 2 documents.
# doc1 :
# [1] "M" "Trump" "lors" "une" "réunion"
# [6] "convoquée" "urgence" "à" "la" "Maison"
# [11] "Blanche" "n'en" "a" "pas" "dit"
# [16] "mot" "devant" "la" "presse" "En"
# [21] "réalité" "il" "agit" "une" "mesure"
# [26] "essentiellement" "commerciale" "de" "ce" "pays"
# [31] "qui" "importe"
#
# doc2 :
# [1] "Réfugié" "à" "Bruxelles" "indépendantiste" "catalan"
# [6] "a" "désigné" "comme" "successeur" "Jordi"
# [11] "Sanchez" "partisan" "de" "indépendance" "catalane"
# [16] "actuellement" "En" "prison" "pour" "sédition"
From the resulting toks object you can form a dfm and then proceed to fit the STM.
2. using spacyr
This will involve more sophisticated part-of-speech tagging and then converting the tagged object into quanteda tokens. This requires first that you install Python, spacy, and the French language model. (See https://spacy.io/usage/models.)
library(spacyr)
spacy_initialize(model = "fr", python_executable = "/anaconda/bin/python")
# successfully initialized (spaCy Version: 2.0.1, language model: fr)
toks <- spacy_parse(txt, lemma = FALSE) %>%
as.tokens(include_pos = "pos")
toks
# tokens from 2 documents.
# doc1 :
# [1] "M./NOUN" "Trump/PROPN" ",/PUNCT"
# [4] "lors/ADV" "d’/PUNCT" "une/DET"
# [7] "réunion/NOUN" "convoquée/VERB" "d’/ADP"
# [10] "urgence/NOUN" "à/ADP" "la/DET"
# [13] "Maison/PROPN" "Blanche/PROPN" ",/PUNCT"
# [16] "\n /SPACE" "n’/VERB" "en/PRON"
# [19] "a/AUX" "pas/ADV" "dit/VERB"
# [22] "mot/ADV" "devant/ADP" "la/DET"
# [25] "presse/NOUN" "./PUNCT" "En/ADP"
# [28] "réalité/NOUN" ",/PUNCT" "il/PRON"
# [31] "s’/AUX" "agit/VERB" "d’/ADP"
# [34] "une/DET" "\n /SPACE" "mesure/NOUN"
# [37] "essentiellement/ADV" "commerciale/ADJ" "de/ADP"
# [40] "ce/DET" "pays/NOUN" "qui/PRON"
# [43] "l'/DET" "importe/NOUN" "./PUNCT"
#
# doc2 :
# [1] "Réfugié/VERB" "à/ADP" "Bruxelles/PROPN"
# [4] ",/PUNCT" "l’/PRON" "indépendantiste/ADJ"
# [7] "catalan/VERB" "a/AUX" "désigné/VERB"
# [10] "comme/ADP" "\n /SPACE" "successeur/NOUN"
# [13] "Jordi/PROPN" "Sanchez/PROPN" ",/PUNCT"
# [16] "partisan/VERB" "de/ADP" "l’/DET"
# [19] "indépendance/ADJ" "catalane/ADJ" ",/PUNCT"
# [22] "\n /SPACE" "actuellement/ADV" "en/ADP"
# [25] "prison/NOUN" "pour/ADP" "sédition/NOUN"
# [28] "./PUNCT"
Then we can use the default glob-matching to remove the parts of speech in which we are probably not interested, including the newline:
toks <- tokens_remove(toks, c("*/DET", "*/PUNCT", "\n*", "*/ADP", "*/AUX", "*/PRON"))
toks
# doc1 :
# [1] "M./NOUN" "Trump/PROPN" "lors/ADV" "réunion/NOUN" "convoquée/VERB"
# [6] "urgence/NOUN" "Maison/PROPN" "Blanche/PROPN" "n’/VERB" "pas/ADV"
# [11] "dit/VERB" "mot/ADV" "presse/NOUN" "réalité/NOUN" "agit/VERB"
# [16] "mesure/NOUN" "essentiellement/ADV" "commerciale/ADJ" "pays/NOUN" "importe/NOUN"
#
# doc2 :
# [1] "Réfugié/VERB" "Bruxelles/PROPN" "indépendantiste/ADJ" "catalan/VERB" "désigné/VERB"
# [6] "successeur/NOUN" "Jordi/PROPN" "Sanchez/PROPN" "partisan/VERB" "indépendance/ADJ"
# [11] "catalane/ADJ" "actuellement/ADV" "prison/NOUN" "sédition/NOUN"
Then we can remove the tags, which you probably don't want in your STM - but you could leave them if you prefer.
## remove the tags
toks <- tokens_replace(toks, types(toks),
stringi::stri_replace_all_regex(types(toks), "/[A-Z]+$", ""))
toks
# tokens from 2 documents.
# doc1 :
# [1] "M." "Trump" "lors" "réunion" "convoquée"
# [6] "urgence" "Maison" "Blanche" "n’" "pas"
# [11] "dit" "mot" "presse" "réalité" "agit"
# [16] "mesure" "essentiellement" "commerciale" "pays" "importe"
#
# doc2 :
# [1] "Réfugié" "Bruxelles" "indépendantiste" "catalan" "désigné"
# [6] "successeur" "Jordi" "Sanchez" "partisan" "indépendance"
# [11] "catalane" "actuellement" "prison" "sédition"
From there, you can use the toks object to form your dfm and fit the model.

Here's a scrape from the current page at Le Monde's website. Notice that the apostrophe they use is not the same character as the single-quote here "'":
text <- "Réfugié à Bruxelles, l’indépendantiste catalan a désigné comme successeur Jordi Sanchez, partisan de l’indépendance catalane, actuellement en prison pour sédition."
It has a little angle and is not actually "straight down" when I view it. You need to copy that character into your gsub command:
sub("l’", "", text)
[#1] "Réfugié à Bruxelles, indépendantiste catalan a désigné comme successeur Jordi Sanchez, partisan de l’indépendance catalane, actuellement en prison pour sédition."

Extract address components from coordiantes

I'm trying to reverse geocode with R. I first used ggmap but couldn't get it to work with my API key. Now I'm trying it with googleway.
newframe[,c("Front.lat","Front.long")]
Front.lat Front.long
1 -37.82681 144.9592
2 -37.82681 145.9592
newframe$address <- apply(newframe, 1, function(x){
google_reverse_geocode(location = as.numeric(c(x["Front.lat"],
x["Front.long"])),
key = "xxxx")
})
This extracts the variables as a list but I can't figure out the structure.
I'm struggling to figure out how to extract the address components listed below as variables in newframe
postal_code, administrative_area_level_1, administrative_area_level_2, locality, route, street_number
I would prefer each address component as a separate variable.

Google's API returns the response in JSON. Which, when translated into R naturally forms nested lists. Internally in googleway this is done through jsonlite::fromJSON()
In googleway I've given you the choice of returning the raw JSON or a list, through using the simplify argument.
I've deliberately returned ALL the data from Google's response and left it up to the user to extract the elements they're interested in through usual list-subsetting operations.
Having said all that, in the development version of googleway I've written a few functions to help accessing elements of various API calls. Here are three of them that may be useful to you
## Install the development version
# devtools::install_github("SymbolixAU/googleway")
res <- google_reverse_geocode(
location = c(df[1, 'Front.lat'], df[1, 'Front.long']),
key = apiKey
)
geocode_address(res)
# [1] "45 Clarke St, Southbank VIC 3006, Australia"
# [2] "Bank Apartments, 275-283 City Rd, Southbank VIC 3006, Australia"
# [3] "Southbank VIC 3006, Australia"
# [4] "Melbourne VIC, Australia"
# [5] "South Wharf VIC 3006, Australia"
# [6] "Melbourne, VIC, Australia"
# [7] "CBD & South Melbourne, VIC, Australia"
# [8] "Melbourne Metropolitan Area, VIC, Australia"
# [9] "Victoria, Australia"
# [10] "Australia"
geocode_address_components(res)
# long_name short_name types
# 1 45 45 street_number
# 2 Clarke Street Clarke St route
# 3 Southbank Southbank locality, political
# 4 Melbourne City Melbourne administrative_area_level_2, political
# 5 Victoria VIC administrative_area_level_1, political
# 6 Australia AU country, political
# 7 3006 3006 postal_code
geocode_type(res)
# [[1]]
# [1] "street_address"
#
# [[2]]
# [1] "establishment" "general_contractor" "point_of_interest"
#
# [[3]]
# [1] "locality" "political"
#
# [[4]]
# [1] "colloquial_area" "locality" "political"

After reverse geocoding into newframe$address the address components could be extracted further as follows:
# Make a boolean array of the valid ("OK" status) responses (other statuses may be "NO_RESULTS", "REQUEST_DENIED" etc).
sel <- sapply(c(1: nrow(newframe)), function(x){
newframe$address[[x]]$status == 'OK'
})
# Get the address_components of the first result (i.e. best match) returned per geocoded coordinate.
address.components <- sapply(c(1: nrow(newframe[sel,])), function(x){
newframe$address[[x]]$results[1,]$address_components
})
# Get all possible component types.
all.types <- unique(unlist(sapply(c(1: length(address.components)), function(x){
unlist(lapply(address.components[[x]]$types, function(l) l[[1]]))
})))
# Get "long_name" values of the address_components for each type present (the other option is "short_name").
all.values <- lapply(c(1: length(address.components)), function(x){
types <- unlist(lapply(address.components[[x]]$types, function(l) l[[1]]))
matches <- match(all.types, types)
values <- address.components[[x]]$long_name[matches]
})
# Bind results into a dataframe.
all.values <- do.call("rbind", all.values)
all.values <- as.data.frame(all.values)
names(all.values) <- all.types
# Add columns and update original data frame.
newframe[, all.types] <- NA
newframe[sel,][, all.types] <- all.values
Note that I've only kept the first type given per component, effectively skipping the "political" type as it appears in multiple components and is likely superfluous e.g. "administrative_area_level_1, political".

You can use ggmap:revgeocode easily; look below:
library(ggmap)
df <- cbind(df,do.call(rbind,
lapply(1:nrow(df),
function(i)
revgeocode(as.numeric(
df[i,2:1]), output = "more")
[c("administrative_area_level_1","locality","postal_code","address")])))
#output:
df
# Front.lat Front.long administrative_area_level_1 locality
# 1 -37.82681 144.9592 Victoria Southbank
# 2 -37.82681 145.9592 Victoria Noojee
# postal_code address
# 1 3006 45 Clarke St, Southbank VIC 3006, Australia
# 2 3833 Cec Dunns Track, Noojee VIC 3833, Australia
You can add "route" and "street_number" to the variables that you want to extract but as you can see the second address does not have street number and that will cause an error.
Note: You may also use sub and extract the information from the address.
Data:
df <- structure(list(Front.lat = c(-37.82681, -37.82681), Front.long =
c(144.9592, 145.9592)), .Names = c("Front.lat", "Front.long"), class = "data.frame",
row.names = c(NA, -2L))

Factors numeric output when i set string input

I'm working in a shiny app and I need make a factor with database values.
In all examples people do something like this:
selectInput("selectSectorAgrupado","Sector agrupado",
c("Cylinders" = "cyl",
"Transmission" = "am",
"Gears" = "gear"),selected=NULL,multiple = FALSE)
In this case the user see "Cylinders" string but the value is "cyl". I need do this but the factor values isn't strings, they are data frames fields from database. I try to do something similar but when i put a text field in factor he return numbers.
Using this:
sectorAgrupado$L0NOMBRE
The console returns this:
[1] Actividades recreativas Agencias de viaje [3] Alimentación
Bazares [5] Comercio electrónico Gremios, vivienda
[7] Electrodomésticos, SAT Energia [9] Enseñanza
Grandes Superficies [11] Hosteleria Informática
[13] Joyeria, Relojeria Muebles [15] Otro
comercio por menor Otros [17] Publicidad
Seguros [19] Servicios bancarios Telefonía
[21] Textil, Calzado Tintorerias [23] Transportes
Venta domiciliaria [25] Automóviles Promoción
inmobiliaria
But when I put the code into a factor:
c(sectorAgrupado$L0NOMBRE)
He return this
[1] 1 2 3 5 6 11 7 8 9 10 12 13 14 15 16 17 19 20 21 22
[21] 23 24 25 26 4 18
I'm new in R programing and maybe i don't understand good factors, but i need some help.
Thank you
EDITED (More problems)
OK, the solution for the first problem is solved. The solution is save in factor the values as character like this:
c(as.character(sectorAgrupado$L0NOMBRE))
But the problem continues. When i asociate string with value R return a error.
The code:
c(as.character(sectorAgrupado$L0NOMBRE)=as.character(sectorAgrupado$L0CODIGO))
Return the error:
Error: unexpected '=' in "c(as.character(sectorAgrupado$L0NOMBRE)="

Web Scraping, extract table of a page

i have extract the table that say "R.U.T" and "Entidad" of the page
http://www.svs.cl/portal/principal/605/w3-propertyvalue-18554
I make the follow code:
library(rvest)
#put page
url<-paste("http://www.svs.cl/portal/principal/605/w3-propertyvalue-18554.html",sep="")
url<-read_html(url)
#extract table
table<-html_node(url,xpath='//*[#id="listado_fiscalizados"]/table') #xpath
table<-html_table(table)
#transform table to data.frame
table<-data.frame(table)
but R show me the follow result:
> a
{xml_nodeset (0)}
That is, it is not recognizing the table, Maybe it's because the table has hyperlinks?
If anyone knows how to extract the table, I would appreciate it.
Many thanks in advance and sorry for my English.

It makes an XHR request to another resource which is used to make the table.
library(rvest)
library(dplyr)
pg <- read_html("http://www.svs.cl/institucional/mercados/consulta.php?mercado=S&Estado=VI&consulta=CSVID&_=1484105706447")
html_nodes(pg, "table") %>%
html_table() %>%
.[[1]] %>%
tbl_df() %>%
select(1:2)
## # A tibble: 36 × 2
## R.U.T. Entidad
## <chr> <chr>
## 1 99588060-1 ACE SEGUROS DE VIDA S.A.
## 2 76511423-3 ALEMANA SEGUROS S.A.
## 3 96917990-3 BANCHILE SEGUROS DE VIDA S.A.
## 4 96933770-3 BBVA SEGUROS DE VIDA S.A.
## 5 96573600-K BCI SEGUROS VIDA S.A.
## 6 96656410-5 BICE VIDA COMPAÑIA DE SEGUROS S.A.
## 7 96837630-6 BNP PARIBAS CARDIF SEGUROS DE VIDA S.A.
## 8 76418751-2 BTG PACTUAL CHILE S.A. COMPAÑIA DE SEGUROS DE VIDA
## 9 76477116-8 CF SEGUROS DE VIDA S.A.
## 10 99185000-7 CHILENA CONSOLIDADA SEGUROS DE VIDA S.A.
## # ... with 26 more rows
You can use Developer Tools in any modern browser to monitor the Network requests to find that URL.

This is the answer using RSelenium:
# Start Selenium Server
RSelenium::checkForServer(beta = TRUE)
selServ <- RSelenium::startServer(javaargs = c("-Dwebdriver.gecko.driver=\"C:/Users/Mislav/Documents/geckodriver.exe\""))
remDr <- remoteDriver(extraCapabilities = list(marionette = TRUE))
remDr$open() # silent = TRUE
Sys.sleep(2)
# Simulate browser session and fill out form
remDr$navigate("http://www.svs.cl/portal/principal/605/w3-propertyvalue-18554.html")
Sys.sleep(2)
doc <- htmlParse(remDr$getPageSource()[[1]], encoding = "UTF-8")
# close and stop server
remDr$close()
selServ$stop()
tables <- readHTMLTable(doc)
head(tables)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Download files when exact URL is not known - r

Related

How to fix data reading and table formatting problem with web scrapin in R

removing special apostrophes from French article contractions when tokenizing

Extract address components from coordiantes

Factors numeric output when i set string input

Web Scraping, extract table of a page

Categories

Resources