Web Scraping, extract table of a page

Web Scraping, extract table of a page - r

i have extract the table that say "R.U.T" and "Entidad" of the page
http://www.svs.cl/portal/principal/605/w3-propertyvalue-18554
I make the follow code:
library(rvest)
#put page
url<-paste("http://www.svs.cl/portal/principal/605/w3-propertyvalue-18554.html",sep="")
url<-read_html(url)
#extract table
table<-html_node(url,xpath='//*[#id="listado_fiscalizados"]/table') #xpath
table<-html_table(table)
#transform table to data.frame
table<-data.frame(table)
but R show me the follow result:
> a
{xml_nodeset (0)}
That is, it is not recognizing the table, Maybe it's because the table has hyperlinks?
If anyone knows how to extract the table, I would appreciate it.
Many thanks in advance and sorry for my English.

It makes an XHR request to another resource which is used to make the table.
library(rvest)
library(dplyr)
pg <- read_html("http://www.svs.cl/institucional/mercados/consulta.php?mercado=S&Estado=VI&consulta=CSVID&_=1484105706447")
html_nodes(pg, "table") %>%
html_table() %>%
.[[1]] %>%
tbl_df() %>%
select(1:2)
## # A tibble: 36 × 2
## R.U.T. Entidad
## <chr> <chr>
## 1 99588060-1 ACE SEGUROS DE VIDA S.A.
## 2 76511423-3 ALEMANA SEGUROS S.A.
## 3 96917990-3 BANCHILE SEGUROS DE VIDA S.A.
## 4 96933770-3 BBVA SEGUROS DE VIDA S.A.
## 5 96573600-K BCI SEGUROS VIDA S.A.
## 6 96656410-5 BICE VIDA COMPAÑIA DE SEGUROS S.A.
## 7 96837630-6 BNP PARIBAS CARDIF SEGUROS DE VIDA S.A.
## 8 76418751-2 BTG PACTUAL CHILE S.A. COMPAÑIA DE SEGUROS DE VIDA
## 9 76477116-8 CF SEGUROS DE VIDA S.A.
## 10 99185000-7 CHILENA CONSOLIDADA SEGUROS DE VIDA S.A.
## # ... with 26 more rows
You can use Developer Tools in any modern browser to monitor the Network requests to find that URL.

This is the answer using RSelenium:
# Start Selenium Server
RSelenium::checkForServer(beta = TRUE)
selServ <- RSelenium::startServer(javaargs = c("-Dwebdriver.gecko.driver=\"C:/Users/Mislav/Documents/geckodriver.exe\""))
remDr <- remoteDriver(extraCapabilities = list(marionette = TRUE))
remDr$open() # silent = TRUE
Sys.sleep(2)
# Simulate browser session and fill out form
remDr$navigate("http://www.svs.cl/portal/principal/605/w3-propertyvalue-18554.html")
Sys.sleep(2)
doc <- htmlParse(remDr$getPageSource()[[1]], encoding = "UTF-8")
# close and stop server
remDr$close()
selServ$stop()
tables <- readHTMLTable(doc)
head(tables)

Related

How to fix data reading and table formatting problem with web scrapin in R

The first is how to programmatically access each of the 11 pages in this online table.
Since this is a simple html table, using the "Next" (next) button will take us to a new page. If we look at the URL on the Next page, we can see the page number in the query parameters.
"
... tienda-virtual-del-estado-colombiano / ordenes-compra? page = 1 & number_order = & estado ..."
We know that the pages are numbered starting with 0 (because "next" takes us to page 1), and using the navigation bar we can see that there are 11 pages. The httr package offers many very useful tools for handling html requests. Among those tools is the httr :: parse_url function that returns a list with the components of a URL.
url_params <- httr::parse_url("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=1&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to_=%20&date_from_=")
# Not all results returned were included below for the sake of keeping the example concise
$scheme
[1] "https"
$hostname
[1] "colombiacompra.gov.co"
$path
[1] "tienda-virtual-del-estado-colombiano/ordenes-compra"
$query
$query$page
[1] "1"
$query$number_order
[1] ""
Use the query parameters to construct a series of rvest :: read_html () calls corresponding to the page number by simply using lapply and paste0 to replace the page =. We can also save some time by coercing data.frame into the app.
pages <-
lapply(0:11, function(x) {
read_html(x = paste0("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=",
x,
"&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to_=%20&date_from_=")) |>
html_table() |>
data.frame()
})
do.call(rbind, pages)
But I have the following error
> url_params <- httr::parse_url("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=1&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to_=%20&date_from_=")
>
> pages <- lapply(0:11, function(x) {
+ rvest::read_html(x = paste0("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=",
+ x,
+ "&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to_=%20&date_from_=")) |>
Error: inesperado '>' in:
" x,
"&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to_=%20&date_from_=")) |>"
> html_table() |>
Error: inesperado '>' in " html_table() |>"
> data.frame()
data frame with 0 columns and 0 rows
> })
Error: inesperado '}' in " }"
>
> do.call(rbind, pages)
Error in do.call(rbind, pages) : objeto 'pages' no encontrado
>
> do.ca
Error: objeto 'do.ca' no encontrado
how can I solve that?

You can use map_df to read the html table from each url and combine it into one table.
urls <- paste0("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=",
0:11,
"&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to_=%20&date_from_=")
library(rvest)
res <- purrr::map_df(urls, ~.x %>% read_html() %>% html_table)
res
# A tibble: 578 x 6
# `Orden de Compra` `Entidad Estatal` `Fecha de la or… Estado Instrumento Total
# <int> <chr> <chr> <chr> <chr> <chr>
# 1 72683 UNIDAD ADMINISTRA… 2021-07-16 20:0… Emiti… IAD Softwa… $453…
# 2 72670 SUPERINTENDENCIA … 2021-07-16 16:5… Emiti… IAD Softwa… $252…
# 3 72648 BOGOTA D.C - UNID… 2021-07-16 14:4… Emiti… IAD Softwa… $179…
# 4 72638 ESTABLECIMIENTO P… 2021-07-16 14:1… Emiti… IAD Softwa… $1,7…
# 5 72631 INPEC - ESTABLECI… 2021-07-16 12:5… Emiti… IAD Softwa… $1,6…
# 6 72605 POLICIA NACIONAL … 2021-07-15 20:0… Emiti… IAD Softwa… $2,9…
# 7 72524 FOGAFIN 2021-07-15 08:4… Emiti… IAD Softwa… $5,9…
# 8 72502 INSTITUTO DE FOME… 2021-07-14 22:2… Emiti… IAD Softwa… $1,9…
# 9 72471 ANTIOQUA - MUNICI… 2021-07-14 09:5… Emiti… IAD Softwa… $28,…
#10 72433 AGENCIA DE RENOVA… 2021-07-13 16:2… Emiti… IAD Softwa… $282…
# … with 568 more rows

How to turn around strings in dataframe

I have a name, something like Robin the Bruyne or Victor from the Loo
These names are in a dataframe in my session. I need to change these names into:
<lastname, firstname middlename(s)>,
so they are turn arouned. But I don't know how to do this.
I know I can use things like separate() or map() with PURR (of tidyverse).
Data:
~nr, ~name, ~prodno,
2019001, "Piet de Boer", "lux_zwez",
2019002, "Elly Hamstra", "zuv_vla",
2019003, "Sue Ellen Schilder", "zuv_vla",
2019004, "Truus Janssen", "zuv_vmlk",
2019005, "Evelijne de Vries", "lux_zwez",
2019006, "Berend Boersma", "lux_gras",
2019007, "Marius van Asten", "zuv_vla",
2019008, "Corneel Jansen", "lux_gras",
2019009, "Joke Timmerman", "zuv_vla",
2019010, "Jan Willem de Jong", "lux_zwez",
2019011, "Frederik Janssen", "zuv_vmlk",
2019012, "Antonia de Jongh", "zuv_vmlk",
2019013, "Lena van der Loo", "zuv_qrk",
2019014, "Johanna Haanstra", "lux_gras"

We can try using sub here:
names <- c("Robin the Bruyne", "Victor from the Loo")
output <- sub("^(.*) ([A-Z][a-z]+)$", "\\2, \\1", names)
output
[1] "Bruyne, Robin the" "Loo, Victor from the"
This approach uses the following pattern:
^(.*) capture everything from the start until the last space
([A-Z][a-z]+)$ capture the last name, which starts with a capital
Then, we replace with the last name and first/middle names swapped, separated by a comma.

If I understood you correctly, this should work.
dat = tibble::tribble(
~nr, ~name, ~prodno,
2019001, "Piet de Boer", "lux_zwez",
2019002, "Elly Hamstra", "zuv_vla",
2019003, "Sue Ellen Schilder", "zuv_vla",
2019004, "Truus Janssen", "zuv_vmlk",
2019005, "Evelijne de Vries", "lux_zwez",
2019006, "Berend Boersma", "lux_gras",
2019007, "Marius van Asten", "zuv_vla",
2019008, "Corneel Jansen", "lux_gras",
2019009, "Joke Timmerman", "zuv_vla",
2019010, "Jan Willem de Jong", "lux_zwez",
2019011, "Frederik Janssen", "zuv_vmlk",
2019012, "Antonia de Jongh", "zuv_vmlk",
2019013, "Lena van der Loo", "zuv_qrk",
2019014, "Johanna Haanstra", "lux_gras"
)
library(magrittr)
dat %>% dplyr::mutate(
lastname = stringr::str_extract(name,"(?<=[:blank:])[:alnum:]+$"),
firstname = stringr::str_extract(name,".*(?=[:blank:])"),
name = paste(lastname,firstname,sep = ", ")
) %>% dplyr::select(-firstname,-lastname)
#> # A tibble: 14 x 3
#> nr name prodno
#> <dbl> <chr> <chr>
#> 1 2019001 Boer, Piet de lux_zwez
#> 2 2019002 Hamstra, Elly zuv_vla
#> 3 2019003 Schilder, Sue Ellen zuv_vla
#> 4 2019004 Janssen, Truus zuv_vmlk
#> 5 2019005 Vries, Evelijne de lux_zwez
#> 6 2019006 Boersma, Berend lux_gras
#> 7 2019007 Asten, Marius van zuv_vla
#> 8 2019008 Jansen, Corneel lux_gras
#> 9 2019009 Timmerman, Joke zuv_vla
#> 10 2019010 Jong, Jan Willem de lux_zwez
#> 11 2019011 Janssen, Frederik zuv_vmlk
#> 12 2019012 Jongh, Antonia de zuv_vmlk
#> 13 2019013 Loo, Lena van der zuv_qrk
#> 14 2019014 Haanstra, Johanna lux_gras
Created on 2019-06-02 by the reprex package (v0.2.1)

How to lemmatize a corpus with a particular dictionary in R?”

I'm trying to perform lemmatization on a corpus, using the function lemmatize_strings() as an argument to tm_map() of tm package.
But I want to use my own dictionary ("lexico" - first column with the full word form in lower case, while the second column has the corresponding replacement lemma).
I tried to use:
corpus<-tm_map(corpus, lemmatize_strings)
But didn't work...
When I use:
lemmatize_strings(corpus[[1]], dictionary = lexico)
I have no problem!
How can I put the my dictionary "lexico" in the fuction tm_map()?
Sorry for this question, it'is my first attempt to make some text mining, at the age of 48.
To be more understandable, my corpus are composed by 2000 documents; an extract from the first document:
corpus[[1]][[1]]
[9] "..."
[10] "Nos últimos dias da passada legislatura, a maioria de direita aprovou duas leis que significam enormes recuos nos direitos das cidadãs do país. Fizeram tábua rasa do pronunciamento das cidadãs e cidadãos do país em referendo, optando por humilhar e tentar culpabilizar as mulheres que abortam por sua livre escolha. Estas duas leis são a Lei n.º 134/2015 e a Lei n.º 136/2015, de setembro. A primeira prevê o pagamento de taxas moderadoras na interrupção de gravidez quando for realizada, por opção da mulher, nas primeiras 10 semanas de gravidez. A segunda representa a primeira alteração à Lei n.º 16/2007, de 17 de abril, sobre exclusão de ilicitude nos casos de interrupção voluntária da gravidez."
Then worked on a dictionary file (lexico) with this configuration:
lexico[1:10,]
termo lema pos.tag
1 aa a NCMP000
2 aais aal NCMP000
3 aal aal NCMS000
4 aaleniano aaleniano NCMS000
5 aalenianos aaleniano NCMP000
6 ab-rogação ab-rogação NCFS000
7 ab-rogações ab-rogação NCFP000
8 ab-rogamento ab-rogamento NCMS000
9 ab-rogamentos ab-rogamento NCMP000
10 ab-rogáveis ab-rogável AQ0CP0
When I use the function lemmatize_strings(corpus[[1]], dictionary = lexico), it works correctly and give de document of corpus nº1 lemmatized with lemmas from my dictionary.
The problem that I have, is with this function:
> corpus<-tm_map(corpus, lemmatize_strings, dictionary = lexico)
Warning messages:
1: In stringi::stri_extract_all_regex(x, numreg) :
argument is not an atomic vector; coercing
2: In stringi::stri_extract_all_regex(x, numreg) :
argument is not an atomic vector; coercing
> corpus[[1]][[1]]
[1] ""
This simply destroy all my documents in the corpus
> corpus
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 2000
Thnks in advance for your reply!

You could, for example, use the quanteda package for this:
library("quanteda")
text <- "This is a test sentence. We can lemmatize it using quanteda."
dict <- data.frame(
word = c("is", "using"),
lemma = c("be", "use"),
stringsAsFactors = FALSE
)
toks <- tokens(text, remove_punct = TRUE)
toks_lemma <- tokens_replace(toks,
pattern = dict$word,
replacement = dict$lemma,
case_insensitive = TRUE,
valuetype = "fixed")
toks_lemma
tokens from 1 document.
text1 :
[1] "This" "be" "a" "test" "sentence" "We" "can" "lemmatize"
[9] "it" "use" "quanteda"
The function is very fast and despite the name was mainly ceated for lemmatization.

Web-Scraping with rvest doesn't work

I'm trying to scrape comments from this website:
http://www.latercera.com/noticia/trabajos-realizan-donde-viven-los-extranjeros-tienen-residencia-chile/
And this is my code for this task.
url <- 'http://www.latercera.com/noticia/trabajos-realizan-donde-viven-los-extranjeros-tienen-residencia-chile/'
webpage <- read_html(url)
data_html <- html_nodes(webpage,"gig-comment-body")
Unfortunately it seems that rvest doesn't recognize the nodes through the CSS selector (gig-comment-body).
nodes comes out to be a null list, so it's not scraping anything.

That is another solution with rselenium without docker
install.packages("RSelenium")
library (RSelenium)
driver<- rsDriver()
remDr <- driver[["client"]]
remDr$navigate("http://www.latercera.com/noticia/trabajos-realizan-donde-viven-los-extranjeros-tienen-residencia-chile/")
elem <- remDr$findElement( using = "id",value = "commentsDiv-779453")
#or
elem <- remDr$findElement( using = "class name", "gig-comments-comments")
elem$highlightElement() # just for interactive use in browser.
elemtxt <- elem$getElementAttribute("outerHTML") # gets us the HTML

#r2evans is correct. It builds the comment <div>s with javascript and it also requires a delay. I prefer Splash to Selenium (tho I made splashr so I'm not exactly impartial):
library(rvest)
library(splashr)
URL <- 'http://www.latercera.com/noticia/trabajos-realizan-donde-viven-los-extranjeros-tienen-residencia-chile/'
# Needs Docker => https://www.docker.com/
# Then needs splashr::install_splash()
start_splash()
splash_local %>%
splash_response_body(TRUE) %>%
splash_go(URL) %>%
splash_wait(10) %>%
splash_html() -> pg
html_nodes(pg, "div.gig-comment-body")
## {xml_nodeset (10)}
## [1] <div class="gig-comment-body"><p><span>Algunosdesubicados comentan y se refieren a la UE<span> </span>como si en alguna forma Chil ...
## [2] <div class="gig-comment-body">Si buscan información se encontrarán que la unión Europea se está desmorona ndo por asunto de la inmi ...
## [3] <div class="gig-comment-body">Pocos inmigrantes tiene Chile en función de su población. En España hay 4.5 mill de inmigrantes. 800. ...
## [4] <div class="gig-comment-body">Chao chilenois idiotas tanto hablan y dicen que hacer cuando ni su pais les pertenece esta gobernado ...
## [5] <div class="gig-comment-body">\n<div> Victor Hugo Ramirez Lillo, de Conchalí, exiliado en Goiania, Brasil, pecha bono de exonerado, ...
## [6] <div class="gig-comment-body">Les escribo desde mi 2do pais, USA. Mi PDTE. TRUMP se bajó del TPP y Chile se va a la cresta. La o ...
## [7] <div class="gig-comment-body">En CHILE siempre fuimos muy cuidadosos con le emigración, solo lo MEJOR de Alemania, Francia, Suecia, ...
## [8] <div class="gig-comment-body"><span>Basta de inmigración!!! Santiago está lleno de vendedores ambulantes extranieros!!!¿¿esos son l ...
## [9] <div class="gig-comment-body">IGNOREN A JON LESCANO, ESE ES UN CHOLO QUE FUE DEPORTADO DE CHILE.<div>IGNOREN A LOS EXTRANJEROS MET ...
## [10] <div class="gig-comment-body">Me pregunto qué dirá el nacionalista promedio cuando agarre un libro de historia y se dé cuenta de qu ...
killall_splash()

Download files when exact URL is not known

From this link, I´m trying to download multiple pdf files, but I can´t get the exact URL for each file.
To access one of the pdf files, you could click on "Región de Arica y Parinacota" and then click on "Arica". Then, you can check that the url is http://cdn.servel.cl/padronesauditados/padron/A1501001.pdf, if you click on the next link "Camarones" you now noticed that the URL is http://cdn.servel.cl/padronesauditados/padron/A1501002.pdf
I checked more URLs, and they all have a similar pattern:
"A" + "two digit number from 1 to 15" + "two digit number of unknown range" + "three digit number of unknown range"
Even though the URL examples I showed seem to suggest that the file names are named sequentally, this is not always the case.
What I did to be able to download all the files despite not knowing the exact URLs I did the following:
1) I made a for loop in order to write all possible file names based on the pattern I describe above, i.e, A0101001.pdf, A0101002.pdf....A1599999.pdf
library(downloader)
library(stringr)
reg.ind <- 1:15
pro.ind <- 1:99
com.ind <- 1:999
reg <- str_pad(reg.ind, width=2, side="left", pad="0")
prov <- str_pad(pro.ind, width=2, side="left", pad="0")
com <- str_pad(com.ind, width=3, side="left", pad="0")
file <- c()
for(i in 1:length(reg)){
reg.i <- reg[i]
for(j in 1:length(prov)){
prov.j <- prov[j]
for(k in 1:length(com)){
com.k <- com[k]
file <- c(file, (paste0("A", reg.i, prov.j, com.k)))
}
}
}
2) then I used another for loop to download a file everytime I hit a correct URL. I use tryCatchto ignore the cases when the URL was incorrect (most of the time)
for(i in 1:length(file)){
tryCatch({
url <- paste0("http://cdn.servel.cl/padronesauditados/padron/", file[i],
".pdf")
# change destfile accordingly if you decide to run the code
download.file(url, destfile = paste0("./datos/comunas/", file[i], ".pdf"),
mode = "wb")
}, error = function(e){})
}
PROBLEM: In total I know there are not more than 400 pdf files, as each one of them correspond to a commune in Chile, but I wrote a vector with 1483515 possible file names, and therefore my code, even though it works, takes a much longer time than if I could manage to obtain the file names before hand.
Does anyone know how to workaround this problem?

You can re-create the "browser developer tools" experience in R with splashr:
library(splashr) # devtools::install_github("hrbrmstr/splashr")
library(tidyverse)
sp <- start_splash()
Sys.sleep(3) # give the docker container time to work
res <- render_har(url = "http://cdn.servel.cl/padronesauditados/padron.html",
response_body=TRUE)
map_chr(har_entries(res), c("request", "url"))
## [1] "http://cdn.servel.cl/padronesauditados/padron.html"
## [2] "http://cdn.servel.cl/padronesauditados/stylesheets/navbar-cleaned.min.css"
## [3] "http://cdn.servel.cl/padronesauditados/stylesheets/virtue.min.css"
## [4] "http://cdn.servel.cl/padronesauditados/stylesheets/virtue2.min.css"
## [5] "http://cdn.servel.cl/padronesauditados/stylesheets/custom.min.css"
## [6] "https://fonts.googleapis.com/css?family=Lato%3A400%2C700%7CRoboto%3A100%2C300%2C400%2C500%2C700%2C900%2C100italic%2C300italic%2C400italic%2C500italic%2C700italic%2C900italic&ver=1458748651"
## [7] "http://cdn.servel.cl/padronesauditados/jquery-ui-1.12.1.custom/jquery-ui.css"
## [8] "http://cdn.servel.cl/padronesauditados/jquery-ui-1.12.1.custom/external/jquery/jquery.js"
## [9] "http://cdn.servel.cl/padronesauditados/jquery-ui-1.12.1.custom/jquery-ui.js"
## [10] "http://cdn.servel.cl/padronesauditados/images/logo-txt-retina.png"
## [11] "http://cdn.servel.cl/assets/img/nav_arrows.png"
## [12] "http://cdn.servel.cl/padronesauditados/images/loader.gif"
## [13] "http://cdn.servel.cl/padronesauditados/archivos.xml"
## [14] "http://cdn.servel.cl/padronesauditados/jquery-ui-1.12.1.custom/images/ui-icons_444444_256x240.png"
## [15] "https://fonts.gstatic.com/s/roboto/v16/zN7GBFwfMP4uA6AR0HCoLQ.ttf"
## [16] "https://fonts.gstatic.com/s/roboto/v16/RxZJdnzeo3R5zSexge8UUaCWcynf_cDxXwCLxiixG1c.ttf"
## [17] "https://fonts.gstatic.com/s/roboto/v16/Hgo13k-tfSpn0qi1SFdUfaCWcynf_cDxXwCLxiixG1c.ttf"
## [18] "https://fonts.gstatic.com/s/roboto/v16/Jzo62I39jc0gQRrbndN6nfesZW2xOQ-xsNqO47m55DA.ttf"
## [19] "https://fonts.gstatic.com/s/roboto/v16/d-6IYplOFocCacKzxwXSOKCWcynf_cDxXwCLxiixG1c.ttf"
## [20] "https://fonts.gstatic.com/s/roboto/v16/mnpfi9pxYH-Go5UiibESIqCWcynf_cDxXwCLxiixG1c.ttf"
## [21] "http://cdn.servel.cl/padronesauditados/stylesheets/fonts/virtue_icons.woff"
## [22] "https://fonts.gstatic.com/s/lato/v13/v0SdcGFAl2aezM9Vq_aFTQ.ttf"
## [23] "https://fonts.gstatic.com/s/lato/v13/DvlFBScY1r-FMtZSYIYoYw.ttf"
Spotting the XML entry is easy in ^^, so we can focus on it:
har_entries(res)[[13]]$response$content$text %>%
openssl::base64_decode() %>%
xml2::read_xml() %>%
xml2::xml_find_all(".//Region") %>%
map_df(~{
data_frame(
id = xml2::xml_find_all(.x, ".//id") %>% xml2::xml_text(),
nombre = xml2::xml_find_all(.x, ".//nombre") %>% xml2::xml_text(),
nomcomuna = xml2::xml_find_all(.x, ".//comunas/comuna/nomcomuna") %>% xml2::xml_text(),
id_archivo = xml2::xml_find_all(.x, ".//comunas/comuna/idArchivo") %>% xml2::xml_text(),
archcomuna = xml2::xml_find_all(.x, ".//comunas/comuna/archcomuna") %>% xml2::xml_text()
)
})
## # A tibble: 346 x 5
## id nombre nomcomuna id_archivo archcomuna
## <chr> <chr> <chr> <chr> <chr>
## 1 1 Región de Arica y Parinacota Arica 1 A1501001.pdf
## 2 1 Región de Arica y Parinacota Camarones 2 A1501002.pdf
## 3 1 Región de Arica y Parinacota General Lagos 3 A1502002.pdf
## 4 1 Región de Arica y Parinacota Putre 4 A1502001.pdf
## 5 2 Región de Tarapacá Alto Hospicio 5 A0103002.pdf
## 6 2 Región de Tarapacá Camiña 6 A0152002.pdf
## 7 2 Región de Tarapacá Colchane 7 A0152003.pdf
## 8 2 Región de Tarapacá Huara 8 A0152001.pdf
## 9 2 Región de Tarapacá Iquique 9 A0103001.pdf
## 10 2 Región de Tarapacá Pica 10 A0152004.pdf
## # ... with 336 more rows
stop_splash(sp) # don't forget to clean up!
You can then either programmatically download all the PDFs by using the URL prefix: http://cdn.servel.cl/padronesauditados/padron/

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Web Scraping, extract table of a page - r

Related

How to fix data reading and table formatting problem with web scrapin in R

How to turn around strings in dataframe

How to lemmatize a corpus with a particular dictionary in R?”

Web-Scraping with rvest doesn't work

Download files when exact URL is not known

Categories

Resources