Web scraping PDF files from a map - r

I've been trying to download pdfs embedded in a map following this code (original one can be found here). Each pdf refers to a brazilian municipality (5,570 files).
library(XML)
library(RCurl)
url <- "http://simec.mec.gov.br/sase/sase_mapas.php?uf=RJ&tipoinfo=1"
page <- getURL(url)
parsed <- htmlParse(page)
links <- xpathSApply(parsed, path="//a", xmlGetAttr, "href")
inds <- grep("*.pdf", links)
links <- links[inds]
regex_match <- regexpr("[^/]+$", links, perl=TRUE)
destination <- regmatches(links, regex_match)
for(i in seq_along(links)){
download.file(links[i], destfile=destination[i])
Sys.sleep(runif(1, 1, 5))
}
I already used this code in other projects a few times and it worked. For this specific case, it doesn't. In fact, I've tried many things to scrape these files but it seems impossible to me. Recently, I got the following link. Then it makes possible to combine uf (state) and muncod (municipal code) to download the file, but I dont know how to include this to the code though.
http://simec.mec.gov.br/sase/sase_mapas.php?uf=MT&muncod=5100102&acao=download
Thanks in advance!

devtools::install_github("ropensci/RSelenium")
library(rvest)
library(httr)
library(RSelenium)
# connect to selenium server from within r (REPLACE SERVER ADDRESS)
rem_dr <- remoteDriver(
remoteServerAddr = "192.168.50.25", port = 4445L, browserName = "firefox"
)
rem_dr$open()
# get the two-digit state codes for brazil by scraping the below webpage
tables <- "https://en.wikipedia.org/wiki/States_of_Brazil" %>%
read_html() %>%
html_table(fill = T)
states <- tables[[4]]$Abbreviation
# for each state, we are going to go navigate to the map of that state using
# selenium, then scrape the list of possible municipality codes from the drop
# down menu present in the map
get_munip_codes <- function(state) {
url <- paste0("http://simec.mec.gov.br/sase/sase_mapas.php?uf=", state)
rem_dr$navigate(url)
# have to wait until the drop down menu loads. 8 seconds will be enough time
# for each state
Sys.sleep(8)
src <- rem_dr$getPageSource()
out <- read_html(src[[1]]) %>%
html_nodes(xpath = "//select[#id='muncod']/option[boolean(#value)]") %>%
xml_attrs("value") %>%
unlist(use.names = F)
print(state)
out
}
state_munip <- sapply(
states, get_munip_codes, USE.NAMES = TRUE, simplify = FALSE
)
# now you can download each pdf. first create a directory for each state, where
# the pdfs for that state will go:
lapply(names(state_munip), function(x) dir.create(file.path("brazil-pdfs", x)))
# ...then loop over each state/municipality code and download the pdf
lapply(
names(state_munip), function(state) {
lapply(state_munip[[state]], function(munip) {
url <- sprintf(
"http://simec.mec.gov.br/sase/sase_mapas.php?uf=%s&muncod=%s&acao=download",
state, munip
)
file <- file.path("brazil-pdfs", state, paste0(munip, ".pdf"))
this_one <- paste0("state ", state, ", munip ", munip)
tryCatch({
GET(url, write_disk(file, overwrite = TRUE))
print(paste0(this_one, " downloaded"))
},
error = function(e) {
print(paste0("couldn't download ", this_one))
try(unlink(file, force = TRUE))
}
)
})
}
)
STEPS:
Get the IP address of your windows machine (see https://www.digitalcitizen.life/find-ip-address-windows)
start selenium server docker container by running this:
docker run -d -p 4445:4444 selenium/standalone-firefox:2.53.1
start rocker/tidyverse docker container by running this:
docker run -v `pwd`/brazil-pdfs:/home/rstudio/brazil-pdfs -dp 8787:8787 rocker/tidyverse
Go into your preferred browser and enter this address: http://localhost:8787 ...This will take you to the login screen for rstudio server. login using the username "rstudio" and password "rstudio"
Copy/paste the code shown above in a new Rstudio .R document. Replace the value for remoteServerAddr with the IP address you found in step 1.
Run the code...this should write the pdfs to a directory "brazil-pdfs" that is both inside the container and mapped to your windows machine (in other words, the pdfs will show up in the brazil-pdfs dir on your local machine as well). note, it takes a while to run the code b/c there are a lot of pdfs.

Related

web scraping RSelenium findElement

I feel this is supposed to be simple but I have been struggled to get it right. I'm trying to extract the Employees number ("2,300,000") from this webpage: https://fortune.com/company/walmart/
I used Chrome's extension SelectorGadget to locate the number---"info__row--7f9lE:nth-child(13) .info__value--2AHH7""
```
library(RSelenium)
library(rvest)
library(netstat)
rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE, port=free_port())
remDr<-rs_driver_object$client
remDr$navigate('https://fortune.com/company/walmart/')
Employees<-remDr$findElement(using = 'xpath','//h3[#class="info__row--7f9lE:nth-child(13) .info__value--2AHH7"]')
Employees
```
An error says
> "Selenium message:no such element: Unable to locate element".
I have also tried:
```
Employees<-remDr$findElement(using = 'class name','info__value--2AHH7')
```
But it returns the data not as wanted.
Can someone point out the problem? Really appreciate it!
Updated
I modified the code as suggested by Frodo below in the comment to apply to multiple webpages to save the statistics as a dataframe. But I still encountered an error.
library(RSelenium)
library(rvest)
library(netstat)
rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE, port=netstat::free_port())
remDr<-rs_driver_object$client
Data<-data.frame("url" = c("https://fortune.com/company/walmart/", "https://fortune.com/company/amazon-com/"
,"https://fortune.com/company/apple/"
,"https://fortune.com/company/cvs-health/"
,"https://fortune.com/company/jpmorgan-chase/"
,"https://fortune.com/company/verizon/"
,"https://fortune.com/company/ford-motor/"
, "https://fortune.com/company/general-motors/"
,"https://fortune.com/company/anthem/"
, "https://fortune.com/company/centene/"
,"https://fortune.com/company/fannie-mae/"
, "https://fortune.com/company/comcast/"
, "https://fortune.com/company/chevron/"
,"https://fortune.com/company/dell-technologies/"
,"https://fortune.com/company/bank-of-america-corp/"
,"https://fortune.com/company/target/") )
Data$numEmp<-"NA"
Data$numEmp <- numeric()
for (i in 1:length(Data$url))
{
remDr$navigate(url = Data$url[i])
pgSrc <- remDr$getPageSource()
pgCnt <- read_html(pgSrc[[1]])
Data$numEmp[i] <- pgCnt %>%
html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>%
html_text(trim = TRUE)
}
Data$numEmp
Selenium message:unknown error: unexpected command response
(Session info: chrome=103.0.5060.114)
Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'
System info: host: 'DESKTOP-VCCIL8P', ip: '192.168.1.249', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_311'
Driver info: driver.version: unknown
Error: Summary: UnknownError
Detail: An unknown server-side error occurred while processing the command.
class: org.openqa.selenium.WebDriverException
Further Details: run errorDetails method
Can someone please take another look?
Use RSelenium to load up the webpage and get the page source
remdr$navigate(url = 'https://fortune.com/company/walmart/')
pgSrc <- remdr$getPageSource()
Use Rvest to read the contents of the webpage
pgCnt <- read_html(pgSrc[[1]])
Further, use rvest::html_nodes and rvest::html_text functions to extract the text using relevant xpath selectors. (this Chrome extension should help)
reqTxt <- pgCnt %>%
html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>%
html_text(trim = TRUE)
Output of reqTxt
> reqTxt
[1] "2,300,000"
UPDATE
The error Selenium message:unknown error: unexpected command response seems to be occurring specifically 103 version of Chromedriver. More info here. One of the answers there was a giving a simple wait of 5 seconds before and after the driver navigates to the URL. And I have also used tryCatch to keep continuing the code to run within a while loop. Essentially, the code will run until it loads the page. This seems to work.
# Function to fetch employee count
getEmployees <- function(myURL) {
pagestatus <<- 0
while(pagestatus == 0) {
tryCatch(
expr = remDr$navigate(url = myURL),
pagestatus <<- 1,
error = function(error){
pagestatus <<- 0
}
)
}
pgSrc <- remDr$getPageSource()
pgCnt <- read_html(pgSrc[[1]])
return(pgCnt %>% html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>% html_text(trim = TRUE))
}
Implement this function to all of your dataframe URLs.
for(i in 1:nrow(Data)) {
Sys.sleep(5)
Data[i, 2] <- getEmployees(Data[i, 1])
Sys.sleep(5)
}
Now when we see the output of second column
> Data[, 2]
[1] "2,300,000" "1,608,000" "154,000" "258,000" "271,025" "118,400"
[7] "183,000" "157,000" "98,200" "72,500" "7,400" "189,000"
[13] "42,595" "133,000" "208,248" "450,000"
Does it have to be with RSelenium only? In my experience, the most flexible approach is to use RSelenium to navigate to the required pages (where findElement helps you find boxes to enter text into or buttons to click) and then use rvest to extract what you need from the page.
Start with
rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE, port=netstat::free_port())
remDr<-rs_driver_object$client
remDr$navigate('https://fortune.com/company/walmart/')
page_source <- remDr$getPageSource()
pg <- xml2::read_html(page_source[[1]])
How you then go about it depends on how specific you want the solution to be wrt this exact page. Here is one way:
rvest::html_elements(pg, "div.info__row--7f9lE") |>
rvest::html_text2()
or
rvest::html_elements(pg, "div:nth-child(13) > div.info__value--2AHH7") |>
rvest::html_text2()
or
rvest::html_elements(pg, "div.info__row--7f9lE")[11] |>
rvest::html_children()
or
rvest::html_elements(pg, '.info__row--7f9lE:nth-child(13) .info__value--2AHH7') |>
rvest::html_text2()
et cetera. What you do in the rvest part would depend on how general you want the selection/extraction process to be.

Always set working directory to Dropbox folder on any machine

I wanted to have a simple code at the beginning of my scripts to set the working directory to my Dropbox folder, regardless of which machine I run my code on:
setdir <- function(){
wandir <- paste(path.expand("~"), "/Dropbox/_R", sep = "")
curdir <- getwd()
if(curdir!=wandir){
setwd(wandir)
}
}
setdir()
The trick with the path.expand("~") works on Linux machines, but it doesn't on Windows machines, because it leads to C:/Users/username/Documents instead of C:/Users/username/. Is there a function that would work globally?
Here is a hacky workaround, which is far from a global one:
setdir <- function(){
wandir <- paste(path.expand("~"), "/Dropbox/_R", sep = "")
wandir <- sub("/Documents", "", wandir)
curdir <- getwd()
if(curdir!=wandir){
setwd(wandir)
}
}
setdir()

How can i change my system's IP through R?

I want to change my system's ip through R , is there any way this can be done ?
have tried below answer by Pablo Barbera but couldn't actually worked .
library(RCurl)
# check current IP address
print(getURL("http://ifconfig.me/ip"))
# proxy options
opts <- list(proxy="127.0.0.1", proxyport=8118)
# opening connection with TOR
con <- socketConnection(host="127.0.0.1",port=9051)
print(getURL("http://ifconfig.me/ip", .opts = opts))
for (i in 1:10)
{
writeLines('AUTHENTICATE \"password\"\r\nSIGNAL NEWNYM\r\n', con=con)
Sys.sleep(5)
print(getURL("http://ifconfig.me/ip", .opts = opts))
Sys.sleep(5)
}
LINK : Changing Tor identity in R
Can anybody make me understand what this code is saying and how this is wroking ?

Specify download folder in RSelenium

I am using RSelenium to navigate towards a webpage which contains a button to download a file. I use RSelenium to click this button which downloads the file. However, the files are by default downloaded in my folder 'downloads', whereas I want to file to be downloaded in my working directory. I tried specifying a chrome profile as below but this did not seem to do the job:
wd <- getwd()
cprof <- getChromeProfile(wd, "Profile 1")
remDr <- remoteDriver(browserName= "chrome", extraCapabilities = cprof)
The file is still downloaded in the folder 'downloads', rather than my working directory. How can this be solved?
The solution involves setting the appropriate chromeOptions outlined at https://sites.google.com/a/chromium.org/chromedriver/capabilities . Here is an example on a windows 10 box:
library(RSelenium)
eCaps <- list(
chromeOptions =
list(prefs = list(
"profile.default_content_settings.popups" = 0L,
"download.prompt_for_download" = FALSE,
"download.default_directory" = "C:/temp/chromeDL"
)
)
)
rD <- rsDriver(extraCapabilities = eCaps)
remDr <- rD$client
remDr$navigate("http://www.colorado.edu/conflict/peace/download/")
firstzip <- remDr$findElement("xpath", "//a[contains(#href, 'zip')]")
firstzip$clickElement()
> list.files("C:/temp/chromeDL")
[1] "peace.zip"
I've been trying the alternatives, and it seems that #Bharath's first comment about giving up on fiddling with the prefs (it doesn't seem possible to do that) and instead moving the file from the default download folder to the desired folder is the way to go. The trick to making this a portable solution is finding where the default download directory is—of course it varies by os (which you can get like so)—and you need to find the user's username too:
desired_dir <- "~/Desktop/cool_downloads"
file_name <- "whatever_I_downloaded.zip"
# build path to chrome's default download directory
if (Sys.info()[["sysname"]]=="Linux") {
default_dir <- file.path("home", Sys.info()[["user"]], "Downloads")
} else {
default_dir <- file.path("", "Users", Sys.info()[["user"]], "Downloads")
}
# move the file to the desired directory
file.rename(file.path(default_dir, file_name), file.path(desired_dir, file_name))
Look this alternative way.
Your download folder should be empty.
# List the files inside the folder
down.list <- list.files(path = "E:/Downloads/",all.files = T,recursive = F)
# Move all files to specific folder
file.rename(from = paste0("E:/Downloads/",down.list),to = paste0("E:/1/scrape/",down.list))

system open RStudio close connection

I'm attempting to use R to open a .Rproj file used in RStudio. I have succeeded with the code below (stolen from Ananda here). However, the connection to open RStudio called from R is not closed after the file is opened. How can I sever this "connection" after the .Rproj file is opened? (PS this has not been tested on Linux or Mac yet).
## Create dummy .Rproj
x <- c("Version: 1.0", "", "RestoreWorkspace: Default", "SaveWorkspace: Default",
"AlwaysSaveHistory: Default", "", "EnableCodeIndexing: Yes",
"UseSpacesForTab: No", "NumSpacesForTab: 4", "Encoding: UTF-8",
"", "RnwWeave: knitr", "LaTeX: pdfLaTeX")
loc <- file.path(getwd(), "Bar.rproj")
cat(paste(x, collapse = "\n"), file = loc)
## wheresRStudio function to find RStudio location
wheresRstudio <-
function() {
myPaths <- c("rstudio", "~/.cabal/bin/rstudio",
"~/Library/Haskell/bin/rstudio", "C:\\PROGRA~1\\RStudio\\bin\\rstudio.exe",
"C:\\RStudio\\bin\\rstudio.exe")
panloc <- Sys.which(myPaths)
temp <- panloc[panloc != ""]
if (identical(names(temp), character(0))) {
ans <- readline("RStudio not installed in one of the typical locations.\n
Do you know where RStudio is installed? (y/n) ")
if (ans == "y") {
temp <- readline("Enter the (unquoted) path to RStudio: ")
} else {
if (ans == "n") {
stop("RStudio not installed or not found.")
}
}
}
temp
}
## function to open .Rproj files
open_project <- function(Rproj.loc) {
action <- paste(wheresRstudio(), Rproj.loc)
message("Preparing to open project!")
system(action)
}
## Test it (it works but does no close)
open_project(loc)
It's not clear what you're trying to do exactly. What you've described doesn't really sound to me like a "connection" -- it's a system call.
I think what you're getting at is that after you run open_project(loc) in your above example, you don't get your R prompt back until you close the instance of RStudio that was opened by your function. If that is the case, you should add wait = FALSE to your system call.
You might also need to add an ignore.stderr = TRUE in there to get directly back to the prompt. I got some error about "QSslSocket: cannot resolve SSLv2_server_method" on my Ubuntu system, and after I hit "enter" it took me back to the prompt. ignore.stderr can bypass that (but might also mean that the user doesn't get meaningful errors in the case of serious errors).
In other words, I would change your open_project() function to the following and see if it does what you expect:
open_project <- function(Rproj.loc) {
action <- paste(wheresRstudio(), Rproj.loc)
message("Preparing to open project!")
system(action, wait = FALSE, ignore.stderr = TRUE)
}

Resources