Issue scraping page with "Load more" button with rvest - r

I want to obtain the links to the atms listed on this page: https://coinatmradar.com/city/345/bitcoin-atm-birmingham-uk/
Would I need to do something about the 'load more' button at the bottom of the page?
I have been using the selector tool you can download for chrome to select the CSS path.
I've written the below code block and it only seems to retrieve the first ten links.
library(rvest)
base <- "https://coinatmradar.com/city/345/bitcoin-atm-birmingham-uk/"
base_read <- read_html(base)
atm_urls <- html_nodes(base_read, ".place > a")
all_urls_final <- html_attr(atm_urls, "href" )
print(all_urls_final)
I expected to be able to retrieve all links to the atms listed in the area but my R code has not done so.
Any help would be great. Sorry if this is a really simple question.

You should give RSelenium a try. I'm able to get the links with the following code:
# install.packages("RSelenium")
library(RSelenium)
library(rvest)
# Download binaries, start driver, and get client object.
rd <- rsDriver(browser = "firefox", port = 4444L)
ffd <- rd$client
# Navigate to page.
ffd$navigate("https://coinatmradar.com/city/345/bitcoin-atm-birmingham-uk/")
# Find the load button and assign, then send click event.
load_btn <- ffd$findElement(using = "css selector", ".load-more .btn")
load_btn$clickElement()
# Wait for elements to load.
Sys.sleep(2)
# Get HTML data and parse
html_data <- ffd$getPageSource()[[1]]
html_data %>%
read_html() %>%
html_nodes(".place a:not(.operator-link)") %>%
html_attr("href")
#### OUTPUT ####
# [1] "/bitcoin_atm/5969/bitcoin-atm-shitcoins-club-birmingham-uk-bitcoin-embassy/"
# [2] "/bitcoin_atm/7105/bitcoin-atm-general-bytes-northampton-costcutter/"
# [3] "/bitcoin_atm/4759/bitcoin-atm-general-bytes-birmingham-uk-costcutter/"
# [4] "/bitcoin_atm/2533/bitcoin-atm-general-bytes-birmingham-uk-londis-# convenience/"
# [5] "/bitcoin_atm/5458/bitcoin-atm-general-bytes-coventry-agg-african-restaurant/"
# [6] "/bitcoin_atm/711/bitcoin-atm-general-bytes-coventry-bigs-barbers/"
# [7] "/bitcoin_atm/5830/bitcoin-atm-general-bytes-telford-bpred-lion-service-station/"
# [8] "/bitcoin_atm/5466/bitcoin-atm-general-bytes-nottingham-24-express-off-licence/"
# [9] "/bitcoin_atm/4615/bitcoin-atm-general-bytes-northampton-costcutter/"
# [10] "/bitcoin_atm/4841/bitcoin-atm-lamassu-worcester-computer-house/"
# [11] "/bitcoin_atm/3150/bitcoin-atm-bitxatm-leicester-keshs-wines-and-newsagents-braustone/"
# [12] "/bitcoin_atm/2948/bitcoin-atm-bitxatm-coventry-nisa-local/"
# [13] "/bitcoin_atm/4742/bitcoin-atm-bitxatm-birmingham-uk-custcutter-coventry-road-hay-mills/"
# [14] "/bitcoin_atm/4741/bitcoin-atm-bitxatm-derby-michaels-drink-store-alvaston/"
# [15] "/bitcoin_atm/4740/bitcoin-atm-bitxatm-birmingham-uk-nisa-local-crabtree-# hockley/"
# [16] "/bitcoin_atm/4739/bitcoin-atm-bitxatm-birmingham-uk-nisa-local-subway-boldmere/"
# [17] "/bitcoin_atm/4738/bitcoin-atm-bitxatm-birmingham-uk-ashtree-convenience-store/"
# [18] "/bitcoin_atm/4737/bitcoin-atm-bitxatm-birmingham-uk-nisa-local-finnemore-road-bordesley-green/"
# [19] "/bitcoin_atm/3160/bitcoin-atm-bitxatm-birmingham-uk-costcutter/"

When you click show more the page does an XHR POST request for more results using an offset of 10 (suggesting results come in batches of 10) from current set. You can mimic this so long as you have the followings params in the post body (I suspect only the bottom 3 are essential)
'direction' : 1
'sort' : 1
'offset' : 10
'pagetype' : 'city'
'pageid' : 345
And the following request header is required (at least in Python implementations)
'X-Requested-With' : 'XMLHttpRequest'
You send that correctly and you will get a response containing the additional content. Note: content is wrapped in ![CDATA[]] as instruction that content should not be interpreted as xml - you will need to account for that by extracting content within for parsing.
The total number of atms is returned from original page you have and with css selector
.atm-number
You can split on and take the upper bound value from the split and convert to int. You then can calculate each offset required to meet that total (being used in a loop as consecutive offset param until total achieved) e.g. 19 results will be 2 requests total with 1 request at offset 10 for additional content.

Related

Looking for recommendations on scraping specific data from this unstructured, paginated HTML website

As the title describes, I am trying to extract data from a website. Specifically, I'm trying to extract host susceptibility and host insusceptibility data from each of the species pages found here.
These data can be found on individual species specific pages, for example for Abelia latent tymovirus at its respective URL.
I am struggling to extract these data as the HTML seems to be very unstructured. For example, host susceptibility/insusceptibility always exists in node h4, but along with other varying headers and listitems.
This is my first go at web-scraping and I have been trying with the chrome plugin Web Scraper, which seems very intuitive and flexible. I have been able to get the scraper to visit the multiple pages, but I can't seem to direct it to specifically collect the susceptibility/insusceptibility data. I attempted using SelectorGadget to identify exactly what my selector should be, but the lack of structure in the HTML made this ineffective.
Any advice on how I can change my plan of attack for this?
I am also open to trying to extract the data using R's rvest package. I have so far been able to read the html from a specific page, extract the h4 and li elements, and clean up the line breaks. Reproducible code:
library(rvest)
library(stringr)
pvo <- read_html("http://bio-mirror.im.ac.cn/mirrors/pvo/vide/descr042.htm")
pvo %>%
html_elements("h4, li") %>%
html_text() %>%
str_replace_all("[\n]" , "")
Which seems to provide me with what I want plus extraneous data:
...
[20] "Susceptible host species "
[21] "Chenopodium amaranticolor"
[22] "Chenopodium quinoa"
[23] "Cucumis sativus"
[24] "Cucurbita pepo"
[25] "Cynara scolymus"
[26] "Gomphrena globosa"
[27] "Nicotiana benthamiana"
[28] "Nicotiana clevelandii"
[29] "Nicotiana glutinosa"
[30] "Ocimum basilicum"
[31] "Vigna unguiculata"
[32] "Insusceptible host species"
[33] "Nicotiana rustica"
[34] "Nicotiana tabacum"
[35] "Phaseolus vulgaris "
...
From here, I am unfamiliar with how to specifically select/filter the desired information from the string. I have tried some stringr, gsub, and rm_between filter functions, but all attempts have been unsuccessful. I wouldn't know where to start to make this code visit the many species pages on the online database, or how to instruct it to save the aggregate data. What a road I have ahead of me!
Here is one trick.
You can get the index of
'Susceptible host species'
'Insusceptible host species'
'Families containing susceptible hosts'
Everything between 1 and 2 are susceptible_species and between 2 and 3 are insusceptible_species.
library(rvest)
library(stringr)
pvo <- read_html("http://bio-mirror.im.ac.cn/mirrors/pvo/vide/descr042.htm")
all_values <- pvo %>% html_elements("h4 li") %>% html_text()
which(all_values == 'Susceptible host species')
sus_index <- grep('Susceptible host species', all_values, fixed = TRUE)
insus_index <- grep('Insusceptible host species', all_values, fixed = TRUE)
family_sus_index <- grep('Families containing susceptible hosts', all_values, fixed = TRUE)
susceptible_species <- all_values[(sus_index+1):(insus_index-1)]
susceptible_species
# [1] "Chenopodium amaranticolor" "Chenopodium quinoa" "Cucumis sativus"
# [4] "Cucurbita pepo" "Cynara scolymus" "Gomphrena globosa"
# [7] "Nicotiana benthamiana" "Nicotiana clevelandii" "Nicotiana glutinosa"
#[10] "Ocimum basilicum" "Vigna unguiculata"
insusceptible_species <- all_values[(insus_index+1):(family_sus_index-1)]
insusceptible_species
#[1] "Nicotiana rustica" "Nicotiana tabacum" "Phaseolus vulgaris "

R: readLines on a URL leads to missing lines

When I readLines() on an URL, I get missing lines or values. This might be due to spacing that the computer can't read.
When you use the URL above, CTR + F finds 38 instances of text that matches "TV-". On the other hand, when I run readLines() and grep("TV-", HTML) I only find 12.
So, how can I avoid encoding/ spacing errors so that I can get complete lines of the HTML?
You can use rvest to scrape the data. For example, to get all the titles you can do :
library(rvest)
url <- 'https://www.imdb.com/search/title/?locations=Vancouver,%20British%20Columbia,%20Canada&start=1.json'
url %>%
read_html() %>%
html_nodes('div.lister-item-content h3 a') %>%
html_text() -> all_titles
all_titles
# [1] "The Haunting of Bly Manor" "The Haunting of Hill House"
# [3] "Supernatural" "Helstrom"
# [5] "The 100" "Lucifer"
# [7] "Criminal Minds" "Fear the Walking Dead"
# [9] "A Babysitter's Guide to Monster Hunting" "The Stand"
#...
#...

html_nodes returning two results for a link

I'm trying to use R to fetch all the links to data files on the Eurostat website. While my code currently "works", I seem to get a duplicate result for every link.
Note, the use of download.file is to get around my company's firewall, per this answer
library(dplyr)
library(rvest)
myurl <- "http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?dir=data&sort=1&sort=2&start=all"
download.file(myurl, destfile = "eurofull.html")
content <- read_html("eurofull.html")
links <- content %>%
html_nodes("a") %>% #Note that I dont know the significance of "a", this was trial and error
html_attr("href") %>%
data.frame()
# filter to only get the ".tsv.gz" links
files <- filter(links, grepl("tsv.gz", .))
Looking at the top of the dataframe
files$.[1:6]
[1] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&file=data%2Faact_ali01.tsv.gz
[2] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&downfile=data%2Faact_ali01.tsv.gz
[3] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&file=data%2Faact_ali02.tsv.gz
[4] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&downfile=data%2Faact_ali02.tsv.gz
[5] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&file=data%2Faact_eaa01.tsv.gz
[6] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&downfile=data%2Faact_eaa01.tsv.gz
The only difference between 1 and 2 is that 1 says "...file=data..." while 2 says "...downfile=data...". This pattern continues for all pairs down the dataframe.
If I download 1 and 2 and read the files into R, an identical check confirms they are the same.
Why are two links to the same data being returned? Is there a way (other than filtering for "downfile") to only return one of the links?
As noted, you can just do some better node targeting. This uses XPath vs CSS selectors and picks the links with downfile in the href:
html_nodes(content, xpath = ".//a[contains(#href, 'downfile')]") %>%
html_attr("href") %>%
sprintf("http://ec.europa.eu/%s", .) %>%
head()
## [1] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_ali01.tsv.gz"
## [2] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_ali02.tsv.gz"
## [3] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa01.tsv.gz"
## [4] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa02.tsv.gz"
## [5] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa03.tsv.gz"
## [6] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa04.tsv.gz"

Download all files on a webpage with R?

My question is almost same as here. I want to download all files from this page. But the difference is I do not have the same pattern to be able to download all the files.
Any idea to get the download in R ?
# use the FTP mirror link provided on the page
mirror <- "ftp://srtm.csi.cgiar.org/SRTM_v41/SRTM_Data_GeoTIFF/"
# read the file listing
pg <- readLines(mirror)
# take a look
head(pg)
## [1] "06-18-09 06:18AM 713075 srtm_01_02.zip"
## [2] "06-18-09 06:18AM 130923 srtm_01_07.zip"
## [3] "06-18-09 06:18AM 130196 srtm_01_12.zip"
## [4] "06-18-09 06:18AM 156642 srtm_01_15.zip"
## [5] "06-18-09 06:18AM 317244 srtm_01_16.zip"
## [6] "06-18-09 06:18AM 160847 srtm_01_17.zip"
# clean it up and make them URLs
fils <- sprintf("%s%s", mirror, sub("^.*srtm", "srtm", pg))
head(fils)
## [1] "ftp://srtm.csi.cgiar.org/SRTM_v41/SRTM_Data_GeoTIFF/srtm_01_02.zip"
## [2] "ftp://srtm.csi.cgiar.org/SRTM_v41/SRTM_Data_GeoTIFF/srtm_01_07.zip"
## [3] "ftp://srtm.csi.cgiar.org/SRTM_v41/SRTM_Data_GeoTIFF/srtm_01_12.zip"
## [4] "ftp://srtm.csi.cgiar.org/SRTM_v41/SRTM_Data_GeoTIFF/srtm_01_15.zip"
## [5] "ftp://srtm.csi.cgiar.org/SRTM_v41/SRTM_Data_GeoTIFF/srtm_01_16.zip"
## [6] "ftp://srtm.csi.cgiar.org/SRTM_v41/SRTM_Data_GeoTIFF/srtm_01_17.zip"
# test download
download.file(fils[1], basename(fils[1]))
# validate it worked before slamming the server (your job)
# do the rest whilst being kind to the mirror server
for (f in fils[-1]) {
download.file(f, basename(f))
Sys.sleep(5) # unless you have entitlement issues, space out the downloads by a few seconds
}
If you don't mind using a non-base package, curl can help you just get the file names vs doing the sub above:
unlist(strsplit(rawToChar(curl::curl_fetch_memory(mirror, curl::new_handle(dirlistonly=TRUE))$content), "\n"))
This is not the most elegant solution, but it appears to be working when I try it on random subsets of helplinks.
library(rvest)
#Grab filenames from separate URL
helplinks <- read_html("http://rdf.muninn-project.org/api/elevation/datasets/srtm/") %>% html_nodes("a") %>% html_text(trim = T)
#Keep only filenames relevant for download
helplinks <- helplinks[grepl("srtm", helplinks)]
#Download files - make sure to adjust the `destfile` argument of the download.file function.
lapply(helplinks, function(x) download.file(sprintf("http://srtm.csi.cgiar.org/SRT-ZIP/SRTM_V41/SRTM_Data_GeoTiff/%s", x), sprintf("C:/Users/aud/Desktop/%s", x)))

R: Extract words from a website

I am attempting to extract all words that start with a particular phrase from a website. The website I am using is:
http://docs.ggplot2.org/current/
I want to extract all the words that start with "stat_". I should get 21 names like "stat_identity" in return. I have the following code:
stats <- readLines("http://docs.ggplot2.org/current/")
head(stats)
grep("stat_{1[a-z]", stats, value=TRUE)
I am returned every line containing the phrase "stat_". I just want to extract the "stat_" words. So I tried something else:
gsub("\b^stat_[a-z]+ ", "", stats)
I think the output I got was an empty string, " ", where a "stat_" phrase would be? So now I'm trying to think of ways to extract all the text and set everything that is not a "stat_" phrase to empty strings. Does anyone have any ideas on how to get my desired output?
rvest & stringr to the rescue:
library(xml2)
library(rvest)
library(stringr)
pg <- read_html("http://docs.ggplot2.org/current/")
unique(str_match_all(html_text(html_nodes(pg, "body")),
"(stat_[[:alnum:]_]+)")[[1]][,2])
## [1] "stat_bin" "stat_bin2dCount"
## [3] "stat_bindot" "stat_binhexBin"
## [5] "stat_boxplot" "stat_contour"
## [7] "stat_density" "stat_density2d"
## [9] "stat_ecdf" "stat_functionSuperimpose"
## [11] "stat_identity" "stat_qqCalculation"
## [13] "stat_quantile" "stat_smooth"
## [15] "stat_spokeConvert" "stat_sum"
## [17] "stat_summarySummarise" "stat_summary_hexApply"
## [19] "stat_summary2dApply" "stat_uniqueRemove"
## [21] "stat_ydensity" "stat_defaults"
Unless you need the links (then you can use other rvest functions), this removes all the markup for you and just gives you the text of the website.

Resources