I am trying to collect some area names from a website and in order to do so I want to click the drop-down box to expand the downwards pointing arrow.
i.e. on the following page if I click on the "distritos" drop down I can see further drop down-availability
https://www.fotocasa.es/es/comprar/viviendas/barcelona-capital/todas-las-zonas/l
For Ciutat Vella I see I have 4 additional items Barri Gòtic, EL Raval, La Barceloneta and Sant Pare, Sta...
I would like to collect these names also. I have the following code to collect the following:
library(RSelenium)
library(rvest)
library(tidyverse)
# 1.a) Open URL, click on provincias
rD <- rsDriver(browser="firefox", port=4536L)
remDr <- rD[["client"]]
url2 = "https://www.fotocasa.es/es/comprar/viviendas/barcelona-capital/todas-las-zonas/l"
remDr$navigate(url2)
remDr$maxWindowSize()
# accept cookies
remDr$findElement(using = "xpath",'/html/body/div[1]/div[4]/div/div/div/footer/div/button[2]')$clickElement()
#click on Distrito
remDr$findElement(using = "xpath", '/html/body/div[1]/div[2]/div[1]/div[3]/div/div[1]/div')$clickElement()
html_distrito_full_page = remDr$getPageSource()[[1]] %>%
read_html()
Distritos_Names = html_distrito_full_page %>%
html_nodes('.re-GeographicSearchNext-checkboxItem') %>%
html_nodes('.re-GeographicSearchNext-checkboxItem-literal') %>%
html_text()
Distritos_Names
Which gives:
[1] "Ciutat Vella" "Eixample" "Gràcia" "Horta - Guinardó" "Les Corts" "Nou Barris" "Sant Andreu" "Sant Martí"
[9] "Sants - Montjuïc" "Sarrià - Sant Gervasi"
However, this is missing the names of the regions in the drop-down boxes.
How can I collect these drop-down links also? i.e. RSelenium to navigate to the page, expand all "downwards facing arrows" then use rvest to scrape the whole page once these downwards facing arrows have been expanded.
You could just use rvest to get the mappings by extracting the JavaScript variable housing the mappings + some other data. Use jsonlite to deserialize the extracted string into a JSON object, then apply a custom function to extract the actual mappings for each dropdown. Wrap that function in a map_dfr() call to get a final combined dataframe of all dropdown mappings.
TODO: Review JSON to see if can remove magic number 4 and dynamically determine the correct item to retrieve from parent list.
library(tidyverse)
library(rvest)
library(jsonlite)
extract_data <- function(x) {
tibble(
location = x$literal,
sub_location = map(x$subLocations, "literal", pluck)
)
}
p <- read_html("https://www.fotocasa.es/es/comprar/viviendas/barcelona-capital/todas-las-zonas/l") %>% html_text()
s <- str_match(p, 'window\\.__INITIAL_PROPS__ = JSON\\.parse\\("(.*)"')[, 2]
data <- jsonlite::parse_json(gsub('\\\\\\"', '\\\"', gsub('\\\\"', '"', s)))
location_data <- data$initialSearch$result$geographicSearch[4]
df <- map_dfr(location_data, extract_data)
Related
I am trying to get the store address of apple stores for multiple countries using Rselenium.
library(RSelenium)
library(tidyverse)
library(netstat)
# start the server
rs_driver_object <- rsDriver(browser = "chrome",
chromever = "100.0.4896.60",
verbose = F,
port = free_port())
# create a client object
remDr <- rs_driver_object$client
# maximise window size
remDr$maxWindowSize()
# navigate to the website
remDr$navigate("https://www.apple.com/uk/retail/storelist/")
# click on search bar
search_box <- remDr$findElement(using = "id", "dropdown")
country_name <- "United States" # for a single country. I can loop over multiple countries
# in the search box, pass on the country name and hit enter
search_box$sendKeysToElement(list(country_name, key = "enter"))
search_box$clickElement() # I am not sure if I need to click but I am doing anyway
The page now shows me the location of each store. Each store has a hyperlink that will take me to the store website where the full address is which I want to extract
However, I am stuck on how do I click on individual store address in the last step.
I thought I will get the href for all the stores in the particular page
store_address <- remDr$findElement(using = 'class', 'store-address')
store_address$getElementAttribute('href')
But it returns me an empty list. How do I go from here?
After obtaining page with list of stores we can do,
link = remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.state') %>% html_nodes('a') %>% html_attr('href') %>% paste0('https://www.apple.com', .)
[1] "https://www.apple.com/retail/thesummit/" "https://www.apple.com/retail/bridgestreet/"
[3] "https://www.apple.com/retail/anchorage5thavenuemall/" "https://www.apple.com/retail/chandlerfashioncenter/"
[5] "https://www.apple.com/retail/santanvillage/" "https://www.apple.com/retail/arrowhead/"
I'm trying to scrape only one value from this site, but I can not get it
Here is my code
library(RSelenium)
rD <- rsDriver(browser="chrome",port=0999L,verbose = F,chromever = "95.0.4638.54")
remDr <- rD[["client"]]
remDr$navigate("https://www.dailyfx.com/eur-usd")
html <- remDr$getPageSource()[[1]]
library(rvest)
page <- read_html(html)
nodes <- html_nodes(page, css = ".mt-2.text-black")
html_text(nodes)
My result is
html_text(nodes)
[1] "\n\nEUR/USD\nMixed\n\n\n\n\n\n\n\n\n\nNet Long\n\n\n\nNet Short\n\n\n\n\n\nDaily change in\n\n\n\nLongs\n5%\n\n\nShorts\n1%\n\n\nOI\n4%\n\n\n\n\n\nWeekly change in\n\n\n\nLongs\n13%\n\n\nShorts\n23%\n\n\nOI\n17%\n\n\n\n"
What I need to do to get the value of Net Long ?
I would use a more targeted css selector list, to target just the node of interest. Then extract the data-value attribute value from the single matched node to get the percentage:
webElem <- remDr$findElement(using = 'css selector', '.dfx-technicalSentimentCard__netLongContainer [data-type="long-value-info"]')
var <- webElem$getElementAttribute("data-value")[[1]]
Or,
page %>% html_element('.dfx-technicalSentimentCard__netLongContainer [data-type="long-value-info"]') %>% html_attr('data-value')
I'm very new to R, but trying to create a webscraper to help inspire my team.
The site I want to scrape is https://www.parl.ca/legisinfo/Home.aspx?Page=1 . If you visit that site it will push you to a site where the 43 session is selected by default but I want to include all sessions. If you deselect the session the site sends you to the URL above.
My problem is finding the last page. By clicking >> at the bottom of the page, I know there are 328 pages in total. I want the code to remain relevant so I want the code to find the last page number so I can build a list. I don't know how to do that when the number for "last page" isn't actually in the page text. The code below, returns ">>" instead of "328."
url <- 'https://www.parl.ca/legisinfo/Home.aspx?Page=1'
url %>%
read_html()%>%
html_nodes('.resultPageCurrent') %>%
html_text() %>%
tail(1L)
When I look at the source code I do find the magic 328 number, I just don't know how to point to it. Thank you for your help!
An interesting problem. I decided to see if there was an API. There are a number of APIs. See: https://openparliament.ca/api/
Not spending much time reading, I opted for the bills API endpoint. I couldn't easily see a request that returned the total bills count, or all bills in one go. I did find I could use one of the endpoints in a loop, using a querystring construct to return 500 (max I could get to return in one) bills at a time, using offset to get the next batch. I used a for loop (gasp), as I was unsure how else to terminate making requests when next_url, in the response json, was NULL.
library(jsonlite)
library(data.table)
library(dplyr)
n <- 0
result <- list()
while (TRUE) {
url <- sprintf("https://api.openparliament.ca/bills/?limit=500&offset=%i&format=json", n)
data <- jsonlite::read_json(url, simplifyVector = T)
next_url <- data$pagination$next_url
df <- data$objects %>% mutate(name_en = name$en, name_fr = name$fr) %>% select(-c('name'))
result[[(n/500)+1]] <- df
if (is.null(next_url)) {
break
} else {
n <- n + 500
}
}
df <- rbindlist(result)
df %>% arrange(desc(introduced)) %>% head()
df contains the data you see on the page for all bills listed via API. Currently 6,620 rows.
Py:
import requests
import pandas as pd
n = 0
results = []
def get_dict_list(value: list)->list:
output = []
for i in value:
d = {k:v for k,v in i.items() if k != 'name'}
d['name_en'] = i['name']['en']
d['name_fr'] = i['name']['fr']
output.append(d)
return output
with requests.Session() as s:
while True:
url = f'https://api.openparliament.ca/bills/?limit=500&offset={n}&%5D=&format=json'
r = s.get(url)
data = r.json()
next_url = data['pagination']['next_url']
results.extend(get_dict_list(data['objects']))
if next_url is None:
break
n+=500
df = pd.DataFrame(results)
I decided to try using RSelenium to navigate to the last page, and grab the last page number. This did the trick, but was not very elegant or efficient. I allow myself this as someone who is new to R and webscraping.
#Step 1 - call packages
library(tidyverse)
library(rvest)
library(RSelenium)
#Step 2 - Identify the main webpage
url <- 'https://www.parl.ca/legisinfo/Home.aspx'
#Step 3 - Open browser
rD <- rsDriver(port = 4546L, version = "latest", browser=c("firefox"))
remDr <- rD[["client"]]
#Step 4 - Navigate to page
remDr$navigate(url)
#Step 5 - Select filters
webElem <- remDr$findElement(using = 'css selector',".filterTopRemoveImage")
webElem$clickElement() #Click to remove default filter so that all Parliamentary sessions are included
webElem <- remDr$findElement(using = 'css selector',".filterRefiner~ div:nth-child(5) .filterRefinement")
webElem$clickElement() #Click to select only House of Commons Bills
#Step 6 - Get last page by navigating >>
webElem <- remDr$findElement(using = 'link text',">>")
webElem$clickElement()
html <- remDr$getPageSource()[[1]] #Get everything on the page
html <-read_html(html) #Parse HTML
lastpage <- html %>% html_nodes(".resultPageCurrent:nth-child(12)") %>% html_text()
lastpage
I'm trying to scrape links to all minutes and agenda provided in this website: https://www.charleston-sc.gov/AgendaCenter/
I've managed to scrape section IDs associated with each category (and years for each category) to loop through the contents within each category-year (please see below). But I don't know how to scrape the hrefs that lives inside the contents. Especially because the links to Agenda lives inside the drop down menu under 'download', it seems like I need to go through extra clicks to scrape the hrefs.
How do I scrape the minutes and agenda (inside the download dropdown) for each table I select? Ideally, I would like a table with the date, title of the agenda, links to minutes, and links to agenda.
I'm using RSelenium for this. Please see the code I have so far below, which allows me to click through each category and year, but not else much. Please help!
rm(list = ls())
library(RSelenium)
library(tidyverse)
library(httr)
library(XML)
library(stringr)
library(RCurl)
t <- readLines('https://www.charleston-sc.gov/AgendaCenter/', encoding = 'UTF-8')
co <- str_match(t, 'aria-label="(.*?)"[ ]href="java')[,2]
yr <- str_match(t, 'id="(.*?)" aria-label')[,2]
df <- data.frame(cbind(co, yr)) %>%
mutate_all(as.character) %>%
filter_all(any_vars(!is.na(.))) %>%
mutate(id = ifelse(grepl('^a0', yr), gsub('a0', '', yr), NA)) %>%
tidyr::fill(c(co,id), .direction='down')%>% drop_na(co)
remDr <- remoteDriver(port=4445L, browserName = "chrome")
remDr$open()
remDr$navigate('https://www.charleston-sc.gov/AgendaCenter/')
remDr$screenshot(display = T)
for (j in unique(df$id)){
remDr$findElement(using = 'xpath',
value = paste0('//*[#id="cat',j,'"]/h2'))$clickElement()
for (k in unique(df[which(df$id==j),'yr'])){
remDr$findElement(using = 'xpath',
value = paste0('//*[#id="',k,'"]'))$clickElement()
# NEED TO SCRAPE THE HREF ASSOCIATED WITH MINUTES AND AGENDA DOWNLOAD HERE #
}
}
Maybe you don't really need to click through all the elements? You can use the fact that all downloadable links have ViewFile in their href:
t <- readLines('https://www.charleston-sc.gov/AgendaCenter/', encoding = 'UTF-8')
viewfile <- str_extract_all(t, '.*ViewFile.*', simplify = T)
viewfile <- viewfile[viewfile!='']
library(data.table) # I use data.table because it's more convenient - but can be done without too
dt.viewfile <- data.table(origStr=viewfile)
# list the elements and patterns we will be looking for:
searchfor <- list(
Title='name=[^ ]+ title=\"(.+)\" href',
Date='<strong>(.+)</strong>',
href='href=\"([^\"]+)\"',
label= 'aria-label=\"([^\"]+)\"'
)
for (this.i in names(searchfor)){
this.full <- paste0('.*',searchfor[[this.i]],'.*');
dt.viewfile[grepl(this.full, origStr), (this.i):=gsub(this.full,'\\1',origStr)]
}
# Clean records:
dt.viewfile[, `:=`(Title=na.omit(Title),Date=na.omit(Date),label=na.omit(label)),
by=href]
dt.viewfile[,Date:=gsub('<abbr title=".*">(.*)</abbr>','\\1',Date)]
dt.viewfile <- unique(dt.viewfile[,.(Title,Date,href,label)]); # 690 records
What you have as the result is a table with the links to all downloadable files. You can now download them using any tool you like, for example using download.file() or GET():
dt.viewfile[, full.url:=paste0('https://www.charleston-sc.gov', href)]
dt.viewfile[, filename:=fs::path_sanitize(paste0(Title, ' - ', Date), replacement = '_')]
for (i in seq_len(nrow(dt.viewfile[1:10,]))){ # remove `1:10` limitation to process all records
url <- dt.viewfile[i,full.url]
destfile <- dt.viewfile[i,filename]
cat('\nDownloading',url, ' to ', destfile)
fil <- GET(url, write_disk(destfile))
# our destination file doesn't have extension, we need to get it from the server:
serverFilename <- gsub("inline;filename=(.*)",'\\1',headers(fil)$`content-disposition`)
serverExtension <- tools::file_ext(serverFilename)
# Adding the extension to the file we just saved
file.rename(destfile,paste0(destfile,'.',serverExtension))
}
Now the only problem we have is that the original webpage was only showing records for the last 3 years. But instead of clicking View More through RSelenium, we can simply load the page with earlier dates, something like this:
t <- readLines('https://www.charleston-sc.gov/AgendaCenter/Search/?term=&CIDs=all&startDate=10/14/2014&endDate=10/14/2017', encoding = 'UTF-8')
then repeat the rest of the code as necessary.
I'm scraping the data from IMDB movie list.
I would like to scrape link for each movie, but not able to correctly identify where it is stored on the page.
This is how the part of the link is stored:
link screenshot
What I have tried:
link<-html_nodes(strona_int, '.lister-item-header+ a href')
link<-html_text(link)
Whole code
install.packages("rvest")
install.packages("RSelenium")
library(rvest)
library(RSelenium)
#open webprowser (in my case Firefox, but can be chrome or internet explorer)
rD <- rsDriver(browser=c("firefox"))
remDr <- rD[["client"]]
#set the start number for page link
ile<-seq(from=1, by=250, length.out = 1)
#empty frame for data
filmy_df=data.frame()
#loop reading the data
for (j in ile){
#set the link for webpage
newURL<-"https://www.imdb.com/search/title/?title_type=feature&release_date=,2018-12-31&count=250&start="
startNumberURL<-paste0(newURL,j)
#open webpage
remDr$navigate(startNumberURL)
#read html code of the page
strona_int<-read_html(startNumberURL)
#rank section
rank_data<-html_nodes(strona_int,'.text-primary')
#konwersja rankingu na text
rank_data<-html_text(rank_data)
#konwersja na numeric
rank_data<-as.numeric(rank_data)
link<-html_nodes(strona_int, '.lister-item-header+ a href')
link<-html_text(link)
#release date
year<-html_nodes(strona_int,'.lister-item-year')
#konwersja na text
year<-html_text(year)
#usuniecie non numeric
year<-gsub("\\D","",year)
#ustawienie jako factor
year<-as.factor(year)
#title
title_data<-html_nodes(strona_int,'.lister-item-header a')
#konwersja tytulu na text
title_data<-html_text(title_data)
#temporary data frame
filmy_df_temp<-data.frame(Rank=rank_data,Title=title_data,Release.Year=year)
#temp df to target df
filmy_df<-rbind(filmy_df,filmy_df_temp)
}
#close browser
remDr$close()
#stop Selenium
rD[["server"]]$stop()
Expected solution:
Scraped link for the each film which could be used later if required.
Selenium is not required for gathering the links.
The links are a tags housed within a parent with class lister-item-header. You can match on those then extract the href attribute. You need to add the protocol and domain of "https://www.imdb.com"
In the css selector:
.lister-item-header a
The dot is a class selector for the parent class; the space between is a descendant combinator; the final a is a type selector for the child a tags.
library(rvest)
library(magrittr)
url <- "https://www.imdb.com/search/title/?title_type=feature&release_date=,2018-12-31&count=250&start="
links <- read_html(url) %>% html_nodes(., ".lister-item-header a") %>% html_attr(., "href")
One way of adding protocol and domain:
library(rvest)
library(magrittr)
library(xml2)
base <- 'https://www.imdb.com'
url <- "https://www.imdb.com/search/title/?title_type=feature&release_date=,2018-12-31&count=250&start="
links <- url_absolute(read_html(url) %>% html_nodes(., ".lister-item-header a") %>% html_attr(., "href"), base)
Reference:
https://www.rdocumentation.org/packages/xml2/versions/1.2.0/topics/url_absolute
https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors