I'm trying to scrape only one value from this site, but I can not get it
Here is my code
library(RSelenium)
rD <- rsDriver(browser="chrome",port=0999L,verbose = F,chromever = "95.0.4638.54")
remDr <- rD[["client"]]
remDr$navigate("https://www.dailyfx.com/eur-usd")
html <- remDr$getPageSource()[[1]]
library(rvest)
page <- read_html(html)
nodes <- html_nodes(page, css = ".mt-2.text-black")
html_text(nodes)
My result is
html_text(nodes)
[1] "\n\nEUR/USD\nMixed\n\n\n\n\n\n\n\n\n\nNet Long\n\n\n\nNet Short\n\n\n\n\n\nDaily change in\n\n\n\nLongs\n5%\n\n\nShorts\n1%\n\n\nOI\n4%\n\n\n\n\n\nWeekly change in\n\n\n\nLongs\n13%\n\n\nShorts\n23%\n\n\nOI\n17%\n\n\n\n"
What I need to do to get the value of Net Long ?
I would use a more targeted css selector list, to target just the node of interest. Then extract the data-value attribute value from the single matched node to get the percentage:
webElem <- remDr$findElement(using = 'css selector', '.dfx-technicalSentimentCard__netLongContainer [data-type="long-value-info"]')
var <- webElem$getElementAttribute("data-value")[[1]]
Or,
page %>% html_element('.dfx-technicalSentimentCard__netLongContainer [data-type="long-value-info"]') %>% html_attr('data-value')
Related
I am trying to collect some area names from a website and in order to do so I want to click the drop-down box to expand the downwards pointing arrow.
i.e. on the following page if I click on the "distritos" drop down I can see further drop down-availability
https://www.fotocasa.es/es/comprar/viviendas/barcelona-capital/todas-las-zonas/l
For Ciutat Vella I see I have 4 additional items Barri Gòtic, EL Raval, La Barceloneta and Sant Pare, Sta...
I would like to collect these names also. I have the following code to collect the following:
library(RSelenium)
library(rvest)
library(tidyverse)
# 1.a) Open URL, click on provincias
rD <- rsDriver(browser="firefox", port=4536L)
remDr <- rD[["client"]]
url2 = "https://www.fotocasa.es/es/comprar/viviendas/barcelona-capital/todas-las-zonas/l"
remDr$navigate(url2)
remDr$maxWindowSize()
# accept cookies
remDr$findElement(using = "xpath",'/html/body/div[1]/div[4]/div/div/div/footer/div/button[2]')$clickElement()
#click on Distrito
remDr$findElement(using = "xpath", '/html/body/div[1]/div[2]/div[1]/div[3]/div/div[1]/div')$clickElement()
html_distrito_full_page = remDr$getPageSource()[[1]] %>%
read_html()
Distritos_Names = html_distrito_full_page %>%
html_nodes('.re-GeographicSearchNext-checkboxItem') %>%
html_nodes('.re-GeographicSearchNext-checkboxItem-literal') %>%
html_text()
Distritos_Names
Which gives:
[1] "Ciutat Vella" "Eixample" "Gràcia" "Horta - Guinardó" "Les Corts" "Nou Barris" "Sant Andreu" "Sant Martí"
[9] "Sants - Montjuïc" "Sarrià - Sant Gervasi"
However, this is missing the names of the regions in the drop-down boxes.
How can I collect these drop-down links also? i.e. RSelenium to navigate to the page, expand all "downwards facing arrows" then use rvest to scrape the whole page once these downwards facing arrows have been expanded.
You could just use rvest to get the mappings by extracting the JavaScript variable housing the mappings + some other data. Use jsonlite to deserialize the extracted string into a JSON object, then apply a custom function to extract the actual mappings for each dropdown. Wrap that function in a map_dfr() call to get a final combined dataframe of all dropdown mappings.
TODO: Review JSON to see if can remove magic number 4 and dynamically determine the correct item to retrieve from parent list.
library(tidyverse)
library(rvest)
library(jsonlite)
extract_data <- function(x) {
tibble(
location = x$literal,
sub_location = map(x$subLocations, "literal", pluck)
)
}
p <- read_html("https://www.fotocasa.es/es/comprar/viviendas/barcelona-capital/todas-las-zonas/l") %>% html_text()
s <- str_match(p, 'window\\.__INITIAL_PROPS__ = JSON\\.parse\\("(.*)"')[, 2]
data <- jsonlite::parse_json(gsub('\\\\\\"', '\\\"', gsub('\\\\"', '"', s)))
location_data <- data$initialSearch$result$geographicSearch[4]
df <- map_dfr(location_data, extract_data)
I am webscraping information from this "https://lsf.uni-heidelberg.de/qisserver/rds?state=change&type=6&moduleParameter=personalSelect&nextdir=change&next=SearchSelect.vm&target=personSearch&subdir=person&init=y&source=state%3Dchange%26type%3D5%26moduleParameter%3DpersonSearch%26nextdir%3Dchange%26next%3Dsearch.vm%26subdir%3Dperson%26menuid%3Dsearch%26_form%3Ddisplay%26topitem%3Dmembers%26subitem%3D%26field%3DNachname&targetfield=Nachname&_form=display" .
I would like to search each individual to collect email addresses. I am doing the following but I can't find a way to submit the search button.
#url
uni<-"https://lsf.uni-heidelberg.de/qisserver/rds?state=change&type=6&moduleParameter=personalSelect&nextdir=change&next=SearchSelect.vm&target=personSearch&subdir=person&init=y&source=state%3Dchange%26type%3D5%26moduleParameter%3DpersonSearch%26nextdir%3Dchange%26next%3Dsearch.vm%26subdir%3Dperson%26menuid%3Dsearch%26_form%3Ddisplay%26topitem%3Dmembers%26subitem%3D%26field%3DNachname&targetfield=Nachname&_form=display"
#people's name
r<-read_html(uni)
name <- r %>%
html_nodes("a") %>%
html_text()
name<-name[40:length(name)]
name<-gsub("\n","",name ,fixed = T)
name<-gsub("\t","",name ,fixed = T)
#people's first link
link <- r %>%
html_nodes("a") %>%
html_attrs() %>%
as.character()
link<-link[40:length(link)]
link<-str_split(link, '"')
link<-sapply(link, "[", 6)
#create a loop: with R selenium, click on search for each link and get emails which are in the next page
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]
#remDr$navigate("https://ki.se/en/research/professors-at-ki")
for (i in 1:lenght(link)) {
i=1
#r<- read_html(link[i])
remDr$navigate(link[i])
webElem <- remDr$findElement(using = 'xpath', '//*+[contains(concat( " ", #class, " " ), concat( " ", "abstand_search", " " ))]//font//input')
webElem$clickElement()
#here i get the error
}
Here are some pointers. I would go with faster and more intuitive, on reading, css selectors to gather the links:
library(rvest)
links <- read_html('https://lsf.uni-heidelberg.de/qisserver/rds?state=change&type=6&moduleParameter=personalSelect&nextdir=change&next=SearchSelect.vm&target=personSearch&subdir=person&init=y&source=state%3Dchange%26type%3D5%26moduleParameter%3DpersonSearch%26nextdir%3Dchange%26next%3Dsearch.vm%26subdir%3Dperson%26menuid%3Dsearch%26_form%3Ddisplay%26topitem%3Dmembers%26subitem%3D%26field%3DNachname&targetfield=Nachname&_form=display') %>%
html_nodes('.regular[name]') %>%
html_attr('href')
Then, I would use same strategy to target the search button:
webElem <- remDr$findElement(using = 'css selector', '.abstand_search + [value="Suche starten"]') # this matches for the element which is interactable
Finally, I would pick up name and email from destination page
name <- remDr$findElement(using = 'css selector', '.regular')
email <- remDr$findElement(using = 'css selector', '[href*=mail]') # could also take 2nd match for .regular
I get around it by using rvest in the following way in the loop
#use Rselenium to dowload emails
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]
emails<-list()
for (i in 1:length(links)) {
#r<- read_html(link[i])
remDr$navigate(links[i])
webElem <- remDr$findElement(using = 'css selector', '.abstand_search + [value="Suche starten"]') # this matches for the element which is interactable
webElem$clickElement()
r <- read_html(unlist(webElem$getCurrentUrl()))
mail <- r %>%
html_nodes("a") %>%
html_attrs() %>%
as.character() %>%
str_subset("mailto:") %>%
str_remove("mailto:")
if(length(mail)!=0){
a<- str_split(mail, "href")
a<-unlist(a)
w<-which((grepl("#",a, fixed = T)))
emails<-c(emails,a[w])
}else{ emails<-c(emails,NA)}
rm(mail)
}
Not than elegant code but it works. For names is more complex and I cannot find a way to get the right css or xpath. Let me know if you can think of a more elegant and fast code or if the issue can only be soleved using a brute forze way.
I'm trying to scrape links to all minutes and agenda provided in this website: https://www.charleston-sc.gov/AgendaCenter/
I've managed to scrape section IDs associated with each category (and years for each category) to loop through the contents within each category-year (please see below). But I don't know how to scrape the hrefs that lives inside the contents. Especially because the links to Agenda lives inside the drop down menu under 'download', it seems like I need to go through extra clicks to scrape the hrefs.
How do I scrape the minutes and agenda (inside the download dropdown) for each table I select? Ideally, I would like a table with the date, title of the agenda, links to minutes, and links to agenda.
I'm using RSelenium for this. Please see the code I have so far below, which allows me to click through each category and year, but not else much. Please help!
rm(list = ls())
library(RSelenium)
library(tidyverse)
library(httr)
library(XML)
library(stringr)
library(RCurl)
t <- readLines('https://www.charleston-sc.gov/AgendaCenter/', encoding = 'UTF-8')
co <- str_match(t, 'aria-label="(.*?)"[ ]href="java')[,2]
yr <- str_match(t, 'id="(.*?)" aria-label')[,2]
df <- data.frame(cbind(co, yr)) %>%
mutate_all(as.character) %>%
filter_all(any_vars(!is.na(.))) %>%
mutate(id = ifelse(grepl('^a0', yr), gsub('a0', '', yr), NA)) %>%
tidyr::fill(c(co,id), .direction='down')%>% drop_na(co)
remDr <- remoteDriver(port=4445L, browserName = "chrome")
remDr$open()
remDr$navigate('https://www.charleston-sc.gov/AgendaCenter/')
remDr$screenshot(display = T)
for (j in unique(df$id)){
remDr$findElement(using = 'xpath',
value = paste0('//*[#id="cat',j,'"]/h2'))$clickElement()
for (k in unique(df[which(df$id==j),'yr'])){
remDr$findElement(using = 'xpath',
value = paste0('//*[#id="',k,'"]'))$clickElement()
# NEED TO SCRAPE THE HREF ASSOCIATED WITH MINUTES AND AGENDA DOWNLOAD HERE #
}
}
Maybe you don't really need to click through all the elements? You can use the fact that all downloadable links have ViewFile in their href:
t <- readLines('https://www.charleston-sc.gov/AgendaCenter/', encoding = 'UTF-8')
viewfile <- str_extract_all(t, '.*ViewFile.*', simplify = T)
viewfile <- viewfile[viewfile!='']
library(data.table) # I use data.table because it's more convenient - but can be done without too
dt.viewfile <- data.table(origStr=viewfile)
# list the elements and patterns we will be looking for:
searchfor <- list(
Title='name=[^ ]+ title=\"(.+)\" href',
Date='<strong>(.+)</strong>',
href='href=\"([^\"]+)\"',
label= 'aria-label=\"([^\"]+)\"'
)
for (this.i in names(searchfor)){
this.full <- paste0('.*',searchfor[[this.i]],'.*');
dt.viewfile[grepl(this.full, origStr), (this.i):=gsub(this.full,'\\1',origStr)]
}
# Clean records:
dt.viewfile[, `:=`(Title=na.omit(Title),Date=na.omit(Date),label=na.omit(label)),
by=href]
dt.viewfile[,Date:=gsub('<abbr title=".*">(.*)</abbr>','\\1',Date)]
dt.viewfile <- unique(dt.viewfile[,.(Title,Date,href,label)]); # 690 records
What you have as the result is a table with the links to all downloadable files. You can now download them using any tool you like, for example using download.file() or GET():
dt.viewfile[, full.url:=paste0('https://www.charleston-sc.gov', href)]
dt.viewfile[, filename:=fs::path_sanitize(paste0(Title, ' - ', Date), replacement = '_')]
for (i in seq_len(nrow(dt.viewfile[1:10,]))){ # remove `1:10` limitation to process all records
url <- dt.viewfile[i,full.url]
destfile <- dt.viewfile[i,filename]
cat('\nDownloading',url, ' to ', destfile)
fil <- GET(url, write_disk(destfile))
# our destination file doesn't have extension, we need to get it from the server:
serverFilename <- gsub("inline;filename=(.*)",'\\1',headers(fil)$`content-disposition`)
serverExtension <- tools::file_ext(serverFilename)
# Adding the extension to the file we just saved
file.rename(destfile,paste0(destfile,'.',serverExtension))
}
Now the only problem we have is that the original webpage was only showing records for the last 3 years. But instead of clicking View More through RSelenium, we can simply load the page with earlier dates, something like this:
t <- readLines('https://www.charleston-sc.gov/AgendaCenter/Search/?term=&CIDs=all&startDate=10/14/2014&endDate=10/14/2017', encoding = 'UTF-8')
then repeat the rest of the code as necessary.
I'm scraping the data from IMDB movie list.
I would like to scrape link for each movie, but not able to correctly identify where it is stored on the page.
This is how the part of the link is stored:
link screenshot
What I have tried:
link<-html_nodes(strona_int, '.lister-item-header+ a href')
link<-html_text(link)
Whole code
install.packages("rvest")
install.packages("RSelenium")
library(rvest)
library(RSelenium)
#open webprowser (in my case Firefox, but can be chrome or internet explorer)
rD <- rsDriver(browser=c("firefox"))
remDr <- rD[["client"]]
#set the start number for page link
ile<-seq(from=1, by=250, length.out = 1)
#empty frame for data
filmy_df=data.frame()
#loop reading the data
for (j in ile){
#set the link for webpage
newURL<-"https://www.imdb.com/search/title/?title_type=feature&release_date=,2018-12-31&count=250&start="
startNumberURL<-paste0(newURL,j)
#open webpage
remDr$navigate(startNumberURL)
#read html code of the page
strona_int<-read_html(startNumberURL)
#rank section
rank_data<-html_nodes(strona_int,'.text-primary')
#konwersja rankingu na text
rank_data<-html_text(rank_data)
#konwersja na numeric
rank_data<-as.numeric(rank_data)
link<-html_nodes(strona_int, '.lister-item-header+ a href')
link<-html_text(link)
#release date
year<-html_nodes(strona_int,'.lister-item-year')
#konwersja na text
year<-html_text(year)
#usuniecie non numeric
year<-gsub("\\D","",year)
#ustawienie jako factor
year<-as.factor(year)
#title
title_data<-html_nodes(strona_int,'.lister-item-header a')
#konwersja tytulu na text
title_data<-html_text(title_data)
#temporary data frame
filmy_df_temp<-data.frame(Rank=rank_data,Title=title_data,Release.Year=year)
#temp df to target df
filmy_df<-rbind(filmy_df,filmy_df_temp)
}
#close browser
remDr$close()
#stop Selenium
rD[["server"]]$stop()
Expected solution:
Scraped link for the each film which could be used later if required.
Selenium is not required for gathering the links.
The links are a tags housed within a parent with class lister-item-header. You can match on those then extract the href attribute. You need to add the protocol and domain of "https://www.imdb.com"
In the css selector:
.lister-item-header a
The dot is a class selector for the parent class; the space between is a descendant combinator; the final a is a type selector for the child a tags.
library(rvest)
library(magrittr)
url <- "https://www.imdb.com/search/title/?title_type=feature&release_date=,2018-12-31&count=250&start="
links <- read_html(url) %>% html_nodes(., ".lister-item-header a") %>% html_attr(., "href")
One way of adding protocol and domain:
library(rvest)
library(magrittr)
library(xml2)
base <- 'https://www.imdb.com'
url <- "https://www.imdb.com/search/title/?title_type=feature&release_date=,2018-12-31&count=250&start="
links <- url_absolute(read_html(url) %>% html_nodes(., ".lister-item-header a") %>% html_attr(., "href"), base)
Reference:
https://www.rdocumentation.org/packages/xml2/versions/1.2.0/topics/url_absolute
https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors
I would like to scrape the table’s data from this page http://agmarknet.gov.in/PriceTrends/SA_Month_PriMar.aspx.
Which ask for selecting the multiple options like "Commodity","state", "year" and "month". Then need to press submit button to get the table.
My attempt is to scrape the table associated with "Commodity"="Tomato","state"="Karnataka", "year"="2016" and "month"=ALL MONTH DATA. I am working with the following code in R
url<-"http://agmarknet.gov.in/PriceTrends/SA_Month_PriMar.aspx"
pgsession <- html_session(url)
pgform <-html_form(pgsession)[[1]]
filled_form <-set_values(pgform,
"ctl00$cphBody$Commodit_list"= "Tomato",
"ctl00$cphBody$State_list" = "Karnataka",
"ctl00$cphBody$Yea_list" = "2016",
"ctl00$cphBody$Mont_list" = "January"
)
d <- submit_form(session=pgsession, form=filled_form)
y <- d %>%
html_nodes("table") %>%.[[2]] %>%
html_table(header=TRUE)
dim(y)
but I am getting a error message as:
Submitting with 'ctl00$ddlDistrict'
Warning message:
In request_POST(session, url = url, body = request$values, encode =
request$encode, :
Internal Server Error (HTTP 500).
I am not able to scrap the required table from the web page please help me to extract the table with the desired options from the page.
Here is a method that uses RSelenium package to scrape data for all months of 2016.
library(RSelenium)
library(rvest)
library(tidyverse)
url <- "http://agmarknet.gov.in/PriceTrends/SA_Month_PriMar.aspx"
rD <- rsDriver()
remDr <- rD$client
lst <- lapply(seq(2,13), function(x) {
remDr$navigate(url)
webElem_commodity <- remDr$findElement(using = "css", "#cphBody_Commodit_list")
opts_commodity <- webElem_commodity$selectTag() # get all the associated tags
commodity_num <- which(opts_commodity$text=="Tomato") # find the required option
opts_commodity$elements[[commodity_num]]$clickElement() # select the required option
Sys.sleep(10) # for state names to load
webElem_state <- remDr$findElement(using = "css", "#cphBody_State_list")
opts_state <- webElem_state$selectTag()
state_num <- which(opts_state$text=="Karnataka")
opts_state$elements[[state_num]]$clickElement()
Sys.sleep(10) # for years to load
webElem_yr <- remDr$findElement(using = "css", "#cphBody_Yea_list")
opts_yr <- webElem_yr$selectTag()
yr_num <- which(opts_yr$text=="2016")
opts_yr$elements[[yr_num]]$clickElement()
Sys.sleep(10) # for months to load
webElem_month <- remDr$findElement(using = "css", "#cphBody_Mont_list")
opts_month <- webElem_month$selectTag()
opts_month$elements[[x]]$clickElement() # select a different month in each lapply iteration
Sys.sleep(10) # for submit button to become active
webElem_submit <- remDr$findElement(using = "css", "#cphBody_But_Submit")
webElem_submit$clickElement()
page_source <- remDr$getPageSource()
tdf <- read_html(page_source[[1]]) %>% # read table
html_nodes("table") %>% .[[5]] %>%
html_table(header=T,fill=T, trim=T) %>%
head(-1) # remove the last row which contains average at the bottom of the scraped table
})
remDr$close()
rD$server$stop()
# lst is a list, with 12 elements. Each element corresponds to data for one month of 2016