I need to fill the fields month and year of the page:
Http://www.svs.cl/institucional/mercados/entidad.php?mercado=S&rut=99588060&grupo=&tipoentidad=CSVID&row=AABaHEAAaAAAB7uAAT&vig=VI&control=svs&pestania=3
By this, I have programmed the following in Rselenium and it works
#library
library(RSelenium)
#browser parameters
mybrowser<-remoteDriver(browserName = "chrome")
mybrowser$open(silent = TRUE)
mybrowser$setTimeout(type = "page load", milliseconds =1000000)
mybrowser$setImplicitWaitTimeout(milliseconds = 1000000)
url<-paste("http://www.svs.cl/institucional/mercados/entidad.php?mercado=S&rut=99588060&grupo=&tipoentidad=CSVID&row=AABaHEAAaAAAB7uAAT&vig=VI&control=svs&pestania=3",sep="")
#start navigation
mybrowser$navigate(url)
webElem$clickElement()
wxbox<-mybrowser$findElement(using="class","bordeInput2")
wxbox$sendKeysToElement(list("09"))
wxbox<-mybrowser$findElement(using="id","aa")
wxbox$sendKeysToElement(list("2016"))
wxbutton<-mybrowser$findElement('xpath',"//*[#id='fm']/div[2]/input")
wxbutton$clickElement()
However, I'd like to see a solution using rvest or rcurl, I've tried and it does not work for me. If anyone can help me with that, I would appreciate it.
An attempt I made was
library(RCurl)
library(XML)
form <- postForm("Http://www.svs.cl/institucional/mercados/entidad.php?mercado=S&rut=99588060&grupo=&tipoentidad=CSVID&row=AABaHEAAaAAAB7uAAT&vig=VI&control=svs&pestania=3", Year = 2010, Month = 2)
doc <- htmlParse(form) pkids <- xpathSApply(doc, xmlAttrs)
pkids
data <- lapply(pkids)
tab <- readHTMLTable(data[[1]], which = 1)
first of all, Thanks
You can simply POST to the URL as follows:
require(rvest)
require(httr)
a <- POST("http://www.svs.cl/institucional/mercados/entidad.php",
# Body = what you fill in the form
body = list(mm = 09, aa = 2016),
# query = the long URL broken into parameter
query = list(mercado="S",
rut="99588060",
grupo="",
tipoentidad="CSVID",
row="AABaHEAAaAAAB7uAAT",
vig="VI",
control="svs",
pestania="3"))
read_html(a) %>% html_nodes("dd") %>% html_text %>%
setNames(c("Business name", "RUT"))
Which gives you:
Business name RUT
"ACE SEGUROS DE VIDA S.A." "99588060-1"
Related
My goal is to get EVERY tweet ever for any twitter account. I picked the NYTimes for this example.
The code below works, but it only pulls the last 100 tweets. max_results does not allow you to put a value over 100.
The code below almost fully copy-paste-able, you would have to have your own bearer token.
How can I expand this to give me every tweet from an account?
One idea is that I can loop it for every day since the account was created, but that seems tedious if there is a faster way.
# NYT Example --------------------------------------------------------------------
library(httr)
library(jsonlite)
library(tidyverse)
bearer_token <- "insert your bearer token here"
headers <- c(`Authorization` = sprintf('Bearer %s', bearer_token))
params <- list(`user.fields` = 'description')
handle <- 'nytimes'
url_handle <- sprintf('https://api.twitter.com/2/users/by?usernames=%s', handle)
response <- httr::GET(url = url_handle,
httr::add_headers(.headers = headers),
query = params)
json_data <- fromJSON(httr::content(response, as = "text"), flatten = TRUE)
json_data %>%
as_tibble()
NYT_ID <- json_data$data$id
url_handle <- paste0("https://api.twitter.com/2/users/", NYT_ID, "/tweets")
params <- list(`tweet.fields` = 'id,text,author_id,created_at,attachments,public_metrics',
`max_results` = '100')
response <- httr::GET(url = url_handle,
httr::add_headers(.headers = headers),
query = params)
json_data <- fromJSON(httr::content(response, as = "text"), flatten = TRUE)
NYT_tweets <- json_data$data %>%
as_tibble() %>%
select(-id, -author_id, -9)
NYT_tweets
For anyone that finds this later on, I found a solution that works for me.
Using the parameters of start_time and end_time you can clarify dates for the tweets to be between. I was able to pull all tweets from November for example and then rbind those to the ones from December, etc. Sometimes I had to do two tweet pulls (half of March, second half of March) to get all of them, but it worked for this.
params <- list(`tweet.fields` = 'id,text,author_id,created_at,attachments,public_metrics',
`max_results` = '100',
`start_time` = '2021-11-01T00:00:01.000Z',
`end_time` = '2021-11-30T23:58:21.000Z')
I have used the rvest package in R to scrape unique URLs before.
However, I am now stuck with a particular website. The URL stays static and I need to select the following dropdowns now and scrape the resulting table that appears.
Will be helpful if someone can guide me on what direction to take with websites like these? Is R even capable of doing this?
Edit: I have researched and it seems RSelenium can handle such tasks. Unfortunately, I have no exposure to it. Can someone recommend an example/blog/material online on using Selenium specifically for clicking and scraping for someone as noob as I am?
I have made a blog post about an RSelenium example:
https://guillaumepressiat.github.io/blog/2021/04/RSelenium-paginated-tables
this website contains a lot of things about selenium, you will have to plug it to RSelenium api package.(verbs are almost the same in all languages, findElement, etc) https://www.guru99.com/selenium-tutorial.html
But as an example based on your question maybe something like this to begin:
# https://stackoverflow.com/q/67021563/10527496
# java -jar selenium-server-standalone-3.9.1.jar
library(RSelenium)
library(tidyverse)
library(rvest)
library(httr)
remDr <- remoteDriver(
remoteServerAddr = "localhost",
port = 4444L, # change port according to terminal
browserName = "firefox"
)
remDr$open()
# remDr$getStatus()
url <- "https://fcainfoweb.nic.in/reports/Report_Menu_Web.aspx"
remDr$navigate(url)
Sys.sleep(5)
# first : radio buttons
u1 <- remDr$findElement(using = "id", value = 'ctl00_MainContent_Rbl_Rpt_type_0')
u2 <- remDr$findElement(using = "id", value = 'ctl00_MainContent_Rbl_Rpt_type_1')
u3 <- remDr$findElement(using = "id", value = 'ctl00_MainContent_Rbl_Rpt_type_2')
u4 <- remDr$findElement(using = "id", value = 'ctl00_MainContent_Rbl_Rpt_type_3')
dynam <- remDr$mouseMoveToLocation(webElement = u1)
u1$click()
Sys.sleep(5)
# second : Select input
s1 <- remDr$findElement(using = "id", value = 'ctl00_MainContent_Ddl_Rpt_Option0')
# get available choices
s_choices <- read_html(s1$getElementAttribute('innerHTML')[[1]]) %>%
html_nodes('option') %>%
html_attrs() %>%
unlist() %>%
.[3:length(.)] %>%
as.vector()
dynam <- remDr$mouseMoveToLocation(webElement = s1)
s1$click()
s1$sendKeysToElement(sendKeys = list(s_choices[1], key = "enter"))
# s_choices[1] is "Daily Prices"
Sys.sleep(5)
# get date choices
s_date_choices <- remDr$findElement(using = "id", value = "ctl00_MainContent_Txt_FrmDate")
dynam <- remDr$mouseMoveToLocation(webElement = s_date_choices)
s_date_choices$click()
s_date_choices$sendKeysToElement(sendKeys = list('01/01/2021', key = "enter"))
Sys.sleep(5)
s_table <- remDr$findElement(using = "id", value = "Panel1")
# get first tables as an example
results_1 <- read_html(s_table$getElementAttribute('innerHTML')[[1]]) %>%
html_table(fill = TRUE) %>%
.[2:length(.)]
we get a list of three tables as a result:
Making a function from this code to loop on a date vector is possible after that I think (you will have to reload a fresh start page on base URL for each date I suppose).
I'm trying to scrape links to all minutes and agenda provided in this website: https://www.charleston-sc.gov/AgendaCenter/
I've managed to scrape section IDs associated with each category (and years for each category) to loop through the contents within each category-year (please see below). But I don't know how to scrape the hrefs that lives inside the contents. Especially because the links to Agenda lives inside the drop down menu under 'download', it seems like I need to go through extra clicks to scrape the hrefs.
How do I scrape the minutes and agenda (inside the download dropdown) for each table I select? Ideally, I would like a table with the date, title of the agenda, links to minutes, and links to agenda.
I'm using RSelenium for this. Please see the code I have so far below, which allows me to click through each category and year, but not else much. Please help!
rm(list = ls())
library(RSelenium)
library(tidyverse)
library(httr)
library(XML)
library(stringr)
library(RCurl)
t <- readLines('https://www.charleston-sc.gov/AgendaCenter/', encoding = 'UTF-8')
co <- str_match(t, 'aria-label="(.*?)"[ ]href="java')[,2]
yr <- str_match(t, 'id="(.*?)" aria-label')[,2]
df <- data.frame(cbind(co, yr)) %>%
mutate_all(as.character) %>%
filter_all(any_vars(!is.na(.))) %>%
mutate(id = ifelse(grepl('^a0', yr), gsub('a0', '', yr), NA)) %>%
tidyr::fill(c(co,id), .direction='down')%>% drop_na(co)
remDr <- remoteDriver(port=4445L, browserName = "chrome")
remDr$open()
remDr$navigate('https://www.charleston-sc.gov/AgendaCenter/')
remDr$screenshot(display = T)
for (j in unique(df$id)){
remDr$findElement(using = 'xpath',
value = paste0('//*[#id="cat',j,'"]/h2'))$clickElement()
for (k in unique(df[which(df$id==j),'yr'])){
remDr$findElement(using = 'xpath',
value = paste0('//*[#id="',k,'"]'))$clickElement()
# NEED TO SCRAPE THE HREF ASSOCIATED WITH MINUTES AND AGENDA DOWNLOAD HERE #
}
}
Maybe you don't really need to click through all the elements? You can use the fact that all downloadable links have ViewFile in their href:
t <- readLines('https://www.charleston-sc.gov/AgendaCenter/', encoding = 'UTF-8')
viewfile <- str_extract_all(t, '.*ViewFile.*', simplify = T)
viewfile <- viewfile[viewfile!='']
library(data.table) # I use data.table because it's more convenient - but can be done without too
dt.viewfile <- data.table(origStr=viewfile)
# list the elements and patterns we will be looking for:
searchfor <- list(
Title='name=[^ ]+ title=\"(.+)\" href',
Date='<strong>(.+)</strong>',
href='href=\"([^\"]+)\"',
label= 'aria-label=\"([^\"]+)\"'
)
for (this.i in names(searchfor)){
this.full <- paste0('.*',searchfor[[this.i]],'.*');
dt.viewfile[grepl(this.full, origStr), (this.i):=gsub(this.full,'\\1',origStr)]
}
# Clean records:
dt.viewfile[, `:=`(Title=na.omit(Title),Date=na.omit(Date),label=na.omit(label)),
by=href]
dt.viewfile[,Date:=gsub('<abbr title=".*">(.*)</abbr>','\\1',Date)]
dt.viewfile <- unique(dt.viewfile[,.(Title,Date,href,label)]); # 690 records
What you have as the result is a table with the links to all downloadable files. You can now download them using any tool you like, for example using download.file() or GET():
dt.viewfile[, full.url:=paste0('https://www.charleston-sc.gov', href)]
dt.viewfile[, filename:=fs::path_sanitize(paste0(Title, ' - ', Date), replacement = '_')]
for (i in seq_len(nrow(dt.viewfile[1:10,]))){ # remove `1:10` limitation to process all records
url <- dt.viewfile[i,full.url]
destfile <- dt.viewfile[i,filename]
cat('\nDownloading',url, ' to ', destfile)
fil <- GET(url, write_disk(destfile))
# our destination file doesn't have extension, we need to get it from the server:
serverFilename <- gsub("inline;filename=(.*)",'\\1',headers(fil)$`content-disposition`)
serverExtension <- tools::file_ext(serverFilename)
# Adding the extension to the file we just saved
file.rename(destfile,paste0(destfile,'.',serverExtension))
}
Now the only problem we have is that the original webpage was only showing records for the last 3 years. But instead of clicking View More through RSelenium, we can simply load the page with earlier dates, something like this:
t <- readLines('https://www.charleston-sc.gov/AgendaCenter/Search/?term=&CIDs=all&startDate=10/14/2014&endDate=10/14/2017', encoding = 'UTF-8')
then repeat the rest of the code as necessary.
I have tried scraping data from a real estate site, and arranging the data in a way that can then easily be filtered and checked using a spreadsheet. I’m actually a little embarrassed that i don’t move of this R code forward.
Now that i have all the links to the posts, i can not now loop through the previously compiled dataframe and get the details from all the URLs.
Could you just please help me with it? Thanks a lot.
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
library(xml2)
complete <- data.frame()
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
URL.base <- "https://www.sreality.cz/hledani/prodej/byty?strana="
#"https://www.sreality.cz/hledani/prodej/byty/praha?strana="
#"https://www.sreality.cz/hledani/prodej/byty/praha?stari=dnes&strana="
#"https://www.sreality.cz/hledani/prodej/byty/praha?stari=tyden&strana="
for (i in 1:10000) {
#Specifying the url for desired website to be scrapped
main_link<- paste0(URL.base, i)
# go to website
remDr$navigate(main_link)
# get page source and save it as an html object with rvest
main_page <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# get the data
name <- html_nodes(main_page, css=".name.ng-binding") %>% html_text()
locality <- html_nodes(main_page, css=".locality.ng-binding") %>% html_text()
norm_price <- html_nodes(main_page, css=".norm-price.ng-binding") %>% html_text()
sreality_url <- main_page %>% html_nodes(".title") %>% html_attr("href")
sreality_url2 <- sreality_url[c(4:24)]
name2 <- name[c(4:24)]
record <- data.frame(cbind(name2, locality, norm_price, sreality_url2))
complete <- rbind(complete, record)
}
# Write CSV in R
write.csv(complete, file = "MyData.csv")
I would do this differently:
I would create a function, say 'scraper', that groups up together all the scraping functions you have already defined, doing so I'll create a list with the str_c of all the possibile links (say 30), after that a simple lapply function. As it all said, I will not use Rselenium. (libraries: rvest , stringr , tibble, dplyr )
url = 'https://www.sreality.cz/hledani/prodej/byty?strana='
here it is the URL base, starting from here you should be able to replicate the URL strings for all the pages (1 to whichever) you are interested in (and for all the possible url, for praha, olomuc, ostrava etc ).
main_page = read_html('https://www.sreality.cz/hledani/prodej/byty?strana=')
here you create all the linnks according to the number of pages you want:
list.of.pages = str_c(url, 1:30)
then define a single function for all the single data you are interested, in this way you are more precise and your error debug is easier, as well as the data quality. (I assume your CSS selections are right, otherwise you will obtain empty obj)
for names
name = function(url) {
data = html_nodes(url, css=".name.ng-binding") %>%
html_text()
return(data)
}
for locality
locality = function(url) {
data = html_nodes(url, css=".locality.ng-binding") %>%
html_text()
return(data)
}
for normprice
normprice = function(url) {
data = html_nodes(url, css=".norm-price.ng-binding") %>%
html_text()
return(data)
}
for hrefs
sreality_url = function(url) {
data = html_nodes(url, css=".title") %>%
html_attr("href")
return(data)
}
those are the single fuctions (the CSS selection, even if i didnt test them, seem to be not correct to me, but this will give you the right framework to work on). After that combine them into a tibble obj
get.data.table = function(html){
name = name(html)
locality = locality(html)
normprice = normprice(html)
hrefs = sreality_url(html)
combine = tibble(adtext = name,
loc = locality,
price = normprice,
URL = sreality_url)
combine %>%
select(adtext, loc, price, URL) return(combine)
}
then the final scraper:
scrape.all = function(urls){
list.of.pages %>%
lapply(get.data.table) %>%
bind_rows() %>%
write.csv(file = 'MyData.csv')
}
I am trying to get the "Team Offense" table into R. I have tried multiple techniques and I cannot seem to get it to work. It looks like R is only reading the first two tables. The link is below.
https://www.pro-football-reference.com/years/2018/index.htm
This is what I have tried...
library(XML)
TeamData = 'https://www.pro-football-reference.com/years/2018/index.htm'TeamData = 'https://www.pro-football-reference.com/years/2018/index.htm'
URL = TeamData
URLdata = getURL(URL)
table = readHTMLTable(URLdata, stringsAsFactors=F, which = 5)
Scraping Sports Reference sites can be tricky but they are great sources:
library(rvest)
library(httr)
link <- "https://www.pro-football-reference.com/years/2018/index.htm"
doc <- GET(link)
cont <- content(doc, "text") %>%
gsub(pattern = "<!--\n", "", ., fixed = TRUE) %>%
read_html %>%
html_nodes(".table_outer_container table") %>%
html_table()
# Team Offense table is the fifth one
df <- cont[[5]]