Scraping in page with a login - r

I made sure to look at other rvest question on stackexchange before asking. In particular:
R - Using rvest to scrape a password protected website without logging in at each loop iteration
and
Using rvest to scrape a website w/ a login page
I am trying to scrape the ex-dividend dates using rvest from dividend.com
Following the above code, I wrote code but I seem to have trouble.
# ADDRESS OF LOGIN PAGE:
URL <- "https://www.dividend.com/login/"
# CREATE PAGE SESSION
pgsession <- html_session(URL)
# CREATE PAGE FORM
pgform <- html_form(pgsession)[[2]] # Problem is here, I get a blank list
# FILL IN PASSWORD AND ID
filled_form <- set_values(form,
user_login = email within quotes,
password = password within quotes)
# GO TO EX-DIVIDEND DATE PAGE AND SCRAPE
div_data <- submit_form(pgsession, filled_form) %>%
jump_to(url from which `to scrape first table within quotes)`
%>% html_node(list("td a", "td span", "tr td") %>% html_table()
But html_form(pgsession) returns me a blank list().
Can anyone provide me some help?
Thanks

if everything else works as intended I think the issues is with using a html_session object inside html_form. It is expecting a node or a node set so I think this should work.
# CREATE PAGE FORM
pgform <- html_form(read_html("https://www.dividend.com/login/"))
You still have to subset the correct form from the pgform object but I think this should be it.

Related

Extract table from web page by page

I have written a code for web scraping table from webpage. This code extracts table from page one (in url /page=0):
url <- "https://ss0.corp.com/auth/page=0"
login <- "john.johnson" (fake)
password <- "67HJL54GR" (fake)
res <- GET(url, authenticate(login, password))
content <- content(res, "text")
table <- fromJSON(content) %>%
as.data.farme()
I want to write a code to extract rows from table page by page and then to bind them. I do that, cause table is too large and i can't extract everything at once (it will brake the system). I don't know what how many pages there can be, it changes, so it must stop once the last page is collected. How could i do that?
I cannot test to guarantee this will work because the question is not reproducible, but you mainly need three steps:
Setup the url and credentials
url <- "http://someurl/auth/page="
login <- ""
password <- ""
Iterate over all (I'm assuming there are N) pages and store the result in a list. Note that we modify the url properly for each page.
tables <- lapply(1:N, function(page) {
# Create the proper url and make the request
this_url <- paste0(url, page)
res <- GET(this_url, authenticate(login, password))
# Extract the content just like you would in a single page
content <- content(res, "text")
table <- fromJSON(content) %>%
as.data.frame()
return(table)}
)
Aggregate all the tables in the list in a single complete table using rbind
complete <- do.call(rbind, tables)
I hope this helps at least giving a direction.

How do I find html_node on search form?

I have a list of names (first name, last name, and date-of-birth) that I need to search the Fulton County Georgia (USA) Jail website to determine if a person is in or released from jail.
The website is http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400
The site requires you enter a last name and first name, then it gives you a list of results.
I have found some stackoverflow posts that have given me some direction, but I'm still struggling to figure this out. I"m using this post as and example to follow. I am using SelectorGaget to help figure out the CSS tags.
Here is the code I have so far. Right now I can't figure out what html_node to use.
library(rvest)
# Specify URL
fc.url <- "http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400"
# start session
jail <- html_session(fc.url)
# Grab initial form
form.unfilled <- jail %>% html_node("form")
form.unfilled
The result I get from form.unfilled is {xml_missing} <NA> which I know isn't right.
I think if I can figure out the html_node value, I can proceed to using set_values and submit_form.
Thanks.
It appears on the initial call the webpage opens onto "http://justice.fultoncountyga.gov/PAJailManager/default.aspx". Once the session is started you should be able to jump to the search page:
library(rvest)
# Specify URL
fc.url <- "http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400"
# start session
jail <- html_session("http://justice.fultoncountyga.gov/PAJailManager/default.aspx")
#jump to search page
jail2 <- jail %>% jump_to("http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400")
#list the form's fields
html_form(jail2)[[1]]
# Grab initial form
form.unfilled <- jail2 %>% html_node("form")
Note: Verify that your actions are within the terms of service for the website. Many sites do have policy against scraping.
The website relies heavily on Javascript to render itself. When opening the link you provided in a fresh browser instance, you get redirected to http://justice.fultoncountyga.gov/PAJailManager/default.aspx, where you have to click the "Jail Records" link. This executes a bit a Javascript, to send you to the page with the form.
rvest is unable to execute arbitrary Javascript. You might have to look at RSelenium. Selenium basically remote-controls a browser (for example Firefox or Chrome), which executes the Javascript as intended.
Thanks to Dave2e.
Here is the code that works. This questions is answered (but I'll post another one because I'm not getting a table of data as a result.)
Note: I cannot find any Terms of Service on this site that I'm querying
library(rvest)
# start session
jail <- html_session("http://justice.fultoncountyga.gov/PAJailManager/default.aspx")
#jump to search page
jail2 <- jail %>% jump_to("http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400")
#list the form's fields
html_form(jail2)[[1]]
# Grab initial form
form.unfilled <- jail2 %>% html_node("form") %>% html_form()
form.unfilled
#name values
lname <- "DOE"
fname <- "JOHN"
# Fille the form with name values
form.filled <- form.unfilled %>%
set_values("LastName" = lname,
"FirstName" = fname)
#Submit form
r <- submit_form(jail2, form.filled,
submit = "SearchSubmit")
#grab tables from submitted form
table <- r %>% html_nodes("table")
#grab a table with some data
table[[5]] %>% html_table()
# resulting text in this table:
# " An error occurred while processing your request.Please contact your system administrator."

Scrape data using spanID in R

Scraping error using span ID
I have been successfully able to login to a website. I want to scrape the post status if it's delivered or not. This is wrapped inside a Span ID "lblFinalStatus" which i am unable to scrape with the following code
library(rvest)
url <-"http://203.115.106.5/PodTrackingnew/PodTracking.aspx"
pgsession <-html_session(url)
pgform <-html_form(pgsession)[[1]]
filled_form <- set_values(pgform, "txtPodNo" = "EM3352650")
submit_form(pgsession,filled_form)
memberlist <-jump_to(pgsession,"http://203.115.106.5/PodTrackingnew/PodTracking.aspx")
page <- read_html(memberlist)
#Error
usernames <- html_nodes(page,"#lblFinalStatus")
I want to obtain "Packet of Mr./Ms./M/S. KISHAN HARISHANKAR SHARMA is Delivered" statement from the website in the usernames variable above.

Scraping website that requires login (password-protected)

I am scraping a newspaper website (http://politiken.dk) and I could get all the titles from the news I need.
But I can't get the headlines + full text.
When I try without login, the code just get the first headline of the day I am scraping (not even the one as my first in the RData list).
I believe I need to log in to get right?
So I got a user and a password, but I cannot make any code work.
And I need to get the headlines from the articles in my RData, in the section URL. So the specific urls for all the articles I need are already in this code (under).
I saw this code to create a login in this website but I cannot apply to my case
library(httr)
library(XML)
handle <- handle("http://subscribers.footballguys.com") # I DONT KNOW WHAT TO PUT HERE
path <- "amember/login.php" ##I DONT KNOW WHAT TO PUT HERE
# fields found in the login form.
login <- list(
amember_login = "username"
,amember_pass = "password"
,amember_redirect_url =
"http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2"
)
response <- POST(handle = handle, path = path, body = login)
This is my code to get the headlines:
headlines <- rep("",nrow(politiken.unique))
for(i in 1:nrow(politiken.unique)){
try({
text <- read_html(as.character(politiken.unique$urls[i])) %>%
html_nodes(".summary__p") %>%
html_text(trim = T)
headlines[i] = paste(text, collapse = " ")
})
}
I tried this suggestion: Scrape password-protected website in R
But it did not work, or I don't know how to do.
Thanks in advance!

R Web scrape - Error

Okay, So I am stuck on what seems would be a simple web scrape. My goal is to scrape Morningstar.com to retrieve a fund name based on the entered url. Here is the example of my code:
library(rvest)
url <- html("http://www.morningstar.com/funds/xnas/fbalx/quote.html")
url %>%
read_html() %>%
html_node('r_title')
I would expect it to return the name Fidelity Balanced Fund, but instead I get the following error: {xml_missing}
Suggestions?
Aaron
edit:
I also tried scraping via XHR request, but I think my issue is not knowing what css selector or xpath to select to find the appropriate data.
XHR code:
get.morningstar.Table1 <- function(Symbol.i,htmlnode){
try(res <- GET(url = "http://quotes.morningstar.com/fundq/c-header",
query = list(
t=Symbol.i,
region="usa",
culture="en-US",
version="RET",
test="QuoteiFrame"
)
))
tryCatch(x <- content(res) %>%
html_nodes(htmlnode) %>%
html_text() %>%
trimws()
, error = function(e) x <-NA)
return(x)
} #HTML Node in this case is a vkey
still the same question is, am I using the correct css/xpath to look up? The XHR code works great for requests that have a clear css selector.
OK, so it looks like the page dynamically loads the section you are targeting, so it doesn't actually get pulled in by read_html(). Interestingly, this part of the page also doesn't load using an RSelenium headless browser.
I was able to get this to work by scraping the page title (which is actually hidden on the page) and doing some regex to get rid of the junk:
library(rvest)
url <- 'http://www.morningstar.com/funds/xnas/fbalx/quote.html'
page <- read_html(url)
title <- page %>%
html_node('title') %>%
html_text()
symbol <- 'FBALX'
regex <- paste0(symbol, " (.*) ", symbol, ".*")
cleanTitle <- gsub(regex, '\\1', title)
As a side note, and for your future use, your first call to html_node() should include a "." before the class name you are targeting:
mypage %>%
html_node('.myClass')
Again, this doesn't help in this specific case, since the page is failing to load the section we are trying to scrape.
A final note: other sites contain the same info and are easier to scrape (like yahoo finance).

Resources