Scraping error using span ID
I have been successfully able to login to a website. I want to scrape the post status if it's delivered or not. This is wrapped inside a Span ID "lblFinalStatus" which i am unable to scrape with the following code
library(rvest)
url <-"http://203.115.106.5/PodTrackingnew/PodTracking.aspx"
pgsession <-html_session(url)
pgform <-html_form(pgsession)[[1]]
filled_form <- set_values(pgform, "txtPodNo" = "EM3352650")
submit_form(pgsession,filled_form)
memberlist <-jump_to(pgsession,"http://203.115.106.5/PodTrackingnew/PodTracking.aspx")
page <- read_html(memberlist)
#Error
usernames <- html_nodes(page,"#lblFinalStatus")
I want to obtain "Packet of Mr./Ms./M/S. KISHAN HARISHANKAR SHARMA is Delivered" statement from the website in the usernames variable above.
Related
I'm trying to get the product link from a customers profile page usign R's RVEST package
I've referenced various questions on stack overflow including here(could not read webpage with read_html using rvest package from r), but each time I try something, I'm not able to return the correct result.
For example on this profile page:
https://www.amazon.com/gp/profile/amzn1.account.AETT6GZORFV55BFNOAVFDIJ75QYQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8
I'd like to be able to return this link, with the end goal to extract the product id: B01A51S9Y2
https://www.amazon.com/Amagabeli-Stainless-Chainmail-Scrubber-Pre-Seasoned/dp/B01A51S9Y2?ref=pf_vv_at_pdctrvw_dp
library(dplyr)
library(rvest)
library(stringr)
library(httr)
library(rvest)
# get url
url='https://www.amazon.com/gp/profile/amzn1.account.AETT6GZORFV55BFNOAVFDIJ75QYQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8'
x <- GET(url, add_headers('user-agent' = 'test'))
page <- read_html(x)
page %>%
html_nodes("[class='a-link-normal profile-at-product-box-link a-text-normal']") %>%
html_text()
#I did a test to see if i could even find the href, with no luck
test <- page %>%
html_nodes("#a-page") %>%
html_text()
grepl("B01A51S9Y2",test)
Thanks for the tip #Qharr on Rselenium. that is helpful, but still unsure how to extract the link or asin. library(RSelenium)
driver <- rsDriver(browser=c("chrome"), port = 4574L, chromever = "77.0.3865.40")
rd <- driver[["client"]]
rd$open()
rd$navigate("https://www.amazon.com/gp/profile/amzn1.account.AETT6GZORFV55BFNOAVFDIJ75QYQ/ref=cm_cr_arp_d_gw_btm?ie=UTF8")
prod <- rd$findElement(using = "css", '.profile-at-product-box-link')
prod$getElementText
This doesn't really return anything
Added the get attribute href, and was able to get the link
prod <- rd$findElements(using = "css selector", '.profile-at-product-box-link')
for (link in 1:length(prod)){
print(prod[[link]]$getElementAttribute('href'))
}
That info is pulled in dynamically from a POST request the page makes that your rvest initial request doesn't capture. This subsequent request returns in json format the content governing asins, the products links etc.....
You can find it in the network tab of dev tools F12. Press F5 to refresh the page then examine network traffic:
It is not a simple POST request to mimic and I would just go with RSelenium to let the page render and then use css selector
.profile-at-product-box-link
to gather a webElements collection you can loop and extract href attribute from.
If you go to the website https://www.myfxbook.com/members/iseasa_public1/rush/2531687 then click that dropdown box Export, then choose CSV, you will be taken to https://www.myfxbook.com/statements/2531687/statement.csv and the download (from the browser) will proceed automatically. The thing is, you need to be logged in to https://www.myfxbook.com in order to receive the information; otherwise, the file downloaded will contain the text "Please login to Myfxbook.com to use this feature".
I tried using read.csv to get the csv file in R, but only got that "Please login" message. I believe R has to simulate an html session (whatever that is, I am not sure about this) so that access will be granted. Then I tried some scraping tools to login first, but to no avail.
library(rvest)
login <- "https://www.myfxbook.com"
pgsession <- html_session(login)
pgform <- html_form(pgsession)[[1]]
filled_form <- set_values(pgform, loginEmail = "*****", loginPassword = "*****") # loginEmail and loginPassword are the names of the html elements
submit_form(pgsession, filled_form)
url <- "https://www.myfxbook.com/statements/2531687/statement.csv"
page <- jump_to(pgsession, url) # page will contain 48 bytes of data (in the 'content' element), which is the size of that warning message, though I could not access this content.
From the try above, I got that page has an element called cookies which in turns contains JSESSIONID. From my research, it seems this JSESSIONID is what "proves" I am logged in to that website. Nonetheless, downloading the CSV does not work.
Then I tried:
library(RCurl)
h <- getCurlHandle(cookiefile = "")
ans <- getForm("https://www.myfxbook.com", loginEmail = "*****", loginPassword = "*****", curl = h)
data <- getURL("https://www.myfxbook.com/statements/2531687/statement.csv", curl = h)
data <- getURLContent("https://www.myfxbook.com/statements/2531687/statement.csv", curl = h)
It seems these libraries were built to scrape html pages and do not deal with files in other formats.
I would pretty much appreciate any help as I've been trying to make this work for quite some time now.
Thanks.
I made sure to look at other rvest question on stackexchange before asking. In particular:
R - Using rvest to scrape a password protected website without logging in at each loop iteration
and
Using rvest to scrape a website w/ a login page
I am trying to scrape the ex-dividend dates using rvest from dividend.com
Following the above code, I wrote code but I seem to have trouble.
# ADDRESS OF LOGIN PAGE:
URL <- "https://www.dividend.com/login/"
# CREATE PAGE SESSION
pgsession <- html_session(URL)
# CREATE PAGE FORM
pgform <- html_form(pgsession)[[2]] # Problem is here, I get a blank list
# FILL IN PASSWORD AND ID
filled_form <- set_values(form,
user_login = email within quotes,
password = password within quotes)
# GO TO EX-DIVIDEND DATE PAGE AND SCRAPE
div_data <- submit_form(pgsession, filled_form) %>%
jump_to(url from which `to scrape first table within quotes)`
%>% html_node(list("td a", "td span", "tr td") %>% html_table()
But html_form(pgsession) returns me a blank list().
Can anyone provide me some help?
Thanks
if everything else works as intended I think the issues is with using a html_session object inside html_form. It is expecting a node or a node set so I think this should work.
# CREATE PAGE FORM
pgform <- html_form(read_html("https://www.dividend.com/login/"))
You still have to subset the correct form from the pgform object but I think this should be it.
I am having real difficulty with the rvest package in R. I am trying to navigate to a particular webpage after hitting an "I Agree" button on the first webpage. Here's the link to the webpage that I begin with. The code below attempts to obtain the next webpage which has a form to fill out in order to obtain data that I will need to extract.
url <- "http://wonder.cdc.gov/mcd-icd10.html"
pgsession <- html_session(url)
pgform <- html_form(pgsession)[[3]]
new_session <- html_session(submit_form(pgsession,pgform)$url)
pgform_new <- html_form(new_session)
The last line does not obtain the html form for the next webpage and gives me the following error in R.
Error in read_xml.response(x$response, ..., as_html = as_html) :
server error: (500) Internal Server Error
Would very much appreciate any help that I can get with this in both getting to the next webpage and also submitting a form to obtain data. Thanks so much for your time!
I have a problem with logging in in my script. Despite all other good answers that I found on stackoverflow, none of the solutions worked for me.
I am scraping a web forum for my PhD research, its URL is http://forum.axishistory.com.
The webpage I want to scrape is the memberlist - a page that lists the links to all member profiles. One can only access the memberlist if logged in. If you try to access the memberlist without logging in, it shows you the log in form.
The URL of the memberlist is this: http://forum.axishistory.com/memberlist.php.
I tried the httr-package:
library(httr)
members <- GET("http://forum.axishistory.com/memberlist.php", authenticate("username", "password"))
members_html <- html(members)
The output is the log in form.
Then I tried RCurl:
library(RCurl)
members_html <- htmlParse(getURL("http://forum.axishistory.com/memberlist.php", userpwd = "username:password"))
members_html
The output is the log in form - again.
Then i tried the list() function from this topic - Scrape password-protected website in R :
handle <- handle("http://forum.axishistory.com/")
path <- "ucp.php?mode=login"
login <- list(
amember_login = "username"
,amember_pass = "password"
,amember_redirect_url =
"http://forum.axishistory.com/memberlist.php"
)
response <- POST(handle = handle, path = path, body = login)
and again! The output is the log in form.
The next thing I am working on is RSelenium, but after all these attempts I am trying to figure out whether I am probably missing something (probably something completely obvious).
I have looked at other relevant posts in here, but couldn't figure out how to apply the code to my case:
How to use R to download a zipped file from a SSL page that requires cookies
Scrape password-protected website in R
How to use R to download a zipped file from a SSL page that requires cookies
https://stackoverflow.com/questions/27485311/scrape-password-protected-https-website-in-r
Web scraping password protected website using R
Thanks to Simon I found the answer here: Using rvest or httr to log in to non-standard forms on a webpage
library(rvest)
url <-"http://forum.axishistory.com/memberlist.php"
pgsession <-html_session(url)
pgform <-html_form(pgsession)[[2]]
filled_form <- set_values(pgform,
"username" = "username",
"password" = "password")
submit_form(pgsession,filled_form)
memberlist <- jump_to(pgsession, "http://forum.axishistory.com/memberlist.php")
page <- html(memberlist)
usernames <- html_nodes(x = page, css = "#memberlist .username")
data_usernames <- html_text(usernames, trim = TRUE)