I have a website that I'm trying to scrap . I've found the proper html tag that I want to scrap and copied its XPath that looks like this:
/html/body/div[1]/div[2]/main/div/div[2]/div[1]/div[2]/div[3]/ul
Next I use rvest:
library(rvest)
my_website <- read_html("http://www...") # a full link to the website
links <- html_nodes(my_website, "/html/body/div[1]/div[2]/main/div/div[2]/div[1]/div[2]/div[3]/ul/")
and I get the following error:
Error in tokenize(css) : Unexpected character '/' found at position 1.
I'm not really fluent in webscrapping, so could anyone please explain me what I am doing wrong?
Related
I am trying to scrape the http://www.emedexpert.com/lists/brand-generic.shtml web page for brand and generic drug names
library(httr)
library(rvest)
session <- read_html("http://www.emedexpert.com/lists/brand-generic.shtml")
form1 <- html_form(session)[[2]]
form2 <- set_values(form1, brand = "tylenol")
submit_form(session, form2)
however this results in the error message:
Error in xml2::url_absolute(form$url, session$url) :
not compatible with STRSXP
Therefore, based on this answer to the same error message ("Error: not compatible with STRSXP" on submit_form with rvest) I added a session$url as follows:
session$url <- "http://www.emedexpert.com/lists/brand-generic.shtml" # added from S.Ov
but I still get the same error message. So I tried also adding various permutations of also adding form2$url such as these
form2$url <- "http://www.emedexpert.com/lists/brand-generic.shtml"
form2$url <- ""
form2$url <- "/"
submit_form(session, form2)
At this point, the error message goes away and I obtain a web page which contain most of the desired web page. However it seems to completely lack the table of brand and generic names.
Any suggestions?
Yes #hackR, RSelenium is not always the answer.
library(rvest)
url<-"http://www.emedexpert.com/lists/bg.php?myc"
page<-html_session(url)
table<-html_table(read_html(page))[[1]]
This could help you I hope.
I'm trying to use the XML2 package to scrape a few tables from ESPN.com. For the sake of example, I'd like to scrape the week 7 fantasy quarterback rankings into R, the URL to which is:
http://www.espn.com/fantasy/football/story/_/page/16ranksWeek7QB/fantasy-football-week-7-quarterback-rankings
I'm trying to use the "read_html()" function to do this because it is what I am most familiar with. Here is my syntax and its error:
> wk.7.qb.rk = read_html("www.espn.com/fantasy/football/story/_/page/16ranksWeek7QB/fantasy-football-week-7-rankings-quarterbacks", which = 1)
Error: 'www.espn.com/fantasy/football/story/_/page/16ranksWeek7QB/fantasy-football-week-7-rankings-quarterbacks' does not exist in current working directory ('C:/Users/Brandon/Documents/Fantasy/Football/Daily').
I've also tried "read_xml()", only to get the same error:
> wk.7.qb.rk = read_xml("www.espn.com/fantasy/football/story/_/page/16ranksWeek7QB/fantasy-football-week-7-rankings-quarterbacks", which = 1)
Error: 'www.espn.com/fantasy/football/story/_/page/16ranksWeek7QB/fantasy-football-week-7-rankings-quarterbacks' does not exist in current working directory ('C:/Users/Brandon/Documents/Fantasy/Football/Daily').
Why is R looking for this URL in the working directory? I've tried this function with other URLs and had some success. What is it about this specific URL that makes it look in a different location than it does for others? And, how do I change that?
I got this error while I was running my read_html in a loop to navigate through 20 pages. After the 20th page the loop was still running with no urls and hence it started calling read_html with NAs for the other loop iterations.Hope this helps!
I am trying to scrape my trip history data on Capital Bikeshare Website. I have to log in and go to the trips menu to see the data. but i get this error:
> `No encoding supplied: defaulting to UTF-8.
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"xml_document"’
Here's my code.
> library(httr)
> library(XML)
> handle <- handle("https://www.capitalbikeshare.com/")
> path <-"profile/trips"
> login <- list( profile_login="myemail", profile_pass ="mypassword", profile_redirect_url="https://secure.capitalbikeshare.com/profile/trips/QNURCMF2Q6")
> response <- POST(handle = handle, path = path, body = login)
> readHTMLTable(content(response))
I also tried using rvest but then I kept getting the "Error: Unknown field names: _username, _password" error. Which field should I use here? I tried Id, name, etc and still didn't work.
For a start the member login page is different than the intro page which you have listed above:
This may not be correct but try this as a possible rvest starting point:
login<-"https://secure.capitalbikeshare.com/profile/login"
library(rvest)
pgsession<-html_session(login)
pgform<-html_form(pgsession)[[1]]
#update user id and password in the next line
filled_form<-set_values(pgform, "_username"="myemail#gmail.com", "_password"="password")
submit_form(pgsession, filled_form)
Once you login in then one can use the jump_to function to move to the desired pages:
page<-jump_to(pgsession, newurl) #newurl will be the address where to go to next.
Hope this helps, if this does not work, leave a comment and I'll delete the post.
I'm trying to use the set_values() function to insert a company name using this website:
https://www.unternehmensregister.de/ureg/search1.4.html
Unfortunately, after
search <- html_form(read_html("https://www.unternehmensregister.de/ureg/search1.4.html"))[[1]]
the command
set_values(search, searchRegisterForm:companyPublicationsCompanyName - "Daimler")
gives an error.
Error in
set_values(search,searchRegisterForm:companyPublicationsCompanyName -
: object 'searchRegisterForm:companyPublicationsCompanyName' not
found
It would be great if someone can help me with that!
I am trying and failing to use RCurl to automate the process of fetching a spreadsheet from a web site, China Labour Bulletin's Strike Map.
Here is the URL for the spreadsheet with the options set as I'd like them:
http://strikemap.clb.org.hk/strikes/api.v4/export?FromYear=2011&FromMonth=1&ToYear=2015&ToMonth=6&_lang=en
Here is the code I'm using:
library(RCurl)
temp <- tempfile()
temp <- getForm("http://strikemap.clb.org.hk/strikes/api.v4/export",
FromYear="2011", FromMonth="1",
ToYear="2015", ToMonth="6",
_lang="en")
And here is the error message I get in response:
Error: unexpected input in:
" ToYear=2015, ToMonth=6,
_"
Any suggestions on how to get this to work?
Try enclosing _lang with a backtick.
temp <- getForm("http://strikemap.clb.org.hk/strikes/api.v4/export",
FromYear="2011",
FromMonth="1",
ToYear="2015",
ToMonth="6",
`_lang`="en")
I think R has trouble on the argument starting with an underscore. This seems to have worked for me.