Error using XML package in R - r

I am gathering data about different universities and I have a question about the follow error after executing the following code. The problem is when using htmlParse()
Code:
url1 <- "http://nces.ed.gov/collegenavigator/?id=165015"
webpage1<- getURL(url1)
doc1 <- htmlParse(webpage1)
Output:
Error in htmlParse(webpage1) : File
!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
html xmlns="http://www.w3.org/1999/xhtml" head id="ctl00_hd"meta http-equiv="Content-type" content="text/html;charset=UTF-8" /title
College Navigator - National Center for Education Statistics
/titlelink href="css/md0.css" type="text/css" rel="stylesheet" meta name="keywords" content="college navigator,college search,postsecondary education,postsecondary statistics,NCES,IPEDS,college locator"/meta meta name="description" content="College Navigator is a free consumer information tool designed to help students, parents, high school counselors, and others get information about over 7,000 postsecondary institutions in the United States - such as programs offered, retention and graduation rates, prices, aid available, degrees awarded, campus safety, and accreditation."meta>meta name="robots" content="index,nofollow"/metalink
I have webs scraped pages before using this package and I never had an issue. Does the name="robots" have anything to do with it? Any help would be greatly appreciate.

http://validator.w3.org/check?verbose=1&uri=http%3A%2F%2Fnces.ed.gov%2Fcollegenavigator%2F%3Fid%3D165015
indicates the webpage is badly formed. Your browser can compensate for this but your R package is having problems.
if you are using windows you can get the IE browser to fix it for you as follows:
library(rcom)
library(XML)
ie = comCreateObject('InternetExplorer.Application')
ie[["visible"]]=T # true for debugging
comInvoke(ie,"Navigate2","http://nces.ed.gov/collegenavigator/?id=165015")
while(comGetProperty(ie,"busy")||comGetProperty(ie,"ReadyState")<4){
Sys.sleep(1)
print(comGetProperty(ie,"ReadyState"))
}
myDoc<-comGetProperty(ie,"Document")
webpage1<-myDoc$getElementsByTagName('html')[[0]][['innerHTML']]
ie$Quit()
doc1 <- htmlParse(webpage1)

Related

Webscraping "How did you contribute to OpenStreetMap" with rvest

I would like to scrape all the information from the "How did you contribute to OpenStreetMap?" (https://hdyc.neis-one.org/). It is necessary to login to OSM in order to gain a user profile.
Since there are quite a lot of profiles that need to be scraped, I want to automatically scrape the list using the rvest package (https://rvest.tidyverse.org/).
So far I attempted to do this:
> library(rvest)
> url <- "https://hdyc.neis-one.org/?mrsensible"
> pgsession <- session(URL)
> pgsession
<session> https://hdyc.neis-one.org/?mrsensible
Status: 200
Type: text/html
Size: 4245
When I tried to read the information of my OSM record with read_html(url), here is what it turns out:
> read_html(url)
{html_document}
<html>
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n<meta ...
[2] <body onload="init();">\n <div class="copyright">Copyright © <a target="_bl ...
So it doesn't really capture the information.
Would it be possible to scrape the data using rvest codes?
Many thanks in advance!

How to use R (rvest or XML or RCurl) to scrape data from website like makemytrip

For instance I want scrape flight data for flights operating between Chicago (ORD) and NewDelhi (DEL). I would search for the flights on makemytrip and this is teh URL that gets generated - http://us.makemytrip.com/international/listing/exUs/RT/ORD-DEL-D-22May2016_JFK-DEL-D-25May2016/A-1/E?userID=90281463121653408
When I am trying to read this HTML page using rvest package, this is what I get -
htmlpage<-read_html("http://us.makemytrip.com/international/listing/exUs/RT/ORD-DEL-D-22May2016_JFK-DEL-D-25May2016/A-1/E?userID=90281463121653408")
htmlpage
{xml_document}
<html>
[1] <head>\n <meta http-equiv="Content-Type" cont ...
[2] <body onload="done_loading();">\n\n <div id= ...
myhtml<-html_nodes(htmlpage,".flight_info")
> myhtml
{xml_nodeset (0)}
Need help on parsing/scraping this data and understand what is going on wrong here.
Thanks !

Not able to send the pander table intact in Gmail using sendmailR

Please find below my code that I am using to share my analysis (dataframe) with my friend in R. I am using sendmailR package and pander:
library(sendmailR)
from <- "<me#gmail.com>"
to <- "<friend#gmail.com>"
subject <- "Important Report of the Day!!"
body <- "This is the result of the test:"
mailControl=list(smtpServer="ASPMX.L.GOOGLE.COM")
#-----------------------------------------------------
msg_content <- mime_part(paste('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0
Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0"/>
</head>
<body><pre>', paste(pander_return(pander(vvv, style="multiline")), collapse = '\n'), '</pre></body>
</html>'))
msg_content[["headers"]][["Content-Type"]] <- "text/html"
sendmail(from=from,to=to,subject=subject,msg=msg_content,control=mailControl)
Problem is that in the mail the table is broken into two parts (8 column table and 4 column table) PFB the sample picture
How do I change my code so that my table of 12 columns remain intact.
After adding this line
panderOptions('table.split.table', Inf)
This is the email that I am getting enter image description here
You have to increase or disable the default max width of the resulting markdown table via the split.tables argument of pandoc.table (that can be also used with the pander call, which will pass that argument to pandoc.table after all) or update the global options via panderOptions.
Quick example on updating your pander call:
paste(pander_return(pander(vvv, split.tables = Inf)), collapse = '\n')
Or set that globally for all future pander calls:
panderOptions('table.split.table', Inf)

Column widths not aligned with table data in pander tables sent from R with sendmailr

I'm working with the 'pander' and 'sendmailr' packages to send a small data frame in the body of an email, rather than as an attachment. I'd like to send it from and to a gmail account.
I'm close, but the column headers won't align with the columns themselves in the email body the way they do in Rstudio for example- basically the column headers are too wide to line up with the data columns below them.
It seems the problem is the way the dashes and whitespaces are compressed in various email clients (I tried this in gmail, yahoo and hotmail through the web and through the email client that ships with OS X Mavericks). I was able to remedy the problem in my OS X email client by going to 'preferences' and checking the box labeled 'use fixed-width font for plain-text messages' but I'd like it to work on multiple devices, with multiple clients, etc for many of my coworkers so I'm wondering if there's a way that doesn't involve global email settings.
Here is the code to reproduce the problem:
library(sendmailR) # for emails from R
library(pander) # for table-formatting that does not require HTML
results <- head(iris)
pander(results) # widths look great so far...
a = pandoc.table.return(results)
strsplit(a, "\n") # widths still look great...
panderOptions('table.split.table', Inf) # show all columns on same line
msg_content <- mime_part(
pandoc.table.return(results, style = "multiline")
)
# I'm using my own gmail address for email_from and email_to
sendmail(from = email_from,
to = email_to,
subject = "test",
msg = msg_content
)
… and the email received has the problem described above.
Next you can see an image which illustrates the problem:
The problem with plain text e-mails and using markdown tables is that the e-mail client usually displays the text with a non-fixed font, and you have to use custom settings in all your e-mail client to override that (like you did with your OS X e-mail client). On the other hand, that's why HTML mails are trending :)
So let's create a HTML mail and include the markdown table in a pre block:
msg_content <- mime_part(paste('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0
Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0"/>
</head>
<body><pre>', paste(pander.return(results, style = "multiline"), collapse = '\n'), '</pre></body>
</html>'))
Due to a bug in sendmailR, we have to override the Content-type to HTML:
msg_content[["headers"]][["Content-Type"]] <- "text/html"
And now it's ready to be sent via the comment you used in your example, resulting in:
The table should look similarly fine in any other HTML-capable e-mail client. Please note that this way you could also use HTML tables instead of markdown if that would fit your needs better.

Downloading a RDA file from Github [duplicate]

This question already has answers here:
Importing data into R (rdata) from Github
(3 answers)
Closed 7 years ago.
The package for O'Reily's new Learning R book (called "learningr") does not work in R v3. Fortunately, the dataset I want from the package is on the package's Github page here called english_monarchs.rda.
However, for the life of me I cannot figure out how to download the rda file. This is my best attempt:
> library(RCurl)
>
> x <- getURL("https://github.com/richierocks/learningr/blob/master/data/english_monarchs.rda"); x
[1] "\n\n\n<!DOCTYPE html>\n<html>\n <head prefix=\"og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# githubog: http://ogp.me/ns/fb/githubog#\">\n <meta charset='utf-8'>\n <meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge\">\n <title>learningr/data/english_monarchs.rda at master · richierocks/learningr · GitHub</title>\n <link rel=\"search\" type=\"application/opensearchdescription+xml\" href=\"/opensearch.xml\" title=\"GitHub\" />
It goes on like this through all the html of the page, I cut it short since you get the point. I get the html but not the file itself.
Any help would be much appreciated.
Did you try clicking on "View Raw"?
There may be a better way to do this, but if you want to do this entirely automatically/within R:
library(RCurl)
## paste URL to make it easier to read code (cosmetic!)
dat_url <- paste0("https://raw.github.com/richierocks/",
"learningr/master/data/english_monarchs.rda")
f <- getBinaryURL()
L <- load(rawConnection(f))
(To deal with the redirection, I downloaded the file in Firefox and then asked Firefox to copy the actual download link.)
By the way, are you sure learningr doesn't work with R 3.+ ? I followed the installation instructions at https://github.com/richierocks/learningr/blob/master/README.md with R-devel and they seemed to work fine ...

Resources