xmlTreeParse and html content - r

I can't get (web scrape) html tree content with R function xmlTreeParse - I mean common page with products.
I get library Rcurl and XML.
myurln3<-"www.amazon.com/s?k=router+hand+plane+cheap&i=arts-crafts-intl-ship&ref=nb_sb_noss"
html_page<-xmlTreeParse(myurln3, useInternalNodes = TRUE)
Error: XML content does not seem to be XML:
'www.amazon.com/s?k=router+hand+plane+cheap&i=arts-crafts-intl-ship&ref=nb_sb_noss'
I expect to scrape page and get full html structure.

I back after some other projects to web scraping with R and still with problems.
> library(XML)
Warning message:
XML package is in R 3.5.3 version
> my_url99 <- "https://www.amazon.com/s?k=Dell+laptop+windows+10&ref=nb_sb_noss_2"
> html_page99 <- htmlTreeParse(my_url99, useInternalNode=TRUE)
Warning message:
XML content does not seem to be XML: 'https://www.amazon.com/s?k=Dell+laptop+windows+10&ref=nb_sb_noss_2'
> head(html_page99)
Error in `[.XMLInternalDocument`(x, seq_len(n)) :
No method for subsetting an XMLInternalDocument with integer
> html_page99
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>https://www.amazon.com/s?k=Dell+laptop+windows+10&ref=nb_sb_noss_2</p></body></html>
But I need to scrape above page with full content = I mean content with $ sign on the left (fmaybe that's not the best direct description) and all the tags.

Related

Using readtext to extract text from XML

I am not used to working with XML files but need to extract text from various fields in XML files. Specifically, I've downloaded and saved XML files like the following: https://www.federalregister.gov/documents/full_text/xml/2007/09/18/07-4595.xml. I'm interested in the text within the tag "regtext" in this and other similar XML files.
I've downloaded the XML files and stored them on my computer, but when I set the directory and attempt to use the readtext package to read from the XML files, I get the following error:
regtext <- readtext("/regdata/RegDataValidation", text_field = "regtext")
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
Start tag expected, '<' not found [4]
I've tried to search the error, but nothing I've come across has helped me figure out what might be going on. This basic command works like a charm on any number of other document types, including .csv or .docx, but for some reason it just doesn't seem to recognize the files I'm trying to work with here. Any pointers would be much appreciated--I'm too much of a novice and all of the documentation on readtext does not give examples of how to work with XML.
Pursuant to comments below, I've also tried to specify a single saved XML file, as follows:
> regtext <- readtext("/regdata/RegDataValidation/0579- AC01.xml", text_field = "regtext")
Error in xml2_to_dataframe(xml) :
The xml format does not fit for the extraction without xPath
Use xPath method instead
In addition: There were 50 or more warnings (use warnings() to see the first 50)
I tried to specify an xPath expression on a single file, and this did not return any errors, but didn't actually extract any text (even though there should be plenty of text within the "regtext" node:
> regtext <- readtext("/regdata/RegDataValidation/0579- AC01.xml", text_field = "/regtext/*")
I end up with a dataframe with the correct doc_id, but no text.
From the error messages, the readtext function appears to be converting the xml file into a plain text document and the XML package is not accepting it as a valid document.
It is also likely that the XML parser is differentiating between "regtext" and "REGTEXT".
Here is a solution using the xml2 package. (I find this package provides a simpler interface and is easier to use)
library(xml2)
url <- "https://www.federalregister.gov/documents/full_text/xml/2007/09/18/07-4595.xml"
page <- read_xml(url)
#parse out the nodes within the "REGTEXT" sections
regtext <- xml_find_all(page, ".//REGTEXT")
#convert the regtext nodes into vector of strings
xml_text(regtext)

readHTMLTable() returning a List of 0

I am trying to read a table using readHTML() function in R. But getting the result - List of 0 and the error message as below:
"XML content does not seem to be XML:
'https://www.forbes.com/powerful-brands/list/#tab:rank' "
I have already tried using library(XML) and library(RCurl) before readHTMLTable() function.
I tried below options till now
library(XML)
Forbes=readHTMLTable("https://www.forbes.com/powerful-brands/list/#tab:rank",as.data.frame = TRUE)
Another way.
library(XML)
library(RCurl)
URL<- "https://www.forbes.com/powerful-brands/list/#tab:rank"
Forbeslist <- readHTMLTable(getURL(URL))
getting below error message:
"XML content does not seem to be XML:
'https://www.forbes.com/powerful-brands/list/#tab:rank' "
The table on the site is generated by a script. You can see it if you disable scripts in a browser or just download the page using wget https://www.forbes.com/powerful-brands/list/#tab:rank. R doesn't execute scripts, so it doesn't see any table.

List XML files in web server directory and subdirectories

I'm trying to get a list of all the XML documents in a web server directory (and all its subdirectories).
I've tried these examples:
One:
library(XML)
url <- "https://cmgds.marine.usgs.gov/metadata/pcmsc/"
getHTMLLinks(url)
Returns: character(0) Warning message: XML content does not seem to be XML
Two:
readHTMLTable(url)
Returns the same error.
I've tried other sites as well, like those included in the examples. I saw some SO questions (example) about this error saying to change https to http. When I do that I get Error: failed to load external entity.
Is there a way I can get a list of all the XML files at that URL and all the subdirectories using R?
To get the raw html from the page:
require(rvest)
url <- "https://cmgds.marine.usgs.gov/metadata/pcmsc/"
html <- read_html(url)
Then, we'll get all the links using html_nodes. The names are truncated, so we need to get the href attribute rather than just using html_table().
data <- html %>% html_nodes("a") %>% html_attr('href')

Parse kml file with R

Edit: In fact, it appears that htmltreeparse don't parse well kml files. In that case, xmlTreeParse is what is needed.
I try to parse a huge kml file in R. My issue is when I want to use xpath to "navigate" through the nodes of the tree. Either way I grab the problem, I can't manage to do it, as the functions are made for xml and html files.
My final goal is to get a list of string of all the node under the node placemark.
# parse kml file:
pc2 <- htmlTreeParse(file = "http://www.doogal.co.uk/kml/EC.kml")
pc3 <- htmlTreeParse(file = "http://www.doogal.co.uk/kml/EC.kml", useInternalNodes = T)
# doesn't work
pc2["//#Placemark"]
# doesn't work either
xpathApply(pc3, "//#Placemark")
Is there a way to do it or the kml file block all?
So far, the only way I found was to doing it manually with call to the node, but it is not best practice.
pc4 <- htmlTreeParse(file = "http://www.doogal.co.uk/kml/EC.kml")$doc$chidren$kml ....
+ for loop
Edit: There is a strange effect, here: when I download the file, it is a kml file, beginning by a kml balise. when I use htmlTreeParse, it adds an html level:
<!DOCTYPE html PUBLIC "-//EN" "http://www.w3">
<?xml version="1.0" encoding="UTF-8"?>
<!-- comment here-->
<html>
<body>
<kml xmlns="http://www.opengis.net/kml/2.2">
<document>
my document here
</document></kml></body></html>
And the html parser react strangely to this. To correct this, I use xmltreeparse and it works fine in the end.

Web scraping Airbnb with R (rvest, XML) - hidden html \n?

I am scraping an Airbnb page using rvest.
My objective is to get the number of listings of a user (on the lower left-hand side of the web page) as well as the links for each listing.
However, it seems that Airbnb is blocking access to the source or something. I am a bit lost..
1) Using SelectorGadget and rvest, I have identified the node I'm interested in. Here is my entire code:
library(rvest)
URL = "https://www.airbnb.com/users/show/..."
--> put any user id instead of ...
source = read_html(URL)
source %>% html_nodes(".row-space-3") %>% .[[1]] %>% html_text()
And here is my (disappointing) output:
[1] "\n "
Looking for the webpage source code I should get "Listings (2)" - here it is:
<div class="listings row-space-2 row-space-top-4">
<h2 class="row-space-3">
Listings
<small>(2)</small>
</h2>
What is happening?
PS:
2) I noticed that when I try to get the source code by brute force with XML THERE IS A WHOLE SECTION MISSING if compared to the source code on Chrome or Firefox
library(XML)
library(RCurl)
URL = "https://www.airbnb.com/users/show/..."
parsed <- htmlParse(getURL(URL),asText=TRUE,encoding = "UTF-8")

Resources