xpath node determination - r

I´m all new to scraping and I´m trying to understand xpath using R. My objective is to create a vector of people from this website. I´m able to do it using :
r<-htmlTreeParse(e) ## e is after getURL
g.k<-(r[[3]][[1]][[2]][[3]][[2]][[2]][[2]][[1]][[4]])
l<-g.k[names(g.k)=="text"]
u<-ldply(l,function(x) {
w<-xmlValue(x)
return(w)
})
However this is cumbersome and I´d prefer to use xpath. How do I go about referencing the path detailed above? Is there a function for this or can I submit my path somehow referenced as above?
I´ve come to
xpathApply( htmlTreeParse(e, useInt=T), "//body//text//div//div//p//text()", function(k) xmlValue(k))->kk
But this leaves me a lot of cleaning up to do and I assume it can be done better.
Regards,
//M
EDIT: Sorry for the unclearliness, but I´m all new to this and rather confused. The XML document is too large to be pasted unfortunately. I guess my question is whether there is some easy way to find the name of these nodes/structure of the document, besides using view source ? I´ve come a little closer to what I´d like:
getNodeSet(htmlTreeParse(e, useInt=T), "//p")[[5]]->e2
gives me the list of what I want. However still in xml with br tags. I thought running
xpathApply(e2, "//text()", function(k) xmlValue(k))->kk
would provide a list that later could be unlisted. however it provides a list with more garbage than e2 displays.
Is there a way to do this directly:
xpathApply(htmlTreeParse(e, useInt=T), "//p[5]//text()", function(k) xmlValue(k))->kk
Link to the web page: I´m trying to get the names, and only, the names from the page.
getURL("http://legeforeningen.no/id/1712")

I ended up with
xml = htmlTreeParse("http://legeforeningen.no/id/1712", useInternalNodes=TRUE)
(no need for RCurl) and then
sub(",.*$", "", unlist(xpathApply(xml, "//p[4]/text()", xmlValue)))
(subset in xpath) which leaves a final line that is not a name. One could do the text processing in XML, too, but then one would iterate at the R level.
n <- xpathApply(xml, "count(//p[4]/text())") - 1L
sapply(seq_len(n), function(i) {
xpathApply(xml, sprintf('substring-before(//p[4]/text()[%d], ",")', i))
})
Unfortunately, this does not pick up names that do not contain a comma.

Use a mixture of xpath and string manipulation.
#Retrieve and parse the page.
library(XML)
library(RCurl)
page <- getURL("http://legeforeningen.no/id/1712")
parsed <- htmlTreeParse(page, useInternalNodes = TRUE)
Inspecting the parsed variable which contains the page's source tells us that instead of sensibly using a list tag (like <ul>), the author just put a paragraph (<p>) of text split with line breaks (<br />). We use xpath to retrieve the <p> elements.
#Inspection tells use we want the fifth paragraph.
name_nodes <- xpathApply(parsed, "//p")[[5]]
Now we convert to character, split on the <br> tags and remove empty lines.
all_names <- as(name_nodes, "character")
all_names <- gsub("</?p>", "", all_names)
all_names <- strsplit(all_names, "<br />")[[1]]
all_names <- all_names[nzchar(all_names)]
all_names
Optionally, separate the names of people and their locations.
strsplit(all_names, ", ")
Or more prettily with stringr.
str_split_fixed(all_names, ", ", 2)

Related

Malformed JSON missing a comma separator, Insert comma in R

I'm new to R and have a json file, containing data I'm hoping to convert to an R dataframe, that's been scraped in the following format:
The picture indicates where the data was scraped incorrectly, as no commas were inserted to separate entries. I've tried reading the data in with scan and separating into a list (to then read into a df) with this code:
indices <- grep(":[{",x, fixed=TRUE)
n <- length(indices)
l <- vector("list", n);
for(i in 1:n) {
ps <- substr(x ,indices[[i]], indices[i+1]) ## where i is whatever your Ps is
l[[i]] <- ps
}
But am getting empty string and NAN values. I've tried parsing with jsonlite, tidyjson, rjson, without any luck (which makes sense since the json is malformed). This article seems to match my json's structure, but the solution isn't working because of the missing commas. How would I insert a comma before every instance of "{"entries":[" in R when the file is read in as one string?
UPDATE: first, second and third entries
{"entries":[{"url":"/leonardomso/playground","name":"playground","lang":"TypeScript","desc":"Playground using React, Emotion, Relay, GraphQL, MongoDB.","stars":5,"forks":"2","updated":"2021-03-24T09:35:44Z","info":["react","reactjs","graphql","typescript","hooks","apollo","boilerplate","!DOCTYPE html \"\""],"repo_url":"/leonardomso?tab=repositories"}
{"entries":[{"url":"/leonardomso/playground","name":"playground","lang":"TypeScript","desc":"Playground using React, Emotion, Relay, GraphQL, MongoDB.","stars":5,"forks":"2","updated":"2021-03-24T09:35:44Z","info":["react","reactjs","graphql","typescript","hooks","apollo","boilerplate","!DOCTYPE html \"\""],"repo_url":"/leonardomso?tab=repositories"}
{"entries":[{"url":"/shiffman/Presentation-Manager","name":"Presentation-Manager","lang":"JavaScript","desc":"Simple web app to manage student presentation schedule.","stars":17,"forks":"15","updated":"2021-01-19T15:28:55Z","info":[]},{"desc":"","stars":null,"forks":"","info":[]},{"url":"/shiffman/A2Z-F20","name":"A2Z-F20","lang":"JavaScript","desc":"ITP Course Programming from A to Z Fall 2020","stars":40,"forks":"31","updated":"2020-12-21T13:52:58Z","info":[]},{"desc":"","stars":null,"forks":"","info":[]},{"desc":"","stars":null,"forks":"","info":[]},{"url":"/shiffman/RunwayML-Object-Detection","name":"RunwayML-Object-Detection","lang":"JavaScript","desc":"Object detection model with RunwayML, node.js, and p5.js","stars":16,"forks":"2","updated":"2020-11-15T23:36:36Z","info":[]},{"url":"/shiffman/ShapeClassifierCNN","name":"ShapeClassifierCNN","lang":"JavaScript","desc":"test code for new tutorial","stars":11,"forks":"1","updated":"2020-11-06T15:02:26Z","info":[]},{"url":"/shiffman/Bot-Code-of-Conduct","name":"Bot-Code-of-Conduct","desc":"Code of Conduct to guide ethical bot making practices","stars":15,"forks":"1","updated":"2020-10-15T18:30:26Z","info":[]},{"url":"/shiffman/Twitter-Bot-A2Z","name":"Twitter-Bot-A2Z","lang":"JavaScript","desc":"New twitter bot examples","stars":26,"forks":"2","updated":"2020-10-13T16:17:45Z","info":["hacktoberfest","!DOCTYPE html \"\""],"repo_url":"/shiffman?tab=repositories"}
You can use
gsub('}{"entries":[', '},{"entries":[', x, fixed=TRUE)
So, this is a plain replacement of all {"entries":[ with },{"entries":[.
Note the fixed=TRUE parameter that disables the regex engine parsing the string.

Function to list elements that are lists

lets say I have a list in r that contains other lists and I want to remove certain characters (comma in the following example) from elements of each list.
my.list <- list(c("hello , world ", "hello world,,," ),c("123,456", "1,234"))
The following gets the job done
gsub(",", "", my.list[[1]])
gsub(",", "", my.list[[2]])
but how do I do it more efficiently as my actual problem is long? I tried the following but it gives me strange results
lapply(my.list, function(x) gsub(",","",my.list))
any help? thx
You might be able to utilize the function filter_element
I would suggest sanitizing your data before combining them into a list. Once the data has been sanitized, then start setting entering them into lists.
You can check out the documentation for filter_element on page 10 in the following PDF.
https://cran.r-project.org/web/packages/textclean/textclean.pdf

Vector will not accept result (returns it as blank)

I have the following code:
webpage <- "https://www.dotabuff.com/heroes"
heroes <- read_html(webpage) %>%
html_nodes("div.name") %>%
html_text()
heroes <- sapply(gsub(" ","-",heroes), tolower)
It pulls a list of names from this website. When I run this code, it correctly parses all the hero names as lowercase and with words separated by hyphens.
When I run this code:
cat(webpage,heroes[i],sep="/")
with i being the object from the vector that I want to return (I intend to use it in a for loop), it will correctly return the webpage as I expect. However, when I do
var <- cat(webpage,heroes[i],sep="/")
it tells me that var is null, and does not have a value. It also will not assign that value to anything in the for loop, either, presenting it as null.
I've also tried
var <- toString(cat(webpage,heroes[i],sep="/"))
but that didn't work either (same issue)
What am I missing here?
I'm running this all in https://rstudio.cloud, for context. Is it something with the environment? I would have thought this would be simple.
I think you need to used paste instead of cat. The cat-function which is intended for printing whereas the paste-function is for concatenating strings:
var <- paste(webpage, heroes[i], sep="/")
As you can see in the help page ?cat does not return anything (null). So the behavior is as documented.

R: Add paste() elements to file

I'm using base::paste in a for loop:
for (k in 1:length(summary$pro))
{
if (k == 1)
mp <- summary$pro[k]
else
mp <- paste(mp, summary$pro[k], sep = ",")
}
mp comes out as one big string, where the elements are separated by commas.
For example mp is "1,2,3,4,5,6"
Then, I want to put mp in a file, where each of its elements is added to a separate column in the same row. My code for this is:
write.table(mp, file = recompdatafile, sep = ",")
However, mp just appears in the CSV as one big string as opposed to being divided up. How can I achieve my desired format?
FYI
I've also tried converting mp to a list, and strsplit()-ing it, neither of which have worked.
Once I've added summary$pro to the file, how can I also add summary$me (which has the same format), in one row with multiple columns?
Thanks,
n.i.
If you want to write something to a file, write.table() isn't the only way. If you want to avoid headers and quotes and such, you can use the more direct cat. For example
cat(summary$pro, sep=",", file="filename.txt")
will write out the vector of values from summary$pro separated by commas more directly. You don't need to build a string first. (And building a string one element at a time as you did above is a bad practice anyway. Most functions in R can operate on an entire vector at a time, including paste).

extracting node information

Using the XML library, I have parsed a web page
basicInfo <- htmlParse(myURL, isURL = TRUE)
the relevant section of which is
<div class="col-left"><h1 class="tourney-name">Price Cutter Charity Championship Pres'd by Dr Pep</h1><img class="tour-logo" alt="Nationwide Tour" src="http://a.espncdn.com/i/golf/leaderboard11/logo-nationwide-tour.png"/></div>
I can manage to extract the tournament name
tourney <- xpathSApply(basicInfo, "//*/div[#class='col-left']", xmlValue)
but also wish to know the tour it is from using the alt tag. In this case I want to get the result "Nationwide Tour"
TIA and apologies for scrolling required
Don't know R but I'm pretty good with XPath
Try this:
tourney_name <- xpathSApply(basicInfo, "//*/div[#class='col-left']/h1/text()", xmlValue)
tourney_loc <- xpathSApply(basicInfo, "//*/div[#class='col-left']/img/#alt", xmlValue)
Note the use of "#" to extract attributes and text() to extract text nodes (looks like R did this automatically), my revised tourney_name xpath should do the same thing, but it is more clear which part is being extracted.

Resources