NodeSet as character - r

I want to get a NodeSet, with the getNodeSet function from the XML package, and write it as text in a file.
For example :
> getNodeSet(htmlParse("http://www.google.fr/"), "//div[#id='hplogo']")[[1]]
<div title="Google" align="left" id="hplogo" onload="window.lol&&lol()" style="height:110px;width:276px;background:url(/images/srpr/logo9w.png) no-repeat">
<div nowrap="" style="color:#777;font-size:16px;font-weight:bold;position:relative;top:70px;left:218px">France</div>
</div>
I want to save all this node unchanged in a file.
The problem is we can't write the object directly with :
write.lines(getNodeSet(...), file)
And as.character(getNodeSet(...)) returns a C pointer.
How can I do this ? Thank you.

To save an XML object to a file, use saveXML, e.g.,
url = "http://www.google.fr/"
nodes = getNodeSet(htmlParse(url), "//div[#id='hplogo']")[[1]]
fl <- saveXML(nodes, tempfile())
readLines(fl)

There has to be a better way, until then you can capture what the print method for a XMLNode outputs:
nodes <- getNodeSet(...)
sapply(nodes, function(x)paste(capture.output(print(x)), collapse = ""))

I know it might be a bit outdated but i got into the same problem and wanted to leave it for future reference, after searching and struggling the answer is as simple as:
htmlnodes <- toString(nodes)
write.lines(htmlnodes, file)

Related

Find sequencing reads with insertions longer than number

I'm trying to isolate, from a bam file, those sequencing reads that have insertions longer than number (let's say 50bp). I guess I can do that using the cigar but I don't know any easy way to parse it and keep only the reads that I want. This is what I need:
Read1 -> 2M1I89M53I2M
Read2 -> 2M1I144M
I should keep only Read1.
Thanks!
Most likely I'm late, but ...
Probably you want MC tag, not CIGAR. I use BWA, and information on insertions is stored in the MC tag. But I may mistake.
Use pysam module to parse BAM, and regular expressions to parse MC tags.
Example code:
import pysam
import re
input_file = pysam.AlignmentFile('input.bam', 'rb')
output_file = pysam.AlignmentFile('found.bam', 'wb', template = input_file)
for Read in input_file:
try:
TagMC = Read.get_tag('MC')
except KeyError:
continue
InsertionsTags = re.findall('\d+I', TagMC)
if not InsertionsTags: continue
InsertionLengths = [int(Item[:-1]) for Item in InsertionsTags]
MinLength = min(InsertionLengths)
if MinLength > 50: output_file.write(Read)
input_file.close()
output_file.close()
Hope that helps.

Loop works outside function but in functions it doesn't.

Been going around for hours with this. My 1st question online on R. Trying to creat a function that contains a loop. The function takes a vector that the user submits like in pollutantmean(4:6) and then it loads a bunch of csv files (in the directory mentioned) and binds them. What is strange (to me) is that if I assign the variable id and then run the loop without using a function, it works! When I put it inside a function so that the user can supply the id vector then it does nothing. Can someone help ? thank you!!!
pollutantmean<-function(id=1:332)
{
#read files
allfiles<-data.frame()
id<-str_pad(id,3,pad = "0")
direct<-"/Users/ped/Documents/LearningR/"
for (i in id) {
path<-paste(direct,"/",i,".csv",sep="")
file<-read.csv(path)
allfiles<-rbind(allfiles,file)
}
}
Your function is missing a return value. (#Roland)
pollutantmean<-function(id=1:332) {
#read files
allfiles<-data.frame()
id<-str_pad(id,3,pad = "0")
direct<-"/Users/ped/Documents/LearningR/"
for (i in id) {
path<-paste(direct,"/",i,".csv",sep="")
file<-read.csv(path)
allfiles<-rbind(allfiles,file)
}
return(allfiles)
}
Edit:
Your mistake was that you did not specify in your function what you want to get out from the function. In R, you create objects inside of function (you could imagine it as different environment) and then specify which object you want it to return.
With my comment about accepting my answer, I meant this: (...To mark an answer as accepted, click on the check mark beside the answer to toggle it from greyed out to filled in...).
Consider even an lapply and do.call which would not need return being last line of function:
pollutantmean <- function(id=1:332) {
id <- str_pad(id,3,pad = "0")
direct_files <- paste0("/Users/ped/Documents/LearningR/", id, ".csv")
# READ FILES INTO LIST AND ROW BIND
allfiles <- do.call(rbind, lapply(direct_files, read.csv))
}
ok, I got it. I was expecting the files that are built to be actually created and show up in the environment of R. But for some reason they don't. But R still does all the calculations. Thanks lot for the replies!!!!
pollutantmean<-function(directory,pollutant,id)
{
#read files
allfiles<-data.frame()
id2<-str_pad(id,3,pad = "0")
direct<-paste("/Users/pedroalbuquerque/Documents/Learning R/",directory,sep="")
for (i in id2) {
path<-paste(direct,"/",i,".csv",sep="")
file<-read.csv(path)
allfiles<-rbind(allfiles,file)
}
#averaging polutants
mean(allfiles[,pollutant],na.rm = TRUE)
}
pollutantmean("specdata","nitrate",23:35)

Splitting an htmlParse'd HTML document while preserving the class

I'd like to scrap phone numbers from this French public directory. The thing is, it can return multiple answers, and I'd like to get them all, but I have a problem in the splitting of the parsed HTML doc.
Here is my code :
# example url for reproducibility
url_ <- "http://www.pagesjaunes.fr/recherche/departement/zc-de-vignolles-beaune-21/pagot-&-savoie---espace-aubade"
response <- GET(url_)
doc <- content(response, type="text/html", encoding = "UTF-8")
parseddoc <- htmlParse(doc)
# I think the problem lies in this next line, let's call it "line A" :
boxes <- xpathSApply(parseddoc, "//article[#class='bi-bloc blocs clearfix bi-pro']")
foreach(box = boxes) %do% {
# and also in this line, let's call it "line B" :
return_line$PJ_phone_number <- xpathApply(box, "//div[#class='item bi-contact-tel']", xmlValue)
}
}
I've tested line A, the xpathSApply() gets all the nodes with the XPath "//article[#class='bi-bloc blocs clearfix bi-pro']" (which is basically each box of result from the search on the website) and puts them into a list. I'm then going through this list with foreach. (I've tested this)
However, for line B to work, "box" needs to be of class "XMLInternalDocument". (parseddoc has class "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument" "XMLAbstractDocument" for instance). But in line A, xpathSApply() returns a list of objects of class "XMLInternalElementNode" "XMLInternalNode" "XMLAbstractNode".
So my question is, how can I have line A "split" the parts of parseddoc that I need, also while keeping the same class, XMLInternalDocument ?
I hope I'm clear enough. Thanks.

R xpathSApply --> extracting Attribute gives empty result

I try to parse out the xmlValue for the attribute "NAME" in an XML Document in R.
<NN ID_NAME="107232" ID_NTYP="6" NAME="dSpace_ECat1Error.STS" KOMMENTAR="dSpace_ECat1Error.STS" IS_SYSTEM="0" IS_LOCKED="0" DTYP="Ganzzahl" ADIM="" AFMT=""/><NN ID_NAME="107233" ID_NTYP="6" NAME="dSpace_ECat2Error.STS" KOMMENTAR="dSpace_ECat2Error.STS" IS_SYSTEM="0" IS_LOCKED="0" DTYP="Ganzzahl" ADIM="" AFMT=""/>
The result should be like this:
dSpace_ECat1Error.STS
dSpace_ECat2Error.STS
I use this function:
xpathSApply(root,"//NN[#NAME]",xmlValue)
But as a result, I get just empty "" (Quotes)
What have I done wrong?
Thank's in advance!
I just found out by using:
erg<-xpathSApply(root,"//NN",xmlGetAttr,'NAME')
There should be a better tutorial for this particular XML-function in R....

removing data with tags from a vector

I have a string vector which contains html tags e.g
abc<-""welcome <span class=\"r\">abc</span> Have fun!""
I want to remove these tags and get follwing vector
e.g
abc<-"welcome Have fun"
Try
> gsub("(<[^>]*>)","",abc)
what this says is 'substitute every instance of < followed by anything that isnt a > up to a > with nothing"
You cant just do gsub("<.*>","",abc) because regexps are greedy, and the .* would match up to the last > in your text (and you'd lose the 'abc' in your example).
This solution might fail if you've got > in your tags - but is <foo class=">" > legal? Doubtless someone will come up with another answer that involves parsing the HTML with a heavyweight XML package.
You can convert your piece of HTML to an XML document with
htmlParse or htmlTreeParse.
You can then convert it to text,
i.e., strip all the tags, with xmlValue.
abc <- "welcome <span class=\"r\">abc</span> Have fun!"
library(XML)
#doc <- htmlParse(abc, asText=TRUE)
doc <- htmlTreeParse(abc, asText=TRUE)
xmlValue( xmlRoot(doc) )
If you also want to remove the contents of the links,
you can use xmlDOMApply to transform the XML tree.
f <- function(x) if(xmlName(x) == "span") xmlTextNode(" ") else x
d <- xmlDOMApply( xmlRoot(doc), f )
xmlValue(d)

Resources