whats wrong with my XML for document ranking? - asp.net

I wrote a program in C# to calculate TF-IDF to rank documents.
I used the following XML to store the word frequencies within documents. I was criticised heavily for using this structure. Even though I use the text of the word within the Tag, as per me its efficient and consumes less space. Also, I can make a search using XDocument pretty easily since its a nice tree structure. Can you help me understand why was I criticised heavily?
Criticism: How can you add information within meta-data? (For me its innovative).
<word>
<siddhartha>
<doc1> 4 </doc4>
<doc2> 5 </doc2>
<insipration>
<doc1> 4 </doc1>
<doc6> 5 </doc6>
....
</word>
I was suggested something like this:
<word>
<text> siddhartha </text>
<doc1> 4 </doc1>
<text> inspiration </text>
<doc1> 4 </doc1>
...
</word>

Your structure, with word name as node, will be hard to parse with generic parsers. There is no defined structure: you need to read the whole document to know it.
I may have done something like this (I tried to stay closed to your idea):
<words>
<word id="siddhartha">
<freq id="doc1"> 4 </freq>
<freq id="doc2"> 5 </freq>
</word>
....
</words>

Related

How to loop through xml nodes in R

I have a requirement to split an xml document into multiple nodes; and then split each node separately into more sub nodes. I am using xpathSApply/getNodeSet functions in XML package. But it seems like once the xml document is split as nodes, each node is now considered as class "internal node" and hence cannot perform spath operations on it unless we save it as an xml using saveXML(). Any ideas on how this can be worked out without having to do a SAVEXML?
For example, consider sample xml below:
<array>
<ResidentialProperty>
<Listing>
<StreetAddress>
<StreetNumber>11111</StreetNumber>
<StreetName>111th</StreetName>
<StreetSuffix>Avenue Ct</StreetSuffix>
<StateOrProvince>WA</StateOrProvince>
</StreetAddress>
<MLSInformation>
<ListingStatus Status="Active"/>
<StatusChangeDate>2015-07-05T23:48:53.410</StatusChangeDate>
</MLSInformation>
<GeographicData>
<Latitude>11.111111</Latitude>
<Longitude>-111.111111</Longitude>
<County>Pierce</County>
</GeographicData>
</ResidentialProperty>
<ResidentialProperty>
<Listing>
<StreetAddress>
<StreetNumber>11211</StreetNumber>
<StreetName>11111334th</StreetName>
<StreetSuffix>Av1enue Ct</StreetSuffix>
<StateOrProvince>WA</StateOrProvince>
</StreetAddress>
<MLSInformation>
<ListingStatus Status="Active"/>
<StatusChangeDate>2017-07-05T23:48:53.410</StatusChangeDate>
</MLSInformation>
<GeographicData>
<Latitude>11.111111</Latitude>
<Longitude>-111.111111</Longitude>
<County>Pie2rce</County>
</GeographicData>
</ResidentialProperty>
</array>
I am intending to split the above into:
1. Two separate nodes with root ResidentialProperty
2. Then be able to perform XPATH operations on each of these nodes.
P.S: This is sample data and not similar to the actual data set I am working with. Just tried to use this to explain the problem I am trying to solve.
EDIT : I think I've misunderstood the question. New approach.
We use xpathApply, toString.XMLNode and xmlParseString to extract specific nodes in 2 objects.
Parse the XML file and exctract the nodes :
library(XML) :
doc=xmlParse("pathtoyourXML.xml")
result1=xmlParseString(toString.XMLNode(xpathApply(doc,"(//ResidentialProperty)[1]")))
result2=xmlParseString(toString.XMLNode(xpathApply(doc,"(//ResidentialProperty)[2]")))
We have 2 objects, we evaluate them with :
from.result1=xpathApply(result1,"//StreetAddress")
from.result2=xpathApply(result2,"//StreetAddress")
Sidenote : your XML is not valid. Listings elements are not closed.
EDIT 2 : In fact, you can use XPathApply on a previously "extracted" nodeset :
foo=xpathApply(doc,"(//ResidentialProperty)[2]")
xpathApply(foo[[1]],"//StreetAddress")
foo does not contain the result of the previous xpath expression ((//ResidentialProperty)[2]) but the whole XML nodeset.

Obtaining the QAST of a Perl 6 file from another program

This is related to this question on accesing the POD, but it goes further than that. You can easily access the Abstract Syntax Tree of a Perl 6 program using:
perl6 --target=ast -e '"Þor is mighty!".say'
This will print the whole Q abstract syntax tree. It's not too clear how to make this from your own program, or I haven't found how to do it. In fact, the CoreHackers::Q module runs that as an external script. But being able to access it from your own program, like
use QAST; # It does not exist
my $this-qast = QAST::Load("some-external-file.p6") # Would want something like this
would be great. I'm pretty sure it should be possible, at the NQP level and probably in a Rakudo-dependent way. Does someone know hot it goes?
Since QAST is not a part of the Perl 6 language specification, but an internal implementation detail of Rakudo, there's no official way to do this. Eventually there will be an AST form that is part of the language specification, but that doesn't yet exist (the 007 project which is working on exploring this area).
It is, however, possible to obtain the QAST tree by using:
use nqp;
my $ast = nqp::getcomp("perl6").eval("say 42", :target<ast>);
say $ast.dump();

Atom data-grammar syntax for keybindings

Can someone give a full explanation of the syntax for Atom's data-grammar attribute (used in keybinding selectors)?
For instance, what is the difference between
[data-grammar='source example']
and
[data-grammar~='source example']
?
Also, how do you specify multiple grammars? For instance, how would you specify that a key binding should be limited to html or xml formats?
If there already exists documentation on this somewhere, I have not yet found it, but would appreciate being pointed to it.
Quick Example:
keymap.cson:
"atom-text-editor[data-grammar='text tex latex']":
'f5':'latex:build'
Grammar Information & Documentation
I began by looking at the file-types package. source and text categorize languages - source deals with development languages, while text deals with documentation/logs formats.
You can add and customize language recognition by reading the flight manual. I've linked some specific sections below that are helpful for that.
Flight Manual | Basic Customization:
Language Recognition
Language Specific Settings
Working with [data-grammar]:
The little doocumentation given is listed under the Keymaps in Depth section.
Flight Manual | Keymaps in Depth
Selectors and Custom Packages.
This also describes the not([...]) functionality used below and how to manipulate various rules.
Although in the above, grammars are listed in a dot format, ie source.c, to use them in the [data-grammar='<name>'] format spaces are instead required.
An example of how I might use the data grammar option in my keymap.cson config is as such (here I'm using the latex package):
"atom-text-editor[data-grammar='text tex latex']":
'f5':'latex:build'
The ~ is not the correct syntax for desired functionality with data-grammar. Instead, use something like "atom-text-editor:not([data-grammar='<name>'])":
Note that you wouldn't use data-grammar in something like config.cson. The syntax for language specifics looks something like this instead:
# **config.cson**
".latex.tex.text":
editor:
softWrap: true
Extra useful information - List of registered Grammars
A dump of the output of Object.keys(atom.grammars.grammarsByScopeName).sort().join('\n') through the Dev Console (View > Developer > Toggle Developer Options > Console)
source.c
source.cake
source.clojure
source.coffee
source.cpp
source.cs
source.css
source.css.less
source.css.scss
source.csx
source.diff
source.gfm
source.git-config
source.go
source.gotemplate
source.java
source.java-properties
source.js
source.js.rails source.js.jquery
source.js.regexp
source.js.regexp.replacement
source.json
source.litcoffee
source.makefile
source.nant-build
source.objc
source.objcpp
source.perl
source.perl6
source.plist
source.python
source.python.django
source.regexp.python
source.ruby
source.ruby.gemfile
source.ruby.rails
source.ruby.rails.rjs
source.sass
source.shell
source.sql
source.sql.mustache
source.sql.ruby
source.strings
source.toml
source.verilog
source.yaml
text.bibtex
text.git-commit
text.git-rebase
text.html.basic
text.html.erb
text.html.gohtml
text.html.jsp
text.html.mustache
text.html.php
text.html.ruby
text.hyperlink
text.junit-test-report
text.log.latex
text.plain
text.plain.null-grammar
text.python.console
text.python.traceback
text.shell-session
text.tex
text.tex.latex
text.tex.latex.beamer
text.tex.latex.memoir
text.todo
text.xml
text.xml.plist
text.xml.xsl
To complement Mr G's answer, atom-text-editor[data-grammar~='html'] with the ~= means match an <atom-text-editor> HTML element with a data-grammar attribute that contains "html" amongst any other possible whitespace separated words.
For example, if the current language of the file is PHP, then the text-editor HTML element will look something like this:
<atom-text-editor data-grammar="text html php">
And atom-text-editor[data-grammar~='html'] will match this.
More info on attribute selectors: https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors
As for trying to select multiple grammars, I don't think it's possible unless they share a common word in the data-grammar attribute, e.g., both HTML and PHP share "html", or both C and JavaScript share "source" (but in this case many other grammars share "source"). The only other way is to specify a keymap for each grammar individually even if it's the same key combination.

R - read html files within a folder, count frequency, and export output

I'm planning to use R to do some simple text mining tasks. Specifically, I would like to do the following:
Automatically read each html file within a folder, then
For each file, do frequency count of some particular words (e.g., "financial constraint" "oil export" etc.), then
Automatically write output to a csv. file using the following data structure (e.g., file 1 has "financial constraint" showing 3 times and "oil export" 4 times, etc.):
file_name count_financial_constraint count_oil_export
1 3 4
2 0 3
3 4 0
4 1 2
Can anyone please let me know where I should start, so far I think I've figured out how to clean html files and then do the count but I'm still not sure how to automate the process (I really need this as I have around 5 folders containing about 1000 html files within each)? Thanks!
Try this:
gethtml<-function(path=".") {
files<-list.files(path)
setwd(path)
html<-grepl("*.html",files)
files<-files[html]
htmlcount<-vector()
for (i in files) {
htmlcount[i]<- ##### add function that reads html file and counts it
}
return(sum(htmlcount))
}
R is not intended for doing rigorous text parsing. Subsequently, the tools for such tasks are limited. If you insist on doing it with R then you better get familiar with regular expressions and have a look at this.
However, I highly recommend using Python with the beautifulsoup library, which is specifically designed for this task.

XSL, comparing dates to exclude any past events

I have an RSS of an events feed. I would like to hide previous events.
Assuming XML data subset of
<Navigation Name="ItemList" Type="Children">
<Page ID="x32444" URL="..." Title="Class..."
EventStartDate="20090831T23:00:00" EventEndDate="20090904T23:00:00"
EventStartTime="20090830T15:30:00" EventEndTime="20090830T18:30:00" Changed="20090830T20:28:31" CategoryIds="" Schema="Event"
Name="Class of 2010 BAKE SALE"/>
<Page ID="x32443" URL="x32443.xml?Preview=true&Site=&UserAgent=&IncludeAllPages=true&tfrm=4" Title="Class of 2010 BAKE SALE"
Abstract="Treat yourself with our famous 10-star FRIED ICE CREAM!" EventStartDate="20090831T23:00:00" EventEndDate="20090904T23:00:00"
EventStartTime="20090830T15:30:00" EventEndTime="20090830T18:30:00" Changed="20090830T20:25:35" CategoryIds="" Schema="Event"
Name="Class of 2010 BAKE SALE"/>
<Page ID="x32426" URL="x32426.xml?Preview=true&Site=&UserAgent=&IncludeAllPages=true&tfrm=4" Title="Tribute to ..."
Abstract="Event to recognize and celebrate the lifetime of leadership and service ..."
EventStartDate="20091206T00:00:00" EventEndDate="20091206T00:00:00" EventStartTime="20090828T23:00:00" EventEndTime="20090828T04:00:00"
Changed="20090828T22:09:54" CategoryIds="" Schema="Event" Name="Tribute to ...."/>
</Navigation>
How would I not include anything past today's date
<xsl:apply-template select="Page[#EventStartDate=notBeforeToday()]"/>
Easiest with XSL parameters that you set from outside.
<xsl:param name="today" select="'undefined'" />
<!-- time passes... -->
<xsl:apply-templates select="Page[#EventStartDate < $today]"/>
Your date format is such that you can compare it using string comparison, unless there are different timezones involved. You would simply set
20091001T00:00:00
as the param value for $today. Have a look into your XSLT processor's documentation to see how.
The alternative would be to use an extension function. Here it depends on which extension functions your XSLT processor supports, so this approach won't be portable.
For this purpose, i usually add an extra date attribute in the XML which contains the day number since year 1900.
for example #dateid='9876543' or #seconds="9876675446545"
then i can can easily compare with today or another variable in the XSL.
You can also use this technique to compare times using "Unix time" for example

Resources