I'm just learning how to program in ruby using nokogiri gem.
doc.xpath("//*[#class='someclass']//#href")
will return all href values under "someclass" class somewhere in the HTML.
doc.xpath("//*[#class='someclass']").xpath("//#href")
will return all href in entire HTML.
Could someone explain to me how would someone go about using //# equivalent in xpath for instance, within parsed data so something like:
doc.xpath("//*[#class='someclass']").xpath(grab all the href within previously parsed)
is possible?
using the *, # seems to be quite powerful but I can't seem to be able to narrow that down, other than searching through entire HTML, whereever I use it..
as a beginner, I just thought it would be.. intuitive? to be able to use "grab from everywhere" type of syntax limited to what has been parsed previously to narrow down my target, so I can do something like
xpath(whatever).css(whatever).xpath(whatever)
maybe this is not a good practice? maybe with more understanding of parsing concept I would never have to do this? sometimes I find using both xpath and CSS easier..
hopefully someone can enlighten me..
Try changing your second expression from
doc.xpath("//*[#class='someclass']").xpath("//#href")
to
doc.xpath("//*[#class='someclass']").xpath(".//#href")
// at the beginning of an XPath expression means "descendants of the root of the document," whereas .// means "descendants of the context node(s)."
You're right that XPath is powerful, and some major aspects of it are intuitive... but there are significant pieces that aren't intuitive, or depend on how your intuition is trained. Careful study reaps dividends, especially if you are going to use XPath much!
Related
I have been trying for days to move forward with this little code for getting the headers and the links of the news from a journal website.
using HTTP
function website_parser(website_url::AbstractString)
r = readstring(get(website_url))
splitted = split(r, "\n")
end
website_parser("https://www.nature.com/news/newsandviews")
The problem is that I could not figure out how to proceed on once I got the text from the website. How can I retrieve specific elements (as header and link of the news in this case)?
Any help is very much appreciated, thank you
You need some kind of HTML parsing. For only extracting the header, you probably can get away with regex, which are built in.
If it gets more complicated than that, regular expressions don't generalize, and you should use a full-fledged HTML parser. Gumbo.jl seems to be state of the art in Julia and has a rather simple interface.
In the latter case, it's unneccessary to split the document; in the former, it at least makes things more complicated, since then you have to think about line breaks. So, better parse first, then split.
Specific elements can be extracted using the library Cascadia git repo
for instance, the class attribute for elements in the HTML page can be extracted via qs = eachmatch(Selector(".classID"),h.root) so that all the class elements such as <div class="classID"> get selected/extracted for the returned query string (qs).
I have a template class that grabs HTML and basically returns html to the caller. How do I test the caller using PHP Unit? Do I just assertTrue(is_string(call_function))? It seems like a stupid test, and I thought I may be testing it improperly.
Is the returned HTML supposed to be well-formed? If so you could validate it.
And/or if there is always supposed to be a certain node, or string of text, present you could check for its existence. Using strpos, regexes, or a proper DOM parser.
This StackOverflow question gives you some ideas for ways to parse and query your HTML: How do you parse and process HTML/XML in PHP?
More generally, the way I usually approach how to test a function that returns a string is to use:
$html=call_function();
$this->assertEquals("dummy",$html);
Then it fails, but tells me the correct output, so I paste that in:
$html=call_function();
$expected=<<<EOD
<html>
...
</html>
EOD;
$this->assertEquals($expected,$html);
If it fails again I then study the differences between the two correct answers I have. If this is a good unit test should they really even be different? Do I want to use a mock object to replace some uncontrollable aspect of the system? (E.g. if the HTML it is returning is google search results, then maybe I want a mock object to simulate calling google, but always return exactly the same search results page.)
If the only differences are timestamps I might use regexes to hunt-and-destroy them, to give me a string that should always be the same, e.g.
$html=preg_replace('/\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}/','[TIMESTAMP]',$html);
ADDITION
If the HTML string is very big, one alternative is to use md5() to reduce it to a short string. This will still warn you when something breaks, but the (big) downside is when it breaks you won't know where. If you are concerned about that then it is better to use the DOM approach (or its poor cousin, regexes) to just cherry-pick a few key parts of the HTML to test.
#test is the selector for id="test"
.test is the selector for class="test"
but how do you remember which way round they are (eg not .=id)
Well, in truth these things are so common that most people don't need mnemonics to remember them, but here's something I came up with, if it helps:
In terms of a filename a . and then an extension denotes a type of thing. There can be many different things of this type. With CSS, using classes you can denote a single style for many elements of the same type.
In terms of a URL, a # denotes an anchor link to a specific spot in the document. It refers to one location only. With CSS, using IDs you denote a single style for a single specific element.
If a police officer catches you with "hash," he will ask to see your ID. If not you get to stay classy. It's really dumb, but that's how I remember.
CallerID shows a phone #.
Periods are round like pearls, and-- Pearls are classy.
(P.S.: What's with all the "you learn differently than me, so you suck" comments? Goodness. Repetition is OK for me, but if I can visualize something I pick things up more quickly. In fact, the weirder something is the easier it is to memorize!)
I learned it the same way I learned that quotes (rather than parentheses) are used for attributes' values — by typing them a couple of times.
If you or someone you know gets tripped up by # vs ., though, consider that many programming languages use a . to access the members of an class-typed object.
I see I'm a day late (and maybe a dollar short), but I had the same problem in the early days and the following helped:
for the dot (.) as the selector for Class, I remembered it as: "My class always starts on the dot, not a minute early or late."
for the number sign (#) for ID, I just reminded myself that an ID(entification) card is incomplete without its number.
Spend lots of time writing CSS. When you've got it wrong enough times, your brain will give in and retain it.
So, I've got a bunch of markup-pages delivered that I am supposed to style. Problem is that tags are all in uppercase, even though the doctype declares it as xhtml. Not only is it ugly and hurting my eyes, it's also wrong, isn't it?
Is there a good way, perhaps a coda (my preferred tool), plug-in, or online service that can do this for me? Or can you do a regexp search-and-replace in coda, and if so, how? (I'll be the first to admit that regexp isn't my cup o'Java.)
You could try figuring out how to do it in RegExp (something a little like /<([^>]*)>/g would be a bit too simplistic, but its a start) but that still presents problems. Coda's RegExp just does find-and-replace so I don't think you'd be able to do a "toLowerCase()" on each item found before replacing with Coda - you'd have to use some scripting language.
Another option is to download the W3C's Tidy application which I believe has "to uppercase" and "to lowercase" options, though I've not used it. It's here: http://tidy.sourceforge.net/#binaries (hasn't been updated in a while)
I asked a question earlier today regarding using Perl to search in a CSS document. I have since refined my requirements a little bit, and have a better idea of what I am trying to do.
The document I am searching through is actually an .html doc with CSS as a style in the <head>, if that makes sense.
Basically, what I need to do though is find all the CSS elements that have a color or background color attribute, and record them. Here's my thought process.
open the file and set it as an array
read the array line-by-line until it comes to a "{"
make everything into a scalar variable or array until I get to the "}"
search the secondary variable or string for instances of "color" blah blah blah.
The issue I am having is finding a way to scour the document and turn everything between { and } into a variable of some sort. Any one have any ideas?
Cheers!
No matter what, I wouldn't recommend writing your own code from the ground up for this. You should use a parser. A quick search on CPAN suggests this family of modules. On the other hand, if your css is in an html file rather than a separate css file (shame on you), then you might end up needing a different type of parser.
Either way, it's generally not a good idea to try to hand-roll your own quasi-parser out of regular expressions. Use a proper parser, and leverage someone else's work.
On a slightly different tack, if you only want to extract some of the information from a file of any kind, then in many cases you don't want to put the whole file into an array first. (It can be memory intensive if the file is very large, and it's unnecessary.) It's easy to open the file and the process the items as you work through it line by line.
#!/usr/bin/env perl
use strict;
use warnings;
open my $fh, '<', 'file-of-interest'
or die "Couldn't open 'file-of-interest': $!";
my #saved_items;
while (my $line = <$fh>) {
# process $line
# push #saved_items, $something
}
# Do more fun stuff with #saved_items
You could use the CSS module, which is available on CPAN.
I think this is really just the same question that you asked previously, although you didn't mention as you did in a previous comment that you don't think you are allowed to use modules.
The CSS module already does this. You can look at the source to see how they do it. That's the same answer I gave you last time too.
There isn't really any magic or secret way that everyone is hiding from you. Most times, if the module you find on CPAN could be simpler, it would be. However, without any more information that constrains your problem, a general solution like SS](http://search.cpan.org/dist/CSS) is the way to go. Study that source or just lift it wholly into your script, although you might try some arguments to get some modules installed. If you could use the module, you might already be done and onto the next project. That's often a convincing argument. :)