Scraping a page to retrieve prices, currency code messing things up - web-scraping

I'm scraping a page using PHP Simple HTML DOM Parser and I want to retrieve the price. It's been going well except for a page that I've encountered, where the html reads:
<p class="was-price">Was: £220.00</p>
I want to scrape the part that reads 220.00 and I am very confused about how to retrieve it. Thus far I have been using preg_replace() with great success to strip out text from a string, yet this is the first time I have come across a currency symbol in numeric format.
Today is the first day I have used preg_replace() and it's confusing to say the least. Can it be used to remove currency symbols in this way? Or should I be looking at another method? Thanks

Use html_entity_decode() to decode encoded html entities. Then you apply preg_replace().
$str = '<p class="was-price">Was: £220.00</p>';
$str = html_entity_decode($str);
echo $str;
preg_replace(...);

Related

How secured is the simple use of addslashes() and stripslashes() to code contents?

Making an ad manager plugin for WordPress, so the advertisement code can be almost anything, from good code to dirty, even evil.
I'm using simple sanitization like:
$get_content = '<script>/*code to destroy the site*/</script>';
//insert into db
$sanitized_code = addslashes( $get_content );
When viewing:
$fetched_data = /*slashed code*/;
//show as it's inserted
echo stripslashes( $fetched_data );
I'm avoiding base64_encode() and base64_decode() as I learned their performance is a bit slow.
Is that enough?
if not, what else I should ensure to protect the site and/or db from evil attack using bad ad code?
I'd love to get your explanation why you are suggestion something - it'll help deciding me the right thing in future too. Any help would be greatly appreciated.
addslashes then removeslashes is a round trip. You are echoing the original string exactly as it was submitted to you, so you are not protected at all from anything. '<script>/*code to destroy the site*/</script>' will be output exactly as-is to your web page, allowing your advertisers to do whatever they like in your web page's security context.
Normally when including submitted content in a web page, you should be using htmlspecialchars so that everything comes out as plain text and < just means a less then sign.
If you want an advertiser to be able to include markup, but not dangerous constructs like <script> then you need to parse the HTML, only allowing tags and attributes you know to be safe. This is complicated and difficult. Use an existing library such as HTMLPurifier to do it.
If you want an advertiser to be able to include markup with scripts, then you should put them in an iframe served from a different domain name, so they can't touch what's in your own page. Ads are usually done this way.
I don't know what you're hoping to do with addslashes. It is not the correct form of escaping for any particular injection context and it doesn't even remove difficult characters. There is almost never any reason to use it.
If you are using it on string content to build a SQL query containing that content then STOP, this isn't the proper way to do that and you will also be mangling your strings. Use parameterised queries to put data in the database. (And if you really can't, the correct string literal escape function would be mysql_real_escape_string or other similarly-named functions for different databases.)

What is the difference between get_the_* and the_* template tags in wordpress?

I am confuse about get_the_* and the_* template tags. I have used those many times to my theme but i am not clear enough when to use get_the_* and when to use the_* . Would you please explain both concept clearly.
Typically, there are two key differences between get_the_* and the_* functions.
get_the_* methods don't echo anything themselves. Instead, they return the value that you're interested in, normally as a string. For example, get_the_time() echoes nothing, and returns a string representation of the posting time of the current post. the_* methods directly output the same value, without you having to echo it; the_time() returns nothing, but directly echoes the posting time.
the_* methods are generally designed to be used inside the Loop, so they often don't take a parameter to specify which post you're asking about; for example, the_title() doesn't take a post_id parameter, and can therefore only act on the "current" post inside the Loop. It doesn't make sense to call it outside the loop—which post would it be getting the title for? However, get_the_title() takes a post ID as a parameter, so you can use it from anywhere to get the title of any post, as long as you've got the post's ID. (Many of the get_the_ methods take an optional post id parameter, and default to returning the value for the current post if they're used from in the Loop, for convenience.)
Because WordPress has been in development for so many years, and things have gradually been added, these aren't guaranteed rules, and you'll find exceptions here and there. You should take this as general advice and check the documentation for each specific instance as you need it.
The difference is that you can only use the_* inside your loop. But get_the* you can use inside or oustide the loop. Outside the loop you should give the post_id as a parameter.
And by default the_* echo's the title for example and get_the* just gets the title for using it in your PHP.
There is something more to it. I just tried the_content() and echo get_the_content() which should the same thing but.. If you add a filter('the_content') it wont work with echo get_the_content() but it works fine with the_content() method.

parse escaped HTML into node in xqilla

I'm trying to get text from an rss 2.0 feed (description tag) using XQilla. The address is here. This is fine but the tag contains escaped HTML like
"<a href="some_address>..."
It would be useful to have this HTML in a node and further work with it, but I am at a loss here. I can get the tag contents with
let $desc := $item/*[name()='description']
but do not know how to unescape it. I tried parse-html, which only strips the text of tags and returns a string, like the data() function. Searching on the web suggests that extension functions exist for this, but in other parsers. Is there a way to do it in XQilla? By the way, the code I am working on is a JAWS ResearchIt lookup source.
XQilla has – like lots of other XQuery implementations – a proprietary function to load XML and HTML from a string (they don't have anchor tags, thus you need to scroll through the document, I'm sorry).
xqilla:parse-xml($xml as xs:string?) as document-node()?
xqilla:parse-html($html as xs:string?) as document-node()?
Given $desc contains the unparsed HTML, xqilla:parse-html($desc) will return the parse result.

Scraping variable names and values from html definition lists using R

I'm looking to extract some data from a definition list in some html code in R. So far I've done the following;
url <- "myurl"
doc <- htmlParse(url)
and then I (think I) want to use xpathSApply to extract the list data; however I keep returning an error... I'm new to the concept of webscraping and HTML, so I'm not entirely sure how the function goes about locating the data to scrape.
How do I find the xpath to pass to xpathSApply?
an example url would be http://opencorporates.com/companies/gb/06309283
and I would want to scrape the data regarding company name, number, address, directors etc. into one observation per query.
Firefox has an amazing plugin called FireBug, and an extension to that called FirePath. Using that, you can right click on any element on a web page and click "Inspect" . That will show you the XPath to be passed to xpathSApply.
If you can't use Firebug there's a nifty bookmarklet called SelectorGadget that does much the same thing and should work in IE9
Turns out the syntax that I was in need of was the '//node[#class="myclass"]' for use in the xpathSAppply function. Cheers all

makes RSS feed fail validation

Im making an RSS feed from content that's been imported into my site. The content contains , which makes the RSS feed fail validation. I'm trying to strip out the but I'm not sure ill be able to. The feed looks fine in google reader but IE's reader just displays an error.
Are their any other solutions to my problem? Could I make the doc type less strict or something similar?
Thanks
You shouldn't need at all if the content will be visible in an rss feed. If you're using PHP to generate the feed, you can use str_replace to strip it.
$stripped = str_replace(" ", " ", $original);
So if you're iterating thru a loop, $original will be the var with the original data, when outputting the data, use $stripped instead. It will replace the $nbsp; with a single space.
You could replace all entities with  .

Resources