trying to understand how regex works - r

I'm learning about regex expressions and confused by how this whole field works. I've taken an example from a tutorial here and pasted it into https://regexr.com/.
The regex below is supposed to capture email addresses but it doesn't seem to work, at least as is.
I'm posting here in the hopes that there's a simple explanation I might look further into.
From the tutorial website, I gleaned that there are different "flavors" of regex. From the regexr.com site, it seems I have the option to choose a JavaScript or PCRE engine (I assume engine is a synonym for flavor). It doesn't seem to make a difference.
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,}\b
Ultimately I'm working in R so have added the R tag to this post. I suspect R may use yet a different flavour from the one above.

Related

Good whitelist for search terms

I'm implementing a simple search on a website, and right now I'm working on sanitizing the input. My plan is to make a whitelist of allowed characters. I'm using PHP, and so far I've got the current regex:
preg_replace('/[^a-z0-9 -]/i', '', $s);
So, I'm removing anything that's not alphanumeric or a space or a hyphen.
Is there a generally accepted whitelist for this sort of thing, or does it just depend on the application? I'm going to be searching on book titles, author names and book blurbs.
What about 2010 (A space odyssey)? What about Giscard d`Estaing's autobiography? ... This is really impossible to answer generally, it will depend on your application and data structures.
You want to look into the fulltext search functions of the database of your choice, or even specialized search appliances like Sphinx.
Clarify what engine you will use first to actually perform your search, and the rules on what you need to strip out will become much clearer.
Google has some pretty advanced rules for searches, but their basic rule is this:
Generally, punctuation is ignored, including ##$%^&*()=+[]\ and other special characters.
However, Google makes exceptions for common search terms, like C++, C#, or $100.
If you want a search as sophisticated as Google's, you can make rules against the above punctuation and have some exceptions. However, for a simple search, just ignore the characters that Google generally ignores.
There's not a generic regular expression to solve this problem. Your code strips out a lot of things you might want to keep, like commas, exclamation points, (semi-)colons, and non-English letters. If you have a full list of all of the titles in your database, you should be able to write a script that will construct a list of all characters found in all of your titles. If your regular expression strips out any of those characters, then you risk having problems (although passing this test doesn't mean that you won't run into problems).
Depending on how the rest of your search is implemented, you may be able to strip out valid characters and still return relevant search results. In this case, you would want your expression to allow non-English characters (since you don't want to split a word) but you might be able to remove all punctuation marks that aren't inside of a quote-delimited phrase. For example, searching for red haired should give you all of the results you would get from searching for red-haired plus a few extra.

Aggregating from various sources

It could be a project well beyond my skills right now but I've got around one full month to spend on it so I think I can do it. What I want to build is this: Gather news about a specific subject from various sources. Easy, right? Just get the rss feeds and display them on a page. Well, I want something more advanced: Duplicates removed and customized presentation (that is, be able to define/change the format in which the news headlines are displayed).
I've played a bit with Yahoo Pipes and some other tools and I am facing two big problems:
Some sources don't provide rss feeds. How do I create one?
What's the best method to find and remove duplicates. I thought about comparing the headlines and checking if there is a matching bigger than, say, 50%. Is that a good practice though?
Please add any other things (problems, suggestions, whatever) I might not have considered.
Duplication is a nasty issue. What I eventually ended up doing:
1. Strip out all HTML tags except for links (Although I started using regex, I was burned. I eventually moved to custom parsing to remove tags)
2. Strip out all whitespace
3. Case-desensitize
4. Hash all that with MD5.
Here's why you leave the link in:
A comment might be as simple as "Yes, this sucks". "Yes, this sucks" could be a common comment. BUT if the text "this sucks" is linked to different things, then it is not a duplicate comment.
Additionally, you will find that HTML tag escaping is weird with RSS feeds. You would think that a stray < would be double-encoded: (I think)&<;
But it is not. It is encoded <
But so too are HTML tags! :<p>
I eventually copied all the known HTML tags as parsed by Mozilla Firefox and manually recognized those tags.
Creating an RSS feed from HTML is quite nasty and I can only point you to services such as Spinn3r, which are fantastic at de-duplication and content extraction. These services typically use probability-based algorithms that are above me. I know of one provider that got away with regexing pages (They had to know that a certain page was MySpace-based or Blogger-based) but they did not perform admirably.
You might want to try to use the YQL module to scrape a webpage that doesn't provide RSS. Here's a sample of a YQL statement to scrape HTML.
About duplicates, take a look at this pipe.
Customized presentation: if you want it truly customized you'll have to manipulate the pipe results yourself, e.g. get it as JSON an manipulate it with Javascript, or process it server-side.

Is there a good way to change UPPER CASE html tags to lower case, across a whole document?

So, I've got a bunch of markup-pages delivered that I am supposed to style. Problem is that tags are all in uppercase, even though the doctype declares it as xhtml. Not only is it ugly and hurting my eyes, it's also wrong, isn't it?
Is there a good way, perhaps a coda (my preferred tool), plug-in, or online service that can do this for me? Or can you do a regexp search-and-replace in coda, and if so, how? (I'll be the first to admit that regexp isn't my cup o'Java.)
You could try figuring out how to do it in RegExp (something a little like /<([^>]*)>/g would be a bit too simplistic, but its a start) but that still presents problems. Coda's RegExp just does find-and-replace so I don't think you'd be able to do a "toLowerCase()" on each item found before replacing with Coda - you'd have to use some scripting language.
Another option is to download the W3C's Tidy application which I believe has "to uppercase" and "to lowercase" options, though I've not used it. It's here: http://tidy.sourceforge.net/#binaries (hasn't been updated in a while)

Where can one post nicely formatted code combined with LaTeX for mathematical expressions?

Admittedly not a programming question, but I don't really know where else to ask this...
I'm planning to start a blog to post the stuff I'm working on, which is mostly about Expression Trees & Mathematics. Hopefully this will help me focus on the problem at hand instead of going off every possible tangent that comes up.
I wonder if someone out there knows a good place to host a blog with the two following requirements:
(1) Nice support for code listings (as seen here, for example).
(2) Support for complex mathematical expressions, ideally in LaTeX (as seen here, for example).
For a while now I've been looking around for posts/articles combining both nicely formatted code and mathematical expressions, but I haven't found anything.
Thanks a lot!
PS - If there's another Q&A forum where this question would fit better, then please let me know and I'll move it there.
EDIT(1): While carrying out some additional research, I found this related SO question (see also resources therein), which then took me to here. Leaving the question open for now though in case someone wants to suggest alternatives.
Read here:
http://sixthform.info/steve/wordpress/?cat=2
http://fugato.net/2007/01/20/latex-in-wordpress/
about LaTeX on Wordpress, syntax higlighting is easy (hint: google for "syntax highlighting") and you can go on any host with WP.
Good luck.
Edit: Okay, about that latex - it seems you need to have administrator rights, so any hosting with friendly administrator or virtual server or server hosting/housing :)
Edit: As for syntax highlighting:
PHP SH: http://xtractpro.com/articles/CSharp-Syntax-Highlighter.aspx
JS SH: http://wordpress.org/extend/plugins/google-syntax-highlighter/
For math expressions in LaTeX syntax checkout MathJax. It can probably be used with most blog hosting services. You just need to be able to add javascript to the page.

LaTeX equivalent to Google Chart API

I'm currently looking at different solutions getting 2 dimensional mathematical formulas into webpages. I think that the wikipedia solution (generating png images from LaTeX sourcecode) is good enough until we get support for MathML in webbrowsers.
I suddenly realized that it might be possible to create a Google Charts API equivalent for mathformulas. Has this already been done? Is it even possible due to the strange characters involved in LaTeX-code?
I would like to hit an url like latex2png.org/api/?eq="E = mc^2" and get the following response:
edit:
Thanks for the answers sofar. However, I am already aware of several tools to generate png images from latex source code (both online and from my commandline), but what I was looking for was a simple way to get the image via an Http GET request. Perhaps such a service does not exist.
Update
As #hughes (and others) pointed out, the previous Google Chart API has been deprecated.
The example I wrote still works as of Sept 2015, but a new one shall be used now (documentation):
Old answer
Google Chart can do it (Documentation):
http://chart.apis.google.com/chart?cht=tx&chl=%5CLaTeX
I'm using this with Google Docs, because it doesn't support math yet.
chart.apis.google with background color changed
https://chart.apis.google.com/chart?cht=tx&chf=bg,s,FFFF00&chl=%0D%0A4x_0%5CDelta%28x%29%2B3%5CDelta%28x%29%2B2%5CDelta%28x%5E2%29%3E0%0D%0A
or chart.apis.google with background color transparent and resized
For better readability URL needs to be decoded.
https://chart.apis.google.com/chart?cht=tx&chs=428x35&chf=bg,s,FFFFFF00&chl=
4x_0\Delta(x)+3\Delta(x)+2\Delta(x^2)>0
Data structure looks like this
{
"cht":"tx",
"chs":"428x35",
"chf":"bg,s,FFFFFF00",
"chl":"n4x_0\Delta(x) 3\Delta(x) 2\Delta(x^2)>0"
}
https://chart.apis.google.com/chart?cht=tx&chs=428x35&chf=bg,s,FFFFFF00&chl=%0D%0A4x_0%5CDelta%28x%29%2B3%5CDelta%28x%29%2B2%5CDelta%28x%5E2%29%3E0%0D%0A
You could try the Online image generator for mathematical formulas for a start.
mathurl is a mathematical version of TinyURL.com. It allows you to reference LaTeXed mathematical expressions using a short url. For example, http://mathurl.com/?5v4pjw will show [LaTeX output Image] which you can then edit. More details on mathurl’s help page
I just ran across MathJax on Ajaxian [via Wayback Machine]:
MathJax seems to have a chance at being a practical solution that offers a high quality display of LaTeX and MathML math notation in HTML pages.
The output is remarkably beautiful, and it's all pure HTML and CSS, which makes it scalable and selectable. Performance is currently a bit sluggish, but this is recognized.
As everyone has said, there are many services that do this already. Here is another easy one that I've used a number of times (and you can install it locally on your server if necessary):
http://www.codecogs.com/components/equationeditor/equationeditor.php
I'd take a good look at how the MediaWiki LaTeX support does it and borrow from there.
Please check out this site for a way to create TeX documents without any software installed. You can then snippet the result image with any screen capture method and embed the resulting image into a any website.
Go to http://sharelatex.com
The software is free to use, but you need to register to create documents.

Resources