Good whitelist for search terms - whitelist

I'm implementing a simple search on a website, and right now I'm working on sanitizing the input. My plan is to make a whitelist of allowed characters. I'm using PHP, and so far I've got the current regex:
preg_replace('/[^a-z0-9 -]/i', '', $s);
So, I'm removing anything that's not alphanumeric or a space or a hyphen.
Is there a generally accepted whitelist for this sort of thing, or does it just depend on the application? I'm going to be searching on book titles, author names and book blurbs.

What about 2010 (A space odyssey)? What about Giscard d`Estaing's autobiography? ... This is really impossible to answer generally, it will depend on your application and data structures.
You want to look into the fulltext search functions of the database of your choice, or even specialized search appliances like Sphinx.
Clarify what engine you will use first to actually perform your search, and the rules on what you need to strip out will become much clearer.

Google has some pretty advanced rules for searches, but their basic rule is this:
Generally, punctuation is ignored, including ##$%^&*()=+[]\ and other special characters.
However, Google makes exceptions for common search terms, like C++, C#, or $100.
If you want a search as sophisticated as Google's, you can make rules against the above punctuation and have some exceptions. However, for a simple search, just ignore the characters that Google generally ignores.

There's not a generic regular expression to solve this problem. Your code strips out a lot of things you might want to keep, like commas, exclamation points, (semi-)colons, and non-English letters. If you have a full list of all of the titles in your database, you should be able to write a script that will construct a list of all characters found in all of your titles. If your regular expression strips out any of those characters, then you risk having problems (although passing this test doesn't mean that you won't run into problems).
Depending on how the rest of your search is implemented, you may be able to strip out valid characters and still return relevant search results. In this case, you would want your expression to allow non-English characters (since you don't want to split a word) but you might be able to remove all punctuation marks that aren't inside of a quote-delimited phrase. For example, searching for red haired should give you all of the results you would get from searching for red-haired plus a few extra.

Related

Is there a way in Windows 10 to convert a hexadecimal code to its symbol regardless of the program?

I've read many pages that point out that many office applications allow for this by typing the code followed by Alt + X, but frequently, I want to insert a symbol when I'm not in one of those applications. Is there a universal way to achieve this?
The character map is useless, unless you have time to manually search through all the characters available.
I posted the question at Super User, and basically, the response I got there was to use Alt codes for the symbols. However, I discovered that, on the whole, these only work for the first 256 Alt codes. So basically, the answer to my question is "No, there's not a good way."

trying to understand how regex works

I'm learning about regex expressions and confused by how this whole field works. I've taken an example from a tutorial here and pasted it into https://regexr.com/.
The regex below is supposed to capture email addresses but it doesn't seem to work, at least as is.
I'm posting here in the hopes that there's a simple explanation I might look further into.
From the tutorial website, I gleaned that there are different "flavors" of regex. From the regexr.com site, it seems I have the option to choose a JavaScript or PCRE engine (I assume engine is a synonym for flavor). It doesn't seem to make a difference.
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,}\b
Ultimately I'm working in R so have added the R tag to this post. I suspect R may use yet a different flavour from the one above.

Converting words to hashtags when posting to Twitter

I currently use LinqToTwitter to send posts to Twitter. I'd like to convert words in the title of the post to hashtags when it gets fired off as tweet so something like - "Firefox is cool" is the blog post and becomes #Firefox is cool http://myshortu.rl/dhsgeh on Twitter.
So far the way i see it is i need a database table with the words i want to convert to hashtags. I'd have to parse out the title and compare the words to those in the db and add on the pound sign. Is the best way to use a db table? Or can I do it with an in memory collection or keep the words in web.config? Thanks....
The decision on whether to use a database or file (such as web.config) might depend on whether you want to write code that allows you to maintain the list. e.g. Add, Modify, Remove. If so, then a DB sounds like the easiest option. If the list is small and doesn't change, then adding a delimited list to web.config would work fine.
Since you're using ASP.NET you can't hold it in a memory variable, but you can hold the list in Cache. This can make for some very fast lookups, rather than multiple file or DB queries.
Just to put this into perspective though, it's tough to recommend a proper design in a forum because there might be details that aren't known. So, it's best to take my answer as something that helps think about what the tradeoffs are, rather than a definitive recommendation on what you should do.

How should I sanitize urls so people don't put 漢字 or á or other things in them?

How should I sanitize urls so people don't put 漢字 or other things in them?
EDIT: I'm using java. The url will be generated from a question the user asks on a form. It seems StackOverflow just removed the offending characters, but it also turns an á into an a.
Is there a standard convention for doing this? Or does each developer just write their own version?
The process you're describing is slugify. There's no fixed mechanism for doing it; every framework handles it in their own way.
Yes, I would sanitize/remove. It will either be inconsistent or look ugly encoded
Using Java see URLEncoder API docs
Be careful! If you are removing elements such as odd chars, then two distinct inputs could yield the same stripped URL when they don't mean to.
The specification for URLs (RFC 1738, Dec. '94) poses a problem, in that it limits the use of allowed characters in URLs to only a limited subset of the US-ASCII character set
This means it will get encoded. URLs should be readable. Standards tend to be English biased (what's that? Langist? Languagist?).
Not sure what convention is other countries, but if I saw tons of encoding in a URL send to me, I would think it was stupid or suspicious ...
Unless the link is displayed properly, encoded by the browser and decoded at the other end ... but do you want to take that risk?
StackOverflow seems to just remove those chars from the URL all together :)
StackOverflow can afford to remove the
characters because it includes the
question ID in the URL. The slug
containing the question title is for
convenience, and isn't actually used
by the site, AFAIK. For example, you
can remove the slug and the link will
still work fine: the question ID is
what matters and is a simple mechanism
for making links unique, even if two
different question titles generate the
same slug. Actually, you can verify
this by trying to go to
stackoverflow.com/questions/2106942/…
and it will just take you back to this
page.
Thanks Mike Spross
Which language you are talking about?
In PHP I think this is the easiest and would take care of everything:
http://us2.php.net/manual/en/function.urlencode.php

How to match URIs in text?

How would one go about spotting URIs in a block of text?
The idea is to turn such runs of texts into links. This is pretty simple to do if one only considered the http(s) and ftp(s) schemes; however, I am guessing the general problem (considering tel, mailto and other URI schemes) is much more complicated (if it is even possible).
I would prefer a solution in C# if possible. Thank you.
Regexs may prove a good starting point for this, though URIs and URLs are notoriously difficult to match with a single pattern.
To illustrate, the simplest of patterns looks fairly complicated (in Perl 5 notation):
\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*
This would match
http://example.com/foo/bar-baz
and
ftp://192.168.0.1/foo/file.txt
but would cause problems for at least these:
mailto:support#stackoverflow.com (no match - no //, but present #)
ftp://192.168.0.1.2 (match, but too many numbers, so it's not a valid URI)
ftp://1000.120.0.1 (match, but the IP address needs numbers between 0 and 255, so it's not a valid URI)
nonexistantscheme://obvious.false.positive
http://www.google.com/search?q=uri+regular+expression (match, but query isn't
I think this is a case of the 80:20 rule. If you want to catch most things, then I would do as suggested an find a decent regular expression if you can't write one yourself.
If you're looking at text pulled from fairly controlled sources (e.g. machine generated), then this will the best course of action.
If you absolutely positively have to catch every URI that you encounter, and you're looking at text from the wild, then I think I would look for any word with a colon in it e.g. \s(\w:\S+)\s. Once you have a suitable candidate for a URI, then pass it to the a real URI parser in the URI class of whatever library you're using.
If you're interested in why it's so hard to write a URI pattern, the I guess it would be that the definition of a URI is done with a Type-2 grammar, while regular expressions can only parse languages from Type-3 grammars.
Whether or not something is a URI is context-dependent. In general the only thing they always have in common is that they start "scheme_name:". The scheme name can be anything (subject to legal characters). But other strings also contain colons without being URIs.
So you need to decide what schemes you're interested in. Generally you can get away with searching for "scheme_name:", followed by characters up to a space, for each scheme you care about. Unfortunately URIs can contain spaces, so if they're embedded in text they are potentially ambiguous. There's nothing you can do to resolve the ambiguity - the person who wrote the text would have to fix it. URIs can optionally be enclosed in <>. Most people don't do that, though, so recognising that format will only occasionally help.
The Wikipedia article for URI lists the relevant RFCs.
[Edit to add: using regular expressions to fully validate URIs is a nightmare - even if you somehow find or create one that's correct, it will be very large and difficult to comment and maintain. Fortunately, if all you're doing is highlighting links, you probably don't care about the odd false positive, so you don't need to validate. Just look for "http://", "mailto:\S*#", etc]
For a lot of the protocols you could just search for "://" without the quotes. Not sure about the others though.
Here is a code snippet with regular expressions for various needs:
http://snipplr.com/view/6889/regular-expressions-for-uri-validationparsing/
That is not easy to do, if you want to also match "something.tld", because normal text will have many instances of that pattern, but if you want to match only URIs that begin with a scheme, you can try this regular expression (sorry, I don't know how to plug it in C#)
(http|https|ftp|mailto|tel):\S+[/a-zA-Z0-9]
You can add more schemes there, and it will match the scheme until the next whitespace character, taking into account that the last character is not invalid (for example as in the very usual string "http://www.example.com.")
the URL Tool for Ubiquity does the following:
findURLs: function(text) {
var urls = [];
var matches = text.match(/(\S+\.{1}[^\s\,\.\!]+)/g);
if (matches) {
for each (var match in matches) {
urls.push(match);
}
}
return urls;
},
The following perl regexp should pull do the trick. Does c# have perl regexps?
/\w+:\/\/[\w][\w\.\/]*/

Resources