I have a large (10,000s) list of strings (needles), which follow the pattern: [a-z0-9_#-]{2,45}, case insensitive.
["needle-1", "needle-2", ..., "needle-25000"]
You have a large body of text (haystack):
<body>needle-1 and other text that may contain needle-9000</body>
The goal is to find out which strings are found in your body of text.
But: I don't want anyone to have my list of strings.
Is there a way I can encrypt and publish my list of strings, so that you can still check to see if any matches exist in your haystack?
Related
The format I would like to allow in my text boxes are comma delimited lists followed by a line break in between the comma delimited lists. Here is an example of what I want from the user:
1,2,3
1,2,4
1,2,5
1,2,6
So far I have limited the user using this ValidationExpression:
^([1-9][0-9]*[]*[ ]*,[ ]*)*[1-9][0-9]*$
However with that expression, the user is only able to enter one row of comma delimited numbers.
How can proceed to accept multiple rows by accepting line breaks?
It is possible to check if the input has the correct format. I would recommend to use groups and repeat them:
((\d+,)+\d+\n?)+
But to check if the matrix is symmetric you have to use something else then regex.
Check it out here: https://regex101.com/r/GqtOuQ/2/
If you want to be a bit more user friendly it is possible to allow as much horizontal spaces as the user wants to add between the number and comma. This can be done with he regex group \h which allows every whitespace except \n.
The regex code looks now a bit more messy:
((\h*\d+\h*,\h*)+\h*\d+\h*\n?\h*)+
Check this out here: https://regex101.com/r/GqtOuQ/3
Here is the version that should work with .NET:
(([ \t]*\d+[ \t]*,[ \t]*)+[ \t]*\d+[ \t]*\n?[ \t]*)+
I am new to using the Google translate API and during testing we noticed that for some translations (I have not been able to find a pattern yet) we get \u200b characters in the response. That results in a lot of issues and above all it does not seem to server any purpose or make any sense. As simple example:
https://www.googleapis.com/language/translate/v2?key=YOURKEY&source=NL&target=EN&q=Hergeneer%20verkopen
returns:
{
"data": {
"translations": [
{
"translatedText": "Sell \u200b\u200bHerge Down"
}
]
}
}
Our software stumbles over these \u200b strings/characters and I have not found a way to prevent them or get rid of them.
Please read the documentation of the JSON format: https://json.org/
A string is a sequence of zero or more Unicode characters.
A char is either any Unicode character except " or \ or control-character,
[...]
or it is \u followed by four hex-digits.
We are in this last case, \u followed by four hex-digits, and it represents a Unicode character: Unicode Character 'ZERO WIDTH SPACE' (U+200B). It even has its own Wikipedia page: Zero-width space. And its Stack Overflow question: What's HTML character code 8203?.
Now, there are plenty Unicode characters with special behaviors, and this is one of those, an invisible one among others. So you need to be aware of how Unicode works, and you should sanitize input/output from third-parties API (and from user inputs as well).
Just define the list of characters that you actually want to support, and be sure to strip or filter out all the other ones. For instance, if you desire to support NL and EN, then you could strip what is outside the Latin script in Unicode.
Stripping the U+200B that you're encountering and other undesirable characters may save you from potential surprises like with:
big characters ⎲⎳
zalgo characters C̨̦̺̩̲̥͉̭͚̜̻̝̣̼͙̮̯̪o̴̡͇̘͎̞̲͇̦̲͞͡m̸̩̺̝̣̹̱͚̬̥̫̳̼̞̘̯͘ͅẹ͇̺̜́̕͢
invisible characters
emojis 👨👩👧👦#️⃣🏳️🌈
I need to find any special character. If I put it in the middle of a word, SQLite FTS match can ignore it as if it does not exist, e.g.:
Text Body: book's
If my match string is 'books' I need to get result of "book's"..
No problem using porter or simple tokenizer.
I tried many characters for that like: book!s, book?s, book|s, book,s, book:s…, but when searching by match for 'books' no results of these returned.
I don't understand, why?
I am using: Contentless FTS4 Tables, and External Content FTS4 Tables, my text body has many characters in each word, should be changed to ignore it when searching..
I cannot change match query because I do not know where the special character in the word is. Also, I need to leave the original word length equal to the length of FTS Index word to use match info or snippet(); as such, I cannot remove these characters from text body.
The default tokenizers do not ignore punctuation characters but treat them as word separators.
So the text body or match string book's will end up as two words, book and s.
These will never match a single work like books.
To ignore characters like ', you have to install your own custom tokenizer.
I am trying to use a regular expression for name field in the asp.net application.
Conditions:name should be minimum 6 characters ?
I tried the following
"^(?=.*\d).{6}$"
I m completely new to the regex.Can any one suggest me what must be the regex for such condition ?
You could use this to match any alphanumeric character in length of 6 or more: ^[a-zA-Z0-9]{6,}$. You can tweak it to allow other characters or go the other route and just put in exclusions. The Regex Coach is a great environment for testing/playing with regular expressions (I wrote a blog post with some links to other tools too).
Look at Expression library and choose user name and/or password regex for you. You can also test your regex in online regex testers like RegexPlanet.
My regex suggestions are:
^[a-zA-Z][a-zA-Z0-9._\-]{5,}$
This regex accepts user names with minimum 6 characters, starting with a letter and containing only letters, numbers and ".","-","_" characters.
Next one:
^[a-zA-Z0-9._\\-]{6,}$
Similar to above, but accepts ".", "-", "_" and 0-9 to be first characters too.
If you want to validate only string length (minimum 6 characters), this simple regex below will be enough:
^.{6,}$
What about
^.{6,}$
What's all the stuff at the start of yours, and did you want to limit yourself to digits?
NRegex is a nice site for testing out regexes.
To just match 6 characters, ".{6}" is enough
In its simplest form, you can use the following:
.{6,}
This will match on 6 or more characters and fail on anything less. This will accept ANY character - unicode, ascii, whatever you are running through. If you have more requirements (i.e. only the latin alphabet, must contain a number, etc), the regex would obviously have to change.
Using ASP.NET syntax for the RegularExpressionValidator control, how do you specify restriction of two consecutive characters, say character 'x'?
You can provide a regex like the following:
(\\w)\\1+
(\\w) will match any word character, and \\1+ will match whatever character was matched with (\\w).
I do not have access to asp.net at the moment, but take this console app as an example:
Console.WriteLine(regex.IsMatch("hello") ? "Not valid" : "Valid"); // Hello contains to consecutive l:s, hence not valid
Console.WriteLine(regex.IsMatch("Bar") ? "Not valid" : "Valid"); // Bar does not contain any consecutive characters, so it's valid
Alexn is right, this is the way you match consecutive characters with a regex, i.e. (a)\1 matches aa.
However, I think this is a case of everything looking like a nail when you're holding a hammer. I would not use regex to validate this input. Rather, I suggest validating this in code (just looping through the string, comparing str[i] and str[i-1], checking for this condition).
This should work:
^((?<char>\w)(?!\k<char>))*$
It matches abc, but not abbc.
The key is to use so called "zero-width negative lookahead assertion" (syntax: (?! subexpression)).
Here we make sure that a group matched with (?<char>\w) is not followed by itself (expressed with (?!\k<char>)).
Note that \w can be replaced with any valid set of characters (\w does not match white-spaces characters).
You can also do it without named group (note that the referenced group has number 2):
^((\w)(?!\2))*$
And its important to start with ^ and end with $ to match the whole text.
If you want to only exclude text with consecutive x characters, you may use this
^((?<char>x)(?!\k<char>)|[^x\W])*$
or without backreferences
^(x(?!x)|[^x\W])*$
All syntax elements for .NET Framework Regular Expressions are explained here.
You can use a regex to validate what's wrong as well as what's right of course. The regex (.)\1 will match any two consecutive characters, so you can just reject any input that gives an IsValid result to that. If this is the only validation you need, I think this way is far easier than trying to come up with a regex to validate correct input instead.