len function - counting the # of letters in words - python-3.4

In Python, the len() function does provide the exact # amount of letters that make up a word in a string.
But when i have a string with multiple words, it doesn't display the correct # amount of letters because it is counting the spaces between the words.
what would be the correct command for the len() function to calculate the number of letters correctly for a string with multiple words ?

Remove all spaces before counting length:
string = string.replace(' ', '')

You can use len([c for c in address if c.isalpha()]) Here I'm assuming that your string is named address. Here is the defininition of isalpha from the python 3.4 docs:
Return true if all characters in the string are alphabetic and there
is at least one character, false otherwise. Alphabetic characters are
those characters defined in the Unicode character database as
“Letter”, i.e., those with general category property being one of
“Lm”, “Lt”, “Lu”, “Ll”, or “Lo”. Note that this is different from the
“Alphabetic” property defined in the Unicode Standard
We perform this test for each one-character string in the address. Since python 3 strings are in Unicode, this test would also catch letters from other alphabets like Greek, Arabic, or Hebrew. I don't know if that's what you want, but if you only have letters from the English alphabet, it will work fine.

You can use regular expression :
import re
s = "14th Street 456 */\&^%$-+##!()[]{};.,:"
# Remove anything other than letters
n = re.sub(r'[^a-zA-Z]', "", s)
print(n)
print("length :" , len(n))
output :
thStreet
length : 8

Related

keep only alphanumeric characters and space in a string using gsub

I have a string which has alphanumeric characters, special characters and non UTF-8 characters. I want to strip the special and non utf-8 characters.
Here's what I've tried:
gsub('[^0-9a-z\\s]','',"�+ Sample string here =�{�>E�BH�P<]�{�>")
However, This removes the special characters (punctuations + non utf8) but the output has no spaces.
gsub('/[^0-9a-z\\s]/i','',"�+ Sample string here =�{�>E�BH�P<]�{�>")
The result has spaces but there are still non utf8 characters present.
Any work around?
For the sample string above, output should be:
Sample string here
You could use the classes [:alnum:] and [:space:] for this:
sample_string <- "�+ Sample 2 string here =�{�>E�BH�P<]�{�>"
gsub("[^[:alnum:][:space:]]","",sample_string)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"
Alternatively you can use PCRE codes to refer to specific character sets:
gsub("[^\\p{L}0-9\\s]","",sample_string, perl = TRUE)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"
Both cases illustrate clearly that the characters still there, are considered letters. Also the EBHP inside are still letters, so the condition on which you're replacing is not correct. You don't want to keep all letters, you just want to keep A-Z, a-z and 0-9:
gsub("[^A-Za-z0-9 ]","",sample_string)
#> [1] " Sample 2 string here EBHP"
This still contains the EBHP. If you really just want to keep a section that contains only letters and numbers, you should use the reverse logic: select what you want and replace everything but that using backreferences:
gsub(".*?([A-Za-z0-9 ]+)\\s.*","\\1", sample_string)
#> [1] " Sample 2 string here "
Or, if you want to find a string, even not bound by spaces, use the word boundary \\b instead:
gsub(".*?(\\b[A-Za-z0-9 ]+\\b).*","\\1", sample_string)
#> [1] "Sample 2 string here"
What happens here:
.*? fits anything (.) at least 0 times (*) but ungreedy (?). This means that gsub will try to fit the smallest amount possible by this piece.
everything between () will be stored and can be refered to in the replacement by \\1
\\b indicates a word boundary
This is followed at least once (+) by any character that's A-Z, a-z, 0-9 or a space. You have to do it that way, because the special letters are contained in between the upper and lowercase in the code table. So using A-z will include all special letters (which are UTF-8 btw!)
after that sequence,fit anything at least zero times to remove the rest of the string.
the backreference \\1 in combination with .* in the regex, will make sure only the required part remains in the output.
stringr may use a differrent regex engine that supports POSIX character classes. The :ascii: names the class, which must generally be enclosed in square brackets [:asciii:], whithin the outer square bracket. The [^ indicates negation of the match.
library(stringr)
str_replace_all("�+ Sample string here =�{�>E�BH�P<]�{�>", "[^[:ascii:]]", "")
result in
[1] "+ Sample string here ={>EBHP<]{>"

Regular expression for excluding some specific characters

I am trying to build a regular expression in Qt for the following set of strings:
The set can contain all the set of strings of length 1 which does not include r and z.
The set also includes the set of strings of length greater than 1, which start with z, followed by any number of z's but must terminate with a single character that is not r and z
So far I have developed the following:
[a-qs-y]?|z+[a-qs-y]
But it does not work.
The question mark in your regular expression causes the first alternative to either match lowercase strings of length 1 excluding r and z or the empty string, and as the empty string can be matched within any string, the second alternative will never be matched against. The rest of your regular expression matches your specification, although you will probably want to make your regular expression only match entire strings by anchoring it:
QRegularExpression re("^[a-qs-y]$|^z+[a-qs-y]$");
QRegularExpressionMatch match = re.match("zzza");
if (match.hasMatch()) {
QString matched = match.captured(0);
// ...
}

How to extract characters from a string based on the text surrounding them in R

Edited to highlight the language I'm using I'm using the R language and I have many large lists of character strings and they have a similar format. I am interested in the characters directly in front of a series of characters that is consistently in the string, but not in a consistent place within the string. For instance:
a <- "aabbccddeeff"
b <- "aabbddff"
c <- "aabbffgghhii"
d <- "bbffgghhii"
I am interested in extracting the two characters directly preceding the "ff" in each character string. I can't find any reasonable solution apart from breaking each character string down using grepl() and then processing them each independently, which seems like an inefficient way to do it.
You can match those two characters and capture them with sub and the right regular expression.
Strings = c("aabbccddeeff",
"aabbddff",
"aabbffgghhii",
"bbffgghhii")
sub(".*(\\w\\w)ff.*", "\\1", Strings)
[1] "ee" "dd" "bb" "bb"
Explanation, This replaces the entire string with the two characters before the "ff". If there are multiple "ff" in the string, this expression takes the two characters before the last "ff".
How this works: The three arguments to sub are:
1. a pattern to search for
2. What it will be replaced with
3. The strings to apply it to.
Most of the work is in the pattern part - .*(\\w\\w)ff.*. The ff part of the pattern must be obvious. We are targeting things near the specific string ff. What comes right before it is (\\w\\w). \w refers to a "word character". That means any letter a-z or A-Z, any digit 0-9 or the one other character _. We want two characters so we have \\w\\w. By enclosing \\w\\w in parentheses, it turns this pattern of two characters into a "capture group", a string that will be saved into a variable for later use. Since this is the first (and only) capture group in this expression, those two characters will be stored in a variable called \1. Now we want only those two characters so in order to blow away everything before and after we put .* at the front and back. . matches any character and * means do this zero or more times, so .* means zero or more copies of any character. Now we have broken the string into four parts: "ff", the two characters before "ff", everything before that and everything after the ff. This covers the entire string. sub will _replace the part that was matched (everything) with whatever it says in the substitution pattern, in this case "\1". That is just how you write a string that evaluates to \1, the name of the variable where we stored the two characters that we want. We write it that way because backslash "escapes" whatever is after it. We actually want the character \ so we write \ to indicate \ and \1 evaluates to \1. So everything in the string is replaced by the targeted two characters. We apply this to every string in the list of strings Strings.

Need Regular Expression for this(C#)

Updated::
Password strength:
Contain characters from three of the following four categories:
English uppercase characters (A through Z)
English lowercase characters (a through z)
Base 10 digits (0 through 9)
Non-alphabetic characters (for example, !, $, #, %
IS it possible to compare two fields value(entered) with regex...if yes then please add onr another condition to above list.
compare password with username entered they must be different
EDIT: This answer was written before the question was edited. It originally included the requirement to not include the user's account name, and be at least 8 characters long.
Given that you need to use the user's account name as part of it anyway, is there any reason you particularly want to do this as a regular expression? You may want to use regular expressions to express the patterns for the four categories (although there are other ways of doing it too) but I would write the rules out separately. For example:
// Categories is a list of regexes in this case. You could easily change
// it to anything else.
int categories = Categories.Count(regex => regex.IsMatch(password));
bool valid = password.IndexOf(name, StringComparison.OrdinalIgnoreCase) == -1
&& password.Length >= 8
&& categories >= 3;
If you need to do it in one expression it should be something like this:
^(?:(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])|(?=.*[a-z])(?=.*[A-Z])(?=.*[!%,.;:])|(?=.*[a-z])(?=.*[0-9])(?=.*[!%,.;:])|(?=.*[A-Z])(?=.*[0-9])(?=.*[!%,.;:])).{8,}$
See it here on Regexr
Positive lookaheads (the (?=.*[a-z])) are used to check if the string contains the character group you want.
The problem here is, you want 3 out of 4, that means you have to make an alternation with all the allowed combinations.
The last part .{8,} is then matching the string and checking for at least 8 characters.
^ and $ are anchors, that anchor the pattern to the start and the end of the string.
[!%,.;:] is a character class, here you can add all the characters you want to include. Maybe its simpler to use a Unicode script like \p{P} for all punctuation characters. For more details see here on regular-expresssions.info
Update
compare password with username entered they must be different
normally you should be able to build up your regular expression using string concatenation. I have no idea how it is in your case where you put the regex ...
Something like this (pseudo)
String Username = "FooBar";
regex = "^(?:(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])|(?=.*[a-z])(?=.*[A-Z])(?=.*[!%,.;:])|(?=.*[a-z])(?=.*[0-9])(?=.*[!%,.;:])|(?=.*[A-Z])(?=.*[0-9])(?=.*[!%,.;:]))(?i)(?!.*" + Username + ").+$";
I used here also an inline modifier (?i) to match it case independent. The (?!.* is the start of negative lookahead, meaning the string should not contain ...

How do I write a regular expression that will match if the 6th character of a string is one of two different letters?

I'm trying to write a validator for an ASP.NET txtbox.
How can I validate so the regular expression will only match if the 6th character is a "C" or a "P"?
^.{5}[CP] will match strings starting with any five characters and then a C or P.
Depending on exactly what you want, you are looking for something like:
^.{5}[CP]
The ^ says to start from the beginning of the string, the . defines any character, the {5} says that the . must match 5 times, then the [CP] says the next character must be part of the character class CP - i.e. either a C or a P.
^.{5}[CP] -- the trick is the {}, they match a certain number of characters.
^.{5}[CP] has a few important pieces:
^ = from the beginning
. = match anything
{5} = make the previous match the number of times in braces
[CP] = match any one of the specific items in brackets
so the regex spoken would be something like "from the beginning of the string, match anything five times, then match a 'C' or 'P'"
[a-zA-Z0-9]{5}[CP] will match any five characters or digits and then a C or P.

Resources