Unable to extract (US) Zipcode from doc file

Unable to extract (US) Zipcode from doc file - asp.net

I am need to get the Zipcode from the Resume.doc file..
but not succceded,,,
Its working with static string , I mean it validates the static string but unable to parse the zipcode from doc file,,
I am sharing my code ...
protected void zipcodeGetter()
{
var path = "C:\\Users\\Jatinder\\Desktop\\LUCENE\\Resume\\Jeffrey.doc";
Document doc = new Document();
string html = File.ReadAllText(path);
using (StreamReader sr = new StreamReader(path, System.Text.Encoding.Default))
{
html = sr.ReadToEnd();
}
const string MatchPhondePattern = #"^\d{5}(?:[-\s]\d{4})?$";
Regex rx = new Regex(MatchPhondePattern, RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);
MatchCollection matches = rx.Matches(html);
// Report the number of matches found.
int noOfMatches = matches.Count;
//Do something with the matches
foreach (Match match in matches)
{
//Do something with the matches
string tempPhoneNumber = match.Value.ToString(); ;
}
}
can anyone help me with this

Your code just won't work with that regular expression.
This problem is complicated and your best option is to use a service from a company that does this. They will have a robust system.
Here is a quote from an article on regex and addresses:
We get a lot of questions from programmers about parsing addresses. We see a lot of people trying to use regular expressions for street addresses, and as the address user experience experts, we cringe whenever another programmer falls prey to this trap. We hope that this information will save you some trouble, and if your searching is in vain, please feel free to ask us any questions you have about addresses. ...
Should you use regular expressions to parse street addresses? The short answer is, "Probably not." Because of the wide variance in address content and formatting, addresses aren't "regular"—an indispensable factor in using regular expressions to process information.
Now, some notes and hints about your regular expressions.
I used RegExr to make an example of the regular expression you used. As you can see, there are no highlighted regions, meaning your regular expression won't work.
If you just want to match five consecutive digits, the regular expression is: [0-9]{5}. Here is an example.
You can't just use ^ and $ because, for example, there might be a space or a period before or after the zip code and ^ and $ in your code would mean you're looking for beginnings and ends of lines.
The problem with not having any other qualifiers, however, is you will match long numbers, too. In other words, with a string like 1234567890, you will match [0-9]{5} because there are five consecutive digits in that string.
It's hard to qualify the regular expression with possible punctuation or spaces before or after the match, because what if the match is at the beginning or end of the line? It will miss some.
Here is a regex that might be useful to you. It seems to work in a lot of cases. You can see the example here, with more explanation.
(?<=\W|^)\d{5}(-?\d{4})?(?=\W|$)
(Full disclosure: I work for SmartyStreets and we have an API that does this. Check out the API docs if you're interested.)

Related

Regex to match for multiple strings anywhere within a string

I'm trying to add some validation in ASP.NET using Regex. Essentially I need to ensure a text box includes both ***ThisString*** and ***ThatString*** including the asterisks.
I can get it to work with one, or one or the other, just not both being present at the same time and at any part of the string.l it's validating.
Thanks

As nanhydrin correctly pointed out, my solution will not work if there are multiple of one of the strings but not the other. If that case may occur, you can check for each string separately for readability's sake
First regular expression- (?:\*{3}ThisString\*{3})
Second regular expression- (?:\*{3}ThatString\*{3})
If matches are found in both cases, you're good to go!
Original Answer:-
This is the regular expression you want: (?:\*{3}(?:ThisString|ThatString)\*{3})
Note: Make sure to have global match on and be sure to escape the asterisks correctly.
If the above expression finds 2 (or more) matches, it means you're good to go.
Explanation:-
The entire thing is in a non capturing group, this is to ensure, everything within does get matched fully
There are 3 stars on each side of the strings, having 3 stars on one side but not the other will not result in a match
Both ThisString and ThatString are in a grouped alternative, this is to reduce clutter, you could totally jam every possible positional pattern but this is just better as position doesn't matter here. ***ThatString*** can come before ***ThisString*** or vice versa.
MAKE SURE to check the length of the matches found, the length must be 2 for your
described condition to be satisfied.
Here's the live demo

Using #Chase's answer I was able to come up with the following:
protected override ValidationResult IsValid(object value, ValidationContext validationContext)
{
Regex thisString = new Regex("(?:\*{3}ThisString\*{3})");
Regex thatString = new Regex("(?:\*{3}ThatString\*{3})");
if (!thisString.IsMatch(value.ToString()) || !thatString.IsMatch(value.ToString()))
{
return new ValidationResult("***ThisString*** and ***ThatString*** are used to generate the email text. Please ensure the text above has both ***ThisString*** and ***ThatString*** somewhere within the text."); }
return ValidationResult.Success;
}
Now if either regex patterns don't match anywhere, it'll return an error.

GA Search & Replace Filter

I wonder whether someone can help me please.
I have the following URI in GA: /invite/accept-invitation/accepted/B
Which I'd like to change to: /invite/accept-invitation/accepted
I've tried a 'Search and Replace filter as follows:
Search String - /invite/accept-invitation/accepted/*
Replace String - /invite/accept-invitation/accepted
But the result I get is:
/inviteaccept-invitation/accepted/B
Could someone tell me where I've gone wrong with this please?
Many thanks and kind regards
Chris

Google Analytics "Search and replace" filter uses regular expressions. More precisely:
Replace string is either a regular string or it can refer to group
patterns in the search expression using backslash-escaped single
digits like (\0 to \9).
More details are available on the filter settings UI, which also refers to this link.
So in your case, the search string would be something like this.
\/invite\/accept-invitation\/accepted\/\w+
In this expression \ is escaped. Your last string part is captured with \w+, which
matches any word character (equal to [a-zA-Z0-9_]), between one and unlimited times, as many times as possible.
The Replace string doesn't have to be a regular expression. So in your case, your original version could be used:
/invite/accept-invitation/accepted/
Putting this together would result something like this, which gives the desired output in my test view:

regular expression for date with Starting and Ending date

I am using the regular expression of the date for the format "MM/DD/YYYY" like
"^(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)\d\d$"
its working fine, no problem....here I want to limit the year between "1950" to "2050", how to do this, can anyone help me....

So the answer depends on how you want to accomplish the task.
Your current Regex search pattern is going to match on most dates in the format "MM/DD/YYYY" in the 20th and 21st century. So one approach is to loop through the resulting matches, which are represented as string values at this point, and parse each string into a DateTime. Then you can do some range validation checking.
(Note: I removed the beginning ^ and ending $ from your original to make my example work)
string input = "This is one date 07/04/1776 and this is another 12/07/1941. Today is 08/10/2019.";
string pattern = "(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)\\d\\d";
List<DateTime> list = new List<DateTime>();
foreach (Match match in Regex.Matches(input, pattern))
{
Console.WriteLine(match.Value);
DateTime result;
if (DateTime.TryParse(match.Value, out result))
{
if (result.Year >= 1950 && result.Year <= 2050)
{
list.Add(result);
}
}
}
Console.WriteLine("Number of valid dates: {0}", list.Count);
This code outputs the following, noting that 1776 is not matched, the other two dates are, but only the last one is added to the list.
12/07/1941
08/10/2019
Number of valid dates: 1
Although this approach has some drawbacks, such as looping over the results a second time to try and do the range validation, there are some advantages as well.
The built-in DateTime methods in the framework are easier to deal with, rather than constantly adjusting the Regex search pattern as your acceptable range can move over time.
By range checking afterward, you could also simplify your Regex search pattern to be more inclusive, perhaps even getting all dates.
A simpler Regex search pattern is easier to maintain, and also makes clear the intent of the code. Regex can be confusing and tricky to decipher the meaning, especially for less experienced coders.
Complex Regex search patterns can introduce subtle bugs. Make sure you have good unit tests wrapped around your code.
Of course your other approach is to adjust the Regex search pattern so that you don't have to parse and check afterwards. In most cases this is going to be the best option. Your search pattern is not returning any values that are outside the range, so you don't have to loop or do any additional checking at that point. Just remember those unit tests!
As #skywalker pointed out in his answer, this pattern should work for you.
string pattern = "(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19[5-9][0-9]|20[0-4][0-9]|2050)";

year 1950-2050 both inclusive can be found using 19[5-9][0-9]|20[0-4][0-9]|2050

ASP.NET Routing Regex to match specific pattern

I am trying to write a regular expression for ASP.NET MapPageRoute that matches a specific type of path.
I do not want to match anything with a file extension so I used this regex ^[^.]*$ which worked fine except it also picked up if the default document was requested. I do not want it to pick up the default document so I have been trying to change it to require at least one character. I tried adding .{1,} or .+ to the beginning of the working regex but it stopped working alltogether.
routes.MapPageRoute("content", "{*contentpath}", "~/Content.aspx", true, new RouteValueDictionary { }, new RouteValueDictionary { { "contentpath", #"^[^.]*$" } });
How can I change my regex to accomplish this?
Unfortunately my brain does not seem capable of learning regular expressions properly.

You want to change your * quantifier to +. * matches zero or more times, whereas + matches one or more. So, what you are asking for is this:
^[^.]+$
The regex is accomplishing this: "At the beginning of the string, match all characters that are not ., at least one time, up to the end of the string."

^[^.]+$
zero is to * as one is to +

Word count of a string

How to count the words in a document, get the result same as the result of MS OFFICE?

In theory you'd first have to define what you see as a word (see also Jason Williams' post). Then you open the document with whatever language you're planning to use for this. You translate the document from Microsoft's proprietary format to something nice and clean.
Then its simply a matter of counting the occurrences of the afore mentioned word definition.
The hard part here will be the parsing of the office document. Luckily for you, Microsoft has relceased their proprietary format specification!
Its a bit long winded, but perhaps you can find somebody who has done the hard work for you, or you can try doing it from scratch.
Alternatively, if you're willing to reveal what language you're planning on using and what operating system, things can be a lot easier (if you're on Windows and have Office installed, for example, you can use OLE plug-ins.)
Also, have a look at this blog post about that format of Office documents featuring some helpful information (courtesy of google)

Without knowing your environment all I can tell you is that you would need to implement something like this:
Take the entire document as a string.
Split the string on whitespace.
The number of items in the resulting sequence will be the number of words in the document.

Basic word splitting uses whitespace and punctuation (.,?!"'- etc - indeed any non-alphanumeric or character usually) characters to split the words.
Make sure you skip sequences of punctuation/whitespace instead of counting extra "words" between them.
You will have to decide whether numbers are "words" or not. And whether "$123,456.78" is one word or three.
You may also want to apply other rules - for example, if you are looking for words in source code, you may wish to treat +-=*/()&^%$ characters as "whitespace". If you have identifiers in camelCase or PascalCase styles, you may want to take the "words" you have found and check if they have uppercase characters in the middles or the words.
Fundamentally, it's an easy problem - you just have to decide what a "word" is. You can be as simple or as complicated as you like about it.
The best way to get the same word count as Office would be to use macros or automation to use MS Word to load the text and calculate the word count.

If you take the whole document as a String, this code (in java) may work for you:
private int wordCount(String str){
String[] words = str.trim().split("\\s+");
for (int i = 0; i < words.length; i++) {
words[i] = words[i].replaceAll("[^\\w]", "");
}
return words.length;
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex