Find & replace text not already inside an <A> tag - RegEx .Net - asp.net

I am working with XML data in .NET from the Federal Register, which contain many references to Executive Orders & chapters from the U.S. Code.
I'd like to be able to hyperlink to these references, unless they're already inside of an <a> tag (which is determined by the XML, and often links within the document itself).
The pattern I've written is matching and deleting leading and trailing characters and not displaying them, even if I include the boundary character in the replacement string:
[?!]([0-9]{1,2})[ ]{0,1}(U\.S\.C\.|USC)[\s]{0,1}([0-9]{1,5})(\b)[^]
An example of the initial XML:
<p>The Regulatory Flexibility Act of 1980 (RFA), 5 U.S.C. 604(b), as amended, requires Federal agencies to consider the potential impact of regulations on small entities during rulemaking.</p>
<p>Small entities include small businesses, small not-for-profit organizations, and small governmental jurisdictions.</p>
<p>Section 605 of the RFA allows an agency to certify a rule, in lieu of preparing an analysis, if the rulemaking is not expected to have a significant economic impact on a substantial number of small entities. Reference: 13 USC 401</p>
<ul>
<li><em>Related laws from 14USC301-345 do not apply.</em></li>
<li>14 USC 301 does apply.</li>
</ul>
As you can see, some references include ranges of U.S. Code sections (e.g. 14 USC 301-345) or references to specific subsections (e.g. 5 U.S.C. 604(b) ). I'd only want to link to the first reference in the range, so the link should terminate at the - or the (.

If I'm understanding you correctly, I think the following should work.
var re = new Regex(#"\d{1,2}\s?U\.?S\.?C\.?\s?\d{1,5}\b(?!</a>)");
var matches = re.Matches(text);
// matches[0].Value = 5 U.S.C. 604
// matches[1].Value = 14USC301
You might even be able to simplify the regex to \d+\s?U\.?S\.?C\.?\s?\d+\b(?!</a>) – I'm not sure if the upper limits of 2 and 5 are significant.

Related

Meaning of the numbers on the side of paragraphs in the Ada Reference Manual

The Reference Manual paragraphs have a "side number" (this is how I call them).
For example, in the attached screenshot of the Reference Manual Introduction, the first "side numbers" are 1, 2, 3/3, 4/1, 5/3, 6/3 ,7.
What is the meaning of the number after the slash sign ?
I could not find the explanation in http://www.ada-auth.org.
See the final paragraph of the Introduction of the latest Ada Reference Manual: www.ada-auth.org/standards/22rm/html/RM-0-2.html#p73
Copy-paste:
Using this version of the Ada Reference Manual
72/5
This document has been revised with the corrections specified in Technical Corrigendum 1 for Ada 2012 (which corresponds to ISO/IEC 8652:2012/COR.1:2016) and other changes specifically for Ada 2022. In addition, a variety of editorial errors have been corrected.
73/5
Changes to the original 1995 version of the Ada Reference Manual can be identified by the version number following the paragraph number. Paragraphs with a version number of /1 were changed by Technical Corrigendum 1 for Ada 95 or were editorial corrections at that time, while paragraphs with a version number of /2 were changed by Amendment 1 or were more recent editorial corrections, and paragraphs with a version number of /3 were changed by the 2012 edition of the Reference Manual or were still more recent editorial corrections. Paragraphs with a version number of /4 are changed by Technical Corrigendum 1 for Ada 2012 or were editorial corrections at that time. Paragraphs with a version number of /5 are changes or editorial corrections for Ada 2022. Paragraphs not so marked are unchanged since the original 1995 edition of the Ada Reference Manual, and have the same paragraph numbers as in that edition. In addition, some versions of this document include revision bars near the paragraph numbers. Where paragraphs are inserted, the paragraph numbers are of the form pp.nn, where pp is the number of the preceding paragraph, and nn is an insertion number. For instance, the first paragraph inserted after paragraph 8 is numbered 8.1, the second paragraph inserted is numbered 8.2, and so on. Deleted paragraphs are indicated by the text This paragraph was deleted. Deleted paragraphs include empty paragraphs that were numbered in the 1995 edition of the Ada Reference Manual. 

R read data from a txt space delimited file with quoted text

I'm trying to load a dataset into R Studio, where the dataset itself is space-delimited, but it also contains spaces in quoted text like in csv files. Here is the head of the data:
DOC_ID LABEL RATING VERIFIED_PURCHASE PRODUCT_CATEGORY PRODUCT_ID PRODUCT_TITLE REVIEW_TITLE REVIEW_TEXT
1 __label1__ 4 N PC B00008NG7N "Targus PAUK10U Ultra Mini USB Keypad, Black" useful "When least you think so, this product will save the day. Just keep it around just in case you need it for something."
2 __label1__ 4 Y Wireless B00LH0Y3NM Note 3 Battery : Stalion Strength Replacement 3200mAh Li-Ion Battery for Samsung Galaxy Note 3 [24-Month Warranty] with NFC Chip + Google Wallet Capable New era for batteries Lithium batteries are something new introduced in the market there average developing cost is relatively high but Stallion doesn't compromise on quality and provides us with the best at a low cost.<br />There are so many in built technical assistants that act like a sensor in their particular forté. The battery keeps my phone charged up and it works at every voltage and a high voltage is never risked.
3 __label1__ 3 N Baby B000I5UZ1Q "Fisher-Price Papasan Cradle Swing, Starlight" doesn't swing very well. "I purchased this swing for my baby. She is 6 months now and has pretty much out grown it. It is very loud and doesn't swing very well. It is beautiful though. I love the colors and it has a lot of settings, but I don't think it was worth the money."
4 __label1__ 4 N Office Products B003822IRA Casio MS-80B Standard Function Desktop Calculator Great computing! I was looking for an inexpensive desk calcolatur and here it is. It works and does everything I need. Only issue is that it tilts slightly to one side so when I hit any keys it rocks a little bit. Not a big deal.
5 __label1__ 4 N Beauty B00PWSAXAM Shine Whitening - Zero Peroxide Teeth Whitening System - No Sensitivity Only use twice a week "I only use it twice a week and the results are great. I have used other teeth whitening solutions and most of them, for the same results I would have to use it at least three times a week. Will keep using this because of the potency of the solution and also the technique of the trays, it keeps everything in my teeth, in my mouth."
6 __label1__ 3 N Health & Personal Care B00686HNUK Tobacco Pipe Stand - Fold-away Portable - Light Weight - For Single Pipe not sure I'm not sure what this is supposed to be but I would recommend that you do a little more research into the culture of using pipes if you plan on giving this as a gift or using it yourself.
7 __label1__ 4 N Toys B00NUG865W ESPN 2-Piece Table Tennis PING PONG TABLE GREAT FOR YOUTHS AND FAMILY "Pleased with ping pong table. 11 year old and 13 year old having a blast, plus lots of family entertainment too. Plus better than kids sitting on video games all day. A friend put it together. I do believe that was a challenge, but nothing they could not handle"
8 __label1__ 4 Y Beauty B00QUL8VX6 "Abundant Health 25% Vitamin C Serum with Vitamin E and Hyaluronic Acid for Youthful Looking Skin, 1 fl. oz." Great vitamin C serum "Great vitamin C serum... I really like the oil feeling, not too sticky. I used it last week on some of my recent bug bites and it helps heal the skin faster than normal."
9 __label1__ 4 N Health & Personal Care B004YHKVCM PODS Spring Meadow HE Turbo Laundry Detergent Pacs 77-load Tub wonderful detergent. "I've used tide pods laundry detergent for many years,its such a great detergent to use having a nice scent and leaver the cloths smelling fresh."
Problem is that it looks tab-delimited but it is not, example would be DOC_ID = 1, where there are only two spaces between useful and "When least...", this way passing sep = "/t" to read.table throws an error saying that line 1 did not have 10 elements, which for some reason is incorrect, because the number of elements should be 9. Here are the parameters that I'm passing(without the original path):
read.table(file = "path", sep ="\t", header = TRUE, strip.white = TRUE)
Also relying on quotes is not a good strategy, because some lines do not have their text quoted, so the delimiter should be something like a double space, which combined with strip.white should work properly, but read.table only accepts single byte delimiters.
So the question is how would you parse such corpus in R or with any other third party software that could convert it adequately to a csv or atleast a tab-delimited file?
Parsing the data using python pandas.read_csv(filename, sep='\t', header = 0, ...) seems to have parsed the data successfully and from this point anything could be done with it. Closing this out.

XPath to extract all text between two 'p' elements scrapy

I am trying to scrape a database using Scrapy and Splash, which requires login so unfortunately, I am unable to share the full website. The database contains a list of companies showing their name and a short description.
I am struggling to find an XPath expression that would yield all the text between the two 'p' tags as shown:
<p class="pre-wrap ng-binding"
ng-bind-html="object._source.startup.general_information.project_public_description"
ng-click="listView.showDetail(object)" role="button" tabindex="0">
<div>With the vision of providing creative sustainable solutions for global food crisis,
AquiNovo develops innovative, non-GMO, non-hormonal, peptide-based feed additives,
addressing the ever-growing demand for fish protein. Company’s additives improve both growth
performance and feed utilization, enabling the <strong><em>growth of more fish with less
feed</em></strong>. A unique peptide production system, enables large commercial
scale production at significant lower cost and carbon footprint. Growing more fish with less
feed also promote several SDG’s including the reduction of pressure on fish population in
the sea, providing food security and reducing hunger and poverty, climate change and
responsible production. </div>
</p>
All the company descriptions are in the same format (between two 'p' elements), but as shown in the HTML, there are <strong><em> elements as well. I would like to ask for help to find a way to create an XPath that would get all text including the ones in the <strong><em> element as one single text block (that would be one description, when viewed on the website there is no separation in the text.
I tried the following but that only gets the part before the element //p[#class='pre-wrap ng-binding']//div//text()
I used the following code:
'the descript': ''.join(startup.xpath('//div//text()').getall()),
scrapy shell
In [1]: html = """<html>
...: <body>
...: <p class="pre-wrap ng-binding"
...: ng-bind-html="object._source.startup.general_information.project_public_description"
...: ng-click="listView.showDetail(object)" role="button" tabindex="0">
...: <div>With the vision of providing creative sustainable solutions for global food crisis,
...: AquiNovo develops innovative, non-GMO, non-hormonal, peptide-based feed additives,
...: addressing the ever-growing demand for fish protein. Company’s additives improve both growth
...: performance and feed utilization, enabling the <strong><em>growth of more fish with less
...: feed</em></strong>. A unique peptide production system, enables large commercial
...: scale production at significant lower cost and carbon footprint. Growing more fish with less
...: feed also promote several SDG’s including the reduction of pressure on fish population in
...: the sea, providing food security and reducing hunger and poverty, climate change and
...: responsible production. </div>
...: </p>
...: </body>
...: </html>"""
In [2]: selector = scrapy.Selector(text=html)
In [3]: ''.join(selector.xpath('//div//text()').getall())
Out[3]: 'With the vision of providing creative sustainable solutions for global food crisis,\n AquiNovo develops innovative, non-GMO, non-hormonal, peptide-based feed additives,\n addressing the ever-growing demand for fish protein. Company’s additives improve both growth\n performance and feed utilization, enabling the growth of more fish with less\n feed. A unique peptide production system, enables large commercial\n scale production at significant lower cost and carbon footprint. Growing more fish with less\n feed also promote several SDG’s including the reduction of pressure on fish population in\n the sea, providing food security and reducing hunger and poverty, climate change and\n responsible production.\xa0'

Determine user location based latitude-longitude

I am planing to do a system that allow user to enter a 10 values of (digits, Characters) then I can determine his location.
I would to do some mathematics stuff or anythings that allow me to convert the (latitude-longitude) to one string (digits, Characters).
Is it possible to do that if yes please give me hint how I can do it!
thanks
At a code length of 10 characters, an Open Location Code (a.k.a. “Plus Code”) gives about 14m of resolution. Usually you'd have a + between the first 8 and the last 2 characters, but you can infer that. You can type and find these codes easily in Google Maps.
Geohash uses base 32 instead of base 20, so each character provides more information. 8 characters there already give you 19m resolution, the way I read Wikipedia. There is a chance you'd accidentially have obscenities in your code, though, which other codes try harder to avoid.
Geohash-36 uses 36 base characters, and avoids vowels (to prevent obscenities), but relies on character case. Wikipedia gives the accuracy of 10 characters as ⅙m.
All of these are well documented and probably have freely accessible reference implementations, too. You can also read about the design principles behind these.

Dictionary Training For Different Language

I am working on a messaging system and got the idea of storing the messages in each inbox independently. While working on that idea I asked myself why not compress the messages. So I am looking for a good way to optain dictionaries in different languages.
Since the messages are highly related to everydays talking (social bla bla) I need a good source and way for that.
I need some text for it like a bunch of millions emails, books etc. I would like to create a Huffman tree out of it with the ability to inline and represent each message as a string within this huffman tree. So decoding would be fast enough.
The languages I want to use are over the place. Since Newspapers and alike might not be sufficient I need other ways.
[Update]
As I countinue my research, I noticed that I actually create two dictionaries out of wikipedia. First is a dictionary containing typical characters with a certain propability. Also I noticed that special characters I used for each language seams to have even distribution among latain based languages (well actually latain is just one member of the language family) and even russians tend to have the same distribution beside the quite different alphabet.
Also I noticed that in around 15 to 18% a special character (like ' ', ',', ';') follows another special character. And by than the first 8 most frequent word yield 10% where the next 16 words yield 9% and going on and on and by around 128 (160 words) you reach a yield of 80% of all words. So storing the next 256 and more words, becomes senseless in terms of analysing. This leaves me behind with three dictionaries (characters, words, special characters) per language of around 2 to 5KB (I use a special format to use prefix compression) that save me between 30% to 60% in character reduction and also when you remember that in Java each character stores 16 bits it results in an overall reduction of even more making it a 1:5 to 1:10 by also compressing (huffman tree) the characters having to insert.
Using such a system and compressing numbers as variable length integers, one produces a byte array that can be used for String matching, loads faster and checking for words to contain is faster than doing character by character comparism since one can check for complete words more faster by not needing to tokenize or recognize words in the first place.
It solved the problem of supporting string keys since I just can transform the string in each language and it results in a set of keys I can use for lookups.

Resources