Pull text from website - web-scraping

I have a task in front of me to document many many thousands of public land information
and i record it on a spreadsheet basically. There's 3 pieces of information i need from these records. SECTION TOWNSHIP RANGE that's all I care about.
http://i843.photobucket.com/albums/zz360/mattr1992/ndrin_zpsdc360ac8.png
Here's my resources as you can see each entry has what I'm looking for section/township/range although they are all unique entries and not the same
I would like to pull the section/township/range of each entry into a spreadsheet. How would i do this?

If you can copy the webpage into a plain text file, you could use the regular expression like section: [0-9]* township: [0-9]* range: [0-9]* to capture all the information, and then import to Excel, which could easily separate them into different sheet.

Related

How to export a complex element in a doc showing just a property and keeping all other information?

I need to import and export some documents from my web app written in .net-core to docx and viceversa: the users should be able to export, modify offline, and import back. Currently I am using OpenXml-PowerTools to export.
The problem is that there are dynamic contents that show the current value of some fields in the database so I should be able to export the document showing a face value (for instance an amount of money) and when importing back I should be able to recall the original reference (which is an object containing an expression and operations, like "sum_db_1 + sum_db_2" and info about the formatting of numbers and so on). Of course if needed everything can be treated as a String instead of a complex object.
In the original document the face value is shown (a text or an amount) while the original formula is stored like in this xml:
<reuse-link reuse="reuse-link">
<reuse-param name="figname" value="exp_sum_n"></reuse-param>
<reuse-param name="format" value="MC"></reuse-param>
</reuse-link>
In short, I need the possibility to export a complex object in Word that shows the face value and keeps somewhere also the other additional fields of the original object so they can be retrieved once imported back. The possibility of editing the "complex" values is not foreseen.
How can I achieve this?
I tried to negotiate with customers explaining they should only edit online but they are not flexible to change their internal workflow that foresee an exchange of the document between various parties.
Thank you in advance for your help.
I suggest you use one or more Custom XML Parts to store whatever additional information you need. You will probably need to create a naming convention that will allow you to relate elements/attributes in those Parts to the "face values" (however they may be expressed).
A Custom XML Part can store any XML (the content does have to be valid XML). As long as you create them, and the necessary relationships, in the .docx or Flat OPC format .xml file, the Parts should survive normal use - i.e. the user would have to do something unusual to delete them.
You could also store the information in Word document variables, but Custom XML Parts look like a better fit to your scenario.
(Sorry, I am not yet allowed to post comments or respond to them here).

How to create a ligature from a user dictionary in Abby Finereader?

I need to recognize a complex chemichal names from a scanned document (pdf). They contain special characters and are written in a table format. I also have an Excel document that contains ALL possible names (I would say rows because there are no combinations) that I may encounter during scanning. Is there a way to create ligatures (so the Finereader will recognize an entire row instead of dissecting it into separate characters)? I tried creating a user dictionary but Finereader does not treat it as a one row.
The only way to create ligatures is to use "user pattern training". In FineReader, go to Tools -> Options -> Read tab (changes slightly depending on FR version) and enable User pattern training. During training extend your box to include several combined characters, thus creating a ligature.
The formulas recognition using this method is tough but may be possible.
I have done this many times in my work at www.wisetrend.com. I am a former ABBYY support employee and current integrator and OCR consulting specialist. I will be glad to help if you need more specific assistance.

How to prevent ampersand from duplicating in Export to Word RTF in MS Access

We have an Access database that has a feature to print a notification letter from a report. Sometimes, we have to use an address different from the database, in which case we use the Export - Word RTF File command to edit the address in MS Word and print from there.
Here's the problem. The name of our department, "Compensation & Pension" or the abbreviation "C&P" appears just fine...until we export it to Word RTF. The output comes out as "Compensation && Pension" or "C&&P." Is there an easy fix for this in the Access report, other than simply editing out the additional ampersand when we do the export?
Well, I figured out a resolution...sort of. The person who created the report didn't do a very good job of a lot of things, not just with the ampersand, but with his use of bold and underlining. It looks fine when the letter is printed directly from the report, but when exported, the RTF document has some things hidden by spillover from other text covering fields such as a date pulled from the database. Since he is not willing to fix the formatting problems with the report, and I'm not allowed to play with the Access program, I decided to say "the heck with it" (or something along those lines) and just pasted the text into a Word document, reformatted it so it looks good, and saved it as a Word template. When I need this letter, I just paste in the name and address as shown in Access. Does that defeat the purpose of the Access program? Sort of, but since I only this letter two or three times a week, I figured there is no point burning more bridges than I already have with my boss.

Datatype evaluation in Excel to ASP.NET copy/paste operation

I think the answer to my question may be that it is not possible, but I am working on a multinational ASP.NET application where users will want to copy a column of numbers from out of an Excel worksheet and into a web GridView. This could be from any client operating system and we want to avoid any client plug-ins, etc.
The problem is that in many countries, number delimiters for decimal and thousands portions are completely reversed. For instance, in Germany a value of 999.999.999,999 translates in the USA to 999,999,999.999. For the raw text "999,999" without knowledge of a format and/or location number preference, it is not know whether that should be (in USA format) 999,999.000 or 999.999.
As far as I have been able to ascertain, in the copy/paste operations from an OS clipboard into a web page, there is no way to also transfer the underlying original Excel data and datatype, e.g., a number represented without these textual delimiters. The only way the data is transmitted is though the formatted text.
Anyone know otherwise or can offer helpful advise?

How to save documents like PDF,Docx,xls in sql server 2008

I develop a web application that let users to upload files like images and documents. this file divided into two parts :
binary files
document files
I want to allow users to search documents that uploaded. specialy using full text search. What data types I should use for these two file types?
You can store the data in binary and use full text search to interpret the binary data and extract the textual information: .doc, .txt, .xls, .ppt, .htm. The extracted text is indexed and becomes available for querying (make sure you use the CONTAINS keyword). Needless to say, full text search has to be enabled.Not sure how adding a full text index will affect your system - i.e., its size. You'll also need to look at the execution plan to ensure the index gets used at query time.
For more information look at this:
http://technet.microsoft.com/en-us/library/ms142499(SQL.90).aspx
Pros:
The main advantage of storing data in the database is that it makes the data "self-contained". Since all of the data is contained within the database, backing up the data, moving the data from one database server to another, replicating the database, and so on, is much easier.
also you can enable versioning of files and also make it easier for load balanced web farms.
Cons:
you can read it here: https://dba.stackexchange.com/questions/3924/sql-server-2005-large-binary-storage. But this is something that you have to do in order to search through the files efficiently.
Or the other thing that I could suggest is probably storing keywords in the database and then linking the same to file in the fileshare.
Here is an article discussing abt using a FileStream and a database: http://blogs.msdn.com/b/manisblog/archive/2007/10/21/filestream-data-type-sql-server-2008.aspx
You first need to convert the PDF to text. There are libraries for this sort of thing (ie: PowerGREP). Then I'd recommend storing the text of the PDF files in a database. If you need to do full text searching and logic such as "on the same line" then you'll need to store one record per line of text. If you just want to search for text in a file, then you can change the structure of your SQL schema to match your needs.
For docx files, I would convert them to RTF and search them that way while stored in SQL.
For images, Microsoft has a program called Microsoft OneNote that does OCR (optical character recognition) so you can search for text within images. It doesn't matter what tool you use, just that it supports OCR.
Essentially, if you don't have a way to directly read the binary file, then you need to convert it to text with some library, then worry about doing your searching.
The full-text index can be created for columns which use any of the following data types – CHAR, NCHAR, VARCHAR, NVARCHAR, TEXT, NTEXT, VARBINARY, VARBINARY (MAX), IMAGE and XML.
In addition, To use full text search you must create a full-text index for the table against which they want to run full-text search queries. For a particular SQL Server Table or Indexed View you can create a maximum of one Full-Text Index.
these are two article about it:
SQL SERVER - 2008 - Creating Full Text Catalog and Full Text Search
Using Full Text Search in SQL Server 2008

Resources