I need to recognize a complex chemichal names from a scanned document (pdf). They contain special characters and are written in a table format. I also have an Excel document that contains ALL possible names (I would say rows because there are no combinations) that I may encounter during scanning. Is there a way to create ligatures (so the Finereader will recognize an entire row instead of dissecting it into separate characters)? I tried creating a user dictionary but Finereader does not treat it as a one row.
The only way to create ligatures is to use "user pattern training". In FineReader, go to Tools -> Options -> Read tab (changes slightly depending on FR version) and enable User pattern training. During training extend your box to include several combined characters, thus creating a ligature.
The formulas recognition using this method is tough but may be possible.
I have done this many times in my work at www.wisetrend.com. I am a former ABBYY support employee and current integrator and OCR consulting specialist. I will be glad to help if you need more specific assistance.
Related
It's a Drupal site with solr for search. Mainly I am not satisfied with current search result on Chinese. The tokenizer has broken the words into supposed small pieces. Most of them are reasonable. But still, it made mistakes by not treating something as a valid token either breaking it to pieces or not breaking it.
Assuming I am writing Chinese now: big data analysis is one word which shouldn't be broken. So my search on it should find it. Also I want people to find AI and big data analysis training as the first hit when they search the exact phrase AI and big data analysis training.
So I want a way to intervene or compensate the current tokens to make the search smarter.
Maybe there is a file in solr allow me to manually write these tokens down to relate them certain phrases? So every time when indexing, solr can use it as a reference.
You different steps to achieve what you want :
1) I don't see an extremely big problem with your " over tokenization" :
big data analysis is one word which shouldn't be broken. So my search on it should find it. -> your search will find it even if tokenized, I understand this was an example and the actual words are chinese, but I suspect a different issue there
2) You can use the edismax[1] query parser with phrase boost at various level to boost subsequent tokens or phrases ( pf,pf2,pf3...ps,ps2,ps3...)
[1] https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html , https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html#TheExtendedDisMaxQueryParser-ThepsParameter
How to handle misspelled words in Watson conversation API. NLP technique/Algorithm used in converation API calculates the word ranking and matches the trained data based on the rank.But how to handle the mispelled words or the short names in english.
At the moment there is nothing special to handle misspellings. The best process is to use the 'Synonyms' option within entities to add what you expect the user to use, including misspellings, short names, and acronym's.
There is a list of proper names of stars here: https://www.wikidata.org/wiki/Q1433418
How can I query this in the Wikidata Query Service so that all individual names of stars are listed, alongwith other data in the list, such as Constellation?
In other words, how do I get at the members of the list? "Instance of" doesn't seem to work.
There is a confusion here coming from the fact that this List of proper names of stars (Q1433418) is an element centralizing links to Wikipedia pages playing this role in the different Wikipedia editions but isn't really playing any meaningful role in Wikidata: there are no instance of (P31) List of proper names of stars (Q1433418) in Wikidata.
You would have more luck looking for instance of (P31) Stars (Q523) and instance of elements that are a subclass of (P279) Star, a pattern that you will find in many of the SPARQL query examples: ?star wdt:P31/wdt:P279* wd:Q523 .
That could give this query (json version).
And if you're into JS, you can parse the JSON result with this function I wrote: wdk.simplifySparqlResults
I would not take official names of stars from there. The Wikipedia is one of the most useful resources to get first hand, somewhat organised information, on any topic. It is irreplaceable for this, and it would be a great mess not having it. However, the information is very sensitive to misuse caused by vandalism or clumsy editors.
To get (the only) official proper names of stars, the IAU is making an effort started this year. I would use this as reference. It is also stored in a text file which is easy to retrieve by a program, and is being updated while the Committee accepts more star names. It is here:
http://www.pas.rochester.edu/~emamajek/WGSN/IAU-CSN.txt
In fact, as you see, the file structure is presented in a format ready to use by software applications. It has been made to meet needs as yours.
I'm writing a Notes Client application. Web compatibility is a secondary concern. The language is LotusScript.
The specification: a form to enter lines from receipts. The lines are all saved as part of the same document so that they can be signed as an atomic unit.
When a line is added, it is to be formatted into a table for presentation. Ultimately, this architecture is like an input/datastore/presentation split.
I've managed to get the data stored and signed, and I think I've managed to get it deserializing properly (the LotusScript debugger makes it difficult to see, but it looks right). The problem now is the UI.
Looking at the Programmable Table, it is always a tabbed table with only one row shown per tab. I need a programmable table which can dynamically have rows added to it for display, without forcing new tabs to be created.
This suggests that I would need to use a Rich Text field to contain a table, but thus far my attempts to get anything to display when I try to update a Rich Text field in edit mode have failed. I am forced to conclude that it is impossible.
I cannot figure out how I'm supposed to do a dynamically-displayed list of tabular data like this. Any advice?
Most people just create a table with one row and N columns, with a multi-valued field in each column, and use code to append values to each of the fields in parallel. You don't get borders between rows this way or the ability to do variable formatting of cells, and you have to be careful to avoid letting data length exceed column widths in order to keep everything aligned properly.
If you truly want a dynamic table for presentation with all the bells and whistles that you can get in terms of cell formatting, then the Midas Rich Text API from Genii Software is a commercial solution that can do the job.
I blogged about this a couple of years ago: http://blog.texasswede.com/dynamic-tables-in-classic-notes/
This is a non-XPages solution, but of course you can also use XPages to achieve the same/similar result. It does not use tabs, as each row is a separate table.
Alternatively, you can build your Rich Text Table in another NotesDocument, which you then save. Then use NotesUIDocument.ImportItem (which is undocumented, but present in the R8.5 mail template) to update your NotesUIDocument.
Don't forget to delete the other NotesDocument when you're done.
Another option is to build the table in HTML in computed text, and re-open the document every time you modify it. I have inherited a system that does that, and I hate it...so be warned :)
I'm trying to model a business rule set in EA.
The rules are easily described in a decision table: a column is a matching condition, a row is a rule, if all the conditions are matched in a row then the rule matched. More info is available in the Drools docs, for example.
These rules are an integral part of the application, even if on a different level than the technology details (classes, database tables, etc.). So naturally I would like to add the decision table to my documentation in EA.
I found no way to do this. EA doesn't even know about a "table" or a "spreadsheet", let alone decision tables. I would be happy to simply insert my XLS as an "attachment" to the model, but I didn't find a way to do that either.
Any ideas are appreciated.
There currently seems to be no way to do this short of taking a screen shot of the decision table and pasting it into the generated report after the fact. I believe it is in Sparx System's road-map to implement but no immediate time-frame has been given.
You could try submitting a feature request via their official forms, it can do nothing but add more ammunition to the request. At the very least they should notify you when its available.
Update1: You could always paste that screen shot into the linked document (Ctrl+Alt+D) of the parent element that contains the business rules matrix. This could then be automatically included in the auto generated report. At least then it is still contained in the model and can be used in many places.
Update2: Just Rereading your OP, are you actually using EA's business Rules engine? or are you just after a matrix that can be included in the reporting? if it is the latter then you have two options.
The first is the Relationship Matrix (View -> Relationship matrix). This can be included automatically in RTF and HTML generated reports as well has the option to Export to CSV, save as a png or metafile.
The second option is to shoehorn the State Machine Table, (From a State Machine Diagram, right click and select State Chart Editor - Table) Both of these options will allow you to layout a grid style table where you can compare your business rules.
I hope this helps