Sprache as a round-trip tool - sprache

Can Sprache be used as a round-trip tool? I understand that with it I can build a parser that can extract information from a given text. But I can reuse the same (simple*) pattern I came up for parsing also to generate text from Information?
Let me give you an example: I implement a parser which is capable of extracting the information "abc" from the text "[abc]". Now it would be handy if I could simply provide the value "abc" and it would know how to produce the resulting text "[abc]". Thus I would have a round-trip tool to go from text to information and back to text.
*) limited to known number of appearances of symbols (i.e. no .AtLeastOnce() etc.)

Related

Barcode for identification of separator pages

I want to use a barcode as a way to identify a separator page in a stack of scanned documents.
I want to figure out the best type of barcode to use for that.
Here is the current situation: The user scans in a stack of paper (1-10 pages) that represent one document.
It would be much faster for them to scan in a bigger stack of paper.
To accommodate this I am going to create a page with a special pattern on it and write a C# program that will look for that pattern and create separate documents based on those pages separating the documents.
I am writing my own program because I will be looking for barcodes on the actual documents as well so I need custom code.
My question is:
Which barcode technology will be the best for the separator page?
My gut tells me to use QR Code; but I would like hear what others have to say.
As long as your scanning code can rely on your barcode being relatively level with the page and the amount of data that you want to scan is less than 50 or so characters, you don't need to go 2D with your symbology. I would recommend Code 128.
If you aren't relying on a library, it is much easier write the code to spot and decode a raster with a predefined pattern of 1's and 0's. Using QR code or any other 2D symbology (Datamatrix or PDF417) should only be considered necessary if you need a high volume of characters as the decoding of a 2D symbol is much more complex.
This assumes that you also have control over the symbology that will be used within the documents and they follow the same constraints.

How to add customized tokens into solr to change the indexing token behaviour

It's a Drupal site with solr for search. Mainly I am not satisfied with current search result on Chinese. The tokenizer has broken the words into supposed small pieces. Most of them are reasonable. But still, it made mistakes by not treating something as a valid token either breaking it to pieces or not breaking it.
Assuming I am writing Chinese now: big data analysis is one word which shouldn't be broken. So my search on it should find it. Also I want people to find AI and big data analysis training as the first hit when they search the exact phrase AI and big data analysis training.
So I want a way to intervene or compensate the current tokens to make the search smarter.
Maybe there is a file in solr allow me to manually write these tokens down to relate them certain phrases? So every time when indexing, solr can use it as a reference.
You different steps to achieve what you want :
1) I don't see an extremely big problem with your " over tokenization" :
big data analysis is one word which shouldn't be broken. So my search on it should find it. -> your search will find it even if tokenized, I understand this was an example and the actual words are chinese, but I suspect a different issue there
2) You can use the edismax[1] query parser with phrase boost at various level to boost subsequent tokens or phrases ( pf,pf2,pf3...ps,ps2,ps3...)
[1] https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html , https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html#TheExtendedDisMaxQueryParser-ThepsParameter

Tokenizing SQL Injection strings

I have a data set of requests obtained from numerous PCAP files and have loaded these PCAP files into R. Each PCAP file effectively refers to a single observation (row).
In this data set there is a "Request" column that gives a string regarding the request of the source. For example a request may read:
http://111.22.33.1/ilove/usingR/extraextra/sqli/?id='or1=1--
I want to tokenize each request string in order to run some machine learning algorithm on it. What would be the best way to tokenize strings like these in order to run some analysis on it? I know packages such as tm exist, but have had little experience with them.
I fear that you have first to examine your request variable and find similar patterns to help you find rules to tokenize your variable.
Then you could use str_splitwith the / pattern. If you keep the apparition number in the string, some models may find you the co-occurrence patterns in your requests.
Then do some analysis, like frequency check, for ip address and for the text.
tm is more for text corpus. Here, as it is "automated" created string, you probably could find some useful information with more classical methods first.

Pull text from website

I have a task in front of me to document many many thousands of public land information
and i record it on a spreadsheet basically. There's 3 pieces of information i need from these records. SECTION TOWNSHIP RANGE that's all I care about.
http://i843.photobucket.com/albums/zz360/mattr1992/ndrin_zpsdc360ac8.png
Here's my resources as you can see each entry has what I'm looking for section/township/range although they are all unique entries and not the same
I would like to pull the section/township/range of each entry into a spreadsheet. How would i do this?
If you can copy the webpage into a plain text file, you could use the regular expression like section: [0-9]* township: [0-9]* range: [0-9]* to capture all the information, and then import to Excel, which could easily separate them into different sheet.

Creating links to ontology nodes

Let's say that, being abstract from any language, we have some ontology made of triples (e.g. subject (S) - predicate (P) - object (O))
Now if I want to, for some reason, annotate any of these triples (nodes), than I'd like to keep links to them that I can use in web documents.
Here are some conditions:
1) Such link must be in a form of one line of text
2) Such link should be easily parseable both by machine and person
3) Sections of such links should be delimited
4) Such link must be easy to grep, which IMO means they should be wrapped in some distinct letters or characters to make them easy to regex from any web or other document
5) Such link can be used in URL pathnames or query strings, thus has to comply with URL syntax
6) Characters used in such link must not be reserved for URL pathnames, query strings or hashes (e.g. not "/", ";" "?", "#")
My ideas so far were as follows:
a) Start and end such link with some distinct, constant set of letters, e.g. STK_....._OVRFLW
b) Separate sections with dashes "-", e.g. Subject-Predicate-Object
So it would look like:
STK_S1234-P123-O1234_OVRFLW
You have better ideas?
I'm with #msalvadores on this one - this seems to be a classic use of semantic web / linked data (albeit in a rather complex form), and your example seems to be more related to URI design rather than anthing else.
# is dealt with extensively in the semantic web lit, also there are javascript libraries for querying rdf through sparql - it just makes more sense to stick with the standard.
To link to a triple, the standard method is to use reification - essentially naming a triple (to keep with the triple model, it ends up creating 4 triples, but I would consider it the "correct" method in this situation). There is also the "named graph" method, which isn't a standard, but probably has more widespread adoption.
The link will be 1 line of text
It will be easily machine parsable, to make it human parsable, it might be necessary to give some thought to URI design.
Delimitation is once again on URI design
easy grepping - URI design
URL syntax - tick
no "/", ";" "?", "#" - I would try to incorporate it into a url instead of pushing it out
I would consider www.stackoverflow.com/statement/S1234_P123_O123, where S1234 etc. are unique labels (I don't necessarily agree with human readable uris, but I guess they'll have to stay until humans don't have to read uris). The beautiful thing is that it should dereference and give a nice human vs machine readable representation

Resources