biztalk: identify message - biztalk

In my case, I need to parse a bunch of text files and search for a specific strings in each. Each text file is formatted differently, so I can't create a generic flat file schema(or can I?).
Is there a way to simply parse the text in each file, and then use orchestration to make decisions based on the result of the search?

This thread answers my question
MSDN Forum: Multiple flat files on single rcv location, which recommended to use different receive locations and file masks to distinguish the different files


Extracting relevant content from a blob

Daily, we get a 15+M xml dump that contains a bunch of superfluous content that masks the needed details. It is not problem to extract the content from the xml tags, however, the blob has proven to be a problem.
I can extract the headers of the info that I am after using str_extrac, however, I also need to the character vector that follows. An example
\n\nSubject:\n\tSecurity ID:\t\tS-1-5-21-1390067357-1580818891-1801674531-43388\n
Unfortunately, I cannot post a full copy of the blob, as it contains proprietary content. As you can see, the fields that I need are all separated with embedded new line and tab characters, which I am trying to trigger on, but I cannot find a way to configure str_extract to capture the additional content.
Any insight you might have would be greatly appreciated.

Possible to use .zip file with multiple .csv files?

Is it possible using U-SQL to unzip a zip folder with multiple .csv files and process them?
Each file has a different schema.
So you've got two problems here.
Extract from a ZIP file.
Deal with inner varying contents.
To answer your question. Is it possible?... Yes.
How?... You'd need to write a user defined extractor to do it.
First check out the MSDN extractors page:
The class for the extractor needs to inherit from IExtractor with methods that iterate over the archive contents.
Then to output each inner file in turn pass a file name to the extractor so you can define the columns for each dataset.
Another option would be to use Azure Data Factory to perform the UnZip operation in a custom activity and output the CSV contents to ADL Store. This would involve some more engineering though and an Azure Batch Service.
Hope this helps.

Comment in aspell .dic files?

They look like this:
I am using this kind of dictionary with Node.js application, but I will need it to be smarter. Specifically, I want to remember occurrence probability of every word based on already processed text. I'd like to save this information in existing .dic file - but how to do that without making it invalid?
Is there any comment syntax that would allow me to store additional data next to the words in file? Such that normal dictionary parser will ignore it?

Merge translation files (.ts) with existing .ts files using QT Utilities (lconvert)

Here's my problem: We've got .ts files for nine different languages for our product. We've added about 100 new strings that need to be translated, but some are for our next release, and some are for the release after that. We've run into problems with translators missing strings or translating strings ahead of time. We want to be able to send them smaller .ts file containing only the strings we want translated now, and then merge that .ts file into the larger .ts file containing the rest of the translation.
Our translators are required to use QT Linguist (previously we let them edit the raw XML with less than stellar results).
One solution would be to use contexts, but our dev team is not very keen on that idea. Another would be to merge the .ts files by hand, but that seems like a recipe for cut & paste errors.
Is there a method with lupdate & the project file to add or merge secondary .ts files? I've read through the forums in QT-land w/o finding the answer, but the switches in lupdate allude to being able to point to other translation files. Specifically the -pro switch which says:
-pro <filename>
Name of a .pro file. Useful for files with .pro file syntax but
different file suffix. Projects are recursed into and merged.
Example1: we have a German .ts file, we want to add 20 strings from a separate German translation file such that the primary translation file contains all the strings including the 20 new ones.
Example2: we have a German .ts file, we want to add 20 strings from a separate German translation file such that the secondary translation file will be merged with the primary during lupdate so that the resultant .qm file contains all the strings including the 20 new ones.
Has anyone done either of these (and either would work) and can you give me some insight?
The answer doesn't use lupdate, it lies in another utility called lconvert. It's quite easy to create a secondary file that only contains the strings you're interested in (and delete those same strings from the primary file), then run:
lconvert -i primary.ts secondary.ts -o complete.ts
This will take all the strings from the two input files and put them together into the output file. Using this method I was able to create a zero difference file (other than time stamp) of the original file that I'd split the two primary & secondary files from.
This question didn't get a lot of attention, but maybe someone will have this same problem and this will help.
thanks for this tip. It seems to work properly for my case :
I tried to extract updated and new strings from my project, which is currently under translation in an older version/release that I do not already have translated strings.
The problem was to send the new/updated strings only to translators.
I passed older strings in status resolved, adding new string using Lupdate, make a research using OxygenXML Editor with an XPath "/TS/context/message[not(translation/#type)]" to delete older strings, and clean it from useless blanks and carriage returns.
I tried a merge using lconvert with your solution, in order to merge translated strings : older and newer. It pass correctly lrelease and are displayed properly.

Generating keywords from a pdf automatically

My application allows user to upload pdf files and store them on the webserver for later viewing. I store the name of the file, location, size, upload date, user name etc in an SQL server database.
I'd like to be able to programatically, just after a file is uploaded, generate a list of keywords (maybe everything except common words) and store them in the sql database as well so that subsequent users can do keyword searches...
Suggestions on how to approach this task? Does these type of routine already exist?
EDIT: Just to clarify my requirements, I wouldn't be concerned with doing OCR, I don't know the insides' of PDF's, but I understand that if it was generated by an app, such as Word->PDF Print, the text of the document is really my first task, and the intent of my question is, how do I access the text of a PDF file from an app? OCR on scanned PDF's is probably beyond my requirements at this point.
As a first step you should extract all text from the PDF.
ghostscript and pdftotext can do this, the PDFBox is another option.
There are certainly other tools as well.
Then you can remove all stopwords and duplicates and write it to the database.
I has been mentioned that this does not work for scanned PDF documents but this is only half the truth. On the one hand there are lots of scanned PDFs which have text additionally embeded, because that is what some scanners drivers do (Canon CanoScan drivers performs OCR and generate searchable PDFs). On the other hand documents generated with LaTeX that contain non-ASCCII characters return garbage in my experience (even when I copy and paste in acrobat).
The only problem I foresee of grabbing every non-common word is that you'll dilute your search results and have to query the DB for more pdfs. One website to look at is Scribd which does something similar to what you are talking about doing with users uploading files and people being able to view them online via a flash app.
That is very interesting topic. The question is how many keywords do you need to define one PDF. If you say:
3 to 10 - I would check methods of text categorization such as bayesian classifier or K-NN (that method will group PDF files into clusters which are similar). I know that similar algorithms are used to filter spam. But it is a system that need input for example if you add keywords to 100 PDF this system will learn the schemas. I am not an expert but this is one way to do it.
more than 10 - then I would suggest brute force -> filter common words -> get most frequent words for a specific document.
I would explore first option. You must surely check such methods as "text categorization", "auto tagging", "text mining", "automatic keyword extraction".
Some links :
Keyword Extraction Using Naive Bayes
If you are planning on indexing PDF documents, you should consider using a dedicated text search engine like Lucene. Lucene provides features that will be difficult to implement using only SQL and a relational database. You will still need to extract the text from the PDF documents, but won't have to worry about filtering out common words. By filtering out common words, you will completely lose the ability to do phrase searches.
