Extracting relevant content from a blob - r

Daily, we get a 15+M xml dump that contains a bunch of superfluous content that masks the needed details. It is not problem to extract the content from the xml tags, however, the blob has proven to be a problem.
I can extract the headers of the info that I am after using str_extrac, however, I also need to the character vector that follows. An example
\n\nSubject:\n\tSecurity ID:\t\tS-1-5-21-1390067357-1580818891-1801674531-43388\n
Unfortunately, I cannot post a full copy of the blob, as it contains proprietary content. As you can see, the fields that I need are all separated with embedded new line and tab characters, which I am trying to trigger on, but I cannot find a way to configure str_extract to capture the additional content.
Any insight you might have would be greatly appreciated.

Related

Julia: website scraping?

I have been trying for days to move forward with this little code for getting the headers and the links of the news from a journal website.
using HTTP
function website_parser(website_url::AbstractString)
r = readstring(get(website_url))
splitted = split(r, "\n")
end
website_parser("https://www.nature.com/news/newsandviews")
The problem is that I could not figure out how to proceed on once I got the text from the website. How can I retrieve specific elements (as header and link of the news in this case)?
Any help is very much appreciated, thank you
You need some kind of HTML parsing. For only extracting the header, you probably can get away with regex, which are built in.
If it gets more complicated than that, regular expressions don't generalize, and you should use a full-fledged HTML parser. Gumbo.jl seems to be state of the art in Julia and has a rather simple interface.
In the latter case, it's unneccessary to split the document; in the former, it at least makes things more complicated, since then you have to think about line breaks. So, better parse first, then split.
Specific elements can be extracted using the library Cascadia git repo
for instance, the class attribute for elements in the HTML page can be extracted via qs = eachmatch(Selector(".classID"),h.root) so that all the class elements such as <div class="classID"> get selected/extracted for the returned query string (qs).

biztalk: identify message

In my case, I need to parse a bunch of text files and search for a specific strings in each. Each text file is formatted differently, so I can't create a generic flat file schema(or can I?).
Is there a way to simply parse the text in each file, and then use orchestration to make decisions based on the result of the search?
This thread answers my question
MSDN Forum: Multiple flat files on single rcv location, which recommended to use different receive locations and file masks to distinguish the different files

Problem using Interop Range.Find when range contains table

I'm trying to write a word add-in (with C#) that searches a document for all occurrences of certain pieces of text and makes some changes to the sections of text it finds.
I've created a loop that uses Range.Find to get all of the ranges in the document that contain a piece of text and the use the range objects it returns to do the manipulation later. A problem comes up when there is a table in the document, though.
In my first attempt at this, I just kept creating a new range, from the end of my last found occurrence to the end of the document and then searching again in that new range until it returns no found values. When I did this with a document containing a table, it just got stuck inside the table and created an infinite loop.
Then, I found this article: http://www.codeproject.com/KB/office/wordaddinpart1.aspx, and when using the Find function the article describes, it successfully continues on through a table, but unfortunately doesn't successfully grab all of the values within that table, which I need it to do.
Does anyone have any advice about getting around this problem? I've seen a couple people talk about having this problem, but no solutions.
I would suggest using the OpenXml SDK for this. The Office interop is a relic. Here's an article that explains how to use the OpenXml SDK to search a Word document:
http://msdn.microsoft.com/en-us/library/bb508261.aspx
Here's an SO question that discusses how to replace an image in a Word document using the OpenXml SDK:
Replace image in word doc using OpenXML

DBI_QUERY replacement in twiki for HTTP

I am currently replacing some functionality in a twiki page that has been pulling data from a DB using the DBI_QUERY feature and generating a table complete with hyperlinks on one of the table columns. Is there a way to generate a similar table from a comma separated file pulled from an HTTP request that twiki makes when the page is loaded? Alternatively, I can pull the data as JSON.
Thanks,
SetJmp
Answer is: apparently not.
However, using an iframe one can implant a table if the GET is already pre-formatted appropriately.
Will look forward to better answers...

Interpreting Search Results

I am tasked with writing a program that, given a search term and the HTML source of a page representing search results of some unknown search engine (it can really be anything, a blog, a shop, Google, eBay, ...), needs to build a data structure of the results containing "what's in the results": a title for earch result, the "details" link, the position within the results etc. It is not known whether the results page contains any of the data at all, and whether there are any search results. The goal is to feed the data structure into another program that extracts meaning.
What I am looking for is not BeautifulSoup or a RegExp but rather some clever ideas or algorithms on how to interpret the HTML source. What do I do to find out what part of the page constitutes a single result item? How do I filter the markup noise to extract the important bits? What would you do? Pointers to fields of research covering what I try to to are aly greatly appreciated.
Thanks, Simon
I doubt that there exist a silver-bullet algorithm that without any training will just work on any arbitrary search query output.
However, this task can be solved and is actually solved in many applications, but with different approach. First you have to define general structure of single search result item based on what you actually going to do with it (it could be name, date, link, description snippet, etc.), and then write number of html parsers that will extract necessary necessary fields from search result output of particular web sites.
I know it is not super sexy solution, but it probably the only one that works. And it is not rocket science. Writing parsers is actually extremly simple, you can make dozen per day. If you will look into html source of search result, you will notice that output results are typically very structured and marked with specific div sections or class atributes, so it is very easy to find it in the document. You dont have even use any complicated HTML parsing library for that, something grep-like will be enough.
For example, on this particular page your question starts with <div class="post-text"> and ends with </div>. Everything in between is actually a post text with some HTML formatting that you may want to remove along with extra spaces and "\n". And this <div class="post-text"> appears on the page only once.
Once you go at large scale with your retrieval applicaiton, you will find out that there is not that big variety of different search engines on different sites, and you will be able to re-use already created parsers for sties using similar search engines.
The only thing you have to remember is built-in self-testing. Sites tend to upgrade and change design from time to time. If your application is going to live for some time, you will need to include into your parsers some logic that will check validity of their results and notify you every time search output has changed and is not compatible anymore with your parser. Then you will have to modify particular parser or write new one.
Hope this helps.

Resources