DTD parsing error in R - r

I've got a bit of a problem with an xml tree in r. I have a treebank, containing the corpus - stuff I really need. What I want is to take the XML files, parse them with the help of the DTD on my computer, and then just create a corpus afterwards.
So far I've tried
xmlTreeParse(doc, options=XML::DTDLOAD)
and
xmlParse(doc)
and also
parseDTD(dtd)
but all of them throw back an error. First two still say "entity not defined", and the parsing function gives back "failed to load external entity "yaddayadda.dtd"". In this question the treeparse function was given as an answer, but it does not work for me. The xml files have a SYSTEM "../yaddayadda.dtd" designation.
What I plan to do with this, is to somehow create a VCorpus object in the tm package from the parsed text, to use it in later textmining research.
Could you help me please? Will provide further details if needed.

The parser, which you are telling to load the DTD, is seeing a reference to "../yaddayadda.dtd" and not finding it.
The most likely cause is that you have no file named "yaddayadda.dtd" on the appropriate file system, or that you have it in the wrong place; the parser should be looking for it in the directory one level up from the XML document which refers to it.
If you have it in what you think is the right location, then apparently you and the parser do not agree on what the right location is. Good luck.

Related

Trying to read multiple lines from txt file separated with :, but I'm getting imbRecoverableException caught from worker -> parseNext

As I'm new to IBM MQ and IIB I'm trying to experiment around with online tutorials. At the moment I'm trying to make a simple app that reads several lines in txt file separated by colon and writes them into XML file. Currently I'm stuck at reading multiple lines from file. I know how to make it work with only one line, but can't with more than one. I do know that there should be a parent-child relationship between two complex types but can't configure them properly. Also im using RFHUtil to send message file into queue.
Since I can't find much googling it, I hope someone with right knowledge could help around.
Don't have any code, but got my message definition picture: http://prnt.sc/nv9npr
Here is the error I'm getting: http://prnt.sc/nv9nyi
So two things I can see in your current screen shots.
In the first screenshot I can see \r\n i.e. CRLF which indicates that your separator needs to either be CRLF or your model needs to deal with the CRLF.
In the second you've got a partially parsed message. Try setting the Advanced Parser options on your Input node to ParseComplete things will still blow up but you should get some better diagnostic information in the ExceptionList.
Looks like you are trying use the MRM parser which has been replaced by the DFDL parser. I suggest you find some tutorials on the DFDL parser, it's much more efficient. Also there is support built into the Toolkit which will let you debug the Message Model you create Testing a DFDL schema by parsing test input data

Ada `Gprbuild` Shorter File Names, Organized into Directories

Over the past few weeks I have been getting into Ada, for various different reasons. But there is no doubt that information regarding my personal reasons as to why I'm using Ada is out of scope for this question.
As of the other day I started using the gprbuild command that comes with the Windows version of GNAT, in order to get the benefits of a system for managing my applications in a project-related manner. That is, being able to define certain attributes on a per-project basis, rather than manually setting up the compile-phase myself.
Currently when naming my files, their names are based off of what seems to be a standard for the grpbuild, although I could very much be wrong. For periods (in the package structure), a - is put in the name of the file, for underscores, an _ is put accordingly. As such, a package by the name App.Test.File_Utils would have a file name of app-test-file_utils: .ads and .adb accordingly.
In the .gpr project file I have specified:
for Source_Dirs use ("app/src/**");
so that I am allowed to use multiple directories for storing my files, rather than needing to have them all in the same directory.
The Problem
The problem that arises, however, is that file names tend to get very long. As I am already putting the files in a directory based on the package name contained by the file, I was wondering if there is a way to somehow make the compiler understand that the package name can be retrieved from the file's directory name.
That is, rather than having to name the App.Test.File_Utils' file name app-test-file_utils, I would like it to reside under the app/test directory by the name file_utils.
Is this doable, or will I be stuck with the horrors of eventually having to name my files along the lines of: app-test-some-then-one-has-more_files-another_package-knew-test-more-important_package.ads? Granted, I have not missed something about how an Ada application should actually be structured.
What I have tried
I tried looking for answers in the package Naming configuration of the gpr files in the documentation, but to no avail. Furthermore I have been browsing the web for information, but decided it might be better to get help through Stackoverflow, so that other people who might struggle with this problem in the future (granted it is a problem in the first place) might also get help.
Any pointers in the right direction would be very helpful!
In the top-secret GNAT documentation there is a description of how to use non-default file names. It's a great deal of effort. You will probably give up, use the default names, and put them all in a single directory.
You can also simplify much of the effort by using GPS and letting it build your project file as you add files to your source directories.

HTK ERROR [+5010] InitSource: Cannot open source file f-ihm+k

I believe that this error has something to do with a mismatch between my tiedlist and the hmmdefs (as pointed out here:http://www.ling.ohio-state.edu/~bromberg/htk_problems.html), but I can not seem to solve it. All of the triphones in my corpus are present in my triphones1 list and triphones1 only contains monophones,biphones and triphones from my corpus.
If I take said triphone out of the triphones1 list and recreate the tiedlist it passes but complains about another triphone down the road. Obviously manually taking out all of these triphones would take me years and it doesn't seem efficient which leads me to believe that I have missed something further back.
It is also important to note that all these triphones generating errors are in my corpus as well. To me this error would only make sense if I had unseen triphones somewhere, but where? I feel that I have left no stone unturned but surely someone can give me a fresh idea of where to look.
There was an extra AU command at the end of the tree.hed file This was causing it to try and open another file after the tiedlist. I am not sure why this causes an issue when it has already accessed tiedlist, but there you go.
Hopefully this serves as a extra check for future htk users.

Interpreting edi mscons files

I have en EDI file in mscons format. I am trying to parse the file in R and save it as a csv file. However, I do not have any good explanation how to proceed. Anyone out there worked with these sort of files?
Example:
UNA:+.? '
UNB+UNOC:3+7080005046091:14:TIMER+102953452626:82:TIMER+140312:2152+XGATE019452198++++1'
UNH+1+MSCONS:D:96A:ZZ:E2NO6A'BGM+7+1488136+9+NA'
DTM+137:201403121751:203'DTM+163:201403030000:203'
DTM+164:201403092400:203'DTM+ZZZ:1:805'
NAD+FR+7080005046053::9+++++++NO'
NAD+DO+953452626:NO3:82+++++++NO'UNS+D'
NAD+XX'LOC+90+707057500071137750::9'
RFF+MG:97645'RFF+LI:22446237_17506927'
LIN+1++1491:::SM'MEA+AAZ++KWH'QTY+136:1'
DTM+324:201403030000201403030100:Z13'QTY+136:1'
DTM+324:201403030100201403030200:Z13'QTY+136:2'
DTM+324:201403030200201403030300:Z13'QTY+136:1'
DTM+324:201403030300201403030400:Z13'QTY+136:1'
DTM+324:201403030400201403030500:Z13'QTY+136:2'
DTM+324:201403030500201403030600:Z13'QTY+136:1'
DTM+324:201403030600201403030700:Z13'QTY+136:1'
DTM+324:201403092300201403092400:Z13'CNT+1:167181'
UNT+6832+1'UNZ+1+XGATE019452198'
Download this application to start: EDI Notepad
Open your EDIFACT file in this tool. This will help you with context. What each segment / element is. It should also help give you context related to qualifiers and envelopes in the documents. You should find the source of the document and get an implementation guide, which will also explain their specific usage.
Once you apply context and understand what the elements are, parsing becomes easy. You can write your own parser, use an open source product like BOTS (mentioned in the comments above, or purchase commercial translation software (hundreds available).
The elements within the MSCONS file are well documented. See here: http://www.edi-energy.de - the latest description (in German) is available here: http://www.edi-energy.de/files2/MSCONS_2_2b_Fehlerkorrektur_2014_02_27.pdf

Handling raw PDF source in ActionScript

I am using the JODConverter web service to convert an ODT-document to a PDF-file. I have some working Ruby code, that will load up the ODT-file and convert it using the web service. The resulting PDF-file is then returned to me and I can easily save it.
When I try to do the same thing in ActionScript, I seem to be facing some issues with FlateDecode blocks in the PDF source. They are somehow altered (possibly because ActionScript strings are UTF-8). The result is that the resulting PDF-file is incomplete. Meta-information is correct, but the file appears to be blank.
I would appreciate any kind of feedback relating to this issue.

Resources