Displaying XML using CSS: How to handle &nbsp? - css

I'm dealing with a lot of .xml files. (Millions - an .xml formatted dump of Wikipedia) and they're a lot more unreadable than I imagined.
For the time being, I've written a .css file to display them in a readable manner in a browser, and wrote a script to plug a reference to this .css into all the files.
(I know there's other solutions, like XSLT - but all the information I found made it seem document-level which didn't suit - I'm really trying not to expand the size of these files if possible)
The .css works fine for some of the files, but many contain entities like &nbsp and I get errors like:
"XML Parsing Error: undefined entity" with a nice little illustration pointing to &nbsp or it's kin within a quote.
There is an articles.dtd file, which seems like it should connect the dots ( keyword -> Unicode ) for the browser. It is referenced in each file like:
<!DOCTYPE article SYSTEM "../article.dtd">
and contains a lot of entries like:
<!ENTITY nbsp " "> <!-- no-break space = non-breaking space,
U+00A0 ISOnum -->
but either I'm entirely misunderstanding what this file is for, or it's not working correctly.
In any case; How can I make these documents display; Either by:
displaying the entities (like "&nbSp" as plain-text)
removing the entities altogether (by any means other than just a linear search/removal of them in the actual files)
Interpreting the entities as unicode, as they were intended
Naturally, the latter being preferable; absolutely ideally, by referencing some sort of external file that maps identities to Unicode (if that's not what the articles.dtd file is for....)
EDIT: I'm not working with a powerful machine here.. extracting the .rars took days. Any sort of edits to each file would take a very long time.

It is not very good way, just workaround: try to replace with  

so I've since solved my problem: if it helps anyone in future:
It turned out the guts of my problem was that external .dtd files are totally deprecated.
The function of the .dtd was in fact to declare the entities I was having trouble with( etc) as I thought; but because external .dtd files are not supported by browsers any more (the browsers simply don't fetch/parse them - and the only way to force them to depends on files in the install of the browser on the client-machine) the entities went undeclared.
I had sourced an .XML collection that was simply too old to be up to standards; without realizing it.
The solution best for my circumstances turned out to be lazy-processing of each file as it was requested. with a simple flag to differentiate processed from not.

Related

How to detect wrong encoding declaration?

I am building a ASP.NET webservice loading other webpages and then hand it clients.
I have been doing quite well with character code treatment, reading the meta tag from HTML then use that codeset to read the file.
But nevertheless, some less educated users just don't understand code sets. They declare a specific encoding method e.g. "gb2312", but in fact, he is just using normal UTF8. When I use gb2312 to decode the text, everything turns out a holy mess.
How can I detect whether the text is properly decoded? I loaded that page into my IE, which correctly use UTF-8 to decode the page. How does it achieve that?
Based on the BOM you can tell what encoding is used.
BOM and encoding
If you want to detect character set you could use the C# port of mozilla's character set detector.
CharDetSharp
If you want to make it extra sure that you are using a correct one, you maybe could be looking for special characters that are not supposed to be there. It is not very likely to include "óké". So you could be looking for such characters and try to use different encoding/character set to process your file.
Actually it is really hard to make your application completely "fool-proof".

How do I add a string to a ASP.NET global resource that only belongs to one language?

I have a global resources file for different languages:
Resource.resx
Resource.de-DE.resx
Resource.ro-RO.resx
For the most part, all the strings in Resource.resx have localized versions in other languages as well.
However, I have certain strings that should only exist in Resource.de-DE.resx but not Resource.resx. When I try to use them in my code:
GetGlobalResourceObject("Resource", "Personal Identification Number")
I get an error that says Cannot resolve resource item 'Personal Identification Number'. The string still gets localized properly when I view the page in German because it's present in Resource.de-DE.resx, but because it's not in Resource.resx, I get this error in Visual Studio, and I'd like to get rid of the error.
How do I work around this so that I don't get this error message? Should I move the local-specific string to another resource file?
The whole resource fallback approach really assumes that all strings are present for the base language.
I imagine you have this scenario because you implemented some feature that only applies to German and you don't want to add unnecessary resources to your base language as these will increase the localization effort for languages that don't need it.
One solution would be to create a separate local resource file. And either only translate this one into German (and not other languages) or make it a base resource (without the de-DE language code but still with your German strings in it).
Another solution (if you can't create a local resource file and for some reason can only use global resources) would be to add those extra entries to your base global resources (Resource.resx) and make it obvious that you don't want these translated. For example make them all blank strings and use the Comment field to explain that these strings are for German only. Not very nice.
I just replicated your scenario and it works fine. just create another resource file containing local-specific strings. hope this helps :)

Mass Thunderbird folder to Gnus nnfolder conversions

I'm pondering the idea of importing a few thousand Thunderbird folders, each folder containing many emails of course, as a set of Emacs' Gnus mailgroups. Each mailgroup name would be derived from the folder hierarchy. Because of the quantity, the work is going to be fairly tedious, so I would automate this massive import if possible.
Among the available backends, nnfolder seems the most promising in this case. I presume it would be better to populate the mailgroups from within Gnus. Otherwise, I would have to thoroughly understand the nnfolder format, and this might require many iterations before I really get it right. Moreover, as email continues to flow in, iterations may become difficult to properly organize without loosing anything.
I guess I have to respool everything, under the constraint that the selected mailgroup is a function of the Thunderbird origin, overriding the standard Gnus selection mechanism. I did some Gnus coding in the past, but since I did not touch Emacs for a dozen years, it is all very rusty. I'm a bit lost about how to approach this task as efficiently and quickly as possible. So my question: how would you handle it? Or is there some clever Gnus hidden corner that I should explore more deeply? :-)
François
P.S. After I wrote this question, I found out that Gnus has a nice, helping function towards this goal. The idea is to first copy all Thunderbird folder files within the ~/Mail directory, as they are for the contents, but properly renamed. Once this done, M-x nnfolder-generate-active-file does at once, for each copied folder, edit the contents, leave a ~ backup, generate NOV data, create one mailgroup and, of course, adjust the ~/Mail/active file.
To copy the folders underneath the ~/.thunderbird/LOGIN/Mail/Local Folders/ directory, I wrote a small Python script. It ignores all .msf files, and recurse within .sbd directories. The folder path name, relative to Local Folders/, has all its .sbd/ strings turned into periods to produce the mailgroup name, also lowering case, turning spaces and underlines to dashes, and handling other special characters appropriately. In particular, non-ASCII characters are not handled properly, nnfolder is confusing UTF-8 and ISO-8859-1 here and there. The script also has to skip msgfilterrules.dat and likely drafts, junk and such things.
I notice two details requiring attention :
Thunderbird itself can be used to compact folders before copying them, otherwise one might unwillingly recover messages which were already deleted.
(setq nnmail-use-long-file-names t) is needed in ~/.emacs prior to the whole operation.
The batch transformation aborted, saying it is not able to decrypt one of the message. I moved the offending folder out of the way, and then, the lengthy operation succeeded.

lupdate - common single words in the ts file?

Im learning how to use QT's translate stuff for the first time for work.
They already have things working to some degree and its my job to clean
things up and get it working properly as well as using lupdate to keep
things in sync when they change etc. We are also using QML in which we
wrote a wrapper function for all our strings, so lupdate does not find
our function to add to the .ts XML file. The reason we use a wrapper is
for centralizing other functions at once place for all strings. We also
dont always use a string literal in our 'source' argument but a defined
property, such as:
property string buttonTxt: "ButtonText"
then: commonTRFunction(context, buttonTxt)
which of course lupdate does not find for both reasons.
Ive looked into updating the lupdate source very briefly
and Im not sure if its worth trying to hack it to find our function or
write our own parser to find the standard QT tags AND also our new ones?
Secondly, and related to the first part, Id like a way to make one
context section that contains all the common words we use in our app
such as 'Back', 'Save', 'Ok' etc etc without repeating it over and over
throughout the .ts file. lupdate seems to repeat things over and over
in multiple contexts which seems both inefficient and a waste of lines
in the ts file.
I haven't found any QT docs that really explain the differences between tr(), qsTr(), qsTranslate(), QT_TR_NOOP() and QT_TRANSLATE_NOOP(). I know you sometimes need a context and the source,
and other times just use the source without a context. We dont use
the disambigous arg. Most of our code is in QML and not C++.
Also we are running lupdate from the command line.
Does anyone have thoughts, suggestions or even a tool someone wrote
that can be used for what we have? I appreciate your help.

How can I use Perl to get a list of CSS elements with a color or background color attribute?

I asked a question earlier today regarding using Perl to search in a CSS document. I have since refined my requirements a little bit, and have a better idea of what I am trying to do.
The document I am searching through is actually an .html doc with CSS as a style in the <head>, if that makes sense.
Basically, what I need to do though is find all the CSS elements that have a color or background color attribute, and record them. Here's my thought process.
open the file and set it as an array
read the array line-by-line until it comes to a "{"
make everything into a scalar variable or array until I get to the "}"
search the secondary variable or string for instances of "color" blah blah blah.
The issue I am having is finding a way to scour the document and turn everything between { and } into a variable of some sort. Any one have any ideas?
Cheers!
No matter what, I wouldn't recommend writing your own code from the ground up for this. You should use a parser. A quick search on CPAN suggests this family of modules. On the other hand, if your css is in an html file rather than a separate css file (shame on you), then you might end up needing a different type of parser.
Either way, it's generally not a good idea to try to hand-roll your own quasi-parser out of regular expressions. Use a proper parser, and leverage someone else's work.
On a slightly different tack, if you only want to extract some of the information from a file of any kind, then in many cases you don't want to put the whole file into an array first. (It can be memory intensive if the file is very large, and it's unnecessary.) It's easy to open the file and the process the items as you work through it line by line.
#!/usr/bin/env perl
use strict;
use warnings;
open my $fh, '<', 'file-of-interest'
or die "Couldn't open 'file-of-interest': $!";
my #saved_items;
while (my $line = <$fh>) {
# process $line
# push #saved_items, $something
}
# Do more fun stuff with #saved_items
You could use the CSS module, which is available on CPAN.
I think this is really just the same question that you asked previously, although you didn't mention as you did in a previous comment that you don't think you are allowed to use modules.
The CSS module already does this. You can look at the source to see how they do it. That's the same answer I gave you last time too.
There isn't really any magic or secret way that everyone is hiding from you. Most times, if the module you find on CPAN could be simpler, it would be. However, without any more information that constrains your problem, a general solution like SS](http://search.cpan.org/dist/CSS) is the way to go. Study that source or just lift it wholly into your script, although you might try some arguments to get some modules installed. If you could use the module, you might already be done and onto the next project. That's often a convincing argument. :)

Resources