Can individual tags override the Character Set in the Specific Character Set (0008,0005) - dicom

If I create a DICOM object with a basic single byte Specific Character Set like (0008,0005) = ISO_IR 100, can one of the tags use a different 2-byte Character set? For example can Patient Name (0010,0010) be encoded in Simplified Chinese (ISO 2022 IR 58)?

The short answer is No. You cannot use a character set not defined in Specific Character Set.
The longer answer: you can use multiple character sets (Specific Character Set is multi-valued), but certain restrictions apply. Multiple character sets are implemented via Code Extensions (described in Chapter 6 of the DICOM Standard, starting with 6.1.2.4).
In your example, you can use the Specific Character Set value ISO 2022 IR 100\ISO 2022 IR 58, which allows to use both Latin1 and Simplified Chinese (also intermixed within the same tag, which is common in tags with representation PN). The encodings are changed by using specific Escape sequences, defined by the ISO 2022 standard. Common DICOM frameworks shall be able to handle this automatically (though you have to check this for your framework).
Note that you have to use ISO 2022 IR 100 instead of ISO_IR 100 - only ISO 2022 codes can be used in multi-valued character sets.
Note also that the Chinese Character Set (GB18030) and the UTF8 Character Set (ISO_IR 192) cannot be used together with other encodings.
If you don't want to handle multiple encodings, you could use UTF8 encoding instead (e.g. set Specific Character Set to ISO_IR 192). Note though that in this case you have to convert all non-ASCII tag values in the dataset to UTF8.

Related

How to represent acute accents in ASCII?

I'm having an encoding problem related to cookies on one of my websites.
A user is inputing Usuário, which has an acute accent, and that's being put in a cookie. The raw HEX for the cookie response is (for the Usuário string):
55 73 75 C3 A1 72 69 6F
When I see it in the browser, it looks like this:
...which is really messy. I need to fix this up.
Then I went to this website: http://www.rapidtables.com/convert/number/hex-to-ascii.htm and converted the HEX value to see how it would look like. And I got the same output:
Right. This means the HEX code is wrong. Then I tried to convert Usuário to ASCII to see how it should be. I used this WebSite: http://www.asciitohex.com/ and this is the result:
For my surprise, the HEX is exactly the one that is showing up messy. Why???
And how do I represent Usuário in ASCII so I can put it in a cookie? Should I manually encode it?
PS: I'm using ASP.NET, just in case it matters.
As of 2015 the standard of the web to store character data is UTF-8 and not ASCII. ASCII actually only contains the first 128 characters of the codepage, and does not include any kind of accented characters. To add accented characters to this 128 characters there were many legacy solutions: codepages. They each added 128 different characters to the default ASCII list thereby allowing representing 256 different characters.
The problem was, that this didn't properly solve the issue: ASCII based codepages were more or less incomatible with each other (except for the first 128 characters), and there was usually no way of programatically knowing which codepage was in used.
One of the solutions was UTF-8, which is a way to encode the unocde character set (containing most of the characters used around the world, and more) while trying to remain compatible with ASCII. The first 128 characters are actually the same in both cases, but afterwards UTF-8 characters become multi-byte: one character is encoded using a series of bytes (usually 2-3, depends on which character needs to be encoded)
The problem is if you are using some kind of ASCII based single byte codebase (like ISO-8859-1), which encodes supported characters in single bytes, but your input is actually UTF-8, which will encode accented characters in multiple bytes (you can see this in your HEX example. á is encoded as C3 A1: two bytes). If you try to read these two bytes in an ASCII based codepage, which uses single bytes for every characters (in West-Europe this codepage is usually ISO-8859-1), then each of this two bytes will be reprensented with two different characters.
In the web world the default encoding is UTF-8, so your clients will usually send their requests using UTF-8. ASP.NET is Unicode aware, so it can handle these requests. However somewere in your code this UTF-8 is converted acccidentally into ISO-8859-1, and then back into UTF-8. This might happen on various layers. As you have issues it probably happens at the cookie layer, which is sometimes problematic (here is how it worked in 2009). You should also double check your application that it uses UTF-8 everywhere else though (views, database, etc.), if you want to properly support accented characters.

CR/LF generated by PBEWithMD5AndDES encryption?

May the encryption string provided by PBEWithMD5AndDES and then Base64 encoded contain the CR and or LF characters?
Base64 is only printable characters. However when it's used as a MIME type for email it's split into lines which are separated by CR-LF.
PBEWithMD5AndDES returns binary data. PBE encryption is defined within the PKCS#5 standard, and this standard does not have a dedicated base 64 encoding scheme. So the question becomes for which system you need to Base 64 encode the binary data. Wikipedia has a nice section within the Base 64 article that explains the various forms.
You may encounter a PBE implementation that returns a Base 64, and the implementation does not mention which of the above schemes is used. In that case you need to somehow figure out which scheme is used. I would suggest searching for it, asking the community, looking at the source or if all fails, creating a set of tests on the output.
Fortunately you are pretty safe if you are decoding base 64 and you are ignoring all the white space. Note that some implementations are disregarding padding, so add it before decoding, if applicable.
If you perform the encoding base 64 yourself, I would strongly suggest to not output any whitespace, use only the default alphabet (with '+' and '/' signs) and always perform padding when required. After that you can always split the result and replace any non-standard character (especially the '+' and '/' signs of course), or remove the padding.
I was using java with Andorid SDK. I found that the command:
String s = Base64.encodeToString(enc, Base64.DEFAULT);
did line wrapping. It put LF chars into the output string.
I found that:
String s = Base64.encodeToString(enc, Base64.NO_WRAP);
did not put the LF characters into the output string.

Printing ASCII value of BB (HEX) in Unix

When I am trying to paste the character » (right double angle quotes) in Unix from my Notepad, it's converting to /273. The corresponding Hex value is BB and the Decimal value is 187.
My actual requirement is to have this character as the file delimiter when I export a .dat file from a database table. So, this character was put in as the delimiter after each column name. But, while copy-pasting, it's getting converted to /273.
Any idea about how to fix this? I am on Solaris (SunOS 5.10).
Thanks,
Visakh
ASCII only defines the character codes up to 127 (0x7F) - everything after that is another encoding, such as ISO-8859-1 or UTF-8. Make sure your locale is set to the encoding you are trying to use - the locale command will report your current locale settings, the locale(5) and environ(5) man pages cover how to set them. A much more in-depth introduction to the whole character encoding concept can be found in Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The character code 0xBB is shown as » in the IS0-8859-1 character chart, so that's probably the character set you want, so the locale would be something like en_US.ISO8859-1 for that character set with US/English messages/date formats/currency settings/etc.

What are the ISO 2375 and 2735 standards mentioned in the ECMA-119 specification?

Out of ECMA-119 specification:
8.5 Supplementary Volume Descriptor
...
8.5.3 Volume Flags (BP 8):
The bits of this field shall be numbered from 0 to 7 starting with the least significant bit.
This field shall specify certain characteristics of the volume as follows.
Bit 0:
if set to ZERO, shall mean that the Escape Sequences field specifies only escape sequences registered according to ISO 2735;
if set to ONE, shall mean that the Escape Sequences field specifies at least one escape sequence not registered according to ISO 2375.
On iso.org I found the ISO-2735 standard:
Hermetically sealed metal food containers -- Capacities and diameters of round open-top and vent hole cans for milk
And the ISO 2375 standard:
Data processing -- Procedure for registration of escape sequences
Could somebody confirm "ISO 2735" as a typing error meant "ISO 2375"? Is there an ECMA standard equivalent to ISO 2375?
Yes, I do believe you are correct in that there is a typographical error in Ecma-119. It should be Ecma-2375 where it says Ecma-2735. I will inform Ecma of this issue.
There is no Ecma equivalent to ISO 2375. However, ISO 2375 is referenced in some Ecma specifications, such as Ecma-35 (Character Code Structure and Extension Techniques).

Multiple Base64 encoded parameters that appear as 1 in a URL query string

I need to pass 2 parameters in a query string but would like them to appear as a single parameter to the user. At a low level, how can I concatinate these two values and then later separate them? Both values are Base64 encoded.
?Name=abcyxz
where both abc and xyz are separate Base64 encoded strings.
why don't you just do something like this
temp = base64_encode("var1=abc&var2=yxz")
and then call
?Name=temp
Later you can decode the whole string and split the vars.
(sry for pseudo code :P)
Edit: a small quote from wikipedia
The current version of PEM (specified in RFC 1421) uses a 64-character alphabet consisting of upper- and lower-case Roman alphabet characters (A–Z, a–z), the numerals (0–9), and the "+" and "/" symbols. The "=" symbol is also used as a special suffix code. The original specification, RFC 989, additionally used the "*" symbol to delimit encoded but unencrypted data within the output stream.
You should either use some separator or store the length of the first item.
First of all, I would be curious as to why you can't just pass two parameters. But with that as a given, just choose any character that's a valid character in a URL query string, but won't show up in your base64 encoding, such as ~

Resources