Encoding issue asp.net - asp.net

I have a file in notepad, saves ad Ansi encoding with two URLs:
http://www.odinklik.ru/site​.aspx?​site=korney_​chukovsky
http://www.odinklik.ru/site.aspx?site=korney_chukovsky
As you can see from the paste - one of them is looking "weird".
It is changed to:
"http://www.odinklik.ru/site%E2%80%8B.aspx?%E2%80%8Bsite=korney_%E2%80%8Bchukovsky"
When i copy it in the browser. What is happening here?

The code E2 80 8B is the UTF-8 code for the character Zero Width Space, but this is not a character that exists in an ANSI character set.
Either you have some other obscure spacing character that is translated into a Zero Width Space when you copy the text, or the file is actually not saved as ANSI after all.

Related

convert comment string to an ASCII character list in sicstus-prolog

currently I am working on comparison between SICStus3 and SICStus4 but I got one issue that is SICStus4 will not consult any cases where the comment string has carriage controls or tab characters etc as given below.
Example case as given below.It has 3 arguments with comma delimiter.
case('pr_ua_sfochi',"
Response:
answer(amount(2370.09,usd),[[01AUG06SFO UA CHI Q9.30 1085.58FUA2SFS UA SFO Q9.30 1085.58FUA2SFS NUC2189.76END ROE1.0 XT USD 180.33 ZPSFOCHI 164.23US6.60ZP5.00AY XF4.50SFO4.5]],amount(2189.76,usd),amount(2189.76,usd),amount(180.33,usd),[[fua2sfs,fua2sfs]],amount(6.6,usd),amount(4.5,usd),amount(0.0,usd),amount(18.6,usd),lasttktdate([20061002]),lastdateafterres(200712282]),[[fic_ticketinfo(fare(fua2sfs),fic([]),nvb([]),nva([]),tktiss([]),penalty([]),tktendorsement([]),tourinfo([]),infomsgs([])),fic_ticketinfo(fare(fua2sfs),fic([]),nvb([]),nva([]),tktiss([]),penalty([]),tktendorsement([]),tourinfo([]),infomsgs([]))]],<>,<>,cat35(cat35info([])))
.
02/20/2006 17:05:10 Transaction 35 served by static.static.server1 (usclsefat002:7551) running E*Fare version $Name: build-2006-02-19-1900 $
",price(pnr(
user('atl','1y',<>,<>,dept(<>,'0005300'),<>,<>,<>),
[
passenger(adt,1,[ptconly(n)])
],
[
segment(1,sfo,chi,'ua','<>','100',20140901,0800,f,20140901,2100,'737',res(20140628,1316),hk,pf2(n,[],[],n),<>,flags(no,no,no,no,no,no,no,no,no)),
segment(2,chi,sfo,'ua','<>','101',20140906,1000,f,20140906,1400,'737',res(20140628,1316),hk,pf2(n,[],[],n),<>,flags(no,no,no,no,no,no,no,no,no))
]),[
rebook(n),
ticket(20140301,131659),
dbaccess(20140301,131659),
platingcarrier('ua'),
tax_exempt([]),
trapparm("trap:ffil"),
city(y)
])).
The below predicate will remove comment section in above case.
flatten-cases :-
getmessage(M1),
write_flattened_case(M1),
flatten-cases.
flatten-cases.
write_flattened_case(M1):-
M1 = case(Case,_Comment,Entry),!,
M2 = case(Case,Entry),
writeq(M2),write('.'),nl.
getmessage(M) :-
read(M),
!,
M \== end_of_file.
:- flatten-cases.
Now my requirement is to convert the comment string to an ASCII character list.
Layout characters other than a regular space cannot occur literally in a quoted atom or a double quoted list. This is a requirement of the ISO standard and is fully implemented in SICStus since 3.9.0 invoking SICStus 3 with the option --iso. Since SICStus 4 only ISO syntax is supported.
You need to insert \n and \t accordingly. So instead of
log('Response:
yes'). % BAD!
Now write
log('Response:\n\tyes').
Or, to make it better readable use a continuation escape sequence:
log('Response:\n\
\tyes').
Note that using literal tabs and literal newlines is highly problematic. On a printout you do not see them! Think of 'A \nB' which would not show the trailing spaces nor trailing tabs.
But there are also many other situations like: Making a screenshot of program text, making a photo of program text, using a 3270 terminal emulator and copying the output. In the past, punched cards. The text-mode when reading files (which was originally motivated by punched cards). Similar arguments hold for the tabulator which comes from typewriters with their manually settable tab stops.
And then on SO it is quite difficult to type in a TAB. The browser refuses to type it (very wisely), and if you copy it in, you get it rendered as spaces.
If I am at it, there is also another problem. The name flatten-case should rather be written flatten_case.

How to specify encoding while creating file?

I am using an R script to create and append a file. But I need the file to be saved in ANSI encoding,even though some characters are in Unicode format. How to ensure ANSI encoding?
newfile='\home\user\abc.ttl'
file.create(newfile)
text3 <- readLines('\home\user\init.ttl')
sprintf('readlines %d',length(text3))
for(k in 1:length(text3))
{
cat(text3[[k]],file=newfile,sep="\n",append=TRUE)
}
Encoding can be tricky, since you need to detect your encoding upon input, and then you need to convert it before writing. Here it sounds like your input file input.ttl is encoded as UTF-8, and you need it converted to ASCII. This means you are probably going to lose some non-translatable characters, since there may be no mapping from the UTF-8 characters to ASCII outside of the 128-bit lower range. (Within this range the mappings of UTF-8 to ASCII are the same.)
So here is how to do it. You will have to modify your code accordingly to test since you did not supply the elements needed for a reproducible example.
Make sure that your input file is actually UTF-8 and that you are reading it as UTF-8. You can do this by adding encoding = "UTF-8" to the third line of your code, as an argument to readLines(). Note that you may not be able to set the system locale to UTF-8 on a Windows platform, but the file will still be read as UTF-8, even though extended characters may not display properly.
Use iconv() to convert the text from UTF-8 to ASCII. iconv() is vectorised so it works on the whole set of text. You can do this using
text3 <- iconv(text3, "UTF-8", "ASCII", sub = "")
Note here that the sub = "" argument prevents the default behaviour of converting the entire character element to NA if it encounters any untranslatable characters. (These include the seemingly innocent but actually subtly evil things such as "smart quotes".)
Now when you write the file using cat() the output should be ASCII.

Handle utf 8 characters in unix

I was trying to find a solution for my problem and after looking at the forums I couldn't so I'll explain my problem here.
We receive a csv file from a client with some special characters and encoded as unknown-8bit. We convert this csv file to xml using an awk script. With the xml file we make an API call to our system using utf-8 as default encoding. The response is an error with following information:
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence
The content of the file is as bellow:
151215901579-109617744500,sandra,sandra,Coesfeld,,Coesfeld,48653,DE,1,2.30,ASTRA 16V CAVALIER CALIBRA TURBO BLUE 10,53.82,GB,,.80,3,ASTRA 16V CAVALIER CALIBRA TURBO BLUE 10MM 4CORE IGNITION HT LEADS WIRES MLR.CR,,sandra#online.de,parcel1,Invalid Request,,%004865315500320004648880276,INTL,%004865315500320004648880276,1,INTL,DPD,180380,INTL,2.30,Send A2B Ltd,4th Floor,200 Gray’s Inn Road,LONDON,,WC1X8XZ,GBR,
I think the problem is in the field "200 Gray’s Inn Road" cause when I use utf-8 encoding it automatically converts "'" character by a x92 value.
Does anybody know how can I handle this?
Thanks in advance,
Sandra
Find out the actual encoding first, best would be asking the sender.
If you cannot do so, and also for sanity-checking, the unix command file is very useful for that (the linked page shows more options).
Next step, convert to UTF-8.
As it is obviously an ASCII-based encoding, you could just discard all non-ASCII or replace them on encoding, if that loss is acceptable.
As an alternative, open it in the editor of your choice and flip the encoding used for interpreting the data until you get something useful. My guess is you'll have either Latin-1 or Windows-1252, but check it for yourself.
Last step, do what you wanted to do, in comforting knowledge that you now have valid UTF-8.
Obviously, don't pretend it's UTF-8 if it isn't. Find out what the encoding is, or replace all non-ASCII characters with the UTF-8 REPLACEMENT CHARACTER sequence 0xEF 0xBF 0xBD.
Since you are able to view this particular sample just fine, you apparently already know which encoding it is (even if you don't know that you know -- it would be whatever your current set-up is using) -- I would guess Windows-1252 which uses 0x92 for a curvy right single quote.

CSS character encoding

According to W3C, CSS can set its character encoding by #charset in the first line, is it valid to to say that I should put #charset "UTF-8" in every CSS i made, even it only contains ASCII characters?
Will there be any performance penalty after I declare it using UTF-8 ?
p.s. I can't think of a way to test it out.
No, it is not valid to say so, as an unqualified statement. If your file contains only ASCII characters, it is very likely that its character encoding is ASCII compatible (EBCDIC is not much used there days), so the rule would be harmless, but also pointless as long as the file keeps being ASCII-only.
What matters is what happens when a non-ASCII character gets inserted into the CSS file, for whatever reason. It could be, for example, a innocent-look smart quote (”) inserted when editing the file with a program that produces smart quotes. It is more likely that the smart quote gets inserted in windows-1252 encoding than in UTF-8 encoding. So if the file has the #charset "UTF-8" rule, it probably becomes a bit more difficult to analyze the problem.
If, on the other hand, you know that your CSS file will be edited using software that uses UTF-8 encoding by default, then it is OK to declare it as UTF-8 encoded even if it only contains ASCII characters. For example, if you some day edit the file and add a declaration like content: "“foo”", you might forget to add the #charset rule.
There is no overhead in declaring the encoding as UTF-8. If the data contains ASCII characters only, any decent routine that reads UTF-8 will process the characters as fast as simple reading of ASCII. A routine that reads a UTF-8 bytestream will have to first check whether the byte is in the ASCII range and take it as standing for an ASCII character if it is.

Printing ASCII value of BB (HEX) in Unix

When I am trying to paste the character » (right double angle quotes) in Unix from my Notepad, it's converting to /273. The corresponding Hex value is BB and the Decimal value is 187.
My actual requirement is to have this character as the file delimiter when I export a .dat file from a database table. So, this character was put in as the delimiter after each column name. But, while copy-pasting, it's getting converted to /273.
Any idea about how to fix this? I am on Solaris (SunOS 5.10).
Thanks,
Visakh
ASCII only defines the character codes up to 127 (0x7F) - everything after that is another encoding, such as ISO-8859-1 or UTF-8. Make sure your locale is set to the encoding you are trying to use - the locale command will report your current locale settings, the locale(5) and environ(5) man pages cover how to set them. A much more in-depth introduction to the whole character encoding concept can be found in Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The character code 0xBB is shown as » in the IS0-8859-1 character chart, so that's probably the character set you want, so the locale would be something like en_US.ISO8859-1 for that character set with US/English messages/date formats/currency settings/etc.

Resources