RST Inline markup broken on some lines - restructuredtext

Inline markup is working for some strings, not for others:
**Traffic protection rule** created for ​``example.com``
Data is aggregated for ​``example.com``, ``​anysubdomain.example.com`` and ``onemorelevel.anysubdomain.example.com`` and then the rule is applied on the aggregated data.
In the first paragraph and in the second, example.com is not converted.

There is a zero-width space immediately before each of the two instances of ​
``example.com``
If you remove these two characters, the rendered output will be OK.
Zero-width space is not classified as a whitespace character in Unicode (see https://sourceforge.net/p/docutils/bugs/307/#ff33).

Related

R removes spaces in read.table

I came across some surprising behavior today that doesn't seem right to me. I have a CSV file with several columns, some numeric and some text. One of my text columns contains extra spaces between some words. When I read this file into R using read.csv (or more generally read.table), it removes the extra spaces. I am not talking about leading or trailing whitespace, but spaces inside the string.
I have looked through the docs and nowhere can I find an option to turn off this behavior. Surely there must be a way to tell R to read the data as it is and not remove these spaces. Or is there?

What special character is this space like thousand separator?

Sometimes, in a Excel file, I find thousand separators. It is exactly like a space, but it is not. Why, because when you want to replace it, you can't type space. In stead, if you copy paste this "space", the thousand separator inside a figure, then you can replace them.
Since always, I do this, and I still don't know what is this mysterious space like thousand separator.
Now I have a problem because I have to do in R, and copy paste no more works. I think that maybe like the case with €.
When I do : gsub("€","",Price), the euro symbol won't be replaced.
Could you please help ? Thank you
That would be U+2009, "thin space".
There are multiple characters that look like a space character. Because they look similar, it is good to refer to them using the Unicode code.
U+0020 : this is the normal space character you get when pressing the spacebar on the keyboard
U+00A0 No-Break Space: this space character is meant to prevent breaking into a new line when word wrapping. In HTML, it is equivalent to
U+2007   Figure Space: this space character is meant to be as wide as a numerical digit, and it prevents line breaking.
U+2009   Thin Space: this space character is meant to be slightly less wide than a normal space character
U+202F   Narrow No-Break Space: similar to U+00A0, but narrower in width
There may be others I'm missing.

emphasis and not emphasis in the same word

In ReStructuredText, is it possible to have emphasis and no emphasis in the same word? For example:
*emph*not-emph
leading to "emph no-emph", but with no white space in between? I can't find a way to do it, not even with a substitution.
What you are looking for is Character-Level Inline Markup. The description from the reStructuredText specification is (emphasis mine):
It is possible to mark up individual characters within a word with backslash escapes [...] Backslash escapes can be used to allow arbitrary text to immediately follow inline markup.
The two examples provided in the specification are:
For a single character immediately following inline markup:
Python ``list``\s use square bracket syntax.
For arbitrary text immediately following inline markup:
Possible in *re*\ ``Structured``\ *Text*, though not encouraged.
So to achieve the output you want, you need to use the backslash-escaped whitespace pattern:
*emph*\ not-emph
The reason this is required is because the inline markup recognition rules require that:
Inline markup end-strings must end a text block or be immediately followed by
whitespace,
one of the ASCII characters - . , : ; ! ? \ / ' " ) ] } > or
a non-ASCII punctuation character with Unicode category Pd (Dash), Po (Other), Pe (Close), Pf (Final quote), or Pi (Initial quote).
Note that the use of that pattern above is discouraged in the reStructuredText specification:
The use of backslash-escapes for character-level inline markup is not encouraged. Such use is ugly and detrimental to the unprocessed document's readability. Please use this feature sparingly and only where absolutely necessary.

How to fix prettytable to display chinese character properly

from prettytable import PrettyTable
header="乘客姓名,性别,出生日期".split(",")
x = PrettyTable(header)
x.align["乘客姓名"]="l"
table='''HuangTianhui,男,1948/05/28
姜翠云,女,1952/03/27
李红晶,女,1994/12/09
LuiChing,女,1969/08/02
宋飞飞,男,1982/03/01
唐旭东,男,1983/08/03
YangJiabao,女,1988/08/25
买买提江·阿布拉,男,1979/07/10
安文兰,女,1949/10/20
胡偲婠(婴儿),女,2011/02/25
(有待确定姓名),男,1985/07/20
'''
data=[row for row in table.split("\n") if row]
for row in data:
x.add_row(row.strip().split(","))
print(x)
What I want the output format is as the following.
In this example, prettytable.py can not display properly chinese ambiguous width of character · in 买买提江·阿布拉 , the character has ambiguous width. How to fix the bug in prettytable.py?
I have add two lines in def _char_block_width(char) of prettytable.py, but the problem still remains.
if char == 0xb7:
return 2
I have solved it, the file prettytable.py should be installed in my computer d:\python33\Lib\site-packagesdirectly not in as the form of d:\python33\Lib\site-packages\prettytable\prettytable.py
There are many chinese character with ambiguous width, it is stupid for us to add two lines such as the following to fix the bug, if there are 50 ambiguous character,100 lines will be added in the prettytable.py, is there a simple way to do that? Just fix some lines to treat all the ambiguous character?
if char == 0xb7:
return 2
The issue you're running into has to do with the dot character in the incorrectly padded line of your Python output. The dot is Unicode code point U+00B7 · middle dot. This character is considered to have an "ambiguous" width, as it is a narrow character in most non-East-Asian fonts, but is rendered a full-width in most Asian ones. Without context, a program can't tell how wide it will appear on the screen. Unfortunately, Python's Unicode system doesn't appear to have any way to provide that context.
One fix might be to replace the offending dot with one that has an unambiguous width, such as U+30FB katakana middle dot (which is always full width). This way the padding logic will be able to recognize that extra space is needed for that line.
Another solution could be to set your console to use a font with more Western treatment of the middle dot character, rather than the current one that follows the East-Asian style of rendering of it as full-width. This will mean that the existing padding is correct. Your output from R clearly uses a different font that the Python output does, and its font renders the dot as half-width.

Can the HTML 'class' element attribute contain line breaks?

Can the 'class' attribute of HTML5 elements contain line breaks? Is it allowable in the specs and do browsers support it?
I ask because I have some code that dynamically inserts various classes into the element and this has created one very long line that is hard to manage. Normally I would build the class value using a variable but the CMS I'm using requires the template conditional tags to be positioned inline with the HTML. I can't use variables or PHP.
What I found in my research is that some HTML tag attributes need to be a single line, but I haven't been able to discover if the class attribute is one of those.
Does anyone know something about this?
Per the HTML 4 spec, the class attribute is CDATA:
User agents should interpret attribute values as follows:
o Replace character entities with characters
o Ignore line feeds
o Replace each carriage return or tab with a single space.
so you're in good shape there.
The HTML5 spec describes a class as a set of space separated tokens, where a 'space' includes newlines.
So you should be good there, too.
Can the [class] attribute of HTML5 elements contain line breaks?
Yes. The HTML5 spec says:
The attribute, if specified, must have a value that is a set of space-separated tokens representing the various classes that the element belongs to.
The link proceeds to say:
A set of space-separated tokens is a string containing zero or more words (known as tokens) separated by one or more space characters, where words consist of any string of one or more characters, none of which are space characters.
And space characters include:
space (' ')
tab (\t)
line feed (\n)
form feed (\f)
carriage return (\r)
The space characters, for the purposes of this specification, are U+0020 SPACE, "tab" (U+0009), "LF" (U+000A), "FF" (U+000C), and "CR" (U+000D).
Newlines as you would add to UTF-8 documents are:
line feeds (\n)
carriage returns (\r)
a carriage return followed immediately by a line feed (\r\n)

Resources