What special character is this space like thousand separator? - r

Sometimes, in a Excel file, I find thousand separators. It is exactly like a space, but it is not. Why, because when you want to replace it, you can't type space. In stead, if you copy paste this "space", the thousand separator inside a figure, then you can replace them.
Since always, I do this, and I still don't know what is this mysterious space like thousand separator.
Now I have a problem because I have to do in R, and copy paste no more works. I think that maybe like the case with €.
When I do : gsub("€","",Price), the euro symbol won't be replaced.
Could you please help ? Thank you

That would be U+2009, "thin space".

There are multiple characters that look like a space character. Because they look similar, it is good to refer to them using the Unicode code.
U+0020 : this is the normal space character you get when pressing the spacebar on the keyboard
U+00A0 No-Break Space: this space character is meant to prevent breaking into a new line when word wrapping. In HTML, it is equivalent to
U+2007   Figure Space: this space character is meant to be as wide as a numerical digit, and it prevents line breaking.
U+2009   Thin Space: this space character is meant to be slightly less wide than a normal space character
U+202F   Narrow No-Break Space: similar to U+00A0, but narrower in width
There may be others I'm missing.

Related

How to remove characters between space and specific character in R

I have a question similar to this one but instead of having two specific characters to look between, I want to get the text between a space and a specific character. In my example, I have this string:
myString <- "This is my string I scraped from the web. I want to remove all instances of a picture. picture-file.jpg. The text continues here. picture-file2.jpg"
but if I were to do something like this: str_remove_all(myString, " .*jpg) I end up with
[1] "This"
I know that what's happening is R is finding the first instance of a space and removing everything between that space and ".jpg" but I want it to be the first space immediately before ".jpg". My final result I hope for looks like this:
[1] "This is my string I scraped from the web. I want to remove all instances of a picture. the text continues here.
NOTE: I know that a solution may arise which does what I want, but ends up putting two periods next to each other. I do not mind a solution like that because later in my analysis I am removing punctuation.
You can use
str_remove_all(myString, "\\S*\\.jpg")
Or, if you also want to remove optional whitespace before the "word":
str_remove_all(myString, "\\s*\\S*\\.jpg")
Details:
\s* - zero or more whitespaces
\S* - zero or more non-whitespaces
\.jpg - .jpg substring.
To make it case insensitive, add (?i) at the pattern part: "(?i)\\s*\\S*\\.jpg".
If you need to make sure there is no word char after jpg, add a word boundary: "(?i)\\s*\\S*\\.jpg\\b"

R removes spaces in read.table

I came across some surprising behavior today that doesn't seem right to me. I have a CSV file with several columns, some numeric and some text. One of my text columns contains extra spaces between some words. When I read this file into R using read.csv (or more generally read.table), it removes the extra spaces. I am not talking about leading or trailing whitespace, but spaces inside the string.
I have looked through the docs and nowhere can I find an option to turn off this behavior. Surely there must be a way to tell R to read the data as it is and not remove these spaces. Or is there?

How to fix prettytable to display chinese character properly

from prettytable import PrettyTable
header="乘客姓名,性别,出生日期".split(",")
x = PrettyTable(header)
x.align["乘客姓名"]="l"
table='''HuangTianhui,男,1948/05/28
姜翠云,女,1952/03/27
李红晶,女,1994/12/09
LuiChing,女,1969/08/02
宋飞飞,男,1982/03/01
唐旭东,男,1983/08/03
YangJiabao,女,1988/08/25
买买提江·阿布拉,男,1979/07/10
安文兰,女,1949/10/20
胡偲婠(婴儿),女,2011/02/25
(有待确定姓名),男,1985/07/20
'''
data=[row for row in table.split("\n") if row]
for row in data:
x.add_row(row.strip().split(","))
print(x)
What I want the output format is as the following.
In this example, prettytable.py can not display properly chinese ambiguous width of character · in 买买提江·阿布拉 , the character has ambiguous width. How to fix the bug in prettytable.py?
I have add two lines in def _char_block_width(char) of prettytable.py, but the problem still remains.
if char == 0xb7:
return 2
I have solved it, the file prettytable.py should be installed in my computer d:\python33\Lib\site-packagesdirectly not in as the form of d:\python33\Lib\site-packages\prettytable\prettytable.py
There are many chinese character with ambiguous width, it is stupid for us to add two lines such as the following to fix the bug, if there are 50 ambiguous character,100 lines will be added in the prettytable.py, is there a simple way to do that? Just fix some lines to treat all the ambiguous character?
if char == 0xb7:
return 2
The issue you're running into has to do with the dot character in the incorrectly padded line of your Python output. The dot is Unicode code point U+00B7 · middle dot. This character is considered to have an "ambiguous" width, as it is a narrow character in most non-East-Asian fonts, but is rendered a full-width in most Asian ones. Without context, a program can't tell how wide it will appear on the screen. Unfortunately, Python's Unicode system doesn't appear to have any way to provide that context.
One fix might be to replace the offending dot with one that has an unambiguous width, such as U+30FB katakana middle dot (which is always full width). This way the padding logic will be able to recognize that extra space is needed for that line.
Another solution could be to set your console to use a font with more Western treatment of the middle dot character, rather than the current one that follows the East-Asian style of rendering of it as full-width. This will mean that the existing padding is correct. Your output from R clearly uses a different font that the Python output does, and its font renders the dot as half-width.

Regular Expression to remove contents in string

I have a string as below:
4s: and in this <em>new</em>, 5s: <em>year</em> everybody try to make our planet clean and polution free.
Replace string:
4s: and in this <em>new</em>, <em>year</em> everybody try to make our planet clean and polution free.
what i want is ,if string have two <em> tags , and if gap between these two <em> tags is of just one word and also , format of that word will be of ns: (n is any numeric value 0 to 4 char. long). then i want to remove ns: from that string. while keeping punctuation marks('?', '.' , ',',) between two <em> as it is.
also i like to add note that. input string may or may not have punctuation marks between these two <em> tags.
My regular expression as below
Regex.Replace(txtHighlight, #"</em>.(\s*)(\d*)s:(\s*).<em", "</em> <em");
Hope it is clear to my requirement.
How can I do this using regular expressions?
Not really sure what you need, but how about:
Regex.Replace(txtHighlight, #"</em>(.)\s*\d+s:\s*(.)<em", "</em>$1$2<em");
If you just want to take out the 4s 5s bit you could do something like this:
Regex.Replace(txtHighlight, #"\s\d\:", "");
This will match a space followed by a digit followed by a colon.
If that's not what you're after, my apologies. I hope it might help :)

regex to disallow whitespace

I'm looking for a regex that will allow Alpha Numeric and most all special characters except white space. It should be usable in c#. It would be nice if .net supported posix style but I can't seem to get it to work. TIA
Pretty sure \S (note capitalization) is the non-whitespace character class.
Something along the lines of: [^\s]+ should do the trick.
This roughly translates as "match one or more consecutive characters that are not whitespace" (\s matches a space, tab, or line break).

Resources