I need to remove the space between the digits in a number.
Currently I have: "...... 36 191 39 128 ...... 10 (17) -".
And I need to get the following: "...... 36191 39128 ...... 10 (17) -".
I would be grateful for any help in this regard.
You could use regular expressions if you want to avoid dependencies, but this is made easy with the stringr package.
library(stringr)
x_string <- " 36 191 39 128 10 (17)"
str_replace_all(x_string, " ", "")
# [1] "361913912810(17)"
I have a web response being returned in raw format which I'm unable to properly encode. It contains the following values:
ef bc 86
The character is meant to be a Fullwidth Ampersand (to illustrate below):
> as.character("\uFF06")
[1] "&"
> charToRaw("\uFF02")
[1] ef bc 82
However, no matter what I've tried it gets converted to ". To illustrate:
> rawToChar(charToRaw("\uFF02"))
[1] """
Because of the equivalence of the raw values, I don't think there's anything I can do in my web call to influence the problem I'm having (happy to be corrected). I believe I need to work out how to properly do the character encoding.
I also took an extreme approach of trying all other encodings as follows but none converted to the fullwidth ampersand:
> x_raw <- charToRaw("\uFF02")
> x_raw
[1] ef bc 82
> sapply(
+ stringi::stri_enc_list()
+ ,function(encoding) stringi::stri_encode(str = x_raw, encoding)
+ ) |> # R's new native pipe
+ tibble::enframe(name = "encoding")
# A tibble: 1,203 x 2
encoding value
<chr> <chr>
1 037 "Õ¯b"
2 273 "Õ¯b"
3 277 "Õ¯b"
4 278 "Õ¯b"
5 280 "Õ¯b"
6 284 "Õ¯b"
7 285 "Õ~b"
8 297 "Õ¯b"
9 420 "\u001a\u001ab"
10 424 "\u001a\u001ab"
# ... with 1,193 more rows
My work around at the moment is to replace the strings after the encoding, but this character is just one example of many, and hard-coding every instance doesn't seem practical.
> rawToChar(x_raw)
[1] """
> stringr::str_replace_all(rawToChar(x_raw), c(""" = "\uFF06"))
[1] "&"
The substitution workaround is also complicated that I've also got characters like the HYPHEN (not HYPEN-MINUS) somehow getting converted where the last to raw values are getting converted to a string with what appears to be octal values:
> as.character("\u2010") # HYPHEN
[1] "‐"
> as.character("\u2010") |> charToRaw() # As raw
[1] e2 80 90
> as.character("\u2010") |> charToRaw() |> rawToChar() # Converted back to string
[1] "â€\u0090"
> charToRaw("â\200\220") # string with equivalent raw
[1] e2 80 90
Any help appreciated.
I'm not totally clear on exactly what you are trying to do, but the problem with getting back your original character is that R cannot determine the encoding automatically from the raw bytes. I assume you are on Windows. If you do
val <- rawToChar(charToRaw("\uFF06"))
val
# [1] "&"
Encoding(val)
# [1] "unknown"
Encoding(val) <- "UTF-8"
val
# [1] "&"
Just make sure to set the encoding properly.
I was doing some file merging, and two files wouldn't - despite having a key column that matches (I actually generated one key column by copy-pasting from the other). It's the damndest thing, and I worry that I'm either going crazy or missing something fundamental. As an example (and I cannot figure out how to make it reproducible, as when I copy and paste these strings into new objects, they compare just fine), here's my current console:
> q
[1] "1931 80th Anniversary"
> z
[1] "1931 80th Anniversary"
> q == z
[1] FALSE
I str-ed both, just in case I missed something, and...
> str(q)
chr "1931 80th Anniversary"
> str(z)
chr "1931 80th Anniversary"
What could be going on here?
This was a great puzzler. To answer - to diagnose the problem, charToRaw() was the answer.
> charToRaw(q)
[1] 31 39 33 31 c2 a0 38 30 74 68 c2 a0 41 6e 6e 69 76 65
[19] 72 73 61 72 79
> charToRaw(z)
[1] 31 39 33 31 20 38 30 74 68 20 41 6e 6e 69 76 65 72 73
[19] 61 72 79
Oh! Different! It seems to lie in the encoding, which, given that these were both plain ole' CSVs I loaded from, I never would have guessed, but
> Encoding(q)
[1] "UTF-8"
> Encoding(z)
[1] "unknown"
In the end, I used iconv() on q to make it work
> iconv(q, from = 'UTF-8', to = 'ASCII//TRANSLIT') == z
[1] TRUE
This has been a weird journey, and I hope this helps someone else who is as baffled as I was - and they learn a few new functions along the way.
It looks like you have non-breaking spaces in your string, which isn't really an encoding issue. This happens to me all the time because alt + space inserts a non-breaking space on a Mac, and I use alt on my German keyboard for all sorts of special characters, too. My pinkies are my slowest fingers and they don't always release alt fast enough when I transition from some special character to a space. I discovered this problem writing bash scripts, where <command> | <command> is common and | is alt + 7.
I think stringr::str_replace_all(q, "\\s", " ") should fix your current issue. Alternatively, you can try targeting specific non-printables, e.g. in your situation stringr::str_replace_all(q, "\uA0", " "). To expose the offending characters you can use stringi::stri_escape_unicode(q), which would return "1931\\u00a080th\\u00a0Anniversary". You can then just copy and paste to get the same results as above: stringr::str_replace_all(q, "\u00a0", " ")
I'm confused about why certain characters (e.g. "Ě", "Č", and "ŝ") lose their diacritical marks in a data frame, while others (e.g. "Š" and "š") do not. My OS is Windows 10, by the way. In my sample code below, a vector czechvec has 11 single-character strings, all Slavic accented characters. R displays those characters properly. Then a data frame mydf is created with czechvec as the second column (the function I() is used so it won't be converted to a factor). But then when R displays mydf or any row of mydf, it converts most of these characters to their plain-ascii equivalent; e.g. mydf[3,] shows the character as "E" not "Ě". But subscripting with row and column, e.g. mydf[3,2], it properly shows the accented character ("Ě"). Why should it make a difference whether R displays the whole row or just one cell? And why are some characters like "Š" completely unaffected? Also when I write this data frame to a file, it completely loses the accent, even though I specify fileEncoding="UTF-8".
> charvals <- c(193, 269, 282, 268, 262, 263, 348, 349, 350, 352, 353)
> hexvals <- as.hexmode(charvals)
> czechvec <- unlist(strsplit(intToUtf8(charvals), ""))
> czechvec
[1] "Á" "č" "Ě" "Č" "Ć" "ć" "Ŝ" "ŝ" "Ş" "Š" "š"
>
> mydf = data.frame(dec=charvals, char=I(czechvec), hex=I(format(hexvals, width=4, upper.case=TRUE)))
> mydf
dec char hex
1 193 Á 00C1
2 269 c 010D
3 282 E 011A
4 268 C 010C
5 262 C 0106
6 263 c 0107
7 348 S 015C
8 349 s 015D
9 350 S 015E
10 352 Š 0160
11 353 š 0161
> mydf[3,2]
[1] "Ě"
> mydf[3,]
dec char hex
3 282 E 011A
>
> write.table(mydf, file="myfile.txt", fileEncoding="UTF-8")
>
> df2 <- read.table("myfile.txt", stringsAsFactors=FALSE, fileEncoding="UTF-8")
> df2[3,2]
[1] "E"
Edited to add: Per Ernest A's answer, this behaviour is not reproducible in Linux. It must be a Windows issue. (I'm using R 3.4.1 for Windows.)
I cannot reproduce this behaviour, using R version 3.3.3 (Linux).
> data.frame(dec=charvals, char=I(czechvec), hex=I(format(hexvals, width=4, upper.case=TRUE)))
dec char hex
1 193 Á 00C1
2 269 č 010D
3 282 Ě 011A
4 268 Č 010C
5 262 Ć 0106
6 263 ć 0107
7 348 Ŝ 015C
8 349 ŝ 015D
9 350 Ş 015E
10 352 Š 0160
11 353 š 0161
Thanks to Ernest A's answer checking that the weird behaviour I observed does not occur in Linux, I Googled R WINDOWS UTF-8 BUG which led me to this article by Ista Zahn: Escaping from character encoding hell in R on Windows
The article confirms there is a bug in the data.frame print method on Windows, and gives some workarounds. (However, the article doesn't note the issue with write.table in Windows, for data frames with foreign-language text.)
One workaround suggested by Zahn is to change the locale to suit the particular language we are working with:
Sys.setlocale(category = "LC_CTYPE", locale = "czech")
charvals <- c(193, 269, 282, 268, 262, 263, 348, 349, 350, 352, 353)
hexvals <- format(as.hexmode(charvals), width=4, upper.case=TRUE)
df1 <- data.frame(dec=charvals, char=I(unlist(strsplit(intToUtf8(charvals), ""))), hex=I(hexvals))
print.listof(df1)
dec :
[1] 193 269 282 268 262 263 348 349 350 352 353
char :
[1] "Á" "č" "Ě" "Č" "Ć" "ć" "Ŝ" "ŝ" "Ş" "Š" "š"
hex :
[1] "00C1" "010D" "011A" "010C" "0106" "0107" "015C" "015D" "015E" "0160"
[11] "0161"
df1
dec char hex
1 193 Á 00C1
2 269 č 010D
3 282 Ě 011A
4 268 Č 010C
5 262 Ć 0106
6 263 ć 0107
7 348 S 015C
8 349 s 015D
9 350 Ş 015E
10 352 Š 0160
11 353 š 0161
Notice that the Czech characters are now displayed correctly but not "Ŝ" and "ŝ", Unicode U+015C and U+015D, which apparently are used in Esperanto. But with the print.listof command, all the characters are displayed correctly. (By the way, dput(df1) lists the Esperanto characters incorrectly, as "S" and "s".)
write.table(df1, file="special characters example.txt", fileEncoding="UTF-8")
df2 <- read.table("special characters example.txt", stringsAsFactors=FALSE, fileEncoding="UTF-8")
print.listof(df2)
dec :
[1] 193 269 282 268 262 263 348 349 350 352 353
char :
[1] "Á" "č" "Ě" "Č" "Ć" "ć" "S" "s" "Ş" "Š" "š"
hex :
[1] "00C1" "010D" "011A" "010C" "0106" "0107" "015C" "015D" "015E" "0160"
[11] "0161"
When I write.table df1 and then read.table it back as df2, the "Ŝ" and "ŝ" characters have lost their circumflex. This must be a problem with the write.table command, as confirmed when I open the file with a different application such as OpenOffice Writer. The Czech characters are all there correctly, but the "Ŝ" and "ŝ" have been changed to "S" and "s".
For the time being, the best workaround for my purposes is, instead of putting the actual character in my data frame, to record the Unicode value of it, then using write.table, and using the UNICHAR function in OpenOffice Calc to add the character itself to the file. But this is inconvenient.
I believe this same bug is relevant to this question: how to read data in utf-8 format in R?
Edited to add: Other similar questions I've now found on Stack Overflow:
Why do some Unicode characters display in matrices, but not data frames in R?
UTF-8 file output in R
Write UTF-8 files from R
And I found a workaround for the display issue by Peter Meissner here:
http://r.789695.n4.nabble.com/Unicode-display-problem-with-data-frames-under-Windows-tp4707639p4707667.html
It involves defining your own class unicode_df and print function print.unicode_df.
This still does not solve the issue I have with using write.table to write my data frame (which contains some columns with text in a variety of European languages) to a file that can be imported to a spreadsheet or any arbitrary application. But perhaps Meissner's solution can be adapted to work with write.table.
Here's a function write.unicode.csv that uses paste and writeLines (with useBytes=TRUE) to export a data frame containing foreign-language characters (encoded in UTF-8) to a csv file. All cells in the data frame will be enclosed in quote marks in the csv file.
#function that will create a CSV file for a data frame containing Unicode text
#this can be used instead of write.csv in R for Windows
#source: https://stackoverflow.com/questions/46137078/r-accented-characters-in-data-frame
#this is not elegant, and probably not robust
write.unicode.csv <- function(mydf, filename="") { #mydf can be a data frame or a matrix
linestowrite <- character( length = 1+nrow(mydf) )
linestowrite[1] <- paste('"","', paste(colnames(mydf), collapse='","'), '"', sep="") #first line will have the column names
if(nrow(mydf)<1 | ncol(mydf)<1) print("This is not going to work.") #a bit of error checking
for(k1 in 1:nrow(mydf)) {
r <- paste('"', k1, '"', sep="") #each row will begin with the row number in quotes
for(k2 in 1:ncol(mydf)) {r <- paste(r, paste('"', mydf[k1, k2], '"', sep=""), sep=",")}
linestowrite[1+k1] <- r
}
writeLines(linestowrite, con=filename, useBytes=TRUE)
} #end of function
Sys.setlocale(category = "LC_CTYPE", locale = "usa")
charvals <- c(193, 269, 282, 268, 262, 263, 348, 349, 350, 352, 353)
hexvals <- format(as.hexmode(charvals), width=4, upper.case=TRUE)
df1 <- data.frame(dec=charvals, char=I(unlist(strsplit(intToUtf8(charvals), ""))), hex=I(hexvals))
print.listof(df1)
write.csv(df1, file="test1.csv")
write.csv(df1, file="test2.csv", fileEncoding="UTF-8")
write.unicode.csv(df1, filename="test3.csv")
dftest1 <- read.csv(file="test1.csv", encoding="UTF-8", colClasses="character")
dftest2 <- read.csv(file="test2.csv", encoding="UTF-8", colClasses="character")
dftest3 <- read.csv(file="test3.csv", encoding="UTF-8", colClasses="character")
print("CSV file written using write.csv with no fileEncoding parameter:")
print.listof(dftest1)
print('CSV file written using write.csv with fileEncoding="UTF-8":')
print.listof(dftest2)
print("CSV file written using write.unicode.csv:")
print.listof(dftest3)
How can I convert Ab9876543210 into Ab9876543210? Is there a solution by regular expression?
test <- dput("Ab9876543210")
Disclaimer: The following works on my machine, but since I can't replicate your full width string based purely on the example provided, this is a best guess based on my version of the problem (pasting the string into a text file, save it with UTF-8 encoding, & loading it in with coding specified as UTF-8.
Step 1. Reading in the text (I added a half width version for comparison):
> test <- readLines("fullwidth.txt", encoding = "UTF-8")
> test
[1] "Ab9876543210" "Ab9876543210"
Step 2. Verifying that the full & half width versions are not equal:
# using all.equal()
test1 <- test[1]
test2 <- test[2]
> all.equal(test1, test2)
[1] "1 string mismatch"
# compare raw bytes
> charToRaw(test1)
[1] ef bb bf ef bc a1 62 ef bc 99 ef bc 98 ef bc 97 ef bc 96 ef bc 95 ef
[24] bc 94 ef bc 93 ef bc 92 ef bc 91 ef bc 90
> charToRaw(test2)
[1] 41 62 39 38 37 36 35 34 33 32 31 30
For anyone interested, if you paste the raw byte version into a utf-8 decoder as hexadecimal input, you'll see that except for letter b (mapped from 62 in the 7th byte), the rest of the letters were formed by 3-byte sequences. In addition, the first 3-byte sequence maps to "ZERO WIDTH NO-BREAK SPACE character", so it's not visible when you print the string to console.
Step 3. Converting from full width to half width using the Nippon package:
library(Nippon)
test1.converted <- zen2han(test1)
> test1.converted
[1] "Ab9876543210"
# If you want to compare against the original test2 string, remove the zero
# width character in front
> all.equal(substring(test1.converted, 2), test2)
[1] TRUE
Here is a base R solution
Full width characters are in the range 0xFF01:0xFFEF, and can be offset like this.
x <- "Ab9876543210"
iconv(x, to = "utf8") |>
utf8ToInt() |>
(\(.) ifelse(. > 0xFF01 & . <= 0xFFEF, . - 65248, .))() |>
intToUtf8()
[1] "Ab9876543210"