find and replace strings with special characters? - r

I have a dataframe with a bunch of titles for classroom courses that include special characters in it. I'm trying to find and replace them but it's not working
Example
Tank Walk Around – Round Portable Restroom Tanks
db$objectName[db$objectName == "Tank Walk Around – Round Portable Restroom Tanks"] <- "Tank Walk Around - Round Portable Restroom Tanks"
I also have other course titles with these special characters that have been problematic as well
` ’ “ „ ¢ € ®

Assuming you want to keep all alphanumeric characters the following code can be used. The code uses a regex expression to remove all non-alphanumerics.
str = "Tank Walk Around – Round Portable Restroom Tanks"
print(strsplit(gsub("[^[:alnum:] ]", "", str), " +")[[1]])
Result:
source('~/.active-rstudio-document')
[1] "Tank" "Walk" "Around" "â" "Round" "Portable" "Restroom" "Tanks"
Source: R remove non-alphanumeric symbols from a string

Related

How to remove numbered newlines from string?

I'm cleaning some text data and I've come across a problem associated with removing newline text. For this data, there are not merely \n strings in the text, but \n\n strings, as well as numbered newlines such as: \n2 and \n\n2. The latter are my problem. How does one remove this using regex?
I'm working in R. Here is some sample text and what I've used, so far:
#string
string <- "There is a square in the apartment. \n\n4Great laughs, which I hear from the other room. 4 laughs. Several. 9 times ten.\n2"
#code attempt
gsub("[\r\\n0-9]", '', string)
The problem with this regex code is that it removes numbers and matches with the letter n.
I would like to have the following output:
"There is a square in the apartment. Great laughs, which I hear from the other room. 4 laughs. Several. 9 times ten."
I'm using regexr for reference.
Writing the pattern like this [\r\\n0-9] matches either a carriage return, one of the chars \ or n or a digit 0-9
You could write the pattern matching 1 or more carriage returns or newlines, followed by optional digits:
[\r\n]+[0-9]*
Example:
string <- "There is a square in the apartment. \n\n4Great laughs, which I hear from the other room. 4 laughs. Several. 9 times ten.\n2"
gsub("[\r\n]+[0-9]*", '', string)
Output
[1] "There is a square in the apartment. Great laughs, which I hear from the other room. 4 laughs. Several. 9 times ten."
See a R demo.

How to generate all possible unicode characters?

If we type in letters we get all lowercase letters from english alphabet. However, there are many more possible characters like ä, é and so on. And there are symbols like $ or (, too. I found this table of unicode characters which is exactly what I need. Of course I do not want to copy and paste hundreds of possible unicode characters in one vector.
What I've tried so far: The table gives the decimals for (some of) the unicode characters. For example, see the following small table:
Glyph Decimal Unicode Usage in R
! 33 U+0021 "\U0021"
So if type "\U0021" we get a !. Further, paste0("U", format(as.hexmode(33), width= 4, flag="0")) returns "U0021" which is quite close to what I need but adding \ results in an error:
paste0("\U", format(as.hexmode(33), width= 4, flag="0"))
Error: '\U' used without hex digits in character string starting ""\U"
I am stuck. And I am afraid even if I figure out how to transform numbers to characters usings as.hexmode() there is still the problem that there are not Decimals for all unicode characters (see table, Decimals end with 591).
Any idea how to generate a vector with all the unicode characters listed in the table linked?
(The question started with a real world problem but now I am mostly simply eager to know how to do this.)
There may be easier ways to do this, but here goes. The Unicode package contains everything you need.
First we can get a list of unicode scripts and the block ranges:
library(Unicode)
uranges <- u_scripts()
Check what we've got:
head(uranges, 3)
$Adlam
[1] U+1E900..U+1E943 U+1E944..U+1E94A U+1E94B U+1E950..U+1E959 U+1E95E..U+1E95F
$Ahom
[1] U+11700..U+1171A U+1171D..U+1171F U+11720..U+11721 U+11722..U+11725 U+11726 U+11727..U+1172B U+11730..U+11739 U+1173A..U+1173B U+1173C..U+1173E U+1173F
[11] U+11740..U+11746
$Anatolian_Hieroglyphs
[1] U+14400..U+14646
Next we can convert the ranges into their sequences.
expand_uranges <- lapply(uranges, as.u_char_seq)
To get a single vector of all characters we can unlist it. This won't be easy to work with so really it would be better to keep them as a list:
all_unicode_chars <- unlist(expand_uranges)
# The Wikipedia page linked states there are 144,697 characters
length(all_unicode_chars)
[1] 144762
So seems to be all of them and the page needs updating. They are stored as integers so to print them (assuming the glyph is supported) we can do, for example, printing Japanese katakana:
intToUtf8(expand_uranges$Katakana[[1]])
[1] "ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ"

Remove whitespace before bracket " (" in R

I have nearly 100,000 rows of scraped data that I have converted to data frames. One column is a string of text characters but is operating strangely. In the example below, there is text, that has bracketed information that I want to remove, and I also want to remove " (c)". However the space in front is not technically a space (is it considered whitespace?).
I am not sure how to reproduce the example here because when I copy/paste a record, it is treated like normal and works, but in the scraped data, it does not. Gut check was to count spaces and it gave me 4, which means the space in front of ( is not a true space. I do not know how to remove this!
My code that I usually would run is as follows. Again, works this way, but does not work in my scraped data.
test<-c("Barry Windham (c) & Mike Rotundo (c)")
test<-gsub("[ ][(]c[)]","",test)
You can consider using:
test<-c("Barry Windham (c) & Mike Rotundo (c)")
gsub("(*UCP)\\s+\\(c\\)", "", test, perl=TRUE)
# => [1] "Barry Windham & Mike Rotundo"
See an online R demo
Details
(*UCP) - makes all shorthand character classes in the PCRE regex (it is PCRE due to perl=TRUE) Unicode aware
\\s+ - any one or more Unicode whitespaces
\\(c\\) - (c) substring.
If you need to keep (c), capture it and use a backreference in the replacement:
gsub("(*UCP)\\s+(\\(c\\))", "\\1", test, perl=TRUE)

R utf-8 and replace a word from a sentence based on ending character

I have a requirement where I am working on a large data which is having double byte characters, in korean text. i want to look for a character and replace it. In order to display the korean text correctly in the browser I have changed the locale settings in R. But not sure if it gets updated for the code as well. below is my code to change locale to korean and the korean text gets visible properly in viewer, however in console it gives junk character on printing-
Sys.setlocale(category = "LC_ALL", locale = "korean")
My data is in a data.table format that contains a column with text in korean. example -
"광주광역시 동구 제봉로 49 (남동,(지하))"
I want to get rid of the 1st word which ends with "시" character. Then I want to get rid of the "(남동,(지하))" an the end. I was trying gsub, but it does not seem to be working.
New <- c("광주광역시 동구 제봉로 49 (남동,(지하))")
data <- as.data.table(New)
data[,New_trunc := gsub("\\b시", "", data$New)]
Please let me know where I am going wrong. Since I want to search the end of word, I am using \\b and since I want to replace any word ending with "시" character I am giving it as \\b시.....is this not the way to give? How to take care of () at the end of the sentence.
What would be a good source to refer to for regular expressions.
Is a utf-8 setting needed for the script as well?How to do that?
Since you need to match the letter you have at the end of the word, you need to place \b (word boundary) after the letter, so as to require a transition from a letter to a non-letter (or end of string) after that letter. A PCRE pattern that will handle this is
"\\s*\\b\\p{L}*시\\b"
Details
\\s* - zero or more whitespaces
\\b - a leading word boundary
\\p{L}* - zero or more letters
시 - your specific letter
\\b - end of the word
The second issue is that you need to remove a set of nested parentheses at the end of the string. You need again to rely on the PCRE regex (perl=TRUE) that can handle recursion with the help of a subroutine call.
> sub("\\s*(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE)
[1] "광주광역시 동구 제봉로 49"
Details:
\\s* - zero or more whitespaces
(\\((?:[^()]++|(?1))*\\)) - Group 1 (will be recursed) matching
\\( - a literal (
(?:[^()]++|(?1))* - zero or more occurrences of
[^()]++ - 1 or more chars other than ( and ) (possessively)
| - or
(?1) - a subroutine call that repeats the whole Group 1 subpattern
\\) - a literal )
$ - end of string.
Now, if you need to combine both, you would see that R PCRE-powered gsub does not handle Unicode chars in the pattern so easily. You must tell it to use Unicode mode with (*UCP) PCRE verb.
> gsub("(*UCP)\\b\\p{L}*시\\b|\\s*(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE)
[1] " 동구 제봉로 49"
Or using trimws to get rid of the leading/trailing whitespace:
> trimws(gsub("(*UCP)\\b\\p{L}*시\\b|(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE))
[1] "동구 제봉로 49"
See more details about the verb at PCRE Man page.

special characters in R

I'm working on being able to read transcripts of dialogue into R. However I run into a bump with special characters like curly quotes en and em dashes etc. Typically I replace these special characters in a microsoft product first with replace. Typically I replace special characters with plain text but on some occasions desire to replace them with other characters (ie I replace “ ” with { }). This is tedious and not always thorough. If I could read the transcripts into R as is and then use Encoding to switch their encoding to a recognizable unicode format, I could gsub them out and replace them with plain text versions. However the file is read in in some way I don't understand.
Here's an xlsx of what my data may look like:
http://dl.dropbox.com/u/61803503/test.xlsx
This is what is in the .xlsx file
text num
“ ” curly quotes 1
en dash (–) and the em dash (—) 2
‘ ’ curly apostrophe-ugg 3
… ellipsis are uck in R 4
This can be read into R with:
URL <- "http://dl.dropbox.com/u/61803503/test.xlsx"
library(gdata)
z <- read.xls(URL, stringsAsFactors = FALSE)
The result is:
text num
1 “ †curly quotes 1
2 en dash (–) and the em dash (—) 2
3 ‘ ’ curly apostrophe-ugg 3
4 … ellipsis are uck in R 4
So I tried to use Encoding to convert to Unicode:
iconv(z[, 1], "latin1", "UTF-8")
This gives:
[1] "â\u0080\u009c â\u0080\u009d curly quotes" "en dash (â\u0080\u0093) and the em dash (â\u0080\u0094)"
[3] "â\u0080\u0098 â\u0080\u0099 curly apostrophe-ugg" "â\u0080¦ ellipsis are uck in R"
Which makes gsubing less useful.
What can I do to convert these special characters to distinguishable unicode so I can gsub them out appropriately? To be more explicit I was hoping to have z[1, 1] read:
\u201C 2\u01D curly quotes
To make it even more clear my desired outcome I will webscrape the tables from a page like wikipedia's: http://en.wikipedia.org/wiki/Quotation_mark_glyphs and use the unicode reference chart to replace characters appropriately. So I need the characters to be in unicode or some standard format that I can systematically go through and replace the characters. Maybe it already is and I'm missing it.
PS I don't save the files as .csv or plain text because the special characters are replaced with ? hence the use of read.xls I'm not attached to any particular method of reading in the file (ie read.xls) if you've got a better alternative.
Maybe this will help (I'll have access to a Windows machine tomorrow and can probably play with it more at that point if SO doesn't get you the answer first).
On my Linux system, when I do the following:
iconv(z$text, "", "cp1252")
I get:
[1] "\x93 \x94 curly quotes" "en dash (\x96) and the em dash (\x97)"
[3] "\x91 \x92 curly apostrophe-ugg" "\x85 ellipsis are uck in R"
This is not UTF, but (I believe) ISO hex entities. Still, if you are able to get to this point also, then you should be able to use gsub the way you intend to.
See this page (reserved section in particular) for conversions.
Update
You can also try converting to an encoding that doesn't have those characters, like ASCII and set sub to "byte". On my machine, that gives me:
iconv(z$text, "", "ASCII", "byte")
# [1] "<e2><80><9c> <e2><80><9d> curly quotes"
# [2] "en dash (<e2><80><93>) and the em dash (<e2><80><94>)"
# [3] "<e2><80><98> <e2><80><99> curly apostrophe-ugg"
# [4] "<e2><80><a6> ellipsis are uck in R"
It's ugly, but UTF-8(e2, 80, 9c) is a right curly quote (each character, I believe, is a set of three values in angled brackets). You can find conversions at this site where you can search by punctuation mark name.
Try
> iconv(z, "UTF-8", "UTF-8")
[1] "c(\"“—” curly quotes\", \"en dash (–) and the em dash (—)\", \"‘—’ curly apostrophe-ugg\", \"… ellipsis are uck in R\")"
[2] "c(1, 2, 3, 4)"
windows is very problematic with encodings. Maybe you can look at http://www.vmware.com/products/player/ and run linux.
This works on my windows box. Initial input was as you had. You may have a different experience.

Resources