Remove accented characters, don't replace them

Remove accented characters, don't replace them - r

there's a lot of people asking how to remove accents from data, but I'm looking for how to remove the entire character. They're retained using [[:alnum:]], and [[A-Za-z]]. What would I have to do to get rid of them?
Thanks

You can do in base R:
gsub("[^A-Za-z ]", "", "à la volée d'où est-il")
[1] " la vole do estil"
Here exclude everything that is not letters and spaces. Have a look if you want to keep punctuation with [:punct:]

Without using regex, you could define a set of letters you would accept, which is available in base R as the letters and LETTERS vectors.
characters_to_keep <- c(letters, " ")
accentstring <- "éqodio diq ozàoih"
result <- unlist(strsplit(accentstring,""))
result <- result[result %in% characters_to_keep]
result <- paste0(result, collapse="")
> result
[1] "qodio diq ozoih"

Related

Get substring before the second capital letter

Is there an R function to get only the part of a string before the 2nd capital character appears?
For example:
Example <- "MonkeysDogsCats"
Expected output should be:
"Monkeys"

Maybe something like
stringr::str_extract("MonkeysDogsCats", "[A-Z][a-z]*")
#[1] "Monkeys"

Here is an alternative approach:
Here we first put a space before all uppercase and then extract the first word:
library(stringr)
word(gsub("([a-z])([A-Z])","\\1 \\2", Example), 1)
[1] "Monkeys"

A base solution with sub():
x <- "MonkeysDogsCats"
sub("(?<=[a-z])[A-Z].*", "", x, perl = TRUE)
# [1] "Monkeys"
Another way using stringr::word():
stringr::word(x, 1, sep = "(?=[A-Z])\\B")
# [1] "Monkeys"

If the goal is strictly to capture any string before the 2nd capital character, one might want pick a solution it'll also work with all types of strings including numbers and special characters.
strings <- c("MonkeysDogsCats",
"M4DogsCats",
"M?DogsCats")
stringr::str_remove(strings, "(?<=.)[A-Z].*")
Output:
[1] "Monkeys" "M4" "M?"

It depends on what you want to allow to match. You can for example match an uppercase char [A-Z] optionally followed by any character that is not an uppercase character [^A-Z]*
If you don't want to allow whitespace chars, you can exclude them [^A-Z\\s]*
library(stringr)
str_extract("MonkeysDogsCats", "[A-Z][^A-Z]*")
Output
[1] "Monkeys"
R demo
If there should be an uppercase character following, and there are only lowercase characters allowed:
str <- "MonkeysDogsCats"
regmatches(str, regexpr("[A-Z][a-z]*(?=[A-Z])", str, perl = TRUE))
Output
[1] "Monkeys"
R demo

R: How can I Split and Trim Data Successfully

I've successfully split the data and removed the "," with the following code:
s = MSA_data$area_title
str_split(s, pattern = ",")
Result
[1] "Albany" " GA"
I need to trim this data, removing white space, however this places the comma back into the data which was initially removed.
"Albany, GA"
How can I successfully split and trim the data so that the result is:
[1] "Albany" "GA"
Thank you

An alternative is to use trimws function to trim the whitespace at the beginning and end of the string.
Result <- trimws(Result)

We just need to use zero or more spaces (\\s*) (the question OP asked) and this can be done in a single step
strsplit(MSA_data$area_title, pattern = ",\\s*")
If we are using the stringr, then make use of the str_trim
library(stringr)
str_trim(str_split("Albany, GA", ",")[[1]])
#[1] "Albany" "GA"

How to throw out spaces and underscores only from the beginning of the string?

I want to ignore the spaces and underscores in the beginning of a string in R.
I can write something like
txt <- gsub("^\\s+", "", txt)
txt <- gsub("^\\_+", "", txt)
But I think there could be an elegant solution
txt <- " 9PM 8-Oct-2014_0.335kwh "
txt <- gsub("^[\\s+|\\_+]", "", txt)
txt
The output should be "9PM 8-Oct-2014_0.335kwh ". But my code gives " 9PM 8-Oct-2014_0.335kwh ".
How can I fix it?

You could bundle the \s and the underscore only in a character class and use quantifier to repeat that 1+ times.
^[\s_]+
Regex demo
For example:
txt <- gsub("^[\\s_]+", "", txt, perl=TRUE)
Or as #Tim Biegeleisen points out in the comment, if only the first occurrence is being replaced you could use sub instead:
txt <- sub("[\\s_]+", "", txt, perl=TRUE)
Or using a POSIX character class
txt <- sub("[[:space:]_]+", "", txt)
More info about perl=TRUE and regular expressions used in R
R demo

The stringr packages offers some task specific functions with helpful names. In your original question you say you would like to remove whitespace and underscores from the start of your string, but in a comment you imply that you also wish to remove the same characters from the end of the same string. To that end, I'll include a few different options.
Given string s <- " \t_blah_ ", which contains whitespace (spaces and tabs) and underscores:
library(stringr)
# Remove whitespace and underscores at the start.
str_remove(s, "[\\s_]+")
# [1] "blah_ "
# Remove whitespace and underscores at the start and end.
str_remove_all(s, "[\\s_]+")
# [1] "blah"
In case you're looking to remove whitespace only – there are, after all, no underscores at the start or end of your example string – there are a couple of stringr functions that will help you keep things simple:
# `str_trim` trims whitespace (\s and \t) from either or both sides.
str_trim(s, side = "left")
# [1] "_blah_ "
str_trim(s, side = "right")
# [1] " \t_blah_"
str_trim(s, side = "both") # This is the default.
# [1] "_blah_"
# `str_squish` reduces repeated whitespace anywhere in string.
s <- " \t_blah blah_ "
str_squish(s)
# "_blah blah_"
The same pattern [\\s_]+ will also work in base R's sub or gsub, with some minor modifications, if that's your jam (see Thefourthbird`s answer).

You can use stringr as:
txt <- " 9PM 8-Oct-2014_0.335kwh "
library(stringr)
str_trim(txt)
[1] "9PM 8-Oct-2014_0.335kwh"
Or the trimws in Base R
trimws(txt)
[1] "9PM 8-Oct-2014_0.335kwh"

Move location of special character

I have an entire vector of strings with the only special symbol in them being "-"
To be clear a sample string is like 23 C-Exam
I'd like to change it 23-C Exam
I essentially want R to find the location of "-" and move it 2 spaces back.
I feel this is a really simple task although I cant figure out how.
Assume that whenever R finds "-" , two spaces back is whitespace just like the example above.

regex attempt:
x <- c("23 C-Exam","45 D-Exam")
#[1] "23 C-Exam" "45 D-Exam"
sub(".(.)-", "-\\1 ", x)
#[1] "23-C Exam" "45-D Exam"
Find a character ., before a character (.), followed by a literal dash -.
Replace with a literal dash -, the saved character from above \\1, and overwrite the dash with a space

There is probably a sleek way of doing this with regular expressions, but one approach is to simply splice together the various pieces of the desired output. First, I find the index in the string containing the -, and then I use substr() to piece together the output.
pos <- regexpr("-", "23 C-Exam")
x <- "23 C-Exam"
x <- paste0(substr(x, 1, pos-3),
"-",
substr(x, pos-1, pos-1),
" ",
substr(x, pos+1, nchar(x)))
> x
[1] "23-C Exam"

We can also use chartr
chartr(" -", "- ", x)
#[1] "23-C Exam" "45-D Exam"
data
x <- c("23 C-Exam","45 D-Exam")

How to Convert "space" into "%20" with R

Referring the title, I'm figuring how to convert space between words to be %20 .
For example,
> y <- "I Love You"
How to make y = I%20Love%20You
> y
[1] "I%20Love%20You"
Thanks a lot.

Another option would be URLencode():
y <- "I love you"
URLencode(y)
[1] "I%20love%20you"

gsub() is one option:
R> gsub(pattern = " ", replacement = "%20", x = y)
[1] "I%20Love%20You"

The function curlEscape() from the package RCurl gets the job done.
library('RCurl')
y <- "I love you"
curlEscape(urls=y)
[1] "I%20love%20you"

I like URLencode() but be aware that sometimes it does not work as expected if your url already contains a %20 together with a real space, in which case not even the repeated option of URLencode() is doing what you want.
In my case, I needed to run both URLencode() and gsub consecutively to get exactly what I needed, like so:
a = "already%20encoded%space/a real space.csv"
URLencode(a)
#returns: "encoded%20space/real space.csv"
#note the spaces that are not transformed
URLencode(a, repeated=TRUE)
#returns: "encoded%2520space/real%20space.csv"
#note the %2520 in the first part
gsub(" ", "%20", URLencode(a))
#returns: "encoded%20space/real%20space.csv"
In this particular example, gsub() alone would have been enough, but URLencode() is of course doing more than just replacing spaces.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Remove accented characters, don't replace them - r

there's a lot of people asking how to remove accents from data, but I'm looking for how to remove the entire character. They're retained using [[:alnum:]], and [[A-Za-z]]. What would I have to do to get rid of them? Thanks

You can do in base R: gsub("[^A-Za-z ]", "", "à la volée d'où est-il") [1] " la vole do estil" Here exclude everything that is not letters and spaces. Have a look if you want to keep punctuation with [:punct:]

Related

Get substring before the second capital letter

R: How can I Split and Trim Data Successfully

How to throw out spaces and underscores only from the beginning of the string?

Move location of special character

How to Convert "space" into "%20" with R

Categories

Resources