Simplifying characters with ornaments in R [duplicate] - r

This question already has answers here:
Replace multiple letters with accents with gsub
(11 answers)
Closed 5 years ago.
I have the names of some music artists which I am working with within the Spotify API. I'm having some issues dealing with some strings because of the characters' accents. I don't have much understanding of character encoding.
I'll provide more context a bit further below, but essentially I am wondering if there is a way in R to "simplify" characters with ornaments.
Essentially, I am interested if there is a function which will take c("ë", "ö") as an input, and return c("e", "o"), removing the ornaments from the characters.
I don't think I can create a reproducible example because of the issues with API authentication, but for some context, when I try to run:
artistName <- "Tiësto"
GET(paste0("https://api.spotify.com/v1/search?q=",
artistName,
"&type=artist"),
config(token = token))
The following gets sent to the API:
https://api.spotify.com/v1/search?q=Tiësto&type=artist
Returning me a 400 bad request error. I am trying to alter the strings I pass to the GET function so I can get some useful output.
Edit: I am not looking for a gsub type solution, as that relies on me anticipating the sorts of accented characters which might appear in my data. I'm interested whether there is a function already out there which does this sort of translation between different character encodings.

Here is what I found, and may work for you. Simpler and convenient to apply on any form of data.
> artistName <- "Tiësto"
> iconv(artistName, "latin1", "ASCII//TRANSLIT")
[1] "Tiesto"

Based on the answers to this question , you could do this:
artistName <- "Tiësto"
removeOrnaments <- function(string) {
chartr(
"ŠŽšžŸÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖÙÚÛÜÝàáâãäåçèéêëìíîïðñòóôõöùúûüýÿ",
"SZszYAAAAAACEEEEIIIIDNOOOOOUUUUYaaaaaaceeeeiiiidnooooouuuuyy",
string
)
}
removeOrnaments(artistName)
# [1] "Tiesto"

Related

Remove multiple instances with a regex expression, but not the text in between instances [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 1 year ago.
In long passages using bookdown, I have inserted numerous images. Having combined the passages into a single character string (in a data frame) I want to remove the markdown text associated with inserting images, but not any text in between those inserted images. Here is a toy example.
text.string <- "writing ![Stairway scene](/media/ClothesFairLady.jpg) writing to keep ![Second scene](/media/attire.jpg) more writing"
str_remove_all(string = text.string, pattern = "!\\[.+\\)")
[1] "writing more writing"
The regex expression doesn't stop at the first closed parenthesis, it continues until the last one and deletes the "writing to keep" in between.
I tried to apply String manipulation in R: remove specific pattern in multiple places without removing text in between instances of the pattern, which uses gsubfn and gsub but was unable to get the solutions to work.
Please point me in the right direction to solve this problem of a regex removal of designated strings, but not the characters in between the strings. I would prefer a stringr solution, but whatever works. Thank you
You have to use the following regex
"!\\[[^\\)]+\\)"
alternatively you can also use this:
"!\\[.*?\\)"
both solution offer a lazy match rather than a greedy one, which is the key to your question
I think you could use the following solution too:
gsub("!\\[[^][]*\\]\\([^()]*\\)", "", text.string)
[1] "writing writing to keep more writing"

Replace latex with r strings using gsub [duplicate]

This question already has an answer here:
"'\w' is an unrecognized escape" in grep
(1 answer)
Closed 1 year ago.
I would like to find and replace tabular instances by tabularx. I tried with gsub but it seems to enter me into a world of escaping pain. Following other questions and answers I find fixed=TRUE which is the best I so far have. The code snippet below almost works, \B is unrecognized. If I escape it twice I get \BEGIN as output!
texText <- '\begin{tabular}{rl}\begin{tabular}{rll}'
texText <- gsub("\begin{tabular}{rl}", "\BEGIN{tabular}{rll}", texText, fixed=TRUE)
I'm using BEGIN as my test to see what is happening. This is before I get to tackling the question of what goes on in the brackets {rl} {ll} {rrl} etc. Ideally I'm looking for a regex that would output:
\begin{tabularx}{rX}\begin{tabularx}{rlX}
That is the final column is replaced by X.
Try using proper escaping:
texText <- "\begin{tabular}{rl}\begin{tabular}{rll}"
output <- gsub("\begin\\{tabular\\}", "\begin{tabularx}", texText)
output
[1] "\begin{tabularx}{rl}\begin{tabularx}{rll}"
A literal backslash requires two backslashes, and also metacharacters such as { and } require two backslashes.

How do I remove the extraneous printing of [1] in R? [duplicate]

This question already has answers here:
Output in R, Avoid Writing "[1]"
(4 answers)
R output without [1], how to nicely format?
(1 answer)
Closed 3 years ago.
Consider the following example:
library(digest)
hash <- digest("hello world", algo="md5", serialize=F)
hash
produces [1] "5eb63bbbe01eeed093cb22bb8f5acdc3"
For my purposes, I only want the raw string output with no embellishments or extras. The objective is to alter the script so it produces 5eb63bbbe01eeed093cb22bb8f5acdc3.
I've spent over an hour looking for any way to get rid of the [1] and the documentation has been absolutely terrible. Most of the search results are manipulation, clickbait, wrong, or scams.
Array indexing doesn't work:
hash[1]
produces [1] "5eb63bbbe01eeed093cb22bb8f5acdc3", because apparently an array is the first element of itself which makes no programmatic sense whatsoever.
typeof(hash)
produces [1] "character". Really?
substr(hash[1], 4, 1000)
produces [1] "63bbbe01eeed093cb22bb8f5acdc3".
How do I just make that [1] and preferably the quotes as well go away? There's absolutely no instructions searchable on the web as far as I know.
More generally, I'd like a function or procedure to convert anything to a string for manipulation and post-processing.
library(digest)
hash <- digest("hello world", algo="md5", serialize=F)
cat(hash)

Remove <U+00A0> from values in columns in R [duplicate]

This question already has answers here:
How to remove unicode <U+00A6> from string?
(4 answers)
Closed 4 years ago.
When I read my csv file using read.csv and using the encoding parameter, I get some values with in them.
application <- read.csv("application.csv", na.strings = c("N/A","","NA"), encoding = "UTF-8")
The dataset looks like
X Y
Met<U+00A0>Expectations Met<U+00A0>Expectations
Met<U+00A0>Expectations Met<U+00A0>Expectations
NA Met<U+00A0>Expectations
Met<U+00A0>Expectations Exceeded Expectations
Did<U+00A0>Not Meet Expectations Met<U+00A0>Expectations
Unacceptable Exceeded Expectations
How can I remove the from these values? If I do not use the "encoding" parameter, when I show these values in the shiny application, it is seen as:
Met<a0>Expectations and Did<a0>Not Meet Expectations
I have no clue on how to handle this.
PS: I have modified the original question with examples of the problem faced.
The problem bores me a long time, and I search all around the R communities, no answer in "r" tag can work in my situation. Until I expanded search area, I got the worked answer in "java" tag.
Okay,for the data frame, the solution is:
application <- as.data.frame(lapply(application, function(x) {
gsub("\u00A0", "", x)
}))
Two options:
application <- read.csv("application.csv", na.strings = c("N/A","","NA"), encoding = "ASCII")
or with {readr}
application <- read_csv("application.csv", na.strings = c("N/A","","NA"), locale = locale(encoding = "ASCII"))
Converting UTF-8 to ASCII will remove the printed UTF-8 syntax, but the spaces will remain. Beware that if there are extra spaces at the beginning or end of a character string, you may get unwanted unique values. For example "Met Expectations<U+00A0>" converted to ASCII will read "Met Expectations ", which does not equal "Met Expectations".
This isn't a great answer but to get your csv back into UTF-8 you can open it in google sheets and then download as a .csv. Then import with trim_ws = T. This will solve the importing problems and won't create any weirdness.

Searching for an exact String in another String

I'm dealing with a very simple question and that is searching for a string inside of another string. Consider the example below:
bigStringList <- c("SO1.A", "SO12.A", "SO15.A")
strToSearch <- "SO1."
bigStringList[grepl(strToSearch, bigStringList)]
I'm looking for something that when I search for "SO1.", it only returns "SO1.A".
I saw many related questions on SO but most of the answers include grepl() which does not work in my case.
Thanks very much for your help in advance.
When searching for a simple string that doesn't include any metacharacters, you can set fixed=TRUE:
grep("SO1.", bigStringList, fixed=TRUE, value=TRUE)
# [1] "SO1.A"
Otherwise, as Frank notes, you'll need to escape the period (so that it'll be interpreted as an actual . rather than as a symbol meaning "any single character"):
grep("SO1\\.", bigStringList, value=TRUE)
# [1] "SO1.A"

Resources