Replace special character in data frame

Replace special character in data frame - r

I have a dataframe which contains in different cells a special character which I know which is. An example of the structure:
df = data.frame(col_1 = c("21 myspec^ch2 12",NA),
col_2 = c("1 myspec^ch2 4","4 myspec^ch2 212"))
The character is this myspec^ch2 and I would like to replace with -. An example of expected output:
df = data.frame(col_1 = c("21-12",NA),
col_2 = c("1-4","4-212"))
I tried this but it is not working:
df [ df == " myspec^ch2 " ] <- "-"

To get gsub work on whole dataframe use apply:
apply(df, 2, function(x) gsub(" myspec\\^ch2 ", "-", x))

You really want to do a regex-style substitution here. However, in regex, ^ is seen as the beginning of the line (rather than a literal caret). So you can do something like this (using the stringr package):
library(dplyr)
library(stringr)
fixed_df <- df %>%
mutate_all(funs(str_replace_all( . , " myspec\\^ch2 ", "-"))
Note the double backslashes in front of the caret--that escapes the caret and tells R to interpret it literally, rather than as the beginning of the line.

Related

using tidyr separate function to split by \ backslash

I would like to split text in a column by '' using the separate function in tidyr. Given this example data...
library(tidyr)
df1 <- structure(list(Parent.objectId = 1:2, Attachment.path = c("photos_attachments\\photos_image-20220602-192146.jpg",
"photos_attachments\\photos_image-20220602-191635.jpg")), row.names = 1:2, class = "data.frame")
And I've tried multiple variations of this...
df2 <- df1 %>%
separate(Attachment.path,c("a","b","c"),sep="\\",remove=FALSE,extra="drop",fill="right")
Which doesn't result in an error, but it doesn't split the string into two columns, likely because I'm not using the correct regular expression for the single backslash.

We may need to escape
library(tidyr)
separate(df1, Attachment.path,c("a","b","c"),
sep= "\\\\", remove=FALSE, extra="drop", fill="right")
According to ?separate
sep - ... The default value is a regular expression that matches any sequence of non-alphanumeric values.

By splitting on \, assuming you are trying to get folder and filenames, try these 2 functions:
#get filenames
basename(df1$Attachment.path)
# [1] "photos_image-20220602-192146.jpg" "photos_image-20220602-191635.jpg"
#get foldernames
basename(dirname(df1$Attachment.path))
# [1] "photos_attachments" "photos_attachments"

replacing the same pattern in a string with new value each time

One string with 25 xy as patterns and a 25 long vector that should replace those 25 xy.
This is not for prgramming or anything complicated, I just wish to get a result, which I can copy and then paste into a forum that uses this BBcode inside the string to make a colorful line.
string <- "[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]__[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]"
colrs <- c( "08070D", "100F1A", "191627", "211E34", "292541", "312D4E", "39345B", "413C68", "4A4375", "524A82", "5A528E", "62599B", "6C64A6", "7971AE", "827BB3", "8C85B9", "958FBF", "9F99C4", "A9A3CA", "B2AED0", "BCB8D6", "C5C2DC", "CFCCE2", "D9D6E8", "E2E0ED")
and want this as a result
[COLOR="#08070D"]_[COLOR="#100F1A"]_[COLOR="#191627"]_[COLOR="#211E34"]_[COLOR="#292541"]_[COLOR="#312D4E"]_[COLOR="#39345B"]_[COLOR="#413C68"]_[COLOR="#4A4375"]_[COLOR="#524A82"]_[COLOR="#5A528E"]_[COLOR="#62599B"]_[COLOR="#6C64A6"]_[COLOR="#7971AE"]_[COLOR="#827BB3"]_[COLOR="#8C85B9"]_[COLOR="#958FBF"]_[COLOR="#9F99C4"]_[COLOR="#A9A3CA"]_[COLOR="#B2AED0"]_[COLOR="#BCB8D6"]_[COLOR="#C5C2DC"]_[COLOR="#CFCCE2"]_[COLOR="#D9D6E8"]_[COLOR="#E2E0ED"]__[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]

I have completely revised my answer given that you removed the previous iteration of your code example. Here's the revised solution:
string <- '[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]__[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]'
colrs <- c("08070D", "100F1A", "191627", "211E34", "292541", "312D4E", "39345B", "413C68", "4A4375", "524A82", "5A528E", "62599B", "6C64A6", "7971AE", "827BB3", "8C85B9", "958FBF", "9F99C4", "A9A3CA", "B2AED0", "BCB8D6", "C5C2DC", "CFCCE2", "D9D6E8", "E2E0ED")
library(stringr)
string0 <- string |>
str_split("xy") |>
unlist()
string0[seq_along(colrs)] |>
str_c(colrs, collapse = "") |>
str_c(string0[length(colrs)+1])
[1] "[COLOR=\"#08070D\"]_[COLOR=\"#100F1A\"]_[COLOR=\"#191627\"]_[COLOR=\"#211E34\"]_[COLOR=\"#292541\"]_[COLOR=\"#312D4E\"]_[COLOR=\"#39345B\"]_[COLOR=\"#413C68\"]_[COLOR=\"#4A4375\"]_[COLOR=\"#524A82\"]_[COLOR=\"#5A528E\"]_[COLOR=\"#62599B\"]_[COLOR=\"#6C64A6\"]_[COLOR=\"#7971AE\"]_[COLOR=\"#827BB3\"]_[COLOR=\"#8C85B9\"]_[COLOR=\"#958FBF\"]_[COLOR=\"#9F99C4\"]_[COLOR=\"#A9A3CA\"]_[COLOR=\"#B2AED0\"]_[COLOR=\"#BCB8D6\"]_[COLOR=\"#C5C2DC\"]_[COLOR=\"#CFCCE2\"]_[COLOR=\"#D9D6E8\"]_[COLOR=\"#E2E0ED\"]__[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]"

EDIT #2:
An easy solution to the new data problem is this:
library(stringr)
string0 <- unlist(str_split(gsub('"', "", string), "__?"))
str_c(str_replace(string0,'xy', colrs), collapse = "_")
[1] "[COLOR=#08070D]_[COLOR=#100F1A]_[COLOR=#191627]_[COLOR=#211E34]_[COLOR=#292541]_[COLOR=#312D4E]_[COLOR=#39345B]_[COLOR=#413C68]_[COLOR=#4A4375]_[COLOR=#524A82]_[COLOR=#5A528E]_[COLOR=#62599B]_[COLOR=#6C64A6]_[COLOR=#7971AE]_[COLOR=#827BB3]_[COLOR=#8C85B9]_[COLOR=#958FBF]_[COLOR=#9F99C4]_[COLOR=#A9A3CA]_[COLOR=#B2AED0]_[COLOR=#BCB8D6]_[COLOR=#C5C2DC]_[COLOR=#CFCCE2]_[COLOR=#D9D6E8]_[COLOR=#E2E0ED]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]"
EDIT:
Given this data:
string <- "AxyBxyCxyDxyExy"
vector <- c(1,2,3,4,5)
and this desired result:
"A1B2C3D4E5"
you can do this:
library(stringr)
First, we extract the character that's before xy using str_extract_all:
string0 <- unlist(str_extract_all(string, ".(?=xy)"))
Next we do two things: a) we replace the lone character with itself (\\1) AND the vector value, and b) we collapse the separate strings into one large string using str_c:
str_c(str_replace(string0, "(.)$", str_c("\\1", vector)), collapse = "")
[1] "A1B2C3D4E5"

conditional str_replace based on matching regex within mutate?

For any entries of the column "district" that match regex("[:alpha:]{2}AL"), I would like to replace the "AL" with "01".
For example:
df <- tibble(district = c("NY14", "MT01", "MTAL", "PA10", "KS02", "NDAL", "ND01", "AL02", "AL01"))
I tried:
df %>% mutate(district=replace(district,
str_detect(district, regex("[:alpha:]{2}AL")),
str_replace(district,"AL","01")))
and
df %>% mutate(district=replace(district,
str_detect(district, regex("[:alpha:]{2}AL")),
paste(str_sub(district, start = 1, end = 2),"01",sep = ""))
but there is a vectorization problem.

Is this ok?
str_replace_all(string=df$district,
pattern="(\\w{2})AL",
replacement="\\101")
I replaced the regex with \\w, a word character: https://www.regular-expressions.info/shorthand.html
I am using \\1 to indicate replace the string with the first captured region, which is captured in the (\\w{2}) so keep the first 2 letters then add the 01

You can change the replace to ifelse
ifelse( str_detect(df$district, regex("[:alpha:]{2}AL")),
str_replace(df$district,"AL","01"),df$district)

Replace multiple strings comprising of a different number of characters with one gsubfn()

Here Replace multiple strings in one gsub() or chartr() statement in R? it is explained to replace multiple strings of one character at in one statement with gsubfn(). E.g.:
x <- "doremi g-k"
gsubfn(".", list("-" = "_", " " = ""), x)
# "doremig_k"
I would however like to replace the string 'doremi' in the example with ''. This does not work:
x <- "doremi g-k"
gsubfn(".", list("-" = "_", "doremi" = ""), x)
# "doremi g_k"
I guess it is because of the fact that the string 'doremi' contains multiple characters and me using the metacharacter . in gsubfn. I have no idea what to replace it with - I must confess I find the use of metacharacters sometimes a bit difficult to udnerstand. Thus, is there a way for me to replace '-' and 'doremi' at once?

You might be able to just use base R sub here:
x <- "doremi g-k"
result <- sub("doremi\\s+([^-]+)-([^-]+)", "\\1_\\2", x)
result
[1] "g_k"

Does this work for you?
gsubfn::gsubfn(pattern = "doremi|-", list("-" = "_", "doremi" = ""), x)
[1] " g_k"
The key is this search: "doremi|-" which tells to search for either "doremi" or "-". Use "|" as the or operator.

Just a more generic solution to #RLave's solution -
toreplace <- list("-" = "_", "doremi" = "")
gsubfn(paste(names(toreplace),collapse="|"), toreplace, x)
[1] " g_k"

gsub not working on colnames?

I have a dataframe called df with column names in the following format:
"A Agarwal" "A Agrawal" "A Balachandran"
"A.Brush" "A.Casavant" "A.Chakrabarti"
They are first initial and last name. However, some of them are separated with a space, while other are with a period. I need to replace the period with a period.(The first column is called author.ID, and I excluded it from the following code)
I have tried the following codes but the resulting colnames still do not change.
colnames(df[, -1]) = gsub("\\s", "\\.", colnames(df[, -1]))
colnames(df[, -1]) = gsub(" ", ".", colnames(df[, -1]))
What am I doing wrong?
Thanks.

Note that df[, -1] gets you all rows and columns except the first column (see this reference). In order to modify the column names you should use colnames(df).
To replace the first literal space with a dot, use
colnames(df) <- sub(" ", ".", colnames(df), fixed=TRUE)
If there can be more than one whitespace, use a regex:
colnames(df) <- sub("\\s+", ".", colnames(df))
If you need to remove all whitespaces sequences with a single dot in the column names, use gsub:
colnames(df) <- gsub("\\s+", ".", colnames(df))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Replace special character in data frame - r

To get gsub work on whole dataframe use apply: apply(df, 2, function(x) gsub(" myspec\\^ch2 ", "-", x))

Related

using tidyr separate function to split by \ backslash

replacing the same pattern in a string with new value each time

conditional str_replace based on matching regex within mutate?

Replace multiple strings comprising of a different number of characters with one gsubfn()

gsub not working on colnames?

Categories

Resources