R remove special character and repeating underscores

R remove special character and repeating underscores - r

I have a dataset that contains spaces and other punctuation characters. I'm trying to replace the spaces and special characters with "_". This creates spots with multiple "_" strung together, so I'd like to remove these too by using the following function as described here :
removeSpace <- function(x){
class1 <- class(x)
x <- as.character(x)
x <- gsub(" |&|-|/|'|(|)",'_', x) # convert special characters to _
x <- gsub("([_])\\1+","\\1", x) # convert multiple _ to single _
if(class1 == 'character'){
return(x)
}
if(class1 == 'factor'){
return(as.factor(x))
}
}
The issue is instead of removing spaces and replacing with "_" it does every other character with "_" (i.e. "test" -> "t_e_s_t")
What am I doing wrong?

You don't need to run two separate replacements to accomplish this. Just put a + quantifier in your match pattern.
Match: [-/&'() ]+
Replace with: _
Also note that I used a character set instead of switching between each option with |. This is generally a better approach when matching one of multiple individual characters.

Related

R Is there a way to remove special character from beginning of a string

A field from my dataset has some of it's observations to start with a "." e.g ".TN34AB1336"instead of "TN34AB1336".
I've tried
truck_log <- truck_log %>%
filter(BookingID != "WDSBKTP49392") %>%
filter(BookingID != "WDSBKTP44502") %>%
mutate(GpsProvider = str_replace(GpsProvider, "NULL", "UnknownGPS")) %>%
mutate(vehicle_no = str_replace(vehicle_no, ".TN34AB1336", "TN34AB1336"))
the last mutate command in my code worked but there are more of such issues in another field e.g "###TN34AB1336" instead of "TN34AB1336".
So I need a way of automating the process such that all observations that doesn't start with a character is trimmed from the left by a single command in R.
I've attached a picture of a filtered field from spreadsheet to make the question clearer.

We can use regular expressions to replace anything up to the first alphanumeric character with ""to remove everything that is not a Number/Character from the beginning of a string:
names <- c("###TN34AB1336",
".TN34AB1336",
",TN654835A34",
":+?%TN735345")
stringr::str_replace(names,"^.+?(?=[[:alnum:]])","") # Matches up to, but not including the first alphanumeric character and replaces with ""
[1] "TN34AB1336" "TN34AB1336" "TN654835A34" "TN735345"
``

You can use sub.
s <- c(".TN34AB1336", "###TN34AB1336")
sub("^[^A-Z]*", "", s)
#[1] "TN34AB1336" "TN34AB1336"
Where ^ is the start of the string, [^A-Z] matches everything what is not A, B, C, ... , Z and * matches it 0 to n times.

easy way to extract uppercase in string in R

I am beginner programmer in R.
I have "cCt/cGt" and I want to extract C and G and write it like C>G.
test ="cCt/cGt"
str_extract(test, "[A-Z]+$")

Try this:
gsub(".*([A-Z]).*([A-Z]).*", "\\1>\\2", test )
[1] "C>G"
Here, we capture the two occurrences of the upper case letters in capturing groups given in parentheses (...). This enables us to refer to them (and only to them but not the rest of the string!) in gsub's replacement clause using backreferences \\1 and \\2. In the replacement clause we also include the desired >.

You seem to look for a mutation in two concatenated strings, this function should solve your problem:
extract_mutation <- function(text){
splitted <- strsplit(text, split = "/")[[1]]
pos <- regexpr("[[:upper:]]", splitted)
uppercases <- regmatches(splitted, pos)
mutation <- paste0(uppercases, collapse = ">")
return(mutation)
}
If the two base exchanges are always at the same index, you could also return the position if you're interested:
position <- pos[1]
return(list(mutation, position))
instead of the return(mutation)

You might also capture the 2 uppercase chars followed and preceded by optional lowercase characters and a / in between.
test ="cCt/cGt"
res = str_match(test, "([A-Z])[a-z]*/[a-z]*([A-Z])")
sprintf("%s>%s", res[2], res[3])
Output
[1] "C>G"
See an R demo.
An exact match for the whole string could be:
^[a-z]([A-Z])[a-z]/[a-z]([A-Z])[a-z]$

Insert a character as an item delimiter into string in R

In R, I have a single string with multiple entries such as:
mydata <- c("(first data entry) (second data entry) (third data entry) ")
I want to insert the pipe symbol "|" between the entries as an item delimiter, ending up with the following list:
"(first data entry)|(second data entry)|(third data entry)"
Not all of the mydata rows are containing the same amount of entries. If mydata contains 0 or just 1 entry, then no "|" pipe symbol is required.
I've tried the following without success:
newdata <- paste(mydata, collapse = "|")
Thanks for your help!

You do not need a regex if you have consistent )+1 space+( pattern.
You can simply use
gsub(") (", ")|(", mydata, fixed=TRUE)
If your strings contain variable amount of spaces, tabs, etc., you can use
gsub("\\)\\s*\\(", ")|(", mydata)
gsub("\\)[[:space:]]*\\(", ")|(", mydata)
stringr::str_replace_all(mydata, "\\)\\s*\\(", ")|(")
Here, \)\s*\( pattern matches a ) (escaped because ) is a special regex metacharacter), then zero or more whitespaces, and then a (.
See the regex demo.
If there is always one or more whitespaces between parentheses, use \s+ instead of \s*.

This will replace the spaces with the "|" in your string. If you need more complex rules use regex with gsub.
gsub(") (",")|(", yourString)

I think you could use the following solution too:
gsub("(?<=\\))\\s+(?=\\()", "|", mydata, perl = TRUE)
[1] "(first data entry)|(second data entry)|(third data entry) "

Remove ,"" and replacing '-' to '.'

I am working with single cell data.
I am trying to match cell barcodes I extracted with another data, but the structure of barcodes are different.
Barcode I extracted: ,"SAMPLE_AAAGCAAAGATACACA-1_1" (weirdly, it saved with a comma at the front)
Barcode I want: SAMPLE_AAAGCAAAGATACACA.1_1
Which function is necessary to use when I try to remove <,"> replace these?

Is this what you want?
Data:
x <- ',"SAMPLE_AAAGCAAAGATACACA-1_1"'
Solution:
cat(gsub(',', '', gsub('(?<=[A-Z])-(?=\\d)', '\\.', x, perl = T)))
"SAMPLE_AAAGCAAAGATACACA.1_1"
Here we use 'nested' gsub to first change the hyphen into the period and then to delete the comma.
If you need it without double quote marks:
cat(gsub(',"|"$', '', gsub('(?<=[A-Z])-(?=\\d)', '\\.', x, perl = T)))
SAMPLE_AAAGCAAAGATACACA.1_1

The following are some alternatives.
1) chartr/trimws Assume the test data v below. Then we replace each dash with minus using chartr and we can strip all commas and double quotes from both ends using trimws. If you have a very old version of R you will need to upgrade since the whitespace= argument was added more recently. No packages are used.
Note that the double quotes shown in the output are not part of the strings but are just how R displays chraacter vectors.
# test input
v <- c(',"SAMPLE_AAAGCAAAGATACACA-1_1"', ',"SAMPLE_AAAGCAAAGATACACA-1_1"')
trimws(chartr("-", ".", v), whitespace = '[,"]')
## [1] "SAMPLE_AAAGCAAAGATACACA.1_1" "SAMPLE_AAAGCAAAGATACACA.1_1"
2) gsubfn gsubfn in the package of the same name can map all minus characters to dot and commas and double quotes to empty strings in a single command. The second argument defines the mapping.
This substitutes all double quotes, commas and minus signs. If there are embedded double quotes and commas (i.e. not on the ends) that are not to be substituted then use (1) which onbly trims comma and double quote off the ends.
library(gsubfn)
gsubfn('.', list('"' = '', ',' = '', '-' = '.'), v)
## [1] "SAMPLE_AAAGCAAAGATACACA.1_1" "SAMPLE_AAAGCAAAGATACACA.1_1"
3) read.table/chartr This also uses only base R. Read in the input using read.table separating fields on comma and keeping only the second field. This will also remove the double quotes. Then use chartr to replace minus signs with dot.
This assumes that the only double quotes are the ones surrounding the field and all minus signs are to be replaced by dot. Embedded commas will be handled properly.
chartr("-", ".", read.table(text = v, sep = ",")[[2]])
## [1] "SAMPLE_AAAGCAAAGATACACA.1_1" "SAMPLE_AAAGCAAAGATACACA.1_1"

How to extract bracket from string into new columns

I need to export information from a string into different columns.
More specifically the content of the brackets within the string;
Lets say I have a string
a <- "2xExp [K89; K96]; 1xExp [N-Term]; 2xNum [S87(100); S93(100)]"
What I am trying to output is a vector with the contents of the brackets, if there is a comma save them as separate bracketed strings, and remove parentheses.
e.g.
tmp <- function(a)
Result
tmp
"[K89]" , "[K96]", "[N-Term]", "[S87]", "[S93]"
My approach so far:
pattern <- "(\\[.*?\\])"
hits <- gregexpr(pattern, a)
matches <- regmatches(a, hits)
unlisted_matches <- unlist(matches)
Results
"[K89; K96]" "[N-Term]" "[S87(100); S93(100)]"
This does give me the brackets but still doesn't split the terms. And for any reason I am not able to efficiently separate the ";" terms.

Here's a way using the tidyverse :
a <- "2xExp [K89; K96]; 1xExp [N-Term]; 2xNum [S87(100); S93(100)]"
library(tidyverse)
a %>%
# extract between square, brackets, not keeping brackets, and unlist
str_extract_all("(?<=\\[).*?(?=\\])") %>%
unlist() %>%
# remove round brackets and content
str_replace_all("\\(.*?\\)", "") %>%
# split by ";" and unlist
str_split("; ") %>%
unlist() %>%
# put the brackets back
str_c("[",.,"]")
#> [1] "[K89]" "[K96]" "[N-Term]" "[S87]" "[S93]"

You may use
a <- "2xExp [K89; K96]; 1xExp [N-Term]; 2xNum [S87(100); S93(100)]"
pattern <- "(?:\\G(?!^)(?:\\([^()]*\\))?\\s*;\\s*|\\[)\\K[^][;()]+"
matches <- regmatches(a, gregexpr(pattern, a, perl=TRUE))
unlisted_matches <- paste0("[", unlist(matches),"]")
unlisted_matches
## => [1] "[K89]" "[K96]" "[N-Term]" "[S87]" "[S93]"
See the R demo and the regex demo.
Pattern details
(?:\G(?!^)(?:\([^()]*\))?\s*;\s*|\[) - either the end of the previous successful match (\G(?!^)) followed with any substring inside round parentheses (optional, see (?:\([^()]*\))?) and then a ; enclosed with optional 0+ whitespaces (see \s*;\s*) or a [ char
\K - match reset operator discarding all text matched so far
[^][;()]+ - one or more chars other than [, ], ;, ( and ).
The paste0("[", unlist(matches),"]") part wraps the matches with square brackets.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R remove special character and repeating underscores - r

Related

R Is there a way to remove special character from beginning of a string

easy way to extract uppercase in string in R

Insert a character as an item delimiter into string in R

Remove ,"" and replacing '-' to '.'

How to extract bracket from string into new columns

Categories

Resources