How do I remove these characters from my vector of strings - r

I need a solution to how I can clean my vector of strings which has characters and symbols,
for example
[1]c("hiv3=0", "comdiab=0", "ppl=0")
[2]c("fxet3=1", "hiv3=0", "ppl=0")
[3]c("fxet3=1", "escol4=0", "alcool=0", "tipores3=1")
[4]c("escol4=0", "alcool=0", "ppl=0", "tipores3=1")
The intended string will produce
[1]"hiv3=0,comdiab=0, ppl=0"
[2]"fxet3=1, hiv3=0, ppl=0"
[3]"fxet3=1, escol4=0, alcool=0, tipores3=1"
[4]"escol4=0, alcool=0, ppl=0, tipores3=1"
Any solution is acceptable, though I have tried using the gsub function
Regex solution would be very much acceptable also

Based on the post, it seems to be a listof vectors. We can use paste to create a single string from the list of vectors
sapply(lst1, paste, collapse=", ")
#[1] "hiv3=0, comdiab=0, ppl=0"
#[2] "fxet3=1, hiv3=0, ppl=0"
#[3] "fxet3=1, escol4=0, alcool=0, tipores3=1"
#[4] "escol4=0, alcool=0, ppl=0, tipores3=1"
or otherwise can be modified as
sapply(lst1, toString)
data
lst1 <- list(c("hiv3=0", "comdiab=0", "ppl=0"), c("fxet3=1", "hiv3=0",
"ppl=0"), c("fxet3=1", "escol4=0", "alcool=0", "tipores3=1"),
c("escol4=0", "alcool=0", "ppl=0", "tipores3=1"))

tidyverse answer
library(tidyverse)
my_strings <- list(c("hiv3=0", "comdiab=0", "ppl=0"),
c("fxet3=1", "hiv3=0", "ppl=0"),
c("fxet3=1", "escol4=0", "alcool=0", "tipores3=1"),
c("escol4=0", "alcool=0", "ppl=0", "tipores3=1"))
map_chr(.x = my_strings, .f = str_c, collapse = " ")
# [1] "hiv3=0 comdiab=0 ppl=0"
# [2] "fxet3=1 hiv3=0 ppl=0"
# [3] "fxet3=1 escol4=0 alcool=0 tipores3=1"
# [4] "escol4=0 alcool=0 ppl=0 tipores3=1"

Related

string split and interchange the position of string in R

I have a vector called myvec. I would like to split it at _ and interchange the position. What would be the simplest way to do this?
myvec <- c("08AD09144_NACC022453", "08AD8245_NACC657970")
Result I want:
NACC022453_08AD09144, NACC657970_08AD8245
You can do this with regex capturing data in two groups and interchanging them using back reference.
myvec <- c("A1_B1", "B2_C1", "D1_A2")
sub('(\\w+)_(\\w+)', '\\2_\\1', myvec)
#[1] "B1_A1" "C1_B2" "A2_D1"
We can use strsplit from base R
sapply(strsplit(myvec, "_"), function(x) paste(x[2], x[1], sep = "_"))
#[1] "NACC022453_08AD09144" "NACC657970_08AD8245"

Replacing given characters to new ones before a defined parameter in gsub function

I am not so qualified in R and I am struggling with a problem. I want to replace all the existing underscores which are before "S11" pattern, with the dashes "(-)". S11 is just a number and it is variable in my table such as S29, S30. Here is the code that I am using and failing:
foo <- c("H2_2months_S11_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_", "H2_2months_with_acetate_S101_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_", "Formate_3months_S99_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_")
Sample <- gsub(pattern="*(_S)", replacement="-", x=foo)
Getting:
[1] "H2_2months-11_L001_R1_001_(paired)trimmed(paired)_contig_940_[cov=11]_"
[2] "H2_2months_with_acetate-101_L001_R1_001_(paired)trimmed(paired)_contig_940_[cov=11]_"
[3] "Formate_3months-99_L001_R1_001_(paired)trimmed(paired)_contig_940_[cov=11]_"
I also don't want "_S" to be deleted and replaced. I use "_S[0-9]" as a matching criteria and before "_S", the underscores should be changed to "-".
Also please recommend me a good website that I can learn those "codes or signs" using in this function. Thanks in advance.
Expected output:
[1] "H2-2months-S11_L001_R1_001_(paired)trimmed(paired)_contig_940_[cov=11]_"
[2] "H2-2months-with-acetate-S101_L001_R1_001_(paired)trimmed(paired)_contig_940_[cov=11]_"
[3] "Formate-3months-S99_L001_R1_001_(paired)trimmed(paired)_contig_940_[cov=11]_"
This should work.
Basically we divide the job in two parts, first match ("_(S[0-9+])"), then we split the resulting string at "-", then we use gsub to fix all the "_" we find.
foo <- c("H2_2months_S123_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_")
foo <- gsub(pattern="_(S[0-9+])", replacement="-\\1", x=foo)
#foo
#[1] "H2_2months-S123_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_"
Then we split:
split <- unlist(strsplit(foo, "-")) # split using the new "-"
#split
#[1] "H2_2months"
#[2] "S123_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_"
Now we can use simple gsub on everything except the last element in split.
split_1 <- split[-length(split)] # fix all the "_" before the match (exclude the last)
split_1 <- gsub("_", "-", split_1)
Then we paste the results:
paste0(split_1, "-", split[length(split)]) # paste back together
#[1] "H2-2months-S123_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_"
Here in a function and with another example:
foo <- c("H2_2months_abc_456_S123_L001_R1_001")
my_foo <- function(s) {
s <- gsub(pattern="_(S[0-9+])", replacement="-\\1", x=s)
split <- unlist(strsplit(s, "-"))
split_1 <- split[-length(split)]
split_1 <- gsub("_", "-", split_1)
paste0(split_1, "-", split[length(split)])
}
my_foo(foo)
#[1] "H2-2months-abc-456-S123_L001_R1_001"
This will match the "_S11" and save S11 to the group. Then replace this with a "-" followed by the captured group "S11".
Sample <- gsub("_(S[0-9+])", "-\\1", foo)
Excellent place to learn more regex: https://www.regular-expressions.info/quickstart.html
Excellent place to test regex with explanations of the matching: https://regexr.com/
Edit: Thanks RLave, didn't realise it could be any digits after the S. Updated answer.

Vectorized stringr with fixed (literal) characters

I've got the following code, which I expect to give me a list of 3, since there are 3 elements in texts:
library(stringr)
texts <- c("I doubt it! :)", ";) disagree, but ok.", "No emoticons here!!!")
smileys <- c(":)","(:",";)",":D")
str_extract_all(texts, fixed(smileys))
Instead, I get a list of four (the length of my "pattern" parameter, here the smileys. Additionally, I get the following warning message:
Warning message: In stri_extract_all_fixed(string, pattern, simplify =
simplify, : longer object length is not a multiple of shorter object
length```
Well, I don't imagine length will match, as I'm looking for any hits on any of the smileys in each text. It's not like I want to match string 1 with pattern 1, string 2 with pattern 2, etc.
Aware that I am messing up stringi's understanding of vectorizing, I have tried this instead:
texts %>% map(~ str_extract_all(.x, fixed(smileys)))
This is much better, as it gives me a list of 3, but each element is in turn a list of four.
What I'm trying to get to is a list of 3 that is as little nested as possible. Someone, somewhere, has solved this, but I can't for the life of me figure it out or get how to google it. I could do a for loop over this, but I consider myself a citizen of the tidyverse...
Grateful for any assistance.
You can use paste to wrap each element of smiley with \\Q and \\E and collapse on the regex "or" metacharacter (|) to form a single pattern. As mentioned in the link Henrik shared and documented on ?regex and in the stringi manual, characters between \\Q and \\E are interpreted literally.
pattern <- paste("\\Q", smileys, "\\E", sep = "", collapse = "|")
# [1] "\\Q:)\\E|\\Q(:\\E|\\Q;)\\E|\\Q:D\\E"
library(stringi)
stri_extract_all_regex(texts, pattern)
#[[1]]
#[1] ":)"
#
#[[2]]
#[1] ";)"
#
#[[3]]
#[1] NA
Base R:
regmatches(texts, gregexpr(pattern, texts))
#[[1]]
#[1] ":)"
#
#[[2]]
#[1] ";)"
#
#[[3]]
#character(0)
# If you want an NA, instead of a zero-length vector,
# then you could do something like:
# lapply(
# regmatches(texts, gregexpr(pattern, texts)),
# function(ii) ifelse(is.character(ii) & length(ii) == 0L, NA, ii))
And if you do want to use purrr and avoid regular expressions, one idea would be something like this:
library(purrr)
library(stringr)
texts %>%
map(~ unlist(str_extract_all(.x, fixed(smileys))))
#[[1]]
#[1] ":)"
#
#[[2]]
#[1] ";)"
#
#[[3]]
#character(0)
# if you want NA, not a zero-length vector, you could add:
# %>% map(~ ifelse(is.character(.x) & length(.x) == 0L, NA, .x))

Grep to subset in R

How can i grep all the gene names starting only with "Gm" from data1[,7].
I tried data2[grep("^Gm",data2$Genes),]; but it extract the entire row which starts with "Gm".
data1[,7] <-
[1] "Ighmbp2,Mrpl21,Cpt1a,Mtl5,Gal,Ppp6r3,Gm23940,Lrp5"
[2] "Gm5852,Gm5773,Tdpoz4,Tdpoz3,Gm9116,Gm9117,Tdpoz5"
[3] "Arhgap15,Gm22867"
One option would be to split the string (strsplit(..) by , and then extract words in the output (which is a list, so lapply can be used) that begin with "Gm" using grep. (^- denotes the beginning of word/string)
lapply(strsplit(Genes, ','), function(x) grep('^Gm', x, value=TRUE))
#[[1]]
#[1] "Gm23940"
#[[2]]
#[1] "Gm5852" "Gm5773" "Gm9116" "Gm9117"
#[[3]]
#[1] "Gm22867"
Or you could extract the words by stri_extract_all from stringi
library(stringi)
stri_extract_all_regex(Genes, 'Gm[[:alnum:]]+')
Or if you need it as a vector, you can use unlist on the above output or use gsub to remove those words that don't begin with "Gm" (\\b(?!Gm)\\w+\\b) and ,', then usescan`.
scan(text=gsub('\\b(?!Gm)\\w+\\b|,', ' ',
Genes, perl=TRUE), what='', quiet=TRUE)
#[1] "Gm23940" "Gm5852" "Gm5773" "Gm9116" "Gm9117" "Gm22867"
Update
If you need to remove all the words starting with Gm
scan(text=gsub('\\bGm\\w+\\b|,', ' ', Genes, perl=TRUE),
what='', quiet=TRUE)
# [1] "Ighmbp2" "Mrpl21" "Cpt1a" "Mtl5" "Gal" "Ppp6r3"
# [7] "Lrp5" "Tdpoz4" "Tdpoz3" "Tdpoz5" "Arhgap15"
data
Genes <- c("Ighmbp2,Mrpl21,Cpt1a,Mtl5,Gal,Ppp6r3,Gm23940,Lrp5",
"Gm5852,Gm5773,Tdpoz4,Tdpoz3,Gm9116,Gm9117,Tdpoz5",
"Arhgap15,Gm22867")

Splitting CamelCase in R

Is there a way to split camel case strings in R?
I have attempted:
string.to.split = "thisIsSomeCamelCase"
unlist(strsplit(string.to.split, split="[A-Z]") )
# [1] "this" "s" "ome" "amel" "ase"
string.to.split = "thisIsSomeCamelCase"
gsub("([A-Z])", " \\1", string.to.split)
# [1] "this Is Some Camel Case"
strsplit(gsub("([A-Z])", " \\1", string.to.split), " ")
# [[1]]
# [1] "this" "Is" "Some" "Camel" "Case"
Looking at Ramnath's and mine I can say that my initial impression that this was an underspecified question has been supported.
And give Tommy and Ramanth upvotes for pointing out [:upper:]
strsplit(gsub("([[:upper:]])", " \\1", string.to.split), " ")
# [[1]]
# [1] "this" "Is" "Some" "Camel" "Case"
Here is one way to do it
split_camelcase <- function(...){
strings <- unlist(list(...))
strings <- gsub("^[^[:alnum:]]+|[^[:alnum:]]+$", "", strings)
strings <- gsub("(?!^)(?=[[:upper:]])", " ", strings, perl = TRUE)
return(strsplit(tolower(strings), " ")[[1]])
}
split_camelcase("thisIsSomeGood")
# [1] "this" "is" "some" "good"
Here's an approach using a single regex (a Lookahead and Lookbehind):
strsplit(string.to.split, "(?<=[a-z])(?=[A-Z])", perl = TRUE)
## [[1]]
## [1] "this" "Is" "Some" "Camel" "Case"
Here is a one-liner using the gsubfn package's strapply. The regular expression matches the beginning of the string (^) followed by one or more lower case letters ([[:lower:]]+) or (|) an upper case letter ([[:upper:]]) followed by zero or more lower case letters ([[:lower:]]*) and processes the matched strings with c (which concatenates the individual matches into a vector). As with strsplit it returns a list so we take the first component ([[1]]) :
library(gsubfn)
strapply(string.to.split, "^[[:lower:]]+|[[:upper:]][[:lower:]]*", c)[[1]]
## [1] "this" "Is" "Camel" "Case"
I think my other answer is better than the follwing, but if only a oneliner to split is needed...here we go:
library(snakecase)
unlist(strsplit(to_parsed_case(string.to.split), "_"))
#> [1] "this" "Is" "Some" "Camel" "Case"
The beginnings of an answer is to split all the characters:
sp.x <- strsplit(string.to.split, "")
Then find which string positions are upper case:
ind.x <- lapply(sp.x, function(x) which(!tolower(x) == x))
Then use that to split out each run of characters . . .
Here an easy solution via snakecase + some tidyverse helpers:
install.packages("snakecase")
library(snakecase)
library(magrittr)
library(stringr)
library(purrr)
string.to.split = "thisIsSomeCamelCase"
to_parsed_case(string.to.split) %>%
str_split(pattern = "_") %>%
purrr::flatten_chr()
#> [1] "this" "Is" "Some" "Camel" "Case"
Githublink to snakecase: https://github.com/Tazinho/snakecase

Resources