Get data out of "c(\"a\", \"b\")" format - r

I have a string "c(\"AV\", \"IM\")", which I'm trying to transform into the string "AV IM".
My issue is that I can't unlist() or flatten() this, as it's a character, and neither paste() nor stringr::str_c() work, since it's technically still 1 character value.
Any ideas how I can do this?
Tidyverse solutions preferred, if possible.
EDIT: I know this can be solved via regex, but I feel like this is more a "fundamental" problem to be solved string-level than it is a regex problem, if that makes any sense.

Not sure how you got here, but this as presented would be an eval/parse situation. However, as noted in many other answers on this site, there's almost always a better way of preparing your data so you end up in a more R-friendly form. See, for starters, What specifically are the dangers of eval(parse(...))?.
> a <- "c(\"AV\", \"IM\")"
> (b <- eval(parse(text=a)))
[1] "AV" "IM"
> paste(b, collapse=" ")
[1] "AV IM"

You can also consider to use regular expression to replace all symbols and the beginning c.
s <- "c(\"AV\", \"IM\")"
s_vec <- strsplit(s, split = ",")[[1]]
gsub("[[:punct:]]|^c", "", s_vec)
# [1] "AV" " IM"

Well it is not quite easy how you got here. You can use eval-parse, though it is not vectorized. And also it is slow. Thus you need a regular expression:
a <- "c(\"AV\", \"IM\")"
stringr::str_extract_all(a,"\\w+(?!\\()")
[[1]]
[1] "AV" "IM"

Other answers output a vector. My understanding is you want a space-delimited list of your strings.
library(dplyr)
a <- "c(\"AV\", \"IM\")"
a %>%
gsub("c(", "", ., fixed=TRUE) %>%
gsub("\"", "", ., fixed=TRUE) %>%
gsub(",", "", ., fixed=TRUE) %>%
gsub(")", "", ., fixed=TRUE)
Output
"AV IM"
EDIT Or simply (from #www's answer):
a %>%
gsub("[[:punct:]]|^c", "", .)

Related

R. Remove everything between to delimiter characters [duplicate]

This question already has answers here:
Remove the letters between two patterns of strings in R
(3 answers)
Closed 2 years ago.
I have a data frame with this kind of expression in column C:
GT_rs9628326:N_rs9628326
GT_rs1111:N_rs1111
GT_rs8374:N_rs8374
Using R, I want to remove everything between the first "T" and ":", as well as everything after the "N". I know this can be done with gsub. I would get:
GT:N
GT:N
GT:N
Maybe you can try
gsub("_\\w+","",s)
giving
[1] "GT:N" "GT:N" "GT:N"
Data
s <- c("GT_rs9628326:N_rs9628326","GT_rs1111:N_rs1111","GT_rs8374:N_rs8374")
Another option would be splitting the strings by : and then replace non necessary text in order to collapse all together again by same split symbol (I have used #ThomasIsCoding data thanks):
#Data
v1 <- c("GT_rs9628326:N_rs9628326","GT_rs1111:N_rs1111","GT_rs8374:N_rs8374")
#Code
unlist(lapply(lapply(strsplit(v1,split = ':'),
function(x) sub("_[^_]+$", "", x)),
function(x) paste0(x,collapse = ':')))
Output:
[1] "GT:N" "GT:N" "GT:N"
Using str_remove from stringr
library(stringr)
str_remove_all(s, "_\\w+")
#[1] "GT:N" "GT:N" "GT:N"
data
s <- c("GT_rs9628326:N_rs9628326","GT_rs1111:N_rs1111","GT_rs8374:N_rs8374")
Remove a word after either "T" or "N". Using #ThomasIsCoding's data.
gsub('(?<=T|N)\\w+', '', s, perl = TRUE)
#[1] "GT:N" "GT:N" "GT:N"

How to drop substring from variable names?

I have the following names of variables:
vars <- c("var1.caps(12, For]","var2(5,For]","var3.tree.(15, For]","var4.caps")
I need to clean these names in order to get the following result:
clean_vars <- c("var1.caps","var2","var3.tree.","var4.caps")
So, basically I would like to drop (..].
Is there any automated way to do it in R?
I was trying to adapt str_replace(vars, pattern, ""), but not sure how to make pattern flexible because it could have different values between ( and ].
gsub("\\(.*\\]","",vars)
[1] "var1.caps" "var2" "var3.tree." "var4.caps"
Using stringr and purrr:
stringr::str_split(vars, "\\(") %>% purrr::map(., 1) %>% unlist()
[1] "var1.caps" "var2" "var3.tree." "var4.caps"
Another option of using gsub
> gsub("(?<=)\\(.*\\]","\\1",vars,perl = T)
[1] "var1.caps" "var2" "var3.tree."
[4] "var4.caps
Eliminate
the first ( (in regex \\() in the string
and everything that comes after it (.+).
Replace it with nothing ("").
sub("\\(.+", "", vars)
# [1] "var1.caps" "var2" "var3.tree." "var4.caps"

strsplit not behaving as expected R

I have a basic problem in R, everything I'm working with is familiar to me (data, functions) but for some reason I can't get the strsplit or the gsub function to work as expected. I also tried the stringr package. I'm not going to bother putting up code using that package because I know this problem is simple and can be done with the two functions mentioned above. Personally, I feel like putting up a page for this isn't even necessary but my patience is pretty thin at this point.
I am trying to remove the "." and the number followed by the '.' in an Ensemble Gene ID. Simple, I know.
id <- "ENSG00000223972.5"
gsub(".*", "", id)
strsplit(id, ".")
The asterisk symbol was meant to catch anything after the '.' and remove it but I don't know for sure if that's what it does. The strsplit should definitely output a list of two items, the first being everything before the '.' and the second being the one digit after. All it returns is a list with 17 "" symbols, for no space and one for each character in the string. I think it's an obvious thing that I'm missing but I haven't been able to figure it out. Thanks in advance.
Read the help file for ?strsplit, you cannot use "."
id <- "ENSG00000223972.5"
gsub("[.]", "", id)
strsplit(id, split = "[.]")
Output:
> gsub("[.]", "", id)
[1] "ENSG000002239725"
> strsplit(id, split = "[.]")
[[1]]
[1] "ENSG00000223972" "5"
Help:
unlist(strsplit("a.b.c", "."))
## [1] "" "" "" "" ""
## Note that 'split' is a regexp!
## If you really want to split on '.', use
unlist(strsplit("a.b.c", "[.]"))
## [1] "a" "b" "c"
## or
unlist(strsplit("a.b.c", ".", fixed = TRUE))

Extracting Headers from a list [duplicate]

I have a character string and what to extract the information inside of multiple parentheses. Currently I can extract the information from the last parenthesis with the code below. How would I do it so it extracts multiple parentheses and returns as a vector?
j <- "What kind of cheese isn't your cheese? (wonder) Nacho cheese! (groan) (Laugh)"
sub("\\).*", "", sub(".*\\(", "", j))
Current output is:
[1] "Laugh"
Desired output is:
[1] "wonder" "groan" "Laugh"
Here is an example:
> gsub("[\\(\\)]", "", regmatches(j, gregexpr("\\(.*?\\)", j))[[1]])
[1] "wonder" "groan" "Laugh"
I think this should work well:
> regmatches(j, gregexpr("(?=\\().*?(?<=\\))", j, perl=T))[[1]]
[1] "(wonder)" "(groan)" "(Laugh)"
but the results includes parenthesis... why?
This works:
regmatches(j, gregexpr("(?<=\\().*?(?=\\))", j, perl=T))[[1]]
Thanks #MartinMorgan for the comment.
Using the stringr package we can reduce this a little bit.
library(stringr)
# Get the parenthesis and what is inside
k <- str_extract_all(j, "\\([^()]+\\)")[[1]]
# Remove parenthesis
k <- substring(k, 2, nchar(k)-1)
#kohske uses regmatches but I'm currently using 2.13 so don't have access to that function at the moment. This adds the dependency on stringr but I think it is a little easier to work with and the code is a little clearer (well... as clear as using regular expressions can be...)
Edit: We could also try something like this -
re <- "\\(([^()]+)\\)"
gsub(re, "\\1", str_extract_all(j, re)[[1]])
This one works by defining a marked subexpression inside the regular expression. It extracts everything that matches the regex and then gsub extracts only the portion inside the subexpression.
I think there are basically three easy ways of extracting multiple capture groups in R (without using substitution); str_match_all, str_extract_all, and regmatches/gregexpr combo.
I like #kohske's regex, which looks behind for an open parenthesis ?<=\\(, looks ahead for a closing parenthesis ?=\\), and grabs everything in the middle (lazily) .+?, in other words (?<=\\().+?(?=\\))
Using the same regex:
str_match_all returns the answer as a matrix.
str_match_all(j, "(?<=\\().+?(?=\\))")
[,1]
[1,] "wonder"
[2,] "groan"
[3,] "Laugh"
# Subset the matrix like this....
str_match_all(j, "(?<=\\().+?(?=\\))")[[1]][,1]
[1] "wonder" "groan" "Laugh"
str_extract_all returns the answer as a list.
str_extract_all(j, "(?<=\\().+?(?=\\))")
[[1]]
[1] "wonder" "groan" "Laugh"
#Subset the list...
str_extract_all(j, "(?<=\\().+?(?=\\))")[[1]]
[1] "wonder" "groan" "Laugh"
regmatches/gregexpr also returns the answer as a list. Since this is a base R option, some people prefer it. Note the recommended perl = TRUE.
regmatches(j, gregexpr( "(?<=\\().+?(?=\\))", j, perl = T))
[[1]]
[1] "wonder" "groan" "Laugh"
#Subset the list...
regmatches(j, gregexpr( "(?<=\\().+?(?=\\))", j, perl = T))[[1]]
[1] "wonder" "groan" "Laugh"
Hopefully, the SO community will correct/edit this answer if I've mischaracterized the most popular options.
Using rex may make this type of task a little simpler.
matches <- re_matches(j,
rex(
"(",
capture(name = "text", except_any_of(")")),
")"),
global = TRUE)
matches[[1]]$text
#>[1] "wonder" "groan" "Laugh"

str_replace (package stringr) cannot replace brackets in r?

I have a string, say
fruit <- "()goodapple"
I want to remove the brackets in the string. I decide to use stringr package because it usually can handle this kind of issues. I use :
str_replace(fruit,"()","")
But nothing is replaced, and the following is replaced:
[1] "()good"
If I only want to replace the right half bracket, it works:
str_replace(fruit,")","")
[1] "(good"
However, the left half bracket does not work:
str_replace(fruit,"(","")
and the following error is shown:
Error in sub("(", "", "()good", fixed = FALSE, ignore.case = FALSE, perl = FALSE) :
invalid regular expression '(', reason 'Missing ')''
Anyone has ideas why this happens? How can I remove the "()" in the string, then?
Escaping the parentheses does it...
str_replace(fruit,"\\(\\)","")
# [1] "goodapple"
You may also want to consider exploring the "stringi" package, which has a similar approach to "stringr" but has more flexible functions. For instance, there is stri_replace_all_fixed, which would be useful here since your search string is a fixed pattern, not a regex pattern:
library(stringi)
stri_replace_all_fixed(fruit, "()", "")
# [1] "goodapple"
Of course, basic gsub handles this just fine too:
gsub("()", "", fruit, fixed=TRUE)
# [1] "goodapple"
The accepted answer works for your exact problem, but not for the more general problem:
my_fruits <- c("()goodapple", "(bad)apple", "(funnyapple")
str_replace(my_fruits,"\\(\\)","")
## "goodapple" "(bad)apple", "(funnyapple"
This is because the regex exactly matches a "(" followed by a ")".
Assuming you care only about bracket pairs, this is a stronger solution:
str_replace(my_fruits, "\\([^()]{0,}\\)", "")
## "goodapple" "apple" "(funnyapple"
Building off of MJH's answer, this removes all ( or ):
my_fruits <- c("()goodapple", "(bad)apple", "(funnyapple")
str_replace_all(my_fruits, "[//(//)]", "")
[1] "goodapple" "badapple" "funnyapple"

Resources