strsplit not behaving as expected R

strsplit not behaving as expected R - r

I have a basic problem in R, everything I'm working with is familiar to me (data, functions) but for some reason I can't get the strsplit or the gsub function to work as expected. I also tried the stringr package. I'm not going to bother putting up code using that package because I know this problem is simple and can be done with the two functions mentioned above. Personally, I feel like putting up a page for this isn't even necessary but my patience is pretty thin at this point.
I am trying to remove the "." and the number followed by the '.' in an Ensemble Gene ID. Simple, I know.
id <- "ENSG00000223972.5"
gsub(".*", "", id)
strsplit(id, ".")
The asterisk symbol was meant to catch anything after the '.' and remove it but I don't know for sure if that's what it does. The strsplit should definitely output a list of two items, the first being everything before the '.' and the second being the one digit after. All it returns is a list with 17 "" symbols, for no space and one for each character in the string. I think it's an obvious thing that I'm missing but I haven't been able to figure it out. Thanks in advance.

Read the help file for ?strsplit, you cannot use "."
id <- "ENSG00000223972.5"
gsub("[.]", "", id)
strsplit(id, split = "[.]")
Output:
> gsub("[.]", "", id)
[1] "ENSG000002239725"
> strsplit(id, split = "[.]")
[[1]]
[1] "ENSG00000223972" "5"
Help:
unlist(strsplit("a.b.c", "."))
## [1] "" "" "" "" ""
## Note that 'split' is a regexp!
## If you really want to split on '.', use
unlist(strsplit("a.b.c", "[.]"))
## [1] "a" "b" "c"
## or
unlist(strsplit("a.b.c", ".", fixed = TRUE))

Related

beg2char function in R (qdap package)

I am trying keep only that part of the string left of "keyword". Anything on the right of "keyword" should be removed. beg2char seems like the best choice but its not doing what I thought it would do.
Please advise:
x <-"/index.php/front/yellow/searchHeading/heading/926/h_name/Architects/keyword/A//"
beg2char(x,"keyword")
# [1] "/in"

We could use, gsub as below:
gsub("keyword.*", "", x)
# [1] "/index.php/front/yellow/searchHeading/heading/926/h_name/Architects/"

If we want to keep the "keyword" in the output, then set include = TRUE:
library(qdap)
x <-"/index.php/front/yellow/searchHeading/heading/926/h_name/Architects/keyword/A//"
beg2char(x, "keyword", include = TRUE)
# [1] "/index.php/front/yellow/searchHeading/heading/926/h_name/Architects/keyword"
If we want to exclude "keyword", then we would do as you did, which doesn't work, because letter "d" is part of the "keyword". Looks like a bug to me, submitted an issue at GitHub:qdap.
But this works:
beg2char(x, "k")
# [1] "/index.php/front/yellow/searchHeading/heading/926/h_name/Architects/"

Use gsub to replace curly apostrophe with straight apostrophe in R list of character vectors

Looking for some guidance on how to replace a curly apostrophe with a straight apostrophe in an R list of character vectors.
The reason I'm replacing the curly apostrophes - later in the script, I check each list item, to see if it's found in a dictionary (using qdapDictionary) to ensure it's a real word and not garbage. The dictionary uses straight apostrophes, so words with the curly apostrophes are being "rejected."
A sample of the code I have currently follows. In my test list, item #6 contains a curly apostrophe, and item #2 has a straight apostrophe.
Example:
list_TestWords <- as.list(c("this", "isn't", "ideal", "but", "we", "can’t", "fix", "it"))
func_ReplaceTypographicApostrophes <- function(x) {
gsub("’", "'", x, ignore.case = TRUE)
}
list_TestWords_Fixed <- lapply(list_TestWords, func_ReplaceTypographicApostrophes)
The result: No change. Item 6 still using curly apostrophe. See output below.
list_TestWords_Fixed
[[1]]
[1] "this"
[[2]]
[1] "isn't"
[[3]]
[1] "ideal"
[[4]]
[1] "but"
[[5]]
[1] "we"
[[6]]
[1] "can’t"
[[7]]
[1] "fix"
[[8]]
[1] "it"
Any help you can offer will be most appreciated!

This might work: gsub("[\u2018\u2019\u201A\u201B\u2032\u2035]", "'", x)
I found it over here: http://axonflux.com/handy-regexes-for-smart-quotes

You might be running up against a bug in R on Windows. Try using utf8::as_utf8 on your input. Alternatively, this also works:
library(utf8)
list_TestWords <- as.list(c("this", "isn't", "ideal", "but", "we", "can’t", "fix", "it"))
lapply(list_TestWords, utf8_normalize, map_quote = TRUE)
This will replace the following characters with ASCII apostrophe:
U+055A ARMENIAN APOSTROPHE
U+2018 LEFT SINGLE QUOTATION MARK
U+2019 RIGHT SINGLE QUOTATION MARK
U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
U+FF07 FULLWIDTH APOSTROPHE
It will also convert your text to composed normal form (NFC).

I see a problem in your call to gsub:
gsub("/’", "/'", x, ignore.case = TRUE)
You are prefixing the curly single quote with a forward slash. I don't know why you are doing this. I could speculate that you are trying to escape the quote characters, but this is having the side effect that your pattern is now trying to match a forward slash followed by a quote. As this never occurs in your text, no replacements are being made. You should be doing this:
gsub("’", "'", x, ignore.case = TRUE)
Follow the link below for a demo which shows that using the above gsub calls works as you expect.
Demo

Was about to say the same thing.
Try using str_replace from stringr package, will not need to use slashes

I was facing similar problem. Somehow non of the solutions worked for me. So I devised an indirect way of doing it by identifying apostrophe and replacing it with the required format.
gsub("(\\w)(\\W)(\\w\\s)", "\\1'\\3","sid’s bicycle")
[1] "sid's bicycle"
Hope it helps someone.

Extracting Headers from a list [duplicate]

I have a character string and what to extract the information inside of multiple parentheses. Currently I can extract the information from the last parenthesis with the code below. How would I do it so it extracts multiple parentheses and returns as a vector?
j <- "What kind of cheese isn't your cheese? (wonder) Nacho cheese! (groan) (Laugh)"
sub("\\).*", "", sub(".*\\(", "", j))
Current output is:
[1] "Laugh"
Desired output is:
[1] "wonder" "groan" "Laugh"

Here is an example:
> gsub("[\\(\\)]", "", regmatches(j, gregexpr("\\(.*?\\)", j))[[1]])
[1] "wonder" "groan" "Laugh"
I think this should work well:
> regmatches(j, gregexpr("(?=\\().*?(?<=\\))", j, perl=T))[[1]]
[1] "(wonder)" "(groan)" "(Laugh)"
but the results includes parenthesis... why?
This works:
regmatches(j, gregexpr("(?<=\\().*?(?=\\))", j, perl=T))[[1]]
Thanks #MartinMorgan for the comment.

Using the stringr package we can reduce this a little bit.
library(stringr)
# Get the parenthesis and what is inside
k <- str_extract_all(j, "\\([^()]+\\)")[[1]]
# Remove parenthesis
k <- substring(k, 2, nchar(k)-1)
#kohske uses regmatches but I'm currently using 2.13 so don't have access to that function at the moment. This adds the dependency on stringr but I think it is a little easier to work with and the code is a little clearer (well... as clear as using regular expressions can be...)
Edit: We could also try something like this -
re <- "\\(([^()]+)\\)"
gsub(re, "\\1", str_extract_all(j, re)[[1]])
This one works by defining a marked subexpression inside the regular expression. It extracts everything that matches the regex and then gsub extracts only the portion inside the subexpression.

I think there are basically three easy ways of extracting multiple capture groups in R (without using substitution); str_match_all, str_extract_all, and regmatches/gregexpr combo.
I like #kohske's regex, which looks behind for an open parenthesis ?<=\\(, looks ahead for a closing parenthesis ?=\\), and grabs everything in the middle (lazily) .+?, in other words (?<=\\().+?(?=\\))
Using the same regex:
str_match_all returns the answer as a matrix.
str_match_all(j, "(?<=\\().+?(?=\\))")
[,1]
[1,] "wonder"
[2,] "groan"
[3,] "Laugh"
# Subset the matrix like this....
str_match_all(j, "(?<=\\().+?(?=\\))")[[1]][,1]
[1] "wonder" "groan" "Laugh"
str_extract_all returns the answer as a list.
str_extract_all(j, "(?<=\\().+?(?=\\))")
[[1]]
[1] "wonder" "groan" "Laugh"
#Subset the list...
str_extract_all(j, "(?<=\\().+?(?=\\))")[[1]]
[1] "wonder" "groan" "Laugh"
regmatches/gregexpr also returns the answer as a list. Since this is a base R option, some people prefer it. Note the recommended perl = TRUE.
regmatches(j, gregexpr( "(?<=\\().+?(?=\\))", j, perl = T))
[[1]]
[1] "wonder" "groan" "Laugh"
#Subset the list...
regmatches(j, gregexpr( "(?<=\\().+?(?=\\))", j, perl = T))[[1]]
[1] "wonder" "groan" "Laugh"
Hopefully, the SO community will correct/edit this answer if I've mischaracterized the most popular options.

Using rex may make this type of task a little simpler.
matches <- re_matches(j,
rex(
"(",
capture(name = "text", except_any_of(")")),
")"),
global = TRUE)
matches[[1]]$text
#>[1] "wonder" "groan" "Laugh"

Gsub transforming numbers

I find this problem >S
I scrap some data from the web and for instance I obtain this
"3.444.654" (As character)
If I use gsub("3.444.654", ".", "") in order to get 3444654...
R gives me
[1] ""
What could I do to get the integer!

> gsub(".", "", "3.444.654", fixed = TRUE)
[1] "3444654"
Maybe read the documentation for gsub for argument order etc. To then turn the string into a number, use as.numeric, as.integer etc.

str_replace (package stringr) cannot replace brackets in r?

I have a string, say
fruit <- "()goodapple"
I want to remove the brackets in the string. I decide to use stringr package because it usually can handle this kind of issues. I use :
str_replace(fruit,"()","")
But nothing is replaced, and the following is replaced:
[1] "()good"
If I only want to replace the right half bracket, it works:
str_replace(fruit,")","")
[1] "(good"
However, the left half bracket does not work:
str_replace(fruit,"(","")
and the following error is shown:
Error in sub("(", "", "()good", fixed = FALSE, ignore.case = FALSE, perl = FALSE) :
invalid regular expression '(', reason 'Missing ')''
Anyone has ideas why this happens? How can I remove the "()" in the string, then?

Escaping the parentheses does it...
str_replace(fruit,"\\(\\)","")
# [1] "goodapple"
You may also want to consider exploring the "stringi" package, which has a similar approach to "stringr" but has more flexible functions. For instance, there is stri_replace_all_fixed, which would be useful here since your search string is a fixed pattern, not a regex pattern:
library(stringi)
stri_replace_all_fixed(fruit, "()", "")
# [1] "goodapple"
Of course, basic gsub handles this just fine too:
gsub("()", "", fruit, fixed=TRUE)
# [1] "goodapple"

The accepted answer works for your exact problem, but not for the more general problem:
my_fruits <- c("()goodapple", "(bad)apple", "(funnyapple")
str_replace(my_fruits,"\\(\\)","")
## "goodapple" "(bad)apple", "(funnyapple"
This is because the regex exactly matches a "(" followed by a ")".
Assuming you care only about bracket pairs, this is a stronger solution:
str_replace(my_fruits, "\\([^()]{0,}\\)", "")
## "goodapple" "apple" "(funnyapple"

Building off of MJH's answer, this removes all ( or ):
my_fruits <- c("()goodapple", "(bad)apple", "(funnyapple")
str_replace_all(my_fruits, "[//(//)]", "")
[1] "goodapple" "badapple" "funnyapple"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

strsplit not behaving as expected R - r

Related

beg2char function in R (qdap package)

Use gsub to replace curly apostrophe with straight apostrophe in R list of character vectors

Extracting Headers from a list [duplicate]

Gsub transforming numbers

str_replace (package stringr) cannot replace brackets in r?

Categories

Resources