How to obtain character at a specific place? [duplicate] - r

This question already has answers here:
str_extract: Extracting exactly nth word from a string
(5 answers)
Closed 3 years ago.
example:
"A.B.C.D"
"apple.good.sad.sea"
"X1.AN2.ED3.LK8"
What I need is to obtain the string specifically between the second dot and the third dot.
result:
"C"
"sad"
"ED3"
How can I do this?

You can use base::strsplit, loop thr the elements to get the 3rd one
v <- c("A.B.C.D", "apple.good.sad.sea", "X1.AN2.ED3.LK8")
sapply(strsplit(v, "\\."), `[[`, 3L)
output:
[1] "C" "sad" "ED3"

You can use unlist(strsplit(str,split = "."))[3] to get the third sub-string, where the original string is split by "." when you apply strsplit

I'd use
sub("^([^.]*\\.){2}([^.]*)\\..*", "\\2", x)
# [1] "C" "sad" "ED3"

Using regex in gsub.
v <- c("A.B.C.D", "apple.good.sad.sea", "X1.AN2.ED3.LK8", "A.B.C.D.E")
gsub("(.*?\\.){2}(.*?)(\\..*)", "\\2", v)
# [1] "C" "sad" "ED3" "C"

Related

How to get the most frequent character within a character string? [duplicate]

This question already has answers here:
Finding the most repeated character in a string in R
(2 answers)
Closed 1 year ago.
Suppose the next character string:
test_string <- "A A B B C C C H I"
Is there any way to extract the most frequent value within test_string?
Something like:
extract_most_frequent_character(test_string)
Output:
#C
We can use scan to read the string as a vector of individual elements by splitting at the space, get the frequency count with table, return the named index that have the max count (which.count), get its name
extract_most_frequent_character <- function(x) {
names(which.max(table(scan(text = x, what = '', quiet = TRUE))))
}
-testing
extract_most_frequent_character(test_string)
[1] "C"
Or with strsplit
extract_most_frequent_character <- function(x) {
names(which.max(table(unlist(strsplit(x, "\\s+")))))
}
Here is another base R option (not as elegant as #akrun's answer)
> intToUtf8(names(which.max(table(utf8ToInt(gsub("\\s", "", test_string))))))
[1] "C"
One possibility involving stringr could be:
names(which.max(table(str_extract_all(test_string, "[A-Z]", simplify = TRUE))))
[1] "C"
Or marginally shorter:
names(which.max(table(str_extract_all(test_string, "[A-Z]")[[1]])))
Here is solution using stringr package, table and which:
library(stringr)
test_string <- str_split(test_string, " ")
test_string <- table(test_string)
names(test_string)[which.max(test_string)]
[1] "C"

How to split words in R while keeping contractions [duplicate]

This question already has an answer here:
strsplit on all spaces and punctuation except apostrophes [duplicate]
(1 answer)
Closed 7 years ago.
I'm trying to turn a character vector novel.lower.mid into a list of single words. So far, this is the code I've used:
midnight.words.l <- strsplit(novel.lower.mid, "\\W")
This produces a list of all the words. However, it splits everything, including contractions. The word "can't" becomes "can" and "t". How do I make sure those words aren't separated, or that the function just ignores the apostrophe?
We can use
library(stringr)
str_extract_all(novel.lower.mid, "\\b[[:alnum:]']+\\b")
Or
strsplit(novel.lower.mid, "(?!')\\W", perl=TRUE)
If you just want your current "\W" split to not include apostrophes, negate \w and ':
novel.lower.mid <- c("I won't eat", "green eggs and", "ham")
strsplit(novel.lower.mid, "[^\\w']", perl=T)
# [[1]]
# [1] "I" "won't" "eat"
#
# [[2]]
# [1] "green" "eggs" "and"
#
# [[3]]
# [1] "ham"

strsplit in R not working for $ as split character [duplicate]

This question already has answers here:
How do I strip dollar signs ($) from data/ escape special characters in R?
(4 answers)
Closed 7 years ago.
> str = "a$b$c"
> astr <- strsplit(str,"$")
> astr
[[1]]
[1] "a$b$c"
Still trying to figure the answer out!
You need to escape it
strsplit(str,"\\$")
Another option is to use , fixed = TRUE option:
strsplit(str,"$",fixed=TRUE)
## [1] "a" "b" "c"

Extract text in parentheses in R

Two related questions. I have vectors of text data such as
"a(b)jk(p)" "ipq" "e(ijkl)"
and want to easily separate it into a vector containing the text OUTSIDE the parentheses:
"ajk" "ipq" "e"
and a vector containing the text INSIDE the parentheses:
"bp" "" "ijkl"
Is there any easy way to do this? An added difficulty is that these can get quite large and have a large (unlimited) number of parentheses. Thus, I can't simply grab text "pre/post" the parentheses and need a smarter solution.
Text outside the parenthesis
> x <- c("a(b)jk(p)" ,"ipq" , "e(ijkl)")
> gsub("\\([^()]*\\)", "", x)
[1] "ajk" "ipq" "e"
Text inside the parenthesis
> x <- c("a(b)jk(p)" ,"ipq" , "e(ijkl)")
> gsub("(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.", "", x, perl=T)
[1] "bp" "" "ijkl"
The (?<=\\()[^()]*(?=\\)) matches all the characters which are present inside the brackets and then the following (*SKIP)(*F) makes the match to fail. Now it tries to execute the pattern which was just after to | symbol against the remaining string. So the dot . matches all the characters which are not already skipped. Replacing all the matched characters with an empty string will give only the text present inside the rackets.
> gsub("\\(([^()]*)\\)|.", "\\1", x, perl=T)
[1] "bp" "" "ijkl"
This regex would capture all the characters which are present inside the brackets and matches all the other characters. |. or part helps to match all the remaining characters other than the captured ones. So by replacing all the characters with the chars present inside the group index 1 will give you the desired output.
The rm_round function in the qdapRegex package I maintain was born to do this:
First we'll get and load the package via pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(qdapRegex)
## Then we can use it to remove and extract the parts you want:
x <-c("a(b)jk(p)", "ipq", "e(ijkl)")
rm_round(x)
## [1] "ajk" "ipq" "e"
rm_round(x, extract=TRUE)
## [[1]]
## [1] "b" "p"
##
## [[2]]
## [1] NA
##
## [[3]]
## [1] "ijkl"
To condense b and p use:
sapply(rm_round(x, extract=TRUE), paste, collapse="")
## [1] "bp" "NA" "ijkl"

How to split a string in r by a delimiter and discard the last two items?

I have a string separated by _ and I want to get rid of the last two elements. For example, from A_B_C_D I want to return A_B, and from A_B_C_D_E I want A_B_C. I have tried str_split_fixed from stringr:
my_string <- "A_B_C_D"
x <- str_split_fixed(my_string,"_",3)
but it returns "A" "B" "C_D" instead of "A_B" "C" "D", otherwise I could have done head(x,-2) to get A_B
Is there a better way than
paste(head(unlist(strsplit(my_string,"_")),-2),collapse="_")
How about using a regex:
sub('(_[A-Z]){2}$', '', 'A_B_C_D')
Where the number 2 is the length you want to drop.

Resources