Use strsplit with multiple delimiters [duplicate] - r

This question already has answers here:
R strsplit with multiple unordered split arguments?
(4 answers)
Closed 4 years ago.
How can I split this
Chr3:153922357-153944632(-)
Chr11:70010183-70015411(-)
in to
Chr3 153922357 153944632 -
Chr11 70010183 70015411 -
I tried strsplit(df$V1,"[[:punct:]]")), but the negative sign is not coming in the final result

You can also try str_split from stringr:
library(stringr)
lapply(str_split(df$V1, "(?<!\\()\\-|[:\\)\\(]"), function(x) x[x != ""])
Result:
[[1]]
[1] "Chr3" "153922357" "153944632" "-"
[[2]]
[1] "Chr11" "70010183" "70015411" "-"
Data:
df = read.table(text = " Chr3:153922357-153944632(-)
Chr11:70010183-70015411(-) ")

How about this in base R using stringsplit and gsub:
# Your sample strings
ss <- c("Chr3:153922357-153944632(-)",
"Chr11:70010183-70015411(-)")
# Split items as list of vectors
lst <- lapply(ss, function(x)
unlist(strsplit(gsub("(.+):(\\d+)-(\\d+)\\((.)\\)", "\\1,\\2,\\3,\\4", x), ",")))
# rbind to dataframe if necessary
do.call(rbind, lst);
# [,1] [,2] [,3] [,4]
#[1,] "Chr3" "153922357" "153944632" "-"
#[2,] "Chr11" "70010183" "70015411" "-"
This should work for other chromosome names and positive strand features as well.

The issue is that - is both a character you want to extract and a delimiter. Your best bet is using capture groups and specifying the full regex string:
stringr::str_match(x, "^(.{4}):(\\d+)-(\\d+)\\((.)\\)$")
EDIT: If you want to let the first capture group capture strings of arbitrary length (e.g. ChrX for any X), you can change first capture group from .{4} to Chr\\d+.

Related

How to get the most frequent character within a character string? [duplicate]

This question already has answers here:
Finding the most repeated character in a string in R
(2 answers)
Closed 1 year ago.
Suppose the next character string:
test_string <- "A A B B C C C H I"
Is there any way to extract the most frequent value within test_string?
Something like:
extract_most_frequent_character(test_string)
Output:
#C
We can use scan to read the string as a vector of individual elements by splitting at the space, get the frequency count with table, return the named index that have the max count (which.count), get its name
extract_most_frequent_character <- function(x) {
names(which.max(table(scan(text = x, what = '', quiet = TRUE))))
}
-testing
extract_most_frequent_character(test_string)
[1] "C"
Or with strsplit
extract_most_frequent_character <- function(x) {
names(which.max(table(unlist(strsplit(x, "\\s+")))))
}
Here is another base R option (not as elegant as #akrun's answer)
> intToUtf8(names(which.max(table(utf8ToInt(gsub("\\s", "", test_string))))))
[1] "C"
One possibility involving stringr could be:
names(which.max(table(str_extract_all(test_string, "[A-Z]", simplify = TRUE))))
[1] "C"
Or marginally shorter:
names(which.max(table(str_extract_all(test_string, "[A-Z]")[[1]])))
Here is solution using stringr package, table and which:
library(stringr)
test_string <- str_split(test_string, " ")
test_string <- table(test_string)
names(test_string)[which.max(test_string)]
[1] "C"

Str_split is returning only half of the string

I have a tibble and the vectors within the tibble are character strings with a mix of English and Mandarin characters. I want to split the tibble into two, with one column returning the English, the other column returning the Mandarin. However, I had to resort to the following code in order to accomplish this:
tb <- tibble(x = c("I我", "love愛", "you你")) #create tibble
en <- str_split(tb[[1]], "[^A-Za-z]+", simplify = T) #split string when R reads a character that is not a-z
ch <- str_split(tb[[1]], "[A-Za-z]+", simplify = T) #split string after R reads all the a-z characters
tb <- tb %>%
mutate(EN = en[,1],
CH = ch[,2]) %>%
select(-x)#subset the matrices created above, because the matrices create a column of blank/"" values and also remove x column
tb
I'm guessing there's something wrong with my RegEx that's causing this to occur. Ideally, I would like to write one str_split line that would return both of the columns.
We can use strsplit from base R
do.call(rbind, strsplit(tb$x, "(?<=[A-Za-z])(?=[^A-Za-z])", perl = TRUE))
Or we can use
library(stringr)
tb$en <- str_extract(tb$x,"[[:alpha:]]+")
tb$ch <- str_extract(tb$x,"[^[:alpha:]]+")
We can use str_match and get data for English and rest of the characters separately.
stringr::str_match(tb$x, "([A-Za-z]+)(.*)")[, -1]
# [,1] [,2]
#[1,] "I" "我"
#[2,] "love" "愛"
#[3,] "you" "你"
A simple solution using str_extract from package stringr:
library(stringr)
tb$en <- str_extract(tb$x,"[A-z]+")
tb$ch <- str_extract(tb$x,"[^A-z]")
In case there's more than one Chinese character, just add +to [^A-z].
Alternatively, use gsuband backreference:
tb$en <- gsub("(\\w+).$", "\\1", tb$x)
tb$ch <- gsub("\\w+(.$)", "\\1", tb$x)
Yet another solution macthes unicode characters with [ -~]+ and excludes them with [^ -~]+:
tb$en <- str_extract(tb$x, "[ -~]+")
tb$ch <- str_extract(tb$x, "[^ -~]+")
Result:
tb
# A tibble: 3 x 3
x en ch
<chr> <chr> <chr>
1 I我 I 我
2 love愛 love 愛
3 you你 you 你

How to obtain character at a specific place? [duplicate]

This question already has answers here:
str_extract: Extracting exactly nth word from a string
(5 answers)
Closed 3 years ago.
example:
"A.B.C.D"
"apple.good.sad.sea"
"X1.AN2.ED3.LK8"
What I need is to obtain the string specifically between the second dot and the third dot.
result:
"C"
"sad"
"ED3"
How can I do this?
You can use base::strsplit, loop thr the elements to get the 3rd one
v <- c("A.B.C.D", "apple.good.sad.sea", "X1.AN2.ED3.LK8")
sapply(strsplit(v, "\\."), `[[`, 3L)
output:
[1] "C" "sad" "ED3"
You can use unlist(strsplit(str,split = "."))[3] to get the third sub-string, where the original string is split by "." when you apply strsplit
I'd use
sub("^([^.]*\\.){2}([^.]*)\\..*", "\\2", x)
# [1] "C" "sad" "ED3"
Using regex in gsub.
v <- c("A.B.C.D", "apple.good.sad.sea", "X1.AN2.ED3.LK8", "A.B.C.D.E")
gsub("(.*?\\.){2}(.*?)(\\..*)", "\\2", v)
# [1] "C" "sad" "ED3" "C"

Regex: how to keep all digits when splitting a string?

Question
Using a regular expression, how do I keep all digits when splitting a string?
Overview
I would like to split each element within the character vector sample.text into two elements: one of only digits and one of only the text.
Current Attempt is Dropping Last Digit
This regular expression - \\d\\s{1} - inside of base::strsplit() removes the last digit. Below is my attempt, along with my desired output.
# load necessary data -----
sample.text <-
c("111110 Soybean Farming", "0116 Soybeans")
# split string by digit and one space pattern ------
strsplit(sample.text, split = "\\d\\s{1}")
# [[1]]
# [1] "11111" "Soybean Farming"
#
# [[2]]
# [1] "011" "Soybeans"
# desired output --------
# [[1]]
# [1] "111110" "Soybean Farming"
#
# [[2]]
# [1] "0116" "Soybeans"
# end of script #
Any advice on how I can split sample.text to keep all digits would be much appreciated! Thank you.
Because you're splitting on \\d, the digit there is consumed in the regex, and not present in the output. Use lookbehind for a digit instead:
strsplit(sample.text, split = "(?<=\\d) ", perl=TRUE)
http://rextester.com/GDVFU71820
Some alternative solutions, using very simple pattern matching on the first occurrence of space:
1) Indirectly by using sub to substitute your own delimiter, then strsplit on your delimiter:
E.g. you can substitute ';' for the first space (if you know that character does not exist in your data):
strsplit( sub(' ', ';', sample.text), split=';')
2) Using regexpr and regmatches
You can effectively match on the first " " (space character), and split as follows:
regmatches(sample.text, regexpr(" ", sample.text), invert = TRUE)
Result is a list, if that's what you are after per your sample desired output:
[[1]]
[1] "111110" "Soybean Farming"
[[2]]
[1] "0116" "Soybeans"
3) Using stringr library:
library(stringr)
str_split_fixed(sample.text, " ", 2) #outputs a character matrix
[,1] [,2]
[1,] "111110" "Soybean Farming"
[2,] "0116" "Soybeans"

How to extract multiple substrings in a string using stringr regex

I have this string:
mystring <- "HMSC-bm_in_ALL_CELLTYPES.distal"
What I want to do is to extract the substring as defined
in this bracketing
[HMSC-bm]_in_ALL_CELLTYPES.[distal]
So in the end it will yield a vector with two values: HMSC-bm and distal. How can I do it? I tried this but failed:
> stringr::str_extract(base,"\\([\\w-]+\\)_in_ALL_CELLTYPES\\.\\([\\w+]\\)")
[1] NA
I'd use str_match:
library(stringr)
mymatch <- str_match(mystring, "^(.*?)_.*?\\.(.*?)$")
mymatch
[,1] [,2] [,3]
[1,] "HMSC-bm_in_ALL_CELLTYPES.distal" "HMSC-bm" "distal"
mymatch[, 2]
[1] "HMSC-bm"
mymatch[3, ]
[1] "distal"
We can split the string by _in_ALL_CELLTYPES..
strsplit(mystring, split = "_in_ALL_CELLTYPES.")[[1]]
[1] "HMSC-bm" "distal"

Resources