Split string according to ambiguous delimiter in R

Split string according to ambiguous delimiter in R - r

I have a pairs of strings included in a data frame:
df <- data.frame(str = c("L_V1_ROI-L_MST_ROI",
"L_V6_ROI-L_V2_ROI",
"L_V3_ROI-L_V4_ROI",
"L_V8_ROI-L_4_ROI",
"L_p9-46v_ROI-L_a9-46v_ROI"))
Each pair is separated by - symbol with the exception of the last pair which contains three - symbols and should be separated into substrings L_p9-46v_ROI and L_a9-46v_ROI.
A task is to split these pairs into substrings according to the separator. To do this I simply use:
library(tidyr)
df %>% separate(data = df, col = str, into = c("str1", "str2"), sep = "-")
which gives the following result:
str1 str2
1 L_V1_ROI L_MST_ROI
2 L_V6_ROI L_V2_ROI
3 L_V3_ROI L_V4_ROI
4 L_V8_ROI L_4_ROI
5 L_p9 46v_ROI
Warning message:
Too many values at 1 locations: 5
As expected, the problem lies in the 5th pair which has more than one - symbol.
Question: what is the regex to match the proper separator?
My partial solution is pasted below, but I hope that there should be more intelligent solution.
my_split <- function(string, pattern) {
## Match start end end position of the "_ROI-"
position <- str_locate(string = string, pattern = pattern)
start <- position[1]
end <- position[2]
## Extract substrings
substring1 <- substr(my_str, 1, start + 3)
substring2 <- substr(my_str, end + 1, nchar(string))
return(list(substring1, substring2))
}
## Toy example
my_str <- "L_p9-46v_ROI-L_a9-46v_ROI"
my_split(string = my_str, pattern = "_ROI-")
[[1]]
[1] "L_p9-46v_ROI"
[[2]]
[1] "L_a9-46v_ROI"

Related

If a row matches a criteria do paste in R

Let's imagine you have a dataframe with two columns ID and POSITION. I want to paste some text depending on the ID value.
I want to paste the ID value with GK0000 (when ID>10) or GK00000 (when ID<10) along with .2:, POSITION value, .. and the following POSITION value (POSITION+1)
For example if ID = 1 and POSITION = 10, the result would be GK000001.2:10..11 and if ID = 10 and POSITION = 10, the result would be GK000010.2:10..11
In Excel I can do this being A as ID and B as POSITION using =IF(A2<10,CONCATENATE("GK00000",A2,".2:",B2,"..",B2+1),CONCATENATE("GK0000",A2,".2:",B2,"..",B2+1)) but I want to add it to my R script line.
I give you a short example of my input data just ilustrative
ID <- c(1,5,9,10,12)
POSITION <- c(10,50,90,100,120)
df <- cbind(ID,POSITION)
and the result I'm expecting is
CONCAT <- c("GK000001.2:10..11","GK000005.2:50..51","GK000009.2:90..91",
"GK000010.2:100..101","GK000012.2:120..121")
dfResult <- cbind(ID,POSITION,CONCAT)

I believe the question asks for a string format given two arguments, A and B and a number of digits.
concat <- function(A, B, digits = 6){
fmt <- paste0("%0", digits, "d")
fmt <- paste0("GK", fmt, ".2:%d..%d")
sprintf(fmt, A, B, B + 1)
}
concat(df[, 'ID'], df[, 'POSITION'], 6)
# [1] "GK000001.2:10..11" "GK000002.2:20..21" "GK000003.2:30..31"
# [4] "GK000004.2:40..41" "GK000005.2:50..51" "GK000006.2:60..61"
# [7] "GK000007.2:70..71" "GK000008.2:80..81" "GK000009.2:90..91"
#[10] "GK000010.2:100..101" "GK000011.2:110..111" "GK000012.2:120..121"

How to split a sentence in two halves in R

I have a vector of string, and I want each string to be cut roughly in half, at the nearest space.
For exemple, with the following data :
test <- data.frame(init = c("qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk",
"qsdf",
"mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll",
"qsddddddddddddddddddddddddddddddd",
"qsdfmlk mlk mkljlmkjlmkjml lmj mjjmjmjm lkj"), stringsAsFactors = FALSE)
I want to get something like this :
first sec
1 qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk
2 qsdf
3 mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll
4 qsddddddddddddddddddddddddddddddd
5 lmj mjjmjmjm lkj lmj mjjmjmjm lkj
Any solution that does not cut in halves but "so that the first part isn't longer than X character" would be also great.

First, we split the strings by spaces.
a <- strsplit(test$init, " ")
Then we find the last element of each vector for which the cumulative sum of characters is lower than half the sum of all characters in the vector:
b <- lapply(a, function(x) which.max(cumsum(cumsum(nchar(x)) <= sum(nchar(x))/2)))
Afterwards we combine the two halfs, substituting NA if the vector was of length 1 (only one word).
combined <- Map(function(x, y){
if(y == 1){
return(c(x, NA))
}else{
return(c(paste(x[1:y], collapse = " "), paste(x[(y+1):length(x)], collapse = " ")))
}
}, a, b)
Finally, we rbind the combined strings and change the column names.
newdf <- do.call(rbind.data.frame, combined)
names(newdf) <- c("first", "second")
Result:
> newdf
first second
1 qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk
2 qsdf <NA>
3 mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll
4 qsddddddddddddddddddddddddddddddd <NA>
5 qsdfmlk mlk mkljlmkjlmkjml lmj mjjmjmjm lkj

You can use the function nbreak from the package that I wrote:
devtools::install_github("igorkf/breaker")
library(tidyverse)
test <- data.frame(init = c("Phrase with four words", "That phrase has five words"), stringsAsFactors = F)
#This counts the numbers of words of each row:
nwords = str_count(test$init, " ") + 1
#This is the position where break the line for each row:
break_here = ifelse(nwords %% 2 == 0, nwords/2, round(nwords/2) + 1)
test
# init
# 1 Phrase with four words
# 2 That phrase has five words
#the map2_chr is applying a function with two arguments,
#the string is "init" and the n is "break_here":
test %>%
mutate(init = map2_chr(init, break_here, ~breaker::nbreak(string = .x, n = .y, loop = F))) %>%
separate(init, c("first", "second"), sep = "\n")
# first second
# 1 Phrase with four words
# 2 That phrase has five words

Extract only values with a decimal point in between from strings

I have a dataframe with strings such as:
id <- c(1,2)
x <- c("...14.....5.......................395.00.........................14.........1..",
"......114.99....................124.99................")
df <- data.frame(id,x)
df$x <- as.character(df$x)
How can I extract only values with a decimal point in between such as 395.00, 114.99 and 124.99 and not 14, 5, or 1 for each row, and put them in a new column separated by a comma?
The ideal result would be:
id x2
1 395.00
2 114.99,124.99
The amount of periods separating the values are random.

library(stringr)
df$x2 = str_extract_all(df$x, "[0-9]+\\.[0-9]+")
df[c(1, 3)]
# id x2
# 1 1 395.00
# 2 2 114.99, 124.99
Explanation: [0-9]+ matches one or more numbers, \\. matches a single decimal point. str_extract_all extracts all matches.
The new column is a list column, not a string with an inserted comma. This allows you access to the individual elements, if needed:
df$x2[2]
# [[1]]
# [1] "114.99" "124.99"
If you prefer a character vector as the column, do this:
df$x3 = sapply(str_extract_all(df$x, "[0-9]+\\.[0-9]+"), paste, collapse = ",")
df$x3[2]
#[1] "114.99,124.99"

Extract 2 parts of a string

Assume I have the following string (filename):
a <- "X/ZHEB100/TKN_VAR29380_timely_p1.txt"
which consists of several parts (here is given p1)
or another one
b <- "X/ZHEB100/ZHN_VAR29380_timely.txt"
which consists of only one part (so no need to label any p)
How can I extract the Identifier, which is the three letters before the VARXXXXX (so in case one it would be TKN, in case two it would be ZHN) PLUS the part identifier, if available?
So the result should be:
case1 : TKN_p1
case2 : ZHN
I know how to extract the first identifier, but I cannot handle the second one at the same time.
My approach so far:
sub(".*(.{3})_VAR29380_timely(.{3}).*","\\1\\2", a)
sub(".*(.{3})_VAR29380_timely(.{3}).*","\\1\\2", b)
but this adds .tx incorrectly in the second case.

You are not using anchors and matching the last 3 characters right after timely without checking what these characters are (. matches any character).
I suggest
sub("^.*/([A-Z]{3})_VAR\\d+_timely(_[^_.]+)?\\.[^.]*$", "\\1\\2", a)
Details:
^ - start of string
.*/ - part of string up to and including the last /
([A-Z]{3}) - 3 ASCII uppercase letters captured into Group 1
_VAR\\d+_timely - _VAR + 1 or more digits + _timely
(_[^_.]+)? - an optional Group 2 capturing _ + 1 or more chars other than _ and .
\\. - a dot
[^.]* - zero or more chars other than .
$ - end of string.
Replacement pattern contains 2 backreferences to both the capturing groups to insert their contents to the replaced string.
R demo:
a <- "X/ZHEB100/TKN_VAR29380_timely_p1.txt"
a2 <- sub("^.*/([A-Z]{3})_VAR\\d+_timely(_[^_.]+)?\\.[^.]*$", "\\1\\2", a)
a2
[1] "TKN_p1"
b <- "X/ZHEB100/ZHN_VAR29380_timely.txt"
b2 <- sub("^.*/([A-Z]{3})_VAR\\d+_timely(_[^_.]+)?\\.[^.]*$", "\\1\\2", b)
b2
[1] "ZHN"

Just another solution, for something different from Wiktor's already working solution:
library( magrittr )
data <- c( a, b )
First get the "ID" values by splitting on "/", taking the last value, and taking the first 3 characters of that:
ID <- strsplit( data, "/" ) %>%
sapply( tail, n = 1 ) %>%
substr( 1, 3 )
Then get the "part" values by splitting out both "timely" and ".txt", and taking the last element (which may be an empty string):
part <- strsplit( data, "timely|.txt" ) %>%
sapply( tail, n = 1 )
Now just paste them together for the result:
output <- paste0( ID, part )
output
[1] "TKN_p1" "ZHN"
Or, if you'd rather not create the intermediate objects:
output <- strsplit( data, "/" ) %>%
sapply( tail, n = 1 ) %>%
substr( 1, 3 ) %>%
paste0( strsplit( data, "timely|.txt" ) %>%
sapply( tail, n = 1 ) )

String match algorithms in R

I would like to find the string match, with number of characters which are matching from the start. I have two strings a <- "ABCDBADCABC", b <- "ABC". I want to find the match of b in a. I am interested to find if b <- "ABC" exists at the start of a <- "ABCDBADCABC". I am not looking for other locations of string match other than start.
Other example: b <- ABCDBADCABC, a <- "ABCDAB", here only four characters of a match with b from the start. So output will be ABCD match from a.
What are the available options in R to do this.

I would keep it simple and make a and b vectors contain individual characters. Then string matching is straight forward.
## Make a and b
b = "ABCDBADCABC"
a = "ABCDAB"
Find the length of the shortest vector
min_char = min(nchar(a), nchar(b))
Then split a and b up
a_split = strsplit(substr(a,1, min_char), "")[[1]]
b_split = strsplit(substr(b,1, min_char), "")[[1]]
Compare using standard operators
comp = a_split == b_split
Find the first occurrence of FALSE
which.min(comp) - 1
With less code:
compare(a, b)
where
compare = function(a, b) {
min_char = min(nchar(a), nchar(b))
a_split = strsplit(substr(a,1, min_char), "")[[1]]
b_split = strsplit(substr(b,1, min_char), "")[[1]]
comp = a_split == b_split
which.min(comp) - 1
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Split string according to ambiguous delimiter in R - r

Related

If a row matches a criteria do paste in R

How to split a sentence in two halves in R

Extract only values with a decimal point in between from strings

Extract 2 parts of a string

String match algorithms in R

Categories

Resources