R simplify gsub() to make sample names from longer string

R simplify gsub() to make sample names from longer string - r

I have a list of sample names
name <- c("GOM_13M_TB-01_S.HM (Q)30",
"GOM_13M_PS-06_S.HM (Q)30",
"GOM_13O_PS-06_3C_HM (Q)30",
"GOM_14O_GI-02_B3 (Q)30",
"GOM_14O_PS-03_A3 (Q)30",
"GOM_12J_GI-01_MS (Q)30")'
that need to be simplified into
13M_TB-01_MS (MS for consistency)
13M_PS-06_MS
13O_PS-06_3C (I am not too concerned about the last 2 digits order)
14O_GI-02_B3
14O_PS-03_A3
12J_GI-01_MS
I have tried the following uses of gsub(), but I'm trying to simplify the solution.
x <- gsub("GOM_", "", name)
x <- gsub("\\(Q\\)30", "", x)
x <- gsub("_S", "_MS", x)
x <- gsub(".HM", "", x)
Any suggestions?

Maybe you can try something like the following:
gsub("GOM_(.*) .*", "\\1", gsub("S.HM", "MS", name))
# [1] "13M_TB-01_MS" "13M_PS-06_MS" "13O_PS-06_3C_HM" "14O_GI-02_B3"
# [5] "14O_PS-03_A3" "12J_GI-01_MS"
Or, perhaps:
## I think this matches what you're expecting...
substr(gsub("S.HM", "MS", name), 5, 16)
# [1] "13M_TB-01_MS" "13M_PS-06_MS" "13O_PS-06_3C" "14O_GI-02_B3"
# [5] "14O_PS-03_A3" "12J_GI-01_MS"

Related

sub a pattern in a data frame in r except for when it includes another pattern

I have a df with rows in a specific column containing the following
temp1_01_100, temp2_01_100, temp2_02_100, s10_100, s11_100, s12_100, s21_100
I would like to replace the "_100" with "" e.g. doing that df$col <- sub("_100.*", "", df$col)
but I don't want the replacement when there is the pattern "temp" no matter if it's temp1 or temp2.
the output I want is:
temp1_01_100, temp2_01_100, temp2_02_100, s10, s11, s12, s21

You can try gsub like below
> gsub("(?<=s\\d{2})_100", "", vec, perl = TRUE)
[1] "temp1_01_100" "temp2_01_100" "temp2_02_100" "s10" "s11"
[6] "s12" "s21"
Data
vec <- c("temp1_01_100", "temp2_01_100", "temp2_02_100", "s10_100", "s11_100", "s12_100", "s21_100")

Input
> vec
[1] "temp1_01_100" "temp2_01_100" "temp2_02_100" "s10_100" "s11_100" "s12_100" "s21_100"
Code
vec[!grepl("temp", vec)] <- gsub("_100", "", vec[!grepl("temp", vec)])
Output
> vec
[1] "temp1_01_100" "temp2_01_100" "temp2_02_100" "s10" "s11" "s12" "s21"
Addendum
For your situation, try directly with
df$col[!grepl("temp", df$col)] <- gsub("_100", "", df$col[!grepl("temp", df$col)])

How to truncate specific part of string if present

Let's consider vector following:
x <- c("GDP_UK", "GDP_US", "GDP_UK_diff2_L2",
"INC","GDP_UK_L2", "GDP_US_level", "INC_UK", "INC_L1", "INC_diff1")
As you can see there is a vector containing some strings.
What I want to do is to find those who have "_diff(number)", "_L(number), _level within it and truncate this part of the string.
What I want to end up with is a vector following:
c("GDP_UK", "GDP_US", "GDP_UK", "INC", "GDUP_UK", "GDP_US", "INC_UK", "INC", "INC")
As you can see all _diff, _L, _level were truncated to obtain raw strings.
And I'm not sure how to do it. I tried code
x[grepl(paste(c("diff", "level", "_L"), collapse = "|"), x)]
to obtain only elements which include grepl or level or _L, but I haven't any idea how to cut it. Tried something with substring but wasn't sure exactly how to specify up to which letter it should be deleted. Do you have any idea how it can be done ?
** EDIT **
WE can use code following:
x <- gsub(pattern = "_L", replacement = "", x)
x <- gsub(pattern = "_diff", replacement = "", x)
x <- gsub(pattern = "_level", replacement = "", x)
However we will end up with remaining numbers at the end of the strings:
"GDP_UK" "GDP_US" "GDP_UK22" "INC" "GDP_UK2" "GDP_US" "INC_UK" "INC2" "INC1"

What you are looking for is the regex "_L\\d*", etc. This matches an underscore, L and zero or more digits.
In full
x <- c("GDP_UK", "GDP_US", "GDP_UK_diff2_L2",
"INC","GDP_UK_L2", "GDP_US_level", "INC_UK", "INC_L1", "INC_diff1")
gsub("_L\\d*", "", x)
gsub("_diff\\d*", "", x)
gsub("_level\\d*", "", x)
# or in one go:
library(stringr)
x %>%
str_replace_all("_L\\d*", "") %>%
str_replace_all("_diff\\d*", "") %>%
str_replace_all("_level\\d*", "")
#> [1] "GDP_UK" "GDP_US" "GDP_UK" "INC" "GDP_UK" "GDP_US" "INC_UK" "INC"
#> [9] "INC"
## or even in one go:
gsub("_(L|diff|level)\\d*", "", x)
#> [1] "GDP_UK" "GDP_US" "GDP_UK" "INC" "GDP_UK" "GDP_US" "INC_UK" "INC"
#> [9] "INC"

Subset string by counting specific characters

I have the following strings:
strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
I want to cut off the string, as soon as the number of occurances of A, G and N reach a certain value, say 3. In that case, the result should be:
some_function(strings)
c("ABBSDGN", "AABSDG", "AGN", "GGG")
I tried to use the stringi, stringr and regex expressions but I can't figure it out.

You can accomplish your task with a simple call to str_extract from the stringr package:
library(stringr)
strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
str_extract(strings, '([^AGN]*[AGN]){3}')
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
The [^AGN]*[AGN] portion of the regex pattern says to look for zero or more consecutive characters that are not A, G, or N, followed by one instance of A, G, or N. The additional wrapping with parenthesis and braces, like this ([^AGN]*[AGN]){3}, means look for that pattern three times consecutively. You can change the number of occurrences of A, G, N, that you are looking for by changing the integer in the curly braces:
str_extract(strings, '([^AGN]*[AGN]){4}')
# [1] "ABBSDGNHN" NA "AGNA" "GGGDSRTYHG"
There are a couple ways to accomplish your task using base R functions. One is to use regexpr followed by regmatches:
m <- regexpr('([^AGN]*[AGN]){3}', strings)
regmatches(strings, m)
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Alternatively, you can use sub:
sub('(([^AGN]*[AGN]){3}).*', '\\1', strings)
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"

Here is a base R option using strsplit
sapply(strsplit(strings, ""), function(x)
paste(x[1:which.max(cumsum(x %in% c("A", "G", "N")) == 3)], collapse = ""))
#[1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Or in the tidyverse
library(tidyverse)
map_chr(str_split(strings, ""),
~str_c(.x[1:which.max(cumsum(.x %in% c("A", "G", "N")) == 3)], collapse = ""))

Identify positions of pattern using gregexpr then extract n-th position (3) and substring everything from 1 to this n-th position using subset.
nChars <- 3
pattern <- "A|G|N"
# Using sapply to iterate over strings vector
sapply(strings, function(x) substr(x, 1, gregexpr(pattern, x)[[1]][nChars]))
PS:
If there's a string that doesn't have 3 matches it will generate NA, so you just need to use na.omit on the final result.

This is just a version without strsplit to Maurits Evers neat solution.
sapply(strings,
function(x) {
raw <- rawToChar(charToRaw(x), multiple = TRUE)
idx <- which.max(cumsum(raw %in% c("A", "G", "N")) == 3)
paste(raw[1:idx], collapse = "")
})
## ABBSDGNHNGA AABSDGDRY AGNAFG GGGDSRTYHG
## "ABBSDGN" "AABSDG" "AGN" "GGG"
Or, slightly different, without strsplit and paste:
test <- charToRaw("AGN")
sapply(strings,
function(x) {
raw <- charToRaw(x)
idx <- which.max(cumsum(raw %in% test) == 3)
rawToChar(raw[1:idx])
})

Interesting problem. I created a function (see below) that solves your problem. It's assumed that there are just letters and no special characters in any of your strings.
reduce_strings = function(str, chars, cnt){
# Replacing chars in str with "!"
chars = paste0(chars, collapse = "")
replacement = paste0(rep("!", nchar(chars)), collapse = "")
str_alias = chartr(chars, replacement, str)
# Obtain indices with ! for each string
idx = stringr::str_locate_all(pattern = '!', str_alias)
# Reduce each string in str
reduce = function(i) substr(str[i], start = 1, stop = idx[[i]][cnt, 1])
result = vapply(seq_along(str), reduce, "character")
return(result)
}
# Example call
str = c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
chars = c("A", "G", "N") # Characters that are counted
cnt = 3 # Count of the characters, at which the strings are cut off
reduce_strings(str, chars, cnt) # "ABBSDGN" "AABSDG" "AGN" "GGG"

Replace multiple symbols in a string differently in r

I tried to recode values such as (5,10],(20,20] to 5-10%,20-20% using gsub. So, the first parenthesis should be gone, the comma should be changed to dash and the last bracket should be %. What I can do was only
x<-c("(5,10]","(20,20]")
gsub("\\,","-",x)
Then the comma is changed to the dash. How can I change others as well?
Thanks.

Keeping it very simple, a set of gsubs.
x <- c("(5,10]","(20,20]")
x <- gsub(",", "-", x) # remove comma
x <- gsub("\\(", "", x) # remove bracket
x <- gsub("]", "%", x) # replace ] by %
x
"5-10%" "20-20%"

Here's another alternative:
> gsub("\\((\\d+),(\\d+)\\]", "\\1-\\2%", x)
[1] "5-10%" "20-20%"

Other solution.
Using regmatches we extract all the numbers. We then combine every first and second number.
nrs <- regmatches(x, gregexpr("[[:digit:]]+", x))
nrs <- as.numeric(unlist(nrs))
i <- 1:length(nrs); i <- i[(i%%2)==1]
for(h in i){print(paste0(nrs[h],'-',nrs[h+1],'%'))}
[1] "5-10%"
[1] "20-20%"

Just for fun, an ugly one-liner:
sapply(regmatches(x, gregexpr("\\d+", x)), function(x) paste0(x[1], "-", x[2], "%"))
[1] "5-10%" "20-20%"

Merging specific specific strings in character vector

I have character vector where each level is a word. It has been generated from a text in which some segments are marked up with angular brackets. These segments vary in length. I need the marked up segments to be merged in the vector.
The input looks like this:
c("This","is","some","text","with","<marked","up","chunks>[L]","in","it")
I need the output to look like this:
c("This","is","some","text","with","<marked up chunks>[L]","in","it")
Thanks.

Here's an approach that also works with multiple chunks in a vector:
vec <- c("This","is","some","text","with","<marked","up","chunks>[L]","in","it")
from <- grep("<", vec)
to <- grep(">", vec)
idx <- mapply(seq, from, to, SIMPLIFY = FALSE)
new_strings <- sapply(idx, function(x)
paste(vec[x], collapse = " "))
replacement <- unlist(mapply(function(x, y) c(y, rep(NA, length(x) - 1)),
idx, new_strings, SIMPLIFY = FALSE))
new_vec <- "attributes<-"(na.omit(replace(vec, unlist(idx), replacement)), NULL)
[1] "This" "is"
[3] "some" "text"
[5] "with" "<marked up chunks>[L]"
[7] "in" "it"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R simplify gsub() to make sample names from longer string - r

Related

sub a pattern in a data frame in r except for when it includes another pattern

How to truncate specific part of string if present

Subset string by counting specific characters

Replace multiple symbols in a string differently in r

Merging specific specific strings in character vector

Categories

Resources