Remove run-length duplicate numbers and NA in strings - r

I have columns with large strings of decimal numbers and NA:
df <- data.frame(
A_gsr =c("2.752,2.752,2.752,2.752,2.752,2.752,2.752,2.911,2.911,3.555",
"2.999,2.999,2.999,2.752,2.752,2.752,2.752"),
B_gsr = c("1.34,1.34,1.34,1.55,1.55,1.55,1.55,1.55,1.55,1.55",
"1.56,1.56,1.56,1.55,1.55,1.55,1.55,NA,NA,NA,NA,1.34,1.34,1.34"),
C_gsr = c("NA,NA,NA,0.147,0.147,0.147,0.147,0.147,NA",
"0.146,0.146,0.146,0.146,0.146,0.146,0.146,0.146,0.146,0.146")
)
I want to remove all run-length duplicates. Using gsub and backreference, I'm getting pretty close to what I want to have:
lapply(df[,1:3], function(x) gsub("((\\d\\.\\d+,)|(NA,))\\1+", "\\1", x))
$A_gsr
[1] "2.752,2.911,3.555" "2.999,2.752,2.752"
$B_gsr
[1] "1.34,1.55,1.55" "1.56,1.55,NA,1.34,1.34"
$C_gsr
[1] "NA,0.147,NA" "0.146,0.146"
However, not close enough - there are still some run-length dups, all at the end of the strings. The expected result is this:
$A_gsr
[1] "2.752,2.911,3.555" "2.999,2.752"
$B_gsr
[1] "1.34,1.55" "1.56,1.55,NA,1.34"
$C_gsr
[1] "NA,0.147,NA" "0.146"

You can use
lapply(df[,1:3], function(x) gsub("\\b(\\d+\\.\\d+|NA)(?:,\\1)+\\b", "\\1", x))
## => $A_gsr
## [1] "2.752,2.911,3.555" "2.999,2.752"
##
## $B_gsr
## [1] "1.34,1.55" "1.56,1.55,NA,1.34"
##
## $C_gsr
## [1] "NA,0.147,NA" "0.146"
See the regex demo and the R demo online.
Details:
\b - a word boundary
(\d+\.\d+|NA) - Group 1: one or more digits, ., one or more digits, OR NA string
(?:,\1)+ - one or more repetitions of a comma and the value in Group 1
\b - a word boundary

Related

regex strsplit expression in R so it only applies once to the first occurrence of a specific character in each string?

I have a list filled with strings:
string<- c("SPG_L_subgenual_ACC_R", "SPG_R_MTG_L_pole", "MTG_L_pole_CerebellumGM_L")
I need to split the strings so they appear like:
"SPG_L", "subgenual_ACC_R", "SPG_R", "MTG_L_pole", "MTG_L_pole", "CerebellumGM_L"
I tried using the following regex expression to split the strings:
str_split(string,'(?<=[[RL]|pole])_')
But this leads to:
"SPG_L", "subgenual" "ACC_R", "SPG_R", "MTG_L", "pole", "MTG_L", "pole", "CerebellumGM_L"
How do I edit the regex expression so it splits each string element at the "_" after the first occurrence of "R", "L" unless the first occurrence of "R" or "L" is followed by "pole", then it splits the string element after the first occurrence of "pole" and only splits each string element once?
I suggest a matching approach using
^(.*?[RL](?:_pole)?)_(.*)
See the regex demo
Details
^ - start of string
(.*?[RL](?:_pole)?) - Group 1:
.*? - any zero or more chars other than line break chars as few as possible
[RL](?:_pole)? - R or L optionally followed with _pole
_ - an underscore
(.*) - Group 2: any zero or more chars other than line break chars as many as possible
See the R demo:
library(stringr)
x <- c("SPG_L_subgenual_ACC_R", "SPG_R_MTG_L_pole", "MTG_L_pole_CerebellumGM_L", "SFG_pole_R_IFG_triangularis_L", "SFG_pole_R_IFG_opercularis_L" )
res <- str_match_all(x, "^(.*?[RL](?:_pole)?)_(.*)")
lapply(res, function(x) x[-1])
Output:
[[1]]
[1] "SPG_L" "subgenual_ACC_R"
[[2]]
[1] "SPG_R" "MTG_L_pole"
[[3]]
[1] "MTG_L_pole" "CerebellumGM_L"
[[4]]
[1] "SFG_pole_R" "IFG_triangularis_L"
[[5]]
[1] "SFG_pole_R" "IFG_opercularis_L"
split_again = function(x){
if(length(x) > 1){
return(x)
}
else{
str_split(
string = x,
pattern = '(?<=[R|L])_',
n = 2)
}
}
str_split(
string = string,
pattern = '(?<=pole)_',
n = 2) %>%
lapply(split_again) %>%
unlist()
you could use sub then strsplit as shown:
strsplit(sub("^.*?[LR](?:_pole)?\\K_",":",string,perl=TRUE),":")
[[1]]
[1] "SPG_L" "subgenual_ACC_R"
[[2]]
[1] "SPG_R" "MTG_L_pole"
[[3]]
[1] "MTG_L_pole" "CerebellumGM_L"

Regex: how to keep all digits when splitting a string?

Question
Using a regular expression, how do I keep all digits when splitting a string?
Overview
I would like to split each element within the character vector sample.text into two elements: one of only digits and one of only the text.
Current Attempt is Dropping Last Digit
This regular expression - \\d\\s{1} - inside of base::strsplit() removes the last digit. Below is my attempt, along with my desired output.
# load necessary data -----
sample.text <-
c("111110 Soybean Farming", "0116 Soybeans")
# split string by digit and one space pattern ------
strsplit(sample.text, split = "\\d\\s{1}")
# [[1]]
# [1] "11111" "Soybean Farming"
#
# [[2]]
# [1] "011" "Soybeans"
# desired output --------
# [[1]]
# [1] "111110" "Soybean Farming"
#
# [[2]]
# [1] "0116" "Soybeans"
# end of script #
Any advice on how I can split sample.text to keep all digits would be much appreciated! Thank you.
Because you're splitting on \\d, the digit there is consumed in the regex, and not present in the output. Use lookbehind for a digit instead:
strsplit(sample.text, split = "(?<=\\d) ", perl=TRUE)
http://rextester.com/GDVFU71820
Some alternative solutions, using very simple pattern matching on the first occurrence of space:
1) Indirectly by using sub to substitute your own delimiter, then strsplit on your delimiter:
E.g. you can substitute ';' for the first space (if you know that character does not exist in your data):
strsplit( sub(' ', ';', sample.text), split=';')
2) Using regexpr and regmatches
You can effectively match on the first " " (space character), and split as follows:
regmatches(sample.text, regexpr(" ", sample.text), invert = TRUE)
Result is a list, if that's what you are after per your sample desired output:
[[1]]
[1] "111110" "Soybean Farming"
[[2]]
[1] "0116" "Soybeans"
3) Using stringr library:
library(stringr)
str_split_fixed(sample.text, " ", 2) #outputs a character matrix
[,1] [,2]
[1,] "111110" "Soybean Farming"
[2,] "0116" "Soybeans"

strsplit and keep part before first underscore

I would like to keep the part after the FIRST undescore. Please see example code.
colnames(df)
"EGAR00001341740_P32_1" "EGAR00001341741_PN32"
My try, but does not give P32_1 but only P32 which is wrong.
sapply(strsplit(colnames(df), split='_', fixed=TRUE), function(x) (x[2]))
desired output: P32_1, PN32
It could be done with a regex by matching zero or more characters that are not an underscore ([^_]*) from the start (^) of the string, followed by an underscore (_) and replace it with blanks ("")
colnames(df) <- sub("^[^_]*_", "", colnames(df))
colnames(df)
#[1] "P32_1" "PN32"
With strsplit, it will split whereever the split character occurs. One option is str_split from stringr where there is an option to specify the 'n' i.e. number of split parts. If we choose n = 2, we get 2 substrings as it will only split at the first _
library(stringr)
sapply(str_split(colnames(df), "_", n = 2), `[`, 2)
#[1] "P32_1" "PN32"
Here are a few ways. The first fixes the code in the question and the remaining ones are alternatives. All use only base except (6). (4) and (7) assume that the first field is fixed length, which is the case in the question.
x <- c("EGAR00001341740_P32_1", "EGAR00001341741_PN32")
# 1 - using strsplit
sapply(strsplit(x, "_"), function(x) paste(x[-1], collapse = "-"))
## [1] "P32_1" "PN32"
# 2 - a bit easier using sub. *? is a non-greedy match
sub(".*?_", "", x)
## [1] "P32_1" "PN32"
# 3 - locate the first underscore and extract all after that
substring(x, regexpr("_", x) + 1)
## [1] "P32_1" "PN32"
# 4 - if the first field is fixed length as in the example
substring(x, 17)
## [1] "P32_1" "PN32"
# 5 - replace first _ with character that does not appear and remove all until it
sub(".*;", "", sub("_", ";", x))
## [1] "P32_1" "PN32"
# 6 - extract everything after first _
library(gsubfn)
strapplyc(x, "_(.*)", simplify = TRUE)
## [1] "P32_1" "PN32"
# 7 - like (4) assumes fixed length first field
read.fwf(textConnection(x), widths = c(16, 99), as.is = TRUE)$V2
## [1] "P32_1" "PN32"

removes part of string in r

I'm trying to extract ES at the end of a string
> data <- c("phrases", "phases", "princesses","class","pass")
> data1 <- gsub("(\\w+)(s)+?es\\b", "\\1\\2", data, perl=TRUE)
> gsub("(\\w+)s\\b", "\\1", data1, perl=TRUE)
[1] "phra" "pha" "princes" "clas" "pas"
I get this result
[1] "phra" "pha" "princes" "clas" "pas"
but in reality what I need to obtain is:
[1] "phras" "phas" "princess" "clas" "pas"
You can use a word boundary (\\b) if it is guaranteed that each word is followed by a punctuation or is at the end of the string:
data <- c("phrases, phases, princesses, bases")
gsub('es\\b', '', data)
# [1] "phras, phas, princess, bas"
With your method, just wrap everything till the second + with one set of parentheses:
gsub("(\\w+s+)es\\b", "\\1", data)
# [1] "phras, phas, princess, bas"
There is also no need to make + lazy with ?, since you are trying to match as many consecutive s's as possible.
Edit:
OP changed the data and the desired output. Below is a simple solution that removes either es or s at the end of each string:
data <- c("phrases", "phases", "princesses","class","pass")
gsub('(es|s)\\b', '', data)
# [1] "phras" "phas" "princess" "clas" "pas"
maybe you are looking for a lookbehind assertion (which is a 0 length match)
"(?<=s)es\\b"
or because lookbehind can't have a variable length perl \K construct to keep out of match left of \K
"\\ws\\Kes\\b"

Using str_view with a list of words in R

I want to use str_view from stringr in R to find all the words that start with "y" and all the words that end with "x." I have a list of words generated by Corpora, but whenever I launch the code, it returns a blank view.
Common_words<-corpora("words/common")
#start with y
start_with_y <- str_view(Common_words, "^[y]", match = TRUE)
start_with_y
#finish with x
str_view(Common_words, "$[x]", match = TRUE)
Also, I would like to find the words that are only 3 letters long, but no
ideas so far.
I'd say this is not about programming with stringr but learning some regex. Here are some sites I have found useful for learning:
http://www.regular-expressions.info/tutorial.html
http://www.rexegg.com/
https://www.debuggex.com/
Here the \\w or short hand class for word characters (i.e., [A-Za-z0-9_]) is useful with quantifiers (+ and {3} in these 2 cases). PS here I use stringi because stringr is using that in the backend anyway. Just skipping the middle man.
x <- c("I like yax because the rock to the max!",
"I yonx & yix to pick up stix.")
library(stringi)
stri_extract_all_regex(x, 'y\\w+x')
stri_extract_all_regex(x, '\\b\\w{3}\\b')
## > stri_extract_all_regex(x, 'y\\w+x')
## [[1]]
## [1] "yax"
##
## [[2]]
## [1] "yonx" "yix"
## > stri_extract_all_regex(x, '\\b\\w{3}\\b')
## [[1]]
## [1] "yax" "the" "the" "max"
##
## [[2]]
## [1] "yix"
EDIT Seems like these may be of use too:
## Just y starting words
stri_extract_all_regex(x, 'y\\w+\\b')
## Just x ending words
stri_extract_all_regex(x, 'y\\w+x')
## Words with n or more characters
stri_extract_all_regex(x, '\\b\\w{4,}\\b')

Resources