Remove Duplicate Characters from String r - r

I've got a vector of strings which is being read in, but every entry has garbage characters at the start and end of the string that I want to remove. My problem is that I don't know which characters are the garbage until they appear in each entry.
ie:
Vector contains:
nRsp ;A810SS-Q1D-01 "
nRsp ;C5A19A60WESD04 "
nRsp ;461961 "
in this case, nRsp ; is the garbage at the beginning and " is the end garbage. The garbage values should occur at the same place relative to the start and end of the vector, but I need some way to first find them and then remove them.
Thanks!!

If you want to find the characters that all elements of your vector have in common both at the beginning and at the end before removing them, you could do:
library(purrr)
## Replicating the data
v = c("nRsp ;A810SS-Q1D-01 \"","nRsp ;C5A19A60WESD04 \"","nRsp ;461961 \"")
## Split each string into a vector
l = strsplit(v,"")
## Find the common parts at the start and end of all elements in the list
start = 1
while(every(l,function(x) sum(x[1:start]==l[[1]][1:start])==start)){start=start+1}
end = 1
while(every(l,function(x) sum(rev(x)[1:end]==rev(l[[1]])[1:end])==end)){end=end+1}
## Remove the common 'garbage' from each element of the list
v2 = sapply(l,function(x) paste(x[start:(length(x)-end+1)],collapse=""))
This returns:
[1] "A810SS-Q1D-01" "C5A19A60WESD04" "461961"

Related

Extract all substrings in string

I want to extract all substrings that begin with M and are terminated by a *
The string below as an example;
vec<-c("SHVANSGYMGMTPRLGLESLLE*A*MIRVASQ")
Would ideally return;
MGMTPRLGLESLLE
MTPRLGLESLLE
I have tried the code below;
regmatches(vec, gregexpr('(?<=M).*?(?=\\*)', vec, perl=T))[[1]]
but this drops the first M and only returns the first string rather than all substrings within.
"GMTPRLGLESLLE"
You can use
(?=(M[^*]*)\*)
See the regex demo. Details:
(?= - start of a positive lookahead that matches a location that is immediately followed with:
(M[^*]*) - Group 1: M, zero or more chars other than a * char
\* - a * char
) - end of the lookahead.
See the R demo:
library(stringr)
vec <- c("SHVANSGYMGMTPRLGLESLLE*A*MIRVASQ")
matches <- stringr::str_match_all(vec, "(?=(M[^*]*)\\*)")
unlist(lapply(matches, function(z) z[,2]))
## => [1] "MGMTPRLGLESLLE" "MTPRLGLESLLE"
If you prefer a base R solution:
vec <- c("SHVANSGYMGMTPRLGLESLLE*A*MIRVASQ")
matches <- regmatches(vec, gregexec("(?=(M[^*]*)\\*)", vec, perl=TRUE))
unlist(lapply(matches, tail, -1))
## => [1] "MGMTPRLGLESLLE" "MTPRLGLESLLE"
This could be done instead with a for loop on a char array converted from you string.
If you encounter a M you start concatenating chars to a new string until you encounter a *, when you do encounter a * you push the new string to an array of strings and start over from the first step until you reach the end of your loop.
It's not quite as interesting as using REGEX to do it, but it's failsafe.
It is not possible to use regular expressions here, because regular languages don't have memory states required for nested matches.
stringr::str_extract_all("abaca", "a[^a]*a") only gives you aba but not the sorrounding abaca.
The first M was dropped, because (?<=M) is a positive look behind which is by definition not part of the match, but just behind it.

Tracing a recursive function for longest common substring

I am getting confused tracing the following recursive approach to find the longest common substring. The last two lines are where my confusion is. Specifically how is the count variable getting the answer when characters of both string matches? In the last line which "count" does this refer to i.e count in the function definition or the updated count from function call? Are there any resources for better understanding of recursions?
int recursive_substr(string a, string b, int m, int n,int count){
if (m == -1 || n == -1) return count;
if (a[m] == b[n]) {
count = recursive_substr(a,b,m-1,n-1,++count);
}
return max(count,max(recursive_substr(a,b,m,n-1,0),recursive_substr(a,b,m-1,n,0)));
}
The first thing to understand is what values to use for the parameters the very first time you call the function.
Consider the two following strings:
std::string a = "helloabc";
std::string b = "hello!abc";
To figure out the length of the longest common substring, you can call the function this way:
int length = recursive_substr(a, b, a.length()-1, b.length()-1, 0);
So, m begins as the index of the last character in a, and n begins as the index of the last character in b. count begins as 0.
During execution, m represents the index of the current character in a, n represents the index of the current character in b, and count represents the length of the current common substring.
Now imagine we're in the middle of the execution, with m=4 and n=5 and count=3.
We're there:
a= "helloabc"
^m
b="hello!abc" count=3
^n
We just saw the common substring "abc", which has length 3, and that is why count=3. Now, we notice that a[m] == 'o' != '!' == b[n]. So, we know that we can't extend the common substring "abc" into a longer common substring. We make a note that we have found a common substring of length 3, and we start looking for another common substring between "hello" and "hello!". Since 'o' and '!' are different, we know that we should exclude at least one of the two. But we don't know which one. So, we make two recursive calls:
count1 = recursive_substr(a,b,m,n-1,0); // length of longest common substring between "hello" and "hello"
count2 = recursive_substr(a,b,m-1,n,0); // length of longest common substring between "hell" and "hello!"
Then, we return the maximum of the three lengths we've collected:
the length count==3 of the previous common substring "abc" we had found;
the length count1==5 of the longest common substring between "hello" and "hello";
the length count2==4 of the longest common substring between "hell" and "hello!".

How to match phonemic transcriptions with a single vowel except if a condition applies

I have phonemic transcriptions of English words such as these:
test <- c("ˈsɜːtnli", "ˈtwɛnti", "ˈfɒksi", "kɑːnt", "ʧeɪnʤd", "vɪkˈtɔːrɪə", "wɒznt", "ðeər", "dɪdnt",
"ˈdɪzni", "ˈəʊnli", "ˈfæbrɪks", "sɪˈkjʊərɪti", "ˈnjuːzˌpeɪpər", "ɑhɑː")
I'd like to match mono-syllabic words, i.e., words that contain a single vowel. My set of phonemic vowels is this:
vowel <- "iː|aɪ|ɔː|ɔɪ|əʊ|ɛə|eɪ|aʊ|eə|uː|ɑː|ɪə|ɜː|ʊə|ə|ɪ|ɒ|ʊ|ʌ|æ|e|ɑ|ɛ|i"
Using str_count and the vector vowel as pattern, I'm able to match a fairly good set of words:
library(stringr)
test[str_count(test, vowel) == 1]
[1] "kɑːnt" "ʧeɪnʤd" "wɒznt" "ðeər" "dɪdnt"
However, wɒznt and dɪdntcan be seen as bi-syllabic (as the nsound can replace a vowel so that nt counts as a second vowel). So the question is, how can I match mono-syllabic words except those that end in nt?
What I've tried so far is this set operation, which works well but looks clumsy:
setdiff(test[str_count(test, vowel) == 1], test[str_count(test, paste0("[^", vowel, "]nt$")) == 1])
[1] "kɑːnt" "ʧeɪnʤd" "ðeər"
I'd much rather have a single more concise regex. Any ideas?
You can use
test <- c("ˈsɜːtnli", "ˈtwɛnti", "ˈfɒksi", "kɑːnt", "ʧeɪnʤd", "vɪkˈtɔːrɪə", "wɒznt", "ðeər", "dɪdnt",
"ˈdɪzni", "ˈəʊnli", "ˈfæbrɪks", "sɪˈkjʊərɪti", "ˈnjuːzˌpeɪpər", "ɑhɑː")
vowel <- "iː|aɪ|ɔː|ɔɪ|əʊ|ɛə|eɪ|aʊ|eə|uː|ɑː|ɪə|ɜː|ʊə|ə|ɪ|ɒ|ʊ|ʌ|æ|e|ɑ|ɛ|i"
library(stringr)
p <- paste0("^(?!.*(?<!",vowel,")nt$)(?:(?!",vowel,").)*(?:",vowel,")(?:(?!",vowel,").)*$")
test[str_detect(test, p)]
## => [1] "kɑːnt" "ʧeɪnʤd" "ðeər"
See the online R demo. See the regex demo. The pattern means
^ - start of string
(?!.*(?<!",vowel,")nt$) - immediately to the right, there must not be any 0+ chars other than line break chars as many as possible followed with nt (not preceded with any of the specified vowel sound sequences) and end of string
(?:(?!",vowel,").)* - any char but a line break char, zero or more times as many as possible, that does not start a vowel char sequence
(?:",vowel,") - any of the specified vowel sound sequences
(?:(?!",vowel,").)* - any char but a line break char, zero or more times as many as possible, that does not start a vowel char sequence
$ - end of string.
This is a somewhat concise solution (thanks to #G5W for the decisive hint):
vowel_cc <- paste0(unique(unlist(strsplit(gsub("\\|", "", vowel), ""))), collapse = "")
vowel_cc
[1] "iːaɪɔəʊɛeuɑɜɒʌæ"
test[str_count(test, paste0(vowel, "|[^", vowel_cc, "]+nt$")) == 1]
[1] "kɑːnt" "ʧeɪnʤd" "ðeər"
This solution uses a vector vowel_cc consisting of all unique characters in vowels. These serve as input for a negated character class. The pattern specifies nt as one of the vowel alternatives on the condition that it be preceded by one or more non-vowel_ccs and occur at string end.

Match only parenthesis with text and numbers in R

I would like to replace the parenthesis and the text between parenthesis in string variables. However I only want to replace those parenthesis with at least one number in it.
Example string:
text <- c("Sekretär (dipl.) (G3)", "Zolldeklarant (3 Jahre)", "Grenzwächter (< 2 Jahre)")
I tried the following:
str_extract_all(text, " *\\(.*?\\d+.*?\\) *")
It does extract the text in parenthesis, but in the first one, it matches also the first parenthesis without any number.
The extraction should look like:
" (G3)"
" (3 Jahre)"
" (< 2 Jahre)"
If you want to replace these terms in parentheses, containing at least one number, then sub is a good base R option:
text
sapply(text, function (x) {
gsub("\\([^()]*\\d[^()]*\\)", "REMOVED", x)
})
[1] "Sekretär (dipl.) (G3)" "Zolldeklarant (3 Jahre)" "Grenzwächter (< 2 Jahre)"
[1] "Sekretär (dipl.) REMOVED" "Zolldeklarant REMOVED" "Grenzwächter REMOVED"
I have replaced with the literal text REMOVED just as a placeholder to show the replacement.
Edit:
If you just want to extract these terms, we can also use sub for this:
sapply(text, function (x) {
gsub(".*(\\([^()]*\\d[^()]*\\)).*", "\\1", x)
})
[1] "(G3)" "(3 Jahre)" "(< 2 Jahre)"
Here, we capture the term in parentheses, then replace the entire string with just the first (and only) capture group \\1.
You can use
\([^()]*\d+[^()]*\)
See a demo on regex101.com.
Backslashes need to be double escaped in R, so your expression would become
\\([^()]*\\d+[^()]*\\)
Broken down this is
\( # (
[^()]* # not ( nor ), 0+ times
\d+ # digits, 1+
[^()]* # same as above
\) # )
text <- c("Sekretär (dipl.) (G3)", "Zolldeklarant (3 Jahre)", "Grenzwächter (< 2 Jahre)")
gsub(".*\\((.*[0-9].*)\\).*","(\\1)",text)
Basically you ask gsub to select the whole string but to assign as a group (\1) the strings in a parentheses and including a number.
Plus, if you want to extract the last parentheses always, that could follow a different approach.

How to Flag Floats in a vector of Floats and Chars in Julia

I have a mixed vector of floats and characters that I'm streaming from a text file. This vector is being read in as a string. My problem is that I want to parse only the floats and ignore the characters. How can I do this?
v = "Float_or_Char"
if isblank(v) == false # <-- v might be blank as well
Parse(Float64,v) # <-- only if v is a Float (how do I do this?)
end
Supposing x is a vector of strings, some of which are floats-as-strings and the rest are actual strings, you could do something like
for i in 1:length(x)
f = NaN
try
f = float(x[i])
println("$i is a float")
catch
println("$i isn't a float")
end
end
If you are using Julia 0.4 (not yet released), you could get really fancy if you just wanted the floats from x using the new Nullable type and the new method tryparse
maybe_floats = map(s->tryparse(Float64,s), x)
floats = map(get, filter(n->!isnull(n), maybe_floats))

Resources