Rlibstree (getLongestCommonSubstring) generates redundant character - r

I am using the Rlibstree package version 0.3-2 with the function getLongestCommonSubstring
So I have character strings that only contain 0-9 and >; they look like this:
string A:
0113>0213>0212>0312>0411>0611>0711>0812>1012>1112>1212>1412>1313>1413>1412>1311>1211>1212>1012>1013>0912>0812>0712>0513>0612>0511>0410>0309>0209>0308>0207>0107>0007>0109>0010>0110>0010>0008>0007>0106>0105>0204>0304>0503>0603>0701>0801>0802>0803>0904>1003>1002>1001>1002>1103>1004>0904>0803>0802>0701>0702>0603>0503>0403>0303>0204>0105>0104>0203>0302>0401>0302>0203>0204>0104>0105>0106>0107>0307>0308>0409>0410>0311>0212>0113>0213>0113>0213
String B:
0113>0213>0212>0312>0411>0511>0410>0409>0308>0307>0207>0107>0108>0109>0010>0110>0010>0009>0107>0207>0307>0308>0309>0209>0309>0410>0411>0611>0711>0812>0912>1012>1112>1212>1412>1313>1412>1212>1112>1012>1013>0912>0812>0612>0613>0513>0612>0611>0511>0411>0312>0213>0113>0213>0113>0212>0311>0411>0312>0213>0212>0311>0312>0311>0411>0410>0409>0308>0307>0207>0107>0106>0105>0204>0304>0503>0604>0603>0602>0601>0701>0801>0802>0803>0804>0904>1004>1003>1002>1001>1002>1001>1003>1004>0904>0803>0802>0801>0701>0602>0604>0504>0404>0304>0104>0105>0107>0108>0109>0108>0107>0207>0308>0409>0410>0311>0212>0213
String C:
0113>0213>0113>0213>0113>0213>0212>0311>0411>0611>0812>0912>1012>1212>1312>1412>1413>1314>1313>1213>1413>1412>1411>1311>1212>1011>0911>0811>0712>0611>0411>0410>0409>0309>0209>0309>0408>0410>0510>0611>0712>0611>0511>0411>0311>0310>0409>0309>0307>0207>0108>0109>0110>0010>0109>0108>0107>0006>0106>0105>0204>0203>0303>0204>0203>0302>0401>0402>0401>0302>0203>0304>0404>0504>0503>0604>0705>0605>0705>0604>0505>0504>0603>0503>0403>0303>0203>0104>0105>0005>0107>0108>0109>0108>0107>0207>0107>0106>0104>0204>0304>0404>0504>0603>0604>0603>0503>0504>0603>0702>0701>0801>0802>0804>0904>1004>1003>1002>1001>1002>1003>1104>1205>1304>1303>1403>1404>1403>1304>1205>1104>0904>0804>0802>0801>0701>0602>0703>0604>0704>0602>0701>0601>0602>0603>0504>0404>0303>0203>0204>0105>0106>0107>0207>0308>0408>0409>0308>0309>0409>0410>0411>0511>0611>0812>0912>1012>1112>1012>0912>1013>1012>1112>1212>1312>1313>1213>1313>1312>1412>1313>1312>1413>1313>1213>1313>1312>1112>1012>0911>1011>1112>1312>1412>1312>1413>1313>1312>1212>1112>0911>0811>0711>0511>0411>0312>0212>0312>0411>0511>0611>0612>0413>0513>0612>0611>0411>0312>0212>0213>0212>0213>0113>0213>0113
I want my input string to compare with String A.
See example below:
If I compare A and B, no problem, found two longest common substring, happy!
getLongestCommonSubstring(c(A,B))
[1] "07>0106>0105>0204>0304>0503>060" "12>1012>1112>1212>1412>1313>141"
BUT, if I compare A and C, something happened, as you can see the result,
I got \xc1 or ! at the end, and these special character will change every time.
Execute First time:
getLongestCommonSubstring(c(A,C))
[1] "04>1003>1002>1001>1002>1\xc1" ">0603>0503>0403>0303>020!"
Execute Second time:
getLongestCommonSubstring(c(A,C))
[1] "04>1003>1002>1001>1002>11" ">0603>0503>0403>0303>020!"
Execute Third time:
getLongestCommonSubstring(c(A,C))
[1] "04>1003>1002>1001>1002>1\xc1" ">0603>0503>0403>0303>020\xc1"
With these special character, or escape character in the string, I cannot perform tasks like the nchar() function, these characters are redundant and annoying.
For me, the only difference between B and C is their length, their format is the same, I really cannot figure out why this happened.

Related

RegEx Replace Automatic SQL code creator

I built a script in R that automatically create a very long and complex SQL query to create a view over similar tables of 5 databases.
Of course there were integration issues to solve. The only one remaining to make this happen is the problem I am going to present you now.
Considering one very long string like
'"/*NOTES*/", "/*TABLE_ID*/", "/*TABLE_SUB_ID*/", "/*TABLE_SUB_SUB_ID*/", "OTHER_COLUMNS",'
My objective is to replace
this string '"/*' with this string '/*'
this string '*/",' with this string '*/'
I tried with:
gsub('"/*', '/*', '"/*NOTES*/", "/*TABLE_ID*/", "/*TABLE_SUB_ID*/", "/*TABLE_SUB_SUB_ID*/", "OTHER_COLUMNS",')
but it returns the string
'/**NOTES*//*, /**TABLE_ID*//*, /**TABLE_SUB_ID*//*, /**TABLE_SUB_SUB_ID*//*, /*OTHER_COLUMNS/*,'
whereas my expected output is the following string:
'/*NOTES*/ /*TABLE_ID*/ /*TABLE_SUB_ID*/ /*TABLE_SUB_SUB_ID*/ "OTHER_COLUMNS",'
Note the * is not escaped but it represents start (/*) and end (*/) of comments when the string will be run by a SQL compiler
Escaping regexes requires two backslashes, so the following will get you what you want:
gsub('"?(/\\*|\\*/)"?', '\\1', '"/*NOTES*/", "/*TABLE_ID*/", "/*TABLE_SUB_ID*/", "/*TABLE_SUB_SUB_ID*/", "OTHER_COLUMNS",')
# [1] "/*NOTES*/, /*TABLE_ID*/, /*TABLE_SUB_ID*/, /*TABLE_SUB_SUB_ID*/, \"OTHER_COLUMNS\","
FYI, double-backslashes are required for most, but the following are legitimate single-backslash special characters:
'\a\b\f\n\r\t\v'
# [1] "\a\b\f\n\r\t\v"
'\u0101' # unicode, numbers are variable
# [1] "a"
'\x0A' # hex, hex-numbers are variable
# [1] "\n"
Perhaps there are more, I didn't find the authoritative list though I'm sure it's in there somewhere.

How to extract characters from a string based on the text surrounding them in R

Edited to highlight the language I'm using I'm using the R language and I have many large lists of character strings and they have a similar format. I am interested in the characters directly in front of a series of characters that is consistently in the string, but not in a consistent place within the string. For instance:
a <- "aabbccddeeff"
b <- "aabbddff"
c <- "aabbffgghhii"
d <- "bbffgghhii"
I am interested in extracting the two characters directly preceding the "ff" in each character string. I can't find any reasonable solution apart from breaking each character string down using grepl() and then processing them each independently, which seems like an inefficient way to do it.
You can match those two characters and capture them with sub and the right regular expression.
Strings = c("aabbccddeeff",
"aabbddff",
"aabbffgghhii",
"bbffgghhii")
sub(".*(\\w\\w)ff.*", "\\1", Strings)
[1] "ee" "dd" "bb" "bb"
Explanation, This replaces the entire string with the two characters before the "ff". If there are multiple "ff" in the string, this expression takes the two characters before the last "ff".
How this works: The three arguments to sub are:
1. a pattern to search for
2. What it will be replaced with
3. The strings to apply it to.
Most of the work is in the pattern part - .*(\\w\\w)ff.*. The ff part of the pattern must be obvious. We are targeting things near the specific string ff. What comes right before it is (\\w\\w). \w refers to a "word character". That means any letter a-z or A-Z, any digit 0-9 or the one other character _. We want two characters so we have \\w\\w. By enclosing \\w\\w in parentheses, it turns this pattern of two characters into a "capture group", a string that will be saved into a variable for later use. Since this is the first (and only) capture group in this expression, those two characters will be stored in a variable called \1. Now we want only those two characters so in order to blow away everything before and after we put .* at the front and back. . matches any character and * means do this zero or more times, so .* means zero or more copies of any character. Now we have broken the string into four parts: "ff", the two characters before "ff", everything before that and everything after the ff. This covers the entire string. sub will _replace the part that was matched (everything) with whatever it says in the substitution pattern, in this case "\1". That is just how you write a string that evaluates to \1, the name of the variable where we stored the two characters that we want. We write it that way because backslash "escapes" whatever is after it. We actually want the character \ so we write \ to indicate \ and \1 evaluates to \1. So everything in the string is replaced by the targeted two characters. We apply this to every string in the list of strings Strings.

Subsetting different length strings by spaces in R

In R, I currently have a long vector of dates and times saved as a string. So depending on the given date, the string can be 16 or 17 or 18 characters long and so I cannot just subset the first the 8 or 10 characters in the string, since that would not work for every date. But since there is a space between the date and time values, I am wondering how can I subset this string so that I only get the characters before the space?
Just to show how the string looks like now, here are a couple of examples:
"4/18/1950 0:00:00"
"6/8/1951 0:00:00"
"11/15/1951 0:00:00"
I'm not sure if you are familiar with regular expressions, if not you should learn as they are extremely useful:
tutorial
As akrun pointed out you can use the "sub" command to remove the space and everything after it like this:
sub(" .*","",stringVar)
First argument is the regular expression code which matches the space and everything that follows.
Second argument is what you want to replace the match with, in this case nothing
Third argument is the input string
Alternatively, you can just split the string at the space and select the first half using "strsplit"
strsplit(stringVar," ")[1]

how to use grep in R to get the specified character?

I have
str=c("00005.profit", "00005.profit-in","00006.profit","00006.profit-in")
and I want to get
"00005.profit" "00006.profit"
How can I achieve this using grep in R?
Here is one way:
R> s <- c("00005.profit", "00005.profit-in","00006.profit","00006.profit-in")
> unique(gsub("([0-9]+.profit).*", "\\1", s))
[1] "00005.profit" "00006.profit"
R>
We define a regular expression as digits followed by .profit, which we assign by keeping the expression in parantheses. The \\1 then recalls the first such assignment -- and as we recall nothing else that is what we get. The unique() then reduces the four items to two unique ones.
Dirk's answer is pretty much the ideal generalisable answer, but here are a couple of other options based on the fact that your example always has a - character starting the part you wish to chop off:
1: gsub to return everything prior to the -
gsub("(.+)-.+","\\1",str)
2: strsplit on - and keep only the first part.
sapply(strsplit(str,"-"),head,1)
Both return:
[1] "00005.profit" "00005.profit" "00006.profit" "00006.profit"
which you can then wrap in unique to not return duplicates like:
unique(gsub("(.+)-.+","\\1",str))
unique(sapply(strsplit(str,"-"),head,1))
These will then return:
[1] "00005.profit" "00006.profit"
Another non-generalisable solution would be to just take the first 12 characters (assuming string length for the part you want to keep doesn't change):
unique(substr(str,1,12))
[1] "00005.profit" "00006.profit"
I'm actually interpreting your question differently. I think you might want
grep("[0-9]+\\.profit$",str,value=TRUE)
That is, if you only want the strings that end with profit. The $ special character stands for "end of string", so it excludes cases that have additional characters at the end ... The \\. means "I really want to match a dot, not any character at all" (a . by itself would match any character). You weren't entirely clear about your target pattern -- you might prefer "0+[1-9]\\.profit$" (any number of zeros followed by a single non-zero digit), or even "0{4}[1-9]\\.profit$" (4 zeros followed by a single non-zero digit).

Replacing all occurrences of a pattern in a string

Used to run R with numbers and matrix, when it comes to play with strings and characters I am lost. I want to analyze some data where the time is read into R as follow:
>my.time.char[1]
[1] "\"2011-10-05 15:55:00\""
I want to end up with a string containing only:
"2011-10-05 15:55:00"
Using the function sub() (that i barely understand...), I got the following result:
> sub("(\")","",my.time.char[1])
[1] "2011-10-05 15:55:00\""
This is closer to the format i am looking for, but I still need to get rid of the two last characters (\").
The second line from ?sub explains:
sub and gsub perform replacement of the first and all matches respectively.
which should tell you to use gsub instead.

Resources