How to limit character repetition in a word to 2? - r

I want to remove characters that repeat more than twice in a word. For example
"hhaaappppyyyyyyy mmoooooorning friendsssssssssssssss, good goood day"
to
"hhaappyy mmoorning friendss, good good day"
I have tried something like this, but it is not reducing to exactly 2 repetitions.
gsub('([[:alpha:]])\\1{2}', '\\1',
'hhaaappppyyyyyyy mmoooooorning friendsssssssssssssss, good goood day')
#[1] "hhappyyy mmoorning friendsssss, good god day"
Thank you.

You need to use {2,} quantifier and use two \1 in the replacement:
s<-'hhaaappppyyyyyyy mmoooooorning friendsssssssssssssss, good goood day'
gsub('([[:alpha:]])\\1{2,}', '\\1\\1', s)
# => [1] "hhaappyy mmoorning friendss, good good day"
See the R demo.
The ([[:alpha:]])\\1{2,} pattern matches and captures a letter into Group 1 and then 2 or more repetitions of the same char are matched. Two \1 in the replacement pattern replace the whole match with 2 occurrences of the char. It is valid to use two \1 placeholders because every match is at least 3 identical chars.

Same as from Wiktor Stribiżew, but in javascript and replace every character (numbers, punctuation also), if you need this.
var sInput = "hhaaappppyyyyyyy mmoooooorning friendsssssssssssssss, good goood day";
var sOutput = sInput.replace(/(.)\1{2,}/g, "$1$1");
console.log(sOutput);

fwiw, here is another solution:
f = function(x){
x = strsplit(x, '')[[1]]
x = rle(x)
x = rep(x$values, pmin(2, x$lengths))
paste(x, collapse='')
}
example:
x = "hhaaappppyyyyyyy mmoooooorning friendsssssssssssssss, good goood day"
f(x)
[1] "hhaappyy mmoorning friendss, good good day"
however, gsub is a little easier...

package test.com;
public class limitCharCount {
public static void main(String[] args) {
// TODO Auto-generated method stub
String str = "gggkjjkjkjjjjjsssslklkkkkkk";
char ch[] = str.toCharArray();
String Test = "";
//int count = 2;
for (int i = 0; i < ch.length - 1; i++) {
if (i == 0 ||i == 1)
Test = Test + ch[i];
else if (!(ch[i]==ch[i-1] && ch[i]==ch[i-2]) )
{
Test = Test + ch[i];
}
}
System.out.println(Test);
}
}
output ::ggkjjkjkjjsslklkk

Related

Edit distance leetcode

So I am doing this question of EDIT DISTANCE and before going to DP approach I am trying to solve this question in recursive manner and I am facing some logical error, please help....
Here is my code -
class Solution {
public int minDistance(String word1, String word2) {
int n=word1.length();
int m=word2.length();
if(m<n)
return Solve(word1,word2,n,m);
else
return Solve(word2,word1,m,n);
}
private int Solve(String word1,String word2,int n,int m){
if(n==0||m==0)
return Math.abs(n-m);
if(word1.charAt(n-1)==word2.charAt(m-1))
return 0+Solve(word1,word2,n-1,m-1);
else{
//insert
int insert = 1+Solve(word1,word2,n-1,m);
//replace
int replace = 1+Solve(word1,word2,n-1,m-1);
//delete
int delete = 1+Solve(word1,word2,n-1,m);
int max1 = Math.min(insert,replace);
return Math.min(max1,delete);
}
}
}
here I am checking the last element of both the strings if both the characters are equal then simple moving both string to n-1 and m-1 resp.
Else
Now I am having 3 cases of insertion , deletion and replace ,and between these 3 I have to find minima.
If I am replacing the character then simply I moved the character to n-1 & m-1.
If I am inserting the character from my logic I think I should insert the character at the last of smaller length string and move the pointer to n-1 and m
To delete the element I think I should delete the element from the larger length String that's why I move pointer to n-1 and m but I think I am making mistake here please help.
Leetcode is giving me wrong answer for word1 = "plasma" and word2 = "altruism".
The problem is that the recursive expression for the insert-case is the same as for the delete-case.
Reasoning further, it turns out the one for the insert-case is wrong. In that case we choose to resolve the letter in word2 (at index m-1) through insertion, so it should not be considered any more during the recursive process. On the other hand the considered letter in word1 could still be matched with another letter in word2, so that letter should still be considered during the recursive process.
That means that m should be decremented, not n.
So change:
int insert = 1+Solve(word1,word2,n-1,m);
to:
int insert = 1+Solve(word1,word2,n,m-1);
...and it will work. Then remains to add the memoization for getting a good efficiency.
Python clean DP based solution,
class Solution:
def minDistance(self, word1: str, word2: str) -> int:
return self.edit_distance(word1, word2)
#cache
def edit_distance(self, s, t):
# Edge conditions
if len(s) == 0:
return len(t)
if len(t) == 0:
return len(s)
# If 1st char matches
if s[0] == t[0]:
return self.edit_distance(s[1:], t[1:])
else:
return min(
1 + self.edit_distance(s[1:], t), # delete
1 + self.edit_distance(s, t[1:]), # insert
1 + self.edit_distance(s[1:], t[1:]) # replace
)

find frequency of substring in a set of strings

I have as input a gene list where each genes has a header like >SomeText.
For each gene I would like to find the frequency of the string GTG. (number of occurences divided by length of gene). The string should only be counted if it starts at position 1,4,7,10 etc (every thids position).
>ENST00000619537.4 cds:known chromosome:GRCh38:21:6560714:6564489:1 gene:ENSG00000276076.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CH507-152C13.3 description:alpha-crystallin A chain [Source:RefSeq peptide;Acc:NP_001300979]
ATGGATGTGACCATCCAGCACCCCTGGTTCAAGCGCACCCTGGGGCCCTTCTACCCCAGC
CGGCTGTTCGACCAGTTTTTCGGCGAGGGCCTTTTTGAGTATGACCTGCTGCCCTTCCTG
TCGTCCACCATCAGCCCCTACTACCGCCAGTCCCTCTTCCGCACCGTGCTGGACTCCGGC
ATCTCTGAGGTTCGATCCGACCGGGACAAGTTCGTCATCTTCCTCGATGTGAAGCACTTC
TCCCCGGAGGACCTCACCGTGAAGGTGCAGGACGACTTTGTGGAGATCCACGGAAAGCAC
AACGAGCGCCAGGACGACCACGGCTACATTTCCCGTGAGTTCCACCGCCGCTACCGCCTG
CCGTCCAACGTGGACCAGTCGGCCCTCTCTTGCTCCCTGTCTGCCGATGGCATGCTGACC
TTCTGTGGCCCCAAGATCCAGACTGGCCTGGATGCCACCCACGCCGAGCGAGCCATCCCC
GTGTCGCGGGAGGAGAAGCCCACCTCGGCTCCCTCGTCCTAA
>ENST00000624019.3 cds:known chromosome:GRCh38:21:6561284:6563978:1 gene:ENSG00000276076.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CH507-152C13.3 description:alpha-crystallin A chain [Source:RefSeq peptide;Acc:NP_001300979]
ATGGACGCCCCCCCCCCCCACCCAACCACAGGCCTCCTCTCTGAGCCACGGGTTCGATCC
GACCGGGACAAGTTCGTCATCTTCCTCGATGTGAAGCACTTCTCCCCGGAGGACCTCACC
GTGAAGGTGCAGGACGACTTTGTGGAGATCCACGGAAAGCACAACGAGCGCCAGGACGAC
CACGGCTACATTTCCCGTGAGTTCCACCGCCGCTACCGCCTGCCGTCCAACGTGGACCAG
TCGGCCCTCTCTTGCTCCCTGTCTGCCGATGGCATGCTGACCTTCTGTGGCCCCAAGATC
CAGACTGGCCTGGATGCCACCCACGCCGAGCGAGCCATCCCCGTGTCGCGGGAGGAGAAG
CCCACCTCGGCTCCCTCGTCCTAA
>ENST00000624932.1 cds:known chromosome:GRCh38:21:6561954:6564203:1 gene:ENSG00000276076.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CH507-152C13.3 description:alpha-crystallin A chain [Source:RefSeq peptide;Acc:NP_001300979]
ATGCCTGTCTGTCCAGGAGACAGTCACAGGCCCCCGAAAGCTCTGCCCCACTTGGTGTGT
GGGAGAAGAGGCCGGCAGGTTCGATCCGACCGGGACAAGTTCGTCATCTTCCTCGATGTG
AAGCACTTCTCCCCGGAGGACCTCACCGTGAAGGTGCAGGACGACTTTGTGGAGATCCAC
GGAAAGCACAACGAGCGCCAGGACGACCACGGCTACATTTCCCGTGAGTTCCACCGCCGC
TACCGCCTGCCGTCCAACGTGGACCAGTCGGCCCTCTCTTGCTCCCTGTCTGCCGATGGC
ATGCTGACCTTCTGTGGCCCCAAGATCCAGACTGGCCTGGATGCCACCCACGCCGAGCGA
GCCATCCCCGTGTCGCGGGAGGAGAAGCCCACCTCGGCTCCCTCGTCCTAA
Output:
Gene Frequency
Gene1: 3
Gene2 6.3
....
I was thinging of something like this, but I dont now how to define the positions requirements:
freq <- sapply(gregexpr("GTG",x),function(x)if(x[[1]]!=-1) length(x) else 0)
Here is an idea in R using stringi.
We use stri_locate_all_fixed() to find the start and end position of each GTG occurence. Then we create a column condition to test if start position is in 1,4,7,10,13,16,19,22 ....
library(stringi)
library(dplyr)
data.frame(stri_locate_all_fixed(gene1, "GTG")) %>%
mutate(condition = start %in% seq(1, nchar(gene), 3))
Which gives:
# start end condition
#1 4 6 TRUE
If you want to generalize this to a list of genes, you could do:
lst <- list(gene1, gene2, gene3)
res <- lapply(lst, function(x) {
data.frame(stri_locate_all_fixed(x, "GTG")) %>%
mutate(condition = start %in% seq(1, nchar(x), 3))
})
Which would give:
#[[1]]
# start end condition
#1 4 6 TRUE
#
#[[2]]
# start end condition
#1 NA NA FALSE
#
#[[3]]
# start end condition
#1 3 5 FALSE
#2 9 11 FALSE
#3 21 23 FALSE
#4 70 72 TRUE
#5 75 77 FALSE
Following #Sobrique's comment, if divided by length means number of occurences respecting condition divided by total number of char in each gene, you could do:
lapply(1:length(res), function(x) sum(res[[x]][["condition"]]) / nchar(lst[[x]]))
Which would give:
#[[1]]
#[1] 0.004830918
#
#[[2]]
#[1] 0
#
#[[3]]
#[1] 0.003021148
Here's a Perl solution that does as you ask
But I don't understand how your example output is derived: the first and last sequences have only one occurrence of GTG in the positions you require, and the second sequence has none at all. That means the outputs are 1 / 207, 0 / 74, and 1 / 331 respectively. None of those are anything like 3 and 6.3 that you say you're expecting
This program expects the path to the input file as a parameter on the command line
use strict;
use warnings 'all';
print "Gene Frequency\n";
my $name;
local $/ = '>';
while ( <> ) {
chomp;
next unless /\S/;
my ($name, $seq) = split /\n/, $_, 2;
$seq =~ tr/A-Z//cd;
my $n = 0;
while ( $seq =~ /(?=GTG)/g ) {
++$n if $-[0] % 3 == 0;
}
printf "%-7s%.6f\n", $name, $n / length($seq);
}
output
Gene Frequency
Gene1 0.004831
Gene2 0.000000
Gene3 0.003021
Here is an alternate solution that does not use a pattern match. Not that it will matter much.
use strict;
use warnings;
my $gene;
while ( my $line = <> ) {
if ( $line =~ /^>(.+)/ ) {
$gene = $1;
next;
}
chomp $line;
printf "%s: %s\n",
$gene,
( grep { $_ eq 'GTG' } split /(...)/, $line ) / length $line;
}
Output:
Gene1: 0.00483091787439614
Gene2: 0
Gene3: 0.00302114803625378
It is essentially similar to Sobrique's answer, but assumes that the gene lines contain the right characters. It splits up the gene string into a list of three-character pieces and takes the ones that are literally GTG.
The splitting works by abusing the fact that split uses a pattern as the delimiter, and that it will also capture the delimiter if a capture group is used. Here's an example.
my #foo = split /(...)/, '1234567890';
p #foo; # from Data::Printer
__END__
[
[0] "",
[1] 123,
[2] "",
[3] 456,
[4] "",
[5] 789,
[6] 0
]
The empty elements get filter out by grep. It might not be the most efficient way, but it gets the job done.
You can run it by calling perl foo.pl horribly-large-gene-sequence.file.
Well, you have an R solution. I've hacked something together in perl because you tagged it:
#!/usr/bin/env perl
use strict;
use warnings;
my $target = 'GTG';
local $/ = "\n>";
while ( <> ) {
my ($gene) = m/(Gene\d+)/;
my #hits = grep { /^$target$/ } m/ ( [GTCA]{3} ) /xg;
print "$gene: ".( scalar #hits), "\n";
}
This doesn't give the same results as your input though:
Gene1: 1
Gene2: 0
Gene3: 1
I'm decomposing your string into 3 element lists, and looking for ones that specifically match. (And I haven't divided by length, as I'm not entirely clear if that's the actual string length in letters, or some other metric).
Including length matching - we need to capture both name and string:
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = "\n>";
while (<>) {
my ($gene, $gene_str) = m/(Gene\d+)\n([GTCA]+)/m;
my #hits = grep { /^GTG$/ } $gene_str =~ m/ ( [GTCA]{3} ) /xg;
print "$gene: " . #hits . "/". length ( $gene_str ), " = ", #hits / length($gene_str), "\n";
}
We use <> which is the 'magic' filehandle, and tells perl to read from either STDIN or a file specified on command line. Much like sed or grep does.
With your input:
Gene1: 1/207 = 0.00483091787439614
Gene2: 0/74 = 0
Gene3: 1/331 = 0.00302114803625378
Here is a function I created based on your requirement. I am pretty sure there are alternate ways better than this but this solves the problem.
require(stringi)
input_gene_list<- list(gene1= "GTGGGGGTTTGTGGGGGTG", gene2= "GTGGGGGTTTGTGGGGGTG", gene3= "GTGGGGGTTTGTGGGGGTG")
gene_counter<- function(gene){
x<- gene
y<- gsub(pattern = "GTG",replacement = "GTG ", x = x, perl=TRUE)
if(str_count(y,pattern = "GTG")) {
gene_count<- unlist(gregexpr(pattern = " ", y))
counter<- 0
for(i in 1:length(gene_count)){
if((gene_count[i] %% 3) == 1) counter=counter+1
}
return(counter/nchar(x))
}
}
output_list<- lapply(input_gene_list, function(x) gene_counter(x))
result<- t(as.data.frame(output_list))
result
[,1]
gene1 0.1052632
gene2 0.1052632
gene3 0.1052632
Also share your thoughts on it! Thanks!

How can I manipulate a string and eliminate the character * or #?

In R, how can I manipulate a string and eliminate the character * or #? For example in "ALL8606#057R0" I try with RFC_corr[5] = str_split(RFC[5],split= "#",fixed=true)
As tospig suggested:
> sub("#", "", "ALL8606#057R0")
[1] "ALL8606057R0"
Edit for your comment below: to apply this to a vector you don't need a loop; you can just use the vector of interest when calling the function:
> x <- c("vect#or", "th-at#", "ha%s", "weir*d", "stu+ff")
> gsub("[-+%*#]", "", x)
[1] "vector" "that" "has" "weird" "stuff"
```
The simplest way to verify this make a loop and visit for each char in that array when the "#" and "*" skip that while doing that make the copy of string that contains the string skip the # and *.
int i=0;
string userstr="ALL8606#057R0";
char[] copystr=new char[userstr.Length()];
foreach(char s in userstr)
{
if(s!="#" || s!="*")
{
copystr[i]=s;
i++;
}
}
Hope this code will help you to resolve conflict.if you are getting the error in the userstr.Length so please put the hard coded value and try.
Bye and Happy Coding.

Longest substring in alphabetical order [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
Write a program that prints the longest substring of s in which the letters occur in alphabetical order. For example, if s = 'azcbobobegghakl', then your program should print
Longest substring in alphabetical order is: beggh
In the case of ties, print the first substring. For example, if s = 'abcbcd', then your program should print
Longest substring in alphabetical order is: abc
Here you go edx student i've been helped to finish the code :
from itertools import count
def long_sub(input_string):
maxsubstr = input_string[0:0] # empty slice (to accept subclasses of str)
for start in range(len(input_string)): # O(n)
for end in count(start + len(maxsubstr) + 1): # O(m)
substr = input_string[start:end] # O(m)
if len(substr) != (end - start): # found duplicates or EOS
break
if sorted(substr) == list(substr):
maxsubstr = substr
return maxsubstr
sub = (long_sub(s))
print "Longest substring in alphabetical order is: %s" %sub
These are all assuming you have a string (s) and are needing to find the longest substring in alphabetical order.
Option A
test = s[0] # seed with first letter in string s
best = '' # empty var for keeping track of longest sequence
for n in range(1, len(s)): # have s[0] so compare to s[1]
if len(test) > len(best):
best = test
if s[n] >= s[n-1]:
test = test + s[n] # add s[1] to s[0] if greater or equal
else: # if not, do one of these options
test = s[n]
print "Longest substring in alphabetical order is:", best
Option B
maxSub, currentSub, previousChar = '', '', ''
for char in s:
if char >= previousChar:
currentSub = currentSub + char
if len(currentSub) > len(maxSub):
maxSub = currentSub
else: currentSub = char
previousChar = char
print maxSub
Option C
matches = []
current = [s[0]]
for index, character in enumerate(s[1:]):
if character >= s[index]: current.append(character)
else:
matches.append(current)
current = [character]
print "".join(max(matches, key=len))
Option D
def longest_ascending(s):
matches = []
current = [s[0]]
for index, character in enumerate(s[1:]):
if character >= s[index]:
current.append(character)
else:
matches.append(current)
current = [character]
matches.append(current)
return "".join(max(matches, key=len))
print(longest_ascending(s))
The following code solves the problem using the reduce method:
solution = ''
def check(substr, char):
global solution
last_char = substr[-1]
substr = (substr + char) if char >= last_char else char
if len(substr) > len(solution):
solution = substr
return substr
def get_largest(s):
global solution
solution = ''
reduce(check, list(s))
return solution

How many valid parenthesis combinations?

We have:
n1 number of {} brackets ,
n2 number of () brackets ,
n3 number of [] brackets ,
How many different valid combination of these brackets we can have?
What I thought: I wrote a brute force code in java (which comes in the following) and counted all possible combinations, I know it's the worst solution possible,
(the code is for general case in which we can have different types of brackets)
Any mathematical approach ?
Note 1: valid combination is defined as usual, e.g. {{()}} : valid , {(}){} : invalid
Note 2: let's assume that we have 2 pairs of {} , 1 pair of () and 1 pair of [], the number of valid combinations would be 168 and the number of all possible (valid & invalid) combinations would be 840
static void paranthesis_combination(char[] open , char[] close , int[] arr){
int l = 0;
for (int i = 0 ; i < arr.length ; i++)
l += arr[i];
l *= 2;
paranthesis_combination_sub(open , close , arr , new int[arr.length] , new int[arr.length], new StringBuilder(), l);
System.out.println(paran_count + " : " + valid_paran_count);
return;
}
static void paranthesis_combination_sub(char[] open , char[] close, int[] arr , int[] open_so_far , int[] close_so_far, StringBuilder strbld , int l){
if (strbld.length() == l && valid_paran(open , close , strbld)){
System.out.println(new String(strbld));
valid_paran_count++;
return;
}
for (int i = 0 ; i < open.length ; i++){
if (open_so_far[i] < arr[i]){
strbld.append(open[i]);
open_so_far[i]++;
paranthesis_combination_sub(open , close, arr , open_so_far , close_so_far, strbld , l);
open_so_far[i]--;
strbld.deleteCharAt(strbld.length() -1 );
}
}
for (int i = 0 ; i < open.length ; i++){
if (close_so_far[i] < open_so_far[i]){
strbld.append(close[i]);
close_so_far[i]++;
paranthesis_combination_sub(open , close, arr , open_so_far , close_so_far, strbld , l);
close_so_far[i]--;
strbld.deleteCharAt(strbld.length() -1 );
}
}
return;
}
Cn is the nth Catalan number, C(2n,n)/(n+1), and gives the number of valid strings of length 2n that use only (). So if we change all [] and {} into (), there would be Cn1+n2+n3 ways. Then there are C(n1+n2+n3,n1) ways to change n1 () back to {}, and C(n2+n3,n3) ways to change the remaining () into []. Putting that all together, there are C(2n1+2n2+2n3,n1+n2+n3)C(n1+n2+n3,n1)C(n2+n3,n3)/(n1+n2+n3+1) ways.
As a check, when n1=2, n2=n3=1, we have C(8,4)C(4,2)C(2,1)/5=168.
In general, infinitely. However I assume, that you meant to find how many combinations are there provided limited string length. For simplicity lets assume that the limit is an even number. Then, lets create an initial string:
(((...()...))) with length equal to the limit.
Then, we can switch any instance of () pair with [] or {} parenthesis. However, if we change an opening brace, then we ought to change the matching closing brace. So, we can look only at the opening braces, or at pairs. For each parenthesis pair we have 4 options:
leave it unchanged
change it to []
change it to {}
remove it
So, for each of (l/2) objects we choose one of four labels, which gives:
4^(l/2) possibilities.
EDIT: this assumes only "concentric" parenthesis strings (contained in each other), as you've suggested in your edit. Intuitively however, a valid combination is also: ()[]{} - this solution does not take this into account.

Resources