How to remove strings with exceptions in R? - r

I have a string. I want to (a) keep "/" in fractions, (b) insert whitespace around "/" that are between words, and (c) remove all other "/".
s = "/// // / 1/2 111/222 a/b abc/abc a / b / // ///"
The result should be as follows.
s = "1/2 111/222 a b abc abc a b"
I have tried a few things. I cannot make everything right.

I'm not a regex expert, but this appears to work on your example.
s = "/// // / 1/2 111/222 a/b abc/abc a / b / // ///"
i <- gsub("/{2,}|/\\s", "", s)
i <- trimws(gsub("([[:alpha:]]{1,})(/)([[:alpha:]]{1,})", "\\1 \\3", i))
i <- gsub("\\s{2,}", " ", i)
identical(i, "1/2 111/222 a b abc abc a b")
[1] TRUE

Related

Regex for substring matching with space and substitution

I want to combine words in one string having spaces in between, which are similar to words in another string without spaces in between them (In R).
eg
s1 = 'this is an example of an undivided string case here'
s2 = 'Please note th is is an un di vid ed case right he r e for you!'
s2 needs to be converted into
s2 = 'Please note this is an undivided case right here for you!'
based on combined words in s1 which are same as non combined successive/continuous words in s2(with spaces in between)
I am new to R and tried with gsub, and different combinations of '\s', but not able to get the desired result.
You may achieve what you need by
removing all whitespaces from the string you want to search for (s1) (with gsub("\\s+", "", x)), then
insert whitespace patterns (\s*) in between each char (use something like sapply(strsplit(unspace(s1), ""), paste, collapse="\\s*")), and then
replace all the matches with the replacement with gsub(pattern, s1, s2).
See the R demo:
s2 = 'Please note th is is an un di vid ed case right he r e for you!'
s1 = 'this is an undivided case right here'
unspace <- function(x) { gsub("\\s+", "", x) }
pattern <- sapply(strsplit(unspace(s1), ""), paste, collapse="\\s*")
gsub(pattern, s1, s2)
## => [1] "Please note this is an undivided case right here for you!"

Regular expression to select 2 kinds of substrings

I have strings like:
\n A vs B \n
\n C vs D (EF) \n
\n GH ( I vs J) \n
in a vector called myData.
The following is myData.
c("\n A vs B \n", "\n C vs D (EF) \n", "\n GH ( I vs J)\n")
I want to select A vs B from 1, C vs D from 2 and I vs J from 3.
I have the following code:
loc = regexpr(".*vs.*|\\(.*vs.*\\)",myData,ignore.case=TRUE,perl=T)
end = loc + attr(loc,"match.length")-1
substr(myData,loc,end)
which gives three output:
[1] " A vs B " " C vs D (EF) " " GH ( I vs J)"
The last match is incorrect. How can I fix this?
We can use str_extract
library(stringr)
str_extract(str1, "[A-Za-z]\\s*vs\\s*[A-Za-z]")
#[1] "A vs B" "C vs D" "I vs J"
Or if there are other lower case characters in place of 'vs'
str_extract(str1, "[A-Z]\\s*[a-z]+\\s*[A-Z]")
#[1] "A vs B" "C vs D" "I vs J"
Or with sub from base R
sub(".*([A-Z]\\s*[a-z]+\\s*[A-Z]).*", "\\1", str1)
#[1] "A vs B" "C vs D" "I vs J"
data
str1 <- c("\n A vs B \n", "\n C vs D (EF) \n", "\n GH ( I vs J)\n")
You may use the base R regmatches / gregexpr solution using a PCRE regex like yours, but using lookarounds, changing . to [^()] (to avoid the overflow across parentheses) and placing the longer alternative before the smaller one:
> myData <- c("\n A vs B \n", "\n C vs D (EF) \n", "\n GH ( I vs J)\n")
> res <- regmatches(myData, gregexpr("(?<=\\()[^()]*vs[^()]*(?=\\))|[^()]*vs[^()]*", myData, perl=TRUE))
> trimws(res)
[1] "A vs B" "C vs D" "I vs J"
See the R online demo
Details:
(?<=\\() - positive lookbehind making sure there is a ( immediately to the left of the current location
[^()]* - 0+ chars other than ( and )
vs - a literal substring
[^()]* - 0+ chars other than ( and )
(?=\\)) - positive lookahead making sure there is a ) immediately to the right of the current location
| - or
[^()]*vs[^()]* - a vs enclosed with 0+ chars other than ( and )
NOTE: If you need to prevent the overflow across lines, you need to add \r\n to the [^()] -> [^()\r\n].
See this regex demo.
Throwing a non-regex approach in the mix. Basically we split at vs and paste tha last character of the first element with the first character of the second element.
sapply(strsplit(x, ' vs '), function(i)
paste0(substr(i[1], nchar(i), nchar(i)), ' Vs ', substr(i[2], 1, 1)))
#[1] "A Vs B" "C Vs D" "I Vs J"

R - replacing strings using gsub()

I have a lot of unclean data in the form:
abc
abc/def
abc/de
abc/d
abc/def/i j k
abc/def/i
abc/def/i j
This is just the part of the data I would like to change. This is part of much bigger set of data.
I would like to change all the elements to abc/def/i j k.
I have used the gsub() function as follows:
gsub('abc[a-z/]', 'abc/def/i j k', str)
output :
abc/def/i j k
abc/def/i j k/def
abc/def/i j k/de
abc/def/i j k/d
The problem being that it replaces any occurrence of the pattern.
The only solution where i got decent enough results are where i hard code all the possible options like this:
gsub('abc$|abc/d$|abc/de$|abc/def/i$', 'abc/def/i j k', str)
However, this would not work if there is a variation in any new data.
So I was wondering if it was possible to get the result without hard coding the parameters.
You may use
x <- c("abc", "abc/def","abc/de","abc/d","abc/def/i j k","abc/def/i","abc/def/i j")
sub("^(abc)(?:/[^/]*)?", "\\1/def", x)
## => [1] "abc/def" "abc/def" "abc/def" "abc/def"
## [5] "abc/def/i j k" "abc/def/i" "abc/def/i j"
See R demo
Details:
^ - start of string
(abc) - Group 1: abc
(?:/[^/]*)? - an optional group matching a sequence of:
/ - a /
[^/]* - 0+ chars other than /

find frequency of substring in a set of strings

I have as input a gene list where each genes has a header like >SomeText.
For each gene I would like to find the frequency of the string GTG. (number of occurences divided by length of gene). The string should only be counted if it starts at position 1,4,7,10 etc (every thids position).
>ENST00000619537.4 cds:known chromosome:GRCh38:21:6560714:6564489:1 gene:ENSG00000276076.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CH507-152C13.3 description:alpha-crystallin A chain [Source:RefSeq peptide;Acc:NP_001300979]
ATGGATGTGACCATCCAGCACCCCTGGTTCAAGCGCACCCTGGGGCCCTTCTACCCCAGC
CGGCTGTTCGACCAGTTTTTCGGCGAGGGCCTTTTTGAGTATGACCTGCTGCCCTTCCTG
TCGTCCACCATCAGCCCCTACTACCGCCAGTCCCTCTTCCGCACCGTGCTGGACTCCGGC
ATCTCTGAGGTTCGATCCGACCGGGACAAGTTCGTCATCTTCCTCGATGTGAAGCACTTC
TCCCCGGAGGACCTCACCGTGAAGGTGCAGGACGACTTTGTGGAGATCCACGGAAAGCAC
AACGAGCGCCAGGACGACCACGGCTACATTTCCCGTGAGTTCCACCGCCGCTACCGCCTG
CCGTCCAACGTGGACCAGTCGGCCCTCTCTTGCTCCCTGTCTGCCGATGGCATGCTGACC
TTCTGTGGCCCCAAGATCCAGACTGGCCTGGATGCCACCCACGCCGAGCGAGCCATCCCC
GTGTCGCGGGAGGAGAAGCCCACCTCGGCTCCCTCGTCCTAA
>ENST00000624019.3 cds:known chromosome:GRCh38:21:6561284:6563978:1 gene:ENSG00000276076.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CH507-152C13.3 description:alpha-crystallin A chain [Source:RefSeq peptide;Acc:NP_001300979]
ATGGACGCCCCCCCCCCCCACCCAACCACAGGCCTCCTCTCTGAGCCACGGGTTCGATCC
GACCGGGACAAGTTCGTCATCTTCCTCGATGTGAAGCACTTCTCCCCGGAGGACCTCACC
GTGAAGGTGCAGGACGACTTTGTGGAGATCCACGGAAAGCACAACGAGCGCCAGGACGAC
CACGGCTACATTTCCCGTGAGTTCCACCGCCGCTACCGCCTGCCGTCCAACGTGGACCAG
TCGGCCCTCTCTTGCTCCCTGTCTGCCGATGGCATGCTGACCTTCTGTGGCCCCAAGATC
CAGACTGGCCTGGATGCCACCCACGCCGAGCGAGCCATCCCCGTGTCGCGGGAGGAGAAG
CCCACCTCGGCTCCCTCGTCCTAA
>ENST00000624932.1 cds:known chromosome:GRCh38:21:6561954:6564203:1 gene:ENSG00000276076.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CH507-152C13.3 description:alpha-crystallin A chain [Source:RefSeq peptide;Acc:NP_001300979]
ATGCCTGTCTGTCCAGGAGACAGTCACAGGCCCCCGAAAGCTCTGCCCCACTTGGTGTGT
GGGAGAAGAGGCCGGCAGGTTCGATCCGACCGGGACAAGTTCGTCATCTTCCTCGATGTG
AAGCACTTCTCCCCGGAGGACCTCACCGTGAAGGTGCAGGACGACTTTGTGGAGATCCAC
GGAAAGCACAACGAGCGCCAGGACGACCACGGCTACATTTCCCGTGAGTTCCACCGCCGC
TACCGCCTGCCGTCCAACGTGGACCAGTCGGCCCTCTCTTGCTCCCTGTCTGCCGATGGC
ATGCTGACCTTCTGTGGCCCCAAGATCCAGACTGGCCTGGATGCCACCCACGCCGAGCGA
GCCATCCCCGTGTCGCGGGAGGAGAAGCCCACCTCGGCTCCCTCGTCCTAA
Output:
Gene Frequency
Gene1: 3
Gene2 6.3
....
I was thinging of something like this, but I dont now how to define the positions requirements:
freq <- sapply(gregexpr("GTG",x),function(x)if(x[[1]]!=-1) length(x) else 0)
Here is an idea in R using stringi.
We use stri_locate_all_fixed() to find the start and end position of each GTG occurence. Then we create a column condition to test if start position is in 1,4,7,10,13,16,19,22 ....
library(stringi)
library(dplyr)
data.frame(stri_locate_all_fixed(gene1, "GTG")) %>%
mutate(condition = start %in% seq(1, nchar(gene), 3))
Which gives:
# start end condition
#1 4 6 TRUE
If you want to generalize this to a list of genes, you could do:
lst <- list(gene1, gene2, gene3)
res <- lapply(lst, function(x) {
data.frame(stri_locate_all_fixed(x, "GTG")) %>%
mutate(condition = start %in% seq(1, nchar(x), 3))
})
Which would give:
#[[1]]
# start end condition
#1 4 6 TRUE
#
#[[2]]
# start end condition
#1 NA NA FALSE
#
#[[3]]
# start end condition
#1 3 5 FALSE
#2 9 11 FALSE
#3 21 23 FALSE
#4 70 72 TRUE
#5 75 77 FALSE
Following #Sobrique's comment, if divided by length means number of occurences respecting condition divided by total number of char in each gene, you could do:
lapply(1:length(res), function(x) sum(res[[x]][["condition"]]) / nchar(lst[[x]]))
Which would give:
#[[1]]
#[1] 0.004830918
#
#[[2]]
#[1] 0
#
#[[3]]
#[1] 0.003021148
Here's a Perl solution that does as you ask
But I don't understand how your example output is derived: the first and last sequences have only one occurrence of GTG in the positions you require, and the second sequence has none at all. That means the outputs are 1 / 207, 0 / 74, and 1 / 331 respectively. None of those are anything like 3 and 6.3 that you say you're expecting
This program expects the path to the input file as a parameter on the command line
use strict;
use warnings 'all';
print "Gene Frequency\n";
my $name;
local $/ = '>';
while ( <> ) {
chomp;
next unless /\S/;
my ($name, $seq) = split /\n/, $_, 2;
$seq =~ tr/A-Z//cd;
my $n = 0;
while ( $seq =~ /(?=GTG)/g ) {
++$n if $-[0] % 3 == 0;
}
printf "%-7s%.6f\n", $name, $n / length($seq);
}
output
Gene Frequency
Gene1 0.004831
Gene2 0.000000
Gene3 0.003021
Here is an alternate solution that does not use a pattern match. Not that it will matter much.
use strict;
use warnings;
my $gene;
while ( my $line = <> ) {
if ( $line =~ /^>(.+)/ ) {
$gene = $1;
next;
}
chomp $line;
printf "%s: %s\n",
$gene,
( grep { $_ eq 'GTG' } split /(...)/, $line ) / length $line;
}
Output:
Gene1: 0.00483091787439614
Gene2: 0
Gene3: 0.00302114803625378
It is essentially similar to Sobrique's answer, but assumes that the gene lines contain the right characters. It splits up the gene string into a list of three-character pieces and takes the ones that are literally GTG.
The splitting works by abusing the fact that split uses a pattern as the delimiter, and that it will also capture the delimiter if a capture group is used. Here's an example.
my #foo = split /(...)/, '1234567890';
p #foo; # from Data::Printer
__END__
[
[0] "",
[1] 123,
[2] "",
[3] 456,
[4] "",
[5] 789,
[6] 0
]
The empty elements get filter out by grep. It might not be the most efficient way, but it gets the job done.
You can run it by calling perl foo.pl horribly-large-gene-sequence.file.
Well, you have an R solution. I've hacked something together in perl because you tagged it:
#!/usr/bin/env perl
use strict;
use warnings;
my $target = 'GTG';
local $/ = "\n>";
while ( <> ) {
my ($gene) = m/(Gene\d+)/;
my #hits = grep { /^$target$/ } m/ ( [GTCA]{3} ) /xg;
print "$gene: ".( scalar #hits), "\n";
}
This doesn't give the same results as your input though:
Gene1: 1
Gene2: 0
Gene3: 1
I'm decomposing your string into 3 element lists, and looking for ones that specifically match. (And I haven't divided by length, as I'm not entirely clear if that's the actual string length in letters, or some other metric).
Including length matching - we need to capture both name and string:
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = "\n>";
while (<>) {
my ($gene, $gene_str) = m/(Gene\d+)\n([GTCA]+)/m;
my #hits = grep { /^GTG$/ } $gene_str =~ m/ ( [GTCA]{3} ) /xg;
print "$gene: " . #hits . "/". length ( $gene_str ), " = ", #hits / length($gene_str), "\n";
}
We use <> which is the 'magic' filehandle, and tells perl to read from either STDIN or a file specified on command line. Much like sed or grep does.
With your input:
Gene1: 1/207 = 0.00483091787439614
Gene2: 0/74 = 0
Gene3: 1/331 = 0.00302114803625378
Here is a function I created based on your requirement. I am pretty sure there are alternate ways better than this but this solves the problem.
require(stringi)
input_gene_list<- list(gene1= "GTGGGGGTTTGTGGGGGTG", gene2= "GTGGGGGTTTGTGGGGGTG", gene3= "GTGGGGGTTTGTGGGGGTG")
gene_counter<- function(gene){
x<- gene
y<- gsub(pattern = "GTG",replacement = "GTG ", x = x, perl=TRUE)
if(str_count(y,pattern = "GTG")) {
gene_count<- unlist(gregexpr(pattern = " ", y))
counter<- 0
for(i in 1:length(gene_count)){
if((gene_count[i] %% 3) == 1) counter=counter+1
}
return(counter/nchar(x))
}
}
output_list<- lapply(input_gene_list, function(x) gene_counter(x))
result<- t(as.data.frame(output_list))
result
[,1]
gene1 0.1052632
gene2 0.1052632
gene3 0.1052632
Also share your thoughts on it! Thanks!

Print an extra space if the value is positive

I'm trying to get a visually clear output of my program:
a = -1234
b = 1234
#printf "a = %s%1.2e" "" a
#printf "b = %s%1.2e" " " b
which gives:
a = 1.23e+03
b = -1.23e+03
(The point is to add an extra space for positive number)
Now I want to automate it. I tried to write a funtion:
function negspace(x::Number)
if x < 0
return "", x
else
return " ", x
end
end
and print with
a = -1234
b = 1234
#printf "a = %s%1.2e" negspace( a )
#printf "b = %s%1.2e" negspace( b )
Even simpler is to use the printf format flag to do this directly by putting a space after the %:
julia> #sprintf("a = % 1.2e", -1234)
"a = -1.23e+03"
julia> #sprintf("b = % 1.2e", 1234)
"b = 1.23e+03"
Found a way:
#!/usr/bin/env julia
function signspace(x::Number)
if x > 0
return #sprintf(" %1.2e", x)
else
return #sprintf( "%1.2e", x)
end
end
a = -1234
b = 1234
println("a = ", signspace(a))
println("b = ", signspace(b))
but I'm not sure it is optimal.

Resources