Latex expression replacement in R - r

I'm working with a LaTex document in R and I need to change {#1 \over #2} to \frac{#1}{#2}.
With simple expressions like:
{1\over 2}
{x^2+y^2\over \lambda}
I can do it with stringr::str_replace() or gsub base functions and the regex \\{([\\^a-z0-9\\\\\\s\\+\\-\\*/\(\)]+)\\s*\\\\over\\s*([\\^a-z0-9\\\\\\s\\+\\-\\*/\(\)]+)\\} (I guess there has to be a better approach to do this. I tried with \\{(.+)\\s*\\\\over\\s*(.*)\\} but it captured more than I wanted.)
But when I work with expressions like:
{e^{2c} \over x-1}
{2yz\over 1+x^{2} }
or a more complicated expression:
\\item $Dom\\left(Q\\right)\\ne {\\rm R}^{2} $ y uno de los puntos no pertenecientes al dominio es $\\left({1\\over 2} ,{1\\over 2} \right).$
the above regex failed.
Is there a regex that can catch all the alternatives? Thanks

Given some sample strings:
> strings
[1] "{1\\over 2}" "{x^2+y^2\\over \\lambda}"
This monster:
> unlist(
lapply(
strsplit(
sub("\\}$","",
sub("^\\{","",strings)),"\\\\over"),
function(x){paste0("\\frac{",x[1],"}{",x[2],"}")}))
produces:
[1] "\\frac{1}{ 2}" "\\frac{x^2+y^2}{ \\lambda}"
This will break if there's more than one \over in the source string. And probably in many other cases too... Oh, it doesn't work if there's spaces before the first { or after the closing }.
On your other examples you get this:
in out
[1,] "{1\\over 2}" "\\frac{1}{ 2}"
[2,] "{x^2+y^2\\over \\lambda}" "\\frac{x^2+y^2}{ \\lambda}"
[3,] "{e^{2c} \\over x-1}" "\\frac{e^{2c} }{ x-1}"
[4,] "{2yz\\over 1+x^{2} }" "\\frac{2yz}{ 1+x^{2} }"

I rather enjoyed this question.
At some point you have to parse the document. parse_tex from TeXCheckR had LaTeX not plain TeX in mind but seems to do okay here. For multi-line instances of \over the script would need to be changed though the principle would be the same I think.
The challenge was for continued fractions.
library(data.table) # for shift
library(TeXCheckR) # for parse_tex
locate_over <- function(doc_parsed) {
lead <- function(x, n) data.table::shift(x, n = n, type = "lead", fill = "")
char <- .subset2(doc_parsed, "char")
which(char == "\\" &
lead(char == "o", 1L) &
lead(char == "v", 2L) &
lead(char == "e", 3L) &
lead(char == "r", 4L))
}
over2frac <- function(lines, verbose = FALSE) {
out <- lines
for (i in seq_along(lines)) {
if (grepl("\\over", lines[i], fixed = TRUE)) {
i_parsed <- parse_tex(lines[i])
# Find lhs
for (j in locate_over(i_parsed)) {
lhs_start <- max(which(.subset2(i_parsed, "char") %chin% c("$", "{") &
.subset2(i_parsed, "column") < j &
.subset2(i_parsed, "tex_group") == .subset2(i_parsed[j], "tex_group")))
rhs_end <- min(which(.subset2(i_parsed, "char") %chin% c("$", "}") &
.subset2(i_parsed, "column") > j + 4L &
.subset2(i_parsed, "tex_group") == .subset2(i_parsed[j], "tex_group")))
i_parsed[lhs_start, char := "{\\frac{"]
i_parsed[rhs_end, char := "}}"]
}
res <- paste0(i_parsed[["char"]], collapse = "")
res <- gsub("\\over", "}{", res, fixed = TRUE)
out[i] <- res
}
}
out
}
Test TeX document:
$5 \over 2$
This is another fraction: ${1 \over 2}$.
And another:
$$A = a \over b$$
What about:
$${{D \over C} \over H}$$
Finally:
$${e^{2c} \over x-1}$$
${2yz\over 1+x^{2} }$
$$\phi = 1 + {1 \over {1 + {1 \over {1 + {1 \over {1 + \ddots}}}}}}$$
\item $Dom\left(Q\right)\ne {\rm R}^{2} $ y uno de los puntos no pertenecientes al dominio es $\left({1\over 2} ,{1\over 2}\right).$
\bye
Resulting LaTeX document: with the necessary LaTeX-specific stuff, plus mandatory math mode for inline fractions. writeLines(over2frac(readLines("tex1.tex"), verbose = FALSE), "latex1.tex")
\documentclass{article}
\begin{document}
${\frac{5 }{ 2}}$
This is another fraction: ${\frac{1 }{ 2}}$.
And another:
${\frac{A = a }{ b}}$
What about:
$${\frac{{\frac{D }{ C}} }{ H}}$$
Finally:
$${\frac{e^{2c} }{ x-1}}$$
${\frac{2yz}{ 1+x^{2} }}$
$$\phi = 1 + {\frac{1 }{ {1 + {\frac{1 }{ {1 + {\frac{1 }{ {1 + \ddots}}}}}}}}}$$
\item $Dom\left(Q\right)\ne {\rm R}^{2} $ y uno de los puntos no pertenecientes al dominio es $\left({\frac{1}{ 2}} ,{\frac{1}{ 2}} \right).$
\end{document}

This gets you most of the way for your examples:
library(stringr)
s <- "Expression 1 is {1\\over 2}.
Expression 2 is {x^2+y^2\\over \\lambda}, yes it is.
Expression 3 is {e^{2c} \\over x-1}.
The last expression: {2yz\\over 1+x^{2} }, all done now."
s2 <- str_replace_all(s,
"\\{(.*?)\\s{0,}\\\\over\\s{0,}(.*?)\\}",
"\\frac\\{\\1\\}\\{\\2\\}")
s2
[1] "Expression 1 is frac{1}{2}.\n\nExpression 2 is frac{x^2+y^2}{\\lambda}, yes it is.\n\nExpression 3 is frac{e^{2c}}{x-1}.\n\nThe last expression: frac{2yz}{1+x^{2} }, all done now."
The only issue is that a space remains in the last expression, which may not be a problem since it existed in the original:
frac{2yz}{1+x^{2} }

x=c("{e^{2c} \\over x-1}","{2yz\\over 1+x^{2} },,dty{k^4e{-rpi/3}\\over\\sqrt{2pik}}")
gsub("\\{(.*?)\\\\over(.*?)\\}","\\\frac{\\1}{\\2}",x)
[1] "\frac{e^{2c} }{ x-1}"
[2] "\frac{2yz}{ 1+x^{2} },,dty\frac{k^4e{-rpi/3}}{\\sqrt{2pik}}"
Explanation:
\{(.*?)\\over(.*?)\\
List item{ matches the character { literally (case sensitive)
1st Capturing Group (.*?)
.*? matches any character (except for line terminators)
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
\\ matches the
character \ literally (case sensitive) over matches the characters
over literally (case sensitive)
2nd Capturing Group (.*?)
.*? matches any character (except for line terminators)
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
\\ matches the character \ literally (case sensitive)

This approach can handle:
multiple {...} containing \over on the same line
{...} not containing \over
other text before, after and between occurrences of {...}
lines not having any {...} with \over
For example, note in the example below the {jjj} on the second input line before the first occurrence of a {...} with \over works as expected.
It makes use of gsubfn which can handle balanced parentheses. First create a proto object p similar to the one in my answer here. p initializes a counter k to 0 and increments it for each { and decrements it for each }. It replaces any { for which k=1 with ! and also any } for which k=0 with !.
Then replace !...\over...! with \frac{...}{...} and replace any remaining !...! with {...}.
We have assumed that ! does not appear in the input but if it does choose a different character.
library(gsubfn)
library(magrittr)
# test input
s <- c("abc {1\\over 2} def {x^2+y^2\\over \\lambda} ghi { 12 } XYZ",
"X {jjj} A {e^{2c} \\over x-1} jkl {2yz\\over 1+x^{2} } Z")
# processing
p <- proto(
pre = function(.) .$k <- 0,
fun = function(., x) {
if (x == "{") .$k <- .$k + 1 else if (x == "}") .$k <- .$k - 1
if (x == "{" && .$k == 1) "!" else if (x == "}" && .$k == 0) "!" else x
})
s %>%
gsubfn("[{}]", p, .) %>%
gsub("!([^!]*)\\\\over ([^!]*)!", "\\\\frac{\\1}{\\2}", .) %>%
gsub("!([^!]*)!", "{\\1}", .)
giving this result:
[1] "abc \\frac{1}{2} def \\frac{x^2+y^2}{\\lambda} ghi { 12 } XYZ"
[2] "X {jjj} A \\frac{e^{2c} }{x-1} jkl \\frac{2yz}{1+x^{2} } Z"

Related

Why does Base.rsplit not invert the order (compared to Base.split) of the data in Julia?

I am trying out Base.rsplit() for the first time and I was surprised to see that the order of the data does not change when I use split vs rsplit. See this example:
julia> my_string = "Hello.World.This.Is.A.Test"
"Hello.World.This.Is.A.Test"
julia> a = split(my_string, ".")
6-element Vector{SubString{String}}:
"Hello"
"World"
"This"
"Is"
"A"
"Test"
julia> b = rsplit(my_string, ".")
6-element Vector{SubString{String}}:
"Hello"
"World"
"This"
"Is"
"A"
"Test"
julia> a == b
true
This is a bit counterintuitive given that rsplit says:
Similar to split, but starting from the end of the string.
rsplit just goes from the right side and the only practical difference is the limit parameter.
Try typing in Julia:
#less rsplit("txt",".")
You will find the following function:
function _rsplit(str::AbstractString, splitter, limit::Integer, keepempty::Bool, strs::Array)
n = lastindex(str)::Int
r = something(findlast(splitter, str)::Union{Nothing,Int,UnitRange{Int}}, 0)
j, k = first(r), last(r)
while j > 0 && k > 0 && length(strs) != limit-1
(keepempty || k < n) && pushfirst!(strs, #inbounds SubString(str,nextind(str,k)::Int,n))
n = prevind(str, j)::Int
r = something(findprev(splitter,str,n)::Union{Nothing,Int,UnitRange{Int}}, 0)
j, k = first(r), last(r)
end
(keepempty || n > 0) && pushfirst!(strs, SubString(str,1,n))
return strs
end

regex for replacement of specific character outside parenthesis only

I am looking for regex (preferably in R) which can replace (any number of) specific characters say ; with say ;; but only when not present inside parenthesis () inside the text string.
Note: 1. There may be more than one replacement character present inside parenthesis too
2. There are no nested parenthesis in the data/vector
Example
text;othertext to be replaced with text;;othertext
but text;other(texttt;some;someother);more to be replaced with text;;other(texttt;some;someother);;more. (i.e. ; only outside () to be replaced with replacement text)
Still if some clarification is needed I will try to explain
in_vec <- c("abcd;ghi;dfsF(adffg;adfsasdf);dfg;(asd;fdsg);ag", "zvc;dfasdf;asdga;asd(asd;hsfd)", "adsg;(asdg;ASF;DFG;ASDF;);sdafdf", "asagf;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;sdfa")
in_vec
#> [1] "abcd;ghi;dfsF(adffg;adfsasdf);dfg;(asd;fdsg);ag"
#> [2] "zvc;dfasdf;asdga;asd(asd;hsfd)"
#> [3] "adsg;(asdg;ASF;DFG;ASDF;);sdafdf"
#> [4] "asagf;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;sdfa"
Expected output (calculated manually)
[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"
[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"
[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"
You can use gsub with ;(?![^(]*\\)):
gsub(";(?![^(]*\\))", ";;", in_vec, perl=TRUE)
#[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
#[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"
#[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"
#[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"
; finds ;, (?!) .. Negative Lookahead (make the replacement when it does not match), [^(] .. everything but not (, * repeat the previous 0 to n times, \\) .. flowed by ).
Or
gsub(";(?=[^)]*($|\\())", ";;", in_vec, perl=TRUE)
#[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
#[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"
#[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"
#[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"
; finds ;, (?=) .. Positive Lookahead (make the replacement when it does match), [^)] .. everything but not ), * repeat the previous 0 to n times, ($|\\() .. match end $ or (.
Or using gregexpr and regmatches extracting the part between ( and ) and making the replacement in the non-matched substrings:
x <- gregexpr("\\(.*?\\)", in_vec) #Find the part between ( and )
mapply(function(a, b) {
paste(matrix(c(gsub(";", ";;", b), a, ""), 2, byrow=TRUE), collapse = "")
}, regmatches(in_vec, x), regmatches(in_vec, x, TRUE))
#[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
#[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"
#[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"
#[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"
But all of them will work only for simple open ( close ) combinations.
Though the problem can be tackled with regex, using a simple function might be more straightforward and easier to understand.
replace_semicolons_outside_parentheses <- function(raw_string) {
"""Replace ; with ;; outside of parentheses"""
processed_string <- ""
n_open_parentheses <- 0
# Loops over characters in raw_string
for (char in strsplit(raw_string, "")[[1]]) {
# Update the net number of open parentheses
if (char == "(") {
n_open_parentheses <- n_open_parentheses + 1
} else if (char == ")") {
n_open_parentheses <- n_open_parentheses - 1
}
# Replace ; with ;; outside of parentheses
if (char == ";" && n_open_parentheses == 0) {
processed_string <- paste0(processed_string, ";;")
} else {
processed_string <- paste0(processed_string, char)
}
}
return(processed_string)
}
Note that the function above also works for nested parentheses: no semicolons inside nested parentheses are replaced! The desired output can be obtained in a single line:
out_vec <- lapply(in_vec, replace_semicolons_outside_parentheses)
# 1. 'abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag'
# 2. 'zvc;;dfasdf;;asdga;;asd(asd;hsfd)'
# 3. 'adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf'
# 4. 'asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa'
Use the following in case of no nested parentheses:
gsub("\\([^()]*\\)(*SKIP)(*FAIL)|;", ";;", in_vec, perl=TRUE)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
\( '(' char
--------------------------------------------------------------------------------
[^()]* any character except: '(', ')' (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\) ')' char
--------------------------------------------------------------------------------
(*SKIP)(*FAIL) skip current match, search for new one from here
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
; ';'
If there are nested parentheses:
gsub("(\\((?:[^()]++|(?1))*\\))(*SKIP)(*FAIL)|;", ";;", in_vec, perl=TRUE)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\( '('
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
[^()]++ any character except: '(', ')' (1 or more times
(matching the most amount possible, no backtracking))
--------------------------------------------------------------------------------
| or
--------------------------------------------------------------------------------
(?1) recursing first group pattern
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
\) ')'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
(*SKIP)(*FAIL) skip the match, search for next
--------------------------------------------------------------------------------
| or
--------------------------------------------------------------------------------
; ';'
--------------------------------------------------------------------------------

How to effectively visualize a recursive function?

I'm currently in the process of teaching recursion in a programming class. I noticed how hard it is for my students to grasp the concept of recursion. Is there a nice way to visualize what the function does for the pedagogical purposes?
As an example, here is an R function for getting the n'th Fibonacci number:
fib_r <- function(n) {
if(n <= 2) return(1)
fib_r(n-1) + fib_r(n-2)
}
Thanks.
This is how I would go about explaining recursive functions in R:
First, I agree with #AEBilgrau that the factorial is a good example for recursion. (Better than Fibonacci in my opionion.)
Then I would quickly go through the theoretical basis why the factorial can be defined as a recursive function, something simple like
4! = 4 * 3 * 2 * 1 = 4 * 3!
Then you could present them the respective recursive R function
fact = function(x) if (x == 0) return(1) else return(x * fact(x - 1))
fact(3)
#6
but present them also the following output
#|fact(3) called
#|fact(3) is calculated via 3*fact(2)
#|fact(2) is unknown yet. Therefore calling fact(2) now
#|Waiting for result from fact(2)
#| fact(2) called
#| fact(2) is calculated via 2*fact(1)
#| fact(1) is unknown yet. Therefore calling fact(1) now
#| Waiting for result from fact(1)
#| | fact(1) called
#| | fact(1) is calculated via 1*fact(0)
#| | fact(0) is unknown yet. Therefore calling fact(0) now
#| | Waiting for result from fact(0)
#| | | fact(0) called
#| | | fact(0)=1 per definition. Nothing to calculate.
#| | | fact(0) returning 1 to waiting fact(1)
#| | fact(1) received 1 from fact(0)
#| | fact(1) can now calculate 1*fact(0)=1*1=1
#| | fact(1) returning 1 to waiting fact(2)
#| fact(2) received 1 from fact(1)
#| fact(2) can now calculate 2*fact(1)=2*1=2
#|fact(3) received 2 from fact(2)
#|fact(3) can now calculate 3*fact(2)=3*2=6
#[1] 6
as derived from
# helper function for formatting
tabs = function(n) paste0("|", rep("\t", n), collapse="")
fact = function(x) {
# determine length of call stack
sfl = length(sys.frames()) - 1
# we need to define tmp and tmp1 here because they are used in on.exit
tmp = NULL
tmp1 = NULL
# on.exit will print the returned function value when we exit the function ...
# ... i.e., when one function call is removed from the stack
on.exit({
if (sfl > 1) {
cat(tabs(sfl), "fact(", x, ") returning ",
tmp, " to waiting fact(", x + 1, ")\n", sep="")
}
})
cat(tabs(sfl), "fact(", x, ") called\n", sep="")
if (x == 0) {
cat(tabs(sfl), "fact(0)=1 per definition. Nothing to calculate.\n", sep="")
# set tmp for printing in on.exit
tmp = 1
return(1)
} else {
# print some info for students
cat(tabs(sfl), "fact(", x,
") is calculated via ", x, "*fact(", x - 1, ")\n", sep="")
cat(tabs(sfl),"fact(",x - 1,
") is unknown yet. Therefore calling fact(",
x - 1, ") now\n", sep="")
cat(tabs(sfl), "Waiting for result from fact(",
x - 1, ")\n", sep="")
#call fact again
tmp1 = fact(x - 1)
#more info for students
cat(tabs(sfl), "fact(", x, ") received ", tmp1,
" from fact(", x - 1, ")\n", sep="")
tmp = x * tmp1
cat(tabs(sfl), "fact(", x, ") can now calculate ",
x, "*fact(", x - 1, ")=", x, "*", tmp1,
"=", tmp, "\n", sep="")
return(tmp)
}
}
fact(3)
Here's my example, probably used in quite a few textbooks:
recursive_sum <- function(n){
if(n == 1) {print("Remember 1, add everything together"); return(n)}
print(paste0("Remember ", n, ", pass ", n-1, " to recursive function"))
n + recursive_sum(n-1)
}
Output:
> recursive_sum(4)
[1] "Remember 4, pass 3 to recursive function"
[1] "Remember 3, pass 2 to recursive function"
[1] "Remember 2, pass 1 to recursive function"
[1] "Remember 1, add everything together"
[1] 10
I think the factorial function is a good example for recursion. Combining this with a printout (as others suggest) seem like a good way to describe what is going on:
factorial <- function(n) {
cat("factorial(", n, ") was called.\n", sep = "")
if (n == 0) {
return(1)
} else {
return(n * factorial(n - 1))
}
}
factorial(4)
#factorial(4) was called.
#factorial(3) was called.
#factorial(2) was called.
#factorial(1) was called.
#factorial(0) was called.
#[1] 24
You can also then implement a non-recursive factorial function and compare the computational efficiencies. Or maybe ask them what is problematic with the above implementation (e.g what happens with factorial(-4)).
Regarding a more proper visualization (and not just easy examples), there are websites which illustrate the recursion tree.
Edit: Googling recursion is also a useful lesson.
Print the value of the variable n in the fib_r
print("iteraction at: ", n)

find frequency of substring in a set of strings

I have as input a gene list where each genes has a header like >SomeText.
For each gene I would like to find the frequency of the string GTG. (number of occurences divided by length of gene). The string should only be counted if it starts at position 1,4,7,10 etc (every thids position).
>ENST00000619537.4 cds:known chromosome:GRCh38:21:6560714:6564489:1 gene:ENSG00000276076.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CH507-152C13.3 description:alpha-crystallin A chain [Source:RefSeq peptide;Acc:NP_001300979]
ATGGATGTGACCATCCAGCACCCCTGGTTCAAGCGCACCCTGGGGCCCTTCTACCCCAGC
CGGCTGTTCGACCAGTTTTTCGGCGAGGGCCTTTTTGAGTATGACCTGCTGCCCTTCCTG
TCGTCCACCATCAGCCCCTACTACCGCCAGTCCCTCTTCCGCACCGTGCTGGACTCCGGC
ATCTCTGAGGTTCGATCCGACCGGGACAAGTTCGTCATCTTCCTCGATGTGAAGCACTTC
TCCCCGGAGGACCTCACCGTGAAGGTGCAGGACGACTTTGTGGAGATCCACGGAAAGCAC
AACGAGCGCCAGGACGACCACGGCTACATTTCCCGTGAGTTCCACCGCCGCTACCGCCTG
CCGTCCAACGTGGACCAGTCGGCCCTCTCTTGCTCCCTGTCTGCCGATGGCATGCTGACC
TTCTGTGGCCCCAAGATCCAGACTGGCCTGGATGCCACCCACGCCGAGCGAGCCATCCCC
GTGTCGCGGGAGGAGAAGCCCACCTCGGCTCCCTCGTCCTAA
>ENST00000624019.3 cds:known chromosome:GRCh38:21:6561284:6563978:1 gene:ENSG00000276076.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CH507-152C13.3 description:alpha-crystallin A chain [Source:RefSeq peptide;Acc:NP_001300979]
ATGGACGCCCCCCCCCCCCACCCAACCACAGGCCTCCTCTCTGAGCCACGGGTTCGATCC
GACCGGGACAAGTTCGTCATCTTCCTCGATGTGAAGCACTTCTCCCCGGAGGACCTCACC
GTGAAGGTGCAGGACGACTTTGTGGAGATCCACGGAAAGCACAACGAGCGCCAGGACGAC
CACGGCTACATTTCCCGTGAGTTCCACCGCCGCTACCGCCTGCCGTCCAACGTGGACCAG
TCGGCCCTCTCTTGCTCCCTGTCTGCCGATGGCATGCTGACCTTCTGTGGCCCCAAGATC
CAGACTGGCCTGGATGCCACCCACGCCGAGCGAGCCATCCCCGTGTCGCGGGAGGAGAAG
CCCACCTCGGCTCCCTCGTCCTAA
>ENST00000624932.1 cds:known chromosome:GRCh38:21:6561954:6564203:1 gene:ENSG00000276076.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CH507-152C13.3 description:alpha-crystallin A chain [Source:RefSeq peptide;Acc:NP_001300979]
ATGCCTGTCTGTCCAGGAGACAGTCACAGGCCCCCGAAAGCTCTGCCCCACTTGGTGTGT
GGGAGAAGAGGCCGGCAGGTTCGATCCGACCGGGACAAGTTCGTCATCTTCCTCGATGTG
AAGCACTTCTCCCCGGAGGACCTCACCGTGAAGGTGCAGGACGACTTTGTGGAGATCCAC
GGAAAGCACAACGAGCGCCAGGACGACCACGGCTACATTTCCCGTGAGTTCCACCGCCGC
TACCGCCTGCCGTCCAACGTGGACCAGTCGGCCCTCTCTTGCTCCCTGTCTGCCGATGGC
ATGCTGACCTTCTGTGGCCCCAAGATCCAGACTGGCCTGGATGCCACCCACGCCGAGCGA
GCCATCCCCGTGTCGCGGGAGGAGAAGCCCACCTCGGCTCCCTCGTCCTAA
Output:
Gene Frequency
Gene1: 3
Gene2 6.3
....
I was thinging of something like this, but I dont now how to define the positions requirements:
freq <- sapply(gregexpr("GTG",x),function(x)if(x[[1]]!=-1) length(x) else 0)
Here is an idea in R using stringi.
We use stri_locate_all_fixed() to find the start and end position of each GTG occurence. Then we create a column condition to test if start position is in 1,4,7,10,13,16,19,22 ....
library(stringi)
library(dplyr)
data.frame(stri_locate_all_fixed(gene1, "GTG")) %>%
mutate(condition = start %in% seq(1, nchar(gene), 3))
Which gives:
# start end condition
#1 4 6 TRUE
If you want to generalize this to a list of genes, you could do:
lst <- list(gene1, gene2, gene3)
res <- lapply(lst, function(x) {
data.frame(stri_locate_all_fixed(x, "GTG")) %>%
mutate(condition = start %in% seq(1, nchar(x), 3))
})
Which would give:
#[[1]]
# start end condition
#1 4 6 TRUE
#
#[[2]]
# start end condition
#1 NA NA FALSE
#
#[[3]]
# start end condition
#1 3 5 FALSE
#2 9 11 FALSE
#3 21 23 FALSE
#4 70 72 TRUE
#5 75 77 FALSE
Following #Sobrique's comment, if divided by length means number of occurences respecting condition divided by total number of char in each gene, you could do:
lapply(1:length(res), function(x) sum(res[[x]][["condition"]]) / nchar(lst[[x]]))
Which would give:
#[[1]]
#[1] 0.004830918
#
#[[2]]
#[1] 0
#
#[[3]]
#[1] 0.003021148
Here's a Perl solution that does as you ask
But I don't understand how your example output is derived: the first and last sequences have only one occurrence of GTG in the positions you require, and the second sequence has none at all. That means the outputs are 1 / 207, 0 / 74, and 1 / 331 respectively. None of those are anything like 3 and 6.3 that you say you're expecting
This program expects the path to the input file as a parameter on the command line
use strict;
use warnings 'all';
print "Gene Frequency\n";
my $name;
local $/ = '>';
while ( <> ) {
chomp;
next unless /\S/;
my ($name, $seq) = split /\n/, $_, 2;
$seq =~ tr/A-Z//cd;
my $n = 0;
while ( $seq =~ /(?=GTG)/g ) {
++$n if $-[0] % 3 == 0;
}
printf "%-7s%.6f\n", $name, $n / length($seq);
}
output
Gene Frequency
Gene1 0.004831
Gene2 0.000000
Gene3 0.003021
Here is an alternate solution that does not use a pattern match. Not that it will matter much.
use strict;
use warnings;
my $gene;
while ( my $line = <> ) {
if ( $line =~ /^>(.+)/ ) {
$gene = $1;
next;
}
chomp $line;
printf "%s: %s\n",
$gene,
( grep { $_ eq 'GTG' } split /(...)/, $line ) / length $line;
}
Output:
Gene1: 0.00483091787439614
Gene2: 0
Gene3: 0.00302114803625378
It is essentially similar to Sobrique's answer, but assumes that the gene lines contain the right characters. It splits up the gene string into a list of three-character pieces and takes the ones that are literally GTG.
The splitting works by abusing the fact that split uses a pattern as the delimiter, and that it will also capture the delimiter if a capture group is used. Here's an example.
my #foo = split /(...)/, '1234567890';
p #foo; # from Data::Printer
__END__
[
[0] "",
[1] 123,
[2] "",
[3] 456,
[4] "",
[5] 789,
[6] 0
]
The empty elements get filter out by grep. It might not be the most efficient way, but it gets the job done.
You can run it by calling perl foo.pl horribly-large-gene-sequence.file.
Well, you have an R solution. I've hacked something together in perl because you tagged it:
#!/usr/bin/env perl
use strict;
use warnings;
my $target = 'GTG';
local $/ = "\n>";
while ( <> ) {
my ($gene) = m/(Gene\d+)/;
my #hits = grep { /^$target$/ } m/ ( [GTCA]{3} ) /xg;
print "$gene: ".( scalar #hits), "\n";
}
This doesn't give the same results as your input though:
Gene1: 1
Gene2: 0
Gene3: 1
I'm decomposing your string into 3 element lists, and looking for ones that specifically match. (And I haven't divided by length, as I'm not entirely clear if that's the actual string length in letters, or some other metric).
Including length matching - we need to capture both name and string:
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = "\n>";
while (<>) {
my ($gene, $gene_str) = m/(Gene\d+)\n([GTCA]+)/m;
my #hits = grep { /^GTG$/ } $gene_str =~ m/ ( [GTCA]{3} ) /xg;
print "$gene: " . #hits . "/". length ( $gene_str ), " = ", #hits / length($gene_str), "\n";
}
We use <> which is the 'magic' filehandle, and tells perl to read from either STDIN or a file specified on command line. Much like sed or grep does.
With your input:
Gene1: 1/207 = 0.00483091787439614
Gene2: 0/74 = 0
Gene3: 1/331 = 0.00302114803625378
Here is a function I created based on your requirement. I am pretty sure there are alternate ways better than this but this solves the problem.
require(stringi)
input_gene_list<- list(gene1= "GTGGGGGTTTGTGGGGGTG", gene2= "GTGGGGGTTTGTGGGGGTG", gene3= "GTGGGGGTTTGTGGGGGTG")
gene_counter<- function(gene){
x<- gene
y<- gsub(pattern = "GTG",replacement = "GTG ", x = x, perl=TRUE)
if(str_count(y,pattern = "GTG")) {
gene_count<- unlist(gregexpr(pattern = " ", y))
counter<- 0
for(i in 1:length(gene_count)){
if((gene_count[i] %% 3) == 1) counter=counter+1
}
return(counter/nchar(x))
}
}
output_list<- lapply(input_gene_list, function(x) gene_counter(x))
result<- t(as.data.frame(output_list))
result
[,1]
gene1 0.1052632
gene2 0.1052632
gene3 0.1052632
Also share your thoughts on it! Thanks!

Converting user input to a number

It might be a silly mistake but I'm not able to figure it out.
Here's the sample:
if(K<=50 & k<=K) {
cat("Message A", "\n")
} else {
if(K>50) {
cat("Error A","\n")
return(0)
} else {
cat("Error B","\n")
return(0)
}
}
If I enter K = 9 and k = 2, I still get Error A and the program stops.
Why?
EDIT:
If I take user input for K and k, it gives the "Error A" message. "Error B" works fine. I never get "Message A"
K<-readline("Enter K: ")
k<-readline("Enter k: ")
The reason you're seeing this error is that the readline function returns a string instead of a number. When R compares between a string and a number, it will convert both to a string and compare them alphabetically:
"9" <= 50
# [1] FALSE
"1" <= 50
# [1] TRUE
The solution is to convert the inputted values to the numeric type:
K <- as.numeric(readline("Enter K: "))
k <- as.numeric(readline("Enter k: "))
Note that if you enter something that's not a number then R will store that value as NA; you can check for this with the is.na function.

Resources