Count substring matches within start/stop positions of pattern in R - r

From a given string " GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT", I want a matrix that counts triplets in substrings that
start with “AAA” or “GAA”
end with “AGT”
and
have at least 2 and at most N other triplets between the start and the end.
For my problem n=10;
So I have below code and result:
library( stringr )
#sample data
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
#set constants
start <- c("AAA", "GAA")
end <- "AGT"
n <- 10 # << set as desired
#build regex
regex <- paste0( "(", paste0( start, collapse = "|" ), ")", paste0( "([A-Z]{3}){2,", n, "}" ), end )
str_extract_all( dna, regex )
Result:
[[1]]
[1] "GAACCCACTAGTATAAAATTTGGGAGT" "AAACCCTTTGGGAGT"
Now how to modify the code so that it returns a matrix containing the starting position of each match along with the actual number of "triplet(every 3 characters will be considered as a triplet)" between the start and the end for each match.
Like for the above results, the result should be:
1 7
31 3
here 1 is the starting position for GAACCCACTAGTATAAAATTTGGGAGT and 7 is the number of triplet between starting pattern GAA and ending pattern AGT.
Same for AAACCCTTTGGGAGT"
31 is the starting position of AAA and 7 is the number of triplets between AAA and AGT

Here's a function that might give you what I think you need, though it doesn't provide the matrix you said you wanted.
func <- function(dna, n = 10, starts, stops) {
if (length(n) == 1L) n <- c(2L, n)
startptn <- paste0("(", paste(starts, collapse = "|"), ")")
stopptn <- paste0("(", paste(stops, collapse = "|"), ")")
starts_ind <- gregexpr(startptn, dna)
stops_ind <- gregexpr(stopptn, dna)
# stops_ind ends on the first char of the triplet, so add 2
stops_ind <- lapply(stops_ind, `+`, 2L)
candidates <- Map(function(bgn, end, txt) {
mtx <- outer(bgn, end, FUN = function(b, e) substring(txt, b, e))
vec <- mtx[nzchar(mtx)]
vec
}, starts_ind, stops_ind, dna)
# "6L" is the first/last triplet defined in starts/stops
cand_triplets <- lapply(candidates, function(z) nchar(z) - 6L)
lens <- lengths(candidates)
df <- data.frame( id = rep(seq_along(dna), lens), dna = rep(dna, lengths(candidates)) )
df$match <- unlist(candidates)
df$inner <- substring(df$match, 4, nchar(df$match) - 3)
df$ntriplets <- nchar(df$inner) / 3
if (nrow(df) > 0) {
df <- df[ abs(df$ntriplets %% 1) < 1e-5 &
n[1] <= df$ntriplets &
df$ntriplets <= n[2], , drop = FALSE ]
}
df
}
Demo:
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
func(c(dna, dna), 10, c("AAA","GAA"), "AGT")
# id dna match inner ntriplets
# 1 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGT CCCACT 2
# 2 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGT CCCACTAGTATAAAATTTGGG 7
# 6 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT AAACCCTTTGGGAGT CCCTTTGGG 3
# 7 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGT CCCACT 2
# 8 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGT CCCACTAGTATAAAATTTGGG 7
# 12 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT AAACCCTTTGGGAGT CCCTTTGGG 3
func(c(dna, dna), 20, c("AAA","GAA"), "AGT")
# id dna match inner ntriplets
# 1 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGT CCCACT 2
# 2 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGT CCCACTAGTATAAAATTTGGG 7
# 4 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT CCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGG 13
# 6 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT AAACCCTTTGGGAGT CCCTTTGGG 3
# 7 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGT CCCACT 2
# 8 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGT CCCACTAGTATAAAATTTGGG 7
# 10 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT CCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGG 13
# 12 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT AAACCCTTTGGGAGT CCCTTTGGG 3
In the input:
each input dna can produce 0 or more rows in the output; I use c(dna, dna) merely to demonstrate that this works on multiple dna strings, vectorized;
if n is length 1, then it defaults to ranging between 2 and n; if length 2, then both are used appropriately;
In the output:
id numbers each of the dna input strings;
dna is the actual string; I include both id and dna in case a string is accidentally repeated (minor);
match is the matching substring, including the start and stop triplets;
inner is the same substring with the start/stop triplets removed;
ntriplets is really just the nchar of $inner;
the code filters to ensure we have triplets (/3 reduces it, and %% 1 should be 0), and then how many triplets we have based on the input n
(If you want to see all matches, set n to Inf, though it'll still filter out non-triplet inner strings.)
As to your request of a matrix of lengths, if we insert a browser() in the Map and form mtx, we'll see that we are using matrices:
bgn
# [1] 1 15 31
# attr(,"match.length")
# [1] 3 3 3
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
end
# [1] 12 27 45
# attr(,"match.length")
# [1] 3 3 3
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
txt
# [1] "GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT"
mtx <- outer(bgn, end, FUN = function(b, e) substring(txt, b, e))
mtx
# [,1] [,2] [,3]
# [1,] "GAACCCACTAGT" "GAACCCACTAGTATAAAATTTGGGAGT" "GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT"
# [2,] "" "AAAATTTGGGAGT" "AAAATTTGGGAGTCCCAAACCCTTTGGGAGT"
# [3,] "" "" "AAACCCTTTGGGAGT"
nchar(mtx) - 6
# [,1] [,2] [,3]
# [1,] 6 21 39
# [2,] -6 7 25
# [3,] -6 -6 9
(The negatives are just an artifact of the debugging environment and my naïve subtraction of 6 with empty strings present; this does not reflect a bug in the function.)
To me, this matrix suggests we have 2, 7, and 13 triplets in the top row.

i have an ugly solution. You just need to analysis your pattern.
library( stringr )
#sample data
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
#set constants
start <- c("AAA", "GAA")
end <- "AGT"
n <- 10 # << set as desired
#build regex
regex <- paste0( "(", paste0( start, collapse = "|" ), ")", paste0( "([A-Z]{3}){2,", n, "}" ), end )
triplet_string <- str_extract_all( dna, regex )
triplet_matrix <- str_locate_all(dna, regex)
# first solution:
# differenz between end and start position (-1) divide by 3 for triplet and substract the first and last triplet
triplet_length_1 <- (triplet_matrix[[1]][,2] - (triplet_matrix[[1]][,1]-1))/3 - 2
df <- data.frame(startpos = triplet_matrix[[1]][,1],
lengthtriplet = triplet_length_1)
# second solution:
# getting the length of each ORF and substract 2 for start and stopp codon
triplet_length_2 <- nchar(triplet_string[[1]])/3 - 2
df <- data.frame(startpos = triplet_matrix[[1]][,1],
lengthtriplet = triplet_length_2)
I hope that i could help you.

Related

Regular expression in R to remove pairs

I have a code that outputs a pair of integers as "(1, 21)", as a string. The integers are always between 1 and 99.
I want to extract the integers into an array as numeric. How can I do this? I've done some research and it seems regex is the way to go, but I'm unsure exactly how to do this here.
Here are several base R one-liners. These each produce a data frame. Use as.matrix(...) on that if you want a matrix/array. (2) seems particularly compact.
1) trimws/read.table trim non-digits off the ends using trimws and then use read.table to read it in giving the data frame shown.
x <- c("(1, 21)", "(2, 22)", "(3, 33)") # input
read.table(text = trimws(x, white = "\\D"), sep = ",")
## V1 V2
## 1 1 21
## 2 2 22
## 3 3 33
2) gsub/read.table Another approach is to convert each non-digit to a space and then use read.table:
read.table(text = gsub("\\D", " ", x))
## V1 V2
## 1 1 21
## 2 2 22
## 3 3 33
3) strcapture Define a regular expression with captures to use with strcapture.
strcapture("(\\d+), (\\d+)", x, data.frame(V1 = integer(0), V2 = integer(0)))
## V1 V2
## 1 1 21
## 2 2 22
## 3 3 33
4) chartr/read.table Use chartr to replace ( with a space and then use read.table defining the comment character as ).
read.table(text = chartr("(", " ", x), sep = ",", comment.char = ")")
## V1 V2
## 1 1 21
## 2 2 22
## 3 3 33
You can use gsub to remove ( and ) using [()] and then use strsplit to split at , . unlist the retuned list and convert it to as.integer and create a matrix or array.
matrix(as.integer(unlist(strsplit(gsub("[()]", "", x), ", ", TRUE))), 2)
# [,1] [,2]
#[1,] 1 3
#[2,] 21 31
Data:
x <- c("(1, 21)", "(3, 31)")
Tidyverse way
x <- c("(1, 21)", "(33, 99)", "(1, 7)")
library(tidyverse)
map_dfr(str_split(str_replace(x, '\\((\\d+)\\,\\s(\\d+)\\)', '\\1 \\2'), ' '), ~ set_names(.x, c('A', 'B')))
#> # A tibble: 3 x 2
#> A B
#> <chr> <chr>
#> 1 1 21
#> 2 33 99
#> 3 1 7
Created on 2021-06-02 by the reprex package (v2.0.0)

How to correct/standardize variable names if their format is not consistent

I am writing a script that loads RData files containing the results of earlier experiments and parses data frames saved in them. I've noticed that, while the names of variables are not consistent , for instance, sometimes symbol is called gene_name or gene_symbol. The order of variables is also different between the different data frames, so I can't just rename them all with colnames(df) <- c('a', 'b', ...)
I'm looking for a way to rename variables based on their name that won't give an error if that variable isn't found. The below is what I want to do, but (ideally) without needing dozens of conditional statements.
if ('gene_name' %in% colnames(df)) {
df <- df %>% dplyr::rename('symbol' = gene_name)
}
In the below example, I'd like to find an elegant way to rename the variable b to D that I can use safely on data frames that lack a variable b
x <- data.frame('a' = c(1,2,3), 'b' = c(4,5,6))
y <- data.frame('a' = c(1,2,3), 'c' = c(4,5,6))
dfs <- list(x,y)
dfs.fixed <- lapply(dfs, function(x) ?????)
Desired result:
dfs.fixed
[[1]]
a D
1 1 4
2 2 5
3 3 6
[[2]]
a c
1 1 4
2 2 5
3 3 6
Try this approach:
STEP 1
A function substituting a list of colnames with another string (both info parameterized):
colnames_rep<-function(df,to_find,to_sub)
{
colnames(df)[which(colnames(df) %in% to_find)]<-to_sub
return(df)
}
STEP 2
Use lapply to apply the function over each data.frame:
lapply(dfs,colnames_rep,to_find=c("b"),to_sub="D")
[[1]]
a D
1 1 4
2 2 5
3 3 6
[[2]]
a c
1 1 4
2 2 5
3 3 6
Thanks to divibisan for the suggestion
We can use rename_at with map
map(dfs, ~ .x %>%
rename_at(b, sub, pattern = "^b$", replacement = "D"))
#[[1]]
# a D
#1 1 4
#2 2 5
#3 3 6
#[[2]]
# a c
#1 1 4
#2 2 5
#3 3 6
Here's an approach that is similar in concept to Terru_theTerror's, but extends it by allowing regular expressions. It might be overkill, but ...
First, we define a simple "map" that maps to the desired name (first string in each vector of the list) from any string (remaining strings in each vector). The function that does the matching accepts an argument of fixed=FALSE, in which case the 2nd and remaining strings can be regular expressions, which gives more power and responsibility.
If using fixed=TRUE (the default), then the map might look like this:
colnamemap <- list(
c("symbol", "gene_name", "gene_symbol"),
c("D", "c", "quux"),
c("bbb", "b", "ccc")
)
where "gene_name" and "gene_symbol" will both be changed to "symbol", etc. If you want to use patterns (fixed=FALSE), however, you should be as specific as possible to preclude mis- or multiple-matches (across columns).
colnamemapptn <- list(
c("symbol", "^gene_(name|symbol)$"),
c("D", "^D$", "^c$", "^quux$"),
c("bbb", "^b$", "^ccc$")
)
The function that does the actual remapping:
fixfunc <- function(df, namemap, fixed = TRUE, ignore.case = FALSE) {
compare <- if (fixed) `%in%` else grepl
downcase <- if (ignore.case) tolower else c
newcn <- cn <- colnames(df)
newnames <- sapply(namemap, `[`, 1L)
matches <- sapply(namemap, function(nmap) {
apply(outer(downcase(nmap[-1]), downcase(cn), Vectorize(compare)), 2, any)
}) # dims: 1=cn; 2=map-to
for (j in seq_len(ncol(matches))) {
if (sum(matches[,j]) > 1) {
warning("rule ", sQuote(newnames[j]), " matches multiple columns: ",
paste(sQuote(cn[ matches[,j] ]), collapse=","))
matches[,j] <- FALSE
}
}
for (i in seq_len(nrow(matches))) {
rowmatches <- sum(matches[i,])
if (rowmatches == 1) {
newcn[i] <- newnames[ matches[i,] ]
} else if (rowmatches > 1) {
warning("column ", sQuote(cn[i]), " matches multiple rules: ",
paste(sQuote(newnames[ matches[i,]]), collapse=","))
matches[i,] <- FALSE
}
}
if (any(matches)) colnames(df) <- newcn
df
}
(You might extend it to ensure unique-ness, using make.names and/or make.unique. There's also ignore.case, not really tested here but easily done, I believe.)
I'm going to extend your sample data by including one that will match multiple patterns resulting in ambiguity:
x <- data.frame('a' = c(1,2,3), 'b' = c(4,5,6))
y <- data.frame('a' = c(1,2,3), 'c' = c(4,5,6))
z <- data.frame('cc' = 1:3, 'ccc' = 2:4)
dfs <- list(x,y,z)
where the third data.frame has two columns that match my third non-pattern vector. When there are multiple matches, I think the safer thing to do is warn about it and change none of them.
This is correct, fixed-strings only:
lapply(dfs, fixfunc, colnamemap, fixed=TRUE)
# [[1]]
# a bbb
# 1 1 4
# 2 2 5
# 3 3 6
# [[2]]
# a D
# 1 1 4
# 2 2 5
# 3 3 6
# [[3]]
# cc bbb
# 1 1 2
# 2 2 3
# 3 3 4
This incorrectly uses the strings as patterns, which causes one of them to warn about multiple matches:
lapply(dfs, fixfunc, colnamemap, fixed=FALSE)
# Warning in FUN(X[[i]], ...) :
# rule 'D' matches multiple columns: 'cc','ccc'
# [[1]]
# a bbb
# 1 1 4
# 2 2 5
# 3 3 6
# [[2]]
# a D
# 1 1 4
# 2 2 5
# 3 3 6
# [[3]]
# cc bbb
# 1 1 2
# 2 2 3
# 3 3 4
A better use of fixed=FALSE, with strict patterns instead:
lapply(dfs, fixfunc, colnamemapptn, fixed=FALSE)
# same output as the first call

Difference between `names(df[1]) <- ` and `names(df)[1] <- `

Consider the following:
df <- data.frame(a = 1, b = 2, c = 3)
names(df[1]) <- "d" ## First method
## a b c
##1 1 2 3
names(df)[1] <- "d" ## Second method
## d b c
##1 1 2 3
Both methods didn't return an error, but the first didn't change the column name, while the second did.
I thought it has something to do with the fact that I'm operating only on a subset of df, but why, for example, the following works fine then?
df[1] <- 2
## a b c
##1 2 2 3
What I think is happening is that replacement into a data frame ignores the attributes of the data frame that is drawn from. I am not 100% sure of this, but the following experiments appear to back it up:
df <- data.frame(a = 1:3, b = 5:7)
# a b
# 1 1 5
# 2 2 6
# 3 3 7
df2 <- data.frame(c = 10:12)
# c
# 1 10
# 2 11
# 3 12
df[1] <- df2[1] # in this case `df[1] <- df2` is equivalent
Which produces:
# a b
# 1 10 5
# 2 11 6
# 3 12 7
Notice how the values changed for df, but not the names. Basically the replacement operator `[<-` only replaces the values. This is why the name was not updated. I believe this explains all the issues.
In the scenario:
names(df[2]) <- "x"
You can think of the assignment as follows (this is a simplification, see end of post for more detail):
tmp <- df[2]
# b
# 1 5
# 2 6
# 3 7
names(tmp) <- "x"
# x
# 1 5
# 2 6
# 3 7
df[2] <- tmp # `tmp` has "x" for names, but it is ignored!
# a b
# 1 10 5
# 2 11 6
# 3 12 7
The last step of which is an assignment with `[<-`, which doesn't respect the names attribute of the RHS.
But in the scenario:
names(df)[2] <- "x"
you can think of the assignment as (again, a simplification):
tmp <- names(df)
# [1] "a" "b"
tmp[2] <- "x"
# [1] "a" "x"
names(df) <- tmp
# a x
# 1 10 5
# 2 11 6
# 3 12 7
Notice how we directly assign to names, instead of assigning to df which ignores attributes.
df[2] <- 2
works because we are assigning directly to the values, not the attributes, so there are no problems here.
EDIT: based on some commentary from #AriB.Friedman, here is a more elaborate version of what I think is going on (note I'm omitting the S3 dispatch to `[.data.frame`, etc., for clarity):
Version 1 names(df[2]) <- "x" translates to:
df <- `[<-`(
df, 2,
value=`names<-`( # `names<-` here returns a re-named one column data frame
`[`(df, 2),
value="x"
) )
Version 2 names(df)[2] <- "x" translates to:
df <- `names<-`(
df,
`[<-`(
names(df), 2, "x"
) )
Also, turns out this is "documented" in R Inferno Section 8.2.34 (Thanks #Frank):
right <- wrong <- c(a=1, b=2)
names(wrong[1]) <- 'changed'
wrong
# a b
# 1 2
names(right)[1] <- 'changed'
right
# changed b
# 1 2

How to extract a number into digits using R?

Suppose I have a number: 4321
and I want to extract it into digits: 4, 3, 2, 1
How do I do this?
Alternatively, with strsplit:
x <- as.character(4321)
as.numeric(unlist(strsplit(x, "")))
[1] 4 3 2 1
Use substring to extract character at each index and then convert it back to integer:
x <- 4321
as.integer(substring(x, seq(nchar(x)), seq(nchar(x))))
[1] 4 3 2 1
For real fun, here's an absurd method:
digspl<-function(x){
x<-trunc(x) # justin case
mj<-trunc(log10(x))
y <- trunc(x/10^mj)
for(j in 1:mj) {
y[j+1]<- trunc((x-y[j]*10^(mj-j+1))/(10^(mj-j)))
x<- x - y[j]*10^(mj-j+1)
}
return(y)
}
For fun, here's an alternative:
x <- 4321
read.fwf(textConnection(as.character(x)), rep(1, nchar(x)))
# V1 V2 V3 V4
# 1 4 3 2 1
The only advantage I can think of is the possibility of exploding your input into varying widths, though I guess you can do that with substring too.
An alternative solution, using modulo operator:
get_digit <- function(x, d) {
# digits from the right
# i.e.: first digit is the ones, second is the tens, etc.
(x %% 10^d) %/% (10^(d-1))
}
# for one number
get_all_digit <- function(x) {
get_digit_x <- function(d) get_digit(x,d)
sapply(nchar(x):1, get_digit_x)
}
# for a vector of numbers
digits <- function(x) {
out <- lapply(x, get_all_digit)
names(out) <- x
out
}
Example:
> digits(100:104)
$`100`
[1] 1 0 0
$`101`
[1] 1 0 1
$`102`
[1] 1 0 2
$`103`
[1] 1 0 3
$`104`
[1] 1 0 4

Rows With Blank Entries in R

I have a 721 x 26 dataframe. Some rows have entries that are blank. It's not NULL
or NA but just empty like the following. How can I delete those rows that have these kind of entries?
1 Y N Y N 86.8
2 N N Y N 50.0
3 76.8
4 N N Y N 46.6
5 Y Y Y Y 30.0
The answer to this question depends on how paranoid you want to be about the sort of things that might be in 'blank'-appearing character strings. Here's a fairly careful approach that will match the zero-length blank string "" as well as any string composed of one or more [[:space:]] characters (i.e. "tab, newline, vertical tab, form feed, carriage return, space and possibly other locale-dependent characters", according to the ?regex help page).
## An example data.frame containing all sorts of 'blank' strings
df <- data.frame(A = c("a", "", "\n", " ", " \t\t", "b"),
B = c("b", "b", "\t", " ", "\t\t\t", "d"),
C = 1:6)
## Test each element to see if is either zero-length or contains just
## space characters
pat <- "^[[:space:]]*$"
subdf <- df[-which(names(df) %in% "C")] # removes columns not involved in the test
matches <- data.frame(lapply(subdf, function(x) grepl(pat, x)))
## Subset df to remove rows fully composed of elements matching `pat`
df[!apply(matches, 1, all),]
# A B C
# 1 a b 1
# 2 b 2
# 6 b d 6
## OR, to remove rows with *any* blank entries
df[!apply(matches, 1, any),]
# A B C
# 1 a b 1
# 6 b d 6

Resources