Need to find most common combination of letters - r

Let's say for simplicity that i have 10 rows of 5 characters where each character can be A-Z.
E.g//
KJGXI
GDGQT
JZKDC
YOTQD
SSDIQ
PLUWC
TORHC
PFJSQ
IIZMO
BRPOJ
WLMDX
AZDIJ
ARNUA
JEXGA
VFPIP
GXOXM
VIZEM
TFVQJ
OFNOG
QFNJR
ZGUBZ
CCTMB
HZPGV
ORQTJ
I want to know which 3 letter combination is most common. However, the combination does not need to be in order, nor next to each other. E.g
ABCXY
CQDBA
=ABC
I could probably brute-force it with endless loops but I was wondering if there was a better way of doing it!

Here is a solution:
x <- c("KJGXI", "GDGQT", "JZKDC", "YOTQD", "SSDIQ", "PLUWC", "TORHC", "PFJSQ", "IIZMO", "BRPOJ", "WLMDX", "AZDIJ",
"ARNUA", "JEXGA", "VFPIP", "GXOXM", "VIZEM", "TFVQJ", "OFNOG", "QFNJR", "ZGUBZ", "CCTMB", "HZPGV", "ORQTJ")
temp <- do.call(cbind, lapply(strsplit(x, ""), combn, m = 3))
temp <- apply(temp, 2, sort)
temp <- apply(temp, 2, paste0, collapse = "")
sort(table(temp), decreasing = TRUE)
which will return the number of times each combination appear. You can then use names(which.max(sort(table(temp), decreasing = TRUE))) to have the combination (in this case, "FJQ")
In this case, two combinations appear 3 times, you can do
result <- sort(table(temp), decreasing = TRUE)
names(which(result == max(result)))
# [1] "FJQ" "IMZ"
to have the two combinations which appear the most time.
The code works as follow:
split each element of x in five letters, then generate each possible combination of 3 elements from the 5 letters
sort each of those combination alphabetically
paste the 3 letters together
generate the count for each of those combinations, and sort the result

I would split each string into letters, sort them, then use combn to get all combinations. Use paste0 to collapse these back into strings and count.
txt <- c("KJGXI", "GDGQT", "JZKDC", "YOTQD", "SSDIQ", "PLUWC", "TORHC",
"PFJSQ", "IIZMO", "BRPOJ", "WLMDX", "AZDIJ", "ARNUA", "JEXGA",
"VFPIP", "GXOXM", "VIZEM", "TFVQJ", "OFNOG", "QFNJR", "ZGUBZ",
"CCTMB", "HZPGV", "ORQTJ")
txt2 <- strsplit(txt, split = "")
txt2 <- lapply(txt2, sort)
txt3 <- lapply(txt2, combn, m = 3)
txt4 <- lapply(txt3, function(x){apply(x, 2, paste0, collapse = "")})
table(unlist(txt4))
Several steps here could be combined.

Related

Separating a column by the first 3 characters

I have a set of data below and I would like to separate the first three characters from the bm_id column into a separate column with the rest of the characters in another column.
bm_id
1
popCL20TE
2
agrST20
3
agrST20-09SE
I have tried using solutions to a similar question asked on stack, however I end up making extra empty columns with my data remaining together.
bm_id[c('species', 'id')] <- tstrsplit(bm_id$bm_id, '(?<=.{3})', perl = TRUE)
same happens with this code
bm_id2 <- tidyr::separate(bm_id, bm_id, into = c("species", "id"), sep = 3)
How about substr
df <- data.frame(vec= c("popCL20TE", "agrST20"))
df$first3 <- substr(df$vec, 1, 3)
df$last <- substr(df$vec, 4, nchar(df$vec))
df
vec first3 last
1 popCL20TE pop CL20TE
2 agrST20 agr ST20

Using regex to drop duplicated elements in columns of an R dataframe

I have a dummy dataframe df which has dimensions 6 X 4.
df <- data.frame(
Hits = c("Hit1", "Hit2", "Hit3", "Hit4", "Hit5", "Hit6"),
GO = c("GO:0005634~nucleus,", "", "GO:0005737~cytoplasm,", "GO:0005634~nucleus,GO:0005737~cytoplasm,", "",
"GO:0005634~nucleus,GO:0005654~nucleoplasm,"),
KEGG = c("", "", "", "", "", ""),
SMART = c("SM00394:RIIa,", "SM00394:RIIa,", "", "SM00054:EFh,",
"", "SM00394:RIIa,SM00239:C2,"))
df looks like this
The elements in the columns consist of two parts:
an identifier (e.g. GO:0005634~, SM00394: etc.)
a term (e.g. nucleus, EFh etc.)
For each column I want to retain a row if it contains atleast one term which is not present in any row above it. e.g. in the column GO rows 1 and 3 contain unique terms, so these should be retained. Row 4 contains terms which are already present in rows 1 and 3, so it should be dropped. Row 6 has one term which is not present in any row above it, hence it should also be retained.
I have been able to come up with regular expressions to extract the terms from the columns GO and SMART
Regex for GO: (?<=~).*?(?=,(?:GO:\\d+~|$))
Regex for SMART: (?<=:).*?(?=,(?:\\w+\\d+:|$))
But I'm unable to figure out a way to integrate the regex and the conditions mentioned above into a solution. The output should look like this
Any suggestions on how to solve this?
Here is a general approach that will handle GO, SMART, and potentially KEGG, though it is impossible to say without any information about KEGG.
The function f below takes as arguments
x, a character vector
split, the delimiter separating items in lists
sep, the delimiter separating identifiers and terms within items
and returns a logical vector indexing the elements of x with at least one non-duplicated term.
f <- function(x, split, sep) {
l1 <- strsplit(x, split)
tt <- sub(paste0("^[^", sep, "]*", sep), "", unlist(l1))
l2 <- relist(duplicated(tt), l1)
!vapply(l2, all, NA)
}
Applying f to GO and SMART:
nms <- c("GO", "SMART")
l <- Map(f, x = df[nms], split = ",", sep = c("~", ":"))
l
## $GO
## [1] TRUE FALSE TRUE FALSE FALSE TRUE
##
## $SMART
## [1] TRUE FALSE FALSE TRUE FALSE TRUE
Setting to "" elements of GO and SMART with zero non-duplicated terms, then filtering out empty rows, we obtain the desired result:
df2 <- df
df2[nms] <- Map(replace, df2[nms], lapply(l, `!`), "")
df2[Reduce(`|`, l), ]
## Hits GO KEGG SMART
## 1 Hit1 GO:0005634~nucleus, SM00394:RIIa,
## 3 Hit3 GO:0005737~cytoplasm,
## 4 Hit4 SM00054:EFh,
## 6 Hit6 GO:0005634~nucleus,GO:0005654~nucleoplasm, SM00394:RIIa,SM00239:C2,
The following algorithm is applied to each term (GO, SMART, KEGG):
extract the identifier+term list as comma-separated. See stringr::str_split etc.
extract the term as regex
cumulate all the terms along the dataframe as they appear
extract the difference between each row and the row immediately preceding
replace the string with "" if no new term is introduced
filter rows where not all the terms are ""
library(dplyr)
library(stringr)
library(purrr)
termred <- function(terms, rx) {
terms |>
stringr::str_split(",") |>
purrr::map(stringr::str_trim) |>
purrr::map(~{.x[.x != ""]}) |>
purrr::map(~stringr::str_extract(.x, rx)) |>
purrr::accumulate(union) %>%
{mapply(setdiff, ., lag(., 1), SIMPLIFY = TRUE)} %>%
{ifelse(sapply(., length) > 0, terms, "")}
}
df |>
transform(GO = termred(GO, "~.*$")) |>
transform(SMART = termred(SMART, ":.*$")) |>
filter(GO != "" | SMART != ""| KEGG != "")
##> Hits GO KEGG SMART
##>1 Hit1 GO:0005634~nucleus, SM00394:RIIa,
##>2 Hit3 GO:0005737~cytoplasm,
##>3 Hit4 SM00054:EFh,
##>4 Hit6 GO:0005634~nucleus,GO:0005654~nucleoplasm, SM00394:RIIa,SM00239:C2,

To find the length of the splitted sequence in r

I want to count the number of characters in between two patterns
for eg:
seq="AATTGGCCATGCAATTGGCCATTAAA"
pattern="ATGC|CCAT"
I want the pieces to be
"AATTGGCC" "AATTGG" "TAAA"
And then I want to find the length of these splitted pieces.
We can do a for loop
for(nm in pat){
seq <- gsub(nm, " ", seq)
}
res <- scan(text=seq, sep="", what="", quiet=TRUE)
res
#[1] "AATTGGCC" "AATTGG" "TAAA"
nchar(res)
#[1] 8 6 4
data
seq="AATTGGCCATGCAATTGGCCATTAAA"
pat <- c("ATGC", "CCAT")
Use this
spilt_seq <- unlist(str_split(str_split("AATTGGCCATGCAATTGGCCATTAAA",pattern="ATGC")[[1]],pattern = "CCAT"))
split_seq
Then use nchar to measure the length
nchar(split_seq)

How to delete everything after nth delimiter in R?

I have this vector myvec. I want to remove everything after second ':' and get the result. How do I remove the string after nth ':'?
myvec<- c("chr2:213403244:213403244:G:T:snp","chr7:55240586:55240586:T:G:snp" ,"chr7:55241607:55241607:C:G:snp")
result
chr2:213403244
chr7:55240586
chr7:55241607
We can use sub. We match one or more characters that are not : from the start of the string (^([^:]+) followed by a :, followed by one more characters not a : ([^:]+), place it in a capture group i.e. within the parentheses. We replace by the capture group (\\1) in the replacement.
sub('^([^:]+:[^:]+).*', '\\1', myvec)
#[1] "chr2:213403244" "chr7:55240586" "chr7:55241607"
The above works for the example posted. For general cases to remove after the nth delimiter,
n <- 2
pat <- paste0('^([^:]+(?::[^:]+){',n-1,'}).*')
sub(pat, '\\1', myvec)
#[1] "chr2:213403244" "chr7:55240586" "chr7:55241607"
Checking with a different 'n'
n <- 3
and repeating the same steps
sub(pat, '\\1', myvec)
#[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
#[3] "chr7:55241607:55241607"
Or another option would be to split by : and then paste the n number of components together.
n <- 2
vapply(strsplit(myvec, ':'), function(x)
paste(x[seq.int(n)], collapse=':'), character(1L))
#[1] "chr2:213403244" "chr7:55240586" "chr7:55241607"
Here are a few alternatives. We delete the kth colon and everything after it. The example in the question would correspond to k = 2. In the examples below we use k = 3.
1) read.table Read the data into a data.frame, pick out the columns desired and paste it back together again:
k <- 3 # keep first 3 fields only
do.call(paste, c(read.table(text = myvec, sep = ":")[1:k], sep = ":"))
giving:
[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
[3] "chr7:55241607:55241607"
2) sprintf/sub Construct the appropriate regular expression (in the case below of k equal to 3 it would be ^((.*?:){2}.*?):.* ) and use it with sub:
k <- 3
sub(sprintf("^((.*?:){%d}.*?):.*", k-1), "\\1", myvec)
giving:
[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
[3] "chr7:55241607:55241607"
Note 1: For k=1 this can be further simplified to sub(":.*", "", myvec) and for k=n-1 it can be further simplified to sub(":[^:]*$", "", myvec)
Note 2: Here is a visualization of the regular regular expression for k equal to 3:
^((.*?:){2}.*?):.*
Debuggex Demo
3) iteratively delete last field We could remove the last field n-k times using the last regular expression in Note 1 above like this:
n <- 6 # number of fields
k < - 3 # number of fields to retain
out <- myvec
for(i in seq_len(n-k)) out <- sub(":[^:]*$", "", out)
If we wanted to set n automatically we could optionally replace the hard coded line setting n above with this:
n <- count.fields(textConnection(myvec[1]), sep = ":")
4) locate position of kth colon Locate the positions of the colons using gregexpr and then extract the location of the kth subtracting one from it since we don't want the trailing colon. Use substr to extract that many characters from the respective strings.
k <- 3
substr(myvec, 1, sapply(gregexpr(":", myvec), "[", k) - 1)
giving:
[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
[3] "chr7:55241607:55241607"
Note 3: Suppose there are n fields. The question asked to delete everything after the kth delimiter so the solution should work for k = 1, 2, ..., n-1. It need not work for k = n since there are not n delimiters; however, if instead we define k as the number of fields to return then k = n makes sense and, in fact, (1) and (3) work in that case too. (2) and (4) do not work for this extension but we can easily get them to work by using paste0(myvec, ":") as the input instead of myvec.
Note 4: We compare performance:
library(rbenchmark)
benchmark(
.read.table = do.call(paste, c(read.table(text = myvec, sep = ":")[1:k], sep = ":")),
.sprintf.sub = sub(sprintf("^((.*?:){%d}.*?):.*", k-1), "\\1", myvec),
.for = { out <- myvec; for(i in seq_len(n-k)) out <- sub(":[^:]*$", "", out)},
.gregexpr = substr(myvec, 1, sapply(gregexpr(":", myvec), "[", k) - 1),
order = "elapsed", replications = 1000)[1:4]
giving:
test replications elapsed relative
2 .sprintf.sub 1000 0.11 1.000
4 .gregexpr 1000 0.14 1.273
3 .for 1000 0.15 1.364
1 .read.table 1000 2.16 19.636
The solution using sprintf and sub is the fastest although it does use a complex regular expression whereas the others use simpler or no regular expressions and might be preferred on grounds of simplicity.
ADDED Added additional solutions and additional notes.

Importing one long line of data with spaces into R

This question is a followup to my previous question, Importing one long line of data into R.
I have a large data file consisting of a single line of text. The format resembles
Cat 14 15 Horse 16
I'd eventually like to get it into a data.frame. In the above example I would end up with two variables, two variables, Animal and Number. The number of characters in each "line" is fixed, so in the example above each line contains 11 characters, animals being the first 7 and numbers being the next four.
So what I'd like is a data frame that looks like:
Animal Number
Cat 14
NA 15
Horse 16
You can read the file with read.fwf, specifying the column widths and the number of columns:
inp.fwf <- read.fwf("tmp.txt", widths = rep(c(7, 4), times = 3), as.is = TRUE)
Here the argument times = 3 works for your sample data; for your real file, you'll have to indicate how many pairs there are and change times accordingly. If you don't know how many entries you have, this might work:
inp.rl <- readLines("tmp.txt")
nchar(inp.rl)/11
This will give you a data.frame with one row and many columns. You need to break that into many rows and two columns:
inp.mat <- matrix(inp.fwf, byrow = TRUE, ncol = 2)
This will get you the correct shape for your data. The animal names are stored as character vectors, which you'll probably want to change into factors, but at this point all the data is in R, so you can easily tweak it.
Solution with vectorized substring function.
x <- readLines(textConnection("Cat 14 15 Horse 16 "))
idx <- seq.int(1,nchar(x),by=11)
vsubstr <- Vectorize(substr,vectorize.args=c("start","stop"))
dat <- data.frame(Animal= vsubstr(x,idx,idx+6),
Number= as.numeric(vsubstr(x,idx+7,idx+10)))
Not sure what the 15 is all about from the way you described data it should be animal-space-count-space-animal...
Anyway if the 15 should not be there here is one approach.
list1<-"Cat 14 Horse 16"
x <- unlist(strsplit(list1, " "))
x <- as.data.frame(matrix(x, length(x)/2, 2, byrow = TRUE))
x[, 2] <- as.numeric(as.character(x[, 2]))
x[, 1] <- as.character(x[, 1])
names(x) <-c('animal', 'count')
x
Assume you have a text file, test.dat, with repeated Animal Number pairs.
x <- scan("test.dat", what=list("", 0))
my.df <- data.frame(Animal = x[[1]], Number = x[[2]])
Tyler's use of read.fwf is perhaps cleaner, but here's another possible method.
x <- readLines(textConnection("Cat 14 15 Horse 16 "))
x <- matrix(strsplit(x, "")[[1]], nrow=11)
d <- data.frame(Animal = apply(x[1:7,], 2, paste, collapse=""),
Number = as.numeric(apply(x[8:11,], 2, paste, collapse="")))

Resources