count of multiple partially matching DNA sequences - r

I have a dataset of partially matching DNA sequences and want to assign different numerical indexes to the partially matching sequences.
i.e.:
sequences <- c("AAAAAAAAAAAAAAA",
"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",
"AAAAAAAAAAAAAAAAAAAAAAAAAAAACCC",
"AAAAAAAAAAAAAAAAAAAAAAAAACC",
"CATTTTCAG",
"CATTTTCAGTCAAAATTT",
"CATG",
"CATGG",
"CATGGGTT",
"GATC")
The first one recurs in the 2nd, 3rd and 4th and they should all get a value 1, the 5th recurs in the 6th and they should all get a 2, the 7th recurs in the 8th and 9th and should all get a 3, the 10th does not recur and should get 4 as index. This is just an example of course, sometimes the dataset could contain >3000 rows.
I tried several solutions including grepl and str_count. The latest one of the attempts was to create a dictionary first to store all the sequences and the indices, create a list of prefixes and then iterate the prefixes to assign the indices. However the result is not what I expect as all the sequences get a index of 1.
# Create a dictionary to store the sequences and their indices
indices <- as.list(1:length(sequences))
names(indices) <- sequences
# Create a function that returns the first 7 characters of a sequence
get_prefix <- function(seq) {
return(substring(seq, 1, 7))
}
# Create a list of unique prefixes
prefixes <- unique(sapply(sequences, get_prefix))
# Iterate over the prefixes and assign the same index to all sequences that start with the same prefix
for (i in 1:length(prefixes)) {
prefix <- prefixes[i]
seqs <- sequences[sapply(sequences, get_prefix) == prefix]
indices[seqs] <- which.min(indices[seqs])
}
# Print the final indices
print(indices)
Any help is welcome! thanks!

This problem relates to grouping using relational data. You can use grep + igraph to do so:
library(igraph)
sapply(sequences, grep, sequences, value = TRUE) |>
stack() |>
graph.data.frame() |>
clusters() |>
getElement("membership") |>
stack()
values ind
1 1 AAAAAAAAAAAAAAA
2 1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
3 1 AAAAAAAAAAAAAAAAAAAAAAAAAAAACCC
4 1 AAAAAAAAAAAAAAAAAAAAAAAAACC
5 2 CATTTTCAG
6 2 CATTTTCAGTCAAAATTT
7 3 CATG
8 3 CATGG
9 3 CATGGGTT
10 4 GATC

Related

Is there an R function equivalent to Excel's $ for "keep reference cell constant" [duplicate]

This question already has answers here:
Divide each data frame row by vector in R
(5 answers)
Closed 2 years ago.
I'm new to R and I've done my best googling for the answer to the question below, but nothing has come up so far.
In Excel you can keep a specific column or row constant when using a reference by putting $ before the row number or column letter. This is handy when performing operations across many cells when all cells are referring to something in a single other cell. For example, take a dataset with grades in a course: Row 1 has the total number of points per class assignment (each column is an assignment), and Rows 2:31 are the raw scores for each of 30 students. In Excel, to calculate percentage correct, I take each student's score for that assignment and refer it to the first row, holding row constant in the reference so I can drag down and apply that operation to all 30 rows below Row 1. Most importantly, in Excel I can also drag right to do this across all columns, without having to type a new operation.
What is the most efficient way to perform this operation--holding a reference row constant while performing an operation to all other rows, then applying this across columns while still holding the reference row constant--in R? So far I had to slice the reference row to a new dataframe, remove that row from the original dataframe, then type one operation per column while manually going back to the new dataframe to look up the reference number to apply for that column's operation. See my super-tedious code below.
For reference, each column is an assignment, and Row 1 had the number of points possible for that assignment. All subsequent rows were individual students and their grades.
# Extract number of points possible
outof <- slice(grades, 1)
# Now remove that row (Row 1)
grades <- grades[-c(1),]
# Turn number correct into percentage. The divided by
# number is from the sliced Row 1, which I had to
# look up and type one-by-one. I'm hoping there is
# code to do this automatically in R.
grades$ExamFinal < (grades$ExamFinal / 34) * 100
grades$Exam3 <- (grades$Exam3 / 26) * 100
grades$Exam4 <- (grades$Exam4 / 31) * 100
grades$q1.1 <- grades$q1.1 / 6
grades$q1.2 <- grades$q1.2 / 10
grades$q1.3 < grades$q1.3 / 6
grades$q2.2 <- grades$q2.2 / 3
grades$q2.4 <- grades$q2.4 / 12
grades$q3.1 <- grades$q3.1 / 9
grades$q3.2 <- grades$q3.2 / 8
grades$q3.3 <- grades$q3.3 / 12
grades$q4.1 <- grades$q4.1 / 13
grades$q4.2 <- grades$q4.2 / 5
grades$q6.1 <- grades$q6.1 / 5
grades$q6.2 <- grades$q6.2 / 6
grades$q6.3 <- grades$q6.3 / 11
grades$q7.1 <- grades$q7.1 / 7
grades$q7.2 <- grades$q7.2 / 8
grades$q8.1 <- grades$q8.1 / 7
grades$q8.3 <- grades$q8.3 / 13
grades$q9.2 <- grades$q9.2 / 13
grades$q10.1 <- grades$q10.1 / 8
grades$q12.1 <- grades$q12.1 / 12
You can use sweep
100*sweep(grades, 2, outof, "/")
# ExamFinal EXam3 EXam4
#1 100.00 76.92 32.26
#2 88.24 84.62 64.52
#3 29.41 100.00 96.77
Data:
grades
ExamFinal EXam3 EXam4
1 34 20 10
2 30 22 20
3 10 26 30
outof
[1] 34 26 31
grades <- data.frame(ExamFinal=c(34,30,10),
EXam3=c(20,22,26),
EXam4=c(10,20,30))
outof <- c(34,26,31)
You can use mapply on the original grades dataframe (don't remove the first row) to divide rows by the first row. Then convert the result back to a dataframe.
as.data.frame(mapply("/", grades[2:31, ], grades[1, ]))
The easiest way is to use some type of loop. In this case I am using the sapply function. To all of the elements in each column by the corresponding total score.
#Example data
outof<-data.frame(q1=c(3), q2=c(5))
grades<-data.frame(q1=c(1,2,3), q2=c(4,4, 5))
answermatrix <-sapply(1:ncol(grades), function(i) {
#grades[,i]/outof[i] #use this if "outof" is a vector
grades[,i]/outof[ ,i]
})
answermatrix
A loop would probably be your best bet.
The first part you would want to extract the most amount of points possible, as is listed in the first row, then use that number to calculate the percentage in the remaining rows per column:
`
j = 2 #sets the first row to 2 for later
for (i in 1:ncol(df) {
a <- df[1,] #this pulls the total points into a
#then we compute using that number
while(j <= nrow(df)-1){ #subtract the number of rows from removing the first
#row
b <- df[j,i] #gets the number per row per column that corresponds with each
#student
df[j,i] <- ((a/b)*100) #replaces that row,column with that percentage
j <- j+1 #goes to next row
}
}
`
The only drawback to this approach is data-frames produced in functions aren't copied to the global environment, but that can be fixed by introducing a function like so:
f1 <- function(x = <name of df> ,y= <name you want the completed df to be
called>) {
j = 2
for (i in 1:ncol(x) {
a <- x[1,]
while(j <= nrow(x)-1){
b <- df[j,i]
x[j,i] <- ((a/b)*100)
i <- i+1
}
}
arg_name <- deparse(substitute(y)) #gets argument name
var_name <- paste(arg_name) #construct the name
assign(var_name, x, env=.GlobalEnv) #produces global dataframe
}

R function to subset dataframe so that non-adjacent values in a column differ by >= X (starting with the first value)

I am looking for a function that iterates through the rows of a given column ("pos" for position, ascending) in a dataframe, and only keeps those rows whose values are at least let's say 10 different, starting with the first row.Thus it would start with the first row (and store it), and then carry on until it finds a row with a value at least 10 higher than the first, store this row, then start from this value again looking for the next >10diff one.
So far I have an R for loop that successfully finds adjacent rows at least X values apart, but it does not have the capability of looking any further than one row down, nor of stopping once it has found the given row and starting again from there.
Here is the function I have:
# example data frame
df <- data.frame(x=c(1:1000), pos=sort(sample(1:10000, 1000)))
# prep function (this only checks row above)
library(dplyr)
pos.apart.subset <- function(df, pos.diff) {
# create new dfs to store output
new.df <- list()
new.df1 <- data.frame()
# iterate through each row of df
for (i in 1:nrow(df)) {
# if the value of next row is higher or equal than value or row i+posdiff, keep
# if not ascending, keep
# if first row, keep
if(isTRUE(df$pos[i+1] >= df$pos[i]+pos.diff | df$pos[i+1] < df$pos[i] | i==1 )) {
# add rows that meet conditions to list
new.df[[i]] <- df[i,] }
}
# bind all rows that met conditions
new.df1 <- bind_rows(new.df)
return(new.df1)}
# test run for pos column adjacent values to be at least 10 apart
df1 <- pos.apart.subset(df, 10); head(df1)
Happy to do this in awk or any other language. Many thanks.
It seems I misunderstood the question earlier since we don't want to calculate the difference between consecutive rows, you can try :
nrows <- 1
previous_match <- 1
for(i in 2:nrow(df)) {
if(df$pos[i] - df$pos[previous_match] > 10) {
nrows <- c(nrows, i)
previous_match <- i
}
}
and then subset the selected rows :
df[nrows, ]
Earlier answer
We can use diff to get the difference between consecutive rows and select the row which has difference of greater than 10.
head(subset(df, c(TRUE, diff(pos) > 10)))
# x pos
#1 1 1
#2 2 31
#6 6 71
#9 9 134
#10 10 151
#13 13 185
The first TRUE is to by default select the first row.
In dplyr, we can use lag to get value from previous row :
library(dplyr)
df %>% filter(pos - lag(pos, default = -Inf) > 10)

How do I find out the highest number that commas had appeared in a row in a single column in a data frame in R?

I want to find out the maximum amount comma had appeared in a row in a single column.
For example,
Cars
1 Bugatti (4)","Ferrari (7)","Audi (10)
2 Toyota (6)
3 Tesla (9)","Mercedes(8)
4 Suzuki (11)","Mitsubishi (19)","Ford (7)","BMW (6)
For the table column above, the maximum number a comma had appeared in a row is 3, and it is on row 4. How do I achieve this on a much more larger data (4000+ rows)?
You can use gregexp() to return a vector of the positions of the comma(s) in each string. Then you can apply the length() function to count up the commas:
sapply(gregexpr(",", df$cars), length)
## 2 1 1 3
To answer the exact question asked, just wrap the above line of code in max() to determine the maximum number of times a comma appeared in one of your strings.
The above actually returns a "1" when a "0" is expected. There is probably a more elegant solution, but here's a function that will handle zeros correctly:
count_commas <- function(x) {
y <- sapply(gregexpr(",", x), as.integer) # get position of commas
y <- lapply(y, function(y) if(y[1] == -1) NULL else y) # replace zeros
return( sapply(y, length) ) # return count of commas
}
count_commas(df$cars)
# 2 0 1 3
My idea is to remove the non-comma characters and calculate the number of chars.
I have no clue which class of object you are using for cars. Assuming your input is
cars <- c(' Bugatti (4)","Ferrari (7)","Audi (10)','Toyota (6)','Tesla (9)","Mercedes(8)','Suzuki (11)","Mitsubishi (19)","Ford (7)","BMW (6)')
then you can use nchar(gsub("[^,]","", cars)) to get the number of commas of each row.

How can I check for cluster patterns in a sequence of numbers and obtain the next value?

Given a set of sequences
seq1 <- c(3,3,3,7,7,7,4,4)
seq2 <- c(17,17,77,77,3)
seq3 <- c(5,5,23)
How can we create a function to check this sequence for cluster patterns and predict the next value of the sequence which in this case would be 4,3, and 23 respectively.
Edit: The sequence should first be checked for cluster patterns, if it does not contain this class of pattern then the sequence should be ignored or passed onto another function
Edit 2: A pattern should be defined by more that 1 of the same consecutive number and always grouped consistently e.g 1,1,1,2,2,2,3,3,3 is a pattern but 1,1,2,2,2,3,3 is not a pattern
Here's a way with rle in base R which checks if all run-lengths, except last, are equal and if TRUE then repeats the last value such that it has same pattern as others -
rl <- rle(seq1)$lengths
# check if all run-lengths, except last, are equal
if(all(head(rl, -1) == rl[1])) {
c(seq1, rep(seq1[length(seq1)], diff(range(rl))))
} else {
# do something else
}
# [1] 3 3 3 7 7 7 4 4 4
The same approach applies for seq2 and seq3.

Quickly find value from data.frame in list with matching element names

I have a list with xnumber of named elements, with each element containing a series of numbers.
I also have a data.frame containing 2 columns:
column1: names matching those of the list elements (though not in order)
column2: a vector of numbers
I want to quickly determine the location of the value of each row of my data.frame in the list given that the list element equals each given data.frame row's value in the data.frame's name column.
The end goal is actually to produce a vector containing the list value of each appropriate element preceding the value I've matched for each row of my data.frame.
My data has 200,000 rows so I'm trying to optimize this process.
Example
I have a list and data.frame:
a = 1:5; b = 6:10; c = 4:8; l1 <- list(a,b,c) # a list
d1 <- data.frame(name = c('c','a','b'), val = c(7,3,8)) #a data.frame
So first I want to know where each value occurs in the list (such that the element matches the name from the same row in the data.frame) :
where <- ????
>where
[1] 4 3 3 # 7 = 4th number in c, 3 = 3rd # in a, and 8 = 3rd # in b
But ultimately I want the output to show me the value in the element preceding the one I've matched:
which <- ????
>which
[1] 6 2 7
To have a list with named items, you can use this syntax:
l1 <- list(a=a,b=b,c=c)
Then you can use mapply() to test each item:
mapply(function(n,v) which(l1[[n]]==v) , d1$name,d1$val)
[1] 4 3 3
Then mapply() again to get values:
mapply(function(n,i) l1[[ n]][i] , d1$name,
mapply(function(n,v) which(l1[[n]]==v)-1 , d1$name,d1$val))
[1] 6 2 7

Resources