Related
Before posting my question, I would just like to mention that I have looked through the "Similar questions" tab, and have not quite found what I am looking for. I found something somewhat similar here, but it is in python. There was also a nice idea here that may help as last resort. In any case, I would like to try first if there is a more straightforward way to do it.
To the problem:
Say there are 2 different data frames: (1) Ref_seq; and (2) Variants:
>Ref_seq
Seq_name AA_seq
1 Ref1 VSASTQASRQKKMQEISSLVKYFIKCANRRAPRLKCQ
2 Ref2 SNFPHLVLEKILVSLTMKNCKAAMNFFQ
3 Ref3 RRQKRPSSGTIFNDAFWLDLNYLEVAKVAQS
4 Ref4 HCTSVSKEVEGTSYHESLYNALQSLRDR
5 Ref5 DHTGEYGNLVTIQSFKAEFRLAGGVNLPKIIDC
6 Ref6 HKDQMVDIMRASQDNPQDGIMVKLVVNLLQLS
7 Ref7 SNILLKDILSVRKYWCEISQQQWLELFSVY
8 Ref8 LTIFLKTLAVNFRIRVCELGDEILPTLLYIWT
9 Ref9 EDQSSMNLFNDYPDSSVSDANEPGESQSTIG
10 Ref10 SLSEKSKEETGISLQDLLLEIYRSIGEPDSL
>Variants
peptideID AA_seq
1 Pep1 QEISALVKYF
2 Pep2 HTGERGNLVT
3 Pep3 NKMTTSVLIK
4 Pep4 SMNLKNDYPD
5 Pep5 NEPGYSQSTI
6 Pep6 NPQDVIMVKL
7 Pep7 MAAKFNKMTL
8 Pep8 RRQKDPSSGT
9 Pep9 QQQWTELFSV
The first data frame contains the amino acid (aa) sequences from a reference organism, whilst the second contains the aa sequences from a test organism. It is known that the sequences from the Variants object contain at least (a) one aa change, (b) 4 matching characters to the reference sequence from Ref_seq, and (c) the matching can be forward or backwards (e.g. aa sequence from line 3 of Variants).
I am trying to find a way to lookup and retrieve to which reference sequence (Seq_name) each peptideID belongs to. The result should look like this:
peptideID AA_seq Seq_name
1 Pep1 QEISALVKYF Ref1
2 Pep2 HTGERGNLVT Ref5
3 Pep3 NKMTTSVLIK Ref2
4 Pep4 SMNLKNDYPD Ref9
5 Pep5 NEPGYSQSTI Ref9
6 Pep6 NPQDVIMVKL Ref6
7 Pep7 MAAKFNKMTL Ref2
8 Pep8 RRQKDPSSGT Ref3
9 Pep9 QQQWTELFSV Ref7
I thought that maybe regex coupled with a loop for each peptideID, considering that the strings change according to it. But I cannot wrap my head around it.
Any help will be very welcome!
Data from the example:
Ref_seq <- data.frame(Seq_name=paste0("Ref",1:10), AA_seq=c("VSASTQASRQKKMQEISSLVKYFIKCANRRAPRLKCQ", "SNFPHLVLEKILVSLTMKNCKAAMNFFQ", "RRQKRPSSGTIFNDAFWLDLNYLEVAKVAQS", "HCTSVSKEVEGTSYHESLYNALQSLRDR", "DHTGEYGNLVTIQSFKAEFRLAGGVNLPKIIDC", "HKDQMVDIMRASQDNPQDGIMVKLVVNLLQLS", "SNILLKDILSVRKYWCEISQQQWLELFSVY", "LTIFLKTLAVNFRIRVCELGDEILPTLLYIWT", "EDQSSMNLFNDYPDSSVSDANEPGESQSTIG", "SLSEKSKEETGISLQDLLLEIYRSIGEPDSL"))
Variants <- data.frame(peptideID=paste0("Pep",1:9), AA_seq=c("QEISALVKYF", "HTGERGNLVT", "NKMTTSVLIK", "SMNLKNDYPD", "NEPGYSQSTI", "NPQDVIMVKL", "MAAKFNKMTL", "RRQKDPSSGT", "QQQWTELFSV"))
You can try something like this. It takes the AAs from var and compares them with the ones from Ref_seq, including reverse matching. It uses agrep for fuzzy matching.
data.frame( var, Seq_name=unlist( sapply( var$AA_seq, function(x){
a <- !anyNA(agrep(x, Ref_seq$AA_seq)[1]);
ifelse( a, Ref_seq[ agrep(x, Ref_seq$AA_seq)[1],],
Ref_seq[ agrep(paste0(rev(strsplit(x, "")[[1]]),
collapse=""),Ref_seq$AA_seq)[1], ] ) } ) ) )
peptideID AA_seq Seq_name
1 Pep1 QEISALVKYF Ref1
2 Pep2 HTGERGNLVT Ref5
3 Pep3 NKMTTSVLIK Ref2
4 Pep4 SMNLKNDYPD Ref9
5 Pep5 NEPGYSQSTI Ref9
6 Pep6 NPQDVIMVKL Ref6
7 Pep7 MAAKFNKMTL Ref2
8 Pep8 RRQKDPSSGT Ref3
9 Pep9 QQQWTELFSV Ref7
Although this works for this example, I would suggest searching for a Bioconductor library that does what you want. There are many tricky situations that those libraries already solve.
Data
Ref_seq <- structure(list(Seq_name = c("Ref1", "Ref2", "Ref3", "Ref4", "Ref5",
"Ref6", "Ref7", "Ref8", "Ref9", "Ref10"), AA_seq = c("VSASTQASRQKKMQEISSLVKYFIKCANRRAPRLKCQ",
"SNFPHLVLEKILVSLTMKNCKAAMNFFQ", "RRQKRPSSGTIFNDAFWLDLNYLEVAKVAQS",
"HCTSVSKEVEGTSYHESLYNALQSLRDR", "DHTGEYGNLVTIQSFKAEFRLAGGVNLPKIIDC",
"HKDQMVDIMRASQDNPQDGIMVKLVVNLLQLS", "SNILLKDILSVRKYWCEISQQQWLELFSVY",
"LTIFLKTLAVNFRIRVCELGDEILPTLLYIWT", "EDQSSMNLFNDYPDSSVSDANEPGESQSTIG",
"SLSEKSKEETGISLQDLLLEIYRSIGEPDSL")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
var <- structure(list(peptideID = c("Pep1", "Pep2", "Pep3", "Pep4",
"Pep5", "Pep6", "Pep7", "Pep8", "Pep9"), AA_seq = c("QEISALVKYF",
"HTGERGNLVT", "NKMTTSVLIK", "SMNLKNDYPD", "NEPGYSQSTI", "NPQDVIMVKL",
"MAAKFNKMTL", "RRQKDPSSGT", "QQQWTELFSV")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"))
I believe this is fairly simple, although I am new to using R and code. I have a dataset which has a single row for each rodent trap site. There were however, 8 occasions of trapping over 4 years. What I wish to do is to expand the trap site data and append a number 1 to 8 for each row.
Then I can then label them with the trap visit for a subsequent join with the obtained trap data.
I have managed to replicate the rows with the following code. And while the rows are expanded in the data frame to 1, 1.1...1.7,2, 2.1...2.7 etc. I cannot figure out how to convert this to a useable column based ID.
structure(list(TrapCode = c("IA1sA", "IA2sA", "IA3sA", "IA4sA",
"IA5sA"), Y = c(-12.1355987315, -12.1356879776, -12.1357664998,
-12.1358823313, -12.1359720852), X = c(-69.1335789865, -69.1335225279,
-69.1334668485, -69.1333847769, -69.1333226532)), row.names = c(NA,
5L), class = "data.frame")
gps_1 <– gps_1[rep(seq_len(nrow(gps_1)), 3), ]
gives
"IA5sA", "IA1sA", "IA2sA", "IA3sA", "IA4sA", "IA5sA", "IA1sA",
"IA2sA", "IA3sA", "IA4sA", "IA5sA"), Y = c(-12.1355987315, -12.1356879776,
-12.1357664998, -12.1358823313, -12.1359720852, -12.1355987315,
-12.1356879776, -12.1357664998, -12.1358823313, -12.1359720852,
-12.1355987315, -12.1356879776, -12.1357664998, -12.1358823313,
-12.1359720852), X = c(-69.1335789865, -69.1335225279, -69.1334668485,
-69.1333847769, -69.1333226532, -69.1335789865, -69.1335225279,
-69.1334668485, -69.1333847769, -69.1333226532, -69.1335789865,
-69.1335225279, -69.1334668485, -69.1333847769, -69.1333226532
)), row.names = c("1", "2", "3", "4", "5", "1.1", "2.1", "3.1",
"4.1", "5.1", "1.2", "2.2", "3.2", "4.2", "5.2"), class = "data.frame")
I have a column with Trap_ID currently being a unique identifier. I hope that after the replication I could append an iteration number to this to keep it as a unique ID.
For example:
Trap_ID
IA1sA.1
IA1sA.2
IA1sA.3
IA2sA.1
IA2sA.2
IA2sA.3
Simply use a cross join (i.e., join with no by columns to return a cartesian product of both sets):
mdf <- merge(data.frame(Trap_ID = 1:8), trap_side_df, by=NULL)
I need to know if each level of a factor provides increasing values. I've seen How to check if a sequence of numbers is monotonically increasing (or decreasing)? but don't know how to apply to the single levels only.
Let's say there is the data frame df which is divided into persons. Each person has height over years. Now I want to know if the data set is correct. Therefore I need to know if the height has increasing values - per person:
I tried
Results<- by(df, df$person,
function(x) {data = x,
all(x == cummax(height))
}
)
but it does not work. And
Results<- by(df, df$person,
all(height == cummax(height))
}
)
also not. I receive that height cannot be found.
What am I doing wrong here?
A small data extraction:
Serial_number Amplification Voltage
1 608004648 111.997 379.980
2 608004648 123.673 381.968
3 608004648 137.701 383.979
4 608004648 154.514 385.973
5 608004648 175.331 387.980
6 608004648 201.379 389.968
7 608004649 118.753 378.080
8 608004649 131.739 380.085
9 608004649 147.294 382.082
10 608004649 166.238 384.077
11 608004649 189.841 386.074
12 608004649 220.072 388.073
13 608004650 115.474 382.066
14 608004650 127.838 384.063
15 608004650 142.602 386.064
16 608004650 160.452 388.056
17 608004650 182.732 390.060
18 608004650 211.035 392.065
Serial_number is the factor and I want to check each serial number if the corresponding amplification values are increasing.
We can do this with a group by operation
library(dplyr)
df %>%
group_by(Serial_number) %>%
summarise(index = all(sign(Amplification -
lag(Amplification, default = first(Amplification))) >= 0))
Or with by from base R. As we are passing the complete dataset, the x (anonymous function call object) is the dataset, from which we can extract the column of interest with $ or [[
by(df, list(df$Serial_number), FUN = function(x) all(sign(diff(x$Amplification))>=0))
Or using data.table
library(data.table)
setDT(df)[, .(index = all(sign(Amplification - shift(Amplification,
fill = first(Amplification))) >=0)), .(Serial_number)]
data
df <- structure(list(Serial_number = c(608004648L, 608004648L, 608004648L,
608004648L, 608004648L, 608004648L, 608004649L, 608004649L, 608004649L,
608004649L, 608004649L, 608004649L, 608004650L, 608004650L, 608004650L,
608004650L, 608004650L, 608004650L), Amplification = c(111.997,
123.673, 137.701, 154.514, 175.331, 201.379, 118.753, 131.739,
147.294, 166.238, 189.841, 220.072, 115.474, 127.838, 142.602,
160.452, 182.732, 211.035), Voltage = c(379.98, 381.968, 383.979,
385.973, 387.98, 389.968, 378.08, 380.085, 382.082, 384.077,
386.074, 388.073, 382.066, 384.063, 386.064, 388.056, 390.06,
392.065)), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15",
"16", "17", "18"))
What about something like
vapply(unique(df$person),
function (k) all(diff(df$height[df$person == k]) >= 0), # or '> 0' if strictly mon. incr.
logical(1))
# returns
[1] TRUE FALSE FALSE
with
set.seed(123)
df <- data.frame(person = c("A","B", "C","A","A","C","B"), height = runif(7, 1.75, 1.85))
df
person height
1 A 1.778758
2 B 1.828831
3 C 1.790898
4 A 1.838302
5 A 1.844047
6 C 1.754556
7 B 1.802811
I want to classify my data by minimum distance between known centers.
How to implement using R?
the centers data
> centers
X
1 -0.78998176
2 2.40331380
3 0.77320007
4 -1.64054294
5 -0.05343331
6 -1.14982180
7 1.67658736
8 -0.44575567
9 0.36314671
10 1.18697840
the data wanted to be classified
> Y
[1] -0.7071068 0.7071068 -0.3011463 -0.9128686 -0.5713978 NA
the result I expected:
1. find the closest distance (minimum absolute difference value) between each
items in Y and centers.
2. Assigns sequence number of classes to each items in Y
expected result:
> Y
[1] 1 3 8 1 8 NA
Y <- c(-0.707106781186548, 0.707106781186548, -0.301146296962689,
-0.912868615826101, -0.571397763410073, NA)
centers <- structure(c(-0.789981758587318, 2.40331380121291, 0.773200070034431,
-1.64054294268215, -0.0534333085941505, -1.14982180092619, 1.67658736336158,
-0.445755672120908, 0.363146708827924, 1.18697840480949), .Dim = c(10L,
1L), .Dimnames = list(c("1", "2", "3", "4", "5", "6", "7", "8",
"9", "10"), "X"))
sapply(Y, function(y) {r=which.min(abs(y-centers)); ifelse(is.na(y), NA, r)})
Essentially, you are applying which.min to each element of Y, and determining which center has the smallest absolute distance. Ties go to the earlier element on the list. NA values need to be handled separately, which is why I have a second statement with ifelse there.
This is not clustering.
But nearest neighbor classification.
See the knn function.
Im new to R and im stuck with a problem i can't solve by myself.
A friend recommended me to use one of the apply functions, i just dont get how to use it in this case. Anyway, on to the problem! =)
Inside the inner while loop, I have an ifelse. That is the bottleneck. It takes on average 1 second to run each iteration. The slow part is marked with #slow part start/end in the code.
Given that, we will run it 2000*100 = 200000 times it will take aproximately 55.5 hours to finish each time we run this code. And the bigger problem is that this will be reused a lot. So x*55.5 hours is just not doable.
Below is a fraction of the code relevant to the question
#dt is data.table with close to 1.5million observations of 11 variables
#rand.mat is a 110*100 integer matrix
j <- 1
while(j <= 2000)
{
#other code is executed here, not relevant to the question
i <- 1
while(i <= 100)
{
#slow part start
t$column2 = ifelse(dt$datecolumn %in% c(rand.mat[,i]) & dt$column4==index[i], NA, dt$column2)
#slow part end
i <- i + 1
}
#other code is executed here, not relevant to the question
j <- j + 1
}
Please, any advice would be greatly appreciated.
EDIT - Run below code to reproduce problem
library(data.table)
dt = data.table(datecolumn=c("20121101", "20121101", "20121104", "20121104", "20121130",
"20121130", "20121101", "20121101", "20121104", "20121104", "20121130", "20121130"), column2=c("5",
"3", "4", "6", "8", "9", "2", "4", "3", "5", "6", "8"), column3=c("5",
"3", "4", "6", "8", "9", "2", "4", "3", "5", "6", "8"), column4=c
("1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "2", "2"))
unq_date <- c(20121101L,
20121102L, 20121103L, 20121104L, 20121105L, 20121106L, 20121107L,
20121108L, 20121109L, 20121110L, 20121111L, 20121112L, 20121113L,
20121114L, 20121115L, 20121116L, 20121117L, 20121118L, 20121119L,
20121120L, 20121121L, 20121122L, 20121123L, 20121124L, 20121125L,
20121126L, 20121127L, 20121128L, 20121129L, 20121130L
)
index <- as.numeric(dt$column4)
numberOfRepititions <- 2
set.seed(131107)
rand.mat <- replicate(numberOfRepititions, sample(unq_date, numberOfRepititions))
i <- 1
while(i <= numberOfRepititions)
{
dt$column2 = ifelse(dt$datecolumn %in% c(rand.mat[,i]) & dt$column4==index[i], NA, dt$column2)
i <- i + 1
}
Notice that we wont be able to run the loop more than 2 times now unless dt grows in rows so that we have the initial 100 types of column4 (which is just an integer value 1-100)
Here is one proposal which is based on your small example dataset. I tried to vectorize the operations. Like in your example, numberOfRepititions represents the number of loop runs.
First, create matrices for all necessary evaluations. dt$datecolum is compared with all columns of rand.mat:
rmat <- apply(rand.mat[, seq(numberOfRepititions)], 2, "%in%", x = dt$datecolumn)
Here, dt$column4 is compared with all values of the vector index:
imat <- sapply(head(index, numberOfRepititions), "==", dt$column4)
Both matrices are combined with logical and. Afterwards, we calculate whether there is at least one TRUE:
replace_idx <- rowSums(rmat & imat) != 0
Use the created index to replace corresponding values with NA:
is.na(dt$column2) <- replace_idx
Done.
The code in one chunk:
rmat <- apply(rand.mat[, seq(numberOfRepititions)], 2, "%in%", x = dt$datecolumn)
imat <- sapply(head(index, numberOfRepititions), "==", dt$column4)
replace_idx <- rowSums(rmat & imat) != 0
is.na(dt$column2) <- replace_idx
I think you can do it in 1 line like this:
dt[which(apply(dt, 1, function(x) x[1] %in% rand.mat[,as.numeric(x[4])])),]$column3<-NA
basically the apply function works as follows by argument:
1) uses the data from "dt"
2) "1" means apply by row
3) the function passes the row as 'x', returns TRUE if your criteria are met