I have a data frame with sequences as columns and amino acid sites as rows. I would like to compare the difference between these sequences at each site.
seq1 seq2 seq3 seq4 seq5 seq6 seq7 seq8
1 K E K K A A A A
2 V D A A T A A A
3 W W W W W W W W
4 R R R R R R S R
5 F S F F F Y F F
6 P P P P P P P P
7 N N N C N N N N
8 V I D D Q Q Q Q
9 Q Q Q Q Q Q Q Q
10 E E G G L I S F
11 L L Q L L L L L
12 N N Y Y V V S S
13 N N N N Q Q P P
14 L L L L L L L L
15 T T T T T T T I
Ideally, I would like to be able to have an additional column in my data frame that shows me the sites that are the same in all sequences and those that are the same only between seq1-4 or seq 5-8.
I am not sure what the best way to do this is, and any help is greatly appreciated.
Also, is there a way to add another column that shows the types of amino acids observed at each site?
Thanks in advance!
I am first getting an array where all columns are same:
allsame <- apply(df,1,function(x){
val <- ifelse(length(unique(x)) == 1,1,0)
})
Next I am getting an an array where either of the column sets are same
startfour <- apply(df[,1:4],1,function(x){
val <- ifelse(length(unique(x)) == 1,1,0)
})
lastfour <- apply(df[,5:8],1,function(x){
val <- ifelse(length(unique(x)) == 1,1,0)
})
gen <- startfour + lastfour
eithersame <- ifelse(gen == 0,0,1)
Finally you can just create a column vector as required and join it to the dataframe using the above 2 arrays
output <- as.character(length(allsame))
for(i in 1:length(allsame)){
if(allsame[i] == 1){
output[i] <- "all same"
}
else if(eithersame[i] == 1){
output[i] <- "either same"
}
else{
output[i] <- "none same"
}
}
df <- cbind(df,output)
Here is a quick and dirty way to create the flags that you mentioned. Assuming the dataframe is called amino:
amino$first_flag<-with(amino,ifelse(seq1==seq2 & seq2==seq3 & seq3 == seq4,"same","diff"))
amino$second_flag<-with(amino,ifelse(seq5==seq6 & seq6==seq7 & seq7 == seq8,"same","diff"))
amino$total_flag<-with(amino,ifelse(first_flag=="same" & second_flag=="same" & seq1==seq5,"same","diff"))
Hopefully that works.
edit: and for your last question, I'm not sure what you mean but if you just want the letters that appear in each row then something like this could work:
for(i in 1:nrow(amino)) amino$types[i]<-paste(unique(amino[i,1:4,drop=TRUE]),collapse=",")
It will give you a column containing a comma separated list of the letters that appeared in each row.
edit2: If you have significantly more than 8 sequences, then a modified form of Ganesh's solution might work better (his output code isn't actually necessary):
amino$first_flag <- apply(amino[,1:4],1,function(x){
ifelse(length(unique(x)) == 1,"same","diff")
})
amino$second_flag <- apply(amino[,5:8],1,function(x){
ifelse(length(unique(x)) == 1,"same","diff")
})
amino$total_flag <- apply(amino[,1:8],1,function(x){
ifelse(length(unique(x)) == 1,"same","diff")
})
amino$types <- apply(amino[,1:8],1,function(x) paste(unique(x),collapse=","))
And for your new question-
amino$one_diff <- apply(amino[,1:8],1,function(x){
ifelse(7 %in% as.data.frame(table(x))[,2,drop=TRUE],"1 diff",NA)
})
This uses the table() function which normally gives you a count based on a vector or a column like table(amino$seq1). Using apply, we instead stick a row of the 8 sequences into it, it returns the counts, then we use as.data.frame and the brackets [] to get rid of some extra table() output that we don't need. The "7 %in%" part means if there are 7 of the same letters then there must be 1 different one. Anything else (i.e., all 8 same or more than 1 difference) will get NA.
Related
I have an ini-file, read as a list by R (in the example l). Now I want to add further sub-lists along a vector (m) and assign always the same constant to them. My attempt so far:
l <- list("A")
m <- letters[1:5]
n <- 5
for (i in 1:5){
assign(paste0("l$A$",m[i]), n)
}
# which does not work
# example of the desired outcome:
> l$A$e
[1] 5
I don't think that I have fully understood how lists work yet...
Try
L[["A"]][m] <- n
L$A$e
# [1] 5
Data:
L <- list(A = list())
m <- letters[1:5]
n <- 5
I saw a loop in a demo code:
b <- 3
n <- 4
set.seed(1)
(i <- sample(rep(1:n,
b)) )
(g <- rep(1:b,
each=n) )
(x <- rnorm(n) )
m <- rep(NA, max(g))
for (j in 1:max(g) ) {
k <- i[ g == j ]
m[j] <- mean(x[k])
print (j)
print (k)
}
The max(g) = 3, so the loop run 3 times. but I don't understand the second row of the loop k <- i[ g == j ]. What is the meaning here? Thank you!
i is a vector (created by sample(rep(1:n, b))).
i[<something>] indexes the elements of i for which <something> evaluates to TRUE (in this case when g is equal to j).
g is another vector (created by rep(1:b, each=n)).
So
k <- i[ g == j ]
creates, for each value of j as the for loop runs (these values are 1:max(g)), a vector k which is the subset of i for which the condition g == j is true.
I am looking for an idiomatic way to join a column, say named 'x', which exists in every data.frame element of a list. I came up with a solution with two steps by using lapply and Reduce. The second attempt trying to use only Reduce failed. Can I actually use only Reduce with one anonymous function to do this?
#data
xs <- replicate(5, data.frame(x=sample(letters, 10, T), y =runif(10)), simplify = FALSE)
# This works, but may be still unnecessarily long
otmap = lapply(xs, function(df) df$x)
jotm = Reduce(c, otmap)
# This does not count as another solution:
jotm = Reduce(c, lapply(xs, function(df) df$x))
# Try to use only Reduce function. This produces an error
jotr =Reduce(function(a,b){c(a$x,b$x)}, xs)
# Error in a$x : $ operator is invalid for atomic vectors
We can unlist after extracting the 'x' column
unlist(lapply(xs, `[[`, 'x'))
#[1] b y y i z o q w p d f f z b h m c u f s j e i v y b w j n q e w i r h p z q f x a b v z e x l c q f
#Levels: b d i o p q w y z c f h m s u e j n v r x a l
Based on this post, I created the following matrix and for loops to loop through all regression combinations in my df:
all_lm <-data.frame(matrix(nrow=180, ncol=9))
names(all_lm)=c("col1", "col2", "Estimate", " Std. Error", " z value", " pValue", "2.5%", "97.5%", "r^2")
and to save the results, this:
for (i in c("A","B","C"))
for (j in c(1:10))
for (k in c("D","E"))
for (l in c("F", "G", "H")){
form <- formula(paste0(i,"_PC_AB_",k, " ~ ", l))
result<-lm(form, data = schools, subset=Decile==j)
all_lm[i,1]<-i
all_lm[i,2]<-j
all_lm[i,3]<-round(coef(summary(result))[2,1],3)
all_lm[i,4]<-round(coef(summary(result))[2,2],3)
all_lm[i,5]<-round(coef(summary(result))[2,3],3)
all_lm[i,6]<-round(coef(summary(result))[2,4],3)
all_lm[i,7]<-round(confint(result)[2,1],2)
all_lm[i,8]<-round(confint(result)[2,2],2)
all_lm[i,9]<-round(summary(result)$r.squared, 3)
}
This loop configuration works when I use it to export plots in Cairo, but I realise that the all_lm[i,n] is an incorrect approach. I do not know enough about R to solve this. I've tried various combinations such as all_lm[i,j,k,n]. I have also tried { after each for but this did not work. How can i loop through the 180 regressions and store the results in my matrix?
Most of the time in R, if you're being drawn to using a for loop (let alone nested for loops), you're probably on the wrong track.
The general approach to solving your problem is to use the expand.grid function to create all combinations of the inputs, then use mapply to repeatedly regress on each combination of inputs and return a list of results, then use do.call to combine the list of results into a data frame.
Your code should look something like this:
i <- c('A','B','C')
j <- 1:10
k <- c('D','E')
l <- c('F','G','H')
params <- expand.grid(i, j, k, l, stringsAsFactors = FALSE)
You now have a data frame of all combinations of inputs.
> head(params)
Var1 Var2 Var3 Var4
1 A 1 D F
2 B 1 D F
3 C 1 D F
4 A 2 D F
5 B 2 D F
6 C 2 D F
> tail(params)
Var1 Var2 Var3 Var4
175 A 9 E H
176 B 9 E H
177 C 9 E H
178 A 10 E H
179 B 10 E H
180 C 10 E H
Now set up a function that mapply will use for each row of the params data frame.
#
one_lm <- function(i, j, k, l) {
form <- formula(paste0(i,"_PC_AB_",k, " ~ ", l))
result <- lm(form, data = schools, subset=Decile==j)
list(
col1 = i,
col2 = j,
estimate = round(coef(summary(result))[2,1],3),
std_err = round(coef(summary(result))[2,2],3),
z_value = round(coef(summary(result))[2,3],3),
p_value = round(coef(summary(result))[2,4],3),
pct_2.5 = round(confint(result)[2,1],2),
pct_97.5 = round(confint(result)[2,2],2),
r_square = round(summary(result)$r.squared, 3)
)
}
Now use mapply to process each combination one at a time, and return a list of estimates, std_err, etc for each row.
result_list <- mapply(one_lm, params[,1], params[,2], params[,3], params[,4], SIMPLIFY = FALSE)
You can then combine all those lists into a data frame using the the do.call and rbind functions together.
results <- do.call(rbind, result_list)
I would like to calculate name number for a set of given names.
Name number is calculated by summing the value assigned to each alphabet. The values are given below:
a=i=j=q=y=1
b=k=r=2
c=g=l=s=3
d=m=t=4
h=e=n=x=5
u=v=w=6
o=z=7
p=f=8
Example: Name number of David can be calculated as follows:
D+a+v+i+d
4+1+6+1+4
16=1+6=7
Name number of David is 7.
I would like to write a function in R for doing this.
I am thankful for any directions or tips or package suggestions that I should look into.
This code snippet will accomplish what you want:
# Name for which the number should be computed.
name <- "David"
# Prepare letter scores array. In this case, the score for each letter will be the array position of the string it occurs in.
val <- c("aijqy", "bkr", "cgls", "dmt", "henx", "uvw", "oz", "pf")
# Convert name to lowercase.
lName <- tolower(name)
# Compute the sum of letter scores.
s <- sum(sapply(unlist(strsplit(lName,"")), function(x) grep(x, val)))
# Compute the "number" for the sum of letter scores. This is a recursive operation, which can be shortened to taking the mod by 9, with a small correction in case the sum is 9.
n <- (s %% 9)
n <- ifelse(n==0, 9, n)
'n' is the result that you want for any 'name'
You will want to create a vector of values, in alphabetical order, then use match to get their indices. Something like this:
a <- i <- j <- q <- y <- 1
b <- k <- r <- 2
c <- g <- l <- s <- 3
d <- m <- t <- 4
h <- e <- n <- x <- 5
u <- v <- w <- 6
o <- z <- 7
p <- f <- 8
vals <- c(a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z)
sum(vals[match(c("d","a","v","i","d"), letters)])
I'm sure there are several ways to do this, but here's an approach using a named vector:
x <- c(
"a"=1,"i"=1,"j"=1,"q"=1,"y"=1,
"b"=2,"k"=2,"r"=2,
"c"=3,"g"=3,"l"=3,"s"=3,
"d"=4,"m"=4,"t"=4,
"h"=5,"e"=5,"n"=5,"x"=5,
"u"=6,"v"=6,"w"=6,
"o"=7,"z"=7,
"p"=8,"f"=8)
##
name_val <- function(Name, mapping=x){
split <- tolower(unlist(strsplit(Name,"")))
total <-sum(mapping[split])
##
sum(as.numeric(unlist(strsplit(as.character(total),split=""))))
}
##
Names <- c("David","Betty","joe")
##
R> name_val("David")
[1] 7
R> sapply(Names,name_val)
David Betty joe
7 7 4