apply function in a character dataframe - r

I have a data frame that looks like this:
GID7173723 GID4878677 GID88208 GID346403 GID268825 GID7399578
1 A A A A G A
2 T T T T C T
3 G G G G G G
4 A A A A A A
5 G G G G G G
6 G G G G G G
7 A A A A A A
8 G G G G G G
9 A A A A A A
10 A A A A A A
However, when I use the apply function to get the sum of all 'A' by row divided by the number of columns in the dataframe, I get the total sum of A's instead of getting row sums.
Here is the function I wrote:
myfun <- function(x){
out <- sum(x=='A')/ncol(x)
return(out)
}
apply(df,MARGIN = 1,FUN=myfun)
I cannot figure out why the apply function gives me the total sum of A and not by row.

We can use rowSums
rowSums(df1=="A")/ncol(df1)
Or use `rowMeans
rowMeans(df1 == "A")
With apply, the ncol doesn't apply as it is a vector, so we need length(x)
myfun <- function(x){
sum(x=='A')/length(x)
#or
# mean(x == "A")
}

Solution with apply()
apply(df, 1,FUN=function(rowVec) table(rowVec)['A'])
table() gives counts of each of the bases - you select 'A' out of them.

Related

Number of differences between columns in a data frame in R

I have a data frame with sequences as columns and amino acid sites as rows. I would like to compare the difference between these sequences at each site.
seq1 seq2 seq3 seq4 seq5 seq6 seq7 seq8
1 K E K K A A A A
2 V D A A T A A A
3 W W W W W W W W
4 R R R R R R S R
5 F S F F F Y F F
6 P P P P P P P P
7 N N N C N N N N
8 V I D D Q Q Q Q
9 Q Q Q Q Q Q Q Q
10 E E G G L I S F
11 L L Q L L L L L
12 N N Y Y V V S S
13 N N N N Q Q P P
14 L L L L L L L L
15 T T T T T T T I
Ideally, I would like to be able to have an additional column in my data frame that shows me the sites that are the same in all sequences and those that are the same only between seq1-4 or seq 5-8.
I am not sure what the best way to do this is, and any help is greatly appreciated.
Also, is there a way to add another column that shows the types of amino acids observed at each site?
Thanks in advance!
I am first getting an array where all columns are same:
allsame <- apply(df,1,function(x){
val <- ifelse(length(unique(x)) == 1,1,0)
})
Next I am getting an an array where either of the column sets are same
startfour <- apply(df[,1:4],1,function(x){
val <- ifelse(length(unique(x)) == 1,1,0)
})
lastfour <- apply(df[,5:8],1,function(x){
val <- ifelse(length(unique(x)) == 1,1,0)
})
gen <- startfour + lastfour
eithersame <- ifelse(gen == 0,0,1)
Finally you can just create a column vector as required and join it to the dataframe using the above 2 arrays
output <- as.character(length(allsame))
for(i in 1:length(allsame)){
if(allsame[i] == 1){
output[i] <- "all same"
}
else if(eithersame[i] == 1){
output[i] <- "either same"
}
else{
output[i] <- "none same"
}
}
df <- cbind(df,output)
Here is a quick and dirty way to create the flags that you mentioned. Assuming the dataframe is called amino:
amino$first_flag<-with(amino,ifelse(seq1==seq2 & seq2==seq3 & seq3 == seq4,"same","diff"))
amino$second_flag<-with(amino,ifelse(seq5==seq6 & seq6==seq7 & seq7 == seq8,"same","diff"))
amino$total_flag<-with(amino,ifelse(first_flag=="same" & second_flag=="same" & seq1==seq5,"same","diff"))
Hopefully that works.
edit: and for your last question, I'm not sure what you mean but if you just want the letters that appear in each row then something like this could work:
for(i in 1:nrow(amino)) amino$types[i]<-paste(unique(amino[i,1:4,drop=TRUE]),collapse=",")
It will give you a column containing a comma separated list of the letters that appeared in each row.
edit2: If you have significantly more than 8 sequences, then a modified form of Ganesh's solution might work better (his output code isn't actually necessary):
amino$first_flag <- apply(amino[,1:4],1,function(x){
ifelse(length(unique(x)) == 1,"same","diff")
})
amino$second_flag <- apply(amino[,5:8],1,function(x){
ifelse(length(unique(x)) == 1,"same","diff")
})
amino$total_flag <- apply(amino[,1:8],1,function(x){
ifelse(length(unique(x)) == 1,"same","diff")
})
amino$types <- apply(amino[,1:8],1,function(x) paste(unique(x),collapse=","))
And for your new question-
amino$one_diff <- apply(amino[,1:8],1,function(x){
ifelse(7 %in% as.data.frame(table(x))[,2,drop=TRUE],"1 diff",NA)
})
This uses the table() function which normally gives you a count based on a vector or a column like table(amino$seq1). Using apply, we instead stick a row of the 8 sequences into it, it returns the counts, then we use as.data.frame and the brackets [] to get rid of some extra table() output that we don't need. The "7 %in%" part means if there are 7 of the same letters then there must be 1 different one. Anything else (i.e., all 8 same or more than 1 difference) will get NA.

Concatenate a column from data.frame elements of a list

I am looking for an idiomatic way to join a column, say named 'x', which exists in every data.frame element of a list. I came up with a solution with two steps by using lapply and Reduce. The second attempt trying to use only Reduce failed. Can I actually use only Reduce with one anonymous function to do this?
#data
xs <- replicate(5, data.frame(x=sample(letters, 10, T), y =runif(10)), simplify = FALSE)
# This works, but may be still unnecessarily long
otmap = lapply(xs, function(df) df$x)
jotm = Reduce(c, otmap)
# This does not count as another solution:
jotm = Reduce(c, lapply(xs, function(df) df$x))
# Try to use only Reduce function. This produces an error
jotr =Reduce(function(a,b){c(a$x,b$x)}, xs)
# Error in a$x : $ operator is invalid for atomic vectors
We can unlist after extracting the 'x' column
unlist(lapply(xs, `[[`, 'x'))
#[1] b y y i z o q w p d f f z b h m c u f s j e i v y b w j n q e w i r h p z q f x a b v z e x l c q f
#Levels: b d i o p q w y z c f h m s u e j n v r x a l

R: extract regression results using two or more for loops

Based on this post, I created the following matrix and for loops to loop through all regression combinations in my df:
all_lm <-data.frame(matrix(nrow=180, ncol=9))
names(all_lm)=c("col1", "col2", "Estimate", " Std. Error", " z value", " pValue", "2.5%", "97.5%", "r^2")
and to save the results, this:
for (i in c("A","B","C"))
for (j in c(1:10))
for (k in c("D","E"))
for (l in c("F", "G", "H")){
form <- formula(paste0(i,"_PC_AB_",k, " ~ ", l))
result<-lm(form, data = schools, subset=Decile==j)
all_lm[i,1]<-i
all_lm[i,2]<-j
all_lm[i,3]<-round(coef(summary(result))[2,1],3)
all_lm[i,4]<-round(coef(summary(result))[2,2],3)
all_lm[i,5]<-round(coef(summary(result))[2,3],3)
all_lm[i,6]<-round(coef(summary(result))[2,4],3)
all_lm[i,7]<-round(confint(result)[2,1],2)
all_lm[i,8]<-round(confint(result)[2,2],2)
all_lm[i,9]<-round(summary(result)$r.squared, 3)
}
This loop configuration works when I use it to export plots in Cairo, but I realise that the all_lm[i,n] is an incorrect approach. I do not know enough about R to solve this. I've tried various combinations such as all_lm[i,j,k,n]. I have also tried { after each for but this did not work. How can i loop through the 180 regressions and store the results in my matrix?
Most of the time in R, if you're being drawn to using a for loop (let alone nested for loops), you're probably on the wrong track.
The general approach to solving your problem is to use the expand.grid function to create all combinations of the inputs, then use mapply to repeatedly regress on each combination of inputs and return a list of results, then use do.call to combine the list of results into a data frame.
Your code should look something like this:
i <- c('A','B','C')
j <- 1:10
k <- c('D','E')
l <- c('F','G','H')
params <- expand.grid(i, j, k, l, stringsAsFactors = FALSE)
You now have a data frame of all combinations of inputs.
> head(params)
Var1 Var2 Var3 Var4
1 A 1 D F
2 B 1 D F
3 C 1 D F
4 A 2 D F
5 B 2 D F
6 C 2 D F
> tail(params)
Var1 Var2 Var3 Var4
175 A 9 E H
176 B 9 E H
177 C 9 E H
178 A 10 E H
179 B 10 E H
180 C 10 E H
Now set up a function that mapply will use for each row of the params data frame.
#
one_lm <- function(i, j, k, l) {
form <- formula(paste0(i,"_PC_AB_",k, " ~ ", l))
result <- lm(form, data = schools, subset=Decile==j)
list(
col1 = i,
col2 = j,
estimate = round(coef(summary(result))[2,1],3),
std_err = round(coef(summary(result))[2,2],3),
z_value = round(coef(summary(result))[2,3],3),
p_value = round(coef(summary(result))[2,4],3),
pct_2.5 = round(confint(result)[2,1],2),
pct_97.5 = round(confint(result)[2,2],2),
r_square = round(summary(result)$r.squared, 3)
)
}
Now use mapply to process each combination one at a time, and return a list of estimates, std_err, etc for each row.
result_list <- mapply(one_lm, params[,1], params[,2], params[,3], params[,4], SIMPLIFY = FALSE)
You can then combine all those lists into a data frame using the the do.call and rbind functions together.
results <- do.call(rbind, result_list)

How to add functions in R

I am trying to sum the functions in a list to create a new function. This is easy for a small number of functions. Here is an example:
f <- function(x){x}
g <- function(x){x+1}
Now we sum f and g.
fg <- function(x){f(x) + g(x)}
But if I have 100 functions that I want to sum, this method becomes clumsy. Is there a way to create a function like fg above automatically from a list?
I prefer Reduce:
f <- function(x){x}
g <- function(x){x+1}
h <- function(x){x*2}
funs<-list(f,g,h)
x <- 1:3
Reduce("+", lapply(funs, function(f, y) f(y), y=x))
#[1] 5 9 13
Of course, the return values of all functions must have the same length.
You could use sapply to loop over the functions and apply then
f <- function(x){x}
g <- function(x){x+1}
h <- function(x){x*2}
funs<-list(f,g,h)
x <- 2
rowSums(matrix(sapply(funs, function(f, z) f(z), z=x), nrow=length(x)))
# [1] 9
I use the matrix and rowSums functions just in case you want to be able to call it when x is a vector of values as well
x <- 1:3
rowSums(matrix(sapply(funs, function(f, z) f(z), z=x), nrow=length(x)))
# [1] 5 9 13
You can make it cleaner by making a helper function
getfunsum <- function(funs) {
force(funs)
function(x) {
rowSums(matrix(sapply(funs, function(f, z) f(z), z=x), nrow=length(x)))
}
}
fgh <- getfunsum(funs)
fgh(1:3)
# [1] 5 9 13
You may try:
fun1 <- function(i,a) {
eval(substitute(function(x, a) {x+i*a}, list(i=i)))}
n <- 0:3
lst <- lapply(n, fun1)
rowSums(sapply(lst, function(f) f(12:14, 3)))
#[1] 66 70 74

Recode dataframe based on one column - in reverse

I asked this question a while ago (Recode dataframe based on one column) and the answer worked perfectly. Now however, i almost want to do the reverse. Namely, I have a (700k * 2000) of 0/1/2 or NA. In a separate dataframe I have two columns (Ref and Obs). The 0 corresponds to two instances of Ref, 1 is one instance of Ref and one instance of Obs and 2 is two Obs. To clarify, data snippet:
Genotype File ---
Ref Obs
A G
T C
G C
Ref <- c("A", "T", "G")
Obs <- c("G", "C", "C")
Current Data---
Sample.1 Sample.2 .... Sample.2000
0 1 2
0 0 0
0 NA 1
mat <- matrix(nrow=3, ncol=3)
mat[,1] <- c(0,0,0)
mat[,2] <- c(1,0,NA)
mat[,3] <- c(2,0,1)
Desired Data format---
Sample.1 Sample.1 Sample.2 Sample.2 Sample.2000 Sample.2000
A A A G G G
T T T T T T
G G 0 0 G C
I think that's right. The desired data format has two columns (space separated) for each sample. 0 in this format (plink ped file for the bioinformaticians out there) is missing data.
MAJOR ASSUMPTION: your data is in 3 element frames, i.e. you want to apply your mapping to the first 3 rows, then the next 3, and so on, which I think makes sense given DNA frames. If you want a rolling 3 element window this will not work (but code can be modified to make it work). This will work for an arbitrary number of columns, and arbitrary number of 3 row groups:
# Make up a matrix with your properties (4 cols, 6 rows)
col <- 4L
frame <- 3L
mat <- matrix(sample(c(0:2, NA_integer_), 2 * frame * col, replace=T), ncol=col)
# Mapping data
Ref <- c("A", "T", "G")
Obs <- c("G", "C", "C")
map.base <- cbind(Ref, Obs)
num.to.let <- matrix(c(1, 1, 1, 2, 2, 2), byrow=T, ncol=2) # how many from each of ref obs
# Function to map 0,1,2,NA to Ref/Obs
re_map <- function(mat.small) { # 3 row matrices, with col columns
t(
mapply( # iterate through each row in matrix
function(vals, map, num.to.let) {
vals.2 <- unlist(lapply(vals, function(x) map[num.to.let[x + 1L, ]]))
ifelse(is.na(vals.2), 0, vals.2)
},
vals=split(mat.small, row(mat.small)), # a row
map=split(map.base, row(map.base)), # the mapping for that row
MoreArgs=list(num.to.let=num.to.let) # general conversion of number to Obs/Ref
) )
}
# Split input data frame into 3 row matrices (assumes frame size 3),
# and apply mapping function to each group
mat.split <- split.data.frame(mat, sort(rep(1:(nrow(mat) / frame), frame)))
mat.res <- do.call(rbind, lapply(mat.split, re_map))
colnames(mat.res) <- paste0("Sample.", rep(1:ncol(mat), each=2))
print(mat.res, quote=FALSE)
# Sample.1 Sample.1 Sample.2 Sample.2 Sample.3 Sample.3 Sample.4 Sample.4
# 1 G G A G G G G G
# 2 C C 0 0 T C T C
# 3 0 0 G C G G G G
# 1 A A A A A G A A
# 2 C C C C T C C C
# 3 C C G G 0 0 0 0
I am not sure but this could be what you need:
first same simple data
geno <- data.frame(Ref = c("A", "T", "G"), Obs = c("G", "C", "C"))
data <- data.frame(s1 = c(0,0,0),s2 = c(1, 0, NA))
then a couple of functions:
f <- function(i , x, geno){
x <- x[i]
if(!is.na(x)){
if (x == 0) {y <- geno[i , c(1,1)]}
if (x == 1) {y <- geno[i, c(1,2)]}
if (x == 2) {y <- geno[i, c(2,2)]}
}
else y <- c(0,0)
names(y) <- c("s1", "s2")
y
}
g <- function(x, geno){
Reduce(rbind, lapply(1:length(x), FUN = f , x = x, geno = geno))
}
The way f() is defined may not be the most elegant but it does the job
Then simply run it as a doble for loop in a lapply fashion
as.data.frame(Reduce(cbind, lapply(data , g , geno = geno )))
hope it helps
Here's one way based on the sample data in your answer:
# create index
idx <- lapply(data, function(x) cbind((x > 1) + 1, (x > 0) + 1))
# list of matrices
lst <- lapply(idx, function(x) {
tmp <- apply(x, 2, function(y) geno[cbind(seq_along(y), y)])
replace(tmp, is.na(tmp), 0)
})
# one data frame
as.data.frame(lst)
# s1.1 s1.2 s2.1 s2.2
# 1 A A A G
# 2 T T T T
# 3 G G 0 0

Resources