I have a data frame and want for each row the sum of every second cell (beginning with the second cell), whose left neighbor is greater than zero. Here's an example:
a <- c(-2,1,1,-2)
b <- c(1,2,3,4)
c <- c(-2,1,-1,2)
d <- c(5,6,7,8)
df <- data.frame(a,b,c,d)
This gives:
> df
a b c d
1 -2 1 -2 5
2 1 2 1 6
3 1 3 -1 7
4 -2 4 2 8
For the first row the correct sum is 0 (the left neighbor of 1 is -2 and the left neighbor of 5 is also -2); for the second it's 8; for the third it's 3; for the fourth it's again 8.
I want to do it without loops, so I tried it with sum() and which() like in Conditional Sum in R, but could not find a way through.
We subset the dataset for alternating columns using the recycling vector (c(TRUE, FALSE)) to get the 1st, 3rd, ...etc columns of the dataset, convert it to a logical vector by checking whether it is greater than 0 ( > 0), then multiply the values with the second subset of alternating columns ie. columns 2nd, 4th etc. by using the recycling vector (c(FALSE, TRUE)). The idea is that if there are values in the left column that are less than 0, it will be FALSE in the logical matrix and it gets coerced to 0 by multiplying with the other subset. Finally, do the rowSums to get the expected output
rowSums((df[c(TRUE, FALSE)]>0)*df[c(FALSE, TRUE)])
#[1] 0 8 3 8
It can be also replaced with seq
rowSums((df[seq(1, ncol(df), by = 2)]>0)*df[seq(2, ncol(df), by = 2)])
#[1] 0 8 3 8
Or another option is Reduce with Map
Reduce(`+`, Map(`*`, lapply(df[c(TRUE, FALSE)], `>`, 0), df[c(FALSE, TRUE)]))
#[1] 0 8 3 8
Related
I am trying to determine sequence similarity.
I would like to create a function to compare df elements, for the following example:
V1 V2 V3 V4
1 C D A D
2 A A S E
3 V T T V
4 A T S S
5 C D R Y
6 C A D V
7 V T E T
8 A T A A
9 R V V W
10 W R D D
I want to compare the first element from the first column with a first element from the second column. If it matches == 1, else 0. Then the second element from the first column compared with the second element from the second column. and so on.
For example:
C != D -----0
A == A -----1
That way I would like to compare column 1 with column 2 then column 3 and column 4.
Then column 2 compare with column 3 and column 4.
Then column 3 with column 4.
The output would be just the numbers:
0
1
0
0
0
0
0
0
0
0
I tried the following but it doesn't work:
compared_df <- ifelse(df_trial$V1==df_trial$V2,1,ifelse(df_trial$V1==df_trial$V2,0,NA))
compared_df
As suggested, I tried the following:
compared_df1 <- df_trial$matches <- as.integer(df_trial$V1 == df_trial$V2)
This works well for small sample comparison. Is there a way to compare more globally? Like for the updated columns.
As #Ronak Shah said in the comment using the following is sufficent in the case you want to compare 2 values:
df$matches <- as.integer(df$V1 == df$V2)
Another option which is applicable to more the 2 rows as well is to use apply to check for the number of unique elements in a row in the following way:
df$matches = apply(df, 1, function(x) as.integer(length(unique(x)) == 1))
I need to select elements of a dataframe using the row indices, as stored in a vector. In other words, I have a vector or length equal to the number of columns in a data frame, and that vector contains the row numbers of the elements I need to extract (one element per column, in order).
How would I go about doing this?
Example:
vec <- c(1,2,1)
df <- data.frame(matrix(1:6, ncol = 3, nrow = 2))
That would look like this:
X1 X2 X3
1 1 3 5
2 2 4 6
And I would need to get elements (1,4,5) using the indices from vec = 1,2,1
We can use :
df[cbind(vec, 1:ncol(df))]
#[1] 1 4 5
Using cbind, we create a row and column index to subset values from df.
cbind(vec, 1:ncol(df))
vec
#[1,] 1 1
#[2,] 2 2
#[3,] 1 3
Using this matrix, we subset values from (row 1, column1), (row2, column2) and row(1, column3).
How should I subset a matrix specifying both the line and the column of each item ? I'm currently using sapply but I don't find that particularly elegant:
> mat <- data.frame(a=c(1,2,3),b=c(7,6,5))
> mat
a b
1 1 7
2 2 6
3 3 5
> rowSel <- 1:3
> colSel <- c(1,2,1)
> sapply(rowSel,function(i){mat[i,colSel[i]]})
[1] 1 6 3
A shorter way:
mat[cbind(rowSel, colSel)]
#[1] 1 6 3
This uses the indexing by a twocolumn matrix. The first column contains the index of the row, the second column contains the index of the column. Each row of the twocolumn matrix indexes a element of the matrix mat.
I have a matrix mat.
mat<-matrix(
c('a','a','b','a','b','b'),
nrow=3, ncol=2)
I want to make a vector of the count matches in each row of the matrix. For example, let's say I wanted to count the number of matches of the letter a in each row. The first row of the matrix has an a,a: two matches of a. The second row of the matrix has an a,b: one match of a.
I can count the number of matches of the character a in a row with this line of code:
sum(!is.na(charmatch(mat[1,c(1,2)],"a"))) # first row, returns 2
sum(!is.na(charmatch(mat[2,c(1,2)],"a"))) # second row, returns 1
I want to vectorize this counting procedure. In other words, I want to do something like this
as.vector(rowsum(!is.na(charmatch(mat[,c(1,2)], "a"))))
So that it returns a vector like this 2,1,0 which means 2 matches of a in row 1 of the matrix, 1 match of a in row 2 of the matrix, 0 matches of a in row 3 of the matrix.
You can just do
rowSums(mat=='a', na.rm=TRUE)
#[1] 2 1 0
For all unique values
Un <- sort(unique(c(mat)))
res <- sapply(Map(`==`, list(mat), Un), rowSums, na.rm=TRUE)
colnames(res) <- Un
res
# a b
#[1,] 2 0
#[2,] 1 1
#[3,] 0 2
Or as contributed by #Ananda Mahto, a faster approach would be
lvl <- sort(unique(c(mat)))
vapply(lvl, function(x) rowSums(mat == x, na.rm = TRUE), numeric(nrow(mat)))
If you wanted to do this for all values, you can try one of the following:
table with factor in apply
levs <- unique(c(mat))
t(apply(mat, 1, function(x) table(factor(x, levs))))
# a b
# [1,] 2 0
# [2,] 1 1
# [3,] 0 2
melt and dcast with fun.aggregate = length from "reshape2"
library(reshape2)
dcast(melt(mat), Var1 ~ value, value.var = "Var2")
# Aggregation function missing: defaulting to length
# Var1 a b
# 1 1 2 0
# 2 2 1 1
# 3 3 0 2
Better yet would just be table after manually creating the values to tabulate:
table(rep(sequence(nrow(mat)), ncol(mat)), c(mat))
#
# a b
# 1 2 0
# 2 1 1
# 3 0 2
I would like to multiply several columns in my data frame by a vector of values. The specific vector of values changes depending on the value in another column.
--EDIT--
What if I make the data set more complicated, i.e., more than 2 conditions and the conditions are randomly shuffled around the data set?
Here is an example of my data set:
df=data.frame(
Treatment=(rep(LETTERS[1:4],each=2)),
Species=rep(1:4,each=2),
Value1=c(0,0,1,3,4,2,0,0),
Value2=c(0,0,3,4,2,1,4,5),
Value3=c(0,2,4,5,2,1,4,5),
Condition=c("A","B","A","C","B","A","B","C")
)
Which looks like:
Treatment Species Value1 Value2 Value3 Condition
A 1 0 0 0 A
A 1 0 0 2 B
B 2 1 3 4 A
B 2 3 4 5 C
C 3 4 2 2 B
C 3 2 1 1 A
D 4 0 4 4 B
D 4 0 5 5 C
If Condition=="A", I would like to multiply columns 3-5 by the vector c(1,2,3). If Condition=="B", I would like to multiply columns 3-5 by the vector c(4,5,6). If Condition=="C", I would like to multiply columns 3-5 by the vector c(0,1,0). The resulting data frame would therefore look like this:
Treatment Species Value1 Value2 Value3 Condition
A 1 0 0 0 A
A 1 0 0 12 B
B 2 1 6 12 A
B 2 0 4 0 C
C 3 16 10 12 B
C 3 2 2 3 A
D 4 0 20 24 B
D 4 0 5 0 C
I have tried subsetting the data frame and multiplying by the vector:
t(t(subset(df[,3:5],df[,6]=="A")) * c(1,2,3))
But I can't return the subsetted data frame to the original. Is there any way to perform this operation without subsetting the data frame, so that other columns (e.g., Treatment, Species) are preserved?
Here's a fairly general solution that you should be able to adapt to fit your needs.
Note the first argument in the outer call is a logical vector and the second is numeric, so before multiplication TRUE and FALSE are converted to 1 and 0, respectively. We can add the outer results because the conditions are non-overlapping and the FALSE elements will be zero.
multiples <-
outer(df$Condition=="A",c(1,2,3)) +
outer(df$Condition=="B",c(4,5,6)) +
outer(df$Condition=="C",c(0,1,0))
df[,3:5] <- df[,3:5] * multiples
Here's a non-vectorized, but easy to understand solution:
replaceFunction <- function(v){
m <- as.numeric(v[3:5])
if (v[6]=="A")
out <- m * c(1,2,3)
else if (v[6]=="B")
out <- m * c(4,5,6)
else
out <- m
return(out)
}
g <- apply(df, 1, replaceFunction)
df[3:5] <- t(g)
df
Edited to reflect some notes from the comments
Assuming that Condition is a factor, you could do this:
#Modified to reflect OP's edit - the same solution works just fine
m <- matrix(c(1:6,0,1,0),3,3,byrow = TRUE)
df[,3:5] <- with(df,df[,3:5] * m[Condition,])
which makes use of fairly quick vectorized multiplication. And obviously, wrapping this in with isn't strictly necessary, it's just what popped out of my brain. Also note the subsetting comment below by Backlin.
More globally, remember that every subsetting you can do with subset you can also do with [, and crucially, [ support assignment via [<-. So if you want to alter a portion of a data frame or matrix, you can always use this type of idiom:
df[rowCondition,colCondition] <- <replacement values>
assuming of course that <replacement values> is the same dimension as your subset of df. It may work otherwise, but you will run afoul of R's recycling rules and R may kick back a warning.
df[3:5] <- df[3:5] * t(sapply(df$Condition, function(x) if(x=="B") 4:6 else 1:3))
Or by vector multiplication
df[3:5] <- df[3:5] * (3*(df$Condition == "B") %*% matrix(1, 1, 3)
+ matrix(1:3, nrow(df), 3, byrow=T))