Multiply various subsets of a data frame by different vectors - r

I would like to multiply several columns in my data frame by a vector of values. The specific vector of values changes depending on the value in another column.
--EDIT--
What if I make the data set more complicated, i.e., more than 2 conditions and the conditions are randomly shuffled around the data set?
Here is an example of my data set:
df=data.frame(
Treatment=(rep(LETTERS[1:4],each=2)),
Species=rep(1:4,each=2),
Value1=c(0,0,1,3,4,2,0,0),
Value2=c(0,0,3,4,2,1,4,5),
Value3=c(0,2,4,5,2,1,4,5),
Condition=c("A","B","A","C","B","A","B","C")
)
Which looks like:
Treatment Species Value1 Value2 Value3 Condition
A 1 0 0 0 A
A 1 0 0 2 B
B 2 1 3 4 A
B 2 3 4 5 C
C 3 4 2 2 B
C 3 2 1 1 A
D 4 0 4 4 B
D 4 0 5 5 C
If Condition=="A", I would like to multiply columns 3-5 by the vector c(1,2,3). If Condition=="B", I would like to multiply columns 3-5 by the vector c(4,5,6). If Condition=="C", I would like to multiply columns 3-5 by the vector c(0,1,0). The resulting data frame would therefore look like this:
Treatment Species Value1 Value2 Value3 Condition
A 1 0 0 0 A
A 1 0 0 12 B
B 2 1 6 12 A
B 2 0 4 0 C
C 3 16 10 12 B
C 3 2 2 3 A
D 4 0 20 24 B
D 4 0 5 0 C
I have tried subsetting the data frame and multiplying by the vector:
t(t(subset(df[,3:5],df[,6]=="A")) * c(1,2,3))
But I can't return the subsetted data frame to the original. Is there any way to perform this operation without subsetting the data frame, so that other columns (e.g., Treatment, Species) are preserved?

Here's a fairly general solution that you should be able to adapt to fit your needs.
Note the first argument in the outer call is a logical vector and the second is numeric, so before multiplication TRUE and FALSE are converted to 1 and 0, respectively. We can add the outer results because the conditions are non-overlapping and the FALSE elements will be zero.
multiples <-
outer(df$Condition=="A",c(1,2,3)) +
outer(df$Condition=="B",c(4,5,6)) +
outer(df$Condition=="C",c(0,1,0))
df[,3:5] <- df[,3:5] * multiples

Here's a non-vectorized, but easy to understand solution:
replaceFunction <- function(v){
m <- as.numeric(v[3:5])
if (v[6]=="A")
out <- m * c(1,2,3)
else if (v[6]=="B")
out <- m * c(4,5,6)
else
out <- m
return(out)
}
g <- apply(df, 1, replaceFunction)
df[3:5] <- t(g)
df

Edited to reflect some notes from the comments
Assuming that Condition is a factor, you could do this:
#Modified to reflect OP's edit - the same solution works just fine
m <- matrix(c(1:6,0,1,0),3,3,byrow = TRUE)
df[,3:5] <- with(df,df[,3:5] * m[Condition,])
which makes use of fairly quick vectorized multiplication. And obviously, wrapping this in with isn't strictly necessary, it's just what popped out of my brain. Also note the subsetting comment below by Backlin.
More globally, remember that every subsetting you can do with subset you can also do with [, and crucially, [ support assignment via [<-. So if you want to alter a portion of a data frame or matrix, you can always use this type of idiom:
df[rowCondition,colCondition] <- <replacement values>
assuming of course that <replacement values> is the same dimension as your subset of df. It may work otherwise, but you will run afoul of R's recycling rules and R may kick back a warning.

df[3:5] <- df[3:5] * t(sapply(df$Condition, function(x) if(x=="B") 4:6 else 1:3))
Or by vector multiplication
df[3:5] <- df[3:5] * (3*(df$Condition == "B") %*% matrix(1, 1, 3)
+ matrix(1:3, nrow(df), 3, byrow=T))

Related

Comparing elements from different columns but from the same data frame with R

I am trying to determine sequence similarity.
I would like to create a function to compare df elements, for the following example:
V1 V2 V3 V4
1 C D A D
2 A A S E
3 V T T V
4 A T S S
5 C D R Y
6 C A D V
7 V T E T
8 A T A A
9 R V V W
10 W R D D
I want to compare the first element from the first column with a first element from the second column. If it matches == 1, else 0. Then the second element from the first column compared with the second element from the second column. and so on.
For example:
C != D -----0
A == A -----1
That way I would like to compare column 1 with column 2 then column 3 and column 4.
Then column 2 compare with column 3 and column 4.
Then column 3 with column 4.
The output would be just the numbers:
0
1
0
0
0
0
0
0
0
0
I tried the following but it doesn't work:
compared_df <- ifelse(df_trial$V1==df_trial$V2,1,ifelse(df_trial$V1==df_trial$V2,0,NA))
compared_df
As suggested, I tried the following:
compared_df1 <- df_trial$matches <- as.integer(df_trial$V1 == df_trial$V2)
This works well for small sample comparison. Is there a way to compare more globally? Like for the updated columns.
As #Ronak Shah said in the comment using the following is sufficent in the case you want to compare 2 values:
df$matches <- as.integer(df$V1 == df$V2)
Another option which is applicable to more the 2 rows as well is to use apply to check for the number of unique elements in a row in the following way:
df$matches = apply(df, 1, function(x) as.integer(length(unique(x)) == 1))

R, conditional summing of every second cell in each row

I have a data frame and want for each row the sum of every second cell (beginning with the second cell), whose left neighbor is greater than zero. Here's an example:
a <- c(-2,1,1,-2)
b <- c(1,2,3,4)
c <- c(-2,1,-1,2)
d <- c(5,6,7,8)
df <- data.frame(a,b,c,d)
This gives:
> df
a b c d
1 -2 1 -2 5
2 1 2 1 6
3 1 3 -1 7
4 -2 4 2 8
For the first row the correct sum is 0 (the left neighbor of 1 is -2 and the left neighbor of 5 is also -2); for the second it's 8; for the third it's 3; for the fourth it's again 8.
I want to do it without loops, so I tried it with sum() and which() like in Conditional Sum in R, but could not find a way through.
We subset the dataset for alternating columns using the recycling vector (c(TRUE, FALSE)) to get the 1st, 3rd, ...etc columns of the dataset, convert it to a logical vector by checking whether it is greater than 0 ( > 0), then multiply the values with the second subset of alternating columns ie. columns 2nd, 4th etc. by using the recycling vector (c(FALSE, TRUE)). The idea is that if there are values in the left column that are less than 0, it will be FALSE in the logical matrix and it gets coerced to 0 by multiplying with the other subset. Finally, do the rowSums to get the expected output
rowSums((df[c(TRUE, FALSE)]>0)*df[c(FALSE, TRUE)])
#[1] 0 8 3 8
It can be also replaced with seq
rowSums((df[seq(1, ncol(df), by = 2)]>0)*df[seq(2, ncol(df), by = 2)])
#[1] 0 8 3 8
Or another option is Reduce with Map
Reduce(`+`, Map(`*`, lapply(df[c(TRUE, FALSE)], `>`, 0), df[c(FALSE, TRUE)]))
#[1] 0 8 3 8

How to get the sum shared values of all the randomly picked two columns in a dataframe

I'm quite new to R, so please forgive me. I even don't know how to ask this question...The purpose of this question is to figure out which two or three factors shared most.
I have a dataframe like this:
mydata<-read.table(header=TRUE, text="
A B C D
peak_1 peak_1 0 0
peak_2 0 0 peak_2
0 0 peak_3 peak_3
peak_4 0 0 peak_4
peak_6 0 0 0
peak_7 0 peak_7 0
peak_8 peak_8 peak_8 peak_8")
A,B,C and D are four factors. Hopefully this table can be displayed well in your R.
I want to figure out the number of shared value (but not 0) between every two columns. I'm expecting results will be displayed like below:
myresuts<-read.table(header=TRUE, text = "
factor_1 factor_2 number_of_shared
A B 2
A C 2
A D 3
B C 1
B D 1
C D 2")
For this small table, I can do the intersection manually. But in fact I have a quite big table with more than 100 columns to do such calculation. I wonder how to write a function to solve this problem.
Also, if I want to figure out the sum of shared values in every three column (hopefully this can be solved in the same way).
Thanks!
A useful function for calculating combinations and permutations can be found in the gtools library.
library(gtools)
cbn <- data.frame(combinations(ncol(mydata),2,names(mydata)))
cbn$num_shared = apply(cbn, 1, function(i) sum(mydata[,i[1]] == mydata[,i[2]]))
cbn
X1 X2 num_shared
1 A B 2
2 A C 3
3 A D 4
4 B C 4
5 B D 3
6 C D 4
If you do not want to compare zeroes, convert them to NA using mydata[mydata == 0] <- NA and place na.rm = T inside the sum.
Your desired results suggest that you don't want to count zero values in the comparison. I'm doing this by converting zeros to NA first (I also convert to character so we can compare columns with non-overlapping values).
mydata <- lapply(mydata,
function(x) {
x[x==0] <- NA
as.character(x)
})
cc <- combn(names(mydata),2,
FUN=function(x) {
data.frame(matrix(x,nrow=1),
val=sum(mydata[[x[1]]]==mydata[[x[2]]],na.rm=TRUE))
},
simplify=FALSE)
do.call(rbind,cc)
This should work for 3 columns if you change the condition in the function appropriately ...

Perform ifelse() on every element of a data frame, but different test for each column in R

I've got a large data frame [4000,600] and I'd like to convert elements to 0 if they are smaller than three orders of magnitude less than each column maximum. So each element would need to be compared to the maximum value of its column and if the element < 0.001*$column_max then it should be converted to 0 and if it isn't, it should remain the same.
I am having a tough time getting apply() to let me use an ifelse() function. Is there a better approach or function I am missing?? I'm fairly new to R.
Use lapply to loop over each column with a replace call:
dat <- data.frame(a=c(1,2,1001),b=c(3,4,3003))
dat
# a b
#1 1 3
#2 2 4
#3 1001 3003
dat[] <- lapply(dat, function(x) replace(x, x < max(x)/10^3, 0) )
dat
# a b
#1 0 0
#2 2 4
#3 1001 3003
This should work with ifelse if you use apply column-wise:
df <- data.frame(a = c(1:10, 4000), b = c(4:13, 7000))
apply(df, 2, function(x){ifelse(x < 0.001*max(x), 0, x)})
We could do this without using ifelse
library(dplyr)
dat %>%
mutate_each(funs((.>= 0.001*max(.))*.))
# a b
#1 0 0
#2 2 4
#3 1001 3003
data
dat <- data.frame(a=c(1,2,1001),b=c(3,4,3003))

Filling in missing rows/columns in distance matrices in R

I've two distance matrices.. but either of them can have items missing, and they can be out of order -- for example:
matrix #1 (missing item c)
a b d
a 0 2 3
b 2 0 4
d 3 4 0
matrix #2 (missing item b, and items out of order)
d c a
d 0 1 2
c 1 0 1
a 2 1 0
I want to find the difference between the matrices, while assuming that any missing items are 0. So, my resulting matrix should be:
a b c d
a 0 2 1 1
b 2 0 0 4
c 1 0 0 1
d 1 4 1 0
What's the best way to go about this? Should I be sorting both matrices and then filling in missing columns/rows so that I can then just abs(m1-m2), or is there a way to use row/column headings to have them automatically "match up" when subtracting?
These matrices are 5000x5000 or so, and I'll have about a 1000 to do pairwise comparison on, so I'd rather take a hit on preprocessing the data if that will make each computation significantly faster.
Any hints or suggestions are welcome. I'm usually a non-R programmer, so an iterative solution that I would normally come up would take forever -- I'm hoping for the "R way" of doing things that will be significantly faster.
We create a names index ('Un1') which is the union of names of the first ('m1') and second ('m2') matrix. Two new 0 matrices ('m1N', 'm2N') are created by specifying the dimensions and dim names based on 'Un1'. By row/column indexing, we change the 0 values in these matrices to the values in 'm1', 'm2', subtract and get the absolute.
Un1 <- sort(union(colnames(m1), colnames(m2)))
m1N <- matrix(0, ncol=length(Un1), nrow=length(Un1), dimnames=list(Un1, Un1))
m2N <- m1N
m1N[rownames(m1), colnames(m1)] <- m1
m2N[rownames(m2), colnames(m2)] <- m2
abs(m1N-m2N)
# a b c d
#a 0 2 1 1
#b 2 0 0 4
#c 1 0 0 1
#d 1 4 1 0
Update
If we have several matrices with object names m followed by numbers, we can place them in a list. We get the object names using ls and the values in a list with mget. Loop through the list with lapply to get the column names, use union as f in Reduce, sort to get the unique elements.
lst <- mget(ls(pattern='m\\d+')) #change the pattern accordingly
Un1 <- sort(Reduce(union, lapply(lst, colnames)))
We can create another list with matrix of 0s.
lst1 <- lapply(seq_along(lst), function(i)
matrix(0, ncol=length(Un1), nrow=length(Un1), dimnames=list(Un1, Un1)))
We can change the corresponding elements of 'lst1' using the row/column index of corresponding matrices of 'lst' using Map.
lst2 <- Map(function(x,y) {x[rownames(y), colnames(y)] <- y; x}, lst1, lst)
If we need pairwise difference, combn may be an option
lst3 <- combn(seq_along(lst2),2, FUN=function(x)
list(abs(lst2[[x[1]]]-lst2[[x[2]]])))
names(lst3) <- combn(seq_along(lst2), 2, FUN=paste, collapse='_')
Another approach using match (beginning is similar to #akrun):
func = function(cols, m)
{
res = `dimnames<-`(m[match(cols,rownames(m)), match(cols,colnames(m))],
list(cols, cols))
ifelse(is.na(res), 0, res)
}
cols = sort(union(colnames(m1), colnames(m2)))
abs(func(cols,m1) - func(cols,m2))
# a b c d
#a 0 2 1 1
#b 2 0 0 4
#c 1 0 0 1
#d 1 4 1 0

Resources