R data.table - multiply two columns whose components are matrices - r

I have a data.table with some numeric columns and another column each entry of which is a matrix. Here is an example:
dt = data.table(a = c(1,2,3), b = c(-1,4,2))
dt$c = vector("list",3)
for (ind in 1:3){dt$c[[ind]] = round(matrix(10*runif(8), nrow = 4))}
For each row, I want to multiply the numeric vector formed by columns a and b with the corresponding 4 x 2 matrix in c and store the resulting 4 numbers into columns V1, V2, V3 and V4. For instance, for the first row, I would take the 4 x 2 matrix dt$c[[1]], multiply it with the 2 x 1 vector rbind(dt$a[1],dt$b[1]) and assign the resulting 4 numbers into the first row of 4 new columns named V1, V2, V3 and V4.
I am looking for a native data.table way to do this. I am currently looping over all rows and that is prohibitively slow for my actual problem size. Tried a variety of data.table syntaxes but I suspect I am missing something fundamental in the way column c is internally treated as a list and therefore I am unable to get the matrix multiplication to work.
Any help on this would be greatly appreciated.

We can use Map to do corresponding column value multiplication
dt[, paste0("V", 1:4) := do.call(rbind.data.frame,
Map(function(x, y, z) t(z %*% c(x, y)) , a, b, c))]
dt
# a b c V1 V2 V3 V4
#1: 1 -1 1, 7, 4,10, 9, 7, 4, 1,... -8 0 0 9
#2: 2 4 1, 8, 8, 4, 7, 1, 5,10,... 30 20 36 48
#3: 3 2 6, 6, 9, 5, 0, 0,10, 2,... 18 18 47 19

Related

Assign value to a specific rows (R)

I have a df of 16k+ items. I want to assign values A, B and C to those items based.
Example:
I have the following df with 10 unique items
df <- c(1:10)
Now I have three separate vectors (A, B, C) that contain row numbers of the df with values A, B or C.
A <- c(3, 9)
B <- c(2, 6, 8)
C <- c(1, 4, 5, 7, 10)
Now I want to add a new category column to the df and assign values A, B and C based on the row numbers that are in the three vectors that I have. For example, I would like to assign value C to rows 1, 4, 5, 7 and 10 of the df.
I tried to experiment with for loops and if statements to match the value of the vector with the row number of the df but I didn't succeed. Can anybody help out?
Here is a way to assign the new column.
Create the data frame and a list of vectors:
df <- data.frame(n=1:10)
dat <- list( A=c(3, 9), B=c(2, 6, 8), C=c(1, 4, 5, 7, 10) )
Put the data in the desired rows:
df$new[unlist(dat)] <- sub("[0-9].*$","",names(unlist(dat)))
Result:
df
n new
1 1 C
2 2 B
3 3 A
4 4 C
5 5 C
6 6 B
7 7 C
8 8 B
9 9 A
10 10 C
You could iterate over the names of a list and assign those names to the positions indexed by the successive sets of numeric values:
dat <- list(A=A,B=B,C=C)
for(i in names(dat)){ df$new[ dat[[i]] ] <- i}

Doing computations on lists inside data frames

I want to aggregate data frame f into a new data frame g so that the column g$z contains a list of all group-wise values from column f$z. At first sight, this seems to be working:
f = data.frame(x=c(1, 1, 1, 2), y=c(4, 4, 5, 6), z=c(11, 12, 13, 14))
g = aggregate(z ~ x + y, f, c)
x y z
1 1 4 11, 12
2 1 5 13
3 2 6 14
Now I want to do different computations on the lists in column c for all rows in the data frame and put the result in new columns in the same data frame. But this doesn't work!
g$m = sum(g$z)
g$n = g$z + 1
Error in sum(g$z) : invalid 'type' (list) of argument
How can I work with lists inside a data frame cell like attempted above? Or is this simply un-R-like / impossible? If so, what is the correct approach?
UPDATE
My underlying goal is to do a lot of group-wise operations on all combinations of X and Y in the original data set. What options do I have for this in R in general?
Use apply. Pro: Everything in one table. Con: Complex table structure, can't use sum etc.
for(y), for(x), subset. Pro: Can do sum etc. directly. Con: Lots of code, and possibly slow.
Work in parallell w/original and aggregated table. Pro: Can do sum etc. Con: Data duplication.
Other options?
sum and Vectorization doesn't apply to lists, you can simply use sapply and lapply for the task:
g$m <- sapply(g$z, sum)
g$n <- lapply(g$z, `+`, 1)
g
# x y z m n
#1 1 4 11, 12 23 12, 13
#2 1 5 13 13 14
#3 2 6 14 14 15
Or if you use tidyverse, you can use map + mutate:
g %>% mutate(m = map_dbl(z, sum), n = map(z, ~.x + 1))
# x y z m n
#1 1 4 11, 12 23 12, 13
#2 1 5 13 13 14
#3 2 6 14 14 15

Conditional statement in R dataframe

I have dataframe df as below.
dput(df)
structure(list(X = c(1, 2, 5, 7, 8), Y = c(3, 5, 8, 7, 2), Z = c(2,
8, 7, 4, 3), R = c(6, 6, 6, 6, 66)), .Names = c("X", "Y", "Z",
"R"), row.names = c(NA, -5L), class = "data.frame")
df
class(df)
I have to modify df under two conditions.
First:
modify df so that it check minimum between X,Y,Z for each row and whichever is minimum get replaced with corresponding value of R.
Second case:
which is minimum between X,Y,Z,R in each row, it get replaced with maximum between X,Y,Z,and R and create a new df.
How should i get that?
I tried ifelse and if and else but could not get what i want..
Any help would be appreciated.
You can create a new dataset "df1" with first three coumns of "df". Multiply "df1" with "-1" so that maximum values become "min" (assuming that there are no negative values). Here, in the example, the values were all unique per row. So, you can use the function max.col and specify the ties.method='first'. It will get you the index of maximum value (here it will be minimum) per row, cbind it will the 1:nrow(df) to create the "row/column" index and extract the elements of "df1" based on that index (df1[cbind..]) and change those values to "R" column values (<- df$R). You could then change the original "df" columns ("df[1:3]") to new values. If there are more than one "minimum" value per row, you could use the "loop" method described for the second case.
df1 <- df[1:3]
df1[cbind(1:nrow(df),max.col(-1*df1, 'first'))] <- df$R
df[1:3] <- df1
df
# X Y Z R
#1 6 3 2 6
#2 6 5 8 6
#3 6 8 7 6
#4 7 7 6 6
#5 8 66 3 66
Create a copy of "df" (df2), get the max values per row using pmax, loop over the rows of "df2" (sapply(seq_len...)) and change the "minimum" values in each row to corresponding "max" values ("MaxV"), transpose (t) and assign it back to the "df2" (df2[])
df2 <- df
#only use this if there is only a single "minimum" value per row
# and no negative values in the data
#df2[cbind(1:nrow(df), max.col(-1*df2, 'first'))] <-
# do.call(pmax, df2)
MaxV <- do.call(pmax, df2)
df2 [] <- t(sapply(seq_len(nrow(df2)), function(i) {
x <- unlist(df2[i,])
ifelse(x==min(x), MaxV[i], x)}))
df2
# X Y Z R
#1 6 3 6 6
#2 6 8 8 6
#3 8 8 7 8
#4 7 7 7 7
#5 8 66 66 66

comparing two files and outputting common elements

I have 2 files of 3 columns and hundreds of rows. I want to compare and list the common elements of first two columns of the two files. Then the list which i will get after comparing i have to add the third column of second file to that list. Third column will contain the values which were in the second file corresponding to numbers of remaining two columns which i have got as common to both the files.
For example, consider two files of 6 rows and 3 columns
First file -
1 2 3
2 3 4
4 6 7
3 8 9
11 10 5
19 6 14
second file -
1 4 1
2 1 4
4 6 10
3 7 2
11 10 3
19 6 5
As i said i have to compare the first two columns and then add the third column of second file to that list. Therefore, output must be:
4 6 10
11 10 3
19 6 5
I have the following code, however its showing an error object not found also i am not able to add the third column. Please help :)
df2 = reading first file, df3 = reading second file. Code is in R language.
s1 = 1
for(i in 1:nrow(df2)){
for(j in 1:nrow(df3)){
if(df2[i,1] == df3[j,1]){
if(df2[i,2] == df3[j,2]){
common.rows1[s1,1] <- df2[i,1]
common.rows1[s1,2] <- df2[i,2]
s1 = s1 + 1
}
}
}
You can use the %in% operator twice to subset your second data.frame (I call it df2):
df2[df2$V1 %in% df1$V1 & df2$V2 %in% df1$V2,]
# V1 V2 V3
#3 4 6 10
#5 11 10 3
#6 19 6 5
V1 and V2 in my example are the column names of df1 and df2.
It seems that this is the perfect use-case for merge, e.g.
merge(d1[c('V1','V2')],d2)
results in:
V1 V2 V3
1 11 10 3
2 19 6 5
3 4 6 10
In which 'V1' and 'V2' are the column names of interest.
data.table proposal
library(data.table)
setDT(df1)
setDT(df2)
setkey(df1, V1, V2)
setkey(df2, V1, V2)
df2[df1[, -3, with = F], nomatch = 0]
## V1 V2 V3
## 1: 4 6 10
## 2: 11 10 3
## 3: 19 6 5
If your two tables are d1 and d2,
d1<-data.frame(
V1 = c(1, 2, 4, 3, 11, 19),
V2 = c(2, 3, 6, 8, 10, 6),
V3 = c(3, 4, 7, 9, 5, 14)
)
d2<-data.frame(
V1 = c(1, 2, 4, 3, 11, 19),
V2 = c(4, 1, 6, 7, 10, 6),
V3 = c(1, 4, 10, 2, 3, 5)
)
then you can subset d2 (in order to keep the third column) with
d2[interaction(d2$V1, d2$V2) %in% interaction(d1$V1, d1$V2),]
The interaction() treats the first two columns as a combined key.

creating a sum matrix from counts

I have one matrix of mutation counts, say "counts". This matrix has column names V1, V2,...,Vi,...Vn where not every "i" is there. Thus it can jump, such as V1, V2, V5 say. Further, most of columns have a 0 in them.
I need to create a sum matrix, called "answer", where element i, j is the sum of the number of the number counts at both i and j. At the i, i element it just shows the number of counts at i.
Here's a quick data set up. I already have the correct dimensioned matrix set up in my code called "answer". Thus what I would need to automate are the last several lines where I fill in the matrix.
counts <- matrix(data = c(0,2,0,5,0,6,0), nrow = 1, ncol = 7, dimnames=list("",c("V1","V2","V3","V4","V5","V6","V7")))
answer <- matrix(data =0, nrow = 3, ncol = 3, dimnames = list(c("V2","V4","V6"),c("V2","V4","V6")))
answer[1,1] <- 2
answer[1,2] <- 7
answer[1,3] <- 8
answer[2,1] <- 7
answer[2,2] <- 5
answer[2,3] <- 11
answer[3,1] <- 8
answer[3,2] <- 11
answer[3,3] <- 6
I understand I can do this with 2 nested for loops, but surely there must be a better way no? Thanks!
This could be done with the right use of expand.grid and rowSums:
n = counts[, counts > 0]
answer = matrix(rowSums(expand.grid(n, n)), nrow=length(n), dimnames=list(names(n), names(n)))
diag(answer) = n
To show how it works, n would end up being:
V2 V4 V5
2 5 6
and expand.grid(n, n) would be:
Var1 Var2
1 2 2
2 5 2
3 6 2
4 2 5
5 5 5
6 6 5
7 2 6
8 5 6
9 6 6
The last line (diag) is necessary because otherwise the diagonal would be twice the original vector (adding 2+2, 5+5, or 6+6).

Resources