I have an (5x4) matrix in R, namely data defined as follows:
set.seed(123)
data <- matrix(rnorm(5*4,mean=0,sd=1), 5, 4)
and I want to create 4 different matrices that follows this formula: Assume that data[,1] = [A1,A2,A3,A4,A5]. I want to create the following matrix:
A1-A1 A1-A2 A1-A3 A1-A4 A1-A5
A2-A1 A2-A2 A2-A3 A2-A4 A2-A5
G1 = A3-A1 A3-A2 A3-A3 A3-A4 A3-A5
A4-A1 A4-A2 A4-A3 A4-A4 A4-A5
A5-A1 A5-A2 A5-A3 A5-A4 A5-A5
Similarly for the other columns i want to calculate at once all the G matrices (G1,G2,G3,G4). How can i achieve that with the sapply funciton?
We may use elementwise subtraction of column with outer
outer(data[,1], data[,1], `-`)
If it should be done on each column, loop over the columns (or do asplit with MARGIN = 2 to split by column), loop over the list and apply the outer
lapply(asplit(data, 2), function(x) outer(x, x, `-`))
Related
I have two matrices: A (k rows, m columns), B(k rows, n columns)
I want to operate on all pairs of columns (one from A and one from B), the result should be a matrix C (m rows, n columns) where C[i,j] = f(A[,i],B[,j])
now, if the function f was the sum of the dot product, then the whole thing was just a simple multiplication of matrices (C = t(A) %*% B)
but my f is different (specifically, I count the number equal entries:
f = function(x,y) sum(x==y)
my question if there is a simple (and fast, because my matrices are big) way to compute the result?
preferably in R, but possibly in python (numpy). I thought about using outer(A,B,"==") but this results in a 4 dimensional array which I havent figured out what exactly to do with it.
Any help is appreciated
In R, we can split them into list and apply the function f with a nested lapply/sapply
lapply(asplit(A, 2), function(x) sapply(asplit(B, 2), function(y) f(x, y)))
Or using outer after converting to data.frame because the unit will be column, while for matrix, it is a single element (as matrix is a vector with dim attributes)
outer(as.data.frame(A), as.data.frame(B), FUN = Vectorize(f))
data
A <- cbind(1:5, 6:10)
B <- cbind(c(1:3, 1:2), c(5:7, 6:7))
df<- data.frame(a=c(1:10), b=c(21:30),c=c(1:10), d=c(14:23),e=c(11:20),f=c(-6:-15),g=c(11:20),h=c(-14:-23),i=c(4:13),j=c(1:10))
In this data frame, I have three block-diagonal matrices which are as shown in the image below
I want to apply two functions, one is the sine function for block diagonal and the second is cosine function for the other elements and generates the same structure of the data frame.
sin(df[1:2,1:2])
sin(df[3:5,3:5])
sin(df[6:10,6:10])
cos(the rest of the elements)
1) outer/arithmetic Create a logical block diagonal matrix indicating whether the current cell is on the block diagonal or not and then use that to take a convex combination of the sin and cos values giving a data.frame as follows:
v <- rep(1:3, c(2, 3, 5))
ind <- outer(v, v, `==`)
ind * sin(df) + (!ind) * cos(df)
2) ifelse Alternately, this gives a matrix result (or use as.matrix on the above). ind is from above.
m <- as.matrix(df)
ifelse(ind, sin(m), cos(m))
3) Matrix::bdiag Another approach is to use bdiag in the Matrix package (which comes with R -- no need to install it).
library(Matrix)
ones <- function(n) matrix(1, n, n)
ind <- bdiag(ones(2), ones(3), ones(5)) == 1
Now proceed as in the last line of (1) or as in (2).
If it's okay for you that the result is stored in a new data frame you could change the order of your instructions and do it like that:
ndf <- cos(df)
ndf[1:2,1:2] <- sin(df[1:2,1:2])
ndf[3:5,3:5] <- sin(df[3:5,3:5])
ndf[6:10,6:10] <- sin(df[6:10,6:10])
Assuming I have a dataframe consisting of three columns
df1 <- data.frame(a=runif(10),b=runif(10),c=runif(10),d=runif(10))
And want to have a column of the products of all combinations except for a column multiplied by itself
a*b, a*c, a*d, b*c, b*d, c*d
The solution I'm looking for should work for any number of columns, not just five
We can use combn to create combination of names of dataframe taken 2 at a time and then write a custom function which subsets the dataframe and multiply it with each other.
combn(names(df1), 2, function(x) df1[x[1]] * df1[x[2]], simplify = FALSE)
This command returns a list of 6 dataframes (a*b, a*c, a*d, b*c, b*d, c*d) for the given example.
We could use combn directly on the dataset, specify the m as 2 to select pairwise combination of columns, specify the FUN as Reduce with its parameter f as * to multiply the corresponding elements of each pairwise column
combn(df1, 2, FUN = Reduce, f = `*`)
I have a large dataset, X with 58140 columns, filled with either 1 or 0
I would like to create a 58139 x 58139 matrix from the information of the 58139 columns in the dataset.
For each Aij in the matrix I would like to find the number of common rows which contain the value 1 for Column i+1 and Column J+1 from X.
I figured I can do this through sum(X[[2]]+X[[3]] == 2) for the A12 element of the matrix.
The only problem left is a way to code the matrix in.
You can use mapply. That returns a numeric vector. Then you can just wrap it in a call to matrix and ignore the first row and column.
# sample data
set.seed(123)
X <- data.frame(matrix(rbinom(200, 1, .5), nrow=10))
#
A <- matrix(mapply(function(i, j) sum(rowSums(X[, c(i,j)])==2),
i=rep(1:ncol(X), ncol(X)),
j=rep(1:ncol(X), each=ncol(X))),
ncol=ncol(X))[-1, -1]
A
I'm trying to clean this code up and was wondering if anybody has any suggestions on how to run this in R without a loop. I have a dataset called data with 100 variables and 200,000 observations. What I want to do is essentially expand the dataset by multiplying each observation by a specific scalar and then combine the data together. In the end, I need a data set with 800,000 observations (I have four categories to create) and 101 variables. Here's a loop that I wrote that does this, but it is very inefficient and I'd like something quicker and more efficient.
datanew <- c()
for (i in 1:51){
for (k in 1:6){
for (m in 1:4){
sub <- subset(data,data$var1==i & data$var2==k)
sub[,4:(ncol(sub)-1)] <- filingstat0711[i,k,m]*sub[,4:(ncol(sub)-1)]
sub$newvar <- m
datanew <- rbind(datanew,sub)
}
}
}
Please let me know what you think and thanks for the help.
Below is some sample data with 2K observations instead of 200K
# SAMPLE DATA
#------------------------------------------------#
mydf <- as.data.frame(matrix(rnorm(100 * 20e2), ncol=20e2, nrow=100))
var1 <- c(sapply(seq(41), function(x) sample(1:51)))[1:20e2]
var2 <- c(sapply(seq(2 + 20e2/6), function(x) sample(1:6)))[1:20e2]
#----------------------------------#
mydf <- cbind(var1, var2, round(mydf[3:100]*2.5, 2))
filingstat0711 <- array(round(rnorm(51*6*4)*1.5 + abs(rnorm(2)*10)), dim=c(51,6,4))
#------------------------------------------------#
You can try the following. Notice that we replaced the first two for loops with a call to mapply and the third for loop with a call to lapply.
Also, we are creating two vectors that we will combine for vectorized multiplication.
# create a table of the i-k index combinations using `expand.grid`
ixk <- expand.grid(i=1:51, k=1:6)
# Take a look at what expand.grid does
head(ixk, 60)
# create two vectors for multiplying against our dataframe subset
multpVec <- c(rep(c(0, 1), times=c(4, ncol(mydf)-4-1)), 0)
invVec <- !multpVec
# example of how we will use the vectors
(multpVec * filingstat0711[1, 2, 1] + invVec)
# Instead of for loops, we can use mapply.
newdf <-
mapply(function(i, k)
# The function that you are `mapply`ing is:
# rbingd'ing a list of dataframes, which were subsetted by matching var1 & var2
# and then multiplying by a value in filingstat
do.call(rbind,
# iterating over m
lapply(1:4, function(m)
# the cbind is for adding the newvar=m, at the end of the subtable
cbind(
# we transpose twice: first the subset to multiply our vector.
# Then the result, to get back our orignal form
t( t(subset(mydf, var1==i & mydf$var2==k)) *
(multpVec * filingstat0711[i,k,m] + invVec)),
# this is an argument to cbind
"newvar"=m)
)),
# the two lists you are passing as arguments are the columns of the expanded grid
ixk$i, ixk$k, SIMPLIFY=FALSE
)
# flatten the data frame
newdf <- do.call(rbind, newdf)
Two points to note:
Try not to use words like data, table, df, sub etc which are commonly used functions
In the above code I used mydf in place of data.
You can use apply(ixk, 1, fu..) instead of the mapply that I used, but I think mapply makes for cleaner code in this situation