Analize the entropy of a matrix - r

Given two matrix MAT1 and MAT2 that contains a set of vector-columns of different coordinates for three elements (F,L,G), I would like to test which of the two matrices has the higher entropy. In other words, the data points of each vector sometimes are very close to each other for the same element, and in other cases too far from each other. When a vector is in balance the points of one class (F, L, or G) are in the same position or almost very close.
Which is the right way to test this, if for example the number of vector columns may vary from each experiment? Some toy data:
MAT1<- matrix(c(runif(32)),8,4)
row.names(MAT1) <- c("F","G","G","L","F","F","L","G")
MAT2<- matrix(c(runif(24)),8,3)
row.names(MAT1) <- c("F","G","G","L","F","F","L","G")
I've been using the R function DescTools::MutInf(MAT1) but as you can see, I am not an expert in this field!
Thank you,

Related

Custom rounding to reference values (nDimensional)

I would like to extract the coordinate from a vector that is closest to a test coordinate.
The task would be very similar to the previously posted:(Find the approximate value in the vector) but adapted to nDimensional cases and with multiple inputs.
In other words, given:
test=t(data.frame(
c(0.9,1.1,1),
c(7.5,7.4,7.3),
c(11,11,11.2)
))
reference=t(data.frame(
c(1,0,0.5),
c(2,2,2),
c(3.3,3.3,3.3),
c(9,9,9),
c(10,11,12)
))
result <- approximate(test,reference)
1 0 0.5
9 9 9
10 11 12
I programmed a function using euclidean distances and old school loops but when the inputs dataframes are big it results in looong executing times.
Anyone can figure it out a more efficient way of doing it?
Thank you in advance.
PS:This is the function I created that works but takes a while (in case someone could find it useful)
approximate_function<- function(approximate,reference){
# Function that returns for each entrance of approximate the closest value of reference
# It uses a euclidean distance.
# each entrance must be a row in the dataframe
# the number of columns of the df indicates the dimension of the points
# Sub function to calculate euclidean distance
distance_function<- function(a,b){
squaresum<-0
for(id in 1:length(a)){
squaresum=squaresum+(a[id]-b[id])^2
}
result=sqrt(squaresum)
return(result)
}
result<-data.frame()
#Choose 1 item from vector to aproximate at a time
for(id_approximate in 1:nrow(approximate)){
distance=c()
#Compare the value to aproximate with the reference points and chose the one with less distance
for(id_reference in 1:nrow(reference)){
distance[id_reference]<-distance_function(approximate[id_approximate,],reference[id_reference,])
}
result<-rbind(
result,
reference[which.min(distance),]
)
}
return(result)
}
This way the calculation is done instantly.
approximate_function<- function(approximate,reference){
# Function that returns for each entrance of approximate the closest value of reference
# It uses a euclidean distance.
# each entrance must be a row in the dataframe
# the number of columns of the df indicates the dimension of the points
results=data.frame()
#Choose 1 item from vector to aproximate at a time
for(id in 1:nrow(approximate)){
#calculates euclidean distances regardless the dimension
sumsquares=rep(0,nrow(reference))
for(dim in 1:ncol(approximate)){
sumsquares = sumsquares + (approximate[id,dim]-reference[,dim])^2
}
distances=sqrt(sumsquares)
results<- rbind(
results,
reference[which.min(distances),]
)
}
return(results)
}
You've got a few calculations that will be slow.
First:
test=t(data.frame(
c(0.9,1.1,1),
c(7.5,7.4,7.3),
c(11,11,11.2)
))
This one probably doesn't matter, but it would be better as
test=rbind(
c(0.9,1.1,1),
c(7.5,7.4,7.3),
c(11,11,11.2)
)
Same for setting up reference.
Second and third: You set up result as a dataframe, then add rows to it one at a time. Dataframes are much slower for row operations than matrices, and gradually growing structures in R is slow. So set it up as a matrix from the beginning at the right size, and assign results into specific rows.
EDITED to add:
Fourth: there's no need for the inner loop. You can calculate all the squared differences in one big matrix, then use rowSums or colSums to get the squared distances. This is easiest if you're working with matrix columns instead of rows, because vectors will be properly replicated automatically.
Fifth: There's no need to take the square root; if the squared distance is minimized, so is the distance.
Here's the result:
approximate <- function(test, reference){
# transpose the reference
reference <- t(reference)
# set up the result, not transposed
result <- test*NA
#Choose 1 item from vector to aproximate at a time
for(id in seq_len(nrow(test))){
squareddist <- colSums((test[id,] - reference)^2)
result[id,] <- reference[, which.min(squareddist)]
}
return(result)
}

Select rows from two matrices iteratively and perform function in R

So I have a rather complex (at least for me) problem in R.
I want to calculate distances between two pair of distributions, for nearly 10k pairs.
I have a distance function from package philentropy, which takes two vectors x y and calculates the distance between them such as:
d <- distance(x, y, method="desired_method")
Another option is to create a matrix with each row representing a distribution, so that the function will calculate all pairwise distances among all distributions in the matrix:
d <- distance(x, method="desired_method")
I have two correlation matrices a and b with nearly 10k rows each, corresponding to 10k correlation distributions. Both matrices have the same number of rows, and my goal is to contrast first row of matrix a with first row of matrix b, second a row with second b row and so on, iteratively.
I can select each desired rows and perform the first distance usage, or I can merge the two matrices with rbind and perform all pairwise distances with second distance usage.
The problem is, with first approach, I do not know how to generate a for loop to iteratively get the nth row of each matrix, and perform distance calculation, while storing the result in a vector.
Additionally, if I perform the second option, I do not want to get all pairwise distances, but just distances corresponding to:
d[i,i+nrow(a)]
And doing so iteratively to generate a corresponding vector of nrow(a) values.
Any help?
If you have two matrices, mat_x and mat_y, each with the same number of rows, then the for loop would be:
answer <- vector(mode = 'numeric', length = 10000L)
for (i in 1:10000){
answer[[i]] <- distance(mat_x[i,], mat_y[i,], method="desired_method")
}

Merging Two Matrices

I've done a little bit of digging for this result but most of the questions on here have information in regards to the cbind function, and basic matrix concatenation. What I'm looking to do is a little more complicated.
Let's say, for example, I have an NxM matrix whose first column is a unique identifier for each of the rows (and luckily in this instance is sorted by that identifier). For reasons which are inconsequential to this inquiry, I'm splitting the rows of this matrix into (n_i)xM matrices such that the sum of n_i = N.
I'm intending to run separate analysis on each of these sub-matrices and then combine the data together again with the usage of the unique identifier.
An example:
Let's say I have matrix data which is 10xM. After my split, I'll receive matrices subdata1 and subdata2. If you were to look at the contents of the matrices:
data[,1] = 1:10
subdata1[,1] = c(1,3,4,6,7)
subdata2[,1] = c(2,5,8,9,10)
I then manipulate the columns of subdata1 and subdata2, but preserve the information in the first column. I would like to combine this matrices again such that finaldata[,1] = 1:10, where finaldata is a result of the combination.
I realize now that I could use rbind and the sort the matrix, but for large matrices that is very inefficient.
I know R has some great functions out there for data management, is there a work around for this problem?
I may not fully understand your question, but as an example of general use, I would typically convert the matrices to dataframes and then do something like this:
combi <- rbind(dataframe1, dataframe2)
If you know they are matrices, you can do this with multidimensional arrays:
X <- matrix(1:100, 10,10)
s1 <- X[seq(1, 9,2), ]
s2 <- X[seq(2,10,2), ]
XX <- array(NA, dim=c(2,5,10) )
XX[1, ,] <- s1 #Note two commas, as it's a 3D array
XX[2, ,] <- s2
dim(XX) <- c(10,10)
XX
This will copy each element of s1 and s2 into the appropriate slice of the array, then drop the extra dimension. There's a decent chance that rbind is actually faster, but this way you won't need to re-sort it.
Caveat: you need equal sized splits for this approach.

dataframe (product) correlations in R

I've got 2 dataframes each with 150 rows and 10 columns + column and row IDs. I want to correlate every row in one dataframe with every row in the other (e.g. 150x150 correlations) and plot the distribution of the resulting 22500 values.(Then I want to calculate p values etc from the distribution - but that's the next step).
Frankly I don't know where to start with this. I can read my data in and see how to correlate vectors or matching slices of two matrices etc., but I can't get handle on what I'm trying to do here.
set.seed(42)
DF1 <- as.data.frame(matrix(rnorm(1500),150))
DF2 <- as.data.frame(matrix(runif(1500),150))
#transform to matrices for better performance
m1 <- as.matrix(DF1)
m2 <- as.matrix(DF2)
#use outer to get all combinations of row numbers and apply a function to them
#22500 combinations is small enough to fit into RAM
cors <- outer(seq_len(nrow(DF1)),seq_len(nrow(DF2)),
#you need a vectorized function
#Vectorize takes care of that, but is just a hidden loop (slow for huge row numbers)
FUN=Vectorize(function(i,j) cor(m1[i,],m2[j,])))
hist(cors)
You can use cor with two arguments:
cor( t(m1), t(m2) )

R: apply() type function for two 2-d arrays

I'm trying to find an apply() type function that can run a function that operates on two arrays instead of one.
Sort of like:
apply(X1 = doy_stack, X2 = snow_stack, MARGIN = 2, FUN = r_part(a, b))
The data is a stack of band arrays from Landsat tiles that are stacked together using rbind. Each row contains the data from a single tile, and in the end, I need to apply a function on each column (pixel) of data in this stack. One such stack contains whether each pixel has snow on it or not, and the other stack contains the day of year for that row. I want to run a classifier (rpart) on each pixel and have it identify the snow free day of year for each pixel.
What I'm doing now is pretty silly: mapply(paste, doy, snow_free) concatenates the day of year and the snow status together for each pixel as a string, apply(strstack, 2, FUN) runs the classifer on each pixel, and inside the apply function, I'm exploding each string using strsplit. As you might imagine, this is pretty inefficient, especially on 1 million pixels x 300 tiles.
Thanks!
I wouldn't try to get too fancy. A for loop might be all you need.
out <- numeric(n)
for(i in 1:n) {
out[i] <- snow_free(doy_stack[,i], snow_stack[,i])
}
Or, if you don't want to do the bookkeeping yourself,
sapply(1:n, function(i) snow_free(doy_stack[,i], snow_stack[,i]))
I've just encountered the same problem and, if I clearly understood the question, I may have solved it using mapply.
We'll use two 10x10 matrices populated with uniform random values.
set.seed(1)
X <- matrix(runif(100), 10, 10)
set.seed(2)
Y <- matrix(runif(100), 10, 10)
Next, determine how operations between the matrices will be performed. If it is row-wise, you need to transpose X and Y then cast to data.frame. This is because a data.frame is a list with columns as list elements. mapply() assumes that you are passing a list. In this example I'll perform correlation row-wise.
res.row <- mapply(function(x, y){cor(x, y)}, as.data.frame(t(X)), as.data.frame(t(Y)))
res.row[1]
V1
0.36788
should be the same as
cor(X[1,], Y[1,])
[1] 0.36788
For column-wise operations exclude the t():
res.col <- mapply(function(x, y){cor(x, y)}, as.data.frame(X), as.data.frame(Y))
This obviously assumes that X and Y have dimensions consistent with the operation of interest (i.e. they don't have to be exactly the same dimensions). For instance, one could require a statistical test row-wise but having differing numbers of columns in each matrix.
Wouldn't it be more natural to implement this as a raster stack? With the raster package you can use entire rasters in functions (eg ras3 <- ras1^2 + ras2), as well as extract a single cell value from XY coordinates, or many cell values using a block or polygon mask.
apply can work on higher dimensions (i.e. list elements). Not sure how your data is set up, but something like this might be what you are looking for:
apply(list(doy_stack, snow_stack), c(1,2), function(x) r_part(x[1], x[2]))

Resources