I have a simple array indexing question for multi-dimensional arrays in R. I am doing a lot of simulations that each give a result as a matrix, where the entries are classified into categories. So for example a result looks like
aresult<-array(sample(1:3, 6, replace=T), dim=c(2,5),
dimnames=list(
c("prey1", "prey2"),
c("predator1", "predator2", "predator3", "predator4", "predator5")))
Now I want to store the results of my experiments in a 3D-matrix, where the first two dimension are the same as in aresult and the third dimension holds the number of experiments that fell into each category. So my arrays of counts should look like
Counts<-array(0, dim=c(2, 5, 3),
dimnames=list(
c("prey1", "prey2"),
c("predator1", "predator2", "predator3", "predator4", "predator5"),
c("n1", "n2", "n3")))
and after each experiment I want to increment the numbers in the third dimension by 1, using the values in aresults as indexes.
How can I do that without using loops?
This sounds like a typical job for matrix indexing. By subsetting Counts with a three column matrix, each row specifying the indices of an element we want to extract, we are free to extract and increment any elements we like.
# Create a map of all combinations of indices in the first two dimensions
i <- expand.grid(prey=1:2, predator=1:5)
# Add the indices of the third dimension
i <- as.matrix( cbind(i, as.vector(aresult)) )
# Extract and increment
Counts[i] <- Counts[i] + 1
Related
This is a very trivial example, but I do have real data where I am experiencing this particular problem. For simplicity, let's say I have a matrix in R called x, 20 rows x 3 columns
x <- matrix(0, nrow=20, ncol=3)
Then I take a subset of the matrix, for example, using index i, which can be a single integer, for example i <- 4, or multiple integers, for example i <- c(4:7), depending on the algorithm iterations (in other words, in one iteration i may be a single integer and in the next iteration i is multiple integers) and I'd like to know the size of the resulting subset
xsubset <- x[i,]
Then I use the dim command
dim(xsubset)
and I get the result: NULL
What do I have to do to determine the number of rows and columns in xsubset?
I have not been able to get a solution to my problem yet despite discussing it with several people so hopefully the community here can be of help.
The data is stored as a list of arrays. Each component in the list represent data grouped by a specific factor. The arrays have three dimensions. The first dimension represents time, the second dimension represents number of constituents, and the third dimension represents data points. So in the example data below each constituent (second dimension) has three data points (third dimension) per time unit (first dimension).
The first and third dimension have a fixed nrow (time) and ncol (data points), while the second dimension varies for each group (components in the list) which I why stored the arrays in a list.
So the data would be structured as below.
data.list <- vector("list", 3)
numb <- c(2,3,4)
data.list[[1]] <- array(1:(numb[1] * 5 * 3), dim = c(5, numb[1], 3))
data.list[[2]] <- array(1:(numb[2] * 5 * 3), dim = c(5, numb[2], 3))
data.list[[3]] <- array(1:(numb[3] * 5 * 3), dim = c(5, numb[3], 3))
If I wanted to calculate the mean over time for the first data point across all constituents in group 1 (given it is not stored in a list) I would just do:
apply(array[,, "data.point.1"], 1, mean)
Now I want to apply mean over rows (time) across all groups (components in the list) and across constituents (second dimension in the arrays) - but I cannot find a solution.
I have tried to use sapply because essentially I would like to get a vector as the output with the length corresponding to nrow of the arrays (essentially the number of time periods in this problem). However, sapply would apply a function on each component and return a vector with the length of the list (at least that is what happened for me).
Can anybody see a good solution to this problem?
If not, is the fundamental problem then, that the data might be stored the wrong way for the type of computation I want to make?
I still don't understand OP, but this will generalize and slightly simplify the formula in the comments:
rowMeans(do.call(cbind, lapply(data.list, function(x) x[,,1])))
How can I improve the speed of following codes?
for (i in 1:nrow(training)){
score[training[i,1],training[i,2],training[i,4]] = training[i,3]
}
Training is a matrix with four columns. I just want to build an array which the value is training[i,3] according the formula above.
Thanks!
You can index using a matrix. Here is the relevant part of ['s documentation:
A third form of indexing is via a numeric matrix with the one
column for each dimension: each row of the index matrix then
selects a single element of the array, and the result is a vector.
So in your case, the for loop can be replaced with:
score[training[, c(1, 2, 4)]] <- training[, 3]
I've got a data.frame that contains a field like this:
:6:Description_C
:3:Description_A:2:Description_B:1:Description_C
:2:Description_C:1:Description_B:1:Description_A:1:Description_D:1:Description_E
:3:Description_B:3:Description_A
The number in front, surrounded by colons, is the number of times, out of a total of 6, which the Description is seen in that entry in the data.frame. If there is a :6:Description_X means that all 6 counts go for that description, if not it's split into different counts, one next to each other.
I would like to turn this field into a key/value hash of number of counts for each description, so that I can then do a barplot of the total proportions for all counts, but also in a way that I can plot these proportions in combination with the other factors in the data.frame.
EDIT: looking a bit at the doc for colsplit, probably what people will tell me is that I need a new column for each description, since I only have about 8 descriptions in total. Still, haven't figured out how to do it.
How can I do that in R?
I'm not sure what structure you wanted for the "key:value hash" but this will extract the strings and their associated numeric reps:
inp <- readLines(textConnection(
":6:Description_C
:3:Description_A:2:Description_B:1:Description_C
:2:Description_C:1:Description_B:1:Description_A:1:Description_D:1:Description_E
:3:Description_B:3:Description_A")
)
inp2 <- sapply( strsplit(inp, ":"), "[", -1) # drop the leading empty strings
reps <- lapply(inp2, function(x) as.numeric(x[ seq( 1, length(x) , by=2)]))
values <- lapply(inp2, function(x) x[ seq( 2, length(x) , by=2)])
lapply(reps, barplot) # Probably needs to work but this demonstrates feasibility
I'm trying to find an apply() type function that can run a function that operates on two arrays instead of one.
Sort of like:
apply(X1 = doy_stack, X2 = snow_stack, MARGIN = 2, FUN = r_part(a, b))
The data is a stack of band arrays from Landsat tiles that are stacked together using rbind. Each row contains the data from a single tile, and in the end, I need to apply a function on each column (pixel) of data in this stack. One such stack contains whether each pixel has snow on it or not, and the other stack contains the day of year for that row. I want to run a classifier (rpart) on each pixel and have it identify the snow free day of year for each pixel.
What I'm doing now is pretty silly: mapply(paste, doy, snow_free) concatenates the day of year and the snow status together for each pixel as a string, apply(strstack, 2, FUN) runs the classifer on each pixel, and inside the apply function, I'm exploding each string using strsplit. As you might imagine, this is pretty inefficient, especially on 1 million pixels x 300 tiles.
Thanks!
I wouldn't try to get too fancy. A for loop might be all you need.
out <- numeric(n)
for(i in 1:n) {
out[i] <- snow_free(doy_stack[,i], snow_stack[,i])
}
Or, if you don't want to do the bookkeeping yourself,
sapply(1:n, function(i) snow_free(doy_stack[,i], snow_stack[,i]))
I've just encountered the same problem and, if I clearly understood the question, I may have solved it using mapply.
We'll use two 10x10 matrices populated with uniform random values.
set.seed(1)
X <- matrix(runif(100), 10, 10)
set.seed(2)
Y <- matrix(runif(100), 10, 10)
Next, determine how operations between the matrices will be performed. If it is row-wise, you need to transpose X and Y then cast to data.frame. This is because a data.frame is a list with columns as list elements. mapply() assumes that you are passing a list. In this example I'll perform correlation row-wise.
res.row <- mapply(function(x, y){cor(x, y)}, as.data.frame(t(X)), as.data.frame(t(Y)))
res.row[1]
V1
0.36788
should be the same as
cor(X[1,], Y[1,])
[1] 0.36788
For column-wise operations exclude the t():
res.col <- mapply(function(x, y){cor(x, y)}, as.data.frame(X), as.data.frame(Y))
This obviously assumes that X and Y have dimensions consistent with the operation of interest (i.e. they don't have to be exactly the same dimensions). For instance, one could require a statistical test row-wise but having differing numbers of columns in each matrix.
Wouldn't it be more natural to implement this as a raster stack? With the raster package you can use entire rasters in functions (eg ras3 <- ras1^2 + ras2), as well as extract a single cell value from XY coordinates, or many cell values using a block or polygon mask.
apply can work on higher dimensions (i.e. list elements). Not sure how your data is set up, but something like this might be what you are looking for:
apply(list(doy_stack, snow_stack), c(1,2), function(x) r_part(x[1], x[2]))