Coding zero values into a data frame - r

I am working in R with a series of data values that have an x position (distance along a transect) and a z position (distance from the ground for a given x position). There is not a data value measurement at each x, z coordinate, to do the analysis that I need to perform, I need to code a 0 in there. Here is a short code example, real data is usually 14,000-20,000 rows. In Matlab we solve this issue by creating an empty matrix and filling it. I need an x,z matrix normalized to max(z). So in the sample below, max z is 8 and max x is 4, so I need a 4 x 8 matrix where whenever there is no given value present, 0 would be entered--just not sure the best, most efficient way to do this in R.
x <- c(1,1,1,1,1,2,2,3,3,4,4,4)
z <- c(1,4,5,6,7,1,4,2,8,1,2,5)
value <- c(9,9,9,9,9,9,9,9,9,9,9,9)
data.frame(x,z, value)
Thanks ahead of time!

In R you would do it much the same way as you describe in Matlab. First, create a matrix with all zeroes:
df <- data.frame(x, z, value)
mat <- matrix(0, 4, 8)
And then the tricky part, where you have to create a vector of the selected elements
mat[cbind(df$x, df$z)] <- df$value
What the cbind is essentially doing is creating a 2-column matrix that is used to identify a set of elements in the matrix, and then assigning the corresponding value.

Related

Applying a specific formula to a large matrix/data frame in R, in order to normalize my data

I have a large "distance matrix" (actually a 170x170 data frame in R), for example:
A B C
A 0.198395022 0.314012433 0.32704998
B 0.314012433 0.262514533 0.318539233
C 0.32704998 0.318539233 0.211224133
I am trying to apply a specific formula (which I already have) to bring this variation into the scale of 0-1, as required for my statistical modeling. I am expecting to obtain something like this across the whole data frame (expected output, when applying the formula):
A B C
A 1 0.846050953 0.825897603
B 0.846050953 1 0.822548469
C 0.825897603 0.822548469 1
So, I need to re-calculate each off-diagonal cell relative to the respective values by applying this formula in R:
Formula here
where B is the matrix of normalized values, H is my matrix/data frame, while i and j are the rows and columns of my matrix/data frame, respectively. It is supposed that this normalization procedure systematically replaces the diagonal (i = j) by 1.
Thanks!
You can make a loop in order to replace each value according to your formula:
df <- data.frame(rnorm(3, 20,15), rnorm(3, 10,5), rnorm(3, 200,100))
df # Check out the results
for (i in 1:length(df)){
for (j in 1:nrow(df)){
df[i,j] <- df[i,j]/((df[i,i]+df[j,j])/2)
}
}
df ## Note that is done!

Bootstrapping two datasets in R

I have two dataframes as follows:
seed(1)
X <- data.frame(matrix(rnorm(2000), nrow=10))
where the rows represent the genes and the columns are the genotypes.
For each round of bootstrapping (n=1000), genotypes should be selected at random without replacement from this dataset (X) and form two groups of datasets (X' should have 5 genotypes and Y' should have 5 genotypes). Basically, in the end I will have thousand such datasets X' and Y' which will contain 5 random genotypes each from the full expression dataset.
I tried using replicate and apply but did not work.
B <- 1000
replicate(B, apply(X, 2, sample, replace = FALSE))
I think it might make more sense for you to first select the column numbers, 10 from 200 without replacement (five for each X' and Y'):
colnums_boot <- replicate(1000,sample.int(200,10))
From there, as you evaluate each iteration, i from 1 to 1000, you can grab
Xprime <- X[,colnums_boot[1:5,i]]
Yprime <- X[,colnums_boot[6:10,i]]
This saves you from making a 3-dimensional array (the generalization of matrix in R).
Also, if speed is a concern, I think it would be much faster to leave X as a matrix instead of a data frame. Maybe someone else can comment on that.
EDIT: Here's a way to grab them all up-front (in a pair of three-dimensional arrays):
Z <- as.matrix(X)
Xprimes <- array(,dim=c(10,5,1000))
Xprimes[] <- Z[,colnums_boot[1:5,]]
Yprimes <- array(,dim=c(10,5,1000))
Yprimes[] <- Z[,colnums_boot[6:10,]]

Reverse of aggregate / by?

I have a question and hope that some of you can help me. The issue is this: for a given data frame that includes a vector y of length n and a factor f with k different levels, I want to assign a new variable z which has length k to the data frame, based on f.
Example:
df <- data.frame(y=rnorm(12), f=rep(1:3, length.out=12))
z <- c(-1,0,5)
Note that my real z has been constructed to correspond to the unique factor levels, which is why length(z) = length(unique(df$f). I now want to create a vector of length n=12 that contains the value of z that corresponds to the factor level f. (Note: my real factor values are not ordered like in the above example, so just repeating the vector z won't work),
Now, an obvious solution would be to create a vector foutside the data frame, merge it with z and then to use merge. For instance,
newdf <- data.frame(z=z, f=c(1,2,3))
df <- merge(df, newdf, by="f")
However, I need to repeat this procedure several thousand times, and this merge-solution seems like shooting with canons on microbes. Hence my question: there almost surely is an easier and more efficient way to do this, but I just don't know how. Could anyone point me in the right direction? I am looking for something like the "inverse" of aggregate or by.
assuming that the values in z correspond to the f levels
df <- data.frame(y=rnorm(12), f= sample(c("a","b","c"),12,replace=T))
z <- c(-1,0,5)
df$newz<-z[df$f]
In case this is not clear: this works because factors are stored under the covers as integers. When you index z with that vector of factors you are effectively indexing with the underlying integers, which point to the right z value for that factor value.

converting irregular grid to regular grid

I have a set of observation in irregular grid. I want to have them in regular grid with resolution of 5. This is an example :
d <- data.frame(x=runif(1e3, 0, 30), y=runif(1e3, 0, 30), z=runif(1e3, 0, 30))
## interpolate xy grid to change irregular grid to regular
library(akima)
d2 <- with(d,interp(x, y, z, xo=seq(0, 30, length = 500),
yo=seq(0, 30, length = 500), duplicate="mean"))
how can I have the d2 in SpatialPixelDataFrame calss? which has 3 colomns, coordinates and interpolated values.
You can use code like this (thanks to the comment by #hadley):
d3 <- data.frame(x=d2$x[row(d2$z)],
y=d2$y[col(d2$z)],
z=as.vector(d2$z))
The idea here is that a matrix in R is just a vector with a bit of extra information about its dimensions. The as.vector call drops that information, turning the 500x500 matrix into a linear vector of length 500*500=250000. The subscript operator [ does the same, so although row and col originally return a matrix, that is treated as a linear vector as well. So in total, you have three matrices, turn them all to linear vectors with the same order, use two of them to index the x and y vectors, and combine the results into a single data frame.
My original solution didn't use row and col, but instead rep to formulate the x and y columns. It is a bit more difficult to understand and remember, but might be a bit more efficient, and give you some insight useful for more difficult applications.
d3 <- data.frame(x=rep(d2$x, times=500),
y=rep(d2$y, each=500),
z=as.vector(d2$z))
For this formulation, you have to know that a matrix in R is stored in column-major order. The second element of the linearized vector therefore is d2$z[2,1], so the rows number will change between two subsequent values, while the column number will remain the same for a whole column. Consequently, you want to repeat the x vector as a whole, but repeat each element of y by itself. That's what the two rep calls do.

R: apply() type function for two 2-d arrays

I'm trying to find an apply() type function that can run a function that operates on two arrays instead of one.
Sort of like:
apply(X1 = doy_stack, X2 = snow_stack, MARGIN = 2, FUN = r_part(a, b))
The data is a stack of band arrays from Landsat tiles that are stacked together using rbind. Each row contains the data from a single tile, and in the end, I need to apply a function on each column (pixel) of data in this stack. One such stack contains whether each pixel has snow on it or not, and the other stack contains the day of year for that row. I want to run a classifier (rpart) on each pixel and have it identify the snow free day of year for each pixel.
What I'm doing now is pretty silly: mapply(paste, doy, snow_free) concatenates the day of year and the snow status together for each pixel as a string, apply(strstack, 2, FUN) runs the classifer on each pixel, and inside the apply function, I'm exploding each string using strsplit. As you might imagine, this is pretty inefficient, especially on 1 million pixels x 300 tiles.
Thanks!
I wouldn't try to get too fancy. A for loop might be all you need.
out <- numeric(n)
for(i in 1:n) {
out[i] <- snow_free(doy_stack[,i], snow_stack[,i])
}
Or, if you don't want to do the bookkeeping yourself,
sapply(1:n, function(i) snow_free(doy_stack[,i], snow_stack[,i]))
I've just encountered the same problem and, if I clearly understood the question, I may have solved it using mapply.
We'll use two 10x10 matrices populated with uniform random values.
set.seed(1)
X <- matrix(runif(100), 10, 10)
set.seed(2)
Y <- matrix(runif(100), 10, 10)
Next, determine how operations between the matrices will be performed. If it is row-wise, you need to transpose X and Y then cast to data.frame. This is because a data.frame is a list with columns as list elements. mapply() assumes that you are passing a list. In this example I'll perform correlation row-wise.
res.row <- mapply(function(x, y){cor(x, y)}, as.data.frame(t(X)), as.data.frame(t(Y)))
res.row[1]
V1
0.36788
should be the same as
cor(X[1,], Y[1,])
[1] 0.36788
For column-wise operations exclude the t():
res.col <- mapply(function(x, y){cor(x, y)}, as.data.frame(X), as.data.frame(Y))
This obviously assumes that X and Y have dimensions consistent with the operation of interest (i.e. they don't have to be exactly the same dimensions). For instance, one could require a statistical test row-wise but having differing numbers of columns in each matrix.
Wouldn't it be more natural to implement this as a raster stack? With the raster package you can use entire rasters in functions (eg ras3 <- ras1^2 + ras2), as well as extract a single cell value from XY coordinates, or many cell values using a block or polygon mask.
apply can work on higher dimensions (i.e. list elements). Not sure how your data is set up, but something like this might be what you are looking for:
apply(list(doy_stack, snow_stack), c(1,2), function(x) r_part(x[1], x[2]))

Resources