problem with missing values for calculating the median - r

I'm having problem with managing with NAs to calculate median by using multiple matrix.
This is an example of the code and data I'm working on:
#Data example
m1 = matrix(c(2, 4, 3, 1),nrow=2, ncol=2, byrow = TRUE)
m2 = matrix(c(NA, 5, 7, 9),nrow=2, ncol=2, byrow = TRUE)
m3 = matrix(c(NA, 8, 10, 14),nrow=2, ncol=2, byrow = TRUE)
Median calculation
apply(abind::abind(m1, m2, m3, along = 3), 1:2, median)
[,1] [,2]
[1,] NA 5
[2,] 7 9
As expected the the function doesn't return a value for cells which contains NAs.
The problem is that if I replace NAs with 0 I'll get an output like this:
#Data example
m1 = matrix(c(2, 4, 3, 1),nrow=2, ncol=2, byrow = TRUE)
m2 = matrix(c(0, 5, 7, 9),nrow=2, ncol=2, byrow = TRUE)
m3 = matrix(c(0, 8, 10, 14),nrow=2, ncol=2, byrow = TRUE)
Median calculation
apply(abind::abind(m1, m2, m3, along = 3), 1:2, median)
[,1] [,2]
[1,] 0 5
[2,] 7 9
I'm trying instead to get an output where cells which reports NAs are just skipped so that only values are take into consideration. As in the example, if I have cells with NA, NA, 2 I would expect to get 2 as result while (out of the example) for cells with NA,2,5 I would expect 3.5 as result.
[,1] [,2]
[1,] 2 5
[2,] 7 9
Do you have an idea of how I could get this results? Any suggestion would be appreciated, thanks.

Just pass the argument na.rm=TRUE inside apply
apply(abind::abind(m1, m2, m3, along = 3), 1:2, median, na.rm = TRUE)
Output:
[,1] [,2]
[1,] 2 5
[2,] 7 9

Perhaps you should drop de NA's first? Try adding na.rm = TRUE

Related

issue with loop increments of lower than 1 when simulating data

I'm trying to simulate data utulizing a for loop and storing it in some matrix with the following code:
m <- matrix(nrow = 500 , ncol = 7)
for(i in seq(from = 1, to = 4, by = 0.5)){
a <- 1 * i + rnorm(n = 500, mean = 0, sd = 1)
m[, i] <- a
}
But instead of giving me 7 columns with means of roughly 1, 1.5, 2, 2.5, 3, 3.5 and 4. matrix m contains 4 columns with means of roughly 1.5, 2.5, 3.5 and 4 and 3 columns of NA values.
If i change the increments to 1 and run the below code, everything behaves as expected so the issue seems to be with the increments, but i cant figure out what i should do differently, help would be most appreciated.
m <- matrix(nrow = 500 , ncol = 7)
for(i in seq(from = 1, to = 7, by = 1)){
a <- 1 * i + rnorm(n = 500, mean = 0, sd = 1)
m[, i] <- a
}
Column indices must be integers. In your case, you try to select column 1.5 which is not possible. You can fix this by some simple calculations ((i * 2) - 1)
# reduce number of rows for showcase
n <- 100
m <- matrix(nrow = n , ncol = 7)
for(i in seq(from = 1, to = 4, by = 0.5)){
# NOTE: 1*i does not change anything
a <- 1*i + rnorm(n = n, mean = 0, sd = 1)
# make column index integerish
m[, (i * 2) - 1] <- a
}
m[1:5, ]
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,] 1.15699467 0.8917952 1.999899 2.330557 4.502607 4.469957 5.687460
#> [2,] -1.13634309 1.5394771 1.700148 1.669329 2.124019 3.472836 3.513351
#> [3,] 2.08584731 1.0591743 2.866186 3.192953 3.984286 3.593902 3.983265
#> [4,] 0.02211767 2.2222376 2.055832 2.927851 2.846376 3.411725 3.742966
#> [5,] 0.49167319 2.2244472 2.190050 3.525931 2.841522 5.722172 4.797856
colMeans(m)
#> [1] 0.8537568 1.6805235 1.9907633 2.6434843 2.8651140 3.5499583 3.9757984
When you use rnorm, it actually allows vectorzied input for the mean value, so you can try the code below (but you should use matrix to fit the obtained output into the desired dimensions of your output matrix)
nr <- 500
nc <- 7
m <- t(matrix(rnorm(nr * nc, seq(1, 4, 0.5), 1), nc, nr))
where you can see, for example
> m[1:5, ]
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 3.2157776 0.3805689 0.7550255 2.508356 3.567479 2.597378 4.122201
[2,] 0.8634009 0.4887092 2.5655513 1.710756 2.377790 3.733045 4.199812
[3,] -0.1786419 2.4471083 1.2138140 3.090687 2.763694 3.471715 4.676037
[4,] 1.2492511 2.3480447 2.2180039 1.965656 1.505342 3.832380 4.086075
[5,] -0.1301543 1.7463687 1.2467769 2.649525 4.795677 2.606623 4.318468
> colMeans(m)
[1] 0.901146 1.476423 1.900147 2.567463 2.996918 3.468140 4.025929
You're using i as a row index, but i has non-integer values. Only integers can be used for indexing a matrix/df. When i is, say, 1.5 but you try to use it in the m[,i] expression, it gets forced into an integer and rounded down to 1, so the first 2 runs of your loop overwrite each other (and the 3rd and 4th, etc.).
You could simply use your second code and replace 1*i with 0.5 + 0.5*i:
m <- matrix(nrow = 5000 , ncol = 7)
for(i in seq(from = 1, to = 7, by = 1)){
a <- 0.5 + 0.5*i + rnorm(n = 5000, mean = 0, sd = 1)
m[,i] <- a
}
However, it may be better to use the params of the rnorm function to generate values with a specified mean/sd: currently, you are drawing from a normal distribution centered around 0 then shifting it sideways; you could simply tell it to use the mean you actually want.
m <- matrix(nrow = 5000 , ncol = 7)
for(i in seq(from = 1, to = 7, by = 1)){
m[,i] <- rnorm(n = 5000, mean = 0.5 + 0.5*i, sd = 1)
}

Creating 10 categorical and 10 continuous random variables and save them as a data frame

I would like to create a data frame with 10 categorical and 10 continuous random variables. I can do it using the following loop.
p_val=rbeta(10,1,1) #10 probabilities
n=20
library(truncnorm)
mu_val=rtruncnorm(length(p_val),0,Inf, mean = 100, sd=5)#rnorm(length(p))
d_mat_cat=matrix(NA, nrow = n, ncol = length(p))
d_mat_cont= matrix(NA, nrow = n, ncol = length(p))
for ( j in 1:length(p)){
d_mat_cat[,j]=rbinom(n,1,p[j]) #Binary RV
d_mat_cont[,j]=rnorm(n,mu_val[j]) #Cont. RV
}
d_mat=cbind(d_mat_cat, d_mat_cont)
Any alternative options are appreciated.
rbinom is vectorized over prob, and rnorm is vectorized over mean, so you can use this:
cbind(
matrix(rbinom(n * length(p_val), size = 1, prob = p_val),
ncol = length(p_val), byrow = TRUE),
matrix(rnorm(n * length(mu_val), mean = mu_val),
ncol = length(mu_val), byrow = TRUE)
)
We can be a little clever with rep to make the call much cleaner:
p_val = c(0, 0.5, 1)
mu_val = c(1, 10, 100)
n = 4
##
matrix(
c(
rbinom(n * length(p_val), size = 1, prob = rep(c(0, .5, 1), each = n)),
rnorm(n * length(mu_val), mean = rep(c(1, 10, 100), each = n))
),
nrow = n,
)
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0 1 1 1.1962718 9.373595 100.1739
# [2,] 0 0 1 -0.1854631 9.574706 100.0725
# [3,] 0 1 1 3.4873697 9.447363 100.1345
# [4,] 0 1 1 2.8467450 9.700975 101.3178
You can try using sapply to run rbinom and rnorm and cbind the data.
cbind(sapply(p_val, rbinom, n = n, size = 1), sapply(mu_val, rnorm, n = n))

R: fill an empty matrix with lists and keep row and column names

I have an empty matrix of the following form:
Empty_Matrix = matrix( NA ,nrow = 3, ncol = 2, byrow = TRUE, dimnames = list(c("a","b","c"),c("aa","bb")) )
aa bb
a NA NA
b NA NA
c NA NA
I would like to fill each element of this matrix which another matrix e.g:
Empty_Matrix[,] = list(matrix(0,nrow = 4, ncol=1))
This actually works although, I loose the structure of the row and column names as it shown in the following console screen-shot:
Contrary, if I use the following lines of code
Empty_Matrix = matrix( list(matrix(0,nrow = 4, ncol=2)) ,nrow = 3, ncol = 2, byrow = TRUE, dimnames = list(c("a","b","c"),c("aa","bb")))
the desired output is retrieved:
My question is if it is possible to use a similar line of code such as
Empty_Matrix[,] = list(matrix(0,nrow = 4, ncol=1))
(where Empty_Matrix has already been created with NA elements) and have the console output of the second image.

Use purrr to get quantile of corresponding matrix entries in a list

I have the following reprex list of 10 sample matrices:
# Sample of 10 3*3 matrices
z1 <- matrix(101:104, nrow = 2, ncol = 2)
z2 <- matrix(201:204, nrow = 2, ncol = 2)
z3 <- matrix(301:304, nrow = 2, ncol = 2)
z4 <- matrix(401:404, nrow = 2, ncol = 2)
z5 <- matrix(501:504, nrow = 2, ncol = 2)
z6 <- matrix(601:604, nrow = 2, ncol = 2)
z7 <- matrix(701:704, nrow = 2, ncol = 2)
z8 <- matrix(801:804, nrow = 2, ncol = 2)
z9 <- matrix(901:904, nrow = 2, ncol = 2)
z10 <- matrix(1001:1004, nrow = 2, ncol = 2)
# Combine all matrices into a single list
za <- list(z1, z2, z3, z4, z5, z6, z7, z8, z9, z10)
What we would like is to take za as an input and obtain 2 2*2 matrices called an upper_quantile and lower_quantile matrices.
Essentially this is to take the above list of 10 matrices and take the upper 97.5% quantile for the corresponding entries. And the same for the lower 2.5% quantile.
In this case we can manually construct the upper_quantile matrix for this example as follows:
upper_quantile <- matrix(data = c(quantile(x = seq(101, 1001, by = 100), probs = 0.975),
c(quantile(x = seq(102, 1002, by = 100), probs = 0.975)),
c(quantile(x = seq(103, 1003, by = 100), probs = 0.975)),
c(quantile(x = seq(104, 1004, by = 100), probs = 0.975)))
, nrow = 2
, ncol = 2
, byrow = FALSE)
upper_quantile
#> [,1] [,2]
#> [1,] 978.5 980.5
#> [2,] 979.5 981.5
I would like to understand how to do this using purrr or tidyverse tools as I have been trying to avoid cumbersome loops on lists and would like to adjust to dimensions automatically.
Could anyone please assist?
Here's a slightly clunky method which at least keeps everything in one pipe. It assumes that all the matrices are the same dimension, which needs to be true else the desired output doesn't make much sense. Working with matrices in purrr is always a little odd. The approach is basically to use flatten to make it easy to group the cells in the order we want, which is one column per location. That lets us map across columns to produce another vector, and then put that vector back into the right matrix. Might need some testing for larger matrices than 2x2.
The other approach I thought about was using cross to make a list of all index combinations, and then mapping through and creating the matrix cell by cell analogous to your example. Can attempt that if desired.
library(tidyverse)
z1 <- matrix(101:104, nrow = 2, ncol = 2)
z2 <- matrix(201:204, nrow = 2, ncol = 2)
z3 <- matrix(301:304, nrow = 2, ncol = 2)
z4 <- matrix(401:404, nrow = 2, ncol = 2)
z5 <- matrix(501:504, nrow = 2, ncol = 2)
z6 <- matrix(601:604, nrow = 2, ncol = 2)
z7 <- matrix(701:704, nrow = 2, ncol = 2)
z8 <- matrix(801:804, nrow = 2, ncol = 2)
z9 <- matrix(901:904, nrow = 2, ncol = 2)
z10 <- matrix(1001:1004, nrow = 2, ncol = 2)
# Combine all matrices into a single list
za <- list(z1, z2, z3, z4, z5, z6, z7, z8, z9, z10)
quant_mat <- function(list, p){
dim = ncol(list[[1]]) * nrow(list[[1]])
list %>%
flatten_int() %>%
matrix(ncol = dim, byrow = TRUE) %>%
as_tibble() %>%
map_dbl(quantile, probs = p) %>%
matrix(ncol = ncol(list[[1]]))
}
quant_mat(za, 0.975)
#> [,1] [,2]
#> [1,] 978.5 980.5
#> [2,] 979.5 981.5
quant_mat(za, 0.025)
#> [,1] [,2]
#> [1,] 123.5 125.5
#> [2,] 124.5 126.5
Created on 2018-03-14 by the reprex package (v0.2.0).
This should do the trick for a single quantile using tidyverse:
tibble(za) %>%
mutate(za = map(za, ~ data.frame(t(flatten_dbl(list(.)))))) %>%
unnest(za) %>%
summarize_all(quantile, probs = .975) %>%
matrix(ncol = 2)

R raster::calc calculating quantile with na.rm = FALSE

I use raster::calc to compute quantile for each cell across different layers but I do not understand the behaviour when na.rm = FALSE, like in the example below.
Let's create a sample raster and remove 5 values from random cells.
library(raster)
r <- raster::raster(nrow = 2, ncol = 2)
r[] <- 1:4
s <- raster::stack(r, r*2, r * 3, r * 4, r * 5)
s[]
set.seed(1)
s[][sample(1:4, 1), sample(1:5, 1)] <- NA
s[][sample(1:4, 1), sample(1:5, 1)] <- NA
s[][sample(1:4, 1), sample(1:5, 1)] <- NA
s[][sample(1:4, 1), sample(1:5, 1)] <- NA
s[][sample(1:4, 1), sample(1:5, 1)] <- NA
s[]
If I remove NAs, the code below works!
fun <- function(x) {quantile(x, probs = 0.50, na.rm = TRUE)}
p <- raster::calc(s, fun)
p[]
However, if I want to exclude the cells where there is at least one NA, the code does not work!
fun <- function(x) {quantile(x, probs = 0.50, na.rm = FALSE)}
p <- raster::calc(s, fun)
I was expecting a vector containing 4 NAs, but the code above throws this error instead:
Error in .calcTest(x[1:5], fun, na.rm, forcefun, forceapply) :
cannot use this function
Could anybody help me understand why this happens? And what should I do to get the behaviour I was expecting?
I think the error message is straightforward, and it may not be related to the raster package. The basic idea is if you apply the quantile function to values with any NA, the quantile function returns an error message.
Considering the following example.
# Set na.rm = TRUE
quantile(c(1, NA, 3, 4), probs = 0.50, na.rm = TRUE)
50%
3
# Set na.rm = FALSE
quantile(c(1, NA, 3, 4), probs = 0.50, na.rm = FALSE)
Error in quantile.default(c(1, NA, 3, 4), probs = 0.5, na.rm = FALSE) :
missing values and NaN's not allowed if 'na.rm' is FALSE
When setting na.rm = FALSE the second example just returned an error. This is the same when applying quantile to raster. na.rm needs to be TRUE.
Update
To illustrate how to apply the quantile functions while some cells are NA, I modified the example dataset from the OP a little.
s <- raster::stack(r, r*2, r * 3, r * 4, r * 5)
s[]
set.seed(1)
s[][sample(1:4, 1), sample(1:5, 1)] <- NA
s[][sample(1:4, 1), sample(1:5, 1)] <- NA
s[][sample(1:4, 1), sample(1:5, 1)] <- NA
s[]
layer.1 layer.2 layer.3 layer.4 layer.5
[1,] 1 2 3 4 NA
[2,] 2 NA 6 8 10
[3,] 3 6 9 12 NA
[4,] 4 8 12 16 20
See the last row, which has no any NA.
We can then create a function. This function will return NA if there are any NA from any layers of a location. Otherwise, it will calculate the quantile.
# Design a function
quantile_fun <- function(x, probs = 0.50){
if (anyNA(x)){
return(NA)
} else {
return(quantile(x, probs = probs))
}
}
After that, we can apply this function using calc
p <- raster::calc(s, quantile_fun)
p[]
[1] NA NA NA 12

Resources