R raster::calc calculating quantile with na.rm = FALSE - r

I use raster::calc to compute quantile for each cell across different layers but I do not understand the behaviour when na.rm = FALSE, like in the example below.
Let's create a sample raster and remove 5 values from random cells.
library(raster)
r <- raster::raster(nrow = 2, ncol = 2)
r[] <- 1:4
s <- raster::stack(r, r*2, r * 3, r * 4, r * 5)
s[]
set.seed(1)
s[][sample(1:4, 1), sample(1:5, 1)] <- NA
s[][sample(1:4, 1), sample(1:5, 1)] <- NA
s[][sample(1:4, 1), sample(1:5, 1)] <- NA
s[][sample(1:4, 1), sample(1:5, 1)] <- NA
s[][sample(1:4, 1), sample(1:5, 1)] <- NA
s[]
If I remove NAs, the code below works!
fun <- function(x) {quantile(x, probs = 0.50, na.rm = TRUE)}
p <- raster::calc(s, fun)
p[]
However, if I want to exclude the cells where there is at least one NA, the code does not work!
fun <- function(x) {quantile(x, probs = 0.50, na.rm = FALSE)}
p <- raster::calc(s, fun)
I was expecting a vector containing 4 NAs, but the code above throws this error instead:
Error in .calcTest(x[1:5], fun, na.rm, forcefun, forceapply) :
cannot use this function
Could anybody help me understand why this happens? And what should I do to get the behaviour I was expecting?

I think the error message is straightforward, and it may not be related to the raster package. The basic idea is if you apply the quantile function to values with any NA, the quantile function returns an error message.
Considering the following example.
# Set na.rm = TRUE
quantile(c(1, NA, 3, 4), probs = 0.50, na.rm = TRUE)
50%
3
# Set na.rm = FALSE
quantile(c(1, NA, 3, 4), probs = 0.50, na.rm = FALSE)
Error in quantile.default(c(1, NA, 3, 4), probs = 0.5, na.rm = FALSE) :
missing values and NaN's not allowed if 'na.rm' is FALSE
When setting na.rm = FALSE the second example just returned an error. This is the same when applying quantile to raster. na.rm needs to be TRUE.
Update
To illustrate how to apply the quantile functions while some cells are NA, I modified the example dataset from the OP a little.
s <- raster::stack(r, r*2, r * 3, r * 4, r * 5)
s[]
set.seed(1)
s[][sample(1:4, 1), sample(1:5, 1)] <- NA
s[][sample(1:4, 1), sample(1:5, 1)] <- NA
s[][sample(1:4, 1), sample(1:5, 1)] <- NA
s[]
layer.1 layer.2 layer.3 layer.4 layer.5
[1,] 1 2 3 4 NA
[2,] 2 NA 6 8 10
[3,] 3 6 9 12 NA
[4,] 4 8 12 16 20
See the last row, which has no any NA.
We can then create a function. This function will return NA if there are any NA from any layers of a location. Otherwise, it will calculate the quantile.
# Design a function
quantile_fun <- function(x, probs = 0.50){
if (anyNA(x)){
return(NA)
} else {
return(quantile(x, probs = probs))
}
}
After that, we can apply this function using calc
p <- raster::calc(s, quantile_fun)
p[]
[1] NA NA NA 12

Related

Using app function from {terra} package on raster stacks? (in parallel)

I have four high resolution rasters for a country. I have split each raster into tiles and done some other processing to them. I now want to apply a function to each cell, of each 'stack' of the raster tiles, to produce one set of output tiles. The function is a little complex. I have tried to synthesise some data below to reproduce my current approach. It works (ish) but I'm convinced that there's a better way to do this. To use parallel processing on my unix box, I simply swap mapply for mcmapply, but I haven't done that in the example below as I presume many will be working on Windows machines. I'd welcome ideas on my approach and particularly optimisation.
library("terra")
library("glue")
## Make some toy data
dir.create("temp_folder")
dir.create("result_folder")
x <- rast(ncols = 10, nrows = 10)
a <- rast(ncol = 100, nrow = 100)
some_values <- as.integer(runif(10000, min = 1, max = 100))
ind <- which(some_values %in% sample(some_values, 15))
some_values[ind] <- NA
values(a) <- some_values
a_tiles <- makeTiles(a, x, glue("temp_folder/tile_a_{1:100}.tif"), overwrite = TRUE)
b <- rast(ncol = 100, nrow = 100)
some_values <- as.integer(runif(10000, min = 1, max = 100))
ind <- which(some_values %in% sample(some_values, 15))
some_values[ind] <- NA
values(b) <- some_values
b_tiles <- makeTiles(b, x, glue("temp_folder/tile_b_{1:100}.tif"), overwrite = TRUE)
c <-rast(ncol = 100, nrow = 100)
some_values <- as.integer(runif(10000, min = 1, max = 100))
ind <- which(some_values %in% sample(some_values, 15))
some_values[ind] <- NA
values(c) <- some_values
c_tiles <- makeTiles(c, x, glue("temp_folder/tile_c_{1:100}.tif"), overwrite = TRUE)
d <- rast(ncol = 100, nrow = 100)
some_values <- as.integer(runif(10000, min = 1, max = 100))
ind <- which(some_values %in% sample(some_values, 15))
some_values[ind] <- NA
values(d) <- some_values
d_tiles <- makeTiles(d, x, glue("temp_folder/tile_d_{1:100}.tif"), overwrite = TRUE)
## Outer function so that this can be used in parallel ? But maybe this is a silly way to do it?
outer_function <- function(a_tiles, b_tiles, c_tiles, d_tiles, output_files) {
one_a_tile <- rast(unlist(a_tiles))
one_b_tile <- rast(unlist(b_tiles))
one_c_tile <- rast(unlist(c_tiles))
one_d_tile <- rast(unlist(d_tiles))
output_file <- output_files
# I replace any NAs with 0 as an NA will break my 'if' statement of the inner_function.
# I get Error in if (z["a"] <= z["b"]) { : missing value where TRUE/FALSE needed
one_a_tile[is.na(one_a_tile)] <- 0
one_b_tile[is.na(one_b_tile)] <- 0
one_c_tile[is.na(one_c_tile)] <- 0
one_d_tile[is.na(one_d_tile)] <- 0
z <- sds(one_a_tile, one_b_tile, one_c_tile, one_d_tile)
## Inner function that actually does the work I want doing
inner_function <- function(z) {
names(z) <- c('a', 'b', 'c', 'd')
if (z['a'] <= z['b']) {
y <- rowSums(cbind((z['c'] + z['a'] * 10),
(z['c'] + z['a'] * 20)))
}
if (z['a'] >= z['b']) {
y <- rowSums(cbind((z['c'] + z['a'] * 40),
(z['c'] + z['a'] * 10)))
}
if (z['a'] == z['b']) {
y <- rowSums(cbind((z['c'] + z['a'] * 60),
(z['c'] + z['a'] * 10)))
}
y <- ifelse(y == 0, NA, y)
return(y)
}
app(z,
inner_function,
filename = output_file,
overwrite = TRUE,
wopt = list(datatype = "INT4U"))
return(output_file)
}
results <- mapply(outer_function,
a_tiles = a_tiles,
b_tiles = b_tiles,
c_tiles = c_tiles,
d_tiles = d_tiles,
output_files = output_files <- glue("result_folder/result_tile_{1:length(d_tiles)}.tif"))
names(results) <- NULL
unlink("temp_folder", recursive = TRUE)
unlink("result_folder", recursive = TRUE)

Count with Varying Levels Parameters in R

I want R to count how many times my simulated ARIMA data conform to ARIMA(1,0,0) which I have achieved with:
library(forecast)
library(forecast)
cnt <- 0
num <- 60
phi <- 0.8
for(i in 1:10) {
epselon <- rnorm(num, mean=0, sd=1)
ar1 <- arima.sim(n = num, model=list(ar=phi, order = c(1, 0, 0)),
sd=1)
ar2 <- auto.arima(ar1)
if(all(arimaorder(ar2) == c(1, 0, 0))) cnt <- cnt + 1}
cnt
The above is just for a single case when sd=1, n=60, and ar=0.8.
I want a case when I have varying levels of N <- c(15, 20), SD <- c(1, 2) ^ 2, and phi = c(0.8, 0.9) for sample size, standard diviation and AR parameter respectively.
I have traid this:
library(forecast)
N <- c(15, 20)
SD <- c(1, 2) ^ 2
phi = c(0.8, 0.9)
## generate all combos
all_combos <- expand.grid(N = N, SD = SD, phi = phi)
epselon = function(n) rnorm(n, mean = 0, sd = SD)
## create function
fx_arima <- function(n, SD, phi) {
cnt <- 0
num <- 60
phi <- 0.8
for(i in 1:10) {
epselon <- rnorm(num, mean=0, sd=1)
ar1 <- arima.sim(n = num, model=list(ar=phi, order = c(1, 0, 0)), sd=1)
ar2 <- auto.arima(ar1)
if(all(arimaorder(ar2) == c(1, 0, 0))) cnt <- cnt + 1}
cnt
}
## find arima for all combos using Map
set.seed(123L)
res = Map(fx_arima, all_combos[["N"]], all_combos[["SD"]],
all_combos[["phi"]])
## or a little bit more work:
set.seed(123L)
res2 = by(all_combos, all_combos["N"],
function(DF) {
res = mapply(fx_arima, DF[["N"]], DF[["SD"]], DF[["phi"]])
colnames(res) = paste("SD", DF[["SD"]], "phi", DF[["phi"]], sep = "_")
res
})
res2
## write to csv
Map(function(file, DF) write.csv(DF, paste0("N_", file, ".csv")),
names(res2), res2)
which I mirror from arima.sim() function with varying: sample sizes, phi values and sd values and R Count How Many Time auto.arima() Confirmarima.sim() to be True
I got this error message
Error in `colnames<-`(`*tmp*`, value = c("SD_1_phi_0.2", "SD_4_phi_0.2", : attempt to set 'colnames' on an object with less than two dimensions
Traceback:
How can I solve this such that will have my result to show in varying form suuch that first row will be the label while the second row will be the count itself. The result will in two sheets; the first will be for 'N=15' and the second will be for 'N=20'.
If I understood your problem correctly, the error comes from function colnames because your function does not return a "pure" matrix-like object. If, instead, you use the function names in your last chunk of code as follows:
res2 = by(all_combos, all_combos["N"],
function(DF) {
res = mapply(fx_arima, DF[["N"]], DF[["SD"]], DF[["phi"]])
names(res) = paste("SD", DF[["SD"]], "phi", DF[["phi"]], sep = "_")
return(res)
})
res2
You will get:
> res2
N: 15
SD_1_phi_0.8 SD_4_phi_0.8 SD_1_phi_0.9 SD_4_phi_0.9
1 3 7 5
---------------------------------------------------------------------------
N: 20
SD_1_phi_0.8 SD_4_phi_0.8 SD_1_phi_0.9 SD_4_phi_0.9
3 4 5 2
With elements accessible by name and index:
> res2$`15`["SD_1_phi_0.8"]
SD_1_phi_0.8
1
> res2$`15`[1]
SD_1_phi_0.8
1

loop through 2 dataframes

I am new to R and trying to loop through each row of df1 and search for rows in df2 that are close in distance (5mi/8046.72m). I think df1 is looping as intended but I don't think it is going through all of df2.
{for (i in 1:1452){
p1 <- df1[i, 4:5]
p2 <- df2[1:11, 2:3]
d <- distCosine(p1, p2, r=6378137)
return(d< 8046.72)
i <- i+1}
}
I get the output:
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
I would just use an apply function. First, let's make your problem reproducible by creating some "fake" data - I am making the lon/lat pairs artificially close so that we can get a few TRUE's back in the results:
library(geosphere)
df1 <- data.frame(X1 = sample(letters, 100, replace = T),
x2 = sample(letters, 100, replace = T),
x3 = sample(letters, 100, replace = T),
lon = sample(10:12 + rnorm(100, 0, 0.1), 100, replace = T),
lat = sample(10:12 + rnorm(100, 0, 0.1), replace = T))
df2 <- data.frame(x1 = sample(letters, 100, replace = T),
lon = sample(10:12 + rnorm(100, 0, 0.1), 100, replace = T),
lat = sample(10:12 + rnorm(100, 0, 0.1), 100, replace = T))
We can then create two matrices containing the values of interest:
m1 <- as.matrix(df1[, c("lon", "lat")])
m2 <- as.matrix(df2[1:11, c("lon", "lat")])
Now we can use the apply function across the rows of m2 which return a 100 X 11 matrix:
results <- apply(m2, 1, FUN = function(x) distCosine(x, m1))
To get the less than 5 mi (~8046.72m), results, we simply subset:
results[results < 8046.72]
# Showing the next two for alternative output
which(results < 8046.72)
which(results < 8046.72, arr.ind = T)
Note: In your question, it looks like you are interested in the first 1,452 rows -- this would mean the results would we be a 1,452 X 11 matrix.

Conditional change to data frame column(s) based on values in other columns

Within the simulated data set
n = 50
set.seed(378)
df <- data.frame(
age = sample(c(20:90), n, rep = T),
sex = sample(c("m", "f"), n, rep = T, prob = c(0.55, 0.45)),
smoker = sample(c("never", "former", "active"), n, rep = T, prob = c(0.4, 0.45, 0.15)),
py = abs(rnorm(n, 25, 10)),
yrsquit = abs (rnorm (n, 10,2)),
outcome = as.factor(sample(c(0, 1), n, rep = T, prob = c(0.8, 0.2)))
)
I need to introduce some imbalance between the outcome groups (1=disease, 0=no disease). For example, subjects with the disease are older and more likely to be male. I tried
df1 <- within(df, sapply(length(outcome), function(x) {
if (outcome[x] == 1) {
age[x] <- age[x] + 15
sex[x] <- sample(c("m","f"), prob=c(0.8,0.2))
}
}))
but there is no difference as shown by
tapply(df$sex, df$outcome, length)
tapply(df1$sex, df$outcome, length)
tapply(df$age, df$outcome, mean)
tapply(df1$age, df$outcome, mean)
The use of sapply inside within does not work as you expect. The function within does only use the returned value of sapply. But in your code, sapply returns NULL. Hence, within does not modify the data frame.
Here is an easier way to modify the data frame without a loop or sapply:
idx <- df$outcome == "1"
df1 <- within(df, {age[idx] <- age[idx] + 15;
sex[idx] <- sample(c("m", "f"), sum(idx),
replace = TRUE, prob = c(0.8, 0.2))})
Now, the data frames are different:
> tapply(df$age, df$outcome, mean)
0 1
60.46341 57.55556
> tapply(df1$age, df$outcome, mean)
0 1
60.46341 72.55556
> tapply(df$sex, df$outcome, summary)
$`0`
f m
24 17
$`1`
f m
2 7
> tapply(df1$sex, df$outcome, summary)
$`0`
f m
24 17
$`1`
f m
1 8

Is there a weighted.median() function?

I'm looking for something similar in form to weighted.mean(). I've found some solutions via search that write out the entire function but would appreciate something a bit more user friendly.
The following packages all have a function to calculate a weighted median: 'aroma.light', 'isotone', 'limma', 'cwhmisc', 'ergm', 'laeken', 'matrixStats, 'PSCBS', and 'bigvis' (on github).
To find them I used the invaluable findFn() in the 'sos' package which is an extension for R's inbuilt help.
findFn('weighted median')
Or,
???'weighted median'
as ??? is a shortcut in the same way ?some.function is for help(some.function)
Some experience using the answers from #wkmor1 and #Jaitropmange.
I've checked 3 functions from 3 packages, isotone, laeken, and matrixStats. Only matrixStats works properly. Other two (just as the median(rep(x, times=w) solution) give integer output. As long as I calculated median age of populations, decimal places matter.
Reproducible example. Calculation of the median age of a population
df <- data.frame(age = 0:100,
pop = spline(c(4,7,9,8,7,6,4,3,2,1),n = 101)$y)
library(isotone)
library(laeken)
library(matrixStats)
isotone::weighted.median(df$age,df$pop)
# [1] 36
laeken::weightedMedian(df$age,df$pop)
# [1] 36
matrixStats::weightedMedian(df$age,df$pop)
# [1] 36.164
median(rep(df$age, times=df$pop))
# [1] 35
Summary
matrixStats::weightedMedian() is the reliable solution
To calculate the weighted median of a vector x using a same length vector of (integer) weights w:
median(rep(x, times=w))
This is just a simple solution, ready to use almost anywhere.
weighted.median <- function(x, w) {
w <- w[order(x)]
x <- x[order(x)]
prob <- cumsum(w)/sum(w)
ps <- which(abs(prob - .5) == min(abs(prob - .5)))
return(x[ps])
}
Really old post but I just came across it and did some testing of the different methods. spatstat::weighted.median() seemed to be about 14 times faster than median(rep(x, times=w)) and its actually noticeable if you want to run the function more than a couple times. Testing was with a relatively large survey, about 15,000 people.
One can also use stats::density to create a weighted PDF, then convert this to a CDF, as elaborated here:
my_wtd_q = function(x, w, prob, n = 4096)
with(density(x, weights = w/sum(w), n = n),
x[which.max(cumsum(y*(x[2L] - x[1L])) >= prob)])
Then my_wtd_q(x, w, .5) will be the weighted median.
One could also be more careful to ensure that the total area under the density is one by re-normalizing.
A way in base to get a weighted median will be to order by the values and build the cumsum of the weights and get the value(s) at sum * 0.5 of the weights.
medianWeighted <- function(x, w, q=.5) {
n <- length(x)
i <- order(x)
w <- cumsum(w[i])
p <- w[n] * q
j <- findInterval(p, w)
Vectorize(function(p,j) if(w[n] <= 0) NA else
if(j < 1) x[i[1]] else
if(j == n) x[i[n]] else
if(w[j] == p) (x[i[j]] + x[i[j+1]]) / 2 else
x[i[j+1]])(p,j)
}
What will have the following results with simple input data.
medianWeighted(c(10, 40), c(1, 2))
#[1] 40
median(rep(c(10, 40), c(1, 2)))
#[1] 40
medianWeighted(c(10, 40), c(2, 1))
#[1] 10
median(rep(c(10, 40), c(2, 1)))
#[1] 10
medianWeighted(c(10, 40), c(1.5, 2))
#[1] 40
medianWeighted(c(10, 40), c(3, 4))
#[1] 40
median(rep(c(10, 40), c(3, 4)))
#[1] 40
medianWeighted(c(10, 40), c(1.5, 1.5))
#[1] 25
medianWeighted(c(10, 40), c(3, 3))
#[1] 25
median(rep(c(10, 40), c(3, 3)))
#[1] 25
medianWeighted(c(10, 40), c(0, 1))
#[1] 40
medianWeighted(c(10, 40), c(1, 0))
#[1] 10
medianWeighted(c(10, 40), c(0, 0))
#[1] NA
It can also be used for other qantiles
medianWeighted(1:10, 10:1, seq(0, 1, 0.25))
[1] 1 2 4 6 10
Compare with other methods.
#Functions from other Answers
weighted.median <- function(x, w) {
w <- w[order(x)]
x <- x[order(x)]
prob <- cumsum(w)/sum(w)
ps <- which(abs(prob - .5) == min(abs(prob - .5)))
return(x[ps])
}
my_wtd_q = function(x, w, prob, n = 4096)
with(density(x, weights = w/sum(w), n = n),
x[which.max(cumsum(y*(x[2L] - x[1L])) >= prob)])
weighted.quantile <- function(x, w, probs = seq(0, 1, 0.25),
na.rm = FALSE, names = TRUE) {
if (any(probs > 1) | any(probs < 0)) stop("'probs' outside [0,1]")
if (length(w) == 1) w <- rep(w, length(x))
if (length(w) != length(x)) stop("w must have length 1 or be as long as x")
if (isTRUE(na.rm)) {
w <- x[!is.na(x)]
x <- x[!is.na(x)]
}
w <- w[order(x)] / sum(w)
x <- x[order(x)]
cum_w <- cumsum(w) - w * (1 - (seq_along(w) - 1) / (length(w) - 1))
res <- approx(x = cum_w, y = x, xout = probs)$y
if (isTRUE(names)) {
res <- setNames(res, paste0(format(100 * probs, digits = 7), "%"))
}
res
}
Methods
M <- alist(
medRep = median(rep(DF$x, DF$w)),
isotone = isotone::weighted.median(DF$x, DF$w),
laeken = laeken::weightedMedian(DF$x, DF$w),
spatstat1 = spatstat.geom::weighted.median(DF$x, DF$w, type=1),
spatstat2 = spatstat.geom::weighted.median(DF$x, DF$w, type=2),
spatstat4 = spatstat.geom::weighted.median(DF$x, DF$w, type=4),
survey = survey::svyquantile(~x, survey::svydesign(id=~1, weights=~w, data=DF), 0.5)$x[1],
RAndres = weighted.median(DF$x, DF$w),
matrixStats = matrixStats::weightedMedian(DF$x, DF$w),
MichaelChirico = my_wtd_q(DF$x, DF$w, .5),
Leonardo = weighted.quantile(DF$x, DF$w, .5),
GKi = medianWeighted(DF$x, DF$w)
)
Results
DF <- data.frame(x=c(10, 40), w=c(1, 2))
sapply(M, eval)
# medRep isotone laeken spatstat1 spatstat2
# 40.00000 40.00000 40.00000 40.00000 25.00000
# spatstat4 survey RAndres matrixStats MichaelChirico
# 17.50000 40.00000 10.00000 30.00000 34.15005
# Leonardo.50% GKi
# 25.00000 40.00000
DF <- data.frame(x=c(10, 40), w=c(1, 1))
sapply(M, eval)
# medRep isotone laeken spatstat1 spatstat2
# 25.00000 25.00000 40.00000 10.00000 10.00000
# spatstat4 survey RAndres matrixStats MichaelChirico
# 10.00000 10.00000 10.00000 25.00000 25.05044
# Leonardo.50% GKi
# 25.00000 25.00000
In those two cases only isotone and GKi give identical results compared to what median(rep(x, w)) returns.
If you're working with the survey package, assuming you've defined your survey design and x is your variable of interest:
svyquantile(~x, mydesign, c(0.5))
I got here looking for weighted quantiles, so I thought I might as well leave for future readers what I ended up with. Naturally, using probs = 0.5 will return the weighted median.
I started with MichaelChirico's answer, which unfortunately was off at the edges. Then I decided to switch from density() to approx(). Finally, I believe I nailed the correction factor to ensure consistency with the default algorithm of the unweighted quantile().
weighted.quantile <- function(x, w, probs = seq(0, 1, 0.25),
na.rm = FALSE, names = TRUE) {
if (any(probs > 1) | any(probs < 0)) stop("'probs' outside [0,1]")
if (length(w) == 1) w <- rep(w, length(x))
if (length(w) != length(x)) stop("w must have length 1 or be as long as x")
if (isTRUE(na.rm)) {
w <- x[!is.na(x)]
x <- x[!is.na(x)]
}
w <- w[order(x)] / sum(w)
x <- x[order(x)]
cum_w <- cumsum(w) - w * (1 - (seq_along(w) - 1) / (length(w) - 1))
res <- approx(x = cum_w, y = x, xout = probs)$y
if (isTRUE(names)) {
res <- setNames(res, paste0(format(100 * probs, digits = 7), "%"))
}
res
}
When weights are uniform, the weighted quantiles are identical to regular unweighted quantiles:
x <- rnorm(100)
stopifnot(stopifnot(identical(weighted.quantile(x, w = 1), quantile(x)))
Example using the same data as in the weighted.mean() man page.
x <- c(3.7, 3.3, 3.5, 2.8)
w <- c(5, 5, 4, 1)/15
stopifnot(isTRUE(all.equal(
weighted.quantile(x, w, 0:4/4, names = FALSE),
c(2.8, 3.33611111111111, 3.46111111111111, 3.58157894736842,
3.7)
)))
And this is for whoever solely wants the weighted median value:
weighted.median <- function(x, w, ...) {
weighted.quantile(x, w, probs = 0.5, names = FALSE, ...)
}

Resources