I have a raster stack with 364 layers with a daily rate of change in NDVI values.
I want to scale these values in every cell if positive from 0 to 1 and if negative from -1 to 0. So far I have only found a solution that scale values in single layers (see here: Replace specific value in each band of raster brick in R) and not along cells of multilayer objects. Additionally I have a decent amount of cells with NA for the entire time series and I'm not quite sure how to deal with this fact either.
I took the code from the previously mentioned post and tried to get it working for my problem:
norm <- function(x){-1+(x-min)*((1-(-1))/(max-min))}
for(j in 1:ncell(tif)){
} else {
cat(paste("Currently processing layer:", j,"/",ncell(tif), "\n"))
min <- cellStats(tif[j],'min')
max <- cellStats(tif[j],'max')
#initialize cluster
#number of cores to use for clusterR function (max recommended: ncores - 1)
tif[j] <- clusterR(tif[j], calc, args=list(fun=norm), export=c('min',"max"))
#end cluster
I'm not quite certain if this produces the desired output. Any help is very much appreciated!
Some example data
r <- raster(ncol=10, nrow=10)
s <- stack(lapply(1:5, function(i) setValues(r, runif(100, -1, 1))))
# adding NAs
s[[2]][sample(100, 25, TRUE)] <- NA
For scaling (or any other operation) by cell (as requested) you can use calc together with a function that works on a vector. For example:
ff <- function(i) {
p <- which(i >= 0)
n <- which(i <= 0)
# positive values
if (length(p) > 0) {
i[p] <- i[p] - min(i[p], na.rm=TRUE)
i[p] <- i[p] / max(i[p])
# negative values
if (length(n) > 0) {
i[n] <- i[n] - max(i[n], na.rm=TRUE)
i[n] <- i[n] / abs(min(i[n]))
Test it
ff(c(-.3, -.1, .1, .4, .8))
#[1] -1.0000000 0.0000000 0.0000000 0.4285714 1.0000000
ff(c(-.3, -.1, .1, .4, .8, NA))
#[1] -1.0000000 0.0000000 0.0000000 0.4285714 1.0000000 NA
#[1] -1 0
ff(c(NA, NA))
#[1] NA NA
And use it
z <- calc(s, ff)
See the below to scale by layer, based on the min and max of all cell values (I first thought that this is what was asked for). Note that the functions I used below scale values from -1 to 1, but not the lowest positive value and highest negative value to zero.
minv <- abs(cellStats(s,'min'))
maxv <- cellStats(s,'max')
f1 <- function(i, mn, mx) {
j <- i < 0
j[is.na(j)] <- TRUE
i[j] <- i[j] / abs(mn)
i[!j] <- i[!j] / mx
ss <- list()
for (i in 1:nlayers(s)) {
ss[[i]] <- calc(s[[i]], fun=function(x) f1(x, minv[i], maxv[i]))
ss1 <- stack(ss)
Or without a loop
f2 <- function(x, mn, mx) {
x <- t(x)
i <- which(x > 0)
i[is.na(i)] <- FALSE
mxx <- x / mx
x <- x / mn
x[i] <- mxx[i]
ss2 <- calc(s, fun=function(x) f2(x, minv, maxv))
For reference, to simply scale between 0 and 1
mnv <- cellStats(s,'min')
mxv <- cellStats(s,'max')
x <- (s - mnv) / (mxv - mnv)
To get values between -1 and 1 you can then do
y <- 2 * (x - 1)
But that way previously negative values can become positive and vice versa.
See ?raster::scale for other types of scaling.
I have a matrix of 1s and 0s where the rows are individuals and the columns are events. A 1 indicates that an event happened to an individual and a 0 that it did not.
I want to find which set of (in the example) 5 columns/events that cover the most rows/individuals.
Test Data
#Make test data
d <- sapply(1:300, function(x) sample(c(0,1), 30, T, c(0.9,0.1)))
colnames(d) <- 1:300
rownames(d) <- 1:30
My attempt
My initial attempt was just based on combining the set of 5 columns with the highest colMeans:
#Get top 5 columns with highest row coverage
col_set <- head(sort(colMeans(d), decreasing = T), 5)
#Have a look the set
197 199 59 80 76
0.2666667 0.2666667 0.2333333 0.2333333 0.2000000
#Check row coverage of the column set
sum(apply(d[,colnames(d) %in% names(col_set)], 1, sum) > 0) / 30 #top 5
[1] 0.7
However this set does not cover the most rows. I tested this by pseudo-random sampling 10.000 different sets of 5 columns, and then finding the set with the highest coverage:
#Get 5 random columns using colMeans as prob in sample
##Random sample 10.000 times
result <- lapply(1:10000, function(x){
col_set2 <- sample(colMeans(d), 5, F, colMeans(d))
cover <- sum(apply(d[,colnames(d) %in% names(col_set2)], 1, sum) > 0) / 30 #random 5
list(set = col_set2, cover = cover)
##Have a look at the best set
result[which.max(sapply(result, function(x) x[["cover"]]))]
59 169 262 68 197
0.23333333 0.10000000 0.06666667 0.16666667 0.26666667
[1] 0.7666667
The reason for supplying the colMeans to sample is that the columns with the highest coverages are the ones I am most interested in.
So, using pseudo-random sampling I can collect a set of columns with higher coverage than when just using the top 5 columns. However, since my actual data sets are larger than the example I am looking for a more efficient and rational way of finding the set of columns with the highest coverage.
For the interested, I decided to microbenchmark the 3 solutions provided:
#Defining G. Grothendieck's coverage funciton outside his solutions
coverage <- function(ix) sum(rowSums(d[, ix]) > 0) / 30
#G. Grothendieck top solution
solution1 <- function(d){
cols <- tail(as.numeric(names(sort(colSums(d)))), 20)
co <- combn(cols, 5)
itop <- which.max(apply(co, 2, coverage))
co[, itop]
#G. Grothendieck "Older solution"
solution2 <- function(d){
ones <- rep(1, 300)
res <- lp("max", colSums(d), t(ones), "<=", 5, all.bin = TRUE, num.bin.solns = 10)
m <- matrix(res$solution[1:3000] == 1, 300)
cols <- which(rowSums(m) > 0)
co <- combn(cols, 5)
itop <- which.max(apply(co, 2, coverage))
co[, itop]
#user2554330 solution
bestCols <- function(d, n = 5) {
result <- numeric(n)
for (i in seq_len(n)) {
result[i] <- which.max(colMeans(d))
d <- d[d[,result[i]] != 1,, drop = FALSE]
microbenchmark::microbenchmark(solution1 = solution1(d),
solution2 = solution2(d),
solution3 = bestCols(d), times = 10)
Unit: microseconds
expr min lq mean median uq max neval
solution1 390811.850 497155.887 549314.385 578686.3475 607291.286 651093.16 10
solution2 55252.890 71492.781 84613.301 84811.7210 93916.544 117451.35 10
solution3 425.922 517.843 3087.758 589.3145 641.551 25742.11 10
This looks like a relatively hard optimization problem, because of the ways columns interact. An approximate strategy would be to pick the column with the highest mean; then delete the rows with ones in that column, and repeat. You won't necessarily find the best solution this way, but you should get a fairly good one.
For example,
d <- sapply(1:300, function(x) sample(c(0,1), 30, T, c(0.9,0.1)))
colnames(d) <- 1:300
rownames(d) <- 1:30
bestCols <- function(d, n = 5) {
result <- numeric(n)
for (i in seq_len(n)) {
result[i] <- which.max(colMeans(d))
d <- d[d[,result[i]] != 1,, drop = FALSE]
cat("final dim is ", dim(d))
col_set <- bestCols(d)
sum(apply(d[,colnames(d) %in% col_set], 1, sum) > 0) / 30 #top 5
This gives 90% coverage.
The following provides a heuristic to find an approximate solution. Find the N=20 columns, say, with the most ones, cols, and then use brute force to find every subset of 5 columns out of those 20. The subset having the highest coverage is shown below and its coverage is 93.3%.
coverage <- function(ix) sum(rowSums(d[, ix]) > 0) / 30
N <- 20
cols <- tail(as.numeric(names(sort(colSums(d)))), N)
co <- combn(cols, 5)
itop <- which.max(apply(co, 2, coverage))
co[, itop]
## [1] 90 123 197 199 286
coverage(co[, itop])
## [1] 0.9333333
Repeating this for N=5, 10, 15 and 20 we get coverages of 83.3%, 86.7%, 90% and 93.3%. The higher the N the better the coverage but the lower the N the less the run time.
Older solution
We can approximate the problem with a knapsack problem that chooses the 5 columns with largest numbers of ones using integer linear programming.
We get the 10 best solutions to this approximate problem, get all columns which are in at least one of the 10 solutions. There are 14 such columns and we then use brute force to find which subset of 5 of the 14 columns has highest coverage.
ones <- rep(1, 300)
res <- lp("max", colSums(d), t(ones), "<=", 5, all.bin = TRUE, num.bin.solns = 10)
coverage <- function(ix) sum(rowSums(d[, ix]) > 0) / 30
# each column of m is logical 300-vector defining possible soln
m <- matrix(res$solution[1:3000] == 1, 300)
# cols is the set of columns which are in any of the 10 solutions
cols <- which(rowSums(m) > 0)
## [1] 14
# use brute force to find the 5 best columns among cols
co <- combn(cols, 5)
itop <- which.max(apply(co, 2, coverage))
co[, itop]
## [1] 90 123 197 199 286
coverage(co[, itop])
## [1] 0.9333333
You can try to test if there is a better column and exchange this with the one currently in the selection.
n <- 5 #Number of columns / events
i <- rep(1, n)
for(k in 1:10) { #How many times itterate
tt <- i
for(j in seq_along(i)) {
x <- +(rowSums(d[,i[-j]]) > 0)
i[j] <- which.max(colSums(x == 0 & d == 1))
if(identical(tt, i)) break
#[1] 90 123 197 199 286
mean(rowSums(d[,i]) > 0)
#[1] 0.9333333
Taking into account, that the initial condition influences the result you can take random starts.
n <- 5 #Number of columns / events
x <- apply(d, 2, function(x) colSums(x == 0 & d == 1))
diag(x) <- -1
idx <- which(!apply(x==0, 1, any))
x <- apply(d, 2, function(x) colSums(x != d))
diag(x) <- -1
x[upper.tri(x)] <- -1
idx <- unname(c(idx, which(apply(x==0, 1, any))))
res <- sample(idx, n)
for(l in 1:100) {
i <- sample(idx, n)
for(k in 1:10) { #How many times itterate
tt <- i
for(j in seq_along(i)) {
x <- +(rowSums(d[,i[-j]]) > 0)
i[j] <- which.max(colSums(x == 0 & d == 1))
if(identical(tt, i)) break
if(sum(rowSums(d[,i]) > 0) > sum(rowSums(d[,res]) > 0)) res <- i
#[1] 90 123 197 199 286
mean(rowSums(d[,res]) > 0)
#[1] 0.9333333
I'm tyring to build a function in R which calculates the percentage change between rows based on any arbitrary index, this is, between any given row and the preceding one or any given row and n preceding ones.
perc_change <- function(x,n) {
y <- c()
z <- c()
for (i in 1:length(x)) {
z[i] <- (x[i]/(x[i-n])-1)*100
y <- c(rep(NA,n),z[(n+1):length(z)])
When n is one the function works properly:
x <- c(2,3.5,4,6)
[1] NA 75.00000 14.28571 50.00000
But when I change to 2 or other n, I receive this error:
Error in z[i] <- (x[i]/(x[i - n]) - 1) * 100 :
replacement has length zero
I just can't find why and where the logic of my function is wrong so I appreciate any comment or suggestion.
In the loop, when n is greater than 1, the i starting at 1 can result in negative or zero index (i.e. when n =2, 1 - 2). To avoid, that an if/else condition can be added
perc_change <- function(x,n) {
y <- c()
z <- c()
for (i in 1:length(x)) {
if(i > n) {
z[i] <- (x[i]/(x[i-n])-1)*100
} else z[i] <- NA
y <- c(rep(NA,n),z[(n+1):length(z)])
#[1] NA 75.00000 14.28571 50.00000
perc_change(x, 2)
#[1] NA NA 100.00000 71.42857
perc_change(x, 3)
#[1] NA NA NA 200
The following function lags the input vector and then computes the percent change with vectorized operations, no need for for loops. The lag function is a copy&paste of the last code lines of dplyr::lag.
perc_change <- function(x, n = 1) {
lag <- function(x, n = 1){
if(n == 0)
xlen <- length(x)
n <- pmin(n, xlen)
out <- c(rep(NA, n), x[seq_len(xlen - n)])
attributes(out) <- attributes(x)
y <- lag(x, n)
(x/y - 1)*100
x <- c(2, 3.5, 4, 6)
#[1] NA 75.00000 14.28571 50.00000
perc_change(x, 2)
#[1] NA NA 100.00000 71.42857
I have a piece of working code that is taking too many hours (days?) to compute.
I have a sparse matrix of 1s and 0s, I need to subtract each row from any other row, in all possible combinations, multiply the resulting vector by another vector, and finally average the values in it so to get a single scalar which I need to insert in a matrix. What I have is:
m <- matrix(
c(0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0), nrow=4,ncol=4,
byrow = TRUE)
b <- c(1,2,3,4)
for (j in 1:dim(m)[1]){
for (i in 1:dim(m)[1]){
a <- m[j,] - m[i,]
a[i] <- 0L
a[a < 0] <- 0L
c <- a*b
d[i,j] <- mean(c[c > 0])
The desired output is matrix with the same dimensions of m, where each entry is the result of these operations.
This loop works, but are there any ideas on how to make this more efficient? Thank you
My stupid solution is to use apply or sapply function, instead of for loop to do the iterations:
sapply(1:dim(m)[1], function(k) {z <- t(apply(m, 1, function(x) m[k,]-x)); diag(z) <- 0; z[z<0] <- 0; apply(t(apply(z, 1, function(x) x*b)),1,function(x) mean(x[x>0]))})
I tried to compare your solution and this in terms of running time in my computer, yours takes
t1 <- Sys.time()
d1 <- m
for (j in 1:dim(m)[1]){
for (i in 1:dim(m)[1]){
a <- m[j,] - m[i,]
a[i] <- 0L
a[a < 0] <- 0L
c <- a*b
d1[i,j] <- mean(c[c > 0])
Yours needs Time difference of 0.02799988 secs. For mine, it is reduced a bit but not too much, i.e., Time difference of 0.01899815 secs, when you run
t2 <- Sys.time()
d2 <- sapply(1:dim(m)[1], function(k) {z <- t(apply(m, 1, function(x) m[k,]-x)); diag(z) <- 0; z[z<0] <- 0; apply(t(apply(z, 1, function(x) x*b)),1,function(x) mean(x[x>0]))})
You can try it on your own computer with larger matrix, good luck!
1) create test sparse matrix:
nc <- nr <- 100
p <- 0.001
M <- Matrix(0L, nr, nc, sparse = T) # 0 matrix
n1 <- ceiling(p * (prod(dim(M)))) # 1 count
M[1:n1] <- 1L # fill only first column, to approximate max non 0 row count
# (each row has at maximum 1 positive element)
b <- 1:ncol(M)
So, if the proportion given is correct then we have at most 10 rows that contain non 0 elements
Based on this fact and your supplied calculations:
# a <- m[j, ] - m[i, ]
# a[i] <- 0L
# a[a < 0] <- 0L
# c <- a*b
# mean(c[c > 0])
we can see that the result will be meaningful only form[, j] rows which have at least 1 non 0 element
==> we can skip calculations for all m[, j] which contain only 0s, so:
minem <- function() { # write as function
t1 <- proc.time() # timing
i <- CJ(1:nr, 1:nr) # generate all combinations
k <- rowSums(M) > 0L # get index where at least 1 element is greater that 0
i <- i[data.table(V1 = 1:nr, k), on = 'V1'] # merge
cat('at moust', i[, sum(k)/.N*100], '% of rows needs to be calculated \n')
i[k == T, rowN := 1:.N] # add row nr for 0 subset
i2 <- i[k == T] # subset only those indexes who need calculation
a <- M[i2[[1]],] - M[i2[[2]],] # operate on all combinations at once
a <- drop0(a) # clean up 0
ids <- as.matrix(i2[, .(rowN, V2)]) # ids for 0 subset
a[ids] <- 0L # your line: a[i] <- 0L
a <- drop0(a) # clean up 0
a[a < 0] <- 0L # the same as your line
a <- drop0(a) # clean up 0
c <- t(t(a)*b) # multiply each row with vector
c <- drop0(c) # clean up 0
c[c < 0L] <- 0L # for mean calculation
c <- drop0(c) # clean up 0
r <- rowSums(c)/rowSums(c > 0L) # row means
i[k == T, result := r] # assign results to data.table
i[is.na(result), result := NaN] # set rest to NaN
d2 <- matrix(i$result, nr, nr, byrow = F) # create resulting matrix
t2 <- proc.time() # timing
cat(t2[3] - t1[3], 'sec \n')
d2 <- minem()
# at most 10 % of rows needs to be calculated
# 0.05 sec
Test on smaller example if results matches
d <- matrix(NA, nrow(M), ncol(M))
for (j in 1:dim(M)[1]) {
for (i in 1:dim(M)[1]) {
a <- M[j, ] - M[i, ]
a[i] <- 0L
a[a < 0] <- 0L
c <- a*b
d[i, j] <- mean(c[c > 0])
all.equal(d, d2)
Can we get results for your real data size?:
# generate data:
nc <- nr <- 6663L
b <- 1:nr
p <- 0.0001074096 # proportion of 1s
M <- Matrix(0L, nr, nc, sparse = T) # 0 matrix
n1 <- ceiling(p * (prod(dim(M)))) # 1 count
M[1:n1] <- 1L
# storing this data in usual matrix uses 4000+ times more memory
# calculation:
d2 <- minem()
# at most 71.57437 % of rows needs to be calculated
# 28.33 sec
So you need to convert your matrix to sparse one with
M <- Matrix(m, sparse = T)
I'm working on a function which will get rid of outliers in a given data set based on 3 sigma rule. My code is presented below. "data" is a data set to be processed.
rm.outlier <- function(data){
apply(data, 2, function(var) {
sigma3.plus <- mean(var) + 3 * sd(var)
sigma3.min <- mean(var) - 3 * sd(var)
sapply(var, function(y) {
if (y > sigma3.plus){
y <- sigma3.plus
} else if (y < sigma3.min){
y <- sigma3.min
} else {y <- y}
In order to check if the function works I wrote a short test:
a <- data.frame("var1" = rnorm(10000, 0, 1))
b <- a
sum(a$var1 > mean(a$var1) + 3 * sd(a$var1)) # number of outliers in a
As a result, I get:
[1] 12
So the variable var1 in the data frame a has 12 outliers. Next, I try to apply my function on this object:
a2 <- rm.outlier(a)
sum(b$var1 - a2$var1)
Unfortunately, it gives 0 which clearly indicates that something does not work. I have already worked out that the implementation of sapply is correct so there must be a mistake in my apply. Any help would be appreciated.
If runtime is important for you, you might consider another approach. You could vectorize this filtering, e.g. by using pmin and pmax which is equally readable and > 15x times faster. If you like it a little bit more complex you could use findInterval and get even more speed:
rm.outlier2 <- function(x) {
## calculate -3/3 * sigma borders
s <- mean(x) + c(-3, 3) * sd(x)
pmin(pmax(x, s[1]), s[2])
rm.outlier3 <- function(x) {
## calculate -3/3 * sigma borders
s <- mean(x) + c(-3, 3) * sd(x)
## sorts x into intervals 0 == left of s[1], 2 == right of s[2], 1
## between both s
i <- findInterval(x, s)
## which values are left/right of the interval
j <- which(i != 1L)
## add a value between s to directly use output of findInterval for subsetting
s2 <- c(s[1], 0, s[2])
## replace all values that are left/right of the interval
x[j] <- s2[i[j] + 1L]
Benchmarking the stuff:
## slightly modified OP version
rm.outlier <- function(x) {
sigma3 <- mean(x) + c(-3,3) * sd(x)
sapply(x, function(y) {
if (y > sigma3[2]){
y <- sigma3[2]
} else if (y < sigma3[1]){
y <- sigma3[1]
} else {y <- y}
a <- rnorm(10000, 0, 1)
# check output
all.equal(rm.outlier(a), rm.outlier2(a))
all.equal(rm.outlier2(a), rm.outlier3(a))
benchmark(rm.outlier(a), rm.outlier2(a), rm.outlier3(a),
order = "relative",
columns = c("test", "replications", "elapsed", "relative"))
# test replications elapsed relative
#3 rm.outlier3(a) 100 0.028 1.000
#2 rm.outlier2(a) 100 0.102 3.643
#1 rm.outlier(a) 100 1.825 65.179
It seems like you just forgot to assign your results of the apply function to a new dataframe. (Compare the 3rd line with your code)
rm.outlier <- function(data){
# Assign the result to a new dataframe
data_new <- apply(data, 2, function(var) {
sigma3.plus <- mean(var) + 3 * sd(var)
sigma3.min <- mean(var) - 3 * sd(var)
sapply(var, function(y) {
if (y > sigma3.plus){
y <- sigma3.plus
} else if (y < sigma3.min){
y <- sigma3.min
} else {y <- y}
# Print the new dataframe
a <- data.frame("var1" = rnorm(10000, 0, 1))
sum(a$var1 > mean(a$var1) + 3 * sd(a$var1)) # number of too big outliers
# 15
sum(a$var1 < mean(a$var1) - 3 * sd(a$var1)) # number of too small outliers
# 13
# Overall 28 outliers
# Check the function for the number of outliers
a2 <- rm.outlier(a)
sum(a2$var1 == a$var1) - length(a$var1)
I wrote the following code, and I need to repeat this for 100 times, and I know I need to user another for loop, but I don't know how to do it. Here is the code:
mean <- c(5,5,10,10,5,5,5)
x <- NULL
u <- NULL
delta1 <- NULL
w1 <- NULL
for (i in 1:7 ) {
x[i] <- rexp(1, rate = mean[i])
u[i] <- (1/1.2)*runif(1, min=0, max=1)
y1 <- min(x,u)
if (y1 == min(x)) {
delta1 <- 1
else {
delta1 <- 0
if (delta1 == 0)
w1 <- NULL
else {
if(y1== x[[1]])
w1 <- "x1"
output <- cbind(delta1,w1)
I want the final output to be 100 rows* 3 columns matrix representing run number, delta1, and w1.
Any thought will be truly appreciated.
Here's what I gather you're trying to achieve from your code:
Given two vectors drawn from different distributions (Exponential and Uniform)
Find out which distribution the smallest number comes from
Repeat this 100 times.
Theres a couple of problems with your code if you want to achieve this, so here's a cleaned up example:
rates <- c(5, 5, 10, 10, 5, 5, 5) # 'mean' is an inbuilt function
# Initialise the output data frame:
output <- data.frame(number=rep(0, 100), delta1=rep(1, 100), w1=rep("x1", 100))
for (i in 1:100) {
# Generating u doesn't require a for loop. Additionally, can bring in
# the (1/1.2) out the front.
u <- runif(7, min=0, max=5/6)
# Generating x doesn't need a loop either. It's better to use apply functions
# when you can!
x <- sapply(rates, function(x) { rexp(1, rate=x) })
y1 <- min(x, u)
# Now we can store the output
output[i, "number"] <- y1
# Two things here:
# 1) use all.equal instead of == to compare floating point numbers
# 2) We initialised the data frame to assume they always came from x.
# So we only need to overwrite it where it comes from u.
if (isTRUE(all.equal(y1, min(u)))) {
output[i, "delta1"] <- 0
output[i, "w1"] <- NA # Can't use NULL in a character vector.
Here's an alternative, more efficient approach with replicate:
Mean <- c(5, 5, 10, 10, 5, 5, 5)
n <- 100 # number of runs
res <- t(replicate(n, {
x <- rexp(n = length(Mean), rate = Mean)
u <- runif(n = length(Mean), min = 0, max = 1/1.2)
mx <- min(x)
delta1 <- mx <= min(u)
w1 <- delta1 & mx == x[1]
c(delta1, w1)
output <- data.frame(run = seq.int(n), delta1 = as.integer(res[ , 1]),
w1 = c(NA, "x1")[res[ , 2] + 1])
The result:
# run delta1 w1
# 1 1 1 <NA>
# 2 2 1 <NA>
# 3 3 1 <NA>
# 4 4 1 x1
# 5 5 1 <NA>
# 6 6 0 <NA>