"Approximating" the derivative of date points in R - r

So I have a time series of MODIS NDVI values (vegetation values from 0-1 for the non-geographic geeks), and I'm trying to approximate the derivative by using a for loop.
This is a sample of my data:
> m2001
date value valnorm
1 1 0.4834 0.03460912
2 17 0.4844 0.03664495
3 33 0.5006 0.06962541
4 49 0.4796 0.02687296
5 65 0.5128 0.09446254
6 81 0.4915 0.05109935
7 97 0.4664 0.00000000
8 113 0.5345 0.13864007
9 129 0.8771 0.83611564
10 145 0.9529 0.99043160
11 161 0.9250 0.93363192
12 177 0.9450 0.97434853
13 193 0.9491 0.98269544
14 209 0.9434 0.97109121
15 225 0.9576 1.00000000
16 241 0.8992 0.88110749
17 257 0.9115 0.90614821
18 273 0.8361 0.75264658
19 289 0.5725 0.21600163
20 305 0.5188 0.10667752
21 321 0.5467 0.16347720
22 337 0.5484 0.16693811
23 353 0.5427 0.15533388
Column 1 is the julian day of the pixel value
Column 2 is the raw NDVI value
Column 3 is the NDVI stretched from 0-1 (it's a normalization technique, since NDVI rarely actually gets to 1 or 0).
I'm still very new to programming and R, but I think I've managed to piece together a tenuous grasp on it. What I'm trying to do is create a new column with values that would give me some idea of the local slope of data points.
The function I've come up with is this:
deriv <- function(x1=1:23, x2=1){
for (i in x1){
i1 <- c(x1[i-1], x1[i], x1[i+1])
i2 <- c(x2[i-1], x2[i], x2[i+1])
deriv.func <- lm(i2~i1, na.action=NULL)
} return(deriv.func$coef[[2]])
}
What happens when I run it is this:
> deriv <- function(x1=1:23, x2=1){
+ for (i in x1){
+ i1 <- c(x1[i-1], x1[i], x1[i+1])
+ i2 <- c(x2[i-1], x2[i], x2[i+1])
+ deriv.func <- lm(i2~i1, na.action=NULL)
+ } return(deriv.func$coef[[2]])
Error: unexpected symbol in:
"deriv.func <- lm(i2~i1, na.action=NULL)
} return"
> }
Error: unexpected '}' in "}"
>
I'm not sure what I'm doing wrong, as I can get it to parse when I fill in a value for i
> i=6
> x1=m2001$date
> x2=m2001$valnorm
> i1 <- c(x1[i-1], x1[i], x1[i+1])
> i2 <- c(x2[i-1], x2[i], x2[i+1])
> i1
[1] 33 49 65
> i2
[1] 0.06962541 0.02687296 0.09446254
> lm(i2 ~ i1)
Call:
lm(formula = i2 ~ i1)
Coefficients:
(Intercept) i1
0.0256218 0.0007762
> func <- lm(i2 ~ i1)
> func$coef[[2]]
[1] 0.0007761604
Any ideas? Thanks a ton.

Try putting 'return' on a new line.
}
return(deriv.func$coef[[2]])
}

Well, after looking (a lot) more into the for loop, I got it to do what I want.
deriv <- function(x1=1:23, x2=1){
n=length(x1)
deriv.func <- character(length = n)
for (i in 1:n) {
i1 <- c(x1[i-1], x1[i], x1[i+1])
i2 <- c(x2[i-1], x2[i], x2[i+1])
derivate <- lm(i2~i1)
deriv.func[i] <- derivate$coef[[2]]*
}
return(deriv.func)
}
Thanks for the help, and the tip in the right direction, #dbaseman!
Ideas that made a difference:
making sure I had space allocated for the iterator deriv.func <-
character(length = n).
making sure the intermediate variables
didn't overwrite the output.

Related

Optimize a function using gradient descent

Growing degree days is a concept in plant phenology where a given crop needs to accumulate certain amount of thermal units every day in order to move from one stage to the other.
I have thermal units data available at daily resolution for a given site for 10 years as follows:
set.seed(1)
avg_temp <- data.frame(year_ref = rep(2001:2010, each = 365),
doy = rep(1:365, times = 10),
thermal.units = sample(0:40, 3650, replace=TRUE))
I also have a crop grown in this site that should take 110 days to mature if planted on day 152
planting_date <- 152
observed_days_to_mature <- 110
I also have some initial random guess on how many thermal units this crop in general might accumulate in each stage starting from planting to reach full maturity. For e.g. in the below example, stage 1 needs to accumulate 50 thermal units since planting, stage2 needs 120 thermal units since
planting, stage 3 needs 190 thermal units since planting and so on.
gdd_data <- data.frame(stage_id = 1:4,
gdd_required = c(50, 120, 190, 250))
So given the gdd requirement, I can calculate for each year, how many days does this crop take to mature.
library(dplyr)
library(data.table)
days_to_mature_func <- function(gdd_data_df, avg_temp_df, planting_date_d){
gdd.vec <- gdd_data_df$gdd_required
year_vec <- sort(unique(avg_temp_df$year_ref))
temp_ls <- list()
for(y in seq_along(year_vec)){
year_id <- year_vec[y]
weather_sub <- avg_temp_df %>%
dplyr::filter(year_ref == year_id &
doy >= planting_date_d)
stage_vec <- unlist(lapply(1:length(gdd.vec), function(x) planting_date_d - 1 + which.max(cumsum(weather_sub$thermal.units) >= gdd.vec[x])))
stage_vec[length(stage_vec)] <- ifelse(stage_vec[length(stage_vec)] <= stage_vec[length(stage_vec) - 1], NA, stage_vec[length(stage_vec)])
gdd_doy <- as.data.frame(t(as.data.frame(stage_vec)))
names(gdd_doy) <- paste0('stage_doy', 1:length(stage_vec))
gdd_doy$year_ref <- year_id
temp_ls[[y]] <- gdd_doy
}
days_to_mature_mod <- rbindlist(temp_ls)
return(days_to_mature_mod)
}
days_to_mature_mod <- days_to_mature_func(gdd_data, avg_temp, planting_date)
days_to_mature_mod
stage_doy1 stage_doy2 stage_doy3 stage_doy4 year_ref
1: 154 160 164 167 2001
2: 154 157 159 163 2002
3: 154 157 160 162 2003
4: 155 157 163 165 2004
5: 154 156 160 164 2005
6: 154 161 164 168 2006
7: 154 156 159 161 2007
8: 155 158 161 164 2008
9: 154 156 160 163 2009
10: 154 158 160 163 2010
Since the crop should be taking 110 days to mature, I define the error as:
error_mod <- mean(days_to_mature_mod$stage_doy4 - observed_days_to_mature)^2
My question is how do I optimise the gdd_required in the gdd_data to produce the minimal error.
One method I have implemented is to loop over a sequence of factors that reduces the gdd_required in
each step and calculates the error. the factor with the lowest error is the final factor that I apply
to the gdd_required data. I am reading about the gradient descent algorithm that might make this processquicker but unfortunately I don't have enough techincal expertise yet to achieve this.
From comment: I do have a condition that wasn't explicit - the x in the function that I am optimising are ordered i.e. x[1] < x[2] < x[3] < x[4] since they are cumulative.
Building on your example, you can define a function that takes arbitrary gdd_required and returns the fit:
optim_function <- function(x){
gdd_data <- data.frame(stage_id = 1:4, gdd_required = x)
days_to_mature_mod <- days_to_mature_func(gdd_data, avg_temp, planting_date)
error_mod <- mean(days_to_mature_mod$stage_doy4 - observed_days_to_mature)^2
}
The function optim allows you to find the parameters that reach a minimum, starting from the initial set you used e.g.
optim(c(50, 120, 190, 250), optim_function)
#$par
#[1] 266.35738 199.59795 -28.35870 30.21135
#
#$value
#[1] 1866.24
#
#$counts
#function gradient
# 91 NA
#
#$convergence
#[1] 0
#
#$message
#NULL
So a best fit of around 1866 is found with parameters 266.35738, 199.59795, -28.35870, 30.21135.
The help page gives some pointers on doing constrained optimisation if it is important that they are in a specific range.
Given your comment that the parameters should be strictly increasing, you can transform arbitrary values into increasing ones with cumsum(exp()) so your code would become
optim_function_plus <- function(x){
gdd_data <- data.frame(stage_id = 1:4, gdd_required = cumsum(exp(x)))
days_to_mature_mod <- days_to_mature_func(gdd_data, avg_temp, planting_date)
error_mod <- mean(days_to_mature_mod$stage_doy4 - observed_days_to_mature)^2
}
opt <- optim(log(c(50, 70, 70, 60)), optim_function_plus)
opt
# $par
# [1] 1.578174 2.057647 2.392850 3.241456
#
# $value
# [1] 1953.64
#
# $counts
# function gradient
# 57 NA
#
# $convergence
# [1] 0
#
# $message
# NULL
To get the parameters back on the scale you're interested, you'd need to do:
cumsum(exp(opt$par))
# [1] 4.846097 12.673626 23.618263 49.189184

DFFITs for Beta Regression

I am trying to calculate DFFITS for GLM, where responses follow a Beta distribution. By using betareg R package. But I think this package doesn't support influence.measures() because by using dffits()
Code
require(betareg)
df<-data("ReadingSkills")
y<-ReadingSkills$accuracy
n<-length(y)
bfit<-betareg(accuracy ~ dyslexia + iq, data = ReadingSkills)
DFFITS<-dffits(bfit, infl=influence(bfit, do.coef = FALSE))
DFFITS
it yield
Error in if (model$rank == 0) { : argument is of length zero
I am a newbie in R. I don't know how to resolve this problem. Kindly help to solve this or give me some tips through R code that how to calculate DFFITs manually.
Regards
dffits are not implemented for "betareg" objects, but you could try to calculate them manually.
According to this Stack Overflow Q/A we could write this function:
dffits1 <- function(x1, bres.type="response") {
stopifnot(class(x1) %in% c("lm", "betareg"))
sapply(1:length(x1$fitted.values), function(i) {
x2 <- update(x1, data=x1$model[-i, ]) # leave one out
h <- hatvalues(x1)
nm <- rownames(x1$model[i, ])
num_dffits <- suppressWarnings(predict(x1, x1$model[i, ]) -
predict(x2, x1$model[i, ]))
residx <- if (class(x1) == "betareg") {
betareg:::residuals.betareg(x2, type=bres.type)
} else {
x2$residuals
}
denom_dffits <- sqrt(c(crossprod(residx)) / x2$df.residual*h[i])
return(num_dffits / denom_dffits)
})
}
It works well for lm:
fit <- lm(mpg ~ hp, mtcars)
dffits1(fit)
stopifnot(all.equal(dffits1(fit), dffits(fit)))
Now let's try betareg:
library(betareg)
data("ReadingSkills")
bfit <- betareg(accuracy ~ dyslexia + iq, data=ReadingSkills)
dffits1(bfit)
# 1 2 3 4 5 6 7
# -0.07590185 -0.21862047 -0.03620530 0.07349169 -0.11344968 -0.39255172 -0.25739032
# 8 9 10 11 12 13 14
# 0.33722706 0.16606198 0.10427684 0.11949807 0.09932991 0.11545263 0.09889406
# 15 16 17 18 19 20 21
# 0.21732090 0.11545263 -0.34296030 0.09850239 -0.36810187 0.09824013 0.01513643
# 22 23 24 25 26 27 28
# 0.18635669 -0.31192106 -0.39038732 0.09862045 -0.10859676 0.04362528 -0.28811277
# 29 30 31 32 33 34 35
# 0.07951977 0.02734462 -0.08419156 -0.38471945 -0.43879762 0.28583882 -0.12650591
# 36 37 38 39 40 41 42
# -0.12072976 -0.01701615 0.38653773 -0.06440176 0.15768684 0.05629040 0.12134228
# 43 44
# 0.13347935 0.19670715
Looks not bad.
Notes:
Even if this works in code, you should check if it meets your statistical requirements!
I've used suppressWarnings in lines 5:6 of dffits1. predict(bfit, ReadingSkills) drops the contrasts somehow, whereas predict(bfit) does not (should practically be the same). However the results are identical: all.equal(predict(bfit, ReadingSkills), predict(bfit)), thus ignoring the warnings be safe.

Error in for loop when looping through a specific column in R

I'm a beginner in R programming. I was doing an exercise for an online course and I came up with a question (which does not concern the exercise itself). I have the following DF:
> str(DF)
'data.frame': 915 obs. of 3 variables:
$ sal: int 22000000 15714286 13650000 13571429 13350000 13050000 13000000 12600000 12500000 12500000 ...
$ AB : int 632 36 574 503 80 529 NA 614 577 364 ...
$ OBP: num 0.399 0.154 0.408 0.384 0.143 ...
I had to make a function which returned me the first three numbers of the column $sal for which the sum was 15 million. Here's what I did:
> for(i in DF){
+ x <- 1
+ y <- 3
+ while(sum(i[x:y]) > 15*10^6){
+ x <- x + 1
+ y <- y + 1
+ if(sum(i[x:y]) <= 15*10^6){
+ print(c(x:y))
+ }
+ }
+ }
[1] 138 139 140
This works, but of course, the for will run through the entire DF. If I wanted it to run only through a specific column, I'd set for(i in DF$column). However, when I do this, I get the following error:
Error in while (sum(i[x:y]) > 15 * 10^6) { :
missing value where TRUE/FALSE needed
This is my question: why is this happening? Hope my question was clear.
The problem is that you have missing values in you data (see your column AB).
Try replacing sum(...) by sum(..., na.rm = TRUE).
By the way, what you could do instead of looping is something like:
test <- round(runif(100) * 1e7)
test[1] <- NA
test2 <- RcppRoll::roll_sum(test, n = 3, na.rm = TRUE)
sapply(which(test2 > 15e6), function(x) x + 0:2)
# outer(0:2, which(test2 > 15e6), '+') may be faster

Finding corresponding values from already subsetted vectors in R?

Background: I have my dataset as a csv file called D (please load it to your R):
D = read.csv("https://docs.google.com/uc?id=0B5V8AyEFBTmXWU40TUZGTURQWlU&export=download")
I use the following function to obtain 2 quantities from my data (please source to your R):
source("https://docs.google.com/uc?id=0B5V8AyEFBTmXWTk0LWhaMkY2b3M&export=download")
The 2 quantities are obtained as follows:
b = BF.d.pvalue(t = D$t.value, n1 = D$n1, n2 = D$n2)
BF = b[1, ] ; p.value = b[2, ]
Subsetting Details: I have subsetted p.value larger than .05 and their corresponding BFs:
pvalue.05_1 = p.value[p.value > .05] ;
BF.pvalue.05_1 = BF[p.value > .05]`
I have further subsetted BF.pvalue.05_1 that are between 1/10 and 1/3:
BF.pvalue.05_1_.1_.33 = BF.pvalue.05_1[BF.pvalue.05_1 > 1/10 & BF.pvalue.05_1 <= 1/3]
Question: Now I'm wondering how I can find the corresponding p.value for BF.pvalue.05_1_.1_.33 above?
The preferred way to do this is to combine your data to a dataframe and then using the subset command for filtering.
myDf = data.frame(p = p.value, BF = BF)
head(myDf)
# p BF
# 1 2.274873e-06 6.241835e+03
# 2 3.811612e-02 1.736017e+00
# 3 0.000000e+00 2.592434e+147
# 4 0.000000e+00 1.982820e+130
# 5 0.000000e+00 1.315152e+29
# 6 9.992007e-15 4.442134e+11
Now, whenever you subset your data rowwise, you will have access to both the p value and the BF value.
firstSubset = subset(myDf, p > .05)
dim(firstSubset)
# [1] 175 2
secondSubset = subset(firstSubset, BF > .1 & BF < 1/3)
dim(secondSubset)
# [1] 76 2
head(secondSubset)
# p BF
# 28 0.8518770 0.3131790
# 34 0.9358011 0.2910234
# 35 0.9302671 0.2911639
# 52 0.6825720 0.3101911
# 88 0.7201547 0.2770751
# 96 0.6472360 0.2868055
Alternatively, you can use both conditions simultaniousely
secondSubset = subset(myDf, (BF > .1) & (BF < 1/3) & (p > .05))

Probabilty heatmap in ggplot

I asked this question a year ago and got code for this "probability heatmap":
numbet <- 32
numtri <- 1e5
prob=5/6
#Fill a matrix
xcum <- matrix(NA, nrow=numtri, ncol=numbet+1)
for (i in 1:numtri) {
x <- sample(c(0,1), numbet, prob=c(prob, 1-prob), replace = TRUE)
xcum[i, ] <- c(i, cumsum(x)/cumsum(1:numbet))
}
colnames(xcum) <- c("trial", paste("bet", 1:numbet, sep=""))
mxcum <- reshape(data.frame(xcum), varying=1+1:numbet,
idvar="trial", v.names="outcome", direction="long", timevar="bet")
library(plyr)
mxcum2 <- ddply(mxcum, .(bet, outcome), nrow)
mxcum3 <- ddply(mxcum2, .(bet), summarize,
ymin=c(0, head(seq_along(V1)/length(V1), -1)),
ymax=seq_along(V1)/length(V1),
fill=(V1/sum(V1)))
head(mxcum3)
library(ggplot2)
p <- ggplot(mxcum3, aes(xmin=bet-0.5, xmax=bet+0.5, ymin=ymin, ymax=ymax)) +
geom_rect(aes(fill=fill), colour="grey80") +
scale_fill_gradient("Outcome", formatter="percent", low="red", high="blue") +
scale_y_continuous(formatter="percent") +
xlab("Bet")
print(p)
(May need to change this code slightly because of this)
This is almost exactly what I want. Except each vertical shaft should have different numbers of bins, ie the first should have 2, second 3, third 4 (N+1). In the graph shaft 6 +7 have the same number of bins (7), where 7 should have 8 (N+1).
If I'm right, the reason the code does this is because it is the observed data and if I ran more trials we would get more bins. I don't want to rely on the number of trials to get the correct number of bins.
How can I adapt this code to give the correct number of bins?
I have used R's dbinom to generate the frequency of heads for n=1:32 trials and plotted the graph now. It will be what you expect. I have read some of your earlier posts here on SO and on math.stackexchange. Still I don't understand why you'd want to simulate the experiment rather than generating from a binomial R.V. If you could explain it, it would be great! I'll try to work on the simulated solution from #Andrie to check out if I can match the output shown below. For now, here's something you might be interested in.
set.seed(42)
numbet <- 32
numtri <- 1e5
prob=5/6
require(plyr)
out <- ldply(1:numbet, function(idx) {
outcome <- dbinom(idx:0, size=idx, prob=prob)
bet <- rep(idx, length(outcome))
N <- round(outcome * numtri)
ymin <- c(0, head(seq_along(N)/length(N), -1))
ymax <- seq_along(N)/length(N)
data.frame(bet, fill=outcome, ymin, ymax)
})
require(ggplot2)
p <- ggplot(out, aes(xmin=bet-0.5, xmax=bet+0.5, ymin=ymin, ymax=ymax)) +
geom_rect(aes(fill=fill), colour="grey80") +
scale_fill_gradient("Outcome", low="red", high="blue") +
xlab("Bet")
The plot:
Edit: Explanation of how your old code from Andrie works and why it doesn't give what you intend.
Basically, what Andrie did (or rather one way to look at it) is to use the idea that if you have two binomial distributions, X ~ B(n, p) and Y ~ B(m, p), where n, m = size and p = probability of success, then, their sum, X + Y = B(n + m, p) (1). So, the purpose of xcum is to obtain the outcome for all n = 1:32 tosses, but to explain it better, let me construct the code step by step. Along with the explanation, the code for xcum will also be very obvious and it can be constructed in no time (without any necessity for for-loop and constructing a cumsum everytime.
If you have followed me so far, then, our idea is first to create a numtri * numbet matrix, with each column (length = numtri) having 0's and 1's with probability = 5/6 and 1/6 respectively. That is, if you have numtri = 1000, then, you'll have ~ 834 0's and 166 1's *for each of the numbet columns (=32 here). Let's construct this and test this first.
numtri <- 1e3
numbet <- 32
set.seed(45)
xcum <- t(replicate(numtri, sample(0:1, numbet, prob=c(5/6,1/6), replace = TRUE)))
# check for count of 1's
> apply(xcum, 2, sum)
[1] 169 158 166 166 160 182 164 181 168 140 154 142 169 168 159 187 176 155 151 151 166
163 164 176 162 160 177 157 163 166 146 170
# So, the count of 1's are "approximately" what we expect (around 166).
Now, each of these columns are samples of binomial distribution with n = 1 and size = numtri. If we were to add the first two columns and replace the second column with this sum, then, from (1), since the probabilities are equal, we'll end up with a binomial distribution with n = 2. Similarly, instead, if you had added the first three columns and replaced th 3rd column by this sum, you would have obtained a binomial distribution with n = 3 and so on...
The concept is that if you cumulatively add each column, then you end up with numbet number of binomial distributions (1 to 32 here). So, let's do that.
xcum <- t(apply(xcum, 1, cumsum))
# you can verify that the second column has similar probabilities by this:
# calculate the frequency of all values in 2nd column.
> table(xcum[,2])
0 1 2
694 285 21
> round(numtri * dbinom(2:0, 2, prob=5/6))
[1] 694 278 28
# more or less identical, good!
If you divide the xcum, we have generated thus far by cumsum(1:numbet) over each row in this manner:
xcum <- xcum/matrix(rep(cumsum(1:numbet), each=numtri), ncol = numbet)
this will be identical to the xcum matrix that comes out of the for-loop (if you generate it with the same seed). However I don't quite understand the reason for this division by Andrie as this is not necessary to generate the graph you require. However, I suppose it has something to do with the frequency values you talked about in an earlier post on math.stackexchange
Now on to why you have difficulties obtaining the graph I had attached (with n+1 bins):
For a binomial distribution with n=1:32 trials, 5/6 as probability of tails (failures) and 1/6 as the probability of heads (successes), the probability of k heads is given by:
nCk * (5/6)^(k-1) * (1/6)^k # where nCk is n choose k
For the test data we've generated, for n=7 and n=8 (trials), the probability of k=0:7 and k=0:8 heads are given by:
# n=7
0 1 2 3 4 5
.278 .394 .233 .077 .016 .002
# n=8
0 1 2 3 4 5
.229 .375 .254 .111 .025 .006
Why are they both having 6 bins and not 8 and 9 bins? Of course this has to do with the value of numtri=1000. Let's see what's the probabilities of each of these 8 and 9 bins by generating probabilities directly from the binomial distribution using dbinom to understand why this happens.
# n = 7
dbinom(7:0, 7, prob=5/6)
# output rounded to 3 decimal places
[1] 0.279 0.391 0.234 0.078 0.016 0.002 0.000 0.000
# n = 8
dbinom(8:0, 8, prob=5/6)
# output rounded to 3 decimal places
[1] 0.233 0.372 0.260 0.104 0.026 0.004 0.000 0.000 0.000
You see that the probabilities corresponding to k=6,7 and k=6,7,8 corresponding to n=7 and n=8 are ~ 0. They are very low in values. The minimum value here is 5.8 * 1e-7 actually (n=8, k=8). This means that you have a chance of getting 1 value if you simulated for 1/5.8 * 1e7 times. If you check the same for n=32 and k=32, the value is 1.256493 * 1e-25. So, you'll have to simulate that many values to get at least 1 result where all 32 outcomes are head for n=32.
This is why your results were not having values for certain bins because the probability of having it is very low for the given numtri. And for the same reason, generating the probabilities directly from the binomial distribution overcomes this problem/limitation.
I hope I've managed to write with enough clarity for you to follow. Let me know if you've trouble going through.
Edit 2:
When I simulated the code I've just edited above with numtri=1e6, I get this for n=7 and n=8 and count the number of heads for k=0:7 and k=0:8:
# n = 7
0 1 2 3 4 5 6 7
279347 391386 233771 77698 15763 1915 117 3
# n = 8
0 1 2 3 4 5 6 7 8
232835 372466 259856 104116 26041 4271 392 22 1
Note that, there are k=6 and k=7 now for n=7 and n=8. Also, for n=8, you have a value of 1 for k=8. With increasing numtri you'll obtain more of the other missing bins. But it'll require a huge amount of time/memory (if at all).

Resources