R - polynomial regression issue - model limited to finite number of output values - r

I'm trying to calculate point slopes from a series of x,y data. Because some of the x data repeats (...8, 12, 12, 16...) there will be a division by zero issue when using slope = (y2-y1/x2-x1).
My solution is to create a polynomial regression equation of the data, then plug a new set of x values (xx) into the equation that monotonically increase between the limits of x. This eliminates the problem of equal x data points. As a result, (x) and (xx) have the same limits, but (xx) is always longer in length.
The problem I am having is that the fitted values for xx are limited to the length of x. When I try to use the polynomial equation with (xx) that is 20 in length, the fitted yy results provide data for the first 10 points then gives NA for the next 10 points. What is wrong here?
x <- c(1,2,2,5,8,12,12,16,17,20)
y <- c(2,4,5,6,8,11,12,15,16,20)
df <- data.frame(x,y)
my_mod <- lm(y ~ poly(x,2,raw=T), data=df) # This creates the polynomial equation
xx <- x[1]:x[length(x)] # Creates montonically increasing x using boundaries of original x
yy <- fitted(my_mod)[order(xx)]
plot(x,y)
lines(xx,yy)
tag-name

If you look at
fitted(my_mod)
It outputs:
# 1 2 3 4 5 6 7 8 9 10
#3.241032 3.846112 3.846112 5.831986 8.073808 11.461047 11.461047 15.303305 16.334967 19.600584
Meaning the name of the output matches the position of x, not the value of x, so fitted(my_mod)[order(xx)] doesn't quite make sense.
You want to use predict here:
yy <- predict(my_mod, newdata = data.frame(x = xx))
plot(xx, yy)
# 1 2 3 4 5 6 7 8 9 10
# 3.241032 3.846112 4.479631 5.141589 5.831986 6.550821 7.298095 8.073808 8.877959 9.710550
# 11 12 13 14 15 16 17 18 19 20
# 10.571579 11.461047 12.378953 13.325299 14.300083 15.303305 16.334967 17.395067 18.483606 19.600584

Related

Average Mean w/ Forecast Horizon > 1 in R

I use the updated greybox package in R to forecast the consecutive 2 values (horizon=2) with a moving average scheme (See the first line of code below), where the window size is equal to 3.
For example, the overall goal is to take the average of (1+2+3)/3 = "2" as the forecasted value in horizon 1 (h=1) and then make use of the predicted value in h=1 for h=2, where (2+3+"2")=2,3334.
The following forecast window will make use of the window (2+3+4), where 4 is the actual value to predict the next h1 and h2, which equals 3 and 3,3334 respectively.
Yet, the prediction result I want "ValuesMA[[3]]" only emits one row, i.e. values for the first horizon. But it should be equal to the predifined horizon, which is two.
I have a code for an AR1 process which works perfectly (Second line of code). At the end I add an MAE test statistic to evaluate the model.
Can anyone help?
Thank you!
This is the underlying code I use:
#data
z <- c(1,2,3,4,5,6,7)
ourCall <- "mean(x=data,n.ahead=h)"
ourValue <- c("pred")
# Return a list for an forecasting horizon h
ValuesMA <- ro(z, h=2, origins=3, call=ourCall, ci=TRUE, co=TRUE)
ValuesMA[[3]]
**Which yields:**
origin3 origin4 origin5
[1,] 2 3 4
**But I want:**
origin3 origin4 origin5
[1,] 2 3 4
[2,] 2,3334 3,3334 4,3334
#data
z <- c(1,2,3,4,5,6,7)
# ci defines constant in-sample window size, co defines whether the holdout sample window size should be constant
ourCall <- "predict(arima(x=data,order=c(1,0,0)),n.ahead=h)"
# Ask for predicted values and the standard error
ourValue <- c("pred","se")
# Return a list for an forecasting horizon h with a rolling holdout size equal to origin
ValuesAR1 <- ro(z, h=2, origins=3, call=ourCall, value=ourValue, ci=TRUE, co=TRUE)
# calculate MAE
MAE_AR1 <- apply(abs(ValuesAR1$holdout - ValuesAR1$pred),1,mean,na.rm=TRUE) / mean(ValuesAR1$actuals)
ValuesAR1[[3]]
**Which yields:**
> ValuesAR1[[3]]
origin3 origin4 origin5
h1 2 3 4
h2 2 3 4
For further reading see: https://cran.r-project.org/web/packages/greybox/vignettes/ro.html

Obtain values from simulated mppm in spatstat

I have obtained an mppm object by fitting a model on several independent datasets using the mppm function from the R package spatstat. How can I generate simulated realisations of this model and obtain the x, y, and marks attributes of the simulations ?
I fitted my model as such:
data <- listof(NMJ1,NMJ2,NMJ3)
data <- hyperframe(X=1:3, Points=data)
model <- mppm(Points ~marks*sqrt(x^2+y^2), data)
where NMJ1, NMJ2, and NMJ3 are marked ppp and are independent realisations of the same experiment.
sim <- simulate(model) allows me to generate simulated realisations of this model, and plot(sim,axes = TRUE) to plot them. sim itself is an hyperframe object:
> sim
Hyperframe:
Sim1
1 (ppp)
2 (ppp)
3 (ppp)
How can I access the values (x, y, and marks) in this hyperframe ? My goal is to generate a large number of independent realisations from my model, and to use the simulated values for another task. Is there a practical way to obtain, retrieve and save these values ?
Since you say you want to simulate this many times I have here shown the code
with two simulations (rather than one as you have in the question):
library(spatstat)
data <- list(amacrine, amacrine, amacrine)
data <- hyperframe(X=1:3, Points=data)
model <- mppm(Points ~marks*sqrt(x^2+y^2), data)
sim <- simulate(model, nsim = 2)
#> Generating simulated realisations of 3 models..
#> 1, 2, 3.
Now sim is a hyperframe with 2 columns (one for each simulation). Each
column is a list of 3 point patterns. To get the three sets of coordinates
and marks for the first simulation use as.data.frame on each point pattern:
co1 <- lapply(sim$Sim1, as.data.frame)
Then co1 is a list of length 3, and we can print out the first few
coordinates with the head() command, e.g. the coordinates of the third
point pattern:
head(co1[[3]])
#> x y marks
#> 1 0.4942587 0.7889985 off
#> 2 0.6987270 0.7637359 on
#> 3 0.3926415 0.6819965 on
#> 4 0.7982686 0.9060733 off
#> 5 1.3507722 0.9731363 on
#> 6 0.6450985 0.6924126 on
We can extract the coordinates and marks for each simulation by another lapply that
runs over every simulation (in this case 2):
co <- lapply(sim, function(x) lapply(x, as.data.frame))
Now co is a list with 2 elements, and each element is a list of 3 sets of
coordinates:
length(co)
#> [1] 2
length(co[[2]])
#> [1] 3
head(co[[2]][[3]])
#> x y marks
#> 1 0.1660580 0.04180501 on
#> 2 0.7840025 0.71727782 on
#> 3 1.2011733 0.17109112 on
#> 4 1.0429867 0.49284639 on
#> 5 1.1411869 0.86711072 off
#> 6 1.0375942 0.06427601 on

Error in generating a poisson point pattern with calculated lambda

I have a two-dimensional point pattern (no marks) and I am trying to test for clustering in the presence of spatial inhomogeneity using envelopes and the inhomogenous pair correlation function. I am estimating an inhomogenous intensity function for the data using the density.ppp function. Here is some sample data:
x y
1 533.03 411.58
2 468.39 622.92
3 402.86 530.94
4 427.13 616.81
5 495.20 680.62
6 566.61 598.99
7 799.03 585.16
8 1060.09 544.23
9 144.66 747.40
10 138.14 752.92
11 449.49 839.15
12 756.45 713.72
13 741.01 728.41
14 760.22 740.28
15 802.34 756.21
16 799.04 764.89
17 773.81 771.97
18 768.41 720.07
19 746.14 754.11
20 815.40 765.14
There are ~1700 data points overall
Here is my code:
library("spatstat")
WT <- read.csv("Test.csv")
colnames(WT) <- c("x","y")
#determine bounding window
win <- ripras(WT)
unitname(win) <- c("micrometer")
#convert to ppp data class
WT.ppp <- as.ppp(WT, win)
plot(WT.ppp)
#estimate intensity function using cross validation
I <- density.ppp(WT.ppp,sigma=bw.diggle(WT.ppp),adjust=0.3,kernal="epanechnikov")
plot(I)
#predetermined r values for PCF
radius <- seq(from = 0, to = 50, by = 0.5)
#use envelopes to test the null hypothesis (ie. inhomogenous poisson process)
PCF_envelopes <- envelope(WT.ppp,divisor="d", pcfinhom,r = radius,nsim=10,simulate=expression(rpoispp(I)) )
When I run rpoisspp(I), I get the following error:
Error in sample.int(npix, size = ni, replace = TRUE, prob = lpix) :
negative probability
I can't seem to figure out what the issue is....any suggestions?
Thanks for your help!
This is happening because the image I contains some negative values, probably very small values but negative. You can check that by computing range(I) or min(I) or any(I < 0).
The help for density.ppp says that the result may contain negative values (very small ones) due to numerical error. To remove these, you need to set positive=TRUE in the call to density.ppp.
By the way, the argument kernel has been mis-spelt in the code above. Also the vector r is too coarsely spaced - you would be better to leave this argument un-specified. Also you don't need to type density.ppp, just density.

Calculate furthest distance in given time or best time for given distance

I have imported data from my GPS tracker and I am trying to figure out how to best calculate furthest distance ran in given time (e.g. 12-minutes) or best time for given distence (e.g. 5 miles). Given the observations are taken in unequal intervals and my speed is also not constant, I will have data like the table below:
x <- read.table(header=T, sep="", stringsAsFactors = FALSE,text="
time dist
4 3
5 4
5 6
3 2
5 5
4 5
4 3
4 2
5 6")
My best attempt so far is to generate new dataset where times go by one time unit. It is then relatively easy to calculate furthest distance in given time. The downside of this is that a) I would need to repeat the same logic for best time (generate data with unit distance), b) it seems to be quite sub-optimal solution for data with thousands data points.
# Generate data frame where each row represents one unit of time
z_adj <- data.frame(
time = unlist(sapply(x$time, function(s) rep(s, each = s))),
dist = unlist(sapply(seq_along(x$dist), function(s) rep(x$dist[s], each = x$time[s])))
)
z_adj$seq_time <- seq_along(z_adj$time)
z_adj$time_dist <- z_adj$dist / z_adj$time
# Furthest distance given time
# Time 10
z_adj$in_t10 <- sapply(z_adj$seq_time, function(s) sum(z_adj$dist[s:(s+9)]))
z_adj$in_t10[which(z_adj$in_t10 == max(z_adj$in_t10, na.rm = T))]
# Fastest time given distance
# ... would need to do the above again with constant distance :/
Is there a more straightforward approach to accomplish this?
You could use something like this:
x <- read.table(header=T, sep="", stringsAsFactors = FALSE,text="
time dist
4 3
5 4
5 6
3 2
5 5
4 5
4 3
4 2
5 6")
# Add starting point and cumulatice time/distance
x <- rbind(c(0,0), x)
x$total_time <- cumsum(x$time)
x$total_dist <- cumsum(x$dist)
# function to interpolate and calculate lagging differences
foo <- function(x, y, n) {
interpolation <- approx(x, y, xout = seq(min(x), max(x)))
diff(interpolation$y, lag = n)
}
# Max distance in ten units of time
max(foo(x$total_time, x$total_dist, 10))
# Min time for ten units of distance
min(foo(x$total_dist, x$total_time, 10))
BTW, in your code you should sum over z_adj$time_dist instead of z_adj$distto get the correct result.

Getting percentile values from gamlss centile curves

This question is related to: Selecting Percentile curves using gamlss::lms in R
I can get centile curve from following data and code:
age = sample(5:15, 500, replace=T)
yvar = rnorm(500, age, 20)
mydata = data.frame(age, yvar)
head(mydata)
age yvar
1 12 13.12974
2 14 -18.97290
3 10 42.11045
4 12 27.89088
5 11 48.03861
6 5 24.68591
h = lms(yvar, age , data=mydata, n.cyc=30)
centiles(h,xvar=mydata$age, cent=c(90), points=FALSE)
How can I now get yvar on the curve for each of x value (5:15) which would represent 90th percentiles for data after smoothening?
I tried to read help pages and found fitted(h) and fv(h) to get fitted values for entire data. But how to get values for each age level at 90th centile curve level? Thanks for your help.
Edit: Following figure show what I need:
I tried following but it is correct since value are incorrect:
mydata$fitted = fitted(h)
aggregate(fitted~age, mydata, function(x) quantile(x,.9))
age fitted
1 5 6.459680
2 6 6.280579
3 7 6.290599
4 8 6.556999
5 9 7.048602
6 10 7.817276
7 11 8.931219
8 12 10.388048
9 13 12.138104
10 14 14.106250
11 15 16.125688
The values are very different from 90th quantile directly from data:
> aggregate(yvar~age, mydata, function(x) quantile(x,.9))
age yvar
1 5 39.22938
2 6 35.69294
3 7 25.40390
4 8 26.20388
5 9 29.07670
6 10 32.43151
7 11 24.96861
8 12 37.98292
9 13 28.28686
10 14 43.33678
11 15 44.46269
See if this makes sense. The 90th percentile of a normal distribution with mean and sd of 'smn' and 'ssd' is qnorm(.9, smn, ssd): So this seems to deliver (somewhat) sensible results, albeit not the full hack of centiles that I suggested:
plot(h$xvar, qnorm(.9, fitted(h), h$sigma.fv))
(Note the massive overplotting from only a few distinct xvars but 500 points. Ande you may want to set the ylim so that the full range can be appreciated.)
The caveat here is that you need to check the other parts of the model to see if it is really just an ordinary Normal model. In this case it seems to be:
> h$mu.formula
y ~ pb(x)
<environment: 0x10275cfb8>
> h$sigma.formula
~1
<environment: 0x10275cfb8>
> h$nu.formula
NULL
> h$tau.formula
NULL
So the model is just mean-estimate with a fixed-variance (the ~1) across the range of the xvar, and there are no complications from higher order parameters like a Box-Cox model. (And I'm unable to explain why this is not the same as the plotted centiles. For that you probably need to correspond with the package authors.)

Resources