linear interpolation of points in R - r

This may seem a really simple question, but here goes:
I have a data frame:
test_df <- data.frame(x1 = c(277422033,24118536.4,2096819.0,
182293.4,15905,1330,105,16,1),
x2 = c(2.496e-3,2.495e-2,2.496e-1,
2.496e0,2.47e1,2.48e2,2.456e3,
3.7978e4,3.781e5))
and I would like to linearly interpolate this to increase the number of points. The variables are linearly related on a log scales, i.e.
plot(log10(test_df[,1]),log10(test_df[,2]))
So, my question is, how do I linearly interpolate these to increase the number of values?
Here is my attempt using a linear model (as opposed to the approx function):
I have defined a linear model as:
test.lm <- lm(log10(x1) ~ log10(x2), data = test_df)
and then define a new variable for the new points:
ss <- seq(min(test_df$x2),max(test_df$x2),length.out = 100) # new x1
then predict the new values and plot the points
newY <- predict(test.lm, newdata = data.frame(x2 = ss)) # interpolated values
test_df2 <- data.frame(x1 = 10^newY,
x2 = ss)
points(newY,log10(ss),col = "red")
This works as I expect i.e. the graph in the end is as I expected.
I would like to increase the number of points in test_df2 which can be done by increasing length.out e.g.
ss <- seq(min(test_df$x2),max(test_df$x2),length.out = 10000000)
but this makes the running time very long on my machine, to the point that I have to restart R.
Is there a way that I can linearly interpolate at an evenly distributed number of points which also extend the entire number of points specified in ss?

Just use
ss <- 10^seq(log10(min(test_df$x2)),log10(max(test_df$x2)),length.out = 1000)
to have your new data evenly distributed on the log scale.

Related

How to generate a multivariate spline basis in R?

I want to obtain a multivariate spline basis using R. I do not know how to do it properly or the best approach for this. According to my limited research on the Internet, I think that the package that can help me is mgcv and the functions ti and smooth.construct.tensor.smooth.spec but I am not sure.
The structure of my data is simple. I have two vectors xdata and alphadata generated as
n = 200
T = 2
xdata = as.matrix(rnorm(T*n),T*n,1)
tau = seq(-2,2,by=0.1)
tau = as.matrix(tau,length(tau),1)
So basically I have two vectors xdata and alphadata of dimension n*T and 41, respectively. My goal is then obtain a spline basis (for example a cubic spline) which should be a function of both b(alphadata,xdata).
What I have tried so far is something like this
xdata_data <- data.frame("xdata" = xdata[,1])
tau_data <- data.frame("tau" = tau[,1])
basisobj1 <- ti(tau_data, xdata_data, bs = 'cr', k = c(6, 6), fx = TRUE) #cr:cubic regression splines
xdata_data <- data.frame("xdata_data" = xdata[,1])
tau_data <- data.frame("tau_data" = tau[,1])
basisobj2 <- smooth.construct.tensor.smooth.spec(basisobj1, data = c(tau_data,xdata_data), knots = NULL)
basis <- basisobj2[["X"]]
Note that I manipulated my data, otherwise I get some errors with smooth.construct.tensor.smooth.spec.
My questions are:
(1) With the previous approach I am doing what I want?
(2) Is this a smart approach to do what I want?
(3) When I do the above, the number of rows of basis is 41 but shouldn't the number of rows of basis be equal to the product of dimensions of xdata and alphadata as the basis is a function of two vectors?

Simulate in R the number of samples needed in order to achieve the true standard deviation

i want to recreate in R the figure above that simulates the number of samples needed in order to achieve the true standard deviation.
How can I do it in R ?
I suppose that the distribution is t-distribution or normal.
So I have to generate numbers from these distributions and each time to increase the size of the sample and plot it in order to recreate this plot as shown in the figure.
Any help ?
set.seed(123)
x <- list(v1=rnorm(1,0,12),v2=rnorm(10,0,11),
v3=rnorm(20,0,10),v4=rnorm(30,0,9),
v5=rnorm(40,0,8),v6=rnorm(50,0,7),
v7=rnorm(60,0,6),v8=rnorm(70,0,5),
v9=rnorm(80,0,4),v10=rnorm(90,0,3),
v11=rnorm(100,0,2),v12=rnorm(110,0,2))
g = lapply(x,sd)
g
g1 = unlist(g)
plot(g1,type="l")
First, start with a random uniform distribution of suitable size, and select which sample sizes you want to compute your standard error of the mean.
set.seed(123)
x <- runif(1e6, 0, 1)
sample_size <- 5:120
You can define a function to compute this sigma_m. Here you sample with replacement a sample of n from x, and take the standard deviation and divide by sqrt(n).
calc_sigma_m <- function(n, x) {
sd(sample(x, n, replace = TRUE))/sqrt(n)
}
A data frame can neatly store the sample sizes and sigma_m values for plotting:
df <- data.frame(sample_size,
sigma_m = sapply(sample_size, calc_sigma_m, x))
Your initial plot will look like this:
library(ggplot2)
ggplot(df, aes(sample_size, sigma_m)) +
geom_line()
As expected, this is not smooth especially at smaller sample sizes.
If you want a smooth curve for demonstration, you repeat the sampling process and sigma_m calculation many times, and take the mean.
calc_sigma_m_mean <- function(n, x) {
mean(replicate(1000, sd(sample(x, n, replace = TRUE))/sqrt(n)))
}
df <- data.frame(sample_size, sigma_m = sapply(sample_size, calc_sigma_m_mean, x))
Then you will get a smoother curve:
ggplot(df, aes(sample_size, sigma_m)) +
geom_line()

Is there a way to simulate time series data with a specific rolling mean and autocorrelation in R?

I have an existing time series (1000 samples) and calculated the rolling mean using the filter() function in R, averaging across 30 samples each. The goal of this was to create a "smoothed" version of the time series. Now I would like to create artificial data that "look like" the original time series, i.e., are somewhat noisy, that would result in the same rolling mean if I would apply the same filter() function to the artificial data. In short, I would like to simulate a time series with the same overall course but not the exact same values as those of an existing time series. The overall goal is to investigate whether certain methods can detect similarity of trends between time series, even when the fluctuations around the trend are not the same.
To provide some data, my time series looks somewhat like this:
set.seed(576)
ts <- arima.sim(model = list(order = c(1,0,0), ar = .9), n = 1000) + 900
# save in dataframe
df <- data.frame("ts" = ts)
# plot the data
plot(ts, type = "l")
The filter function produces the rolling mean:
my_filter <- function(x, n = 30){filter(x, rep(1 / n, n), sides = 2, circular = T)}
df$rolling_mean <- my_filter(df$ts)
lines(df$rolling_mean, col = "red")
To simulate data, I have tried the following:
Adding random noise to the rolling mean.
df$sim1 <- df$rolling_mean + rnorm(1000, sd = sd(df$ts))
lines(df$sim1, col = "blue")
df$sim1_rm <- my_filter(df$sim1)
lines(df$sim1_rm, col = "green")
The problem is that a) the variance of the simulated values is higher than the variance of the original values, b) that the rolling average, although quite similar to the original, sometimes deviates quite a bit from the original, and c) that there is no autocorrelation. To have an autocorrelational structure in the data would be good since it is supposed to resemble the original data.
Edit: Problem a) can be solved by using sd = sqrt(var(df$ts)-var(df$rolling_mean)) instead of sd = sd(df$ts).
I tried arima.sim(), which seems like an obvious choice to specify the autocorrelation that should be present in the data. I modeled the original data using arima(), using the model parameters as input for arima.sim().
ts_arima <- arima(ts, order = c(1,0,1))
my_ar <- ts_arima$coef["ar1"]
my_ma <- ts_arima$coef["ma1"]
my_intercept <- ts_arima$coef["intercept"]
df$sim2 <- arima.sim(model = list(order = c(1,0,1), ar = my_ar, ma = my_ma), n = 1000) + my_intercept
plot(df$ts)
lines(df$sim2, col = "blue")
The resulting time series is very different from the original. Maybe a higher order for ar and ma in arima.sim() would solve this, but I think a whole different method might be more appropriate.

Is it possible to analyse a spatial point pattern given another, underlying, spatial point pattern in R

I want to analyse the type of spatial pattern shown by an animal (i.e. random, clustered, uniform) taking into consideration the underlying spatial pattern of it's available habitat. The animals in question roost in trees, so a standard analysis of the animal spp will always show a clustered distribution (i.e. clustering around trees), but I want to test whether there is clustering between trees vs whether they distribute randomly throughout trees. To provide a visual, I want to be able to differentiate between the following scenarios in the image:
https://imgur.com/a/iE3nAoh (image not allowed because I'm new to stack overflow, but it's available through the link)
Here is a reproducible data frame. The scenario here is of uniform habitat (25 areas of habitat) and uniform animals (16 animals per habitat):
library(spatstat)
data <- data.frame(matrix(ncol = 4, nrow = 25))
x <- c("habitat", "x", "y", "animalcount")
colnames(data) <- x
data$habitat <- 1:25
data$x <- seq(from=2, to=20, by=4)
data$y[1:5] <- 2
data$y[6:10] <- 6
data$y[11:15] <- 10
data$y[16:20] <- 14
data$y[21:25] <- 18
data$animalcount <- 16
Set up conditions for the spatial analysis:
plot.win <- owin(c(0,20), c(0,20)) # set the plot window as 20x20m
nS <- 499 # number of simulations
cd <- 5 # cluster distance
ed <- 50 # envelope distance
incr.dist <- 0.5 # increment distance for envelopes
Create the point pattern for the habitat:
habitat <- ppp(x = data$x, y = data$y, window = plot.win)
Create the point pattern for the animals. To do this, first make a new dataframe with repeated rows for the number in animal count, so that points are individual animals. Jitter x/y so that x/y coordinates are not exactly the same:
data <-data[which(data$animalcount>0),]
duplicate_rows <- function(habitat, x, y, animalcount) {
expanded <- paste0("animal-", 1:animalcount)
repeated_rows <- data.frame("habitat" = habitat, "x" = x, "y" = y, "animalcount" = expanded)
repeated_rows
}
expanded_rows <- Map(f = duplicate_rows, data$habitat, data$x, data$y, data$animalcount)
animal_data <- do.call(rbind, expanded_rows)
animal_data$xan <- jitter(animal_data$x)
animal_data$yan <- jitter(animal_data$y)
animal <- ppp(x = animal_data$xan, y = animal_data$yan, window = plot.win)
Now test Complete Spatial Randomness of animals regardless of habitat. This should come out as clustered:
an.csr <- envelope(animal, Kest, nsims = nS, savepatterns = TRUE, r = seq(0, ed, incr.dist), correction=c("Ripley"), verbose = FALSE) #CSR fit and determine the number of simulations
an.dclf <- dclf.test(an.csr, rinterval = c(0,cd), verbose = FALSE) #calculate the summary statistics of the CSR null model fit (dclf.test)
plot(an.csr, sqrt(./pi)-r~r, ylab="L(r)-r", xlab="r (meters)", xlim=c(0,ed), legend="NULL", main=paste("Animal - CSR", sep = "")) #plot 0-centered fit with the confidence bounds
clarkevans(animal)[2] #R > 1 suggests ordering, < 1 suggests clustering
clarkevans.test(animal, "Donnelly")$p
Now test Complete Spatial Randomness of animals, given the available habitat. This should come out not clustered. But simply adding habitat as a covariate clearly isn't the appropriate way to do it:
an.csr <- envelope(animal, covariates = animal_data[,2:3], Kest, nsims = nS, savepatterns = TRUE, r = seq(0, ed, incr.dist), correction=c("Ripley"), verbose = FALSE)
an.dclf <- dclf.test(an.csr, rinterval = c(0,cd), verbose = FALSE)
plot(an.csr, sqrt(./pi)-r~r, ylab="L(r)-r", xlab="r (meters)", xlim=c(0,ed), legend="NULL", main=paste("Animal - CSR", sep = ""))
clarkevans(animal)[2]
clarkevans.test(animal, "Donnelly")$p
I also tried running the test of Complete Spatial Randomness on a fitted Point Process Model, where the animal point pattern could be predicted by x&y, but this also did not change outcomes:
animalppm<-ppm(animal~x+y)
an.csr <- envelope(animalppm, Kest, nsims = nS, savepatterns = TRUE, r = seq(0, ed, incr.dist), correction=c("Ripley"), verbose = FALSE)
an.dclf <- dclf.test(an.csr, rinterval = c(0,cd), verbose = FALSE)
plot(an.csr, sqrt(./pi)-r~r, ylab="L(r)-r", xlab="r (meters)", xlim=c(0,ed), legend="NULL", main=paste("Animal - CSR", sep = ""))
clarkevans(animalppm)[2] #R > 1 suggests ordering, < 1 suggests clustering
clarkevans.test(animalppm, "Donnelly")$p
From there I would run tests of aggregation models, but the logic of adding the second point pattern should be similar.
I would appreciate any suggestions on ways to deal with this. I cannot think of an effective way to google this, and am coming up short on clever coding solutions in R. Thanks in advance!
You can model the intensity as depending on the distance to the
habitat pattern. Here is a simple example where the animals follow a Poisson
point process with intensity function which decays log-linearly with distance
to the habitat:
library(spatstat)
data <- expand.grid(x = seq(2, 18, by=4), y = seq(2, 18, by=4))
data$animalcount <- 16
plot.win <- owin(c(0,20), c(0,20)) # set the plot window as 20x20m
habitat <- ppp(x = data$x, y = data$y, window = plot.win)
d <- distmap(habitat)
plot(d)
lam <- exp(3-2*d)
plot(lam)
animal <- rpoispp(lam)
plot(animal)
fit <- ppm(animal ~ d)
fit
#> Nonstationary Poisson process
#>
#> Log intensity: ~d
#>
#> Fitted trend coefficients:
#> (Intercept) d
#> 2.952048 -1.974381
#>
#> Estimate S.E. CI95.lo CI95.hi Ztest Zval
#> (Intercept) 2.952048 0.07265533 2.809646 3.094450 *** 40.63085
#> d -1.974381 0.07055831 -2.112673 -1.836089 *** -27.98226
Taking the underlying non-homogeneous intensity into account
there is no sign of departure from the Poisson model in the
(inhomogeneous) K-function:
plot(Kinhom(animal, lambda = fit))
#> Warning: The behaviour of Kinhom when lambda is a ppm object has changed
#> (in spatstat 1.37-0 and later). See help(Kinhom)
You don't have to have simple log-linear dependence on distance. You could also make a threshold model where you have one intensity with e.g. distance 1 of the habitat and another intensity outside this distance. You can make all kinds of derived covariates from e.g. the distance for use in your model.
If animals is the point pattern of animals, and trees is the point pattern of trees (both objects of class "ppp" in spatstat) then you could do
d <- distfun(trees)
f <- rhohat(animals, d)
plot(f)
to get an idea of how the concentration of animals depends on distance to nearest tree. You can use
berman.test(animals, d)
to perform a hypothesis test of dependence on the trees.

Maximum pseudo-likelihood estimator for soft-core point process

I am trying to fit a soft-core point process model on a set of point pattern using maximum pseudo-likelihood. I followed the instructions given in this paper by Baddeley and Turner
And here is the R-code I came up with
`library(deldir)
library(tidyverse)
library(fields)
#MPLE
# irregular parameter k
k <- 0.4
## Generate dummy points 50X50. "RA" and "DE" are x and y coordinates
dum.x <- seq(ramin, ramax, length = 50)
dum.y <- seq(demin, demax, length = 50)
dum <- expand.grid(dum.x, dum.y)
colnames(dum) <- c("RA", "DE")
## Combine with data and specify which is data point and which is dummy, X is the point pattern to be fitted
bind.x <- bind_rows(X, dum) %>%
mutate(Ind = c(rep(1, nrow(X)), rep(0, nrow(dum))))
## Calculate Quadrature weights using Voronoi cell area
w <- deldir(bind.x$RA, bind.x$DE)$summary$dir.area
## Response
y <- bind.x$Ind/w
# the sum of distances between all pairs of points (the sufficient statistics)
tmp <- cbind(bind.x$RA, bind.x$DE)
t1 <- rdist(tmp)^(-2/k)
t1[t1 == Inf] <- 0
t1 <- rowSums(t1)
t <- -t1
# fit the model using quasipoisson regression
fit <- glm(y ~ t, family = quasipoisson, weights = w)
`
However, the fitted parameter for t is negative which is obviously not a correct value for a softcore point process. Also, my point pattern is actually simulated from a softcore process so it does not make sense that the fitted parameter is negative. I tried my best to find any bugs in the code but I can't seem to find it. The only potential issue I see is that my sufficient statistics is extremely large (on the order of 10^14) which I fear may cause numerical issues. But the statistics are large because my observation window spans a very small unit and the average distance between a pair of points is around 0.006. So sufficient statistics based on this will certainly be very large and my intuition tells me that it should not cause a numerical problem and make the fitted parameter to be negative.
Can anybody help and check if my code is correct? Thanks very much!

Resources