Predict values from sinusoidal noise - r

Background
Using R to predict the next values in a series.
Problem
The following code generates and plots a model for a curve with some uniform noise:
slope = 0.55
offset = -0.5
amplitude = 0.22
frequency = 3
noise = 0.75
x <- seq( 0, 200 )
y <- offset + (slope * x / 100) + (amplitude * sin( frequency * x / 100 ))
yn <- y + (noise * runif( length( x ) ))
gam.object <- gam( yn ~ s( x ) + 0 )
plot( gam.object, col = rgb( 1.0, 0.392, 0.0 ) )
points( x, yn, col = rgb( 0.121, 0.247, 0.506 ) )
The model reveals the trend, as expected. The trouble is predicting subsequent values:
p <- predict( gam.object, data.frame( x=201:210 ) )
The predictions do not look correct when plotted:
df <- data.frame( fit=c( fitted( gam.object ), p ) )
plot( seq( 1:211 ), df[,], col="blue" )
points( yn, col="orange" )
The predicted values (from 201 onwards) appear to be too low.
Questions
Are the predicted values, as shown, actually the most accurate predictions?
If not, how can the accuracy be improved?
What is a better way to concatenate the two data sets (fitted.values( gam.object ) and p)?

The simulated data is weird, because all the errors you add to the "true" y are greater than 0. (runif creates numbers on [0,1], not [-1,1].)
The problem disappears when an intercept term in the model is allowed.
For example:
gam.object2 <- gam( yn ~ s( x ))
p2 <- predict( gam.object2, data.frame( x=201:210 ))
points( 1:211, c( fitted( gam.object2 ), p2), col="green")
The reason for the systematic underestimation in the model without intercept could be that gam uses a sum-to-zero constraint on the estimated smooth functions. I think point 2 answers your first and second questions.
Your third question needs clarification because a gam-object is not a data.frame. The two data types do not mix.
A more complete example:
slope = 0.55
amplitude = 0.22
frequency = 3
noise = 0.75
x <- 1:200
y <- (slope * x / 100) + (amplitude * sin( frequency * x / 100 ))
ynoise <- y + (noise * runif( length( x ) ))
gam.object <- gam( ynoise ~ s( x ) )
p <- predict( gam.object, data.frame( x = 1:210 ) )
plot( p, col = rgb( 0, 0.75, 0.2 ) )
points( x, ynoise, col = rgb( 0.121, 0.247, 0.506 ) )
points( fitted( gam.object ), col = rgb( 1.0, 0.392, 0.0 ) )

Related

How to draw these 2 functions on the same graph?

I want to draw 2 functions on the same graph, because the goal is to show whether one of them is uniformly higher than the other one.
Function 1 is:
f(x)
=0 if x <= -0.05,
=x+0.05 if -0.05<x<=0.95
=1 if x>0.95
Function 2 is: g(x)
=0 if x <= -0.16
=(2x+2-1.68)^2/2 if -0.16 <x<=0.34
=1-(1.68-2x)^2/2 if 0.34<x<=0.84
=1 if x>0.84
Just spell out the comparisons in detail, so you are sure to get them right:
f <- function(x) {
if( x <= -.05 )
return(0)
if( x <= .95 )
return( x + 0.05 )
## the rest:
return( 1 )
}
g <- function(x) {
if( x <= -.16 )
return(0)
if( x <= .34 )
return( (2*x+2-1.68)^2/2 )
if( x <= 0.84 )
return( 1-(1.68-2*x)^2/2 )
## the rest:
return(1)
}
curve( Vectorize(f)(x), from=-1, to=2 )
curve( Vectorize(g)(x), from=-1, to=2, add=TRUE, col="red" )
legend( "right", col=c("black","red"), legend=c("f(x)", "g(x)"), lwd=2 )
(The Vectorize bit of the code may look weird. It's there because the functions as written can only process one number at a time, while the curve function needs the function to process multiple values at once. Vectorize fixes that.)
Improvements to function 1:
The first function can be simplified greatly, since it's simply clamping your x value to be in a fixed range
pmax(pmin(x+0.05,1),0) # would be one way
That expresion makes sure it never goes above 1, by taking the minimum of 1 and your variable value, and never goes below 0, by taking the maximum of 0 and your variable value.
If you have the raster package you can also use the clamp function from there:
library(raster)
clamp( x+0.05, 0, 1 )
It just cuts any values out of the range to be the limits of the range.
You can add these to the graph yourself, one by one, to see that they end up right on top of the black line:
curve( pmax(pmin(x+0.05,1),0) , add=TRUE, col="green" )
curve( clamp(x+.05,0,1) , add=TRUE, col="blue" )
You can have plot with add=TRUE like below
plot(f, -1.5, 1.5)
plot(g, add = TRUE)
where
f <- function(x) {
ifelse(x <= -0.05,
0, ifelse(x <= 0.95,
x + 0.05, 1
)
)
}
g <- function(x) {
ifelse(x <= -0.16,
0,
ifelse(x <= 0.34,
(2 * x + 2 - 1.68)^2 / 2,
ifelse(x <= 0.84,
1 - (1.68 - 2 * x)^2 / 2,
1
)
)
)
}
such that

Plotting same equation but different graph. l

I tried to plot Taylor expansion with the exponential function, f(x) = exp( x ) at a = 0 and by n = 1.
At first, I wrote the equation #### n = 1 and plotted it. But It failed to fit the line to graph of Exp(x). And I tried to relocate sign '+' to upper line, as see the equation # n = 1, and it worked.
These Equations are exactly same, except location of the operator sign ( + ).
What is difference between the equations ( #### n = 1, # n = 1 ) for plotting?
Is it important the location of operator signs in plotting?
f1 <- function(x) exp( x )
x <- seq( -1, 1, by = 0.025 )
a <- 0
#### n = 1
f1.10 <- function( x ){
exp( a ) / factorial( 0 ) * ( x - a )^0
+ exp( a ) / factorial( 1 ) * ( x - a )^1
}
# n = 1
f1.1 <- function( x ){
exp( a ) / factorial( 0 ) * ( x - a )^0 +
exp( a ) / factorial( 1 ) * ( x - a )^1
}
plot( x, exp(x), ylab = "exp(x)", type = "l", lwd = 3 )
curve( f1.1, -0.5, 0.5, add = T, type = "l", lwd = 2, col = "yellow" )
curve( f1.10, -0.5, 0.5, add = T, type = "l", lwd = 2, col = "blue" )
]1
In R, + of line head means new line. If you put + at end of line, R recognize the next line in script as continuation of current line.
So,
f1.10 <- function( x ){
exp( a ) / factorial( 0 ) * ( x - a )^0
+ exp( a ) / factorial( 1 ) * ( x - a )^1
}
is the same as
f1.11 <- function( x ){
exp( a ) / factorial( 1 ) * ( x - a )^1
}
The two functions produce different results, and they're not the same.
f1.10(seq(10))
#[1] 1 2 3 4 5 6 7 8 9 10
f1.1(seq(10))
#[1] 2 3 4 5 6 7 8 9 10 11
When n is 1 f1.10 produce 1 and f1.1 produce 2 or we can even produce a random number between -0.5 and 0.5 as follows to see if they produce the same result.
set.seed(777)
x <- runif(100, min = -0.5, max = 0.5)
min(f1.10(x))
#[1] -0.4940781
max(f1.10(x))
#[1] 0.4950499
min(f1.1(x))
#[1] 0.5059219
max(f1.1(x))
#[1] 1.49505
You can see that f1.10 is constrained between -0.5 and 0.5 where as f1.1 is not.

Given two Gaussian density curves, how do I identify v, such that v equally separate area under overlap?

Given two Gaussian density curves, how do I identify 'v', such that 'v' equally separate area under overlap?
The following code will create the visualisation of my problem. I am interested in calculating the area 'A' and then find the x-value 'v', which exactly splits the area in two?
# Define Gaussian parameters
mu1 = 10
sd1 = 0.9
mu2 = 12
sd2 = 0.6
# Visualise, set values
sprd = 3
xmin = min(c(mu1-sprd*sd1,mu2-sprd*sd2))
xmax = max(c(mu1+sprd*sd1,mu2+sprd*sd2))
x = seq(xmin,xmax,length.out=1000)
y1 = dnorm(x,mean=mu1,sd=sd1)
y2 = dnorm(x,mean=mu2,sd=sd2)
ymin = min(c(y1,y2))
ymax = max(c(y1,y2))
# Visualise, plot
plot(x,y1,xlim=c(xmin,xmax),ylim=c(ymin,ymax),type="l",col=2,ylab="Density(x)")
lines(x,y2,col=3)
abline(v=c(mu1,mu2),lty=2,col=c(2,3))
abline(h=0,lty=2)
legend("topleft",legend=c("N(mu1,sd1)","N(mu2,sd2)","mu1","mu2"),lty=c(1,1,2,2),col=c(2,3))
text(11,0.05,"A",cex=2)
Based on the comments on this post, I have written I have written my own proposal for a solution:
gaussIsect = function(mu1,mu2,sd1,sd2){
sd12 = sd1**2
sd22 = sd2**2
sqdi = sd12-sd22
x1 = (mu2 * sd12 - sd2*( mu1*sd2 + sd1*sqrt( (mu1-mu2)**2 + 2*sqdi * log(sd1/sd2) ) )) / sqdi
x2 = (mu2 * sd12 - sd2*( mu1*sd2 - sd1*sqrt( (mu1-mu2)**2 + 2*sqdi * log(sd1/sd2) ) )) / sqdi
return(c(x1,x2))
}
gaussSplitOlap = function(mu1,mu2,sd1,sd2){
if( mu1 > mu2 ){
tmp = c(mu1,mu2)
mu1 = tmp[2]
mu2 = tmp[1]
tmp = c(sd1,sd2)
sd1 = tmp[2]
sd2 = tmp[1]
}
isct = gaussIsect(mu1=mu1,mu2=mu2,sd1=sd1,sd2=sd2)
isct = isct[which(mu1 < isct & isct < mu2)]
a1 = 1-pnorm(isct,mean=mu1,sd=sd1)
a2 = pnorm(isct,mean=mu2,sd=sd2)
A = a1 + a2
v1 = qnorm(1-A/2,mean=mu1,sd=sd1)
v2 = qnorm(A/2,mean=mu2,sd=sd2)
results = list(isct=isct,A=A,v1=v1,v2=v2)
return(results)
}
test = gaussSplitOlap(mu1 = 10,sd1 = 0.9,mu2 = 12,sd2 = 0.6)
print(test)
The output from running this test is as follows
$isct
[1] 11.09291
$A
[1] 0.1775984
$v1
[1] 11.21337
$v2
[1] 11.19109
I would have assumed that the v1and v2 values were equal?
First solve analitycally the problem of finding the point x where it overlaps (this is deg 2 polynomial equation).
Then given this x the area is the sum of the two tails:
area = min(pnorm(x, mean = mu1, sd = sd1), 1 - pnorm(x, mean = mu1, sd = sd1)) +
min(pnorm(x, mean = mu2, sd = sd2), 1 - pnorm(x, mean = mu2, sd = sd2))
Like I said in the comment, you can to this using simple Monte Carlo simulation:
prob<-c()
med<-c()
for(i in 1:1000){
randomX<-runif(1000000,xmin,xmax)
randomY<-runif(1000000,0,0.3)
cond<-(randomY<dnorm(randomX,mean=mu1,sd=sd1) & randomY<dnorm(randomX,mean=mu2,sd=sd2))
prob<-c(prob,sum(cond)/1000000*(xmax-xmin)*0.3)
med<-c(med,median(randomX[which(cond==1)]))
}
cat("Area of A is equal to: ", mean(prob),"\n")
# Area of A is equal to: 0.1778459
cat("Value of v is equal to: ",mean(med),"\n")
# Value of v is equal to: 11.21008
plot(x,y1,xlim=c(xmin,xmax),ylim=c(ymin,ymax),type="l",col=2,ylab="Density(x)")
lines(x,y2,col=3)
abline(v=c(mu1,mu2,mean(med)),lty=2,col=c(2,3,4))
abline(h=0,lty=2)
legend("topleft",legend=c("N(mu1,sd1)","N(mu2,sd2)","mu1","mu2"),lty=c(1,1,2,2),col=c(2,3))
text(11,0.05,"A",cex=2)

MC algorithm for a non-conjugate model

The model is Poisson likelihood and Gaussian prior. I worked out the posterior for the model and I think that I have it coded correctly but I'm having a lot of trouble trying to implement the algorithm. I know that it's just a simple matter of not defining my variables properly but I'm not seeing where the problems lie. The code that I have so far is:
# Poisson model
#
#
# Log of the unnormalized posterior density:
log.post.dens = function( theta, n, sum.y, mu0, sig0 )
{
alpha = (log.dpois(x, lamda=exp(theta)))*dnorm(x, mu0, sig0)
}
#++++++++++++++++++++++++++++++++++++++++++++++++++++++++
rw.sim = function( M, mu0, sig0, n, sum.y, sd.pd, theta.start )
{
# create theta array and initialize theta[1]
#
theta = rep( 0, M )
theta[1] = theta.start
acc.cnt = 0
for( ii in 2:M ) {
# Normal proposal distribution is centered at the current theta
#
theta.new = rnorm( 1, theta[ii-1], sd.pd )
log.alpha = log.post.dens( theta.new, n, sum.y, mu0, sig0 ) -
log.post.dens( theta[ii-1], n, sum.y, mu0, sig0 )
if( log.alpha > 0 || exp( log.alpha ) > runif( 1, 0, 1 ) )
{
theta[ii] = theta.new
acc.cnt = acc.cnt + 1
}
else
theta[ii] = theta[ii-1]
}
list( ac = acc.cnt, theta = theta )
}
#++++++++++++++++++++++++++++++++++++++++++++++++++++++++
n = 200
mu0 = log(10)
sig0 = 3
yy = rpois( n, exp( mu0 ))
sd.pd = 1
theta.start = mu0
M = 100000
print( paste("M =", M, " mu0 =", mu0, " sig0 =", sig0, "sd.pd =", sd.pd,
"start", theta.start ))
res = rw.sim( M, mu0, sig0, length(yy), sum( yy ), sd.pd, theta.start )
theta = res$theta
acc.rate = res$ac / M
corr = cor( theta[1:(M-1)], theta[2:M])
print( paste("acceptance rate =", acc.rate ))
print( paste("correlation =", corr ))
3
m = 1
if( m )
{
theta0 = theta
thin.const = 40
theta = theta[ seq( .1*length(theta), length(theta), thin.const )]
}
par( mfrow=c(2,2))
hist( theta, prob=T, breaks=32 )
x = seq( min( theta ), max( theta ), len=200 )
lines( x, dnorm( x, mu0, sig0 ), col = 2)
plot( theta, type=ā€™lā€™ )
acf( theta )
##pacf( theta )
#++++++++++++++++++++++++++++
#
# Posterior predictive density for data on a grid
#
hist( yy, prob=T)
lim1 = max(yy) + 2
xx = 0:lim1
ppd = rep( 0, lim1+1 )
for( ii in 1:(lim1+1) )
{
ppd[ii] = (1/M)*sum(yy)*((log.dpois(x, lamda=exp(theta)))*dnorm(x, mu0, sig0))
}
points( xx+.5, ppd, col=2 )
lines( xx+.5, ppd, col=2 )
As I said it's my defining of parameters that's off but I'm not sure how to fix it.

R help. I'm running a Simple Gibbs Sampler for mu and sig^2 for normal data using informative non-conjugate priors

Mu is distributed as N(0,1) and sig^2 is distributed as IGamma(a,b) with a = 1, b = 2. I'm trying to create a couple of graphs (histograms, scatterplots, ACF, PACF) but keep getting error messages of "Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ"
It's probably a stupid question but I'm new to r. This is the program I've ran so far
# Gibbs sampler, MODEL 2: semi-conjugate normal model
#
# Density of inverse gamma distribution
#dinvgamma = function( x, shape, rate )
{
exp( shape * log( rate ) - lgamma( shape ) -( shape+1)*log(x) - rate/x )
}
nn = 200
yy = rnorm( nn, 1, 4 )
mu.0 = 1.0
sig2.0 = 4^2
a = 2
b = 1
M = 10000
mu = rep( 0, 10000 )
sig2 = rep( 1, 10000 )
mean.y = mean( yy )
var.y = var( yy )
mu[1] = mean.y
sig2[1] = var.y
for( ii in 2:M ) {
mu.star <- ((1/sig2.0)*mu.0 + (nn/sig2 [ii-1])*mean(yy))/((1/sig2.0)+(nn/sig2 [ii- 1]))
sig2.star <- 1/((1/sig2.0)+(nn/sig2[ii-1]))
# sample mu from its full cond.
#
mu[ii] = rnorm( 1, mu.star, sqrt( sig2.star ))
a.star <- a + (nn/2)
b.star <- (sum((yy-mu [ii])^2)/2) + b
# sample sig2 from its full cond.
#
sig2[ii] = 1 / rgamma( 1, a.star, rate = b.star )
## print( c( mu[ii], sig2[ii] ))
}
mu <- mu[9001:length(mu)]
sig2 <- sig2[9001:length(mu)]
#
# Time series
#
par(mfrow=c(2,2))
plot( 1:M, mu, type="l", xlab="Iteration", xlim=c(0, 50), ylim=c(0, 100),
main= "Mu" )
hist( mu, prob = T )
x = seq( min(mu), max( mu ), len=200 )
lines( x, dnorm( x, mean=mu.0, sd=sqrt(sig2.0) ), col=2, lwd=2)
acf( mu )
pacf( mu )
#++++++++++++++++++++++++++++++++++
par(new=TRUE)
par(mfrow=c(2,2))
plot( 1:M, sig2, type="l", xlab="Iteration", ylab="sig^2",xlim=c(0, 50), ylim=c(0, 100), main= "sig^2" )
hist( sig2, prob = T )
x = seq( 0, max( sig2 ), len=200 )
lines( x, dinvgamma( x, shape=a, rate=b ), col=2, lwd=2)
acf( sig2 )
pacf( sig2 )
par(new=TRUE)
plot(mu, sig2, main="Scatterplot Example", xlim=c(0, 50), ylim=c(0, 100),xlab="mu ", ylab="sig2 ", pch=19)
Any help at all would be greatly appreciated. I'm under no illusion that my code is efficient or even approaching it so any changes you may deem necessary please feel free to tell me
Your problem seems to be coming from redefining these two variables:
mu <- mu[9001:length(mu)]
sig2 <- sig2[9001:length(mu)]
I commented them out and the script ran fine producing the plot (presumably) as desired. Maybe take a look at how you are redefining mu and sig2

Resources