How to get rid of multiple outliers in a timeseries in R? - r

I'm using "outliers" package in order to remove some undesirable values. But it seems that rm.outliers() funcion does not replace all outliers at the same time. Probably, rm.outliers() could not perform despikes recursively. Then, basically I have to call this function a lot of times in order to replace all outliers.
Here is a reproducible example of the issue I'm experiencing:
require(outliers)
# creating a timeseries:
set.seed(12345)
y = rnorm(10000)
# inserting some outliers:
y[4000:4500] = -11
y[4501:5000] = -10
y[5001:5100] = -9
y[5101:5200] = -8
y[5201:5300] = -7
y[5301:5400] = -6
y[5401:5500] = -5
# plotting the timeseries + outliers:
plot(y, type="l", col="black", lwd=6, xlab="Time", ylab="w'")
# trying to get rid of some outliers by replacing them by the series mean value:
new.y = outliers::rm.outlier(y, fill=TRUE, median=FALSE)
new.y = outliers::rm.outlier(new.y, fill=TRUE, median=FALSE)
# plotting the new timeseries "after removing the outliers":
lines(new.y, col="red")
# inserting a legend:
legend("bottomleft", c("raw", "new series"), col=c("black","red"), lty=c(1,1), horiz=FALSE, bty="n")
Does anyone know how to improve the code above, so that all outliers could be replaced by a mean value?

Best thought I could come up with is just to use a for loop, keeping track of the outliers as you find them.
plot(y, type="l", col="black", lwd=6, xlab="Time", ylab="w'")
maxIter <- 100
outlierQ <- rep(F, length(y))
for (i in 1:maxIter) {
bad <- outlier(y, logical = T)
if (!any(bad)) break
outlierQ[bad] <- T
y[bad] <- mean(y[!bad])
}
y[outlierQ] <- mean(y[!outlierQ])
lines(y, col="blue")

Related

R: Plot lines are very thick

When using matplot to plot a matrix using:
matplot(t, X[,1:4], col=1:4, lty = 1, xlab="Time", ylab="Stock Value")
my graph comes out as:
How do I reduce the line thickness? I previously used a different method and my graph was fine:
I have tried manupilating lwd but to no avail.
Even tried plot(t, X[1:4097,1]), yet the line being printed is very thick. Something wrong with my R?
EDIT: Here is the code I used to produce the matrix X:
####Inputs mean return, volatility, time period and time step
mu=0.25; sigma=2; T=1; n=2^(12); X0=5;
#############Generating trajectories for stocks
##NOTE: Seed is fixed. Changing seed will produce
##different trajectories
dt=T/n
t=seq(0,T,by=dt)
set.seed(201)
X <- matrix(nrow = n+1, ncol = 4)
for(i in 1:4){
X[,i] <- c(X0,mu*dt+sigma*sqrt(dt)*rnorm(n,mean=0,sd=1))
X[,i] <- cumsum(X[,i])
}
colnames(X) <- paste0("Stock", seq_len(ncol(X)))
Just needed to add type = "l" to matplot(....). Plots fine now.
matplot(t, X[,1:4], col=1:4, type = "l", xlab="Time", ylab="Stock Value")

Plotting several variables on the same scale in R

I've tried over and over to solve this issue but I can't get it down. I have estimated a Beta-t-EGARCH model and a GARCH-t model in R and now I need to plot the results over the same plot. The final result is horrible, since the variables don't share the same scale on the y axis. I'm new to R, so please don't blame me :).
Here's the code:
library(quantmod)
library(betategarch)
library(fGarch)
library(ggplot2)
getSymbols("GOOG",src="yahoo")
google_ret <- abs(periodReturn(GOOG, period="daily", subset=NULL, type="log"))-mean(abs(periodReturn(GOOG, period="daily", subset=NULL, type="log")))
googcomp <- tegarch(google_ret, asym=FALSE, skew=FALSE)
goog1stdev <- fitted(googcomp)
#now we try to fit a standard GARCH-t model
googgarch <- garchFit(data=google_ret, cond.dist="sstd")
googgarch2 <- garchFit(data=google_ret, cond.dist="sstd", include.mean = FALSE, include.delta = FALSE, include.skew = FALSE, include.shape = FALSE, leverage = FALSE, trace = TRUE)
volatility <- volatility(googgarch2, type = "sigma")
plot(google_ret)
par(new=TRUE)
plot(googgarch2, which=2)
par(new=TRUE)
plot(goog1stdev, col="red")
The final result is a plot completely out of scale on the y axis, with variables that have lower values plotted above higher ones. Thanks a lot to anybody that wants to help me!
The recommended approach is to plot them as different plots stacked on top of each other:
layout(matrix(1:3,3))
plot(google_ret)
plot(googgarch2, which=2)
plot(goog1stdev, col="red")
You can get rid of the whitespace with calls to par("mar") to adjust margin sizes:
opar=par(mar=par("mar") -c(1,0,3,0)) # opar will then let your restore previous values
..... plotting efforts
par(opar)
I don't know your domain very much but if you cna use shifted y-ordinates then this produces a somewhat cleaned up version with overlayed plots:
png()
plot(google_ret, ylim=c(0,1), ylab="ylab="Google Returns(black); GGarch x10 +0.5 (blue); STD + 0.3(red)" )
par(new=TRUE)
plot(googgarch2#data +.5, type="l", col="blue",axes=FALSE, ylab="", main="",ylim=c(0, 1)) ;abline(h=.5, col="blue")
par(new=TRUE);
plot( 10*coredata(goog1stdev) + .3, col="red", type="l", axes=FALSE, main="",ylim=c(0,1), ylab=""); abline(h=.3, col="red")
dev.off()

quantile plot, two data - issues with fitting the line in R

So I am trying to plot two p values from two different data frames and compare them to the normal distribution in QQplot in R
here is the code that I am using
## Taking values from 1st dataframe to plot
Rlogp = -log10(trialR$PVAL)
Rindex <- seq(1, nrow(trialR))
Runi <- Rindex/nrow(trialR)
Rloguni <- -log10(Runi)
## Taking values from 2nd dataframe to plot on existing plot
Nlogp = -log10(trialN$PVAL)
Nlogp = sort(Nlogp)
Nindex <- seq(1, nrow(trialN))
Nuni <- Nindex/nrow(trialN)
Nloguni <- -log10(Nuni)
Nloguni <- sort(Nloguni)
qqplot(Rloguni, Rlogp, xlim=range(0,6), ylim=range(0,6), col=rgb(100,0,0,50,maxColorValue=255), pch=19, lwd=2, bty="l",xlab ="", ylab ="")
qqline(Rloguni, Rlogp,distribution=qnorm, lty="dashed")
par(new=TRUE, cex.main=4.8, col.axis="white")
plot(Nloguni, Nlogp, xlim=range(0,6), ylim=range(0,6), col=rgb(0,0,100,50,maxColorValue=255), pch=19, lwd=2, bty="l",xlab ="", ylab ="")
The code plot the graph effectively,but I am not sure of the qqline as it seems bit offset... Can someone tell me if I am doing the correct way or is there something to change
the TARGET plot will look something like this - without the third data value..

Visualize data using histogram in R

I am trying to visualize some data and in order to do it I am using R's hist.
Bellow are my data
jancoefabs <- as.numeric(as.vector(abs(Janmodelnorm$coef)))
jancoefabs
[1] 1.165610e+00 1.277929e-01 4.349831e-01 3.602961e-01 7.189458e+00
[6] 1.856908e-04 1.352052e-05 4.811291e-05 1.055744e-02 2.756525e-04
[11] 2.202706e-01 4.199914e-02 4.684091e-02 8.634340e-01 2.479175e-02
[16] 2.409628e-01 5.459076e-03 9.892580e-03 5.378456e-02
Now as the more cunning of you might have guessed these are the absolute values of some model's coefficients.
What I need is an histogram that will have for axes:
x will be the number (count or length) of coefficients which is 19 in total, along with their names.
y will show values of each column (as breaks?) having a ylim="" set, according to min and max of those values (or something similar).
Note that Janmodelnorm$coef simply produces the following
(Intercept) LON LAT ME RAT
1.165610e+00 -1.277929e-01 -4.349831e-01 -3.602961e-01 -7.189458e+00
DS DSA DSI DRNS DREW
-1.856908e-04 1.352052e-05 4.811291e-05 -1.055744e-02 -2.756525e-04
ASPNS ASPEW SI CUR W_180_270
-2.202706e-01 -4.199914e-02 4.684091e-02 -8.634340e-01 -2.479175e-02
W_0_360 W_90_180 W_0_180 NDVI
2.409628e-01 5.459076e-03 -9.892580e-03 -5.378456e-02
So far and consulting ?hist, I am trying to play with the code bellow without success. Therefore I am taking it from scratch.
# hist(jancoefabs, col="lightblue", border="pink",
# breaks=8,
# xlim=c(0,10), ylim=c(20,-20), plot=TRUE)
When plot=FALSE is set, I get a bunch of somewhat useful info about the set. I also find hard to use breaks argument efficiently.
Any suggestion will be appreciated. Thanks.
Rather than using hist, why not use a barplot or a standard plot. For example,
## Generate some data
set.seed(1)
y = rnorm(19, sd=5)
names(y) = c("Inter", LETTERS[1:18])
Then plot the cofficients
barplot(y)
Alternatively, you could use a scatter plot
plot(1:19, y, axes=FALSE, ylim=c(-10, 10))
axis(2)
axis(1, 1:19, names(y))
and add error bars to indicate the standard errors (see for example Add error bars to show standard deviation on a plot in R)
Are you sure you want a histogram for this? A lattice barchart might be pretty nice. An example with the mtcars built-in data set.
> coef <- lm(mpg ~ ., data = mtcars)$coef
> library(lattice)
> barchart(coef, col = 'lightblue', horizontal = FALSE,
ylim = range(coef), xlab = '',
scales = list(y = list(labels = coef),
x = list(labels = names(coef))))
A base R dotchart might be good too,
> dotchart(coef, pch = 19, xlab = 'value')
> text(coef, seq(coef), labels = round(coef, 3), pos = 2)

Construct a specific plot of time series using R

My problem is that I generate a time series from normal distribution and I plot my time series but I want to color in red the positive area between the time series and the axe X, the same for the negative area below the axe X and my time series.
This is the code I use but it does not work :
x1<-rnorm(250,0.4,0.9)
x <- as.matrix(x1)
t <- ts(x[,1], start=c(1,1), frequency=30)
plot(t,main="Daily closing price of Walterenergie",ylab="Adjusted close Returns",xlab="Times",col="blue")
plot(t,xlim=c(2,4),main="Daily closing price of Walterenergie",ylab="Adjusted close Returns",xlab="Times",col="blue")
abline(0,0)
z1<-seq(2,4,0.001)
cord.x <- c(2,z1,4)
cord.y <- c(0,t(z1),0)
polygon(cord.x,cord.y,col='red')
Edit: In response to OP's additional query.
library(ggplot2)
df <- data.frame(t=1:nrow(x),y=x)
df$fill <- ifelse(x>0,"Above","Below")
ggplot(df)+geom_line(aes(t,y),color="grey")+
geom_ribbon(aes(x=t,ymin=0,ymax=ifelse(y>0,y,0)),fill="red")+
geom_ribbon(aes(x=t,ymin=0,ymax=ifelse(y<0,y,0)),fill="blue")+
labs(title="Daily closing price of Walterenergie",
y="Adjusted close Returns",
x="Times")
Original response:
Is this what you had in mind?
library(ggplot2)
df <- data.frame(t=1:nrow(x),y=x)
ggplot(df)+geom_line(aes(t,y),color="grey")+
geom_ribbon(aes(x=t,ymin=0,ymax=y),fill="red")+
labs(title="Daily closing price of Walterenergie",
y="Adjusted close Returns",
x="Times")
This is some code I had written a while ago for someone. In this case two different colors are used for positive and negative. Although this is not exactly what you're after, I thought I'll share this.
# Set a seed to get a reproducible example
set.seed(12345)
num.points <- 100
# Create some data
x.vals <- 1:num.points
values <- rnorm(n=num.points, mean=0, sd=10)
# Plot the graph
plot(x.vals, values, t="o", pch=20, xlab="", ylab="", las=1)
abline(h=0, col="darkgray", lwd=2)
# We need to find the intersections of the curve with the x axis
# Those lie between positive and negative points
# When the sign changes the product between subsequent elements
# will be negative
crossings <- values[-length(values)] * values[-1]
crossings <- which(crossings < 0)
# You can draw the points to check (uncomment following line)
# points(x.vals[crossings], values[crossings], col="red", pch="X")
# We now find the exact intersections using a proportion
# See? Those high school geometry problems finally come in handy
intersections <- NULL
for (cr in crossings)
{
new.int <- cr + abs(values[cr])/(abs(values[cr])+abs(values[cr+1]))
intersections <- c(intersections, new.int)
}
# Again, let's check the intersections
# points(intersections, rep(0, length(intersections)), pch=20, col="red", cex=0.7)
last.intersection <- 0
for (i in intersections)
{
ids <- which(x.vals<=i & x.vals>last.intersection)
poly.x <- c(last.intersection, x.vals[ids], i)
poly.y <- c(0, values[ids], 0)
if (max(poly.y) > 0)
{
col="green"
}
else
{
col="red"
}
polygon(x=poly.x, y=poly.y, col=col)
last.intersection <- i
}
And here's the result!
Base plotting solution:
x1<-rnorm(250,0.4,0.9)
x <- as.matrix(x1)
# t <- ts(x[,1], start=c(1,1), frequency=30)
plot(x1,main="Daily closing price of Walterenergie",ylab="Adjusted close Returns",xlab="Times",col="blue", type="l")
polygon( c(0,1:250,251), c(0, x1, 0) , col="red")
Note this doesn't deal with the time-series plotting method which is rather difficult to understand because of differences in scaling by the frequency value and a starting x value of 1. The solution to that is below:
plot(t,main="Daily closing price of Walterenergie",
ylab="Adjusted close Returns",xlab="Times",col="blue", type="l")
polygon( c(1,1+(0:250)/30), c(0, t, 0) , col="red")

Resources