Strange output on PM10 gstat spatiotemporal kriging - r

first post here :) Really simple questions in bold. I want to do kriging using PM10 daily data for 8 static stations in Santiago, Chile, from 1997-2012, into 34 centroids which map different counties. I explain what I've done so far with some questions in between. I'm using just 2008 (less missing values) data as first experiment.
DESCRIPTION:
DATA: I have a column with days from 1997 to 2012 and 8 columns with PM10 station data. I imported to R this data, and established time:
time <-as.POSIXlt(data$date) with no error:
" 1997-04-05 UTC 1997-04-12 UTC.... 2012-12-27 UTC"
I imported stations with it's coordinates and built its projections.
coordinates(stations)=~longitud+latitud
proj4string(stations) <- CRS("+proj=longlat + ellps=WGS84")
In order to create the STFDF, I first built the vector PM10 of data with 8 stations, ordered:
PM10<-as.vector(cbind(data $PM10bosque,data $PM10cerrillos,data $PM10cerronavia,data $PM10condes,data $PM10florida,data $PM10independencia,data $PM10parqueoh,data $PM10pudahuel))
PM10<-data.frame(PM10)
DFPM=STFDF(stations, time, PM10)
DFPM<-as(DFPM,"STSDF")
, the last line because I am working with missing data. Then the estimated variogram and its modelling (which I know it's poor) was done with:
varPM10 <- variogramST(PM10~1,data=DFPM,tunit="days",assumeRegular=F,na.omit=T)
sepVgm <- vgmST("separable",space=vgm(0,"Exp", 8, 700),time =vgm(200,"Exp", 15, 700), sill=100)
sepVgm <- fit.StVariogram(varPM10, sepVgm)
Which results in:
Variograms
Then I used KrigeST this way:
gridPM10 <-STF(centroids,time) (centroids defined previously the same way as stations)
krigedPM10<-krigeST(PM10~1, DFPM, newdata=gridPM10,modelList=sepVgm)
The result of ploting one station data and kriged data for that station county's centroid is:
Kriging result for Cerillos county and its station data
which seems as if the estimation ocurrs for time windows by a set of dates.
First question: Does anybody know why this kriging has this shape?
Then I wondered what would happen if I just used distance as predictor so I coded instead:
varPM10 <- variogramST(PM10~1,data=DFPM,tunit="days", tlags=0:0, assumeRegular=F,na.omit=T)
Second question: Is this a reasonable way to try just distnace as predictor? If not, any advise about how to adjust my code do I can do this is very appretiated. Anyway, this is the result:
Variogram with tlag=0:0
using sepVgm <- vgmST("separable",space=vgm(1,"Per", 8, 700),time =vgm(200,"Exp", 15, 700), sill=100)
By the way, how would you guys fit this?
, then the output really surprised me:
Kriging result with tlags=0:0
Third question: Why I am getting this result? I know the variogram modelling is poor, but even if that's true I understand the program should use the station data of the corresponding date so at least it should change in time.

Related

Timeseries Analysis: Observed Values do not Correspond with Input Data

I have generated a decomposition of an additive time series for METAR wind data at a Norwegian airport. I have noticed that the monthly average wind values do not correspond with the observed values shown in the decomposition chart. During the month of January (2014) average winds were measured at 5.74 kts, however the chart shows a dip down to a value below 3 kts. I noticed, however, that when I separated each variable into its own dataset and ran the decomposition separately, the issue had been resolved. Has this got something to do with the way imported data is read? ... Sorry if it seems to be a silly question. Screenshots and code below. Thanks!
To define ts data:
RtestENGM_ts <- ts(test$Sknt, start=c(2012, 1), frequency=12)
To decompose ts data:
decomposed_test <- decompose(RtestENGM_ts, type="additive")
To plot decomposed data:
plot(decomposed_sknt2012ENGM)
To plot ts data
plot(RtestENGM_ts)
Input dataset:
Decompoition of additive time series 2012-22:
I tried importing each variable individually as part of their own respective datasets, this allowed for the correct observed values to be plotted. I still do not understand why r needs the imported variables to be separate. Do I really need to split my data across dozens of spreadsheets? Does R stryggle to isolate a single column during decomposition?

Time Series Forecasting using Support Vector Machine (SVM) in R

I've tried searching but couldn't find a specific answer to this question. So far I'm able to realize that Time Series Forecasting is possible using SVM. I've gone through a few papers/articles who've performed the same but didn't mention any code, instead explained the algorithm (which I didn't quite understand). And some have done it using python.
My problem here is that: I have a company data(say univariate) of sales from 2010 to 2017. And I need to forecast the sales value for 2018 using SVM in R.
Would you be kind enough to simply present and explain the R code to perform the same using a small example?
I really do appreciate your inputs and efforts!
Thanks!!!
let's assume you have monthly data, for example derived from Air Passengers data set. You don't need the timeseries-type data, just a data frame containing time steps and values. Let's name them x and y. Next you develop an svm model, and specify the time steps you need to forecast. Use the predict function to compute the forecast for given time steps. That's it. However, support vector machine is not commonly regarded as the best method for time series forecasting, especially for long series of data. It can perform good for few observations ahead, but I wouldn't expect good results for forecasting eg. daily data for a whole next year (but it obviously depends on data). Simple R code for SVM-based forecast:
# prepare sample data in the form of data frame with cols of timesteps (x) and values (y)
data(AirPassengers)
monthly_data <- unclass(AirPassengers)
months <- 1:144
DF <- data.frame(months,monthly_data)
colnames(DF)<-c("x","y")
# train an svm model, consider further tuning parameters for lower MSE
svmodel <- svm(y ~ x,data=DF, type="eps-regression",kernel="radial",cost=10000, gamma=10)
#specify timesteps for forecast, eg for all series + 12 months ahead
nd <- 1:156
#compute forecast for all the 156 months
prognoza <- predict(svmodel, newdata=data.frame(x=nd))
#plot the results
ylim <- c(min(DF$y), max(DF$y))
xlim <- c(min(nd),max(nd))
plot(DF$y, col="blue", ylim=ylim, xlim=xlim, type="l")
par(new=TRUE)
plot(prognoza, col="red", ylim=ylim, xlim=xlim)

issues using Spatial autocorrelation in R at specific lags (in m)

Since a few days I am struggling with a new challenging spatial analysis which include spatial autocorrelation in R: Specifically, I am interested in verifying the autocorrelation between points set in a grid of 50 m (more or less). My aim is to test the autocorrelation between these points (the locations where I collected the data) and to verify if the autocorrelation decreases increasing the distance among them (this is expected). My idea is to generate different radius of specific meters around each point (50 m, 100 m, 150 m and so on...) and to test the Moran's I Autocorrelation Index. Finally I would like to use ggplot to display the MI at each specific distance results (but this is easy to get once I have the MI outputs...).
My starting dataframe contains 4 coloumns: the ID of the point where data where collected, the values measured at that specific points (z) a coloumn with longitude (x) and a coloumn with latitude(y),data are displayed as follows:
#install libraries
library(sp)
library(spdep)
library(splm)
library(ape)
ID<- c(1,2,3,4,5,6)
x<-c(20.99984,20.99889, 20.99806,20.99800,20.99700,20.99732)
y<-c(52.21511,52.21489,52.21464,52.21410,52.21327,52.21278)
z<-c(1.16,0.54,0.89,0.60,1.27,1.45)
data <- data.frame(ID,x,y,z)
I read many things online and found this tutorial
https://mgimond.github.io/Spatial/spatial-autocorrelation-in-r.html#morans-i-as-a-function-of-a-distance-band
which actually shows what I'm interested in: however, it doesn't really work from the real beginning and, starting from my coordinates, I think there is a problem and I don't know how to tranform them in a proper format for R. this is the error message I get:
data <- data.frame(dataPOL$Long , dataPOL$Lat, dataPOL$Human_presence)
coordinates(data) <- c('x','y')`
proj4string(data) <- "+init=epsg:4326"
S.dist <- dnearneigh(coordinates, 0, 50) #radius of 50 meters
Error in dnearneigh(coordinates, 0, 50) : Data non-numeric
I did not receive any answer, but I ended up finding a solution:
I have found that the most used packages to work with spatial autocorrelation in R (in my case, Moran I) are spdep and ape.
I tried both: spdep didn't work yet but ape did. Here is the tutorial I followed for my specific case:
https://stats.idre.ucla.edu/r/faq/how-can-i-calculate-morans-i-in-r/
before calculate the Moran index, you should generate a distance matrix, I did it with the ‘rdist.earth’ from the package 'fields'.
This function measures the distance between each set of data points based on their coordinates. This function recognizes that the world is not flat, and as such calculates what are known as great-circle distances. I specified the distance in Km for my specific case.
to calculate Moran I, I ran this:
library(ape)
pop.dists.1 <- (popdists > 0 & popdists <= .06) # radius of 60m (remember
that field package works in km or miles)
Moran.I(mydataframe$myzvariable, pop.dists.1)
This is the output I got at this specific radius:
pop.dists.1 <- (popdists > 0 & popdists <= .06) #60m
Moran.I(dataPOL$Human_presence, pop.dists.1)
$observed
[1] 0.3841241 #Moran index: between -1 and 1, in here points within 60 m are
autocorrelated
$expected
[1] -0.009615385
$sd
[1] 0.08767598
$p.value
[1] 7.094019e-06
I repeated the formulas for the distances I am interested in: it works really well and increasing the distance, the Moran I index approximate 0 (which is what I expected).
I am going to plot the single outputs by using ggplot as always, in order to follow the trend of spatial autocorrelation for my z variable.
Hope this will help if needed!

Cross-correlation of 5 time series (distance) and interpretation

I would appreciate some input in this a lot!
I have data for 5 time series (an example of 1 step in the series is in the plot below), where each step in the series is a vertical profile of species sightings in the ocean which were investigated 6h apart. All 5 steps are spaced vertically by 0.1m (and the 6h in time).
What I want to do is calculate the multivariate cross-correlation between all series in order to find out at which lag the profiles are most correlated and stable over time.
Profile example:
I find the documentation in R on that not so great, so what I did so far is use the package MTS with the ccm function to create cross correlation matrices. However, the interpretation of the figures is rather difficult with sparse documentation. I would appreciate some help with that a lot.
Data example:
http://pastebin.com/embed_iframe.php?i=8gdAeGP4
Save in file cross_correlation_stack.csv or change as you wish.
library(dplyr)
library(MTS)
library(data.table)
d1 <- file.path('cross_correlation_stack.csv')
d2 = read.csv(d1)
# USING package MTS
mod1<-ccm(d2,lag=1000,level=T)
#USING base R
acf(d2,lag.max=1000)
# MQ plot also from MTS package
mq(d2,lag=1000)
Which produces this (the ccm command):
This:
and this:
In parallel, the acf command from above produces this:
My question now is if somebody can give some input in whether I am going in the right direction or are there better suited packages and commands?
Since the default figures don't get any titles etc. What am I looking at, specifically in the ccm figures?
The ACF command was proposed somewhere, but can I use it here? In it's documentation it says ... calculates autocovariance or autocorrelation... I assume this is not what I want. But then again it's the only command that seems to work multivariate. I am confused.
The plot with the significance values shows that after a lag of 150 (15 meters) the p values increase. How would you interpret that regarding my data? 0.1 intervals of species sightings and many lags up to 100-150 are significant? Would that mean something like that peaks in sightings are stable over the 5 time-steps on a scale of 150 lags aka 15 meters?
In either way it would be nice if somebody who worked with this before can explain what I am looking at! Any input is highly appreciated!
You can use the base R function ccf(), which will estimate the cross-correlation function between any two variables x and y. However, it only works on vectors, so you'll have to loop over the columns in d1. Something like:
cc <- vector("list",choose(dim(d1)[2],2))
par(mfrow=c(ceiling(choose(dim(d1)[2],2)/2),2))
cnt <- 1
for(i in 1:(dim(d1)[2]-1)) {
for(j in (i+1):dim(d1)[2]) {
cc[[cnt]] <- ccf(d1[,i],d1[,j],main=paste0("Cross-correlation of ",colnames(d1)[i]," with ",colnames(d1)[j]))
cnt <- cnt + 1
}
}
This will plot each of the estimated CCF's and store the estimates in the list cc. It is important to remember that the lag-k value returned by ccf(x,y) is an estimate of the correlation between x[t+k] and y[t].
All of that said, however, the ccf is only defined for data that are more-or-less normally distributed, but your data are clearly overdispersed with all of those zeroes. Therefore, lacking some adequate transformation, you should really look into other metrics of "association" such as the mutual information as estimated from entropy. I suggest checking out the R packages entropy and infotheo.

How do I plot multiple data subset forecast predictions onto a single plot

I am new to R and have found this site extremely helpful, so here is my first posted question. I appreciate your assistance and acknowledge the wisdom on this site.
Background: Start with 5 years of weekly sales data to develop a forecast for future production based on weekly sales with a very strong year seasonality. Determined the starting point with:
auto.fit <- auto.arima(arima.ts, stepwise=FALSE, parallel=TRUE, num.cores=6, trace=TRUE )
> ARIMA(2,1,2)(0,0,1)[52] with drift.
Now I wish to certify the accuracy with visual plotting of multiple 'windows' into the data and compare to the actual values. (This included logging the AIC values.) In other words, the function loops through the data at programmed intervals recomputing/plotting the forecast onto the same plot. It plotted correctly when my window started at the head of the data. Now I am looking at a moving 104 week window and the results are all overlaid starting at 104th observation.
require(forecast) ##[EDITED for simplified clarity]
data <- rep(cos(1:52*(3.1416/26)),5)*100+1000+c(1:26,25:0)
# Create the current fit on data and predict one year out
plot(data, type="l", xlab="weeks", ylab="counts",main="Overlay forecasts & actuals",
sub="green=FIT(1-105,by 16) wks back & PREDICT(26) wks, blue=52 wks")
result <- tryCatch({
arima.fit <- auto.arima(tail(data,156))
arima.pred <- predict(arima.fit, n.ahead=52)
lines(arima.pred$pred, col="blue")
lines(arima.pred$pred+2*arima.pred$se, col="red")
lines(arima.pred$pred-2*arima.pred$se, col="red")
}, error = function(e) {return(e$message)} ) ## Trap error
# Loop and perform comparison plotting of forecast to actuals
for (j in seq(1,105,by=16)) {
result <- tryCatch({
############## This plotted correctly as "Arima(head(data,-j),..."
arima1.fit <- auto.arima(head(tail(data,-j),156))
arima1.pred <- predict(arima1.fit, n.ahead=52)
lines(arima1.pred$pred, col="green", lty=(numtests %% 6) + 1 )
}, error = function(e) {return(e$message)}) ## Trap errors
}
The plots were accurate when all the forecasting included the head of the file, however, the AIC was not comparable between forecast windows because the sample size kept shrinking.
Question: How do I show the complete 5 years of sales data and overlay forecasts at programmed intervals which are computed from a rolling window of 3 years (156 observations)?
The AIC values logged are comparable using the rolling window approach, but all the forecasts overlay starting at observation 157. I tried making the data into a time series and found the initial data plotted correctly on a time axis, but the forecasts were not time series, so they did not display.
This is answered in another post Is there an easy way to revert a forecast back into a time series for plotting?
This was initially posted as two unique questions, but they have the same answer.
The core question being addressed is "how to restore the original time stamps to the forecast data". What I have learned with trial and error is "configure, then never loose the time series attribute" by applying these steps:
1: Make a time series Use the ts() command and create a time series.
2: Subset a time series Use 'window()' to create a subset of the time series in 'for()' loop. Use 'start()' and 'end()' on the data to show the time axis positions.
3: Forecast a time series Use 'forecast()' or 'predict()' which operate on time series.
4: Plot a time series When you plot a time series, then the time axis will align correctly for additional data using the lines() command. {Plotting options are user preference.}
The forecasts will plot over the historical data in the correct time axis location.
The code is here: Is there an easy way to revert a forecast back into a time series for plotting?

Resources