I am trying to use the function variogramST from the R package gstat to calculate a spatio-temporal variogram.
There are 12 years of data with 20'000 data points at irregular points in space and time (no full grid or partial grid). I have to use the STIDF from the spacetime package for an irregular data set. I would like a temporal semivariogram with reference points at 0, 90, 180, 270 days, up to some years etc. Unfortunately both computational and memory problems occur. When the command
samplevariogram<-variogramST(formula=formula_gstat,data=STIDF1)
is run without further arguments, the semiovariogram is taking into account only very short time periods in terms of reference points for the semivariogram, which does not seem to capture the inherent data structure appropriately.
There are more arguments for this function at the user's disposal, but I am not sure how to parametrize them correctly: tlag, tunit, twindow. Specifically, I am wondering how they interact and how I achieve my goal as described above. So I tried the following code
samplevariogram<-variogramST(formula=formula_gstat,data=STIDF1,tlag= ...., tunit=... , twindow= ...)
The following code results ist not working due to memory issues in my 32Gbyte RAM computer:
samplevariogram<-variogramST(formula=formula_gstat,data=STIDF1,tlag=90*(0:20), tunit="days")
but might be perhaps flawed, otherwise. Furthermore, the latter line of code also seems infeasible in terms of computation time.
Does someone know how to specify the variogramST-function from the gstat packaging correctly, aiming at the desired time intervals?
Thanks
If I understand correctly, the twindow argument should be the number of observations to include when calculating the space-time variogram. Assuming your 20k point are distributed more or less evenly over the 12 years, then you have about 1600 points per year. Again, assuming I understand things correctly, if you wanted to include about two years of data in temporal autocorrelation calculations, you would do:
samplevariogram<-variogramST(formula=formula_gstat,data=STIDF1,tlag=90*(0:20), tunit="days",twindow=2*1600)
Related
I have a physical time series in a range of 2 year sample data with a frequency of 30 minutes, but there are multiple and wide lost data intervals as you can see there:
I tried with the function na.interp from forecast package with a bad result (shown above):
sapply(dataframeTS[2:10], na.interp)
Im looking for a more useful method.
UPDATE:
Here is more info about the pattern I want to capture, concretely the row data. This subsample belongs to May.
You might want to try the **imputeTS** package. It's an R package dedicated to time series missing value imputation.
The na_seadec(), na_seasplit(), na_kalman() methods might be interesting here
There are many more algorithm options - you can find a list in this Paper about the package.
In this specific case I would try:
na_seasplit(yourData)
or
na_kalman(yourData)
or
na_seadec(yourData)
Be aware, that it might be you need to give the seasonality information correctly with the time series. (you have to create a time series (ts object) and set the frequency parameter)
Still might be that it won't work out at all, you will have to try.
(if you can provide the data I'll also give it a try)
I am trying to figure out how to approach a data problem that includes observations of multiple equipment units' pressure and temperature measures. The measures are available for a few years as daily or nearly daily values.
This seems like a time series problem (multivariate) and I have found some quality examples. However, because the data set consists of multiple measures taken for each equipment unit, I am a bit stumped on how to proceed. Should I fit a separate time series for each piece of equipment? This seems intuitively wrong, but I am really not sure which package or even approach I can use to work through this.
I would very much appreciate a recommendation or link to some resources.
I am working with an hourly dataset of air temperature, recorded at ~200 stations over a relatively small area. I chose a space-time variogram (e.g. sum-metric) to fit my data and am now trying to make predictions over my same stations in order to fill NA (missing value) gaps. When using the krigeST() function over daily aggregated data everything seems to go smooth but when I use it at the original hourly resolution I always get the following error:
Error in chol.default(A)
the leading minor of order 68 is not positive definite
I googled it and found that it is related to a matrix not being completely positive-definite. However, I am not sure why this happens and was wondering if any of you know a way of fixing this (a workaround to avoid it).
There are several possibilities that lead to a singular covariance matrix. Two common ones:
duplicate observations (identical location & time stamp),
a variogram model that does not discriminate observations sufficiently, leading to near-perfectly correlated observations.
i have a problem with clustering time series in R.
I googled a lot and found nothing that fits my problem.
I have made a STL-Decomposition of Timeseries.
The trend component is in a matrix with 64 columns, one for every series.
Now i want to cluster these series in simular groups, involve the curve shapes and the timely shift. I found some functions that imply one of these aspects but not both.
First i tried to calculte a distance matrix with the dtw-distance so i
found clusters based on the values and inply the time shift but not on the shape of the timeseries. After this i tried some correlation based clustering, but then the timely shift
we're not recognized and the result dont satisfy my claims.
Is there a function that could cover my problem or have i to build up something
on my own. Im thankful for every kind of help, after two days of tutorials and examples i totaly uninspired. I hope i could explain the problem well enough to you.
I attached a picture. Here you can see some example time series.
There you could see the problem. The two series in the middle are set to one cluster,
although the upper and the one on the bottom have the same shape as one of the middle.
Have you tried the R package dtwclust
https://cran.r-project.org/web/packages/dtwclust/index.html
(I'm just starting to explore this package, but it seems like it covers a lot of aspects of time series clustering and it has lots of good references.)
you can use the kml package. It is used specifically to longitudinal data. You can consult its help. It has the next example:
### Generation of some data
cld1 <- generateArtificialLongData(25)
### We suspect 3, 4 or 6 clusters, we want 3 redrawing.
### We want to "see" what happen (so printCal and printTraj are TRUE)
kml(cld1,c(3,4,6),3,toPlot='both')
### 4 seems to be the best. We want 7 more redrawing.
### We don't want to see again, we want to get the result as fast as possible.
kml(cld1,4,10)
Example cluster
I am using R software (R commander) to cluster my data. I have a smaller subset of my data containing 200 rows and about 800 columns. I am getting the following error when trying kmeans cluster and plot on a graph.
"'princomp' can only be used with more units than variables"
I then created a test doc of 10 row and 10 columns whch plots fine but when I add an extra column I get te error again.
Why is this? I need to be able to plot my cluster. When I view my data set after performing kmeans on it I can see the extra results column which shows which clusters they belong to.
IS there anything I am doing wrong, can I ger rid of this error and plot my larger sample???
Please help, been wrecking my head for a week now.
Thanks guys.
The problem is that you have more variables than sample points and the principal component analysis that is being done is failing.
In the help file for princomp it explains (read ?princomp):
‘princomp’ only handles so-called R-mode PCA, that is feature
extraction of variables. If a data matrix is supplied (possibly
via a formula) it is required that there are at least as many
units as variables. For Q-mode PCA use ‘prcomp’.
Principal component analysis is underspecified if you have fewer samples than data point.
Every data point will be it's own principal component. For PCA to work, the number of instances should be significantly larger than the number of dimensions.
Simply speaking you can look at the problems like this:
If you have n dimensions, you can encode up to n+1 instances using vectors that are all 0 or that have at most one 1. And this is optimal, so PCA will do this! But it is not very helpful.
you can use prcomp instead of princomp