I am working with an hourly dataset of air temperature, recorded at ~200 stations over a relatively small area. I chose a space-time variogram (e.g. sum-metric) to fit my data and am now trying to make predictions over my same stations in order to fill NA (missing value) gaps. When using the krigeST() function over daily aggregated data everything seems to go smooth but when I use it at the original hourly resolution I always get the following error:
Error in chol.default(A)
the leading minor of order 68 is not positive definite
I googled it and found that it is related to a matrix not being completely positive-definite. However, I am not sure why this happens and was wondering if any of you know a way of fixing this (a workaround to avoid it).
There are several possibilities that lead to a singular covariance matrix. Two common ones:
duplicate observations (identical location & time stamp),
a variogram model that does not discriminate observations sufficiently, leading to near-perfectly correlated observations.
Related
I have time series data ranging from 0 to 30 million. Its basically web traffic weekly data. I am working on building a forecasting model with this data. I want to understand how can I deal with this range of data. I tried box cox transformation with prophet model. I am not sure about what metrics could I use to evaluate the performance of the model. The data has a lot of 0's. I can't remove them from the dataset. Is there a better way to deal with the 0's other than the Box Cox transformation? I had issues with the inverse transformation but I added a small value (0.1) to the data to avoid negative values.
If your series have lot of periodic zero data,Croston method is a one way.It is a basically forecast strategy for products with intermittent demand.Also you can try exponential smoothing and traditional ARIMA,SARIMA models and clip the negative values in the forecast(this is according to your use case).
you can find croston method in forecast package.
also refer these links as well.
https://stats.stackexchange.com/questions/8779/analysis-of-time-series-with-many-zero-values/8782
https://stats.stackexchange.com/questions/373689/forecasting-intermittent-demand-with-zeroes-in-times-series
https://robjhyndman.com/papers/foresight.pdf
I have a data set called Data, with 30 scaled and centered features and 1 outcome with column name OUTCOME, referred to 700k records, stored in data.table format. I computed its PCA, and observed that its first 8 components account for the 95% of the variance. I want to train a random forest in h2o, so this is what I do:
Data.pca=prcomp(Data,retx=TRUE) # compute the PCA of Data
Data.rotated=as.data.table(Data.pca$x)[,c(1:8)] # keep only first 8 components
Data.dump=cbind(Data.rotated,subset(Data,select=c(OUTCOME))) # PCA dataset plus outcomes for training
This way I have a dataset Data.dump where I have 8 features that are rotated on the PCA components, and at each record I associated its outcome.
First question: is this rational? or do I have to permute somehow the outcomes vector? or the two things are unrelated?
Then I split Data.dump in two sets, Data.train for training and Data.test for testing, all as.h2o. The I feed them to a random forest:
rf=h2o.randomForest(training_frame=Data.train,x=1:8,y=9,stopping_rounds=2,
ntrees=200,score_each_iteration=T,seed=1000000)
rf.pred=as.data.table(h2o.predict(rf,Data.test))
What happens is that rf.pred seems not so similar to the original outcomes Data.test$OUTCOME. I tried to train a neural network as well, and did not even converge, crashing R.
Second question: is it because I am carrying on some mistake from the PCA treatment? or because I badly set up the random forest? Or I am just dealing with annoying data?
I do not know where to start, as I am new to data science, but the workflow seems correct to me.
Thanks a lot in advance.
The answer to your second question (i.e. "is it the data, or did I do something wrong") is hard to know. This is why you should always try to make a baseline model first, so you have an idea of how learnable the data is.
The baseline could be h2o.glm(), and/or it could be h2o.randomForest(), but either way without the PCA step. (You didn't say if you are doing a regression or a classification, i.e. if OUTCOME is a number or a factor, but both glm and random forest will work either way.)
Going to your first question: yes, it is a reasonable thing to do, and no you don't have to (in fact, should not) involve the outcomes vector.
Another way to answer your first question is: no, it unreasonable. It may be that a random forest can see all the relations itself without needing you to use a PCA. Remember when you use a PCA to reduce the number of input dimensions you are also throwing away a bit of signal, too. You said that the 8 components only capture 95% of the variance. So you are throwing away some signal in return for having fewer inputs, which means you are optimizing for complexity at the expense of prediction quality.
By the way, concatenating the original inputs and your 8 PCA components, is another approach: you might get a better model by giving it this hint about the data. (But you might not, which is why getting some baseline models first is essential, before trying these more exotic ideas.)
I am trying to use the function variogramST from the R package gstat to calculate a spatio-temporal variogram.
There are 12 years of data with 20'000 data points at irregular points in space and time (no full grid or partial grid). I have to use the STIDF from the spacetime package for an irregular data set. I would like a temporal semivariogram with reference points at 0, 90, 180, 270 days, up to some years etc. Unfortunately both computational and memory problems occur. When the command
samplevariogram<-variogramST(formula=formula_gstat,data=STIDF1)
is run without further arguments, the semiovariogram is taking into account only very short time periods in terms of reference points for the semivariogram, which does not seem to capture the inherent data structure appropriately.
There are more arguments for this function at the user's disposal, but I am not sure how to parametrize them correctly: tlag, tunit, twindow. Specifically, I am wondering how they interact and how I achieve my goal as described above. So I tried the following code
samplevariogram<-variogramST(formula=formula_gstat,data=STIDF1,tlag= ...., tunit=... , twindow= ...)
The following code results ist not working due to memory issues in my 32Gbyte RAM computer:
samplevariogram<-variogramST(formula=formula_gstat,data=STIDF1,tlag=90*(0:20), tunit="days")
but might be perhaps flawed, otherwise. Furthermore, the latter line of code also seems infeasible in terms of computation time.
Does someone know how to specify the variogramST-function from the gstat packaging correctly, aiming at the desired time intervals?
Thanks
If I understand correctly, the twindow argument should be the number of observations to include when calculating the space-time variogram. Assuming your 20k point are distributed more or less evenly over the 12 years, then you have about 1600 points per year. Again, assuming I understand things correctly, if you wanted to include about two years of data in temporal autocorrelation calculations, you would do:
samplevariogram<-variogramST(formula=formula_gstat,data=STIDF1,tlag=90*(0:20), tunit="days",twindow=2*1600)
First, I gathered from this link Applying a function to multiple columns that using the "function" function would perhaps do what I'm looking for. However, I have not been able to make the leap from thinking about it in the way presented to making it actually work in my situation (or really even knowing where to start). I'm a beginner in R so I apologize in advance if this is a really "newb" question. My data is a data frame that consists of an event variable (tumor recurrence) and a time variable (followup time/time to recurrence) as well as recurrence risk factors (t-stage, tumor size,age at dx, etc.). Some risk factors are categorical and some are continuous. I have been running my univariate analysis by hand, one at a time like this example univariateageatdx<-coxph(survobj~agedx), and then collecting the data. This gets very tedious for multiple factors and doing it for a few different recurrence types. I figured there must be a way to code such that I could basically have one line of code that had the coxph equation and then applied it to all of my variables of interest and spit out a result that had the univariate analysis results for each factor. I tried using cbind to bind variables (i.e x<-cbind("agedx","tumor size") then running cox coxph(recurrencesurvobj~x) but this of course just did the multivariate analysis on these variables and didn't split them out as true univariate analyses.
I also tried the following code based on a similar problem that I found on a different site, but it gave the error shown and I don't know quite what to make of it. Is this on the right track?
f <- as.formula(paste('regionalsurvobj ~', paste(colnames(nodcistradmasvssubcutmasR)[6-9], collapse='+')))
I then ran it has coxph(f)
Gave me the results of a multivariate cox analysis.
Thanks!
**edit: I just fixed the error, I needed to use the column numbers I suppose not the names. Changes are reflected in the code above. However, it still runs the variables selected as a multivariate analysis and not as the true univariate analysis...
If you want to go the formula-route (which in your case with multiple outcomes and multiple variables might be the most practical way to go about it) you need to create a formula per model you want to fit. I've split the steps here a bit (making formulas, making models and extracting data), they can off course be combined this allows you to inspect all your models.
#example using transplant data from survival package
#make new event-variable: death or no death
#to have dichot outcome
transplant$death <- transplant$event=="death"
#making formulas
univ_formulas <- sapply(c("age","sex","abo"),function(x)as.formula(paste('Surv(futime,death)~',x))
)
#making a list of models
univ_models <- lapply(univ_formulas, function(x){coxph(x,data=transplant)})
#extract data (here I've gone for HR and confint)
univ_results <- lapply(univ_models,function(x){return(exp(cbind(coef(x),confint(x))))})
I am using R software (R commander) to cluster my data. I have a smaller subset of my data containing 200 rows and about 800 columns. I am getting the following error when trying kmeans cluster and plot on a graph.
"'princomp' can only be used with more units than variables"
I then created a test doc of 10 row and 10 columns whch plots fine but when I add an extra column I get te error again.
Why is this? I need to be able to plot my cluster. When I view my data set after performing kmeans on it I can see the extra results column which shows which clusters they belong to.
IS there anything I am doing wrong, can I ger rid of this error and plot my larger sample???
Please help, been wrecking my head for a week now.
Thanks guys.
The problem is that you have more variables than sample points and the principal component analysis that is being done is failing.
In the help file for princomp it explains (read ?princomp):
‘princomp’ only handles so-called R-mode PCA, that is feature
extraction of variables. If a data matrix is supplied (possibly
via a formula) it is required that there are at least as many
units as variables. For Q-mode PCA use ‘prcomp’.
Principal component analysis is underspecified if you have fewer samples than data point.
Every data point will be it's own principal component. For PCA to work, the number of instances should be significantly larger than the number of dimensions.
Simply speaking you can look at the problems like this:
If you have n dimensions, you can encode up to n+1 instances using vectors that are all 0 or that have at most one 1. And this is optimal, so PCA will do this! But it is not very helpful.
you can use prcomp instead of princomp