I have used ctree() in the R party package to generate a regression tree fitting points sampled from a raster of my response variable to values sampled from a collection of rasters of possible driving variables. I would now like to apply the resulting BinaryTree object to a collection of equivalent rasters to predict a raster surface of the response variable.
Naively, I could make a data frame by extracting values from each of the input rasters using the raster functions values() or getValues() and input the resulting data frame as the newdata= parameter in treeresponse(), but unless I'm misunderstanding something, wouldn't this allocate memory for all cells in each raster? How can I do this while taking advantage of the capabilities of the raster package (or similar) to read in raster data in blocks?
This question is related but is dealing with generating models rather than applying it, and the solution presented only works for linear models not regression trees: https://gis.stackexchange.com/questions/72648/how-to-do-regression-analysis-out-of-memory-on-a-set-of-large-rasters-in-r
Related
I have the following problem. I want to build a model for landcover classification. My data are multitemporal Remote Sensing data with several bands. For training I created stratified randomly distributed points to extract spectral data at their positions. With these data a Random Forrest (Rpart) was trained using mlr3 package. For accuracy measurement a repeated spatial cross validation using mlr3spatiotempcv was performed. The resulting model of the training step is, after extraction, stored in an R Object of type rpart. In the terms field of this object are the variable names stored. These are all my used bands but also the spatial x and y coordinates. This brings problems when predicting new data. I used terra package and got an error the x and y layer are missing in my input data. Which kind of makes sense because they are stored in the terms field of the model. But from my understanding, the coordinates should not be a variable of the model. The coordinates are just used for spatial resampling and not for predicting. I "solved" this problem by removing x and y coordinates during the training process and perform just an ordinary non-spatial cross validation. After that I performed the prediction and it works perfectly.
So, my Question is, how can I train a model, using mlr3 package, with data containing coordinates, to perform spatial cross validation?, and then use this model to predict a new Raster.
You have found a bug. When the task is created from a data.frame instead of an sf object, coords_as_features is set to TRUE. The default should be FALSE. You can install a fixed version of the package with remotes::install_github("mlr-org/mlr3spatiotempcv"). This fix should be included in the next CRAN version soon. Thanks for reporting.
This brings problems when predicting new data.
Why do you use the models from resampling to predict new data? Usually, you estimate the performance of the final model with (spatial) cross validation but the final model to predict new data is fitted on the complete data set.
I have been using the variofit function in R's gstat package to fit semivariogram models to some spatial data I have, and I am confused by a couple of the models that have been generated. Basically for these few models, I will get a model that has a range for autocorrelation, but not a partial sill. I was told that even without a sill, though, the model should still have some sort of shape to reflect the range, but plotting this model results in the flat lines that are shown in the attached screenshot. I do not think it is a matter of bad initial values as I let variofit parse out the best initial values from a matrix of many values made by expand.grid. I wanted to know whether this is being plotted correctly contrary to what I've been told, and what exactly it means to have a range but no partial sill value. I know when I used an alternative model fitting function from geoR (fit.variogram), these models could be fit to a periodic or wave distribution, though poorly so/probably overfit — so would this be some indication of that, which variofit just cannot plot? I unfortunately can't share the data, but I included an example of the code I have used to make these models if it will help to answer my question:
geo.entPC <- as.geodata(cbind(jitteryPC, log.PC[,5], coords.col=1:2, data.col=5))
test.pc.grid2 <- expand.grid(seq(0,2,0.2),seq(0,100,10))
variog.function.col2 <-function (x) {
vario.cloud <- variog(x, estimator.type = "classical", option="bin")
variogram.mod <- variofit(vario.cloud , ini.cov.pars=test.pc.grid2, fix.nug=FALSE, weights="equal")
plot(vario.cloud)
lines(variogram.mod, col="red")
summary(x)
}
variog.function.col2(geo.entPC)
From the attached plot showing the empirical variogram, I would not expect to find any sensible spatial correlation. This is in accordance with the fitted variogram, which is essentially a pure nugget model. The spatial range might be a relic of the numerical optimization, or the partial spatial sill might (numerically) differ from 0 at a digit that is not shown in the summary of the fitted variogram. However, no matter what the range is for an irrelevant small partial sill, the spatial correlation is neglectable.
Depending on the data, it is sometimes beneficial to limit the maximum distance of pairs used to calculate the empirical variogram - but make sure to have "enough" pairs in each bin.
In a paper under review, I have a very large dataset with a relatively small number of imputations. The reviewer asked me to report how many nodes were in the tree I generated using the CART method within MICE. I don't know why this is important, but after hunting around for a while, my own interest is piqued.
Below is a simple example using this method to impute a single value. How many nodes are in the tree that the missing value is being chosen from? And how many members are in each node?
data(whiteside, package ="MASS")
data <- whiteside
data[1,2] <- NA
library(mice)
impute <- mice(data,m=100,method="cart")
impute2 <- complete(impute,"long")
I guess, whiteside is only used as an example here. So your actual data looks different.
I can't easily get the number of nodes for the tree generated in mice. The first problem is, that it isn't just one tree ... as the package names says mice - Multivariate Imputation by Chained Equations. Which means you are sequentially creating multiple CART trees. Also each incomplete variable is imputed by a separate model.
From the mice documentation:
The mice package implements a method to deal with missing data. The package creates multiple imputations (replacement values) for multivariate missing data. The method is based on Fully Conditional Specification, where each incomplete variable is imputed by a separate model.
If you really want to get numbers of nodes for each used model, you probably would have to adjust the mice package itself and add some logging there.
Here is how you might approach this:
Calling impute <- mice(data,m=100,method="cart") you get a S3 object of class mids that contains information about the imputation. (but not the number of nodes for each tree)
But you can call impute$formulas, impute$method, impute$nmis to get some more information, which formulas were used and which variables actually had missing values.
From the mice.impute.cart documentation you can see, that mice uses rpart internally for creating the classification and regression trees.
Since the mids object does not contain information about the fitted trees I'd suggest, you use rpart manually with the formula from impute$formulas.
Like this:
library("rpart")
rpart(Temp ~ 0 + Insul + Gas, data = data)
Which will print / give you the nodes/tree. This wouldn't really be the tree used in mice. As I said, mice means multiple chained equations / multiple models after each other - meaning multiple possibly different trees after each other. (take a look at the algortihm description here https://stefvanbuuren.name/fimd/sec-cart.html for the univariate missingness case with CART). But this could at least be an indicator, if applying rpart on your specific data provides a useful model and thus leads to good imputation results.
I'm looking for algorithms to create bins of variables in order to reduce the noise.
I have found several libraries for that, one if the chi2 library:
https://www.rdocumentation.org/packages/discretization/versions/1.0-1/topics/chi2
The documentation has the following example:
data(iris)
#---cut-points
chi2(iris,0.5,0.05)$cutp
#--discretized dataset using Chi2 algorithm
chi2(iris,0.5,0.05)$Disc.data
This works for this data, but if I train a model after transforming this data in order to make predicction over new records I will have to use the same cuts that were used here. My question is, is there any method or library that stored the cuts of the bins in a way that can be easiy applied to new data similarly to a predict method? whitout any custom function
I currently try to move from matlab to R.
I have 2D measurements, consisting of irradiance in time and wavelength together with quality flags and uncertainty and error estimates.
In Matlab I extended the timeseries object to store both the wavelength array and the auxiliary data.
What is the best way in R to store this data?
Ideally I would like this data to be stored together such that e.g. window(...) keeps all data synchronized.
So far I had a look at the different timeseries classes like ts, zoo etc and some spatial-time series. However none of them allow me to neither attach auxiliary data to observations nor can they give me a secondary axes.
Not totally sure what you want, but here is a simple tutorial mentioning
R's "ts" and"zoo" time series classes:
http://faculty.washington.edu/ezivot/econ424/Working%20with%20Time%20Series%20Data%20in%20R.pdf
and here is a more comprehensive outline of many more classes(see the Time Series Classes section)
http://cran.r-project.org/web/views/TimeSeries.html