I am using an xgboost model to predict onto a raster stack. I have successfully used the same approach with CART, xgb and Random Forest models:
library(raster)
# create a RasterStack or RasterBrick with with a set of predictor layers
logo <- brick(system.file("external/rlogo.grd", package="raster"))
names(logo)
# known presence and absence points
p <- matrix(c(48, 48, 48, 53, 50, 46, 54, 70, 84, 85, 74, 84, 95, 85,
66, 42, 26, 4, 19, 17, 7, 14, 26, 29, 39, 45, 51, 56, 46, 38, 31,
22, 34, 60, 70, 73, 63, 46, 43, 28), ncol=2)
a <- matrix(c(22, 33, 64, 85, 92, 94, 59, 27, 30, 64, 60, 33, 31, 9,
99, 67, 15, 5, 4, 30, 8, 37, 42, 27, 19, 69, 60, 73, 3, 5, 21,
37, 52, 70, 74, 9, 13, 4, 17, 47), ncol=2)
# extract values for points
xy <- rbind(cbind(1, p), cbind(0, a))
v <- data.frame(cbind(pa=xy[,1], extract(logo, xy[,2:3])))
xgb <- xgboost(data = data.matrix(subset(v, select = -c(pa))), label = v$pa,
nrounds = 5)
raster::predict(model = xgb, logo)
But with xgboost I get the following error:
Error in xgb.DMatrix(newdata, missing = missing) :
xgb.DMatrix does not support construction from list
The problem is that predict.xgb.Booster does not accept a data.frame for argument newdata (see ?predict.xgb.Booster). That is unexpected (all common predict.* methods take a data.frame), but we can work around it. I show how to do that below, using the "terra" package instead of the obsolete "raster" package (but the solution is exactly the same for either package).
The example data
library(terra)
library(xgboost)
logo <- rast(system.file("ex/logo.tif", package="terra"))
p <- matrix(c(48, 48, 48, 53, 50, 46, 54, 70, 84, 85, 74, 84, 95, 85,
66, 42, 26, 4, 19, 17, 7, 14, 26, 29, 39, 45, 51, 56, 46, 38, 31,
22, 34, 60, 70, 73, 63, 46, 43, 28), ncol=2)
a <- matrix(c(22, 33, 64, 85, 92, 94, 59, 27, 30, 64, 60, 33, 31, 9,
99, 67, 15, 5, 4, 30, 8, 37, 42, 27, 19, 69, 60, 73, 3, 5, 21,
37, 52, 70, 74, 9, 13, 4, 17, 47), ncol=2)
xy <- rbind(cbind(1, p), cbind(0, a))
v <- extract(logo, xy[,2:3])
xgb <- xgboost(data = data.matrix(v), label=xy[,1], nrounds = 5)
The work-around is to write a prediction function that first coerces the data.frame with "new data" to a matrix. We can use that function with predict<SpatRaster>
xgbpred <- function(model, data, ...) {
predict(model, newdata=as.matrix(data), ...)
}
p <- predict(logo, model=xgb, fun=xgbpred)
plot(p)
I have to analyze an experiment data set to find a most effective combination of a molecular biology reaction.
The experiment has four factors: Temperature, RPM, Time, Catalytic activity. And I am measuring the Efficiency of a reaction (EE). How can I find an effective combination of four factors for the highest efficiency(EE)?
No repeated measurements. All data are independent experimental data
As I understood - EE is parametric data, factors are categorical data (Fixed combinations).
Do I have to go for a Fourway ANOVA?
if so is this model correct for the analysis
library(lsmeans)
lm(EE ~ Temperature + RPM + Time+ Catalytic +
Temperature:RPM +
Temperature:Time +
Temperature:Catalytic +
RPM:Time+
RPM+Catalytic+
Time+Catalytic+
Temperature:RPM:Time +
Temperature:RPM:Catalytic+
Temperature:Time:Catalytic+
RPM:Time:Catalytic+
Temperature:RPM:Time:Catalytic, "data")
And, then how can I get the significant values for each pairwise comparison?
Here is the sample data set for an example.
> dput(df)
structure(list(TEMPERATURE = c(40, 40, 40, 40, 40, 40, 40, 40,
40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40,
40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40,
42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5,
42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5,
42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5,
42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 45, 45, 45, 45, 45,
45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
45, 45, 45), RPM = c(150, 150, 150, 150, 150, 150, 150, 150,
150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 200,
200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200,
200, 200, 200, 200, 200, 200, 150, 150, 150, 150, 150, 150, 150,
150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150,
200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200,
200, 200, 200, 200, 200, 200, 200, 150, 150, 150, 150, 150, 150,
150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150,
150, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200,
200, 200, 200, 200, 200, 200, 200, 200), TIME = c(24, 24, 24,
24, 24, 48, 48, 48, 48, 48, 72, 72, 72, 72, 72, 96, 96, 96, 96,
96, 24, 24, 24, 24, 24, 48, 48, 48, 48, 48, 72, 72, 72, 72, 72,
96, 96, 96, 96, 96, 24, 24, 24, 24, 24, 48, 48, 48, 48, 48, 72,
72, 72, 72, 72, 96, 96, 96, 96, 96, 24, 24, 24, 24, 24, 48, 48,
48, 48, 48, 72, 72, 72, 72, 72, 96, 96, 96, 96, 96, 24, 24, 24,
24, 24, 48, 48, 48, 48, 48, 72, 72, 72, 72, 72, 96, 96, 96, 96,
96, 24, 24, 24, 24, 24, 48, 48, 48, 48, 48, 72, 72, 72, 72, 72,
96, 96, 96, 96, 96), CAT = c(4, 6, 8, 10, 12, 4, 6, 8, 10, 12,
4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10,
12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6,
8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12,
4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10,
12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6,
8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12),
EE = c(50, 53, 54, 57, 59, 53, 56, 59, 61, 64, 57, 58, 60,
62, 63, 56, 54, 52, 55, 55, 44, 48, 50, 50, 54, 49, 52, 56,
57, 56, 52, 56, 57, 58, 66, 46, 48, 48, 52, 49, 53, 57, 59,
62, 64, 54, 58, 60, 64, 66, 55, 59, 61, 63, 65, 54, 59, 64,
65, 67, 49, 51, 53, 54, 59, 50, 54, 63, 64, 64, 52, 56, 56,
59, 57, 52, 55, 58, 60, 63, 52, 56, 58, 61, 63, 54, 55, 58,
63, 63, 56, 58, 62, 62, 65, 57, 59, 62, 63, 66, 42, 42, 51,
54, 56, 46, 50, 52, 56, 58, 48, 51, 54, 55, 57, 48, 53, 56,
57, 61)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -120L), spec = structure(list(cols = list(
TEMPERATURE = structure(list(), class = c("collector_double",
"collector")), RPM = structure(list(), class = c("collector_double",
"collector")), TIME = structure(list(), class = c("collector_double",
"collector")), CAT = structure(list(), class = c("collector_double",
"collector")), EE = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
Try something like this:
library(rsm)
mod = rsm(EE ~ SO(Temperature, RPM, Time, Catalytic), data = data)
summary(mod)
This will fit a second-order surface (model equation includes all predictors, two-way interactions, and squares). The summary shows a stationary point and related statistics. If all the eigenvalues are negative, then it is a peak. Otherwise you have some kind of saddle point.
This model differs from the one in the OP in that it has no three-way or four-way interactions, but includes the squares of the predictors, which are really essential for fitting a second-order response surface.
Lots more details
I had to revise this a bit to account for the fact that R is case-sensitive!
> mod = rsm(EE ~ SO(TEMPERATURE, RPM, TIME, CAT), data = df)
There is an issue here in that RPM has only two values, so we can't estimate a pure quadratic effect. So there is one coefficient of NA and that messes up the computation of the stationary point. However, we can still plot the fitted surface (despite a few warning messages)
> par(mfrow = c(2,3))
> contour(mod, ~TEMPERATURE+RPM+TIME+CAT)
It looks like we are best off with large CAT and lower RPM (see that plot), so look again:
> par(mfrow=c(1,1))
> contour(mod, ~ TEMPERATURE + TIME, at = list(CAT = 12, RPM = 150))
So visually, we seem to get the best response at around temperature 43.5, time 65, catalyst 12, and rpm 150.
If you insist on modeling these as factors, it can be done, but you need to convert all the predictors to factors. This is a common error; you can have a designed experiment with only a few distinct values of a quantitative variable, but R does not read your mind and assume it's a factor; you have to convert it to one. In the following I have opted to fit a model with up to 2-way interactions.
> facmod = lm(EE ~ (factor(TEMPERATURE) + factor(TIME) + factor(RPM) + factor(CAT))^2, data = df)
> library(emmeans)
> emmip(facmod, TIME ~ TEMPERATURE | CAT*RPM)
The highest fitted response is at catalyst 12, RPM 150, temperature 42.5 or larger, and time 96. It is clear that 150 RPM is better (left vs. right comoparisons) and the high CAT is better (comparing panels vertically). These are different models and somewhat different results. I like the rsm approach better as it is more systematic.
For a screening DOE you collected more data than what was needed.
Here is a starting point, I welcome additional comments.
I would model the linear combination of all of your factors:
model <-lm(EE ~ TEMPERATURE + RPM + TIME +CAT , data=df)
summary(model)
Call:
lm(formula = EE ~ TEMPERATURE + RPM + TIME + CAT, data = df)
Residuals:
Min 1Q Median 3Q Max
-10.1850 -1.5742 0.3383 1.7767 9.7033
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.50833 6.47932 7.950 1.42e-12 ***
TEMPERATURE 0.27000 0.14245 1.895 0.06056 .
RPM -0.10533 0.01163 -9.056 4.10e-15 ***
TIME 0.03639 0.01084 3.358 0.00107 **
CAT 1.20417 0.10281 11.713 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.185 on 115 degrees of freedom
Multiple R-squared: 0.6706, Adjusted R-squared: 0.6591
F-statistic: 58.52 on 4 and 115 DF, p-value: < 2.2e-16
Looking at the signs for the slope estimates. While this simple model is assuming a linear relationship, the physical significance of the slope is comparing the average of low values with average of the high values.
For example the the slope of Temperature term is positive. This means when the temperature increases form the low value (40C) to the high value (45C)the Efficiency will increase.
Temperature, TIME and CAT as positive slopes I would take the largest value available.
RPM had a negative slope so I would choose the lowest value available.
Thus my prediction from the experiments my prediction would yield EE=66,
While the highest result in from the experiment was:
df[which.max(df$EE),]
# A tibble: 1 x 5
TEMPERATURE RPM TIME CAT EE
<dbl> <dbl> <dbl> <dbl> <dbl>
1 42.5 150 96 12 67
Now you could investigate the non-linear relationships by looking at the results from this model:
Call:
lm(formula = EE ~ TEMPERATURE * RPM * TIME * CAT, data = df)
Here the slopes of the interaction terms are orders of magnitudes smaller than the linear terms. This could be misleading is the variables are not normalized.
Good luck.
I have a time series data set with 3 measurement variables and with about 2000 samples. I want to classify samples into 1 of 4 categories using a RNN or 1D CNN model using Keras in R. My problem is that I am unable to successfully reshape the model the k_reshape() function.
I am following along the Ch. 6 of Deep Learning with R by Chollet & Allaire, but their examples aren't sufficiently different from my data set that I'm now confused. I've tried to mimic the code from that chapter of the book to no avail. Here's a link to the source code for the chapter.
library(keras)
df <- data.frame()
for (i in c(1:20)) {
time <- c(1:100)
var1 <- runif(100)
var2 <- runif(100)
var3 <- runif(100)
run <- data.frame(time, var1, var2, var3)
run$sample <- i
run$class <- sample(c(1:4), 1)
df <- rbind(df, run)
}
head(df)
# time feature1 feature2 feature3 sample class
# 1 0.4168828 0.1152874 0.0004415961 1 4
# 2 0.7872770 0.2869975 0.8809415097 1 4
# 3 0.7361959 0.5528836 0.7201276931 1 4
# 4 0.6991283 0.1019354 0.8873193581 1 4
# 5 0.8900918 0.6512922 0.3656302236 1 4
# 6 0.6262068 0.1773450 0.3722923032 1 4
k_reshape(df, shape(10, 100, 3))
# Error in py_call_impl(callable, dots$args, dots$keywords) :
# TypeError: Failed to convert object of type <class 'dict'> to Tensor. Contents: {'time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 3
I'm very new to reshaping arrays, but I would like to have an array with the shape: (samples, time, features). I would love to hear suggestions on how to properly reshape this array or guidance on how this data should be treated for a DL model if I'm off basis on that front.
I found two solutions to my question. My confusion stemmed from the error message from k_reshape that I did not understand how to interpret.
Use the array_reshape() function from the reticulate package.
Use k_reshape() function from keras but this time use the appropriate shape.
Here is the code I successfully executed:
# generate data frame
dat <- data.frame()
for (i in c(1:20)) {
time <- c(1:100)
var1 <- runif(100)
var2 <- runif(100)
var3 <- runif(100)
run <- data.frame(time, var1, var2, var3)
run$sample <- i
run$class <- sample(c(1:4), 1)
dat <- rbind(df, run)
}
dat_m <- as.matrix(df) # convert data frame to matrix
# time feature1 feature2 feature3 sample class
# 1 0.4168828 0.1152874 0.0004415961 1 4
# 2 0.7872770 0.2869975 0.8809415097 1 4
# 3 0.7361959 0.5528836 0.7201276931 1 4
# 4 0.6991283 0.1019354 0.8873193581 1 4
# 5 0.8900918 0.6512922 0.3656302236 1 4
# 6 0.6262068 0.1773450 0.3722923032 1 4
# solution with reticulate's array_reshape function
dat_array <- reticulate::array_reshape(x = dat_m[,c(2:4)], dim = c(20, 100, 3))
dim(dat_array)
# [1] 20 100 3
class(dat_array)
# [1] "array"
# solution with keras's k_reshape
dat_array_2 <- keras::k_reshape(x = dat_m[,c(2:4)], shape = c(20, 100, 3))
dim(dat_array)
# [1] 20 100 3
class(dat_array)
# [1] 20 100 3
class(dat_array_2)
# [1] "tensorflow.tensor" "tensorflow.python.framework.ops.Tensor"
# [3] "tensorflow.python.framework.ops._TensorLike" "python.builtin.object"
A few notes:
Conceptually, this reshaping makes more sense to me as a cast or spreading of the data in R parlance.
The output of array_reshape is an array class, but k_reshape() outputs a tensorflow tensor object. Both worked for me in created deep learning networks, but I find the array class much more interpretable.