R: Autokrige.cv function in automap package generates NaNs - r

I’m fairly new to R and I am trying to make interpolations of temperature measurements that where gathered from different station across the Netherlands. I have data for about 35 stations that make measurements every 10 minutes covering a timespan of about two weeks. Accordingly, I figured it would be best to make a loop that takes care of this. To see how well the interpolation technique works I want to do a cross validation for every timestamp.
In order to do this I used the Autokrige function from the automap package, and next I used the compare.cv function from the automap package in order to get an overview of the most important statistics for all time stamps. Besides that, I made sure the cross validation is only done if at least 25 stations registred meassurements.
The problem however is, that my code as described below works most of the time but gives the following warnings in 4 cases:
1. In sqrt(ret[[var.name]]) : NaNs produced
2. In sqrt(ret[[var.name]]) : NaNs produced
3. In sqrt(ret[[var.name]]) : NaNs produced
4. In sqrt(ret[[var.name]]) : NaNs produced
When I try to use the compare.cv command for the total list including all the cross validations it gives me the following error:
"Error in quantile.default(as.numeric(x), c(0.25, 0.75), na.rm = na.rm, :
missing values and NaN's not allowed if 'na.rm' is FALSE"
Im wondering what causes the Autokrige function to generate NaNs in the cross validation, and more importantly how I can remove them from the results.cv so that I can use the compare.cv function?
rm(list=ls())
# load packages
require(sp)
require(gstat)
require(ggmap)
require(automap)
require(ggplot2)
#load data (download link provided below)
load("download path") https://www.dropbox.com/s/qmi3loub29e55io/meassurements_aug.RDS?dl=0
# make data spatial and assign spatial coordinate system
coordinates(meassurements) = ~x+y
proj4string(meassurements) <- CRS("+init=epsg:4326")
meassurements_df <- as.data.frame(meassurements)
# loop for cross validation
timestamp <- meassurements$import_log_id
results.cv=list()
for (i in unique(timestamp)) {
x = meassurements_df[which(meassurements$import_log_id == i), ]
if(sum(!is.na(x$temperature)) > 25){
results.cv[[paste0(i)]] = autoKrige.cv (temperature ~ 1, meassurements[which(meassurements$import_log_id == i & !is.na(meassurements$temperature)), ])
}
}
# calculate key statistics (RMSE MAE etc)
compare.cv(results.cv)
Thanks!

I came across the same problem and solved it with the help of remove.duplicates() of package sp on the SpatialPointDataFrame used for kriging. Prior to that I calculated the mean of the relevant variables in the DataFrame.
SPDF#data <- SPDF#data %>%
group_by(varx,vary,varz) %>%
mutate_at(vars(one_of(relevant_var)),mean,na.rm=TRUE) %>%
ungroup()
SPDF <- SPDF %>% remove.duplicates()
At the time I was encountering the same problem the Dropbox link above was not working anymore, so I could not check this specific example.

Related

Fastshap summary plot - Error: can't combine <double> and <factor<919a3>>

I'm trying to get a summary plot using fastshap explain function as in the code below.
p_function_G<- function(object, newdata)
caret::predict.train(object,
newdata =
newdata,
type = "prob")[,"AntiSocial"] # select G class
# Calculate the Shapley values
#
# boostFit: is a caret model using catboost algorithm
# trainset: is the dataset used for bulding the caret model.
# The dataset contains 4 categories W,G,R,GM
# corresponding to 4 diferent animal behaviors
library(caret)
shap_values_G <- fastshap::explain(xgb_fit,
X = game_train,
pred_wrapper =
p_function_G,
nsim = 50,
newdata= game_train[which(game_test=="AntiSocial"),])
)
However I'm getting error
Error in 'stop_vctrs()':
can't combine latitude and gender <factor<919a3>>
What's the way out?
I see that you are adapting code from Julia Silge's Predict ratings for board games Tutorial. The original code used SHAPforxgboost for generating SHAP values, but you're using the fastshap package.
Because Shapley explanations are only recently starting to gain traction, there aren't very many standard data formats. fastshap does not like tidyverse tibbles, it only takes matrices or matrix-likes.
The error occurs because, by default, fastshap attempts to convert the tibble to a matrix. But this fails, because matrices can only have one type (f.x. either double or factor, not both).
I also ran into a similar issue and found that you can solve this by passing the X parameter as a data.frame. I don't have access to your full code but you could you try replacing the shap_values_G code-block as so:
shap_values_G <- fastshap::explain(xgb_fit,
X = game_train,
pred_wrapper =
p_function_G,
nsim = 50,
newdata= as.data.frame(game_train[which(game_test=="AntiSocial"),]))
)
Wrap newdata with as.data.frame. This converts the tibble to a dataframe and so shouldn't upset fastshap.

Error related to randomisation test within lapply() function in R

I have 30 datasets that are conbined in a data list. I wanted to analyze spatial point pattern by L function along with randomisation test. Codes are following.
The first code works well for a single dataset (data1) but once it is applied to a list of dataset with lapply() function as shown in 2nd code, it gives me a very long error like so,
"Error in Kcross(X, i, j, ...) : No points have mark i = Acoraceae
Error in envelopeEngine(X = X, fun = fun, simul = simrecipe, nsim =
nsim, : Exceeded maximum number of errors"
Can anybody tell me what is wrong with 2nd code?
grp <- factor(data1$species)
window <- ripras(data1$utmX, data1$utmY)
pp.grp <- ppp(data1$utmX, data1$utmY, window=window, marks=grp)
L.grp <- alltypes(pp.grp, Lest, correlation = "Ripley")
LE.grp <- alltypes(pp.grp, Lcross, nsim = 100, envelope = TRUE)
plot(L.grp)
plot(LE.grp)
L.LE.sp <- lapply(data.list, function(x) {
grp <- factor(x$species)
window <- ripras(x$utmX, x$utmY)
pp.grp <- ppp(x$utmX, x$utmY, window = window, marks = grp)
L.grp <- alltypes(pp.grp, Lest, correlation = "Ripley")
LE.grp <- alltypes(pp.grp, Lcross, envelope = TRUE)
result <- list(L.grp=L.grp, LE.grp=LE.grp)
return(result)
})
plot(L.LE.sp$LE.grp[1])
This question is about the R package spatstat.
It would help if you could add a minimal working example including data which demonstrate this problem.
If that is not available, please generate the error on your computer, then type traceback() and capture the output and post it here. This will trace the location of the error.
Without this information, my best guess is the following:
The error message says No points have mark i=Acoraceae. That means that the code is expecting a point pattern to include points of type Acoraceae but found that there were none. This can happen because in alltypes(... envelope=TRUE) the code generates random point patterns according to complete spatial randomness. In the simulated patterns, the number of points of type Acoraceae (say) will be random according to a Poisson distribution with a mean equal to the number of points of type Acoraceae in the observed data. If the number of Acoraceae in the actual data is small then there is a reasonable chance that the simulated pattern will contain no Acoraceae at all. This is probably what is causing the error message No points have mark i=Acoraceae.
If this interpretation is correct then you should be able to suppress the error by including the argument fix.marks=TRUE, that is,
alltypes(pp.grp, Lcross, envelope=TRUE, fix.marks=TRUE, nsim=99)
I'm not suggesting this is necessarily appropriate for your application, but this should remove the error message if my guess is correct.
In the latest development version of spatstat, available on github, the code for envelope has been tweaked to detect this error.

Testing Recommendation systems: How to specify how many items were given for the prediction. `calcPredictionAccuracy` function

I am trying to test a binary recommendation systems I created with the recommenderlab package. When I run the calcPredictionAccuracy function I get the following error:
Error in .local(x, data, ...) :
You need to specify how many items were given for the prediction!
I have performed numerous searches and can't seem to find any solution to this issue. If I try to add the given argument the error changes to:
error.ubcf<-calcPredictionAccuracy(p.ubcf, getData(test_index, "unknown", given=3))
Error in .local(x, ...) : unused argument (given = 3)
Here is a quick look at my code:
my data set is binary.watch.ratings
affinity.matrix <- as(binary.watch.ratings,"binaryRatingMatrix")
test_index <- evaluationScheme(affinity.matrix[1:1000], method="split",
train=0.9, given=1)
# creation of recommender model based on ubcf
Rec.ubcf <- Recommender(getData(test_index, "train"), "UBCF")
# creation of recommender model based on ibcf for comparison
Rec.ibcf <- Recommender(getData(test_index, "train"), "IBCF")
# making predictions on the test data set
p.ubcf <- predict(Rec.ubcf, getData(test_index, "known"), type="topNList")
# making predictions on the test data set
p.ibcf <- predict(Rec.ibcf, getData(test_index, "known"), type="topNList")
# obtaining the error metrics for both approaches and comparing them
##error occurs with the following two lines
error.ubcf<-calcPredictionAccuracy(p.ubcf, getData(test_index, "unknown"))
error.ibcf<-calcPredictionAccuracy(p.ibcf, getData(test_index, "unknown"))
error <- rbind(error.ubcf,error.ibcf)
rownames(error) <- c("UBCF","IBCF")
This produces the following error:
error.ubcf<-calcPredictionAccuracy(p.ubcf, getData(test_index, "unknown"))
Error in .local(x, data, ...) :
You need to specify how many items were given for the prediction!
My question is what point in my code must I specify how many items are given for prediction? Is this issue related to the fact that my data is binary?
Thanks
Robert
for topNList, you must specify the number of items you want back. So you add these with the predict() function call:
# making predictions on the test data set
p.ubcf <- predict(Rec.ubcf, getData(test_index, "known"), type="topNList", n=10)
# making predictions on the test data set
p.ibcf <- predict(Rec.ibcf, getData(test_index, "known"), type="topNList", n=10)
By varying n, you will be able to see how it impacts your TP/FP/TN/FN accuracy measures, as well as precision/recall. The calculation methodology for these values is at the bottom of this page:
https://github.com/mhahsler/recommenderlab/blob/master/R/calcPredictionAccuracy.R

No missing values are allows kNN in R

I've data set of 45212 elements with 17 columns and i want to find the class label of last column using kNN algorithm, according to me everything is OK, but I always come up with error
"Error in knn(train = data_train, test = data_test, cl = data_train_labels, :
no missing values are allowed"
here is my code
> data_train <-data[1:25000,]
> data_test <-data[25001:45212,]
> data_train_labels <- data[1:25000, 17]
> data_test_labels <- data[1:25000, 17]
> install.package("class")
> library(class)
> data_test_pred <- knn(train=data_train, test=data_test, cl=data_train_labels, k=10)
here is how my data set looks like:
age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
35,management,married,tertiary,no,231,yes,no,unknown,5,may,139,1,-1,0,unknown,no
28,management,single,tertiary,no,447,yes,yes,unknown,5,may,217,1,-1,0,unknown,no
42,entrepreneur,divorced,tertiary,yes,2,yes,no,unknown,5,may,380,1,-1,0,unknown,no
58,retired,married,primary,no,121,yes,no,unknown,5,may,50,1,-1,0,unknown,no
43,technician,single,secondary,no,593,yes,no,unknown,5,may,55,1,-1,0,unknown,no
41,admin.,divorced,secondary,no,270,yes,no,unknown,5,may,222,1,-1,0,unknown,no
I think that your problem is all of the factors in your data. The knn documentation says that it uses Euclidean distance, which does not make sense for factors. Here is a possible solution if you really want to use knn. You can get a distance matrix between the points using daisy in the cluster package. There are several implementations of knn in R but I don't know of one that accepts a distance matrix. You could either write your own (not so difficult) or you could map the distance matrix to a Euclidean space using cmdscale. Then use knn on the projected space.
I believe that your mistake is: data_train <-data[1:25000,]
You are including your header that you have not normalized. I was able to reproduce the same error. But when I changed to data_train <-data[2:25000,] it ran fine.

Dynamic time-series prediction and rollapply

I am trying to get a rolling prediction of a dynamic timeseries in R (and then work out squared errors of the forecast). I based a lot of this code on this StackOverflow question, but I am very new to R so I am struggling quite a bit. Any help would be much appreciated.
require(zoo)
require(dynlm)
set.seed(12345)
#create variables
x<-rnorm(mean=3,sd=2,100)
y<-rep(NA,100)
y[1]<-x[1]
for(i in 2:100) y[i]=1+x[i-1]+0.5*y[i-1]+rnorm(1,0,0.5)
int<-1:100
dummydata<-data.frame(int=int,x=x,y=y)
zoodata<-as.zoo(dummydata)
prediction<-function(series)
{
mod<-dynlm(formula = y ~ L(y) + L(x), data = series) #get model
nextOb<-nrow(series)+1
#make forecast
predicted<-coef(mod)[1]+coef(mod)[2]*zoodata$y[nextOb-1]+coef(mod)[3]*zoodata$x[nextOb-1]
#strip timeseries information
attributes(predicted)<-NULL
return(predicted)
}
rolling<-rollapply(zoodata,width=40,FUN=prediction,by.column=FALSE)
This returns:
20 21 ..... 80
10.18676 10.18676 10.18676
Which has two problems I was not expecting:
Runs from 20->80, not 40->100 as I would expect (as the width is 40)
The forecasts it gives out are constant: 10.18676
What am I doing wrong? And is there an easier way to do the prediction than to write it all out? Thanks!
The main problem with your function is the data argument to dynlm. If you look in ?dynlm you will see that the data argument must be a data.frame or a zoo object. Unfortunately, I just learned that rollapply splits your zoo objects into array objects. This means that dynlm, after noting that your data argument was not of the right form, searched for x and y in your global environment, which of course were defined at the top of your code. The solution is to convert series into a zoo object. There were a couple of other issues with your code, I post a corrected version here:
prediction<-function(series) {
mod <- dynlm(formula = y ~ L(y) + L(x), data = as.zoo(series)) # get model
# nextOb <- nrow(series)+1 # This will always be 21. I think you mean:
nextOb <- max(series[,'int'])+1 # To get the first row that follows the window
if (nextOb<=nrow(zoodata)) { # You won't predict the last one
# make forecast
# predicted<-coef(mod)[1]+coef(mod)[2]*zoodata$y[nextOb-1]+coef(mod)[3]*zoodata$x[nextOb-1]
# That would work, but there is a very nice function called predict
predicted=predict(mod,newdata=data.frame(x=zoodata[nextOb,'x'],y=zoodata[nextOb,'y']))
# I'm not sure why you used nextOb-1
attributes(predicted)<-NULL
# I added the square error as well as the prediction.
c(predicted=predicted,square.res=(predicted-zoodata[nextOb,'y'])^2)
}
}
rollapply(zoodata,width=20,FUN=prediction,by.column=F,align='right')
Your second question, about the numbering of your results, can be controlled by the align argument is rollapply. left would give you 1..60, center (the default) would give you 20..80 and right gets you 40..100.

Resources