ERROR: 'numpy.ndarray' object has no attribute 'iloc' - jupyter-notebook

I am trying to run my K-fold cross-validation and this happened
from sklearn import model_selection
kFold = model_selection.KFold(n_splits=5, shuffle=True)
#use the split function of kfold to split the housing data set
for trainIndex, testIndex in kFold.split(df):
print("Fold: ",i)
print(trainIndex.shape)
print(trainIndex)
i += 1
lRegPara = [0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1]
final_results = []
i=0
for trainIndex, testIndex in kFold.split(df):
# split the train test further
trainX, validX, trainY, validY = train_test_split(np.array(X.iloc[trainIndex]),
np.array(Y.iloc[trainIndex]),
test_size=0.20, random_state=99)
# optimise the linear regression
lResults = []
for regPara in lRegPara:
polyLassoReg = Lasso(alpha=regPara, normalize=True)
polyFitTrainX = polyreg.fit_transform(trainX)
polyLassoReg.fit(polyFitTrainX, trainY)
polyFitValidX = polyreg.fit_transform(validX)
predictKY = polyLassoReg.predict(polyFitValidX)
mse = mean_squared_error(predictKY, validY)
lResults.append(mse)
final_results.append(lResults)
plt.plot(lRegPara, lResults)
Why? I have been getting this error 'numpy.ndarray' object has no attribute 'iloc'. I have search everywhere but there are no similar problem. I tried function 'loc' in numpy and the result still the same.

Convert you numpy array to pandas dataframe
df = pd.DataFrame({'column0': numpy_array[:, 0],
'column1': numpy_array[:, 1],
'column2': numpy_array[:, 2],
'column3': numpy_array[:, 3],
'column4': numpy_array[:, 4] })
And then you can use iloc and other dataframe functions

Related

Kriging interpolation using GeoStats package in Julia

I am trying to build a model for kriging interpolation using GeoStats package in julia.
I have tried an example of 2D interpolations but the results are not accurate, as mentioned below.
Code for 2D interpolation:
using KrigingEstimators, DataFrames, Variography, Plots
OK = OrdinaryKriging(GaussianVariogram()) # interpolator object
f(x) = sin(x)
# fit it to the data:
x_train = range(0, 10.0, length = 9) |> collect
y_train = f.(x_train)
scatter(x_train, y_train, label="train points")
x_train = reshape(x_train, 1, length(x_train))
krig = KrigingEstimators.fit(OK, x_train, y_train) # fit function
result = []
variance =[]
test = range(0, 10, length = 101) |> collect
y_test = f.(test)
test = reshape(test, 1, length(test))
for i in test
μ, σ² = KrigingEstimators.predict(krig, [i])
push!(result, μ)
push!(variance, σ²)
end
df_krig_vario = DataFrame(:predict=>result, :real=>y_test, :variance=>variance)
println(first(df_krig_vario, 5))
mean_var = sum(variance)/length(variance)
println("")
println("mean variance is $mean_var")
test = reshape(test, length(test), 1)
plot!(test, y_test, label="actual")
plot!(test, result, label="predict", legend=:bottomright, title="Gaussian Variogram")
With reference to the above figure it can be seen that the interpolation prediction is not accurate. May I know, how to improve this accuracy?

Error in h(simpleError(msg, call)) in Ridge/Lasso Regression

I am trying to run ridge/lasso with the glmnetand onehot package and getting an error.
library(glmnet)
library(onehot)
set.seed(123)
Sample <- HouseData[1:1460, ]
smp_size <- floor(0.5 * nrow(Sample))
train_ind <- sample(seq_len(nrow(Sample)), size = smp_size)
train <- Sample[train_ind, ]
test <- Sample[-train_ind, ]
############Ridge & Lasso Regressions ################
# Define the response for the training + test set
y_train <- train$SalePrice
y_test <- test$SalePrice
# Define the x training and test
x_train <- train[,!names(train)=="SalePrice"]
x_test <- test[,!names(train)=="SalePrice"]
str(y_train)
## encoding information for training set
x_train_encoded_data_info <- onehot(x_train,stringsAsFactors = TRUE, max_levels = 50)
x_train_matrix <- (predict(x_train_encoded_data_info,x_train))
x_train_matrix <- as.matrix(x_train_matrix)
# create encoding information for x test
x_test_encoded_data_info <- onehot(x_test,stringsAsFactors = TRUE, max_levels = 50)
x_test_matrix <- (predict(x_test_encoded_data_info,x_test))
str(x_train_matrix)
###Calculate best lambda
cv.out <- cv.glmnet(x_train_matrix, y_train,
alpha = 0, nlambda = 100,
lambda.min.ratio = 0.0001)
best.lambda <- cv.out$lambda.min
best.lambda
model <- glmnet(x_train_matrix, y_train, alpha = 0, lambda = best.lambda)
results_ridge <- predict(model,newx=x_test_matrix)
I know my data is clean and my matrices are the same size, But I keep getting this error when I try to run my prediction.
Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'as.matrix': Cholmod error 'X and/or Y have wrong dimensions' at file ../MatrixOps/cholmod_sdmult.c, line 90
My professor has also told me to one-hot encode before I split my data, but that makes no sense to me.
It's hard to debug that specific error because it's not entirely clear where the onehot function in your code is coming from; it doesn't exist in base R or the glmnet package.
That said, I would recommend using the old built-in standby function model.matrix (or its sparse cousin, sparse.model.matrix, if you have larger datasets) for creating the x argument to glmnet. model.matrix will automatically one-hot encode factor or categorical variables for you. It requires a model formula as input, which you can create from your dataset as shown below.
# create the model formula
y_variable <- "SalePrice"
model_formula <- as.formula(paste(y_variable, "~",
paste(names(train)[names(train) != y_variable], collapse = "+")))
# test & train matrices
x_train_matrix <- model.matrix(model_formula, data = train)[, -1]
x_test_matrix <- model.matrix(model_formula, data = test)[, -1]
###Calculate best lambda
cv.out <- cv.glmnet(x_train_matrix, y_train,
alpha = 0, nlambda = 100,
lambda.min.ratio = 0.0001)
A second, newer option would be to use the built-in glmnet function makeX(), which builds matrices off of your test/train dataframes. This can just be fed into cv.glmnet as the x argument as below.
## option 2: use glmnet built in function to create x matrices
x_matrices <- glmnet::makeX(train = train[, !names(train) == "SalePrice"],
test = test[, !names(test) == "SalePrice"])
###Calculate best lambda
cv.out <- cv.glmnet(x_matrices$x, y_train,
alpha = 0, nlambda = 100,
lambda.min.ratio = 0.0001)

SHAP with Keras model : operands could not be broadcast together with shapes (2,6) (10,)

I am running SHAP from the library shapper in R for a classification model intrepetation on a Keras CNN model:
library(keras)
library("shapper")
library("DALEX")
I made a simple reproductible example
mdat.train <- cbind(rep(1:2, each = 5), matrix(c(1:30), ncol = 3, byrow = TRUE))
train.conv <- array_reshape(mdat.train[,-1], c(nrow(mdat.train[,-1]), ncol(mdat.train[,-1]), 1))
mdat.test <- cbind(rep(1:2, each = 3), matrix(c(1:18), ncol = 3, byrow = TRUE))
test.conv <- array_reshape(mdat.test[,-1], c(nrow(mdat.test[,-1]), ncol(mdat.test[,-1]), 1))
My CNN model
model.CNN <- keras_model_sequential()
model.CNN %>%
layer_conv_1d(filters=16L, kernel_initializer=initializer_he_normal(seed=NULL), kernel_size=2L, input_shape = c(dim(train.conv)[[2]],1)) %>%
layer_batch_normalization() %>%
layer_activation_leaky_relu() %>%
layer_flatten() %>%
layer_dense(50, activation ="relu") %>%
layer_dropout(rate=0.5) %>%
layer_dense(units=2, activation ='sigmoid')
model.CNN %>% compile(
loss = loss_binary_crossentropy,
optimizer = optimizer_adam(lr = 0.001, beta_1 = 0.9, beta_2 = 0.999, epsilon = 1e-08),
metrics = c("accuracy"))
model.CNN %>% fit(
train.conv, mdat.train[,1], epochs = 5, verbose = 1)
My Shap command
p_function <- function(model, data) predict(model.CNN, test.conv, type = "prob")
exp_cnn <- explain(model.CNN, data = train.conv)
ive_cnn <- shap(exp_cnn, data = train.conv, new_observation = test.conv, predict_function = p_function)
I am getting this error :
Error in py_call_impl(callable, dots$args, dots$keywords) :
ValueError: operands could not be broadcast together with shapes (2,6) (10,)
Detailed traceback:
File "/.local/lib/python3.6/site-packages/shap/explainers/kernel.py", line 120, in __init__
self.fnull = np.sum((model_null.T * self.data.weights).T, 0)
Problem You've presented has two steps. First of all shown error comes from code typo. p_function shown by You calls global objects instead of passed ones. Thats why You have witnessed that error.
But to my surpirse I've found package not working even after clarifying that mistake. Let me explain motivation and the solution.
Have to say that 3D Arrays are not common in R, therefore shapper package does not support that type of train data. It assumes data.frame format at the beginning of the task (because it's iterating over variables). To be honest it took me like 2 hourse to find a reason why it is not working as well as a solution.
First of all we need new variables that are understandable for shapper.
shapper_data <- as.data.frame(train.conv)
shapper_new_obs <- as.data.frame(test.conv)[1,]
as well as new predict_function
p_function <- function(model, data) {
mat <- as.matrix(data)
mat <- array_reshape(mat, c(nrow(data), ncol(data), 1))
predict(model, mat, type = "prob")
}
Two new lines will convert data.frame into proper shaped array.
Then line
ive_cnn <- individual_variable_effect(x = model.CNN, data = shapper_data, new_observation = shapper_new_obs, predict_function = p_function)
Works perfectly fine for me.
Best
Szymon

Keras LSTM and multiple input feature: how to define parameters

I am discoveting Keras in R and the LSTM. Following this blog post, I want to predict time series, and I would like to use various past time point (t-1, t-2) to predict the t point.
Here is what I tried so far:
library(data.table)
library(tensorflow)
library(keras)
Serie <- c(5.66333333333333, 5.51916666666667, 5.43416666666667, 5.33833333333333,
5.44916666666667, 6.2025, 6.57916666666667, 6.70666666666667,
6.95083333333333, 8.1775, 8.55083333333333, 8.42166666666667,
8.01333333333333, 8.99833333333333, 11.0025, 10.3116666666667,
10.51, 10.9916666666667, 10.6116666666667, 10.8475, 13.7841666666667,
16.2916666666667, 15.9975, 14.3683333333333, 13.4041666666667,
11.8666666666667, 9.11916666666667, 9.47862416666667, 9.08404666666667,
8.79606166666667, 9.93211091666667, 9.03834041666667, 8.58787275,
6.77499383333333, 7.21377583333333, 7.53497175, 6.31212966666667,
5.5825105, 4.64021041666667, 4.608787, 5.39446983333333, 4.93945983333333,
4.8612215, 4.13088808333333, 4.09916575, 3.40943183333333, 3.79573258333333,
4.30319966666667, 4.23431266666667, 3.64880758333333, 3.11700716666667,
3.321058, 2.53599408333333, 2.20433991666667, 1.66643905833333,
0.84187275, 0.467880658333333, 0.810507858333333, 0.795)
Npoints <- 2 # number of previous point to take into account
I then create a data frame with the lagged time series, and create a test and train set:
supervised <- data.table(x = diff(Serie, differences = 1))
supervised[,c(paste0("x-",1:Npoints)) := lapply(1:Npoints,function(i){c(rep(NA,i),x[1:(.N-i)])})] # create shifted versions
# take the non NA
supervised <- supervised[!is.na(get(paste0("x-",Npoints)))]
head(supervised)
# Split dataset into training and testing sets
N = nrow(supervised)
n = round(N *0.7, digits = 0)
train = supervised[1:n, ]
test = supervised[(n+1):N, ]
I rescale the data
scale_data = function(train, test, feature_range = c(0, 1)) {
x = train
fr_min = feature_range[1]
fr_max = feature_range[2]
std_train = ((x - min(x,na.rm = T) ) / (max(x,na.rm = T) - min(x,na.rm = T) ))
std_test = ((test - min(x,na.rm = T) ) / (max(x,na.rm = T) - min(x,na.rm = T) ))
scaled_train = std_train *(fr_max -fr_min) + fr_min
scaled_test = std_test *(fr_max -fr_min) + fr_min
return( list(scaled_train = as.vector(scaled_train), scaled_test = as.vector(scaled_test) ,scaler= c(min =min(x,na.rm = T), max = max(x,na.rm = T))) )
}
Scaled = scale_data(train, test, c(-1, 1))
# define x and y train
y_train = as.vector(Scaled$scaled_train[, 1])
x_train = Scaled$scaled_train[, -1]
And following this post I reshape the data in 3D
x_train_reshaped <- array(NA,dim= c(1,dim(x_train)))
x_train_reshaped[1,,] <- as.matrix(x_train)
I do the following model and try to start the learning :
model <- keras_model_sequential()
model%>%
layer_lstm(units = 1, batch_size = 1, input_shape = dim(x_train), stateful= TRUE)%>%
layer_dense(units = 1)
# compile model ####
model %>% compile(
loss = 'mean_squared_error',
optimizer = optimizer_adam( lr= 0.02, decay = 1e-6 ),
metrics = c('accuracy')
)
# make a test
model %>% fit(x_train_reshaped, y_train, epochs=1, batch_size=1, verbose=1, shuffle=FALSE)
but I get the following error:
Error in py_call_impl(callable, dots$args, dots$keywords) :
ValueError: No data provided for "dense_11". Need data for each key in: ['dense_11']
Trying to reshape the data differently didn't help.
What I am doing wrong ?
Keras and tensorflow in R cannot recognise the size of your input/target data when they are data frames.
y_train is both a data.table and a data.frame:
class(y_train)
[1] "data.table" "data.frame"
The keras fit documentation states: "y: Vector, matrix, or array of target (label) data (or list if the model has multiple outputs)." Similarly, for x.
Unfortunately, there still appears to be an input and/or target dimensionality mismatch when y_train is cast to a matrix:
model %>%
fit(x_train_reshaped, as.matrix(y_train), epochs=1, batch_size=1, verbose=1, shuffle=FALSE)
Error in py_call_impl(callable, dots$args, dots$keywords) :
ValueError: Input arrays should have the same number of samples as target arrays.
Found 1 input samples and 39 target samples.
Hope this answer helps you, or someone else, make further progress.

"The format of predictions is incorrect"

Implementation of ROCR curve, kNN ,K 10 fold cross validation.
I am using Ionosphere dataset.
Here is the attribute information for your reference:
-- All 34 are continuous, as described above
-- The 35th attribute is either "good" or "bad" according to the definition
summarized above. This is a binary classification task.
data1<-read.csv('https://archive.ics.uci.edu/ml/machine-learning-databases/ionosphere/ionosphere.data',header = FALSE)
knn on its own works, kNN with kfold also works. But when I put in the ROCR code it doesnt like it.
I get the error: "The format of predictions is incorrect".
I checked the dataframes pred and Class 1. The dimensions are same. I tried with data.test$V35 instead of Class1 I get the same error with this option.
install.packages("class")
library(class)
nrFolds <- 10
data1[,35]<-as.numeric(data1[,35])
# generate array containing fold-number for each sample (row)
folds <- rep_len(1:nrFolds, nrow(data1))
# actual cross validation
for(k in 1:nrFolds) {
# actual split of the data
fold <- which(folds == k)
data.train <- data1[-fold,]
data.test <- data1[fold,]
Class<-data.train[,35]
Class1<-data.test[,35]
# train and test your model with data.train and data.test
pred<-knn(data.train, data.test, Class, k = 5, l = 0, prob = FALSE, use.all = TRUE)
data<-data.frame('predict'=pred, 'actual'=Class1)
count<-nrow(data[data$predict==data$actual,])
total<-nrow(data.test)
avg = (count*100)/total
avg =format(round(avg, 2), nsmall = 2)
method<-"KNN"
accuracy<-avg
cat("Method = ", method,", accuracy= ", accuracy,"\n")
}
install.packages("ROCR")
library(ROCR)
rocrPred=prediction(pred, Class1, NULL)
rocrPerf=performance(rocrPred, 'tpr', 'fpr')
plot(rocrPerf, colorize=TRUE, text.adj=c(-.2,1.7))
Any help is appreciated.
This worked for me..
install.packages("class")
library(class)
library(ROCR)
nrFolds <- 10
data1[,35]<-as.numeric(data1[,35])
# generate array containing fold-number for each sample (row)
folds <- rep_len(1:nrFolds, nrow(data1))
# actual cross validation
for(k in 1:nrFolds) {
# actual split of the data
fold <- which(folds == k)
data.train <- data1[-fold,]
data.test <- data1[fold,]
Class<-data.train[,35]
Class1<-data.test[,35]
# train and test your model with data.train and data.test
pred<-knn(data.train, data.test, Class, k = 5, l = 0, prob = FALSE, use.all = TRUE)
data<-data.frame('predict'=pred, 'actual'=Class1)
count<-nrow(data[data$predict==data$actual,])
total<-nrow(data.test)
avg = (count*100)/total
avg =format(round(avg, 2), nsmall = 2)
method<-"KNN"
accuracy<-avg
cat("Method = ", method,", accuracy= ", accuracy,"\n")
pred <- prediction(Class1,pred)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=T, add=TRUE)
abline(0,1)
}

Resources