EDIT : OK so now I have my train, validation and test sets with rows belonging to patients in same groups. But, using a plot test, I see that the original imbalanced data from the original dataset (from the outcome LesionResponse, 1: 70% and 0 : 30%) is not very respected...Indeed, in the training datas, I have a nearly 55/45 repartition and it's not really welcomed for me. How can I do to correct this ?
summary(train$LesionResponse)
# 0 1
# 159 487
summary(validation$LesionResponse)
# 0 1
# 33 170
summary(test$LesionResponse)
# 0 1
# 77 126
Hi guys,
I have my dataset (here an exemple) and I must build a predictive model for an outcome : "LesionResponse".
So I have in a first time split my datas in train (60%), validation and test (20% each) sets.
I have a huge problem, many rows of my table belong to same patients...so in order to dodge bias, I must divide my datas and take into account the PatientIDs...
I am here stuck because I don't know how to split my datas in three and keep the rows belonging to same patients together.
Here is my code :
structure(list(PatientID = c("P1", "P1", "P1",
"P2", "P3", "P3", "P4", "P5",
"P5", "P6"), LesionResponse = structure(c(2L, 2L, 1L, 2L, 2L, 2L, 2L,
2L, 1L, 2L), .Label = c("0",
"1"), class = "factor"), pyrad_tum_original_shape_LeastAxisLength = c(19.7842995242803,
15.0703960571122, 21.0652247652897, 11.804125918871, 27.3980336338908,
17.0584330264122, 4.90406343942677, 4.78480430022189, 6.2170232078547,
5.96309532740722, 5.30141540007441), pyrad_tum_original_shape_Sphericity = c(0.652056853392657,
0.773719977240238, 0.723869070051882, 0.715122964970338,
0.70796498824535, 0.811937882810929, 0.836458991713367, 0.863337931630415,
0.851654860256904, 0.746212862162174), pyrad_tum_log.sigma.5.0.mm.3D_firstorder_Skewness = c(0.367453961973625,
0.117673346718817, 0.0992025164349288, -0.174029385779302,
-0.863570016875989, -0.8482193060411, -0.425424618080682,
-0.492420174157913, 0.0105111292451967, 0.249865833210199), pyrad_tum_log.sigma.5.0.mm.3D_glcm_Contrast = c(0.376932105256115,
0.54885738172596, 0.267158344601612, 2.90094719958076, 0.322424096161189,
0.221356030145403, 1.90012334870722, 0.971638740404501, 0.31547550396399,
0.653999340294952), pyrad_tum_wavelet.LHH_glszm_GrayLevelNonUniformityNormalized = c(0.154973213866752,
0.176128379241556, 0.171129002059539, 0.218343919352019,
0.345985943932352, 0.164905080489496, 0.104536489151874,
0.1280276816609, 0.137912385073012, 0.133420904484894), pyrad_tum_wavelet.LHH_glszm_LargeAreaEmphasis = c(27390.2818110851,
11327.7931034483, 51566.7948885976, 7261.68702290076, 340383.536555142,
22724.7792207792, 45.974358974359, 142.588235294118, 266.744186046512,
1073.45205479452), pyrad_tum_wavelet.LHH_glszm_LargeAreaLowGrayLevelEmphasis = c(677.011907073653,
275.281153810458, 582.131636238695, 173.747506476692, 6140.73990175018,
558.277670638306, 1.81042257642817, 4.55724031114589, 6.51794350173746,
19.144924585586), pyrad_tum_wavelet.LHH_glszm_SizeZoneNonUniformityNormalized = c(0.411899490603372,
0.339216399209913, 0.425584323452468, 0.355165782879786,
0.294934042125209, 0.339208410636982, 0.351742274819198,
0.394463667820069, 0.360735532720389, 0.36911240382811)), row.names = c(NA, -10L), class = c("tbl_df", "tbl",
"data.frame"))
i was thinking about a loop who would split a unique(PatientID) dataset in three with 60% in the train set, and if there is no balanced outcome in the sets, to do it again and again. I was thinking more of an interval to solve it...
How would you do guys ?
Edit I misunderstood how you wished to handle PatientIDs. The original answer is at the bottom, however note stratification will aim to put equivalent proportions of each PatientID in each split. You should use the group_ splitting function indicated by #Rui Barradas.
library(tidymodels)
set.seed(217)
df_split <- group_initial_split(df, PatientID, prop = 4/5)
df_training <- training(df_split)
df_testing <- testing(df_split)
df_validation <- group_validation_split(df_training, PatientID, prop = 3/4)
Original reply
In the tidymodels framework you can opt to stratify the sampling using your PatientID variable. The resulting resamples will have equivalent proportions.
To create your desired splits you could first split the data 80:20 training:testing, then split the training set 75:25 into training:validation.
library(tidymodels)
set.seed(217)
df_split <- initial_split(df, prop = 4/5, strata = PatientID)
df_training <- training(df_split)
df_testing <- testing(df_split)
df_validation <- validation_split(df_training, prop = 3/4, strata = PatientID)
Here is a way with package rsample.
First split in test and other data (named train in the code below) keeping all PatientID in the same subsets, then split train.
library(rsample)
set.seed(2023)
g <- group_initial_split(df1, group = PatientID, prop = 0.8)
train <- training(g)
test <- testing(g)
g <- group_initial_split(train, group = PatientID, prop = 3/4)
train <- training(g)
validation <- testing(g)
# check data split proportions
df_list <- list(train = train, validation = validation, test = test)
sapply(df_list, nrow)
#> train validation test
#> 600 199 201
# this shows that all groups belong to one subset only
lapply(df_list, \(x) unique(x[[1]]))
#> $train
#> [1] "P5" "P9" "P8" "P3" "P10" "P4"
#>
#> $validation
#> [1] "P2" "P7"
#>
#> $test
#> [1] "P1" "P6"
Created on 2023-02-17 with reprex v2.0.2
Test data
set.seed(2023)
p <- sprintf("P%d", 1:10)
n <- 1e3
df1 <- data.frame(
PatientID = sample(p, n, TRUE),
x = rnorm(n)
)
Created on 2023-02-17 with reprex v2.0.2
You could use a one-liner that samples one of 1:3 for unique patient IDs and splits df by that.
set.seed(42)
res <- split(df, with(df, ave(id, id, FUN=\(x) sample.int(3, 1, prob=c(.6, .2, .2)))))
Tests:
## test proportions (should approx. be [.6, .2, .2])
proportions(sapply(res, \(x) length(unique(x$id)))) |> round(2)
# 1 2 3
# 0.53 0.25 0.22
## test uniqueness
stopifnot(length(Reduce(intersect, lapply(res, `[[`, 'id'))) == 0)
Update
To get more stable proportions, we could use fixed group sizes by repeating 1:3 by vector p.
len <- length(u <- unique(df$id))
p1 <- c(.2, .2)
rlp <- round(len*p1)
p <- c(len - sum(rlp), rlp)
set.seed(42)
a <- setNames(rep.int(1:3, p), sample(u))
res <- split(df, a[match(df$id, names(a))]) ## this line splits the df
proportions(sapply(res, \(x) length(unique(x$id))))
# 1 2 3
# 0.6 0.2 0.2
## test uniqueness
stopifnot(length(Reduce(intersect, lapply(res, `[[`, 'id'))) == 0)
Data:
set.seed(42)
n <- 200; np <- 100
df <- data.frame(id=paste0('P', as.integer(as.factor(sort(sample.int(np, n, replace=TRUE))))),
les=sample(0:1, n, replace=TRUE),
pyr=runif(n))
Related
I read that it is possible to store dataframes in a column of a dataframe with nest:
https://tidyr.tidyverse.org/reference/nest.html
Is it also possible to store tables in a column of a dataframe?
The reason is that I would like to calculate the Kappa for every subgroup of a dataframe with Caret. Although caret::confusionMatrix(t) expects a table as input.
In the example-code below this works fine if I calculate the Kappa for the complete dataframe at once:
library(tidyverse)
library(caret)
# generate some sample data:
n <- 100L
x1 <- rnorm(n, 1.0, 2.0)
x2 <- rnorm(n, -1.0, 0.5)
y <- rbinom(n, 1L, plogis(1 * x1 + 1 * x2))
my_factor <- rep( c('A','B','C','D'), 25 )
df <- cbind(x1, x2, y, my_factor)
# fit a model and make predictions:
mod <- glm(y ~ x1 + x2, "binomial")
probs <- predict(mod, type = "response")
# confusion matrix
probs_round <- round(probs)
t <- table(factor(probs_round, c(1,0)), factor(y, c(1,0)))
ccm <- caret::confusionMatrix(t)
# extract Kappa:
ccm$overall[2]
> Kappa
> 0.5232
Although if I try to do group_by to generate the Kappa for every factor as a subgroup (see code below) it does not succeed. I suppose I need to nest t in a certain way in df although I don't know how:
# extract Kappa for every subgroup with same factor (NOT WORKING CODE):
df <- cbind(df, probs_round)
df <- as.data.frame(df)
output <- df %>%
dplyr::group_by(my_factor) %>%
dplyr::mutate(t = table(factor(probs_round, c(1,0)), factor(y, c(1,0)))) %>%
summarise(caret::confusionMatrix(t))
Expected output:
>my_factor Kappa
>1 A 0.51
>2 B 0.52
>3 C 0.53
>4 D 0.54
Is this correct and is this possible?
(the exact values for Kappa will be different due to the randomness in the sample data)
Thanks a lot!
You could skip the intermediate mutate() that's giving you trouble to do:
library(dplyr)
library(caret)
df %>%
group_by(my_factor) %>%
summarize(t = confusionMatrix(table(factor(probs_round, c(1,0)),
factor(y, c(1,0))))$overall[2])
Returns:
# A tibble: 4 x 2
my_factor t
<chr> <dbl>
1 A 0.270
2 B 0.513
3 C 0.839
4 D 0.555
The above approach is the easiest to get the desired results. But just to show whats possible, we can use your approach with rowwise::nest_by which groups the data set rowwise.
In the approach below we calculate a separate glm for each subgroup. I'm not sure if that's what you want to do.
library(tidyverse)
library(caret)
# generate some sample data:
n <- 1000L
df <- tibble(x1 = rnorm(n, 1.0, 2.0),
x2 = rnorm(n, -1.0, 0.5),
y = rbinom(n, 1L, plogis(x1 + 1 * x1 + 1 * x2)),
my_factor = rep( c('A','B','C','D'), 250))
output <- df %>%
nest_by(my_factor) %>%
mutate(y = list(data$y),
mod = list(glm(y ~ x1 + x2,
family = "binomial",
data = data)),
probs = list(predict(mod, type = "response")),
probs_round = list(round(probs)),
t = list(table(factor(probs_round, c(1, 0)),
factor(y, c(1, 0)))),
ccm = caret::confusionMatrix(t)$overall[2])
output %>%
pull(ccm)
#> Kappa Kappa Kappa Kappa
#> 0.7743682 0.7078112 0.7157761 0.7549340
Created on 2021-06-23 by the reprex package (v0.3.0)
I am using the pROC package in r to calculate and compare the AUCs of multiple tests, to see which test has the best ability to discriminate between patients and controls. However, I have a large number of tests and essentially want to run a series of pairwise comparisons of each tests AUC with every other test and then correct for multiple comparisons. This is as far as I've gotten with my code (example with simulated and replicable dataset below):
#load pROC
library(pROC)
#generate df with random numbers
set.seed(123)
df <- data.frame(disease_status = rbinom(n=100, size=1, prob=0.20),
test1 = rnorm(100, mean=15, sd=4),
test2 = rnorm(100, mean=30, sd=2),
test3 = rnorm(100, mean=50, sd=3))
#create roc object for test1, test2, test3
roc.out_test1<-roc(df$disease_status, df$test1, plot=TRUE, smooth = FALSE)
roc.out_test2<-roc(df$disease_status, df$test2, plot=TRUE, smooth = FALSE)
roc.out_test3<-roc(df$disease_status, df$test3, plot=TRUE, smooth = FALSE)
#compare the AUC of test1 and test 2
roc.test(roc.out_test1, roc.out_test2, reuse.auc=TRUE, method="delong", na.rm=TRUE)
#DeLong's test for two correlated ROC curves
#data: roc.out_test1 and roc.out_test2
#Z = 0.60071, p-value = 0.548
#alternative hypothesis: true difference in AUC is not equal to 0
#sample estimates:
#AUC of roc1 AUC of roc2
#0.5840108 0.5216802
#create a function to do above for all comparisons
vec_ROCs1 <- c("roc.out_test1,", "roc.out_test2,", "roc.out_test3,")
vec_ROCs2 <- c("roc.out_test1", "roc.out_test2", "roc.out_test3")
ROCs2_specifications <- paste0(vec_ROCs2, ",", "reuse.auc=TRUE")
test <- unlist(lapply(ROCs2_specifications, function(x) paste0(vec_ROCs1, x)))
test2 <- lapply(test, function(x) roc.test(x))
#Error in roc.test.default(x) :
# argument "predictor1" is missing, with no default
Please let me know your thoughts and suggestions on how to fix this!
Thank you.
The following should work, please check it. I didn't write all the details, but you can ask me other questions if you don't understand the code.
#load pROC
library(pROC)
#> Type 'citation("pROC")' for a citation.
#>
#> Attaching package: 'pROC'
#> The following objects are masked from 'package:stats':
#>
#> cov, smooth, var
#generate df with random numbers
set.seed(123)
df <- data.frame(disease_status = rbinom(n=100, size=1, prob=0.20),
test1 = rnorm(100, mean=15, sd=4),
test2 = rnorm(100, mean=30, sd=2),
test3 = rnorm(100, mean=50, sd=3))
#create roc object for test1, test2, test3
roc.out_test1<-roc(df$disease_status, df$test1, plot=TRUE, smooth = FALSE)
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
roc.out_test2<-roc(df$disease_status, df$test2, plot=TRUE, smooth = FALSE)
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
roc.out_test3<-roc(df$disease_status, df$test3, plot=TRUE, smooth = FALSE)
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
# compare the AUC of test1 and test 2
roc.test(roc.out_test1, roc.out_test2, reuse.auc = TRUE, method = "delong", na.rm = TRUE)
#>
#> DeLong's test for two correlated ROC curves
#>
#> data: roc.out_test1 and roc.out_test2
#> Z = 0.60071, p-value = 0.548
#> alternative hypothesis: true difference in AUC is not equal to 0
#> sample estimates:
#> AUC of roc1 AUC of roc2
#> 0.5840108 0.5216802
Now we generate a list of all possible combinations of the three tests and run the roc.test function using the same parameters that you set.
all_tests <- combn(
list(
"test1" = roc.out_test1,
"test2" = roc.out_test2,
"test3" = roc.out_test3
),
FUN = function(x, ...) roc.test(x[[1]], x[[2]]),
m = 2,
simplify = FALSE,
reuse.auc = TRUE,
method = "delong",
na.rm = TRUE
)
The output is a list of choose(3, 2) = 3 elements (i.e. the number of combinations of n elements taken 2 at a time) and each element of the list is a test. For example this is the same as your previous test:
all_tests[[1]]
#>
#> DeLong's test for two correlated ROC curves
#>
#> data: x[[1]] and x[[2]]
#> Z = 0.60071, p-value = 0.548
#> alternative hypothesis: true difference in AUC is not equal to 0
#> sample estimates:
#> AUC of roc1 AUC of roc2
#> 0.5840108 0.5216802
The only problem here is that it's difficult to recognise which tests are used in the comparisons, so we can also add a list of names:
tests_names <- combn(
list("test1", "test2", "test3"),
m = 2,
FUN = paste,
simplify = TRUE,
collapse = "_"
)
all_tests <- setNames(all_tests, tests_names)
This is the result.
names(all_tests)
#> [1] "test1_test2" "test1_test3" "test2_test3"
The names of the objects flag the tests that are used in the comparison.
all_tests$test1_test2
#>
#> DeLong's test for two correlated ROC curves
#>
#> data: x[[1]] and x[[2]]
#> Z = 0.60071, p-value = 0.548
#> alternative hypothesis: true difference in AUC is not equal to 0
#> sample estimates:
#> AUC of roc1 AUC of roc2
#> 0.5840108 0.5216802
Created on 2020-03-14 by the reprex package (v0.3.0)
The roc.test() function expects a roc object as input. The list test is just character strings of all the arguments, which the function does not know what to do with. The list also includes comparisons of the tests with themselves i.e. "roc.out_test1,roc.out_test1,reuse.auc=TRUE" I assume you don't actually need to do this and that there are only 3 comparisons that you need 1v2, 1v3, 2v3. The purrr package provides map functions similar to lapply and map2 allows you to iterate of 2 lists at the same time. You need to create 2 lists of the actually roc objects and iterate over these.
#load pROC
library(pROC)
library(dplyr)
library(purrr) #For map2 function
#generate df with random numbers
set.seed(123)
df <- data.frame(disease_status = rbinom(n=100, size=1, prob=0.20),
test1 = rnorm(100, mean=15, sd=4),
test2 = rnorm(100, mean=30, sd=2),
test3 = rnorm(100, mean=50, sd=3))
#create roc object for test1, test2, test3
roc.out_test1<-roc(df$disease_status, df$test1, plot=TRUE, smooth = FALSE)
roc.out_test2<-roc(df$disease_status, df$test2, plot=TRUE, smooth = FALSE)
roc.out_test3<-roc(df$disease_status, df$test3, plot=TRUE, smooth = FALSE)
#compare the AUC of test1 and test 2
roc.test(roc.out_test1, roc.out_test2, reuse.auc=TRUE, method="delong", na.rm=TRUE)
roc_new <- function(test1, test2){
roc.test(test1, test2, reuse.auc=TRUE, method="delong", na.rm=TRUE)
}
#List of all tests
all_tests <- list(roc.out_test1,
roc.out_test2,
roc.out_test3)
#Create unique combos of tests
unique_combos <- expand.grid(1:3, 1:3) %>%
filter(Var1 < Var2) %>% #exludes duplicate comparisons,
#each col provides the index for the 2 lists to iterate over
mutate(names = paste(Var1, " V ", Var2)) #Create col to name final output list
#Create 2 lists to iterate over
#Create list 1
(test1 <- all_tests[as.numeric(unique_combos$Var1)])
#Create list 2
(test2 <- all_tests[as.numeric(unique_combos$Var2)])
#Iterate over both lists
output <- map2(test1, test2, roc_new)
names(output) <- unique_combos$names
I was wondering why lm() says 5 coefs not defined because of singularities and then gives all NA in the summary output for 5 coefficients.
Note that all my predictors are categorical.
Is there anything wrong with my data on these 5 coefficients or code? How can I possibly fix this?
d <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/v.csv", h = T) # Data
nms <- c("Age","genre","Length","cf.training","error.type","cf.scope","cf.type","cf.revision")
d[nms] <- lapply(d[nms], as.factor) # make factor
vv <- lm(dint~Age+genre+Length+cf.training+error.type+cf.scope+cf.type+cf.revision, data = d)
summary(vv)
First 6 lines of output:
Coefficients: (5 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.17835 0.63573 0.281 0.779330
Age1 -0.04576 0.86803 -0.053 0.958010
Age2 0.46431 0.87686 0.530 0.596990
Age99 -1.64099 1.04830 -1.565 0.118949
genre2 1.57015 0.55699 2.819 0.005263 **
genre4 NA NA NA NA ## For example here is all `NA`s? there are 4 more !
As others noted, a problem is that you seem to have multicollinearity. Another is that there are missing values in your dataset. The missing values should probably just be removed. As for correlated variables, you should inspect your data to identify this collinearity, and remove it. Deciding which variables to remove and which to retain is a very domain-specific topic. However, you could if you wish decide to use regularisation and fit a model while retaining all variables. This also allows you to fit a model when n (number of samples) is less than p (number of predictors).
I've shown code below that demonstrates how to examine the correlation structure within your data, and to identify which variables are most correlated (thanks to this answer. I've included an example of fitting such a model, using L2 regularisation (commonly known as ridge regression).
d <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/v.csv", h = T) # Data
nms <- c("Age","genre","Length","cf.training","error.type","cf.scope","cf.type","cf.revision")
d[nms] <- lapply(d[nms], as.factor) # make factor
vv <- lm(dint~Age+genre+Length+cf.training+error.type+cf.scope+cf.type+cf.revision, data = d)
df <- d
df[] <- lapply(df, as.numeric)
cor_mat <- cor(as.matrix(df), use = "complete.obs")
library("gplots")
heatmap.2(cor_mat, trace = "none")
## https://stackoverflow.com/questions/22282531/how-to-compute-correlations-between-all-columns-in-r-and-detect-highly-correlate
library("tibble")
library("dplyr")
library("tidyr")
d2 <- df %>%
as.matrix() %>%
cor(use = "complete.obs") %>%
## Set diag (a vs a) to NA, then remove
(function(x) {
diag(x) <- NA
x
}) %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
filter(!is.na(value)) %>%
## Sort by decreasing absolute correlation
arrange(-abs(value))
## 2 pairs of variables are almost exactly correlated!
head(d2)
#> var1 var2 value
#> 1 id study.name 0.9999430
#> 2 study.name id 0.9999430
#> 3 Location timed 0.9994082
#> 4 timed Location 0.9994082
#> 5 Age ed.level 0.7425026
#> 6 ed.level Age 0.7425026
## Remove some variables here, or maybe try regularized regression (see below)
library("glmnet")
## glmnet requires matrix input
X <- d[, c("Age", "genre", "Length", "cf.training", "error.type", "cf.scope", "cf.type", "cf.revision")]
X[] <- lapply(X, as.numeric)
X <- as.matrix(X)
ind_na <- apply(X, 1, function(row) any(is.na(row)))
X <- X[!ind_na, ]
y <- d[!ind_na, "dint"]
glmnet <- glmnet(
x = X,
y = y,
## alpha = 0 is ridge regression
alpha = 0)
plot(glmnet)
Created on 2019-11-08 by the reprex package (v0.3.0)
Under such situation you can use "olsrr" package in R for stepwise regression analysis. I am providing you a sample code to do stepwise regression analysis in R
library("olsrr")
#Load the data
d <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/v.csv", h = T)
# stepwise regression
vv <- lm(dint ~ Age + genre + Length + cf.training + error.type + cf.scope + cf.type + cf.revision, data = d)
summary(vv)
k <- ols_step_both_p(vv, pent = 0.05, prem = 0.1)
# stepwise regression plot
plot(k)
# final model
k$model
It will provide you exactly the same output as that of SPSS.
I have data of 11784 records split into test (2946) and train (8838) to run a h20 algorithm, but got an error related to the data frame that I'm trying to create as the final output to link the predictions and the ids that the predictions were made for.
Error for this line:
df_y_test <- data.frame(ID = df_labels, Status = df_y_test$predict)
Error in data.frame(ID = df_labels, Status = df_y_test$predict) :
arguments imply differing number of rows: 2946, 2950
Looked through the forums and understood that the number of rows in df_y_test is 2950 which is causing this, but couldn't figure out why since df_y_test is also derived from the same 'test' variable overall which has only 2946 rows - would be happy for any guidance please, full script posted below for reference
data : 11784 obs of 46 variables
test: 2946 obs of 45 variables
train: 8838 obs of 46 variables
df_labels: 2946 obs of 1 variable
df_y_test: 2950 obs of 4 variables
# Load Data
data <- read.csv('Data.csv')
# Partition Data
library(caTools)
set.seed(75)
split <- sample.split(data$Status, SplitRatio = 0.75)
train <- subset(data, split == TRUE)
test <- subset(data, split == FALSE)
# Dropping the column to be predicted from Test
test <- subset(test[,-c(2)])
library(readr)
library(h2o)
# Init h2o
localh2o <- h2o.init(max_mem_size = '2g', nthreads = -1)
# convert status values (to be predicted) in second column to factors in h2o
train[,2] <- as.factor(train[,2])
train_h2o <- as.h2o(train)
test_h2o <- as.h2o(test)
# Running H2O
model <- h2o.deeplearning(x=c(1, 3:46),
y=2,
training_frame = train_h2o,
activation = "RectifierWithDropout",
input_dropout_ratio = 0.2,
hidden_dropout_ratios = c(0.5, 0.5),
balance_classes = TRUE,
hidden = c(100,100),
nesterov_accelerated_gradient = T,
epochs = 15 )
h2o_y_test <- h2o.predict(model, test_h2o)
# Converting to data frames from h2o
df_y_test <- as.data.frame(h2o_y_test)
df_labels <- as.data.frame(test[,1])
df_y_test <- data.frame(ID = df_labels, Status = df_y_test$predict)
write.csv(df_y_test, file="predictionsH2o.csv", row.names = FALSE)
h2o.shutdown(prompt = FALSE)
My question is very similar to this one, but the problem I am facing has a twist that those answers do not address. Specifically, I am estimating a spatial model, y=rho * lw * y + X *beta. Because the observations are related by the matrix lw, I must apply the model to the entire X matrix simultaneously. Because those answers operate row-wise, they do not apply.
Here is MWE data, consisting of twenty points across three groups and a spatial weights matrix:
library(spdep)
#Coordinates
pointcoords <- data.frame(x = runif(n=20, min =10, max = 100), y = runif(n=20, min = 10, max = 100), ID = as.character(1:20))
pointsSP <- SpatialPoints(pointcoords[,1:2])
# Weights matrix
lw <- nb2listw(knn2nb(knearneigh(pointsSP, k = 4, RANN = FALSE),
row.names = pointcoords$ID))
# Data
MyData <- data.frame(ID = rep(1:20, each = 3),
Group = rep(1:3, times = 20),
DV = rnorm(60),IV = rnorm(60))
I can estimate the models by Group with dplyr
library(dplyr)
models <- MyData %>% group_by(Group) %>%
do(lm = lm(DV ~ IV, data = .),
sar = lagsarlm(DV ~ IV, data = ., listw = lw))
Predicting to new data with this answer operates on a row-wise basis, working fine for the lm objects,
MyData2 <- data.frame(ID = rep(1:20, each = 3),
Group = rep(1:3, times = 20),
IV = rnorm(60))
MyData2 %>% left_join(models) %>% rowwise %>%
mutate(lmPred = predict(lm, newdata = list("IV" = IV))) %>% head()
#Joining by: "Group"
#Source: local data frame [6 x 6]
#Groups:
# ID Group IV lm sar lmPred
#1 1 1 -0.8930794 <S3:lm> <S3:sarlm> -0.21378814
#2 1 2 -1.6637963 <S3:lm> <S3:sarlm> 0.42547796
#3 1 3 0.5243841 <S3:lm> <S3:sarlm> -0.23372996
#4 2 1 -0.1956969 <S3:lm> <S3:sarlm> -0.20860280
#5 2 2 0.8149920 <S3:lm> <S3:sarlm> 0.14771431
#6 2 3 -0.3000439 <S3:lm> <S3:sarlm> 0.05082524
But not for the sar models:
MyData2 %>% left_join(models) %>% rowwise %>%
mutate(sarPred = predict(sar, newdata = list("IV" = IV), listw=lw)) %>% head()
#Joining by: "Group"
#Error in if (nrow(newdata) != length(listw$neighbours)) stop("mismatch between newdata and spatial weights") :
argument is of length zero
I think there should be a better way of doing this, without joining the model to every row. Also, creating a list object for newdata won't work if you have several or changing predictor variables. It seems that the dplyr way should be something like this:
MyData2 %>% group_by(Group) %>%
mutate(sarPred = predict(models$sar[[Group]], newdata = ., listw=lw))
But the [[Group]] index isn't quite right.
I ended up doing this with do in dplyr, going through the models data.frame rowwise. I believe it does what you want, although the output doesn't contain the new data used for predictions. I did add in Group to the output, though, as it seemed necessary to keep groups separated.
models %>%
do(data.frame(Group = .$Group,
predlm = predict(.$lm, newdata = filter(MyData2, Group == .$Group)),
predsar = predict(.$sar, newdata = filter(MyData2, Group == .$Group) , listw = lw)))
EDIT
Playing around with adding the explanatory variable into the output data.frame. The following works, although there is likely a better way to do this.
models %>%
do(data.frame(Group = .$Group, IV = select(filter(MyData2, Group == .$Group), IV),
predlm = predict(.$lm, newdata = filter(MyData2, Group == .$Group)),
predsar = predict(.$sar, newdata = filter(MyData2, Group == .$Group) , listw = lw)))
I'm putting this out there because it does do what I want it to, even if it needs to use a for loop (gasp)
predictobj <- list()
for(i in models$Group){
predictobj[[i]] <- predict.sarlm(models$sar[[i]],
newdata = filter(MyData2, Group == i),
listw = lw)
}
Anybody have a dplyr solution?