How to do a Leave One Out cross validation by group / subset? - r

This question is the second part of a previous question (Linear Regression prediction in R using Leave One out Approach).
I'm trying to build models for each country and generate linear regression predictions using the leave one out approach. In other words, in the code below when building model1 and model2 the "data" used should not be the entire data set. Instead it should be a subset of the dataset (country). Each country data should be evaluated using a model built with data specific to that country.
The code below returns an error. How can I modify/fix the code below to do that? Or is there a better way of doing that?
library(modelr)
install.packages("gapminder")
library(gapminder)
data(gapminder)
#CASE 1
model1 <- lm(lifeExp ~ pop, data = gapminder, subset = country)
model2 <- lm(lifeExp ~ pop + gdpPercap, data = gapminder, subset = country)
models <- list(fit_model1 = model1,fit_model2 = model2)
gapminder %>% nest_by(continent, country) %>%
bind_cols(
map(1:nrow(gapminder), function(i) {
map_dfc(models, function(model) {
training <- data[-i, ]
fit <- lm(model, data = training)
validation <- data[i, ]
predict(fit, newdata = validation)
})
}) %>%
bind_rows()
)

The most succinct and straightforward solution would be a nested for loop approach, where the outer loop is the two model formulae and the inner loop is the unity we want to leave out. This can also be done with outer, which I also show afterwards.
For sake of clarity I first show how to leave out one observation (i.e. one row) in each iteration (Part I). I show later how to leave out one cluster (e.g. country) (Part II). I also use the built-in iris data set, which is smaller and thus easier to handle. It contains a "Species" column that is meant to correspond to the "countries" in your data.
Part I
First, we put the two formulae into a list and name them as we would like them to appear in the resulting columns later.
FOAE <- list(fit1=Petal.Length ~ Sepal.Length,
fit2=Petal.Length ~ Sepal.Length + Petal.Width)
For the loop, we want to initialize a matrix im whose rows correspond to the number of rows we want to leave out, and columns to the number of model formulae.
im <- matrix(NA, nrow=nrow(iris), ncol=length(FOAE),
dimnames=list(NULL, names(FOAE)))
This would look like this:
head(im, n=3)
# fit1 fit2
# [1,] NA NA
# [2,] NA NA
# [3,] NA NA
Now we loop over formulas and rows as described above.
for (i in seq(FOAE)) {
for(j in seq(nrow(iris))) {
train <- iris[-j,]
test <- iris[j,]
fit <- lm(FOAE[[i]], data=train)
im[j, i] <- predict(fit, newdata=test)
}
}
im has now been filled, and we may cbind it to the original iris data set to get our result res1.
res1 <- cbind(iris, im)
head(res1)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# 1 5.1 3.5 1.4 0.2 setosa 2.388501 1.611976
# 2 4.9 3.0 1.4 0.2 setosa 2.014324 1.501389
# 3 4.7 3.2 1.3 0.2 setosa 1.639805 1.392955
# 4 4.6 3.1 1.5 0.2 setosa 1.446175 1.333199
# 5 5.0 3.6 1.4 0.2 setosa 2.201646 1.556620
# 6 5.4 3.9 1.7 0.4 setosa 2.944788 2.127184
To alternatively follow the outer approach, we put the code inside the for loop into a formula which we Vectorize so that it can handle matrix columns (i.e. vectors).
FUN1 <- Vectorize(function(x, y) {
train <- iris[-x,]
test <- iris[x,]
fit <- lm(y, data=train)
predict(fit, newdata=test)
})
Now we put FOAE and the rows 1:nrow(iris) to leave out subsequently, together with FUN1 into outer(). This already gives us the result that we can cbind to iris in the same way as above to get our result res2.
o1 <- outer(FOAE, 1:nrow(iris), FUN1)
res2 <- cbind(iris, o1)
head(res2)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# 1 5.1 3.5 1.4 0.2 setosa 2.388501 1.611976
# 2 4.9 3.0 1.4 0.2 setosa 2.014324 1.501389
# 3 4.7 3.2 1.3 0.2 setosa 1.639805 1.392955
# 4 4.6 3.1 1.5 0.2 setosa 1.446175 1.333199
# 5 5.0 3.6 1.4 0.2 setosa 2.201646 1.556620
# 6 5.4 3.9 1.7 0.4 setosa 2.944788 2.127184
## test if results are different is negative
stopifnot(all.equal(res1, res2))
Part II
We may follow a similar approach when leaving out a cluster (i.e. species or countries). I show here the outer method. The thing we want to change is that we now want to leave out observations belonging to a specific cluster, here "Species" (in your case "countries"), which unique values we put into a vector Species.u . Since the values are in "character" or "factor" format we subset the data using data[!data$cluster %in% x, ] instead of data[-x, ]. Because predict would yield multiple values in the clusters, but we want the same value in the respective clusters, we might want to use a statistic, e.g. the mean prediction of each cluster. We use rownames according to the cluster.
FUN2 <- Vectorize(function(x, y) {
train <- iris[!iris$Species %in% x,]
test <- iris[iris$Species %in% x,]
fit <- lm(y, data=train)
mean(predict(fit, newdata=test))
})
Species.u <- unique(iris$Species)
o2 <- `rownames<-`(outer(Species.u, FOAE, FUN2), Species.u)
This now gives us a matrix which is smaller than our data set. Thanks to the rownames we may match the predictions tho the clusters to which they belong.
o2
# fit1 fit2
# setosa 3.609943 2.662609
# versicolor 3.785760 3.909919
# virginica 4.911009 5.976922
res3 <- cbind(iris, o2[match(iris$Species, rownames(o2)), ])
head(res3)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# setosa 5.1 3.5 1.4 0.2 setosa 3.609943 2.662609
# setosa.1 4.9 3.0 1.4 0.2 setosa 3.609943 2.662609
# setosa.2 4.7 3.2 1.3 0.2 setosa 3.609943 2.662609
# setosa.3 4.6 3.1 1.5 0.2 setosa 3.609943 2.662609
# setosa.4 5.0 3.6 1.4 0.2 setosa 3.609943 2.662609
# setosa.5 5.4 3.9 1.7 0.4 setosa 3.609943 2.662609
tail(res3)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# virginica.44 6.7 3.3 5.7 2.5 virginica 4.911009 5.976922
# virginica.45 6.7 3.0 5.2 2.3 virginica 4.911009 5.976922
# virginica.46 6.3 2.5 5.0 1.9 virginica 4.911009 5.976922
# virginica.47 6.5 3.0 5.2 2.0 virginica 4.911009 5.976922
# virginica.48 6.2 3.4 5.4 2.3 virginica 4.911009 5.976922
# virginica.49 5.9 3.0 5.1 1.8 virginica 4.911009 5.976922
Edit
In this version of FUN2, FUN3, the output of the models of each cluster are rbinded (in two columns of course, because of two models).
FUN3 <- Vectorize(function(x, y) {
train <- iris[!iris$Species %in% x,]
test <- iris[iris$Species %in% x,]
fit <- lm(y, data=train)
(predict(fit, newdata=test))
}, SIMPLIFY=F)
Species.u <- unique(iris$Species)
o3 <- `rownames<-`(outer(Species.u, FOAE, FUN3), Species.u)
res32 <- cbind(iris, apply(o3, 2, unlist))
head(res32)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# setosa.1 5.1 3.5 1.4 0.2 setosa 3.706940 2.678255
# setosa.2 4.9 3.0 1.4 0.2 setosa 3.500562 2.547587
# setosa.3 4.7 3.2 1.3 0.2 setosa 3.294183 2.416919
# setosa.4 4.6 3.1 1.5 0.2 setosa 3.190994 2.351586
# setosa.5 5.0 3.6 1.4 0.2 setosa 3.603751 2.612921
# setosa.6 5.4 3.9 1.7 0.4 setosa 4.016508 3.073249
Edit 2
As I learned in your comment you want 1. a subset of your data along clusters. This would be ss in FUN4 below. Then the ss is also subsetted by leaving out one row z over the rows of subset ss.
FUN4 <- Vectorize(function(x, y) {
## subsets first by cluster then by row
ss <- iris[iris$Species %in% x,] ## cluster subset
sapply(1:nrow(ss), function(z) { ## subset rows using `sapply`
train <- ss[-z,] ## train data w/o row z
test <- ss[z,] ## test data for `predict`, just row z
fit <- lm(y, data=train)
predict(fit, newdata=test)
})
}, SIMPLIFY=F)
## the two models
FOAE <- list(fit1=Petal.Length ~ Sepal.Length,
fit2=Petal.Length ~ Sepal.Length + Petal.Width)
## unique cluster names
Species.u <- unique(iris$Species)
## with the `outer` we iterate over all the permutations of clusters and models `FOAE`.
o4 <- `rownames<-`(outer(Species.u, FOAE, FUN4), Species.u)
## `unlist`ed result is directly `cbind`able to original data
res4 <- cbind(iris, apply(o4, 2, unlist))
## result
head(res4)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# setosa.1 5.1 3.5 1.4 0.2 setosa 1.476004 1.451029
# setosa.2 4.9 3.0 1.4 0.2 setosa 1.449120 1.431737
# setosa.3 4.7 3.2 1.3 0.2 setosa 1.426185 1.416492
# setosa.4 4.6 3.1 1.5 0.2 setosa 1.404040 1.398103
# setosa.5 5.0 3.6 1.4 0.2 setosa 1.462460 1.441295
# setosa.6 5.4 3.9 1.7 0.4 setosa 1.504990 1.559045

Related

Repeated Simulation of New Data Prediction with Tidymodels (Parsnip XGboost)

I have a model, called predictive_fit <- fit(workflow, training) that classifies the Iris dataset species using xgboost. The data are pivoted wide such that each species is a dummied column represented by a 0 or 1. Here, I am trying to predict Virginica based on the Sepal and Petal columns.
Currently, I have the following code which then takes the dataset after the model has been fit to test if it can accurately predict the Virginia species of iris. (Snippet below)
testing_data <-
test %>%
bind_cols(
predict(predictive_fit, test)
)
I cannot, however, figure out how to scale this up with simulation. If I have another dataset with exactly the same structure, I would like to predict whether it is Virginica 100 times. (Snippet below)
new_iris_data <-
new_iris_data %>%
bind_cols(
replicate(n = 100, predict(predictive_fit, new_iris_data))
)
However, it looks as if when I run the new data the same predictions are just being copied 100 times. What is the appropriate way to repeatedly predict the classification? I wouldn't expect that all 100 times the model would predict exactly the same thing, but I'd like some way to have the predictions run n number of times so each and every row of new data can have its own proportion calculated.
I have already tried using the replicate() function to try this. However, it appears as if it copies the same exact results 100 times. I considered having a for loop that iterated through a different seed and then ran the predictions, but I was hoping for a more performant solution out there.
You are replicating the prediction of you model, not the data.frame you call new_iris_data, and the result is exactly that. In order to replicate a (random) part of the iris dataset, try this:
> data("iris")
>
> sample <- sample(nrow(iris), floor(nrow(iris) * 0.5))
>
> train <- iris[sample,]
> test <- iris[-sample,]
>
> new_test <- replicate(100, test, simplify = FALSE)
> new_test <- Reduce(rbind.data.frame, new_test)
>
> head(new_test)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
> nrow(new_test)
[1] 7500
The you can use the new_test in any prediction, independent of the model.
If you want 100 differents random parts of the data set, you need to drop the replicate function and do something like:
> new_test <- lapply(1:100, function(x) {
+ sample <- sample(nrow(iris), floor(nrow(iris) * 0.5))
+ iris[-sample,]
+ })
>
> new_test <- Reduce(rbind.data.frame, new_test)
>
> head(new_test)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
7 4.6 3.4 1.4 0.3 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
18 5.1 3.5 1.4 0.3 setosa
> nrow(new_test)
[1] 7500
>
Hope it helps.

save randomForest prediction as vector

I estimate a randomForest, then run the randomForest.predict function on some hold-out data.
What I would like to do is (preferably) append the prediction for each row to the dataframe containing the holdout data as a new column, or (second choice) save the (row number in test data, prediction for that row) as a .csv file.
What I can't do is access the internals of the results object in a way that lets me do that. I'm new to R so I appreciate your help.
I have:
res <-predict(forest_tst1,
test_d,
type="response")
which successfully gives me a bunch of predictions.
The following is not valid R, but ideally I would do something like:
test_d$predicted_value <- results[some_field_of_the_results]
or,
for i = 1:nrow(test_d)
test_d[i, new_column] = results[prediction_for_row_i]
end
Basically I just want a column of predicted 1's or 0's corresponding to rows in test_d. I've been trying to use the following commands to get at the internals of the res object, but I've not found anything that's helped me.
attributes(res)
names(res)
Finally - I'm a bit confused by the following if anyone can explain!
typeof(res) = "integer"
Edit: I can do
res != test_d$gold_label
which is if anything a little confusing, because I'm comparing a column and a non-column object (??), and
length(res) = 2053
and res appears to be indexable
attributes(res[1])
$names
[1] "6836"
$levels
[1] "0" "1"
$class
[1] "factor"
but I can't select out the sub-parts in a sensible way
> res[1][1]
6836
0
Levels: 0 1
> res[1]["levels"]
<NA>
<NA>
Levels: 0 1
If understand right, all you are trying to do is add predictions to your Test Data?
ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.8, 0.2))
TestData = iris[ind == 2,] ## Generate Test Data
iris.rf <- randomForest(Species ~ ., data=iris[ind == 1,]) ## Build Model
iris.pred <- predict(iris.rf, iris[ind == 2,]) ## Get Predictions
TestData$Predictions <- iris.pred ## Append the Predictions Column
OutPut:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Predictions
9 4.4 2.9 1.4 0.2 setosa setosa
16 5.7 4.4 1.5 0.4 setosa setosa
17 5.4 3.9 1.3 0.4 setosa setosa
32 5.4 3.4 1.5 0.4 setosa setosa
42 4.5 2.3 1.3 0.3 setosa setosa
46 4.8 3.0 1.4 0.3 setosa setosa

Cycling through a list of dataframes with a for loop

New here and not very experienced, and I'm trying to get a project in R shinyapp to work.
I have a list of data frames which have a column labeled 'Gender' containing all/M/F. I want to filter all data frames based on the input, so that if the input is male, only rows containing M or all are kept.
list_tables <- list(adverb,adjective,simplenoun,verber,thingnoun,
personnoun,name_firstpart,name_secondpart)
input$gender <- "male
if(input$gender == "male"){
for (i in list_tables){
list_tables$i <- i[which((i$Gender=="M")|(i$Gender=="all")),]
}
}
Problem is, if I check the list afterwards, nothing has changed. If I do the same, but instead of using a for loop to cycle through the dataframes, I perform the same actions on only one dataframe, it does work. Theoretically, I could make a line of code for each dataframe separately, but it doesn't seem very neat and I have the feeling that the for loop should work but I'm just missing something. Would love to hear tips if anyone has them!
i is not a named-entry within list_tables, so list_tables$i doesn't work. Inside that loop, i is the data.frame you're trying to modify, but you don't update it.
Try either:
for (ind in seq_along(list_tables)) {
i <- list_tables[[ind]] # feels a little sloppt, but it's compact ...
list_tables[[ind]] <- i[which((i$Gender=="M")|(i$Gender=="all")),]
}
or even better
list_tables <- lapply(list_tables, function(i) i[which((i$Gender=="M")|(i$Gender=="all")),])
You could use lapply with subset:
example:
list_tables <- replicate(2,iris[c(1,51,101),],F)
# [[1]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 51 7.0 3.2 4.7 1.4 versicolor
# 101 6.3 3.3 6.0 2.5 virginica
#
# [[2]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 51 7.0 3.2 4.7 1.4 versicolor
# 101 6.3 3.3 6.0 2.5 virginica
solution:
lapply(list_tables,subset,Species %in% c("setosa","virginica"))
# [[1]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 101 6.3 3.3 6.0 2.5 virginica
#
# [[2]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 101 6.3 3.3 6.0 2.5 virginica
In your case that would be:
lapply(list_tables,subset,Gender %in% c("M","all"))

biglm finds the wrong data.frame to take the data from

I am trying to create chunks of my dataset to run biglm. (with fastLm I would need 350Gb of RAM)
My complete dataset is called res. As experiment I drastically decreased the size to 10.000 rows. I want to create chunks to use with biglm.
library(biglm)
formula <- iris$Sepal.Length ~ iris$Sepal.Width
test <- iris[1:10,]
biglm(formula, test)
And somehow, I get the following output:
> test <- iris[1:10,]
> test
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
Above you can see the matrix test contains 10 rows. Yet when running biglm it shows a sample size of 150
> biglm(formula, test)
Large data regression model: biglm(formula, test)
Sample size = 150
Looks like it uses iris instead of test.. how is this possible and how do I get biglm to use chunk1 the way I intend it to?
I suspect the following line is to blame:
formula <- iris$Sepal.Length ~ iris$Sepal.Width
where in the formula you explicitly reference the iris dataset. This will cause R to try and find the iris dataset when lm is called, which it finds in the global environment (because of R's scoping rules).
In a formula you normally do not use vectors, but simply the column names:
formula <- Sepal.Length ~ Sepal.Width
This will ensure that the formula contains only the column (or variable) names, which will be found in the data lm is passed. So, lm will use test in stead of iris.

Convert all column headers into regressors in R

I'm completely new to R and would like to turn each column label (header?) in my data set into a regressor without having to define each regressor one at a time, i.e. date -> data$Date
Is there a way to do this all at once?
Thank you in advance!
Is this what you want:
R > data(iris)
R > head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
R > lm(Sepal.Length ~ ., data = iris)
Call:
lm(formula = Sepal.Length ~ ., data = iris)
Coefficients:
(Intercept) Sepal.Width Petal.Length Petal.Width
2.1713 0.4959 0.8292 -0.3152
Speciesversicolor Speciesvirginica
-0.7236 -1.0235
If you want to choose specific column you can use this:
data is sample.data with dependent variable in col 3 and cols 1, 2, 4:8 are independent variables
yy<-lm(as.formula(paste(colnames(sample.data)[3], "~",paste(colnames(sample.data)[c(1, 2,4:8)], collapse = "+"), sep = "")), data=sample.data)
)
summary(yy)

Resources