I estimate a randomForest, then run the randomForest.predict function on some hold-out data.
What I would like to do is (preferably) append the prediction for each row to the dataframe containing the holdout data as a new column, or (second choice) save the (row number in test data, prediction for that row) as a .csv file.
What I can't do is access the internals of the results object in a way that lets me do that. I'm new to R so I appreciate your help.
I have:
res <-predict(forest_tst1,
test_d,
type="response")
which successfully gives me a bunch of predictions.
The following is not valid R, but ideally I would do something like:
test_d$predicted_value <- results[some_field_of_the_results]
or,
for i = 1:nrow(test_d)
test_d[i, new_column] = results[prediction_for_row_i]
end
Basically I just want a column of predicted 1's or 0's corresponding to rows in test_d. I've been trying to use the following commands to get at the internals of the res object, but I've not found anything that's helped me.
attributes(res)
names(res)
Finally - I'm a bit confused by the following if anyone can explain!
typeof(res) = "integer"
Edit: I can do
res != test_d$gold_label
which is if anything a little confusing, because I'm comparing a column and a non-column object (??), and
length(res) = 2053
and res appears to be indexable
attributes(res[1])
$names
[1] "6836"
$levels
[1] "0" "1"
$class
[1] "factor"
but I can't select out the sub-parts in a sensible way
> res[1][1]
6836
0
Levels: 0 1
> res[1]["levels"]
<NA>
<NA>
Levels: 0 1
If understand right, all you are trying to do is add predictions to your Test Data?
ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.8, 0.2))
TestData = iris[ind == 2,] ## Generate Test Data
iris.rf <- randomForest(Species ~ ., data=iris[ind == 1,]) ## Build Model
iris.pred <- predict(iris.rf, iris[ind == 2,]) ## Get Predictions
TestData$Predictions <- iris.pred ## Append the Predictions Column
OutPut:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Predictions
9 4.4 2.9 1.4 0.2 setosa setosa
16 5.7 4.4 1.5 0.4 setosa setosa
17 5.4 3.9 1.3 0.4 setosa setosa
32 5.4 3.4 1.5 0.4 setosa setosa
42 4.5 2.3 1.3 0.3 setosa setosa
46 4.8 3.0 1.4 0.3 setosa setosa
Related
I have a model, called predictive_fit <- fit(workflow, training) that classifies the Iris dataset species using xgboost. The data are pivoted wide such that each species is a dummied column represented by a 0 or 1. Here, I am trying to predict Virginica based on the Sepal and Petal columns.
Currently, I have the following code which then takes the dataset after the model has been fit to test if it can accurately predict the Virginia species of iris. (Snippet below)
testing_data <-
test %>%
bind_cols(
predict(predictive_fit, test)
)
I cannot, however, figure out how to scale this up with simulation. If I have another dataset with exactly the same structure, I would like to predict whether it is Virginica 100 times. (Snippet below)
new_iris_data <-
new_iris_data %>%
bind_cols(
replicate(n = 100, predict(predictive_fit, new_iris_data))
)
However, it looks as if when I run the new data the same predictions are just being copied 100 times. What is the appropriate way to repeatedly predict the classification? I wouldn't expect that all 100 times the model would predict exactly the same thing, but I'd like some way to have the predictions run n number of times so each and every row of new data can have its own proportion calculated.
I have already tried using the replicate() function to try this. However, it appears as if it copies the same exact results 100 times. I considered having a for loop that iterated through a different seed and then ran the predictions, but I was hoping for a more performant solution out there.
You are replicating the prediction of you model, not the data.frame you call new_iris_data, and the result is exactly that. In order to replicate a (random) part of the iris dataset, try this:
> data("iris")
>
> sample <- sample(nrow(iris), floor(nrow(iris) * 0.5))
>
> train <- iris[sample,]
> test <- iris[-sample,]
>
> new_test <- replicate(100, test, simplify = FALSE)
> new_test <- Reduce(rbind.data.frame, new_test)
>
> head(new_test)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
> nrow(new_test)
[1] 7500
The you can use the new_test in any prediction, independent of the model.
If you want 100 differents random parts of the data set, you need to drop the replicate function and do something like:
> new_test <- lapply(1:100, function(x) {
+ sample <- sample(nrow(iris), floor(nrow(iris) * 0.5))
+ iris[-sample,]
+ })
>
> new_test <- Reduce(rbind.data.frame, new_test)
>
> head(new_test)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
7 4.6 3.4 1.4 0.3 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
18 5.1 3.5 1.4 0.3 setosa
> nrow(new_test)
[1] 7500
>
Hope it helps.
I'm using the Ionosphere dataset in R and am trying to write a loop that will create new columns that are standardized iterations of existing columns and name them accordingly.
I've got the "cname" as the new column name and c as the original. The code is:
install.packages("mlbench")
library(mlbench)
data('Ionosphere')
library(robustHD)
col <- colnames(Ionosphere)
for (c in col[1:length(col)-1]){
cname <- paste(c,"Std")
Ionosphere$cname <- standardize(Ionosphere$c)
}
But get the following error:
"Error in `$<-.data.frame`(`*tmp*`, "cname", value = numeric(0)) :
replacement has 0 rows, data has 351
In addition: Warning message:
In mean.default(x) : argument is not numeric or logical: returning NA"
I feel like there's something super-simple I'm missing but I just can't see it.
Any help gratefully received.
We can use lapply, a custom-made standardization function, setNames, and cbind.
I do not have access to your dataset, so I am using the iris dataset as an example:
df<-iris
cbind(df, set_names(lapply(df[1:4],
\(x) (x - mean(x))/sd(x)),
paste0(names(df[1:4]), '_Std')))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_Std Sepal.Width_Std Petal.Length_Std Petal.Width_Std
1 5.1 3.5 1.4 0.2 setosa -0.89767388 1.01560199 -1.33575163 -1.3110521482
2 4.9 3.0 1.4 0.2 setosa -1.13920048 -0.13153881 -1.33575163 -1.3110521482
3 4.7 3.2 1.3 0.2 setosa -1.38072709 0.32731751 -1.39239929 -1.3110521482
4 4.6 3.1 1.5 0.2 setosa -1.50149039 0.09788935 -1.27910398 -1.3110521482
5 5.0 3.6 1.4 0.2 setosa -1.01843718 1.24503015 -1.33575163 -1.3110521482
...
I feel these transformations get easier with dplyr:
library(dplyr)
iris %>% mutate(across(where(is.numeric),
~ (.x - mean(.x))/sd(.x),
.names = "{col}_Std"))
Is there a way to extract the mean and p-value from a t.test output and create a table that includes the features, mean, and p-value? Say there are 10 columns put through t.test, and that means there are 10 means, and 10 p-values. How would I be able to create a table which only shows those specific items?
here is an example: data (iris):
a. b. c. d. e.
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
t.test(a)
t.test(b) #...ect we obtain the mean and p-value.
this is the output im looking for:
feature mean p-val
col1 0.01 0.95
col2 0.01 0.95
.
.
.
coln
hope it makes sense!
Using the iris built in data set as an example
t(sapply(iris[, 1:4], function(i){
t.test(i)[c(5,3)]
}))
The sapply() function is iteratively performing that custom function - which performs a t-test on a variable and returns the estimate and p-value - through columns 1 to 4 of iris. That is then transposed by t() to rotate the data to the desired shape. You can store that as a data.frame using data.frame() and use row.names() to get the variable names into a new column on that if you like.
values <- t(sapply(iris[, 1:4], function(i){
t.test(i)[c(5,3)]
}))
values <- data.frame("feature"=row.names(values), values)
row.names(values) <- NULL
values
Beware multiple testing though...
This question is the second part of a previous question (Linear Regression prediction in R using Leave One out Approach).
I'm trying to build models for each country and generate linear regression predictions using the leave one out approach. In other words, in the code below when building model1 and model2 the "data" used should not be the entire data set. Instead it should be a subset of the dataset (country). Each country data should be evaluated using a model built with data specific to that country.
The code below returns an error. How can I modify/fix the code below to do that? Or is there a better way of doing that?
library(modelr)
install.packages("gapminder")
library(gapminder)
data(gapminder)
#CASE 1
model1 <- lm(lifeExp ~ pop, data = gapminder, subset = country)
model2 <- lm(lifeExp ~ pop + gdpPercap, data = gapminder, subset = country)
models <- list(fit_model1 = model1,fit_model2 = model2)
gapminder %>% nest_by(continent, country) %>%
bind_cols(
map(1:nrow(gapminder), function(i) {
map_dfc(models, function(model) {
training <- data[-i, ]
fit <- lm(model, data = training)
validation <- data[i, ]
predict(fit, newdata = validation)
})
}) %>%
bind_rows()
)
The most succinct and straightforward solution would be a nested for loop approach, where the outer loop is the two model formulae and the inner loop is the unity we want to leave out. This can also be done with outer, which I also show afterwards.
For sake of clarity I first show how to leave out one observation (i.e. one row) in each iteration (Part I). I show later how to leave out one cluster (e.g. country) (Part II). I also use the built-in iris data set, which is smaller and thus easier to handle. It contains a "Species" column that is meant to correspond to the "countries" in your data.
Part I
First, we put the two formulae into a list and name them as we would like them to appear in the resulting columns later.
FOAE <- list(fit1=Petal.Length ~ Sepal.Length,
fit2=Petal.Length ~ Sepal.Length + Petal.Width)
For the loop, we want to initialize a matrix im whose rows correspond to the number of rows we want to leave out, and columns to the number of model formulae.
im <- matrix(NA, nrow=nrow(iris), ncol=length(FOAE),
dimnames=list(NULL, names(FOAE)))
This would look like this:
head(im, n=3)
# fit1 fit2
# [1,] NA NA
# [2,] NA NA
# [3,] NA NA
Now we loop over formulas and rows as described above.
for (i in seq(FOAE)) {
for(j in seq(nrow(iris))) {
train <- iris[-j,]
test <- iris[j,]
fit <- lm(FOAE[[i]], data=train)
im[j, i] <- predict(fit, newdata=test)
}
}
im has now been filled, and we may cbind it to the original iris data set to get our result res1.
res1 <- cbind(iris, im)
head(res1)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# 1 5.1 3.5 1.4 0.2 setosa 2.388501 1.611976
# 2 4.9 3.0 1.4 0.2 setosa 2.014324 1.501389
# 3 4.7 3.2 1.3 0.2 setosa 1.639805 1.392955
# 4 4.6 3.1 1.5 0.2 setosa 1.446175 1.333199
# 5 5.0 3.6 1.4 0.2 setosa 2.201646 1.556620
# 6 5.4 3.9 1.7 0.4 setosa 2.944788 2.127184
To alternatively follow the outer approach, we put the code inside the for loop into a formula which we Vectorize so that it can handle matrix columns (i.e. vectors).
FUN1 <- Vectorize(function(x, y) {
train <- iris[-x,]
test <- iris[x,]
fit <- lm(y, data=train)
predict(fit, newdata=test)
})
Now we put FOAE and the rows 1:nrow(iris) to leave out subsequently, together with FUN1 into outer(). This already gives us the result that we can cbind to iris in the same way as above to get our result res2.
o1 <- outer(FOAE, 1:nrow(iris), FUN1)
res2 <- cbind(iris, o1)
head(res2)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# 1 5.1 3.5 1.4 0.2 setosa 2.388501 1.611976
# 2 4.9 3.0 1.4 0.2 setosa 2.014324 1.501389
# 3 4.7 3.2 1.3 0.2 setosa 1.639805 1.392955
# 4 4.6 3.1 1.5 0.2 setosa 1.446175 1.333199
# 5 5.0 3.6 1.4 0.2 setosa 2.201646 1.556620
# 6 5.4 3.9 1.7 0.4 setosa 2.944788 2.127184
## test if results are different is negative
stopifnot(all.equal(res1, res2))
Part II
We may follow a similar approach when leaving out a cluster (i.e. species or countries). I show here the outer method. The thing we want to change is that we now want to leave out observations belonging to a specific cluster, here "Species" (in your case "countries"), which unique values we put into a vector Species.u . Since the values are in "character" or "factor" format we subset the data using data[!data$cluster %in% x, ] instead of data[-x, ]. Because predict would yield multiple values in the clusters, but we want the same value in the respective clusters, we might want to use a statistic, e.g. the mean prediction of each cluster. We use rownames according to the cluster.
FUN2 <- Vectorize(function(x, y) {
train <- iris[!iris$Species %in% x,]
test <- iris[iris$Species %in% x,]
fit <- lm(y, data=train)
mean(predict(fit, newdata=test))
})
Species.u <- unique(iris$Species)
o2 <- `rownames<-`(outer(Species.u, FOAE, FUN2), Species.u)
This now gives us a matrix which is smaller than our data set. Thanks to the rownames we may match the predictions tho the clusters to which they belong.
o2
# fit1 fit2
# setosa 3.609943 2.662609
# versicolor 3.785760 3.909919
# virginica 4.911009 5.976922
res3 <- cbind(iris, o2[match(iris$Species, rownames(o2)), ])
head(res3)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# setosa 5.1 3.5 1.4 0.2 setosa 3.609943 2.662609
# setosa.1 4.9 3.0 1.4 0.2 setosa 3.609943 2.662609
# setosa.2 4.7 3.2 1.3 0.2 setosa 3.609943 2.662609
# setosa.3 4.6 3.1 1.5 0.2 setosa 3.609943 2.662609
# setosa.4 5.0 3.6 1.4 0.2 setosa 3.609943 2.662609
# setosa.5 5.4 3.9 1.7 0.4 setosa 3.609943 2.662609
tail(res3)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# virginica.44 6.7 3.3 5.7 2.5 virginica 4.911009 5.976922
# virginica.45 6.7 3.0 5.2 2.3 virginica 4.911009 5.976922
# virginica.46 6.3 2.5 5.0 1.9 virginica 4.911009 5.976922
# virginica.47 6.5 3.0 5.2 2.0 virginica 4.911009 5.976922
# virginica.48 6.2 3.4 5.4 2.3 virginica 4.911009 5.976922
# virginica.49 5.9 3.0 5.1 1.8 virginica 4.911009 5.976922
Edit
In this version of FUN2, FUN3, the output of the models of each cluster are rbinded (in two columns of course, because of two models).
FUN3 <- Vectorize(function(x, y) {
train <- iris[!iris$Species %in% x,]
test <- iris[iris$Species %in% x,]
fit <- lm(y, data=train)
(predict(fit, newdata=test))
}, SIMPLIFY=F)
Species.u <- unique(iris$Species)
o3 <- `rownames<-`(outer(Species.u, FOAE, FUN3), Species.u)
res32 <- cbind(iris, apply(o3, 2, unlist))
head(res32)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# setosa.1 5.1 3.5 1.4 0.2 setosa 3.706940 2.678255
# setosa.2 4.9 3.0 1.4 0.2 setosa 3.500562 2.547587
# setosa.3 4.7 3.2 1.3 0.2 setosa 3.294183 2.416919
# setosa.4 4.6 3.1 1.5 0.2 setosa 3.190994 2.351586
# setosa.5 5.0 3.6 1.4 0.2 setosa 3.603751 2.612921
# setosa.6 5.4 3.9 1.7 0.4 setosa 4.016508 3.073249
Edit 2
As I learned in your comment you want 1. a subset of your data along clusters. This would be ss in FUN4 below. Then the ss is also subsetted by leaving out one row z over the rows of subset ss.
FUN4 <- Vectorize(function(x, y) {
## subsets first by cluster then by row
ss <- iris[iris$Species %in% x,] ## cluster subset
sapply(1:nrow(ss), function(z) { ## subset rows using `sapply`
train <- ss[-z,] ## train data w/o row z
test <- ss[z,] ## test data for `predict`, just row z
fit <- lm(y, data=train)
predict(fit, newdata=test)
})
}, SIMPLIFY=F)
## the two models
FOAE <- list(fit1=Petal.Length ~ Sepal.Length,
fit2=Petal.Length ~ Sepal.Length + Petal.Width)
## unique cluster names
Species.u <- unique(iris$Species)
## with the `outer` we iterate over all the permutations of clusters and models `FOAE`.
o4 <- `rownames<-`(outer(Species.u, FOAE, FUN4), Species.u)
## `unlist`ed result is directly `cbind`able to original data
res4 <- cbind(iris, apply(o4, 2, unlist))
## result
head(res4)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# setosa.1 5.1 3.5 1.4 0.2 setosa 1.476004 1.451029
# setosa.2 4.9 3.0 1.4 0.2 setosa 1.449120 1.431737
# setosa.3 4.7 3.2 1.3 0.2 setosa 1.426185 1.416492
# setosa.4 4.6 3.1 1.5 0.2 setosa 1.404040 1.398103
# setosa.5 5.0 3.6 1.4 0.2 setosa 1.462460 1.441295
# setosa.6 5.4 3.9 1.7 0.4 setosa 1.504990 1.559045
I have a dataset titled nypd, which has a column titled OCCUR_TIME. This column contains various times (ex: 3:57:00, 10:31:00, 22:15:00, etc.).
I would like to create a custom TIME_OF_DAY column using R; I wrote this code below:
nypd$TIME_OF_DAY <- 'Night'
nypd[nypd$OCCUR_TIME >= 6:00:00 & nypd$OCCUR_TIME < 12:00:00,] <- 'Morning'
nypd[nypd$OCCUR_TIME >= 12:00:00 & nypd$OCCUR_TIME < 16:00:00,] <- 'Afternoon'
nypd[nypd$OCCUR_TIME >= 16:00:00 & nypd$OCCUR_TIME < 20:00:00,] <- 'Evening'
The error I am getting is Error in `[<-.data.frame`(`*tmp*`, nypd$OCCUR_TIME >= "6:00:00" & nypd$OCCUR_TIME < : missing values are not allowed in subscripted assignments of data frames.
I'm new to R so I am not too familiar with the error codes, but I'm thinking the error is due to my values in the OCCUR_TIME column not being read as a "time" type of value, so I can't use any operators.
Could someone please help me figure out where I'm going wrong? Thank you!
First, as the error is saying, you have missing values in your data. Since we don't have your data to work with, let's make up some data to use:
> data(iris)
> iris$Petal.Length[3:5] <- NA
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 NA 0.2 setosa
4 4.6 3.1 NA 0.2 setosa
5 5.0 3.6 NA 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Now, it has a problem with subsetting on Petal.Length because it isn't sure what to do when there are missing values.
> iris[iris$Petal.Length > 1.2 & iris$Petal.Length < 1.5, ] <- 50
Error in `[<-.data.frame`(`*tmp*`, iris$Petal.Length > 1.2 & iris$Petal.Length < :
missing values are not allowed in subscripted assignments of data frames
Also note that when you do this:
nypd[nypd$OCCUR_TIME >= 6:00:00 & nypd$OCCUR_TIME < 12:00:00,] <- 'Morning'
You aren't telling it what variable you want to assign 'Morning' to!
You can add a test for is.na to your boolean, and include the variable name you want to affect:
> iris[!is.na(iris$Petal.Length) & iris$Petal.Length > 1.2 & iris$Petal.Length < 1.5, 'Petal.Length'] <- 50
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 50.0 0.2 setosa
2 4.9 3.0 50.0 0.2 setosa
3 4.7 3.2 NA 0.2 setosa
4 4.6 3.1 NA 0.2 setosa
5 5.0 3.6 NA 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
The advice about learning how to deal with dates and times in R is true, the way you are expressing them here is not right. If they are being read in as a factor, then perhaps however you are reading your data you need to add a stringsAsFactors = FALSE?
We could convert the 'OCCUR_TIME' to Time class with as.ITime from data.table, then do the the comparison
library(dplyr)
library(data.table)
nypd %>%
mutate(OCCUR_TIME = as.ITime(OCCUR_TIME),
TIME_OF_DAY = case_when(between(OCCUR_TIME, as.ITime("06:00:00"),
as.ITime("12:00:00")) ~ "Morning",
between(OCCUR_TIME, as.ITime("12:00:00"),
as.ITime("16:00:00")) ~ "Afternoon",
between(OCCUR_TIME, as.ITime("16:00:00"),
as.ITime("20:00:00")) ~ "Evening", TRUE ~ "Night"))
# OCCUR_TIME TIME_OF_DAY
#1 05:22:34 Night
#2 07:22:29 Morning
#3 12:20:05 Afternoon
#4 15:46:23 Afternoon
#5 19:32:42 Evening
data
nypd <- data.frame(OCCUR_TIME = c("05:22:34", "07:22:29", "12:20:05",
"15:46:23", "19:32:42"), stringsAsFactors = FALSE)