I am very new to R and I wanted to know how can I store the classification error value which results from confusion matrix:
confusion(predict(irisfit, iris), iris$Species)
## Setosa Versicolor Virginica
## Setosa 50 0 0
## Versicolor 0 48 1
## Virginica 0 2 49
## attr(, "error"):
## [1] 0.02
I want to fetch the classification error value 0.02 and store it somewhere. How can I do that!?
Assuming that your code works. You should be able to do the following
myconf<-confusion(predict(irisfit, iris), iris$Species)
myerr<-attr(myconf, "error")
which will put the value 0.02 in the variable myerr.
I am trying to compute several model in the same time. The dependent variable in the first column, as rest of them are independent columns. I want to run logistic regression between IV and DV for each independent variables separately. Thank you very much for your help! Please let me know anything needs to be provided.
**** Some of IV are bivariate variables. So it should be treated as.factor in R.
*** After compute each model, can I also compute a p-value for each model in one time.
*** Right now, I just compute and summary each model separately
The data and my current code looks like below.
enter image description here
enter image description here
Pictures of your data are not as helpful as providing a sample of your data with dput(). Also you should paste your code directly into your question and not paste a picture. Here is an example using the iris data set that is included with R:
iris.2 <- iris[iris$Species!="setosa", ]
iris.2 <- droplevels(iris.2)
iris.2$Species <- as.numeric(iris.2$Species) - 1
# Species: 0 == versicolor, 1== virginica
# 'data.frame': 100 obs. of 5 variables:
# $ Sepal.Length: num 7 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 ...
# $ Sepal.Width : num 3.2 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 ...
# $ Petal.Length: num 4.7 4.5 4.9 4 4.6 4.5 4.7 3.3 4.6 3.9 ...
# $ Petal.Width : num 1.4 1.5 1.5 1.3 1.5 1.3 1.6 1 1.3 1.4 ...
# $ Species : num 0 0 0 0 0 0 0 0 0 0 ...
Now we compute the logistic regression in which Species is the dependent variable against each of the independent variables.
forms <- paste("Species ~", colnames(iris.2)[-5])
# [1] "Species ~ Sepal.Length" "Species ~ Sepal.Width" "Species ~ Petal.Length" "Species ~ Petal.Width"
iris.glm <- lapply(forms, function(x) glm(as.formula(x), iris.2, family=binomial))
Now iris.glm is a list containing all of the results. The results of the first logistic regression are iris.glm[[1]] and summary(iris.glm[[1]]) gives you the summary. To print all of the results use lapply():
lapply(iris.glm, print)
lapply(iris.glm, summary)
Are there any packages in R that can generate a random dataset given a pre-existing template dataset?
For example, let's say I have the iris dataset:
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
I want some function random_df(iris) which will generate a data-frame with the same columns as iris but with random data (preferably random data that preserves certain statistical properties of the original, (e.g., mean and standard deviation of the numeric variables).
What is the easiest way to do this?
[Comment from question author moved here. --Editor's note]
I don't want to sample random rows from an existing dataset. I want to generate actually random data with all the same columns (and types) as an existing dataset. Ideally, if there is some way to preserve statistical properties of the data for numeric variables, that would be preferable, but it's not needed
How about this for a start:
Define a function that simulates data from df by
drawing samples from a normal distribution for numeric columns in df, with the same mean and sd as in the original data column, and
uniformly drawing samples from the levels of factor columns.
generate_data <- function(df, nrow = 10) {
as.data.frame(lapply(df, function(x) {
if (class(x) == "numeric") {
rnorm(nrow, mean = mean(x), sd = sd(x))
} else if (class(x) == "factor") {
sample(levels(x), nrow, replace = T)
Then for example, if we take iris, we get
df <- generate_data(iris)
#'data.frame': 10 obs. of 5 variables:
# $ Sepal.Length: num 6.45 5.42 4.49 6.6 4.79 ...
# $ Sepal.Width : num 2.95 3.76 2.57 3.16 3.2 ...
# $ Petal.Length: num 4.26 5.47 5.29 6.19 2.33 ...
# $ Petal.Width : num 0.487 1.68 1.779 0.809 1.963 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 3 2 1 2 3 2 1 1 2 3
It should be fairly straightfoward to extend the generate_data function to account for other column types.
I estimate a randomForest, then run the randomForest.predict function on some hold-out data.
What I would like to do is (preferably) append the prediction for each row to the dataframe containing the holdout data as a new column, or (second choice) save the (row number in test data, prediction for that row) as a .csv file.
What I can't do is access the internals of the results object in a way that lets me do that. I'm new to R so I appreciate your help.
I have:
res <-predict(forest_tst1,
which successfully gives me a bunch of predictions.
The following is not valid R, but ideally I would do something like:
test_d$predicted_value <- results[some_field_of_the_results]
for i = 1:nrow(test_d)
test_d[i, new_column] = results[prediction_for_row_i]
Basically I just want a column of predicted 1's or 0's corresponding to rows in test_d. I've been trying to use the following commands to get at the internals of the res object, but I've not found anything that's helped me.
Finally - I'm a bit confused by the following if anyone can explain!
typeof(res) = "integer"
Edit: I can do
res != test_d$gold_label
which is if anything a little confusing, because I'm comparing a column and a non-column object (??), and
length(res) = 2053
and res appears to be indexable
[1] "6836"
[1] "0" "1"
[1] "factor"
but I can't select out the sub-parts in a sensible way
> res[1][1]
Levels: 0 1
> res[1]["levels"]
Levels: 0 1
If understand right, all you are trying to do is add predictions to your Test Data?
ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.8, 0.2))
TestData = iris[ind == 2,] ## Generate Test Data
iris.rf <- randomForest(Species ~ ., data=iris[ind == 1,]) ## Build Model
iris.pred <- predict(iris.rf, iris[ind == 2,]) ## Get Predictions
TestData$Predictions <- iris.pred ## Append the Predictions Column
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Predictions
9 4.4 2.9 1.4 0.2 setosa setosa
16 5.7 4.4 1.5 0.4 setosa setosa
17 5.4 3.9 1.3 0.4 setosa setosa
32 5.4 3.4 1.5 0.4 setosa setosa
42 4.5 2.3 1.3 0.3 setosa setosa
46 4.8 3.0 1.4 0.3 setosa setosa
in this website:
it says that we can use it for predict
prediction <- predict(svm1, test_iris)
> xtab <- table(test_iris$Species, prediction)
> xtab prediction
setosa versicolor virginica
setosa 20 0 0
versicolor 0 20 1
virginica 0 0 19
and use this for finding accuracy
(20+20+19)/nrow(test_iris) # Compute prediction accuracy
But when I have very very large data set I even can not see table how I can find this number (20+20+19)? to find accuracy?
You can get the correct classified with diag:
svm1 <- svm(Species~., data=iris)
prediction <- predict(svm1, iris)
xtab <- table(iris$Species, prediction)
sum(diag(xtab))/sum(xtab) #Overall
#[1] 0.9733333
diag(xtab)/rowSums(xtab) #For each class per observation
# setosa versicolor virginica
# 1.00 0.96 0.96
diag(xtab)/colSums(xtab) #For each class per prediction
# setosa versicolor virginica
# 1.00 0.96 0.96
Hi I'm currently trying to extract some of the inner node information stored in the constant partying object in R using ctree in partykit but I'm finding navigating the objects a bit difficult, I'm able to display the information on a plot but I'm not sure how to extract the information - I think it requires nodeapply or another function in the partykit?
irisct <- ctree(Species ~ .,data = iris)
plot(irisct, inner_panel = node_barplot(irisct))
Plot with inner node details
All the information is accessible by the functions to plot, but I'm after a text output similar to:
Example output
The main trick (as previously pointed out by #G5W) is to take the [id] subset of the party object and then extract the data (by either $data or using the data_party() function) which contains the response. I would recommend to build a table with absolute frequencies first and then compute the relative and marginal frequencies from that. Using the irisct object the plain table can be obtained by
tab <- sapply(1:length(irisct), function(id) {
y <- data_party(irisct[id])
y <- y[["(response)"]]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## setosa 50 50 0 0 0 0 0
## versicolor 50 0 50 49 45 4 1
## virginica 50 0 50 5 1 4 45
Then we can add a little bit of formatting to a nice table object:
colnames(tab) <- 1:length(irisct)
tab <- as.table(tab)
names(dimnames(tab)) <- c("Species", "Node")
And then use prop.table() and margin.table() to compute the frequencies we are interested in. The as.data.frame() method transform from the table layout to a "long" data.frame:
as.data.frame(prop.table(tab, 1))
## Species Node Freq
## 1 setosa 1 0.500000000
## 2 versicolor 1 0.251256281
## 3 virginica 1 0.322580645
## 4 setosa 2 0.500000000
## 5 versicolor 2 0.000000000
## 6 virginica 2 0.000000000
## 7 setosa 3 0.000000000
## 8 versicolor 3 0.251256281
## 9 virginica 3 0.322580645
## 10 setosa 4 0.000000000
## 11 versicolor 4 0.246231156
## 12 virginica 4 0.032258065
## 13 setosa 5 0.000000000
## 14 versicolor 5 0.226130653
## 15 virginica 5 0.006451613
## 16 setosa 6 0.000000000
## 17 versicolor 6 0.020100503
## 18 virginica 6 0.025806452
## 19 setosa 7 0.000000000
## 20 versicolor 7 0.005025126
## 21 virginica 7 0.290322581
as.data.frame(margin.table(tab, 2))
## Node Freq
## 1 1 150
## 2 2 50
## 3 3 100
## 4 4 54
## 5 5 46
## 6 6 8
## 7 7 46
And the split information can be obtained with the (still unexported) .list.rules.party() function. You just need to ask for all node IDs (the default is to use just the terminal node IDs):
partykit:::.list.rules.party(irisct, i = nodeids(irisct))
## 1
## ""
## 2
## "Petal.Length <= 1.9"
## 3
## "Petal.Length > 1.9"
## 4
## "Petal.Length > 1.9 & Petal.Width <= 1.7"
## 5
## "Petal.Length > 1.9 & Petal.Width <= 1.7 & Petal.Length <= 4.8"
## 6
## "Petal.Length > 1.9 & Petal.Width <= 1.7 & Petal.Length > 4.8"
## 7
## "Petal.Length > 1.9 & Petal.Width > 1.7"
Most of the information that you want is accessible without much work.
I will show how to get the information, but leave you to format the
information into a pretty table.
Notice that your tree structure irisct is just a list of each of the nodes.
[1] 7
Each node has a field data that contains the points that have made it down
this far in the tree, so you can get the number of observations at the node
by counting the rows.
[1] 54 5
[1] 54
Or doing them all at once to get your table 2
NObs = sapply(1:7, function(n) { nrow(irisct[n]$data) })
[1] 150 50 100 54 46 8 46
The first column of the data at a node is the class (Species),
so you can get the count of each class and the probability of each class
at a node
setosa versicolor virginica
0 49 5
table(irisct[4]$data[1]) / NObs[4]
setosa versicolor virginica
0.00000000 0.90740741 0.09259259
The split information in your table 3 is a bit more awkward. Still,
you can get a text version of what you need just by printing out the
top level node
Model formula:
Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
Fitted party:
[1] root
| [2] Petal.Length <= 1.9: setosa (n = 50, err = 0.0%)
| [3] Petal.Length > 1.9
| | [4] Petal.Width <= 1.7
| | | [5] Petal.Length <= 4.8: versicolor (n = 46, err = 2.2%)
| | | [6] Petal.Length > 4.8: versicolor (n = 8, err = 50.0%)
| | [7] Petal.Width > 1.7: virginica (n = 46, err = 2.2%)
Number of inner nodes: 3
Number of terminal nodes: 4
To save the output for parsing and display
TreeSplits = capture.output(print(irisct[1]))