I have a list of what are essentially tables of different variables, with a reproducible dummy example below (it's a little ugly, but it gets the idea across).
results <- list()
for(ii in names(iris)[1:4]) {
mytab <- table(iris[,i] > mean(iris[,i]), iris$Species)
myp <- chisq.test(mytab)$p.value
results[[ii]] <- as.data.frame(cbind(mytab, P.value=myp))
results[[ii]] <- tibble::rownames_to_column(results[[ii]], ii)
}
In a previous version R (at least 4.0), I used to be able to do something like:
lapply(results, function(x) write.table(x, "myfile.txt", append=T, sep="\t", quote=F, row.names=F))
which would generate a file called myfile.txt and fill it with all of my tables, much like the list of printed tables from results. I've had this code (which was functioning as expected) since at least the end of 2021. However, I now get the error:
Error in write.table(x, "myfile.txt", append = T, sep = "\t", quote = T, :
(converted from warning) appending column names to file
And to some extent I get it -- the column names I'm using aren't identical to what I'm appending, but I don't really care for my purposes. I just want my printed list of tables. Is there a way to force appending irrespective of mismatched column names? I've tried using col.names=NA but then receive the error that using col.names=NA with row.names=F "makes no sense". Do I need to resign myself to using functions like sink for this? I'd really like everything to remain tab-separated if possible.
It appears to be baked-in, depending solely on the col.names and append arguments and no easy way to squelch it there.
In general it's just a warning, but since it was elevated to Error status, that suggests you've set options(warn = 2) or higher. It's not a factor for these resolutions (which result in no warning being emitted and therefore no escalation to an error).
Suppress it and all other warnings (for good or bad):
write.table(data.frame(a=1,b=2), "quux.csv", append=T, sep="\t", quote=F, row.names=F)
# Error in write.table(data.frame(a = 1, b = 2), "quux.csv", append = T, :
# (converted from warning) appending column names to file
suppressWarnings(write.table(data.frame(a=1,b=2), "quux.csv", append=T, sep="\t", quote=F, row.names=F))
### nothing emitted, file appended
Suppress just that warning, allowing others (since suppressing all can hide other issues):
withCallingHandlers(
write.table(data.frame(a=1,b=2), "quux.csv", append=T, sep="\t", quote=F, row.names=F),
warning = function(w) {
if (grepl("appending column names to file", conditionMessage(w))) {
invokeRestart("muffleWarning")
}
})
### nothing emitted, file appended
withCallingHandlers(
write.table(data.frame(a=1,b=2), "quux.csv", append=T, sep="\t", quote=F, row.names=F),
warning = function(w) {
if (grepl("something else", conditionMessage(w))) {
invokeRestart("muffleWarning")
}
})
# Error in write.table(data.frame(a = 1, b = 2), "quux.csv", append = T, :
# (converted from warning) appending column names to file
Another potential solution is to use write.list() from the erer package:
library(erer)
#> Loading required package: lmtest
#> Loading required package: zoo
#>
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#>
#> as.Date, as.Date.numeric
#> Registered S3 method overwritten by 'quantmod':
#> method from
#> as.zoo.data.frame zoo
results <- list()
for(ii in names(iris)[1:4]) {
mytab <- table(iris[,ii] > mean(iris[,ii]), iris$Species)
myp <- chisq.test(mytab)$p.value
results[[ii]] <- as.data.frame(cbind(mytab, P.value=myp))
results[[ii]] <- tibble::rownames_to_column(results[[ii]], ii)
}
write.list(z = results, file = "myfile.txt", row.names = FALSE, quote = FALSE)
read.csv("~/Desktop/myfile.txt")
#> Result Sepal.Length setosa versicolor virginica P.value
#> 1 Sepal.Length FALSE 50 24 6 8.373761e-18
#> 2 Sepal.Length TRUE 0 26 44 8.373761e-18
#> 3
#> 4 Result Sepal.Width setosa versicolor virginica P.value
#> 5 Sepal.Width FALSE 8 42 33 1.24116e-11
#> 6 Sepal.Width TRUE 42 8 17 1.24116e-11
#> 7
#> 8 Result Petal.Length setosa versicolor virginica P.value
#> 9 Petal.Length FALSE 50 7 0 9.471374e-28
#> 10 Petal.Length TRUE 0 43 50 9.471374e-28
#> 11
#> 12 Result Petal.Width setosa versicolor virginica P.value
#> 13 Petal.Width FALSE 50 10 0 4.636126e-26
#> 14 Petal.Width TRUE 0 40 50 4.636126e-26
#> 15
# You can also specify the table names, e.g.
write.list(z = results, file = "myfile2.txt", row.names = FALSE, quote = FALSE, t.name = 1:4)
read.csv("~/Desktop/myfile2.txt")
#> Result Sepal.Length setosa versicolor virginica P.value
#> 1 1 FALSE 50 24 6 8.373761e-18
#> 2 1 TRUE 0 26 44 8.373761e-18
#> 3
#> 4 Result Sepal.Width setosa versicolor virginica P.value
#> 5 2 FALSE 8 42 33 1.24116e-11
#> 6 2 TRUE 42 8 17 1.24116e-11
#> 7
#> 8 Result Petal.Length setosa versicolor virginica P.value
#> 9 3 FALSE 50 7 0 9.471374e-28
#> 10 3 TRUE 0 43 50 9.471374e-28
#> 11
#> 12 Result Petal.Width setosa versicolor virginica P.value
#> 13 4 FALSE 50 10 0 4.636126e-26
#> 14 4 TRUE 0 40 50 4.636126e-26
#> 15
Created on 2022-07-19 by the reprex package (v2.0.1)
Related
In the below code, I've simulated dice rolls at increasing sample sizes and computed the average roll at each sample size. My lapply function works, but I'm uncomfortable with it since I know sample_n is not a dplyr function and has been superceded by slice_sample. I would like make my code better with a dplyr solution rather than sample_n() within the lapply. I think I may have other syntactical errors within the lapply. Here is the code:
#Dice
dice <- c(1,2,3,4,5,6) #the set of possible outcomes of a dice role
dice_probs <- c(1/6,1/6,1/6,1/6,1/6,1/6) #the probability of each option per roll
dice_df <- data.frame(dice,dice_probs)
#Simulate dice rolls for each of these sample sizes and record the average of the rolls
sample_sizes <- c(10,25,50,100,1000,10000,100000,1000000,100000000) #compute at each sample size
output <- lapply(X=sample_sizes, FUN = function(var){
obs = sample_n(dice_df,var,replace=TRUE)
sample_mean = mean(obs$dice)
new.df <- data.frame(sample_mean, var)
return(new.df)
})
The final step is computing the difference compared to the expected value, 3.5. I want a column where that shows the difference between 3.5 and the sample mean. We should see the difference decreasing as the sample size increases.
output <- output %>%
mutate(difference = across(sample_mean, ~3.5 - .x))
When I run this, it's throwing this error:
Error in UseMethod("mutate") :
no applicable method for 'mutate' applied to an object of class "list"
I've tried using sapply but I get a similar error: no applicable method for 'mutate' applied to an object of class "c('matrix', 'array', 'list')"
If it helps, here was my failed attempt at using slice_sample:
output <- lapply(X=sample_sizes, FUN = function(...){
obs = slice_sample(dice_df, ..., .preserve=TRUE)
sample_mean = mean(obs$dice)
new.df <- data.frame(sample_mean, ...)
return(new.df)
})
I got this error: Error: '...' used in an incorrect context
The output is just a single row data.frame element in a list. We can bind them with bind_rows and simply subtract once instead of doing this multiple times
library(dplyr)
bind_rows(output) %>%
mutate(difference = 3.5 - sample_mean )
sample_mean var difference
1 3.500000 10 0.00000000
2 2.800000 25 0.70000000
3 3.440000 50 0.06000000
4 3.510000 100 -0.01000000
5 3.495000 1000 0.00500000
6 3.502200 10000 -0.00220000
7 3.502410 100000 -0.00241000
8 3.498094 1000000 0.00190600
9 3.500183 100000000 -0.00018332
The n argument of slice_sample correspondes to sample_n's size argument.
And to calculate the difference of your output list we can use purrr::map instead of dplyr::across.
library(dplyr)
library(purrr)
set.seed(123)
#Dice
dice <- c(1,2,3,4,5,6) #the set of possible outcomes of a dice role
dice_probs <- c(1/6,1/6,1/6,1/6,1/6,1/6) #the probability of each option per roll
dice_df <- data.frame(dice,dice_probs)
#Simulate dice rolls for each of these sample sizes and record the average of the rolls
sample_sizes <- c(10,25,50,100,1000,10000,100000,1000000,100000000) #compute at each sample size
output <- lapply(X=sample_sizes, FUN = function(var){
obs = slice_sample(dice_df,n = var,replace=TRUE)
sample_mean = mean(obs$dice)
new.df <- data.frame(sample_mean, var)
return(new.df)
})
output %>%
map(~ 3.5 - .x$sample_mean)
#> [[1]]
#> [1] -0.5
#>
#> [[2]]
#> [1] 0.42
#>
#> [[3]]
#> [1] -0.04
#>
#> [[4]]
#> [1] -0.34
#>
#> [[5]]
#> [1] 0.025
#>
#> [[6]]
#> [1] 0.0317
#>
#> [[7]]
#> [1] 0.00416
#>
#> [[8]]
#> [1] -2.6e-05
#>
#> [[9]]
#> [1] -4.405e-05
Created on 2021-08-02 by the reprex package (v0.3.0)
Alternatively, we can use purrr::map_df and add a row diff inside each tibble as proposed by Martin Gal in the comments:
output %>%
map_df(~ tibble(.x, diff = 3.5 - .x$sample_mean))
#> # A tibble: 9 x 3
#> sample_mean var diff
#> <dbl> <dbl> <dbl>
#> 1 2.6 10 0.9
#> 2 3.28 25 0.220
#> 3 3.66 50 -0.160
#> 4 3.5 100 0
#> 5 3.53 1000 -0.0270
#> 6 3.50 10000 -0.00180
#> 7 3.50 100000 -0.00444
#> 8 3.50 1000000 -0.000226
#> 9 3.50 100000000 -0.0000669
Here is a base R way -
transform(do.call(rbind, output), difference = 3.5 - sample_mean)
# sample_mean var difference
#1 3.80 10 -0.300000
#2 3.44 25 0.060000
#3 3.78 50 -0.280000
#4 3.30 100 0.200000
#5 3.52 1000 -0.015000
#6 3.50 10000 -0.004200
#7 3.50 100000 -0.004370
#8 3.50 1000000 0.002696
#9 3.50 100000000 0.000356
If you just need the difference value you can do -
3.5 - sapply(output, `[[`, 'sample_mean')
I am using tidyverse to connect to multiple databases with the same data structure (clusters). Due to different database sources a union ist not possible without copy locally.
I can do everything with long coding, but now I try to shorten the code an run in the following problem. When defining the column names for the select statement dbplyr stores that with a looping variable into the connection rather than evaluating and store the result of the string.
Here is a minimal reproducible example:
library(tidyverse)
#reproducable example with two database and two tables in memory
con1 <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
con2 <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con1, mtcars)
copy_to(con1, iris)
copy_to(con2, mtcars)
copy_to(con2, iris)
#names of the tables
tables<-c("mtcars", "iris")
#specify which columns to select from which table
columns<-list("mtcars"=c("mpg", "hp"),
"iris"=c("Sepal.Length", "Sepal.Width"))
#list to put
data_list<-vector(mode="list", length=length(tables))
names(data_list)<-tables
#loop over tables
for(i in tables){
#loop over databases
for(j in 1:2)
data_list[[i]][[j]]<-tbl(get(paste0("con",j)), i)%>%select(columns[[i]])
}
This code works fine so far, but the problem is with accessing the data stored in the list (data_list).
If I try
data_list[[1]][[1]]
R still evaluate
select(columns[[i]])
After looping i ist "iris" and the statement give the error message:
Error: Unknown columns Sepal.Length and Sepal.Width
For the second list in data_list it works fine because i is set appropriate:
data_list[[2]][[1]]
How can I force the select statement to evaluate the expression and not to store the expression with the looping variable I?
In the next step, I would like to add a filter expression too so that I don't have to collect all the data and only the data needed.
If the union would work over databases without copy that would solve all the problems
Thx and best regards
Thomas
Hmm, you mean that you want to select columns interactively after you've queried the data base?
I edited your code to use use the tidyverse functions (since you've already loaded).
library(tidyverse)
# Reproducable example with two database and two tables in memory
con1 <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
con2 <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con1, mtcars)
copy_to(con1, iris)
copy_to(con2, mtcars)
copy_to(con2, iris)
# Specify which columns to select from which table
columns <-list("mtcars" = c("mpg", "hp"), "iris" = c("Sepal.Length", "Sepal.Width"))
# Loop over the table names (mtcars, iris) **and** the columns that belong to those datasets
data_list <-
map2(names(columns), columns, ~ {
# For each table/column combination, grab them from con1 and con2 and return them in a list
con1_db <- tbl(con1, .x) %>% select(.y)
con2_db <- tbl(con2, .x) %>% select(.y)
list(con1_db, con2_db)
}) %>%
setNames(names(columns))
# With this you can interactively select the columns you wanted for each data. Just replace the dataset that you're interested in.
data_list %>%
pluck("iris") %>%
map(select, columns[['iris']])
#> [[1]]
#> Warning: `overscope_eval_next()` is deprecated as of rlang 0.2.0.
#> Please use `eval_tidy()` with a data mask instead.
#> This warning is displayed once per session.
#> Warning: `overscope_clean()` is deprecated as of rlang 0.2.0.
#> This warning is displayed once per session.
#> # Source: lazy query [?? x 2]
#> # Database: sqlite 3.30.1 [:memory:]
#> Sepal.Length Sepal.Width
#> <dbl> <dbl>
#> 1 5.1 3.5
#> 2 4.9 3
#> 3 4.7 3.2
#> 4 4.6 3.1
#> 5 5 3.6
#> 6 5.4 3.9
#> 7 4.6 3.4
#> 8 5 3.4
#> 9 4.4 2.9
#> 10 4.9 3.1
#> # … with more rows
#>
#> [[2]]
#> # Source: lazy query [?? x 2]
#> # Database: sqlite 3.30.1 [:memory:]
#> Sepal.Length Sepal.Width
#> <dbl> <dbl>
#> 1 5.1 3.5
#> 2 4.9 3
#> 3 4.7 3.2
#> 4 4.6 3.1
#> 5 5 3.6
#> 6 5.4 3.9
#> 7 4.6 3.4
#> 8 5 3.4
#> 9 4.4 2.9
#> 10 4.9 3.1
#> # … with more rows
I am trying to retain an ID on the row when predicting using a Random Forest model to merge back on to the original dataframe. I am using step_naomit in the recipe that removes the rows with missing data when I bake the training data, but also removes the records with missing data on the testing data. Unfortunately, I don't have an ID to easily know which records were removed so I can accurately merge back on the predictions.
I have tried to add an ID column to the original data, but bake will remove any variable not included in the formula (and I don't want to include ID in the formula). I also thought I may be able to retain the row.names from the original table to merge on, but it appears the row.name is reset upon baking as well.
I realize I can remove the NA values prior to the recipe to solve this problem, but then what is the point of step_naomit in the recipe? I also tried skip=TRUE in the step_naomit, but then I get an error for missing data when fitting the model (only for random forest). I feel I am missing something here in tidymodels that would allow me to retain all the rows prior to baking?
See example:
## R 3.6.1 ON WINDOWS 10 MACHINE
require(tidyverse)
require(tidymodels)
require(ranger)
set.seed(123)
temp <- iris %>%
dplyr::mutate(Petal.Width = case_when(
round(Sepal.Width) %% 2 == 0 ~ NA_real_, ## INTRODUCE NA VALUES
TRUE ~ Petal.Width))
mySplit <- rsample::initial_split(temp, prop = 0.8)
myRecipe <- function(dataFrame) {
recipes::recipe(Petal.Width ~ ., data = dataFrame) %>%
step_naomit(all_numeric()) %>%
prep(data = dataFrame)
}
myPred <- function(mySplit,myRecipe) {
train_set <- training(mySplit)
test_set <- testing(mySplit)
train_prep <- myRecipe(train_set)
analysis_processed <- bake(train_prep, new_data = train_set)
model <- rand_forest(
mode = "regression",
mtry = 3,
trees = 50) %>%
set_engine("ranger", importance = 'impurity') %>%
fit(Sepal.Width ~ ., data=analysis_processed)
test_processed <- bake(train_prep, new_data = test_set)
test_processed %>%
bind_cols(myPrediction = unlist(predict(model,new_data=test_processed)))
}
getPredictions <- myPred(mySplit,myRecipe)
nrow(getPredictions)
## 21 ROWS
max(as.numeric(row.names(getPredictions)))
## 21
nrow(testing(mySplit))
## 29 ROWS
max(as.numeric(row.names(testing(mySplit))))
## 150
To be able to keep track of which observations were removed we need to give the original dataset an id variable.
temp <- iris %>%
dplyr::mutate(Petal.Width = case_when(
round(Sepal.Width) %% 2 == 0 ~ NA_real_, ## INTRODUCE NA VALUES
TRUE ~ Petal.Width),
id = row_number()) #<<<<
Then we use update_role() to first designate it as an "id variable", then remove it as a predictor so it doesn't become part of the modeling process. And that is it. Everything else should work like before. Below is the fully updated code with #<<<< to denote my changes.
require(tidyverse)
#> Loading required package: tidyverse
require(tidymodels)
#> Loading required package: tidymodels
#> Registered S3 method overwritten by 'xts':
#> method from
#> as.zoo.xts zoo
#> ── Attaching packages ───────────────────── tidymodels 0.0.3 ──
#> ✔ broom 0.5.2 ✔ recipes 0.1.7
#> ✔ dials 0.0.3 ✔ rsample 0.0.5
#> ✔ infer 0.5.0 ✔ yardstick 0.0.4
#> ✔ parsnip 0.0.4
#> ── Conflicts ──────────────────────── tidymodels_conflicts() ──
#> ✖ scales::discard() masks purrr::discard()
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ recipes::fixed() masks stringr::fixed()
#> ✖ dplyr::lag() masks stats::lag()
#> ✖ dials::margin() masks ggplot2::margin()
#> ✖ dials::offset() masks stats::offset()
#> ✖ yardstick::spec() masks readr::spec()
#> ✖ recipes::step() masks stats::step()
require(ranger)
#> Loading required package: ranger
set.seed(1234)
temp <- iris %>%
dplyr::mutate(Petal.Width = case_when(
round(Sepal.Width) %% 2 == 0 ~ NA_real_, ## INTRODUCE NA VALUES
TRUE ~ Petal.Width),
id = row_number()) #<<<<
mySplit <- rsample::initial_split(temp, prop = 0.8)
myRecipe <- function(dataFrame) {
recipes::recipe(Petal.Width ~ ., data = dataFrame) %>%
update_role(id, new_role = "id variable") %>% #<<<<
update_role(-id, new_role = 'predictor') %>% #<<<<
step_naomit(all_numeric()) %>%
prep(data = dataFrame)
}
myPred <- function(mySplit,myRecipe) {
train_set <- training(mySplit)
test_set <- testing(mySplit)
train_prep <- myRecipe(train_set)
analysis_processed <- bake(train_prep, new_data = train_set)
model <- rand_forest(
mode = "regression",
mtry = 3,
trees = 50) %>%
set_engine("ranger", importance = 'impurity') %>%
fit(Sepal.Width ~ ., data=analysis_processed)
test_processed <- bake(train_prep, new_data = test_set)
test_processed %>%
bind_cols(myPrediction = unlist(predict(model,new_data=test_processed)))
}
getPredictions <- myPred(mySplit, myRecipe)
getPredictions
#> # A tibble: 23 x 7
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species id myPrediction
#> <dbl> <dbl> <dbl> <dbl> <fct> <int> <dbl>
#> 1 4.6 3.1 1.5 0.2 setosa 4 3.24
#> 2 4.3 3 1.1 0.1 setosa 14 3.04
#> 3 5.1 3.4 1.5 0.2 setosa 40 3.22
#> 4 5.9 3 4.2 1.5 versico… 62 2.98
#> 5 6.7 3.1 4.4 1.4 versico… 66 2.92
#> 6 6 2.9 4.5 1.5 versico… 79 3.03
#> 7 5.7 2.6 3.5 1 versico… 80 2.79
#> 8 6 2.7 5.1 1.6 versico… 84 3.12
#> 9 5.8 2.6 4 1.2 versico… 93 2.79
#> 10 6.2 2.9 4.3 1.3 versico… 98 2.88
#> # … with 13 more rows
# removed ids
setdiff(testing(mySplit)$id, getPredictions$id)
#> [1] 5 28 47 70 90 132
Created on 2019-11-26 by the reprex package (v0.3.0)
Using skip = TRUE in the step_naomit() recipe specification, and then including the recipe in a workflow might be the proper solution. For example,
myRecipe <- recipe(Petal.Width ~ ., data = dataFrame) %>%
step_naomit(all_numeric(), step = FALSE)`
# don't include the prep()
wflow <- workflow() %>%
add_model(model) %>%
add_recipe(myRecipe)
wflow_fit <- wflow %>%
fit(train_set)
preds <- predict(wflow_fit, new_data = (test_set))
How does one add metadata to a tibble?
I would like a sentence describing each of my variable names such that I could print out the tibble with the associated metadata and if I handed it to someone who hadn't seen the data before, they could make some sense of it.
as_tibble(iris)
# A tibble: 150 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fctr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ... with 140 more rows
# Sepal.length. Measured from sepal attachment to stem
# Sepal.width. Measured at the widest point
# Petal.length. Measured from petal attachment to stem
# Petal.width. Measured at widest point
# Species. Nomenclature based on Integrated Taxonomic Information System (ITIS), January 2018.
thanks!
This seems tricky. In principle #hrbrmstr's comment is the way to go (i.e. use ?comment or ?attr to add attributes to any object), but these attributes will not be printed out by default. Attributes seem to be printed automatically for atomic objects:
> z <- 1:6
> attr(z,"hello") <- "goodbye"
> z
[1] 1 2 3 4 5 6
attr(,"hello")
[1] "goodbye"
... but not, alas, for data frames or tibbles:
dd <- tibble::tibble(x=1:4,y=2:5)
> attr(dd,"metadata") <- c("some stuff","some more stuff")
> dd
# A tibble: 4 x 2
x y
<int> <int>
1 1 2
2 2 3
3 3 4
4 4 5
You can wrap the object with its own S3 class to get this stuff printed:
class(dd) <- c("my_tbl",class(dd))
> print.my_tbl <- function(x) {
+ NextMethod(x)
+ print(attr(x,"metadata"))
+ invisible(x)
+ }
> dd
# A tibble: 4 x 2
x y
<int> <int>
1 1 2
2 2 3
3 3 4
4 4 5
[1] "some stuff" "some more stuff"
You could make the printing more elaborate or pretty, e.g.
cat("\nMETADATA:\n")
cat(sprintf("# %s",attr(x,"metadata")),sep="\n")
Nothing bad will happen if the other user hasn't defined print.my_tbl (the print method will fall back to the print method for tibbles), but the metadata will only be printed if they have your print.my_tbl definition ...
Sorry for the delayed response. But this topic has been bugging me since I first started learning R. In my work, assigning metadata to columns is not just common. It is required. That R didn't seem to have a nice way to do it was really bothering me. So much so, that I wrote some packages to do it.
The fmtr package has a function to assign the descriptions (plus other stuff). And the libr package has a dictionary function, so you can look at all the metadata you assign.
Here is how it works:
First, assign the descriptions to the columns. You just send a named list into to the descriptions() function.
library(fmtr)
library(libr)
# Create data frame
df <- iris
# Assign descriptions
descriptions(df) <- list(Sepal.Length = "Measured from sepal attachment to stem",
Sepal.Width = "Measured at the widest point",
Petal.Length = "Measured from petal attachment to stem",
Petal.Width = "Measured at the widest point",
Species = paste("Nomanclature based on Integrated Taxonomic",
"Information System (ITIS), January 2018."))
Then you can see all the metadata by calling the dictionary() function, like so:
dictionary(df)
# # A tibble: 5 x 10
# Name Column Class Label Description
# <chr> <chr> <chr> <chr> <chr>
# 1 df Sepal.Leng~ numer~ NA Measured from sepal attachment to stem
# 2 df Sepal.Width numer~ NA Measured at the widest point
# 3 df Petal.Leng~ numer~ NA Measured from petal attachment to stem
# 4 df Petal.Width numer~ NA Measured at the widest point
# 5 df Species factor NA Nomanclature based on Integrated Taxonomic Information Syst~
If you like, you can return the dictionary as its own data frame, then save it or print it or whatever.
d <- dictionary(df)
Here is the dictionary data frame:
This is not all that different than Ben Bolker's suggestions, but conceptually, if I want information to be related to the vectors in my data frame, I would prefer they be directly tied to the vectors. In other words, I'd prefer to add the attributes to the vectors themselves rather than to the data frame object.
I don't know that I would go so far as to add a custom class to the object, but perhaps a separate function you can call up for a data frame-like object would be adequate:
library(tibble)
library(ggplot2)
library(magrittr)
library(labelVector)
print_with_label <- function(dframe){
stopifnot(inherits(dframe, "data.frame"))
labs <- labelVector::get_label(dframe, names(dframe))
labs <- sprintf("%s: %s", names(dframe), labs)
print(dframe)
cat("\n")
cat(labs, sep = "\n")
}
iris <-
as_tibble(iris) %>%
set_label(Sepal.Length = "This is a user friendly label",
Petal.Length = "I much prefer reading human over computer")
print_with_label(iris)
mtcars <-
set_label(mtcars,
mpg = "Miles per Gallon",
qsec = "Quarter mile time",
hp = "Horsepower",
cyl = "Cylinders",
disp = "Engine displacement")
print_with_label(mtcars)
Hi I'm currently trying to extract some of the inner node information stored in the constant partying object in R using ctree in partykit but I'm finding navigating the objects a bit difficult, I'm able to display the information on a plot but I'm not sure how to extract the information - I think it requires nodeapply or another function in the partykit?
library(partykit)
irisct <- ctree(Species ~ .,data = iris)
plot(irisct, inner_panel = node_barplot(irisct))
Plot with inner node details
All the information is accessible by the functions to plot, but I'm after a text output similar to:
Example output
The main trick (as previously pointed out by #G5W) is to take the [id] subset of the party object and then extract the data (by either $data or using the data_party() function) which contains the response. I would recommend to build a table with absolute frequencies first and then compute the relative and marginal frequencies from that. Using the irisct object the plain table can be obtained by
tab <- sapply(1:length(irisct), function(id) {
y <- data_party(irisct[id])
y <- y[["(response)"]]
table(y)
})
tab
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## setosa 50 50 0 0 0 0 0
## versicolor 50 0 50 49 45 4 1
## virginica 50 0 50 5 1 4 45
Then we can add a little bit of formatting to a nice table object:
colnames(tab) <- 1:length(irisct)
tab <- as.table(tab)
names(dimnames(tab)) <- c("Species", "Node")
And then use prop.table() and margin.table() to compute the frequencies we are interested in. The as.data.frame() method transform from the table layout to a "long" data.frame:
as.data.frame(prop.table(tab, 1))
## Species Node Freq
## 1 setosa 1 0.500000000
## 2 versicolor 1 0.251256281
## 3 virginica 1 0.322580645
## 4 setosa 2 0.500000000
## 5 versicolor 2 0.000000000
## 6 virginica 2 0.000000000
## 7 setosa 3 0.000000000
## 8 versicolor 3 0.251256281
## 9 virginica 3 0.322580645
## 10 setosa 4 0.000000000
## 11 versicolor 4 0.246231156
## 12 virginica 4 0.032258065
## 13 setosa 5 0.000000000
## 14 versicolor 5 0.226130653
## 15 virginica 5 0.006451613
## 16 setosa 6 0.000000000
## 17 versicolor 6 0.020100503
## 18 virginica 6 0.025806452
## 19 setosa 7 0.000000000
## 20 versicolor 7 0.005025126
## 21 virginica 7 0.290322581
as.data.frame(margin.table(tab, 2))
## Node Freq
## 1 1 150
## 2 2 50
## 3 3 100
## 4 4 54
## 5 5 46
## 6 6 8
## 7 7 46
And the split information can be obtained with the (still unexported) .list.rules.party() function. You just need to ask for all node IDs (the default is to use just the terminal node IDs):
partykit:::.list.rules.party(irisct, i = nodeids(irisct))
## 1
## ""
## 2
## "Petal.Length <= 1.9"
## 3
## "Petal.Length > 1.9"
## 4
## "Petal.Length > 1.9 & Petal.Width <= 1.7"
## 5
## "Petal.Length > 1.9 & Petal.Width <= 1.7 & Petal.Length <= 4.8"
## 6
## "Petal.Length > 1.9 & Petal.Width <= 1.7 & Petal.Length > 4.8"
## 7
## "Petal.Length > 1.9 & Petal.Width > 1.7"
Most of the information that you want is accessible without much work.
I will show how to get the information, but leave you to format the
information into a pretty table.
Notice that your tree structure irisct is just a list of each of the nodes.
length(irisct)
[1] 7
Each node has a field data that contains the points that have made it down
this far in the tree, so you can get the number of observations at the node
by counting the rows.
dim(irisct[4]$data)
[1] 54 5
nrow(irisct[4]$data)
[1] 54
Or doing them all at once to get your table 2
NObs = sapply(1:7, function(n) { nrow(irisct[n]$data) })
NObs
[1] 150 50 100 54 46 8 46
The first column of the data at a node is the class (Species),
so you can get the count of each class and the probability of each class
at a node
table(irisct[4]$data[1])
setosa versicolor virginica
0 49 5
table(irisct[4]$data[1]) / NObs[4]
setosa versicolor virginica
0.00000000 0.90740741 0.09259259
The split information in your table 3 is a bit more awkward. Still,
you can get a text version of what you need just by printing out the
top level node
irisct[1]
Model formula:
Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
Fitted party:
[1] root
| [2] Petal.Length <= 1.9: setosa (n = 50, err = 0.0%)
| [3] Petal.Length > 1.9
| | [4] Petal.Width <= 1.7
| | | [5] Petal.Length <= 4.8: versicolor (n = 46, err = 2.2%)
| | | [6] Petal.Length > 4.8: versicolor (n = 8, err = 50.0%)
| | [7] Petal.Width > 1.7: virginica (n = 46, err = 2.2%)
Number of inner nodes: 3
Number of terminal nodes: 4
To save the output for parsing and display
TreeSplits = capture.output(print(irisct[1]))