How to add metadata to a tibble

How to add metadata to a tibble - r

How does one add metadata to a tibble?
I would like a sentence describing each of my variable names such that I could print out the tibble with the associated metadata and if I handed it to someone who hadn't seen the data before, they could make some sense of it.
as_tibble(iris)
# A tibble: 150 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fctr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ... with 140 more rows
# Sepal.length. Measured from sepal attachment to stem
# Sepal.width. Measured at the widest point
# Petal.length. Measured from petal attachment to stem
# Petal.width. Measured at widest point
# Species. Nomenclature based on Integrated Taxonomic Information System (ITIS), January 2018.
thanks!

This seems tricky. In principle #hrbrmstr's comment is the way to go (i.e. use ?comment or ?attr to add attributes to any object), but these attributes will not be printed out by default. Attributes seem to be printed automatically for atomic objects:
> z <- 1:6
> attr(z,"hello") <- "goodbye"
> z
[1] 1 2 3 4 5 6
attr(,"hello")
[1] "goodbye"
... but not, alas, for data frames or tibbles:
dd <- tibble::tibble(x=1:4,y=2:5)
> attr(dd,"metadata") <- c("some stuff","some more stuff")
> dd
# A tibble: 4 x 2
x y
<int> <int>
1 1 2
2 2 3
3 3 4
4 4 5
You can wrap the object with its own S3 class to get this stuff printed:
class(dd) <- c("my_tbl",class(dd))
> print.my_tbl <- function(x) {
+ NextMethod(x)
+ print(attr(x,"metadata"))
+ invisible(x)
+ }
> dd
# A tibble: 4 x 2
x y
<int> <int>
1 1 2
2 2 3
3 3 4
4 4 5
[1] "some stuff" "some more stuff"
You could make the printing more elaborate or pretty, e.g.
cat("\nMETADATA:\n")
cat(sprintf("# %s",attr(x,"metadata")),sep="\n")
Nothing bad will happen if the other user hasn't defined print.my_tbl (the print method will fall back to the print method for tibbles), but the metadata will only be printed if they have your print.my_tbl definition ...

Sorry for the delayed response. But this topic has been bugging me since I first started learning R. In my work, assigning metadata to columns is not just common. It is required. That R didn't seem to have a nice way to do it was really bothering me. So much so, that I wrote some packages to do it.
The fmtr package has a function to assign the descriptions (plus other stuff). And the libr package has a dictionary function, so you can look at all the metadata you assign.
Here is how it works:
First, assign the descriptions to the columns. You just send a named list into to the descriptions() function.
library(fmtr)
library(libr)
# Create data frame
df <- iris
# Assign descriptions
descriptions(df) <- list(Sepal.Length = "Measured from sepal attachment to stem",
Sepal.Width = "Measured at the widest point",
Petal.Length = "Measured from petal attachment to stem",
Petal.Width = "Measured at the widest point",
Species = paste("Nomanclature based on Integrated Taxonomic",
"Information System (ITIS), January 2018."))
Then you can see all the metadata by calling the dictionary() function, like so:
dictionary(df)
# # A tibble: 5 x 10
# Name Column Class Label Description
# <chr> <chr> <chr> <chr> <chr>
# 1 df Sepal.Leng~ numer~ NA Measured from sepal attachment to stem
# 2 df Sepal.Width numer~ NA Measured at the widest point
# 3 df Petal.Leng~ numer~ NA Measured from petal attachment to stem
# 4 df Petal.Width numer~ NA Measured at the widest point
# 5 df Species factor NA Nomanclature based on Integrated Taxonomic Information Syst~
If you like, you can return the dictionary as its own data frame, then save it or print it or whatever.
d <- dictionary(df)
Here is the dictionary data frame:

This is not all that different than Ben Bolker's suggestions, but conceptually, if I want information to be related to the vectors in my data frame, I would prefer they be directly tied to the vectors. In other words, I'd prefer to add the attributes to the vectors themselves rather than to the data frame object.
I don't know that I would go so far as to add a custom class to the object, but perhaps a separate function you can call up for a data frame-like object would be adequate:
library(tibble)
library(ggplot2)
library(magrittr)
library(labelVector)
print_with_label <- function(dframe){
stopifnot(inherits(dframe, "data.frame"))
labs <- labelVector::get_label(dframe, names(dframe))
labs <- sprintf("%s: %s", names(dframe), labs)
print(dframe)
cat("\n")
cat(labs, sep = "\n")
}
iris <-
as_tibble(iris) %>%
set_label(Sepal.Length = "This is a user friendly label",
Petal.Length = "I much prefer reading human over computer")
print_with_label(iris)
mtcars <-
set_label(mtcars,
mpg = "Miles per Gallon",
qsec = "Quarter mile time",
hp = "Horsepower",
cyl = "Cylinders",
disp = "Engine displacement")
print_with_label(mtcars)

Related

Unable to add columns to data table in R

I am trying to add columns with "" value to the data table. But getting following error. Can anyone help me here.
Since it is converted to data.table , I am unable to convert.
iris_sam <- iris
iris_sam <- as.data.table(iris_sam)
iris_sam[c("new", "New1")] <- ""
Error in `[.data.table`(x, i, which = TRUE) :
When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.

data.table uses different syntax, please look into the documentation. For this case you can assign the new columns like this:
library(data.table)
iris_sam <- iris
iris_sam <- as.data.table(iris_sam)
iris_sam[j = c("new", "New1") := ""]
head(iris_sam, 5)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species new New1
#> 1: 5.1 3.5 1.4 0.2 setosa
#> 2: 4.9 3.0 1.4 0.2 setosa
#> 3: 4.7 3.2 1.3 0.2 setosa
#> 4: 4.6 3.1 1.5 0.2 setosa
#> 5: 5.0 3.6 1.4 0.2 setosa

Hierarchical Dendrogram using both continuous and categorical data

So i have a data set that has 90 or so specimens each with the clade they belong to and 5 data points calculated from the length of the line (which is not important here). My aim is to create a dendrogram that essentially calculates the dissimilarity and displays it as a graph, however I want to make it so that the code will separate the specimens by their respective clades into their own clusters then look at how similar/dissimilar they are, but I'm not too sure how to approach this. I'm willing to use any of the available r tools to do so so would appreciate some help. A short example of the data set will be formatted below in the form of a table.
Name
Type
lenght_1
length_2
length_3
length_4
length_5
spec1
S
10
-15
-5
5
10
spec2
O
20
6
6
-5
-10
spec3
O
22
7
10
-3
-7
spec4
S
6
6
-10
-5
3
spec5
T
54
-20
-20
9
9
spec5
T
25
-20
-10
5
9
this is a table that's been made up on the spot but represents what my data generally look like. The type refers the the clade, and the lengths have been calculated by splitting segmented graphs with breaks into 5 sections and finding the average hypotenuse of each line segment. I'd like to determine how similar these lines are for each species in their respective clades (grouping S with other S's and looking at the dissimilarity). Also is there a way to use these dendrograms to perhaps convert the data and draw up a phylogeny (this is just an additional question and does not need to be answered if unknown). thankyou for reading through this my r coding is still progressing hence why these question may seem elementary to some.

Your question seems to be about hierarchical clustering of groups defined by a categorical variable, not hierarchical clustering of both continuous and categorical data. Hierarchical clustering involves a series of decisions about how to scale the data, how to compute distances, and how to create clusters based on those differences so this example will use the default options. Using the iris data set that comes with R:
data(iris)
iris.sub <- iris[c(1:15, 51:65, 101:115), ]
str(iris.sub)
# 'data.frame': 45 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
The data consist of 4 measurements on three different species of iris. We'll use 15 of each species for the example. Now we will scale the data by converting to Z-scores so that larger measurements are not more influential than smaller ones and compute the species means:
iris.z <- scale(iris.sub[ , 1:4])
iris.agg <- aggregate(iris.z, list(iris.sub$Species), mean)
iris.agg
# Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1 setosa -0.9871973 0.7967458 -1.2887187 -1.2543022
# 2 versicolor 0.2171329 -0.5487521 0.2754513 0.1735865
# 3 virginica 0.7700644 -0.2479937 1.0132674 1.0807157
Next the distances between the means:
iris.gdst <- dist(iris.agg[, -1])
iris.gdst
# 1 2
# 2 2.783212
# 3 3.864052 1.327948
Finally the cluster analysis and the dendrogram:
iris.hcl <- hclust(iris.gdst, members=table(iris.sub$Species))
plot(iris.hcl, labels=iris.agg$Group.1)
For details on the various functions, read the manual pages ?scale, ?aggregate, ?dist, and ?hclust.
You can compute the dendrogram for all of the numeric variables and label the result by species as follows:
iris.dst <- dist(iris.z)
iris.all.hcl <- hclust(iris.dst)
plot(iris.all.hcl, labels=iris.sub$Species)
rect.hclust(iris.hcl.all, 3)
Notice that setosa is well separated but not versicolor and virginica.
If we code setosa, versicolor, and virginica as -2, 0, 2 and add it to the distance matrix:
iris.sp.dst <- dist(cbind(iris.z, 2*(as.numeric(iris.sub$Species) - 2)))
iris.sp.hcl <- hclust(iris.sp.dst)
plot(iris.sp.hcl, labels=iris.sub$Species)
rect.hclust(iris.sp.hcl, 3)
Notice that virginica and versicolor are still intermixed.
If we arbitrarily increase the range of species to -3, 0, +3 then the species will be separated, but this is just because we increased the magnitude of the species variable.

Why does adding by 1 actually add by 2 in Sparklyr when using dynamic variable names?

When I run the following code, I expect the value of the Sepal_Width_2 column to be Sepal_Width + 1, but it is in fact Sepal_Width + 2. What gives?
require(dplyr)
require(sparklyr)
Sys.setenv(SPARK_HOME='/usr/lib/spark')
sc <- spark_connect(master="yarn")
# for this example these variables are hard coded
# but in my actual code these are named dynamically
sw_name <- as.name('Sepal_Width')
sw2 <- "Sepal_Width_2"
sw2_name <- as.name(sw2)
ir <- copy_to(sc, iris)
print(head(ir %>% mutate(!!sw2 := sw_name))) # so far so good
# Source: spark<?> [?? x 6]
# Sepal_Length Sepal_Width Petal_Length Petal_Width Species Sepal_Width_2
# <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
# 5.1 3.5 1.4 0.2 setosa 3.5
# 4.9 3 1.4 0.2 setosa 3
# 4.7 3.2 1.3 0.2 setosa 3.2
# 4.6 3.1 1.5 0.2 setosa 3.1
# 5 3.6 1.4 0.2 setosa 3.6
# 5.4 3.9 1.7 0.4 setosa 3.9
print(head(ir %>% mutate(!!sw2 := sw_name) %>% mutate(!!sw2 := sw2_name + 1))) # i guess 2+2 != 4?
# Source: spark<?> [?? x 6]
# Sepal_Length Sepal_Width Petal_Length Petal_Width Species Sepal_Width_2
# <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
# 5.1 3.5 1.4 0.2 setosa 5.5
# 4.9 3 1.4 0.2 setosa 5
# 4.7 3.2 1.3 0.2 setosa 5.2
# 4.6 3.1 1.5 0.2 setosa 5.1
# 5 3.6 1.4 0.2 setosa 5.6
# 5.4 3.9 1.7 0.4 setosa 5.9
My use case requires that I use the dynamic variable naming you see above. In this example it is rather silly (compared to just using the variables directly), but in my use case I'm running the same function across hundreds of different spark tables. They all have the same "schema" in terms of the number of columns and what each column is (outputs from some machine learning models), but the names differ because each table contains the output for a different model. The names are predictable, but since they vary, I construct them dynamically as you see here instead of hardcoding them.
It appears that Spark knows how to add 2 and 2 together when the names are hardcoded, but when the names are dynamic it suddenly freaks out.

You might be misusing as.name which is leading sparklyr to misinterpret your input.
Note that your code errors when just working on a local table:
sw_name <- as.name('Sepal.Width') # swap "_" to "." to match variable names
sw2 <- "Sepal_Width_2"
sw2_name <- as.name(sw2)
data(iris)
print(head(iris %>% mutate(!!sw2 := sw_name)))
# Error: Problem with `mutate()` input `Sepal_Width_2`.
# x object 'Sepal.Width' not found
# i Input `Sepal_Width_2` is `sw_name`.
Note that you are using both the !! operator from rlang with as.name from base R. But you are not using them together as demonstrated in this question.
I recommend you use sym and !! from the rlang package instead of as.name, and that you apply both to character strings that are column names. The following works locally, and is consistent with the non-standard evaluation guidance. So it should translate to spark:
library(dplyr)
data(iris)
sw <- 'Sepal.Width'
sw2 <- paste0(sw, "_2")
head(iris %>% mutate(!!sym(sw2) := !!sym(sw)))
head(iris %>% mutate(!!sym(sw2) := !!sym(sw)) %>% mutate(!!sym(sw2) := !!sym(sw2) + 1))

I'm not sure which package was the culprit (sparklyr, dplyr, R, who knows), but this has been fixed when I upgraded from 3.6.3/sparklyr 1.5 to R 4.0.2/sparklyr 1.7.0.

Change the column names of the data frame only for display purpose

Is there a way to change the names of the column headers of a data frame only for display purpose. For example
iris data has below default columns
ini_col <- c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width" ,"Species")
In case if I need to change it to below new columns. But only display purpose. Meaning, the reference of the old columns should not go away. For example I should still be able to execute iris[["Species"]] and not iris[["Category"]]
new_col <- c("Sep Len","Sep Wid","Pet Len","Pet Wid","Category")
head(iris)
Sep Len Sep Wid Pet Len Pet Wid Category
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
We can actually make use of below code
Code 1 below
mtcars2 <-
set_label(mtcars,
am = "Automatic",
mpg = "Miles per gallon",
cyl = "Cylinders",
qsec = "Quarter mile time")
But I have a dataframe where in there are old columns and new columns. Can we make this dataframe use in Code 1 above
df
Old COl new COl
am Automatic
mpg Miles per gallon
cyl Cylinders
qsec Quarter mile time
Can we use this dataframe in executing Code1?

As linked in a similar question, kable from the knitr package makes it easy to change the names for display without altering the data frame:
knitr::kable(head(iris),
col.names = c("Speal length", "Speal Width",
"Petal Length", "Petal Width", "Species"))

R Mutate (Dataframe vs Tibble)

Working in R 3.6.1 (64) bit. Used readxl to get a data frame into R (Named "RawShift". I made 6 variables (class: "character") that are lists of user names. Each list is named for the team that user is on.
I want to use Mutate to create a column that has the team the user is from.
INTeam = C("user1", "user2",...)
OFTeam = C("user3", "user4",...)
When I had been working with Data frame this code worked:
RawShift <- RawShift %>% mutate(Team =case_when(
`username` %in% OFTeam ~ "Office",
`username` %in% INTeam ~ "Industrial"
))
Now I've taken that and done "as_tibble" on my Raw Shift, it won't error not will it work. Is this a case of not understanding the Tibble access methods ("", [], ., [[]]). Is it worth worry about or just do a hack job and convert using data frame then convert to a tibble later? I've looked into the benefits of Tibble over dataframe and seems to me I'd be better using Tibbles but cant seem to get this working. Have tried using "$", "." etc before the %in% without luck so far. Thanks for any advice/help.

We may need to load the tibble package
library(dplyr)
library(tibble)
head(iris) %>%
as_tibble %>%
mutate(new = case_when(Species == "setosa" ~ "hello"))
# A tibble: 6 x 6
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species new
# <dbl> <dbl> <dbl> <dbl> <fct> <chr>
#1 5.1 3.5 1.4 0.2 setosa hello
#2 4.9 3 1.4 0.2 setosa hello
#3 4.7 3.2 1.3 0.2 setosa hello
#4 4.6 3.1 1.5 0.2 setosa hello
#5 5 3.6 1.4 0.2 setosa hello
#6 5.4 3.9 1.7 0.4 setosa hello