How to get labels from hclust result - r

let's say i have a dataset like this
dt<-data.frame(id=1:4,X=sample(4),Y=sample(4))
and then i try to make a hierarchical clustering using the below code
dis<-dist(dt[,-1])
clusters <- hclust(dis)
plot(clusters)
and it works well
The point is when i ask for
clusters$labels
it gives me NULL, meanwhile i expect to see the label of indivisuals in order like
1, 4, 2, 3
it is important to have them with the order that they are added in plot

Use cluster$order rather than labels if you happened to not have assigned the labels.
Infact you can see all the contents by using function called summary
clusters <- hclust(dis)
plot(clusters)
summary(clusters)
clusters$order
You can compare with the plot i received at my end, it is offcourse little different than yours
My outcome:
> clusters$order
[1] 4 1 2 3
Content of summary command:
> summary(clusters)
Length Class Mode
merge 6 -none- numeric
height 3 -none- numeric
order 4 -none- numeric
labels 0 -none- NULL
method 1 -none- character
call 2 -none- call
dist.method 1 -none- character
You can observe that since there is null value against labels, hence you are not getting the labels. To receive the labels you need to assign them first using clusters$labels <- c("A","B","C","D") or you can assign with the rownames, once your labels are assigned you will no longer see the numbers you will able to see the names/labels.
In my case I have not assigned any name hence receiving the numbers instead.
You can put the labels in the plot function itself as well.
From the documentation ?hclust
labels
A character vector of labels for the leaves of the tree. By
default the row names or row numbers of the original data are used. If
labels = FALSE no labels at all are plotted.

You could use the following code:
# your data, I changed the id to characters to make it more clear
set.seed(1234) # for reproducibility
dt<-data.frame(id=c("A", "B", "C", "D"),X=sample(4),Y=sample(4))
dt
# your code, no labels
dis<-dist(dt[,-1])
clusters <- hclust(dis)
clusters$labels
# add labels, plot and check labels
clusters$labels <- dt$id
plot(clusters)
## labels in the order plotted
clusters$labels[clusters$order]
## [1] A D B C
## Levels: A B C D
Please let me know whether this is what you want.

Please make sure you use rownames(...) to ensure your data has labels
> rownames(dt) <- dt$id
> dt
id X Y
1 1 2 1
2 2 4 3
3 3 1 2
4 4 3 4
> dis<-dist(dt[,-1])
> clusters <- hclust(dis)
> str(clusters)
List of 7
$ merge : int [1:3, 1:2] -1 -2 1 -3 -4 2
$ height : num [1:3] 1.41 1.41 3.16
$ order : int [1:4] 1 3 2 4
$ labels : chr [1:4] "1" "2" "3" "4"
$ method : chr "complete"
$ call : language hclust(d = dis)
$ dist.method: chr "euclidean"
- attr(*, "class")= chr "hclust"
>

Related

R- expand.grid given a data.frame of parameter names and sequence definitions

I have a data.frame that arbitrarily defines parameter names and sequence boundaries:
dfParameterValues <- data.frame(ParameterName = character(), seqFrom = integer(), seqTo = integer(), seqBy = integer())
row1 <- data.frame(ParameterName = "parameterA", seqFrom = 1, seqTo = 2, seqBy = 1)
row2 <- data.frame(ParameterName = "parameterB", seqFrom = 5, seqTo = 7, seqBy = 1)
row3 <- data.frame(ParameterName = "parameterC", seqFrom = 10, seqTo = 11, seqBy = 1)
dfParameterValues <- rbind(dfParameterValues, row1)
dfParameterValues <- rbind(dfParameterValues, row2)
dfParameterValues <- rbind(dfParameterValues, row3)
I would like to use this approach to create a grid of c parameter columns based on the number of unique ParameterNames that contain r rows of all possible combinations of the sequences given by seqFrom, seqTo, and seqBy. The result would therefore look somewhat like this or should have a content like the following:
ParameterA ParameterB ParameterC
1 5 10
1 5 11
1 6 10
1 6 11
1 7 10
1 7 11
2 5 10
2 5 11
2 6 10
2 6 11
2 7 10
2 7 11
Edit: Note that the parameter names and their numbers are not known in advance. The data.frame comes from elsewhere so I cannot use the standard static expand.grid approach and need something like a flexible function that creates the expanded grid based on any dataframe with the columns ParameterName, seqFrom, seqTo, seqBy.
I've been playing around with for loops (which is bad to begin with) and it hasn't lead me to any elegant ideas. I can't seem to find a way to come up with the result by using tidyr without constructing the sequences seperately first, either. Do you have any elegant approaches?
Bonus kudos for extending this to include not only numerical sequences, but vectors/sets of characters / other factors, too.
Many thanks!
Going off CPak's answer, you could use
my_table <- expand.grid(apply(dfParameterValues, 1, function(x) seq(as.numeric(x['seqFrom']), as.numeric(x['seqTo']), as.numeric(x['seqBy']))))
names(my_table) <- c("ParameterA", "ParameterB", "ParameterC")
my_table <- my_table[order(my_table$ParameterA, my_table$ParameterB), ]
#smanski's answer is technically correct (and should arguably be accepted since it motivated this), but it is also a good example of when to be careful when using apply with data.frames. In this case, the frame contains at least one column that is character, so all columns are converted, resulting in the need to use as.numeric. The safer alternative is to only pull the columns needed, such as either of:
expand.grid(apply(dfParameterValues[,-1], 1,
function(x) seq(x['seqFrom'], x['seqTo'], x['seqBy']) ))
expand.grid(apply(dfParameterValues[,c("seqFrom","seqTo","seqBy")], 1,
function(x) seq(x['seqFrom'], x['seqTo'], x['seqBy']) ))
I prefer the second, because it only pulls what it needs and therefore what it "knows" should be numeric. (I find explicit is often safer.)
The reason this is happening is that apply silently converts the data to a matrix, so to see the effects, try:
str(as.matrix(dfParameterValues))
# chr [1:3, 1:4] "parameterA" "parameterB" "parameterC" " 1" " 5" ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:3] "1" "2" "3"
# ..$ : chr [1:4] "ParameterName" "seqFrom" "seqTo" "seqBy"
str(as.matrix(dfParameterValues[c("seqFrom","seqTo","seqBy")]))
# num [1:3, 1:3] 1 5 10 2 7 11 1 1 1
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:3] "1" "2" "3"
# ..$ : chr [1:3] "seqFrom" "seqTo" "seqBy"
(Note the chr on the first and the num on the second.)
Neither one preserves the parameter names. To do that, just sandwich the call with setNames:
setNames(
expand.grid(apply(dfParameterValues[,c("seqFrom","seqTo","seqBy")], 1,
function(x) seq(x['seqFrom'], x['seqTo'], x['seqBy']) )),
dfParameterValues$ParameterName)

Setting variable attributes via subsetting a dataframe

I want to set an attribute ("full.name") of certain variables in a data frame by subsetting the dataframe and iterating over a character vector. I tried two solutions but neither works (varsToPrint is a character vector containing the variables, questionLabels is a character vector containing the labels of questions):
Sample data:
jtiPrint <- data.frame(question1 = seq(5), question2 = seq(5), question3=seq(5))
questionLabels <- c("question1Label", "question2Label")
varsToPrint <- c("question1", "question2")
Solution 1:
attrApply <- function(var, label) {
`<-`(attr(var, "full.name"), label)
}
mapply(attrApply, jtiPrint[varsToPrint], questionLabels)
Solution 2:
i <- 1
for (var in jtiPrint[varsToPrint]) {
attr(var, "full.name") <- questionLabels[i]
i <- i + 1
}
Desired output (for e.g. variable 1):
attr(jtiPrint$question1, "full.name")
[1] "question1Label"
The problems seems to be in solution 2 that R sets the attritbute to a new dataframe only containing one variable (the indexed variable). However, I don't understand why solution 1 does not work. Any ideas how to fix either of these two ways?
Solution 1 :
The function is 'attr<-' not '<-'(attr...), also you need to set SIMPLIFY=FALSE (otherwise a matrix is returned instead of a list) and then call as.data.frame :
attrApply <- function(var, label) {
`attr<-`(var, "full.name", label)
}
df <- as.data.frame(mapply(attrApply,jtiPrint[varsToPrint],questionLabels,SIMPLIFY = FALSE))
> str(df)
'data.frame': 5 obs. of 2 variables:
$ question1: atomic 1 2 3 4 5
..- attr(*, "full.name")= chr "question1Label"
$ question2: atomic 1 2 3 4 5
..- attr(*, "full.name")= chr "question2Label"
Solution 2 :
You need to set the attribute on the column of the data.frame, you're setting the attribute on copies of the columns :
for(i in 1:length(varsToPrint)){
attr(jtiPrint[[i]],"full.name") <- questionLabels[i]
}
> str(jtiPrint)
'data.frame': 5 obs. of 3 variables:
$ question1: atomic 1 2 3 4 5
..- attr(*, "full.name")= chr "question1Label"
$ question2: atomic 1 2 3 4 5
..- attr(*, "full.name")= chr "question2Label"
$ question3: int 1 2 3 4 5
Anyway, note that the two approaches lead to a different result. In fact the mapply solution returns a subset of the previous data.frame (so no column 3) while the second approach modifies the existing jtiPrint data.frame.

simulate observations and calculate sample autocorrelation

Simulating rk (r stands for autocorrelation) for {et} where each et is iid N(0,1).
R code: simulate 100 observations of {et} and calculate r1.
Here is my code so far:
x=rnorm(100,0,1)
x
y=ts(x)
trial_r1=acf(y)[1]
trial_r1
Is my code right? How to get r1 after running acf()
(I'll post as an answer, both to close the question, plus to help with searching for answers to similarly-structured questions.)
When looking for what you believe is but one part of a structured return, it's useful to look at the return value in detail. One common way to do this is with str:
set.seed(42)
x <- rnorm(100, mean = 0, sd = 1)
ret <- acf(ts(x))
str(ret)
## List of 6
## $ acf : num [1:21, 1, 1] 1 0.05592 -0.00452 0.03542 0.00278 ...
## $ type : chr "correlation"
## $ n.used: int 100
## $ lag : num [1:21, 1, 1] 0 1 2 3 4 5 6 7 8 9 ...
## $ series: chr "ts(x)"
## $ snames: NULL
## - attr(*, "class")= chr "acf"
In this instance, you'll see two clusters of numbers in $acf and $lag. The latter "clearly" is just an array of incrementing integers so is not that interesting in this endeavor, but the former looks more interesting. By seeing that the results is ultimately just a list, you can use dollar-sign subsetting (or [[, over to you) to extract what you need:
ret$acf
## , , 1
## [,1]
## [1,] 1.000000e+00
## [2,] 5.592310e-02
## [3,] -4.524017e-03
## [4,] 3.541639e-02
## [5,] 2.784590e-03
## ...snip...
In the case of your question, you should notice that the first element of this 3-dimensional array is the perfectly-autocorrelated 1, but your first real autocorrelation of concern is the second element, or 0.0559. So your first value is attainable with ret$acf[2,,] (or more formally ret$acf[2,1,1]).

Data frame and the very common mistake while using character columns

A very unexpected behavior of the useful data.frame in R arises from keeping character columns as factor. This causes many problems if it is not considered. For example suppose the following code:
foo=data.frame(name=c("c","a"),value=1:2)
# name val
# 1 c 1
# 2 a 2
bar=matrix(1:6,nrow=3)
rownames(bar)=c("a","b","c")
# [,1] [,2]
# a 1 4
# b 2 5
# c 3 6
Then what do you expect of running bar[foo$name,]? It normally should return the rows of bar that are named according to the foo$name that means rows 'c' and 'a'. But the result is different:
bar[foo$name,]
# [,1] [,2]
# b 2 5
# a 1 4
The reason is here: foo$name is not a character vector, but an integer vector.
foo$name
# [1] c a
# Levels: a c
To have the expected behavior, I manually convert it to character vector:
foo$name = as.character(foo$name)
bar[foo$name,]
# [,1] [,2]
# c 3 6
# a 1 4
But the problem is that we may easily miss to perform this, and have hidden bugs in our codes. Is there any better solution?
This is a feature and R is working as documented. This can be dealt with generally in a few ways:
use the argument stringsAsFactors = TRUE in the call to data.frame(). See ?data.frame
if you detest this behaviour so, set the option globally via
options(stringsAsFactors = FALSE)
(as noted by #JoshuaUlrich in comments) a third option is to wrap character variables in I(....). This alters the class of the object being assigned to the data frame component to include "AsIs". In general this shouldn't be a problem as the object inherits (in this case) the class "character" so should work as before.
You can check what the default for stringsAsFactors is on the currently running R process via:
> default.stringsAsFactors()
[1] TRUE
The issue is slightly wider than data.frame() in scope as this also affects read.table(). In that function, as well as the two options above, you can also tell R what all the classes of the variables are via argument colClasses and R will respect that, e.g.
> tmp <- read.table(text = '"Var1","Var2"
+ "A","B"
+ "C","C"
+ "B","D"', header = TRUE, colClasses = rep("character", 2), sep = ",")
> str(tmp)
'data.frame': 3 obs. of 2 variables:
$ Var1: chr "A" "C" "B"
$ Var2: chr "B" "C" "D"
In the example data below, author and title are automatically converted to factor (unless you add the argument stringsAsFactors = FALSE when you are creating the data). What if we forgot to change the default setting and don't want to set the options globally?
Some code I found somewhere (most likely SO) uses sapply() to identify factors and convert them to strings.
dat = data.frame(title = c("title1", "title2", "title3"),
author = c("author1", "author2", "author3"),
customerID = c(1, 2, 1))
# > str(dat)
# 'data.frame': 3 obs. of 3 variables:
# $ title : Factor w/ 3 levels "title1","title2",..: 1 2 3
# $ author : Factor w/ 3 levels "author1","author2",..: 1 2 3
# $ customerID: num 1 2 1
dat[sapply(dat, is.factor)] = lapply(dat[sapply(dat, is.factor)],
as.character)
# > str(dat)
# 'data.frame': 3 obs. of 3 variables:
# $ title : chr "title1" "title2" "title3"
# $ author : chr "author1" "author2" "author3"
# $ customerID: num 1 2 1
I assume this would be faster than re-reading in the dataset with the stringsAsFactors = FALSE argument, but have never tested.

R: numeric vector becoming non-numeric after cbind of dates

I have a numeric vector (future_prices) in my case. I use a date vector from another vector (here: pred_commodity_prices$futuredays) to create numbers for the months. After that I use cbind to bind the months to the numeric vector. However, was happened is that the numeric vector become non-numeric. Do you know how what the reason for this is? When I use as.numeric(future_prices) I get strange values. What could be an alternative? Thanks
head(future_prices)
pred_peak_month_3a pred_peak_quarter_3a
1 68.33907 62.37888
2 68.08553 62.32658
is.numeric(future_prices)
[1] TRUE
> month = format(as.POSIXlt.date(pred_commodity_prices$futuredays), "%m")
> future_prices <- cbind (future_prices, month)
> head(future_prices)
pred_peak_month_3a pred_peak_quarter_3a month
1 "68.3390747063745" "62.3788824938719" "01"
is.numeric(future_prices)
[1] FALSE
The reason is that cbind returns a matrix, and a matrix can only hold one data type. You could use a data.frame instead:
n <- 1:10
b <- LETTERS[1:10]
m <- cbind(n,b)
str(m)
chr [1:10, 1:2] "1" "2" "3" "4" "5" "6" "7" "8" "9" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "n" "b"
d <- data.frame(n,b)
str(d)
'data.frame': 10 obs. of 2 variables:
$ n: int 1 2 3 4 5 6 7 8 9 10
$ b: Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10
See ?format. The format function returns:
An object of similar structure to ‘x’ containing character
representations of the elements of the first argument ‘x’ in a
common format, and in the current locale's encoding.
from ?cbind, cbind returns
... a matrix combining the ‘...’ arguments
column-wise or row-wise. (Exception: if there are no inputs or
all the inputs are ‘NULL’, the value is ‘NULL’.)
and all elements of a matrix must be of the same class, so everything is coerced to character.
F.Y.I.
When one column is "factor", simply/directly using as.numeric will change the value in that column. The proper way is:
data.frame[,2] <- as.numeric(as.character(data.frame[,2]))
Find more details: Converting values to numeric, stack overflow

Resources