dendrogram from pre-made linkage matrix - r

the problem:
in R, I need to plot a dendrogram + cut the associated tree from a linkage matrix created in a different language. based on the nature of the dataset, the prior processing is only available with this other language. so I need to be able to work in R from an already determined linkage matrix.
I have a linkage matrix and a correlation matrix created from a different language. I saved both as csv files and can read either as a data frame into R.
my approach
I wanted to convert the linkage matrix to an hclust object in R, so that I could pass to as.dendrogram and then subsequently use cutree.
When I run as.hclust(df), I get the error:
Error in as.hclust.default(df) : argument 'x' cannot be coerced to class “hclust” Consider providing an as.hclust.data.frame() method
as.hclust only takes a dist, Diana, or Agnes object
I'm unsuccessfully able to convert the data frame to any of these objects to proceed with my downstream analysis.
an alternative would be to work with the correlation matrix, but I'm not seeing a way to backtrack the physical distances from which to build a meaningful dendrogram.
I could use scipy.cluster.hierarchy.cut_tree in Python but there are documented issues with the function that remain unresolved, so I wanted to use R.
many thanks

I'm not sure what would you call the "linkage matrix" or whether there's a "standard" format for them across packages, but in these cases in helps to use str:
x <- matrix(rnorm(30), ncol = 3)
hc <- hclust(dist(x), method = "complete")
str(hc)
List of 7
$ merge : int [1:9, 1:2] -5 -6 -8 -4 -2 -3 -1 6 5 -7 ...
$ height : num [1:9] 0.714 0.976 1.381 1.468 2.065 ...
$ order : int [1:10] 2 6 10 3 8 5 7 1 4 9
$ labels : NULL
$ method : chr "complete"
$ call : language hclust(d = dist(x), method = "complete")
$ dist.method: chr "euclidean"
- attr(*, "class")= chr "hclust"
So, from this, one can deduce that it's a simple S3 structure, and it should be possible to create an imitation with your already-determined step-by-step data like this:
my_hc <- list(
merge = <your data>,
height = <your data>,
order = <your data>,
labels = NULL,
method = "complete",
call = "some_optional_string",
dist.method = "your_custom_distance"
)
class(my_hc) <- "hclust"
Otherwise, you could let R re-do the clustering from a distance matrix if that's available or computationally feasible.

Related

R - Multi-level list indexing

What is the convention to assign an object to a multi-level list?
Sofar I thought the convention 1,2 of indexing is to use [[]] instead of $.
Hence, when saving results in loops I usually used the following approach:
> result <- matrix(2,2,2)
> result_list <- list()
> result_list[["A"]][["B"]][["C"]] <- result
> print(result_list)
$A
$A$B
$A$B$C
[,1] [,2]
[1,] 2 2
[2,] 2 2
Which works as intended with this matrix.
But when assigning a single number the list seems to skip the last level.
> result <- 2
> result_list <- list()
> result_list[["A"]][["B"]][["C"]] <- result
> print(result_list)
$A
B
2
At the same time, if I use $ instead of [[]] the list again is as intendet.
> result_list$A$B$C <- result
> print(result_list)
$A
$A$B
$A$B$C
[1] 2
As mentioned here you can also use list("A" = list("B" = list("C" = 2))).
Which of these methods should be used for indexing a multi-level list in R?
Although the title of the question referst to multi-level list indexing, and the syntax mylist[['a']][['b']][['c']] is the same that one would use to retrieve an element of a multi-level list, the differences that you're observing actually arise from using the same syntax for creation (or not) of multi-level lists.
To show this, we can first explicitly create the multi-level (nested) lists, and then check that the indexing works as expected both for matrices and for single numbers.
mymatrix=matrix(1:4,nrow=2)
list_b=list(c=mymatrix)
list_a=list(b=list_b)
mynestedlist1=list(a=list_a)
str( mynestedlist1 )
# List of 1
# $ a:List of 1
# ..$ b:List of 1
# .. ..$ c: int [1:2, 1:2] 1 2 3 4
mynumber=2
list_e=list(f=mynumber)
list_d=list(e=list_e)
mynestedlist2=list(d=list_d)
str( mynestedlist2 )
# List of 1
# $ d:List of 1
# ..$ e:List of 1
# .. ..$ f: num 2
( Note that I've created the lists in sequential steps for clarity; the could have been all rolled-together in a single line, like: mynestedlist2=list(d=list(e=list(f=mynumber))) )
Anyway, now we'll check that indexing works Ok:
str(mynestedlist1[['a']][['b']][['c']])
# int [1:2, 1:2] 1 2 3 4
str(mynestedlist1$a$b$c)
# int [1:2, 1:2] 1 2 3 4
str(mynestedlist2[['d']][['e']][['f']])
# num 2
str(mynestedlist2$d$e$f)
# num 2
# and, just to check that we don't 'skip the last level':
str(mynestedlist2[['d']][['e']])
# List of 1
# $ f: num 2
So the direct answer to the question 'which of these methods should be used for indexing a multi-level list in R' is: 'any of them - they're all ok'.
So what's going on with the examples in the question, then?
Here, the same syntax is being used to try to implicitly create lists, and since the structure of the nested list is not specified explicitly, this relies on whether R can infer the structure that you want.
In the first and third examples, there's no ambiguity, but each for a different reason:
First example:
mynestedlist1=list()
mynestedlist1[['a']][['b']][['c']]=mymatrix
We've specified that mynestedlist1 is a list. But its elements could be any kind of object, until we assign them. In this case, we put into the element named 'a' an object with an element 'b' that contains an object with an element 'c' that is a matrix. Since there's no R object that can contain a matrix in a single element except a list, the only way to achieve this assignment is by creating a nested list.
Third example:
mynestedlist3=list()
mynestedlist3$g$h$i=mynumber
In this case, we've used the $ notation, which only applies to lists (or to data types that are similar/equivalent to lists, like dataframes). So, again, the only way to follow the instructions of this assignment is by creating a nested list.
Finally, the pesky second example, but starting with a simpler variant of it:
mylist2=list()
mylist2[['c']][['d']]=mynumber
Here there's an ambiguity. We've specified that mylist2 is a list, and we've put into the element named 'c' an object with an element 'd' that contains a single number. This element could have been a list, but it can also be a simple vector, and in this case R chooses this as the simpler option:
str(mylist2)
# List of 1
# $ c: Named num 2
# ..- attr(*, "names")= chr "d"
Contrast this to the behaviour when trying to assign a matrix using exactly the same syntax: in this case, the only way follow the syntax would be by creating another, nested, list inside the first one.
What about the full second example mylist2[['c']][['d']][['e']]=mynumber, where we try to assign a number named 'e' to the just-created but still-empty object 'd'?
This seems rather unclear, and this may be the reason for the different behaviours of different versions of R (as reported in the comments to the question). In the question, the action taken by R has been to assign the number while dropping its name, similarly to:
myvec=vector(); myvec2=vector()
myvec[['a']]=1
myvec2[['b']]=2
myvec[['a']]=myvec2
str(myvec)
# Named num 2
# - attr(*, "names")= chr "a"
However, the syntax alone doesn't seem to force this behaviour, so it would be sensible to avoid relying on this behaviour when trying to create nested lists, or lists of vectors.

Converting from dgCMatrix/dgRMatrix to scipy sparse matrix

I am working on the netflix data set and attempting to use the nmslibR package to do some KNN type work on the sparse matrix that results from the netflix data set. This package only accepts scipy sparse matrices as inputs, so I need to convert my R sparse matrix to that format. When I attempt to do so, I am getting the following error. dfm2 is a 1.1 gb dgCMatrix, I have also attempted it on a dgRMatrix with the exact same error.
dfm3<-TO_scipy_sparse(dfm2)
Error in TO_scipy_sparse(dfm2) : attempt to apply non-function
I don't know how to provide a good sample dataset for my problem, the sparse matrix I'm working with is 1.1 gb, so if someone has a suggestion on how I can make it easier to help me please let me know. I would also be open to hearing other packages that will do KNN/KNN type functions in r for sparse matrices.
Edit:
I use the following code to generate a sample sparse matrix in the dgCMatrix format and attempt to convert it to a sci py sparse matrix and get the following error.
library(Matrix)
library(nmslibR)
sparse<-Matrix(sample(c(1,2,3,4,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),10000,
replace=T),
ncol=50,
byrow=T)
dfm3 <- TO_scipy_sparse(sparse)
Error in TO_scipy_sparse(sparse) : attempt to apply non-function
To answer a question about whether sparse is a dgCMatrix:
str(sparse)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..# i : int [1:2414] 0 6 9 10 13 20 22 23 25 49 ...
..# p : int [1:51] 0 45 92 146 185 227 277 330 383 435 ...
..# Dim : int [1:2] 200 50
..# Dimnames:List of 2
.. ..$ : NULL
.. ..$ : NULL
..# x : num [1:2414] 4 1 1 2 5 3 2 5 3 5 ...
..# factors : list()
The 'attempt to apply non-function' error is a known issue when something is wrong with the python configuration in the operating system. There are similar issues for other Python packages that I ported from Python to R. You can have a look here.
You should also know that the nmslibR package uses the reticulate package for the interface between Python and R, so there must be similar issues too. If the error persists then you can open an issue in the nmslibR repository providing some sample data.

issue with predict with glmnetUtils

trying to use the glmnetUtils package from GitHub for formula interface to glmnet but predict is not estimating enough values
library(nycflights13) # from GitHub
library(modelr)
library(dplyr)
library(glmnet)
library(glmnetUtils)
library(purrr)
fitfun=function(dF){
cv.glmnet(arr_delay~distance+air_time+dep_time,data=dF)
}
gnetr2=function(model,datavals){
yvar=all.vars(formula(model)[[2]])
print(paste('y variable:',yvar))
print('observations')
print(str(as.data.frame(datavals)[[yvar]]))
print('predictions')
print(str(predict(object=model,newdata=datavals)))
stats::cor(stats::predict(object=model, newdata=datavals), as.data.frame(datavals)[[yvar]], use='complete.obs')^2
}
flights %>%
group_by(carrier) %>%
do({
crossv_mc(.,4) %>%
mutate(mdl=map(train,fitfun),
r2=map2_dbl(mdl,test,gnetr2))
})
the output from gnetr2():
[1] "y variable: arr_delay"
[1] "observations"
num [1:3693] -33 -6 47 4 15 -5 45 16 0 NA ...
NULL
[1] "predictions"
num [1:3476, 1] 8.22 21.75 24.31 -7.96 -7.27 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:3476] "1" "2" "3" "4" ...
..$ : chr "1"
NULL
Error: incompatible dimensions
any ideas what's going on? your help is much appreciated!
This is an issue with the underlying glmnet package, but there's no reason that it can't be handled in glmnetUtils. I've just pushed an update that should let you use the na.action argument with the predict method for formula-based calls.
Setting na.action=na.pass (the default) will pad out the predictions to include NAs for rows with missing values
na.action=na.omit or na.exclude will drop these rows
Note that the missingness of a given row may change depending on how much regularisation is done: if the NAs are for variables that get dropped from the model, then the row will be counted as being a complete case.
Also took the opportunity to fix a bug where the LHS of the formula contains an expression.
Give it a go with install_github("Hong-Revo/glmnetUtils") and tell me if anything breaks.
Turns out its happening because there are NA in the predictor variables so predict() results in a shorter vector since na.action=na.exclude.
Normally a solution would be to use predict(object,newdata,na.action=na.pass) but predict.cv.glmnet does not accept other arguments to predict.
Therefore the solution is to filter for complete cases before beginning
flights=flights %>% filter(complete.cases(.))

How do I handle multiple kinds of missingness in R?

Many surveys have codes for different kinds of missingness. For instance, a codebook might indicate:
0-99 Data
-1 Question not asked
-5 Do not know
-7 Refused to respond
-9 Module not asked
Stata has a beautiful facility for handling these multiple kinds of missingness, in that it allows you to assign a generic . to missing data, but more specific kinds of missingness (.a, .b, .c, ..., .z) are allowed as well. All the commands which look at missingness report answers for all the missing entries however specified, but you can sort out the various kinds of missingness later on as well. This is particularly helpful when you believe that refusal to respond has different implications for the imputation strategy than does question not asked.
I have never run across such a facility in R, but I would really like to have this capability. Are there any ways of marking several different types of NA? I could imagine creating more data (either a vector of length nrow(my.data.frame) containing the types of missingness, or a more compact index of which rows had what types of missingness), but that seems pretty unwieldy.
I know what you look for, and that is not implemented in R. I have no knowledge of a package where that is implemented, but it's not too difficult to code it yourself.
A workable way is to add a dataframe to the attributes, containing the codes. To prevent doubling the whole dataframe and save space, I'd add the indices in that dataframe instead of reconstructing a complete dataframe.
eg :
NACode <- function(x,code){
Df <- sapply(x,function(i){
i[i %in% code] <- NA
i
})
id <- which(is.na(Df))
rowid <- id %% nrow(x)
colid <- id %/% nrow(x) + 1
NAdf <- data.frame(
id,rowid,colid,
value = as.matrix(x)[id]
)
Df <- as.data.frame(Df)
attr(Df,"NAcode") <- NAdf
Df
}
This allows to do :
> Df <- data.frame(A = 1:10,B=c(1:5,-1,-2,-3,9,10) )
> code <- list("Missing"=-1,"Not Answered"=-2,"Don't know"=-3)
> DfwithNA <- NACode(Df,code)
> str(DfwithNA)
'data.frame': 10 obs. of 2 variables:
$ A: num 1 2 3 4 5 6 7 8 9 10
$ B: num 1 2 3 4 5 NA NA NA 9 10
- attr(*, "NAcode")='data.frame': 3 obs. of 4 variables:
..$ id : int 16 17 18
..$ rowid: int 6 7 8
..$ colid: num 2 2 2
..$ value: num -1 -2 -3
The function can also be adjusted to add an extra attribute that gives you the label for the different values, see also this question. You could backtransform by :
ChangeNAToCode <- function(x,code){
NAval <- attr(x,"NAcode")
for(i in which(NAval$value %in% code))
x[NAval$rowid[i],NAval$colid[i]] <- NAval$value[i]
x
}
> Dfback <- ChangeNAToCode(DfwithNA,c(-2,-3))
> str(Dfback)
'data.frame': 10 obs. of 2 variables:
$ A: num 1 2 3 4 5 6 7 8 9 10
$ B: num 1 2 3 4 5 NA -2 -3 9 10
- attr(*, "NAcode")='data.frame': 3 obs. of 4 variables:
..$ id : int 16 17 18
..$ rowid: int 6 7 8
..$ colid: num 2 2 2
..$ value: num -1 -2 -3
This allows to change only the codes you want, if that ever is necessary. The function can be adapted to return all codes when no argument is given. Similar functions can be constructed to extract data based on the code, I guess you can figure that one out yourself.
But in one line : using attributes and indices might be a nice way of doing it.
The most obvious way seems to use two vectors:
Vector 1: a data vector, where all missing values are represented using NA. For example, c(2, 50, NA, NA)
Vector 2: a vector of factors, indicating the type of data. For example, factor(c(1, 1, -1, -7)) where factor 1 indicates the a correctly answered question.
Having this structure would give you a create deal of flexibility, since all the standard na.rm arguments still work with your data vector, but you can use more complex concepts with the factor vector.
Update following questions from #gsk3
Data storage will dramatically increase: The data storage will double. However, if doubling the size causes real problem it may be worth thinking about other strategies.
Programs don't automatically deal with it. That's a strange comment. Some functions by default handle NAs in a sensible way. However, you want to treat the NAs differently so that implies that you will have to do something bespoke. If you want to just analyse data where the NA's are "Question not asked", then just use a data frame subset.
now you have to manipulate two vectors together every time you want to conceptually manipulate a variable I suppose I envisaged a data frame of the two vectors. I would subset the data frame based on the second vector.
There's no standard implementation, so my solution might differ from someone else's. True. However, if an off the shelf package doesn't meet your needs, then (almost) by definition you want to do something different.
I should state that I have never analysed survey data (although I have analysed large biological data sets). My answers above appear quite defensive, but that's not my intention. I think your question is a good one, and I'm interested in other responses.
This is more than just a "technical" issue. You should have a thorough statistical background in missing value analysis and imputation. One solution requires playing with R and ggobi. You can assign extremely negative values to several types of NA (put NAs into margin), and do some diagnostics "manually". You should bare in mind that there are three types of NA:
MCAR - missing completely at random, where P(missing|observed,unobserved) = P(missing)
MAR - missing at random, where P(missing|observed,unobserved) = P(missing|observed)
MNAR - missing not at random (or non-ignorable), where P(missing|observed,unobserved) cannot be quantified in any way.
IMHO this question is more suitable for CrossValidated.
But here's a link from SO that you may find useful:
Handling missing/incomplete data in R--is there function to mask but not remove NAs?
You can dispense with NA entirely and just use the coded values. You can then also roll them up to a global missing value. I often prefer to code without NA since NA can cause problems in coding and I like to be able to control exactly what is going into the analysis. If have also used the string "NA" to represent NA which often makes things easier.
-Ralph Winters
I usually use them as values, as Ralph already suggested, since the type of missing value seems to be data, but on one or two occasions where I mainly wanted it for documentation I have used an attribute on the value, e.g.
> a <- NA
> attr(a, 'na.type') <- -1
> print(a)
[1] NA
attr(,"na.type")
[1] -1
That way my analysis is clean but I still keep the documentation. But as I said: usually I keep the values.
Allan.
I´d like to add to the "statistical background component" here. Statistical analysis with missing data is a very good read on this.

How can I count the number of times a value occurs in a column of a dataframe?

Is there a simple way of identifying the number of times a value is in a vector or column of dataframe? I essentially want the numerical values of a histogram but I do not know how to access it.
# sample vector
a <- c(1,2,1,1,1,3,1,2,3,3)
#hist
hist(a)
Thank you.
UPDATE:
On Dirk's suggestion I am using hist. Is there a better way than than specifying the range as 1.9, 2.9 etc when I know that all my values are integers?
hist(a, breaks=c(1,1.9,2.9,3.9,4.9,5.9,6.9,7.9,8.9,9.9), plot=FALSE)$counts
Use table function.
Try this:
R> a <- c(1,2,1,1,1,3,1,2,3,3)
R> b <- hist(a, plot=FALSE)
R> str(b)
List of 7
$ breaks : num [1:5] 1 1.5 2 2.5 3
$ counts : int [1:4] 5 2 0 3
$ intensities: num [1:4] 1 0.4 0 0.6
$ density : num [1:4] 1 0.4 0 0.6
$ mids : num [1:4] 1.25 1.75 2.25 2.75
$ xname : chr "a"
$ equidist : logi TRUE
- attr(*, "class")= chr "histogram"
R>
R is object-oriented and most methods give you meaningful results back. Use them.
If you want to use hist you don't need to specify the breaks as you did, just use the seq function
br <- seq(0.9, 9.9, 1)
num <- hist(a, br, plot=F)$counts
Also, if you're looking for a specific value you can also use which.
For instance:
num <- length(which(a == 1))
In addition to the performance difference between hist and table in the case of many unique values that Dirk and mbq already pointed out, I would also like to mention an other difference in functionality.
hist$counts will also give you zero counts for the bins that do not have any cases. This can be very valuable in the case where you want to be confident about the number of bins (bars on a barplot for example) that will end up in a following plot.
table on the other hand will only give you counts for existing values.
You might also want to check the right option of hist that controls whether your breaks (intervals) will be right closed or not.

Resources