Hierarchical Dendrogram using both continuous and categorical data - r

So i have a data set that has 90 or so specimens each with the clade they belong to and 5 data points calculated from the length of the line (which is not important here). My aim is to create a dendrogram that essentially calculates the dissimilarity and displays it as a graph, however I want to make it so that the code will separate the specimens by their respective clades into their own clusters then look at how similar/dissimilar they are, but I'm not too sure how to approach this. I'm willing to use any of the available r tools to do so so would appreciate some help. A short example of the data set will be formatted below in the form of a table.
Name
Type
lenght_1
length_2
length_3
length_4
length_5
spec1
S
10
-15
-5
5
10
spec2
O
20
6
6
-5
-10
spec3
O
22
7
10
-3
-7
spec4
S
6
6
-10
-5
3
spec5
T
54
-20
-20
9
9
spec5
T
25
-20
-10
5
9
this is a table that's been made up on the spot but represents what my data generally look like. The type refers the the clade, and the lengths have been calculated by splitting segmented graphs with breaks into 5 sections and finding the average hypotenuse of each line segment. I'd like to determine how similar these lines are for each species in their respective clades (grouping S with other S's and looking at the dissimilarity). Also is there a way to use these dendrograms to perhaps convert the data and draw up a phylogeny (this is just an additional question and does not need to be answered if unknown). thankyou for reading through this my r coding is still progressing hence why these question may seem elementary to some.

Your question seems to be about hierarchical clustering of groups defined by a categorical variable, not hierarchical clustering of both continuous and categorical data. Hierarchical clustering involves a series of decisions about how to scale the data, how to compute distances, and how to create clusters based on those differences so this example will use the default options. Using the iris data set that comes with R:
data(iris)
iris.sub <- iris[c(1:15, 51:65, 101:115), ]
str(iris.sub)
# 'data.frame': 45 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
The data consist of 4 measurements on three different species of iris. We'll use 15 of each species for the example. Now we will scale the data by converting to Z-scores so that larger measurements are not more influential than smaller ones and compute the species means:
iris.z <- scale(iris.sub[ , 1:4])
iris.agg <- aggregate(iris.z, list(iris.sub$Species), mean)
iris.agg
# Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1 setosa -0.9871973 0.7967458 -1.2887187 -1.2543022
# 2 versicolor 0.2171329 -0.5487521 0.2754513 0.1735865
# 3 virginica 0.7700644 -0.2479937 1.0132674 1.0807157
Next the distances between the means:
iris.gdst <- dist(iris.agg[, -1])
iris.gdst
# 1 2
# 2 2.783212
# 3 3.864052 1.327948
Finally the cluster analysis and the dendrogram:
iris.hcl <- hclust(iris.gdst, members=table(iris.sub$Species))
plot(iris.hcl, labels=iris.agg$Group.1)
For details on the various functions, read the manual pages ?scale, ?aggregate, ?dist, and ?hclust.
You can compute the dendrogram for all of the numeric variables and label the result by species as follows:
iris.dst <- dist(iris.z)
iris.all.hcl <- hclust(iris.dst)
plot(iris.all.hcl, labels=iris.sub$Species)
rect.hclust(iris.hcl.all, 3)
Notice that setosa is well separated but not versicolor and virginica.
If we code setosa, versicolor, and virginica as -2, 0, 2 and add it to the distance matrix:
iris.sp.dst <- dist(cbind(iris.z, 2*(as.numeric(iris.sub$Species) - 2)))
iris.sp.hcl <- hclust(iris.sp.dst)
plot(iris.sp.hcl, labels=iris.sub$Species)
rect.hclust(iris.sp.hcl, 3)
Notice that virginica and versicolor are still intermixed.
If we arbitrarily increase the range of species to -3, 0, +3 then the species will be separated, but this is just because we increased the magnitude of the species variable.

Related

How to compute several logistic model at the same time and summary p-value

I am trying to compute several model in the same time. The dependent variable in the first column, as rest of them are independent columns. I want to run logistic regression between IV and DV for each independent variables separately. Thank you very much for your help! Please let me know anything needs to be provided.
**** Some of IV are bivariate variables. So it should be treated as.factor in R.
*** After compute each model, can I also compute a p-value for each model in one time.
*** Right now, I just compute and summary each model separately
The data and my current code looks like below.
enter image description here
enter image description here
Pictures of your data are not as helpful as providing a sample of your data with dput(). Also you should paste your code directly into your question and not paste a picture. Here is an example using the iris data set that is included with R:
data(iris)
iris.2 <- iris[iris$Species!="setosa", ]
iris.2 <- droplevels(iris.2)
iris.2$Species <- as.numeric(iris.2$Species) - 1
# Species: 0 == versicolor, 1== virginica
str(iris.2)
# 'data.frame': 100 obs. of 5 variables:
# $ Sepal.Length: num 7 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 ...
# $ Sepal.Width : num 3.2 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 ...
# $ Petal.Length: num 4.7 4.5 4.9 4 4.6 4.5 4.7 3.3 4.6 3.9 ...
# $ Petal.Width : num 1.4 1.5 1.5 1.3 1.5 1.3 1.6 1 1.3 1.4 ...
# $ Species : num 0 0 0 0 0 0 0 0 0 0 ...
Now we compute the logistic regression in which Species is the dependent variable against each of the independent variables.
forms <- paste("Species ~", colnames(iris.2)[-5])
forms
# [1] "Species ~ Sepal.Length" "Species ~ Sepal.Width" "Species ~ Petal.Length" "Species ~ Petal.Width"
iris.glm <- lapply(forms, function(x) glm(as.formula(x), iris.2, family=binomial))
Now iris.glm is a list containing all of the results. The results of the first logistic regression are iris.glm[[1]] and summary(iris.glm[[1]]) gives you the summary. To print all of the results use lapply():
lapply(iris.glm, print)
lapply(iris.glm, summary)

What is the best way to generate a random dataset from an existing dataset?

Are there any packages in R that can generate a random dataset given a pre-existing template dataset?
For example, let's say I have the iris dataset:
data(iris)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
I want some function random_df(iris) which will generate a data-frame with the same columns as iris but with random data (preferably random data that preserves certain statistical properties of the original, (e.g., mean and standard deviation of the numeric variables).
What is the easiest way to do this?
[Comment from question author moved here. --Editor's note]
I don't want to sample random rows from an existing dataset. I want to generate actually random data with all the same columns (and types) as an existing dataset. Ideally, if there is some way to preserve statistical properties of the data for numeric variables, that would be preferable, but it's not needed
How about this for a start:
Define a function that simulates data from df by
drawing samples from a normal distribution for numeric columns in df, with the same mean and sd as in the original data column, and
uniformly drawing samples from the levels of factor columns.
generate_data <- function(df, nrow = 10) {
as.data.frame(lapply(df, function(x) {
if (class(x) == "numeric") {
rnorm(nrow, mean = mean(x), sd = sd(x))
} else if (class(x) == "factor") {
sample(levels(x), nrow, replace = T)
}
}))
}
Then for example, if we take iris, we get
set.seed(2019)
df <- generate_data(iris)
str(df)
#'data.frame': 10 obs. of 5 variables:
# $ Sepal.Length: num 6.45 5.42 4.49 6.6 4.79 ...
# $ Sepal.Width : num 2.95 3.76 2.57 3.16 3.2 ...
# $ Petal.Length: num 4.26 5.47 5.29 6.19 2.33 ...
# $ Petal.Width : num 0.487 1.68 1.779 0.809 1.963 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 3 2 1 2 3 2 1 1 2 3
It should be fairly straightfoward to extend the generate_data function to account for other column types.

How to add metadata to a tibble

How does one add metadata to a tibble?
I would like a sentence describing each of my variable names such that I could print out the tibble with the associated metadata and if I handed it to someone who hadn't seen the data before, they could make some sense of it.
as_tibble(iris)
# A tibble: 150 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fctr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ... with 140 more rows
# Sepal.length. Measured from sepal attachment to stem
# Sepal.width. Measured at the widest point
# Petal.length. Measured from petal attachment to stem
# Petal.width. Measured at widest point
# Species. Nomenclature based on Integrated Taxonomic Information System (ITIS), January 2018.
thanks!
This seems tricky. In principle #hrbrmstr's comment is the way to go (i.e. use ?comment or ?attr to add attributes to any object), but these attributes will not be printed out by default. Attributes seem to be printed automatically for atomic objects:
> z <- 1:6
> attr(z,"hello") <- "goodbye"
> z
[1] 1 2 3 4 5 6
attr(,"hello")
[1] "goodbye"
... but not, alas, for data frames or tibbles:
dd <- tibble::tibble(x=1:4,y=2:5)
> attr(dd,"metadata") <- c("some stuff","some more stuff")
> dd
# A tibble: 4 x 2
x y
<int> <int>
1 1 2
2 2 3
3 3 4
4 4 5
You can wrap the object with its own S3 class to get this stuff printed:
class(dd) <- c("my_tbl",class(dd))
> print.my_tbl <- function(x) {
+ NextMethod(x)
+ print(attr(x,"metadata"))
+ invisible(x)
+ }
> dd
# A tibble: 4 x 2
x y
<int> <int>
1 1 2
2 2 3
3 3 4
4 4 5
[1] "some stuff" "some more stuff"
You could make the printing more elaborate or pretty, e.g.
cat("\nMETADATA:\n")
cat(sprintf("# %s",attr(x,"metadata")),sep="\n")
Nothing bad will happen if the other user hasn't defined print.my_tbl (the print method will fall back to the print method for tibbles), but the metadata will only be printed if they have your print.my_tbl definition ...
Sorry for the delayed response. But this topic has been bugging me since I first started learning R. In my work, assigning metadata to columns is not just common. It is required. That R didn't seem to have a nice way to do it was really bothering me. So much so, that I wrote some packages to do it.
The fmtr package has a function to assign the descriptions (plus other stuff). And the libr package has a dictionary function, so you can look at all the metadata you assign.
Here is how it works:
First, assign the descriptions to the columns. You just send a named list into to the descriptions() function.
library(fmtr)
library(libr)
# Create data frame
df <- iris
# Assign descriptions
descriptions(df) <- list(Sepal.Length = "Measured from sepal attachment to stem",
Sepal.Width = "Measured at the widest point",
Petal.Length = "Measured from petal attachment to stem",
Petal.Width = "Measured at the widest point",
Species = paste("Nomanclature based on Integrated Taxonomic",
"Information System (ITIS), January 2018."))
Then you can see all the metadata by calling the dictionary() function, like so:
dictionary(df)
# # A tibble: 5 x 10
# Name Column Class Label Description
# <chr> <chr> <chr> <chr> <chr>
# 1 df Sepal.Leng~ numer~ NA Measured from sepal attachment to stem
# 2 df Sepal.Width numer~ NA Measured at the widest point
# 3 df Petal.Leng~ numer~ NA Measured from petal attachment to stem
# 4 df Petal.Width numer~ NA Measured at the widest point
# 5 df Species factor NA Nomanclature based on Integrated Taxonomic Information Syst~
If you like, you can return the dictionary as its own data frame, then save it or print it or whatever.
d <- dictionary(df)
Here is the dictionary data frame:
This is not all that different than Ben Bolker's suggestions, but conceptually, if I want information to be related to the vectors in my data frame, I would prefer they be directly tied to the vectors. In other words, I'd prefer to add the attributes to the vectors themselves rather than to the data frame object.
I don't know that I would go so far as to add a custom class to the object, but perhaps a separate function you can call up for a data frame-like object would be adequate:
library(tibble)
library(ggplot2)
library(magrittr)
library(labelVector)
print_with_label <- function(dframe){
stopifnot(inherits(dframe, "data.frame"))
labs <- labelVector::get_label(dframe, names(dframe))
labs <- sprintf("%s: %s", names(dframe), labs)
print(dframe)
cat("\n")
cat(labs, sep = "\n")
}
iris <-
as_tibble(iris) %>%
set_label(Sepal.Length = "This is a user friendly label",
Petal.Length = "I much prefer reading human over computer")
print_with_label(iris)
mtcars <-
set_label(mtcars,
mpg = "Miles per Gallon",
qsec = "Quarter mile time",
hp = "Horsepower",
cyl = "Cylinders",
disp = "Engine displacement")
print_with_label(mtcars)

Meaning of these R codes? Are they correlated?

I am exploring the iris data set in R and I would like some clarification on the following two codes:
cluster_iris<-kmeans(iris[,1:4], centers=3)
iris$ClusterM <- as.factor(cluster_iris$cluster)
I think the first one is performing a k-means cluster analysis using all the cases of the data file and only the first 4 columns with a choice of 3 clusters.
However I'm not sure what the second piece of code is doing? Is the first one just stating the preferences for the analysis and the second one actually executing it (i.e. performing the k-means)?
Any help is appreciated
The first line does the cluster analysis, and stores the cluster labels in a component called cluster_iris$cluster which is just a vector of numbers.
The second line puts that cluster number as a categorical label onto the rows of the original data set. So now your iris data has all the petal and sepal stuff and a cluster index in a column called "ClusterM".
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species ClusterM
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3.0 1.4 0.2 setosa 3
3 4.7 3.2 1.3 0.2 setosa 3
4 4.6 3.1 1.5 0.2 setosa 3

R- reduce dimensionality LSA

I am following an example of svd, but I still don't know how to reduce the dimension of the final matrix:
a <- round(runif(10)*100)
dat <- as.matrix(iris[a,-5])
rownames(dat) <- c(1:10)
s <- svd(dat)
pc.use <- 1
recon <- s$u[,pc.use] %*% diag(s$d[pc.use], length(pc.use), length(pc.use)) %*% t(s$v[,pc.use])
But recon still have the same dimension. I need to use this for Semantic analysis.
The code you provided does not reduce the dimensionality. Instead it takes first principal component from your data, removes the rest of principal components, and then reconstructs the data with only one PC.
You can check that this is happening by inspecting the rank of the final matrix:
library(Matrix)
rankMatrix(dat)
as.numeric(rankMatrix(dat))
[1] 4
as.numeric(rankMatrix(recon))
[1] 1
If you want to reduce dimensionality (number of rows) - you can select some principal principal components and compute the scores of your data on those components instead.
But first let's make some things clear about your data - it seems you have 10 samples (rows) with 4 features (columns). Dimensionality reduction will reduce the 4 features to a smaller set of features.
So you can start by transposing your matrix for svd():
dat <- t(dat)
dat
1 2 3 4 5 6 7 8 9 10
Sepal.Length 6.7 6.1 5.8 5.1 6.1 5.1 4.8 5.2 6.1 5.7
Sepal.Width 3.1 2.8 4.0 3.8 3.0 3.7 3.0 4.1 2.8 3.8
Petal.Length 4.4 4.0 1.2 1.5 4.6 1.5 1.4 1.5 4.7 1.7
Petal.Width 1.4 1.3 0.2 0.3 1.4 0.4 0.1 0.1 1.2 0.3
Now you can repeat the svd. Centering the data before this procedure is advisable:
s <- svd(dat - rowMeans(dat))
Principal components can be obtained by projecting your data onto PCs.
PCs <- t(s$u) %*% dat
Now if you want to reduce dimensionality by eliminating PCs with low variance you can do so like this:
dat2 <- PCs[1:2,] # would select first two PCs.

Resources