How to import a distance matrix for clustering in R - r

I have got a text file containing 200 models all compared to eachother and a molecular distance for each 2 models compared. It looks like this:
1 2 1.2323
1 3 6.4862
1 4 4.4789
1 5 3.6476
.
.
All the way down to 200, where the first number is the first model, the second number is the second model, and the third number the corresponding molecular distance when these two models are compared.
I can think of a way to import this into R and create a nice 200x200 matrix to perform some clustering analyses on. I am still new to Stack and R but thanks in advance!

Since you don't have the distance between model1 and itself, you would need to insert that yourself, using the answer from this question:
(you can ignore the wrong numbering of the models compared to your input data, it doesn't serve a purpose, really)
# Create some dummy data that has the same shape as your data:
df <- expand.grid(model1 = 1:120, model2 = 2:120)
df$distance <- runif(n = 119*120, min = 1, max = 10)
head(df)
# model1 model2 distance
# 1 2 7.958746
# 2 2 1.083700
# 3 2 9.211113
# 4 2 5.544380
# 5 2 5.498215
# 6 2 1.520450
inds <- seq(0, 200*119, by = 200)
val <- c(df$distance, rep(0, length(inds)))
inds <- c(seq_along(df$distance), inds + 0.5)
val <- val[order(inds)]
Once that's in place, you can use matrix() with the ncol and nrow to "reshape" your vector of distance in the appropriate way:
matrix(val, ncol = 200, nrow = 200)
Edit:
When your data only contains the distance for one direction, so only between e.g. model1 - model5 and not model5 - model1 , you will have to fill the values in the upper triangular part of a matrix, like they do here. Forget about the data I generated in the first part of this answer. Also, forget about adding the ones to your distance column.
dist_mat <- diag(200)
dist_mat[upper.tri(dist_mat)] <- your_data$distance
To copy the upper-triangular entries to below the diagonal, use:
dist_mat[lower.tri(dist_mat)] <- t(dist_mat)[lower.tri(dist_mat)]

As I do not know from your question what format is your file in, I will assume the most general file format, i.e., CSV.
Then you should look at the reading files, read.csv, or fread.
Example code:
dt <- read.csv(file, sep = "", header = TRUE)
I suggest using data.table package. Then:
setDT(dt)
dt[, id := paste0(as.character(col1), "-", as.character(col2))]
This creates a new variable out of the first and the second model and serves as a unique id.
What I do is then removing this id and scale the numerical input.
After scaling, run clustering algorithms.
Merge the result with the id to analyse your results.
Is that what you are looking for?

Related

Distribution of mean*standard deviation of sample from gaussian

I'm trying to assess the feasibility of an instrumental variable in my project with a variable I havent seen before. The variable essentially is an interaction between the mean and standard deviation of a sample drawn from a gaussian, and im trying to see what this distribution might look like. Below is what im trying to do, any help is much appreciated.
Generate a set of 1000 individuals with a variable x following the gaussian distribution, draw 50 random samples of 5 individuals from this distribution with replacement, calculate the means and standard deviation of x for each sample, create an interaction variable named y which is calculated by multiplying the mean and standard deviation of x for each sample, plot the distribution of y.
Beginners version
There might be more efficient ways to code this, but this is easy to follow, I guess:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
# As Ben suggested, we create a data.frame filled with NA values
samples <- data.frame(mean = rep(NA, N), sd = rep(NA, N))
# Now we use a loop to populate the data.frame
for(i in 1:N){
# draw 5 samples from population (without replacement)
# I assume you want to replace for each turn of taking 5
# If you want to replace between drawing each of the 5,
# I think it should be obvious how to adapt the following code
smpl <- sample(stat_pop, size = 5, replace = FALSE)
# the data.frame currently has two columns. In each row i, we put mean and sd
samples[i, ] <- c(mean(smpl), sd(smpl))
}
# $ is used to get a certain column of the data.frame by the column name.
# Here, we create a new column y based on the existing two columns.
samples$y <- samples$mean * samples$sd
# plot a histogram
hist(samples$y)
Most functions here use positional arguments, i.e., you are not required to name every parameter. E.g., rnorm(1000, mean = 0, sd = 1) is the same as rnorm(1000, 0, 1) and even the same as rnorm(1000), since 0 and 1 are the default values.
Somewhat more efficient version
In R, loops are very inefficient and, thus, ought to be avoided. In case of your question, it does not make any noticeable difference. However, for large data sets, performance should be kept in mind. The following might be a bit harder to follow:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
n = 5
# again, I set replace = FALSE here; if you meant to replace each individual
# (so the same individual can be drawn more than once in each "draw 5"),
# set replace = TRUE
# replicate repeats the "draw 5" action N times
smpls <- replicate(N, sample(stat_pop, n, replace = FALSE))
# we transform the output and turn it into a data.frame to make it
# more convenient to work with
samples <- data.frame(t(smpls))
samples$mean <- rowMeans(samples)
samples$sd <- apply(samples[, c(1:n)], 1, sd)
samples$y <- samples$mean * samples$sd
hist(samples$y)
General note
Usually, you should do some research on the problem before posting here. Then, you either find out how it works by yourself, or you can provide an example of what you tried. To this end, you can simply google each of the steps you outlined (e.g., google "generate random standard distribution R" in order to find out about the function rnorm().
Run ?rnorm to get help on the function in RStudio.

Stuck with a 2 data frames row copy

I have decided to learn R and am going through Introduction to Scientific programming in R book (http://www.ms.unimelb.edu.au/spuRs/)
I am currently stuck on chapter 7 question 3 of the book, the question is:
Consider the following very simple genetic model. A population consists of
equal numbers of two sexes: male and female. At each generation men and
women are paired at random, and each pair produces exactly two offspring,
one male and one female. We are interested in the distribution of height
from one generation to the next. Suppose that the height of both children
is just the average of the height of their parents, how will the distribution
of height change across generations?
Represent the heights of the current generation as a dataframe with two
variables, m and f, for the two sexes. The command rnorm(100, 160, 20)
will generate a vector of length 100, according to the normal distribution
with mean 160 and standard deviation 20 (see Section 16.5.1). We use it to
randomly generate the population at generation 1:
pop <- data.frame(m = rnorm(100, 160, 20), f = rnorm(100, 160, 20))
The command sample(x, size = length(x)) will return a random sample
of size size taken from the vector x (without replacement). (It will also
sample with replacement, if the optional argument replace is set to TRUE.)
The following function takes the dataframe pop and randomly permutes the
ordering of the men. Men and women are then paired according to rows,
and heights for the next generation are calculated by taking the mean of
each row. The function returns a dataframe with the same structure, giving
the heights of the next generation.
next.gen <- function(pop) {
pop$m <- sample(pop$m)
pop$m <- apply(pop, 1, mean)
pop$f <- pop$m
return(pop)
}
Use the function next.gen to generate nine generations, then use the lattice
function histogram to plot the distribution of male heights in each
generation, as in Figure 7.7. The phenomenon you see is called regression
to the mean.
Hint: construct a dataframe with variables height and generation, where
each row represents a single man.
I have constructed a blank data frame:
generations <- data.frame(gen="", height="")
For now I am trying to get just the first generation height information into it, so I run:
next.gen(pop)
generations$height <- pop$m
and I get the following error:
Error in `$<-.data.frame`(`*tmp*`, "height", value = c(165.208323681597, :
replacement has 100 rows, data has 1
I understand that I'm trying to squeeze in information from pop$m dataframe into a single row of generations$height and that is causing the problem, I do not know how to fix this? I thought that a blank data frame is flexible enough to add rows as they are being copied from pop data frame?
I tried then to run this code:
generations <- pop$m
And I get 100 values but that just turns my generations dataframe into a vector I think and running
generations
Just lists the values copied in a vector only.
I think I am approaching the first step wrong, is my dataframe definition correct? Why can't I copy row information from 1 data frame into an empty one and just adjust the size of the empty data frame as needed?
Thank you
Unsure the exact output you are looking for. Here is an approach which should be simple enough to follow. ** Note: There are workable approaches aplenty.
pop <- data.frame(m = rnorm(100, 160, 20), f = rnorm(100, 160, 20))
next.gen <- function(pop) {
pop$m <- sample(pop$m)
pop$m <- apply(pop, 1, mean)
pop$f <- pop$m
return(pop)
}
# the code
test <- list()
for (i in 1:9) {
test[[i]] <- next.gen(pop)["m"]
test[[i]]$generation <- paste0("g", i)
}
library(data.table)
test2 <- rbindlist(test)
# result
m generation
1: 174.6558 g1
2: 143.2617 g1
3: 185.2829 g1
4: 168.9719 g1
5: 151.6948 g1
---
896: 159.6091 g9
897: 161.4546 g9
898: 171.8679 g9
899: 138.4982 g9
900: 152.7390 g9
Try:
> generations <- data.frame(gen="", height="", stringsAsFactors=F)
> for(i in 1:length(pop$m)) generations[i,] = c("",pop$m[i])
> generations
gen height
1 136.70042632318
2 153.985392293761
3 122.077485676327
4 166.582538529591
5 170.751368839498
6 190.8894492681
...

Using a for loop for performing several regressions

I am currently performing a style analysis using the following method: http://www.r-bloggers.com/style-analysis/ . It is a constrained regression of one asset on a number of benchmarks, over a rolling 36 month window.
My problem is that I need to perform this regression for a fairly large number of assets and doing it one by one would take a huge amount of time. To be more precise: Is there a way to tell R to regress columns 1-100 one by one on colums 101-116. Of course this also means printing 100 different plots, one for each asset. I am new to R and have been stuck for several days now.
I hope it doesn't matter that the following excerpt isn't reproducible, since the code works as originally intended.
# Style Regression over Window, constrained
#--------------------------------------------------------------------------
# setup
load.packages('quadprog')
style.weights[] = NA
style.r.squared[] = NA
# Setup constraints
# 0 <= x.i <= 1
constraints = new.constraints(n, lb = 0, ub = 1)
# SUM x.i = 1
constraints = add.constraints(rep(1, n), 1, type = '=', constraints)
# main loop
for( i in window.len:ndates ) {
window.index = (i - window.len + 1) : i
fit = lm.constraint( hist.returns[window.index, -1], hist.returns[window.index, 1], constraints )
style.weights[i,] = fit$coefficients
style.r.squared[i,] = fit$r.squared
}
# plot
aa.style.summary.plot('Style Constrained', style.weights, style.r.squared, window.len)
Thank you very much for any tips!
"Is there a way to tell R to regress columns 1-100 one by one on colums 101-116."
Yes! You can use a for loop, but you there's also a whole family of 'apply' functions which are appropriate. Here's a generalized solution with a random / toy dataset and using lm(), but you can sub in whatever regression function you want
# data frame of 116 cols of 20 rows
set.seed(123)
dat <- as.data.frame(matrix(rnorm(116*20), ncol=116))
# with a for loop
models <- list() # empty list to store models
for (i in 1:100) {
models[[i]] <-
lm(formula=x~., data=data.frame(x=dat[, i], dat[, 101:116]))
}
# with lapply
models2 <-
lapply(1:100,
function(i) lm(formula=x~.,
data=data.frame(x=dat[, i], dat[, 101:116])))
# compare. they give the same results!
all.equal(models, models2)
# to access a single model, use [[#]]
models2[[1]]

Select the most dissimilar individual using cluster analysis

I want to cluster my data to say 5 clusters, then we need to select 50 individuals with most dissimilar relationship from all the data. That means if cluster one contains 100, two contains 200, three contains 400, four contains 200, and five 100, I have to select 5 from the first cluster + 10 from the second cluster + 20 from the third + 10 from the fourth + 5 from the fifth.
Data example:
mydata<-matrix(nrow=100,ncol=10,rnorm(1000, mean = 0, sd = 1))
What I did till now is clustering the data and rank the individuals within each cluster, then export it to excel and go from there …
That has become became a problem since my data has became really big.
I will appreciate any help or suggestion on how to apply the previous in R
.
I´m not sure if it is exactly what you are searching, but maybe it helps:
mydata<-matrix(nrow=100, ncol=10, rnorm(1000, mean = 0, sd = 1))
rownames(mydata) <- paste0("id", 1:100) # some id for identification
# cluster objects and calculate dissimilarity matrix
cl <- cutree(hclust(
sim <- dist(mydata, diag = TRUE, upper=TRUE)), 5)
# combine results, take sum to aggregate dissimilarity
res <- data.frame(id=rownames(mydata),
cluster=cl, dis_sim=rowSums(as.matrix(sim)))
# order, lowest overall dissimilarity will be first
res <- res[order(res$dis_sim), ]
# split object
reslist <- split(res, f=res$cluster)
## takes first three items with highest overall dissim.
lapply(reslist, tail, n=3)
## returns id´s with highest overall dissimilarity, top 20%
lapply(reslist, function(x, p) tail(x, round(nrow(x)*p)), p=0.2)
regarding you comment, find the code below:
pleas note that the code can be improved in terms of beauty and efficiency.
Further I used a second answer, because otherwise it would be to messy.
# calculation of centroits based on:
# https://stat.ethz.ch/pipermail/r-help/2006-May/105328.html
cl <- hclust(dist(mydata, diag = TRUE, upper=TRUE))
cent <- tapply(mydata,
list(rep(cutree(cl, 5), ncol(mydata)), col(mydata)), mean)
dimnames(cent) <- list(NULL, dimnames(mydata)[[2]])
# add up cluster number and data and split by cluster
newdf <- data.frame(data=mydata, cluster=cutree(cl, k=5))
newdfl <- split(newdf, f=newdf$cluster)
# add centroids and drop cluster info
totaldf <- lapply(1:5,
function(i, li, cen) rbind(cen[i, ], li[[i]][ , -11]),
li=newdfl, cen=cent)
# calculate new distance to centroits and sort them
dist_to_cent <- lapply(totaldf, function(x)
sort(as.matrix(dist(x, diag=TRUE, upper=TRUE))[1, ]))
dist_to_cent
for calculation of centroids out of hclust see R-Mailinglist

Select the most dissimilar individual using cluster analysis [duplicate]

I want to cluster my data to say 5 clusters, then we need to select 50 individuals with most dissimilar relationship from all the data. That means if cluster one contains 100, two contains 200, three contains 400, four contains 200, and five 100, I have to select 5 from the first cluster + 10 from the second cluster + 20 from the third + 10 from the fourth + 5 from the fifth.
Data example:
mydata<-matrix(nrow=100,ncol=10,rnorm(1000, mean = 0, sd = 1))
What I did till now is clustering the data and rank the individuals within each cluster, then export it to excel and go from there …
That has become became a problem since my data has became really big.
I will appreciate any help or suggestion on how to apply the previous in R
.
I´m not sure if it is exactly what you are searching, but maybe it helps:
mydata<-matrix(nrow=100, ncol=10, rnorm(1000, mean = 0, sd = 1))
rownames(mydata) <- paste0("id", 1:100) # some id for identification
# cluster objects and calculate dissimilarity matrix
cl <- cutree(hclust(
sim <- dist(mydata, diag = TRUE, upper=TRUE)), 5)
# combine results, take sum to aggregate dissimilarity
res <- data.frame(id=rownames(mydata),
cluster=cl, dis_sim=rowSums(as.matrix(sim)))
# order, lowest overall dissimilarity will be first
res <- res[order(res$dis_sim), ]
# split object
reslist <- split(res, f=res$cluster)
## takes first three items with highest overall dissim.
lapply(reslist, tail, n=3)
## returns id´s with highest overall dissimilarity, top 20%
lapply(reslist, function(x, p) tail(x, round(nrow(x)*p)), p=0.2)
regarding you comment, find the code below:
pleas note that the code can be improved in terms of beauty and efficiency.
Further I used a second answer, because otherwise it would be to messy.
# calculation of centroits based on:
# https://stat.ethz.ch/pipermail/r-help/2006-May/105328.html
cl <- hclust(dist(mydata, diag = TRUE, upper=TRUE))
cent <- tapply(mydata,
list(rep(cutree(cl, 5), ncol(mydata)), col(mydata)), mean)
dimnames(cent) <- list(NULL, dimnames(mydata)[[2]])
# add up cluster number and data and split by cluster
newdf <- data.frame(data=mydata, cluster=cutree(cl, k=5))
newdfl <- split(newdf, f=newdf$cluster)
# add centroids and drop cluster info
totaldf <- lapply(1:5,
function(i, li, cen) rbind(cen[i, ], li[[i]][ , -11]),
li=newdfl, cen=cent)
# calculate new distance to centroits and sort them
dist_to_cent <- lapply(totaldf, function(x)
sort(as.matrix(dist(x, diag=TRUE, upper=TRUE))[1, ]))
dist_to_cent
for calculation of centroids out of hclust see R-Mailinglist

Resources