UPGMA with hclust plotting branch lengths as raw distances - r

I'm working on a presentation regarding utilizing UPGMA with the hlcust() function within our research lab. According to the literature, the branch length calculated by UPGMA for any pair of elements would be 1/2 the pairwise distance between those two elements.
I'm noticing that the example dendrogram I'm building for the presentation isn't calculating branch lengths that I expected. I'm not finding anything in ?hclust that would make me think that I'm missing a function argument that is causing the UPGMA algorithm to use the raw distances as the branch lengths. I understand that in certain situations, due to the limitations of computation accuracy, having a dendrogram which is exactly ultrametric may not always be possible (from here and here, and I'm sure elsewhere as well). That still doesn't explain why I see the raw pairwise distances being plotted as the branch length between two elements.
Using the data below, here's the code I used to plot an example dendrogram...
demoDend <- hclust(d = demoTable, method = "average") # make an hclust object
# use the ggdendro package to extract segments and labels for ggplot plotting
dendData <- ggdendro::dendro_data(demoDend)
dendSegs <- dendData$segments
dendLabs <- dendData$labels
library(ggplot2)
ggplot()+
geom_segment(data = dendSegs, aes(x = x, y = y, xend = xend, yend = yend))+
geom_text(data = dendLabs, aes(x = x, y = y-0.05, label = label, angle = 90))+
geom_hline(aes(yintercept = 0.333), linetype = 2, color = "blue")+
geom_hline(aes(yintercept = 0.2), linetype = 2, color = "red")+
theme_bw()
The two elements that stand out are 13195 and 13199 which have a distance of 0.2, and whose branch length is being plotted as 0.2 (red line in ggplot).
Even after examining the hclust object, some of the heights for the branches are the raw distances in the input matrix, and not 1/2 the distance. Do I need to manually half the heights in the object before plotting? Maybe I don't understand UPGMA as well as I thought? Any help or insight into the implementation of UPGMA with hclust() would be greatly appreciated.
Here's the sample distance data that I'm working with, from dput()
demoTable <- structure(c(0, 0.333333333333333, 0.333333333333333, 0, 0, 0.333333333333333,
0.333333333333333, 1, 1, 1, 1, 1, 1, NA, 0, 0, 0.333333333333333,
0.333333333333333, 0, 0, 1, 1, 1, 1, 1, 1, NA, NA, 0, 0.333333333333333,
0.333333333333333, 0, 0, 1, 1, 1, 1, 1, 1, NA, NA, NA, 0, 0,
0.333333333333333, 0.333333333333333, 1, 1, 1, 1, 1, 1, NA, NA,
NA, NA, 0, 0.333333333333333, 0.333333333333333, 1, 1, 1, 1,
1, 1, NA, NA, NA, NA, NA, 0, 0, 1, 1, 1, 1, 1, 1, NA, NA, NA,
NA, NA, NA, 0, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, NA, NA,
0, 0.6, 0, 1, 0.6, 0.333333333333333, NA, NA, NA, NA, NA, NA,
NA, NA, 0, 0.6, 1, 0.5, 0.2, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 0, 1, 0.6, 0.333333333333333, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 0, 0.5, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 0, 0.6, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0
), .Dim = c(13L, 13L), .Dimnames = list(c("13187", "13188", "13189",
"13190", "13191", "13192", "13193", "13194", "13195", "13196",
"13197", "13198", "13199"), NULL))

Related

Using for-loops in R to process several columns in a data frame

I am trying to edit 50 columns in my data frame into dummy variables depending on an exact match with a given vector of 50 values using a for-loop function.
I never used loop functions before and can't figure out how to do it.
I first wanted to code this "by hand" for each of the 50 columns like that:
dBGK1a <- as.numeric(BGK1a == BGKright[1])
dBGK2a <- as.numeric(BGK2a == BGKright[2])
dBGK3a <- as.numeric(BGK3a == BGKright[3])
....
dBGK50a <- as.numeric(BGK50a == BGKright[50])
As this is very tedious i tried to come up with a for-loop, that can handle this.
for(i in 1:50) {
for (j in seq(from = 348, to = 448, by = 2)){
data1[j] <- as.numeric(data1[j] == BGKright[i])
}
}
Somehow this doesn't work since i get the value "0" in every column over every observation.
data1 is my data frame. Here is a shorter version of the data frame:
dput(head(data1[348:354], 20))
structure(list(BGK1a = c(NA, NA, NA, NA, NA, NA, NA, NA, 2, NA,
NA, NA, NA, NA, 2, 2, 2, 2, 1, 2), BGK1b = c(NA, NA, NA, NA,
NA, NA, NA, NA, 50, NA, NA, NA, NA, NA, 100, 100, 100, 99, 89,
50), BGK2a = c(NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA,
NA, NA, 1, 2, 1, 2, 1, 1), BGK2b = c(NA, NA, NA, NA, NA, NA,
NA, NA, 50, NA, NA, NA, NA, NA, 100, 50, 96, 62, 93, 50), BGK3a = c(NA,
NA, NA, NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, 2, 1, 1, 1,
1, 2), BGK3b = c(NA, NA, NA, NA, NA, NA, NA, NA, 50, NA, NA,
NA, NA, NA, 100, 100, 50, 85, 82, 74), BGK4a = c(NA, NA, NA,
NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, 1, 2, 2, 2, 1, 1)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
What the loop should do is select the respective value of "BGKright" with "i" and the column to process with "j". Note that "j" needs to jump 2 steps every loop because i only need to process every second column (from column 348 to column 448).
I would appreciate any help regarding this loop and other solutions that are possible for this task without loops.
Thank you in advance.
Ok i used BGKa=select(data1[348:448],ends_with("a")) to make a new data frame with only the relevant columns.
Then i used the for-loop to create the dummies.
for(i in 1:50) {
BGKa[i]=as.numeric(BGKa[i]==BGKright[i])
}
Seems to work. Ty for help.

Progression of non-missing values that have missing values in-between

To continue on a previous topic:
Finding non-missing values between missing values
I would like to also find whether the value before the missing value is smaller, equal to or larger than the one after the missing.
To use the same example from before:
df = structure(list(FirstYStage = c(NA, 3.2, 3.1, NA, NA, 2, 1, 3.2,
3.1, 1, 2, 5, 2, NA, NA, NA, NA, 2, 3.1, 1), SecondYStage = c(NA,
3.1, 3.1, NA, NA, 2, 1, 4, 3.1, 1, NA, 5, 3.1, 3.2, 2, 3.1, NA,
2, 3.1, 1), ThirdYStage = c(NA, NA, 3.1, NA, NA, 3.2, 1, 4, NA,
1, NA, NA, 3.2, NA, 2, 3.2, NA, NA, 2, 1), FourthYStage = c(NA,
NA, 3.1, NA, NA, NA, 1, 4, NA, 1, NA, NA, NA, 4, 2, NA, NA, NA,
2, 1), FifthYStage = c(NA, NA, 2, NA, NA, NA, 1, 5, NA, NA, NA,
NA, 3.2, NA, 2, 3.2, NA, NA, 2, 1)), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -20L))
rows 13, 14 and 16 having non-missing in between missing values. The output this time should be: "same", "larger" and "same" for rows 13, 14, and 16, and say "N/A" for the other rows.
A straight forward approach would be to split, convert to numeric, take the last 2 values and compare with an ifelse statement, i.e.
sapply(strsplit(do.call(paste, df)[c(13, 14, 16)], 'NA| '), function(i){
v1 <- as.numeric(tail(i[i != ''], 2));
ifelse(v1[1] > v1[2], 'greater',
ifelse(v1[1] == v1[2], 'same', 'smaller'))
})
#[1] "same" "smaller" "same"
NOTE
I took previous answer as a given (do.call(paste, df)[c(13, 14, 16)])
A more generic approach (as noted by Ronak, last 2 digits will fail in some cases) would be,
sapply(strsplit(gsub("([[:digit:]])+\\s+[NA]+\\s+([[:digit:]])", '\\1_\\2',
do.call(paste, df)[c(13, 14, 16)]), ' '), function(i) {
v1 <- i[grepl('_', i)];
v2 <- strsplit(v1, '_')[[1]];
ifelse(v2[1] > v2[2], 'greater',
ifelse(v2[1] == v2[2], 'same', 'smaller')) })
#[1] "same" "smaller" "same"

R: constructing bootstrap t confidence interval for 3 parameter estimates

I am trying to construct a bootstrap t confidence interval for 3 parameter estimates but I only able to construct for first parameter. Attached below is my code:
beta0<--8
beta1<-0.03
gamma<-0.0105
alpha<-0.05
n<-100
N<-10
for (i in 1:N)
{
u<-runif(n)
x<-rnorm(n)
c<-rexp(n,1/1255)
t1<-(1/gamma)*log(1-((gamma/exp(beta0+beta1*x))*log(1-u)))
t<-pmin(t1,c)
delta<-1*(t1>c)
length(delta)
delta[delta==1]<-ifelse(rbinom(length(delta[delta==1]),1,0.75),1,2)
deltae<-ifelse(delta==0, 1,0)
deltar<-ifelse(delta==1, 1,0)
deltai<-ifelse(delta==2, 1,0)
dat=data.frame(t,delta, deltae,deltar,deltai,x)
dat$interval[delta==2] <- as.character(cut(dat$t[delta==2], breaks=seq(0, 600, 100)))
labs <- cut(dat$t[delta==2], breaks=seq(0, 600, 100))
dat$lower[delta==2]<-as.numeric( sub("\\((.+),.*", "\\1", labs) )
dat$upper[delta==2]<-as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs) )
beta0hat.boot <- function(data,j)
{
dat<-data[j,]
data0<-dat[which(dat$delta==0),]#uncensored data
data1<-dat[which(dat$delta==1),]#right censored data
data2<-dat[which(dat$delta==2),]#interval censored data
library(maxLik)
#without imputataion
ll<-function(para)
{
b0<-para[1]
b1<-para[2]
g<-para[3]
e<-sum((b0+b1*data0$x)+g*data0$t+(1/g)*exp(b0+b1*data0$x)*(1-exp(g*data0$t)))
r<-sum((1/g)*exp(b0+b1*data1$x)*(1-exp(g*data1$t)))
i<-sum(log(exp((1/g)*exp(b0+b1*data2$x)*(1-exp(g*data2$lower)))-exp((1/g)*exp(b0+b1*data2$x)*(1-exp(g*data2$upper)))))
l<-e+r+i
return(l)
}
est<-maxLik(logLik=ll,start=c(para<-c(-8,0.03,0.0105)))
beta0hat<-est$estimate[1]
beta1hat<-est$estimate[2]
gammahat<-est$estimate[3]
observed<-solve(-est$hessian)
return(c(beta0hat,beta1hat,gammahat,observed[1,1],observed[2,2],observed[3,3]))
}
library(boot)
out<- boot(dat,beta0hat.boot,100)
ci<-boot.ci(out,type =c("stud","perc"),var.t0= out$t0[4],var.t=out$t[,4])
ci1<-boot.ci(out,type = c("stud","perc"),var.t0= out$t0[4],var.t=out$t[,4],index=1)
}
I am only able to construct the confidence interval for the first parameter only without using index=1 which is as follow
ci<-boot.ci(out,type =c("stud","perc"),var.t0= out$t0[4],var.t=out$t[,4])
when i add in index=1,
ci1<-boot.ci(out,type = c("stud","perc"),var.t0= out$t0[4],var.t=out$t[,4],index=1)
i got warnings:
In boot.ci(out, type = c("stud", "perc"), var.t0 = out$t0[4], var.t = out$t[, :
bootstrap variances needed for studentized intervals.
which caused the confidence interval cannot be built.
I want to add index=1 because I need to construct confidence interval for 2nd and 3rd parameter as well by using index=2 and index=3 but after I add in the index, i cannot get the confidence interval. Any idea how to get bootstrap t confidence interval for more than one statistics? I am able to get the percentile interval only by using the above code but not able to get bootstrap t interval.
dput(head(dat, 30)):
structure(list(t = c(143.786081550783, 104.647251273501, 65.5655695306165,
322.260530195967, 307.324160556309, 123.725827237157, 143.083942557736,
426.646805286557, 360.799323547846, 246.295906287976, 315.638222801499,
421.155652813492, 374.160625875751, 123.570819609099, 389.553807438155,
449.110810924173, 162.547725691726, 296.674617375856, 229.680453578098,
343.823664337269, 268.797764971971, 205.704838761594, 14.8630247008987,
91.6607201565057, 260.886289739501, 193.278377859747, 143.269719508224,
27.4780640122481, 267.191708749538, 39.8027657018974), delta = c(1,
0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 0, 1, 0, 1, 0, 1), deltae = c(0, 1, 1, 1, 0, 1, 1, 1, 1,
1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0
), deltar = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1), deltai = c(0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0), x = c(-0.377643157724086, 0.218708630964153,
0.153162542263512, 1.28222373181234, 1.1423312203422, -1.04726335577352,
-0.360028857222088, 0.336098821643731, 0.758860540656158, 0.0533940581013279,
-0.0562132826178519, 0.0798656325287969, -0.748956798800072,
-0.235929730488004, -0.737049993834757, 1.05819046250488, 1.28776064495481,
0.457930197196181, -1.45563588128583, -1.1074384621643, -0.829026816498185,
-1.3824961444269, -1.58951008909026, -0.95052226776903, 0.0145909317659764,
-0.198775419436042, 0.0481467746529365, -0.136098038815233, -0.118319488454131,
-0.498263758291143), interval = c(NA, NA, NA, NA, "(300,400]",
NA, NA, NA, NA, NA, "(300,400]", NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), lower = c(NA,
NA, NA, NA, 300, NA, NA, NA, NA, NA, 300, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), upper = c(NA,
NA, NA, NA, 400, NA, NA, NA, NA, NA, 400, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("t",
"delta", "deltae", "deltar", "deltai", "x", "interval", "lower",
"upper"), row.names = c(NA, 30L), class = "data.frame")

Stratify then impute in R - using mi()

I want to "stratify-then-impute" using the packages available in R.
That is, I am hoping to:
1) stratify my dataset using a binary variable called "arm". This variable has no missing data.
2) run an imputation model for the two subsets
3) combine the two imputed data sets
4) run a pooled analysis.
My dataset looks like:
dataSim <- structure(list(pid = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20), arm = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), X1 = c(0.1, NA, 0.51,
0.56, -0.82, NA, NA, NA, -0.32, 0.4, 0.58, NA, 0.22, -0.23, 1.49,
-1.88, -1.77, -0.94, NA, -1.34), X2 = c(NA, -0.13, NA, 1.2, NA,
NA, NA, 0.02, -0.04, NA, NA, 0.25, -0.81, -1.67, 1.01, 1.69,
-0.06, 0.07, NA, -0.11)), .Names = c("pid", "arm", "X1", "X2"
), row.names = c(NA, 20L), class = "data.frame")
To impute, the data, I'm currently using the mi() function as follows:
library(mi)
data.1 <- dataSim[dataSim[,"arm"]==1,]
data.0 <- dataSim[dataSim[,"arm"]==0,]
data.miss.1 <- missing_data.frame(data.1)
data.miss.0 <- missing_data.frame(data.0)
imputations.1 <- mi(data.1, n.iter=5, n.chains=5, max.minutes=20, parallel=FALSE)
imputations.0 <- mi(data.0, n.iter=5, n.chains=5, max.minutes=20, parallel=FALSE)
complete(imputations.1) # viewing the imputed datasets
complete(imputations.0)
Then I don't know how to combine the 2 imputations in order to do a pooled analysis. I have unsuccessfully tried:
imputations <- rbind(imputations.0, imputations.1) # This doesn't work
# analysis.X1 <- pool(X1 ~ arm, data = imputations ) # This is what I want to run
I assume this method is a simplified version of including an interaction term when imputing, but I don't know how this is possible either.
Thanks

Mapping content of one matrix onto structure of another matrix

I have two matrices sourced from the same dataset but with different amounts of data available for each.
I want to create a dataset that is a replicate of x in terms of column names and row names but which contains the data values in y. If the data is not available then an NA would be used as the value for that coordinate.
Not all of the row names in x are present in y and vice versa. The same holds true for the column names.
For the example input data I've given below, the rownames in x corresponding to those in y are the rowname start and end at | (I want to retain everthing after the | for other mappings).
What is the most efficient way to do this?
DESIRED OUTPUT
z = structure(c(NA, 1, NA, NA, NA, NA, NA, NA, 0, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0, NA, NA, NA, NA, NA,
NA, 0, NA, NA, NA, 0, NA, NA, NA, NA, NA, NA, 0, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), .Dim = c(11L, 5L), .Dimnames = list(
c("AACSL|729522", "AACS|65985", "AADACL2|344752", "AADACL3|126767",
"AADACL4|343066", "AADAC|13", "AADAT|51166", "AAGAB|79719",
"AAK1|22848", "AAK12|14", "AANAT|15"), c("S18", "S20", "S45",
"S95", "S100")))
EXAMPLE INPUT
x = structure(c(0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0,
0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0), .Dim = c(11L,
5L), .Dimnames = list(c("AACSL|729522", "AACS|65985", "AADACL2|344752",
"AADACL3|126767", "AADACL4|343066", "AADAC|13", "AADAT|51166",
"AAGAB|79719", "AAK1|22848", "AAK12|14", "AANAT|15"), c("S18",
"S20", "S45", "S95", "S100")))
y = structure(c(0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0), .Dim = c(11L, 4L), .Dimnames = list(c("A1BG",
"A1CF", "A2ML1", "A4GALT", "AACS", "AAK1", "AARD", "AARS2", "AASDHPPT",
"AASS", "BAACS"), c("S18", "S10", "S45", "S95")))
I think there might be a slight problem with the example that you provided, i can not see how the z is coming from the x and y above.. see this code:
intersect(sapply(rownames(x), #I am just extracting the letter codes here
function(i){
return(
strsplit(x=i,split="|",fixed=TRUE)[[1]][[1]])
}),rownames(y))
#[1] "AACS" "AAK1"
weird, right? I mean, there is only 2 codes in y compared to x. However, I think the code below does what you are planning (with the exception of this inconsistency):
library(data.table)
library(reshape2)
library(dplyr)
x %>% as.data.frame %>% mutate(rownames=rownames(x)) %>%
mutate(nms=sapply(rownames(x),
function(i){
return(
strsplit(x=i,split="|",fixed=TRUE)[[1]][[1]])
})) %>%
melt(id.vars=c("nms","rownames")) %>%
merge(., y %>% as.data.frame %>% mutate(nms=rownames(y))%>% melt(id.vars="nms"), by=c("variable","nms"), all.x=TRUE) %>%
select(-nms, -value.x) %>% dcast(formula = rownames~variable, value.var="value.y") -> xy
#now put back the column names where they belong
rownames(xy)<-xy$rownames
#now the only thing left is to arrange the columns
xy[rownames(x),colnames(x)] -> xy
Or am I wrong in understanding some of your points?

Resources