Related
I'm working on a presentation regarding utilizing UPGMA with the hlcust() function within our research lab. According to the literature, the branch length calculated by UPGMA for any pair of elements would be 1/2 the pairwise distance between those two elements.
I'm noticing that the example dendrogram I'm building for the presentation isn't calculating branch lengths that I expected. I'm not finding anything in ?hclust that would make me think that I'm missing a function argument that is causing the UPGMA algorithm to use the raw distances as the branch lengths. I understand that in certain situations, due to the limitations of computation accuracy, having a dendrogram which is exactly ultrametric may not always be possible (from here and here, and I'm sure elsewhere as well). That still doesn't explain why I see the raw pairwise distances being plotted as the branch length between two elements.
Using the data below, here's the code I used to plot an example dendrogram...
demoDend <- hclust(d = demoTable, method = "average") # make an hclust object
# use the ggdendro package to extract segments and labels for ggplot plotting
dendData <- ggdendro::dendro_data(demoDend)
dendSegs <- dendData$segments
dendLabs <- dendData$labels
library(ggplot2)
ggplot()+
geom_segment(data = dendSegs, aes(x = x, y = y, xend = xend, yend = yend))+
geom_text(data = dendLabs, aes(x = x, y = y-0.05, label = label, angle = 90))+
geom_hline(aes(yintercept = 0.333), linetype = 2, color = "blue")+
geom_hline(aes(yintercept = 0.2), linetype = 2, color = "red")+
theme_bw()
The two elements that stand out are 13195 and 13199 which have a distance of 0.2, and whose branch length is being plotted as 0.2 (red line in ggplot).
Even after examining the hclust object, some of the heights for the branches are the raw distances in the input matrix, and not 1/2 the distance. Do I need to manually half the heights in the object before plotting? Maybe I don't understand UPGMA as well as I thought? Any help or insight into the implementation of UPGMA with hclust() would be greatly appreciated.
Here's the sample distance data that I'm working with, from dput()
demoTable <- structure(c(0, 0.333333333333333, 0.333333333333333, 0, 0, 0.333333333333333,
0.333333333333333, 1, 1, 1, 1, 1, 1, NA, 0, 0, 0.333333333333333,
0.333333333333333, 0, 0, 1, 1, 1, 1, 1, 1, NA, NA, 0, 0.333333333333333,
0.333333333333333, 0, 0, 1, 1, 1, 1, 1, 1, NA, NA, NA, 0, 0,
0.333333333333333, 0.333333333333333, 1, 1, 1, 1, 1, 1, NA, NA,
NA, NA, 0, 0.333333333333333, 0.333333333333333, 1, 1, 1, 1,
1, 1, NA, NA, NA, NA, NA, 0, 0, 1, 1, 1, 1, 1, 1, NA, NA, NA,
NA, NA, NA, 0, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, NA, NA,
0, 0.6, 0, 1, 0.6, 0.333333333333333, NA, NA, NA, NA, NA, NA,
NA, NA, 0, 0.6, 1, 0.5, 0.2, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 0, 1, 0.6, 0.333333333333333, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 0, 0.5, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 0, 0.6, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0
), .Dim = c(13L, 13L), .Dimnames = list(c("13187", "13188", "13189",
"13190", "13191", "13192", "13193", "13194", "13195", "13196",
"13197", "13198", "13199"), NULL))
I want to merge the rows for each record_id into one row based on the type column except from the volunteers in the record_id column which have two repeats in the repeat column. I would like a second row for these. Each record_id corresponds to one person that has either come in for a test once (repeat=1) or twice and therefore has two entries in the repeat column.
Here's is what my data look like
structure(list(record_id = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4,
4, 4, 4), type = c(NA, "data_collection", "test", NA, "data_collection",
"test", NA, "data_collection", "test", "test", NA, "cata_collection",
"test", "test"), `repeat` = c(NA, 1, 1, NA, 1, 1, NA, 1, 1, 2,
NA, 1, 1, 2), dt_volunteer_reg = structure(c(1597246320, NA,
NA, 1599217080, NA, NA, 1596184500, NA, NA, NA, 1598192280, NA,
NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"), age = c(26,
NA, NA, 64, NA, NA, 51, NA, NA, NA, 39, NA, NA, NA), gender = c(0,
NA, NA, 1, NA, NA, 0, NA, NA, NA, 1, NA, NA, NA), case_type = c(NA,
1, NA, NA, 2, NA, NA, 1, NA, NA, NA, 1, NA, NA), test_dis_dt = structure(c(NA,
NA, 1597250220, NA, NA, 1600012980, NA, NA, 1596382080, 1601980740,
NA, NA, 1598284020, 1603118700), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), test_dis_res = c(NA, NA, 1, NA, NA, 1, NA,
NA, 2, 2, NA, NA, 2, 2), test_dis_in = c(NA, NA, NA, NA, NA,
0.02, NA, NA, 6.13, 4.75, NA, NA, 7.23, 3.85), test_cont_dt = structure(c(NA,
NA, 1597250280, NA, NA, 1608636120, NA, NA, NA, 1601980740, NA,
NA, 1605704940, 1603205340), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
test_cont_res = c(NA, NA, 2, NA, NA, 1, NA, NA, NA, 2, NA,
NA, 2, 2), test_cont_val = c(NA, NA, 123, NA, NA, 0, NA,
NA, NA, 40000, NA, NA, 471.6, 306.5)), row.names = c(NA,
-14L), class = c("tbl_df", "tbl", "data.frame"))
And this is what I'm hoping to get
structure(list(record_id = c(1, 2, 3, 3, 4, 4), `repeat` = c(1,
1, 1, 2, 1, 2), dt_volunteer_reg = structure(c(1597246320, 1599217080,
1596184500, 1596184500, 1598192280, 1598192280), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), age = c(26, 64, 51, 51, 39, 39), gender = c(0,
1, 0, 0, 1, 1), case_type = c(1, 2, 1, 1, 1, 1), test_dis_dt = structure(c(1597250220,
1600012980, 1596382080, 1601980740, 1598284020, 1603118700), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), test_dis_res = c(1, 1, 2, 2, 2, 2),
test_dis_in = c(NA, 0.02, 6.13, 4.75, 7.23, 3.85), test_cont_dt = structure(c(1597250280,
1608636120, NA, 1601980740, 1605704940, 1603205340), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), test_cont_res = c(2, 1, NA, 2,
2, 2), test_cont_val = c(123, 0, NA, 40000, 471.6, 306.5)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
Assuming the first dataframe is called input and you are happy using the tidyverse you can do it like this.
input %>%
nest(data = c(-record_id)) %>%
mutate(
data = map(data, ~replace_na(., as.list(head(., 1)))), # Fill in speciment details
data = map(data, filter, !is.na(`repeat`)), # Remove speciment details
data = map(data, ~replace_na(., as.list(head(., 1)))), # Fill in test data with data collection details
data = map(data, filter, type == "test") # Remove data collection rows
) %>%
unnest(data) %>%
select(-type
There are ways to do this more concisely and/or faster but this may be more readable.
I am trying to edit 50 columns in my data frame into dummy variables depending on an exact match with a given vector of 50 values using a for-loop function.
I never used loop functions before and can't figure out how to do it.
I first wanted to code this "by hand" for each of the 50 columns like that:
dBGK1a <- as.numeric(BGK1a == BGKright[1])
dBGK2a <- as.numeric(BGK2a == BGKright[2])
dBGK3a <- as.numeric(BGK3a == BGKright[3])
....
dBGK50a <- as.numeric(BGK50a == BGKright[50])
As this is very tedious i tried to come up with a for-loop, that can handle this.
for(i in 1:50) {
for (j in seq(from = 348, to = 448, by = 2)){
data1[j] <- as.numeric(data1[j] == BGKright[i])
}
}
Somehow this doesn't work since i get the value "0" in every column over every observation.
data1 is my data frame. Here is a shorter version of the data frame:
dput(head(data1[348:354], 20))
structure(list(BGK1a = c(NA, NA, NA, NA, NA, NA, NA, NA, 2, NA,
NA, NA, NA, NA, 2, 2, 2, 2, 1, 2), BGK1b = c(NA, NA, NA, NA,
NA, NA, NA, NA, 50, NA, NA, NA, NA, NA, 100, 100, 100, 99, 89,
50), BGK2a = c(NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA,
NA, NA, 1, 2, 1, 2, 1, 1), BGK2b = c(NA, NA, NA, NA, NA, NA,
NA, NA, 50, NA, NA, NA, NA, NA, 100, 50, 96, 62, 93, 50), BGK3a = c(NA,
NA, NA, NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, 2, 1, 1, 1,
1, 2), BGK3b = c(NA, NA, NA, NA, NA, NA, NA, NA, 50, NA, NA,
NA, NA, NA, 100, 100, 50, 85, 82, 74), BGK4a = c(NA, NA, NA,
NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, 1, 2, 2, 2, 1, 1)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
What the loop should do is select the respective value of "BGKright" with "i" and the column to process with "j". Note that "j" needs to jump 2 steps every loop because i only need to process every second column (from column 348 to column 448).
I would appreciate any help regarding this loop and other solutions that are possible for this task without loops.
Thank you in advance.
Ok i used BGKa=select(data1[348:448],ends_with("a")) to make a new data frame with only the relevant columns.
Then i used the for-loop to create the dummies.
for(i in 1:50) {
BGKa[i]=as.numeric(BGKa[i]==BGKright[i])
}
Seems to work. Ty for help.
I am doing an analysis with groups and as such, need to make a grouping variable, for which I wanted to use gender (0=male, 1=female). What I first did was create a vector of this variable (manual told me to do this), but then I got an eror that: "grouping variable must not contain purely numeric items". Then I transformed my vector in a logical (TRUE/FALSE), but somehow I still get this error.
So my question is, does anyone know, in general terms, what may be the problem when I get this error?
Attached below is the code to the head of my dataset:
structure(c(7, 8, 7, 5, 6, 6, 4.9, NA, 6.9, 5.1, 5.8, NA, NA,
NA, 7, 3, 7, NA, NA, NA, 6.7, 4.1, 5.9, NA, NA, NA, 5, 6, 7,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 8, NA, NA, NA, 6.2,
4.3, 6.3, NA, NA, NA, 7, 5, 7, NA, NA, NA, 6.5, NA, NA, NA, NA,
NA, 6, NA, 7, NA, NA, NA, NA, NA, 5, NA, NA, NA, NA, NA, 7, NA,
NA, NA, NA, NA, 6.1, NA, NA, NA, NA, NA, 7, NA, NA, NA, NA, NA,
NA, NA, 16, 0.001, 12, 11, 11, 0.001, 0.001, 0.001, 12, 12, 12,
0.001, 0.001, 0.001, 12, 12, 12, 0.001, 0.001, 0.001, 15, 12,
12, 0.001, 0.001, 0.001, 16, 0.001, 12, 0.001, 0.001, 0.001,
0.001, 0.001, 15, 0.001, 0.001, 0.001, 0.001, 0.001, 16, 0.001,
0, 1, 0, 0, 1, 0), .Dim = c(6L, 24L), .Dimnames = list(c("800009",
"800012", "800015", "800033", "800042", "800045"), c("gener_sat_T0",
"sel_T0", "gener_sat_T1", "sel_T1", "gener_sat_T2", "sel_T2",
"gener_sat_T3", "sel_T3", "gener_sat_T4", "sel_T4", "gener_sat_T5",
"sel_T5", "gener_sat_T6", "sel_T6", "gener_sat_T7", "sel_T7",
"dT1", "dT2", "dT3", "dT4", "dT5", "dT6", "dT7", "female")))
Then what I am trying to do is fit a CT model (have used it before on non-group data and that worked fine).
CTMODEL <- ctModel(n.latent = 2, n.manifest = 2, Tpoints = 8,
manifestNames = c("gener_sat", "sel"),
latentNames = c("gener_sat", "sel"), LAMBDA = diag(2))
fit_CTMODEL <- ctMultigroupFit(datawide = data_wide, groupings=female, ctmodelobj = CTMODEL)
Thanks a bunch!
Ok, I redid your computations, albeit your code was not reproducible directly, I made some changes and now it works:
# create the structure object (data_wide), and change it to remove the
# grouping:
data_wide = structure(c(7, 8, 7, 5, 6, 6, 4.9, NA, 6.9, 5.1, 5.8, NA, NA,
NA, 7, 3, 7, NA, NA, NA, 6.7, 4.1, 5.9, NA, NA, NA, 5, 6, 7,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 8, NA, NA, NA, 6.2,
4.3, 6.3, NA, NA, NA, 7, 5, 7, NA, NA, NA, 6.5, NA, NA, NA, NA,
NA, 6, NA, 7, NA, NA, NA, NA, NA, 5, NA, NA, NA, NA, NA, 7, NA,
NA, NA, NA, NA, 6.1, NA, NA, NA, NA, NA, 7, NA, NA, NA, NA, NA,
NA, NA, 16, 0.001, 12, 11, 11, 0.001, 0.001, 0.001, 12, 12, 12,
0.001, 0.001, 0.001, 12, 12, 12, 0.001, 0.001, 0.001, 15, 12,
12, 0.001, 0.001, 0.001, 16, 0.001, 12, 0.001, 0.001, 0.001,
0.001, 0.001, 15, 0.001, 0.001, 0.001, 0.001, 0.001, 16, 0.001), .Dim =
c(6L, 23L),
.Dimnames = list(c("800009", "800012", "800015", "800033", "800042",
"800045"),
c("gener_sat_T0", "sel_T0", "gener_sat_T1", "sel_T1",
"gener_sat_T2",
"sel_T2", "gener_sat_T3", "sel_T3", "gener_sat_T4",
"sel_T4", "gener_sat_T5",
"sel_T5", "gener_sat_T6", "sel_T6", "gener_sat_T7",
"sel_T7",
"dT1", "dT2", "dT3", "dT4", "dT5", "dT6", "dT7")))
CTMODEL <- ctModel(n.latent = 2, n.manifest = 2, Tpoints = 8,
manifestNames = c("gener_sat", "sel"),
latentNames = c("gener_sat", "sel"), LAMBDA = diag(2))
fem = c("f", "m", "f", "f", "m", "f") # grouping, which needs to be a
# character vector
fit_CTMODEL <- ctMultigroupFit(dat = data_wide, groupings=fem, ctmodelobj =
CTMODEL) # dat instead of datawide
So in the end it's just a matter of making the grouping variable character vector.
Add: the code runs but hives various errors:
Not all eigenvalues of Hessian are greater than 0
Fit attempt generated errors
Retry limit reached
I guess that's because of the model and leave the solution to you :)
Hello: I am getting slightly different medians for a data set that looks like the one created below when I produce them via dplyr/ tidyr versus aggregate. Can anyone explain the difference? Thank you!
#dataset
out2<-structure(list(d3 = structure(c(1L, 2L, NA, NA, 1L, 1L, NA,
2L,NA,3L,1L, NA, NA, 1L, 3L, NA, 1L, 2L, 3L, 2L, 1L, 3L, 2L, 3L, 1L), .Label
= c("Professional journalist", "Elected politician", "Online blogger"),
class = "factor"), Accessible = c(3, 5, 2,NA, 1, 2, NA, 3, NA, 4, 2, 5, NA,
3, 4, NA, 2, NA, 3, 4, 4, 4,2, 2, 2), Information = c(1, 2, 1, NA, 4, 1, NA,
2, NA, 2, 1, 1, NA, 4, 1, NA, 1, 1, 1, 3, 1, 3, 3, 4, 1), Responsive = c(5,
4, 6, NA, 2, 3, NA, 1, NA, 5, 4, 4, NA, 6, 3, NA, 4, NA, 2, 2, 6, 2, 1, 1,
3), Debate = c(6, 3, 4, NA, 3, 4, NA, 5, NA, 6, 5,6, NA, 1, 5, NA, 5, 2, NA,
1, 5, 6, 5, 5, 7), Officials = c(2,1, 5, NA, 5, 5, NA, 6, NA, 3, 6, 2, NA, 2,
2, NA, 6, 3, NA, 5,2, 5, 4, 6, 5), Social = c(7, 6, 7, NA, 7, 7, NA, 4, NA,
7, 7,
7, NA, 7, 7, NA, 7, NA, NA, 7, 7, 1, 6, 7, 6), `Trade-Offs` = c(4,
7, 3, NA, 6, 6, NA, 7, NA, 1, 3, 3, NA, 5, 6, NA, 3, NA, NA,
6, 3, 7, 7, 3, 4)), .Names = c("d3", "Accessible", "Information",
"Responsive", "Debate", "Officials", "Social", "Trade-Offs"), row.names =
c(171L, 126L, 742L, 379L, 635L, 3L, 303L, 419L, 324L, 97L, 758L, 136L,
770L, 405L, 101L, 674L, 386L, 631L, 168L, 590L, 731L, 387L, 673L, 208L,
728L), class = "data.frame")
#Find Medians via tidyR and dplyr
test<-out2 %>%
gather(variable, value, -1) %>%
filter(is.na(d3)==FALSE)%>%
group_by(d3, variable) %>%
summarise(value=median(value, na.rm=TRUE))
#dataframe
test<-data.frame(test)
#find Medians via aggregate
test2<-aggregate(.~d3, data=out2, FUN=median, na.rm=TRUE)
#Gather for plotting
test2<-test2 %>%
gather(variable, value, -d3)
#Plot Medians via tidyr
ggplot(test, aes(x=d3, y=value,
group=d3))+facet_wrap(~variable)+
geom_bar(stat='identity')+labs(title='Medians via TidyR')
#Plot Medians Via aggregate
ggplot(test2, aes(x=d3, y=value,
group=d3))+facet_wrap(~variable)+geom_bar(stat='identity')+
labs(title='Medians via Aggregate')
#Compare Debate, Information and Responsive
The results produced by aggregate are different because aggregate is dropping entire rows where any value is NA, even if some variables in that row contain data.
You can correct this by specifying a value for the na.action argument, as described in this accepted answer. Here it would be:
test2<-aggregate(.~d3, data=out2, FUN=median, na.rm = TRUE, na.action=NULL)
test2<-test2 %>%
gather(variable, value, -d3)
Confirm that the results are the same:
identical(as.data.frame(test %>% arrange(d3, variable, value)),
as.data.frame(test2 %>% arrange(d3, variable, value)))
[1] TRUE