cluster analysis with weight - r

I have a data frame 'heat' demonstrating people's performance across time.
'Var1' represents the code of persons.
'Var2' represents a time line (measured by number of days from the starting point).
'Variable' is the score they get at a given time point.
Var1 Var2 value
1 1 36 -0.6941826
2 2 36 -0.5585414
3 3 36 0.8032384
4 4 36 0.7973031
5 5 36 0.7536959
6 6 36 -0.5942059
....
54 10 73 0.7063218
55 11 73 -0.6949616
56 12 73 -0.6641516
57 13 73 0.6890433
58 14 73 0.6310124
59 15 73 -0.6305091
60 16 73 0.6809655
61 17 73 0.8957870
....
101 13 110 0.6495796
102 14 110 0.5990869
103 15 110 -0.6210600
104 16 110 0.6441960
105 17 110 0.7838654
....
Now I want to cluster their performance and reflect it on a heatmap. So I used the function dist() and hclust() to clustered the data frame and plotted it with ggplot2:
ggplot(data = heat) + geom_tile(aes(x = Var2, y = Var1 %>% as.character(),
fill = value)) +
scale_fill_gradient(low = "yellow",high = "red") +
geom_vline(xintercept = c(746, 2142, 2917))
It looks like this:
However, I am more interested in what happened around day 746, day 2142 and day 2917 (the black lines). I would like the scores around these days bearing more weight in the clustering. I want people demonstrating similar performance around these days to have more priority to be clustered together. Is there a way of doing this?

As long as your weights are integer, you supposedly can just replicate those days artificially.
If you want more control, just compute the distance matrix yourself, with whatever weighted distance you want to use.

Related

Add category mean value to faceted scatter plots in ggplot

I am using facet wrap to plot Weight Gain versus Caloric Intake for four different diets. Diet is a four-level factor, Weight Gain and Caloric Intake are numeric. I am adding a regression line to each plot facet. What I want to do is add a horizontal line for the group mean weight gain for each diet in the plot (4 different mean values). The problem is when I use the geom_hline function it puts the global mean on all of the plots, which is not what I want.
I tried using stat_summary(fun.y=mean,geom="line"), but it gives me line segments joining each of the points in every plot.
Below is the code I am using that is giving me the single global mean on all plots. Also the data set I am using. I've included the labeller code for completeness but I really just need help with drawing the group mean lines.
Thanks in advance for any help.
# Calculate slopes and means to use for facet labels
#
wgSlope<-rep(NA,nlevels(vitaminData$Diet))
dietMeans<-rep(NA,nlevels(vitaminData$Diet))
for (i in 1:nlevels(vitaminData$Diet)){
dietMeans[i]<-mean(filter(vitaminData,Diet==i)$WeightGain)
#
# Get regression lines and coefficients for each facet
#
lm<-lm(WeightGain~CaloricIntake,data=filter(vitaminData,Diet==i))
wgSlope[i]<-lm$coefficients[2]
}
#
# Build facet labels
#
dietLabel<-c(`1`=
paste("Diet 1, Slope=",round(wgSlope[1],2),", Mean=",round(dietMeans[1],1)),
`2`=paste("Diet 2, Slope=",round(wgSlope[2],2),", Mean=",round(dietMeans[2],1)),
`3`=paste("Diet 3, Slope =",round(wgSlope[3],2),", Mean=",round(dietMeans[3],1)),
`4`=paste("Diet 4, Slope =",round(wgSlope[4],2),", Mean=",round(dietMeans[4],1)))
#
# Draw the plots
#
ggplot(data=vitaminData,
aes(y=WeightGain,x=CaloricIntake,color=Diet))+
theme_bw()+
geom_point(aes(color=Diet,fill=Diet,shape=Diet))+
geom_smooth(method="lm",se=FALSE,linetype=2,alpha=0.5)+
labs(x="Caloric Intake",y="Weight Gain")+
scale_color_manual(values=c("red","blue","orange","darkgreen"))+
geom_hline(yintercept=mean(vitaminData$WeightGain))+
facet_wrap(~Diet,labeller=labeller(Diet=dietLabel))+
theme(legend.position="none")
Diet WeightGain CaloricIntake
<fct> <dbl> <dbl>
1 1 48 35
2 1 67 44
3 1 78 44
4 1 69 51
5 1 53 47
6 2 65 40
7 2 49 45
8 2 37 37
9 2 73 53
10 2 63 42
11 3 79 51
12 3 52 41
13 3 63 47
14 3 65 47
15 3 67 48
16 4 59 53
17 4 50 52
18 4 59 52
19 4 42 45
20 4 34 38
Here's an approach using dplyr. (Add library(dplyr) or library(tidyverse) if not already loaded.)
geom_hline(data = vitaminData %>%
group_by(Diet) %>%
summarize(mean = mean(WeightGain)),
aes(yintercept = mean)) +

fit a normal distribution to grouped data, giving expected frequencies

I have a frequency distribution of observations, grouped into counts within class intervals.
I want to fit a normal (or other continuous) distribution, and find the expected frequencies in each interval according to that distribution.
For example, suppose the following, where I want to calculate another column, expected giving the
expected number of soldiers with chest circumferences in the interval given by chest, where these
are assumed to be centered on the nominal value. E.g., 35 = 34.5 <= y < 35.5. One analysis I've seen gives the expected frequency in this cell as 72.5 vs. the observed 81.
> data(ChestSizes, package="HistData")
>
> ChestSizes
chest count
1 33 3
2 34 18
3 35 81
4 36 185
5 37 420
6 38 749
7 39 1073
8 40 1079
9 41 934
10 42 658
11 43 370
12 44 92
13 45 50
14 46 21
15 47 4
16 48 1
>
> # ungroup to a vector of values
> chests <- vcdExtra::expand.dft(ChestSizes, freq="count")
There are quite a number of variations of this question, most of which relate to plotting the normal density on top of a histogram, scaled to represent counts not density. But none explicitly show the calculation of the expected frequencies. One close question is R: add normal fits to grouped histograms in ggplot2
I can perfectly well do the standard plot (below), but for other things, like a Chi-square test or a vcd::rootogram plot, I need the expected frequencies in the same class intervals.
> bw <- 1
n_obs <- nrow(chests)
xbar <- mean(chests$chest)
std <- sd(chests$chest)
plt <-
ggplot(chests, aes(chest)) +
geom_histogram(color="black", fill="lightblue", binwidth = bw) +
stat_function(fun = function(x)
dnorm(x, mean = xbar, sd = std) * bw * n_obs,
color = "darkred", size = 1)
plt
here is how you could calculate the expected frequencies for each group assuming Normality.
xbar <- with(ChestSizes, weighted.mean(chest, count))
sdx <- with(ChestSizes, sd(rep(chest, count)))
transform(ChestSizes, Expected = diff(pnorm(c(32, chest) + .5, xbar, sdx)) * sum(count))
chest count Expected
1 33 3 4.7600583
2 34 18 20.8822328
3 35 81 72.5129162
4 36 185 199.3338028
5 37 420 433.8292832
6 38 749 747.5926687
7 39 1073 1020.1058521
8 40 1079 1102.2356155
9 41 934 943.0970605
10 42 658 638.9745241
11 43 370 342.7971793
12 44 92 145.6089948
13 45 50 48.9662992
14 46 21 13.0351612
15 47 4 2.7465640
16 48 1 0.4579888

ggplot each group consists of only one observation

I'm trying to make a plot similar to this answer: https://stackoverflow.com/a/4877936/651779
My data frame looks like this:
df2 <- read.table(text='measurements samples value
1 4hours sham1 6
2 1day sham1 175
3 3days sham1 417
4 7days sham1 163
5 14days sham1 37
6 90days sham1 134
7 4hours sham2 8
8 1day sham2 402
9 3days sham2 482
10 7days sham2 67
11 14days sham2 16
12 90days sham2 31
13 4hours sham3 185
14 1day sham3 402
15 3days sham3 482
16 7days sham3 85
17 14days sham3 29
18 90days sham3 10',header=T)
And plot it with
ggplot(df2, aes(measurements, value)) + geom_line(aes(colour = samples))
No lines show in the plot, and I get the message
geom_path: Each group consist of only one observation.
Do you need to adjust the group aesthetic?
I don't see where what I'm doing is different from the answer I linked above. What should I change to make this work?
Add group = samples to the aes of geom_line. This is necessary since you want one line per samples rather than for each data point.
ggplot(df2, aes(measurements, value)) +
geom_line(aes(colour = samples, group = samples))

Barchart help in R

I am trying to set up a bar chart to compare control and experimental samples taken of specific compounds. The data set is known as 'hydrocarbon3' and contains the following information:
Exp. Contr.
c12 89 49
c17 79 30
c26 78 35
c42 63 3
pris 0.5 0.8
phy 0.5 0.9
nap 87 48
nap1 83 44
nap2 78 44
nap3 73 20
acen1 81 50
acen2 86 46
fluor 83 11
fluor1 68 13
fluor2 79 17
dibe 65 7
dibe1 67 6
dibe2 56 10
phen 82 13
phen1 70 12
phen2 65 15
phen3 53 14
fluro 62 9
pyren 48 11
pyren1 34 10
pyren2 19 8
chrys 22 3
chrys1 21 3
chrys2 21 3
When I create a bar chart with the formula:
barplot(as.matrix(hydrocarbon3),
main=c("Fig 1. Change in concentrations of different hydrocarbon compounds\nin sediments with and without the presence of bacteria after 21 days"),
beside=TRUE,
xlab="Oiled sediment samples collected at 21 days",
space=c(0,2),
ylab="% loss in concentration relative to day 0")
I receive this diagram, however I need the control and experimental samples of each chemical be next to each other allow a more accurate comparison, rather than the experimental samples bunched on the left and control samples bunched on the right: Is there a way to correct this on R?
Try transposing your matrix:
barplot(t(as.matrix(hydrocarbon3)), beside=T)
Basically, barplot will plot things in the order they show up in the matrix, which, since a matrix is just a vector wrapped colwise, means barplot will plot all the values of the first column, then all those of the second column, etc.
Check this question out: Barplot with 2 variables side by side
It uses ggplot2, so you'll have to use the following code before running it:
intall.packages("ggplot2")
library(ggplot2)
Hopefully this works for you. Plus it looks a little nicer with ggplot2!
> df
row exp con
1 a 1 2
2 b 2 3
3 c 3 4
> barplot(rbind(df$exp,df$con),
+ beside = TRUE,names.arg=df$row)
produces:

How can I get column data to be added based on a group designation using R?

The data set that I'm working with is similar to the one below (although the example is of a much smaller scale, the data I'm working with is 10's of thousands of rows) and I haven't been able to figure out how to get R to add up column data based on the group number. Essentially I want to be able to get the number of green(s), blue(s), and red(s) added up for all of group 81 and 66 separately and then be able to use that information to calculate percentages.
txt <- "Group Green Blue Red Total
81 15 10 21 46
81 10 10 10 30
81 4 8 0 12
81 42 2 2 46
66 11 9 1 21
66 5 14 5 24
66 7 5 2 14
66 1 16 3 20
66 22 4 2 28"
dat <- read.table(textConnection(txt), sep = " ", header = TRUE)
I've spent a good deal of time trying to figure out how to use some of the functions on my own hoping I would stumble across a proper way to do it, but since I'm such a new basic user I feel like I have hit a wall that I cannot progress past without help.
One way is via aggregate. Assuming your data is in an object x:
aggregate(. ~ Group, data=x, FUN=sum)
# Group Green Blue Red Total
# 1 66 46 48 13 107
# 2 81 71 30 33 134
Both of the answers above are perfect examples of how to address this type of problem. Two other options exist within reshape and plyr
library(reshape)
cast(melt(dat, "Group"), Group ~ ..., sum)
library(plyr)
ddply(dat, "Group", function(x) colSums(x[, -1]))
I would suggest that #Joshua's answer is neater, but two functions you should learn are apply and tapply. If a is your data set, then:
## apply calculates the sum of each row
> total = apply(a[,2:4], 1, sum)
## tapply calculates the sum based on each group
> tapply(total, a$Group, sum)
66 81
107 134

Resources