Adding confidence bands for log growth curve - r

I'm working with data that shows a log growth curve. I was able to fit a non-linear mixed effects regression using the nlme package nicely. However, I am uncertain as to how to add confidence bands around the estimated lines. Can anyone help?
Please find data and code below:
Data:
Harvest Plot Irrigation Graft Rep AwtRun 1 11b 1 b 1 0 2 11b 1 b 1 1.6 3 11b 1 b 1 7.67 4 11b 1 b 1 11.96 5 11b 1 b 1 18.82 6 11b 1 b 1 31.43 11b 1 b 1 41.84 8 11b 1 b 1 45.08 9 11b 1 b 1 48.09 10 11b 1 b 1 48.8 11 11b 1 b 1 51.73 12 11b 1 b 1 54.13 13 11b 1 b 1 60.56 14 11b 1 b 1 63.44 15 11b 1 b 1 65.44 16 11b 1 b 1 67.33 1 11c 1 c 1 0 2 11c 1 c 1 0.86 3 11c 1 c 1 1.6 4 11c 1 c 1 5.41 5 11c 1 c 1 10.17 6 11c 1 c 1 20.4 7 11c 1 c 1 23.32 8 11c 1 c 1 23.99 9 11c 1 c 1 25.23 10 11c 1 c 1 25.89 11 11c 1 c 1 27.71 12 11c 1 c 1 29.64 13 11c 1 c 1 30.81 14 11c 1 c 1 33.09 15 11c 1 c 1 35.66 16 11c 1 c 1 36.59 1 11s 1 s 1 0.82 2 11s 1 s 1 0.82 3 11s 1 s 1 1.19 4 11s 1 s 1 4.39 5 11s 1 s 1 11.77 6 11s 1 s 1 15.81 7 11s 1 s 1 21.9 8 11s 1 s 1 28.16 9 11s 1 s 1 33.63 10 11s 1 s 1 45.22 11 11s 1 s 1 49.45 12 11s 1 s 1 51.71 13 11s 1 s 1 54.82 14 11s 1 s 1 57.44 15 11s 1 s 1 57.61 16 11s 1 s 1 58.38 1 12b 2 b 1 0 2 12b 2 b 1 0.9 3 12b 2 b 1 2.19 4 12b 2 b 1 7.1 5 12b 2 b 1 10.98 6 12b 2 b 1 26.48 7 12b 2 b 1 32.08 8 12b 2 b 1 37.58 9 12b 2 b 1 40.45 10 12b 2 b 1 48.27 11 12b 2 b 1 53.03 12 12b 2 b 1 55.05 13 12b 2 b 1 55.05 14 12b 2 b 1 55.75 15 12b 2 b 1 56.57 16 12b 2 b 1 57.57 1 12c 2 c 1 0 2 12c 2 c 1 0 3 12c 2 c 1 5.05 4 12c 2 c 1 10.08 5 12c 2 c 1 13.65 6 12c 2 c 1 25.03 7 12c 2 c 1 26.9 8 12c 2 c 1 27.47 9 12c 2 c 1 28.66 10 12c 2 c 1 31.98 11 12c 2 c 1 34.79 12 12c 2 c 1 35.2 13 12c 2 c 1 36.65 14 12c 2 c 1 38.41 15 12c 2 c 1 38.68 16 12c 2 c 1 38.94 1 12s 2 s 1 0 2 12s 2 s 1 0 3 12s 2 s 1 0.39 4 12s 2 s 1 4.59 5 12s 2 s 1 8.02 6 12s 2 s 1 17.45 7 12s 2 s 1 25.83 8 12s 2 s 1 33.04 9 12s 2 s 1 35.87 10 12s 2 s 1 52.42 11 12s 2 s 1 57.91 12 12s 2 s 1 57.91 13 12s 2 s 1 57.91 14 12s 2 s 1 57.91 15 12s 2 s 1 57.91 16 12s 2 s 1 58.38 1 21b 1 b 2 0 2 21b 1 b 2 0 3 21b 1 b 2 1.36 4 21b 1 b 2 6.2 5 21b 1 b 2 10.08 6 21b 1 b 2 17.53 7 21b 1 b 2 21.36 8 21b 1 b 2 24.92 9 21b 1 b 2 31.62 10 21b 1 b 2 47.42 11 21b 1 b 2 50.85 12 21b 1 b 2 50.85 13 21b 1 b 2 53.27 14 21b 1 b 2 53.66 15 21b 1 b 2 53.93 16 21b 1 b 2 56.48 1 21c 1 c 2 0 2 21c 1 c 2 0 3 21c 1 c 2 0.74 4 21c 1 c 2 6.44 5 21c 1 c 2 13.8 6 21c 1 c 2 20.12 7 21c 1 c 2 20.75 8 21c 1 c 2 23.58 9 21c 1 c 2 23.58 10 21c 1 c 2 28.69 11 21c 1 c 2 30.4 12 21c 1 c 2 31.74 13 21c 1 c 2 33.86 14 21c 1 c 2 34.06 15 21c 1 c 2 35.15 16 21c 1 c 2 36 1 21s 1 s 2 0 2 21s 1 s 2 0 3 21s 1 s 2 1.67 4 21s 1 s 2 3.41 5 21s 1 s 2 8.36 6 21s 1 s 2 16.97 7 21s 1 s 2 23.85 8 21s 1 s 2 28.16 9 21s 1 s 2 30.54 10 21s 1 s 2 37.33 11 21s 1 s 2 40.11 12 21s 1 s 2 40.41 13 21s 1 s 2 42.03 14 21s 1 s 2 42.03 15 21s 1 s 2 42.03 16 21s 1 s 2 42.03 1 22b 2 b 2 0 2 22b 2 b 2 2.06 3 22b 2 b 2 3.99 4 22b 2 b 2 6.7 5 22b 2 b 2 9.67 6 22b 2 b 2 14.8 7 22b 2 b 2 20.64 8 22b 2 b 2 28.33 9 22b 2 b 2 34.15 10 22b 2 b 2 44.86 11 22b 2 b 2 53.06 12 22b 2 b 2 54.44 13 22b 2 b 2 57.14 14 22b 2 b 2 60.16 15 22b 2 b 2 61.32 16 22b 2 b 2 61.32 1 22c 2 c 2 0 2 22c 2 c 2 0 3 22c 2 c 2 1.55 4 22c 2 c 2 4.93 5 22c 2 c 2 13.63 6 22c 2 c 2 21.98 7 22c 2 c 2 26.7 8 22c 2 c 2 27.23 9 22c 2 c 2 30.56 10 22c 2 c 2 40.73 11 22c 2 c 2 42.01 12 22c 2 c 2 45.52 13 22c 2 c 2 51.7 14 22c 2 c 2 53.59 15 22c 2 c 2 53.59 16 22c 2 c 2 53.59 1 22s 2 s 2 0 2 22s 2 s 2 0 3 22s 2 s 2 1.15 4 22s 2 s 2 9.27 5 22s 2 s 2 13.5 6 22s 2 s 2 23.78 7 22s 2 s 2 24.38 8 22s 2 s 2 27.7 9 22s 2 s 2 33.63 10 22s 2 s 2 41.23 11 22s 2 s 2 44.84 12 22s 2 s 2 48.26 13 22s 2 s 2 51.96 14 22s 2 s 2 54.83 15 22s 2 s 2 54.83 16 22s 2 s 2 54.83 1 31b 1 b 3 0 2 31b 1 b 3 0 3 31b 1 b 3 0 4 31b 1 b 3 0 5 31b 1 b 3 1.32 6 31b
Code based on answer from Rob Hall # https://stats.stackexchange.com/questions/67049/non-linear-mixed-effects-regression-in-r
#nonlinear mixed effects model with self start logistic (SSlogis) for starting values
library(nlme)
#base Model: y = (Asym+u)/(1+exp((Harvest-xmid)/scale)), u ~ N(0,s2u); (no graft)
initVals <- getInitial(sqrtawtrun ~ SSlogis(Harvest, Asym, xmid, scal), data = Data)
#base model without graft, based on starting points found earlier
baseModel<- nlme(sqrtawtrun ~ SSlogis(Harvest, Asym, xmid, scal),
data = Data,
fixed = list(Asym ~ 1, xmid ~ 1, scal ~ 1),
random = Asym ~ 1|Plot,
start = initVals
)
#creating dummy variables for graft; releveling so 's' is the reference category to match SAS code
graft.dummy=model.matrix(~relevel(Data[["Graft"]],"s"))[,2:3]
#updating to include graft in model -- same as above but allowing for diff asym, xmid, and scal vars for each graft
#starting values based on fitted base model values (for those in the base model) & zero for all new parameters
nestedModel <- update(baseModel,fixed=list(Asym ~graft.dummy, xmid ~graft.dummy, scal~graft.dummy),
start = c(fixef(baseModel)[1], 0, 0, fixef(baseModel)[2], 0, 0, fixef(baseModel)[3], 0, 0))
#growth curve plots -- line for each plot, colored on graft
#currently only predicted population level mean at each observed point, plotted as smooth line
ggplot(data=Data,aes(x=Harvest,y=sqrtawtrun,
color=Graft,na.rm=T)) +
geom_point(cex=0.6) +
geom_line(aes(x=Harvest,y=nestedModel$fitted[,1],color=Graft),size=2)

Related

How to order numeric values in a designed order in R?

My question is: Given the target table(on the right), how can I order rows of the original table(on the left) to get exactly the target table with R? Thank you in advance.
Original table:
A B
1 1
1 2
5 12
2 6
5 14
3 6
3 7
5 13
6 2
3 10
5 11
2 5
6 14
2 7
5 15
6 1
3 8
6 3
2 4
1 3
2 10
4 11
2 8
1 4
1 5
2 9
4 12
4 13
3 9
6 15
Target table:
A B
1 1
1 2
1 3
1 4
1 5
3 6
3 7
3 8
3 9
3 10
5 11
5 12
5 13
5 14
5 15
6 1
6 2
6 3
2 4
2 5
2 6
2 7
2 8
2 9
2 10
4 11
4 12
4 13
6 14
6 15
This can be accomplished by ordering by an odd/even flag, and dat$B:
dat[order(-(dat$A %% 2), dat$B),]
## A B
##1 1 1
##2 1 2
##20 1 3
##24 1 4
##25 1 5
##6 3 6
##7 3 7
##17 3 8
##29 3 9
##10 3 10
##11 5 11
##3 5 12
##8 5 13
##5 5 14
##15 5 15
##16 6 1
##9 6 2
##18 6 3
##19 2 4
##12 2 5
##4 2 6
##14 2 7
##23 2 8
##26 2 9
##21 2 10
##22 4 11
##27 4 12
##28 4 13
##13 6 14
##30 6 15
If it's not an odd/even split then you can manually set the 1/3/5, and 2/4/6 groups:
dat[order(`levels<-`(factor(dat$A), list('1'=c(1,3,5), '2'=c(6,2,4))), dat$B),]
This collapsed version of the code with levels<- called directly as a function is a bit hard to read, but it is equivalent to:
grpord <- factor(dat$A)
levels(grpord) <- list('1'=c(1,3,5), '2'=c(6,2,4))
dat[order(grpord, dat$B),]
...where "1" is assigned to the groups 1, 3 and 5, and "2" to the groups 6, 2 and 4.

How can I find the average number of entries for a column in a set of long data

I am new to R and I surprisingly couldn't find an answer to this using the search function. Assuming I have a set of data as follows:
Plot Rate Rep Plant Tuber Weight
1 101 1 1 1 1 179.4
2 101 1 1 1 2 99.4
3 101 1 1 1 3 72.4
4 101 1 1 1 4 111.5
5 101 1 1 1 5 44.9
6 101 1 1 1 6 55.3
7 101 1 1 1 7 12.6
8 101 1 1 1 8 106.7
9 101 1 1 1 9 96.7
10 101 1 1 1 10 52.5
11 101 1 1 2 1 151.1
12 101 1 1 2 2 171.7
13 101 1 1 2 3 93.0
14 101 1 1 2 4 82.4
15 101 1 1 2 5 143.9
16 101 1 1 2 6 115.6
17 101 1 1 2 7 141.3
18 101 1 1 2 8 72.6
19 101 1 1 2 9 97.2
20 101 1 1 2 10 146.8
21 101 1 1 2 11 104.0
22 101 1 1 2 12 121.6
23 101 1 1 3 1 150.9
24 101 1 1 3 2 47.1
25 101 1 1 3 3 59.6
26 101 1 1 3 4 94.2
27 101 1 1 3 5 167.4
28 101 1 1 3 6 55.2
29 101 1 1 3 7 21.8
30 101 1 1 3 8 79.6
31 101 1 1 3 9 92.2
32 101 1 1 3 10 78.0
33 101 1 1 3 11 61.8
34 101 1 1 3 12 9.5
35 101 1 1 3 13 2.7
36 101 1 1 3 14 3.8
37 101 1 1 3 15 1.1
38 106 1 2 1 1 50.7
39 106 1 2 1 2 148.8
40 106 1 2 1 3 50.6
41 106 1 2 1 4 129.6
42 106 1 2 1 5 69.7
43 106 1 2 1 6 83.4
44 106 1 2 1 7 49.1
45 106 1 2 1 8 100.4
46 106 1 2 1 9 33.0
47 106 1 2 1 10 0.8
Here, there is a weight entry for each tuber collected from treatment combinations of Rate, Rep, and Plant.
How can I find the overall average number of tubers found in the Rate/Rep/Plant combos? For example, there are 10 tubers in 1/1/1 and 12 tubers in 1/1/2. I am looking for the average number of tubers found in a plant. The way that the tubers are expressed one at a time in a column makes this difficult for me. Any help would be hugely appreciated. Thanks in advance.
I'm going to add a little to what #akrun said here. You can use dplyr::group_by, and then find the number of tubers in a plant by taking the maximum value of Tubers within each Rate/Rep/Plant group. Then finding the average number of tubers per plant is easy:
df <- data.table::fread(
"Row Plot Rate Rep Plant Tuber Weight
1 101 1 1 1 1 179.4
2 101 1 1 1 2 99.4
3 101 1 1 1 3 72.4
4 101 1 1 1 4 111.5
5 101 1 1 1 5 44.9
6 101 1 1 1 6 55.3
7 101 1 1 1 7 12.6
8 101 1 1 1 8 106.7
9 101 1 1 1 9 96.7
10 101 1 1 1 10 52.5
11 101 1 1 2 1 151.1
12 101 1 1 2 2 171.7
13 101 1 1 2 3 93.0
14 101 1 1 2 4 82.4
15 101 1 1 2 5 143.9
16 101 1 1 2 6 115.6
17 101 1 1 2 7 141.3
18 101 1 1 2 8 72.6
19 101 1 1 2 9 97.2
20 101 1 1 2 10 146.8
21 101 1 1 2 11 104.0
22 101 1 1 2 12 121.6
23 101 1 1 3 1 150.9
24 101 1 1 3 2 47.1
25 101 1 1 3 3 59.6
26 101 1 1 3 4 94.2
27 101 1 1 3 5 167.4
28 101 1 1 3 6 55.2
29 101 1 1 3 7 21.8
30 101 1 1 3 8 79.6
31 101 1 1 3 9 92.2
32 101 1 1 3 10 78.0
33 101 1 1 3 11 61.8
34 101 1 1 3 12 9.5
35 101 1 1 3 13 2.7
36 101 1 1 3 14 3.8
37 101 1 1 3 15 1.1
38 106 1 2 1 1 50.7
39 106 1 2 1 2 148.8
40 106 1 2 1 3 50.6
41 106 1 2 1 4 129.6
42 106 1 2 1 5 69.7
43 106 1 2 1 6 83.4
44 106 1 2 1 7 49.1
45 106 1 2 1 8 100.4
46 106 1 2 1 9 33.0
47 106 1 2 1 10 0.8"
)
library(tidyverse)
tubers_per_plant <- df %>%
group_by(Rate,Rep,Plant) %>%
summarize(num_Tubers = max(Tuber))
tubers_per_plant
# A tibble: 4 × 4
# Groups: Rate, Rep [2]
Rate Rep Plant num_Tubers
<int> <int> <int> <int>
1 1 1 1 10
2 1 1 2 12
3 1 1 3 15
4 1 2 1 10
mean(tubers_per_plant$num_Tubers)
[1] 11.75

Trouble visualizing K-means clusters with fviz_clusters()

Currently trying to visualize k-means clusters and running in to a bit of trouble. I'm getting this error message when I run the code below:
Error in fviz_cluster(res.km, data = nci[, 5], palette = c("#2E9FDF", :
The dimension of the data < 2! No plot.
Here's my code:
library(dplyr)
library(tidyr)
library(ggplot2)
library(tidyverse)
library(hrbrthemes)
library(factoextra)
library(ggpubr)
nci <- read.csv('/Users/KyleHammerberg/Desktop/ML Extra Credit/nci.datanames.csv')
names(nci)[1] <- "gene"
# Compute k-means with k = 3
set.seed(123)
res.km <- kmeans(scale(nci[,2]), 3, nstart = 25)
# K-means clusters showing the group of each individuals
res.km$cluster
fviz_cluster(res.km, data = nci[,5 ],
palette = c("#2E9FDF", "#00AFBB", "#E7B800"),
geom = "point",
ellipse.type = "convex",
ggtheme = theme_bw()
)
res.km$cluster
[1] 1 2 1 2 3 1 1 3 3 3 3 3 1 1 1 3 3 3 1 3 3 3 3 1 1 1 3 3 3 3 1 3 3 1 3 3 1 1 1 1 1 3
[43] 1 3 3 3 1 1 1 1 3 3 3 3 3 3 3 1 1 3 3 1 1 1 1 1 1 1 3 1 3 1 1 1 3 3 1 2 1 1 3 2 1 3
[85] 1 1 1 1 1 1 1 2 3 1 1 1 3 3 1 1 1 1 1 1 1 3 2 1 2 1 3 3 1 1 1 1 3 3 1 3 3 3 3 1 1 1
[127] 3 3 1 3 1 1 1 3 1 1 1 2 2 2 1 2 2 2 3 1 1 3 3 1 3 1 2 1 3 3 3 3 3 3 1 1 3 1 1 3 3 3
[169] 1 3 3 3 3 1 1 3 1 1 1 1 1 3 1 1 1 1 1 3 1 1 1 1 2 3 3 3 1 3 3 1 1 3 3 1 3 1 1 3 3 1
[211] 3 1 3 1 3 3 1 3 3 1 1 1 1 3 3 1 3 1 3 3 3 3 1 1 1 1 1 3 3 1 3 1 3 1 3 1 3 1 3 3 3 3
[253] 3 3 1 3 3 3 3 3 1 2 1 3 1 3 3 1 1 3 1 1 1 1 1 3 1 3 3 3 3 1 1 3 3 1 3 3 1 1 1 3 1 1
[295] 2 3 1 3 1 3 1 3 1 3 3 3 1 3 3 3 3 3 3 3 1 1 1 1 3 1 1 1 3 1 3 1 1 1 1 3 3 1 3 1 1 1
[337] 3 1 1 2 1 1 1 1 1 1 3 1 3 3 1 3 1 3 3 1 1 3 3 1 1 1 3 1 1 3 3 1 1 1 1 1 1 1 3 1 3 1
[379] 1 1 1 1 1 1 1 1 3 3 1 3 1 1 1 2 1 1 1 3 1 1 1 1 1 3 3 1 3 3 3 1 1 1 1 1 1 1 1 1 3 1
[421] 1 1 1 3 1 3 1 2 1 3 3 3 1 1 1 1 1 1 3 1 1 3 1 1 1 1 1 1 1 3 1 3 3 3 1 1 3 3 1 1 1 3
[463] 3 3 1 3 3 1 3 3 3 3 1 3 1 1 1 3 1 3 3 3 3 3 3 3 3 3 1 3 1 1 3 3 1 1 3 3 3 3 3 3 3 3
[505] 3 3 3 1 3 1 3 3 2 1 1 3 3 1 3 3 3 1 1 3 3 3 1 1 1 1 1 3 3 1 3 3 1 1 1 3 3 1 3 3 1 3
[547] 1 1 1 1 3 3 3 1 3 3 3 3 3 3 1 2 1 1 3 3 3 3 1 1 3 3 3 3 3 1 3 1 1 3 1 3 3 3 3 3 3 3
[589] 1 1 1 1 1 1 3 1 3 1 3 3 3 3 3 1 3 3 3 3 3 1 1 3 3 3 3 3 3 1 3 1 3 3 3 3 3 3 1 3 3 3
[631] 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 3 1 3 3 1 3 3 3 1 3
[673] 1 3 3 1 1 1 3 1 3 3 3 3 1 3 3 1 3 1 1 1 1 3 1 3 1 3 3 3 1 1 1 3 1 1 1 1 3 3 3 3 3 3
[715] 1 1 1 1 1 1 1 3 1 1 1 3 1 1 3 3 1 1 3 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 3 1 3 1 1 3 3
[757] 1 1 1 1 1 1 1 3 3 3 3 1 3 1 1 3 1 3 3 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 1 1 1
[799] 1 1 1 1 1 1 1 1 3 1 1 1 1 3 1 1 3 3 1 3 3 1 3 1 3 1 3 1 3 1 3 1 3 1 1 1 1 3 3 1 3 3
[841] 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 1 1 3 3 1 2 1 1 1 3 3 1 3 1 1 1 1 1 1 3 1 3 1 1 1
[883] 1 1 1 1 1 1 3 1 1 1 1 3 3 1 1 3 3 3 3 3 3 1 1 2 1 3 1 1 1 1 1 1 1 3 1 3 1 3 1 1 1 1
[925] 1 1 1 3 3 1 1 3 1 1 1 1 1 1 1 1 1 1 3 3 3 3 1 3 3 3 3 3 3 3 1 1 1 3 1 3 1 1 1 1 1 1
[967] 1 1 1 3 1 1 3 1 3 1 3 1 1 3 1 3 3 3 3 3 3 3 1 3 1 3 3 3 3 1 3 1 1 1
[ reached getOption("max.print") -- omitted 5830 entries ]
Here's a look at the data if that helps:
head(nci)
gene CNS CNS.1 CNS.2 RENAL BREAST CNS.3 CNS.4 BREAST.1 NSCLC NSCLC.1
1 g1 0.300 0.679961 0.940 2.80e-01 0.485 0.310 -0.830 -0.190 0.460 0.760
2 g2 1.180 1.289961 -0.040 -3.10e-01 -0.465 -0.030 0.000 -0.870 0.000 1.490
3 g3 0.550 0.169961 -0.170 6.80e-01 0.395 -0.100 0.130 -0.450 1.150 0.280
4 g4 1.140 0.379961 -0.040 -8.10e-01 0.905 -0.460 -1.630 0.080 -1.400 0.100
5 g5 -0.265 0.464961 -0.605 6.25e-01 0.200 -0.205 0.075 0.005 -0.005 -0.525
6 g6 -0.070 0.579961 0.000 -1.39e-17 -0.005 -0.540 -0.360 0.350 -0.700 0.360
RENAL.1 RENAL.2 RENAL.3 RENAL.4 RENAL.5 RENAL.6 RENAL.7 BREAST.2 NSCLC.2 RENAL.8 UNKNOWN
1 0.270 -0.450 -0.030 0.710 -0.360 -0.210 -0.500 -1.060 0.150 -0.290 -0.200
2 0.630 -0.060 -1.120 0.000 -1.420 -1.950 -0.520 -2.190 -0.450 0.000 0.740
3 -0.360 0.150 -0.050 0.160 -0.030 -0.700 -0.660 -0.130 -0.320 0.050 0.080
4 -1.040 -0.610 0.000 -0.770 -2.280 -1.650 -2.610 0.000 -1.610 0.730 0.760
5 0.015 -0.395 -0.285 0.045 0.135 -0.075 0.225 -0.485 -0.095 0.385 -0.105
6 -0.040 0.150 -0.250 -0.160 -0.320 0.060 -0.050 -0.430 -0.080 0.390 -0.080
OVARIAN MELANOMA PROSTATE OVARIAN.1 OVARIAN.2 OVARIAN.3 OVARIAN.4 OVARIAN.5 PROSTATE.1
1 0.430 -0.490 -0.530 -0.010 0.640 -0.480 0.140 0.640 0.070
2 0.500 0.330 -0.050 -0.370 0.550 0.970 0.720 0.150 0.290
3 -0.730 0.010 -0.230 -0.160 -0.540 0.300 -0.240 -0.170 0.070
4 0.600 -1.660 0.170 0.930 -1.780 0.470 0.000 0.550 1.310
5 -0.635 -0.185 0.825 0.395 0.315 0.425 1.715 -0.205 0.085
6 -0.430 -0.140 0.010 -0.100 0.810 0.020 0.260 0.290 -0.620
NSCLC.3 NSCLC.4 NSCLC.5 LEUKEMIA K562B.repro X6K562B.repro LEUKEMIA.1 LEUKEMIA.2
1 0.130 0.320 0.515 0.080 0.410 -0.200 -0.36998050 -0.370
2 2.240 0.280 1.045 0.120 0.000 0.000 -1.38998000 0.180
3 0.640 0.360 0.000 0.060 0.210 0.060 -0.05998047 0.000
4 0.680 -1.880 0.000 0.400 0.180 -0.070 0.07001953 -1.320
5 0.135 0.475 0.330 0.105 -0.255 -0.415 -0.07498047 -0.825
6 0.300 0.110 -0.155 -0.190 -0.110 0.020 0.04001953 -0.130
LEUKEMIA.3 LEUKEMIA.4 LEUKEMIA.5 COLON COLON.1 COLON.2 COLON.3 COLON.4
1 -0.430 -0.380 -0.550 -0.32003900 -0.620 -4.90e-01 0.07001953 -0.120
2 -0.590 -0.550 0.000 0.08996101 0.080 4.20e-01 -0.82998050 0.000
3 -0.500 -1.710 0.100 -0.29003900 0.140 -3.40e-01 -0.59998050 -0.010
4 -1.520 -1.870 -2.390 -1.03003900 0.740 7.00e-02 -0.90998050 0.130
5 -0.785 -0.585 -0.215 0.09496101 0.205 -2.05e-01 0.24501950 0.555
6 0.520 0.120 -0.620 0.05996101 0.000 -1.39e-17 -0.43998050 -0.550
COLON.5 COLON.6 MCF7A.repro BREAST.3 MCF7D.repro BREAST.4 NSCLC.6 NSCLC.7
1 -0.290 -0.8100195 0.200 0.37998050 0.3100195 0.030 -0.42998050 0.160
2 0.030 0.0000000 -0.230 0.44998050 0.4800195 0.220 -0.38998050 -0.340
3 -0.310 0.2199805 0.360 0.65998050 0.9600195 0.150 -0.17998050 -0.020
4 1.500 0.7399805 0.180 0.76998050 0.9600195 -1.240 0.86001950 -1.730
5 0.005 0.1149805 -0.315 0.05498047 -0.2149805 -0.305 0.78501950 -0.625
6 -0.540 0.1199805 0.410 0.54998050 0.3700195 0.050 0.04001953 -0.140
NSCLC.8 MELANOMA.1 BREAST.5 BREAST.6 MELANOMA.2 MELANOMA.3 MELANOMA.4 MELANOMA.5
1 0.010 -0.620 -0.380 0.04998047 0.650 -0.030 -0.270 0.210
2 -1.280 -0.130 0.000 -0.72001950 0.640 -0.480 0.630 -0.620
3 -0.770 0.200 -0.060 0.41998050 0.150 0.070 -0.100 -0.150
4 0.940 -1.410 0.800 0.92998050 -1.970 -0.700 1.100 -1.330
5 -0.015 1.585 -0.115 -0.09501953 -0.065 -0.195 1.045 0.045
6 0.270 1.160 0.180 0.19998050 0.130 0.410 0.080 -0.400
MELANOMA.6 MELANOMA.7
1 -5.00e-02 0.350
2 1.40e-01 -0.270
3 -9.00e-02 0.020
4 -1.26e+00 -1.230
5 4.50e-02 -0.715
6 -2.71e-20 -0.340
nci[,5 ] is the data with only one column. fviz_cluster requires data with atleast 2 columns. This check is performed in these lines https://github.com/kassambara/factoextra/blob/master/R/fviz_cluster.R#L184-L203 .
Using mtcars as example -
Passing a single column in data :
res.km <- kmeans(scale(mtcars[,2]), 3, nstart = 25)
factoextra::fviz_cluster(res.km, data = mtcars[,5],
palette = c("#2E9FDF", "#00AFBB", "#E7B800"),
geom = "point",
ellipse.type = "convex",
ggtheme = theme_bw())
Error in factoextra::fviz_cluster(res.km, data = mtcars[, 5], palette = c("#2E9FDF", :
The dimension of the data < 2! No plot.
Passing two columns in data :
factoextra::fviz_cluster(res.km, data = mtcars[,5:6],
palette = c("#2E9FDF", "#00AFBB", "#E7B800"),
geom = "point",
ellipse.type = "convex",
ggtheme = theme_bw())

seqentially number groups based on a condition

I need some help with my R code - I've been trying to get it to work for ages and i'm totally stuck.
I have a large dataset (~40000 rows) and I need to assign group IDs to a new column based on a condition of another column. So if df$flow.type==1 then then that [SITENAME, SAMPLING.YEAR, cluster] group should be assigned with a unique group ID. This is an example:
This is a similar question but for SQL: Assigning group number based on condition. I need a way to do this in R - sorry I am a novice at if_else and loops. The below code is the best I could come up with but it isn't working. Can anyone see what i'm doing wrong?
thanks in advance for your help
if(flow.type.test=="0"){
event.samp.num.test <- "1000"
} else (flow.type.test=="1"){
event.samp.num.test <- Sample_dat %>% group_by(SITENAME, SAMPLING.YEAR, cluster) %>% tally()}
Note the group ID '1000' is just a random impossible number for this dataset - it will be used to subset the data later on.
My subset df looks like this:
> str(dummydat)
'data.frame': 68 obs. of 6 variables:
$ SITENAME : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
$ SAMPLING.YEAR: Factor w/ 4 levels "1","2","3","4": 3 3 3 3 3 3 3 3 3 4 ...
$ DATE : Date, format: "2017-10-17" "2017-10-17" "2017-10-22" "2017-11-28" ...
$ TIME : chr "10:45" "15:00" "15:20" "20:59" ...
$ flow.type : int 1 1 0 0 1 1 0 0 0 1 ...
$ cluster : int 1 1 2 3 4 4 5 6 7 8 ...
Sorry I tried dput but the output is horrendous. I have subset 40 rows of the subset data below as an example, I hope this is okay.
> head(dummydat, n=40)
SITENAME SAMPLING.YEAR DATE TIME flow.type cluster
1 A 3 2017-10-17 10:45 1 1
2 A 3 2017-10-17 15:00 1 1
3 A 3 2017-10-22 15:20 0 2
4 A 3 2017-11-28 20:59 0 3
5 A 3 2017-12-05 18:15 1 4
6 A 3 2017-12-06 8:25 1 4
7 A 3 2017-12-10 10:05 0 5
8 A 3 2017-12-15 15:12 0 6
9 A 3 2017-12-19 17:40 0 7
10 A 4 2018-12-09 18:10 1 8
11 A 4 2018-12-16 10:35 0 9
12 A 4 2018-12-26 6:47 0 10
13 A 4 2019-01-01 14:25 0 11
14 A 4 2019-01-05 16:40 0 12
15 A 4 2019-01-12 7:42 0 13
16 A 4 2019-01-20 16:15 0 14
17 A 4 2019-01-28 10:41 0 15
18 A 4 2019-02-03 16:30 1 16
19 A 4 2019-02-04 17:14 1 16
20 B 1 2015-12-24 6:21 1 16
21 B 1 2015-12-29 17:41 1 17
22 B 1 2015-12-29 23:33 1 17
23 B 1 2015-12-30 5:17 1 17
24 B 1 2015-12-30 17:23 1 17
25 B 1 2015-12-31 5:29 1 17
26 B 1 2015-12-31 11:35 1 17
27 B 1 2015-12-31 23:40 1 17
28 B 1 2016-02-09 10:53 0 18
29 B 1 2016-03-03 15:23 1 19
30 B 1 2016-03-03 17:37 1 19
31 B 1 2016-03-03 21:33 1 19
32 B 1 2016-03-04 3:17 1 19
33 B 2 2017-01-07 13:16 1 20
34 B 2 2017-01-07 22:24 1 20
35 B 2 2017-01-08 6:34 1 20
36 B 2 2017-01-08 11:42 1 20
37 B 2 2017-01-08 20:50 1 20
38 B 2 2017-01-31 11:39 1 21
39 B 2 2017-01-31 16:45 1 21
40 B 2 2017-01-31 22:53 1 21
Here is one approach with tidyverse:
library(dplyr)
library(tidyr)
left_join(df, df %>%
filter(flow.type == 1) %>%
group_by(SITENAME, SAMPLING.YEAR) %>%
mutate(group.ID = cumsum(cluster != lag(cluster, default = first(cluster))) + 1)) %>%
mutate(group.ID = replace_na(group.ID, 1000))
First, filter rows that have flow.type of 1. Then, group_by both SITENAME and SAMPLING.YEAR to count groups within those same characteristics. Next, use cumsum for cumulative sum of when cluster value changes - this will be the group number. This will be merged back with original data (left_join). To have those with flow.type 0 become 1000 for group.ID, you can use replace_na.
Output
SITENAME SAMPLING.YEAR DATE TIME flow.type cluster group.ID
1 A 3 2017-10-17 10:45 1 1 1
2 A 3 2017-10-17 15:00 1 1 1
3 A 3 2017-10-22 15:20 0 2 1000
4 A 3 2017-11-28 20:59 0 3 1000
5 A 3 2017-12-05 18:15 1 4 2
6 A 3 2017-12-06 8:25 1 4 2
7 A 3 2017-12-10 10:05 0 5 1000
8 A 3 2017-12-15 15:12 0 6 1000
9 A 3 2017-12-19 17:40 0 7 1000
10 A 4 2018-12-09 18:10 1 8 1
11 A 4 2018-12-16 10:35 0 9 1000
12 A 4 2018-12-26 6:47 0 10 1000
13 A 4 2019-01-01 14:25 0 11 1000
14 A 4 2019-01-05 16:40 0 12 1000
15 A 4 2019-01-12 7:42 0 13 1000
16 A 4 2019-01-20 16:15 0 14 1000
17 A 4 2019-01-28 10:41 0 15 1000
18 A 4 2019-02-03 16:30 1 16 2
19 A 4 2019-02-04 17:14 1 16 2
20 B 1 2015-12-24 6:21 1 16 1
21 B 1 2015-12-29 17:41 1 17 2
22 B 1 2015-12-29 23:33 1 17 2
23 B 1 2015-12-30 5:17 1 17 2
24 B 1 2015-12-30 17:23 1 17 2
25 B 1 2015-12-31 5:29 1 17 2
26 B 1 2015-12-31 11:35 1 17 2
27 B 1 2015-12-31 23:40 1 17 2
28 B 1 2016-02-09 10:53 0 18 1000
29 B 1 2016-03-03 15:23 1 19 3
30 B 1 2016-03-03 17:37 1 19 3
31 B 1 2016-03-03 21:33 1 19 3
32 B 1 2016-03-04 3:17 1 19 3
33 B 2 2017-01-07 13:16 1 20 1
34 B 2 2017-01-07 22:24 1 20 1
35 B 2 2017-01-08 6:34 1 20 1
36 B 2 2017-01-08 11:42 1 20 1
37 B 2 2017-01-08 20:50 1 20 1
38 B 2 2017-01-31 11:39 1 21 2
39 B 2 2017-01-31 16:45 1 21 2
40 B 2 2017-01-31 22:53 1 21 2
Here is a data.table approach
library(data.table)
setDT(df)[
, group.ID := 1000
][
flow.type == 1, group.ID := copy(.SD)[, grp := .GRP, by = cluster]$grp,
by = .(SITENAME, SAMPLING.YEAR)
]
Output
> df[]
SITENAME SAMPLING.YEAR DATE TIME flow.type cluster group.ID
1: A 3 2017-10-17 10:45:00 1 1 1
2: A 3 2017-10-17 15:00:00 1 1 1
3: A 3 2017-10-22 15:20:00 0 2 1000
4: A 3 2017-11-28 20:59:00 0 3 1000
5: A 3 2017-12-05 18:15:00 1 4 2
6: A 3 2017-12-06 08:25:00 1 4 2
7: A 3 2017-12-10 10:05:00 0 5 1000
8: A 3 2017-12-15 15:12:00 0 6 1000
9: A 3 2017-12-19 17:40:00 0 7 1000
10: A 4 2018-12-09 18:10:00 1 8 1
11: A 4 2018-12-16 10:35:00 0 9 1000
12: A 4 2018-12-26 06:47:00 0 10 1000
13: A 4 2019-01-01 14:25:00 0 11 1000
14: A 4 2019-01-05 16:40:00 0 12 1000
15: A 4 2019-01-12 07:42:00 0 13 1000
16: A 4 2019-01-20 16:15:00 0 14 1000
17: A 4 2019-01-28 10:41:00 0 15 1000
18: A 4 2019-02-03 16:30:00 1 16 2
19: A 4 2019-02-04 17:14:00 1 16 2
20: B 1 2015-12-24 06:21:00 1 16 1
21: B 1 2015-12-29 17:41:00 1 17 2
22: B 1 2015-12-29 23:33:00 1 17 2
23: B 1 2015-12-30 05:17:00 1 17 2
24: B 1 2015-12-30 17:23:00 1 17 2
25: B 1 2015-12-31 05:29:00 1 17 2
26: B 1 2015-12-31 11:35:00 1 17 2
27: B 1 2015-12-31 23:40:00 1 17 2
28: B 1 2016-02-09 10:53:00 0 18 1000
29: B 1 2016-03-03 15:23:00 1 19 3
30: B 1 2016-03-03 17:37:00 1 19 3
31: B 1 2016-03-03 21:33:00 1 19 3
32: B 1 2016-03-04 03:17:00 1 19 3
33: B 2 2017-01-07 13:16:00 1 20 1
34: B 2 2017-01-07 22:24:00 1 20 1
35: B 2 2017-01-08 06:34:00 1 20 1
36: B 2 2017-01-08 11:42:00 1 20 1
37: B 2 2017-01-08 20:50:00 1 20 1
38: B 2 2017-01-31 11:39:00 1 21 2
39: B 2 2017-01-31 16:45:00 1 21 2
40: B 2 2017-01-31 22:53:00 1 21 2
SITENAME SAMPLING.YEAR DATE TIME flow.type cluster group.ID

Taking means over `sam` and `dup`

I am trying to take the means over the columns sam and dup of the following dataset:
fat co lab sam dup
1 0.62 1 1 1 1
2 0.55 1 1 1 2
3 0.34 1 1 2 1
4 0.24 1 1 2 2
5 0.80 1 1 3 1
6 0.68 1 1 3 2
7 0.76 1 1 4 1
8 0.65 1 1 4 2
9 0.30 1 2 1 1
10 0.40 1 2 1 2
11 0.33 1 2 2 1
12 0.43 1 2 2 2
13 0.39 1 2 3 1
14 0.40 1 2 3 2
15 0.29 1 2 4 1
16 0.18 1 2 4 2
17 0.46 1 3 1 1
18 0.38 1 3 1 2
19 0.27 1 3 2 1
20 0.37 1 3 2 2
21 0.37 1 3 3 1
22 0.42 1 3 3 2
23 0.45 1 3 4 1
24 0.54 1 3 4 2
25 0.18 2 1 1 1
26 0.47 2 1 1 2
27 0.53 2 1 2 1
28 0.32 2 1 2 2
29 0.40 2 1 3 1
30 0.37 2 1 3 2
31 0.31 2 1 4 1
32 0.43 2 1 4 2
33 0.35 2 2 1 1
34 0.39 2 2 1 2
35 0.37 2 2 2 1
36 0.33 2 2 2 2
37 0.42 2 2 3 1
38 0.36 2 2 3 2
39 0.20 2 2 4 1
40 0.41 2 2 4 2
41 0.37 2 3 1 1
42 0.43 2 3 1 2
43 0.28 2 3 2 1
44 0.36 2 3 2 2
45 0.18 2 3 3 1
46 0.20 2 3 3 2
47 0.26 2 3 4 1
48 0.06 2 3 4 2
The output should be this:
lab co fat
1 1 1 0.58000
2 2 1 0.34000
3 3 1 0.40750
4 1 2 0.37625
5 2 2 0.35375
6 3 2 0.26750
These are both in the form of .RData files.
How can this be done?
An example with part of the data you posted:
dt = read.table(text = "
fat co lab sam dup
0.62 1 1 1 1
0.55 1 1 1 2
0.34 1 1 2 1
0.24 1 1 2 2
0.80 1 1 3 1
0.68 1 1 3 2
0.76 1 1 4 1
0.65 1 1 4 2
0.30 1 2 1 1
0.40 1 2 1 2
0.33 1 2 2 1
0.43 1 2 2 2
0.39 1 2 3 1
0.40 1 2 3 2
0.29 1 2 4 1
0.18 1 2 4 2
", header= T)
library(dplyr)
dt %>%
group_by(lab, co) %>% # for each lab and co combination
summarise(fat = mean(fat)) %>% # get the mean of fat
ungroup() # forget the grouping
# # A tibble: 2 x 3
# lab co fat
# <int> <int> <dbl>
# 1 1 1 0.58
# 2 2 1 0.34

Resources