Hi I need to calculate the cumulative insect day for some of my experiment. This is what my data frame looks like
Rep trt date BLB
1 I 1 7/12/2017 3
2 I 2 7/12/2017 2
3 I 3 7/12/2017 4
4 I 4 7/12/2017 0
5 II 1 7/12/2017 1
6 II 2 7/12/2017 2
7 II 3 7/12/2017 2
8 II 4 7/12/2017 1
9 III 1 7/12/2017 3
10 III 2 7/12/2017 2
11 III 3 7/12/2017 1
12 III 4 7/12/2017 1
13 IV 1 7/12/2017 0
14 IV 2 7/12/2017 3
15 IV 3 7/12/2017 3
16 IV 4 7/12/2017 0
17 I 1 7/20/2017 12
18 I 2 7/20/2017 6
19 I 3 7/20/2017 7
20 I 4 7/20/2017 18
21 II 1 7/20/2017 17
22 II 2 7/20/2017 11
23 II 3 7/20/2017 25
24 II 4 7/20/2017 17
25 III 1 7/20/2017 18
26 III 2 7/20/2017 6
27 III 3 7/20/2017 48
28 III 4 7/20/2017 13
29 IV 1 7/20/2017 7
30 IV 2 7/20/2017 22
31 IV 3 7/20/2017 18
32 IV 4 7/20/2017 11
33 I 1 7/27/2017 1
34 I 2 7/27/2017 3
35 I 3 7/27/2017 4
36 I 4 7/27/2017 0
37 II 1 7/27/2017 1
38 II 2 7/27/2017 0
39 II 3 7/27/2017 1
40 II 4 7/27/2017 0
41 III 1 7/27/2017 1
42 III 2 7/27/2017 1
43 III 3 7/27/2017 0
44 III 4 7/27/2017 0
45 IV 1 7/27/2017 1
46 IV 2 7/27/2017 0
47 IV 3 7/27/2017 1
48 IV 4 7/27/2017 2
49 I 1 8/2/2017 0
50 I 2 8/2/2017 0
51 I 3 8/2/2017 1
52 I 4 8/2/2017 0
53 II 1 8/2/2017 0
54 II 2 8/2/2017 0
55 II 3 8/2/2017 0
56 II 4 8/2/2017 0
57 III 1 8/2/2017 1
58 III 2 8/2/2017 0
59 III 3 8/2/2017 0
60 III 4 8/2/2017 0
61 IV 1 8/2/2017 0
62 IV 2 8/2/2017 0
63 IV 3 8/2/2017 0
64 IV 4 8/2/2017 2
Structure would be:
data.frame': 64 obs. of 4 variables:
$ Rep : Factor w/ 4 levels "I","II","III",..: 1 1 1 1 2 2 2 2 3 3 ...
$ trt : Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
$ date: Factor w/ 4 levels "7/12/2017","7/20/2017",..: 1 1 1 1 1 1 1 1 1 1 ...
$ BLB : int 3 2 4 0 1 2 2 1 3 2 ...
To do it, I need calculate the average of insect for each combination of date for the different treatment. for example I have to calculate the every between date 7/12 and 7/20 for each treatment. Then I need to calculate the average between date 7/20 and 7/27, etc. Does anyone knows how to do this using r software? I really appreciate the help!!
First create data (would be nice if you provided dput(data)...):
set.seed(123)
df = data.frame(Rep = rep(c("I","II","III","IV"), each = 4, times = 4),
trt = as.factor(rep(1:4, times = 16)),
date = as.Date(rep(c("7/12/2017", "7/20/2017", "7/27/2017", "8/2/2017"), each = 16),
format = "%m/%d/%Y"),
BLB = sample(0:50, 64, replace = TRUE))
> str(df)
'data.frame': 64 obs. of 4 variables:
$ Rep : Factor w/ 4 levels "I","II","III",..: 1 1 1 1 2 2 2 2 3 3 ...
$ trt : Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
$ date: Date, format: "2017-07-12" "2017-07-12" "2017-07-12" ...
$ BLB : int 14 40 20 45 47 2 26 45 28 23 ...
Simple subsetting and aggregation:
# Create subset for each date group
date_group1 = subset(df, df$date %in% c(as.Date("2017-07-12"),
as.Date("2017-07-20")))
date_group2 = subset(df, df$date %in% c(as.Date("2017-07-20"),
as.Date("2017-07-27")))
date_group3 = subset(df, df$date %in% c(as.Date("2017-07-27"),
as.Date("2017-08-02")))
# Aggregate by treatment in each date_group
aggregate(BLB ~ trt, data = date_group1, mean)
aggregate(BLB ~ trt, data = date_group2, mean)
aggregate(BLB ~ trt, data = date_group3, mean)
# > aggregate(BLB ~ trt, data = date_group1, mean)
# trt BLB
# 1 1 28.375
# 2 2 21.750
# 3 3 27.875
# 4 4 41.500
# > aggregate(BLB ~ trt, data = date_group2, mean)
# trt BLB
# 1 1 23.875
# 2 2 19.875
# 3 3 21.625
# 4 4 31.250
# > aggregate(BLB ~ trt, data = date_group3, mean)
# trt BLB
# 1 1 22.375
# 2 2 21.250
# 3 3 17.875
# 4 4 17.500
You have missed some date combination group #useR
There are
(2017-07-12, 2017-07-27),
(2017-07-12, 2017-08-02),
(2017-07-20, 2017-08-02) also.
Related
I have a series of values that includes strings of values that are close to each other, for example the sequences below. Note that roughly around the places I have categorized the values in V1 with distinct values in V2, the range of the values changes. That is, all the values called 1 in V2 are within 20 points of each other. All the values marked 2 in V2 are within 20 points of each other. All the values marked 3 are within 20 points of each other, etc. Notice that the values are not identical (they are all different). But instead, they cluster around a common value.
I identified these clusters manually. How could I automate it?
V1 V2
1 399.710 1
2 403.075 1
3 405.766 1
4 407.112 1
5 408.458 1
6 409.131 1
7 410.477 1
8 411.150 1
9 412.495 1
10 332.419 2
11 330.400 2
12 329.054 2
13 327.708 2
14 326.363 2
15 325.017 2
16 322.998 2
17 319.633 2
18 314.923 2
19 288.680 3
20 285.315 3
21 283.969 3
22 281.950 3
23 279.932 3
24 276.567 3
25 273.875 3
26 272.530 3
27 271.857 3
28 272.530 3
29 273.875 3
30 274.548 3
31 275.894 3
32 275.894 3
33 276.567 3
34 277.240 3
35 278.586 3
36 279.932 3
37 281.950 3
38 284.642 3
39 288.007 3
40 291.371 3
41 294.063 4
42 295.409 4
43 296.754 4
44 297.427 4
45 298.100 4
46 299.446 4
47 300.792 4
48 303.484 4
49 306.848 4
50 327.708 5
51 309.540 6
52 310.213 6
53 309.540 6
54 306.848 6
55 304.156 6
56 302.811 6
57 302.811 6
58 304.156 6
59 305.502 6
60 306.175 6
61 306.175 6
62 304.829 6
I haven't tried anything yet, I don't know how to do this.
Using dist and hclust with cutree to detect clusters, but with unique levels at the breaks.
hc <- hclust(dist(x))
cl <- cutree(hc, k=6)
data.frame(x, seq=cumsum(c(0, diff(cl)) != 0) + 1)
# x seq
# 1 399.710 1
# 2 403.075 1
# 3 405.766 1
# 4 407.112 1
# 5 408.458 1
# 6 409.131 1
# 7 410.477 1
# 8 411.150 1
# 9 412.495 1
# 10 332.419 2
# 11 330.400 2
# 12 329.054 2
# 13 327.708 2
# 14 326.363 2
# 15 325.017 2
# 16 322.998 2
# 17 319.633 3
# 18 314.923 3
# 19 288.680 4
# 20 285.315 4
# 21 283.969 4
# 22 281.950 4
# 23 279.932 4
# 24 276.567 5
# 25 273.875 5
# 26 272.530 5
# 27 271.857 5
# 28 272.530 5
# 29 273.875 5
# 30 274.548 5
# 31 275.894 5
# 32 275.894 5
# 33 276.567 5
# 34 277.240 5
# 35 278.586 6
# 36 279.932 6
# 37 281.950 6
# 38 284.642 6
# 39 288.007 6
# 40 291.371 6
# 41 294.063 7
# 42 295.409 7
# 43 296.754 7
# 44 297.427 7
# 45 298.100 7
# 46 299.446 7
# 47 300.792 7
# 48 303.484 7
# 49 306.848 7
# 50 327.708 8
# 51 309.540 9
# 52 310.213 9
# 53 309.540 9
# 54 306.848 9
# 55 304.156 9
# 56 302.811 9
# 57 302.811 9
# 58 304.156 9
# 59 305.502 9
# 60 306.175 9
# 61 306.175 9
# 62 304.829 9
However, the dendrogram suggests rather k=4 clusters instead of 6, but it is arbitrary.
plot(hc)
abline(h=30, lty=2, col=2)
abline(h=18.5, lty=2, col=3)
abline(h=14, lty=2, col=4)
legend('topright', lty=2, col=2:4, legend=paste(c(4, 5, 7), 'cluster'), cex=.8)
Data:
x <- c(399.71, 403.075, 405.766, 407.112, 408.458, 409.131, 410.477,
411.15, 412.495, 332.419, 330.4, 329.054, 327.708, 326.363, 325.017,
322.998, 319.633, 314.923, 288.68, 285.315, 283.969, 281.95,
279.932, 276.567, 273.875, 272.53, 271.857, 272.53, 273.875,
274.548, 275.894, 275.894, 276.567, 277.24, 278.586, 279.932,
281.95, 284.642, 288.007, 291.371, 294.063, 295.409, 296.754,
297.427, 298.1, 299.446, 300.792, 303.484, 306.848, 327.708,
309.54, 310.213, 309.54, 306.848, 304.156, 302.811, 302.811,
304.156, 305.502, 306.175, 306.175, 304.829)
This solution iterates over every value, checks the range of all values in the group up to that point, and starts a new group if the range is greater than a threshold.
maxrange <- 18
grp_start <- 1
grp_num <- 1
V3 <- numeric(length(dat$V1))
for (i in seq_along(dat$V1)) {
grp <- dat$V1[grp_start:i]
if (max(grp) - min(grp) > maxrange) {
grp_num <- grp_num + 1
grp_start <- i
}
V3[[i]] <- grp_num
}
cbind(dat, V3)
V1 V2 V3
1 399.710 1 1
2 403.075 1 1
3 405.766 1 1
4 407.112 1 1
5 408.458 1 1
6 409.131 1 1
7 410.477 1 1
8 411.150 1 1
9 412.495 1 1
10 332.419 2 2
11 330.400 2 2
12 329.054 2 2
13 327.708 2 2
14 326.363 2 2
15 325.017 2 2
16 322.998 2 2
17 319.633 2 2
18 314.923 2 2
19 288.680 3 3
20 285.315 3 3
21 283.969 3 3
22 281.950 3 3
23 279.932 3 3
24 276.567 3 3
25 273.875 3 3
26 272.530 3 3
27 271.857 3 3
28 272.530 3 3
29 273.875 3 3
30 274.548 3 3
31 275.894 3 3
32 275.894 3 3
33 276.567 3 3
34 277.240 3 3
35 278.586 3 3
36 279.932 3 3
37 281.950 3 3
38 284.642 3 3
39 288.007 3 3
40 291.371 3 4
41 294.063 4 4
42 295.409 4 4
43 296.754 4 4
44 297.427 4 4
45 298.100 4 4
46 299.446 4 4
47 300.792 4 4
48 303.484 4 4
49 306.848 4 4
50 327.708 5 5
51 309.540 6 6
52 310.213 6 6
53 309.540 6 6
54 306.848 6 6
55 304.156 6 6
56 302.811 6 6
57 302.811 6 6
58 304.156 6 6
59 305.502 6 6
60 306.175 6 6
61 306.175 6 6
62 304.829 6 6
A threshold of 18 reproduces your groups, except that group 4 starts one row earlier. You could use a higher threshold, but then group 6 would start later than you have it.
I'm doing a survival analysis about the time some individual components remain in the source code of a software project, but some of these components are being dropped by the survfit function.
This is what I'm doing:
library(survival)
data <- read.table(text = "component_id weeks removed
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1
8 2 0
9 2 0
10 2 0
11 2 0
12 2 1
13 2 1
14 2 0
15 2 0
16 2 0
17 2 0
18 2 0
19 2 0
20 2 1
21 2 1
22 2 0
23 2 0
24 3 1
25 3 1
26 3 1
27 3 1
28 7 1
29 7 1
30 14 1
31 14 1
32 14 1
33 14 1
34 14 1
35 14 1
36 14 1
37 14 1
38 14 1
39 14 1
40 14 1
41 14 1
42 14 1
43 14 1
44 14 1
45 14 1
46 14 1
47 14 1
48 40 1
49 40 1
50 40 1
51 40 1
52 48 1
53 48 1
54 48 1
55 48 1
56 48 1
57 48 1
58 48 1
59 48 1
60 56 1
61 56 1
62 56 1
63 56 1
64 56 1
65 56 1
66 56 1
67 56 1
68 56 1
69 56 1", header = TRUE)
fit <- survfit(Surv(data$weeks, data$removed) ~ 1)
summary(fit, censored=TRUE)
And this is the output
Call: survfit(formula = Surv(data$weeks, data$removed) ~ 1)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 69 7 0.899 0.0363 0.830 0.973
2 62 4 0.841 0.0441 0.758 0.932
3 46 4 0.767 0.0533 0.670 0.879
7 42 2 0.731 0.0567 0.628 0.851
14 40 18 0.402 0.0654 0.292 0.553
40 22 4 0.329 0.0629 0.226 0.478
48 18 8 0.183 0.0520 0.105 0.319
56 10 10 0.000 NaN NA NA
I was expecting the number of events to be 69 but I get 12 subjects dropped.
I initially thought I was misusing the package functions, and carried a type="interval2" approach, following a similar situation, but the drops keep happening with now a weird continuous number of subjects and events counts:
as.t2 <- function(i, data) if (data$removed[i] == 1) data$weeks[i] else NA
size <- length(data$weeks)
t1 <- data$weeks
t2 <- sapply(1:size, as.t2, data = data)
interval_fit <- survfit(Surv(t1, t2, type="interval2") ~ 1)
summary(interval_fit, censored=TRUE)
Next, I found what I call a mid-air explanation, clarifying a bit further the situation. I understand this is caused by non-censored subjects appearing after a "constant censoring time", but again, why?
That led me somehow to dig deeper and read about right-truncation and realized that type of studies mapped very closely to the drops I'm experiencing. Here's Klein & Moeschberger:
Truncation of survival data occurs when only those individuals whose event time lies within a certain observational window $(Y_L,Y_R)$ are observed. An individual whose event time is not in this interval is not observed and no information on this subject is available to the investigator.
Right truncation occurs when $Y_L$ is equal to zero. That is, we observe the survival time $X$ only when $X \leq Y_R$.
From my perspective, these drops carry important information for my study regardless of their time of entry.
How can I stop the drops?
I have a categorical variable B with 3 levels 1,2,3 also I have another variable A with some values.. sample data is as follows
A B
22 1
23 1
12 1
34 1
43 2
47 2
49 2
65 2
68 3
70 3
75 3
82 3
120 3
. .
. .
. .
. .
All I want is say for every level of B ( say in 1) I need to calculate Val(A)-Min/Max-Min, similarly I need to reproduce the same to other levels (2 & 3)
Solution using dplyr:
set.seed(1)
df=data.frame(A=round(rnorm(21,50,10)),B=rep(1:3,each=7))
library(dplyr)
df %>% group_by(B) %>% mutate(C= (A-min(A))/(max(A)-min(A)))
The output is like
# A tibble: 21 x 3
# Groups: B [3]
A B C
<dbl> <int> <dbl>
1 44 1 0.0833
2 52 1 0.417
3 42 1 0
4 66 1 1
5 53 1 0.458
6 42 1 0
7 55 1 0.542
8 57 2 0.784
9 56 2 0.757
10 47 2 0.514
# ... with 11 more rows
You could use the tapply function:
x = read.table(text="A B
22 1
23 1
12 1
34 1
43 2
47 2
49 2
65 2
68 3
70 3
75 3
82 3
120 3", header = TRUE)
y = tapply(x$A, x$B, function(z) (z - min(z)) / (max(z) - min(z)))
# Or using the scale() function
#y = tapply(x$A, x$B, function(z) scale(z, min(z), max(z) - min(z)))
cbind(x, unlist(y))
Not exactly sure how you want the output, but this should be a decent starting point.
This question already has an answer here:
Blend of na.omit and na.pass using aggregate?
(1 answer)
Closed 5 years ago.
I am using aggregate to get the means of several variables by a specific category (cy), but there are a few NA's in my dataframe. I am using aggregate rather than ddply because from my understanding it takes care of NA's similarly to using rm.na=TRUE. The problem is that it drops all rows containing NA in the output, so the means are slightly off.
Dataframe:
> bt cy cl pf ne YH YI
1 1 H 1 95 70.0 20 20
2 2 H 1 25 70.0 46 50
3 1 H 1 0 70.0 40 45
4 2 H 1 95 59.9 40 40
5 2 H 1 75 59.9 36 57
6 2 H 1 5 70.0 35 43
7 1 H 1 50 59.9 20 36
8 2 H 1 95 59.9 40 42
9 3 H 1 95 49.5 17 48
10 2 H 1 5 70.0 42 42
11 2 H 1 95 49.5 19 30
12 3 H 1 25 49.5 33 51
13 1 H 1 75 49.5 5 26
14 1 H 1 5 70.0 35 37
15 1 H 1 5 59.9 20 40
16 2 H 1 95 49.5 29 53
17 2 H 1 75 70.0 41 41
18 2 H 1 0 70.0 10 10
19 2 H 1 95 49.5 25 32
20 1 H 1 95 59.9 10 11
21 2 H 1 0 29.5 20 28
22 1 H 1 95 29.5 11 27
23 2 H 1 25 59.9 26 26
24 1 H 1 5 70.0 30 30
25 3 H 1 25 29.5 20 30
26 3 H 1 50 70.0 5 5
27 1 H 1 0 59.9 3 10
28 1 K 1 5 49.5 25 29
29 2 K 1 0 49.5 30 32
30 1 K 1 95 49.5 13 24
31 1 K 1 0 39.5 13 13
32 2 M 1 NA 70.0 45 50
33 3 M 1 25 59.9 3 34'
The full dataframe has 74 rows, and there are NA's peppered throughout all but two columns (cy and cl).
My code looks like this:
meancnty<-(aggregate(cbind(pf,ne,YH,YI)~cy, data = newChart, FUN=mean))
I double checked in excel, and the means this function produces are for a dataset of N=69, after removing all rows containing NA's. Is there any way to tell R to ignore the NA's rather than remove the rows, other than taking the mean of each variable by county (I have a lot of variables to summarize by many different categories)?
Thank you
using dplyr
df %>%
group_by(cy) %>%
summarize_all(mean, na.rm = TRUE)
# cy bt cl pf ne YH YI
# 1 H 1.785714 0.7209302 53.41463 51.75952 21.92857 29.40476
# 2 K 1.333333 0.8333333 33.33333 47.83333 20.66667 27.33333
# 3 M 1.777778 0.4444444 63.75000 58.68889 24.88889 44.22222
# 4 O 2.062500 0.8750000 31.66667 53.05333 18.06667 30.78571
I think this will work:
meancnty<-(aggregate(with(newChart(cbind(pf,ne,YH,YI),
by = list(newchart$cy), FUN=mean, na.rm=T))
I used the following test data:
> q<- data.frame(y = sample(c(0,1), 10, replace=T), a = runif(10, 1, 100), b=runif(10, 20,30))
> q$a[c(2, 5, 7)]<- NA
> q$b[c(1, 3, 4)]<- NA
> q
y a b
1 0 86.87961 NA
2 0 NA 22.39432
3 0 89.38810 NA
4 0 12.96266 NA
5 1 NA 22.07757
6 0 73.96121 24.13154
7 0 NA 22.31431
8 1 62.77095 21.46395
9 0 55.28476 23.14393
10 0 14.01912 28.08305
Using your code from above, I get:
> aggregate(cbind(a,b)~y, data=q, mean, na.rm=T)
y a b
1 0 47.75503 25.11951
2 1 62.77095 21.46395
which is wrong, i.e. it deletes all rows with any NAs and then takes the mean.
This however gave the right result:
> aggregate(with(q, cbind(a, b)), by = list(q$y), mean, na.rm=T)
Group.1 a b
1 0 55.41591 24.01343
2 1 62.77095 21.77076
It did na.rm=T by column first, and then took the average by group.
Unfortunately, I have no idea why that is, but my guess is that is has to do with the class of y.
I have to table of data in R
a = Duration (-10,0] (0,0.25] (0.25,0.5] (0.5,10]
1 2 0 0 0 2
2 3 0 0 10 3
3 4 0 51 25 0
4 5 19 129 14 0
5 6 60 137 1 0
6 7 31 62 15 5
7 8 7 11 7 0
and
b = Duration (-10,0] (0,0.25] (0.25,0.5] (0.5,10]
1 1 0 0 1 266
2 2 1 0 47 335
3 3 1 26 415 142
4 4 3 965 508 5
5 5 145 2535 103 0
6 6 939 2239 15 6
7 7 420 613 86 34
8 8 46 84 36 16
I wouold like to calculate b/a by matching the duration. I though of some thing like ifelse() but it does not work. Can someone please help me?
Thanks a lot
Match the order and selection of b with a (in my example y with x). Then do the math.
x <- data.frame(duration = 2:8, v = rnorm(7))
y <- data.frame(duration = 8:1, v = rnorm(8))
m <- match(y$duration, x$duration)
ym <- y[m[!is.na(m)],]
x$v/ym$v
It does not work when x contains items that are not in y, btw.
Do you want something like the following:
a <- a[-1]
b <- b[-1]
a <- a[order(a$Duration),]
b <- b[order(b$Duration),]
durations <- intersect(a$Duration, b$Duration)
b[b$Duration %in% durations,] / a[a$Duration %in% durations,]
Duration (-10,0] (0,0.25] (0.25,0.5] (0.5,10]
2 1 Inf NaN Inf 167.50000
3 1 Inf Inf 41.500000 47.33333
4 1 Inf 18.921569 20.320000 Inf
5 1 7.631579 19.651163 7.357143 NaN
6 1 15.650000 16.343066 15.000000 Inf
7 1 13.548387 9.887097 5.733333 6.80000
8 1 6.571429 7.636364 5.142857 Inf
you may like to replace NaN and Inf values by something else.