using nested group_by in dplyr - r

Given the following toy example:
set.seed(200)
h<-data.frame(T1=sample(0:100,size = 20),ID=sample(c("A","B","C","D"),size=20,replace=T),yr=sample(c(2006:2010),size = 20,replace=T))
How can I
calculate the proportion of ID having more than 1 instance per year
Create a variable that increments for each ascending value of T1 per ID and year
Subtract each instance T1(2) from T1(1) and T1(3) from T1(2) etc for each ID
I figured out the first one:
h %>% group_by(yr,ID) %>% summarise(n=n()) %>% summarise(n2=sum(n>1),n3=n(),n4=n2/n3)
Now, to the last two questions - this is the desired output:
T1 ID yr Inc.var diff
1 92 A 2006 1 6
2 98 A 2006 2 0
3 41 B 2006 1 0
4 26 C 2006 1 71
5 97 C 2006 2 0
6 11 D 2006 1 56
7 67 D 2006 2 0
8 9 B 2008 1 44
9 53 B 2008 2 4
10 57 B 2008 3 19
11 76 B 2008 4 0
12 33 D 2008 etc etc
13 48 A 2009
14 58 A 2009
15 99 A 2009
16 52 B 2009
17 80 B 2009
18 13 B 2010
19 64 B 2010
20 21 C 2010

Here is how I solved the last two questions:
j <- h %>% group_by(ID,yr) %>% arrange(T1) %>% mutate(diff=lead(T1)-T1,inc.var=seq(length(T1))) %>% arrange(yr)

Related

loop to identify a sum and then get the position of that sum

I have this data.frame called EXAMPLE, with 4 variables:
date <- c(2010, 2011, 2012, 2013)
new_york <- c(10,20,22,28)
berlin <- c(0,51,45,12)
tokyo <- c(2,15,20,13)
EXAMPLE <- data.frame(date, new_york, berlin, tokyo)
I want to identify, in each column, in which position was summed at least 50 and also store that sum. For example, in the column new_york, was summed 52 at row 3.
I was thinking in something like this below, but it didn't work:
x <- 1
while(sum(EXAMPLE$berlin[1:x]) <= 50) {
a <-x
}
I appreciate if someone can help.
out <- lapply(EXAMPLE[,-1], cumsum)
names(out) <- paste0(names(out), "_cumulative")
options(width=123, length=99999)
cbind(EXAMPLE, out)
# date new_york berlin tokyo new_york_cumulative berlin_cumulative tokyo_cumulative
# 1 2010 10 0 2 10 0 2
# 2 2011 20 51 15 30 51 17
# 3 2012 22 45 20 52 96 37
# 4 2013 28 12 13 80 108 50
Here's the equivalent tidy version of #r2evans answer...
library(dplyr)
EXAMPLE %>%
mutate(across(new_york:tokyo,
cumsum,
.names = "cumsum_{.col}")
)
#> date new_york berlin tokyo cumsum_new_york cumsum_berlin cumsum_tokyo
#> 1 2010 10 0 2 10 0 2
#> 2 2011 20 51 15 30 51 17
#> 3 2012 22 45 20 52 96 37
#> 4 2013 28 12 13 80 108 50

Creating an index for each subject in R

I'm working with some data on repeated measures of subjects over time. The data is in this format:
Subject <- as.factor(c(rep("A", 20), rep("B", 35), rep("C", 13)))
variable.A <- rnorm(mean = 300, sd = 50, n = Subject)
dat <- data.frame(Subject, variable.A)
dat
Subject variable.A
1 A 334.6567
2 A 353.0988
3 A 244.0863
4 A 284.8918
5 A 302.6442
6 A 298.3162
7 A 271.4864
8 A 268.6848
9 A 262.3761
10 A 341.4224
11 A 190.4823
12 A 297.1981
13 A 319.8346
14 A 343.9855
15 A 332.5318
16 A 221.9502
17 A 412.9172
18 A 283.4206
19 A 310.9847
20 A 276.5423
21 B 181.5418
22 B 340.5812
23 B 348.5162
24 B 364.6962
25 B 312.2508
26 B 278.9855
27 B 242.8810
28 B 272.9585
29 B 239.2776
30 B 254.9140
31 B 253.8940
32 B 330.1918
33 B 300.7302
34 B 237.6511
35 B 314.4919
36 B 239.6195
37 B 282.7955
38 B 260.0943
39 B 396.5310
40 B 325.5422
41 B 374.8063
42 B 363.1897
43 B 258.0310
44 B 358.8605
45 B 251.8775
46 B 299.6995
47 B 303.4766
48 B 359.8955
49 B 299.7089
50 B 289.3128
51 B 401.7680
52 B 276.8078
53 B 441.4852
54 B 232.6222
55 B 305.1977
56 C 298.4580
57 C 210.5164
58 C 272.0228
59 C 282.0540
60 C 207.8797
61 C 263.3859
62 C 324.4417
63 C 273.5904
64 C 348.4389
65 C 174.2979
66 C 363.4353
67 C 260.8548
68 C 306.1833
I've used the seq_along() function and the dplyr package to create an index of each observation for every subject:
dat <- as.data.frame(dat %>%
group_by(Subject) %>%
mutate(index = seq_along(Subject)))
Subject variable.A index
1 A 334.6567 1
2 A 353.0988 2
3 A 244.0863 3
4 A 284.8918 4
5 A 302.6442 5
6 A 298.3162 6
7 A 271.4864 7
8 A 268.6848 8
9 A 262.3761 9
10 A 341.4224 10
11 A 190.4823 11
12 A 297.1981 12
13 A 319.8346 13
14 A 343.9855 14
15 A 332.5318 15
16 A 221.9502 16
17 A 412.9172 17
18 A 283.4206 18
19 A 310.9847 19
20 A 276.5423 20
21 B 181.5418 1
22 B 340.5812 2
23 B 348.5162 3
24 B 364.6962 4
25 B 312.2508 5
26 B 278.9855 6
27 B 242.8810 7
28 B 272.9585 8
29 B 239.2776 9
30 B 254.9140 10
31 B 253.8940 11
32 B 330.1918 12
33 B 300.7302 13
34 B 237.6511 14
35 B 314.4919 15
36 B 239.6195 16
37 B 282.7955 17
38 B 260.0943 18
39 B 396.5310 19
40 B 325.5422 20
41 B 374.8063 21
42 B 363.1897 22
43 B 258.0310 23
44 B 358.8605 24
45 B 251.8775 25
46 B 299.6995 26
47 B 303.4766 27
48 B 359.8955 28
49 B 299.7089 29
50 B 289.3128 30
51 B 401.7680 31
52 B 276.8078 32
53 B 441.4852 33
54 B 232.6222 34
55 B 305.1977 35
56 C 298.4580 1
57 C 210.5164 2
58 C 272.0228 3
59 C 282.0540 4
60 C 207.8797 5
61 C 263.3859 6
62 C 324.4417 7
63 C 273.5904 8
64 C 348.4389 9
65 C 174.2979 10
66 C 363.4353 11
67 C 260.8548 12
68 C 306.1833 13
What I'm now looking to do is set up an analysis that looks at every 10 observations, so I'd like to create another column that basically gives me a number for every 10 observations. For example, Subject A would have a sequence of ten "1's" followed by a sequence of ten "2's" (IE, two groupings of 10). I've tried to use the rep() function but the issue I'm running into is that the other subjects don't have a number of observations that is divisible by 10.
Is there a way for the rep() function to just assign the grouping the next number, even if it doesn't have 10 total observations? For example, Subject B would have ten "1's", ten "2's" and then five "3's" (representing that his last group of observations)?
You can use modular division %/% to generate the ids:
dat %>%
group_by(Subject) %>%
mutate(chunk_id = (seq_along(Subject) - 1) %/% 10 + 1) -> dat1
table(dat1$Subject, dat1$chunk_id)
# 1 2 3 4
# A 10 10 0 0
# B 10 10 10 5
# C 10 3 0 0
For a plain vanilla base R solution, you also could try this:
dat$newcol <- 1
dat$index <- ave(dat$newcol, dat$Subject, FUN = cumsum)
dat$chunk_id <- (dat$index - 1) %/% 10 + 1
which, when you run the table command as above gives you
table(dat$Subject, dat$chunk_id)
1 2 3 4
A 10 10 0 0
B 10 10 10 5
C 10 3 0 0
If you don't want the extra 'newcol' column, just use 'NULL' to get rid of it:
dat$newcol <- NULL

How to delete observations in R based criterion that observations have same value?

I have the following data frame, from which I would like to remove observations based on three criteria: x=x, y=y and z>=60.
df <- data.frame(x=c(1,1,2,2,3,3,4,4),
y=c(2011,2012,2011,2011,2013,2014,2011,2012),
z=c(15,15,60,60,15,15,30,15))
> df
x y z
1 1 2011 15
2 1 2012 15
3 2 2011 60
4 2 2011 60
5 3 2013 15
6 3 2014 15
7 4 2011 30
8 4 2012 15
The data frame I'm looking for is thus (which one of the x=2 observations is removed doesn't matter):
> df1
x y z
1 1 2011 15
2 1 2012 15
3 2 2011 60
4 3 2013 15
5 3 2014 15
6 4 2011 30
7 4 2012 15
My first thoughts included using unique or duplicate, but I cannot seem to understand how to implement it in practice.
This should do the trick. Look for duplicated x and y entries where z is also greater than or equal to 60:
df[!(duplicated(df[,1:2]) & df$z >= 60), ]
# x y z
#1 1 2011 15
#2 1 2012 15
#3 2 2011 60
#5 3 2013 15
#6 3 2014 15
#7 4 2011 30
#8 4 2012 15

fill the time gap in data frame in r

I have a data set including the following info:
id class year n
25 A63 2006 3
25 F16 2006 1
39 0901 2001 1
39 0903 2001 3
39 0903 2003 2
39 1901 2003 1
...
There are about 100k different ids and more than 300 classes. The year varies from 1998 to 2007.
What I want to do, is to fill the time gap, after some id and classes happened, with n=0 by id and class.
And then calculate the sum of n and the quantity of classes.
For example, the above 6 lines data should expand to the following table:
id class year n sum Qc Qs
25 A63 2006 3 3 2 2
25 F16 2006 1 1 2 2
25 A63 2007 0 3 0 2
25 F16 2007 0 1 0 2
39 0901 2001 1 1 2 2
39 0903 2001 3 3 2 2
39 0901 2002 0 1 0 2
39 0903 2002 0 3 0 2
39 0901 2003 0 1 2 3
39 0903 2003 2 5 2 3
39 1901 2003 1 1 2 3
39 0901 2004 0 1 0 3
39 0903 2004 0 5 0 3
39 1901 2004 0 1 0 3
...
39 0901 2007 0 1 0 3
39 0903 2007 0 5 0 3
39 1901 2007 0 1 0 3
I can solve it by the ugly for loop and it will takes one hour to get the result. Is there any better way to do that? Vectorize or using the data.table?
Using dplyr you could try:
library(dplyr)
df%>% group_by(class,id) %>% arrange(year) %>%
do(merge(data.frame(year=c(.$year[1]:2007),id=rep(.$id[1],2007-.$year[1]+1),class=rep(.$class[1],2007-.$year[1]+1)),.,all.x=T))
It groups the data by class and id, and merges each group to a dataframe containing all the years with the id and class of that group.
Edit: if you want to do this only after a certain id you could do:
as.data.frame(rbind(df[df$id<=25,],df%>% filter(id>25) %>% group_by(class,id) %>% arrange(year) %>%
do(merge(data.frame(year=c(.$year[1]:2007),id=rep(.$id[1],2007-.$year[1]+1),class=rep(.$class[1],2007-.$year[1]+1)),.,all.x=T))))
Use expand.grid to get the cartesian product of class and year.
Then merge your current data frame to this new one. Then do the classic subset replacement.
df <- data.frame(class = as.factor(c("A63","F16","0901","0903","0903","1901")),
year = c(2006,2006,2001,2001,2003,2003),
n=c(3,1,1,3,2,1))
df2 <- expand.grid(class = levels(df$class),
year= 1997:2006)
df2 <- merge(df2,df, all.x=TRUE)
df2$n[is.na(df2$n)] <- 0

Identical quadratic and cubic predictions

For the following data:
require(dplyr)
require(ggplot2)
ds <- read.table(header = TRUE, text ="
obs id year attend
1 47 2000 1
2 47 2001 3
3 47 2002 5
4 47 2003 8
5 47 2004 6
6 47 2005 4
7 47 2006 2
8 47 2007 1
9 47 2008 2
10 47 2009 3
11 47 2010 4
12 47 2011 5
")
print(ds)
I would like to compute predicted values of linear models
linear<- predict(lm(attend ~ year, ds))
quadratic<- predict(lm(attend ~ year + I(year^2),ds))
cubic<- predict(lm(attend ~ year + I(year^2) + I(year^3),ds))
ds<- ds %>% dplyr::mutate(linear=linear, quadratic=quadratic, cubic=cubic)
print(ds)
obs id year attend linear quadratic cubic
1 1 47 2000 1 3.820513 3.500000 3.500000
2 2 47 2001 3 3.792541 3.646853 3.646853
3 3 47 2002 5 3.764569 3.758741 3.758741
4 4 47 2003 8 3.736597 3.835664 3.835664
5 5 47 2004 6 3.708625 3.877622 3.877622
6 6 47 2005 4 3.680653 3.884615 3.884615
7 7 47 2006 2 3.652681 3.856643 3.856643
8 8 47 2007 1 3.624709 3.793706 3.793706
9 9 47 2008 2 3.596737 3.695804 3.695804
10 10 47 2009 3 3.568765 3.562937 3.562937
11 11 47 2010 4 3.540793 3.395105 3.395105
12 12 47 2011 5 3.512821 3.192308 3.192308
Question: Despite the fact that time series has a clear cubical shape, the quadratic and cubic predictions are identical. Why? Is this a mistake?
This is due to the fact 2011^3 is a very big number (greater tha and this is causing the coeffiicent to be returned as NA. If you had inspected the models, you would have noticed this.
coef(lm(attend ~ year + I(year^2) + I(year^3),ds))
# (Intercept) year I(year^2) I(year^3)
# -7.025524e+04 7.009441e+01 -1.748252e-02 NA
It is more sensible to use poly to create orthogonal polynomials
linear<- predict(lm(attend ~ year, ds))
quadratic<- predict(lm(attend ~ poly(year,2),ds))
cubic<- predict(lm(attend ~ poly(year,3),ds))
ds<- (ds %>% dplyr::mutate(linear=linear, quadratic=quadratic, cubic=cubic))
ds
# obs id year attend linear quadratic cubic
# 1 1 47 2000 1 3.820513 3.500000 0.7435897
# 2 2 47 2001 3 3.792541 3.646853 3.8974359
# 3 3 47 2002 5 3.764569 3.758741 5.5128205
# 4 4 47 2003 8 3.736597 3.835664 5.9238539
# 5 5 47 2004 6 3.708625 3.877622 5.4646465
# 6 6 47 2005 4 3.680653 3.884615 4.4693085
# 7 7 47 2006 2 3.652681 3.856643 3.2719503
# 8 8 47 2007 1 3.624709 3.793706 2.2066822
# 9 9 47 2008 2 3.596737 3.695804 1.6076146
# 10 10 47 2009 3 3.568765 3.562937 1.8088578
# 11 11 47 2010 4 3.540793 3.395105 3.1445221
# 12 12 47 2011 5 3.512821 3.192308 5.9487179

Resources