Sum grouped observations based on time window of given days - r

Here an example of my data.frame:
df = read.table(text = 'ID Day Episode Count
28047 6000 143 7
28049 6000 143 7
29002 6000 143 7
29003 6000 143 7
30003 6000 143 7
30004 6000 143 7
32010 6000 143 7
30001 7436 47 6
33021 7436 47 6
33024 7436 47 6
33034 7436 47 6
37018 7436 47 6
40004 7436 47 6
29003 7300 111 6
30003 7300 111 6
30004 7300 111 6
32010 7300 111 6
30001 7300 111 6
33021 7300 111 6
2001 7438 54 5
19007 7438 54 5
20002 7438 54 5
22006 7438 54 5
22007 7438 54 5
32010 7301 99 5
30001 7301 99 5
33021 7301 99 5
2001 7301 99 5
19007 7301 99 5
27021 5998 158 5
28015 5998 158 5
28047 5998 158 5
28049 5998 158 5
29001 5998 158 5
21009 7437 65 4
24001 7437 65 4
25005 7437 65 4
25009 7437 65 4
14001 7435 81 4
16004 7435 81 4
17001 7435 81 4
17005 7435 81 4
21009 7299 77 4
24001 7299 77 4
25005 7299 77 4
25009 7299 77 4
29002 5996 158 4
29003 5996 158 4
27002 5996 158 4
27003 5996 158 4
33014 5999 56 3
33023 5999 56 3
25005 5999 56 3
27021 5995 246 2
33006 5995 246 2
8876 7439 765 2
5421 7439 765 2
6678 7298 68 1
34001 5994 125 1
4432 7440 841 1', header = TRUE)
What I need to do is for each unique Day observation look for its Count value and add it to the previous 3 days' Count ones (i.e. 4-days time window).
e.g. 1) Day = 6000, sum 7 (Count value) to Count values of Day 5999, 5998 and 5997 (the last one not present in the df), which are respectively 3, 5 and 0 -> 7 + 3 + 5 + 0 = new_Count 15;
2) next Day = 7436, sum 6 to Count values of 7435, 7434 and 7433 -> 6 + 4 + 0 + 0 = new_Count 10;
and so on up to the last Day within df.
Desired output:
ID Day new_Episode new_Count
2001 7438 1 19
19007 7438 1 19
20002 7438 1 19
22006 7438 1 19
22007 7438 1 19
21009 7437 1 19
24001 7437 1 19
25005 7437 1 19
25009 7437 1 19
30001 7436 1 19
33021 7436 1 19
33024 7436 1 19
33034 7436 1 19
37018 7436 1 19
40004 7436 1 19
14001 7435 1 19
16004 7435 1 19
17001 7435 1 19
17005 7435 1 19
8876 7439 2 17
5421 7439 2 17
2001 7438 2 17
19007 7438 2 17
20002 7438 2 17
22006 7438 2 17
22007 7438 2 17
21009 7437 2 17
24001 7437 2 17
25005 7437 2 17
25009 7437 2 17
30001 7436 2 17
33021 7436 2 17
33024 7436 2 17
33034 7436 2 17
37018 7436 2 17
40004 7436 2 17
32010 7301 3 16
30001 7301 3 16
33021 7301 3 16
2001 7301 3 16
19007 7301 3 16
29003 7300 3 16
30003 7300 3 16
30004 7300 3 16
32010 7300 3 16
30001 7300 3 16
33021 7300 3 16
21009 7299 3 16
24001 7299 3 16
25005 7299 3 16
25009 7299 3 16
6678 7298 3 16
28047 6000 4 15
28049 6000 4 15
29002 6000 4 15
29003 6000 4 15
30003 6000 4 15
30004 6000 4 15
32010 6000 4 15
33014 5999 4 15
33023 5999 4 15
25005 5999 4 15
27021 5998 4 15
28015 5998 4 15
28047 5998 4 15
28049 5998 4 15
29001 5998 4 15
21009 7437 5 14
24001 7437 5 14
25005 7437 5 14
25009 7437 5 14
30001 7436 5 14
33021 7436 5 14
33024 7436 5 14
33034 7436 5 14
37018 7436 5 14
40004 7436 5 14
14001 7435 5 14
16004 7435 5 14
17001 7435 5 14
17005 7435 5 14
4432 7440 6 12
8876 7439 6 12
5421 7439 6 12
2001 7438 6 12
19007 7438 6 12
20002 7438 6 12
22006 7438 6 12
22007 7438 6 12
21009 7437 6 12
24001 7437 6 12
25005 7437 6 12
25009 7437 6 12
33014 5999 7 12
33023 5999 7 12
25005 5999 7 12
27021 5998 7 12
28015 5998 7 12
28047 5998 7 12
28049 5998 7 12
29001 5998 7 12
29002 5996 7 12
29003 5996 7 12
27002 5996 7 12
27003 5996 7 12
29003 7300 8 11
30003 7300 8 11
30004 7300 8 11
32010 7300 8 11
30001 7300 8 11
33021 7300 8 11
21009 7299 8 11
24001 7299 8 11
25005 7299 8 11
25009 7299 8 11
6678 7298 8 11
27021 5998 9 11
28015 5998 9 11
28047 5998 9 11
28049 5998 9 11
29001 5998 9 11
29002 5996 9 11
29003 5996 9 11
27002 5996 9 11
27003 5996 9 11
27021 5995 9 11
33006 5995 9 11
30001 7436 10 10
33021 7436 10 10
33024 7436 10 10
33034 7436 10 10
37018 7436 10 10
40004 7436 10 10
14001 7435 10 10
16004 7435 10 10
17001 7435 10 10
17005 7435 10 10
29002 5996 11 7
29003 5996 11 7
27002 5996 11 7
27003 5996 11 7
27021 5995 11 7
33006 5995 11 7
34001 5994 11 7
21009 7299 12 5
24001 7299 12 5
25005 7299 12 5
25009 7299 12 5
6678 7298 12 5
14001 7435 13 4
16004 7435 13 4
17001 7435 13 4
17005 7435 13 4
27021 5995 14 3
33006 5995 14 3
34001 5994 14 3
6678 7298 15 1
34001 5994 16 1
Note that the output_df is larger than df (but it's ok) and it is ranked by -new_Count and -Day with new_Episode column accordingly to -new_Count ranking.
Any suggestion?

So I'm not sure why output_df has more rows than the original data.frame, but we can use the by function along with subset to calculate new_Count. Note that I've called your data.frame df1 instead of df.
output_df1 <- do.call('rbind', by(df1, list(df1$Day, df1$ID), FUN = function(d){
#grab subset of df
sub_df <- subset(df1, Day < d$Day & Day > (d$Day - 4))
#select unique day, count
sub_df_u <- unique(sub_df[,-1])
d$new_Count <- sum(sub_df_u$Count) + d$Count
d
}))
head(output_df1)
ID Day Episode Count new_Count
14 2001 7438 54 5 15
28 14001 7435 81 4 4
29 16004 7435 81 4 4
30 17001 7435 81 4 4
31 17005 7435 81 4 4
15 19007 7438 54 5 15
To get the new_Episode column, we can use the dense_rank function from the dplyr package:
output_df1$new_Episode <- dplyr::dense_rank(-output_df1$new_Count)

Related

R: sum columns with two conditions

I have two data tables. The first table is matrix with coordinates and precipitation. It consists of four columns with latitude, longitude, precipitation and day of monitoring. The example of table is:
latitude_1 longitude_1 precipitation day_mon
54.17 62.15 5 34
69.61 48.65 3 62
73.48 90.16 7 96
66.92 90.27 7 19
56.19 96.46 9 25
72.23 74.18 5 81
88.00 95.20 7 97
92.44 44.41 6 18
95.83 52.91 9 88
99.68 96.23 8 6
81.91 48.32 8 96
54.66 52.70 0 62
95.31 91.82 2 84
60.32 96.25 9 71
97.39 47.91 7 76
65.21 44.63 9 3
The second table consists of 5 columns : station number, longitude, latitude, day when monitoring began, day when monitoring ends. It looks like:
station latitude_2 longitude_2 day_begin day_end
15 50.00 93.22 34 46
11 86.58 85.29 15 47
14 93.17 63.17 31 97
10 88.56 61.28 15 78
13 45.29 77.10 24 79
6 69.73 99.52 13 73
4 45.60 77.36 28 95
13 92.88 62.38 9 51
1 65.10 64.13 7 69
10 60.57 86.77 34 64
3 53.62 60.76 23 96
16 87.82 59.41 38 47
1 47.83 95.89 21 52
11 75.42 46.20 38 87
3 55.71 55.26 2 73
16 71.65 96.15 36 93
I want to sum precipitations from 1 table. But I have two conditions:
day_begin< day_mon< day_end. Day of monitoring(day_mon from 1 table) should be less than day of end and more than day of begin (from 2 table)
Sum precipitation from the point which is closer than others. distance between point of monitoring (coordinates consists
longitude_1 and latitude_1) and station (coordinates consists
longitude_2 and latitude_2) should be minimum. The distance is calculated by the formula :
R = 6400*arccos(sin(latitude_1)*sin(latitude_2)+cos(latitude_1)*cos(latitude_2))*cos(longitude_1-longitude_2))
At last I want to get results as table :
station latitude_2 longitude_2 day_begin day_end Sum
15 50 93.22 34 46 188
11 86.58 85.29 15 47 100
14 93.17 63.17 31 97 116
10 88.56 61.28 15 78 182
13 45.29 77.1 24 79 136
6 69.73 99.52 13 73 126
4 45.6 77.36 28 95 108
13 92.88 62.38 9 51 192
1 65.1 64.13 7 69 125
10 60.57 86.77 34 64 172
3 53.62 60.76 23 96 193
16 87.82 59.41 38 47 183
1 47.83 95.89 21 52 104
11 75.42 46.2 38 87 151
3 55.71 55.26 2 73 111
16 71.65 96.15 36 93 146
I know how to calculate it in C++. What function should I use in R?
Thank you for your help!
I'm not sure if I solved your problem correctly... but here it comes..
I used a data.table approach.
library( tidyverse )
library( data.table )
#step 1. join days as periods
#create a dummy variables to create a virtual period in dt1
dt1[, point_id := .I]
dt1[, day_begin := day_mon]
dt1[, day_end := day_mon]
setkey(dt2, day_begin, day_end)
#overlap join finding all stations for each point that overlap periods
dt <- foverlaps( dt1, dt2, type = "within" )
#step 2. calculate the distance station for each point based on TS-privided formula
dt[, distance := 6400 * acos( sin( latitude_1 ) * sin( latitude_2 ) + cos( latitude_1 ) * cos( latitude_2 ) ) * cos( longitude_1 - longitude_2 ) ]
#step 3. filter (absolute) minimal distance based on point_id
dt[ , .SD[which.min( abs( distance ) )], by = point_id ]
# point_id station latitude_2 longitude_2 day_begin day_end latitude_1 longitude_1 precipitation day_mon i.day_begin i.day_end distance
# 1: 1 1 47.83 95.89 21 52 54.17 62.15 5 34 34 34 -248.72398
# 2: 2 6 69.73 99.52 13 73 69.61 48.65 3 62 62 62 631.89228
# 3: 3 14 93.17 63.17 31 97 73.48 90.16 7 96 96 96 -1519.84886
# 4: 4 11 86.58 85.29 15 47 66.92 90.27 7 19 19 19 1371.54757
# 5: 5 11 86.58 85.29 15 47 56.19 96.46 9 25 25 25 1139.46849
# 6: 6 14 93.17 63.17 31 97 72.23 74.18 5 81 81 81 192.99264
# 7: 7 14 93.17 63.17 31 97 88.00 95.20 7 97 97 97 5822.81529
# 8: 8 3 55.71 55.26 2 73 92.44 44.41 6 18 18 18 -899.71206
# 9: 9 3 53.62 60.76 23 96 95.83 52.91 9 88 88 88 45.16237
# 10: 10 3 55.71 55.26 2 73 99.68 96.23 8 6 6 6 -78.04484
# 11: 11 14 93.17 63.17 31 97 81.91 48.32 8 96 96 96 -5467.77459
# 12: 12 3 53.62 60.76 23 96 54.66 52.70 0 62 62 62 -1361.57863
# 13: 13 11 75.42 46.20 38 87 95.31 91.82 2 84 84 84 -445.18765
# 14: 14 14 93.17 63.17 31 97 60.32 96.25 9 71 71 71 -854.86321
# 15: 15 3 53.62 60.76 23 96 97.39 47.91 7 76 76 76 1304.41634
# 16: 16 3 55.71 55.26 2 73 65.21 44.63 9 3 3 3 -7015.57516
Sample data
dt1 <- read.table( text = "latitude_1 longitude_1 precipitation day_mon
54.17 62.15 5 34
69.61 48.65 3 62
73.48 90.16 7 96
66.92 90.27 7 19
56.19 96.46 9 25
72.23 74.18 5 81
88.00 95.20 7 97
92.44 44.41 6 18
95.83 52.91 9 88
99.68 96.23 8 6
81.91 48.32 8 96
54.66 52.70 0 62
95.31 91.82 2 84
60.32 96.25 9 71
97.39 47.91 7 76
65.21 44.63 9 3", header = TRUE ) %>%
setDT()
dt2 <- read.table( text = "station latitude_2 longitude_2 day_begin day_end
15 50.00 93.22 34 46
11 86.58 85.29 15 47
14 93.17 63.17 31 97
10 88.56 61.28 15 78
13 45.29 77.10 24 79
6 69.73 99.52 13 73
4 45.60 77.36 28 95
13 92.88 62.38 9 51
1 65.10 64.13 7 69
10 60.57 86.77 34 64
3 53.62 60.76 23 96
16 87.82 59.41 38 47
1 47.83 95.89 21 52
11 75.42 46.20 38 87
3 55.71 55.26 2 73
16 71.65 96.15 36 93", header = TRUE ) %>%
setDT()

Group rows based on time window and sum observations

Here an example of my data.frame:
df_1 = read.table(text='ID Day Episode Count
1 30001 7423 47 16
2 33021 7423 47 16
3 33024 7423 47 16
4 33034 7423 47 16
5 37018 7423 47 16
6 40004 7423 47 16
7 40011 7423 47 16
8 41028 7423 47 16
9 42001 7423 47 16
10 42011 7423 47 16
11 45003 7423 47 16
12 45004 7423 47 16
13 45005 7423 47 16
14 46006 7423 47 16
15 46008 7423 47 16
16 47004 7423 47 16
17 2001 7438 54 13
18 19007 7438 54 13
19 20002 7438 54 13
20 22006 7438 54 13
21 22007 7438 54 13
22 29002 7438 54 13
23 29003 7438 54 13
24 30001 7438 54 13
25 30004 7438 54 13
26 33023 7438 54 13
27 33029 7438 54 13
28 41006 7438 54 13
29 41020 7438 54 13
30 21009 7428 65 12
31 24001 7428 65 12
32 25005 7428 65 12
33 25009 7428 65 12
34 27002 7428 65 12
35 27003 7428 65 12
36 27009 7428 65 12
37 30001 7428 65 12
38 33023 7428 65 12
39 33029 7428 65 12
40 33050 7428 65 12
41 34003 7428 65 12
42 14001 7427 81 10
43 16004 7427 81 10
44 17001 7427 81 10
45 17005 7427 81 10
46 19001 7427 81 10
47 19006 7427 81 10
48 19007 7427 81 10
49 19010 7427 81 10
50 20001 7427 81 10
51 21009 7427 81 10
52 28047 7424 143 9
53 28049 7424 143 9
54 29002 7424 143 9
55 29003 7424 143 9
56 30003 7424 143 9
57 30004 7424 143 9
58 32010 7424 143 9
59 33014 7424 143 9
60 33023 7424 143 9
104 30001 6500 111 9
105 33021 6500 111 9
106 33024 6500 111 9
107 33034 6500 111 9
108 37018 6500 111 9
109 40004 6500 111 9
110 40011 6500 111 9
111 41028 6500 111 9
112 42001 6500 111 9
61 25005 7422 158 8
62 27021 7422 158 8
63 28015 7422 158 8
64 28047 7422 158 8
65 28049 7422 158 8
66 29001 7422 158 8
67 29002 7422 158 8
68 29003 7422 158 8
69 27002 7425 246 6
70 27003 7425 246 6
71 27021 7425 246 6
72 33006 7425 246 6
73 34001 7425 246 6
74 37019 7425 246 6
75 33014 7429 979 5
76 33021 7429 979 5
77 33024 7429 979 5
78 34001 7429 979 5
79 35010 7429 979 5
80 28022 7426 1199 5
81 34006 7426 1199 5
82 37006 7426 1199 5
83 37008 7426 1199 5
84 37018 7426 1199 5
85 29001 7437 1756 4
86 30014 7437 1756 4
87 32010 7437 1756 4
88 45004 7437 1756 4
89 4003 7430 1757 4
90 15013 7430 1757 4
91 16004 7430 1757 4
92 43007 7430 1757 4
93 7002 7434 1570 4
94 8006 7434 1570 4
95 15006 7434 1570 4
96 94001 7434 1570 4
113 33024 6499 135 4
114 33034 6499 135 4
115 37018 6499 135 4
116 40004 6499 135 4
222 3005 7440 999 2
223 3400 7440 999 2
97 3002 7433 2295 2
98 4003 7433 2295 2
99 48005 7436 3389 2
100 49004 7436 3389 2
101 8006 7431 3390 2
102 15006 7431 3390 2
104 6780 7439 22 1
103 41020 7435 4511 1', header = TRUE)
The data.frame has got this order based on -Count and -Day and it cannot be changed.
What I need to do is to group the data.frame by Day and its previous 16 days (17 days in total) and sum the Count column's observations. So in this case the largest Day observation with largest Count is 7438, then take 7438, 7437, 7436, 7435, etc.. up to 7424 and sum the Count values. The next Day observation which cannot be included in the 17-days time window of 7438 is 6500. Then look for 6500, 6499, etc... and do the same as per 7438 group. And do the same for Day 7440 and 7439.
The thing is that if we start to count the Day backwards from the top of the Day col, Day = 7423 will include in its Episode Day = 7422 and both of them will be trowed away from being included in the Day = 7438 Episode.
How can we write a code that first kind of analyse all the possible Episodes' combinations (based on the time window) and eventually select the one which cover the largest number of days (no more than the time window)?
Expected output:
ID Day Episode Count
1 2001 7438 1 103
2 19007 7438 1 103
3 20002 7438 1 103
4 22006 7438 1 103
5 22007 7438 1 103
6 29002 7438 1 103
7 29003 7438 1 103
8 30001 7438 1 103
9 30004 7438 1 103
10 33023 7438 1 103
11 33029 7438 1 103
12 41006 7438 1 103
13 41020 7438 1 103
14 29001 7437 1 103
15 30014 7437 1 103
16 32010 7437 1 103
17 45004 7437 1 103
18 48005 7436 1 103
19 49004 7436 1 103
20 41020 7435 1 103
21 7002 7434 1 103
22 8006 7434 1 103
23 15006 7434 1 103
24 94001 7434 1 103
25 3002 7433 1 103
26 4003 7433 1 103
27 8006 7431 1 103
28 15006 7431 1 103
29 4003 7430 1 103
30 15013 7430 1 103
31 16004 7430 1 103
32 43007 7430 1 103
33 33014 7429 1 103
34 33021 7429 1 103
35 33024 7429 1 103
36 34001 7429 1 103
37 35010 7429 1 103
38 21009 7428 1 103
39 24001 7428 1 103
40 25005 7428 1 103
41 25009 7428 1 103
42 27002 7428 1 103
43 27003 7428 1 103
44 27009 7428 1 103
45 30001 7428 1 103
46 33023 7428 1 103
47 33029 7428 1 103
48 33050 7428 1 103
49 34003 7428 1 103
50 14001 7427 1 103
51 16004 7427 1 103
52 17001 7427 1 103
53 17005 7427 1 103
54 19001 7427 1 103
55 19006 7427 1 103
56 19007 7427 1 103
57 19010 7427 1 103
58 20001 7427 1 103
59 21009 7427 1 103
60 28022 7426 1 103
61 34006 7426 1 103
62 37006 7426 1 103
63 37008 7426 1 103
64 37018 7426 1 103
65 27002 7425 1 103
66 27003 7425 1 103
67 27021 7425 1 103
68 33006 7425 1 103
69 34001 7425 1 103
70 37019 7425 1 103
71 28047 7424 1 103
72 28049 7424 1 103
73 29002 7424 1 103
74 29003 7424 1 103
75 30003 7424 1 103
76 30004 7424 1 103
77 32010 7424 1 103
78 33014 7424 1 103
79 33023 7424 1 103
80 30001 7423 1 103
81 33021 7423 1 103
82 33024 7423 1 103
83 33034 7423 1 103
84 37018 7423 1 103
85 40004 7423 1 103
86 40011 7423 1 103
87 41028 7423 1 103
88 42001 7423 1 103
89 42011 7423 1 103
90 45003 7423 1 103
91 45004 7423 1 103
92 45005 7423 1 103
93 46006 7423 1 103
94 46008 7423 1 103
95 47004 7423 1 103
96 25005 7422 1 103
97 27021 7422 1 103
98 28015 7422 1 103
99 28047 7422 1 103
100 28049 7422 1 103
101 29001 7422 1 103
102 29002 7422 1 103
103 29003 7422 1 103
104 30001 6500 2 13
105 33021 6500 2 13
106 33024 6500 2 13
107 33034 6500 2 13
108 37018 6500 2 13
109 40004 6500 2 13
110 40011 6500 2 13
111 41028 6500 2 13
112 42001 6500 2 13
113 33024 6499 2 13
114 33034 6499 2 13
115 37018 6499 2 13
116 40004 6499 2 13
117 3005 7440 3 3
118 3400 7440 3 3
119 6780 7439 3 3
My real data.frame has got >40,000 rows and >1600 episodes.
There is probably a better way to accomplish this, but something like this may get you what you need.
library(dplyr)
groups <- seq(max(df_1$Day)+1,min(df_1$Day),by = -17)
groups <- rev(append(groups, min(df_1$Day)))
df_1$group <- groups[cut(df_1$Day, breaks = groups, labels = FALSE, right = FALSE)]
df_1 <- df_1 %>%
group_by(group) %>%
mutate(TotalPerGroup = sum(Count))

how to fix “No appropriate likelihood could be inferred” for network meta-analysis in R?

I am currently learning Network meta-analysis in R with "gemtc",and "netmeta".
As I try to fit the GLM model for analysis, I encountered this error message " No appropriate likelihood could be inferred" .
My code are:
gemtc_network_numbers <-mtc.network(data.ab=diabetes_data,treatments=treatments)
mtcmodel<-mtc.model(network=gemtc_network_numbers,type="consistency",factor=2.5, n.chain=4, linearModel="random")
mtcresults <- mtc.run(mtcmodel, n.adapt = 20000, n.iter=100000, thin=10, sampler="rjags")
# View results summary
print(summary(mtcresults))
My data are:
> diabetes_data
study treatment responder samplesize
1 1 1 45 410
2 1 3 70 405
3 1 4 32 202
4 2 1 119 4096
5 2 4 154 3954
6 2 5 302 6766
7 3 2 1 196
8 3 5 8 196
9 4 1 138 2800
10 4 5 200 2826
11 5 3 799 7040
12 5 4 567 7072
13 6 1 337 5183
14 6 3 380 5230
15 7 2 163 2715
16 7 6 202 2721
17 8 1 449 2623
18 8 6 489 2646
19 9 5 29 416
20 9 6 20 424
21 10 4 177 4841
22 10 6 154 4870
23 11 3 86 3297
24 11 5 75 3272
25 12 1 102 2837
26 12 6 155 2883
27 13 4 136 2508
28 13 5 176 2511
29 14 3 665 8078
30 14 4 569 8098
31 15 2 242 4020
32 15 3 320 3979
33 16 3 37 1102
34 16 5 43 1081
35 16 6 34 2213
36 17 3 251 5059
37 17 4 216 5095
38 18 1 335 3432
39 18 6 399 3472
40 19 2 93 2167
41 19 6 115 2175
42 20 5 140 1631
43 20 6 118 1578
44 21 1 93 1970
45 21 3 97 1960
46 21 4 95 1965
47 22 2 690 5087
48 22 4 845 5074
Thanks for your help.
Angel
You have to solution :
1- Replace your responder variable by "responders" and your samplesize variable by "sampleSize".
or
2- Use for example : mtc.model(...,likelihood="poisson",link="log")).

Storing third value of given column in new column for each group in R

I have a dataset that looks like this:
USER.ID ISO_DATE
1 3 2014-05-02
2 3 2014-05-05
3 3 2014-05-06
4 3 2014-05-20
5 3 2014-05-21
6 3 2014-05-24
7 3 2014-06-09
8 3 2014-06-14
9 3 2014-06-18
10 3 2014-06-26
11 3 2014-07-11
12 3 2014-07-21
13 3 2014-07-22
14 3 2014-07-25
15 3 2014-07-27
16 3 2014-08-03
17 3 2014-08-07
18 3 2014-08-12
19 3 2014-08-13
20 3 2014-08-16
21 3 2014-08-17
22 3 2014-08-20
23 3 2014-08-22
24 3 2014-08-31
25 3 2014-10-22
26 3 2014-11-19
27 3 2014-11-20
28 3 2014-11-23
29 3 2014-11-25
30 3 2014-12-06
31 3 2014-12-09
32 3 2014-12-10
33 3 2014-12-12
34 3 2014-12-14
35 3 2014-12-14
36 3 2014-12-14
37 3 2014-12-15
38 3 2014-12-16
39 3 2014-12-17
40 3 2014-12-18
41 3 2014-12-20
42 3 2015-01-08
43 3 2015-01-09
44 3 2015-01-11
45 3 2015-01-12
46 3 2015-01-14
47 3 2015-01-15
48 3 2015-01-18
49 3 2015-01-18
50 3 2015-01-19
51 3 2015-01-21
52 3 2015-01-22
53 3 2015-01-22
54 3 2015-01-23
55 3 2015-01-26
56 3 2015-01-27
57 3 2015-01-28
58 3 2015-01-29
59 3 2015-01-30
60 3 2015-01-30
61 3 2015-02-01
62 3 2015-02-02
63 3 2015-02-03
64 3 2015-02-04
65 3 2015-02-08
66 3 2015-02-09
67 3 2015-02-10
68 3 2015-02-13
69 3 2015-02-15
70 3 2015-02-16
71 3 2015-02-19
72 3 2015-02-20
73 3 2015-02-21
74 3 2015-02-23
75 3 2015-02-26
76 3 2015-02-28
77 3 2015-03-01
78 3 2015-03-11
79 3 2015-03-18
80 3 2015-03-22
81 3 2015-03-28
82 3 2015-04-03
83 3 2015-04-07
84 3 2015-04-08
85 3 2015-04-08
86 3 2015-04-15
87 3 2015-04-19
88 3 2015-04-21
89 3 2015-04-22
90 3 2015-04-24
91 3 2015-04-28
92 3 2015-05-03
93 3 2015-05-03
94 3 2015-05-04
95 3 2015-05-06
96 3 2015-05-08
97 3 2015-05-15
98 3 2015-05-16
99 3 2015-05-16
100 3 2015-05-19
101 3 2015-05-21
102 3 2015-05-21
103 3 2015-05-22
104 5 2015-02-05
105 7 2015-01-02
106 7 2015-01-03
107 7 2015-01-25
108 7 2015-02-21
109 7 2015-02-28
110 7 2015-03-02
111 7 2015-03-02
112 7 2015-03-07
113 7 2015-03-14
114 7 2015-05-01
115 9 2014-03-12
116 9 2014-03-12
117 9 2014-03-19
118 9 2014-04-10
119 9 2014-04-10
120 9 2014-04-10
121 9 2014-04-11
122 9 2014-05-30
123 9 2014-05-30
124 9 2014-06-06
125 9 2014-06-07
126 9 2014-06-14
127 9 2014-10-17
128 9 2014-10-17
129 9 2014-10-17
130 9 2014-10-17
131 9 2014-10-17
132 9 2014-10-17
133 9 2014-10-17
134 9 2014-10-19
135 9 2014-10-20
136 9 2014-10-20
137 9 2014-12-20
138 13 2014-07-08
139 13 2014-07-08
140 13 2014-07-08
141 13 2014-07-11
142 13 2014-07-11
143 13 2014-07-18
144 13 2014-07-19
145 13 2014-07-23
146 13 2014-07-23
147 13 2014-07-27
148 13 2014-07-29
149 13 2014-07-31
150 13 2014-08-02
151 13 2014-08-03
152 13 2014-08-06
153 13 2014-08-14
154 13 2014-08-14
155 13 2014-08-18
156 13 2014-08-19
157 13 2014-08-26
158 13 2014-08-30
159 13 2014-09-02
160 13 2014-09-10
161 13 2014-09-12
162 13 2014-09-13
163 13 2014-09-18
164 13 2014-09-20
165 13 2014-09-21
166 13 2014-09-24
167 13 2014-09-28
168 13 2014-09-30
169 13 2014-10-04
170 13 2014-10-09
171 13 2014-10-15
172 13 2014-10-20
173 13 2014-10-20
174 13 2014-10-20
175 13 2014-10-20
176 13 2014-10-25
177 13 2014-10-26
178 13 2014-10-29
179 13 2014-11-10
180 13 2014-11-28
181 13 2014-11-28
182 13 2014-11-28
183 13 2014-11-28
184 13 2014-11-29
185 13 2014-12-03
186 13 2014-12-05
187 13 2014-12-05
188 13 2014-12-10
189 13 2015-01-03
190 13 2015-03-08
191 13 2015-03-22
192 13 2015-04-06
193 13 2015-04-16
194 13 2015-04-21
195 13 2015-04-22
196 13 2015-04-26
197 13 2015-05-05
198 13 2015-05-07
199 13 2015-05-15
200 13 2015-05-21
201 16 2014-03-11
202 16 2014-03-13
203 16 2014-03-15
204 16 2014-04-12
205 16 2014-04-14
206 16 2014-04-23
207 16 2014-05-26
208 16 2014-05-30
209 16 2014-05-31
210 16 2014-06-10
211 16 2014-06-26
212 16 2014-08-18
213 16 2014-08-21
214 16 2014-08-24
215 16 2014-08-29
216 16 2014-09-01
217 16 2014-09-07
218 16 2014-09-15
219 16 2014-09-17
220 16 2014-09-24
221 16 2014-09-29
222 16 2014-10-06
223 16 2014-10-07
224 16 2014-10-08
225 16 2014-10-20
226 16 2014-10-20
227 16 2014-10-20
228 16 2014-11-12
229 16 2014-11-12
I want to create a two new columns that would store 3rd and 6th value of ISO_DATE for each USER.ID separately.
I tried this:
users <- users %>%
arrange(USER.ID) %>%
group_by(USER.ID) %>%
mutate(third_date = head(ISO_DATE, 3)) %>%
mutate(fifth_date = head(ISO_DATE, 6))
but it is not helping. Is there a way to do this in R?
You can convert the 'ISO_DATE' column to 'Date' class (if it is not), group_by 'USER.ID', arrange the 'ISO_DATE' and create new columns with 3rd and 6th observation of 'ISO_DATE'
library(dplyr)
users1 <- users %>%
mutate(ISO_DATE = as.Date(ISO_DATE)) %>%
group_by(USER.ID) %>%
arrange(ISO_DATE) %>%
mutate(third_date = ISO_DATE[3L], sixth_date=ISO_DATE[6L])
Or using data.table
library(data.table)
setDT(users)[, ISO_DATE:= as.Date(ISO_DATE)
][order(ISO_DATE),
c('third_date', 'sixth_date') := list(ISO_DATE[3L], ISO_DATE[6L]) ,
by= USER.ID]

How to calculate difference between different dates of purchase for different users in R

I have a dataset which contains user.id and purchase date. I need to calculate the duration between successive purchases for each user in R.
Here is what my sample data looks like:
row.names USER.ID ISO_DATE
1 1067 3 2014-05-05
2 1079 3 2014-05-06
3 1571 3 2014-05-20
4 1625 3 2014-05-21
5 1759 3 2014-05-24
6 2387 3 2014-06-09
7 2683 3 2014-06-14
8 2902 3 2014-06-18
9 3301 3 2014-06-26
10 4169 3 2014-07-11
11 5361 3 2014-07-21
12 5419 3 2014-07-22
13 5921 3 2014-07-25
14 6314 3 2014-07-27
15 7361 3 2014-08-03
16 8146 3 2014-08-07
17 10091 3 2014-08-12
18 10961 3 2014-08-13
19 13296 3 2014-08-16
20 13688 3 2014-08-17
21 15672 3 2014-08-20
22 18586 3 2014-08-22
23 24304 3 2014-08-31
24 38123 3 2014-10-22
25 50124 3 2014-11-19
26 50489 3 2014-11-20
27 52201 3 2014-11-23
28 52900 3 2014-11-25
29 61564 3 2014-12-06
30 64351 3 2014-12-09
31 65465 3 2014-12-10
32 67880 3 2014-12-12
33 69363 3 2014-12-14
34 69982 3 2014-12-14
35 70040 3 2014-12-14
36 70351 3 2014-12-15
37 72393 3 2014-12-16
38 73220 3 2014-12-17
39 75110 3 2014-12-18
40 78827 3 2014-12-20
41 112447 3 2015-01-08
42 113903 3 2015-01-09
43 114723 3 2015-01-11
44 114760 3 2015-01-12
45 115464 3 2015-01-14
46 116095 3 2015-01-15
47 118406 3 2015-01-18
48 118842 3 2015-01-18
49 119527 3 2015-01-19
50 120774 3 2015-01-21
51 120853 3 2015-01-22
52 121284 3 2015-01-22
53 121976 3 2015-01-23
54 126256 3 2015-01-26
55 126498 3 2015-01-27
56 127776 3 2015-01-28
57 128537 3 2015-01-29
58 128817 3 2015-01-30
59 129374 3 2015-01-30
60 131604 3 2015-02-01
61 132150 3 2015-02-02
62 132557 3 2015-02-03
63 132953 3 2015-02-04
64 135514 3 2015-02-08
65 136058 3 2015-02-09
66 136965 3 2015-02-10
67 140787 3 2015-02-13
68 143113 3 2015-02-15
69 143793 3 2015-02-16
70 146344 3 2015-02-19
71 147669 3 2015-02-20
72 148397 3 2015-02-21
73 151196 3 2015-02-23
74 156014 3 2015-02-26
75 161235 3 2015-02-28
76 162521 3 2015-03-01
77 177878 3 2015-03-11
78 190178 3 2015-03-18
79 199679 3 2015-03-22
80 212460 3 2015-03-28
81 221153 3 2015-04-03
82 228935 3 2015-04-07
83 230358 3 2015-04-08
84 230696 3 2015-04-08
85 250294 3 2015-04-15
86 267469 3 2015-04-19
87 270947 3 2015-04-21
88 274882 3 2015-04-22
89 282252 3 2015-04-24
90 299949 3 2015-04-28
91 323336 3 2015-05-03
92 324847 3 2015-05-03
93 326284 3 2015-05-04
94 337381 3 2015-05-06
95 346498 3 2015-05-08
96 372764 3 2015-05-15
97 376366 3 2015-05-16
98 379325 3 2015-05-16
99 386458 3 2015-05-19
100 392200 3 2015-05-21
101 393039 3 2015-05-21
102 399126 3 2015-05-22
103 106789 7 2015-01-03
104 124929 7 2015-01-25
105 148711 7 2015-02-21
106 161337 7 2015-02-28
107 163738 7 2015-03-02
108 164070 7 2015-03-02
109 170121 7 2015-03-07
110 184856 7 2015-03-14
111 314891 7 2015-05-01
112 182 9 2014-03-12
113 290 9 2014-03-19
114 549 9 2014-04-10
115 553 9 2014-04-10
116 559 9 2014-04-10
117 564 9 2014-04-11
118 1973 9 2014-05-30
119 1985 9 2014-05-30
120 2243 9 2014-06-06
121 2298 9 2014-06-07
122 2713 9 2014-06-14
123 35352 9 2014-10-17
124 35436 9 2014-10-17
125 35509 9 2014-10-17
126 35641 9 2014-10-17
127 35642 9 2014-10-17
128 35679 9 2014-10-17
129 35750 9 2014-10-17
130 36849 9 2014-10-19
131 37247 9 2014-10-20
132 37268 9 2014-10-20
133 79630 9 2014-12-20
134 3900 13 2014-07-08
135 3907 13 2014-07-08
136 4125 13 2014-07-11
137 4142 13 2014-07-11
138 5049 13 2014-07-18
139 5157 13 2014-07-19
140 5648 13 2014-07-23
141 5659 13 2014-07-23
142 6336 13 2014-07-27
143 6621 13 2014-07-29
144 6971 13 2014-07-31
145 7221 13 2014-08-02
146 7310 13 2014-08-03
147 8036 13 2014-08-06
148 11437 13 2014-08-14
149 11500 13 2014-08-14
150 14627 13 2014-08-18
151 15260 13 2014-08-19
152 22417 13 2014-08-26
153 23837 13 2014-08-30
154 24668 13 2014-09-02
155 26481 13 2014-09-10
156 26788 13 2014-09-12
157 27116 13 2014-09-13
158 27959 13 2014-09-18
159 28304 13 2014-09-20
160 28552 13 2014-09-21
161 29069 13 2014-09-24
162 30041 13 2014-09-28
163 30349 13 2014-09-30
164 31352 13 2014-10-04
165 32189 13 2014-10-09
166 34163 13 2014-10-15
167 36946 13 2014-10-20
168 36977 13 2014-10-20
169 37042 13 2014-10-20
170 37266 13 2014-10-20
171 40117 13 2014-10-25
172 40765 13 2014-10-26
173 43418 13 2014-10-29
174 47691 13 2014-11-10
175 54971 13 2014-11-28
176 55275 13 2014-11-28
177 55297 13 2014-11-28
178 55458 13 2014-11-28
179 55908 13 2014-11-29
180 59925 13 2014-12-03
181 60722 13 2014-12-05
182 61178 13 2014-12-05
183 65547 13 2014-12-10
184 107202 13 2015-01-03
185 173010 13 2015-03-08
186 199791 13 2015-03-22
187 227003 13 2015-04-06
188 252548 13 2015-04-16
189 271845 13 2015-04-21
190 274804 13 2015-04-22
191 294579 13 2015-04-26
192 332205 13 2015-05-05
193 339695 13 2015-05-07
194 373554 13 2015-05-15
195 390934 13 2015-05-21
196 203 16 2014-03-13
197 228 16 2014-03-15
198 616 16 2014-04-12
199 664 16 2014-04-14
200 851 16 2014-04-23
201 1826 16 2014-05-26
202 1969 16 2014-05-30
203 2026 16 2014-05-31
204 2419 16 2014-06-10
205 3295 16 2014-06-26
206 14030 16 2014-08-18
207 16368 16 2014-08-21
208 21239 16 2014-08-24
209 23651 16 2014-08-29
210 24533 16 2014-09-01
211 25868 16 2014-09-07
212 27408 16 2014-09-15
213 27721 16 2014-09-17
214 29076 16 2014-09-24
215 30122 16 2014-09-29
216 31622 16 2014-10-06
217 31981 16 2014-10-07
I want to add one more column that would give the difference in successive purchases for each user. I am using ddply function but it is showing some error.
Here is what I tried:
users_frequency <- ddply(users_ordered, "USER.ID", summarize,
orderfrequency = as.numeric(diff(ISO_DATE)))
If you're comfortable with dplyr instead of plyr
df %>%
mutate(ISO_DATE = as.Date(df$ISO_DATE, "%Y-%m-%d")) %>%
group_by(USER.ID) %>%
arrange(ISO_DATE) %>%
mutate(lag = lag(ISO_DATE), difference = ISO_DATE - lag)

Resources