My dataset has repeated observations for people that work on projects. I need a data frame with two columns that list 'combinations' of projects for each person and time point. Let me explain with an example:
This is my data:
ID Week Project
01 1 101
01 1 102
01 1 103
01 2 101
01 2 102
02 1 101
02 1 102
02 2 101
Person 1 (ID = 1) worked on three projects in week 1. This means that there are six possible combinations of projects (project_i & project_j) for this person, in this week.
This is what I need
ID Week Project_i Project_j
01 1 101 101
01 1 101 102
01 1 101 103
01 1 102 101
01 1 102 102
01 1 102 103
01 1 103 101
01 1 103 102
01 1 103 103
01 2 101 101
01 2 101 102
01 2 102 101
01 2 102 102
02 1 101 101
02 1 101 102
02 1 102 101
02 1 102 102
02 2 101 101
Losing cases that only have one project per week is not an issue.
I have tried basic r and reshape2 for a bit, but I can't figure this out.
Here's one way:
library(data.table)
setDT(DT)
DT[, CJ(P1 = Project, P2 = Project)[P1 != P2], by=.(ID, Week)]
ID Week P1 P2
1: 1 1 101 102
2: 1 1 101 103
3: 1 1 102 101
4: 1 1 102 103
5: 1 1 103 101
6: 1 1 103 102
7: 1 2 101 102
8: 1 2 102 101
9: 2 1 101 102
10: 2 1 102 101
CJ is the Cartesian Join of two vectors, taking all combinations.
If you don't want both (101,102) and (102,101), use P1 > P2 instead of P1 != P2. Oh, the OP has changed the question... so use P1 <= P2.
Here is a solution that uses dplyr and tidyr. The key step is tidyr::complete() combined with dplyr::group_by()
library(dplyr)
library(tidyr)
d %>%
rename(Project_i = Project) %>%
mutate(Project_j = Project_i) %>%
group_by(ID, Week) %>%
complete(Project_i, Project_j) %>%
filter(Project_i != Project_j)
Here's a base option using expand.grid:
do.call(rbind, lapply(split(df, paste(df$ID, df$Week)), function(x){
x2 <- expand.grid(ID = unique(x$ID),
Week = unique(x$Week),
Project_i = unique(x$Project),
Project_j = unique(x$Project))
# omit if 101 102 is different from 102 101; make `<` if 101 101 not possible
x2[x2$Project_i <= x2$Project_j,]
}))
# ID Week Project_i Project_j
# 1 1.1 1 1 101 101
# 1 1.4 1 1 101 102
# 1 1.5 1 1 102 102
# 1 1.7 1 1 101 103
# 1 1.8 1 1 102 103
# 1 1.9 1 1 103 103
# 1 2.1 1 2 101 101
# 1 2.3 1 2 101 102
# 1 2.4 1 2 102 102
# 2 1.1 2 1 101 101
# 2 1.3 2 1 101 102
# 2 1.4 2 1 102 102
# 2 2 2 2 101 101
Related
I have a big data-set with over 1000 subjects, a small piece of the data-set looks like:
mydata <- read.table(header=TRUE, text="
Id DAYS QS Event
01 50 1 1
01 57 4 1
01 70 1 1
01 78 2 1
01 85 3 1
02 70 2 1
02 92 4 1
02 98 5 1
02 105 6 1
02 106 7 0
")
I would like to get row number of the observation 28 or more days prior to last observation, eg. for id=01; last observation is 85 minus 28 would be 57 which is row number 2. For id=02; last observation is 106 minus 28; 78 and because 78 does not exist, we will use row number of 70 which is 1 (I will be getting the row number for each observation separately) or first observation for id=02.
This should work:
mydata %>% group_by(Id) %>%
mutate(row_number = last(which(DAYS <= max(DAYS) - 28)))
# A tibble: 10 x 6
# Groups: Id [2]
Id DAYS QS Event max row_number
<int> <int> <int> <int> <dbl> <int>
1 1 50 1 1 57 2
2 1 57 4 1 57 2
3 1 70 1 1 57 2
4 1 78 2 1 57 2
5 1 85 3 1 57 2
6 2 70 2 1 78 1
7 2 92 4 1 78 1
8 2 98 5 1 78 1
9 2 105 6 1 78 1
10 2 106 7 0 78 1
I am looking for finding a method to find the association between words in the table (or list). In each cell of the table, I have several words separated by ";".
lets say I have a table as below; some words are 'af' or 'aa' belong to one cell.
df<-read.table(text="
A B C D
af;aa;az bf;bb c;cc df;dd
aa;az bf;bc c dc;dd
ah;al;aa bb c;cd dd
af;aa bf cc dd",header=T,stringsAsFactors = F)
I want to find associations between all words in the entire dataset, between cells(not interested in within cell association). for example, how many times aa and dd appear in one row, or show me which words have the highest association (e.g. aa with bb, aa with dd,....).
expected output: (the numbers can be inaccurate and association rep does not have be shown with '--')
2 pairs association (numbers can be counts, probability or normalized association)
association number of associations
aa--dd 3
aa--c 3
bb--dd 2
...
3 pairs association
aa--bb--dd 3
aa--bb--c 3
...
4 pairs association
aa--bb--c--dd 2
aa--bf--c--dd 2
...
can you help me to implement it in R?
Tx
I am not sure if you have something like the approach below in mind. It is basically a custom function which we use in a nested purrr::map call. The outer call loops over the number of pairs: 2,3, 4 and the inner call uses combn to create all possible combinations as input and uses the custom function to create the desired output.
library(tidyverse)
count_pairs <- function(x) {
s <- seq(x)
df[, x] %>%
reduce(s, separate_rows, .init = ., sep = ";")
group_by(across()) %>%
count() %>%
rename(set_names(s))
}
map(2:4,
~ map_dfr(combn(1:4, .x, simplify = FALSE),
count_pairs) %>% arrange(-n))
#> [[1]]
#> # A tibble: 50 x 3
#> # Groups: 1, 2 [50]
#> `1` `2` n
#> <chr> <chr> <int>
#> 1 aa dd 4
#> 2 aa bf 3
#> 3 aa c 3
#> 4 bf dd 3
#> 5 c dd 3
#> 6 aa bb 2
#> 7 af bf 2
#> 8 az bf 2
#> 9 aa cc 2
#> 10 af cc 2
#> # ... with 40 more rows
#>
#> [[2]]
#> # A tibble: 70 x 4
#> # Groups: 1, 2, 3 [70]
#> `1` `2` `3` n
#> <chr> <chr> <chr> <int>
#> 1 aa bf dd 3
#> 2 aa c dd 3
#> 3 aa bb c 2
#> 4 aa bf c 2
#> 5 aa bf cc 2
#> 6 af bf cc 2
#> 7 az bf c 2
#> 8 aa bb dd 2
#> 9 af bf dd 2
#> 10 az bf dd 2
#> # ... with 60 more rows
#>
#> [[3]]
#> # A tibble: 35 x 5
#> # Groups: 1, 2, 3, 4 [35]
#> `1` `2` `3` `4` n
#> <chr> <chr> <chr> <chr> <int>
#> 1 aa bb c dd 2
#> 2 aa bf c dd 2
#> 3 aa bf cc dd 2
#> 4 af bf cc dd 2
#> 5 az bf c dd 2
#> 6 aa bb c df 1
#> 7 aa bb cc dd 1
#> 8 aa bb cc df 1
#> 9 aa bb cd dd 1
#> 10 aa bc c dc 1
#> # ... with 25 more rows
# the data
df<-read.table(text="
A B C D
af;aa;az bf;bb c;cc df;dd
aa;az bf;bc c dc;dd
ah;al;aa bb c;cd dd
af;aa bf cc dd",header=T,stringsAsFactors = F)
Created on 2021-08-11 by the reprex package (v2.0.1)
You can try the base R code below
lst <- apply(df, 1, function(x) sort(unlist(strsplit(x, ";"))))
res <- lapply(
2:4,
function(k) {
setNames(
data.frame(
table(
unlist(
sapply(
lst,
function(v) combn(v, k, paste0, collapse = "--")
)
)
)
), c("association", "count")
)
}
)
and you will see that
> res
[[1]]
association count
1 aa--af 2
2 aa--ah 1
3 aa--al 1
4 aa--az 2
5 aa--bb 2
6 aa--bc 1
7 aa--bf 3
8 aa--c 3
9 aa--cc 2
10 aa--cd 1
11 aa--dc 1
12 aa--dd 4
13 aa--df 1
14 af--az 1
15 af--bb 1
16 af--bf 2
17 af--c 1
18 af--cc 2
19 af--dd 2
20 af--df 1
21 ah--al 1
22 ah--bb 1
23 ah--c 1
24 ah--cd 1
25 ah--dd 1
26 al--bb 1
27 al--c 1
28 al--cd 1
29 al--dd 1
30 az--bb 1
31 az--bc 1
32 az--bf 2
33 az--c 2
34 az--cc 1
35 az--dc 1
36 az--dd 2
37 az--df 1
38 bb--bf 1
39 bb--c 2
40 bb--cc 1
41 bb--cd 1
42 bb--dd 2
43 bb--df 1
44 bc--bf 1
45 bc--c 1
46 bc--dc 1
47 bc--dd 1
48 bf--c 2
49 bf--cc 2
50 bf--dc 1
51 bf--dd 3
52 bf--df 1
53 c--cc 1
54 c--cd 1
55 c--dc 1
56 c--dd 3
57 c--df 1
58 cc--dd 2
59 cc--df 1
60 cd--dd 1
61 dc--dd 1
62 dd--df 1
[[2]]
association count
1 aa--af--az 1
2 aa--af--bb 1
3 aa--af--bf 2
4 aa--af--c 1
5 aa--af--cc 2
6 aa--af--dd 2
7 aa--af--df 1
8 aa--ah--al 1
9 aa--ah--bb 1
10 aa--ah--c 1
11 aa--ah--cd 1
12 aa--ah--dd 1
13 aa--al--bb 1
14 aa--al--c 1
15 aa--al--cd 1
16 aa--al--dd 1
17 aa--az--bb 1
18 aa--az--bc 1
19 aa--az--bf 2
20 aa--az--c 2
21 aa--az--cc 1
22 aa--az--dc 1
23 aa--az--dd 2
24 aa--az--df 1
25 aa--bb--bf 1
26 aa--bb--c 2
27 aa--bb--cc 1
28 aa--bb--cd 1
29 aa--bb--dd 2
30 aa--bb--df 1
31 aa--bc--bf 1
32 aa--bc--c 1
33 aa--bc--dc 1
34 aa--bc--dd 1
35 aa--bf--c 2
36 aa--bf--cc 2
37 aa--bf--dc 1
38 aa--bf--dd 3
39 aa--bf--df 1
40 aa--c--cc 1
41 aa--c--cd 1
42 aa--c--dc 1
43 aa--c--dd 3
44 aa--c--df 1
45 aa--cc--dd 2
46 aa--cc--df 1
47 aa--cd--dd 1
48 aa--dc--dd 1
49 aa--dd--df 1
50 af--az--bb 1
51 af--az--bf 1
52 af--az--c 1
53 af--az--cc 1
54 af--az--dd 1
55 af--az--df 1
56 af--bb--bf 1
57 af--bb--c 1
58 af--bb--cc 1
59 af--bb--dd 1
60 af--bb--df 1
61 af--bf--c 1
62 af--bf--cc 2
63 af--bf--dd 2
64 af--bf--df 1
65 af--c--cc 1
66 af--c--dd 1
67 af--c--df 1
68 af--cc--dd 2
69 af--cc--df 1
70 af--dd--df 1
71 ah--al--bb 1
72 ah--al--c 1
73 ah--al--cd 1
74 ah--al--dd 1
75 ah--bb--c 1
76 ah--bb--cd 1
77 ah--bb--dd 1
78 ah--c--cd 1
79 ah--c--dd 1
80 ah--cd--dd 1
81 al--bb--c 1
82 al--bb--cd 1
83 al--bb--dd 1
84 al--c--cd 1
85 al--c--dd 1
86 al--cd--dd 1
87 az--bb--bf 1
88 az--bb--c 1
89 az--bb--cc 1
90 az--bb--dd 1
91 az--bb--df 1
92 az--bc--bf 1
93 az--bc--c 1
94 az--bc--dc 1
95 az--bc--dd 1
96 az--bf--c 2
97 az--bf--cc 1
98 az--bf--dc 1
99 az--bf--dd 2
100 az--bf--df 1
101 az--c--cc 1
102 az--c--dc 1
103 az--c--dd 2
104 az--c--df 1
105 az--cc--dd 1
106 az--cc--df 1
107 az--dc--dd 1
108 az--dd--df 1
109 bb--bf--c 1
110 bb--bf--cc 1
111 bb--bf--dd 1
112 bb--bf--df 1
113 bb--c--cc 1
114 bb--c--cd 1
115 bb--c--dd 2
116 bb--c--df 1
117 bb--cc--dd 1
118 bb--cc--df 1
119 bb--cd--dd 1
120 bb--dd--df 1
121 bc--bf--c 1
122 bc--bf--dc 1
123 bc--bf--dd 1
124 bc--c--dc 1
125 bc--c--dd 1
126 bc--dc--dd 1
127 bf--c--cc 1
128 bf--c--dc 1
129 bf--c--dd 2
130 bf--c--df 1
131 bf--cc--dd 2
132 bf--cc--df 1
133 bf--dc--dd 1
134 bf--dd--df 1
135 c--cc--dd 1
136 c--cc--df 1
137 c--cd--dd 1
138 c--dc--dd 1
139 c--dd--df 1
140 cc--dd--df 1
[[3]]
association count
1 aa--af--az--bb 1
2 aa--af--az--bf 1
3 aa--af--az--c 1
4 aa--af--az--cc 1
5 aa--af--az--dd 1
6 aa--af--az--df 1
7 aa--af--bb--bf 1
8 aa--af--bb--c 1
9 aa--af--bb--cc 1
10 aa--af--bb--dd 1
11 aa--af--bb--df 1
12 aa--af--bf--c 1
13 aa--af--bf--cc 2
14 aa--af--bf--dd 2
15 aa--af--bf--df 1
16 aa--af--c--cc 1
17 aa--af--c--dd 1
18 aa--af--c--df 1
19 aa--af--cc--dd 2
20 aa--af--cc--df 1
21 aa--af--dd--df 1
22 aa--ah--al--bb 1
23 aa--ah--al--c 1
24 aa--ah--al--cd 1
25 aa--ah--al--dd 1
26 aa--ah--bb--c 1
27 aa--ah--bb--cd 1
28 aa--ah--bb--dd 1
29 aa--ah--c--cd 1
30 aa--ah--c--dd 1
31 aa--ah--cd--dd 1
32 aa--al--bb--c 1
33 aa--al--bb--cd 1
34 aa--al--bb--dd 1
35 aa--al--c--cd 1
36 aa--al--c--dd 1
37 aa--al--cd--dd 1
38 aa--az--bb--bf 1
39 aa--az--bb--c 1
40 aa--az--bb--cc 1
41 aa--az--bb--dd 1
42 aa--az--bb--df 1
43 aa--az--bc--bf 1
44 aa--az--bc--c 1
45 aa--az--bc--dc 1
46 aa--az--bc--dd 1
47 aa--az--bf--c 2
48 aa--az--bf--cc 1
49 aa--az--bf--dc 1
50 aa--az--bf--dd 2
51 aa--az--bf--df 1
52 aa--az--c--cc 1
53 aa--az--c--dc 1
54 aa--az--c--dd 2
55 aa--az--c--df 1
56 aa--az--cc--dd 1
57 aa--az--cc--df 1
58 aa--az--dc--dd 1
59 aa--az--dd--df 1
60 aa--bb--bf--c 1
61 aa--bb--bf--cc 1
62 aa--bb--bf--dd 1
63 aa--bb--bf--df 1
64 aa--bb--c--cc 1
65 aa--bb--c--cd 1
66 aa--bb--c--dd 2
67 aa--bb--c--df 1
68 aa--bb--cc--dd 1
69 aa--bb--cc--df 1
70 aa--bb--cd--dd 1
71 aa--bb--dd--df 1
72 aa--bc--bf--c 1
73 aa--bc--bf--dc 1
74 aa--bc--bf--dd 1
75 aa--bc--c--dc 1
76 aa--bc--c--dd 1
77 aa--bc--dc--dd 1
78 aa--bf--c--cc 1
79 aa--bf--c--dc 1
80 aa--bf--c--dd 2
81 aa--bf--c--df 1
82 aa--bf--cc--dd 2
83 aa--bf--cc--df 1
84 aa--bf--dc--dd 1
85 aa--bf--dd--df 1
86 aa--c--cc--dd 1
87 aa--c--cc--df 1
88 aa--c--cd--dd 1
89 aa--c--dc--dd 1
90 aa--c--dd--df 1
91 aa--cc--dd--df 1
92 af--az--bb--bf 1
93 af--az--bb--c 1
94 af--az--bb--cc 1
95 af--az--bb--dd 1
96 af--az--bb--df 1
97 af--az--bf--c 1
98 af--az--bf--cc 1
99 af--az--bf--dd 1
100 af--az--bf--df 1
101 af--az--c--cc 1
102 af--az--c--dd 1
103 af--az--c--df 1
104 af--az--cc--dd 1
105 af--az--cc--df 1
106 af--az--dd--df 1
107 af--bb--bf--c 1
108 af--bb--bf--cc 1
109 af--bb--bf--dd 1
110 af--bb--bf--df 1
111 af--bb--c--cc 1
112 af--bb--c--dd 1
113 af--bb--c--df 1
114 af--bb--cc--dd 1
115 af--bb--cc--df 1
116 af--bb--dd--df 1
117 af--bf--c--cc 1
118 af--bf--c--dd 1
119 af--bf--c--df 1
120 af--bf--cc--dd 2
121 af--bf--cc--df 1
122 af--bf--dd--df 1
123 af--c--cc--dd 1
124 af--c--cc--df 1
125 af--c--dd--df 1
126 af--cc--dd--df 1
127 ah--al--bb--c 1
128 ah--al--bb--cd 1
129 ah--al--bb--dd 1
130 ah--al--c--cd 1
131 ah--al--c--dd 1
132 ah--al--cd--dd 1
133 ah--bb--c--cd 1
134 ah--bb--c--dd 1
135 ah--bb--cd--dd 1
136 ah--c--cd--dd 1
137 al--bb--c--cd 1
138 al--bb--c--dd 1
139 al--bb--cd--dd 1
140 al--c--cd--dd 1
141 az--bb--bf--c 1
142 az--bb--bf--cc 1
143 az--bb--bf--dd 1
144 az--bb--bf--df 1
145 az--bb--c--cc 1
146 az--bb--c--dd 1
147 az--bb--c--df 1
148 az--bb--cc--dd 1
149 az--bb--cc--df 1
150 az--bb--dd--df 1
151 az--bc--bf--c 1
152 az--bc--bf--dc 1
153 az--bc--bf--dd 1
154 az--bc--c--dc 1
155 az--bc--c--dd 1
156 az--bc--dc--dd 1
157 az--bf--c--cc 1
158 az--bf--c--dc 1
159 az--bf--c--dd 2
160 az--bf--c--df 1
161 az--bf--cc--dd 1
162 az--bf--cc--df 1
163 az--bf--dc--dd 1
164 az--bf--dd--df 1
165 az--c--cc--dd 1
166 az--c--cc--df 1
167 az--c--dc--dd 1
168 az--c--dd--df 1
169 az--cc--dd--df 1
170 bb--bf--c--cc 1
171 bb--bf--c--dd 1
172 bb--bf--c--df 1
173 bb--bf--cc--dd 1
174 bb--bf--cc--df 1
175 bb--bf--dd--df 1
176 bb--c--cc--dd 1
177 bb--c--cc--df 1
178 bb--c--cd--dd 1
179 bb--c--dd--df 1
180 bb--cc--dd--df 1
181 bc--bf--c--dc 1
182 bc--bf--c--dd 1
183 bc--bf--dc--dd 1
184 bc--c--dc--dd 1
185 bf--c--cc--dd 1
186 bf--c--cc--df 1
187 bf--c--dc--dd 1
188 bf--c--dd--df 1
189 bf--cc--dd--df 1
190 c--cc--dd--df 1
df<-read.table(text="
A B C D
af;aa;az bf;bb c;cc df;dd
aa;az bf;bc c dc;dd
ah;al;aa bb c;cd dd
af;aa bf cc dd",header=T,stringsAsFactors = F)
df<-apply(df, 2, function(x) gsub(";", " ", x))
df2<-as.data.frame(apply(df, 2,function(x) paste(x[1:4],sep = "," )))
df3<-splitstackshape::cSplit(df2, c("A", "B", "C","D"), " ")
df4<-as.data.frame(t(df3))
A<-expand.grid(df4[,1],df4[,1])
B<-expand.grid(df4[,2],df4[,2])
C<-expand.grid(df4[,3],df4[,3])
D<-expand.grid(df4[,4],df4[,4])
DF_list<-bind_rows(A,B,C,D)
DF_list$Var1 == DF_list$Var2 -> DF_list$filter
DF_list[DF_list$filter == "FALSE", ]-> DF_list
DF_list$Var1<-as.character(DF_list$Var1)
DF_list$Var2<-as.character(DF_list$Var2)
DF_list[nchar(DF_list$Var1) > 0, ]-> DF_list
DF_list[nchar(DF_list$Var2) > 0, ]-> DF_list
pair_df<-data.frame("pairs" = paste(DF_list[["Var1"]], "--", DF_list[["Var2"]], sep = ""))
pairs<-data.frame("combos"= table(pair_df$pairs))
####output###
combos.Var1 combos.Freq
1 aa--af 2
2 aa--ah 1
3 aa--al 1
4 aa--az 2
5 aa--bb 2
6 aa--bc 1
7 aa--bf 3
8 aa--c 3
9 aa--cc 2
10 aa--cd 1
11 aa--dc 1
12 aa--dd 4
13 aa--df 1
14 af--aa 2
15 af--az 1
16 af--bb 1
17 af--bf 2
18 af--c 1
19 af--cc 2
20 af--dd 2
21 af--df 1
22 ah--aa 1
23 ah--al 1
24 ah--bb 1
25 ah--c 1
26 ah--cd 1
27 ah--dd 1
28 al--aa 1
29 al--ah 1
30 al--bb 1
31 al--c 1
32 al--cd 1
33 al--dd 1
34 az--aa 2
35 az--af 1
36 az--bb 1
37 az--bc 1
38 az--bf 2
39 az--c 2
40 az--cc 1
41 az--dc 1
42 az--dd 2
43 az--df 1
44 bb--aa 2
45 bb--af 1
46 bb--ah 1
47 bb--al 1
48 bb--az 1
49 bb--bf 1
50 bb--c 2
51 bb--cc 1
52 bb--cd 1
53 bb--dd 2
54 bb--df 1
55 bc--aa 1
56 bc--az 1
57 bc--bf 1
58 bc--c 1
59 bc--dc 1
60 bc--dd 1
61 bf--aa 3
62 bf--af 2
63 bf--az 2
64 bf--bb 1
65 bf--bc 1
66 bf--c 2
67 bf--cc 2
68 bf--dc 1
69 bf--dd 3
70 bf--df 1
71 c--aa 3
72 c--af 1
73 c--ah 1
74 c--al 1
75 c--az 2
76 c--bb 2
77 c--bc 1
78 c--bf 2
79 c--cc 1
80 c--cd 1
81 c--dc 1
82 c--dd 3
83 c--df 1
84 cc--aa 2
85 cc--af 2
86 cc--az 1
87 cc--bb 1
88 cc--bf 2
89 cc--c 1
90 cc--dd 2
91 cc--df 1
92 cd--aa 1
93 cd--ah 1
94 cd--al 1
95 cd--bb 1
96 cd--c 1
97 cd--dd 1
98 dc--aa 1
99 dc--az 1
100 dc--bc 1
101 dc--bf 1
102 dc--c 1
103 dc--dd 1
104 dd--aa 4
105 dd--af 2
106 dd--ah 1
107 dd--al 1
108 dd--az 2
109 dd--bb 2
110 dd--bc 1
111 dd--bf 3
112 dd--c 3
113 dd--cc 2
114 dd--cd 1
115 dd--dc 1
116 dd--df 1
117 df--aa 1
118 df--af 1
119 df--az 1
120 df--bb 1
121 df--bf 1
122 df--c 1
123 df--cc 1
124 df--dd 1
I am trying to find a proper way, in R, to find duplicated values, and add the value 1 to each subsequent duplicated value grouped by id. For example:
data = data.table(id = c('1','1','1','1','1','2','2','2'),
value = c(95,100,101,101,101,20,35,38))
data$new_value <- ifelse(data[ , data$value] == lag(data$value,1),
lag(data$value, 1) + 1 ,data$value)
data$desired_value <- c(95,100,101,102,103,20,35,38)
Produces:
id value new_value desired_value
1: 1 95 NA 95
2: 1 100 100 100
3: 1 101 101 101 # first 101 in id 1: add 0
4: 1 101 102 102 # second 101 in id 1: add 1
5: 1 101 102 103 # third 101 in id 1: add 2
6: 2 20 20 20
7: 2 35 35 35
8: 2 38 38 38
I tried doing this with ifelse, but it doesn't work recursively so it only applies to the following row, and not any subsequent rows. Also the lag function results in me losing the first value in value.
I've seen examples with character variables with make.names or make.unique, but haven't been able to find a solution for a duplicated numeric value.
Background: I am doing a survival analysis and I am finding that with my data there are stop times that are the same, so I need to make it unique by adding a 1 (stop times are in seconds).
Here's an attempt. You're essentially grouping by id and value and adding 0:(length(value)-1). So:
data[, onemore := value + (0:(.N-1)), by=.(id, value)]
# id value new_value desired_value onemore
#1: 1 95 96 95 95
#2: 1 100 101 100 100
#3: 1 101 102 101 101
#4: 1 101 102 102 102
#5: 1 101 102 103 103
#6: 2 20 21 20 20
#7: 2 35 36 35 35
#8: 2 38 39 38 38
With base R we can use ave where we take the first value of each group and basically add the row number of that row in that group.
data$value1 <- ave(data$value, data$id, data$value, FUN = function(x)
x[1] + seq_along(x) - 1)
# id value new_value desired_value value1
#1: 1 95 96 95 95
#2: 1 100 101 100 100
#3: 1 101 102 101 101
#4: 1 101 102 102 102
#5: 1 101 102 103 103
#6: 2 20 21 20 20
#7: 2 35 36 35 35
#8: 2 38 39 38 38
Here is one option with tidyverse
library(dplyr)
data %>%
group_by(id, value) %>%
mutate(onemore = value + row_number()-1)
# id value onemore
# <chr> <dbl> <dbl>
#1 1 95 95
#2 1 100 100
#3 1 101 101
#4 1 101 102
#5 1 101 103
#6 2 20 20
#7 2 35 35
#8 2 38 38
Or we can use base R without anonymous function call
data$onemore <- with(data, value + ave(value, id, value, FUN =seq_along)-1)
data$onemore
#[1] 95 100 101 102 103 20 35 38
To avoid (a potentially costly) by, you may use rowid:
data[, res := value + rowid(id, value) - 1]
# data
# id value new_value desired_value res
# 1: 1 95 96 95 95
# 2: 1 100 101 100 100
# 3: 1 101 102 101 101
# 4: 1 101 102 102 102
# 5: 1 101 102 103 103
# 6: 2 20 21 20 20
# 7: 2 35 36 35 35
# 8: 2 38 39 38 38
I have data that looks like this:
df <- read.table(textConnection(
"ID DATE UNIT
100 1/5/2005 4
100 2/6/2006 4
100 3/7/2007 5
100 4/7/2008 5
100 5/9/2009 6
101 1/5/2005 1
101 2/6/2006 1
101 3/7/2007 1
101 4/7/2008 1
102 1/3/2010 3
102 4/5/2010 4
102 5/9/2011 3
102 6/7/2011 5
102 10/10/2012 5
103 1/5/2005 1
103 1/6/2010 2"),header=TRUE)
I want to group by ID, sort each group by DATE, and create another column that is a running count of the number of times the UNIT variable has changed for each given ID variable. So I want an output that looks like this:
ID DATE UNIT CHANGES
100 1/5/2005 4 0
100 2/6/2006 4 0
100 3/7/2007 5 1
100 4/7/2008 5 1
100 5/9/2009 6 2
101 1/5/2005 1 0
101 2/6/2006 1 0
101 3/7/2007 1 0
101 4/7/2008 1 0
102 1/3/2010 3 0
102 4/5/2010 4 1
102 5/9/2011 3 2
102 6/7/2011 5 3
102 10/10/2012 5 3
103 1/5/2005 1 0
103 1/6/2010 2 1
You could also do this in base R, using order to sort the observations and ave to compute the grouped values:
df$DATE <- as.Date(df$DATE, "%m/%d/%Y")
df <- df[order(df$ID, df$DATE),]
df$CHANGES <- ave(df$UNIT, df$ID, FUN=function(x) c(0, cumsum(diff(x) != 0)))
df
# ID DATE UNIT CHANGES
# 1 100 2005-01-05 4 0
# 2 100 2006-02-06 4 0
# 3 100 2007-03-07 5 1
# 4 100 2008-04-07 5 1
# 5 100 2009-05-09 6 2
# 6 101 2005-01-05 1 0
# 7 101 2006-02-06 1 0
# 8 101 2007-03-07 1 0
# 9 101 2008-04-07 1 0
# 10 102 2010-01-03 3 0
# 11 102 2010-04-05 4 1
# 12 102 2011-05-09 3 2
# 13 102 2011-06-07 5 3
# 14 102 2012-10-10 5 3
# 15 103 2005-01-05 1 0
# 16 103 2010-01-06 2 1
Using dplyr.
First I'm converting your DATE column to a date, assuming it's in format m/d/y (if not, change the "%m/%d/%Y" to "%d/%m/%Y"):
df$DATE <- as.Date(df$DATE, "%m/%d/%Y")
Now the code:
library(dplyr)
df %>% group_by(ID) %>%
arrange(DATE) %>%
mutate(CHANGES=c(0,cumsum(na.omit(UNIT!=lag(UNIT,1)))))
I am trying to rank multiple numeric variables ( around 700+ variables) in the data and am not sure exactly how to do this as I am still pretty new to using R.
I do not want to overwrite the ranked values in the same variable and hence need to create a new rank variable for each of these numeric variables.
From reading the posts, I believe assign and transform function along with rank maybe able to solve this. I tried implementing as below ( sample data and code) and am struggling to get it to work.
The output dataset in addition to variables xcount, xvisit, ysales need to be populated
With variables xcount_rank, xvisit_rank, ysales_rank containing the ranked values.
input <- read.table(header=F, text="101 2 5 6
102 3 4 7
103 9 12 15")
colnames(input) <- c("id","xcount","xvisit","ysales")
input1 <- input[,2:4] #need to rank the numeric variables besides id
for (i in 1:3)
{
transform(input1,
assign(paste(input1[,i],"rank",sep="_")) =
FUN = rank(-input1[,i], ties.method = "first"))
}
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 10)
The problem with this approach is that it's creating the rank values as (101, 230] , (230, 450] etc whereas I would like to see the values in the rank variable to be populated as 1, 2 etc up to 10 categories as per the splits I did. Is there any way to achieve this? input[5:7] <- lapply(input[5:7], rank, ties.method = "first")
The approach I tried from the solutions provided below is:
input <- read.table(header=F, text="101 20 5 6
102 2 4 7
103 9 12 15
104 100 8 7
105 450 12 65
109 25 28 145
112 854 56 93")
colnames(input) <- c("id","xcount","xvisit","ysales")
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 3)
Current output I get is:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 (1.15,286] (3.95,21.3] (5.86,52.3]
2 102 2 4 7 (1.15,286] (3.95,21.3] (5.86,52.3]
3 103 9 12 15 (1.15,286] (3.95,21.3] (5.86,52.3]
4 104 100 8 7 (1.15,286] (3.95,21.3] (5.86,52.3]
5 105 450 12 65 (286,570] (3.95,21.3] (52.3,98.7]
6 109 25 28 145 (1.15,286] (21.3,38.7] (98.7,145]
7 112 854 56 93 (570,855] (38.7,56.1] (52.3,98.7]
Desired output:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 1 1 1
2 102 2 4 7 1 1 1
3 103 9 12 15 1 1 1
4 104 100 8 7 1 1 1
5 105 450 12 65 2 1 2
6 109 25 28 145 1 2 3
Would like to see the records in the group they would fall under if I try to rank the interval values.
Using dplyr
library(dplyr)
nm1 <- paste("rank", names(input)[2:4], sep="_")
input[nm1] <- mutate_each(input[2:4],funs(rank(., ties.method="first")))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 2 5 6 1 2 1
#2 102 3 4 7 2 1 2
#3 103 9 12 15 3 3 3
Update
Based on the new input and using cut
input[nm1] <- mutate_each(input[2:4], funs(cut(., breaks=3, labels=FALSE)))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 20 5 6 1 1 1
#2 102 2 4 7 1 1 1
#3 103 9 12 15 1 1 1
#4 104 100 8 7 1 1 1
#5 105 450 12 65 2 1 2
#6 109 25 28 145 1 2 3
#7 112 854 56 93 3 3 2