So, I'm having a headache finding a way to program this in R. Since it is very easy to do Excel, I'm hoping this is just my n00b lack of knowledge.
Checking the example table I'm presenting: my objetive is to create the last column (Balance).
Each TRA (101 and 102) has a number of IDAs (the order of all entries in that TRA, from 1 to last).
Balance in the 1st IDA is the total sum of the Principal. For each next IDA, the Balance value is reduce by the total amount of its Principal, until the last Balance being simply equal to the last TDA.
In order words,
the Balance value of one row is the sum of the Principal value in that same row plus the Balance value of the next IBA row, until the last one of each TRA.
So, for instance:
For TRA 101, we got fow rows (IDA from 1 to 4). The Balance value of the 1st row is (-4.799.471 + -14.398.412 = -19.197.882), the Principal of 1st row plus Balance of the 2nd.
For last IDA of each TRA (4 in 101, 9 in 102), I just need the value of the principal.
We tried this option, but it isn't working when we have different Principal values through the TRA.
df %<>%
group_by(TRA)%>%
arrange(desc(IDA))%>%
mutate(saldo = cumsum(Principal))%>%
ungroup()%>%
arrange(TRA)
Can someone point the best approach for me, please?
ROW TRA IDA IDB Principal Balance
1 101 1 1011 -4,799,471 -19,197,882
2 101 2 1012 -4,799,471 -14,398,412
3 101 3 1013 -4,799,471 -9,598,941
4 101 4 1014 -4,799,471 -4,799,471
5 102 1 1021 -5,248,583 -47,237,250
6 102 2 1022 -5,248,583 -41,988,667
7 102 3 1023 -5,248,583 -36,740,084
8 102 4 1024 -5,248,583 -31,491,500
9 102 5 1025 -5,248,583 -26,242,917
10 102 6 1026 -5,248,583 -20,994,334
11 102 7 1027 -5,248,583 -15,745,750
12 102 8 1028 -5,248,583 -10,497,167
13 102 9 1029 -5,248,584 -5,248,584
If your posted data is the data frame you're working with you need to convert your Principal column to numeric, e.g.
df %>%
group_by(TRA) %>%
arrange(desc(IDA)) %>%
mutate(saldo = cumsum(gsub(",", "", Principal))) %>%
ungroup() %>%
arrange(TRA)
# A tibble: 13 × 7
ROW TRA IDA IDB Principal Balance saldo
<int> <int> <int> <int> <chr> <chr> <dbl>
1 4 101 4 1014 -4,799,471 -4,799,471 -4799471
2 3 101 3 1013 -4,799,471 -9,598,941 -9598942
3 2 101 2 1012 -4,799,471 -14,398,412 -14398413
4 1 101 1 1011 -4,799,471 -19,197,882 -19197884
5 13 102 9 1029 -5,248,584 -5,248,584 -5248584
6 12 102 8 1028 -5,248,583 -10,497,167 -10497167
7 11 102 7 1027 -5,248,583 -15,745,750 -15745750
8 10 102 6 1026 -5,248,583 -20,994,334 -20994333
9 9 102 5 1025 -5,248,583 -26,242,917 -26242916
10 8 102 4 1024 -5,248,583 -31,491,500 -31491499
11 7 102 3 1023 -5,248,583 -36,740,084 -36740082
12 6 102 2 1022 -5,248,583 -41,988,667 -41988665
13 5 102 1 1021 -5,248,583 -47,237,250 -47237248
It works fine, no?
df <- read_table(
"ROW TRA IDA IDB Principal Balance
1 101 1 1011 -4,799,471 -19,197,882
2 101 2 1012 -4,799,471 -14,398,412
3 101 3 1013 -4,799,471 -9,598,941
4 101 4 1014 -4,799,471 -4,799,471
5 102 1 1021 -5,248,583 -47,237,250
6 102 2 1022 -5,248,583 -41,988,667
7 102 3 1023 -5,248,583 -36,740,084
8 102 4 1024 -5,248,583 -31,491,500
9 102 5 1025 -5,248,583 -26,242,917
10 102 6 1026 -5,248,583 -20,994,334
11 102 7 1027 -5,248,583 -15,745,750
12 102 8 1028 -5,248,583 -10,497,167
13 102 9 1029 -5,248,584 -5,248,584"
)
df %>%
group_by(TRA) %>%
arrange(TRA, desc(IDA)) %>%
mutate(saldo = cumsum(Principal)) %>%
ungroup()
# A tibble: 13 × 7
ROW TRA IDA IDB Principal Balance saldo
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 101 4 1014 -4799471 -4799471 -4799471
2 3 101 3 1013 -4799471 -9598941 -9598942
3 2 101 2 1012 -4799471 -14398412 -14398413
4 1 101 1 1011 -4799471 -19197882 -19197884
5 13 102 9 1029 -5248584 -5248584 -5248584
6 12 102 8 1028 -5248583 -10497167 -10497167
7 11 102 7 1027 -5248583 -15745750 -15745750
8 10 102 6 1026 -5248583 -20994334 -20994333
9 9 102 5 1025 -5248583 -26242917 -26242916
10 8 102 4 1024 -5248583 -31491500 -31491499
11 7 102 3 1023 -5248583 -36740084 -36740082
12 6 102 2 1022 -5248583 -41988667 -41988665
13 5 102 1 1021 -5248583 -47237250 -47237248
Related
I have 4 data frames that all look like this:
Product 2018
Number
Minimum
Maximum
1
56
1
5
2
42
12
16
3
6523
23
56
4
123
23
102
5
56
23
64
6
245623
56
87
7
546
25
540
8
54566
253
560
Product 2019
Number
Minimum
Maximum
1
56
32
53
2
642
423
620
3
56423
432
560
4
3
431
802
5
2
2
6
6
4523
43
68
7
555
23
54
8
55646
3
6
Product 2020
Number
Minimum
Maximum
1
23
2
5
2
342
4
16
3
223
3
5
4
13
4
12
5
2
4
7
6
223
7
8
7
5
34
50
8
46
3
6
Product 2021
Number
Minimum
Maximum
1
234
3
5
2
3242
4
16
3
2423
43
56
4
123
43
102
5
24
4
6
6
2423
4
18
7
565
234
540
8
5646
23
56
I want to join all the tables so I get a table that looks like this:
Products
Number 2021
Min-Max 2021
Number 2020
Min-Max 2020
Number 2019
Min-Max 2019
Number 2018
Min-Max 2018
1
234
3 to 5
23
2 to 5
...
...
...
...
2
3242
4 to 16
342
4 to 16
...
...
...
...
3
2423
43 to 56
223
3 to 5
...
...
...
...
4
123
43 to 102
13
4 to 12
...
...
...
...
5
24
4 to 6
2
4 to 7
...
...
...
...
6
2423
4 to 18
223
7 to 8
...
...
...
...
7
565
234 to 540
5
34 to 50
...
...
...
...
8
5646
23 to 56
46
3 to 6
...
...
...
...
The Product for all years are the same so I would like to have a data frame that contains the number for each year as a column and joins the column for minimum and maximum as one.
Any help is welcome!
How about something like this. You are trying to join several dataframes by a single column, which is relatively straight forward using full_join. The difficulty is that you are trying to extract information from the column names and combine several columns at the same time. I would map out everying you want to do and then reduce the list of dataframes at the end. Here is an example with two dataframes, but you could add as many as you want to the list at the begining.
library(tidyverse)
#test data
set.seed(23)
df1 <- tibble("Product 2018" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
set.seed(46)
df2 <- tibble("Product 2019" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
list(df1, df2) |>
map(\(x){
year <- str_extract(colnames(x)[1], "\\d+?$")
mutate(x, !!quo_name(paste0("Min-Max ", year)) := paste(Minimum, "to", Maximum))|>
rename(!!quo_name(paste0("Number ", year)) := Number)|>
rename_with(~gsub("\\s\\d+?$", "", .), 1) |>
select(-c(Minimum, Maximum))
}) |>
reduce(full_join, by = "Product")
#> # A tibble: 8 x 5
#> Product `Number 2018` `Min-Max 2018` `Number 2019` `Min-Max 2019`
#> <int> <int> <chr> <int> <chr>
#> 1 1 29 21 to 481 50 93 to 416
#> 2 2 28 17 to 314 78 7 to 313
#> 3 3 72 40 to 787 1 91 to 205
#> 4 4 43 36 to 557 47 55 to 542
#> 5 5 45 70 to 926 52 76 to 830
#> 6 6 34 96 to 645 70 20 to 922
#> 7 7 48 31 to 197 84 6 to 716
#> 8 8 17 86 to 951 99 75 to 768
This is a similar answer, but includes bind_rows to combine the data.frames, then pivot_wider to end in a wide format.
The first steps strip the year from the Product XXXX column name, as this carries relevant information on year for that data.frame. If that column is renamed as Product they are easily combined (with a separate column containing the Year). If this step can be taken earlier in the data collection or processing timeline, it is helpful.
library(tidyverse)
list(df1, df2, df3, df4) %>%
map(~.x %>%
mutate(Year = gsub("Product", "", names(.x)[1])) %>%
rename(Product = !!names(.[1]))) %>%
bind_rows() %>%
mutate(Min_Max = paste(Minimum, Maximum, sep = " to ")) %>%
pivot_wider(id_cols = Product, names_from = Year, values_from = c(Number, Min_Max), names_vary = "slowest")
Output
Product Number_2018 Min_Max_2018 Number_2019 Min_Max_2019 Number_2020 Min_Max_2020 Number_2021 Min_Max_2021
<int> <int> <chr> <int> <chr> <int> <chr> <int> <chr>
1 1 56 1 to 5 56 32 to 53 23 2 to 5 234 3 to 5
2 2 42 12 to 16 642 423 to 620 342 4 to 16 3242 4 to 16
3 3 6523 23 to 56 56423 432 to 560 223 3 to 5 2423 43 to 56
4 4 123 23 to 102 3 431 to 802 13 4 to 12 123 43 to 102
5 5 56 23 to 64 2 2 to 6 2 4 to 7 24 4 to 6
6 6 245623 56 to 87 4523 43 to 68 223 7 to 8 2423 4 to 18
7 7 546 25 to 540 555 23 to 54 5 34 to 50 565 234 to 540
8 8 54566 253 to 560 55646 3 to 6 46 3 to 6 5646 23 to 56
I have a data frame with thousands of rows looking like this:
time Unique_ID Unix_Time Event Version
<dbl> <dbl> <dbl> <lgl> <dbl>
1 1404 4961657804 1565546745 FALSE 6
2 2534 4453645779 1550934792 FALSE 5
3 2114 3602935494 1512593418 TRUE 3
4 2605 5343699852 1586419012 TRUE 6
5 1246 5095942046 1572689498 FALSE 6
6 2519 3206995213 1495881898 TRUE 3
7 1419 4958551504 1565434177 TRUE 6
8 2262 5441937631 1590754817 TRUE 6
9 1650 3024892331 1488210079 TRUE 2
10 1880 3163703804 1494173662 FALSE 2
I manipulate the data frame using the following command:
df <- df %>%
group_by(minute = findInterval(time, seq(min(0), max(9000), 60))) %>%
summarise(Number= n(),
Won = sum(Event))
Now my data frame looks like this:
minute Number Won
<int> <int> <int>
1 55 264 128
2 71 34 17
3 31 1427 728
4 80 9 5
5 24 1197 673
6 141 1 1
7 53 326 163
8 30 1572 802
9 77 14 9
10 97 1 1
I would want something like this though:
minute Number Won Version
<int> <int> <int> <int>
1 55 264 128 1
2 55 34 17 2
3 55 1427 728 3
4 80 9 5 1
5 24 1197 673 1
6 141 1 1 2
7 53 326 163 3
8 53 1572 802 4
9 77 14 9 2
10 97 1 1 6
Is it possible to keep the rows with different Versions seperated while grouping time?
I think you can group by 2 columns: minute and Version
df <- df %>%
group_by(minute = findInterval(time, seq(min(0), max(9000), 60)), Version)
I'm pretty fresh to r (like 2 days old). I have a set of data that is a time series taken every 200 msecs over a few hours. Here's the
head(dat):
Date Time MSec Sample Pat1 Pat2 Pat3
1 8/7/~ 14:34 411 0 100 13 13
2 8/7/~ 14:34 615 1 13 13 143
3 8/7/~ 14:34 814 2 13 13 13
4 8/7/~ 14:34 12 3 130 13 13
5 8/7/~ 14:34 216 4 13 13 130
6 8/7/~ 14:34 417 5 139 13 13
It goes down for 2 hours, so several thousands points and over for several hundred patients. The 13 is our baseline and what we are interested in spikes in activity over say 100. I have been trying to create a new column for each Patient column for every time a signal is over 100. I've worked out the follow code:
dat$Pat1exc <- as.numeric(dat$Pat1 >=100)
This works and gives me the new column and my data looks like below:
Date Time MSec Sample Pat1 Pat2 Pat3 Pat1exc
1 8/7/~ 14:34 411 0 100 13 13 1
2 8/7/~ 14:34 615 1 13 13 143 0
3 8/7/~ 14:34 814 2 13 13 13 0
4 8/7/~ 14:34 12 3 130 13 13 1
5 8/7/~ 14:34 216 4 13 13 130 0
6 8/7/~ 14:34 417 5 139 13 13 1
This is exactly what I want, but I don't know how to iterate through each column to create Pat2exc, Pat3exc, etc. I figured I could use sapply or vapply after I create a function. However, I can't get the function to work.
excite <- function(x, y) {y <- as.numeric(x >=100)}
excite(x=dat$Pat2, y=dat$Pat2exc)
This gives me no errors, but doesn't modify the dat data frame. Essentially, in the end I just want to sum up all the excited columns (>=100). If there is an easier way to count the samples over 100 for each patient then I'd be happy to learn how to do that as well.
Sorry if this is unclear. Thanks in advance.
P.S.: I am also looking for a good way to combine the Time and Msec columns.
Edit: Added in unabbreviated data:
Date Time Msecs
8/7/2018 14:34:07 411
8/7/2018 14:34:07 615
8/7/2018 14:34:07 814
8/7/2018 14:34:08 12
8/7/2018 14:34:08 216
8/7/2018 14:34:08 417
8/7/2018 14:34:08 619
8/7/2018 14:34:08 816
8/7/2018 14:34:09 15
We can use mutate_at from dplyr to create the binary variables and mutate + rowSums to add them all up:
library(dplyr)
df %>%
mutate_at(vars(starts_with("Pat")), funs(exc = (. >= 100)*1)) %>%
mutate(exc_total = rowSums(.[grepl('_exc', names(.))]))
Result:
Date Time MSec Sample Pat1 Pat2 Pat3 Pat1_exc Pat2_exc Pat3_exc exc_total
1 8/7/~ 14:34 411 0 100 13 13 1 0 0 1
2 8/7/~ 14:34 615 1 13 13 143 0 0 1 1
3 8/7/~ 14:34 814 2 13 13 13 0 0 0 0
4 8/7/~ 14:34 12 3 130 13 13 1 0 0 1
5 8/7/~ 14:34 216 4 13 13 130 0 0 1 1
6 8/7/~ 14:34 417 5 139 13 13 1 0 0 1
I did a rfm analysis using package "rfm". The results are in tibble and I can't seem to figure out how to export it to .csv. I tried argument below but it exported a blank file.
> dim(bmdata4RFM)
[1] 1182580 3
> str(bmdata4RFM)
'data.frame': 1182580 obs. of 3 variables:
$ customer_ID: num 0 0 0 0 0 0 0 0 0 0 ...
$ sales_date : Factor w/ 366 levels "1/1/2018 0:00:00",..: 267 275 286 297 300 301 302 303 304 305 ...
$ sales : num 101541 110543 60932 75472 43588 ...
> head(bmdata4RFM,5)
customer_ID sales_date sales
1 0 6/30/2017 0:00:00 101540.70
2 0 7/1/2017 0:00:00 110543.35
3 0 7/2/2017 0:00:00 60932.20
4 0 7/3/2017 0:00:00 75471.93
5 0 7/4/2017 0:00:00 43587.70
> library(rfm)
> # convert date from factor to date format
> bmdata4RFM[,2] <- as.Date(as.character(bmdata4RFM[,2]), format = "%m/%d/%Y")
> rfm_result_v2
# A tibble: 535,868 x 9
customer_id date_most_recent recency_days transaction_count amount recency_score frequency_score monetary_score rfm_score
<dbl> <date> <dbl> <dbl> <dbl> <int> <int> <int> <dbl>
1 0 2018-06-30 12 366 42462470. 5 5 5 555
2 1 2018-06-30 12 20 2264. 5 5 5 555
3 2 2018-01-12 181 24 1689 3 5 5 355
4 3 2018-05-04 69 27 1984. 4 5 5 455
5 6 2017-12-07 217 12 922. 2 5 5 255
6 7 2018-01-15 178 19 1680. 3 5 5 355
7 9 2018-01-05 188 19 2106 2 5 5 255
8 20 2018-04-11 92 4 414. 4 5 5 455
9 26 2018-02-10 152 1 72 3 1 2 312
10 48 2017-12-20 204 1 90 2 1 3 213
11 68 2017-09-30 285 1 37 1 1 1 111
12 70 2017-12-17 207 1 18 2 1 1 211
13 104 2017-08-11 335 1 90 1 1 3 113
14 120 2017-07-27 350 1 19 1 1 1 111
15 134 2018-01-13 180 1 275 3 1 4 314
16 153 2018-06-24 18 10 1677 5 5 5 555
17 155 2018-05-28 45 1 315 5 1 4 514
18 171 2018-06-11 31 6 3485. 5 5 5 555
19 172 2018-05-24 49 1 93 5 1 3 513
20 174 2018-06-06 36 3 347. 5 4 5 545
# ... with 535,858 more rows
> write.csv(rfm_result_v2,"bmdataRFMFunction_output071218v2.csv")
The problem seems to be that the result of the rfm_table_order is not only a tibble: looking at this question already solved, and using its data, you can know this:
> class(rfm_result)
[1] "rfm_table_order" "tibble" "data.frame"
So if for example choose this:
> rfm_result$rfm
# A tibble: 325 x 9
customer_id date_most_recent recency_days transaction_count amount recency_score frequency_score monetary_score rfm_score
<int> <date> <dbl> <dbl> <int> <int> <int> <int> <dbl>
1 1 2017-08-06 353 1 145 4 1 2 412
2 2 2016-10-15 648 1 268 2 1 3 213
3 5 2016-12-14 588 1 119 3 1 1 311
4 7 2017-04-27 454 1 290 3 1 3 313
5 8 2016-12-07 595 3 835 2 5 5 255
6 10 2017-07-31 359 1 192 4 1 2 412
7 11 2017-08-16 343 1 278 4 1 3 413
8 12 2017-10-14 284 2 294 5 4 3 543
9 15 2016-07-12 743 1 206 2 1 2 212
10 17 2017-05-22 429 2 405 4 4 4 444
# ... with 315 more rows
You can export it with this command:
write.table(rfm_result$rfm , file = "your_path\\df.csv")
OP asks for a CSV output.
Being very picky, write.table(rfm_result$rfm , file = "your_path\\df.csv") creates a TSV.
If you want a CSV add the sep="," parameter and also you'll likely want to not write out the row names so also use row.names=FALSE.
write.table(rfm_result$rfm , file = "your_path\\df.csv", sep=",", row.names=FALSE)
Starting from this SO question.
Example data.frame:
df = read.table(text = 'ID Day Count Count_group
18 1933 6 15
33 1933 6 15
37 1933 6 15
18 1933 6 15
16 1933 6 15
11 1933 6 15
111 1932 5 9
34 1932 5 9
60 1932 5 9
88 1932 5 9
18 1932 5 9
33 1931 3 4
13 1931 3 4
56 1931 3 4
23 1930 1 1
6 1800 6 12
37 1800 6 12
98 1800 6 12
52 1800 6 12
18 1800 6 12
76 1800 6 12
55 1799 4 6
6 1799 4 6
52 1799 4 6
133 1799 4 6
112 1798 2 2
677 1798 2 2
778 888 4 8
111 888 4 8
88 888 4 8
10 888 4 8
37 887 2 4
26 887 2 4
8 886 1 2
56 885 1 1
22 120 2 6
34 120 2 6
88 119 1 6
99 118 2 5
12 118 2 5
90 117 1 3
22 115 2 2
99 115 2 2', header = TRUE)
The Count col shows the total number of ID values per each Day and the Count_group col shows the sum of the ID values per each Day, Day - 1, Day -2, Day -3 and Day -4.
e.g. 1933 = Count_group 15 because Count 6 (1933) + Count 5 (1932) + Count 3 (1931) + Count 1 (1930) + Count 0 (1929).
What I need to do is to create duplicated observations per each Count_group and add them to it in order to show per each Count_group its Day, Day - 1, Day -2, Day -3 and Day -4.
e.g. Count_group = 15 is composed by the Count values of Day 1933, 1932, 1931, 1930 (and 1929 not present in the df). So the five days needs to be included in the Count_group = 15. The next one will be Count_group = 9, composed by 1932, 1931, 1930, 1929 and 1928; etc...
Desired output:
ID Day Count Count_group
18 1933 6 15
33 1933 6 15
37 1933 6 15
18 1933 6 15
16 1933 6 15
11 1933 6 15
111 1932 5 15
34 1932 5 15
60 1932 5 15
88 1932 5 15
18 1932 5 15
33 1931 3 15
13 1931 3 15
56 1931 3 15
23 1930 1 15
111 1932 5 9
34 1932 5 9
60 1932 5 9
88 1932 5 9
18 1932 5 9
33 1931 3 9
13 1931 3 9
56 1931 3 9
23 1930 1 9
33 1931 3 4
13 1931 3 4
56 1931 3 4
23 1930 1 4
23 1930 1 1
6 1800 6 12
37 1800 6 12
98 1800 6 12
52 1800 6 12
18 1800 6 12
76 1800 6 12
55 1799 4 12
6 1799 4 12
52 1799 4 12
133 1799 4 12
112 1798 2 12
677 1798 2 12
55 1799 4 6
6 1799 4 6
52 1799 4 6
133 1799 4 6
112 1798 2 6
677 1798 2 6
112 1798 2 2
677 1798 2 2
778 888 4 8
111 888 4 8
88 888 4 8
10 888 4 8
37 887 2 8
26 887 2 8
8 886 1 8
56 885 1 8
37 887 2 4
26 887 2 4
8 886 1 4
56 885 1 4
8 886 1 2
56 885 1 2
56 885 1 1
22 120 2 6
34 120 2 6
88 119 1 6
99 118 2 6
12 118 2 6
90 117 1 6
88 119 1 6
99 118 2 6
12 118 2 6
90 117 1 6
22 115 2 6
99 115 2 6
99 118 2 5
12 118 2 5
90 117 1 5
22 115 2 5
99 115 2 5
90 117 1 3
22 115 2 3
99 115 2 3
22 115 2 2
99 115 2 2
(note that different group of 5 days each one have been separated by a blank line in order to make them clearer)
I have got different data.frames which are grouped by n days and therefore I would like to adapt the code (by changing it a little) specifically for each of them.
Thanks
A generalised version of my previous answer...
#first add grouping variables
days <- 5 #grouping no of days
df$smalldaygroup <- c(0,cumsum(sapply(2:nrow(df),function(i) df$Day[i]!=df$Day[i-1]))) #individual days
df$bigdaygroup <- c(0,cumsum(sapply(2:nrow(df),function(i) df$Day[i]<df$Day[i-1]-days+1))) #blocks of linked days
#duplicate days in each big group
df2 <- lapply(split(df,df$bigdaygroup),function(x) {
n <- max(x$Day)-min(x$Day)+1 #number of consecutive days in big group
dayvec <- (max(x$Day):min(x$Day)) #possible days in range
daylog <- dayvec[dayvec %in% x$Day] #actual days in range
pattern <- data.frame(base=rep(dayvec,each=days))
pattern$rep <- sapply(1:nrow(pattern),function(i) pattern$base[i]+1-sum(pattern$base[1:i]==pattern$base[i])) #indices to repeat
pattern$offset <- match(pattern$rep,daylog)-match(pattern$base,daylog) #offsets (used later)
pattern <- pattern[(pattern$base %in% x$Day) & (pattern$rep %in% x$Day),] #remove invalid elements
#store pattern in list as offsets needed in next loop
return(list(df=split(x,x$smalldaygroup)[match(pattern$rep,daylog)],pat=pattern))
})
#change the Count_group to previous value in added entries
df2 <- lapply(df2,function(L) lapply(1:length(L$df),function(i) {
x <- L$df[[i]]
offset <- L$pat$offset #pointer to day to copy Count_group from
x$Count_group <- L$df[[i-offset[i]]]$Count_group[1]
return(x)
}))
df2 <- do.call(rbind,unlist(df2,recursive=FALSE)) #bind back together
df2[,5:6] <- NULL #remove grouping variables
head(df2,30) #ignore rownames!
ID Day Count Count_group
01.1 18 1933 6 15
01.2 33 1933 6 15
01.3 37 1933 6 15
01.4 18 1933 6 15
01.5 16 1933 6 15
01.6 11 1933 6 15
02.7 111 1932 5 15
02.8 34 1932 5 15
02.9 60 1932 5 15
02.10 88 1932 5 15
02.11 18 1932 5 15
03.12 33 1931 3 15
03.13 13 1931 3 15
03.14 56 1931 3 15
04 23 1930 1 15
05.7 111 1932 5 9
05.8 34 1932 5 9
05.9 60 1932 5 9
05.10 88 1932 5 9
05.11 18 1932 5 9
06.12 33 1931 3 9
06.13 13 1931 3 9
06.14 56 1931 3 9
07 23 1930 1 9
08.12 33 1931 3 4
08.13 13 1931 3 4
08.14 56 1931 3 4
09 23 1930 1 4
010 23 1930 1 1
11.16 6 1800 6 12
I attach a rather mechanical method, but I believe it is a good starting point.
I have noticed that in your original table the entry
ID Day Count Count_group
18 1933 6 14
is duplicated; I have left it untouched for sake of clarity.
Structure of the approach:
Read original data
Generate list of data frames, for each Day
Generate final data frame, collapsing the list in 2.
1. Read original data
We start with
df = read.table(text = 'ID Day Count Count_group
18 1933 6 14
33 1933 6 14
37 1933 6 14
18 1933 6 14
16 1933 6 14
11 1933 6 14
111 1932 5 9
34 1932 5 9
60 1932 5 9
88 1932 5 9
18 1932 5 9
33 1931 3 4
13 1931 3 4
56 1931 3 4
23 1930 1 1
6 1800 6 12
37 1800 6 12
98 1800 6 12
52 1800 6 12
18 1800 6 12
76 1800 6 12
55 1799 4 6
6 1799 4 6
52 1799 4 6
133 1799 4 6
112 1798 2 2
677 1798 2 2
778 888 4 7
111 888 4 7
88 888 4 7
10 888 4 7
37 887 2 4
26 887 2 4
8 886 1 2
56 885 1 1', header = TRUE)
# ordered vector of unique values for "Day"
ord_day <- unique(df$Day[order(df$Day)])
ord_day
[1] 885 886 887 888 1798 1799 1800 1930 1931 1932 1933
2. Generate list of data frames, for each Day
For each element in ord_day we introduce a data.frame as element of a list called df_new_aug.
Such data frames are defined through a for loop for all values in ord_day except ord_day[2] and ord_day[1] which are treated separately.
Idea behind the looping: for each unique ord_day[i] with i > 2 we check which days between ord_day[i-1] and ord_day[i-2] (or both!) contribute (through the variable "Count") to the value "Count_Group" at ord_day[i].
We therefore introduce if else statements in the loop.
Here we go
# Recursive generation of the list of data.frames (for days > 886)
#-----------------------------------------------------------------
df_new <- list()
df_new_aug <- list()
# we exclude cases i=1, 2: they are manually treated below
for ( i in 3: length(ord_day) ) {
# is "Count_Group" for ord_day[i] equal to the sum of "Count" at ord_day[i-1] and ord_day[i-2]?
if ( unique(df[df$Day == ord_day[i], "Count_group"]) == unique(df[df$Day == ord_day[i], "Count"]) +
unique(df[df$Day == ord_day[i-1], "Count"]) + unique(df[df$Day == ord_day[i-2], "Count"])
) {
# we create columns ID | Day | Count
df_new[[i]] <- data.frame(df[df$Day == ord_day[i] | df$Day == ord_day[i-1] | df$Day == ord_day[i-2],
c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[i]
df_new_aug[[i]] <- data.frame( df_new[[i]],
Count_group = rep(unique(df[df$Day == ord_day[i], "Count_group"]), nrow(df_new[[i]]) ) )
} else if (unique(df[df$Day == ord_day[i], "Count_group"]) == unique(df[df$Day == ord_day[i], "Count"]) +
unique(df[df$Day == ord_day[i-1], "Count"]) ) #only "Count" at i and i-1 contribute to "Count_group" at i
{
df_new[[i]] <- data.frame(df[df$Day == ord_day[i] | df$Day == ord_day[i-1],
c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[2]
df_new_aug[[i]] <- data.frame(df_new[[i]],
Count_group = rep(unique(df[df$Day == ord_day[i], "Count_group"]), nrow(df_new[[i]]) ) )
} else #only "Count" at i contributes to "Count_group" at i
df_new[[i]] <- data.frame(df[df$Day == ord_day[i],
c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[i]
df_new_aug[[i]] <- data.frame(df_new[[i]],
Count_group = rep(unique(df[df$Day == ord_day[i], "Count_group"]), nrow(df_new[[i]]) ) )
#closing the for loop
}
# for ord_day[2] = "886" (both "Count" at i =2 and i = 1 contribute to "Count_group" at i=2)
#-------------------------------------------------------------------------------------
df_new[[2]] <- data.frame(df[df$Day == ord_day[2] | df$Day == ord_day[1],
c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[2]
df_new_aug[[2]] <- data.frame(df_new[[2]],
Count_group = rep(unique(df[df$Day == ord_day[2], "Count_group"]), nrow(df_new[[2]]) ) )
# for ord_day[1] = "885" (only "count" at i = 1 contributes to "Count_group" at i =1)
#------------------------------------------------------------------------------------
df_new[[1]] <- data.frame(df[df$Day == ord_day[1], c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[i]
df_new_aug[[1]] <- data.frame(df_new[[1]], Count_group = rep(unique(df[df$Day == ord_day[1], "Count_group"]), nrow(df_new[[1]]) ) )
# produced list
df_new_aug
3. Generate final data frame, collapsing the list in 2.
We collapse df_new_aug through an ugly loop, but other solutions (for example with Reduce() and merge() are possible):
# merging the list (mechanically): final result
df_result <- df_new_aug[[1]]
for (i in 1:10){
df_result <- rbind(df_result, df_new_aug[[i+1]])
}
One arrives at df_result and the analysis is stopped.