group_by one column but keeping multiples based off another column - r

I have a data frame with thousands of rows looking like this:
time Unique_ID Unix_Time Event Version
<dbl> <dbl> <dbl> <lgl> <dbl>
1 1404 4961657804 1565546745 FALSE 6
2 2534 4453645779 1550934792 FALSE 5
3 2114 3602935494 1512593418 TRUE 3
4 2605 5343699852 1586419012 TRUE 6
5 1246 5095942046 1572689498 FALSE 6
6 2519 3206995213 1495881898 TRUE 3
7 1419 4958551504 1565434177 TRUE 6
8 2262 5441937631 1590754817 TRUE 6
9 1650 3024892331 1488210079 TRUE 2
10 1880 3163703804 1494173662 FALSE 2
I manipulate the data frame using the following command:
df <- df %>%
group_by(minute = findInterval(time, seq(min(0), max(9000), 60))) %>%
summarise(Number= n(),
Won = sum(Event))
Now my data frame looks like this:
minute Number Won
<int> <int> <int>
1 55 264 128
2 71 34 17
3 31 1427 728
4 80 9 5
5 24 1197 673
6 141 1 1
7 53 326 163
8 30 1572 802
9 77 14 9
10 97 1 1
I would want something like this though:
minute Number Won Version
<int> <int> <int> <int>
1 55 264 128 1
2 55 34 17 2
3 55 1427 728 3
4 80 9 5 1
5 24 1197 673 1
6 141 1 1 2
7 53 326 163 3
8 53 1572 802 4
9 77 14 9 2
10 97 1 1 6
Is it possible to keep the rows with different Versions seperated while grouping time?

I think you can group by 2 columns: minute and Version
df <- df %>%
group_by(minute = findInterval(time, seq(min(0), max(9000), 60)), Version)

Related

Rounded averages by group that sum to the same as the group total

I have data that looks like this:
library(dplyr)
Data <- tibble(
ID = c("Code001", "Code001","Code001","Code002","Code002","Code002","Code002","Code002","Code003","Code003","Code003","Code003"),
Value = c(107,107,107,346,346,346,346,346,123,123,123,123))
I need to work out the average value per group per row. However, the value needs to be rounded (so no decimal places) and the group sum needs to equal the group sum of Value.
So solutions like this won't work:
Data %>%
add_count(ID) %>%
group_by(ID) %>%
mutate(Prop_Value_1 = Value/n,
Prop_Value_2 = round(Value/n))
Is there a solution that can produce an output like this:
Data %>%
mutate(Prop_Value = c(35,36,36,69,69,69,69,70,30,31,31,31))
Can use ceiling and then row_number to get there:
Data %>%
group_by(ID) %>%
mutate(count = n(),
ceil_avg = ceiling(Value/count)) %>%
mutate(sum_ceil_avg = sum(ceil_avg),
diff_sum = sum_ceil_avg - Value,
rn = row_number()) %>%
mutate(new_avg = ifelse(rn <= diff_sum,
ceil_avg - 1,
ceil_avg))
# A tibble: 12 × 8
# Groups: ID [3]
ID Value count ceil_avg sum_ceil_avg diff_sum rn new_avg
<chr> <dbl> <int> <dbl> <dbl> <dbl> <int> <dbl>
1 Code001 107 3 36 108 1 1 35
2 Code001 107 3 36 108 1 2 36
3 Code001 107 3 36 108 1 3 36
4 Code002 346 5 70 350 4 1 69
5 Code002 346 5 70 350 4 2 69
6 Code002 346 5 70 350 4 3 69
7 Code002 346 5 70 350 4 4 69
8 Code002 346 5 70 350 4 5 70
9 Code003 123 4 31 124 1 1 30
10 Code003 123 4 31 124 1 2 31
11 Code003 123 4 31 124 1 3 31
12 Code003 123 4 31 124 1 4 31
A first solution is to use integer division:
Data %>%
group_by(ID) %>%
mutate(Prop_Value = ifelse(row_number() <= Value %% n(), Value %/% n() + 1, Value %/% n()))
# A tibble: 12 × 3
# Groups: ID [3]
ID Value Prop_Value
<chr> <dbl> <dbl>
1 Code001 107 36
2 Code001 107 36
3 Code001 107 35
4 Code002 346 70
5 Code002 346 69
6 Code002 346 69
7 Code002 346 69
8 Code002 346 69
9 Code003 123 31
10 Code003 123 31
11 Code003 123 31
12 Code003 123 30

Join data frame into one in r

I have 4 data frames that all look like this:
Product 2018
Number
Minimum
Maximum
1
56
1
5
2
42
12
16
3
6523
23
56
4
123
23
102
5
56
23
64
6
245623
56
87
7
546
25
540
8
54566
253
560
Product 2019
Number
Minimum
Maximum
1
56
32
53
2
642
423
620
3
56423
432
560
4
3
431
802
5
2
2
6
6
4523
43
68
7
555
23
54
8
55646
3
6
Product 2020
Number
Minimum
Maximum
1
23
2
5
2
342
4
16
3
223
3
5
4
13
4
12
5
2
4
7
6
223
7
8
7
5
34
50
8
46
3
6
Product 2021
Number
Minimum
Maximum
1
234
3
5
2
3242
4
16
3
2423
43
56
4
123
43
102
5
24
4
6
6
2423
4
18
7
565
234
540
8
5646
23
56
I want to join all the tables so I get a table that looks like this:
Products
Number 2021
Min-Max 2021
Number 2020
Min-Max 2020
Number 2019
Min-Max 2019
Number 2018
Min-Max 2018
1
234
3 to 5
23
2 to 5
...
...
...
...
2
3242
4 to 16
342
4 to 16
...
...
...
...
3
2423
43 to 56
223
3 to 5
...
...
...
...
4
123
43 to 102
13
4 to 12
...
...
...
...
5
24
4 to 6
2
4 to 7
...
...
...
...
6
2423
4 to 18
223
7 to 8
...
...
...
...
7
565
234 to 540
5
34 to 50
...
...
...
...
8
5646
23 to 56
46
3 to 6
...
...
...
...
The Product for all years are the same so I would like to have a data frame that contains the number for each year as a column and joins the column for minimum and maximum as one.
Any help is welcome!
How about something like this. You are trying to join several dataframes by a single column, which is relatively straight forward using full_join. The difficulty is that you are trying to extract information from the column names and combine several columns at the same time. I would map out everying you want to do and then reduce the list of dataframes at the end. Here is an example with two dataframes, but you could add as many as you want to the list at the begining.
library(tidyverse)
#test data
set.seed(23)
df1 <- tibble("Product 2018" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
set.seed(46)
df2 <- tibble("Product 2019" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
list(df1, df2) |>
map(\(x){
year <- str_extract(colnames(x)[1], "\\d+?$")
mutate(x, !!quo_name(paste0("Min-Max ", year)) := paste(Minimum, "to", Maximum))|>
rename(!!quo_name(paste0("Number ", year)) := Number)|>
rename_with(~gsub("\\s\\d+?$", "", .), 1) |>
select(-c(Minimum, Maximum))
}) |>
reduce(full_join, by = "Product")
#> # A tibble: 8 x 5
#> Product `Number 2018` `Min-Max 2018` `Number 2019` `Min-Max 2019`
#> <int> <int> <chr> <int> <chr>
#> 1 1 29 21 to 481 50 93 to 416
#> 2 2 28 17 to 314 78 7 to 313
#> 3 3 72 40 to 787 1 91 to 205
#> 4 4 43 36 to 557 47 55 to 542
#> 5 5 45 70 to 926 52 76 to 830
#> 6 6 34 96 to 645 70 20 to 922
#> 7 7 48 31 to 197 84 6 to 716
#> 8 8 17 86 to 951 99 75 to 768
This is a similar answer, but includes bind_rows to combine the data.frames, then pivot_wider to end in a wide format.
The first steps strip the year from the Product XXXX column name, as this carries relevant information on year for that data.frame. If that column is renamed as Product they are easily combined (with a separate column containing the Year). If this step can be taken earlier in the data collection or processing timeline, it is helpful.
library(tidyverse)
list(df1, df2, df3, df4) %>%
map(~.x %>%
mutate(Year = gsub("Product", "", names(.x)[1])) %>%
rename(Product = !!names(.[1]))) %>%
bind_rows() %>%
mutate(Min_Max = paste(Minimum, Maximum, sep = " to ")) %>%
pivot_wider(id_cols = Product, names_from = Year, values_from = c(Number, Min_Max), names_vary = "slowest")
Output
Product Number_2018 Min_Max_2018 Number_2019 Min_Max_2019 Number_2020 Min_Max_2020 Number_2021 Min_Max_2021
<int> <int> <chr> <int> <chr> <int> <chr> <int> <chr>
1 1 56 1 to 5 56 32 to 53 23 2 to 5 234 3 to 5
2 2 42 12 to 16 642 423 to 620 342 4 to 16 3242 4 to 16
3 3 6523 23 to 56 56423 432 to 560 223 3 to 5 2423 43 to 56
4 4 123 23 to 102 3 431 to 802 13 4 to 12 123 43 to 102
5 5 56 23 to 64 2 2 to 6 2 4 to 7 24 4 to 6
6 6 245623 56 to 87 4523 43 to 68 223 7 to 8 2423 4 to 18
7 7 546 25 to 540 555 23 to 54 5 34 to 50 565 234 to 540
8 8 54566 253 to 560 55646 3 to 6 46 3 to 6 5646 23 to 56

R: take one record per 30 days within each group

I have a dataset with over 1000 unique IDs and for each ID about 15 Surgery Codes done on different Dates(recorded as Days Diff)
I want to take only 1 record per 30 days within the group of each surgery code for each ID.
Adding a demo data here:
ID Age Diag.Date Surgery.Code Days.diff
1 1 67 4/8/2011 A364 421
2 1 67 4/8/2011 A364 1197
3 1 67 4/8/2011 A364 2207
4 1 67 4/8/2011 A364 2226
5 1 67 4/8/2011 A364 2247
6 1 67 4/8/2011 A364 2254
7 1 67 4/8/2011 A364 2331
8 1 67 4/8/2011 A364 2367
9 1 67 4/8/2011 A364 2905
10 1 67 4/8/2011 A364 2918
11 1 67 4/8/2011 D365 2200
12 1 67 4/8/2011 D441 308
13 1 67 4/8/2011 D443 218
14 1 67 4/8/2011 A446 308
15 2 56 6/4/2018 A453 2260
16 2 56 6/4/2018 D453 645
17 2 56 6/4/2018 D453 3095
18 2 56 6/4/2018 B453 645
Diff of 2226-2207 days is 19 days so row4 will delete, again diff of 2247-2207 days is 40 days so row5 will get recorded.
Again diff of 2254-2247 days is 7 days so row6 will get deleted.
Similarly, row10 will get deleted.
Any help is appreciated!
Use dplyr::group_by(ID, Surgery.Code) to filter within individuals and surgeries;
Within each group, use Days.diff - dplyr::lag(Days.diff) <= 30 to test for adjacent rows within 30 days;
Because the results of (2) may change when rows are removed, you'll want to iterate by removing one row at a time per group, then re-testing. You can use while to iterate until no more cases are detected.
library(dplyr)
filtered <- surgeries %>%
group_by(ID, Surgery.Code) %>%
mutate(within30 = if_else(
Days.diff - lag(Days.diff) <= 30,
row_number(),
NA_integer_
))
while (any(!is.na(filtered$within30))) {
filtered <- filtered %>%
mutate(within30 = if_else(
Days.diff - lag(Days.diff) <= 30,
row_number(),
NA_integer_
)) %>%
filter(is.na(within30) | within30 != min(within30, na.rm = TRUE))
}
filtered %>%
select(!within30) %>%
ungroup()
#> # A tibble: 15 x 5
#> ID Age Diag.Date Surgery.Code Days.diff
#> <int> <int> <chr> <chr> <int>
#> 1 1 67 4/8/2011 A364 421
#> 2 1 67 4/8/2011 A364 1197
#> 3 1 67 4/8/2011 A364 2207
#> 4 1 67 4/8/2011 A364 2247
#> 5 1 67 4/8/2011 A364 2331
#> 6 1 67 4/8/2011 A364 2367
#> 7 1 67 4/8/2011 A364 2905
#> 8 1 67 4/8/2011 D365 2200
#> 9 1 67 4/8/2011 D441 308
#> 10 1 67 4/8/2011 D443 218
#> 11 1 67 4/8/2011 A446 308
#> 12 2 56 6/4/2018 A453 2260
#> 13 2 56 6/4/2018 D453 645
#> 14 2 56 6/4/2018 D453 3095
#> 15 2 56 6/4/2018 B453 645
Created on 2022-03-01 by the reprex package (v2.0.1)

Str_count() for a dataframe Error: argument is not an atomic vector; coercing

I have a dataset where I need to count any values that contain the string "E9" and what I have so far seems to... sort of work. Here's an example dataset I'm working with:
ColA ColB ColC ColD ColE ColF ColG ColH ColI ColJ ColK ColL
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
1 407 345 381 0 E9 E9 E9 12 E9 E9 E9 E9
2 328 301 314 0 6 6 7 10 7 8 7 6
3 295 261 267 0 7 8 8 8 8 7 7 6
4 163 199 298 0 2 6 3 6 7 6 6 3
5 599 499 576 0 E9 E9 E9 17 E9 E9 E9 E9
6 566 436 545 0 12 16 16 16 17 15 15 11
7 200 168 170 0 5 5 5 5 5 5 5 5
8 617 507 435 0 13 18 17 18 18 17 16 12
9 624 0 629 18 13 0 0 18 18 17 16 14
10 177 163 161 0 4 5 4 5 5 5 4 3
Now, if I run this code:
df2$Exclusions <- str_count(df1, "E9")
I get the following error:
Warning message:
In stri_count_regex(string, pattern, opts_regex = opts(pattern)) :
argument is not an atomic vector; coercing
However, it does give me the end result I want, which looks like this:
Device Exclusions
<chr> <int>
1 ColA 0
2 ColB 0
3 ColC 0
4 ColD 16
5 ColE 19
6 ColF 19
7 ColG 19
8 ColH 0
9 ColI 19
10 ColJ 19
From what I understand, str_count() is just mad that I'm using a dataframe rather than a vector. For some reason it works fine anyway each time I use it like this, but when I try to put it inside of a loop, it stops dead in its tracks. How can I achieve the same result but with a function that is meant to work with dataframes rather than vectors?

How to export tibble to .csv

I did a rfm analysis using package "rfm". The results are in tibble and I can't seem to figure out how to export it to .csv. I tried argument below but it exported a blank file.
> dim(bmdata4RFM)
[1] 1182580 3
> str(bmdata4RFM)
'data.frame': 1182580 obs. of 3 variables:
$ customer_ID: num 0 0 0 0 0 0 0 0 0 0 ...
$ sales_date : Factor w/ 366 levels "1/1/2018 0:00:00",..: 267 275 286 297 300 301 302 303 304 305 ...
$ sales : num 101541 110543 60932 75472 43588 ...
> head(bmdata4RFM,5)
customer_ID sales_date sales
1 0 6/30/2017 0:00:00 101540.70
2 0 7/1/2017 0:00:00 110543.35
3 0 7/2/2017 0:00:00 60932.20
4 0 7/3/2017 0:00:00 75471.93
5 0 7/4/2017 0:00:00 43587.70
> library(rfm)
> # convert date from factor to date format
> bmdata4RFM[,2] <- as.Date(as.character(bmdata4RFM[,2]), format = "%m/%d/%Y")
> rfm_result_v2
# A tibble: 535,868 x 9
customer_id date_most_recent recency_days transaction_count amount recency_score frequency_score monetary_score rfm_score
<dbl> <date> <dbl> <dbl> <dbl> <int> <int> <int> <dbl>
1 0 2018-06-30 12 366 42462470. 5 5 5 555
2 1 2018-06-30 12 20 2264. 5 5 5 555
3 2 2018-01-12 181 24 1689 3 5 5 355
4 3 2018-05-04 69 27 1984. 4 5 5 455
5 6 2017-12-07 217 12 922. 2 5 5 255
6 7 2018-01-15 178 19 1680. 3 5 5 355
7 9 2018-01-05 188 19 2106 2 5 5 255
8 20 2018-04-11 92 4 414. 4 5 5 455
9 26 2018-02-10 152 1 72 3 1 2 312
10 48 2017-12-20 204 1 90 2 1 3 213
11 68 2017-09-30 285 1 37 1 1 1 111
12 70 2017-12-17 207 1 18 2 1 1 211
13 104 2017-08-11 335 1 90 1 1 3 113
14 120 2017-07-27 350 1 19 1 1 1 111
15 134 2018-01-13 180 1 275 3 1 4 314
16 153 2018-06-24 18 10 1677 5 5 5 555
17 155 2018-05-28 45 1 315 5 1 4 514
18 171 2018-06-11 31 6 3485. 5 5 5 555
19 172 2018-05-24 49 1 93 5 1 3 513
20 174 2018-06-06 36 3 347. 5 4 5 545
# ... with 535,858 more rows
> write.csv(rfm_result_v2,"bmdataRFMFunction_output071218v2.csv")
The problem seems to be that the result of the rfm_table_order is not only a tibble: looking at this question already solved, and using its data, you can know this:
> class(rfm_result)
[1] "rfm_table_order" "tibble" "data.frame"
So if for example choose this:
> rfm_result$rfm
# A tibble: 325 x 9
customer_id date_most_recent recency_days transaction_count amount recency_score frequency_score monetary_score rfm_score
<int> <date> <dbl> <dbl> <int> <int> <int> <int> <dbl>
1 1 2017-08-06 353 1 145 4 1 2 412
2 2 2016-10-15 648 1 268 2 1 3 213
3 5 2016-12-14 588 1 119 3 1 1 311
4 7 2017-04-27 454 1 290 3 1 3 313
5 8 2016-12-07 595 3 835 2 5 5 255
6 10 2017-07-31 359 1 192 4 1 2 412
7 11 2017-08-16 343 1 278 4 1 3 413
8 12 2017-10-14 284 2 294 5 4 3 543
9 15 2016-07-12 743 1 206 2 1 2 212
10 17 2017-05-22 429 2 405 4 4 4 444
# ... with 315 more rows
You can export it with this command:
write.table(rfm_result$rfm , file = "your_path\\df.csv")
OP asks for a CSV output.
Being very picky, write.table(rfm_result$rfm , file = "your_path\\df.csv") creates a TSV.
If you want a CSV add the sep="," parameter and also you'll likely want to not write out the row names so also use row.names=FALSE.
write.table(rfm_result$rfm , file = "your_path\\df.csv", sep=",", row.names=FALSE)

Resources