dplyr Error: cannot modify grouping variable even when first applying ungroup - r

I'm getting this error but the fixes in related posts don't seem to apply I'm using ungroup, though it's no longer needed (can I switch the grouping variable in a single dplyr statement? but see Format column within dplyr chain). Also I have no quotes in my group_by call and I'm not applying any functions that act on the grouped-by columns (R dplyr summarize_each --> "Error: cannot modify grouping variable") but I'm still getting this error:
> games2 = baseball %>%
+ ungroup %>%
+ group_by(id, year) %>%
+ summarize(total=g+ab, a = ab+1, id = id)%>%
+ arrange(desc(total)) %>%
+ head(10)
Error: cannot modify grouping variable
This is the baseball set that comes with plyr:
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
4 ansonca01 1871 1 RC1 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA NA
44 forceda01 1871 1 WS3 32 162 45 45 9 4 0 29 8 0 4 0 NA NA NA NA NA
68 mathebo01 1871 1 FW1 19 89 15 24 3 1 0 10 2 1 2 0 NA NA NA NA NA
99 startjo01 1871 1 NY2 33 161 35 58 5 1 1 34 4 2 3 0 NA NA NA NA NA
102 suttoez01 1871 1 CL1 29 128 35 45 3 7 3 23 3 1 1 0 NA NA NA NA NA
106 whitede01 1871 1 CL1 29 146 40 47 6 5 1 21 2 2 4 1 NA NA NA NA NA
I loaded plyr before dplyr. Other bugs to check for? Thanks for any corrections/suggestions.

Not clear what you are doing. I think following is what you are looking for:
games2 = baseball %>%
group_by(id, year) %>%
mutate(total=g+ab, a = ab+1)%>%
arrange(desc(total)) %>%
head(10)
> games2
Source: local data frame [10 x 24]
Groups: id, year
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp total a
1 aaronha01 1954 1 ML1 NL 122 468 58 131 27 6 13 69 2 2 28 39 NA 3 6 4 13 590 469
2 aaronha01 1955 1 ML1 NL 153 602 105 189 37 9 27 106 3 1 49 61 5 3 7 4 20 755 603
3 aaronha01 1956 1 ML1 NL 153 609 106 200 34 14 26 92 2 4 37 54 6 2 5 7 21 762 610
4 aaronha01 1957 1 ML1 NL 151 615 118 198 27 6 44 132 1 1 57 58 15 0 0 3 13 766 616
5 aaronha01 1958 1 ML1 NL 153 601 109 196 34 4 30 95 4 1 59 49 16 1 0 3 21 754 602
6 aaronha01 1959 1 ML1 NL 154 629 116 223 46 7 39 123 8 0 51 54 17 4 0 9 19 783 630
7 aaronha01 1960 1 ML1 NL 153 590 102 172 20 11 40 126 16 7 60 63 13 2 0 12 8 743 591
8 aaronha01 1961 1 ML1 NL 155 603 115 197 39 10 34 120 21 9 56 64 20 2 1 9 16 758 604
9 aaronha01 1962 1 ML1 NL 156 592 127 191 28 6 45 128 15 7 66 73 14 3 0 6 14 748 593
10 aaronha01 1963 1 ML1 NL 161 631 121 201 29 4 44 130 31 5 78 94 18 0 0 5 11 792 632

The problem is that you are trying to edit id in the summarize call, but you have grouped on id.
From your example, it looks like you want mutate anyway. You would use summarize if you were looking to apply a function that would return a single value like sum or mean.
games2 = baseball %>%
dplyr::group_by(id, year) %>%
dplyr::mutate(
total = g + ab,
a = ab + 1
) %>%
dplyr::select(id, year, total, a) %>%
dplyr::arrange(desc(total)) %>%
head(10)
Source: local data frame [10 x 4]
Groups: id, year
id year total a
1 aaronha01 1954 590 469
2 aaronha01 1955 755 603
3 aaronha01 1956 762 610
4 aaronha01 1957 766 616
5 aaronha01 1958 754 602
6 aaronha01 1959 783 630
7 aaronha01 1960 743 591
8 aaronha01 1961 758 604
9 aaronha01 1962 748 593
10 aaronha01 1963 792 632

Related

Group by sum specific column in R

df <- data.frame(items=sample(LETTERS,replace= T),quantity=sample(1:100,26,replace=FALSE),price=sample(100:1000,26,replace=FALSE))
I want to group_by sum quantity is about 500(ballpark) ,
When count close about 500 put the same group,like below
Any help would be appreciated.
Updated
Because the condition need to change, I reset the threshold to 250,
I summarize to find the max total value for each group, and then,
How could I change the the total of group6 < 200 , into group5.
I think about using ifelse but can't work successfully.
set.seed(123)
df <- data.frame(items=sample(LETTERS,replace= T),quantity=sample(1:100,26,replace=FALSE),price=sample(100:1000,26,replace=FALSE))
df$group=cumsum(c(1,ifelse(diff(cumsum(df$quantity)%% 250) < 0,1,0)))
df$total=ave(df$quantity,df$group,FUN=cumsum)
df %>% group_by(group) %>% summarise(max = max(total, na.rm=TRUE))
# A tibble: 6 × 2
group max
<dbl> <int>
1 1 238
2 2 254
3 3 256
4 4 246
5 5 237
6 6 101
I want get like
> df
items quantity price group total
1 O 36 393 1 36
2 S 78 376 1 114
3 N 81 562 1 195
4 C 43 140 1 238
5 J 76 530 2 76
6 R 15 189 2 91
7 V 32 415 2 123
8 K 7 322 2 130
9 E 9 627 2 139
10 T 41 215 2 180
11 N 74 705 2 254
12 V 23 873 3 23
13 Y 27 846 3 50
14 Z 60 555 3 110
15 E 53 697 3 163
16 S 93 953 3 256
17 Y 86 138 4 86
18 Y 88 258 4 174
19 I 38 851 4 212
20 C 34 308 4 246
21 H 69 473 5 69
22 Z 72 917 5 141
23 G 96 133 5 237
24 J 63 615 5 300
25 I 13 112 5 376
26 S 25 168 5 477
Thank you for any helping all the time.
Base R
set.seed(123)
df <- data.frame(items=sample(LETTERS,replace= T),quantity=sample(1:100,26,replace=FALSE),price=sample(100:1000,26,replace=FALSE))
df$group=cumsum(c(1,ifelse(diff(cumsum(df$quantity)%%500)<0,1,0)))
df$total=ave(df$quantity,df$group,FUN=cumsum)
items quantity price group total
1 O 36 393 1 36
2 S 78 376 1 114
3 N 81 562 1 195
4 C 43 140 1 238
5 J 76 530 1 314
6 R 15 189 1 329
7 V 32 415 1 361
8 K 7 322 1 368
9 E 9 627 1 377
10 T 41 215 1 418
11 N 74 705 1 492
12 V 23 873 2 23
13 Y 27 846 2 50
14 Z 60 555 2 110
15 E 53 697 2 163
16 S 93 953 2 256
17 Y 86 138 2 342
18 Y 88 258 2 430
19 I 38 851 2 468
20 C 34 308 2 502
21 H 69 473 3 69
22 Z 72 917 3 141
23 G 96 133 3 237
24 J 63 615 3 300
25 I 13 112 3 313
26 S 25 168 3 338
You could use Reduce(..., accumulate = TRUE) to find where the first cumulative quantity >= 500.
set.seed(123)
df <- data.frame(items=sample(LETTERS,replace= T),quantity=sample(1:100,26,replace=FALSE),price=sample(100:1000,26,replace=FALSE))
library(dplyr)
df %>%
group_by(group = lag(cumsum(Reduce(\(x, y) {
z <- x + y
if(z < 500) z else 0
}, quantity, accumulate = TRUE) == 0) + 1, default = 1)) %>%
mutate(total = sum(quantity)) %>%
ungroup()
# A tibble: 26 × 5
items quantity price group total
<chr> <int> <int> <dbl> <int>
1 O 36 393 1 515
2 S 78 376 1 515
3 N 81 562 1 515
4 C 43 140 1 515
5 J 76 530 1 515
6 R 15 189 1 515
7 V 32 415 1 515
8 K 7 322 1 515
9 E 9 627 1 515
10 T 41 215 1 515
11 N 74 705 1 515
12 V 23 873 1 515
13 Y 27 846 2 548
14 Z 60 555 2 548
15 E 53 697 2 548
16 S 93 953 2 548
17 Y 86 138 2 548
18 Y 88 258 2 548
19 I 38 851 2 548
20 C 34 308 2 548
21 H 69 473 2 548
22 Z 72 917 3 269
23 G 96 133 3 269
24 J 63 615 3 269
25 I 13 112 3 269
26 S 25 168 3 269
Here is a base R solution. The groups break after the cumulative sum passes a threshold. The output of aggregate shows that all cumulative sums are above thres except for the last one.
set.seed(2022)
df <- data.frame(items=sample(LETTERS,replace= T),
quantity=sample(1:100,26,replace=FALSE),
price=sample(100:1000,26,replace=FALSE))
f <- function(x, thres) {
grp <- integer(length(x))
run <- 0
current_grp <- 0L
for(i in seq_along(x)) {
run <- run + x[i]
grp[i] <- current_grp
if(run > thres) {
current_grp <- current_grp + 1L
run <- 0
}
}
grp
}
thres <- 500
group <- f(df$quantity, thres)
aggregate(quantity ~ group, df, sum)
#> group quantity
#> 1 0 552
#> 2 1 513
#> 3 2 214
ave(df$quantity, group, FUN = cumsum)
#> [1] 70 133 155 224 235 327 347 409 481 484 552 29 95 129 224 263 294 377 433
#> [20] 434 453 513 50 91 182 214
Created on 2022-09-06 by the reprex package (v2.0.1)
Edit
To assign groups and total quantities to the data can be done as follows.
df$group <- f(df$quantity, thres)
df$total_quantity <- ave(df$quantity, df$group, FUN = cumsum)
head(df)
#> items quantity price group total_quantity
#> 1 D 70 731 0 70
#> 2 S 63 516 0 133
#> 3 N 22 710 0 155
#> 4 W 69 829 0 224
#> 5 K 11 887 0 235
#> 6 D 92 317 0 327
Created on 2022-09-06 by the reprex package (v2.0.1)
Edit 2
To assign only the total quantity per group use sum instead of cumsum.
df$total_quantity <- ave(df$quantity, df$group, FUN = sum)

How to test for p-value with groups/filters in dplyr

My data looks like the example below. (sorry if it's too long, not sure what's acceptable/needed).
I have used the following code to calculate the median and IQR of each time difference (tdif) between tests (testno):
data %>% group_by(testno) %>% filter(type ==1) %>%
summarise(Median = median(tdif), IQR= IQR(tdif), n= n(), .groups = 'keep') -> result
I have done this for each category of 'type' (coded as 1 - 10), which brought me to the added table (bottom).
My question is, if it is possible to:
Do this an easier way (without the filters? So I can do this all in 1 run), and
Is it possible run a test for p-value with all the groups/filters?
data <- read.table(header=T, text= '
PID time tdif testno type
3 205 0 1 1
4 77 0 1 1
4 85 8 2 1
4 126 41 3 1
4 165 39 4 1
4 202 37 5 1
4 238 36 6 1
4 272 34 7 1
4 277 5 8 1
4 370 93 9 1
4 397 27 10 1
4 452 55 11 1
4 522 70 12 1
4 529 7 13 1
4 608 79 14 1
4 651 43 15 1
4 655 4 16 1
4 713 58 17 1
4 804 91 18 1
4 900 96 19 1
4 944 44 20 1
4 979 35 21 1
4 1015 36 22 1
4 1051 36 23 1
4 1077 26 24 1
4 1124 47 25 1
4 1162 38 26 1
4 1222 60 27 1
4 1334 112 28 1
4 1383 49 29 1
4 1457 74 30 1
4 1506 49 31 1
4 1590 84 32 1
4 1768 178 33 1
4 1838 70 34 1
4 1880 42 35 1
4 1915 35 36 1
4 1973 58 37 1
4 2017 44 38 1
4 2090 73 39 1
4 2314 224 40 1
4 2381 67 41 1
4 2433 52 42 1
4 2484 51 43 1
4 2694 210 44 1
4 2731 37 45 1
4 2792 61 46 1
4 2958 166 47 1
5 48 0 1 3
5 111 63 2 3
5 699 588 3 3
5 1077 378 4 3
6 -43 0 1 3
8 67 0 1 1
8 168 101 2 1
8 314 146 3 1
8 368 54 4 1
8 586 218 5 1
10 639 0 1 6
13 -454 0 1 3
13 -384 70 2 3
13 -185 199 3 3
13 193 378 4 3
13 375 182 5 3
13 564 189 6 3
13 652 88 7 3
13 669 17 8 3
13 718 49 9 3
14 704 0 1 8
15 -165 0 1 3
15 -138 27 2 3
15 1335 1473 3 3
16 168 0 1 6
18 -1329 0 1 3
18 -1177 152 2 3
18 -1071 106 3 3
18 -945 126 4 3
18 -834 111 5 3
18 -719 115 6 3
18 -631 88 7 3
18 -497 134 8 3
18 -376 121 9 3
18 -193 183 10 3
18 -78 115 11 3
18 -13 65 12 3
18 100 113 13 3
18 196 96 14 3
18 552 356 15 3
18 650 98 16 3
18 737 87 17 3
18 804 67 18 3
18 902 98 19 3
18 983 81 20 3
18 1119 136 21 3
19 802 0 1 1
19 1593 791 2 1
26 314 0 1 8
26 389 75 2 8
26 597 208 3 8
33 639 0 1 6
Added table (values differ from example data, because this isn't the complete set).

Putting several rows into one column in R

I am trying to run a time series analysis on the following data set:
Year 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780
Number 101 82 66 35 31 7 20 92 154 125
Year 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790
Number 85 68 38 23 10 24 83 132 131 118
Year 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800
Number 90 67 60 47 41 21 16 6 4 7
Year 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810
Number 14 34 45 43 48 42 28 10 8 2
Year 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820
Number 0 1 5 12 14 35 46 41 30 24
Year 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830
Number 16 7 4 2 8 17 36 50 62 67
Year 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840
Number 71 48 28 8 13 57 122 138 103 86
Year 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850
Number 63 37 24 11 15 40 62 98 124 96
Year 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860
Number 66 64 54 39 21 7 4 23 55 94
Year 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870
Number 96 77 59 44 47 30 16 7 37 74
My problem is that the data is placed in multiple rows. I am trying to make two columns from the data. One for Year and one for Number, so that it is easily readable in R. I have tried
> library(tidyverse)
> sun.df = data.frame(sunspots)
> Year = filter(sun.df, sunspots == "Year")
to isolate the Year data, and it works, but I am unsure of how to then place it in a column.
Any suggestions?
Try this:
library(tidyverse)
df <- read_csv("test.csv",col_names = FALSE)
df
# A tibble: 6 x 4
# X1 X2 X3 X4
# <chr> <dbl> <dbl> <dbl>
# 1 Year 123 124 125
# 2 Number 1 2 3
# 3 Year 126 127 128
# 4 Number 4 5 6
# 5 Year 129 130 131
# 6 Number 7 8 9
# Removing first column and transpose it to get a dataframe of numbers
df_number <- as.data.frame(as.matrix(t(df[,-1])),row.names = FALSE)
df_number
# V1 V2 V3 V4 V5 V6
# 1 123 1 126 4 129 7
# 2 124 2 127 5 130 8
# 3 125 3 128 6 131 9
# Keep the first two column (V1,V2) and assign column names
df_new <- df_number[1:2]
colnames(df_new) <- c("Year","Number")
# Iterate and rbind with subsequent columns (2 by 2) to df_new
for(i in 1:((ncol(df_number) - 2 )/2)) {
df_mini <- df_number[(i*2+1):(i*2+2)]
colnames(df_mini) <- c("Year","Number")
df_new <- rbind(df_new,df_mini)
}
df_new
# Year Number
# 1 123 1
# 2 124 2
# 3 125 3
# 4 126 4
# 5 127 5
# 6 128 6
# 7 129 7
# 8 130 8
# 9 131 9

Pivot / Reshape data [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
My sample data looks like this:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurment4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
Now i would like it to look something like this:
PID Time (days) Value
1 1435 1356
1 1405 1483
1 1374 1563
2 1848 943
2 1818 1173
2 1785 1300
3 185 1590
... ... ...
How would i tend to get there? I have looked up some things about wide to longformat, but it doesn't seem to do the trick.
Kind regards, and thank you in advance.
Here is a base R option
u <- cbind(
data[1],
do.call(
rbind,
lapply(
split.default(data[-1], ceiling(seq_along(data[-1]) / 2)),
setNames,
c("Value", "Time")
)
)
)
out <- `row.names<-`(
subset(
x <- u[order(u$pid), ],
complete.cases(x)
), NULL
)
such that
> out
pid Value Time
1 1 1356 1435
2 1 1483 1405
3 1 1563 1374
4 2 943 1848
5 2 1173 1818
6 2 1300 1785
7 3 1590 185
8 3 1585 294
9 4 130 72
10 4 140 82
11 4 220 126
12 4 166 159
13 4 380 189
14 4 353 231
15 4 180 268
16 4 571 334
17 4 443 70
18 4 266 124
19 4 213 156
20 4 583 173
21 4 510 222
22 4 596 303
23 4 476 145
24 4 656 217
25 4 816 289
26 4 136 79
27 4 756 89
28 4 703 128
29 4 776 166
30 4 586 203
31 4 526 240
32 4 580 278
33 4 483 371
An option with pivot_longer
library(dplyr)
library(tidyr)
names(data)[8] <- "measurement4"
data %>%
pivot_longer(cols = -pid, names_to = c('.value', 'grp'),
names_sep = "(?<=[a-z])(?=[0-9])", values_drop_na = TRUE) %>% select(-grp)
# A tibble: 33 x 3
# pid measurement Tdays
# <int> <int> <int>
# 1 1 1356 1435
# 2 1 1483 1405
# 3 1 1563 1374
# 4 2 943 1848
# 5 2 1173 1818
# 6 2 1300 1785
# 7 3 1590 185
# 8 3 1585 294
# 9 4 130 72
#10 4 443 70
# … with 23 more rows

Add values to dataframe when condition met in R

I have a dataframe like this:
ID1 ID2 Position Grade Day
234 756 2 87 27
245 486 4 66 26
321 275 1 54 20
768 656 6 51 7
421 181 1 90 14
237 952 8 68 23
237 553 4 32 30
And I have another dataframe like this:
ID1 ID2 Day Count
234 756 2 3
245 486 2 1
209 706 2 1
124 554 2 2
237 553 2 4
I need to add the Counts to the first dataframe where the ID1, ID2 and Day are matched. However, I also need to have it so that if there is no match (no Counts in the second dataframe for the set of ID1, ID2 and Day in the first dataframe) then a zero is put in that place. So the final dataframe would be something like:
ID1 ID2 Position Grade Day Count
234 756 2 87 27 3
245 486 4 66 26 1
321 275 1 54 20 0
768 656 6 51 7 0
421 181 1 90 14 0
237 952 8 68 23 0
237 553 4 32 30 4
This can be useful
> # First, merge df1 and df2
> df3 <- merge(df1, df2, by=c("ID1", "ID2"), all.x=TRUE)
> # Replace NA with 0's
> transform(df3[, -6], Count=ifelse(is.na(Count), 0, Count))
ID1 ID2 Position Grade Day.x Count
1 234 756 2 87 27 3
2 237 553 4 32 30 4
3 237 952 8 68 23 0
4 245 486 4 66 26 1
5 321 275 1 54 20 0
6 421 181 1 90 14 0
7 768 656 6 51 7 0

Resources