This question already has answers here:
Count number of rows within each group
(17 answers)
Count number of rows per group and add result to original data frame
(11 answers)
Closed 3 years ago.
I have a data frame in which each ID belongs to a unique group. I wish to create a summarize table which tells me the number of observations for each id and which group it belongs to.
dat=data.frame(id=c(1,1,1,2,2,2,2,3,4,4,4,4,4),group=c(1,1,1,0,0,0,0,1,0,0,0,0,0))
count=dat%>% group_by(id)%>% tally()
## A tibble: 4 x 2
id n
<dbl> <int>
1 1 3
2 2 4
3 3 1
4 4 5
with the code above I can count the number of observations. But I have no idea how to create a third column for group. The desired result is:
# A tibble: 4 x 3
id n group
<dbl> <int> <dbl>
1 1 3 1
2 2 4 0
3 3 1 1
4 4 5 0
When I do
dat %>% group_by(id) %>% summarise(n=count(id), group = unique(group))
I go a error: Error in quickdf(.data[names(cols)]) : length(rows) == 1 is not TRUE
However, when I do
dat %>% group_by(id) %>% summarise( group = unique(group))
It worked. I was so confused why the summarise command can not take multiple arguments.
Update: the error is caused by another package called"plyr". Summarise is working well when I detached plyr.
We can use count
library(dplyr)
dat %>%
count(id, group)
# A tibble: 4 x 3
# id group n
# <dbl> <dbl> <int>
#1 1 1 3
#2 2 0 4
#3 3 1 1
#4 4 0 5
akrun's answer is more elegant, but as an alternative you can simply add the group variable to your group_by() call:
library(dplyr)
dat <- tibble(id = c(1, 1, 1, 2, 2, 2, 2, 3, 4, 4, 4, 4, 4),
group = c(1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0))
dat %>%
group_by(id, group) %>%
tally()
# A tibble: 4 x 3
# Groups: id [4]
id group n
<dbl> <dbl> <int>
1 1 1 3
2 2 0 4
3 3 1 1
4 4 0 5
Notice that if your id and group are not straightfoward correspondent like in your example (id = 1 -> group = 1, id = 2 -> group = 0, and so on), it will generate a row for each combination (which obviously is very useful). For example,
dat2 <- tibble(id = c(1, 1, 1, 2, 2), group = c(1, 0, 0, 1, 0))
dat2 %>%
group_by(id, group) %>%
tally()
# A tibble: 4 x 3
# Groups: id [2]
id group n
<dbl> <dbl> <int>
1 1 0 2
2 1 1 1
3 2 0 1
4 2 1 1
Related
My dataframe contains data about political careers, such as a unique identifier (called: ui) column for each politician and the electoral term(called: electoral_term) in which they were elected. Since a politician can be elected in multiple electoral terms, there are multiple rows that contain the same ui.
Now I would like to add another column to my dataframe, that counts how many times the politician got re-elected.
So e.g. the politician with ui=1 was re-elected 2 times, since he occured in 3 electoral_terms.
I already tried
df %>% count(ui)
But that only gives out a table which can't be added into my dataframe.
Thanks in advance!
We may use base R
df$reelected <- with(df, ave(ui, ui, FUN = length)-1)
-output
> df
ui electoral reelected
1 1 1 2
2 1 2 2
3 1 3 2
4 2 2 0
5 3 7 1
6 3 9 1
data
df <- structure(list(ui = c(1, 1, 1, 2, 3, 3), electoral = c(1, 2,
3, 2, 7, 9)), class = "data.frame", row.names = c(NA, -6L))
mydf <- tibble::tribble(~ui, ~electoral, 1, 1, 1, 2, 1, 3, 2, 2, 3, 7, 3, 9)
library(dplyr)
df |>
add_count(ui, name = "re_elected") |>
mutate(re_elected = re_elected - 1)
# A tibble: 6 × 3
ui electoral re_elected
<dbl> <dbl> <dbl>
1 1 1 2
2 1 2 2
3 1 3 2
4 2 2 0
5 3 7 1
6 3 9 1
library(tidyverse)
df %>%
group_by(ui) %>%
mutate(re_elected = n() - 1)
# A tibble: 6 × 3
# Groups: ui [3]
ui electoral re_elected
<dbl> <dbl> <dbl>
1 1 1 2
2 1 2 2
3 1 3 2
4 2 2 0
5 3 7 1
6 3 9 1
A similar question was asked here... however, I cant get it to work in my case and Im not sure why.
I am trying to arrange a tibble based on 2 columns. For example, in my data, I am trying to arrange by the value and count columns. To begin, I show a working example:
library(dplyr)
dat <- tibble(
value = c("B", "D", "D", "E", "A", "A", "B", "C", "B", "E"),
ids = c(1:10),
count = c(3, 2, 1, 2, 2, 1, 2, 1, 1, 1)
)
dat %>%
group_by(value) %>%
mutate(valrank = min(ids)) %>%
ungroup() %>%
arrange(valrank, value, desc(count))
looking at the output:
# A tibble: 10 × 4
value ids count valrank
<chr> <int> <dbl> <int>
1 B 1 3 1
2 B 7 2 1
3 B 9 1 1
4 D 2 2 2
5 D 3 1 2
6 E 4 2 4
7 E 10 1 4
8 A 5 2 5
9 A 6 1 5
10 C 8 1 8
We can see that the code worked... the tibble is arranged by the value column, and the order is based on how many times each element appears in the tibble (ie, the count).
However, when I try the following example, the same code doesn't work:
dat_1 <- tibble(
value = c("x2....", "x5...." , "x5....", "x3...." , "x3....", "x4....", "x3....", "x3....", "x4....", "x2...." ),
ids = c(1:10),
count = c(2, 2, 1, 4, 3, 2, 2, 1, 1, 1)
)
dat_1 %>%
group_by(value) %>%
mutate(valrank = min(ids)) %>%
ungroup() %>%
arrange(valrank, value, desc(count))
Looking at this output, we get:
# A tibble: 10 × 4
value ids count valrank
<chr> <int> <dbl> <int>
1 x2.... 1 2 1
2 x2.... 10 1 1
3 x5.... 2 2 2
4 x5.... 3 1 2
5 x3.... 4 4 4
6 x3.... 5 3 4
7 x3.... 7 2 4
8 x3.... 8 1 4
9 x4.... 6 2 6
10 x4.... 9 1 6
So we can see, this has failed to reorder the tibble based on the count. In the 2nd example, x3 appears the most (i.e., has the highest count), so should appear at the top of the tibble.
I'm not sure what Im doing wrong here!?
UPDATE:
I think I may have solved this problem with:
dat_1 %>%
group_by(value) %>%
mutate(valrank = max(count)) %>%
ungroup() %>%
arrange(-valrank, value, -count)
I have data on hospital admissions per patients. I am trying add up the price of care for patients that were re-admitted to hospital within 5 days.
This is an example dataset:
(
dt <- data.frame(
id = c(1, 1, 2, 2, 3, 4),
admit_date = c(1, 9, 5, 9, 10, 20),
price = c(10, 20, 20, 30, 15, 16)
)
)
# id admit_date price
# 1 1 1 10
# 2 1 9 20
# 3 2 5 20
# 4 2 9 30
# 5 3 10 15
# 6 4 20 16
And this is what I have tried so far:
library(dplyr)
# 5-day readmission:
dt %>%
group_by(id) %>%
arrange(id, admit_date)%>%
mutate(
duration = admit_date - lag(admit_date),
readmit = ifelse(duration < 6, 1, 0)
) %>%
group_by(id, readmit) %>% # this is where i get stuck
summarize(sumprice = sum(price))
# # A tibble: 6 × 3
# # Groups: id [4]
# id readmit sumprice
# <dbl> <dbl> <dbl>
# 1 1 0 20
# 2 1 NA 10
# 3 2 1 30
# 4 2 NA 20
# 5 3 NA 15
# 6 4 NA 16
And this is what I would like to have:
# id sum_price
# 1 1 10
# 2 1 20
# 3 2 50
# 4 3 15
# 5 4 16
If the difference in days, between adjacent visits is greater than 5 - return TRUE if not - return FALSE (-Inf > 5 is FALSE for the first day, thus lags default is Inf). After that, for each individual we take a cumulative sum to label the groups. We finally summarize within each individual, using this cumsum as a grouping variable for by:
dt |>
group_by(id) |>
arrange(id, admit_date) |>
summarise(
sum_price = by(
price,
cumsum((admit_date - lag(admit_date, , Inf)) > 5),
sum
)
) |>
ungroup()
# # A tibble: 5 × 2
# id sum_price
# <dbl> <by>
# 1 1 10
# 2 1 20
# 3 2 50
# 4 3 15
# 5 4 16
So, you want (at most) one row per patient in the final dataframe, so you should group on just id.
Then, for each patient, you should calculate if that patient has any row with readmit==).
Finally, you filter out any patient that wasn't readmitted from your summarized dataframe.
Putting it all together, it might look like:
dt %>%
group_by(id) %>%
arrange(id, admit_date) %>%
mutate(duration = admit_date - lag(admit_date),
readmit = ifelse(duration < 6, 1, 0)) %>%
group_by(id) %>% # group by just 'id' to get one row per patient
summarize(sumprice = sum(price, na.rm = T),
is_readmit = any(readmit == 1)) %>% # If patient has any 'readmit' rows, count the patient as a readmit patient
filter(is_readmit) %>% # Filter out any non-readmit patients
select(-is_readmit) # get rid of the `is_readmit` column
Which should result in:
# A tibble: 1 x 3
id sumprice is_readmit
<dbl> <dbl> <lgl>
1 2 50 TRUE
I am trying to aggregate a dataset of 12.000 obs. with 37 variables in which I want to group by 2 variables and sum by 1.
All other variables must remain, as these contain important information for later analysis.
Most remaining variables contain the same value within the group, from others I would want to select the first value.
To get a better feeling of what is happening, I created a random small test dataset (10 obs. 5 variables).
row <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4)
set1 <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3)
set2 <- c(1, 1, 1, 2, 2, 2, 1, 1, 2, 1)
set3 <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5)
df <- data.frame(row, y, set1, set2, set3)
df
row y set1 set2 set3
1 1 1 1 1 1
2 2 1 1 1 1
3 3 1 1 1 2
4 4 2 1 2 2
5 5 2 1 2 3
6 6 2 1 2 3
7 7 3 2 1 4
8 8 3 2 1 4
9 9 3 2 2 5
10 10 4 3 1 5
I want to aggregate the data based on set1 and set2, getting sum(y)-values, whilst keeping the other columns (here row and set3) by selecting the first value within the remaining columns, resulting in the following aggregated dataframe (or tibble):
# row y set1 set2 set3
# 1 3 1 1 1
# 4 6 1 2 2
# 7 6 2 1 4
# 9 3 2 2 5
# 10 4 3 1 5
I have checked other questions for a possible solution, but have not been able to solve mine.
The most important questions and websites I have looked into and tried are:
Combine rows and sum their values
https://community.rstudio.com/t/combine-rows-and-sum-values/41963
https://datascienceplus.com/aggregate-data-frame-r/
R: How to aggregate some columns while keeping other columns
Aggregate by multiple columns, sum one column and keep other columns? Create new column based on aggregated values?
I have figured out that using summarise in dplyr always results in removal of remaining variables.
I thought to have found a solution with R: How to aggregate some columns while keeping other columns as reproducing the example gave satisfying results.
As using
library(dplyr)
df_aggr1 <-
df %>%
group_by(set1, set2) %>%
slice(which.max(y))
Resulted in
# A tibble: 5 x 5
# Groups: set1, set2 [5]
row y set1 set2 set3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1
2 4 2 1 2 2
3 7 3 2 1 4
4 9 3 2 2 5
5 10 4 3 1 5
However, using
library(dplyr)
df_aggr2 <-
df %>%
group_by(set1, set2) %>%
slice(sum(y))
resulted in:
# A tibble: 1 x 5
# Groups: set1, set2 [1]
row y set1 set2 set3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 1 1 1 2
In which y apparently is not even summed, so I do not get what is happening.
What am I missing?
Thanks in advance!
It works for me when literally specifying that you want the first value, i.e.:
library(tidyverse)
df %>%
group_by(set1, set2) %>%
summarize(y = sum(y),
row = row[1],
set3 = set3[1])
A tibble: 5 x 5
# Groups: set1 [3]
set1 set2 y row set3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 3 1 1
2 1 2 6 4 2
3 2 1 6 7 4
4 2 2 3 9 5
5 3 1 4 10 5
Edit: To keep every other column without specifying, you can make use of across() and indicate that you want to apply this aggregation to every column except one.
df %>%
group_by(set1, set2) %>%
summarize(
across(!y, ~ .x[1]),
y = sum(y)
)
# A tibble: 5 x 5
# Groups: set1 [3]
set1 set2 row set3 y
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 3
2 1 2 4 2 6
3 2 1 7 4 6
4 2 2 9 5 3
5 3 1 10 5 4
I have a dataset in r with two columns of numerical data and one with an identifier. Some of the rows share the same identifier (i.e. they are the same individual), but contain different data. I want to use the identifier to move those that share an identifier from a row into a columns. There are currently 600 rows, but there should be 400.
Can anyone share r code that might do this? I am new to R, and have tried the reshape (cast) programme, but I can't really follow it, and am not sure it's exactly what i'm trying to do.
Any help gratefully appreciated.
UPDATE:
Current
ID Age Sex
1 3 1
1 5 1
1 6 1
1 7 1
2 1 2
2 12 2
2 5 2
3 3 1
Expected output
ID Age Sex Age2 Sex2 Age3 Sex3 Age4 Sex4
1 3 1 5 1 6 1 7 1
2 1 2 12 2 5 2
3 3 1
UPDATE 2:
So far I have tried using the melt and dcast commands from reshape2. I am getting there, but it still doesn't look quite right. Here is my code:
x <- melt(example, id.vars = "ID")
x$time <- ave(x$ID, x$ID, FUN = seq_along)
example2 <- dcast (x, ID ~ time, value.var = "value")
and here is the output using that code:
ID A B C D E F G H (for clarity i have labelled these)
1 3 5 6 7 1 1 1 1
2 1 12 5 2 2 2
3 3 1
So, as you can probably see, it is mixing up the 'sex' and 'age' variables and combining them in the same column. For example column D has the value '7' for person 1 (age4), but '2' for person 2 (Sex). I can see that my code is not instructing where the numerical values should be cast to, but I do not know how to code that part. Any ideas?
Here's an approach using gather, spread and unite from the tidyr package:
suppressPackageStartupMessages(library(tidyverse))
x <- tribble(
~ID, ~Age, ~Sex,
1, 3, 1,
1, 5, 1,
1, 6, 1,
1, 7, 1,
2, 1, 2,
2, 12, 2,
2, 5, 2,
3, 3, 1
)
x %>% group_by(ID) %>%
mutate(grp = 1:n()) %>%
gather(var, val, -ID, -grp) %>%
unite("var_grp", var, grp, sep ='') %>%
spread(var_grp, val, fill = '')
#> # A tibble: 3 x 9
#> # Groups: ID [3]
#> ID Age1 Age2 Age3 Age4 Sex1 Sex2 Sex3 Sex4
#> * <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 3 5 6 7 1 1 1 1
#> 2 2 1 12 5 2 2 2
#> 3 3 3 1
If you prefer to keep the columns numeric then just remove the fill='' argument from spread(var_grp, val, fill = '').
Other questions which might help with this include:
R spreading multiple columns with tidyr
How can I spread repeated measures of multiple variables into wide format?
I have recently come across a similar issue in my data, and wanted to provide an update using the tidyr 1.0 functions as gather and spread have been retired. The new pivot_longer and pivot_wider are currently much slower than gather and spread, especially on very large datasets, but this is supposedly fixed in the next update of tidyr, so hope this updated solution is useful to people.
library(tidyr)
library(dplyr)
x %>%
group_by(ID) %>%
mutate(grp = 1:n()) %>%
pivot_longer(-c(ID, grp), names_to = "var", values_to = "val") %>%
unite("var_grp", var, grp, sep = "") %>%
pivot_wider(names_from = var_grp, values_from = val)
#> # A tibble: 3 x 9
#> # Groups: ID [3]
#> ID Age1 Sex1 Age2 Sex2 Age3 Sex3 Age4 Sex4
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3 1 5 1 6 1 7 1
#> 2 2 1 2 12 2 5 2 NA NA
#> 3 3 3 1 NA NA NA NA NA NA