I have a longitudal dataset, where the same subjects are measured at different occasions in time.
For instance:
dd=data.frame(subject_id=c(1,1,1,2,2,2,3,3,4,5,6,7,8,8,9,9),income=c(rnorm(16,50000,250)))
I should write something able to tell me how many subjects have been counted only once, twice, three times,... In the example above, the number of subjects measured at only one occasion in time is 4, the number of subjects measured twice is 3,...
That's my attempt for counting, for instance, how many subjects have been measured only twice:
library(dplyr)
s.two=dd %>% group_by(subject_id) %>% filter(n() == 2) %>% ungroup()
length(s.two$subject_id)/2
But since I have very heterogenous clusters (ranging from 1 to 24 observations per subject), this implies that I should write planty of rows. Is there something more efficient I can do?
The objective is to have a summary of the size of the clusters (and the cluster is the subject_id). For instance, let say I have 1,000 clusters. I wanna know, how many of them are made up of subjects observed just once, twice... And so, 50 out of 1000 clusters are made up of subjects observed just one occasion in time ; 300 out of 1000 clusters are made up of subjects observed just two occasions in time...
With this info, I shall construct a table to add in my report
You should use summarize. After this you can still filter with filter(n == 2).
library(dplyr)
dd <- data.frame(
subject_id = c(1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 6, 7, 8, 8, 9, 9),
income = c(rnorm(16, 50000, 250))
)
dd |>
group_by(subject_id) |>
summarise(n = n())
#> # A tibble: 9 × 2
#> subject_id n
#> <dbl> <int>
#> 1 1 3
#> 2 2 3
#> 3 3 2
#> 4 4 1
#> 5 5 1
#> 6 6 1
#> 7 7 1
#> 8 8 2
#> 9 9 2
If you use mutate instead of summarize and filter then, you will get
dd |>
group_by(subject_id) |>
mutate(n = n()) |>
filter(n ==2)
subject_id income n
<dbl> <dbl> <int>
1 3 49675. 2
2 3 50306. 2
3 8 49879. 2
4 8 50202. 2
5 9 49783. 2
6 9 49834. 2
NEW EDIT
Maybe you mean this:
dd |>
group_by(subject_id) |>
summarise(n = n()) |>
mutate(info = glue::glue(
'There are {n} times {subject_id} out of {max(subject_id)} groups')) |>
select(info)
# A tibble: 9 × 1
info
<glue>
1 There are 3 times 1 out of 9 groups
2 There are 3 times 2 out of 9 groups
3 There are 2 times 3 out of 9 groups
4 There are 1 times 4 out of 9 groups
5 There are 1 times 5 out of 9 groups
6 There are 1 times 6 out of 9 groups
7 There are 1 times 7 out of 9 groups
8 There are 2 times 8 out of 9 groups
9 There are 2 times 9 out of 9 groups
Next which would be #Ritchie Sacramento 's solution
dd |>
group_by(subject_id) |>
summarise(no_of_occurences = n()) |>
count(no_of_occurences)
# A tibble: 3 × 2
no_of_occurences n
<int> <int>
1 1 4
2 2 3
3 3 2
Related
I'm trying to randomize the order for the receipt of 6 drinks (each in a different day) for 40 participants. I want to ensure that every participant get each drink once, and that every drink has roughly the same number of occurrences across participants in each day.
I create the data, with participants in columns and days in rows.
library(ggplot2)
set.seed(123)
random_order <- as.data.frame(replicate(40, sample(1:6, 6,
replace = F)))
random_order$trial <- c(1:6)
random_order
Then I check the number of occurrences of each drink within each row / trial, which shows that the frequency of different drinks within trials is not uniform:
tidyr::pivot_longer(random_order, cols = c(1:40),
names_to = "participant", values_to = "drink_order") |>
dplyr::group_by(trial, drink_order) |>
dplyr::summarise(count = dplyr::n())
# # A tibble: 36 × 3
# # Groups: trial [6]
# trial drink_order count
# <int> <int> <int>
# 1 1 1 9
# 2 1 2 8
# 3 1 3 8
# 4 1 4 4
# 5 1 5 5
# 6 1 6 6
# 7 2 1 7
# 8 2 2 4
# 9 2 3 10
# 10 2 4 7
# # … with 26 more rows
and look at it with a density plot:
tidyr::pivot_longer(random_order, cols = c(1:40),
names_to = "participant", values_to = "drink_order") |>
dplyr::group_by(trial, drink_order) |>
dplyr::summarise(count = dplyr::n()) |>
ggplot(aes(count)) +
geom_density()
Basically, I want to have a very thin normal curve. How can I make it so that the count column above has a small range during creating the data?
Thanks!
You’re looking for a variation on a Latin square, which is a set of ordered elements such that each element occurs exactly once per column and once per row. You can generate random Latin squares using agricolae::design.lsd(). In your case, instead of once per row, you want each element to occur the same number of times per row, which you can do by binding together multiple Latin squares.
library(agricolae)
set.seed(123)
# to get 40 columns, first get 7 Latin squares
# (7 squares x 6 columns per square = 42 columns)
orders <- replicate(
7,
design.lsd(1:6)$sketch,
simplify = FALSE
)
# then column-bind and subset to 40 columns
random_order <- data.frame(do.call(cbind, orders))[, 1:40]
random_order$trial <- c(1:6)
Using the code from your question, we can see that all trials include 6 or 7 of each drink:
# A tibble: 36 × 3
# Groups: trial [6]
trial drink_order count
<int> <chr> <int>
1 1 1 7
2 1 2 7
3 1 3 7
4 1 4 6
5 1 5 6
6 1 6 7
7 2 1 7
8 2 2 6
9 2 3 6
10 2 4 7
# … with 26 more rows
I have a dataset with financial data. Sometimes, a product gets refunded, resulting in a negative count of the product (so the money gets returned). I want to conditionally filter these rows out of the dataset.
Example:
library(tidyverse)
set.seed(1)
df <- tibble(
count = sample(c(-1,1),80,replace = TRUE,prob=c(.2,.8)),
id = rep(1:4,20)
)
df %>%
group_by(id) %>%
summarize(total = sum(count))
# A tibble: 4 x 2
id total
<int> <dbl>
1 1 10
2 2 14
3 3 16
4 4 10
id = 1 has 15 positive counts and 5 negatives. (15 - 5= 10). I want to keep 10 values in df with id = 1 with the positive values.
id = 2 has 17 positive counts and 3 negatives. (17- 3 = 14). I want to keep 14 values in df with id = 2 with the positive values.
In the end, this condition should be True nrow(df) == sum(df$count)
Unfortunately, a filtering join such as anti_join() will remove all the rows. For some reason I cannot think of another option to filter the tibble.
Thanks for helping me!
You can "uncount" using the total column to get the number of repeats of each row.
df %>%
group_by(id) %>%
summarize(total = sum(count)) %>%
uncount(total) %>%
mutate(count = 1)
#> # A tibble: 50 x 2
#> id count
#> <int> <dbl>
#> 1 1 1
#> 2 1 1
#> 3 1 1
#> 4 1 1
#> 5 1 1
#> 6 1 1
#> 7 1 1
#> 8 1 1
#> 9 1 1
#> 10 1 1
#> # ... with 40 more rows
Created on 2022-10-21 with reprex v2.0.2
I have data on hospital admissions per patients. I am trying add up the price of care for patients that were re-admitted to hospital within 5 days.
This is an example dataset:
(
dt <- data.frame(
id = c(1, 1, 2, 2, 3, 4),
admit_date = c(1, 9, 5, 9, 10, 20),
price = c(10, 20, 20, 30, 15, 16)
)
)
# id admit_date price
# 1 1 1 10
# 2 1 9 20
# 3 2 5 20
# 4 2 9 30
# 5 3 10 15
# 6 4 20 16
And this is what I have tried so far:
library(dplyr)
# 5-day readmission:
dt %>%
group_by(id) %>%
arrange(id, admit_date)%>%
mutate(
duration = admit_date - lag(admit_date),
readmit = ifelse(duration < 6, 1, 0)
) %>%
group_by(id, readmit) %>% # this is where i get stuck
summarize(sumprice = sum(price))
# # A tibble: 6 × 3
# # Groups: id [4]
# id readmit sumprice
# <dbl> <dbl> <dbl>
# 1 1 0 20
# 2 1 NA 10
# 3 2 1 30
# 4 2 NA 20
# 5 3 NA 15
# 6 4 NA 16
And this is what I would like to have:
# id sum_price
# 1 1 10
# 2 1 20
# 3 2 50
# 4 3 15
# 5 4 16
If the difference in days, between adjacent visits is greater than 5 - return TRUE if not - return FALSE (-Inf > 5 is FALSE for the first day, thus lags default is Inf). After that, for each individual we take a cumulative sum to label the groups. We finally summarize within each individual, using this cumsum as a grouping variable for by:
dt |>
group_by(id) |>
arrange(id, admit_date) |>
summarise(
sum_price = by(
price,
cumsum((admit_date - lag(admit_date, , Inf)) > 5),
sum
)
) |>
ungroup()
# # A tibble: 5 × 2
# id sum_price
# <dbl> <by>
# 1 1 10
# 2 1 20
# 3 2 50
# 4 3 15
# 5 4 16
So, you want (at most) one row per patient in the final dataframe, so you should group on just id.
Then, for each patient, you should calculate if that patient has any row with readmit==).
Finally, you filter out any patient that wasn't readmitted from your summarized dataframe.
Putting it all together, it might look like:
dt %>%
group_by(id) %>%
arrange(id, admit_date) %>%
mutate(duration = admit_date - lag(admit_date),
readmit = ifelse(duration < 6, 1, 0)) %>%
group_by(id) %>% # group by just 'id' to get one row per patient
summarize(sumprice = sum(price, na.rm = T),
is_readmit = any(readmit == 1)) %>% # If patient has any 'readmit' rows, count the patient as a readmit patient
filter(is_readmit) %>% # Filter out any non-readmit patients
select(-is_readmit) # get rid of the `is_readmit` column
Which should result in:
# A tibble: 1 x 3
id sumprice is_readmit
<dbl> <dbl> <lgl>
1 2 50 TRUE
I am trying to aggregate a dataset of 12.000 obs. with 37 variables in which I want to group by 2 variables and sum by 1.
All other variables must remain, as these contain important information for later analysis.
Most remaining variables contain the same value within the group, from others I would want to select the first value.
To get a better feeling of what is happening, I created a random small test dataset (10 obs. 5 variables).
row <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4)
set1 <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3)
set2 <- c(1, 1, 1, 2, 2, 2, 1, 1, 2, 1)
set3 <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5)
df <- data.frame(row, y, set1, set2, set3)
df
row y set1 set2 set3
1 1 1 1 1 1
2 2 1 1 1 1
3 3 1 1 1 2
4 4 2 1 2 2
5 5 2 1 2 3
6 6 2 1 2 3
7 7 3 2 1 4
8 8 3 2 1 4
9 9 3 2 2 5
10 10 4 3 1 5
I want to aggregate the data based on set1 and set2, getting sum(y)-values, whilst keeping the other columns (here row and set3) by selecting the first value within the remaining columns, resulting in the following aggregated dataframe (or tibble):
# row y set1 set2 set3
# 1 3 1 1 1
# 4 6 1 2 2
# 7 6 2 1 4
# 9 3 2 2 5
# 10 4 3 1 5
I have checked other questions for a possible solution, but have not been able to solve mine.
The most important questions and websites I have looked into and tried are:
Combine rows and sum their values
https://community.rstudio.com/t/combine-rows-and-sum-values/41963
https://datascienceplus.com/aggregate-data-frame-r/
R: How to aggregate some columns while keeping other columns
Aggregate by multiple columns, sum one column and keep other columns? Create new column based on aggregated values?
I have figured out that using summarise in dplyr always results in removal of remaining variables.
I thought to have found a solution with R: How to aggregate some columns while keeping other columns as reproducing the example gave satisfying results.
As using
library(dplyr)
df_aggr1 <-
df %>%
group_by(set1, set2) %>%
slice(which.max(y))
Resulted in
# A tibble: 5 x 5
# Groups: set1, set2 [5]
row y set1 set2 set3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1
2 4 2 1 2 2
3 7 3 2 1 4
4 9 3 2 2 5
5 10 4 3 1 5
However, using
library(dplyr)
df_aggr2 <-
df %>%
group_by(set1, set2) %>%
slice(sum(y))
resulted in:
# A tibble: 1 x 5
# Groups: set1, set2 [1]
row y set1 set2 set3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 1 1 1 2
In which y apparently is not even summed, so I do not get what is happening.
What am I missing?
Thanks in advance!
It works for me when literally specifying that you want the first value, i.e.:
library(tidyverse)
df %>%
group_by(set1, set2) %>%
summarize(y = sum(y),
row = row[1],
set3 = set3[1])
A tibble: 5 x 5
# Groups: set1 [3]
set1 set2 y row set3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 3 1 1
2 1 2 6 4 2
3 2 1 6 7 4
4 2 2 3 9 5
5 3 1 4 10 5
Edit: To keep every other column without specifying, you can make use of across() and indicate that you want to apply this aggregation to every column except one.
df %>%
group_by(set1, set2) %>%
summarize(
across(!y, ~ .x[1]),
y = sum(y)
)
# A tibble: 5 x 5
# Groups: set1 [3]
set1 set2 row set3 y
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 3
2 1 2 4 2 6
3 2 1 7 4 6
4 2 2 9 5 3
5 3 1 10 5 4
I need to calculate the Euclidean distance between the first and current row in a dataframe. Each row is keyed by (group, month) and has a list of values. In the toy example below the key is c(month, student) and the values are in c(A, B). I want to create a distance column C, that's equal to sqrt((A_i-A_1)^2 + (B_i - B_1)^2).
So far I managed to spread my data and pull each group's first values into new columns. While I could create the formula by hand in the toy example, in my actual data I have very many columns instead of just 2. I believe I could create the squared differences within the mutate_all, and then do a row sum and take the square root of that, but no luck so far.
df <- data.frame(month=rep(1:3,2),
student=rep(c("Amy", "Bob"), each=3),
A=c(9, 6, 6, 8, 6, 9),
B=c(6, 2, 8, 5, 6, 7))
# Pull in each column's first values for each group
df %>%
group_by(student) %>%
mutate_all(list(first = first)) %>%
# TODO: Calculate the distance, i.e. SQRT(sum_i[(x_i - x_1)^2]).
#Output:
month student A B month_first A_first B_first
1 1 Amy 9 6 1 9 6
2 2 Amy 6 2 1 9 6
...
Desired output:
#Output:
month student A B month_first A_first B_first dist_from_first
1 1 Amy 9 6 1 9 6 0
2 2 Amy 6 2 1 9 6 5
...
Here is another way using compact dplyr code. This can be used for any number of columns
df %>%
select(-month) %>%
group_by(student) %>%
mutate_each(function(x) (first(x) - x)^2) %>%
ungroup() %>%
mutate(euc.dist = sqrt(rowSums(select(., -1))))
# A tibble: 6 x 4
student A B euc.dist
<chr> <dbl> <dbl> <dbl>
1 Amy 0 0 0
2 Amy 9 16 5
3 Amy 9 4 3.61
4 Bob 0 0 0
5 Bob 4 1 2.24
6 Bob 1 4 2.24
Edit: added alternative formulation using a join. I expect that approach will be much faster for a very wide data frame with many columns to compare.
Approach 1: To get euclidean distance for a large number of columns, one way is to rearrange the data so each row shows one month, one student, and one original column (e.g. A or B in the OP), but then two columns representing current month value and first value. Then we can square the difference, and group across all columns to get the euclidean distance, aka root-mean-squared / RMS for each student-month.
library(tidyverse)
df %>%
group_by(student) %>%
mutate_all(list(first = first)) %>%
ungroup() %>%
# gather into long form; make col show variant, col2 show orig column
gather(col, val, -c(student, month, month_first)) %>%
mutate(col2 = col %>% str_remove("_first")) %>%
mutate(col = if_else(col %>% str_ends("_first"),
"first",
"comparison")) %>%
spread(col, val) %>%
mutate(square_dif = (comparison - first)^2) %>%
group_by(student, month) %>%
summarize(RMS = sqrt(sum(square_dif)))
# A tibble: 6 x 3
# Groups: student [2]
student month RMS
<fct> <int> <dbl>
1 Amy 1 0
2 Amy 2 5
3 Amy 3 3.61
4 Bob 1 0
5 Bob 2 2.24
6 Bob 3 2.24
Approach 2. Here, a long version of the data is joined to a version that is just the earliest month for each student.
library(tidyverse)
df_long <- gather(df, col, val, -c(month, student))
df_long %>% left_join(df_long %>%
group_by(student) %>%
top_n(-1, wt = month) %>%
rename(first_val = val) %>%
select(-month),
by = c("student", "col")) %>%
mutate(square_dif = (val - first_val)^2) %>%
group_by( student, month) %>%
summarize(RMS = sqrt(sum(square_dif)))
# A tibble: 6 x 3
# Groups: student [2]
student month RMS
<fct> <int> <dbl>
1 Amy 1 0
2 Amy 2 5
3 Amy 3 3.61
4 Bob 1 0
5 Bob 2 2.24
6 Bob 3 2.24
Instead of the mutate_all call, it'd be easier to directly calculate the dist_from_first. The only thing I'm unclear about is whether month should be included in the group_by() statement.
library(tidyverse)
df <- tibble(month=rep(1:3,2),
student=rep(c("Amy", "Bob"), each=3),
A=c(9, 6, 6, 8, 6, 9),
B=c(6, 2, 8, 5, 6, 7))
df%>%
group_by(student)%>%
mutate(dist_from_first = sqrt((A - first(A))^2 + (B - first(B))^2))%>%
ungroup()
# A tibble: 6 x 5
# month student A B dist_from_first
# <int> <chr> <dbl> <dbl> <dbl>
#1 1 Amy 9 6 0
#2 2 Amy 6 2 5
#3 3 Amy 6 8 3.61
#4 1 Bob 8 5 0
#5 2 Bob 6 6 2.24
#6 3 Bob 9 7 2.24