hi everyone I have a dataframe like this:
value count
<dbl> <dbl>
1 1 10
2 2 20
3 3 30
4 4 40
5 5 50
6 6 60
I would like to be able to divide my observations into intervals. The first and last interval must include all the observations left out of the range (for example of 2)
interval count
<???> <dbl>
1 [<1, 2] 30
2 [3, 4] 50
3 [5, >6] 110
Is it possible to do this with dplyr?
You can use cut() to create a grouping variable with which to summarise count.
library(dplyr)
df %>%
group_by(grp = cut(value, c(-Inf, 2, 4, Inf))) %>%
summarise(count = sum(count))
# A tibble: 3 x 2
grp count
<fct> <int>
1 (-Inf,2] 30
2 (2,4] 70
3 (4, Inf] 110
Related
I have a longitudal dataset, where the same subjects are measured at different occasions in time.
For instance:
dd=data.frame(subject_id=c(1,1,1,2,2,2,3,3,4,5,6,7,8,8,9,9),income=c(rnorm(16,50000,250)))
I should write something able to tell me how many subjects have been counted only once, twice, three times,... In the example above, the number of subjects measured at only one occasion in time is 4, the number of subjects measured twice is 3,...
That's my attempt for counting, for instance, how many subjects have been measured only twice:
library(dplyr)
s.two=dd %>% group_by(subject_id) %>% filter(n() == 2) %>% ungroup()
length(s.two$subject_id)/2
But since I have very heterogenous clusters (ranging from 1 to 24 observations per subject), this implies that I should write planty of rows. Is there something more efficient I can do?
The objective is to have a summary of the size of the clusters (and the cluster is the subject_id). For instance, let say I have 1,000 clusters. I wanna know, how many of them are made up of subjects observed just once, twice... And so, 50 out of 1000 clusters are made up of subjects observed just one occasion in time ; 300 out of 1000 clusters are made up of subjects observed just two occasions in time...
With this info, I shall construct a table to add in my report
You should use summarize. After this you can still filter with filter(n == 2).
library(dplyr)
dd <- data.frame(
subject_id = c(1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 6, 7, 8, 8, 9, 9),
income = c(rnorm(16, 50000, 250))
)
dd |>
group_by(subject_id) |>
summarise(n = n())
#> # A tibble: 9 × 2
#> subject_id n
#> <dbl> <int>
#> 1 1 3
#> 2 2 3
#> 3 3 2
#> 4 4 1
#> 5 5 1
#> 6 6 1
#> 7 7 1
#> 8 8 2
#> 9 9 2
If you use mutate instead of summarize and filter then, you will get
dd |>
group_by(subject_id) |>
mutate(n = n()) |>
filter(n ==2)
subject_id income n
<dbl> <dbl> <int>
1 3 49675. 2
2 3 50306. 2
3 8 49879. 2
4 8 50202. 2
5 9 49783. 2
6 9 49834. 2
NEW EDIT
Maybe you mean this:
dd |>
group_by(subject_id) |>
summarise(n = n()) |>
mutate(info = glue::glue(
'There are {n} times {subject_id} out of {max(subject_id)} groups')) |>
select(info)
# A tibble: 9 × 1
info
<glue>
1 There are 3 times 1 out of 9 groups
2 There are 3 times 2 out of 9 groups
3 There are 2 times 3 out of 9 groups
4 There are 1 times 4 out of 9 groups
5 There are 1 times 5 out of 9 groups
6 There are 1 times 6 out of 9 groups
7 There are 1 times 7 out of 9 groups
8 There are 2 times 8 out of 9 groups
9 There are 2 times 9 out of 9 groups
Next which would be #Ritchie Sacramento 's solution
dd |>
group_by(subject_id) |>
summarise(no_of_occurences = n()) |>
count(no_of_occurences)
# A tibble: 3 × 2
no_of_occurences n
<int> <int>
1 1 4
2 2 3
3 3 2
I have data on hospital admissions per patients. I am trying add up the price of care for patients that were re-admitted to hospital within 5 days.
This is an example dataset:
(
dt <- data.frame(
id = c(1, 1, 2, 2, 3, 4),
admit_date = c(1, 9, 5, 9, 10, 20),
price = c(10, 20, 20, 30, 15, 16)
)
)
# id admit_date price
# 1 1 1 10
# 2 1 9 20
# 3 2 5 20
# 4 2 9 30
# 5 3 10 15
# 6 4 20 16
And this is what I have tried so far:
library(dplyr)
# 5-day readmission:
dt %>%
group_by(id) %>%
arrange(id, admit_date)%>%
mutate(
duration = admit_date - lag(admit_date),
readmit = ifelse(duration < 6, 1, 0)
) %>%
group_by(id, readmit) %>% # this is where i get stuck
summarize(sumprice = sum(price))
# # A tibble: 6 × 3
# # Groups: id [4]
# id readmit sumprice
# <dbl> <dbl> <dbl>
# 1 1 0 20
# 2 1 NA 10
# 3 2 1 30
# 4 2 NA 20
# 5 3 NA 15
# 6 4 NA 16
And this is what I would like to have:
# id sum_price
# 1 1 10
# 2 1 20
# 3 2 50
# 4 3 15
# 5 4 16
If the difference in days, between adjacent visits is greater than 5 - return TRUE if not - return FALSE (-Inf > 5 is FALSE for the first day, thus lags default is Inf). After that, for each individual we take a cumulative sum to label the groups. We finally summarize within each individual, using this cumsum as a grouping variable for by:
dt |>
group_by(id) |>
arrange(id, admit_date) |>
summarise(
sum_price = by(
price,
cumsum((admit_date - lag(admit_date, , Inf)) > 5),
sum
)
) |>
ungroup()
# # A tibble: 5 × 2
# id sum_price
# <dbl> <by>
# 1 1 10
# 2 1 20
# 3 2 50
# 4 3 15
# 5 4 16
So, you want (at most) one row per patient in the final dataframe, so you should group on just id.
Then, for each patient, you should calculate if that patient has any row with readmit==).
Finally, you filter out any patient that wasn't readmitted from your summarized dataframe.
Putting it all together, it might look like:
dt %>%
group_by(id) %>%
arrange(id, admit_date) %>%
mutate(duration = admit_date - lag(admit_date),
readmit = ifelse(duration < 6, 1, 0)) %>%
group_by(id) %>% # group by just 'id' to get one row per patient
summarize(sumprice = sum(price, na.rm = T),
is_readmit = any(readmit == 1)) %>% # If patient has any 'readmit' rows, count the patient as a readmit patient
filter(is_readmit) %>% # Filter out any non-readmit patients
select(-is_readmit) # get rid of the `is_readmit` column
Which should result in:
# A tibble: 1 x 3
id sumprice is_readmit
<dbl> <dbl> <lgl>
1 2 50 TRUE
On my attempt to learn dplyr, I want to divide each row by another row, representing the corresponding group's total.
I generated test data with
library(dplyr)
# building test data
data("OrchardSprays")
totals <- OrchardSprays %>% group_by(treatment) %>%
summarise(decrease = sum(decrease))
totals$decrease <- totals$decrease + seq(10, 80, 10)
totals$rowpos = totals$colpos <- "total"
df <- rbind(OrchardSprays, totals)
Note the line totals$decrease <- totals$decrease + seq(10, 80, 10): for the sake of the question, I assumed there was an additional decrease for each treatment, which was not observed in the single lines of the data frame but only in the "total" lines for each group.
What I now want to do is adding another column decrease_share to the data frame where each line's decrease value is divided by the corresponding treatment groups total decrease value.
So, for head(df) I would expect an output like this
> head(df)
decrease rowpos colpos treatment treatment_decrease
1 57 1 1 D 0.178125
2 95 2 1 E 0.1711712
3 8 3 1 B 0.09876543
4 69 4 1 H 0.08603491
5 92 5 1 G 0.1488673
6 90 6 1 F 0.1470588
My real world example is a bit more complex (more group variables and also more levels), therefore I am looking for a suitable solution in dplyr.
Here's a total dplyr approach:
library(dplyr) #version >= 1.0.0
OrchardSprays %>%
group_by(treatment) %>%
summarise(decrease = sum(decrease)) %>%
mutate(decrease = decrease + seq(10, 80, 10),
rowpos = "total",
colpos = "total") %>%
bind_rows(mutate(OrchardSprays, across(rowpos:colpos, as.character))) %>%
group_by(treatment) %>%
mutate(treatment_decrease = decrease / decrease[rowpos == "total"])
# A tibble: 72 x 5
# Groups: treatment [8]
treatment decrease rowpos colpos treatment_decrease
<fct> <dbl> <chr> <chr> <dbl>
1 A 47 total total 1
2 B 81 total total 1
3 C 232 total total 1
4 D 320 total total 1
5 E 555 total total 1
6 F 612 total total 1
7 G 618 total total 1
8 H 802 total total 1
9 D 57 1 1 0.178
10 E 95 2 1 0.171
# … with 62 more rows
I want to create a cumulative sum by id. But, it should not sum the value that belongs to the row where is being calculated.
I've already tried with cumsum. However, I do not know how to add a statement which specifies to do not add the amount of the row where the sum is made. The result column I am looking for is the third column called: "sum".
For example, for id 1, the first row is sum=0, because should not add this row. But, for id 1 and row 2 sum=100 because the amount of id 1 previous to the row 2 was 100 and so on.
id amount sum
1: 1 100 0
2: 1 20 100
3: 1 150 120
4: 2 60 0
5: 2 100 60
6: 1 30 270
7: 2 40 160
This is what I've tried:
df[,sum:=cumsum(amount),
by ="id"]
data: df <- data.table(id = c(1, 1, 1, 2, 2,1,2), amount = c(100, 20,
150,60,100,30,40),sum=c(0,100,120,0,60,270,160) ,stringsAsFactors =
FALSE)
You can do this without using lag:
> df %>%
group_by(id) %>%
mutate(sum = cumsum(amount) - amount)
# A tibble: 7 x 3
# Groups: id [2]
id amount sum
<dbl> <dbl> <dbl>
#1 1 100 0
#2 1 20 100
#3 1 150 120
#4 2 60 0
#5 2 100 60
#6 1 30 270
#7 2 40 160
With dplyr -
df %>%
group_by(id) %>%
mutate(sum = lag(cumsum(amount), default = 0)) %>%
ungroup()
# A tibble: 7 x 3
id amount sum
<dbl> <dbl> <dbl>
1 1 100 0
2 1 20 100
3 1 150 120
4 2 60 0
5 2 100 60
6 1 30 270
7 2 40 160
Thanks to #thelatemail here's the data.table version -
df[, sum := cumsum(shift(amount, fill=0)), by=id]
Here is an option in base R
df$Sum <- with(df, ave(amount, id, FUN = cumsum) - amount)
df$Sum
#[1] 0 100 120 0 60 270 160
Or by removing the last observation, take the cumsum
with(df, ave(amount, id, FUN = function(x) c(0, cumsum(x[-length(x)]))))
You can shift the values you're summing by using the lag function.
library(tidyverse)
df <- data.frame(id = c(1, 1, 1, 2, 2,1,2), amount = c(100, 20,
150,60,100,30,40),sum=c(0,100,120,0,60,270,160) ,stringsAsFactors =
FALSE)
df %>%
group_by(id) %>%
mutate(sum = cumsum(lag(amount, 1, default=0)))
# A tibble: 7 x 3
# Groups: id [2]
id amount sum
<dbl> <dbl> <dbl>
1 1 100 0
2 1 20 100
3 1 150 120
4 2 60 0
5 2 100 60
6 1 30 270
7 2 40 160
I am trying to come up with a sum for each task in a dataset that only uses the largest value observed for the id once in the sum. If that's not clear I've provided an example of the desired output below.
Sample Data
dat <- data.frame(task = rep(LETTERS[1:3], each=3),
id = c(rep(1:2, 4) , 3),
value = c(rep(c(10,20), 4), 5))
dat
task id value
1 A 1 10
2 A 2 20
3 A 1 10
4 B 2 20
5 B 1 10
6 B 2 20
7 C 1 10
8 C 2 20
9 C 3 5
I've found an answer that works, but it requires two separate group_by() functions. Is there a way to get the same output with a single group_by()? The reason is I have other summarized metrics that are sensitive to the grouping and I can't run two different group_by functions in the same pipeline.
dat %>%
group_by(task, id) %>%
summarize(v = max(value)) %>%
group_by(task) %>%
summarize(unique_ids = n_distinct(id),
value_sum = sum(v))
# A tibble: 3 × 3
task unique_ids value_sum
<chr> <int> <dbl>
1 A 2 30
2 B 2 30
3 C 3 35
I've found something that works using tapply().
dat %>%
group_by(task) %>%
summarize(unique_ids = length(unique(id)),
value_sum = sum(tapply(value, id, FUN = max)))
# A tibble: 3 × 3
task unique_ids value_sum
<chr> <int> <dbl>
1 A 2 30
2 B 2 30
3 C 3 35