I'm having so far this df: (not column result):
df <- data.frame(number = c(1,1,1,1,2,2,2,2,3,3,3,3),
value1 = c(5,7,6,9,3,5,6,3,4,5,5,6),
group = c("control", "Treated1", "Treated2", "Treated3","control", "Treated1", "Treated2", "Treated3","control", "Treated1", "Treated2", "Treated3"),
result = c(1,1.4,1.2,1.8,1.0,1.67,2,1,1,1.25,1,1.2))
number value1 group result
1 1 5 control 1.00
2 1 7 Treated1 1.40
3 1 6 Treated2 1.20
4 1 9 Treated3 1.80
5 2 3 control 1.00
6 2 5 Treated1 1.67
7 2 6 Treated2 2.00
8 2 3 Treated3 1.00
9 3 4 control 1.00
10 3 5 Treated1 1.25
11 3 5 Treated2 1.00
12 3 6 Treated3 1.20
I want to group the data by number and also by group and then divide each subgroup of group with the control of the same numbergroup, but I'm struggling to archieve this.
e.g.
Line1: 5/5 = 1.0
Line2: 7/5 = 1.40
Line3: 6/5 = 1.20
Line4: 9/5 = 1.80
Line5: 3/3 = 1.0
I tried to do something like that (which does not work obviously):
library(dplyr)
df <- df %>%
group_by(number) %>%
mutate(result = value1[group == contains("Treated")] / value1[group == control)
Do you have any ideas?
You can index value1 which has group == "control", and divide all other value1 with this value.
library(dplyr)
df %>% group_by(number) %>% mutate(result = value1/value1[group == "control"])
Or you can arrange the group column, so that "control" will always be the first value.
df %>% group_by(number) %>%
arrange(number, group) %>%
mutate(result = value1/first(value1))
Output
# A tibble: 12 × 4
# Groups: number [3]
number value1 group result
<dbl> <dbl> <chr> <dbl>
1 1 5 control 1
2 1 7 Treated1 1.4
3 1 6 Treated2 1.2
4 1 9 Treated3 1.8
5 2 3 control 1
6 2 5 Treated1 1.67
7 2 6 Treated2 2
8 2 3 Treated3 1
9 3 4 control 1
10 3 5 Treated1 1.25
11 3 5 Treated2 1.25
12 3 6 Treated3 1.5
Related
Consider the following data:
df <- structure(list(date = structure(c(10904, 10613, 10801, 10849,
10740, 10680, 10780, 10909, 10750, 10814), class = "Date"), group = c(1L,
2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L)), class = "data.frame", row.names = c(NA,
-10L))
which gives:
date group
1 1999-11-09 1
2 1999-01-22 2
3 1999-07-29 1
4 1999-09-15 2
5 1999-05-29 1
6 1999-03-30 1
7 1999-07-08 1
8 1999-11-14 2
9 1999-06-08 2
10 1999-08-11 2
I now want to calculate
a) the months that have been passed between two neighbouring dates per group (I know how to do this)
b) flag rows when a certain time period (3 months) has passed and if it has passed, i would kind of reset and again look when 3 months have been passed from that date.
So for a) I'm doing this:
library(tidyverse)
library(lubridate)
df %>%
group_by(group) %>%
arrange(group, date) %>%
mutate(months_passed = time_length(interval(lag(date), date), "months"))
which gives:
# A tibble: 10 x 3
# Groups: group [2]
date group months_passed
<date> <int> <dbl>
1 1999-03-30 1 NA
2 1999-05-29 1 1.97
3 1999-07-08 1 1.3
4 1999-07-29 1 0.677
5 1999-11-09 1 3.35
6 1999-01-22 2 NA
7 1999-06-08 2 4.55
8 1999-08-11 2 2.10
9 1999-09-15 2 1.13
10 1999-11-14 2 1.97
But for b) I'm lost. What I want to do is:
Look at each group separately.
Calculate the months_passed between row 1 and row 2 (here: 1.97 months for group 1)
If it is < 3 months, continue with the next row and calculate the time difference between the already passed months and the current time difference (here: 1.97 + 1.3 months).
Now that the difference is bigger >= 3 months, I want to flag row 3.
Now I reset the cumulative time difference and again start calculating the difference with the next row (here: 0.67 months) and so on.
Expected outcome would be:
# A tibble: 10 x 4
# Groups: group [2]
date group months_passed time_flag
<date> <int> <dbl> <int>
1 1999-03-30 1 NA 0
2 1999-05-29 1 1.97 0
3 1999-07-08 1 1.3 1
4 1999-07-29 1 0.677 0
5 1999-11-09 1 3.35 1
6 1999-01-22 2 NA 0
7 1999-06-08 2 4.55 1
8 1999-08-11 2 2.10 0
9 1999-09-15 2 1.13 1
10 1999-11-14 2 1.97 0
Any ideas?
You can write a helper function :
assign_1 <- function(x) {
y <- numeric(length(x))
sum <- 0
for(i in seq_along(x)) {
sum <- sum + x[i]
if(sum >= 3) {
y[i] <- 1
sum <- 0
}
}
y
}
and use it in your already existing pipe :
library(dplyr)
library(lubridate)
df %>%
group_by(group) %>%
arrange(group, date) %>%
mutate(months_passed = time_length(interval(lag(date, default = first(date)),
date), "months"),
time_flag = assign_1(months_passed)) %>%
ungroup
# date group months_passed time_flag
# <date> <int> <dbl> <dbl>
# 1 1999-03-30 1 0 0
# 2 1999-05-29 1 1.97 0
# 3 1999-07-08 1 1.3 1
# 4 1999-07-29 1 0.677 0
# 5 1999-11-09 1 3.35 1
# 6 1999-01-22 2 0 0
# 7 1999-06-08 2 4.55 1
# 8 1999-08-11 2 2.10 0
# 9 1999-09-15 2 1.13 1
#10 1999-11-14 2 1.97 0
BTW, with some better searching (now that I know which terms to search for), I was able to bring up this one completely working without additional functions:
dplyr / R cumulative sum with reset
df %>%
group_by(group) %>%
arrange(group, date) %>%
mutate(months_passed = time_length(interval(lag(date), date), "months"),
months_passed = if_else(is.na(months_passed), 0, months_passed),
time_flag = if_else(accumulate(months_passed, ~if_else(.x >= 3, .y, .x + .y)) >= 3, 1, 0))
# A tibble: 10 x 4
# Groups: group [2]
date group months_passed time_flag
<date> <int> <dbl> <dbl>
1 1999-03-30 1 0 0
2 1999-05-29 1 1.97 0
3 1999-07-08 1 1.3 1
4 1999-07-29 1 0.677 0
5 1999-11-09 1 3.35 1
6 1999-01-22 2 0 0
7 1999-06-08 2 4.55 1
8 1999-08-11 2 2.10 0
9 1999-09-15 2 1.13 1
10 1999-11-14 2 1.97 0
I observe 12 responses of 2 survey participants.
data = data.frame(id = c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2), response = c(2,2,3,3,6,3,6,7,3,1,4,3,3,3,6,4,2,6,7,3,2,1,5,6))
data
id response
1 1 2
2 1 2
3 1 3
4 1 3
5 1 6
6 1 3
7 1 6
8 1 7
9 1 3
10 1 1
11 1 4
12 1 3
13 2 3
14 2 3
15 2 6
16 2 4
17 2 2
18 2 6
19 2 7
20 2 3
21 2 2
22 2 1
23 2 5
24 2 6
Now I want to add 2 things to the data of each survey participant:
a) The most frequent value of this survey participant
b) the relative frequency of the most frequent value
How can I add these things using dplyr:
data %>%
group_by(id) %>%
mutate(most_frequent_value = ?,
relative_frequency_of_most_frequent_value = ?)
I'd probably use a two step solution. First, create a data.frame of frequency/relative frequency. Then join to it. We use slice(which.max()), because it will return one row. Using slice_max may return multiple rows.
library(tidyverse)
# count by id, response, calculate rel frequency
# rename columns to make inner_join easier
freq_table <- dd %>%
count(id, response) %>%
group_by(id) %>%
mutate(rel_freq = n / sum(n)) %>%
select(id, most_frequent_response = response, rel_freq)
# inner join to sliced freq_table (grouping by id is preserved)
dd %>%
inner_join(freq_table %>% slice(which.max(rel_freq)))
# id response most_frequent_response rel_freq
# 1 1 2 3 0.4166667
# 2 1 2 3 0.4166667
# 3 1 3 3 0.4166667
# 4 1 3 3 0.4166667
# 5 1 6 3 0.4166667
# ...
You could try:
table(data$id, data$response) %>%
as.data.frame() %>%
setNames(c("id", "response", "n")) %>%
group_by(id) %>%
slice_max(n, 1) %>%
group_by(response) %>%
filter(n() > 1) %>%
mutate(ratio = c(n[1]/sum(n), n[2]/sum(n)))
#> # A tibble: 2 x 4
#> # Groups: response [1]
#> id response n ratio
#> <fct> <fct> <int> <dbl>
#> 1 1 3 5 0.625
#> 2 2 3 3 0.375
Does this work:
data %>% group_by(id, response) %>% mutate(n = n()) %>%
ungroup() %>% group_by(id) %>%
mutate(most_frequent_value = response[n == max(n)][1],
relative_frequency_of_most_frequent_value = max(n)/n())
# A tibble: 24 x 5
# Groups: id [2]
id response n most_frequent_value relative_frequency_of_most_frequent_value
<dbl> <dbl> <int> <dbl> <dbl>
1 1 2 2 3 0.417
2 1 2 2 3 0.417
3 1 3 5 3 0.417
4 1 3 5 3 0.417
5 1 6 2 3 0.417
6 1 3 5 3 0.417
7 1 6 2 3 0.417
8 1 7 1 3 0.417
9 1 3 5 3 0.417
10 1 1 1 3 0.417
11 1 4 1 3 0.417
12 1 3 5 3 0.417
13 2 3 3 3 0.25
14 2 3 3 3 0.25
15 2 6 3 3 0.25
16 2 4 1 3 0.25
17 2 2 2 3 0.25
18 2 6 3 3 0.25
19 2 7 1 3 0.25
20 2 3 3 3 0.25
21 2 2 2 3 0.25
22 2 1 1 3 0.25
23 2 5 1 3 0.25
24 2 6 3 3 0.25
>
I am trying to classify the temp variable into different classes in such a way that Dur>5.
Further, I want to find the maximum value for each group as shown in expected outcome.
Dur=c(2.75,0.25,13,0.25,45.25,0.25,0.25,4.25,0.25,0.25,14)
temp=c(2.54,5.08,0,2.54,0,5,2.54,0,2.54,0,2.54)
df=data.frame(Dur,temp)
Expected Outcome:
group=c(1,1,1,2,2,3,3,3,3,3,3)
Colnew=c(5.08,5.08,5.08,2.54,2.54,5,5,5,5,5,5)
(output=data.frame(df,group,Colnew))
We create a grouping variable by taking the cumsum of logical vector, then get the max of 'temp'
library(dplyr)
df %>%
group_by(group = as.integer(factor(lag(cumsum(Dur > 5), default = 0)))) %>%
mutate(Max = max(temp))
# A tibble: 11 x 4
# Groups: group [3]
# Dur temp group Max
# <dbl> <dbl> <int> <dbl>
# 1 2.75 2.54 1 5.08
# 2 0.25 5.08 1 5.08
# 3 13 0 1 5.08
# 4 0.25 2.54 2 2.54
# 5 45.2 0 2 2.54
# 6 0.25 5 3 5
# 7 0.25 2.54 3 5
# 8 4.25 0 3 5
# 9 0.25 2.54 3 5
#10 0.25 0 3 5
#11 14 2.54 3 5
Suppose I have a tibble tbl_
tbl_ <- tibble(id = c(1,1,2,2,3,3), dta = 1:6)
tbl_
# A tibble: 6 x 2
id dta
<dbl> <int>
1 1 1
2 1 2
3 2 3
4 2 4
5 3 5
6 3 6
There are 3 id groups. I want to resample entire id groups 3 times with replacement. For example the resulting tibble can be:
id dta
<dbl> <int>
1 1 1
2 1 2
3 1 1
4 1 2
5 3 5
6 3 6
but not
id dta
<dbl> <int>
1 1 1
2 1 2
3 1 1
4 2 4
5 3 5
6 3 6
or
id dta
<dbl> <int>
1 1 1
2 1 1
3 2 3
4 2 4
5 3 5
6 3 6
Here is one option with sample_n and distinct
library(tidyverse)
distinct(tbl_, id) %>%
sample_n(nrow(.), replace = TRUE) %>%
pull(id) %>%
map_df( ~ tbl_ %>%
filter(id == .x)) %>%
arrange(id)
# A tibble: 6 x 2
# id dta
# <dbl> <int>
#1 1.00 1
#2 1.00 2
#3 1.00 1
#4 1.00 2
#5 3.00 5
#6 3.00 6
An option can be to get the minimum row number for each id. That row number will be used to generate random samples from wiht replace = TRUE.
library(dplyr)
tbl_ %>% mutate(rn = row_number()) %>%
group_by(id) %>%
summarise(minrow = min(rn)) ->min_row
indx <- rep(sample(min_row$minrow, nrow(min_row), replace = TRUE), each = 2) +
rep(c(0,1), 3)
tbl_[indx,]
# # A tibble: 6 x 2
# id dta
# <dbl> <int>
# 1 1.00 1
# 2 1.00 2
# 3 3.00 5
# 4 3.00 6
# 5 2.00 3
# 6 2.00 4
Note: In the above answer the number of rows for each id has been assumed as 2 but this answer can tackle any number of IDs. The hard-coded each=2 and c(0,1) needs to be modified in order to scale it up to handle more than 2 rows for each id
I have the following data frame
d2
# A tibble: 10 x 2
ID Count
<int> <dbl>
1 1
2 1
3 1
4 1
5 1
6 2
7 2
8 2
9 3
10 3
Which states how many counts each person (ID) had.
I would like to calculate the cumulative percentage of each count: 1 - 50%, up to 2: 80%, up to 3: 100%.
I tried
> d2 %>% mutate(cum = cumsum(Count)/sum(Count))
# A tibble: 10 x 3
ID Count cum
<int> <dbl> <dbl>
1 1 0.05882353
2 1 0.11764706
3 1 0.17647059
4 1 0.23529412
5 1 0.29411765
6 2 0.41176471
7 2 0.52941176
8 2 0.64705882
9 3 0.82352941
10 3 1.00000000
but this result is obviously incorrect because I would expect that the count of 1 would correspond to 50% rather than 29.4%.
What is wrong here? How do I get the correct answer?
We get the count of 'Count', create the 'Cum' by taking the cumulative sum of 'n' and divide it by the sum of 'n', then right_join with the original data
d2 %>%
count(Count) %>%
mutate(Cum = cumsum(n)/sum(n)) %>%
select(-n) %>%
right_join(d2) %>%
select(names(d2), everything())
# A tibble: 10 x 3
# ID Count Cum
# <int> <int> <dbl>
# 1 1 1 0.500
# 2 2 1 0.500
# 3 3 1 0.500
# 4 4 1 0.500
# 5 5 1 0.500
# 6 6 2 0.800
# 7 7 2 0.800
# 8 8 2 0.800
# 9 9 3 1.00
#10 10 3 1.00
If we need the output as #LAP mentioned
d2 %>%
mutate(Cum = row_number()/n())
# ID Count Cum
#1 1 1 0.1
#2 2 1 0.2
#3 3 1 0.3
#4 4 1 0.4
#5 5 1 0.5
#6 6 2 0.6
#7 7 2 0.7
#8 8 2 0.8
#9 9 3 0.9
#10 10 3 1.0
This works:
d2 %>%
mutate(cum = cumsum(rep(1/n(), n())))
ID Count cum
1 1 1 0.1
2 2 1 0.2
3 3 1 0.3
4 4 1 0.4
5 5 1 0.5
6 6 2 0.6
7 7 2 0.7
8 8 2 0.8
9 9 3 0.9
10 10 3 1.0
One option could be as:
library(dplyr)
d2 %>%
group_by(Count) %>%
summarise(proportion = n()) %>%
mutate(Perc = cumsum(100*proportion/sum(proportion))) %>%
select(-proportion)
# # A tibble: 3 x 2
# Count Perc
# <int> <dbl>
# 1 1 50.0
# 2 2 80.0
# 3 3 100.0