How to randomise order of group within group in R/dplyr? - r

I have a group nested within another group in my data. I would like to randomise the order of the nested groups while preserving the order of the rows within each nested group. (This will be a step within an existing pipe, so a tidyverse solution would be ideal.)
In the example below, how do I randomise the order of block within participant_id, while also preserving the order of both participant_id and trial?
library(dplyr)
set.seed(123)
# dummy data
data <- tibble::tribble(
~participant_id, ~block, ~trial,
1L, "a", 1L,
1L, "a", 2L,
1L, "a", 3L,
1L, "b", 1L,
1L, "b", 2L,
1L, "b", 3L,
2L, "a", 1L,
2L, "a", 2L,
2L, "a", 3L,
2L, "b", 1L,
2L, "b", 2L,
2L, "b", 3L
)
# something along the lines of...
new_data <- data %>%
group_by(participant_id) %>%
# ? step here to randomise order within 'block', while preserving order within 'trial'.
Thanks.

And here's another:
# Randomise within one participant
randomiseGroup <- function(.x, .y) {
# Generalise to that any number of blocks can be handled
r <- .x %>%
distinct(block) %>%
mutate(random=runif(nrow(.)))
# Randomise
.y %>%
bind_cols(
.x %>%
ungroup() %>%
left_join(r, by="block") %>%
arrange(random, trial) %>%
select(-random)
)
}
# Randomise all participants
data %>%
group_by(participant_id) %>%
group_map(randomiseGroup) %>%
bind_rows()
# A tibble: 12 × 3
participant_id block trial
<int> <chr> <int>
1 1 a 1
2 1 a 2
3 1 a 3
4 1 b 1
5 1 b 2
6 1 b 3
7 2 b 1
8 2 b 2
9 2 b 3
10 2 a 1
11 2 a 2
12 2 a 3

One option could be:
data %>%
group_by(participant_id) %>%
mutate(rleid = cumsum(block != lag(block, default = first(block))),
block_random = sample(n())) %>%
group_by(participant_id, rleid) %>%
mutate(block_random = min(block_random)) %>%
ungroup()
participant_id block trial rleid block_random
<int> <chr> <int> <int> <int>
1 1 a 1 0 2
2 1 a 2 0 2
3 1 a 3 0 2
4 1 b 1 1 1
5 1 b 2 1 1
6 1 b 3 1 1
7 2 a 1 0 2
8 2 a 2 0 2
9 2 a 3 0 2
10 2 b 1 1 1
11 2 b 2 1 1
12 2 b 3 1 1

Related

How to keep the true uniques row? [duplicate]

This question already has answers here:
How can I remove all duplicates so that NONE are left in a data frame?
(3 answers)
Closed 12 months ago.
Here an example of a matrix,
A
B
C
1
1
1
1
1
4
1
2
4
2
1
1
3
1
1
3
1
2
I would like extract only rows which are unique in A and B.
I can't use unique, duplicate etc. because they retain always one of my duplicated row.
In final result I wish obtain:
A
B
C
1
2
4
2
1
1
How can I do it?
Thank you
Here are couple of options -
Base R -
cols <- c('A', 'B')
res <- df[!(duplicated(df[cols]) | duplicated(df[cols], fromLast = TRUE)), ]
res
# A B C
#3 1 2 4
#4 2 1 1
dplyr -
library(dplyr)
df %>% group_by(A, B) %>% filter(n() == 1) %>% ungroup
# A tibble: 2 x 3
# A B C
# <int> <int> <int>
#1 1 2 4
#2 2 1 1
data.table
df <- data.frame(
A = c(1L, 1L, 1L, 2L, 3L, 3L),
B = c(1L, 1L, 2L, 1L, 1L, 1L),
C = c(1L, 4L, 4L, 1L, 1L, 2L)
)
library(data.table)
setDT(df)[, .SD[.N == 1], by = list(A, B)]
#> A B C
#> 1: 1 2 4
#> 2: 2 1 1
Created on 2022-02-28 by the reprex package (v2.0.1)

Add a mean column to a table dataframe in R

I have a dataframe such as:
COL1 VALUE1 VALUE2
1 A,A 1 5
2 A,A,B 1 3
3 C 1 1
4 D 1 2
5 D 1 2
6 A,A 1 10
7 A,B,A 1 2
and I can succeed to remove duplicate within the COL1 and count the number of different duplicated in COL1 by using:
as.data.frame(table(tab$COL1)) %>%
group_by(Var1 = sapply(strsplit(as.character(Var1), ","), function(x) toString(unique(x)))) %>%
summarise(Freq = sum(Freq))
And then I get:
# A tibble: 4 × 2
Var1 Freq
<chr> <int>
1 A 2
2 A, B 2
3 C 1
4 D 2
But I wondered if someone had an idea in order to add a new column called Mean which would be for each COL1 groups, the mean of the VALUE2 values and then get:
Var1 Freq Mean
1 A 2 7.5 < because (5+10)/2 =7.5
2 A, B 2 2.5 < because (3+2)/2 =2.5
3 C 1 1 < because 1/1 = 1
4 D 2 2 < because (2+2)/2 = 2
Here is the dataframe if it can helps:
structure(list(COL1 = structure(c(1L, 2L, 4L, 5L, 5L, 1L, 3L), .Label = c("A,A",
"A,A,B", "A,B,A", "C", "D"), class = "factor"), VALUE1 = c(1L,
1L, 1L, 1L, 1L, 1L, 1L), VALUE2 = c(5L, 3L, 1L, 2L, 2L, 10L,
2L)), class = "data.frame", row.names = c(NA, -7L))
You can calculate the frequency table directly in the dplyr chain, and then just add a Mean = mean(VALUE2) in the summarise() call.
I.e.
tab %>%
group_by(Var1 = sapply(strsplit(as.character(COL1), ","), function(x) toString(unique(x)))) %>%
summarise(Freq = sum(VALUE1), Mean = mean(VALUE2))
# # A tibble: 4 x 3
# Var1 Freq Mean
# <chr> <int> <dbl>
# 1 A 2 7.5
# 2 A, B 2 2.5
# 3 C 1 1
# 4 D 2 2
Is this what you want:
library(dplyr)
tab %>%
mutate(COL1 = sapply(strsplit(as.character(COL1), ","), function(x) toString(unique(x)))) %>%
group_by(COL1) %>%
summarise(Freq = sum(VALUE1),
Mean = mean(VALUE2))
# A tibble: 4 x 3
COL1 Freq Mean
* <chr> <int> <dbl>
1 A 2 7.5
2 A, B 2 2.5
3 C 1 1
4 D 2 2

counting number of times an id has duplicated years

I have the following data frame:
df =
id Year Value
1 1 3
1 2 4
2 1 6
2 2 2
2 2 3
3 1 7
3 2 3
I want to count the number of times an individual id has a duplicating year.
Desired Outcome:
1
Id 2 has year 2 twice, that's why 1 is the outcome
So far I have tried:
library("dplyr")
df %>% group_by(id, Year) %>% summarize(count=n())
but I cannot get a single number with the count
Cheers
We can use table and create counts of observation for each id and year and then calculate the ones which occur more than 1 time.
sum(table(df$id, df$Year) > 1)
#[1] 1
Just for completion, if we want to do this in dplyr
library(dplyr)
df %>%
group_by(id, Year) %>%
summarise(count= n()) %>%
ungroup() %>%
summarise(new_count = sum(count > 1))
# new_count
# <int>
#1 1
Just for fun:
data.table solution:
data:
dt<-
fread("id Year Value
1 1 3
1 2 4
2 1 6
2 2 2
2 2 3
3 1 7
3 2 3")
code:
dt[,.N>1,by=c("id","Year")]$V1 %>% sum
A (fast) alternative:
sum(sapply(split(df$Year, df$id), function(x) any(duplicated(x))))
Where:
df <- data.frame(
id = c(1L, 1L, 2L, 2L, 2L, 3L, 3L),
Year = c(1L, 2L, 1L, 2L, 2L, 1L, 2L),
Value = c(3L, 4L, 6L, 2L, 3L, 7L, 3L)
)

Counting occurrence of a variable without taking account duplicates

I have a big data frame, called data with 1 004 490 obs, and I want to analyse the success of a treatment.
ID POSITIONS TREATMENT
1 0 A
1 1 A
1 2 B
2 0 C
2 1 D
3 0 B
3 1 B
3 2 C
3 3 A
3 4 A
3 5 B
So firstly, I want to count the number of time that one treatment is applicated to a patient (ID), but one treatment can be given several times to an iD. So, do I need to first delete all the duplicates and after count or there is a function that don't take into account all the duplicates.
What I want to have :
A : 2
B : 2
C : 2
D : 1
Then, I want to know how many time the treatment was given at the last position, but the last position is always different according to the ID.
What I want to have :
A : 0
B : 2 (for ID = 1 and 3)
C : 0
D : 1 (for ID = 1)
Thanks for your help, I am a new user of R !
Using base R, we can do,
merge(aggregate(ID ~ TREATMENT, df, FUN = function(i) length(unique(i))),
aggregate(ID ~ TREATMENT, df[!duplicated(df$ID, fromLast = TRUE),], toString),
by = 'TREATMENT', all = TRUE)
Which gives,
TREATMENT ID.x ID.y
1 A 2 <NA>
2 B 2 1, 3
3 C 2 <NA>
4 D 1 2
Here is a tidyverse approach, where we get the distinct rows based on 'ID', 'TREATMENT' and get the count of 'TREATMENT'
library(tidyverse)
df1 %>%
distinct(ID, TREATMENT) %>%
count(TREATMENT)
# A tibble: 4 x 2
# TREATMENT n
# <chr> <int>
#1 A 2
#2 B 2
#3 C 2
#4 D 1
and for second output, after grouping by 'ID', slice the last row (n()), create a column 'ind' and fill that with 0 for all missing combinations of 'TREATMENT' with complete, then get the sum of 'ind' after grouping by 'TREATMENT'
df1 %>%
group_by(ID) %>%
slice(n()) %>%
mutate(ind = 1) %>%
complete(TREATMENT = unique(df1$TREATMENT), fill = list(ind=0)) %>%
group_by(TREATMENT) %>%
summarise(n = sum(ind))
# A tibble: 4 x 2
# TREATMENT n
# <chr> <dbl>
#1 A 0
#2 B 2
#3 C 0
#4 D 1
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L), POSITIONS = c(0L, 1L, 2L, 0L, 1L, 0L, 1L, 2L, 3L, 4L, 5L
), TREATMENT = c("A", "A", "B", "C", "D", "B", "B", "C", "A",
"A", "B")), .Names = c("ID", "POSITIONS", "TREATMENT"),
class = "data.frame", row.names = c(NA, -11L))

R - add column that counts sequentially within groups but repeats for duplicates

I'm looking for a solution to add the column "desired_result" preferably using dplyr and/or ave(). See the data frame here, where the group is "section" and the unique instances I want my "desired_results" column to count sequentially are in "exhibit":
structure(list(section = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), exhibit = structure(c(1L,
2L, 3L, 3L, 1L, 2L, 2L, 3L), .Label = c("a", "b", "c"), class = "factor"),
desired_result = c(1L, 2L, 3L, 3L, 1L, 2L, 2L, 3L)), .Names = c("section",
"exhibit", "desired_result"), class = "data.frame", row.names = c(NA,
-8L))
dense_rank it is
library(dplyr)
df %>%
group_by(section) %>%
mutate(desire=dense_rank(exhibit))
# section exhibit desired_result desire
#1 1 a 1 1
#2 1 b 2 2
#3 1 c 3 3
#4 1 c 3 3
#5 2 a 1 1
#6 2 b 2 2
#7 2 b 2 2
#8 2 c 3 3
I've recently pushed a function rleid() to data.table (currently available on the development version, 1.9.5), which does exactly this. If you're interested, you can install it by following this.
require(data.table) # 1.9.5, for `rleid()`
require(dplyr)
DF %>%
group_by(section) %>%
mutate(desired_results=rleid(exhibit))
# section exhibit desired_result desired_results
# 1 1 a 1 1
# 2 1 b 2 2
# 3 1 c 3 3
# 4 1 c 3 3
# 5 2 a 1 1
# 6 2 b 2 2
# 7 2 b 2 2
# 8 2 c 3 3
If exact enumeration is necessary and you need the desired result to be consistent (so that a same exhibit in a different section will always have the same number), you can try:
library(dplyr)
df <- data.frame(section = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
exhibit = c('a', 'b', 'c', 'c', 'a', 'b', 'b', 'c'))
if (is.null(saveLevels <- levels(df$exhibit)))
saveLevels <- sort(unique(df$exhibit)) ## or levels(factor(df$exhibit))
df %>%
group_by(section) %>%
mutate(answer = as.integer(factor(exhibit, levels = saveLevels)))
## Source: local data frame [8 x 3]
## Groups: section
## section exhibit answer
## 1 1 a 1
## 2 1 b 2
## 3 1 c 3
## 4 1 c 3
## 5 2 a 1
## 6 2 b 2
## 7 2 b 2
## 8 2 c 3
If/when a new exhibit appears in subsequent sections, they should get newly enumerated results. (Notice the last exhibit is different.)
df2 <- data.frame(section = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
exhibit = c('a', 'b', 'c', 'c', 'a', 'b', 'b', 'd'))
if (is.null(saveLevels2 <- levels(df2$exhibit)))
saveLevels2 <- sort(unique(df2$exhibit))
df2 %>%
group_by(section) %>%
mutate(answer = as.integer(factor(exhibit, levels = saveLevels2)))
## Source: local data frame [8 x 3]
## Groups: section
## section exhibit answer
## 1 1 a 1
## 2 1 b 2
## 3 1 c 3
## 4 1 c 3
## 5 2 a 1
## 6 2 b 2
## 7 2 b 2
## 8 2 d 4

Resources