Randomize two sets of number in R, not repeating values between groups - r

I have this file:
ID
1
1
1
3
3
3
7
7
7
And I need to assign two sets randomly, (1,2,3) and (5,15,25).
To do this I used this:
set.seed(1109201)
df %>%
group_by(ID) %>%
dplyr::mutate(set1=sample(c(1,2,3), size=n(), replace=F),set2=sample(c(5,15,25), size=n(), replace=F))
and I obtained this:
ID set1 set2
1 1 15
3 1 25
7 1 25
1 2 5
3 2 15
7 2 5
1 3 25
3 3 5
7 3 15
but I need different values for set2 in set1 and ID, like this:
ID set1 set2
1 1 15
3 1 25
7 1 5
1 2 5
3 2 15
7 2 25
1 3 25
3 3 5
7 3 15
Set2 cannot be repeated into ID or set1
some suggestion to control these 2 sets?

Change your dplyr code to the following. Using a 'group_by()` step will have the second sampling occur only within the group.
set.seed(1109201)
df %>%
group_by(ID) %>%
dplyr::mutate(set1=sample(c(1,2,3), size=n(), replace=F)) %>%
group_by(set1) %>%
mutate(set2=sample(c(5,15,25), size=n(), replace=F)) %>%
ungroup()
# A tibble: 8 x 3
ID set1 set2
<dbl> <dbl> <dbl>
1 1 2 15
2 1 3 5
3 1 1 25
4 3 3 15
5 3 2 5
6 3 1 5
7 7 2 25
8 7 3 25

Related

How to create another column in a data frame based on repeated observations in another column?

So basically I have a data frame that looks like this:
BX
BY
1
12
1
12
1
12
2
14
2
14
3
5
I want to create another colum ID, which will have the same number for the same values in BX and BY. So the table would look like this then:
BX
BY
ID
1
12
1
1
12
1
1
12
1
2
14
2
2
14
2
3
5
3
Here is a base R way.
Subset the data.frame by the grouping columns, find the duplicated rows and use a standard cumsum trick.
df1<-'BX BY
1 12
1 12
1 12
2 14
2 14
3 5'
df1 <- read.table(textConnection(df1), header = TRUE)
cumsum(!duplicated(df1[c("BX", "BY")]))
#> [1] 1 1 1 2 2 3
df1$ID <- cumsum(!duplicated(df1[c("BX", "BY")]))
df1
#> BX BY ID
#> 1 1 12 1
#> 2 1 12 1
#> 3 1 12 1
#> 4 2 14 2
#> 5 2 14 2
#> 6 3 5 3
Created on 2022-10-12 with reprex v2.0.2
You can do:
transform(dat, ID = as.numeric(interaction(dat, drop = TRUE, lex.order = TRUE)))
BX BY ID
1 1 12 1
2 1 12 1
3 1 12 1
4 2 14 2
5 2 14 2
6 3 5 3
Or if you prefer dplyr:
library(dplyr)
dat %>%
group_by(across()) %>%
mutate(ID = cur_group_id()) %>%
ungroup()
# A tibble: 6 × 3
BX BY ID
<dbl> <dbl> <int>
1 1 12 1
2 1 12 1
3 1 12 1
4 2 14 2
5 2 14 2
6 3 5 3

Converting time-dependent variable to long format using one variable indicating day of update

I am trying to convert my data to a long format using one variable that indicates a day of the update.
I have the following variables:
baseline temperature variable "temp_b";
time-varying temperature variable "temp_v" and
the number of days "n_days" when the varying variable is updated.
I want to create a long format using the carried forward approach and a max follow-up time of 5 days.
Example of data
df <- structure(list(id=1:3, temp_b=c(20L, 7L, 7L), temp_v=c(30L, 10L, NA), n_days=c(2L, 4L, NA)), class="data.frame", row.names=c(NA, -3L))
# id temp_b temp_v n_days
# 1 1 20 30 2
# 2 2 7 10 4
# 3 3 7 NA NA
df_long <- structure(list(id=c(1,1,1,1,1, 2,2,2,2,2, 3,3,3,3,3),
days_cont=c(1,2,3,4,5, 1,2,3,4,5, 1,2,3,4,5),
long_format=c(20,30,30,30,30,7,7,7,10,10,7,7,7,7,7)),
class="data.frame", row.names=c(NA, -15L))
# id days_cont long_format
# 1 1 1 20
# 2 1 2 30
# 3 1 3 30
# 4 1 4 30
# 5 1 5 30
# 6 2 1 7
# 7 2 2 7
# 8 2 3 7
# 9 2 4 10
# 10 2 5 10
# 11 3 1 7
# 12 3 2 7
# 13 3 3 7
# 14 3 4 7
# 15 3 5 7
You could repeat each row 5 times with tidyr::uncount():
library(dplyr)
df %>%
tidyr::uncount(5) %>%
group_by(id) %>%
transmute(days_cont = 1:n(),
temp = ifelse(row_number() < n_days | is.na(n_days), temp_b, temp_v)) %>%
ungroup()
# # A tibble: 15 × 3
# id days_cont temp
# <int> <int> <int>
# 1 1 1 20
# 2 1 2 30
# 3 1 3 30
# 4 1 4 30
# 5 1 5 30
# 6 2 1 7
# 7 2 2 7
# 8 2 3 7
# 9 2 4 10
# 10 2 5 10
# 11 3 1 7
# 12 3 2 7
# 13 3 3 7
# 14 3 4 7
# 15 3 5 7
Here's a possibility using tidyverse functions. First, pivot_longer and get rid of unwanted values (that will not appear in the final df, i.e. values with temp_v == NA), then group_by id, and mutate the n_days variable to match the number of rows it will have in the final df. Finally, uncount the dataframe.
library(tidyverse)
df %>%
replace_na(list(n_days = 6)) %>%
pivot_longer(-c(id, n_days)) %>%
filter(!is.na(value)) %>%
group_by(id) %>%
mutate(n_days = case_when(name == "temp_b" ~ n_days - 1,
name == "temp_v" ~ 5 - (n_days - 1))) %>%
uncount(n_days) %>%
mutate(days_cont = row_number()) %>%
select(id, days_cont, long_format = value)
id days_cont long_format
<int> <int> <int>
1 1 1 20
2 1 2 30
3 1 3 30
4 1 4 30
5 1 5 30
6 2 1 7
7 2 2 7
8 2 3 7
9 2 4 10
10 2 5 10
11 3 1 7
12 3 2 7
13 3 3 7
14 3 4 7
15 3 5 7

Filtering out a grade variable that is not meeting consecutive order in r

I have a combined dataset that consists of three years of data for the same ids. When I merged the dataset, I see some of the students' grades are not consecutive in the following years.
Here is sample dataset looks like:
df <- data.frame( id = c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3),
category = c("A","A","A","B","B","B","A","A","A","B","B","B","A","A","A","B","B","B"),
year = c(18,19,20,18,19,20,18,19,20,18,19,20,18,19,20,18,19,20),
grade = c(3,4,5,3,4,5,5,6,8,5,6,8,3,4,6,3,4,6))
> df
id category year grade
1 1 A 18 3
2 1 A 19 4
3 1 A 20 5
4 1 B 18 3
5 1 B 19 4
6 1 B 20 5
7 2 A 18 5
8 2 A 19 6
9 2 A 20 8
10 2 B 18 5
11 2 B 19 6
12 2 B 20 8
13 3 A 18 3
14 3 A 19 4
15 3 A 20 6
16 3 B 18 3
17 3 B 19 4
18 3 B 20 6
In this sample dataset, id=2 and id=3 have those grades not in order as 5,6,7 and 3,4,5. id=2 has 5,6,8 instead of 5,6,7 and id=3 has 3,4,6 instead of 3,4,5. I would like remove those students from the dataset. My desired output would include only id=1 who has the grades are in order for the consecutive years.
My desired output file would be:
> df
id category year grade
1 1 A 18 3
2 1 A 19 4
3 1 A 20 5
4 1 B 18 3
5 1 B 19 4
6 1 B 20 5
Any ideas?
Thanks!
Get the diff and check if all of them is equal to 1, grouped by 'id', and 'category' to filter the groups
library(dplyr)
df %>%
group_by(id, category) %>%
filter(all(diff(grade) == 1)) %>%
ungroup
-output
# A tibble: 6 × 4
id category year grade
<dbl> <chr> <dbl> <dbl>
1 1 A 18 3
2 1 A 19 4
3 1 A 20 5
4 1 B 18 3
5 1 B 19 4
6 1 B 20 5

How to balance a dataset in `dplyr` using `sample_n` automatically to the size of the smallest class?

I have a dataset like:
df <- tibble(
id = 1:18,
class = rep(c(rep(1,3),rep(2,2),3),3),
var_a = rep(c("a","b"),9)
)
# A tibble: 18 x 3
id cluster var_a
<int> <dbl> <chr>
1 1 1 a
2 2 1 b
3 3 1 a
4 4 2 b
5 5 2 a
6 6 3 b
7 7 1 a
8 8 1 b
9 9 1 a
10 10 2 b
11 11 2 a
12 12 3 b
13 13 1 a
14 14 1 b
15 15 1 a
16 16 2 b
17 17 2 a
18 18 3 b
That dataset contains a number of observations in several classes. The classes are not balanced. In the sample above we can see, that only 3 observations are of class 3, while there are 6 observations of class 2 and 9 observations of class 1.
Now I want to automatically balance that dataset so that all classes are of the same size. So I want a dataset of 9 rows, 3 rows in each class. I can use the sample_n function from dplyr to do such a sampling.
I achieved to do so by first calculating the smallest class size..
min_length <- as.numeric(df %>%
group_by(class) %>%
summarise(n = n()) %>%
ungroup() %>%
summarise(min = min(n)))
..and then apply the sample_n function:
set.seed(1)
df %>% group_by(cluster) %>% sample_n(min_length)
# A tibble: 9 x 3
# Groups: cluster [3]
id cluster var_a
<int> <dbl> <chr>
1 15 1 a
2 7 1 a
3 13 1 a
4 4 2 b
5 5 2 a
6 17 2 a
7 18 3 b
8 6 3 b
9 12 3 b
I wondered If it's possible to do that (calculating the smallest class size and then sampling) in one go?
You can do it in one step, but it is cheating a little:
set.seed(42)
df %>%
group_by(class) %>%
sample_n(min(table(df$class))) %>%
ungroup()
# # A tibble: 9 x 3
# id class var_a
# <int> <dbl> <chr>
# 1 1 1 a
# 2 8 1 b
# 3 15 1 a
# 4 4 2 b
# 5 5 2 a
# 6 11 2 a
# 7 12 3 b
# 8 18 3 b
# 9 6 3 b
I say "cheating" because normally you would not want to reference df$ from within the pipe. However, because they property we're looking for is of the whole frame but the table function only sees one group at a time, we need to side-step that a little.
One could do
df %>%
mutate(mn = min(table(class))) %>%
group_by(class) %>%
sample_n(mn[1]) %>%
ungroup()
# # A tibble: 9 x 4
# id class var_a mn
# <int> <dbl> <chr> <int>
# 1 14 1 b 3
# 2 13 1 a 3
# 3 7 1 a 3
# 4 4 2 b 3
# 5 16 2 b 3
# 6 5 2 a 3
# 7 12 3 b 3
# 8 18 3 b 3
# 9 6 3 b 3
Though I don't think that that is any more elegant/readable.

dplyr solution to split dataset, but keep IDs in same splits

I'm looking for a dplyr or tidyr solution to split a dataset into n chunks. However, I do not want to have any single ID go into multiple chunks. That is, each ID should appear in only one chunk.
For example, imagine "test" below is an ID variable, and the dataset has many other columns.
test<-data.frame(id= c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
val = 1:16)
out <- test %>% select(id) %>% ntile(n = 3)
out
[1] 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
The ID=4 would end up in chunks 1 and 2. I am wondering how to code this so that all ID=4 end up in the same chunk (doesn't matter which one). I looked at the split function but could not find a way to do this.
The desired output would be something like
test[which(out==1),]
returning
id val
1 1 1
2 2 2
3 3 3
4 4 4
5 4 5
6 4 6
7 4 7
8 4 8
Then if I wanted to look at the second chunk, I would call something like test[which(out==2),], and so on up to out==n. I only want to deal with one chunk at a time. I don't need to create all n chunks simultaneously.
You need to create a data frame, then use group_by and mutate to add columns:
test<-data_frame(id = c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
value = 1:16)
out <- test %>%
mutate(new_column = ntile(id,3))
out
# A tibble: 16 x 3
id value new_column
<dbl> <int> <int>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 4 1
5 4 5 1
6 4 6 1
7 4 7 2
8 4 8 2
9 6 9 2
10 7 10 2
11 8 11 2
12 9 12 3
13 9 13 3
14 9 14 3
15 9 15 3
16 10 16 3
Or given Frank's comment you could run the ntile function on distinct/unique values of the id - then join the original table back on id:
test<-data_frame(id = c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
value = 1:16)
out <- test %>%
distinct(id) %>%
mutate(new_column = ntile(id,3)) %>%
right_join(test, by = "id")
out
# A tibble: 16 x 3
id new_column value
<dbl> <int> <int>
1 1 1 1
2 2 1 2
3 3 1 3
4 4 2 4
5 4 2 5
6 4 2 6
7 4 2 7
8 4 2 8
9 6 2 9
10 7 2 10
11 8 3 11
12 9 3 12
13 9 3 13
14 9 3 14
15 9 3 15
16 10 3 16

Resources