Create new column with shared ID to randomly link two rows in R - r

I am using R and working with this sample dataframe.
library(tibble)
library(stats)
set.seed(111)
conditions <- factor(c("1","2","3"))
df_sim <-
tibble::tibble(StudentID = 1:10,
Condition = sample(conditions,
size = 10,
replace = T),
XP = stats::rpois(n = 10,
lambda = 15))
This creates the following tibble.
StudentID
Condition
XP
1
2
8
2
3
11
3
3
16
4
3
12
5
1
22
6
3
16
7
1
18
8
3
8
9
2
14
10
1
17
I am trying create a new column in my dataframe called DyadID. The purpose of this column is to create a variable that is uniquely shared by two students in the dataframe — in other words, two students (e.g. Student 1 and Student 9) would share the same value (e.g. 4) in the DyadID column.
However, I only want observations linked together if they share the same Condition value. Condition contains three unique values (1, 2, 3). I want condition 1 observations linked with other condition 1 observations, 2 with 2, and 3 with 3.
Importantly, I'd like the students to be linked together randomly.
Ideally, I would like to stay within the tidyverse as that is what I am most familiar with. However, if that's not possible or ideal, any solution would be appreciated.
Here is a possible outcome I am hoping to achieve.
StudentID
Condition
XP
DyadID
1
2
8
4
2
3
11
1
3
3
16
2
4
3
12
1
5
1
22
3
6
3
16
NA
7
1
18
3
8
3
8
2
9
2
14
4
10
1
17
NA
Note that two students did not receive a pairing, because there was an odd number in condition 1 and condition 3. If there is an odd number, the DyadID can be NA.
Thank you for your help with this!

Using match to get a unique id according to Condition and sample for randomness.
library(dplyr)
df_sim <- df_sim %>% mutate(dyad_id = match(Condition,sample(unique(Condition))))

Related

How to create a dataframe by sampling 1 case (row) from each group in R

I would like to randomly select 1 case (so 1 row from a dataframe) from each group in R, but I cannot work out how to do it.
My data is structured in longformat: 400 cases (rows) clustered within 250 groups (some groups only contain a single case, others 2, 3, 4, 5, or even 6). So what I would like to end up with is a dataframe containing 250 rows (with each row representing 1 randomly selected case from the 250 different groups).
I have the idea that I should use the sample function for this, but I could work out how to do it. Anyone any ideas?
Suppose your data frame X indicates group membership with a variable named "Group," as in this synthetic example:
G <- 8
set.seed(17)
X <- data.frame(Group=sort(sample.int(G, G, replace=TRUE)),
Case=1:G)
Here is a printout of X:
Group Case
1 2 1
2 2 2
3 2 3
4 4 4
5 4 5
6 5 6
7 7 7
8 8 8
Pick up the first instance of each value of "Group" using the duplicated function after randomly permuting the rows of X:
Y <- X[sample.int(nrow(X)), ]
Y[!duplicated(Y$Group), ]
Group Case
8 8 8
1 2 1
4 4 4
6 5 6
7 7 7
A comparison to X indicates random cases in each group were selected. Repeat these last two steps to confirm this if you like.

what is this function doing? replication [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
rep_sample_n <- function(tbl, size, replace = FALSE, reps = 1)
{
rep_tbl = replicate(reps, tbl[sample(1:nrow(tbl), size, replace = replace),
], simplify = FALSE) %>%
bind_rows() %>%
mutate(replicate = rep(1:reps, each = size)) %>%
select(replicate, everything()) %>%
group_by(replicate)
return(rep_tbl)
}
Hey, can anyone help me there? What is this function doing? Is the first line setting the variables of the function? And then what is this "replicate" doing? Thanks!
This formula replicates your data. lets say we have a dataset of 10 observations. In order to come up with additional like-datasets of your current one, you can replicate it by introducing random sampling of your dataset.
You can check out the wikipedia page on
statistical replication if you're more curious.
Lets take a simple dataframe:
df <- data.frame(x = 1:10, y = 1:10)
df
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
if we want to take a random sample of this, we can use the function rep_sample_n which takes 2 arguments tbl, size, and has another 2 optional arguments replace = FALSE, reps = 1.
Here is an example of us just taking 4 randomly selected columns from our data.
rep_sample_n(df, 4)
# A tibble: 4 x 3
# Groups: replicate [1]
replicate x y
<int> <int> <int>
1 1 1 1
2 1 3 3
3 1 4 4
4 1 10 10
Now if we want to randomly sample 15 observations from a 10 observation dataset, it will throw an error. Currently the replace = FALSE argument doesn't allow that because each time a sample row is chosen, it's removed from the pool for the next sample to be taken. In the example above, it chose the 1st observation, then it went to choose the 2nd (because we asked for 4), and it only have 2 through 10 left, and it chose the 3rd, then 4th and then 10th etc. If we allow replace = TRUE, it will choose an observation from the full dataset each time.
Notice how in this example, the 5th observation was chosen twice. That wouldn't happen with replace = FALSE
rep_sample_n(df, 4, replace = TRUE)
# A tibble: 4 x 3
# Groups: replicate [1]
replicate x y
<int> <int> <int>
1 1 5 5
2 1 3 3
3 1 2 2
4 1 5 5
Lastly and most importantly, we have the reps argument which is the basis for this function, really. It allows you randomly sample your dataset multiple times, and then combine all those samples together.
Below, we have sampled our original dataset of 10 observations by selecting 4 of them in a sample, then we replicated that 5 times, so we have 5 different sample dataframes of 4 observations each that have been combined together into one 20 observation dataframe, but each of the unique 5 dataframes has been tagged with a replicate #. The replicate column will point out which 4 observations goes with which replicated dataframe.
rep_sample_n(df, 4, reps = 5)
# A tibble: 20 x 3
# Groups: replicate [5]
replicate x y
<int> <int> <int>
1 1 8 8
2 1 4 4
3 1 3 3
4 1 1 1
5 2 4 4
6 2 5 5
7 2 8 8
8 2 3 3
9 3 6 6
10 3 1 1
11 3 3 3
12 3 2 2
13 4 5 5
14 4 7 7
15 4 10 10
16 4 3 3
17 5 7 7
18 5 10 10
19 5 3 3
20 5 9 9
I hope this provided some clarity
This function takes a data frame as input (and several input preferences). It takes a random sample of size rows from the table, with or without replacement as set by the replace input. It repeats that random sampling reps times.
Then, it binds all the samples together into a single data frame, adding a new column called "replicate" indicating which repetition of the sampling produced each row.
Finally, it "groups" the resulting table, preparing it for future group-wise operations with dplyr.
For general questions about specific functions, like "What is this "replicate" doing?", you should look at the function's help page: type ?replicate or help("replicate") to get there. It includes a description of the function and examples of how to use it. If you read the description, run the examples, and are still confused, feel free to come back with a specific question and example illustrating what you are confused by.
Similarly, for "Is the first line setting the variables of the function?", the arguments to function() are the inputs to the function. If you have basic questions about R like "How do functions work", have a look at An Introduction to R, or one of the other sources in the R Tag Wiki.

How to calculate variance in a data table

I am a nebie to R.I have a data table DT as
id time day type
1 1 9 10
2 2 3 10
1 3 6 12
3 8 9 10
6 9 9 10
8 2 6 18
9 3 5 10
9 1 4 12
From this I initially wanted the count group by day time type.SO i did
DT[,.N,by=list(day,time,type)]
which gives the count for each group.
Now I need to calculate the variance for each group. So I tried
DT[,var(.N),by=list(day,time,type)]
But this gave NA for all fields.Any help is appreciated.
In the example given, there is only a single unique value for many of the combinations, so there is no variance for those rows.
DT <- data.frame (id = c(1,2,1,3,6,8,9,9),
time = c(1,2,3,8,9,2,3,1),
day = c(9,3,6,9,9,6,5,4),
type = c(10,10, 12, 10,10,18,10,12))
aggregate(DT, list(DT$id), FUN = var)

group and label rows in data frame by numeric in R

I need to group and label every x observations(rows) in a dataset in R.
I need to know if the last group of rows in the dataset has less than x observations
For example:
If I use a dataset with 10 observations and 2 variables and I want to group by every 3 rows.
I want to add a new column so that the dataset looks like this:
speed dist newcol
4 2 1
4 10 1
7 4 1
7 22 2
8 16 2
9 10 2
10 18 3
10 26 3
10 34 3
11 17 4
df$group <- rep(1:(nrow(df)/3), each = 3)
This works if the number of rows is an exact multiple of 3. Every three rows will get tagged in serial numbers.
A quick dirty way to tackle the problem of not knowing how incomplete the final group is to simply check the remained when nrow is modulus divided by group size: nrow(df) %% 3 #change the divisor to your group size
assuming your data is df you can do
df$newcol = rep(1:ceiling(nrow(df)/3), each = 3)[1:nrow(df)]

Return rows of data frame that meet multiple criteria in R (panel data random sample)

I am hoping to create a random sample from panel data based on the unique id.
For instance if you start with:
e = data.frame(id=c(1,1,1,2,2,3,3,3,4,4,4,4), data=c(23,34,45,1,23,45,6,2,9,39,21,1))
And you want a random sample of 2 unique ids:
out = data.frame(id=c(1,1,1,3,3,3), data=c(23,34,45,45,6,2))
Although sample gives me random unique ids
sample( e$id ,2) # give c(1,3)
I can't figure out how to use logical calls to return all the desired data.
I have tried a number of things including:
e[ e$id == sample( e$id ,2) ] # only returns 1/2 the data
Any ideas??? Its killing me.
I'm not entirely sure what your expected result should be, but does this work for what you're trying to do?
> e[e$id %in% sample(e$id, 2), ]
id data
6 3 45
7 3 6
8 3 2
9 4 9
10 4 39
11 4 21
12 4 1
Or maybe you want this:
> e[e$id %in% sample(unique(e$id), 2), ]
id data
1 1 23
2 1 34
3 1 45
9 4 9
10 4 39
11 4 21
12 4 1

Resources