How to create a dataframe by sampling 1 case (row) from each group in R - r

I would like to randomly select 1 case (so 1 row from a dataframe) from each group in R, but I cannot work out how to do it.
My data is structured in longformat: 400 cases (rows) clustered within 250 groups (some groups only contain a single case, others 2, 3, 4, 5, or even 6). So what I would like to end up with is a dataframe containing 250 rows (with each row representing 1 randomly selected case from the 250 different groups).
I have the idea that I should use the sample function for this, but I could work out how to do it. Anyone any ideas?

Suppose your data frame X indicates group membership with a variable named "Group," as in this synthetic example:
G <- 8
set.seed(17)
X <- data.frame(Group=sort(sample.int(G, G, replace=TRUE)),
Case=1:G)
Here is a printout of X:
Group Case
1 2 1
2 2 2
3 2 3
4 4 4
5 4 5
6 5 6
7 7 7
8 8 8
Pick up the first instance of each value of "Group" using the duplicated function after randomly permuting the rows of X:
Y <- X[sample.int(nrow(X)), ]
Y[!duplicated(Y$Group), ]
Group Case
8 8 8
1 2 1
4 4 4
6 5 6
7 7 7
A comparison to X indicates random cases in each group were selected. Repeat these last two steps to confirm this if you like.

Related

Create new column with shared ID to randomly link two rows in R

I am using R and working with this sample dataframe.
library(tibble)
library(stats)
set.seed(111)
conditions <- factor(c("1","2","3"))
df_sim <-
tibble::tibble(StudentID = 1:10,
Condition = sample(conditions,
size = 10,
replace = T),
XP = stats::rpois(n = 10,
lambda = 15))
This creates the following tibble.
StudentID
Condition
XP
1
2
8
2
3
11
3
3
16
4
3
12
5
1
22
6
3
16
7
1
18
8
3
8
9
2
14
10
1
17
I am trying create a new column in my dataframe called DyadID. The purpose of this column is to create a variable that is uniquely shared by two students in the dataframe — in other words, two students (e.g. Student 1 and Student 9) would share the same value (e.g. 4) in the DyadID column.
However, I only want observations linked together if they share the same Condition value. Condition contains three unique values (1, 2, 3). I want condition 1 observations linked with other condition 1 observations, 2 with 2, and 3 with 3.
Importantly, I'd like the students to be linked together randomly.
Ideally, I would like to stay within the tidyverse as that is what I am most familiar with. However, if that's not possible or ideal, any solution would be appreciated.
Here is a possible outcome I am hoping to achieve.
StudentID
Condition
XP
DyadID
1
2
8
4
2
3
11
1
3
3
16
2
4
3
12
1
5
1
22
3
6
3
16
NA
7
1
18
3
8
3
8
2
9
2
14
4
10
1
17
NA
Note that two students did not receive a pairing, because there was an odd number in condition 1 and condition 3. If there is an odd number, the DyadID can be NA.
Thank you for your help with this!
Using match to get a unique id according to Condition and sample for randomness.
library(dplyr)
df_sim <- df_sim %>% mutate(dyad_id = match(Condition,sample(unique(Condition))))

what is this function doing? replication [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
rep_sample_n <- function(tbl, size, replace = FALSE, reps = 1)
{
rep_tbl = replicate(reps, tbl[sample(1:nrow(tbl), size, replace = replace),
], simplify = FALSE) %>%
bind_rows() %>%
mutate(replicate = rep(1:reps, each = size)) %>%
select(replicate, everything()) %>%
group_by(replicate)
return(rep_tbl)
}
Hey, can anyone help me there? What is this function doing? Is the first line setting the variables of the function? And then what is this "replicate" doing? Thanks!
This formula replicates your data. lets say we have a dataset of 10 observations. In order to come up with additional like-datasets of your current one, you can replicate it by introducing random sampling of your dataset.
You can check out the wikipedia page on
statistical replication if you're more curious.
Lets take a simple dataframe:
df <- data.frame(x = 1:10, y = 1:10)
df
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
if we want to take a random sample of this, we can use the function rep_sample_n which takes 2 arguments tbl, size, and has another 2 optional arguments replace = FALSE, reps = 1.
Here is an example of us just taking 4 randomly selected columns from our data.
rep_sample_n(df, 4)
# A tibble: 4 x 3
# Groups: replicate [1]
replicate x y
<int> <int> <int>
1 1 1 1
2 1 3 3
3 1 4 4
4 1 10 10
Now if we want to randomly sample 15 observations from a 10 observation dataset, it will throw an error. Currently the replace = FALSE argument doesn't allow that because each time a sample row is chosen, it's removed from the pool for the next sample to be taken. In the example above, it chose the 1st observation, then it went to choose the 2nd (because we asked for 4), and it only have 2 through 10 left, and it chose the 3rd, then 4th and then 10th etc. If we allow replace = TRUE, it will choose an observation from the full dataset each time.
Notice how in this example, the 5th observation was chosen twice. That wouldn't happen with replace = FALSE
rep_sample_n(df, 4, replace = TRUE)
# A tibble: 4 x 3
# Groups: replicate [1]
replicate x y
<int> <int> <int>
1 1 5 5
2 1 3 3
3 1 2 2
4 1 5 5
Lastly and most importantly, we have the reps argument which is the basis for this function, really. It allows you randomly sample your dataset multiple times, and then combine all those samples together.
Below, we have sampled our original dataset of 10 observations by selecting 4 of them in a sample, then we replicated that 5 times, so we have 5 different sample dataframes of 4 observations each that have been combined together into one 20 observation dataframe, but each of the unique 5 dataframes has been tagged with a replicate #. The replicate column will point out which 4 observations goes with which replicated dataframe.
rep_sample_n(df, 4, reps = 5)
# A tibble: 20 x 3
# Groups: replicate [5]
replicate x y
<int> <int> <int>
1 1 8 8
2 1 4 4
3 1 3 3
4 1 1 1
5 2 4 4
6 2 5 5
7 2 8 8
8 2 3 3
9 3 6 6
10 3 1 1
11 3 3 3
12 3 2 2
13 4 5 5
14 4 7 7
15 4 10 10
16 4 3 3
17 5 7 7
18 5 10 10
19 5 3 3
20 5 9 9
I hope this provided some clarity
This function takes a data frame as input (and several input preferences). It takes a random sample of size rows from the table, with or without replacement as set by the replace input. It repeats that random sampling reps times.
Then, it binds all the samples together into a single data frame, adding a new column called "replicate" indicating which repetition of the sampling produced each row.
Finally, it "groups" the resulting table, preparing it for future group-wise operations with dplyr.
For general questions about specific functions, like "What is this "replicate" doing?", you should look at the function's help page: type ?replicate or help("replicate") to get there. It includes a description of the function and examples of how to use it. If you read the description, run the examples, and are still confused, feel free to come back with a specific question and example illustrating what you are confused by.
Similarly, for "Is the first line setting the variables of the function?", the arguments to function() are the inputs to the function. If you have basic questions about R like "How do functions work", have a look at An Introduction to R, or one of the other sources in the R Tag Wiki.

sequential counting with input from more than one variable in r

I want to create a column with sequential values but it gets its value from input from two other columns in the df. I want the value to sequentially count if either Team changes (between 1 and 2) or Event = x. Any help would be appreciated! See example below:
Team Event Value
1 1 a 1
2 1 a 1
3 2 a 2
4 2 x 3
5 2 a 3
6 1 a 4
7 1 x 5
8 1 a 5
9 2 x 6
10 2 a 6
This will do it...
df$Value <- cumsum(df$Event=="x" | c(1, diff(df$Team))!=0)
It takes the cumulative sum (i.e. of TRUE values) of those elements where either Event=="x" or the difference in successive values of Team is non-zero. An extra element is added at the start of the diff term to keep it the same length as the original.

group and label rows in data frame by numeric in R

I need to group and label every x observations(rows) in a dataset in R.
I need to know if the last group of rows in the dataset has less than x observations
For example:
If I use a dataset with 10 observations and 2 variables and I want to group by every 3 rows.
I want to add a new column so that the dataset looks like this:
speed dist newcol
4 2 1
4 10 1
7 4 1
7 22 2
8 16 2
9 10 2
10 18 3
10 26 3
10 34 3
11 17 4
df$group <- rep(1:(nrow(df)/3), each = 3)
This works if the number of rows is an exact multiple of 3. Every three rows will get tagged in serial numbers.
A quick dirty way to tackle the problem of not knowing how incomplete the final group is to simply check the remained when nrow is modulus divided by group size: nrow(df) %% 3 #change the divisor to your group size
assuming your data is df you can do
df$newcol = rep(1:ceiling(nrow(df)/3), each = 3)[1:nrow(df)]

From a set of pairs, find all subsets s.t. no pair in the subset shares any element with a pair not in the subset

I have a set of pairs. Each pair is represented as [i,1:2]. That is, the ith pair are the numbers in the first and second column in the ith row.
I need to sort these pairs into distinct groups, s.t. there is no element in any pair in the jth group that is in any group not j. For example:
EXAMPLE 1: DATA
> col1 <- c(3, 4, 6, 7, 10, 8)
> col2 <- c(6, 7, 3, 4, 3, 1)
>
> dat <- cbind(col1, col2)
> rownames(dat) <- 1:nrow(dat)
>
> dat
col1 col2
1 3 6
2 4 7
3 6 3
4 7 4
5 10 3
6 8 1
For all pairs, it doesn't matter if the number is in column 1 or column 2, the pairs should be sorted into groups s.t. every number in every pair in every group exists only in one group. So the solved example would look like this.
col1 col2 groups
1 3 6 1
2 4 7 2
3 6 3 1
4 7 4 2
5 10 3 1
6 8 1 3
Rows 1, 3, and 5 are grouped together because 1 and 3 contain the same numbers and 5 shares the number 3, so it must be grouped with them. 2 and 4 share the same distinct numbers so they are grouped together and 6 has unique numbers so it is left alone.
If we change the data slightly, note the following.
EXAMPLE 2: NEW DATA
Note what happens when we add a row that shares an element with row 6 and row 5.
col1 col2 groups
1 3 6 1
2 4 7 2
3 6 3 1
4 7 4 2
5 10 3 1
6 8 1 1
7 1 10 1
The 10 in the 7th row connects it to the first group because it shares an elements with the 5th row. It also shares an element with the 6th row (the number 1), so the 6th row would instead be in group 1.
PROBLEM
Is there a simple way to form the groups? A vector operation? A sorting algorithm? It gets very nasty very quickly if you try to do it with a loop, since each subsequent row can change the membership of earlier rows, as demonstrated in the example.
To draw on the old answer at: identify groups of linked episodes which chain together , which assigns a group to each individual value, you could try this to assign a group to each linked pair:
library(igraph)
g <- graph_from_data_frame(dat)
links <- data.frame(col1=V(g)$name,group=components(g)$membership)
merge(dat,links,by="col1",all.x=TRUE,sort=FALSE)
# col1 col2 group
#1 3 6 1
#2 4 7 2
#3 6 3 1
#4 7 4 2
#5 10 3 1
#6 8 1 3
Your elements can be regarded as vertices in an undirected graph, and your pairs can be regarded as edges, and then (assuming that you want to find groups of minimal size -- if you don't, then e.g. the entire set of pairs could be labelled "Group 1") the groups you're looking for are the connected components in this graph. They can all be found in linear time with a depth-first or breadth-first search.

Resources