group and label rows in data frame by numeric in R - r

I need to group and label every x observations(rows) in a dataset in R.
I need to know if the last group of rows in the dataset has less than x observations
For example:
If I use a dataset with 10 observations and 2 variables and I want to group by every 3 rows.
I want to add a new column so that the dataset looks like this:
speed dist newcol
4 2 1
4 10 1
7 4 1
7 22 2
8 16 2
9 10 2
10 18 3
10 26 3
10 34 3
11 17 4

df$group <- rep(1:(nrow(df)/3), each = 3)
This works if the number of rows is an exact multiple of 3. Every three rows will get tagged in serial numbers.
A quick dirty way to tackle the problem of not knowing how incomplete the final group is to simply check the remained when nrow is modulus divided by group size: nrow(df) %% 3 #change the divisor to your group size

assuming your data is df you can do
df$newcol = rep(1:ceiling(nrow(df)/3), each = 3)[1:nrow(df)]

Related

R: Creating Random Samples From Entries in Neighboring Row

I am working with the R programming language.
I have the following data set:
my_data = data.frame(id = c(1,2,3,4,5), n = c(15,3,51,8,75))
I want to create a new variable that generates a single random integer for each row based on the corresponding value of "n". I tried to do this with the following code:
my_data$rand = sample.int(my_data$n,1)
But this is not working (the same random number is repeated 5 times).
I also tried to define a function to this:
my_function <- function(x){sample.int(x,1)}
transform(my_data, new_column= my_function(my_data$n) )
But this is also not working (the same random number is again repeated 5 times)..
In the end, I am trying to achieve something like this :
my_data$rand = c(sample.int(15,1), sample.int(3,1), sample.int(51,1), sample.int(8,1), sample.int(75,1))
Can someone please show me how to do this for larger datasets without having to manually specify each "sample.int" command?
Thanks!
When you say "based on value of n" what do you mean by that exactly? Based on n how?
Guess#1: at each row, you want to draw one random number with possible values being 1 to n.
Guess#2: at each row, you want to draw n random numbers for possible values between 0 and 1.
Second option is harder, but option #1 can be done with a loop:
my_data = data.frame(id = c(1,2,3,4,5), n = c(15,3,51,8,75))
my_data$rand = NA
set.seed(123)
for(i in 1:nrow(my_data)){
my_data$rand[i] = sample(1:(my_data$n[i]), size = 1)
}
my_data
id n rand
1 1 15 15
2 2 3 3
3 3 51 51
4 4 8 6
5 5 75 67
We can use sapply to go over all rows in my_data, and generate one sample.int per iteration.
my_data$rand <- sapply(1:nrow(my_data), function(x) sample.int(my_data[x, 2], 1))
id n rand
1 1 15 7
2 2 3 2
3 3 51 28
4 4 8 6
5 5 75 9
You can do this efficiently by a single call to runif(), multiplying by n, and rounding up:
transform(my_data, rand = ceiling(runif(n) * n))
id n rand
1 1 15 13
2 2 3 1
3 3 51 41
4 4 8 1
5 5 75 9

Create new column with shared ID to randomly link two rows in R

I am using R and working with this sample dataframe.
library(tibble)
library(stats)
set.seed(111)
conditions <- factor(c("1","2","3"))
df_sim <-
tibble::tibble(StudentID = 1:10,
Condition = sample(conditions,
size = 10,
replace = T),
XP = stats::rpois(n = 10,
lambda = 15))
This creates the following tibble.
StudentID
Condition
XP
1
2
8
2
3
11
3
3
16
4
3
12
5
1
22
6
3
16
7
1
18
8
3
8
9
2
14
10
1
17
I am trying create a new column in my dataframe called DyadID. The purpose of this column is to create a variable that is uniquely shared by two students in the dataframe — in other words, two students (e.g. Student 1 and Student 9) would share the same value (e.g. 4) in the DyadID column.
However, I only want observations linked together if they share the same Condition value. Condition contains three unique values (1, 2, 3). I want condition 1 observations linked with other condition 1 observations, 2 with 2, and 3 with 3.
Importantly, I'd like the students to be linked together randomly.
Ideally, I would like to stay within the tidyverse as that is what I am most familiar with. However, if that's not possible or ideal, any solution would be appreciated.
Here is a possible outcome I am hoping to achieve.
StudentID
Condition
XP
DyadID
1
2
8
4
2
3
11
1
3
3
16
2
4
3
12
1
5
1
22
3
6
3
16
NA
7
1
18
3
8
3
8
2
9
2
14
4
10
1
17
NA
Note that two students did not receive a pairing, because there was an odd number in condition 1 and condition 3. If there is an odd number, the DyadID can be NA.
Thank you for your help with this!
Using match to get a unique id according to Condition and sample for randomness.
library(dplyr)
df_sim <- df_sim %>% mutate(dyad_id = match(Condition,sample(unique(Condition))))

How to create a dataframe by sampling 1 case (row) from each group in R

I would like to randomly select 1 case (so 1 row from a dataframe) from each group in R, but I cannot work out how to do it.
My data is structured in longformat: 400 cases (rows) clustered within 250 groups (some groups only contain a single case, others 2, 3, 4, 5, or even 6). So what I would like to end up with is a dataframe containing 250 rows (with each row representing 1 randomly selected case from the 250 different groups).
I have the idea that I should use the sample function for this, but I could work out how to do it. Anyone any ideas?
Suppose your data frame X indicates group membership with a variable named "Group," as in this synthetic example:
G <- 8
set.seed(17)
X <- data.frame(Group=sort(sample.int(G, G, replace=TRUE)),
Case=1:G)
Here is a printout of X:
Group Case
1 2 1
2 2 2
3 2 3
4 4 4
5 4 5
6 5 6
7 7 7
8 8 8
Pick up the first instance of each value of "Group" using the duplicated function after randomly permuting the rows of X:
Y <- X[sample.int(nrow(X)), ]
Y[!duplicated(Y$Group), ]
Group Case
8 8 8
1 2 1
4 4 4
6 5 6
7 7 7
A comparison to X indicates random cases in each group were selected. Repeat these last two steps to confirm this if you like.

Creating a new variable in a data frame and changing its values in one step [duplicate]

This question already has answers here:
Convert continuous numeric values to discrete categories defined by intervals
(2 answers)
Closed 5 years ago.
I have a column which is part of a data frame, df. It is full of integers. Let's say it is the number of houses sold in a day by a reality compant. Let's call it df$houses. I want to make a second column called df$quant where the number of houses is categorized, with 0 being 0-2 houses sold in a day, 1 being 3-5 houses, 2 being 6-9 houses and 3 being more than 10 houses? I could do this in two steps.
1) Create the new column df$quant from df$houses:
df$quant <- df$houses
2) Change the values of df$quant:
df$quant[which(df$quant <= 2)] <- 0
etc.
I would like to do this in one step though, making the new variable and filling it with the proper values. Mostly, so I don't have to worry about getting the order of the lines of code in the second step right. It would be more robust.
Could this be done with an if statement?
Thanks a lot.
I would do something like this: (using cut)
x <- 1:11
df <- data.frame(x)
myFunction <- function(x) as.integer(cut(x, c(-1, 2, 5, 9, max(x)))) - 1
df$new <- myFunction(df$x)
df
x new
1 1 0
2 2 0
3 3 1
4 4 1
5 5 1
6 6 2
7 7 2
8 8 2
9 9 2
10 10 3
11 11 3

working with data in tables in R

I'm a newbie at working with R. I've got some data with multiple observations (i.e., rows) per subject. Each subject has a unique identifier (ID) and has another variable of interest (X) which is constant across each observation. The number of observations per subject differs.
The data might look like this:
ID Observation X
1 1 3
1 2 3
1 3 3
1 4 3
2 1 4
2 2 4
3 1 8
3 2 8
3 3 8
I'd like to find some code that would:
a) Identify the number of observations per subject
b) Identify subjects with greater than a certain number of observations (e.g., >= 15 observations)
c) For subjects with greater than a certain number of observations, I'd like to to manipulate the X value for each observation (e.g., I might want to subtract 1 from their X value, so I'd like to modify X for each observation to be X-1)
I might want to identify subjects with at least three observations and reduce their X value by 1. In the above, individuals #1 and #3 (ID) have at least three observations, and their X values--which are constant across all observations--are 3 and 8, respectively. I want to find code that would identify individuals #1 and #3 and then let me recode all of their X values into a different variable. Maybe I just want to subtract 1 from each X value. In that case, the code would then give me X values of (3-1=)2 for #1 and 7 for #3, but #2 would remain at X = 4.
Any suggestions appreciated, thanks!
You can use the aggregate function to do this.
a) Say your table is named temp, you can find the total number of observations for each ID and x column by using the SUM function in aggregate:
tot =aggregate(Observation~ID+x, temp,FUN = sum)
The output will look like this:
ID x Observation
1 1 3 10
2 2 4 3
3 3 8 6
b) To see the IDs that are over a certain number, you can create a subset of the table, tot.
vals = tot$ID[tot$Observation>5]
Output is:
[1] 1 3
c) To change the values that were found in (b) you reference the subsetted data, where the number of observations is > 5, and then update those values.
tot$x[vals] = tot$x[vals]+1
The final output for the table is
ID x Observation
1 1 4 10
2 2 4 3
3 3 9 6
To change the original table, you can subset the table by the IDs you found
temp[temp$ID %in% vals,]$x = temp[temp$ID %in% vals,]$x + 1
a) Identify the number of observations per subject
you can use this code on each variable:
summary

Resources