R: Creating Random Samples From Entries in Neighboring Row - r

I am working with the R programming language.
I have the following data set:
my_data = data.frame(id = c(1,2,3,4,5), n = c(15,3,51,8,75))
I want to create a new variable that generates a single random integer for each row based on the corresponding value of "n". I tried to do this with the following code:
my_data$rand = sample.int(my_data$n,1)
But this is not working (the same random number is repeated 5 times).
I also tried to define a function to this:
my_function <- function(x){sample.int(x,1)}
transform(my_data, new_column= my_function(my_data$n) )
But this is also not working (the same random number is again repeated 5 times)..
In the end, I am trying to achieve something like this :
my_data$rand = c(sample.int(15,1), sample.int(3,1), sample.int(51,1), sample.int(8,1), sample.int(75,1))
Can someone please show me how to do this for larger datasets without having to manually specify each "sample.int" command?
Thanks!

When you say "based on value of n" what do you mean by that exactly? Based on n how?
Guess#1: at each row, you want to draw one random number with possible values being 1 to n.
Guess#2: at each row, you want to draw n random numbers for possible values between 0 and 1.
Second option is harder, but option #1 can be done with a loop:
my_data = data.frame(id = c(1,2,3,4,5), n = c(15,3,51,8,75))
my_data$rand = NA
set.seed(123)
for(i in 1:nrow(my_data)){
my_data$rand[i] = sample(1:(my_data$n[i]), size = 1)
}
my_data
id n rand
1 1 15 15
2 2 3 3
3 3 51 51
4 4 8 6
5 5 75 67

We can use sapply to go over all rows in my_data, and generate one sample.int per iteration.
my_data$rand <- sapply(1:nrow(my_data), function(x) sample.int(my_data[x, 2], 1))
id n rand
1 1 15 7
2 2 3 2
3 3 51 28
4 4 8 6
5 5 75 9

You can do this efficiently by a single call to runif(), multiplying by n, and rounding up:
transform(my_data, rand = ceiling(runif(n) * n))
id n rand
1 1 15 13
2 2 3 1
3 3 51 41
4 4 8 1
5 5 75 9

Related

Populate column by adding onto row above using lag() in R

I want to populate an existing column with values that continually add onto the row above.
This is easy in Excel, but I haven't figured out a good way to automate it in R.
If we had 2 columns in Excel, A and B, we want cell B2 to =B1+A2, and cell B3 would = B2+A3. How can I do this in R?
#example dataframe
df <- data.frame(A = 0:9, B = c(50,0,0,0,0,0,0,0,0,0))
#desired output
desired <- data.frame(A = 0:9, B = c("NA",51,53,56,60,65,71,78,86,95))
I tried using the lag() function, but it didn't give the correct output.
df <- df %>%
mutate(B = B + lag(A))
So I made a for loop that works, but I feel like there's a better solution.
for(i in 2:nrow(df)){
df$B[i] <- df$B[i-1] + df$A[i]
}
Eventually, I want to iterate this function over every n rows of the whole dataframe, essentially so the summation resets every n rows. (any tips on how to do that would be greatly appreciated!)
This might be close to what you need, and uses tidyverse. Specifically, it uses accumulate from purrr.
Say you want to reset to zero every n rows, you can also use group_by ahead of time.
It was not entirely clear how you'd like to handle the first row; here, it will just use the first B value and ignore the first A value, which looked similar to what you had in the post.
n <- 5
library(tidyverse)
df %>%
group_by(grp = ceiling(row_number() / n)) %>%
mutate(B = accumulate(A[-1], sum, .init = B[1]))
Output
A B grp
<int> <dbl> <dbl>
1 0 50 1
2 1 51 1
3 2 53 1
4 3 56 1
5 4 60 1
6 5 0 2
7 6 6 2
8 7 13 2
9 8 21 2
10 9 30 2
cumsum() can be used to get the result you need.
df$B <- cumsum(df$B + df$A)
df
A B
1 0 50
2 1 51
3 2 53
4 3 56
5 4 60
6 5 65
7 6 71
8 7 78
9 8 86
10 9 95

Create new column with shared ID to randomly link two rows in R

I am using R and working with this sample dataframe.
library(tibble)
library(stats)
set.seed(111)
conditions <- factor(c("1","2","3"))
df_sim <-
tibble::tibble(StudentID = 1:10,
Condition = sample(conditions,
size = 10,
replace = T),
XP = stats::rpois(n = 10,
lambda = 15))
This creates the following tibble.
StudentID
Condition
XP
1
2
8
2
3
11
3
3
16
4
3
12
5
1
22
6
3
16
7
1
18
8
3
8
9
2
14
10
1
17
I am trying create a new column in my dataframe called DyadID. The purpose of this column is to create a variable that is uniquely shared by two students in the dataframe — in other words, two students (e.g. Student 1 and Student 9) would share the same value (e.g. 4) in the DyadID column.
However, I only want observations linked together if they share the same Condition value. Condition contains three unique values (1, 2, 3). I want condition 1 observations linked with other condition 1 observations, 2 with 2, and 3 with 3.
Importantly, I'd like the students to be linked together randomly.
Ideally, I would like to stay within the tidyverse as that is what I am most familiar with. However, if that's not possible or ideal, any solution would be appreciated.
Here is a possible outcome I am hoping to achieve.
StudentID
Condition
XP
DyadID
1
2
8
4
2
3
11
1
3
3
16
2
4
3
12
1
5
1
22
3
6
3
16
NA
7
1
18
3
8
3
8
2
9
2
14
4
10
1
17
NA
Note that two students did not receive a pairing, because there was an odd number in condition 1 and condition 3. If there is an odd number, the DyadID can be NA.
Thank you for your help with this!
Using match to get a unique id according to Condition and sample for randomness.
library(dplyr)
df_sim <- df_sim %>% mutate(dyad_id = match(Condition,sample(unique(Condition))))

For Loop Adding Extra Rows to The Data Frame

Hello I am very new to the programming world and data science as well, and I am trying to work my way through it.
I am trying to assign values to the column in a data frame and using for loop such that the data frame is divided into ten groups and every row in every group is assigned a rank, such that row 1 to 10 is assigned as rank 1 and row 11 to 20 is assigned as rank 2 and so on. The original dimension of subset data set is 100 * 6
My data frame looks like
Data Frame
The codes I have written are:
x <- round(nrow(subset) / 10)
a=1
for(j in 1:10){
for(i in a:x){
subset[i, "rank"] = j
}
j = j + 1
a = x + 1
x = x * j
}
However, the loop runs infinitely and keeps on adding additional rows to the data frame. I had to manually stop the loop and the resulting dimension of the subset data frame was 17926 * 6.
Please help me understand where am I going wrong in writing the loop.
P.S. subset is a data frame name and not the subset function in R
Thanks in Advance !!
It might be better for you to start working with vectorized calculations instead of loops. This will help you in the future.
For example:
df <- data.frame(x = 1:100)
df$rank <- (df$x-1)%/%10 + 1
df
results in:
x rank
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 11 2
12 12 2
13 13 2
14 14 2
15 15 2
16 16 2
17 17 2
18 18 2
19 19 2
20 20 2
21 21 3
22 22 3
23 23 3
24 24 3
25 25 3
How about something like this:
subset$Rank <- ceiling(as.numeric(rownames(subset))/10)
The as.numeric converts the rowname into a number, dividing it by 10 and rounding up should give you what you need? Let me know if I've misunderstood.

group and label rows in data frame by numeric in R

I need to group and label every x observations(rows) in a dataset in R.
I need to know if the last group of rows in the dataset has less than x observations
For example:
If I use a dataset with 10 observations and 2 variables and I want to group by every 3 rows.
I want to add a new column so that the dataset looks like this:
speed dist newcol
4 2 1
4 10 1
7 4 1
7 22 2
8 16 2
9 10 2
10 18 3
10 26 3
10 34 3
11 17 4
df$group <- rep(1:(nrow(df)/3), each = 3)
This works if the number of rows is an exact multiple of 3. Every three rows will get tagged in serial numbers.
A quick dirty way to tackle the problem of not knowing how incomplete the final group is to simply check the remained when nrow is modulus divided by group size: nrow(df) %% 3 #change the divisor to your group size
assuming your data is df you can do
df$newcol = rep(1:ceiling(nrow(df)/3), each = 3)[1:nrow(df)]

How to bin ordered data by percentile for each id in R dataframe [r]

I have dataframe that contains 70-80 rows of ordered response time (rt) data for each of 228 people each with a unique id# (everyone doesn't have the same amount of rows). I want to bin each person's RTs into 5 bins. I want the 1st bin to be their fastest 20 percent of RTs, 2nd bin to be their next fastest 20 percent RTs, etc., etc. Each bin should have the same amount of trials in it (unless the total # of trial is odd).
My current dataframe looks like this:
id RT
7000 225
7000 250
7000 253
7001 189
7001 201
7001 225
I'd like my new dataframe to look like this:
id RT Bin
7000 225 1
7000 250 1
After getting my data to look like this, I will aggregate by id and bin
The only way I can think of to do this is to split the data into a list (using the split command), loop through each person, use the quantile command to get break points for the different bins, assign a bin value (1-5) to every response time. This feels very convoluted (and would be difficult for me). I'm in a bit of a jam and I would greatly appreciate any help in how to streamline this process. Thanks.
The answer #Chase gave split the range into 5 groups of equal length (difference of endpoints). What you seem to want is pentiles (5 groups with equal number in each group). For that, you need the cut2 function in Hmisc
library("plyr")
library("Hmisc")
dat <- data.frame(id = rep(1:10, each = 10), value = rnorm(100))
tmp <- ddply(dat, "id", transform, hists = as.numeric(cut2(value, g = 5)))
tmp now has what you want
> tmp
id value hists
1 1 0.19016791 3
2 1 0.27795226 4
3 1 0.74350982 5
4 1 0.43459571 4
5 1 -2.72263322 1
....
95 10 -0.10111905 3
96 10 -0.28251991 2
97 10 -0.19308950 2
98 10 0.32827137 4
99 10 -0.01993215 4
100 10 -1.04100991 1
With the same number in each hists for each id
> table(tmp$id, tmp$hists)
1 2 3 4 5
1 2 2 2 2 2
2 2 2 2 2 2
3 2 2 2 2 2
4 2 2 2 2 2
5 2 2 2 2 2
6 2 2 2 2 2
7 2 2 2 2 2
8 2 2 2 2 2
9 2 2 2 2 2
10 2 2 2 2 2
Here's a reproducible example using package plyr and the cut function:
dat <- data.frame(id = rep(1:10, each = 10), value = rnorm(100))
ddply(dat, "id", transform, hists = cut(value, breaks = 5))
id value hists
1 1 -1.82080027 (-1.94,-1.41]
2 1 0.11035796 (-0.36,0.166]
3 1 -0.57487134 (-0.886,-0.36]
4 1 -0.99455189 (-1.41,-0.886]
....
96 10 -0.03376074 (-0.233,0.386]
97 10 -0.71879488 (-0.853,-0.233]
98 10 -0.17533570 (-0.233,0.386]
99 10 -1.07668282 (-1.47,-0.853]
100 10 -1.45170078 (-1.47,-0.853]
Pass in labels = FALSE to cut if you want simple integer values returned instead of the bins.
Here's an answer in plain old R.
#make up some data
df <- data.frame(rt = rnorm(60), id = rep(letters[1:3], rep(20)) )
#and this is all there is to it
df <- df[order(df$id, df$rt),]
df$bin <- rep( unlist( tapply( df$rt, df$id, quantile )), each = 4)
You'll note that quantile command used can be set to use any quantiles. The defaults are for quintiles but if you want deciles then use
quantile(x, seq(0, 1, 0.1))
in the function above.
The answer above is a bit fragile. It requires equal numbers of RTs/id and I didn't tell you how to get to the magic number 4. But, it also will run very fast on a large dataset. If you want a more robust solution in base R.
library('Hmisc')
df <- df[order(df$id),]
df$bin <- unlist(lapply( unique(df$id), function(x) cut2(df$rt[df$id==x], g = 5) ))
This is much more robust than the first solution but it isn't as fast. For small datasets you won't notice.

Resources