How to have a maximum number of replacement when sampling in R? - r

I'm trying to sample a list of numbers with replacement. I would like to have a hard cap on the number of times a number is chosen. For instance:
x=sample(1:20, 10, replace = TRUE)
[1] 17 5 11 13 5 11 14 11 10 11
In this case the number 11 has a frequency of 4.
Is there a way that I can for that frequency to be 2 or less?

It sounds like you are actually looking for a random assignment of people to houses. This could be done by putting two ids for each house into a vector, generate a random permutation of the vector and then assign each entry to a person.
houseIds <- c(1,2,3,4,5)
houseSamples <- sample(rep(houseIds, 2), 8) #where 8 is the number of people

Related

How do I simulate choosing a random player at first, and then repeating that sequence?

I am trying to simulate a game in R. For that I need to choose a random player out of n_players who begins in the first round. Then the other n_players follow in a random order in the first round. However, in the next rounds the same order of players as in the first round must be kept. Does anyone have an idea on how to do this?
Create a sequence of numbers, say n=10, from 1 up to n.
x<-1:10
Think of this to be the tag number of players. You can then use the sample function of R (read the documentation using ?sample command or visit here) to create another sequence of numbers whose order have been shuffled randomly.
y<-sample(x,10,replace=F)
Now your y variable is the order in which your players are selected one by one.
Also, you can access each individual chosen player just like you choose an element from a vector.
Finally, the vector y is the sequence in which these players are selected in the subsequent rounds.
Test run:
x<-1:10
#[1] 1 2 3 4 5 6 7 8 9 10
y<-sample(x,10,replace=F)
#[1] 2 4 1 8 9 7 5 6 10 3

How to create a list of numbers which are multiples of 3 and fall between a specific range? [duplicate]

This question already has answers here:
How to produce random integer numbers from 0 - 100 but by 10s only?
(2 answers)
Closed 4 years ago.
I am new to R and would appreciate any help on this 2 step task.
I need to write the R codes to create a list of numbers which are mutiples of 3 between 1 and 40.
The second part would need the codes to randomly select 6 numbers from the list above.
To generate number multiple of 3 try seq with by=3. And then use sample to pick 6 random samples out of that sequence. I have used set.seed(1) to get fixed output:
set.seed(1)
sample(seq(3,40,by=3), 6)
#[1] 12 15 21 30 6 24
Here is a solution in a step-by-step manner:
# 1. List of numbers between 1 and 40
list_numbers <- seq(1:40)
# 2. Filter
list_filter <- sapply(list_numbers, function(x) {x %% 3 == 0})
# 3. List of numbers multiple 3
list_numbers_multiple_3 <- list_numbers[list_filter]
# 4. Select 6 random numbers
sample(list_numbers_multiple_3, 6)

Split data frame based into ntiles based on value that is equal to sum of rows divided by the number of ntiles we want

I have a data frame with about 45k points with 3 columns - weight, persons and population. Population is weight*persons. I want to be able to split the data frame into ntiles(deciles, centiles etc) based on need. The data frame has to be split in a way that there are same number of population points in each ntile.
Which means, the data frame needs to be split at value = sum(population)/ntile. So for example if ntile = 10, then, sum(population)/10 = a. Next I need to add up row values in population column till sum = a, split at that point and continue this until I have run through all the 45K points. A sample of data is below.
weight persons population
1 3687.926 9 33191.337
2 3687.926 16 59006.8217
3 3687.926 7 25815.4847
4 4420.088 5 22100.447
5 4420.088 7 30940.6167
6 4420.088 6 26520.5287
7 3687.926 15 55318.8927
8 3687.926 9 33191.3357
9 3687.926 6 22127.5577
10 4452.829 8 35622.6367
11 4452.829 3 13358.4887
12 4452.829 4 17811.3187
I have been trying to use loops. I am stuck on splitting the data frame into the n splits needed. I an new to R. So any help is appreciated.
x= df$population
break_point = sum(x)/10
ntile_points = 0
for(i in 1:length(x))
{
while(ntile_points != break_point)
{
ntile_points = ntile_points+x[i]
}
}
I'm not sure that's what you want, note that your quantile is not necessary an integer, you should substract between each break point :
ntile=10
df=cbind(df,cumsum(df$population))
names(df)[ncol(df)]='Cumsum'
s=seq(0,sum(df$population),sum(df$population)/ntile)
subdfs=list()
for (i in 2:length(s)){
subdfs=c(subdfs,list(df[intersect(which(df$Cumsum<=s[i]),which(df$Cumsum>s[i-1])),]))
}
Then subdfs is a list which contains 10 data frames split as you wanted. Call the first data frame with subdfs[[1]] and so on. Maybe I did not understand what you want, tell me.
In this way the first df contain all the first values until the cumulate sum of the population stays in the interaval ]0,sum(population)/10], the second contains, the following values where the cumulate sum of the population is in the interval ]sum(population)/10,2*sum(population)/10], etc....
Is that what you wanted ?

Extract 100 sections from a vector

I have a vector of length 1000. It contains (numeric) survey answers of 100 participants, thus 10 answers per participant. I would like to drop the first three values for every participant to create a new vector of length 700 (including only the answers to questions 4-10).
I only know how to extract every n-th value of the vector, but cannot figure how to solve the above problem.
vector <- seq(1,1000,1)
Expected output:
4 5 6 7 8 9 10 14 15 16 17 18 19 20 24 ...
Using a matrix to first structure and then flatten is one method. Another somewhat similar method is to use what I am calling a "logical pattern index":
head( # just showing the first couple of "segments"
vector[ c( rep(FALSE, 3), rep(TRUE, 10-3) ) ],
15)
[1] 4 5 6 7 8 9 10 14 15 16 17 18 19 20 24
This method can also be use inside the two argument version of [ to select rows ore columns using a logical pattern index. This works because of R's recycling of logical indices.
Thanks for providing example data, based on which this thread is reproducible. Here is one solution
c(matrix(vector, 10)[4:10, ])
We first convert the vector to a matrix with 10 rows, so that each column attributes to a participant. Then use row subsetting to remove first three rows. Finally the matrix is flattened to a vector again.

Extract multiple data.frames from one with selection criteria

Let this be my data set:
df <- data.frame(x1 = runif(1000), x2 = runif(1000), x3 = runif(1000),
split = sample( c('SPLITMEHERE', 'OBS'), 1000, replace=TRUE, prob=c(0.04, 0.96) ))
So, I have some variables (in my case, 15), and criteria by which I want to split the data.frame into multiple data.frames.
My criteria is the following: each other time the 'SPLITMEHERE' appears I want to take all the values, or all 'OBS' below it and get a data.frame from just these observations. So, if there's 20 'SPLITMEHERE's in starting data.frame, I want to end up with 10 data.frames in the end.
I know it sounds confusing and like it doesn't have much sense, but this is the result from extracting the raw numbers from an awfully dirty .txt file to obtain meaningful data. Basically, every 'SPLITMEHERE' denotes the new table in this .txt file, but each county is divided into two tables, so I want one table (data.frame) for each county.
In the hope I will make it more clear, here is the example of exactly what I need. Let's say the first 20 observations are:
x1 x2 x3 split
1 0.307379064 0.400526799 0.2898194543 SPLITMEHERE
2 0.465236674 0.915204924 0.5168274657 OBS
3 0.063814420 0.110380201 0.9564822116 OBS
4 0.401881416 0.581895095 0.9443995396 OBS
5 0.495227871 0.054014926 0.9059893533 SPLITMEHERE
6 0.091463620 0.945452614 0.9677482590 OBS
7 0.876123151 0.702328031 0.9739113525 OBS
8 0.413120761 0.441159673 0.4725571219 OBS
9 0.117764512 0.390644966 0.3511555807 OBS
10 0.576699384 0.416279417 0.8961428872 OBS
11 0.854786077 0.164332814 0.1609375612 OBS
12 0.336853841 0.794020157 0.0647337821 SPLITMEHERE
13 0.122690541 0.700047133 0.9701538396 OBS
14 0.733926139 0.785366852 0.8938749305 OBS
15 0.520766503 0.616765349 0.5136788010 OBS
16 0.628549288 0.027319848 0.4509875809 OBS
17 0.944188977 0.913900539 0.3767973795 OBS
18 0.723421337 0.446724318 0.0925365961 OBS
19 0.758001243 0.530991725 0.3916394396 SPLITMEHERE
20 0.888036748 0.862066601 0.6501050976 OBS
What I would like to get is this:
data.frame1:
1 0.465236674 0.915204924 0.5168274657 OBS
2 0.063814420 0.110380201 0.9564822116 OBS
3 0.401881416 0.581895095 0.9443995396 OBS
4 0.091463620 0.945452614 0.9677482590 OBS
5 0.876123151 0.702328031 0.9739113525 OBS
6 0.413120761 0.441159673 0.4725571219 OBS
7 0.117764512 0.390644966 0.3511555807 OBS
8 0.576699384 0.416279417 0.8961428872 OBS
9 0.854786077 0.164332814 0.1609375612 OBS
And
data.frame2:
1 0.122690541 0.700047133 0.9701538396 OBS
2 0.733926139 0.785366852 0.8938749305 OBS
3 0.520766503 0.616765349 0.5136788010 OBS
4 0.628549288 0.027319848 0.4509875809 OBS
5 0.944188977 0.913900539 0.3767973795 OBS
6 0.723421337 0.446724318 0.0925365961 OBS
7 0.888036748 0.862066601 0.6501050976 OBS
Therefore, split column only shows me where to split, data in columns where 'SPLITMEHERE' is written is meaningless. But, this is no bother, as I can delete this rows later, the point is in separating multiple data.frames based on this criteria.
Obviously, just the split() function and filter() from dplyr wouldn't suffice here. The real problem is that the lines which are supposed to separate the data.frames (i.e. every other 'SPLITMEHERE') do not appear in regular fashion, but just like in my above example. Once there is a gap of 3 lines, and other times it could be 10 or 15 lines.
Is there any way to extract this efficiently in R?
The hardest part of the problem is creating the groups. Once we have the proper groupings, it's easy enough to use a split to get your result.
With that said, you can use a cumsum for the groups. Here I divide the cumsum by 2 and use a ceiling so that any groups of 2 SPLITMEHERE's will be collapsed into one. I also use an ifelse to exclude the rows with SPLITMEHERE:
df$group <- ifelse(df$split != "SPLITMEHERE", ceiling(cumsum(df$split=="SPLITMEHERE")/2), 0)
res <- split(df, df$group)
The result is a list with a dataframe for each group. The groups with 0 are ones you want throw out.

Resources