What is the most efficient way to sample a data frame under a certain constraint?
For example, say I have a directory of Names and Salaries, how do I select 3 such that their sum does not exceed some value. I'm just using a while loop but that seems pretty inefficient.
You could face a combinatorial explosion. This simulates the selection of 3 combinations of the EE's from a set of 20 with salaries at a mean of 60 and sd 20. It shows that from the enumeration of the 1140 combinations you will find only 263 having sum of salaries less than 150.
> sum( apply( combn(1:20,3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 200
> set.seed(123)
> salry <- data.frame(EEnams = sapply(1:20 ,
function(x){paste(sample(letters[1:20], 6) ,
collapse="")}), sals = rnorm(20, 60, 20))
> head(salry)
EEnams sals
1 fohpqa 67.59279
2 kqjhpg 49.95353
3 nkbpda 53.33585
4 gsqlko 39.62849
5 ntjkec 38.56418
6 trmnah 66.07057
> sum( apply( combn(1:NROW(salry), 3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 263
If you had 1000 EE's then you would have:
> choose(1000, 3) # Combination possibilities
# [1] 166,167,000 Commas added to output
One approach would be to start with the full data frame and sample one case. Create a data frame which consists of all the cases which have a salary less than your constraint minus the selected salary. Select a second case from this and repeat the process of creating a remaining set of cases to choose from. Stop if you get to the number you need (3), or if at any point there are no cases in the data frame to choose from (reject what you have so far and restart the sampling procedure).
Note that different approaches will create different probability distributions for a case being included; generally it won't be uniform.
How big is your dataset? If it is small (and small really depends on your hardware), you could just list all groups of three, calculate the sum, and sample from that.
## create data frame
N <- 100
salary <- rnorm(N))
## list all possible groups of 3 from this
x <- combn(salary, 3)
## the sum
sx <- colSums(x)
sxc <- sx[sx<1]
## sampling with replacement
sample(sxc, 10, replace=TRUE)
Related
I have a data set of 12.5 million records and I need to randomly select about 2.5 million. However, these individuals are in 55284 groups and I want to keep groups intact.
So basically I want to remove groups until I've got 2.5 million records left OR select groups until I have about 2.5 million individuals.
If this is my data:
data <- data.frame(
id = c(1, 2, 3, 4, 5),
group = (1, 1, 2, 2, 3)
)
I wouldn't want to remove id1 and keep id2, I'd like to either keep them both or discard both, because they are in the same group(2).
So ideally, this function randomly selects a group, counts these individuals and puts them in a data set, then does the same thing again, keeps counting the individuals until it has about 2.5 million (it is okay to say: if n exceeds 2.5 stop putting groups into new data set).
I haven't been able to find a function and I am not yet skilled enough to put something together myself, unfortunately.
Hope someone can help me out!
Thanks
Too long for a comment hence answering. Do you need something like this ?
#Order data by group so rows with same groups are together
data1 <- data[order(data$group), ]
#Get all the groups in first 2.5M entries
selected_group <- unique(data1$group[1:2500000])
#Subset those groups so you have all groups intact
final_data <- data1[data1$group %in% selected_group, ]
For a random approach, we can use while loop
#Get all the groups in the data
all_groups <- unique(data$group)
#Variable to hold row indices
rows_to_sample <- integer()
#While the number of rows to subset is less than 2.5M
while (length(rows_to_sample) <= 2500000) {
#Select one random group
select_group <- sample(all_groups, 1)
#Get rows indices of that group
rows_to_sample <- c(rows_to_sample, which(data$group == select_group))
#Remove that group from the all_groups
all_groups <- setdiff(all_groups, select_group)
}
data[rows_to_sample, ]
here is a possibility. I demonstrate it using toydata and threshold of 33 (instead of 2.5) million. First I create the toy group vector:
threshold <- 33
set.seed(111)
mygroups <- rep(1:10, rpois(10, 10))
In this toy example group 1 has 10 individuals, group 2 has 8 individuals and so on.
Now I put the groups in random order and use cumsum to determine when the threshold is exceeded:
x <- cumsum(table(mygroups)[sample(1:10)])
randomgroups <- as.integer(names(x[x <= threshold]))
randomgroups
[1] 1 7 5
I am trying to get a random sample from a dataframe with different size.
example the first sample should only have 8 observations
2nd sample can have 10 observations
3rd can have 12 observations
df[sample(nrow(df),10 ), ]
this gives me a fixed 10 observations when I take a sample
In an ideal case, I have 100observations and these observations should be placed in 3 groups without replacement and each group can have any number of observations. example group 1 has 45 observations, group 2 has 20 observations and group 3 has 35 observations.
Any help will be appreciated
You could try using replicate:
times_to_sample = 5L
NN = nrow(df)
replicate(times_to_sample, df[sample(NN, sample(5:10, 1L)), ], simplify = FALSE)
This will return a list of length times_to_sample, the ith element of which will give you a data.frame with the result for the ith replication.
simplify=FALSE prevents simplify2array from mangling the results into a not-particularly-useful matrix.
You should also consider adding some robustness checks -- for example, you said you want between 5 and 10 rows, but in generalizing this to be from a to b rows, you'll want to ensure a >= 1, b <= nrow(df).
If times_to_sample is going to be large, it'll be more efficient to get all of the samples from 5:10 up front instead:
idx = sample(5:10, times_to_sample, replace = TRUE)
lapply(idx, function(i) df[sample(NN, i), ])
A little less readable but surely more efficient than to repeatedly to sample(5:10, 1), i.e. only one at a time (not leveraging vectorization)
Consider the following data:
library(Benchmarking)
d <- data.frame(x1=c(200,200,3000), x2=c(200,200,1000), y=c(100,100,3))
So I have 3 observations.
Now I want to select 2 observations randomly out of d three times (without repetition - there is three combinations in total). For each of these three times I want to calculate the following:
e <- dea(d[c('x1', 'x2')], d$y)
weighted.mean(eff(e), d$y)
That is, I will get three numbers, which I want to calculate an average of. Can someone show how to do this with a loop function in R?
Example:
There is three combinations in total, so I can only get the same result in this case. If I do the calculation manually, I will get the three following result:
0.977 0.977 1
(The result could of course be in a another order).
And the mean of these two numbers is:
0.984
This is a simple example. In my case I have a lot of combinations, where I don't select all of the combinations (e.g. there could be say 1,000,000 combinations, where I only select 1,000 of them).
I think it's better if you use sample.int and replicate instead of doing all the combinations, see my example:
nsample <- 2 # Number of selected observations
nboot <- 10 # Number of times you repeat the process
replicate(nboot, with(d[sample.int(nrow(d), nsample), ],
weighted.mean(eff(dea(data.frame(x1, x2), y)), y)))
I have check also the link you bring regarding this issue, so if I got it right, I mean, you want to extract two rows (observations) each time without replacement, you can use sample:
SelObs <- sample(1:nrow(d),2)
# for getting the selected observations just
dSel <- d[SelObs,]
And then do your calculations
If you want those already selected observation to not be selected in a nex random selection, it is similar, but you need an index
Obs <- 1:nrow(d)
SelObs <- sample(Obs, 2)
dSel <- d[SelObs, ]
# and now, for removing those already selected
Obs <- Obs[-SelObs]
# and keep going with next random selections and the above code
I have numbers starting from 1 to 6000 and I want it to be separated in the manner listed below.
1-10 as "Range1"
10-20 as "Range2"
20-30 as ""Range3"
.
.
.
5900-6000 as "Range 600".
I want to calculate the range with equal time interval as 10 and at last I want to calculate the frequency as which range is repeated the most.
How can we solve this in R programming.
You should use the cut function and then table can determine the counts in each category and sort in order of the most prevalent.
x <- 1:6000
x2 <- cut(x, breaks=seq(1,6000,by=10), labels=paste0('Range', 1:599))
sort(table(x2), descending = TRUE)
There is a maths trick to you question. If you want categories of length 10, round(x/10) will create a category in which 0-5 will become 0, 6 to 14 will become 1, 15 to 24 will become 2 etc. If you want to create cat 1-10, 11-20, etc., you can use round((x+4.1)/10).
(i don't know why in R round(0.5)=0 but round(1.5)=2, that's why i have to use 4.1)
Not the most elegant code but maybe the easiest to understand, here is an example:
# Create randomly 50 numbers between 1 and 60
x = sample(1:60, 50)
# Regroup in a data.frame and had a column count containing the value one for each row
df <- data.frame(x, count=1)
df
# create a new column with the category
df$cat <- round((df$x+4.1)/10)
# If you want it as text:
df$cat2 <- paste("Range",round((df$x+4.1)/10), sep="")
str(df)
# Calculate the number of values in each category
freq <- aggregate(count~cat2, data=df, FUN=sum)
# Get the maximum number of values in the most frequent category(ies)
max(freq$count)
# Get the category(ies) name(s)
freq[freq$count == max(freq$count), "cat2"]
I need to generate a data set which contains 20 observations in 3 classes (20 observations to each of the classes - 60 in total) with 50 variables. I have tried to achieve this by using the code below, however it throws an error and I end up creating 2 observations of 50 variables.
data = matrix(rnorm(20*3), ncol = 50)
Warning message:
In matrix(rnorm(20 * 3), ncol = 50) :
data length [60] is not a sub-multiple or multiple of the number of columns [50]
I would like to know where I am going wrong, or even if this is the best way to generate a data set, and some explanations of possible solutions so I can better understand how to do this in the future.
The below can probably be done in less than my 3 lines of code but I want to keep it simple and I also want to use the matrix function with which you seem to be familiar:
#for the response variable y (60 values - 3 classes 1,2,3 - 20 observations per class)
y <- rep(c(1,2,3),20 ) #could use sample instead if you want this to be random as in docendo's answer
#for the matrix of variables x
#you need a matrix of 50 variables i.e. 50 columns and 60 rows i.e. 60x50 dimensions (=3000 table cells)
x <- matrix( rnorm(3000), ncol=50 )
#bind the 2 - y will be the first column
mymatrix <- cbind(y,x)
> dim(x) #60 rows , 50 columns
[1] 60 50
> dim(mymatrix) #60 rows, 51 columns after the addition of the y variable
[1] 60 51
Update
I just wanted to be a bit more specific about the error that you get when you try matrix in your question.
First of all rnorm(20*3) is identical to rnorm(60) and it will produce a vector of 60 values from the standard normal distribution.
When you use matrix it fills it up with values column-wise unless otherwise specified with the byrow argument. As it is mentioned in the documentation:
If one of nrow or ncol is not given, an attempt is made to infer it from the length of data and the other parameter. If neither is given, a one-column matrix is returned.
And the logical way to infer it is by the equation n * m = number_of_elements_in_matrix where n and m are the number of rows and columns of the matrix respectively. In your case your number_of_elements_in_matrix was 60 and the column number was 50. Therefore, the number of rows had to be 60/50=1.2 rows. However, a decimal number of rows doesn't make any sense and thus you get the error. Since you chose 50 columns only multiples of 50 will be accepted as the number_of_elements_in_matrix. Hope that's clear!