I am trying to get a random sample from a dataframe with different size.
example the first sample should only have 8 observations
2nd sample can have 10 observations
3rd can have 12 observations
df[sample(nrow(df),10 ), ]
this gives me a fixed 10 observations when I take a sample
In an ideal case, I have 100observations and these observations should be placed in 3 groups without replacement and each group can have any number of observations. example group 1 has 45 observations, group 2 has 20 observations and group 3 has 35 observations.
Any help will be appreciated
You could try using replicate:
times_to_sample = 5L
NN = nrow(df)
replicate(times_to_sample, df[sample(NN, sample(5:10, 1L)), ], simplify = FALSE)
This will return a list of length times_to_sample, the ith element of which will give you a data.frame with the result for the ith replication.
simplify=FALSE prevents simplify2array from mangling the results into a not-particularly-useful matrix.
You should also consider adding some robustness checks -- for example, you said you want between 5 and 10 rows, but in generalizing this to be from a to b rows, you'll want to ensure a >= 1, b <= nrow(df).
If times_to_sample is going to be large, it'll be more efficient to get all of the samples from 5:10 up front instead:
idx = sample(5:10, times_to_sample, replace = TRUE)
lapply(idx, function(i) df[sample(NN, i), ])
A little less readable but surely more efficient than to repeatedly to sample(5:10, 1), i.e. only one at a time (not leveraging vectorization)
Related
I have 400 rows that have a bunch of columns, with the last five being: a,b,c,d,e
For each row, I want to randomly select three of the above 5 variables and do rowmeans(varx,vary,varz) to make column trio_average, and the other two making pair_average.
For example, one row might be the mean of b,d,e for column "trio_average" and the mean of a,c for "pair_average", and the next might be the mean of a,c,e and b,d.
I did this in a pretty roundabout way...I used "randomizr()" to generate a variable called "trio_set" with 400 random (conditional random to keep them all equal) trios of the 5 variables. There's 10 possible combinations of the 5 variables so I have 40 each of for example "a_c_e", "b_c_d" etc.
Then, I used a series of 10 if_else statements:
data <- transform(data, trio_average = ifelse(trio_set = "a_b_c", rowMeans(data[c("a","b","c")]),
ifelse(trio_set = "a_b_d", rowMeans(data[c("a","b","d")]), ....
I would then do this another 10 times for the pairs.
This does get the job done but in reality, my column names are much longer than e.g. "a" so my code in the end is pretty bad looking and inefficient. Is there a better way to do this?
Using base R, we can use row-wise apply
cols <- c('a', 'b', 'c', 'd', 'e')
df$trio_average <- apply(df[cols], 1, function(x) mean(sample(x, 3), na.rm = TRUE))
Select the specific columns you are interested in and for each row randomly select 3 values and take their mean.
To get the mean of the numbers which were not selected we can store the index of random numbers and use it to get two pairs of mean for each row.
df[c('chosen', 'remaining')] <- t(apply(df[cols], 1, function(x) {
inds <- sample(seq_along(x), 3)
c(mean(x[inds]), mean(x[-inds]))
}))
Consider the following data:
library(Benchmarking)
d <- data.frame(x1=c(200,200,3000), x2=c(200,200,1000), y=c(100,100,3))
So I have 3 observations.
Now I want to select 2 observations randomly out of d three times (without repetition - there is three combinations in total). For each of these three times I want to calculate the following:
e <- dea(d[c('x1', 'x2')], d$y)
weighted.mean(eff(e), d$y)
That is, I will get three numbers, which I want to calculate an average of. Can someone show how to do this with a loop function in R?
Example:
There is three combinations in total, so I can only get the same result in this case. If I do the calculation manually, I will get the three following result:
0.977 0.977 1
(The result could of course be in a another order).
And the mean of these two numbers is:
0.984
This is a simple example. In my case I have a lot of combinations, where I don't select all of the combinations (e.g. there could be say 1,000,000 combinations, where I only select 1,000 of them).
I think it's better if you use sample.int and replicate instead of doing all the combinations, see my example:
nsample <- 2 # Number of selected observations
nboot <- 10 # Number of times you repeat the process
replicate(nboot, with(d[sample.int(nrow(d), nsample), ],
weighted.mean(eff(dea(data.frame(x1, x2), y)), y)))
I have check also the link you bring regarding this issue, so if I got it right, I mean, you want to extract two rows (observations) each time without replacement, you can use sample:
SelObs <- sample(1:nrow(d),2)
# for getting the selected observations just
dSel <- d[SelObs,]
And then do your calculations
If you want those already selected observation to not be selected in a nex random selection, it is similar, but you need an index
Obs <- 1:nrow(d)
SelObs <- sample(Obs, 2)
dSel <- d[SelObs, ]
# and now, for removing those already selected
Obs <- Obs[-SelObs]
# and keep going with next random selections and the above code
I need to generate a data set which contains 20 observations in 3 classes (20 observations to each of the classes - 60 in total) with 50 variables. I have tried to achieve this by using the code below, however it throws an error and I end up creating 2 observations of 50 variables.
data = matrix(rnorm(20*3), ncol = 50)
Warning message:
In matrix(rnorm(20 * 3), ncol = 50) :
data length [60] is not a sub-multiple or multiple of the number of columns [50]
I would like to know where I am going wrong, or even if this is the best way to generate a data set, and some explanations of possible solutions so I can better understand how to do this in the future.
The below can probably be done in less than my 3 lines of code but I want to keep it simple and I also want to use the matrix function with which you seem to be familiar:
#for the response variable y (60 values - 3 classes 1,2,3 - 20 observations per class)
y <- rep(c(1,2,3),20 ) #could use sample instead if you want this to be random as in docendo's answer
#for the matrix of variables x
#you need a matrix of 50 variables i.e. 50 columns and 60 rows i.e. 60x50 dimensions (=3000 table cells)
x <- matrix( rnorm(3000), ncol=50 )
#bind the 2 - y will be the first column
mymatrix <- cbind(y,x)
> dim(x) #60 rows , 50 columns
[1] 60 50
> dim(mymatrix) #60 rows, 51 columns after the addition of the y variable
[1] 60 51
Update
I just wanted to be a bit more specific about the error that you get when you try matrix in your question.
First of all rnorm(20*3) is identical to rnorm(60) and it will produce a vector of 60 values from the standard normal distribution.
When you use matrix it fills it up with values column-wise unless otherwise specified with the byrow argument. As it is mentioned in the documentation:
If one of nrow or ncol is not given, an attempt is made to infer it from the length of data and the other parameter. If neither is given, a one-column matrix is returned.
And the logical way to infer it is by the equation n * m = number_of_elements_in_matrix where n and m are the number of rows and columns of the matrix respectively. In your case your number_of_elements_in_matrix was 60 and the column number was 50. Therefore, the number of rows had to be 60/50=1.2 rows. However, a decimal number of rows doesn't make any sense and thus you get the error. Since you chose 50 columns only multiples of 50 will be accepted as the number_of_elements_in_matrix. Hope that's clear!
I have a set of genetic SNP data that looks like:
Founder1 Founder2 Founder3 Founder4 Founder5 Founder6 Founder7 Founder8 Sample1 Sample2 Sample3 Sample...
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
The size of the matrix is 56 columns by 46482 rows. I need to first bin the matrix by every 20 rows, then compare each of the first 8 columns (founders) to each columns 9-56, and divide the total number of matching letters/alleles by the total number of rows (20). Ultimately I need 48 8 column by 2342 row matrices, which are essentially similarity matrices. I have tried to extract each pair separately by something like:
"length(cbind(odd[,9],odd[,1])[cbind(odd[,9],cbind(odd[,9],odd[,1])[,1])[,1]=="T" & cbind(odd[,9],odd[,1])[,2]=="T",])/nrow(cbind(odd[,9],odd[,1]))"
but this is nowhere near efficient, and I do not know of a faster way of applying the function to every 20 rows and across multiple pairs.
In the example given above, if the rows were all identical like shown across 20 rows, then the first row of the matrix for Sample1 would be:
1 1 1 0 0 0 0
I think this is what you want? It helps to break the problem down into smaller pieces and then repeatedly apply a function to those pieces. My solution takes a few minutes to run on my laptop, but I think it should give you or others a start. If you're looking for better speed, I'd recommend looking at the data.table package. I'm sure there are other ways to make the code below a little faster too.
# Make a data set of random sequences
rows = 46482
cols = 56
binsize = 20
founder.cols = 1:8
sample.cols = setdiff(1:cols,founder.cols)
data = as.data.frame( matrix( sample( c("A","C","T","G"),
rows * cols, replace=TRUE ),
ncol=cols ) )
# Split the data into bins
binlevels = gl(n=ceiling(rows/binsize),k=20,length=rows)
databins = split(data,binlevels)
# A function for making a similarity matrix
compare_cols = function(i,j,mat) mean(mat[,i] == mat[,j])
compare_group_cols = function(mat, group1.cols, group2.cols) {
outer( X=group1.cols, Y=group2.cols,
Vectorize( function(X,Y) compare_cols(X,Y,mat) ) )
}
# Apply the function to each bin
mats = lapply( databins, compare_group_cols, sample.cols, founder.cols )
# And just to check. Random sequences should match 25% of the time. Right?
hist( vapply(mats,mean,1), n=30 ) # looks like this is the case
What is the most efficient way to sample a data frame under a certain constraint?
For example, say I have a directory of Names and Salaries, how do I select 3 such that their sum does not exceed some value. I'm just using a while loop but that seems pretty inefficient.
You could face a combinatorial explosion. This simulates the selection of 3 combinations of the EE's from a set of 20 with salaries at a mean of 60 and sd 20. It shows that from the enumeration of the 1140 combinations you will find only 263 having sum of salaries less than 150.
> sum( apply( combn(1:20,3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 200
> set.seed(123)
> salry <- data.frame(EEnams = sapply(1:20 ,
function(x){paste(sample(letters[1:20], 6) ,
collapse="")}), sals = rnorm(20, 60, 20))
> head(salry)
EEnams sals
1 fohpqa 67.59279
2 kqjhpg 49.95353
3 nkbpda 53.33585
4 gsqlko 39.62849
5 ntjkec 38.56418
6 trmnah 66.07057
> sum( apply( combn(1:NROW(salry), 3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 263
If you had 1000 EE's then you would have:
> choose(1000, 3) # Combination possibilities
# [1] 166,167,000 Commas added to output
One approach would be to start with the full data frame and sample one case. Create a data frame which consists of all the cases which have a salary less than your constraint minus the selected salary. Select a second case from this and repeat the process of creating a remaining set of cases to choose from. Stop if you get to the number you need (3), or if at any point there are no cases in the data frame to choose from (reject what you have so far and restart the sampling procedure).
Note that different approaches will create different probability distributions for a case being included; generally it won't be uniform.
How big is your dataset? If it is small (and small really depends on your hardware), you could just list all groups of three, calculate the sum, and sample from that.
## create data frame
N <- 100
salary <- rnorm(N))
## list all possible groups of 3 from this
x <- combn(salary, 3)
## the sum
sx <- colSums(x)
sxc <- sx[sx<1]
## sampling with replacement
sample(sxc, 10, replace=TRUE)