Using rnorm() to generate data sets - r

I need to generate a data set which contains 20 observations in 3 classes (20 observations to each of the classes - 60 in total) with 50 variables. I have tried to achieve this by using the code below, however it throws an error and I end up creating 2 observations of 50 variables.
data = matrix(rnorm(20*3), ncol = 50)
Warning message:
In matrix(rnorm(20 * 3), ncol = 50) :
data length [60] is not a sub-multiple or multiple of the number of columns [50]
I would like to know where I am going wrong, or even if this is the best way to generate a data set, and some explanations of possible solutions so I can better understand how to do this in the future.

The below can probably be done in less than my 3 lines of code but I want to keep it simple and I also want to use the matrix function with which you seem to be familiar:
#for the response variable y (60 values - 3 classes 1,2,3 - 20 observations per class)
y <- rep(c(1,2,3),20 ) #could use sample instead if you want this to be random as in docendo's answer
#for the matrix of variables x
#you need a matrix of 50 variables i.e. 50 columns and 60 rows i.e. 60x50 dimensions (=3000 table cells)
x <- matrix( rnorm(3000), ncol=50 )
#bind the 2 - y will be the first column
mymatrix <- cbind(y,x)
> dim(x) #60 rows , 50 columns
[1] 60 50
> dim(mymatrix) #60 rows, 51 columns after the addition of the y variable
[1] 60 51
Update
I just wanted to be a bit more specific about the error that you get when you try matrix in your question.
First of all rnorm(20*3) is identical to rnorm(60) and it will produce a vector of 60 values from the standard normal distribution.
When you use matrix it fills it up with values column-wise unless otherwise specified with the byrow argument. As it is mentioned in the documentation:
If one of nrow or ncol is not given, an attempt is made to infer it from the length of data and the other parameter. If neither is given, a one-column matrix is returned.
And the logical way to infer it is by the equation n * m = number_of_elements_in_matrix where n and m are the number of rows and columns of the matrix respectively. In your case your number_of_elements_in_matrix was 60 and the column number was 50. Therefore, the number of rows had to be 60/50=1.2 rows. However, a decimal number of rows doesn't make any sense and thus you get the error. Since you chose 50 columns only multiples of 50 will be accepted as the number_of_elements_in_matrix. Hope that's clear!

Related

How to sample with various sample size in R?

I am trying to get a random sample from a dataframe with different size.
example the first sample should only have 8 observations
2nd sample can have 10 observations
3rd can have 12 observations
df[sample(nrow(df),10 ), ]
this gives me a fixed 10 observations when I take a sample
In an ideal case, I have 100observations and these observations should be placed in 3 groups without replacement and each group can have any number of observations. example group 1 has 45 observations, group 2 has 20 observations and group 3 has 35 observations.
Any help will be appreciated
You could try using replicate:
times_to_sample = 5L
NN = nrow(df)
replicate(times_to_sample, df[sample(NN, sample(5:10, 1L)), ], simplify = FALSE)
This will return a list of length times_to_sample, the ith element of which will give you a data.frame with the result for the ith replication.
simplify=FALSE prevents simplify2array from mangling the results into a not-particularly-useful matrix.
You should also consider adding some robustness checks -- for example, you said you want between 5 and 10 rows, but in generalizing this to be from a to b rows, you'll want to ensure a >= 1, b <= nrow(df).
If times_to_sample is going to be large, it'll be more efficient to get all of the samples from 5:10 up front instead:
idx = sample(5:10, times_to_sample, replace = TRUE)
lapply(idx, function(i) df[sample(NN, i), ])
A little less readable but surely more efficient than to repeatedly to sample(5:10, 1), i.e. only one at a time (not leveraging vectorization)

Why is the matrix function showing the number of columns 100

In this example code I define the X1=matrix(rnorm(length(y)*100), nrow = length(y)); I get the number of rows 97 which is correct, but the number of columns 100.
When I multiply with 10 instead with 100 in: X1=matrix(rnorm(length(y)*10 the number of columns is then 10.
I don't know why that is? Since I didn't assign any value for the columns.
library(glmnet)
library(ncvreg)
data("prostate");
X=prostate[,1:8];
y=prostate$lpsa; #97 values
X1=matrix(rnorm(length(y)*100), nrow = length(y)); #97x100
nrow(X1); ncol(X1);

Using R to count patterns in columns

I have a matrix in R containing 1000 columns and 4 rows. Each cell in the matrix contains an integer between 1-4. I want to know two things:
1) What is the number of columns that contain a "1", "2", "3", and "4" in any order? Ideally, I would like the code to not require that I input each possible combination of 1,2,3,4 to perform its count.
2) What is the number of columns that contain 3 of the possible integers, but not all 4?
Solution 1
The most obvious approach is to run apply() over the columns and test for the required tabulation of the column vector using tabulate(). This requires first building a factor() out of the column vector to normalize its storage representation to an integer vector based from 1. And since you don't care about order, we must run sort() before comparing it against the expected tabulation.
For the "4 of 4" problem the expected tabulation will be four 1s, while for the "3 of 4" problem the expected tabulation will be two 1s and one 2.
## generate data
set.seed(1L); NR <- 4L; NC <- 1e3L; m <- matrix(sample(1:4,NR*NC,T),NR);
sum(apply(m,2L,function(x) identical(rep(1L,4L),sort(tabulate(factor(x))))));
## [1] 107
sum(apply(m,2L,function(x) identical(c(1L,1L,2L),sort(tabulate(factor(x))))));
## [1] 545
Solution 2
v <- c(1L,2L,4L,8L);
sum(colSums(matrix(v[m],nrow(m)))==15L);
## [1] 107
v <- c(1L,3L,9L,27L);
s3 <- c(14L,32L,38L,16L,34L,22L,58L,46L,64L,42L,48L,66L);
sum(colSums(matrix(v[m],nrow(m)))%in%s3);
## [1] 545
Here's a slightly weird solution.
I was looking into how to use colSums() or colMeans() to try to find a quick test for columns that have 4 of 4 or 3 of 4 of the possible cell values. The problem is, there are multiple combinations of the 4 values that sum to the same total. For example, 1+2+3+4 == 10, but 1+1+4+4 == 10 as well, so just getting a column sum of 10 is not enough.
I realized that one possible solution would be to change the set of values that we're summing, such that our target combinations would sum to unambiguous values. We can achieve this by spreading out the original set from 1:4 to something more diffuse. Furthermore, the original set of values of 1:4 is perfect for indexing a precomputed vector of values, so this seemed like a particularly logical approach for your problem.
I wasn't sure what degree of diffusion would be required to make unique the sums of the target combinations. Some ad hoc testing seemed to indicate that multiplication by a fixed multiplier would not be sufficient to disambiguate the sums, so I moved up to exponentiation. I wrote the following code to facilitate the testing of different bases to identify the minimal bases necessary for this disambiguation.
tryBaseForTabulation <- function(N,tab,base) {
## make destination value set, exponentiating from 0 to N-1
x <- base^(seq_len(N)-1L);
## make a matrix of unique combinations of the original set
g <- unique(t(apply(expand.grid(x,x,x,x),1L,sort)));
## get the indexes of combinations that match the required tabulation
good <- which(apply(g,1L,function(x) identical(tab,sort(tabulate(factor(x))))));
## get the sums of good and bad combinations
hs <- rowSums(g[good,,drop=F]);
ns <- rowSums(g[-good,,drop=F]);
## return the number of ambiguous sums; we need to get zero!
sum(hs%in%ns);
}; ## end tryBaseForTabulation()
The function takes the size of the set (4 for us), the required tabulation (as returned by tabulate()) in sorted order (as revealed earlier, this is four 1s for the "4 of 4" problem, two 1s and one 2 for the "3 of 4" problem), and the test base. This is the result for a base of 2 for the "4 of 4" problem:
tryBaseForTabulation(4L,rep(1L,4L),2L);
## [1] 0
So we get the result we need right away; a base of 2 is sufficient for the "4 of 4" problem. But for the "3 of 4" problem, it takes one more attempt:
tryBaseForTabulation(4L,c(1L,1L,2L),2L);
## [1] 7
tryBaseForTabulation(4L,c(1L,1L,2L),3L);
## [1] 0
So we need a base of 3 for the "3 of 4" problem.
Note that, although we are using exponentiation as the tool to diffuse the set, we don't actually need to perform any exponentiation at solution run-time, because we can simply index a precomputed vector of powers to transform the value space. Unfortunately, indexing a vector with a matrix returns a flat vector result, losing the matrix structure. But we can easily rebuild the matrix structure with a call to matrix(), thus we don't lose very much with this idiosyncrasy.
The last step is to derive the destination value space and the set of sums that satisfy the problem condition. The value spaces are easy; we can just compute the power sequence as done within tryBaseForTabulation():
2L^(1:4-1L);
## [1] 1 2 4 8
3L^(1:4-1L);
## [1] 1 3 9 27
The set of sums was computed as hs in the tryBaseForTabulation() function. Hence we can write a new similar function for these:
getBaseSums <- function(N,tab,base) {
## make destination value set, exponentiating from 0 to N-1
x <- base^(seq_len(N)-1L);
## make a matrix of unique combinations of the original set
g <- unique(t(apply(expand.grid(x,x,x,x),1L,sort)));
## get the indexes of combinations that match the required tabulation
good <- which(apply(g,1L,function(x) identical(tab,sort(tabulate(factor(x))))));
## return the sums of good combinations
rowSums(g[good,,drop=F]);
}; ## end getBaseSums()
Giving:
getBaseSums(4L,rep(1L,4L),2L);
## [1] 15
getBaseSums(4L,c(1L,1L,2L),3L);
## [1] 14 32 38 16 34 22 58 46 64 42 48 66
Now that the solution is complete, I realize that the cost of the vector index operation, rebuilding the matrix, and the %in% operation for the second problem may render it inferior to other potential solutions. But in any case, it's one possible solution, and I thought it was an interesting idea to explore.
Solution 3
Another possible solution is to precompute an N-dimensional lookup table that stores which combinations match the problem condition and which don't. The input matrix can then be used directly as an index matrix into the lookup table (well, almost directly; we'll need a single t() call, since its combinations are laid across columns instead of rows).
For a large set of values, or for long vectors, this could easily become impractical. For example, if we had 8 possible cell values with 8 rows then we would need a lookup table of size 8^8 == 16777216. But fortunately for the sizing given in the question we only need 4^4 == 256, which is completely manageable.
To facilitate the creation of the lookup table, I wrote the following function, which stands for "N-dimensional combinations":
NDcomb <- function(N,f) {
x <- seq_len(N);
g <- do.call(expand.grid,rep(list(x),N));
array(apply(g,1L,f),rep(N,N));
}; ## end NDcomb()
Once the lookup table is computed, the solution is easy:
v <- NDcomb(4L,function(x) identical(rep(1L,4L),sort(tabulate(factor(x)))));
sum(v[t(m)]);
## [1] 107
v <- NDcomb(4L,function(x) identical(c(1L,1L,2L),sort(tabulate(factor(x)))));
sum(v[t(m)]);
## [1] 545
We can use colSums. Loop over 1:4, convert the matrix to a logical matrix, get the colSums, check whether it is not equal to 0 and sum it.
sapply(1:4, function(i) sum(colSums(m1==i)!=0))
#[1] 6 6 9 5
If we need the number of columns that contain 3 and not have 4
sum(colSums(m1!=4)!=0 & colSums(m1==3)!=0)
#[1] 9
data
set.seed(24)
m1 <- matrix(sample(1:4, 40, replace=TRUE), nrow=4)

Using aggregate to get the mean of duplicate rows in a data.frame in r

I have a matrix B that is 10 rows x 2 columns:
B = matrix(c(1:20), nrow=10, ncol=2)
Some of the rows are technical duplicates, and they correspond to the same
number in a list of length 20 (list1).
list1 = c(1,1,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8,8)
list1 = as.list(list1)
I would like to use this list (list1) to take the mean of any duplicate values for all columns in B such that I end up with a matrix or data.frame with 8 rows and 2 columns (all the duplicates are averaged).
Here is my code:
aggregate.data.frame(B, by=list1, FUN=mean)
And it generates this error:
Error in aggregate.data.frame(B, by = list1, FUN = mean) :
arguments must have same length
What am I doing wrong?
Thank you!
Your data have 2 variables (2 columns), each with 10 observations (10 rows). The function aggregate.data.frame expects the elements in the list to have the same length as the number of observations in your variables. You are getting an error because the vector in your list has 20 values, while you only have 10 observations per variable. So, for example, you can do this because now you have 1 variable with 20 observations, and list 1 has a vector with 20 elements.
B <- 1:20
list1 <- list(B=c(1,1,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8,8))
aggregate.data.frame(B, by=list1, FUN=mean)
The code will also work if you give it a matrix with 2 columns and 20 rows.
aggregate.data.frame(cbind(B,B), by=list1, FUN=mean)
I think this answer addresses why you are getting an error. However, I am not sure that it addresses what you are actually trying to do. How do you expect to end up with 8 rows and 2 columns? What exactly would the cells in that matrix represent?

Sample with constraint, vectorized

What is the most efficient way to sample a data frame under a certain constraint?
For example, say I have a directory of Names and Salaries, how do I select 3 such that their sum does not exceed some value. I'm just using a while loop but that seems pretty inefficient.
You could face a combinatorial explosion. This simulates the selection of 3 combinations of the EE's from a set of 20 with salaries at a mean of 60 and sd 20. It shows that from the enumeration of the 1140 combinations you will find only 263 having sum of salaries less than 150.
> sum( apply( combn(1:20,3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 200
> set.seed(123)
> salry <- data.frame(EEnams = sapply(1:20 ,
function(x){paste(sample(letters[1:20], 6) ,
collapse="")}), sals = rnorm(20, 60, 20))
> head(salry)
EEnams sals
1 fohpqa 67.59279
2 kqjhpg 49.95353
3 nkbpda 53.33585
4 gsqlko 39.62849
5 ntjkec 38.56418
6 trmnah 66.07057
> sum( apply( combn(1:NROW(salry), 3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 263
If you had 1000 EE's then you would have:
> choose(1000, 3) # Combination possibilities
# [1] 166,167,000 Commas added to output
One approach would be to start with the full data frame and sample one case. Create a data frame which consists of all the cases which have a salary less than your constraint minus the selected salary. Select a second case from this and repeat the process of creating a remaining set of cases to choose from. Stop if you get to the number you need (3), or if at any point there are no cases in the data frame to choose from (reject what you have so far and restart the sampling procedure).
Note that different approaches will create different probability distributions for a case being included; generally it won't be uniform.
How big is your dataset? If it is small (and small really depends on your hardware), you could just list all groups of three, calculate the sum, and sample from that.
## create data frame
N <- 100
salary <- rnorm(N))
## list all possible groups of 3 from this
x <- combn(salary, 3)
## the sum
sx <- colSums(x)
sxc <- sx[sx<1]
## sampling with replacement
sample(sxc, 10, replace=TRUE)

Resources