R code to generate random pairs of rows and do simulation - r

I have a data matrix in R having 45 rows. Each row represents a value of a individual sample. I need to do to a trial simulation; I want to pair up samples randomly and calculate their differences. I want a large sampling (maybe 10000) from all the possible permutations and combinations.
This is how I managed to do it till now:-
My data matrix ("data") has 45 rows and 2 columns. I selected 45 rows randomly and subtracted from another randomly generated 45 rows.
n1<-(data[sample(nrow(data),size=45,replace=F),])-(data[sample(nrow(data),size=45,replace=F),])
This gave me a random set of differences of size 45.
I made 50 such vectors (n1 to n50) and did rbind, which gave me a big data matrix containing random differences.
Of course, many rows between first random set and second random set were same and cancelled out. I removed it with a code as follows:
row_sub = apply(new, 1, function(row) all(row !=0 ))
new.remove.zero<-new[row_sub,]
BUT, is there a cleaner way to do this? A simpler way to generate all possible random pairs of rows, calculate their difference as bind it together as a new matrix?
Thanks in advance.

Related

Sorry for a dumb question. How to create a random sample of size e.g. 10 individuals in R

How do I create a random sample df of 10 PEOPLE in R?
You can simply use sample. the first criteria is the range or list of numbers that it will randomly pick. the second number is the amount of numbers to pick
sample(1:100, 10)

Sample n rows in permutations table resulting in similar element frequencies by column in R

I am working with R and am faced with the following combinatorial problem. The initial situation is a data frame with 512 rows containing all possible triple combinations of the digits 1 to 8:
expand.grid(rep(list(1:8), 3))
Now I would like to sample 420 rows from this data frame so that the frequency of each digit in each column is as similar as possible.
The randomly produced table would look like this and contains - depending on chance - very fluctuating frequencies.
expand.grid(rep(list(1:8), 3)) %>%
filter(row_number() %in% sample(1:nrow(.), 420))
Does some sort of constraint exist in order to obtain frequencies that are as equal as possible?
Edit:
However, the result doesn't have to be random. Is there a way to filter 420 rows with maximally equal frequencies?

Vectorized euclidean distance between 2 dataset in r

I have 2 data frames (temp_inp & temp) with 5 & 50k rows respectively and both of them having around 3k columns.
I want to calculate euclidean distance [(x1-x2)^2- no need to take square roots or divide by nrow - I just need the rows with minimum distance] between all the rows of dataframe1 against dataframe 2.
Final output required will have
rows = rows from temp(50k)
columns = rows from temp_inp(5)
After which I want to take the 5 rows with minimum distances from data frame 1.
Note that something like rbind & dist function will not work due to the size of the data. What I tried is this:
for (i in c(1:nrow(temp_inp))) {
for (j in c(1:nrow(temp))) {
temp1[j,i] <- sum((temp[j,]-temp_inp[i,])^2)
}}
This is taking insane amount of time... (8 hours)
I tried racking my to find a vectorized version of the same code. Please help me if you have any idea regarding this or know of any in-built function/package which will help me do so.

Get all possible combinations on a large dataset in R

I am having a large dataset having more than 10 million records and 20 variables. I need to get every possible combination for 11 variables out of these 20 variables and for each combination the frequency also should be displayed.
I have tried count() in plyr package and table() function. But both of them are unable to get all possible combinations, since the number of combinations are very high (greater than 2^32 combinations) and also the size is huge.
Assume following dataset having 5 variables and 6 observations -
And I want all possible combinations of first three variables where frequencies are greater than 0.
Is there any other function to achieve this? I am just interested in combinations whose frequency is non-zero.
Thanks!
OK. I think I have an idea of what you require. If you are saying you want the count by N categories of rows in your table, you can do so with the data.table package. It will give you the count of all combinations that exist in the table. Simply list the required categories in the by arguement
DT<-data.table(val=rnorm(1e7),cat1=sample.int(10,1e7,replace = T),cat2=sample.int(10,1e7,replace = T),cat3=sample.int(10,1e7,replace = T))
DT_count<-DT[, .N, by=.(cat1,cat2,cat3)]

Sampling according to distribution from a large vetor in R

I have a large vector of 11 billion values. The distribution of the data is not know and therefore I would like to sample 500k data points based on the existing probabilities/distribution. In R there is a limitation of values that can be loaded in a vector - 2^31 -1 which is why I plan to do the sampling manually.
Some information about the data: The data is just integers. And many of them are repeated multiple times.
large.vec <- (1,2,3,4,1,1,8,7,4,1,...,216280)
To create the probabilities of 500k samples across the distribution I will first create the probability sequence.
prob.vec <- seq(0,1,,500000)
Next, convert these probabilities to position in the original sequence.
position.vec <- prob.vec*11034432564
The reason I created the position vector is so that I can pic data point at the specific position after I order the population data.
Now I count the occurrences of each integer value in the population. Create a data frame with the integer values and their counts. I also create the interval for each of these values
integer.values counts lw.interval up.interval
0 300,000,034 0 300,000,034
1 169,345,364 300,000,034 469,345,398
2 450,555,321 469,345,399 919,900,719
...
Now using the position vector, I identify which position value falls in which interval and based on that get the value of that interval.
This way I believe I have a sample of the population. I got a large chunk of the idea from this reference,
Calculate quantiles for large data.
I wanted to know if there is a better approach? Or if this approach could reasonably, albeit crudely give me a good sample of the population?
This process does take a reasonable amount of time, as the position vector as to go through all possible intervals in the data frame. For that I have made it parallel using RHIPE.
I understand that I will be able to do this only because the data can be ordered.
I am not trying to randomly sample here, I am trying to "sample" the data keeping the underlying distribution intact. Mainly reduce 11 billion to 500k.

Resources