R unique multivariate integer sample - r

I want to identify two integer vectors with size equals to 50 where the first vector can vary from 0 to 20 and the second vector from 0 to 100.
The combination of the first vector and the second vector is unique (50 different couples of indices). But for example, you can select two times 1 in the first vector or two times 100 in the second one.
Bad solution:
Vector A (1, 1, ....)
Vector B (100, 100, ...)
Good solution:
Vector A (1, 1, 2, ... )
Vector B (100 , 99 , 100, ...)
At the moment I'm using:
a = sample(1:20, 50,replace = T)
b = sample(1:100, 50,replace = T)
But of course, I have many non-unique values.

You could increase your sample a bit an only keep those that have unique combinations:
a = sample(1:20, 100,replace = T)
b = sample(1:100, 100,replace = T)
df <- tibble(a = a, b = b)
df %>%
distinct() %>%
slice_head(n = 50)

Related

Two Random Numbers Without Repeating

I'm looking to make a set of two random numbers (e.g., [1,2], [3,12]) with the first number between 1-12, and the second between 1-4. I know how to sample the two numbers independently using:
sample(1:12, 1, replace = T)
sample(1:4, 1, replace = T)
but don't know how to create a system to determine if the pairing of the two numbers has already been rolled, and if so, roll again. Any tips!?
Thanks :)
While this doesn't scale happily (in case you need large-scale simulation), you can do this:
set.seed(42)
di2 <- sample(setdiff(1:4, di1 <- sample(1:12, size = 1)), size = 1)
c(di1, di2)
# [1] 1 2
The inner (di1) assignment takes the first from 1:12, so far so good.
We then set-diff 1:4 from this so that the second sampling only has candidates that are not equal to di1;
The outer (di2) assignment samples from 1:4 without di1 if it was within 1-4.
While not an authoritative proof of correctness,
rand <- replicate(100000, local({ di2 <- sample(setdiff(1:4, di1 <- sample(1:12, size=1)), size = 1); c(di1, di2); }))
dim(rand)
# [1] 2 100000
any(rand[1,] == rand[2,])
# [1] FALSE
Are you looking for sth like:
library(tidyverse)
expand.grid(1:12,1:4) %>%
as.data.frame() %>%
slice_sample (n = 5, replace = FALSE)

R Calculate row average, different columns for each row as indicated by another column

I want to calculate the row average of different columns for each row, as indicated by another column. In this example dataframe, the column "number" ranges from 1:11, and the other 12 columns are named "block_1" through "block_12".For example, if "block" is 5, I want to calculate the row average of columns block_6 through block_12. In other words, the average of the block number after indicated in "number" column, through block_12. Each ID represents a unique entry, all rows should be preserved.
How can I achieve this?
n <- 11 ; m <- 11 ; reps <- 12
dff<-as.data.frame(cbind(matrix(sample.int(11, n, replace = TRUE), n, m/n),
replicate(reps, sample(1:9, n, replace = TRUE)/10)))
myFun<- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
dff$ID<-myFun(11)
dff<-data.table::setnames(dff, old = c('V1','V2','V3','V4', 'V5','V6','V7','V8','V9','V10', 'V11','V12','V13'), new = c('number','block_1','block_2','block_3','block_4', 'block_5','block_6','block_7','block_8','block_9','block_10', 'block_11','block_12'))
Here is one base R option :
cols <- grep('block', names(dff), value = TRUE)
n <- length(cols)
dff$mean_value <- mapply(function(x, y) mean(unlist(dff[x, cols[y:n]])),
seq(nrow(dff)), dff$number + 1)

Replacing Values in Randomly Generated Matrix with Additional Random Numbers

I'm working with a matrix where I need to replace values coded as 1 with new randomly generated numbers
The starting point is a matrix like this
set.seed(38921)
p <- matrix(nrow = 10, ncol = 25)
for(i in 1:10){
p[i, seq(1, floor(runif(1, min = 1, max = 25)), 1)] = 1
}
In the resulting p matrix, on each row I need the value of 1 to be replaced with a randomly generated integer bound between 1 and 25, where the numbers cannot repeat.
For example, on the first row of the matrix, there should be 6 randomly drawn numbers between 1 and 25, where none of the numbers are repeated, and 19 NA columns. On row two, there should be 12 randomly drawn numbers between 1 and 25 with no repeats and 13 NA columns.
Any help is greatly appreciated.
You can simply multiply your matrix by another matrix of random numbers. NA's will remain as NA.
p*matrix(sample(1:25), 10, 25)
Or if the dimensions change:
p*matrix(sample(1:25), nrow(p), ncol(p))
Where you have:
p[i, seq(1, floor(runif(1, min = 1, max = 25)), 1)] = 1
You're assigning to a range of inputs. So instead of assigning 1, you need to assign an appropriately sized vector with the elements you want. This can be generated with: sample(1:25, desiredLength, replace=F)
set.seed(38921)
p <- matrix(nrow = 10, ncol = 25)
for(i in 1:10){
n = floor(runif(1, min = 1, max = 25))
p[i, seq(1, n, 1)] = sample(1:25, n, replace=F)
}
Assuming you've created your initial matrix, here's one way to do it.
apply(t(p), 1, function(x) ifelse(x == 1, sample(1:25, sum(x[x == 1], na.rm = T), replace = F)))

How can I delete values by column in a data frame?

I need to take abundance values by column without zeros, by this reason I used an empty list and a loop (for loop). When I delete [i] in the first line of my loop I get the desired result only in the column of total values (sum by an object), but in the way in which I learn to write them, I only obtain an undesired result.
set.seed(1000)
df <- data.frame(Category = sample(LETTERS[1:10]),
Object = sample(letters[1:10]),
A = sample(0:20, 10, rep = TRUE),
B = sample(0:20, 10, rep = TRUE),
C = sample(0:20, 10, rep = TRUE))
sincero <- list()
for (i in colnames(df[ , 3:5])){
sincero[i] = df[df[ , i] != 0, ]
sincero
}
sincero

for each row in a data frame, find whether there is a "close" row in another data frame

I have the following data frame:
library(dplyr)
set.seed(42)
df <- data_frame(x = sample(seq(0, 1, 0.1), 5, replace = T), y = sample(seq(0, 1, 0.1), 5, replace = T), z= sample(seq(0, 1, 0.1), 5, replace = T) )
For each row in df, I would like to find out whether there is a row in df2 which is close to it ("neighbor") in all columns, where "close" means that it is not different by more than 0.1 in each column.
So for instance, a proper neighbor to the row (1, 0.5, 0.5) would be (0.9, 0.6, 0.4).
The second data set is
set.seed(42)
df2 <- data_frame(x = sample(seq(0, 1, 0.1), 10, replace = T), y = sample(seq(0, 1, 0.1), 10, replace = T), z= sample(seq(0, 1, 0.1), 10, replace = T) )
In this case there is no "neighbor", so Im supposed to get "FALSE" for all rows of df.
My actual data frames are much bigger than this (dozens of columns and hundreds of thousands of rows, so the naming has to be very general rather than "x", "y" and "z".
I have a sense that this can be done using mutate and funs, for example I tried this line:
df <- df %>% mutate_all(funs(close = (. <= df2(, .)+0.1) & (. >= df2(, .)-0.1))
But got an error.
Any ideas?
You can use package fuzzyjoin
library(fuzzyjoin)
# adding two rows that match
df2 <- rbind(df2,df[1:2,] +0.01)
df %>%
fuzzy_left_join(df2,match_fun= function(x,y) y<x+0.1 & y> x-0.1 ) %>%
mutate(found=!is.na(x.y)) %>%
select(-4:-6)
# # A tibble: 5 x 4
# x.x y.x z.x found
# <dbl> <dbl> <dbl> <lgl>
# 1 1 0.5 0.5 TRUE
# 2 1 0.8 0.7 TRUE
# 3 0.3 0.1 1 FALSE
# 4 0.9 0.7 0.2 FALSE
# 5 0.7 0.7 0.5 FALSE
find more info there: Joining/matching data frames in R
The machine learning approach to finding a close entry in a multi-dimensional dataset is Euclidian distance.
The general approach is to normalize all the attributes. Make the range for each column the same, zero to one or negative one to one. That equalizes the effect of the columns with large and small values. When more advanced approaches are used one would center the adjusted column values on zero. The test criteria is scaled the same.
The next step is to calculate the distance of each observation from its neighbors. If the data set is small or computing time is cheap, calculate the distance from every observation to every other. The Euclidian distance from observation1 (row1) to observation2 (row2) is sqrt((X1 - X2)^2 + sqrt((Y1 - Y2)^2 + ...). Choose your criteria and select.
In your case, the section criterion is simpler. Two observations are close if no attribute is more than 0.1 from the other observation. I assume that df and df2 have the same number of columns in the same order. I make the assumption that close observations are relatively rare. My approach tells me once we discover a pair is distant, discontinue investigation. If you have hundred of thousands of rows, you will likely exhaust memory if you try to calculate all the combinations at the same time.
~~~~~
You have a big problem. If your data sets df and df2 are one hundred thousand rows each, and four dozen columns, the machine needs to do 4.8e+11 comparisons. The scorecard at the end will have 1e+10 results (close or distant). I started with some subsetting to do comparisons with tearful results. R wanted matrices of the same size. The kluge I devised was unsuccessful. Therefore I regressed to the days of FORTRAN and did it with loops. With the loop approach, you could subset the problem and finish without smoking your machine.
From the sample data, I did the comparisons by hand, all 150 of them: nrow(df) * nrow(df2) * ncol(df). There were no close observations in the sample data by the definition you gave.
Here is how I intended to present the results before transferring the results to a new column in df.
dfclose <- matrix(TRUE, nrow = nrow(df), ncol = nrow(df2))
dfclose # Have a look
This matrix describes the distance from observation in df (rows in dfclose) to observation in df2 (colums in dfclose). If close, the entry is TRUE.
Here is the repository of the result of the distance measures:
dfdist <- matrix(0, nrow = nrow(df), ncol = nrow(df2))
dfdist # have a look; it's the same format, but with numbers
We start with the assumption that all observations in df aare close to df2.
The total distance is zero. To that we add the Manhattan Distance. When the total Manhattan distance is greater than .1, they are no longer close. We needn't evaluate any more.
closeCriterion <- function(origin, dest) {
manhattanDistance <- abs(origin-dest)
#print(paste("manhattanDistance =", manhattanDistance))
if (manhattanDistance < .1) ret <- 0 else ret <- 1
}
convertScore <- function(x) if (x>0) FALSE else TRUE
for (j in 1:ncol(df)) {
print(paste("col =",j))
for (i in 1:nrow(df)) {
print(paste("df row =",i))
for (k in 1:nrow(df2)) {
# print(paste("df2 row (and dflist column) =", k))
distantScore <- closeCriterion(df[i,j], df2[k,j])
#print(paste("df and dfdist row =", i, " df2 row (and dflist column) =", k, " distantScore = ", distantScore))
dfdist[i,k] <- dfdist[i,k] + distantScore
}
}
}
dfdist # have a look at the numerical results
dfclose <- matrix(lapply(dfdist, convertScore), ncol = nrow(df2))
I wanted to see what the process would look like at scale.
set.seed(42)
df <- matrix(rnorm(3000), ncol = 30)
set.seed(42)
df2 <-matrix(rnorm(5580), ncol = 30)
dfdist <- matrix(0, nrow = nrow(df), ncol = nrow(df2))
Then I ran the code block to see what would happen.
~ ~ ~
You might consider the problem definition. I ran the model several times, changing the criterion for closeness. If the entry in each of three dozen columns in df2 has a 90% chance of matching its correspondent in df, the row only has a 2.2% chance of matching. The example data is not such a good test case for the algorithm.
Best of luck
Here's one way to calculate that column without fuzzyjoin
library(tidyverse)
found <-
expand.grid(row.df = seq(nrow(df)),
row.df2 = seq(nrow(df2))) %>%
mutate(in.range = pmap_lgl(., ~ all(abs(df[.x,] - df2[.y,]) <= 0.1))) %>%
group_by(row.df) %>%
summarise_at('in.range', any) %>%
select(in.range)

Resources