Conditionally assign a value to a random subset of a vector - r

I want to assign a defined value (let's say 1) to a random sample of a subset of a vector that meets certain conditions. I can't seem to make it work.
I have tried this code:
a <- c(1:50)
df <- as.data.frame(a)
df$c <- 0
df$c[sample(x=(df$c[df$a>25]), size = round(NROW(df$c[df$a>25])/5), replace = F)] <- 1
I would like just to randomly make some of the df$c vector values to be equal to 1, exactly a random sample of one fifth of the values in df$c in which value of df$a is a is greater than 25 (that would be 5 observations switched to 1).
But so far all of them remain 0 :/
Thanks!

Here's a way with base R -
df$c[sample(which(df$a > 25), sum(df$a > 25)/5)] <- 1
Be aware that this will fail if there is only 1 value in df$a > 25.
Below approach will not fail for any case but is bit verbose. Feel free to use whatever suits your need the best depending on expected values in df$a -
df$c[which(df$a > 25)[sample(length(which(df$a > 25)), sum(df$a > 25)/5)]] <- 1
Also, note that since, relace = F, sample size = sum(df$a > 25)/5 must be <= length(which(df$a > 25)). You can include this condition in your code if you want to make it even more safer.
Also, there will be no change if sum(df$a > 25)/5 < 1 so you may want to use size = max(sum(df$a > 25)/5, 1) if you want at least 1 change.
Here's a nicer version of my first version, thanks to #Frank -
df$c <- replace(df$c, sample(w <- which(df$a > 25), length(w)*.2), 1)

Not as elegant as the other solution you have but here's another way:
df <- data.frame('a' = c(1:50), 'c' = rep(0,50))
df$c[sample(
# subset to sample
df$a[df$a > 25],
# sample size
size = round(length(df$a[df$a > 25])/5, 0),
# no replacement
replace = F)] <- 1
Yours didn't work because you sample where df$c > 25 rather than df$a
df$c[sample(x=( df$c [df$a>25]), size = round(NROW(df$c[df$a>25])/5), replace = F)] <- 1

Related

How to take values in every other row (odd) and shift them to be in every row (even and odd) or switch them to other rows (even)?

How do you take values in every other row (odd) and shift them to every row (even and odd) or switch them to other rows (even)? In a dataframe, I have values in every other row. I would like them to be in every row of the column. How do I achieve this? An alternative solution would be to switch them so that they are only in even, but not odd.
Example of what it currently looks like:
set.seed(5)
output<-data.frame(matrix("", nrow=500, ncol=1))
for(i in 1:nrow(output)){
if(i %% 2 == 0){
output[i,1] <- sample(c("A","B","C"),1,replace = T)
}
}
colnames(output) <- "work"
I would like it to appear as:
set.seed(5)
output<-data.frame(matrix("", nrow=500, ncol=1))
for(i in 1:nrow(output)){
output[i,1] <- sample(c("A","B","C"),1,replace = T)
}
colnames(output) <- "work"
row_odd <- seq_len(nrow(output)) %% 2
output[row_odd == 1, 1] <- data_row_even
output[row_odd == 0, 1] <- NA
If I understand correctly, create an index that selects the values you'd like
> idx = seq(2, nrow(output), by = 2)
and use idx - 1 to select the rows to be replaced
> output[idx - 1, "work"] = output[idx, "work"]
> head(output)
work
1 B
2 B
3 C
4 C
5 A
6 A
It's not really clear what you'd like to happen to the last row, if there are an odd number of rows?
Note that a much more efficient / R way to generate the original data is perhaps
df = data.frame(work = character(500))
idx = seq(2, nrow(df), by = 2)
df[idx, "work"] = sample(c("A", "B", "C"), length(idx), replace = TRUE)

Two Random Numbers Without Repeating

I'm looking to make a set of two random numbers (e.g., [1,2], [3,12]) with the first number between 1-12, and the second between 1-4. I know how to sample the two numbers independently using:
sample(1:12, 1, replace = T)
sample(1:4, 1, replace = T)
but don't know how to create a system to determine if the pairing of the two numbers has already been rolled, and if so, roll again. Any tips!?
Thanks :)
While this doesn't scale happily (in case you need large-scale simulation), you can do this:
set.seed(42)
di2 <- sample(setdiff(1:4, di1 <- sample(1:12, size = 1)), size = 1)
c(di1, di2)
# [1] 1 2
The inner (di1) assignment takes the first from 1:12, so far so good.
We then set-diff 1:4 from this so that the second sampling only has candidates that are not equal to di1;
The outer (di2) assignment samples from 1:4 without di1 if it was within 1-4.
While not an authoritative proof of correctness,
rand <- replicate(100000, local({ di2 <- sample(setdiff(1:4, di1 <- sample(1:12, size=1)), size = 1); c(di1, di2); }))
dim(rand)
# [1] 2 100000
any(rand[1,] == rand[2,])
# [1] FALSE
Are you looking for sth like:
library(tidyverse)
expand.grid(1:12,1:4) %>%
as.data.frame() %>%
slice_sample (n = 5, replace = FALSE)

Subset in the data frame rows in R

I have a data frame with 30 rows and 4 columns (namely, x, y, z, u). It is given below.
mydata = data.frame(x = rnorm(30,4), y = rnorm(30,2,1), z = rnorm(30,3,1), u = rnorm(30,5))
Further, I have a sequence values, which represent row number in my data frame.
myseq = c(seq(1, 30, by = 5))
myseq
[1] 1 6 11 16 21 26
Now, I wanted to compute the prob values for each segment of 99 rows.
filt= subset(mydata[1:6,], mydata[1:6,]$x < mydata[1:6,]$y & mydata[1:6,]$z < mydata[1:6,]$u
filt
prob = length(filt$x)/30
prob
Then I need to compute the above prob for 1:6,.., 27:30 and so on . Here, I have only 6 prob values. So, I can do one by one. If I have 100 values it would be tedious. Are there any way to compute the prob values?.
Thank you in advance.
BTW: in subset(DF[1:99,], ...), use DF[1:99,] in the first argument, not again, ala
subset(DF[1:99,], cumsuml < inchivaluel & cumsumr < inchivaluer)
Think about how to do this in a list.
The first step is to break your data into the va starting points. I'll start with a list of the indices to break it into:
inds <- mapply(seq, va, c(va[-1], nrow(DF)), SIMPLIFY=FALSE)
this now is a list of sequences, starting with 1:99, then 100:198, etc. See str(inds) to verify.
Now we can subset a portion of the data based on each element's vector of indices:
filts <- lapply(inds, function(ind) subset(DF[ind,], cumsuml < inchivaluel & cumsumr < inchivaluer))
We now have a list of vectors, let's summarize it:
results <- sapply(filts, function(filt) length(filt$cumsuml)/length(alpha))
Bottom line, it helps to think about how to break this problem into lists, examples at http://stackoverflow.com/a/24376207/3358272.
BTW: instead of initially making a list of indices, we could just break up the data in that first step, ala
DF2 <- mapply(function(a,b) DF[a:b,], va, c(va[-1], nrow(DF)), SIMPLIFY=FALSE)
filts <- lapply(DF2, function(x) subset(x, cumsuml < inchivaluel & cumsumr < inchivaluer))
results <- sapply(filts, function(filt) length(filt$cumsuml)/length(alpha))

Generating binary matrix with number of 1's fixed within a range [duplicate]

I want to generate an nxm matrix. Suppose its 100x3. I want each row to sum to 1 (so two "0"'s and one "1").
sample(c(0,0,1),3)
will give me 1 row but is there a very fast way to generate the whole matrix without an rbind?
Thank you!
No loops, no transposition. Just create a matrix of zeros and replace one entry per row with 1 by sampling the rows.
m <- matrix(0, 100, 3)
nr <- nrow(m)
m[cbind(1:nr, sample(ncol(m), nr, TRUE))] <- 1
all(rowSums(m) == 1)
# [1] TRUE
mat <- matrix(runif(300),ncol=3)
mat[] <- as.numeric(t(apply(mat, 1, function(r) r == max(r))))
t(apply(t(matrix(rep(c(0,0,1),300),nrow = 3)), 1, function(x) sample(x)))
Since you want single 1 for a row, the problem can be restated to select a column entry randomly that has 1 for each row.
So you can do like,
m <- 3; n<-100
rand_v <- floor(runif(n)*3)+1
mat <- matrix(0,n,m)
idx <- cbind(1:n,rand_v)
mat[idx] <- 1
Hope this helps.

Split vector such that means of two groups are nearly equal

I have a data frame df with 2 variables A and B. I would like to split A in groups 1 and 2 so that mean(df$B[df$group==1]) as close as possible to mean(df$B[df$group==2])
Or just to express it otherwise, what I would like is to find a cut point (cutp) in df$A that would minimize the abs(mean(df$B[df$A<cutp])-mean(df$B[df$A>=cutp]))
Any ideas?
If you want to find a threshold on variable A, to split the data into two groups, so that the means of B in those two groups be similar, you can compute these means for all possible cut-points, and check when the distance between those means is minimal.
# Sample data
n <- 10
d <- data.frame(
A = rnorm(n),
B = rnorm(n)
)
# The quantity to minimize
# (You can use a loop instead of apply.)
d$differences <- apply(
d, 1,
# Compute the difference of the means for each value of A
function (u) {
i <- d$A <= u[1];
abs( mean( d$B[which(i)]) - mean(d$B[which(!i)] ) )
}
)
# The mean of an empty vector is NaN: discard those values
d$differences[ ! is.finite( d$differences ) ] <- Inf
# Take the minimum
threshold <- d$A[ which.min( d$differences ) ]
# Build the groups
d$group <- ifelse( d$A <= threshold, "group 1", "group 2" )
I'm still not sure how column A factors into it. It seems you want to create a new column that has two levels which create ~= mean values for column B. Column A is obviously associated with the new column created, but does not directly factor into the calculation needed. Am I missing something?
Regardless, here's a start (note this can be made much more robust, but proof of concept should work). Define a tolerance that you find acceptable and then set up a while loop to create new groups until the condition is met, i.e.
FUN <- function(tol){
df$groups <- sample(1:2, nrow(df), TRUE)
while(abs(mean(df$B[df$groups == 1]) - mean(df$B[df$groups == 2])) > tol) {
df$groups <- sample(1:2, nrow(df), TRUE)
}
return(df)
}
set.seed(101)
df <- data.frame(A=runif(20),B=runif(20))
#Test it. Means should be less than .02 different and have roughly equivalent sample sizes.
set.seed(101)
out <- FUN(.02)
library(plyr)
> ddply(out, "groups", summarize, n = length(B), mean = mean(B))
groups n mean
1 1 11 0.5229024
2 2 9 0.5037279
I should note that you could create a runaway function if you set tol super low so don't blame me if your computer crashes.

Resources