I want to assign a defined value (let's say 1) to a random sample of a subset of a vector that meets certain conditions. I can't seem to make it work.
I have tried this code:
a <- c(1:50)
df <- as.data.frame(a)
df$c <- 0
df$c[sample(x=(df$c[df$a>25]), size = round(NROW(df$c[df$a>25])/5), replace = F)] <- 1
I would like just to randomly make some of the df$c vector values to be equal to 1, exactly a random sample of one fifth of the values in df$c in which value of df$a is a is greater than 25 (that would be 5 observations switched to 1).
But so far all of them remain 0 :/
Thanks!
Here's a way with base R -
df$c[sample(which(df$a > 25), sum(df$a > 25)/5)] <- 1
Be aware that this will fail if there is only 1 value in df$a > 25.
Below approach will not fail for any case but is bit verbose. Feel free to use whatever suits your need the best depending on expected values in df$a -
df$c[which(df$a > 25)[sample(length(which(df$a > 25)), sum(df$a > 25)/5)]] <- 1
Also, note that since, relace = F, sample size = sum(df$a > 25)/5 must be <= length(which(df$a > 25)). You can include this condition in your code if you want to make it even more safer.
Also, there will be no change if sum(df$a > 25)/5 < 1 so you may want to use size = max(sum(df$a > 25)/5, 1) if you want at least 1 change.
Here's a nicer version of my first version, thanks to #Frank -
df$c <- replace(df$c, sample(w <- which(df$a > 25), length(w)*.2), 1)
Not as elegant as the other solution you have but here's another way:
df <- data.frame('a' = c(1:50), 'c' = rep(0,50))
df$c[sample(
# subset to sample
df$a[df$a > 25],
# sample size
size = round(length(df$a[df$a > 25])/5, 0),
# no replacement
replace = F)] <- 1
Yours didn't work because you sample where df$c > 25 rather than df$a
df$c[sample(x=( df$c [df$a>25]), size = round(NROW(df$c[df$a>25])/5), replace = F)] <- 1
Related
How do you take values in every other row (odd) and shift them to every row (even and odd) or switch them to other rows (even)? In a dataframe, I have values in every other row. I would like them to be in every row of the column. How do I achieve this? An alternative solution would be to switch them so that they are only in even, but not odd.
Example of what it currently looks like:
set.seed(5)
output<-data.frame(matrix("", nrow=500, ncol=1))
for(i in 1:nrow(output)){
if(i %% 2 == 0){
output[i,1] <- sample(c("A","B","C"),1,replace = T)
}
}
colnames(output) <- "work"
I would like it to appear as:
set.seed(5)
output<-data.frame(matrix("", nrow=500, ncol=1))
for(i in 1:nrow(output)){
output[i,1] <- sample(c("A","B","C"),1,replace = T)
}
colnames(output) <- "work"
row_odd <- seq_len(nrow(output)) %% 2
output[row_odd == 1, 1] <- data_row_even
output[row_odd == 0, 1] <- NA
If I understand correctly, create an index that selects the values you'd like
> idx = seq(2, nrow(output), by = 2)
and use idx - 1 to select the rows to be replaced
> output[idx - 1, "work"] = output[idx, "work"]
> head(output)
work
1 B
2 B
3 C
4 C
5 A
6 A
It's not really clear what you'd like to happen to the last row, if there are an odd number of rows?
Note that a much more efficient / R way to generate the original data is perhaps
df = data.frame(work = character(500))
idx = seq(2, nrow(df), by = 2)
df[idx, "work"] = sample(c("A", "B", "C"), length(idx), replace = TRUE)
I'm looking to make a set of two random numbers (e.g., [1,2], [3,12]) with the first number between 1-12, and the second between 1-4. I know how to sample the two numbers independently using:
sample(1:12, 1, replace = T)
sample(1:4, 1, replace = T)
but don't know how to create a system to determine if the pairing of the two numbers has already been rolled, and if so, roll again. Any tips!?
Thanks :)
While this doesn't scale happily (in case you need large-scale simulation), you can do this:
set.seed(42)
di2 <- sample(setdiff(1:4, di1 <- sample(1:12, size = 1)), size = 1)
c(di1, di2)
# [1] 1 2
The inner (di1) assignment takes the first from 1:12, so far so good.
We then set-diff 1:4 from this so that the second sampling only has candidates that are not equal to di1;
The outer (di2) assignment samples from 1:4 without di1 if it was within 1-4.
While not an authoritative proof of correctness,
rand <- replicate(100000, local({ di2 <- sample(setdiff(1:4, di1 <- sample(1:12, size=1)), size = 1); c(di1, di2); }))
dim(rand)
# [1] 2 100000
any(rand[1,] == rand[2,])
# [1] FALSE
Are you looking for sth like:
library(tidyverse)
expand.grid(1:12,1:4) %>%
as.data.frame() %>%
slice_sample (n = 5, replace = FALSE)
I have a data frame with 30 rows and 4 columns (namely, x, y, z, u). It is given below.
mydata = data.frame(x = rnorm(30,4), y = rnorm(30,2,1), z = rnorm(30,3,1), u = rnorm(30,5))
Further, I have a sequence values, which represent row number in my data frame.
myseq = c(seq(1, 30, by = 5))
myseq
[1] 1 6 11 16 21 26
Now, I wanted to compute the prob values for each segment of 99 rows.
filt= subset(mydata[1:6,], mydata[1:6,]$x < mydata[1:6,]$y & mydata[1:6,]$z < mydata[1:6,]$u
filt
prob = length(filt$x)/30
prob
Then I need to compute the above prob for 1:6,.., 27:30 and so on . Here, I have only 6 prob values. So, I can do one by one. If I have 100 values it would be tedious. Are there any way to compute the prob values?.
Thank you in advance.
BTW: in subset(DF[1:99,], ...), use DF[1:99,] in the first argument, not again, ala
subset(DF[1:99,], cumsuml < inchivaluel & cumsumr < inchivaluer)
Think about how to do this in a list.
The first step is to break your data into the va starting points. I'll start with a list of the indices to break it into:
inds <- mapply(seq, va, c(va[-1], nrow(DF)), SIMPLIFY=FALSE)
this now is a list of sequences, starting with 1:99, then 100:198, etc. See str(inds) to verify.
Now we can subset a portion of the data based on each element's vector of indices:
filts <- lapply(inds, function(ind) subset(DF[ind,], cumsuml < inchivaluel & cumsumr < inchivaluer))
We now have a list of vectors, let's summarize it:
results <- sapply(filts, function(filt) length(filt$cumsuml)/length(alpha))
Bottom line, it helps to think about how to break this problem into lists, examples at http://stackoverflow.com/a/24376207/3358272.
BTW: instead of initially making a list of indices, we could just break up the data in that first step, ala
DF2 <- mapply(function(a,b) DF[a:b,], va, c(va[-1], nrow(DF)), SIMPLIFY=FALSE)
filts <- lapply(DF2, function(x) subset(x, cumsuml < inchivaluel & cumsumr < inchivaluer))
results <- sapply(filts, function(filt) length(filt$cumsuml)/length(alpha))
I want to generate an nxm matrix. Suppose its 100x3. I want each row to sum to 1 (so two "0"'s and one "1").
sample(c(0,0,1),3)
will give me 1 row but is there a very fast way to generate the whole matrix without an rbind?
Thank you!
No loops, no transposition. Just create a matrix of zeros and replace one entry per row with 1 by sampling the rows.
m <- matrix(0, 100, 3)
nr <- nrow(m)
m[cbind(1:nr, sample(ncol(m), nr, TRUE))] <- 1
all(rowSums(m) == 1)
# [1] TRUE
mat <- matrix(runif(300),ncol=3)
mat[] <- as.numeric(t(apply(mat, 1, function(r) r == max(r))))
t(apply(t(matrix(rep(c(0,0,1),300),nrow = 3)), 1, function(x) sample(x)))
Since you want single 1 for a row, the problem can be restated to select a column entry randomly that has 1 for each row.
So you can do like,
m <- 3; n<-100
rand_v <- floor(runif(n)*3)+1
mat <- matrix(0,n,m)
idx <- cbind(1:n,rand_v)
mat[idx] <- 1
Hope this helps.
I have a data frame df with 2 variables A and B. I would like to split A in groups 1 and 2 so that mean(df$B[df$group==1]) as close as possible to mean(df$B[df$group==2])
Or just to express it otherwise, what I would like is to find a cut point (cutp) in df$A that would minimize the abs(mean(df$B[df$A<cutp])-mean(df$B[df$A>=cutp]))
Any ideas?
If you want to find a threshold on variable A, to split the data into two groups, so that the means of B in those two groups be similar, you can compute these means for all possible cut-points, and check when the distance between those means is minimal.
# Sample data
n <- 10
d <- data.frame(
A = rnorm(n),
B = rnorm(n)
)
# The quantity to minimize
# (You can use a loop instead of apply.)
d$differences <- apply(
d, 1,
# Compute the difference of the means for each value of A
function (u) {
i <- d$A <= u[1];
abs( mean( d$B[which(i)]) - mean(d$B[which(!i)] ) )
}
)
# The mean of an empty vector is NaN: discard those values
d$differences[ ! is.finite( d$differences ) ] <- Inf
# Take the minimum
threshold <- d$A[ which.min( d$differences ) ]
# Build the groups
d$group <- ifelse( d$A <= threshold, "group 1", "group 2" )
I'm still not sure how column A factors into it. It seems you want to create a new column that has two levels which create ~= mean values for column B. Column A is obviously associated with the new column created, but does not directly factor into the calculation needed. Am I missing something?
Regardless, here's a start (note this can be made much more robust, but proof of concept should work). Define a tolerance that you find acceptable and then set up a while loop to create new groups until the condition is met, i.e.
FUN <- function(tol){
df$groups <- sample(1:2, nrow(df), TRUE)
while(abs(mean(df$B[df$groups == 1]) - mean(df$B[df$groups == 2])) > tol) {
df$groups <- sample(1:2, nrow(df), TRUE)
}
return(df)
}
set.seed(101)
df <- data.frame(A=runif(20),B=runif(20))
#Test it. Means should be less than .02 different and have roughly equivalent sample sizes.
set.seed(101)
out <- FUN(.02)
library(plyr)
> ddply(out, "groups", summarize, n = length(B), mean = mean(B))
groups n mean
1 1 11 0.5229024
2 2 9 0.5037279
I should note that you could create a runaway function if you set tol super low so don't blame me if your computer crashes.