r generate random poisson values - r

How do I randomly generate 50% 1s, 30% 2s, 15% 3s and remaining 4s ?
Finally when I do a table , table(x, useNA = "ifany"), it should be
1 2 3 4
50 30 15 5
I am not sure how to use rpois to generate this.

Using sample.int.
set.seed(42)
x <- sample.int(n=4, size=1e4, replace=TRUE, prob=c(.5, .3, .15, 1 - sum(c(.5, .3, .15))))
proportions(table(x))
# x
# 1 2 3 4
# 0.5001 0.3001 0.1506 0.0492
If you depend on rpois, you probably need to invent something with optimize.

Related

Random assignment 1s and 0s with a maximum

I have a dataset with two different columns (X and Y) that both contains the exact same amount of 0s and 1s:
0 1
3790 654
Now I want to have column Y to contain an exact amount of 1733 1s and 2711 0s. But the 1079 extra 1s (1733-654) must be assigned randomly. I already tried the following:
ind <- which(df$X == 0)
ind <- ind[rbinom(length(ind), 1, prob = 1079/3790) > 0]
df$Y[ind] <- 1
But if I run this code, there is everytime a different number of 1s, and I want it to be exactly 1733 if I run it. How do I do this?
You have this vector:
x <- sample(c(rep(0, 3790), rep(1, 654)))
#> table(x)
#> x
#> 0 1
#> 3790 654
What you need to do is randomly select the position of 1079 elements in your vector that equals 0, and assign them the value 1:
s <- sample(which(x == 0), 1079)
x[s] <- 1
#> table(x)
#> x
#> 0 1
#> 2711 1733

Creating balanced groups based on three categorical variables

I'm creating a group assignment for a college class (~180 students) I'm instructing. It's important that these groups be as heterogeneous as possible across three variables (field of study (FOS), sex, division:i.e., newer/older students).
FOS has 5 levels, sex has 2, division has 2. Given the project, I'd like to create about 8-9 groups. In other words, I'd like groups of approximately 6 with a "good" balance of different fields of study, males/females, and new and older students. I'd then simply post the names with the automated assignments.
The instructor before did it all by hand, but I've tried playing around with R to see if there's a more systematic way of doing this, but only came up with repeated (and clunky) sorting. I expect the 5 FOS levels to vary in size, so I recognize that it will not be a perfect solution. Interested in people's clever solutions. Here's a reproducible sample:
dat <- data.frame(
student = 1:180,
gender = factor(sample(LETTERS[1:2], 180, replace = T, prob = c(.52,.48)),
labels=c("female","male")),
division = factor(sample(LETTERS[1:2], 180, replace = T, prob = c(.6,.4)),
labels=c("lower","upper")),
field = factor(sample(LETTERS[1:5], 180, replace = T,
prob = c(.26,.21,.35,.07,.11)),
labels = c("humanities","natural science",
"social science","engineer","other")))
This was what I was playing with, but it's really increasing the randomness in assignment and not so much the balance as can be seen:
library(dplyr)
dat$rand <- sample(1:180,180)
dat1 <- arrange(dat, field, division, gender, rand)
dat1$grp <- 1:(nrow(dat1)/6) #issue if not divisible
Which does not result in adequate balance:
with(dat1, table(gender, grp)) #as a check
with(dat1, table(field, grp))
with(dat1, table(division, grp))
I know this is an old question, but I had a similar problem today and here's the solution I came up with. Basically you assign groups randomly then use either chi square test for categorical variables or ANOVA for continuous variables to test for group differences for each variable. You set a threshold for the p-value that you do not want to drop below. The code will reshuffle the groups until all p values are above that threshold. If it goes through 10,000 iterations without reaching a grouping solution, it will stop and suggest that you lower the threshold.
set.seed(905)
#let's say you have a continuous variable you would also like to keep steady across groups
dat$age <- sample(18:35, nrow(dat), replace = TRUE)
dat$group <- rep_len(1:20, length.out = nrow(dat)) #if you wanted to make 20 groups
dat$group <- as.factor(dat$group)
a <- 0.1; b <- 0.1; c <- 0.1; d <- 0.1
thresh <- 0.85 #Minimum threshold for p value
z <- 1
while (a < thresh | b < thresh |c < thresh |d < thresh) {
dat <- transform(dat, group = sample(group)) #shuffles the groups
x <- summary(aov(age ~ group, dat)) #ANOVA for continuous variables
a <- x[[1]]['group','Pr(>F)']
x <- summary(table(dat$group, dat$gender)) #Chi Sq for categorical variables
b <- x[['p.value']]
x <- summary(table(dat$group, dat$division))
c <- x[['p.value']]
x <- summary(table(dat$group, dat$field))
d <- x[['p.value']]
z <- z + 1
if (z > 10000) {
print('10,000 tries, no solution, reduce threshold')
break
}
}
With enough datapoints per combination of the variables, you should be able to do this:
dat <- groupdata2::fold(dat, k = 8,
cat_col = c("gender", "division", "field"))
with(dat, table(gender, .folds))
## .folds
## gender 1 2 3 4 5 6 7 8
## female 11 12 11 12 12 11 12 12
## male 10 11 11 11 11 11 11 11
with(dat, table(field, .folds))
## .folds
## field 1 2 3 4 5 6 7 8
## humanities 5 8 9 7 9 6 6 5
## natural science 2 3 4 6 3 9 2 4
## social science 9 7 6 8 5 6 9 6
## engineer 3 3 2 1 3 0 2 4
## other 2 2 1 1 3 1 4 4
with(dat, table(division, .folds))
## .folds
## division 1 2 3 4 5 6 7 8
## lower 11 15 13 14 10 13 11 15
## upper 10 8 9 9 13 9 12 8

R scale one value based on other

Is there a way to scale one value to a fixed value and the other value to the corresponding scaled value. I was not able to use scale method in R as it needs a range.
For eg,
ID X Y
1 25 25
2 20 40
3 10 50
4 50 20
I need to scale the Y value to a fixed value of 100. Correspondingly X value should be scaled accordingly.
ID X Y
1 100 100
2 50 100
3 20 100
4 250 100
If your data is
id <- c(1,2,3,4)
X <- c(25,20,10,50)
Y <- c(25,40,50,20)
df <- data.frame(id,X,Y)
you could try
df$X <- df$X*100/df$Y
df$Y <- 100
# > df
# id X Y
# 1 1 100 100
# 2 2 50 100
# 3 3 20 100
# 4 4 250 100
Presumably you may also consider having a quick look at the scalefunction available in the base package. For instance, if you wish to scale part of the data frame, you could:
data("mtcars")
test <- scale(x = mtcars[,2:ncol(mtcars)], scale = TRUE)
You could easily modify the syntax further setting the base and centre so the obtained scales matches your requirements:
tst_mpg <- scale(x = mtcars$mpg, scale = 1, center = 0.5)

aggregate data frame by equal buckets

I would like to aggregate an R data.frame by equal amounts of the cumulative sum of one of the variables in the data.frame. I googled quite a lot, but probably I don't know the correct terminology to find anything useful.
Suppose I have this data.frame:
> x <- data.frame(cbind(p=rnorm(100, 10, 0.1), v=round(runif(100, 1, 10))))
> head(x)
p v
1 10.002904 4
2 10.132200 2
3 10.026105 6
4 10.001146 2
5 9.990267 2
6 10.115907 6
7 10.199895 9
8 9.949996 8
9 10.165848 8
10 9.953283 6
11 10.072947 10
12 10.020379 2
13 10.084002 3
14 9.949108 8
15 10.065247 6
16 9.801699 3
17 10.014612 8
18 9.954638 5
19 9.958256 9
20 10.031041 7
I would like to reduce the x to a smaller data.frame where each line contains the weighted average of p, weighted by v, corresponding to an amount of n units of v. Something of this sort:
> n <- 100
> cum.v <- cumsum(x$v)
> f <- cum.v %/% n
> x.agg <- aggregate(cbind(v*p, v) ~ f, data=x, FUN=sum)
> x.agg$'v * p' <- x.agg$'v * p' / x.agg$v
> x.agg
f v * p v
1 0 10.039369 98
2 1 9.952049 94
3 2 10.015058 104
4 3 9.938271 103
5 4 9.967244 100
6 5 9.995071 69
First question, I was wondering if there is a better (more efficient approach) to the code above. The second, more important, question is how to correct the code above in order to obtain more precise bucketing. Namely, each row in x.agg should contain exacly 100 units of v, not just approximately as it is the case above. For example, the first row contains the aggregate of the first 17 rows of x which correspond to 98 units of v. The next row (18th) contains 5 units of v and is fully included in the next bucket. What I would like to achieve instead would be attribute 2 units of row 18th to the first bucket and the remaining 3 units to the following one.
Thanks in advance for any help provided.
Here's another method that does this with out repeating each p v times. And the way I understand it is, the place where it crosses 100 (see below)
18 9.954638 5 98
19 9.958256 9 107
should be changed to:
18 9.954638 5 98
19.1 9.958256 2 100 # ---> 2 units will be considered with previous group
19.2 9.958256 7 107 # ----> remaining 7 units will be split for next group
The code:
n <- 100
# get cumulative sum, an id column (for retrace) and current group id
x <- transform(x, cv = cumsum(x$v), id = seq_len(nrow(x)), grp = cumsum(x$v) %/% n)
# Paste these two lines in R to install IRanges
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
require(IRanges)
ir1 <- successiveIRanges(x$v)
ir2 <- IRanges(seq(n, max(x$cv), by=n), width=1)
o <- findOverlaps(ir1, ir2)
# gets position where multiple of n(=100) occurs
# (where we'll have to do something about it)
pos <- queryHits(o)
# how much do the values differ from multiple of 100?
val <- start(ir2)[subjectHits(o)] - start(ir1)[queryHits(o)] + 1
# we need "pos" new rows of "pos" indices
x1 <- x[pos, ]
x1$v <- val # corresponding values
# reduce the group by 1, so that multiples of 100 will
# belong to the previous row
x1$grp <- x1$grp - 1
# subtract val in the original data x
x$v[pos] <- x$v[pos] - val
# bind and order them
x <- rbind(x1,x)
x <- x[with(x, order(id)), ]
# remove unnecessary entries
x <- x[!(duplicated(x$id) & x$v == 0), ]
x$cv <- cumsum(x$v) # updated cumsum
x$id <- NULL
require(data.table)
x.dt <- data.table(x, key="grp")
x.dt[, list(res = sum(p*v)/sum(v), cv = tail(cv, 1)), by=grp]
Running on your data:
# grp res cv
# 1: 0 10.037747 100
# 2: 1 9.994648 114
Running on #geektrader's data:
# grp res cv
# 1: 0 9.999680 100
# 2: 1 10.040139 200
# 3: 2 9.976425 300
# 4: 3 10.026622 400
# 5: 4 10.068623 500
# 6: 5 9.982733 562
Here's a benchmark on a relatively big data:
set.seed(12345)
x <- data.frame(cbind(p=rnorm(1e5, 10, 0.1), v=round(runif(1e5, 1, 10))))
require(rbenchmark)
benchmark(out <- FN1(x), replications=10)
# test replications elapsed relative user.self
# 1 out <- FN1(x) 10 13.817 1 12.586
It takes about 1.4 seconds on 1e5 rows.
If you are looking for precise bucketing, I am assuming value of p is same for 2 "split" v
i.e. in your example, value of p for 2 units of row 18th that go in first bucket is 9.954638
With above assumption, you can do following for not super large datasets..
> set.seed(12345)
> x <- data.frame(cbind(p=rnorm(100, 10, 0.1), v=round(runif(100, 1, 10))))
> z <- unlist(mapply(function(x,y) rep(x,y), x$p, x$v, SIMPLIFY=T))
this creates a vector with each value of p repeated v times for each row and result is combined into single vector using unlist.
After this aggregation is trivial using aggregate function
> aggregate(z, by=list((1:length(z)-0.5)%/%100), FUN=mean)
Group.1 x
1 0 9.999680
2 1 10.040139
3 2 9.976425
4 3 10.026622
5 4 10.068623
6 5 9.982733

Smoothing a sequence without using a loop in R

I am implementing a statistical method from an academic paper (see the end for a citation) in R. I think there's a way to do one of the steps without using a loop, but I'm having trouble deciding how to attack it.
This method operates on a data frame with three variables: x, n, and p. It can only operate if p[i] <= p[i+1] for all i. If a pair of points violates that, they are smoothed out by setting both p[i] and p[i+1] equal to their weighted average
(n[i]*p[i]+n[i+1]*p[i+1])/(n[i]+n[i+1])
This smoothing is iterated until the p_i are a nondecreasing sequence.
The problem with this smooth is that a) loops are bad form in R, and b) if there are multiple points in a row such that p_i > p_(i+1) >= p_(i+2), the method can fail to terminate or take a very long time to converge. For instance, if a sequence like so happens:
x n p
2 10 0.6
5 10 0.5
10 10 0.5
the smooth will set the first two values of p to 0.55, then the second two to 0.525, then set the first two to 0.5325, and so on and loop forever (or if I'm lucky reach the limit of significance in a bajillion iterations). There should be a mathematically equivalent but more efficient way to do this by identifying adjacent decreasing data points and averaging them as a group, but I'm not sure how to approach that in R.
If you need more background, the paper in question is Martin A. Hamilton, Rosemarie C. Russo, Robert V. Thurston.
"Trimmed Spearman-Karber method for estimating median lethal concentrations in toxicity bioassays." Environ. Sci. Technol., 1977, 11 (7), pp 714–719. I'm referring to the "first step" section on page 716.
As I understand the algorithm, you need to locate positions where p is decreasing and, starting from each of these, find out for how long the (cumulative) weighted average is decreasing so that p can be updated block by block. I do not see how this can be done without a loop of some sort. Some solution might hide the loop under lapply or an equivalent but IMHO, this is one of those algorithms that are complex enough that I prefer a good old loop. You may lose a bit in efficiency but the code reads nicely. My attempt, using a while loop:
smooth.p <- function(df) {
while (any(diff(df$p) < 0)) {
# where does it start decreasing
idx <- which(diff(df$p) < 0)[1]
# from there, compute the cumulative weighted average
sub <- df[idx:nrow(df), ]
cuml.wavg <- cumsum(sub$n * sub$p) / cumsum(sub$n)
# and see for how long it is decreasing
bad.streak.len <- rle(diff(cuml.wavg) <= 0)$lengths[1]
# these are the indices for the block to average
block.idx <- seq(from = idx, length = bad.streak.len + 1)
# compute and apply the average p
df$p[block.idx] <- sum(df$p[block.idx] * df$n[block.idx]) /
sum(df$n[block.idx])
}
return(df)
}
Here is some data, including a rough patch like you suggested:
df <- data.frame(x = 1:9,
n = rep(1, 9),
p = c(0.1, 0.3, 0.2, 0.6, 0.5, 0.5, 0.8, 1.0, 0.9))
df
# x n p
# 1 1 1 0.1
# 2 2 1 0.3
# 3 3 1 0.2
# 4 4 1 0.6
# 5 5 1 0.5
# 6 6 1 0.5
# 7 7 1 0.8
# 8 8 1 1.0
# 9 9 1 0.9
And the output:
smooth.p(df)
# x n p
# 1 1 1 0.1000000
# 2 2 1 0.2500000
# 3 3 1 0.2500000
# 4 4 1 0.5333333
# 5 5 1 0.5333333
# 6 6 1 0.5333333
# 7 7 1 0.8000000
# 8 8 1 0.9500000
# 9 9 1 0.9500000
Following Glen_b above, what's described in Hamilton's paper is equivalent to gpava from the CRAN package isotone.

Resources