R_Sample with probabilities - r

I am having some problem with understanding the prob in sample. For example I want to create a sample data set of size 100 with integers 1,2,3 & 4. I am using a probability of 0.1,0.2,0.3 & 0.4 respectively.
sample1<-sample(1:4,100,replace=T,prob=seq(0.1,0.4,0.1))
So, now I am expecting a sample with integers of 1,2,3 & 4 repeating 10,20,30 & 40 times respectively. But the result is different
> table(sample1)
sample1
1 2 3 4
7 24 33 36
Can anyone explain this? And what should I do if I want to get the expected results which is
> table(sample1)
sample1
1 2 3 4
10 20 30 40

sample takes a sample with the specified probabilities. That implies randomness - you won't get the same result every time. To do what you want just use rep
rep(1:4, 100*seq(0.1,0.4,0.1))

sample(...) takes a random sample with probabilities given in prob=..., so you will not get exactly that proportion every time. On the other hand, the proportions get closer to those specified in prob as n increases:
f <- function(n)sample(1:4,n,replace=T,prob=(1:4)/10)
samples <- lapply(10^(2:6),f)
t(sapply(samples,function(x)c(n=length(x),table(x)/length(x))))
# n 1 2 3 4
# [1,] 1e+02 0.090000 0.220000 0.260000 0.430000
# [2,] 1e+03 0.076000 0.191000 0.309000 0.424000
# [3,] 1e+04 0.095300 0.200200 0.310100 0.394400
# [4,] 1e+05 0.099720 0.199800 0.302250 0.398230
# [5,] 1e+06 0.099661 0.199995 0.300223 0.400121
If you need a random sample with exactly those proportions, use rep(...) and randomize the order.
g <- function(n) rep(1:4,n*(1:4)/10)[sample(1:n,n)]
samples <- lapply(10^(2:6),g)
t(sapply(samples,function(x)c(n=length(x),table(x)/length(x))))
# n 1 2 3 4
# [1,] 1e+02 0.1 0.2 0.3 0.4
# [2,] 1e+03 0.1 0.2 0.3 0.4
# [3,] 1e+04 0.1 0.2 0.3 0.4
# [4,] 1e+05 0.1 0.2 0.3 0.4
# [5,] 1e+06 0.1 0.2 0.3 0.4

Related

Create fully antithetic draws using R

say I have the matrix d, which is the result of two different realizations (rows) of a sampling procedure in two dimensions (columns). I want to develop a function that creates the fully-antithetic draws from this original matrix.
c1 <- c(0.1, 0.6);c2 <- c(0.3, 0.8);d <- rbind(c1,c2)
# [,1] [,2]
# c1 0.1 0.6
# c2 0.3 0.8
That is to say, for example, for the first realization (c(0.1, 0.6)) I want to obtain the mirror images of this random draw in two dimensions, which generated 4 (2^2) possible combinations as follows:
d1_anthi = matrix(
c( d[1,1] , d[1,2],
1 - d[1,1], d[1,2],
d[1,1] , 1 - d[1,2],
1 - d[1,1], 1 - d[1,2]), nrow=2,ncol=4)
t(d1_anthi)
# [,1] [,2]
# [1,] 0.1 0.6
# [2,] 0.9 0.6
# [3,] 0.1 0.4
# [4,] 0.9 0.4
Analogously, for the second, realization the results is the following:
d2_anthi = matrix(
c( d[2,1] , d[2,2],
1 - d[2,1], d[2,2],
d[2,1] , 1 - d[2,2],
1 - d[2,1], 1 - d[2,2]), nrow=2, ncol=4)
t(d2_anthi)
# [,1] [,2]
# [1,] 0.3 0.8
# [2,] 0.7 0.8
# [3,] 0.3 0.2
# [4,] 0.7 0.2
Accordingly, my desired object will lock is like this:
anthi_draws <- rbind(t(d1_anthi),t(d2_anthi))
# [,1] [,2]
# [1,] 0.1 0.6 <- original first realization
# [2,] 0.9 0.6
# [3,] 0.1 0.4
# [4,] 0.9 0.4
# [5,] 0.3 0.8 <- original second realization
# [6,] 0.7 0.8
# [7,] 0.3 0.2
# [8,] 0.7 0.2
Finally, I would like to create a function that, given a matrix of random numbers, is able to create this expanded matrix of antithetic draws. For example, in the picture below I have a sampling in three dimensions, then the total number of draws per original draw is 2^3 = 8.
In particular, I am having problems with the creating of the full combinatory that depends on the dimensions of the original sampling (columns of the matrix). I was planning on using expand.grid() but I couldn't create the full combinations using it. Any hints or help in order to create such a function is welcome. Thank you in advance.
You can try this
do.call(
rbind,
apply(
d,
1,
function(x) {
expand.grid(data.frame(rbind(x, 1 - x)))
}
)
)
which gives
X1 X2
c1.1 0.1 0.6
c1.2 0.9 0.6
c1.3 0.1 0.4
c1.4 0.9 0.4
c2.1 0.3 0.8
c2.2 0.7 0.8
c2.3 0.3 0.2
c2.4 0.7 0.2

vector specification to "col", "lwd" and "lty" not working when drawing lines?

I have matrix dsts with 3 columns; third is a factor. I want my linear plot to be colored by the factor but this command is not working:
plot(dsts[ ,'x'],dsts[,'dist'],col=dsts[,'i'],type='l')
and,
plot(dsts[ ,'x'],dsts[,'dist'],col=dsts[,'i'],type='n')
lines(dsts[ ,'x'],dsts[,'dist'],col=dsts[,'i'])
is not working either!!!
I want to avoid using matplot which accepts matrices.
The col option, though able to take vector input, only effectively controls point colour instead of line colour, so type = "p" works but not pch = "l". For pch = "b", only points will have correct colours.
If you want to have several lines with different colours, you have to plot them with separate plot or lines calls. A better way to go is to reshape your data, then use matplot. It takes a matrix, and plot its columns one by one via a for loop.
Since you've already got a function to reshape data, you have the right way to go.
The reason that plot and lines depreciate vector values in col for line display, is that they have no idea of whether this vector has a reasonable, non-random pattern. They will do something safe, by using only col[1]. I will elaborate on this by two steps.
Firstly, consider this example to see that plot will always use col[1] when type = "l":
set.seed(0); mat1 <- round(cbind(rnorm(9),rnorm(9),rep(1:3, each = 3)), 1)
# [,1] [,2] [,3]
# [1,] 1.3 2.4 1
# [2,] -0.3 0.8 1
# [3,] 1.3 -0.8 1
# [4,] 1.3 -1.1 2
# [5,] 0.4 -0.3 2
# [6,] -1.5 -0.3 2
# [7,] -0.9 -0.4 3
# [8,] -0.3 0.3 3
# [9,] 0.0 -0.9 3
Then we reorder the rows of mat1:
mat2 <- mat1[c(4:9,1:3), ]
# [,1] [,2] [,3]
# [1,] 1.3 -1.1 2
# [2,] 0.4 -0.3 2
# [3,] -1.5 -0.3 2
# [4,] -0.9 -0.4 3
# [5,] -0.3 0.3 3
# [6,] 0.0 -0.9 3
# [7,] 1.3 2.4 1
# [8,] -0.3 0.8 1
# [9,] 1.3 -0.8 1
We use the 3rd column for col, now compare:
par(mfrow = c(1,2))
plot(mat1[,1], mat1[,2], col = mat1[,3], type = "l")
plot(mat2[,1], mat2[,2], col = mat2[,3], type = "l")
mat1[, 3] starts with 1, so the line colour is black; mat2[,3] starts with 2, so the line colour is red.
Now it is time to say why plot and lines depreciate vector col when type = "l". Consider a random row shuffle of mat1:
set.seed(0); mat3 <- mat1[sample(9), ]
# [,1] [,2] [,3]
# [1,] 0.0 -0.9 3
# [2,] 1.3 -0.8 1
# [3,] -0.3 0.3 3
# [4,] 1.3 -1.1 2
# [5,] 0.4 -0.3 2
# [6,] 1.3 2.4 1
# [7,] -0.9 -0.4 3
# [8,] -0.3 0.8 1
# [9,] -1.5 -0.3 2
plot(..., type = "l") will line up points one by one. Be aware that a line of a single colour can only be drawn, if data points on this path have the same colour specification. Now, the 3rd column is completely random: there is no way to line points up with such colour specification.
The best & safest assumption plot and lines can take is that col vector is completely random. Thus, it will only retain col[1] to produce a single colour plot. The full vector will only be used, when type = "p".
Note, the same logic applies to lwd and lty, too. Any argument associated with line display will take only the first vector element. As I said earlier, if you do want to draw several different lines in different styles, do them one by one.
On top of #Zheyuan Li valuable insight on the poblem at hand I wrote a simple function to overcome the problem:
plot_line_color <- function(x,y,fact,lwd=2,...)
{
plot(x,y,type='n')
xy <- cbind(x,y)
invisible(
lapply(1:length(unique(fact)), function(j) {
xy2 <- subset(xy,fact==j)
lines(xy2[ ,1],xy2[,2],col=j,lwd=lwd,...)
})
)
}
A simple simulation:
k <- 1:5
x <- seq(0,10,length.out = 100)
dsts <- lapply(1:length(k), function(i) cbind(x=x, distri=dchisq(x,k[i]),fact=i) )
dsts <- do.call(rbind,dsts)
plot_line_color(x=dsts[,1],y=dsts[,2],fact=dsts[,3])

Create single list from two data frames while preserving class of columns and its names

I have two data frames:
DF1
e l u
1 0.5 1.5
2 1 3
3 2 4
DF2
e l u
0.1 0.01 0.15
0.2 0.1 0.3
0.3 0.2 0.4
I want to combine these two data frames into single list like so:
L
[[1]]
$e: [(1 0.1);(2 0.2);(3 0.3)] #numeric
$l: [(0.5 0.01);(1 0.1);(2 0.2)] #numeric
$u: [(1.5 0.015);(3 0.3);(4 0.4)] #numeric
I have tried to rbind two data frames and then split by same column, also i was advised to use Map but it results in multiple lists not a single one or all variables become factors.
Thank you for any suggestions.
It looks like you want a list of arrays.
> mapply(cbind, DF1, DF2, SIMPLIFY=FALSE)
$e
[,1] [,2]
[1,] 1 0.1
[2,] 2 0.2
[3,] 3 0.3
$l
[,1] [,2]
[1,] 0.5 0.01
[2,] 1.0 0.10
[3,] 2.0 0.20
$u
[,1] [,2]
[1,] 1.5 0.15
[2,] 3.0 0.30
[3,] 4.0 0.40

Generate combinations of values which sum to one, sorted in descending order

Do you know a more efficient way to generate a matrix holding all unique combinations of "weights" (let weights be w and 0 <= w <= 1, and values of w are separated by steps of 0.1), such that the weights sum to one AND the first is the highest, the last the lowest weight.
Here is code that does the job, but it seems inefficient to delete rows:
# generate combinations of weights such that w1 >= w2 >= w3 ...
w = seq(0, 1, 0.1) #weights 0, 0.1, ..., 0.9, 1
w = expand.grid(w, w, w, KEEP.OUT.ATTRS = FALSE) #all combinations of 3 weights
w = w[rowSums(w) == 1, ] #make sure the weights sum to one
w = w[!(w[, 1] < w[, 2] | w[, 2] < w[, 3]),] #make sure w1 >= w2 >= w3 ...
w
# Var1 Var2 Var3
# 11 1.0 0.0 0.0
# 21 0.9 0.1 0.0
# 31 0.8 0.2 0.0
# 41 0.7 0.3 0.0
# 51 0.6 0.4 0.0
# 61 0.5 0.5 0.0
# 141 0.8 0.1 0.1
# 151 0.7 0.2 0.1
# 171 0.5 0.4 0.1
# 271 0.6 0.2 0.2
# 281 0.5 0.3 0.2
# 291 0.4 0.4 0.2
# 401 0.4 0.3 0.3
Let me add some more general info:
In this problem (3 weights in the above order) the upper limits for the first, second, third values are as follows:
the first number can minimally be 1 for the combination (1, 0, 0)
the second number can maximally be 1/2 for the combination (1/2, 1/2, 0)
the third number can maximally be 1/3 for the combination (1/3, 1/3, 1/3)
A non-base possibility:
library(partitions)
step <- 0.1
n_weights <- 3
t(restrictedparts(n = 1/step, m = n_weights) * step)
# [1,] 1.0 0.0 0.0
# [2,] 0.9 0.1 0.0
# [3,] 0.8 0.2 0.0
# [4,] 0.7 0.3 0.0
# [5,] 0.6 0.4 0.0
# [6,] 0.5 0.5 0.0
# [7,] 0.8 0.1 0.1
# [8,] 0.7 0.2 0.1
# [9,] 0.6 0.3 0.1
# [10,] 0.5 0.4 0.1
# [11,] 0.6 0.2 0.2
# [12,] 0.5 0.3 0.2
# [13,] 0.4 0.4 0.2
# [14,] 0.4 0.3 0.3
General purpose function with standard packages:
# Generate weights matrix with noWeights columns and noRows rows.
# Each row of this matrix contains sorted decremental weights summing up to 1.0.
generateWeights = function(noWeights,
noRows,
distribution = runif,
rounding = function(x){ round(x, 1) })
{
generator = function()
{
x = distribution (noWeights);
x = x/sum(x);
sort(rounding(x), decreasing = T)
}
t(replicate(noRows, generator()))
}
# example of use
generateWeights(3, 10)

Automatically creating derived variables in a dataframe [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Processing the list of data.frames with “apply” family of functions
I have a dataframe with six numeric variables V1, V2, V3 and V1.lag, V2.lag, V3.lag.
NOTE: My real dataset has much more variables but I use 3 for ilustration only!
I would like to be able to automatically (without hardcoding anything) run through all V variables (not lag variables) and create V1.over.V1.lag variables by dividing each V variable with coresponding lag variable.
df<-data.frame(matrix(rnorm(216),72,6));
colnames(df) <- c("v1.raw", "v2.raw", "v3.raw", "v1.lag", "v2.lag", "v3.lag");
Thanks in advance
**EDIT: I figured how to identify "raw" columns and "lag" columns **
raws <- sapply( names(df), function(x){ unlist(strsplit(x, "[.]"))[2] == "raw" } ); ## which are raw factors
lags <- sapply( names(df), function(x){ unlist(strsplit(x, "[.]"))[2] == "lag" } ); ## which are lagged factors
but I still can't figure how to divide all raw factors with their lag counterparts
which(raws);
will give me indices, but how do I combine them with lags into new factor?
df[which(raws)] / df[which(lags)]
doesn't work
Assuming you have only v.raw and v.lag columns in you data.frame, this should work
mm <- colnames(df) <- c("v1.raw", "v2.raw", "v3.raw", "v1.lag", "v2.lag", "v3.lag")
df[,gregexpr('.raw',mm) > 0] /df[,gregexpr('.*lag',mm) > 0]
Edit some explanations to the solution :
gregexpr('.raw',mm) > 0
[1] TRUE TRUE TRUE FALSE FALSE FALSE
head(df[,gregexpr('.raw',mm) > 0],1)
v1.raw v2.raw v3.raw
1 0.7719037 -0.2078197 -1.223753
regexpr('.lag',mm) > 0
[1] FALSE FALSE FALSE TRUE TRUE TRUE
head(df[,gregexpr('.lag',mm) > 0],1)
v1.lag v2.lag v3.lag
1 0.7719037 -0.2078197 -1.223753
Than we use the vectorize / to do division, in one operation.
Here an example :
df <- matrix(rep(c(1,2,3,4,5,6),each = 5),ncol=6)
colnames(df) <- c("v1.raw", "v2.raw", "v3.raw", "v1.lag", "v2.lag", "v3.lag")
v1.raw v2.raw v3.raw v1.lag v2.lag v3.lag
[1,] 1 2 3 4 5 6
[2,] 1 2 3 4 5 6
[3,] 1 2 3 4 5 6
[4,] 1 2 3 4 5 6
[5,] 1 2 3 4 5 6
mm <- colnames(df)
df[,which(gregexpr('.raw',mm) > 0)] /df[,which(gregexpr('.lag',mm) > 0)]
v1.raw v2.raw v3.raw #as expected 1/4 2/5 3/6
[1,] 0.25 0.4 0.5
[2,] 0.25 0.4 0.5
[3,] 0.25 0.4 0.5
[4,] 0.25 0.4 0.5
[5,] 0.25 0.4 0.5
Edit2 prevent Nan with zero
df <- matrix(rep(c(1,2,3,4,5,6),each = 5),ncol=6)
colnames(df) <- c("v1.raw", "v2.raw", "v3.raw", "v1.lag", "v2.lag", "v3.lag")
df[1,4] <- 0 ## I introduce a 0 here
mm <- colnames(df)
## I use ifelse , because it is vectorize also !
## If you find a 0 , don't compute , and retuen me the original value
## You can do other things here
ifelse(df[,which(gregexpr('.lag',mm) > 0)] != 0 ,
df[,which(gregexpr('.raw',mm) > 0)] /df[,which(gregexpr('.lag',mm) > 0)],
df[,which(gregexpr('.raw',mm) > 0)])
v1.lag v2.lag v3.lag ## for some reasons ifelse choose other columns names!(lag not raw)
[1,] 1.00 0.4 0.5
[2,] 0.25 0.4 0.5
[3,] 0.25 0.4 0.5
[4,] 0.25 0.4 0.5
[5,] 0.25 0.4 0.5

Resources