Create samples with different range and weights - r

I want to create a total sample of 3000 entries with some rules :
Category-1(low) 0.1 - 0.3
Category-2(Medium) 0.4 - 0.7
Category-3(High) 0.7 - 0.9
I want to create the sample in such a way that each category has weights for example :
Category-1(low) 20% of the dataset
Category-2(Medium) 30% of the dataset
Category-3(High) 50% of the dataset
I am unable to find pointers to do that. Can anyone help me out with the same. Thanks a lot in advance.

We can use Map to create a sequence of values between the ranges showed in the OP's post, while generating the sample on the ranges with the proportion also being passed in as argument to Map
lst1 <- Map(function(x, y, z) sample(seq(x, y, by = 0.1), z,
replace = TRUE), c(0.1, 0.4, 0.7), c(0.3, 0.7, 0.9), c(0.2, 0.3, 0.5) * 3000)
names(lst1) <- c("low", "medium", "high")
lengths(lst1)
# low medium high
# 600 900 1500
out <- unlist(lst1)
length(out)
#[1] 3000
If we need as a two column data.frame
dat <- stack(lst1)[2:1]

I like to use the simstudy package for data generation. In this case I back-filled your values that conform to category rules. Simstudy gives a data.table object, but I'm more familiar with Tidyverse syntax:
library(simstudy)
library(dplyr)
set.seed(1724)
# define data
def <- defData(varname = "category", formula = "0.2;0.3;0.5", dist = "categorical", id = "id")
def <- defData(def, varname = "value", dist = "nonrandom", formula = NA)
# generate data
df <- genData(3000, def) %>% as_tibble()
# add in values that conform to category rules
df[df$category == 1,]$value <- runif(nrow(df[df$category == 1,]), min = 0.1, max = 0.3)
df[df$category == 2,]$value <- runif(nrow(df[df$category == 2,]), min = 0.4, max = 0.7)
df[df$category == 3,]$value <- runif(nrow(df[df$category == 3,]), min = 0.7, max = 0.9)
# A tibble: 3,000 x 3
id category value
<int> <int> <dbl>
1 1 3 0.769
2 2 2 0.691
3 3 3 0.827
4 4 3 0.729
5 5 2 0.474
6 6 3 0.818
7 7 2 0.635
8 8 2 0.552
9 9 3 0.794
10 10 3 0.792
# ... with 2,990 more rows

A rather simple approach:
1. This is not that random, but depending on the application this may suffice
out <- c(runif(600, 0.1, 0.3), runif(900, 0.4, 0.7), runif(1500, 0.7, 0.9))
2. Here, you'd draw the numbers coming from each category as well: so more random...
sam <- sample(1:3, size = 3000, prob = c(0.2, 0.3, 0.5), replace = TRUE)
x1 <- sum(sam == 1)
x2 <- sum(sam == 2)
x3 <- sum(sam == 3)
out <- c(runif(x1, 0.1, 0.3), runif(x2, 0.4, 0.7), runif(x3, 0.7, 0.9))

Related

How to create a dataframe from the return of the function?

model <- function(alpha,n,m){
ybar <- numeric()
for(i in 1:m){
y <- arima.sim(model=list(ar=alpha),n)
ybar[i] <- mean(y)
}
CI <- mean(ybar) + c(1,-1)*qnorm(0.025)*sqrt(1/n)*(1/(1-alpha))
width <- abs(abs(CI[1])-abs(CI[2]))
list("Confidence Interval"=CI, Width=width)
}
model(-0.8,1000,1000)
model(-0.4,1000,1000)
model(-0.3,1000,1000)
model(0.2,1000,1000)
model(0.8,1000,1000)
I want to create a dataframe such that the the first column is the list of alpha (e.g. -0.8,-0.4,...,0.8) and the second column is the value for confidence interval while the 3rd column is the widthof CI. Each column associate with their own column name (alpha, confidence interval, width).
How can I do that?
Not sure if this is what we need (in base R)
do.call(rbind, lapply(c(-0.8, -0.4, -0.3),
function(x) data.frame(alpha = x, model(x, 1000, 1000))))
# alpha Confidence.Interval Width
#1 -0.8 -0.03474170 0.0006172874
#2 -0.8 0.03412441 0.0006172874
#3 -0.4 -0.04439685 0.0002515509
#4 -0.4 0.04414530 0.0002515509
#5 -0.3 -0.04777081 0.0001885317
#6 -0.3 0.04758228 0.0001885317
If we need the upper and lower bound as columns
do.call(rbind, lapply(c(-0.8, -0.4, -0.3), function(x) {
out <- model(x, 1000, 100)
data.frame(alpha = x, lower_bound = out$`Confidence Interval`[1],
upper_bound = out$`Confidence Interval`[2], Width = out$Width)}))
# alpha lower_bound upper_bound Width
#1 -0.8 -0.03163379 0.03723232 0.005598532
#2 -0.4 -0.04186212 0.04668002 0.004817898
#3 -0.3 -0.04833423 0.04701885 0.001315380
Or with tidyverse
library(dplyr)
library(purrr)
tibble(alpha = c(-0.8, -0.4, -0.3),
out = map(alpha, model, n = 1000, m = 1000)) %>%
unnest_wider(c(out)) %>%
unnest_longer(c(`Confidence Interval`))

How to fill a matrix by proportion?

I'm trying to create aa 20x20 matrix filled with numbers from -1:2. However, I don't want it to be random but by proportion that I decide.
For example, I would want 0.10 of the cells to be -1, 0.60 to be 0, 0.20 to be 1, 0.10 to be 2.
This code was able to get me a matrix with all of the values I want, but I don't know how to edit it to specify the proportion of each value I want.
r <- 20
c <- 20
mat <- matrix(sample(-1:2,r*c, replace=TRUE),r,c)
We can use the prob argument from sample
matrix(sample(-1:2,r*c, replace=TRUE, prob = c(0.1, 0.6, 0.2, 0.2)), r, c)
r <- 20
c <- 20
ncell = r * c
val = c(-1, 0.2, 1, 2)
p = c(0.1, 0.6, 0.2, 0.1)
fill = rep(val, ceiling(p * ncell))[1:ncell]
mat <- matrix(data = sample(fill), nrow = r, ncol = c)
prop.table(table(mat))
#> mat
#> -1 0.2 1 2
#> 0.1 0.6 0.2 0.1
Created on 2019-09-20 by the reprex package (v0.3.0)

Filter_all with differing condition for each column

I have the following vector
vec1 = c(0.001, 0.05, 0.003, 0.1)
and a data frame
df = data_frame( x = seq(0.001, 0.1, length.out = 10), y = seq(0.03, 0.07, length.out = 10), z = seq(0, 0.005, length.out = 10), w = seq(0.05, 0.25, length.out = 10))
I would like to filter df such that the output would contain the rows of df for which, in each column, the minimum value would be the corresponding value of vec1 - 0.05, and the maximum would be vec1 + 0.05.
So in this example, only the first 4 rows satisfy this condition (in x I allow -0.049 to 0.501 based on the first entry of vec1, in y I allow 0 to 0.1 based on the second entry, and so on).
I am sure this can be done with filter_all and (.), something along the lines of
filter_all(df, all_vars(. >= (vec1(.) - 0.05) & . <= (vec1(.) + 0.05))))
But this doesn't work.
What am I doing wrong?
We can use mapply on the dataframe and pass it along with vec1 and check which of the values satisfy the criteria and select only those rows where all of the columns have TRUE value in it.
df[rowSums(mapply(function(x, y) x > (y-0.05) & x < (y+0.05),
df, vec1)) == ncol(df), ]
# x y z w
# <dbl> <dbl> <dbl> <dbl>
#1 0.0120 0.0344 0.000556 0.0722
#2 0.0230 0.0389 0.00111 0.0944
#3 0.0340 0.0433 0.00167 0.117
#4 0.0450 0.0478 0.00222 0.139

Approximate match (analogue of all.equal for identical)?

Consider:
(tmp1 <- seq(0, 0.2, 0.01)[16])
# [1] 0.15
(tmp2 <- seq(0, 0.2, 0.05)[4])
# [1] 0.15
and
identical(tmp1, tmp2)
# [1] FALSE
all.equal(tmp1, tmp2) # test for 'near' equality
[1] TRUE
The underlying reason is to do with floating point precision. However, this leads to a problem when trying to identify subsequences within sequences using match, for example:
match(seq(0, 0.2, 0.05), seq(0, 0.2, 0.01))
# [1] 1 6 11 NA 21
Is there an alternative to match that is the analogue of all.equal for identical?
We can write a custom match called near.match, inspired by dplyr::near:
near.match <- function(x, y, tol = .Machine$double.eps^0.5){
sapply(x, function(i){
res <- which(abs(y - i) < tol, arr.ind = TRUE)[1]
if(length(res)) res else NA_integer_
})
}
near.match(seq(0, 0.2, 0.05), seq(0, 0.2, 0.01))
# [1] 1 6 11 16 21
near.match(c(seq(0, 0.2, 0.05), 0.3), seq(0, 0.2, 0.01))
# [1] 1 6 11 16 21 NA

Interpolate missing values of a data frame

I have a dataset like this:
x y z
1 1 0.954
1 3 0.134
1 30 0.123
2 1 0.425
2 3 0.123
2 30 0.865
5 1 0.247
5 3 0.654
5 30 0.178
Let's think of this as the height of a surface sampled at 9 points over a 4x29 field. Suppose I want to fill in the missing values by interpolating (linear is fine), so that I end up with a z value for every (integer) x in [1,5] and every y in [1,30]. I want the result to still be a data frame with the same structure.
How can I do this in R?
I'll take the previous lack of answer as a gift :)
#akima_0.5-12
library(akima)
my_df <- data.frame(
x = c(rep(1, 3), rep(2, 3), rep(5, 3)),
y = rep(c(1, 3, 30), 3),
z = c(0.954, 0.134, 0.123, 0.425, 0.123, 0.865, 0.247, 0.654, 0.178)
)
my_op <- interp(
x = my_df$x,
y = my_df$y,
z = my_df$z,
xo = 1:5, # vector of x coordinates to use in interpolation
yo = 1:30, # vector of y coordinates to use in interpolation
linear = TRUE # default interpolation method
)
my_op$z # matrix of interpolated z coordinates, (row, col) correspond to (x, y)
ind <- which(!is.nan(my_op$z), arr.ind = TRUE)
desired_output <- data.frame(
x = ind[, 1],
y = ind[, 2],
z = as.vector(my_op$z) # data are organized column-by-column
)

Resources