Creating mock data with natural decreasing numbers - r

I want to create random mock data looks like this.
__ID__|__Amount__
1 20
1 14
1 9
1 3
2 11
2 5
2 2
Starting from the random number but the second number with the same ID should be lesser than the first one, and the third number has to be lesser than the second one. Maximum number to start should be 20.

you can just create the data first and then sort it as you need, using tidyverse :
set.seed(0)
df <- data.frame(id = rep(1:3,10), amt = sample(1:20, 30, replace = TRUE))
df %>%
group_by(id) %>%
arrange(id, desc(amt))

This is a tricky one if you want the Amount column to be truly random values you can use a recursive call that will use sample recursively:
## Recursively sampling from a uniform distribution
recursive.sample <- function(start, end, length, results = NA, counter =0) {
## To enter the recursion, counter must be smaller than the length out
## and the last result must be smaller than the starting point (except the firs time)
if(counter < length && ifelse(counter != 0, results[counter] > start, TRUE)){
## Increment the counter
counter <- counter + 1
## Sample between start and the last result or the start and the end of the vector
results[counter] <- ifelse(counter != 1, sample(start:results[counter-1], 1), sample(start:end, 1))
## Recursive call
return(recursive.sample(start = start, end = end, length = length, results = results, counter = counter))
} else {
## Exit the recursion
return(results)
}
}
## Example
set.seed(0)
recursive.sample(start = 1, end = 20, length = 3, results = NA, counter = 0)
#[1] 18 5 2
Alternatively (and way easier) you can use sort(sample()):
set.seed(0)
sort(sample(1:20, 3), decreasing = TRUE)
#[1] 18 7 6
Note that the results differ due to the lower probability of sampling higher values in the recursive function.
You can then easily create your table with your chosen function as follow:
set.seed(123)
## The ID column
ID <- c(rep(1, 4), rep(2,3))
## The Amount column
Amount <- c(recursive.sample(1, 20, 4, NA, 0), recursive.sample(1, 11, 3, NA, 0))
## The table
cbind(ID, Amount)
# ID Amount
#[1,] 1 18
#[2,] 1 5
#[3,] 1 2
#[4,] 1 2
#[5,] 2 10
#[6,] 2 3
#[7,] 2 3
Or, again, with the simple sort(sample()) function for a higher probability of picking larger numbers.

Two methods, one using dplyr and one using only base R functions. These are slightly different to the two previous solutions.
I used sorted ID column, but this is not necessary.
Method 1
rm(list = ls())
set.seed(1)
df <- data.frame(ID = rep(1:3, each = 5))
df %>% group_by(ID) %>%
mutate(Amount = sort(sample(1 : 20, n(), replace = T), decreasing = TRUE))
Method 2
rm(list = ls())
set.seed(1)
df <- data.frame(ID = rep(1:3, each = 5))
df$Amount <- NA
uniq_ID <- unique(df$ID)
index_lst <- lapply(uniq_ID, function(x) which(df$ID == x))
res <- lapply(index_lst, function(x) sort(sample(1 : 20, length(x)),
decreasing = TRUE))
df$Amount[unlist(index_lst)] <- unlist(res)
Method 2.5
This is more convoluted than the 2nd method.
rm(list = ls())
set.seed(1)
df <- data.frame(ID = rep(1:3, each = 5))
df$Amount <- NA
tab <- as.data.frame(table(df$ID))
lapply(1 : nrow(tab), function(x) df$Amount[which(df$ID == tab$Var1[x])] <<-
sort(sample(1 : 20, tab$Freq[x]), decreasing = TRUE))

Related

Replace integers in a data frame column with other integers in R?

I want to replace a vector in a dataframe that contains only 4 numbers to specific numbers as shown below
tt <- rep(c(1,2,3,4), each = 10)
df <- data.frame(tt)
I want to replace 1 = 10; 2 = 200, 3 = 458, 4 = -0.1
You could use recode from dplyr. Note that the old values are written as character. And the new values are integers since the original column was integer:
library(tidyverse):
df %>%
mutate(tt = recode(tt, '1'= 10, '2' = 200, '3' = 458, '4' = -0.1))
tt
1 10.0
2 10.0
3 200.0
4 200.0
5 458.0
6 458.0
7 -0.1
8 -0.1
To correct the error in the code in the question and provide for a shorter example we use the input in the Note at the end. Here are several alternatives. nos defined in (1) is used in some of the others too. No packages are used.
1) indexing To get the result since the input is 1 to 4 we can use indexing. This is probably the simplest solution given that the original values of tt are in 1:4.
nos <- c(10, 200, 458, -0.1)
transform(df, tt = nos[tt])
## tt
## 1 10.0
## 2 10.0
## 3 200.0
## 4 200.0
## 5 458.0
## 6 458.0
## 7 -0.1
## 8 -0.1
1a) If the input is not necessarily in 1:4 then we could use this generalization
transform(df, tt = nos[match(tt, 1:4)])
2) arithmetic Another approach is to use arithmetic:
transform(df, tt = 10 * (tt == 1) +
200 * (tt == 2) +
458 * (tt == 3) +
-0.1 * (tt == 4))
3) outer/matrix multiplication This would also work:
transform(df, tt = c(outer(tt, 1:4, `==`) %*% nos))
3a) This is the same except we use model.matrix instead of outer.
transform(df, tt = c(model.matrix(~ factor(tt) + 0, df) %*% nos))
4) factor The levels of the factor are 1:4 and the corresponding labels are defined by nos. Extract the labels using format and then convert them to numeric.
transform(df, tt = as.numeric(format(factor(tt, levels = 1:4, labels = nos))))
4a) or as a pipeline
transform(df, tt = tt |>
factor(levels = 1:4, labels = nos) |>
format() |>
as.numeric())
5) loop We can use a simple loop. Nulling out i at the end is so that it is not made into a column.
within(df, { for(i in 1:4) tt[tt == i] <- nos[i]; i <- NULL })
6) Reduce This is somewhat similar to (5) but implements the loop using Reduce.
fun <- function(tt, i) replace(tt, tt == i, nos[i])
transform(df, tt = Reduce(fun, init = tt, 1:4))
Note
df <- data.frame(tt = c(1, 1, 2, 2, 3, 3, 4, 4))

Efficient Montecarlo simulation over a grid in R

I am running a Montecarlo simulation of a multinomial logit. Therefore I have a function that generates the data and estimates the model. Additionally, I want to generate different datasets over a grid of values. In particular, changing both the number of individuals (n.indiv) and the number of answers by each individual (n.choices).
So far, I have managed to solve it, but at some point, I incurred into a nested for-loop structure over a grid search of the possible values for the number of individuals (n.indiv_list) and the number of answers by each individual(n.choices_list). Finally, I am quite worried about the efficiency of the usage of my last bit of code with the double for-loop structure running on the combinations of the possible values. Probably there is a vectorized way to do it that I am missing (or maybe not?).
Finally, and this is mostly a matter of style, I managed to arrive a multiples objects that contain the models from the combinations of the grid search with informative names, but also would be great if I could collapse all of them in a list but with the current structure, I am not sure how to do it. Thank you in advance!
1) Function that generates data and estimates the model.
library(dplyr)
library(VGAM)
library(mlogit)
#function that generates the data and estimates the model.
mlogit_sim_data <- function(...){
# generating number of (n.alter) X (n.choices)
df <- data.frame(id= rep(seq(1,n.choices ),n.alter ))
# id per individual
df <- df %>%
group_by(id) %>%
mutate(altern = sequence(n()))%>%
arrange(id)
#Repeated scheme for each individual + id_ind
df <- cbind(df[rep(1:nrow(df), n.indiv), ], id_ind = rep(1:n.indiv, each = nrow(df)))
## creating attributes
df<- df %>%
mutate(
x1=rlnorm(n.indiv*n.alter),
x2=rlnorm(n.indiv*n.alter),
)%>%
group_by(altern) %>%
mutate(
id_choice = sequence(n()))%>%
group_by(id_ind) %>%
mutate(
z1 = rpois(1,lambda = 25),
z2 = rlnorm(1,meanlog = 5, sdlog = 0.5),
z3 = ifelse(runif(1, min = 0 , max = 1) > 0.5 , 1 , 0)
)
# Observed utility
df$V1 <- with(df, b1 * x1 + b2 * x2 )
#### Generate Response Variable ####
fn_choice_generator <- function(V){
U <- V + rgumbel(length(V), 0, 1)
1L * (U == max(U))
}
# Using fn_choice_generator to generate 'choice' columns
df <- df %>%
group_by(id_choice) %>%
mutate(across(starts_with("V"),
fn_choice_generator, .names = "choice_{.col}")) %>% # generating choice(s)
select(-starts_with("V")) %>% ##drop V variables.
select(-c(id,id_ind))
tryCatch(
{
model_result <- mlogit(choice_V1 ~ 0 + x1 + x2 |1 ,
data = df,
idx = c("id_choice", "altern"))
return(model_result)
},
error = function(e){
return(NA)
}
)
}
2) Grid search over possible combinations of the data
#List with the values that varies in the simulation
#number of individuals
n.indiv_list <- c(1, 15, 100, 500 )
#number of choice situations
n.choices_list <- c(1, 2, 4, 8, 10)
# Values that remains constant across simulations
#set number of alternatives
n.alter <- 3
## Real parameters
b1 <- 1
b2 <- 2
#Number of reps
nreps <- 10
#Set seed
set.seed(777)
#iteration over different values in the simulation
for(i in n.indiv_list) {
for(j in n.choices_list) {
n.indiv <- i
n.choices <- j
assign(paste0("m_ind_", i, "_choices_", j), lapply(X = 1:nreps, FUN = mlogit_sim_data))
}
}
You can vectorize using the map2 function of the purrr package:
library(tidyverse)
n.indiv_list <- c(1, 15, 100, 500 )
#number of choice situations
n.choices_list <- c(1, 2, 4, 8, 10)
l1 <- length(n.indiv_list)
l2 <- length(n.choices_list)
v1 <- rep(n.indiv_list, each = l2)
v2 <- rep(n.choices_list, l1) #v1, v2 generate all pairs
> v1
[1] 1 1 1 1 1 15 15 15 15 15 100 100 100 100 100 500 500 500 500 500
> v2
[1] 1 2 4 8 10 1 2 4 8 10 1 2 4 8 10 1 2 4 8 10
result <- map2(v1, v2, function(v1, v2) assign(paste0("m_ind_", v1, "_choices_", v2), lapply(X = 1:nreps, FUN = mlogit_sim_data)))
result will be a list of your function outputs.

Test if a given numeric is comprised in a set of intervals defined in a dataframe

I have a dataframe which includes 2 columns, let's say "left" and "right", which define intervals. I want to test if a given numeric "x" is part of any interval defined by the dataframe (if it is, it should be only once, those intervals don't overlap). Expected behaviour:
> df <- data.frame(id = c("A", "B", "C"), left = c(0, 50, 150), right = c(15, 78, 190))
> df
id left right
1 A 0 15
2 B 50 78
3 C 150 190
> my_function(7)
TRUE
> my_function(20)
FALSE
So I did it this way, but it's terribly slow and I'm pretty sure this could be optimized:
my_function <- function(x) {
test <- df %>% dplyr::rowwise() %>% dplyr::mutate(test = (x >= left) && (x <= right)) %>% ungroup()
test <- test %>% filter(test == T)
nrow(test) == 1
}
Then I'd be interested in getting the matching row in case the output is TRUE, but with the current function it'll take forever (the actual dataframe has ~5,000 rows, and I want to test/get coordinates for thousands of x values).
I found a library that manages interval objets but it seems it's tailored for time intervals. Any suggestion?
Here is a simple way with an example:
z <- 567 # single dummy value
left <- x1 <- seq(100, 900, 200)
right <- seq(200, 1000, 200)
df <- data.frame(left, right) # dummy intervals
lo <- z >= df$left
hi <- z <= df$right
check <- lo * hi
introw <- which(check == 1)
introw
3
z2 <- c(356, 934, 134, 597, 771) # vector of values to check
lo2 <- sapply(z2, function(x) x >= df$left)
hi2 <- sapply(z2, function(x) x <= df$right)
check2 <- lo2 * hi2
introws <- apply(check2, 2, function(x) which(x ==1))
introws #vector of intervals for each input value
introws
2 5 1 3 4
final <- cbind(value = z2, interval = introws)
final
value interval
[1,] 356 2
[2,] 934 5
[3,] 134 1
[4,] 597 3
[5,] 771 4
Try this approach using between():
#Code
my_function <- function(x) {
test <- df %>% dplyr::rowwise() %>%
dplyr::mutate(test = between(x,left,right)) %>% ungroup()
test <- test %>% filter(test == T)
nrow(test) == 1
}

R: find consecutive occurrence of a number

first define some function to bind list rowwise and column wise
# a function to append vectors row wise
rbindlist <- function(list) {
n <- length(list)
res <- NULL
for (i in seq(n)) res <- rbind(res, list[[i]])
return(res)
}
cbindlist <- function(list) {
n <- length(list)
res <- NULL
for (i in seq(n)) res <- cbind(res, list[[i]])
return(res)
}
# generate sample data
sample.dat <- list()
set.seed(123)
for(i in 1:365){
vec1 <- sample(c(0,1), replace=TRUE, size=5)
sample.dat[[i]] <- vec1
}
dat <- rbindlist(sample.dat)
dat has five columns. Each column is a location and has 365 days of the year (365 rows) with values 1 or 0.
I have another dataframe (see below) which has certain days of the year for each column (location) in dat.
# generate second sample data
set.seed(123)
sample.dat1 <- list()
for(i in 1:5){
vec1 <- sort(sample(c(258:365), replace=TRUE, size=4), decreasing = F)
sample.dat1[[i]] <- vec1
}
dat1 <- cbindlist(sample.dat1)
I need to use dat1 to subset days in dat to do a calculation. An example below:
1) For location 1 (first column in both dat1 and dat):
In column 1 of dat, select the days from 289 till 302 (using dat1), find the longest consecutive occurrence of 1.
Repeat it and this time select the days from 303 (302 + 1) till 343 from dat, find the longest consecutive occurrence of 1.
Repeat it for 343 till 353: select the days from 344 (343 + 1) till 353, find the longest consecutive occurrence of 1.
2) Do this for all the columns
If I want to do sum of 1s, I can do this:
dat <- as.tibble(dat)
dat1 <- as.tibble(dat1)
pmap(list(dat,dat1), ~ {
range1 <- ..2[1]
range2 <- ..2[2]
range3 <- ..2[3]
range4 <- ..2[4]
sum.range1 <- sum(..1[range1:range2]) # this will generate sum between range 1 and range 2
sum.range2 <- sum(..1[range2:range3]) # this will generate sum between range 2 and range 3
sum.range3 <- sum(..1[range3:range4]) # this will generate sum between range 3 and range 4
c(sum.range1=sum.range1,sum.range2=sum.range2,sum.range3=sum.range3)
})
For longest consequtive occurrence of 1 between each range, I thought of using the rle function. Example below:
pmap(list(dat,dat1), ~ {
range1 <- ..2[1]
range2 <- ..2[2]
range3 <- ..2[3]
range4 <- ..2[4]
spell.range1 <- rle(..1[range1:range2]) # sort the data, this shows the longest run of ANY type (0 OR 1)
spell.1.range1 <- tapply(spell.range1$lengths, spell.range1$values, max)[2] # this should select the maximum consequtive run of 1
spell.range2 <- rle(..1[range2:range3]) # sort the data, this shows the longest run of ANY type (0 OR 1)
spell.1.range2 <- tapply(spell.range2$lengths, spell.range2$values, max)[2] # this should select the maximum consequtive run of 1
spell.range3 <- rle(..1[range3:range4]) # sort the data, this shows the longest run of ANY type (0 OR 1)
spell.1.range3 <- tapply(spell.range3$lengths, spell.range3$values, max)[2] # this should select the maximum consequtive run of 1
c(spell.1.range1 = spell.1.range1, spell.1.range2 = spell.1.range2, spell.1.range3 = spell.1.range3)
})
I get an error which I think is because I am not using the rle function properly here. I would really like to keep the code as above since
my others code are in the same pattern and format of the outputs is suited for my need, so I would appreciate if someone can suggest how to fix it.
OP's code does work for me. So, without a specific error message it is impossible to understand why the code is not working for the OP.
However, the sample datasets created by the OP are matrices (before they were coerced to tibble) and I felt challenged to find a way to solve the task in base R without using purrr:
To find the number of consecutive occurences of a particular value val in a vector x we can use the following function:
max_rle <- function(x, val) {
y <- rle(x)
len <- y$lengths[y$value == val]
if (length(len) > 0) max(len) else NA
}
Examples:
max_rle(c(0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1), 1)
[1] 4
max_rle(c(0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1), 0)
[1] 2
# find consecutive occurrences in column batches
lapply(seq_len(ncol(dat1)), function(col_num) {
start <- head(dat1[, col_num], -1L)
end <- tail(dat1[, col_num], -1L) - 1
sapply(seq_along(start), function(range_num) {
max_rle(dat[start[range_num]:end[range_num], col_num], 1)
})
})
[[1]]
[1] 8 4 5
[[2]]
[1] 4 5 2
[[3]]
[1] NA 3 4
[[4]]
[1] 5 5 4
[[5]]
[1] 3 2 3
The first lapply() loops over the columns of dat and dat1, resp. The second sapply() loops over the row ranges stored in dat1 and subsets dat accordingly.

How to use R for multiple select questions?

I am trying to figure out how to analyze multiple select/multiple responses (i.e., 'select all that apply') questions in a survey I recently conducted.
SPSS has nice capabilities for analyzing online survey data and these types of questions so I am guessing that R has that and more. Dealing with these survey answers is a bit tricky in Excel. For example, show me a histogram/distribution everyone who likes strawberry and chocolate ice cream by age.
How do I structure the data set and what would be the commands to perform some basic tabulations of frequency, pareto, and logical AND OR functions?
I've not found anything that is quite as convenient as the multiple response sets in SPSS. However, you can create groups relatively easily based on common column names, and then use any of the apply() function or friends to iterate through each group. Here's one approach using adply() from the plyr package:
library(plyr)
set.seed(1)
#Fake data with three "like" questions. 0 = non selected, 1 = selected
dat <- data.frame(resp = 1:10,
like1 = sample(0:1, 10, TRUE),
like2 = sample(0:1, 10, TRUE),
like3 = sample(0:1, 10, TRUE)
)
adply(dat[grepl("like", colnames(dat))], 2, function(x)
data.frame(Count = as.data.frame(table(x))[2,2],
Perc = as.data.frame(prop.table(table(x)))[2,2]))
#-----
X1 Count Perc
1 like1 6 0.6
2 like2 5 0.5
3 like3 3 0.3
I recently wrote a quick function to deal with these. You can easily modify it to add proportion of total responses too.
set.seed(1)
dat <- data.frame(resp = 1:10,
like1 = sample(0:1, 10, TRUE),
like2 = sample(0:1, 10, TRUE),
like3 = sample(0:1, 10, TRUE))
The function:
multi.freq.table = function(data, sep="", dropzero=FALSE, clean=TRUE) {
# Takes boolean multiple-response data and tabulates it according
# to the possible combinations of each variable.
#
# See: http://stackoverflow.com/q/11348391/1270695
counts = data.frame(table(data))
N = ncol(counts)
counts$Combn = apply(counts[-N] == 1, 1,
function(x) paste(names(counts[-N])[x],
collapse=sep))
if (isTRUE(dropzero)) {
counts = counts[counts$Freq != 0, ]
} else if (!isTRUE(dropzero)) {
counts = counts
}
if (isTRUE(clean)) {
counts = data.frame(Combn = counts$Combn, Freq = counts$Freq)
}
counts
}
Apply the function:
multi.freq.table(dat[-1], sep="-")
# Combn Freq
# 1 1
# 2 like1 2
# 3 like2 2
# 4 like1-like2 2
# 5 like3 1
# 6 like1-like3 1
# 7 like2-like3 0
# 8 like1-like2-like3 1
Hope this helps! Otherwise, show some examples of desired output or describe some features, and I'll see what can be added.
Update
After looking at the output of SPSS for this online, it seems like the following should do it for you. This is easy enough to wrap into a function if you need to use it a lot.
data.frame(Freq = colSums(dat[-1]),
Pct.of.Resp = (colSums(dat[-1])/sum(dat[-1]))*100,
Pct.of.Cases = (colSums(dat[-1])/nrow(dat[-1]))*100)
# Freq Pct.of.Resp Pct.of.Cases
# like1 6 42.85714 60
# like2 5 35.71429 50
# like3 3 21.42857 30
multfreqtable(data_set, "Banner")
multfreqtable = function(data, question.prefix) {
z = length(question.prefix)
temp = vector("list", z)
for (i in 1:z) {
a = grep(question.prefix[i], names(data))
b = sum(data[, a] != 0)
d = colSums(data[, a] != 0)
e = sum(rowSums(data[,a]) !=0)
f = as.numeric(c(d, b))
temp[[i]] = data.frame(question = c(sub(question.prefix[i],
"", names(d)), "Total"),
freq = f,
percent_response = (f/b)*100,
percent_cases = (f/e)*100 )
names(temp)[i] = question.prefix[i]
}
temp
}
does a very good job of giving you numbers, percentages at the number of cases level and percentage at the number of responses level. Perfect for analyzing Multi-Response Questions

Resources