R behavior of mutate and rnorm - r

Hello I have the following code from a course
library(tidyverse)
library(dslabs)
data("polls_us_election_2016")
head(results_us_election_2016)
results_us_election_2016 %>% arrange(desc(electoral_votes)) %>% top_n(5, electoral_votes)
'Computing the average and standard deviation for each state'
polls <- polls_us_election_2016 %>%
filter(state != "U.S." &
!grepl("CD", "state") &
enddate >= "2016-10-31" &
(grade %in% c("A+", "A", "A-", "B+") | is.na(grade))) %>%
mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100) %>%
group_by(state) %>%
summarize(avg = mean(spread), sd = sd(spread), n = n()) %>%
mutate(state = as.character(state))
# joining electoral college votes and results
results <- left_join(polls, results_us_election_2016, by="state")
head(results)
# states with no polls: note Rhode Island and District of Columbia = Democrat
results_us_election_2016 %>% filter(!state %in% results$state)
# assigns sd to states with just one poll as median of other sd values
results <- results %>%
mutate(sd = ifelse(is.na(sd), median(results$sd, na.rm = TRUE), sd))
#Calculating the posterior mean and posterior standard error
mu <- 0
tau <- 0.02
results %>% mutate(sigma = sd/sqrt(n),
B = sigma^2/ (sigma^2 + tau^2),
posterior_mean = B*mu + (1-B)*avg,
posterior_se = sqrt( 1 / (1/sigma^2 + 1/tau^2))) %>%
arrange(abs(posterior_mean))
#Monte Carlo simulation of Election Night results (no general bias)
mu <- 0
tau <- 0.02
clinton_EV <- replicate(1000, {
results %>% mutate(sigma = sd/sqrt(n),
B = sigma^2/ (sigma^2 + tau^2),
posterior_mean = B*mu + (1-B)*avg,
posterior_se = sqrt( 1 / (1/sigma^2 + 1/tau^2)),
simulated_result = rnorm(length(posterior_mean), posterior_mean, posterior_se),
clintonvotes = ifelse(simulated_result > 0, electoral_votes, 0)) %>% # award votes if Clinton wins state
summarize(clinton = sum(clintonvotes)) %>% # total votes for Clinton
.$clinton + 7 # 7 votes for Rhode Island and DC
})
mean(clinton_EV > 269) # over 269 votes wins election
I don't understand how this line works
simulated_result = rnorm(length(posterior_mean), posterior_mean, posterior_se)
length(posterior_mean) = 47, so rnorm should return a vector of size 47.
When I replace this with 1 each state gets the same result from rnorm although posterior_mean and posterior_se are diffent for each state. When I change it 46 I get an error.
So it seems to me that this line fills the whole column simulated_result (perhaps 47 times with the same results?). I would have expected that mutate uses the values of each row only to manipulate this particulate row.
Can perhaps someone explain this behavior to me or point me to a resource where this is explained?

For the rnorm function, if you check the vignette:
rnorm(n, mean = 0, sd = 1) Arguments
x, q :vector of quantiles.
p :vector of probabilities.
n :number of observations. If length(n) > 1, the length is taken to be the number required.
mean :vector of means.
sd :vector of standard deviations.
There are two ways to use it, one, you generate a vector of length n, coming from normal distribution of same mean and sd, for example:
set.seed(111)
rnorm(10,0,1)
[1] 0.2352207 -0.3307359 -0.3116238 -2.3023457 -0.1708760 0.1402782 -1.4974267 -1.0101884
[9] -0.9484756 -0.4939622
If you provide a vector that is as long as n, you are specifying the mean and sd for each entry, for example:
set.seed(111)
rnorm(10,1:10,1:10)
[1] 1.23522071 1.33852826 2.06512853 -5.20938263 4.14561978 6.84166935 -3.48198659 -0.08150735
[9] 0.46371956 5.06037783
In this case, you generate a vector of 10 random normal variable, first entry comes from mean=1, sd=1, 2nd entry mean=2, sd=2 and so on. We can also do something in between:
set.seed(111)
rnorm(10,1:10,1))
[1] 1.235221 1.669264 2.688376 1.697654 4.829124 6.140278 5.502573 6.989812 8.051524 9.506038
In this case, it returns a vector of length 10, first entry coming from mean = 1,sd=1, 2nd coming from mean =2,sd =1, and we can visualize this by re-running this:
t(replicate(10,rnorm(10,1:10,1)))
It's not very clear what you replaced with 1, but essentially the purpose of mutate is to assign a column with the values. And the simulated results columns work like the above.

Related

How to Perform Statistical Two-Sided Test for Independence (on Proportion) in R?

I am trying to compare two percentages/proportions for statistical significance in R, using a Chi-Square test. I am familiar with a SAS method for Chi Square in which I supply a dataset column for a numerator, another column for denominator, and a categorical variable to distinguish distributions (A/B).
However I am getting unexpected values in R using some examples sets. When I test two similar populations, with low sample sizes, I am getting p-values of (approximately) zero, where I would expect the p-values to be very high (~ 1).
My test set is below, where I went with sugar content in a batch of water: e.g. "does group A use the same ratio of sugar as group B?". My actual problem is similar, where this isn't a pass-fail type test and the numerator and denominator values can vary wildly between samples (different sugar and/or water weights per sample). My first objective is to verify that I can get a high p-value from two similar sets. The next question is, at what sample size does the p-value become low enough to indicate significance?
# CREATE 2 NEARLY-EQUAL DISTRIBUTIONS (EXPECTING HIGH P-VALUE FROM PROP.TEST)
set.seed(108)
group_A = tibble(group = "A", sugar_lbs = rnorm(mean = 10, sd = 3, n = 50), batch_lbs = rnorm(mean = 30, sd = 6, n = 50))
group_B = tibble(group = "B", sugar_lbs = rnorm(mean = 10, sd = 3, n = 50), batch_lbs = rnorm(mean = 30, sd = 6, n = 50))
batches <- rbind(group_A, group_B)
I then do a summarize to calculate the overall sugar percentage tendency between groups:
# SUMMARY TOTALS
totals <- batches %>%
group_by(group) %>%
summarize(batch_count = n(),
batch_lbs_sum = sum(batch_lbs),
sugar_lbs_sum = sum(sugar_lbs),
sugar_percent_overall = sugar_lbs_sum / batch_lbs_sum) %>%
glimpse()
I then supply the sugar percentage between groups to a prop.test, expecting a high p-value
# ADD P-VALUE & CONFIDENCE INTERVAL
stats <- totals %>%
rowwise() %>%
summarize(p_val = prop.test(x = sugar_percent_overall, n = batch_count, conf.level = 0.95, alternative = "two.sided")$p.value) %>%
mutate(p_val = round(p_val, digits = 3)) %>%
mutate(conf_level = 1 - p_val) %>%
select(p_val, conf_level) %>%
glimpse()
# FINAL SUMMARY TABLE
cbind(totals, stats) %>%
glimpse()
Unforunately the final table gives me a p-value of 0, suggesting the two nearly-identical sets are independent/different. Shouldn't I get a p-value of ~1?
Observations: 2
Variables: 7
$ group <chr> "A", "B"
$ batch_count <int> 50, 50
$ batch_lbs_sum <dbl> 1475.579, 1475.547
$ sugar_lbs_sum <dbl> 495.4983, 484.6928
$ sugar_percent_overall <dbl> 0.3357992, 0.3284833
$ p_val <dbl> 0, 0
$ conf_level <dbl> 1, 1
From another angle, I also tried to compare the recommended sample size from power.prop.test with an actual prop.test using this recommended sample size. This gave me the reverse problem -- I was a expecting low p-value, since I am using the recommended sample size, but instead get a p-value of ~1.
# COMPARE PROP.TEST NEEDED COUNTS WITH AN ACTUAL PROP.TEXT
power.prop.test(p1 = 0.33, p2 = 0.34, sig.level = 0.10, power = 0.80, alternative = "two.sided") ## n = 38154
prop.test(x = c(0.33, 0.34), n = c(38154, 38154), conf.level = 0.90, alternative = "two.sided") ## p = 1 -- shouldn't p be < 0.10?
Am I using prop.test wrong or am I misinterpreting something? Ideally, I would prefer to skip the summarize step and simply supply the dataframe, the numerator column 'sugar_lbs', and the denominator 'batch_lbs' as I do in SAS -- is this possible in R?
(Apologies for any formatting issues as I'm new to posting)
---------------------------------
EDIT - EXAMPLE WITH ONLY PROPORTIONS & SAMPLE SIZE
I think my choice of using normal distributions may have distracted from the original question. I found an example that gets to the heart of what I was trying to ask, which is how to use prop test given only a proportion/percentage and the sample size. Instead of city_percent and city_total below, I could simply rename these to sugar_percent and batch_lbs. I think this reference answers my question, where prop.test appears to be the correct test to use.
My actual problem has an extremely non-normal distribution, but is not easily replicated via code.
STANFORD EXAMPLE (pages 37-50)
- https://web.stanford.edu/class/psych10/schedule/P10_W7L1
df <- tibble(city = c("Atlanta", "Chicago", "NY", "SF"), washed = c(1175, 1329, 1169, 1521), not_washed = c(413, 180, 334, 215)) %>%
mutate(city_total = washed + not_washed,
city_percent = washed / city_total) %>%
select(-washed, -not_washed) %>%
glimpse()
# STANFORD CALCULATION (p = 7.712265e-35)
pchisq(161.74, df = 3, lower.tail = FALSE)
# PROP TEST VERSION (SAME RESULT, p = 7.712265e-35)
prop.test(x = df$city_percent * df$city_total, n = df$city_total, alternative = "two.sided", conf.level = 0.95)$p.value
The documentation for prop.test says:
Usage prop.test(x, n, p = NULL,
alternative = c("two.sided", "less", "greater"),
conf.level = 0.95, correct = TRUE)
Arguments
x a vector of counts of successes, a one-dimensional table with two entries, or a
two-dimensional table (or matrix) with 2 columns, giving the counts of
successes and failures, respectively.
n a vector of counts of trials; ignored if x is a matrix or a table.
So if you want a "correct" test, you would have to use sugar_lbs_sum as the x instead of sugar_percent_overall. You should still receive some kind of warning that the x is non-integral, but that's not my major concern.
But from a statistical perspective this is the complete wrong way of doing things. You are directly causing spurious correlation for a testing of difference between two quantities by dividing by their sum arbitrarily. If the samples (sugar_lbs_sum) are independent, but you divide by their sums, you have made the ratios dependent. This violates the assumptions of the statistical test in a critical way. Kronmal 1993 "Spurious correlation and the fallacy of the ratio" covers this.
The data you generated are independent normal, so don't sum them, rather test for a difference with the t-test.
The Stanford link I added to my original post answered my question. I modified the Stanford example to simply rename the variables from city to group, and washed counts to sugar_lbs. I also doubled one batch, (or comparing a small versus large city). I now get the expected high p-value (0.65) indicating that there is no statistical significance that the proportions are different.
When I add more groups (for more degrees of freedom) and continue to vary batch sizes proportionally, I continue to get high p-values as expected, confirming the recipe is the same. If I modify the sugar percent of any one group, the p-value immediately drops to zero indicating one of the groups is different, as expected.
Finally, when doing the prop.text within a 'dplyr' pipe, I found I should not have used the rowwise() step, which causes my p-values to fall to zero. Removing this step gives the correct p-value. The only downside is that I don't yet know which group is different until I compare only 2 groups at a time iteratively.
#---------------------------------------------------------
# STANFORD EXAMPLE - MODIFIED TO SUGAR & ONE DOUBLE BATCHED
#--------------------------------------------------------
df <- tibble(group = c("A", "B"), sugar_lbs = c(495.5, 484.7), water_lbs = c(1475.6 - 495.5, 1475.6 - 484.7)) %>%
mutate(sugar_lbs = ifelse(group == "B", sugar_lbs * 2, sugar_lbs),
water_lbs = ifelse(group == "B", water_lbs * 2, water_lbs)) %>%
mutate(batch_lbs = sugar_lbs + water_lbs,
sugar_percent = sugar_lbs / batch_lbs) %>%
glimpse()
sugar_ratio_all <- sum(df$sugar_lbs) / (sum(df$sugar_lbs) + sum(df$water_lbs))
water_ratio_all <- sum(df$water_lbs) / (sum(df$sugar_lbs) + sum(df$water_lbs))
dof <- (2 - 1) * (length(df$group) - 1)
df <- df %>%
mutate(sugar_expected = (sugar_lbs + water_lbs) * sugar_ratio_all,
water_expected = (sugar_lbs + water_lbs) * water_ratio_all) %>%
mutate(sugar_chi_sq = (sugar_lbs - sugar_expected)^2 / sugar_expected,
water_chi_sq = (water_lbs - water_expected)^2 / water_expected) %>%
glimpse()
q <- sum(df$sugar_chi_sq) + sum(df$water_chi_sq)
# STANFORD CALCULATION
pchisq(q, df = dof, lower.tail = F)
# PROP TEST VERSION (SAME RESULT)
prop.test(x = df$sugar_percent * df$batch_lbs, n = df$batch_lbs, alternative = "two.sided", conf.level = 0.95)$p.value

R: How do I simulate a time series data in replicate until some conditions are met

I want to simulate a time series data that follows AR(1) with phi=0.6 such that if I tried my first simulation I will check if it follows the AR(1). If not, I will make the second trial, together with the first I will get the average of the two trials to form the series. I test the order until it conforms to the AR(1) otherwise I keep adding one (1) to my trials until I confirmed that the average of the trials is a time series of AR(1) model.
After that, I will check if the coefficient of the AR(1) is equal to phi=0.6. if not I will add yet one(1) to my trials until I check that the phi=0.6.
**MWE*
library(FitAR)
n=50
a=0.6
count=0
e <- rnorm(n+100)
x <- double(n+100)
x[1] <- rnorm(1)
for(i in 2:(n+100)) {
x[i] <- a * x[i-1] + e[i]
}
x <- ts(x[-(1:100)])
p=SelectModel(x, lag.max = 14, Criterion = "BIC", Best=1)
if(p >= 2){
count <- count + 1
mat <- replicate(count, x)
x <- as.ts(rowMeans(mat))
}
fit=arima(x,order = c(p,0,0))
my_coef=fit$coef
if(my_coef != 0.6){
mat <- replicate(count + 1, x)
x <- as.ts(rowMeans(mat))
}
my_coefficients=my_coef[!names(my_coef) == 'intercept']
print(my_coefficients)
print(paste0("AR(2) model count is: ", count_coef))
We can generate time series data for 100 time points of a white noise ARIMA(0, 0, 0) process with zero mean and standard deviation sd = 2 in the following way
set.seed(2020)
ts <- arima.sim(model = list(), n = 100, sd = 2)
This is explained in the documentation ?arima.sim
Usage:
arima.sim(model, n, rand.gen = rnorm, innov = rand.gen(n, ...),
n.start = NA, start.innov = rand.gen(n.start, ...),
...)
...: additional arguments for ‘rand.gen’. Most usefully, the
standard deviation of the innovations generated by ‘rnorm’
can be specified by ‘sd’.
To generate 50 time series, we can use replicate
set.seed(2020)
mat <- replicate(50, arima.sim(model = list(), n = 100, sd = 2))
The resulting object is a matrix with dimensions 100 x 50.
We can confirm that the standard deviation is indeed sd = 2
summary(apply(mat, 2, sd))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
#1.669 1.899 2.004 2.006 2.107 2.348
library(tidyverse)
apply(mat, 2, sd) %>%
enframe() %>%
ggplot(aes(value)) +
geom_histogram(bins = 10)

Allocating tasks to parallel workers so that expected cost is roughly equal

I have an assignment problem where I'm trying to allocate a number of tasks with a known expected cost (runtime in seconds) to X parallel workers, subject to the constraint that each worker receives the same number of tasks (save for remainders), so that the total expected runtime per worker is roughly equal.
I'm using a data frame that defines the tasks to be executed, and for each task I can calculate a pretty accurate expected cost (runtime in seconds). E.g. something like this:
library("tibble")
set.seed(1232)
tasks <- tibble(task = 1:20, cost = runif(20, min = 1, max = 5)^2)
head(tasks)
#> # A tibble: 6 x 2
#> task cost
#> <int> <dbl>
#> 1 1 22.5
#> 2 2 20.0
#> 3 3 21.3
#> 4 4 8.13
#> 5 5 18.3
#> 6 6 19.6
Created on 2019-11-21 by the reprex package (v0.3.0)
This is then used with foreach::foreach(...) %dopar% ... to execute the tasks in parallel. foreach() splits the tasks into roughly equal sized groups with size nrow(tasks)/X where X is the number of parallel workers (cores).
I'm currently shuffling the task list so that the cost is roughly equal for each worker, but there can still be substantial deviations, i.e. some workers get finished much earlier than others and thus it would have been better if they had had some more costly tasks. E.g.:
# shuffle tasks (in the original application cost is not random initially)
tasks <- tasks[sample(1:nrow(tasks)), ]
# number of workers
X <- 4
tasks$worker <- rep(1:X, each = nrow(tasks)/X)
# expected total cost (runtime in s) per worker
sapply(split(tasks$cost, tasks$worker), sum)
#> 1 2 3 4
#> 77.25278 35.25026 66.09959 64.05435
Created on 2019-11-21 by the reprex package (v0.3.0)
The second worker finishes in half the time as the other workers, so its capacity is wasted and the thing overall takes longer to finish.
What I'd like to do instead is have a way of re-ordering the task data frame so that when foreach splits it into X groups the total expected cost per group is more even.
I imagine this is a super-well known kind of problem and I just don't know the right verbiage to google (nor how to do it in R). Thanks for any help.
(EDIT) Mostly better alternative
For now, a relatively simple alternative that seems to do better than random shuffling. This orders the tasks by cost, assigns the first X tasks to workers 1 to X, then assigns the next chunk of X tasks in reverse order to workers X to 1, etc (this is "alt1" below).
(EDIT2) Added the RcppAlgos method
By Joseph Wood below.
library("tibble")
library("dplyr")
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library("ggplot2")
library("tidyr")
library("RcppAlgos")
getPartitions <- function(df, nWorkers, tol_ratio = 0.0001) {
nSections <- nrow(df) / nWorkers
avg <- sum(df$cost) / nWorkers
tol <- avg * tol_ratio
vec <- df$cost
cond <- TRUE
part <- list()
for (i in 1:(nWorkers - 1)) {
while (cond) {
vals <- comboGeneral(vec, nSections,
constraintFun = "sum",
comparisonFun = "==",
limitConstraints = avg + (tol / 2),
tolerance = tol,
upper = 1)
cond <- nrow(vals) == 0
if (cond) {
tol <- tol * 2
} else {
v <- match(as.vector(vals), df$cost)
}
}
part[[i]] <- v
vec <- df$cost[-(do.call(c, part))]
avg <- sum(vec) / (nWorkers - i)
tol <- avg * tol_ratio
cond <- TRUE
}
part[[nWorkers]] <- which(!1:nrow(df) %in% do.call(c, part))
part
}
race <- function() {
N_TASKS = 100
X = 4
tasks <- tibble(task = 1:N_TASKS, cost = runif(N_TASKS, min = 1, max = 10)^2)
# random shuffle
tasks$worker <- rep(1:X, each = nrow(tasks)/X)
rando <- max(sapply(split(tasks$cost, tasks$worker), sum))
# alternative 1
tasks <- tasks[order(tasks$cost), ]
tasks$worker <- rep(c(1:X, X:1), length.out = nrow(tasks))
alt1 <- max(sapply(split(tasks$cost, tasks$worker), sum))
# modified version of ivan100sic's answer
# sort by descending cost, after initial allocation, allocate costly tasks
# first to workers with lowest total cost so far
group <- factor(rep(1:(ceiling(nrow(tasks)/4)), each = X))
tasks <- tasks[order(tasks$cost, decreasing = TRUE), ]
tasks$worker <- c(1:X, rep(NA, length.out = nrow(tasks) - X))
task_sets <- split(tasks, group)
task_sets[[1]]$worker <- 1:X
for (i in 2:length(task_sets)) {
# get current total cost by worker
total <- task_sets %>%
bind_rows() %>%
filter(!is.na(worker)) %>%
group_by(worker) %>%
summarize(cost = sum(cost)) %>%
arrange(cost)
task_sets[[i]]$worker <- total[["worker"]]
}
tasks <- bind_rows(task_sets)
alt2 <- max(sapply(split(tasks$cost, tasks$worker), sum))
# RcppAlogs by Joseph Wood below
setParts <- getPartitions(tasks, X)
worker <- rep(1:4, each = N_TASKS/X)
row_num <- unsplit(setParts, worker)
tasks$worker <- worker[order(row_num)]
algo <- max(sapply(split(tasks$cost, tasks$worker), sum))
c(ref = sum(tasks$cost) / X, rando = rando, alt1 = alt1, alt2 = alt2, algo = algo)
}
set.seed(24332)
sims <- replicate(1e3, race())
sims <- sims %>%
t() %>%
as_tibble() %>%
pivot_longer(rando:algo, names_to = "Method")
ggplot(sims, aes(x = value, color = Method)) +
geom_density() +
scale_x_continuous(limits = c(0, max(sims$value))) +
labs(x = "Total runtime (s)")
# this shows the estimated runtime relative to average total cost
# per worker (which may be unobtainable)
sims %>%
group_by(Method) %>%
summarize(time_relative_to_ref = mean(value - ref)) %>%
arrange(time_relative_to_ref)
#> # A tibble: 4 x 2
#> Method time_relative_to_ref
#> <chr> <dbl>
#> 1 algo 0.0817
#> 2 alt2 0.307
#> 3 alt1 4.97
#> 4 rando 154.
Created on 2020-02-04 by the reprex package (v0.3.0)
"rando": randomly shuffle the task list
"alt1": sort tasks by cost and alternate assigning to worker 1 to X, X to 1, etc.
"alt2": based on ivan100sic's answer below, after the first allocation to workers 1 to X, allocate based on total cost per worker so far
"algo": based on Joseph Woods's answer below
The following heuristic might give you good results:
Sort all the tasks by cost in descending order. For each task, assign it to the worker which has the minimum total assigned cost so far.
As #JohnColeman points out, this essentially boils down to partitioning. We are trying to partition the tasks equally such that the sum of the cost doesn't vary wildly.
The algorithm below does just that. The main idea is to successively find a set of tasks whose sum is close to the average. Once we find one, we remove them, and continue selecting.
The work horse of the algorithm below is comboGeneral from RcppAlgos*. This function allows one to find combinations of a vector meeting a constraint. In this case, we are looking for 5 numbers whose sum is close to sum(tasks$cost) / (number of workers) ~ 60.66425. Since we are looking for numbers close to and not exact, we can bound our constraint. That is, we can look for combinations such that the sum is within a given tolerance.
library(RcppAlgos)
getPartitions <- function(df, nWorkers, tol_ratio = 0.0001) {
nSections <- nrow(df) / nWorkers
avg <- sum(df$cost) / nWorkers
tol <- avg * tol_ratio
vec <- df$cost
cond <- TRUE
part <- list()
for (i in 1:(nWorkers - 1)) {
while (cond) {
vals <- comboGeneral(vec, nSections,
constraintFun = "sum",
comparisonFun = "==",
limitConstraints = avg + (tol / 2),
tolerance = tol,
upper = 1)
cond <- nrow(vals) == 0
if (cond) {
tol <- tol * 2
} else {
v <- match(as.vector(vals), df$cost)
}
}
part[[i]] <- v
vec <- df$cost[-(do.call(c, part))]
avg <- sum(vec) / (nWorkers - i)
tol <- avg * tol_ratio
cond <- TRUE
}
part[[nWorkers]] <- which(!1:nrow(df) %in% do.call(c, part))
part
}
The output for the example given by the OP is as follows:
getPartitions(tasks, 4)
[[1]]
[1] 11 13 8 14 10
[[2]]
[1] 12 4 20 2 16
[[3]]
[1] 19 9 18 5 6
[[4]]
[1] 1 3 7 15 17
These are the rows from tasks that are to be passed to each worker. It runs instantly and returns a pretty even workload. Here are the estimated times for each worker:
sapply(getPartitions(tasks, 4), function(x) {
sum(tasks$cost[x])
})
[1] 60.67292 60.66552 60.80399 60.51455
This is pretty good given that the ideal time would be mean(tasks$cost) * 5 ~= 60.66425.
Let's see how it performs. Below is a modified script for plotting that takes into account how varied each result is for a given method. We measure this with sd (standard deviation). It also returns the ideal solution for reference.
library("tibble")
library("dplyr")
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library("ggplot2")
library("tidyr")
race <- function() {
N_TASKS = 100
X = 4
tasks <- tibble(task = 1:N_TASKS, cost = runif(N_TASKS, min = 1, max = 10)^2)
ideal_soln <- sum(tasks$cost) / X
# random shuffle
tasks$worker <- rep(1:X, each = nrow(tasks)/X)
rando_mx <- max(sapply(split(tasks$cost, tasks$worker), sum))
rando_sd <- sd(sapply(split(tasks$cost, tasks$worker), sum))
# alternative 1
tasks <- tasks[order(tasks$cost), ]
tasks$worker <- rep(c(1:X, X:1), length.out = nrow(tasks))
alt1_mx <- max(sapply(split(tasks$cost, tasks$worker), sum))
alt1_sd <- sd(sapply(split(tasks$cost, tasks$worker), sum))
# modified version of ivan100sic's answer
# sort by descending cost, after initial allocation, allocate costly tasks
# first to workers with lowest total cost so far
group <- factor(rep(1:(ceiling(nrow(tasks)/4)), each = X))
tasks <- tasks[order(tasks$cost, decreasing = TRUE), ]
tasks$worker <- c(1:X, rep(NA, length.out = nrow(tasks) - X))
task_sets <- split(tasks, group)
task_sets[[1]]$worker <- 1:X
for (i in 2:length(task_sets)) {
# get current total cost by worker
total <- task_sets %>%
bind_rows() %>%
filter(!is.na(worker)) %>%
group_by(worker) %>%
summarize(cost = sum(cost)) %>%
arrange(cost)
task_sets[[i]]$worker <- total[["worker"]]
}
tasks <- bind_rows(task_sets)
alt2_mx <- max(sapply(split(tasks$cost, tasks$worker), sum))
alt2_sd <- sd(sapply(split(tasks$cost, tasks$worker), sum))
## RcppAlgos solution
setParts <- getPartitions(tasks, X)
algos_mx <- max(sapply(setParts, function(x) sum(tasks$cost[x])))
algos_sd <- sd(sapply(setParts, function(x) sum(tasks$cost[x])))
c(target_soln = ideal_soln,rando_max = rando_mx, alt1_max = alt1_mx,
alt2_max = alt2_mx, algos_max = algos_mx, rando_std_dev = rando_sd,
alt1_std_dev = alt1_sd, alt2_std_dev = alt2_sd, algos_std_dev = algos_sd)
}
set.seed(24332)
system.time(sims <- replicate(1e3, race()))
sims %>%
t() %>%
as_tibble() %>%
pivot_longer(rando_std_dev:algos_std_dev, names_to = "Method") %>%
ggplot(aes(x = value, color = Method)) +
geom_density() +
scale_x_continuous(limits = c(0, 100)) +
labs(x = "Standard Deviation (s)")
Warning message:
Removed 719 rows containing non-finite values (stat_density).
It is hard to tell what is going on because the standard deviation for the rando method is so large. If we just look at alt1, alt2, and the algos approach we have:
sims %>%
t() %>%
as_tibble() %>%
pivot_longer(alt1_std_dev:algos_std_dev, names_to = "Method") %>%
ggplot(aes(x = value, color = Method)) +
geom_density() +
scale_x_continuous(limits = c(0, 5)) +
labs(x = "Standard Deviation (s)")
Warning message:
Removed 335 rows containing non-finite values (stat_density)
And now alt2 and algos:
sims %>%
t() %>%
as_tibble() %>%
pivot_longer(alt2_std_dev:algos_std_dev, names_to = "Method") %>%
ggplot(aes(x = value, color = Method)) +
geom_density() +
scale_x_continuous(limits = c(0, 1.7)) +
labs(x = "Standard Deviation (s)")
As you can see, the RcppAlgos solution gives the most balanced load every time.
And finally, here is an illustration that demonstrates how close each method is to the target solution:
summary(abs(t(sims)[, "algos_max"] - t(sims)[, "target_soln"]))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.003147 0.057913 0.081986 0.081693 0.106312 0.179099
summary(abs(t(sims)[, "alt2_max"] - t(sims)[, "target_soln"]))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01175 0.14321 0.23916 0.30730 0.40949 2.03156
summary(abs(t(sims)[, "alt1_max"] - t(sims)[, "target_soln"]))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.4979 2.9815 4.4725 4.9660 6.3220 16.5716
summary(abs(t(sims)[, "rando_max"] - t(sims)[, "target_soln"]))
Min. 1st Qu. Median Mean 3rd Qu. Max.
13.16 98.14 143.64 154.10 200.41 427.81
We see that the RcppAlgos solution is around 3-4 times closer on average to the target solution than the second best method (alt2 in this case).
Update
For the most part, the alt2/alt1 methods perform relatively well and are very simple, which is a huge plus. However, there are many cases where they will fail. For example, given X workers and X - 1 tasks that you know take a substantially longer time than the other tasks, since those methods rely on sorting, they will predictably allocate too much to X - 1 workers. Simply change the following line in the function race():
## Original
tasks <- tibble(task = 1:N_TASKS, cost = runif(N_TASKS, min = 1, max = 10)^2)
## Modified
tasks <- tibble(task = 1:N_TASKS, cost = c(runif(X - 1, 15, 25),
runif(N_TASKS - X + 1, min = 1, max = 10))^2)
Now rerun and observe:
set.seed(24332)
sims <- replicate(1e3, race())
sims <- sims %>%
t() %>%
as_tibble() %>%
pivot_longer(rando:algo, names_to = "Method")
ggplot(sims, aes(x = value, color = Method)) +
geom_density() +
scale_x_continuous(limits = c(0, max(sims$value))) +
labs(x = "Total runtime with Large Gap (s)")
sims %>%
group_by(Method) %>%
summarize(time_relative_to_ref = mean(value - ref)) %>%
arrange(time_relative_to_ref)
# A tibble: 4 x 2
Method time_relative_to_ref
<chr> <dbl>
1 algo 0.109
2 alt2 150.
3 alt1 184.
4 rando 839.
Although this is a contrived example, it shows that since the alt1/alt2 solutions makes assumptions about the underlying data, it will inevitably fail when presented with a more general problem.
* Disclosure: I am the author of RcppAlgos

record linear regression results repeatly

As shown in the following example, what I want to achieve is to run the regression many times, each time R records the estimates of did in one data.frame.
Each time, I changed the year condition in "ifelse", ie., ifelse(mydata$year >= 1993, 1, 0), thus each time I run a different regression.
mydata$time = ifelse(mydata$year >= 1994, 1, 0)
Can anyone help it? My basic code is as below (the data can be downloaded through browser if R returned errors):
library(foreign)
mydata = read.dta("http://dss.princeton.edu/training/Panel101.dta")
mydata$time = ifelse(mydata$year >= 1994, 1, 0)
mydata$did = mydata$time * mydata$treated
mydata$treated = ifelse(mydata$country == "E" | mydata$country == "F" | mydata$country == "G", 1, 0)
didreg = lm(y ~ treated + time + did, data = mydata)
summary(didreg)
Generally if you want to repeat a process many times with some different input each time, you need a function. The following function takes a scalar value year_value as its input, creates local variables for regression and exports estimates for model term did.
foo <- function (year_value) {
## create local variables from `mydata`
y <- mydata$y
treated <- as.numeric(mydata$country %in% c("E", "F", "G")) ## use `%in%`
time <- as.numeric(mydata$year >= year_value) ## use `year_value`
did <- time * treated
## run regression using local variables
didreg <- lm(y ~ treated + time + did)
## return estimate for model term `did`
coef(summary(didreg))["did", ]
}
foo(1993)
# Estimate Std. Error t value Pr(>|t|)
#-2.784222e+09 1.504349e+09 -1.850782e+00 6.867661e-02
Note there are several places where your original code can be improved. Say, using "%in%" instead of multiple "|", and using as.numeric instead of ifelse to coerce boolean to numeric.
Now you need something like a loop to iterate this function over several different year_value. I would use lappy.
## raw list of result from `lapply`
year_of_choice <- 1993:1994 ## taken for example
result <- lapply(year_of_choice, foo)
## rbind them into a matrix
data.frame(year = year_of_choice, do.call("rbind", result), check.names = FALSE)
# year Estimate Std. Error t value Pr(>|t|)
#1 1993 -2784221881 1504348732 -1.850782 0.06867661
#2 1994 -2519511630 1455676087 -1.730819 0.08815711
Note, don't include year 1990 (the minimum of variable year) as a choice, otherwise time will be a vector of 1, as same as the intercept. The resulting model is rank-deficient and you will get "subscript out of bounds" error. R version since 3.5.0 has a new complete argument to generic function coef. So for stability we may use
coef(summary(didreg), complete = TRUE)["did", ]
But you should see all NA or NaN for year 1990.
Here is another option, here we create a matrix for all the years, join it to mydata, gather to long, nest by grouping, then run regression to extract the estimates. Note that "gt_et_**" stands for "greater than or equal to.."
library(foreign)
library(dplyr)
library(tidyr)
library(purrr)
mydata = read.dta("http://dss.princeton.edu/training/Panel101.dta")
mtrx <- matrix(0, length(min(mydata$year):max(mydata$year)), length(min(mydata$year):max(mydata$year)))
mtrx[lower.tri(mtrx, diag = TRUE)] <- 1
df <- mtrx %>% as.data.frame() %>% mutate(year = min(mydata$year):max(mydata$year))
colnames(df) <- c(paste0("gt_et_", df$year), "year")
models <- df %>%
full_join(., mydata, by = "year") %>%
gather(mod, time, gt_et_1990:gt_et_1999) %>%
nest(-mod) %>%
mutate(data = map(data, ~mutate(.x, treated = ifelse(country == "E"|country == "F"|country == "G", 1, 0),
did = time * treated)),
mods = map(data, ~lm(y ~ treated + time + did, data = .x) %>% summary() %>% coef())) %>%
unnest(mods %>% map(broom::tidy)) %>%
filter(.rownames == "did") %>%
select(-.rownames)
models
#> mod Estimate Std..Error t.value Pr...t..
#> 1 gt_et_1991 -2309823993 2410140350 -0.95837738 0.34137018
#> 2 gt_et_1992 -2036098728 1780081308 -1.14382344 0.25682856
#> 3 gt_et_1993 -2784221881 1504348732 -1.85078222 0.06867661
#> 4 gt_et_1994 -2519511630 1455676087 -1.73081886 0.08815711
#> 5 gt_et_1995 -2357323806 1455203186 -1.61992760 0.11001662
#> 6 gt_et_1996 250180589 1511322882 0.16553749 0.86902697
#> 7 gt_et_1997 405842197 1619653548 0.25057346 0.80292231
#> 8 gt_et_1998 -75683039 1852314277 -0.04085864 0.96753194
#> 9 gt_et_1999 2951694230 2452126428 1.20372840 0.23299421
Created on 2018-09-01 by the reprex
package (v0.2.0).

correlogram with reliabilities in the main diagonal

I'm rather new to R, here is something I encountered in my first steps with it.
In some papers it is required to present a correlogram with the reliability (Cronbach Alpha) of the the correlated variables in the main diagonal (where the correlations are 1 )
an example might be 5 correlated psychometric measures
Job_ins (an average of 4 items)
Employability (an average of 4 items)
INT_to_quit (an average of 4 items)
Mobility_pref (an average of 5 items)
Career_self_mgmt (an average of 8 items)
note that in the native cor() R function the main diagnal (the correlations of the measures with themselves) shows 1.
What I would like to do is to present internal reliablity (cronbach alphas) in the main diagonal instead.
any ideas?
Saar
If I understood you correctly, this is my (long) solution.
#Loading pkgs
require(tidyverse)
require(Hmisc)
require(psych)
#Creating example data
set.seed(123) #making the random data reproducible
#Creating the items for each subject
job <- sim.congeneric(N=200, short = FALSE, low = 1, high=10,categorical=TRUE)
Employability <- sim.congeneric(N=200, short = FALSE, low = 1, high=10,categorical=TRUE)
Career <- sim.congeneric(N=200, short = FALSE, low = 1, high=10,categorical=TRUE)
#Arranging the data to one data.frame
df <- data.frame(as.data.frame(job$observed) %>%
set_names(c("job1", "job2", "job3", "job4")),
as.data.frame(Employability$observed) %>%
set_names(c("Employability1", "Employability2",
"Employability3", "Employability4")),
as.data.frame(job$observed) %>%
set_names(c("Career1", "Career2", "Career3", "Career4")))
#Creating a vector with the Cronbach's alpha for each subject
CronAlpha <- c(
alpha(df %>%
select(job1, job2, job3, job4))$total$std.alpha,
alpha(df %>%
select(Employability1, Employability2,
Employability3, Employability4))$total$std.alpha,
alpha(df %>%
select(Career1, Career2,
Career3, Career4))$total$std.alpha)
#Calculating the mean for each subject, than the correlations
Correlation <- df %>%
#Calculating the means
mutate(job = rowMeans(data.frame(job1, job2, job3, job4), na.rm = TRUE),
Employability =rowMeans(data.frame(Employability1, Employability2,
Employability3, Employability4), na.rm = TRUE),
Career =rowMeans(data.frame(Career1, Career2,
Career3, Career4), na.rm = TRUE)) %>%
#Selecting only the vars that I want for the correlation matrix
select(job, Employability, Career) %>%
as.matrix() %>%
rcorr()
#Extracting the Pearson's r
CorrelationRs <- Correlation$r
#Looping through the correlation data.frame and replacing with
# Cronbach's alpha
i <- 1
for (i in 1:nrow(CorrelationRs)) {
CorrelationRs[i, i] <- CronAlpha[i]
}
CorrelationRs
Edit
Instead of using loop, I should use diag().
diag(CorrelationRs) <- CronAlpha

Resources