How do I calculate an average like this in r - r

Probably a stupid question but I have no idea how to do this.
Consider the following game, in which a balanced die with six sides numbered from 1 to 6 is rolled. If a 4 or 1 is rolled, you lose 50 euros. If you roll 2 or 3, nothing happens. If you roll 5, you win 50 euros. If you roll 6, you win 16×50 euros.
We would like to know how much money you can expect to win per game on average. Setting the seed to 990, simulate 5649 repetitions of the game.
Calculate the average of the winnings in these repetitions, as an estimate of the expected value of the winning in the game. Indicate this value circled to 2 decimal places.

Here is a base R way with a logical index on the die side.
set.seed(990)
rolls <- sample(6, 5649, TRUE)
win <- integer(5649)
win[rolls %in% c(1, 4)] <- -50
win[rolls == 5] <- 50
win[rolls == 6] <- 16*50
mean(win)
#> [1] 121.4728
Created on 2022-11-27 with reprex v2.0.2
A simpler way. Create a vector of prizes and index it with the rolls values.
prizes <- c(-50, 0, 0, -50, 50, 16*50)
win <- prizes[rolls]
mean(win)
#> [1] 121.4728
Created on 2022-11-27 with reprex v2.0.2
To output the result with 2 decimal places, just
round(mean(win), 2)
#> 121.47

#Simulation of the dice roll
set.seed(990);dice_roll <- sample(1:6,5649,replace = TRUE)
library(dplyr)
df <- tibble(dice_roll = dice_roll)
df %>%
mutate(
#Setting each dice roll to their respective result
result = case_when(
dice_roll == 6 ~ (16*50),
dice_roll == 5 ~ 50,
(dice_roll == 2 | dice_roll == 3) ~ 0,
(dice_roll == 1 | dice_roll == 4) ~ -50,
)
) %>%
# The global average
summarise(average = round(mean(result),2)) %>%
pull(average)
[1] 121.47

Could just get the analytical solution:
P(X=-50) = 1/3, P(X=0) = 1/3, P(X=50) = 1/6, P(X=16*50) = 1/6.
E[X] = -50/3 + 0/3 + 50/6 + 16*50/6 = 125.
-50/3 + 0/3 + 50/6 + 16*50/6
[1] 125

Related

How to determine the best cutoff for an easy question

Here is the outline of my data. There are 500 students. Each student has final grade for math, physics, chemistry, music, history. The range of the final grade for each subject is from 0 to 100. For each subject, if student's grade is below a cutoff, then the student will fail this subject. However, the teacher of each subject may change a few students (less than 5%) assessment from fail to pass due to their good performance for class activity. If a student fail any subject, then the overall assessment is supposed to be fail. If a student pass all 5 subjects, then the overall assessment is pass.
Now suppose the cutoffs for math, physics, chemistry, music, history are 45, 45, 45, 60, 60, respectively. Then we will have the demo table below. The second student passed the history due to the history teacher is satisfied with his class performance.
ID math physics chemistry music history overall_assessment
1 95 96 70 65 75 pass
2 46 61 72 86 59 pass
3 55 32 21 95 96 fail
Now my question is that if I have the table above, how can I know the cutoff for each subject? I have the data below in R.
set.seed(1)
math <- sample(30:100, 500, replace=T)
physics <- sample(30:100, 500, replace=T)
chemistry<- sample(30:100, 500, replace=T)
music<- sample(30:100, 500, replace=T)
history<- sample(60:100, 500, replace=T)
grade <- as.data.frame(cbind(math,physics,chemistry,music,history))
grade$assess <- ifelse(grade$math > 45 & grade$physics >55 & grade$chemistry > 60 & grade$music > 50 & grade$history > 80, "pass","fail")
grade$ID <- seq(1,500,1)
change_grade <- sample(1:500, 25, replace=F)
grade$assess[grade$ID %in% change_grade] <- "pass"
Because there is randomness in who is selected to pass for good activity, it is not possible to find the exact cutoff values. But we can find upper and lower bounds for the cutoff. Note that I slightly adjust the data generation, but you can change it and confirm this method gives correct bounds no matter the true cutoffs.
library(tidyverse)
n <- 500
prop <- 0.05
set.seed(1)
math <- sample(30:100, n, replace = T)
physics <- sample(30:100, n, replace = T)
chemistry <- sample(30:100, n, replace = T)
music <- sample(30:100, n, replace = T)
history <- sample(30:100, n, replace = T)
grade <-
as.data.frame(cbind(math, physics, chemistry, music, history))
grade$assess <- ifelse(
grade$math >= 45 &
grade$physics >= 45 &
grade$chemistry >= 45 &
grade$music >= 60 &
grade$history >= 60,
"pass", "fail")
grade$ID <- seq(1, n, 1)
change_grade <- sample(1:n, n * prop, replace = F)
grade$assess[grade$ID %in% change_grade] <- "pass"
grade$assess <- factor(grade$assess)
To find the upper bound for a subject, we will consider all individuals who passed the assessment, and look at their grades in that subject. We know that at most 25 individuals were granted an exception for that subject (n * proportion of exceptions), so the grade of the 26th worst individual is an upper bound for the cutoff score.
# upper bound
get_upper_bound <- function(var, n, prop) {
var <- var[order(var)]
var[ceiling(n * prop) + 1]
}
upper_bound <- grade %>%
subset(assess == "pass") %>%
summarise(
math = get_upper_bound(math, n = n, prop = prop),
physics = get_upper_bound(physics, n = n, prop = prop),
chemistry = get_upper_bound(chemistry, n = n, prop = prop),
music = get_upper_bound(music, n = n, prop = prop),
history = get_upper_bound(history, n = n, prop = prop))
upper_bound
#> math physics chemistry music history
#> 1 57 53 58 68 67
Having now found an upper bound, we can look at the lower bounds. Consider all individuals who passed Math, Physics, Chemistry, and Music by achieving at least the upper bound in those subjects, but who also failed the assessment. Then we know that they must have failed the History subject. Looking at the maximum History grade in those students gives us a lower bound for the cutoff score for History. We can apply this for all different subjects.
This code is inelegant, but I believe it works.
# lower bound
get_lower_bound <- function(varnum, data, upper_bound) {
varnames = c("math", "physics", "chemistry", "music", "history")
vars_using <- c(1:5)
vars_using <- vars_using[-varnum]
indexes <- rep(TRUE, nrow(data))
for (i in vars_using) {
indexes <-
indexes & (data[, varnames[i]] >= as.numeric(upper_bound[i]))
}
indexes <- indexes & (data$assess == "fail")
ifelse(is.finite(max(data[indexes, varnum])),
max(data[indexes, varnum]) + 1,
min(data[, varnum]))
}
lower_bound <- data.frame(
"math" = get_lower_bound(1, grade, upper_bound),
"physics" = get_lower_bound(2, grade, upper_bound),
"chemistry" = get_lower_bound(3, grade, upper_bound),
"music" = get_lower_bound(4, grade, upper_bound),
"history" = get_lower_bound(5, grade, upper_bound))
lower_bound
#> math physics chemistry music history
#> 1 45 44 45 58 60
Then the final bounds for the cutoff scores are:
rbind("lower" = lower_bound,
"upper" = upper_bound)
#> math physics chemistry music history
#> lower 45 44 45 58 60
#> upper 57 53 58 68 67
Created on 2022-08-30 by the reprex package (v2.0.1)
Note that by increasing n and decreasing prop, eventually the lower bound and upper bound are equal, and we have found the cutoff score exactly.

R - Categorize a dataset

Morning folks,
I'm trying to categorize a set of numerical values (Days Left divided by 365.2 which gives us approximately the numbers of years left until a maturity).
The results of this first calculation give me a vector of 3560 values (example: 0.81, 1.65, 3.26 [...], 0.2).
I'd like to categorise these results into intervals, [Between 0 and 1 Year, 0 and 2 Years, 0 and 3 years, 0 and 4 years, Over 4 years].
#Set the Data Frame
dfMaturity <- data.frame(Maturity = DATA$Maturity)
#Call the library and Run the function
MaturityX = ddply(df, .(Maturity), nrow)
#Set the Data Frame
dfMaturityID <- data.frame(testttto = DATA$Security.Name)
#Calculation of the remaining days
MaturityID = ddply(df, .(dfMaturityID$testttto), nrow)
survey <- data.frame(date=c(DATA$Maturity),tx_start=c("1/1/2022"))
survey$date_diff <- as.Date(as.character(survey$date), format="%m/%d/%Y")-
as.Date(as.character(survey$tx_start), format="%m/%d/%Y")
# Data for the table
MaturityName <- MaturityID$`dfMaturityID$testttto
MaturityZ <- survey$date
TimeToMaturity <- as.numeric(survey$date_diff)
# /!/ HERE IS WHERE I NEED HELP /!/ I'M TRYING TO CATEGORISE THE RESULTS OF THIS CALCULATION
Multiplier <- TimeToMaturity /365.2
cx <- cut(Multiplier, breaks=0:5)
The original datasource comes from an excel file (DATA$Maturity)
If it can helps you:
'''
print(Multiplier)
'''
gives us
print(Multiplier)
[1] 0.4956188 1.4950712 1.9989047 0.2464403 0.9994524 3.0010953 5.0000000 7.0016429 9.0005476
[10] 21.0021906 4.1621030 13.1626506 1.1610077 8.6664841 28.5377875 3.1626506 6.7497262 2.0920044
[19] 2.5602410 4.6495071 0.3368018 6.3225630 8.7130340 10.4956188 3.9019715 12.7957284 5.8378970
I copied the first three lines, but there is a total 3560 objects.
I'm open to any kind of help, I just want it to work :) thank you !
The cut function does that:
example <- c(0.81, 1.65, 3.26, 0.2)
cut(example, breaks = c(0, 1, 2, 3, 4),
labels = c("newborn", "one year old", "two", "three"))
Edit:
From the comment
I'd like then to create a table with for example: 30% of the objects has a maturity between 0 and 1 year
You could compute that using the function below:
example <- c(0.81, 1.65, 3.26, 0.2)
share <- function(x, lower = 0, higher= 1){
x <- na.omit(x)
sum((lower <= x) & (x < higher))/length(x)
}
share(1:10, lower = 0,higher = 3.5) # true for 1:3 out of 1:10 so 30%
share(1:10, lower = 4.5, higher = 5.5) # true for 5 so 10%)
share(example, 0, 3)

Graph learning in R, igraph, tidygraph

I have a graph with each node having a value (value in red).
I would like to do the following two things (I guess 1 is a special case of 2):
Each node should be assigned the mean of the value of the direct peers directing to it. For example node #5 (1+2)/2=1.5 or node #3 (0+2+0)/3=2/3.
Instead of direct neighbors, include all connected nodes but with a diffusion of times 1/n with n being the distance to the node. The further away the information is coming from the weaker signal we'd have.
I looked into functions of igraph, but could not find anything that is doing this (I might have overseen though). How could I do this computation?
Below is the code for a sample network with random values.
library(tidyverse)
library(tidygraph)
library(ggraph)
set.seed(6)
q <- tidygraph::play_erdos_renyi(6, p = 0.2) %>%
mutate(id = row_number(),
value = sample(0:3, size = 6, replace = T))
q %>%
ggraph(layout = "with_fr") +
geom_edge_link(arrow = arrow(length = unit(0.2, "inches"),
type = "closed")) +
geom_node_label(aes(label = id)) +
geom_node_text(aes(label = value), color = "red", size = 7,
nudge_x = 0.2, nudge_y = 0.2)
Edit, found a solution to 1
q %>%
mutate(value_smooth = map_local_dbl(order = 1, mindist = 1, mode = "in",
.f = function(neighborhood, ...) {
mean(as_tibble(neighborhood, active = 'nodes')$value)
}))
Edit 2, solution to 2, not the most elegant I guess
q %>%
mutate(value_smooth = map_local_dbl(order = 1, mindist = 0, mode = "in",
.f = function(neighborhood, node, ...) {
ne <- neighborhood
ne <- ne %>%
mutate(d = node_distance_to(which(as_tibble(ne,
active = "nodes")$id == node)))
as_tibble(ne, active = 'nodes') %>%
filter(d != 0) %>%
mutate(helper = value/d) %>%
summarise(m = mean(value)) %>%
pull(m)
}))
Edit 3, a faster alternative to map_local_dbl
map_local loops through all nodes of the graph. For large graphs, this takes very long. For just computing the means, this is not needed. A much faster alternative is to use the adjacency matrix and some matrix multiplication.
q_adj <- q %>%
igraph::as_adjacency_matrix()
# out
(q_adj %*% as_tibble(q)$value) / Matrix::rowSums(q_adj)
# in
(t(q_adj) %*% as_tibble(q)$value) / Matrix::colSums(q_adj)
The square of the adjacency matrix is the second order adjacency matrix, and so forth. So a solution to problem 2 could also be created.
Edit 4, direct weighted mean
Say the original graph has weights associated to each edge.
q <- q %>%
activate(edges) %>%
mutate(w = c(1,0.5,1,0.5,1,0.5,1)) %>%
activate(nodes)
We would like to compute the weighted mean of the direct peers' value.
q_adj_wgt <- q %>%
igraph::as_adjacency_matrix(attr = "w")
# out
(q_adj_wgt %*% as_tibble(q)$value) / Matrix::rowSums(q_adj_wgt)
# in
(t(q_adj_wgt) %*% as_tibble(q)$value) / Matrix::colSums(q_adj_wgt)
Probably you can try the code below
q %>%
set_vertex_attr(
name = "value",
value = sapply(
ego(., mode = "in", mindist = 1),
function(x) mean(x$value)
)
)
which gives
# A tbl_graph: 6 nodes and 7 edges
#
# A directed simple graph with 1 component
#
# Node Data: 6 x 2 (active)
id value
<int> <dbl>
1 1 0.5
2 2 NaN
3 3 0.667
4 4 NaN
5 5 1.5
6 6 NaN
#
# Edge Data: 7 x 2
from to
<int> <int>
1 3 1
2 6 1
3 1 3
# ... with 4 more rows
Each node should be assigned the mean of the value of the direct peers
directing to it.
Guessing that you really mean
Each node should be assigned the mean of the values of the direct peers directing to it, before any node values were changed
This seems trivial - maybe I am missing something?
Loop over nodes
Sum values of adjacent nodes
Calculate mean and store in vector by node index
Loop over nodes
Set node value to mean stored in previous loop

Simulating data sample

I have the following probabilities for each group, and each group represents a certain range of values. My goal is to simulate 1,234 rows of data that corresponds with the groups and percentages:
ages = c(21:29, 30:39,40:49, 50:59, 60:69, 70:79, 80:89, 90:99)
age_probs = c(10.85,12.64,14.02,25.00,19.01,11.45,7.01,0.01) / 100
age_bins = sapply(list(21:29, 30:39,40:49, 50:59, 60:69, 70:79, 80:89, 90:99), length)
age_weighted = rep(age_probs/age_bins, age_bins)
set.seed(1)
n = 1234
data = data.frame(ID = sample(n),
Age = sample(ages, size = n, prob = age_weighted, replace = TRUE))
However, the percentages of the data don't match and is too different at times (I assume because the data isn't big enough). I found another post, which mentions that this happens because this, our "view" of the randomness is effectively "one cell at a time", instead of "one column at a time". This is in reference to the sample() function.
How can I change my sample function to better represent the population percentages?
Oh and here is how I checked the columns of my data frame
to_export = data[order(data$ID),]
for (i in (1:length(to_export$Age))) {
if (to_export$Age[i] >= 21 & to_export$Age[i] <= 29) to_export$block[i] = "21-29"
if (to_export$Age[i] >= 30 & to_export$Age[i] <= 39) to_export$block[i] = "30-39"
if (to_export$Age[i] >= 40 & to_export$Age[i] <= 49) to_export$block[i] = "40-49"
if (to_export$Age[i] >= 50 & to_export$Age[i] <= 59) to_export$block[i] = "50-59"
if (to_export$Age[i] >= 60 & to_export$Age[i] <= 69) to_export$block[i] = "60-69"
if (to_export$Age[i] >= 70 & to_export$Age[i] <= 79) to_export$block[i] = "70-79"
if (to_export$Age[i] >= 80 & to_export$Age[i] <= 89) to_export$block[i] = "80-89"
if (to_export$Age[i] >= 90) to_export$block[i] = "90+"
}
#to_export
age_table = to_export %>% group_by(block) %>% summarise(percentage = round(n()/1234 * 100,2))
age_table
I suggest a small redesign. I'm using dplyr and ggplot but basically they aren't needed:
set.seed(1)
n = 1234
# Definition of the age buckets
ages = c("21:29", "30:39","40:49", "50:59", "60:69", "70:79", "80:89", "90:99")
# probability for each bucket
age_probs = c(10.85,12.64,14.02,25.00,19.01,11.45,7.01,0.01)
# normalise the probabilities since they don't add up to 1
c_age_probs = cumsum(age_probs)/sum(age_probs)
# create the data.frame
data = data.frame(ID = 1:n,
Age = ages[findInterval(runif(n), c_age_probs) + 1])
# plotting the data
ggplot(data, aes(x=Age)) +
geom_bar()
The datas plot looks okay, according to the given probabilities. Let's take a look at the percentages:
# getting the percentage
data %>%
group_by(Age) %>%
summarise(percentage = n()/n)
# A tibble: 7 x 2
# Age percentage
# <chr> <dbl>
# 1 21:29 0.0989
# 2 30:39 0.105
# 3 40:49 0.133
# 4 50:59 0.269
# 5 60:69 0.198
# 6 70:79 0.126
# 7 80:89 0.0705
The key part is ages[findInterval(runif(n), c_age_probs) + 1]. I created some uniform distributed numbers and used the cumulated (and normalised) probabilities to get the corresponding age bucket. By doing so I didn't even need to create multiple case_when-statements.

Building a life table in R with for loops

I'm new to R and programming in general, and I'm struggling with a for-loop for building the lx function in a life table.
I have the age function x, the death function qx (the probability that someone aged exactly x will die before reaching age x+1), and the surviving function px = 1 - qx.
I want to write a function that returns a vector with all the lx values from first to last age in my table. The function is simple...
I've defined cohort = 1000000. The first age in my table is x = 5, so, considering x = 5...
l_(x) = cohort
And, from now on, l_(x+n) = l_(x+n-1)*p_(x+n-1)
I've searched about for-loops, and I can only get my code working for lx[1] and lx[2], and I get nothing for lx[n] if n > 2.
I wrote that function:
living_x <- function(px, cohort){
result <- vector("double", length(px))
l_x <- vector("double", length(px))
for (i in 1:length(px)){
if (i == 1){
l_x[i] = cohort
}
else l_x[i] = l_x[i-1]*px[i-1]
result[i] = l_x
print(result)
}
}
When I run it, I get several outputs (more than length(px)) and "There were 50 or more warnings (use warnings() to see the first 50)".
When I run warnings(), I get "In result[i] <- l_x : number of items to replace is not a multiple of replacement length" for every number.
Also, everything I try besides it give me different errors or only calculate lx for lx[1] and lx[2]. I know there's something really wrong with my code, but I still couldn't identify it. I'd be glad if someone could give me a hint to find out what to change.
Thank you!
Here's an approach using dplyr from the tidyverse packages, to use px to calculate lx. This can be done similarly in "Base R" using excerpt$lx = 100000 * cumprod(1 - lag(excerpt$qx)).
lx is provided in the babynames package, so we can check our work:
library(tidyverse)
library(babynames)
# Get excerpt with age, qx, and lx.
excerpt <- lifetables %>%
filter(year == 2010, sex == "F") %>%
select(x, qx_given = qx, lx_given = lx)
excerpt
# A tibble: 120 x 3
x qx_given lx_given
<dbl> <dbl> <dbl>
1 0 0.00495 100000
2 1 0.00035 99505
3 2 0.00022 99471
4 3 0.00016 99449
5 4 0.00012 99433
6 5 0.00011 99421
7 6 0.00011 99410
8 7 0.0001 99399
9 8 0.0001 99389
10 9 0.00009 99379
# ... with 110 more rows
Using that data to estimate lx_calc:
est_lx <- excerpt %>%
mutate(px = 1 - qx_given,
cuml_px = cumprod(lag(px, default = 1)),
lx_calc = cuml_px * 100000)
And finally, comparing visually the given lx with the one calculated based on px. They match exactly.
est_lx %>%
gather(version, val, c(lx_given, lx_calc)) %>%
ggplot(aes(x, val, color = version)) + geom_line()
I could do it in a very simple way after thinking for some minutes more.
lx = c()
for (i in 2:length(px)){
lx[1] = 10**6
lx[i] = lx[i-1]*px[i-1]
}

Resources