Simulating data sample

Simulating data sample - r

I have the following probabilities for each group, and each group represents a certain range of values. My goal is to simulate 1,234 rows of data that corresponds with the groups and percentages:
ages = c(21:29, 30:39,40:49, 50:59, 60:69, 70:79, 80:89, 90:99)
age_probs = c(10.85,12.64,14.02,25.00,19.01,11.45,7.01,0.01) / 100
age_bins = sapply(list(21:29, 30:39,40:49, 50:59, 60:69, 70:79, 80:89, 90:99), length)
age_weighted = rep(age_probs/age_bins, age_bins)
set.seed(1)
n = 1234
data = data.frame(ID = sample(n),
Age = sample(ages, size = n, prob = age_weighted, replace = TRUE))
However, the percentages of the data don't match and is too different at times (I assume because the data isn't big enough). I found another post, which mentions that this happens because this, our "view" of the randomness is effectively "one cell at a time", instead of "one column at a time". This is in reference to the sample() function.
How can I change my sample function to better represent the population percentages?
Oh and here is how I checked the columns of my data frame
to_export = data[order(data$ID),]
for (i in (1:length(to_export$Age))) {
if (to_export$Age[i] >= 21 & to_export$Age[i] <= 29) to_export$block[i] = "21-29"
if (to_export$Age[i] >= 30 & to_export$Age[i] <= 39) to_export$block[i] = "30-39"
if (to_export$Age[i] >= 40 & to_export$Age[i] <= 49) to_export$block[i] = "40-49"
if (to_export$Age[i] >= 50 & to_export$Age[i] <= 59) to_export$block[i] = "50-59"
if (to_export$Age[i] >= 60 & to_export$Age[i] <= 69) to_export$block[i] = "60-69"
if (to_export$Age[i] >= 70 & to_export$Age[i] <= 79) to_export$block[i] = "70-79"
if (to_export$Age[i] >= 80 & to_export$Age[i] <= 89) to_export$block[i] = "80-89"
if (to_export$Age[i] >= 90) to_export$block[i] = "90+"
}
#to_export
age_table = to_export %>% group_by(block) %>% summarise(percentage = round(n()/1234 * 100,2))
age_table

I suggest a small redesign. I'm using dplyr and ggplot but basically they aren't needed:
set.seed(1)
n = 1234
# Definition of the age buckets
ages = c("21:29", "30:39","40:49", "50:59", "60:69", "70:79", "80:89", "90:99")
# probability for each bucket
age_probs = c(10.85,12.64,14.02,25.00,19.01,11.45,7.01,0.01)
# normalise the probabilities since they don't add up to 1
c_age_probs = cumsum(age_probs)/sum(age_probs)
# create the data.frame
data = data.frame(ID = 1:n,
Age = ages[findInterval(runif(n), c_age_probs) + 1])
# plotting the data
ggplot(data, aes(x=Age)) +
geom_bar()
The datas plot looks okay, according to the given probabilities. Let's take a look at the percentages:
# getting the percentage
data %>%
group_by(Age) %>%
summarise(percentage = n()/n)
# A tibble: 7 x 2
# Age percentage
# <chr> <dbl>
# 1 21:29 0.0989
# 2 30:39 0.105
# 3 40:49 0.133
# 4 50:59 0.269
# 5 60:69 0.198
# 6 70:79 0.126
# 7 80:89 0.0705
The key part is ages[findInterval(runif(n), c_age_probs) + 1]. I created some uniform distributed numbers and used the cumulated (and normalised) probabilities to get the corresponding age bucket. By doing so I didn't even need to create multiple case_when-statements.

Related

How do I calculate an average like this in r

Probably a stupid question but I have no idea how to do this.
Consider the following game, in which a balanced die with six sides numbered from 1 to 6 is rolled. If a 4 or 1 is rolled, you lose 50 euros. If you roll 2 or 3, nothing happens. If you roll 5, you win 50 euros. If you roll 6, you win 16×50 euros.
We would like to know how much money you can expect to win per game on average. Setting the seed to 990, simulate 5649 repetitions of the game.
Calculate the average of the winnings in these repetitions, as an estimate of the expected value of the winning in the game. Indicate this value circled to 2 decimal places.

Here is a base R way with a logical index on the die side.
set.seed(990)
rolls <- sample(6, 5649, TRUE)
win <- integer(5649)
win[rolls %in% c(1, 4)] <- -50
win[rolls == 5] <- 50
win[rolls == 6] <- 16*50
mean(win)
#> [1] 121.4728
Created on 2022-11-27 with reprex v2.0.2
A simpler way. Create a vector of prizes and index it with the rolls values.
prizes <- c(-50, 0, 0, -50, 50, 16*50)
win <- prizes[rolls]
mean(win)
#> [1] 121.4728
Created on 2022-11-27 with reprex v2.0.2
To output the result with 2 decimal places, just
round(mean(win), 2)
#> 121.47

#Simulation of the dice roll
set.seed(990);dice_roll <- sample(1:6,5649,replace = TRUE)
library(dplyr)
df <- tibble(dice_roll = dice_roll)
df %>%
mutate(
#Setting each dice roll to their respective result
result = case_when(
dice_roll == 6 ~ (16*50),
dice_roll == 5 ~ 50,
(dice_roll == 2 | dice_roll == 3) ~ 0,
(dice_roll == 1 | dice_roll == 4) ~ -50,
)
) %>%
# The global average
summarise(average = round(mean(result),2)) %>%
pull(average)
[1] 121.47

Could just get the analytical solution:
P(X=-50) = 1/3, P(X=0) = 1/3, P(X=50) = 1/6, P(X=16*50) = 1/6.
E[X] = -50/3 + 0/3 + 50/6 + 16*50/6 = 125.
-50/3 + 0/3 + 50/6 + 16*50/6
[1] 125

How to determine the best cutoff for an easy question

Here is the outline of my data. There are 500 students. Each student has final grade for math, physics, chemistry, music, history. The range of the final grade for each subject is from 0 to 100. For each subject, if student's grade is below a cutoff, then the student will fail this subject. However, the teacher of each subject may change a few students (less than 5%) assessment from fail to pass due to their good performance for class activity. If a student fail any subject, then the overall assessment is supposed to be fail. If a student pass all 5 subjects, then the overall assessment is pass.
Now suppose the cutoffs for math, physics, chemistry, music, history are 45, 45, 45, 60, 60, respectively. Then we will have the demo table below. The second student passed the history due to the history teacher is satisfied with his class performance.
ID math physics chemistry music history overall_assessment
1 95 96 70 65 75 pass
2 46 61 72 86 59 pass
3 55 32 21 95 96 fail
Now my question is that if I have the table above, how can I know the cutoff for each subject? I have the data below in R.
set.seed(1)
math <- sample(30:100, 500, replace=T)
physics <- sample(30:100, 500, replace=T)
chemistry<- sample(30:100, 500, replace=T)
music<- sample(30:100, 500, replace=T)
history<- sample(60:100, 500, replace=T)
grade <- as.data.frame(cbind(math,physics,chemistry,music,history))
grade$assess <- ifelse(grade$math > 45 & grade$physics >55 & grade$chemistry > 60 & grade$music > 50 & grade$history > 80, "pass","fail")
grade$ID <- seq(1,500,1)
change_grade <- sample(1:500, 25, replace=F)
grade$assess[grade$ID %in% change_grade] <- "pass"

Because there is randomness in who is selected to pass for good activity, it is not possible to find the exact cutoff values. But we can find upper and lower bounds for the cutoff. Note that I slightly adjust the data generation, but you can change it and confirm this method gives correct bounds no matter the true cutoffs.
library(tidyverse)
n <- 500
prop <- 0.05
set.seed(1)
math <- sample(30:100, n, replace = T)
physics <- sample(30:100, n, replace = T)
chemistry <- sample(30:100, n, replace = T)
music <- sample(30:100, n, replace = T)
history <- sample(30:100, n, replace = T)
grade <-
as.data.frame(cbind(math, physics, chemistry, music, history))
grade$assess <- ifelse(
grade$math >= 45 &
grade$physics >= 45 &
grade$chemistry >= 45 &
grade$music >= 60 &
grade$history >= 60,
"pass", "fail")
grade$ID <- seq(1, n, 1)
change_grade <- sample(1:n, n * prop, replace = F)
grade$assess[grade$ID %in% change_grade] <- "pass"
grade$assess <- factor(grade$assess)
To find the upper bound for a subject, we will consider all individuals who passed the assessment, and look at their grades in that subject. We know that at most 25 individuals were granted an exception for that subject (n * proportion of exceptions), so the grade of the 26th worst individual is an upper bound for the cutoff score.
# upper bound
get_upper_bound <- function(var, n, prop) {
var <- var[order(var)]
var[ceiling(n * prop) + 1]
}
upper_bound <- grade %>%
subset(assess == "pass") %>%
summarise(
math = get_upper_bound(math, n = n, prop = prop),
physics = get_upper_bound(physics, n = n, prop = prop),
chemistry = get_upper_bound(chemistry, n = n, prop = prop),
music = get_upper_bound(music, n = n, prop = prop),
history = get_upper_bound(history, n = n, prop = prop))
upper_bound
#> math physics chemistry music history
#> 1 57 53 58 68 67
Having now found an upper bound, we can look at the lower bounds. Consider all individuals who passed Math, Physics, Chemistry, and Music by achieving at least the upper bound in those subjects, but who also failed the assessment. Then we know that they must have failed the History subject. Looking at the maximum History grade in those students gives us a lower bound for the cutoff score for History. We can apply this for all different subjects.
This code is inelegant, but I believe it works.
# lower bound
get_lower_bound <- function(varnum, data, upper_bound) {
varnames = c("math", "physics", "chemistry", "music", "history")
vars_using <- c(1:5)
vars_using <- vars_using[-varnum]
indexes <- rep(TRUE, nrow(data))
for (i in vars_using) {
indexes <-
indexes & (data[, varnames[i]] >= as.numeric(upper_bound[i]))
}
indexes <- indexes & (data$assess == "fail")
ifelse(is.finite(max(data[indexes, varnum])),
max(data[indexes, varnum]) + 1,
min(data[, varnum]))
}
lower_bound <- data.frame(
"math" = get_lower_bound(1, grade, upper_bound),
"physics" = get_lower_bound(2, grade, upper_bound),
"chemistry" = get_lower_bound(3, grade, upper_bound),
"music" = get_lower_bound(4, grade, upper_bound),
"history" = get_lower_bound(5, grade, upper_bound))
lower_bound
#> math physics chemistry music history
#> 1 45 44 45 58 60
Then the final bounds for the cutoff scores are:
rbind("lower" = lower_bound,
"upper" = upper_bound)
#> math physics chemistry music history
#> lower 45 44 45 58 60
#> upper 57 53 58 68 67
Created on 2022-08-30 by the reprex package (v2.0.1)
Note that by increasing n and decreasing prop, eventually the lower bound and upper bound are equal, and we have found the cutoff score exactly.

Using dplyr function group_by() with cut()

I have a data set of real estate data. I'm trying to create a new column of days on market groups (labeled DOM_Groups) and group them into 15-day intervals (i.e. 0-14, 15-29, etc.). Then I'm trying to summarize() these groupings by the count of observations and the average sale price for each 15-day group.
I'm using the cut() function attempting to break my DOM_Groups into these 15-day intervals. In the base spreadsheet that I imported, the column containing the days on market has a unique observation in each cell, and the data in that column are numeric whole numbers...no decimals, no negative numbers.
When I run the following code, the tibble output is not grouping correctly, and it is including a negative number with a decimal, which does not exist in my data set. I'm not sure what to do to correct this.
gibbsMkt %>%
mutate(DOM_Groups = cut(DOM, breaks = 15, dig.lab = 2)) %>%
filter(Status == "SOLD") %>%
group_by(DOM_Groups) %>%
summarize(numDOM = n(),
avgSP = mean(`Sold Price`, na.rm = TRUE))
The tibble output I get is this:
DOM_Groups numDOM avgSP
<fct> <int> <dbl>
1 (-0.23,16] 74 561675.
2 (16,31] 18 632241.
3 (31,47] 11 561727.
4 (47,63] 8 545862.
5 (63,78] 7 729286.
6 (78,94] 6 624167.
7 (1.4e+02,1.6e+02] 2 541000
8 (1.6e+02,1.7e+02] 1 535395
Also, for rows 7 & 8 in the tibble, the largest number is 164, so I also don't understand why these rows are being converted to scientific notation.
When I use an Excel pivot table, I get the output that I want to reproduce in R, which is depicted below:
How can I reproduce this in R with the correct code?

cut(x, breaks = 15) means x will be cut into 15 intervals--it cannot guess that you want 15-unit intervals starting with 0 and ending with 150. This is in the docs for ?cut:
breaks either a numeric vector of two or more unique cut points or a single number (greater than or equal to 2) giving the number of intervals into which x is to be cut.
You will need to define your own start and end to each interval such as:
seq(0, max(x), 15)
# [1] 0 15 30 45 60 75 90 105 120 135 150
cut(x, seq(0, max(x), 15))
However, if you set it up correctly, you can define your intervals and make labels at the same time.
set.seed(1)
x <- floor(runif(500, 0, 164))
from <- seq(0, max(x), 15)
to <- from + 15 - 1
labs <- sprintf('%s-%s', from, to)
# [1] "0-14" "15-29" "30-44" "45-59" "60-74" "75-89" "90-104" "105-119" "120-134" "135-149" "150-164"
data.frame(table(cut(x, c(from, Inf), right = FALSE)), labels = labs)
# Var1 Freq labels
# 1 [0,15) 35 0-14
# 2 [15,30) 57 15-29
# 3 [30,45) 45 30-44
# 4 [45,60) 44 45-59
# 5 [60,75) 57 60-74
# 6 [75,90) 55 75-89
# 7 [90,105) 33 90-104
# 8 [105,120) 47 105-119
# 9 [120,135) 40 120-134
# 10 [135,150) 39 135-149
# 11 [150,Inf) 48 150-164
DOM_Groups <- cut(x, c(from, Inf), labs, right = FALSE)
data.frame(table(DOM_Groups))
# DOM_Groups Freq
# 1 0-14 35
# 2 15-29 57
# 3 30-44 45
# 4 45-59 44
# 5 60-74 57
# 6 75-89 55
# 7 90-104 33
# 8 105-119 47
# 9 120-134 40
# 10 135-149 39
# 11 150-164 48
Your other question of "why am I getting negative numbers," as I mentioned this does not mean that you have negatives in your data--these are just labels generated by using breaks = 15 with your data.
These are the relevant lines in cut.default
if (length(breaks) == 1L) {
if (is.na(breaks) || breaks < 2L)
stop("invalid number of intervals")
nb <- as.integer(breaks + 1)
dx <- diff(rx <- range(x, na.rm = TRUE))
if (dx == 0) {
dx <- if (rx[1L] != 0)
abs(rx[1L])
else 1
breaks <- seq.int(rx[1L] - dx/1000, rx[2L] + dx/1000,
length.out = nb)
}
else {
breaks <- seq.int(rx[1L], rx[2L], length.out = nb)
breaks[c(1L, nb)] <- c(rx[1L] - dx/1000, rx[2L] +
dx/1000)
}
Using the x from before and breaks = 15, you can see how negatives are introduced:
breaks <- 15
nb <- as.integer(breaks + 1)
dx <- diff(rx <- range(x, na.rm = TRUE))
if (dx == 0) {
dx <- if (rx[1L] != 0)
abs(rx[1L])
else 1
breaks <- seq.int(rx[1L] - dx/1000, rx[2L] + dx/1000,
length.out = nb)
} else {
breaks <- seq.int(rx[1L], rx[2L], length.out = nb)
breaks[c(1L, nb)] <- c(rx[1L] - dx/1000, rx[2L] + dx/1000)
}
breaks
# [1] -0.16300 10.86667 21.73333 32.60000 43.46667 54.33333 65.20000 76.06667 86.93333 97.80000 108.66667 119.53333 130.40000
# [14] 141.26667 152.13333 163.16300
levels(cut(x, breaks = 15))
# [1] "(-0.163,10.9]" "(10.9,21.7]" "(21.7,32.6]" "(32.6,43.5]" "(43.5,54.3]" "(54.3,65.2]" "(65.2,76.1]" "(76.1,86.9]"
# [9] "(86.9,97.8]" "(97.8,109]" "(109,120]" "(120,130]" "(130,141]" "(141,152]" "(152,163]"

Here's a simple solution with my santoku package:
library(santoku)
gibbsMkt %>%
mutate(DOM_Groups = chop_width(DOM, 15, labels = lbl_dash("-")))
# then proceed as before
You can use the start argument to chop_width if you want to start the intervals at a particular number.

How do i count the number of observations from 2 columns where the variables cannot be separated from one another

I have a data frame of 222 observations and 2 variables: landslide_z_prediction and y (occurrences)
landslide_z_prediction takes values from 0 to 1 while ytest takes on integers 0 or 1.
my task is to find out how many of the predicted positives/negatives were predicted correctly.
if z < 0.5 --> predicted negative
if z > 0.5 --> predicted positive
if y = 0 --> observed negative
if y = 1 --> observed positive
The scenarios are:
a) if z < 0.5 and y = 0 --> prediction is correct
b) if z <0.5 and y = 1 --> prediction is wrong
c) if z > 0.5 and y = 1 --> prediction is correct
d) if z > 0.5 and y = 0 --> prediction is wrong
I have placed my 222 observations in a data.frame format
combined_predicitons <- data.frame(landslide_z_predicted, ytest)
How am I able to extract out the number of occurrences of each scenarios?

Hy,
I found one solution for your problem with the dplyr package. Here is the code:
library(dplyr)
# generate sample data
df <- data.frame(landslide_z_predicted=runif(75), y=sample(c(0, 1), 75, replace=T))
# add is_correct and case variables to the data frame
df <- df %>%
mutate(is_correct = case_when((landslide_z_predicted < 0.5) & !y ~ TRUE,
(landslide_z_predicted >= 0.5) & y ~ TRUE,
TRUE ~ FALSE)) %>%
mutate(case = case_when((landslide_z_predicted < 0.5) & !y & is_correct ~ "case_01",
(landslide_z_predicted < 0.5) & y & !is_correct ~ "case_02",
(landslide_z_predicted >= 0.5) & y & is_correct ~ "case_03",
(landslide_z_predicted >= 0.5) & !y & !is_correct ~ "case_04"))
# count the occurrences of the cases
df %>% select(case) %>% group_by(case) %>% summarize(count=n())
First I generate a variable called is_correct. This is True/False if the prediction was correct or not. In the second mutate I list all your cases from your question and name them "case_01", "case_02" ,... With that in hand I can group the data frame by the cases and count the occurrences.

If I understand correctly, you want a make a confusion matrix.
In order to make it I can suggest you this:
1 - Change z values into 1 or 0 according with your threshold:
Since your threshold is at 0.5 you could use round.
combined_predicitons$landslide_z_predicted_dicotomy = round(combined_predicitons$landslide_z_predicted)
Otherwise, use ifelse, if the value predicted is over the threshold, it will output a value of 1, 0 otherwise.
threshold = 0.5
combined_predicitons$landslide_z_predicted_dicotomy = ifelse(combined_predicitons$landslide_z_predicted >= threshold, 1, 0)
2 - Create the table
table(combined_predicitons$y, combined_predicitons$landslide_z_predicted_dicotomy)
With this, you'll have the number of ocurrences of each scenario.

Find a range in R and replace it with the median

I have a data called data_v and one of the columns is salaries. The range of the data is between 0 and 140 000. I want to find different ranges(range1: 0-10000, range2: 10000-20000...) calculate the median of each range and replace the range with its median.
Using this I am able to get the desired output:
first = data_v$salaries[data_v$salaries>=0 & data_v$salaries<10000]
data_v$salaries[data_v$salaries>=0 & data_v$salaries<10000] = median(first)
second = data_v$salaries[data_v$salaries>=10000 & data_v$salaries<20000]
data_v$salaries[data_v$salaries>=10000 & data_v$salaries<20000] = median(second)
.............
ten=data_v$salaries[data_v$salaries>=90000 & data_v$salaries<=100000]
data_v$salaries[data_v$salaries >= 90000 & data_v$salaries <= 100000] = median(ten)
Output:
table(data_v$salaries)
median 7949 17523 25939 34302 42827 56840 65423 73292 81900 95479.75
# 130 2022 8481 9233 2661 1270 3864 2232 176 4
I tried to implement the same thing with while loop without success:
i <- 0;
while(i <=140000) {
m = data_v$salaries[data_v$salaries >= i & data_v$salaries < (i + 10000)]
data_v$salaries[data_v$salaries >= i & data_v$salaries < (i + 10000)] = median(m)
i <- i + 10000; }
Any help/suggestions are more then welcomed.

data(mtcars) # data for test
step = 10 # interval length, 10000 for your data
n = ceiling(max(mtcars$mpg)/step) # number of intervals
mtcars$mpg_interval = cut(mtcars$mpg, step*(0:n))
mtcars$mpg_median = ave(mtcars$mpg, mtcars$mpg_interval, FUN = median)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Simulating data sample - r

Related

How do I calculate an average like this in r

How to determine the best cutoff for an easy question

Using dplyr function group_by() with cut()

How do i count the number of observations from 2 columns where the variables cannot be separated from one another

Find a range in R and replace it with the median

Categories

Resources