Suppose I have every month an income of 1200
The interest rate is 1% - so after 1 year the price will increase 1%
I would like to find out how many years it will take until each investment will break even
Suppose an investment costs 200,000 with a momthly income of 1200
My first year and subsequent years annual income will be:
firstYear = 1200 * 12
additionalYears = (1200*12)*(1+0.01)^c(1:5)
c(firstYear, additionalYears)
14400.00 14544.00 14689.44 14836.33 14984.70 15134.54
I would like to make the "5" in the above example dynamic until it find the breakeven point.
In this example I have:
sum(c(firstYear, additionalYears))
198854.3
So the investment did not breakeven yet. Adjusting it to "12" gives me the breakeven point:
firstYear = 1200 * 12
additionalYears = (1200*12)*(1+0.01)^c(1:12)
sum(c(firstYear, additionalYears))
If possible I would like to determine the month of that year it will break even (so given this example it breakseven in month 12 of year 12, but others might break even in month 8 of year 6 etc.
Instead of using a loop, you can use vectors. Set the max years to 100, create a vector of 100 incomes to grow and a vector of 100 growth factors. Multiply the two and get a cumulative sum of the cost (negative) and the incomes. Count the number of times the sum is negative, that is your break even.
cost = -200000 # negative cost
income = 1200*12 # annual income
i = 0.01 # interest rate to grow income after year 0.
# repeat 14400 101 times, multiply it by (1+r)^n - R is vectorised
income100 = rep(income, 101) * ((1+i) ^ seq(0,100))
# subtract the cost from the cumulative sum of income
cumincome = cost + cumsum(income100)
# how many are negative?
yrs = sum(cumincome < 0)
# how much left to recover in next year
yrs + (-cumincome[yrs] / income100[yrs+1])
[1] 13.06991
Putting this into a function
break_even_years <- function(cost, income, interest=0, period = "monthly"){
if(cost >= 0) cost = -cost
if(period == "monthly") income = income * 12
income100 = rep(income, 101) * ((1+interest) ^ seq(0,100))
cumincome = cost + cumsum(income100)
# how many are negative?
yrs = sum(cumincome < 0)
# how much left to recover in next year
yrs + (-cumincome[yrs] / income100[yrs+1])
}
Using the function
purrr::map2(
.x = cost,
.y = investment,
~ break_even_years(.x, .y, interest = 0.03, "annual"))
[[1]]
[1] 28.90435
[[2]]
[1] 23.75858
[[3]]
[1] 6.391264
[[4]]
[1] 3.505453
Benchmarking
library(microbenchmark)
microbenchmark(break_even_years(200000,1000,0.01),
find_break_even_year(1000, 200000, 0.01, 100), times = 1000)
Unit: microseconds
expr min lq mean median uq
break_even_years(200000, 1000, 0.01) 50.9 87.10 257.4185 119.0 159.05
find_break_even_year(1000, 200000, 0.01, 100) 853.5 1247.05 3432.5157 1556.2 2391.35
max neval
36938.0 1000
145980.6 1000
I think this answers my question. If anybody can help with not using the forloop function that would be very helpful.
library(tidyverse)
investment = c(1000, 2000, 5000, 27000)
interest_rate = 0.03
cost = c(45000, 67900, 34678, 98367)
max_years = 100
future_value = list()
find_break_even_year <- function(CF, investment, interest_rate, max_years){
for (year in 1:max_years){
#print(year)
future_value[[year]] <- CF * (1 + interest_rate)^year
future_value_sums = sum(unlist(future_value))
if(future_value_sums >= investment)
return(year)
}
}
purrr::map2(
.x = investment,
.y = cost,
~ find_break_even_year(.x, .y, interest_rate = 0.03, max_years = 100)
)
Related
I have a dataframe with participants and I want to randomly assign them to a group (0,1). Each group should have approximately the same amount of participants.
My problem: I will keep adding participants. So, when I calculate a new random number for that participant, it should take into accound the distribution of the random numbers I already have.
This is my code:
groupData <- data.frame(participant = c(1), Group = floor(runif(1, min=0, max=2)))
groupData[nrow(groupData) + 1,] = c(2,floor(runif(1, min=0, max=2))) # with this I will be adding participants
I think what you're saying is that when iteratively adding participants to groupData, you want to randomly assign them to a group such that over time, the groups will be evenly distributed.
N.B., iteratively adding rows to a frame scales horribly, so if you're doing this with a lot of data, it will slow down a lot. See "Growing Objects" in The R Inferno.
We can weight the different groups proportion to their relative size (inversely), so that a new participant has a slightly-higher likelihood of being assigned an under-populated group.
For instance, if we already have 100 participants with unbalanced groups:
set.seed(42)
groupData <- data.frame(participant = 1:100, Group = sample(c(rep(0, 70), rep(1, 30))))
head(groupData)
# participant Group
# 1 1 0
# 2 2 0
# 3 3 0
# 4 4 1
# 5 5 0
# 6 6 1
table(groupData$Group)
# 0 1
# 70 30
then we can prioritize the under-filled group using
100 / (table(c(0:1, groupData$Group))-1)
# 0 1
# 1.428571 3.333333
which can be used with sample as in
sample(0:1, size = 1, prob = 100 / (table(c(0:1, groupData$Group)) - 1) )
I use table(c(0:1, ..)) - 1 because I want this to work when there may not yet be participants in one of the groups; by concatenating 0:1 to it, I ensure heac group has at least one, and the "minus one" compensates for this artificiality, trying to keep the ratios unbiased.
To "prove" that this eventually rounds out ...
for (pa in 101:400) {
newgroup <- sample(0:1, size = 1, prob = 100 / (table(c(0:1, groupData$Group))-1))
groupData <- rbind(groupData, data.frame(participant=pa, Group=newgroup))
}
library(ggplot2)
transform(groupData, GroupDiff = cumsum(Group == 0) - cumsum(Group == 1)) |>
ggplot(aes(participant, y = GroupDiff)) +
geom_point() +
geom_hline(yintercept=0) +
geom_vline(xintercept = 100) +
geom_text(data=data.frame(participant=101, GroupDiff=c(-Inf, -1, 1), vjust=c(-0.5, 0.5, -0.5), label=c("Start of group-balancing", "Group0-heavy", "Group1-heavy")), hjust=0, aes(label=label, vjust=vjust))
It is possible (even likely) that the balance will sway from side-to-side, but in general (asymptotically) it should stay balanced.
It occurs to me that the simplest method is just to assign people in pairs. Draw a random number (0 or 1) assign person N to the group associated with that value and assign person N+1 to the other group. That guarantees random assignment as well as perfectly equal group sizes.
Whether this properly simulates the situation you want to analyze is a separate issue.
Here is the outline of my data. There are 500 students. Each student has final grade for math, physics, chemistry, music, history. The range of the final grade for each subject is from 0 to 100. For each subject, if student's grade is below a cutoff, then the student will fail this subject. However, the teacher of each subject may change a few students (less than 5%) assessment from fail to pass due to their good performance for class activity. If a student fail any subject, then the overall assessment is supposed to be fail. If a student pass all 5 subjects, then the overall assessment is pass.
Now suppose the cutoffs for math, physics, chemistry, music, history are 45, 45, 45, 60, 60, respectively. Then we will have the demo table below. The second student passed the history due to the history teacher is satisfied with his class performance.
ID math physics chemistry music history overall_assessment
1 95 96 70 65 75 pass
2 46 61 72 86 59 pass
3 55 32 21 95 96 fail
Now my question is that if I have the table above, how can I know the cutoff for each subject? I have the data below in R.
set.seed(1)
math <- sample(30:100, 500, replace=T)
physics <- sample(30:100, 500, replace=T)
chemistry<- sample(30:100, 500, replace=T)
music<- sample(30:100, 500, replace=T)
history<- sample(60:100, 500, replace=T)
grade <- as.data.frame(cbind(math,physics,chemistry,music,history))
grade$assess <- ifelse(grade$math > 45 & grade$physics >55 & grade$chemistry > 60 & grade$music > 50 & grade$history > 80, "pass","fail")
grade$ID <- seq(1,500,1)
change_grade <- sample(1:500, 25, replace=F)
grade$assess[grade$ID %in% change_grade] <- "pass"
Because there is randomness in who is selected to pass for good activity, it is not possible to find the exact cutoff values. But we can find upper and lower bounds for the cutoff. Note that I slightly adjust the data generation, but you can change it and confirm this method gives correct bounds no matter the true cutoffs.
library(tidyverse)
n <- 500
prop <- 0.05
set.seed(1)
math <- sample(30:100, n, replace = T)
physics <- sample(30:100, n, replace = T)
chemistry <- sample(30:100, n, replace = T)
music <- sample(30:100, n, replace = T)
history <- sample(30:100, n, replace = T)
grade <-
as.data.frame(cbind(math, physics, chemistry, music, history))
grade$assess <- ifelse(
grade$math >= 45 &
grade$physics >= 45 &
grade$chemistry >= 45 &
grade$music >= 60 &
grade$history >= 60,
"pass", "fail")
grade$ID <- seq(1, n, 1)
change_grade <- sample(1:n, n * prop, replace = F)
grade$assess[grade$ID %in% change_grade] <- "pass"
grade$assess <- factor(grade$assess)
To find the upper bound for a subject, we will consider all individuals who passed the assessment, and look at their grades in that subject. We know that at most 25 individuals were granted an exception for that subject (n * proportion of exceptions), so the grade of the 26th worst individual is an upper bound for the cutoff score.
# upper bound
get_upper_bound <- function(var, n, prop) {
var <- var[order(var)]
var[ceiling(n * prop) + 1]
}
upper_bound <- grade %>%
subset(assess == "pass") %>%
summarise(
math = get_upper_bound(math, n = n, prop = prop),
physics = get_upper_bound(physics, n = n, prop = prop),
chemistry = get_upper_bound(chemistry, n = n, prop = prop),
music = get_upper_bound(music, n = n, prop = prop),
history = get_upper_bound(history, n = n, prop = prop))
upper_bound
#> math physics chemistry music history
#> 1 57 53 58 68 67
Having now found an upper bound, we can look at the lower bounds. Consider all individuals who passed Math, Physics, Chemistry, and Music by achieving at least the upper bound in those subjects, but who also failed the assessment. Then we know that they must have failed the History subject. Looking at the maximum History grade in those students gives us a lower bound for the cutoff score for History. We can apply this for all different subjects.
This code is inelegant, but I believe it works.
# lower bound
get_lower_bound <- function(varnum, data, upper_bound) {
varnames = c("math", "physics", "chemistry", "music", "history")
vars_using <- c(1:5)
vars_using <- vars_using[-varnum]
indexes <- rep(TRUE, nrow(data))
for (i in vars_using) {
indexes <-
indexes & (data[, varnames[i]] >= as.numeric(upper_bound[i]))
}
indexes <- indexes & (data$assess == "fail")
ifelse(is.finite(max(data[indexes, varnum])),
max(data[indexes, varnum]) + 1,
min(data[, varnum]))
}
lower_bound <- data.frame(
"math" = get_lower_bound(1, grade, upper_bound),
"physics" = get_lower_bound(2, grade, upper_bound),
"chemistry" = get_lower_bound(3, grade, upper_bound),
"music" = get_lower_bound(4, grade, upper_bound),
"history" = get_lower_bound(5, grade, upper_bound))
lower_bound
#> math physics chemistry music history
#> 1 45 44 45 58 60
Then the final bounds for the cutoff scores are:
rbind("lower" = lower_bound,
"upper" = upper_bound)
#> math physics chemistry music history
#> lower 45 44 45 58 60
#> upper 57 53 58 68 67
Created on 2022-08-30 by the reprex package (v2.0.1)
Note that by increasing n and decreasing prop, eventually the lower bound and upper bound are equal, and we have found the cutoff score exactly.
I essentially have two columns (vectors) with speed and accel in a data.frame as such:
speed acceleration
1 3.2694444 2.6539535522
2 3.3388889 2.5096979141
3 3.3888889 2.2722134590
4 3.4388889 1.9815256596
5 3.5000000 1.6777544022
6 3.5555556 1.3933215141
7 3.6055556 1.1439051628
8 3.6527778 0.9334115982
9 3.6722222 0.7561602592
I need to find for each value speed on the x axis (speed), what is the top 10% max values from the y axis (acceleration). This also needs to be in a specific interval. For example speed 3.2-3.4, 3.4-3.6, and so on. Can you please show me how a for loop would look like in this situation?
As #alistaire already pointed out, you have provided a very limited amount of data. So we first have to simulate I a bit more data based on which we can test our code.
set.seed(1)
# your data
speed <- c(3.2694444, 3.3388889, 3.3388889, 3.4388889, 3.5,
3.5555556, 3.6055556, 3.6527778, 3.6722222)
acceleration <- c(2.6539535522, 2.5096979141, 2.2722134590,
1.9815256596, 1.6777544022, 1.3933215141,
1.1439051628, 0.9334115982, 0.7561602592)
df <- data.frame(speed, acceleration)
# expand data.frame and add a little bit of noise to all values
# to make them 'unique'
df <- as.data.frame(do.call(
rbind,
replicate(15L, apply(df, 2, \(x) (x + runif(length(x), -1e-1, 1e-1) )),
simplify = FALSE)
))
The function create_intervals, as the name suggests, creates user-defined intervals. The rest of the code does the 'heavy lifting' and stores the desired result in out.
If you would like to have intervals of speed with equal widths, simply specify the number of groups (n_groups) you would like to have and leave the rest of the arguments (i.e. lwr, upr, and interval_span) unspecified.
# Cut speed into user-defined intervals
create_intervals <- \(n_groups = NULL, lwr = NULL, upr = NULL, interval_span = NULL) {
if (!is.null(lwr) & !is.null(upr) & !is.null(interval_span) & is.null(n_groups)) {
speed_low <- subset(df, speed < lwr, select = speed)
first_interval <- with(speed_low, c(min(speed), lwr))
middle_intervals <- seq(lwr + interval_span, upr - interval_span, interval_span)
speed_upp <- subset(df, speed > upr, select = speed)
last_interval <- with(speed_upp, c(upr, max(speed)))
intervals <- c(first_interval, middle_intervals, last_interval)
} else {
step <- with(df, c(max(speed) - min(speed))/n_groups)
intervals <- array(0L, dim = n_groups)
for(i in seq_len(n_groups)) {
intervals[i] <- min(df$speed) + i * step
}
}
return(intervals)
}
# three intervals with equal width
my_intervals <- create_intervals(n_groups = 3L)
# Compute values of speed when acceleration is greater then
# or equal to the 90th percentile
out <- lapply(1:(length(my_intervals)-1L), \(i) {
x <- subset(df, speed >= my_intervals[i] & speed <= my_intervals[i+1L])
x[x$acceleration >= quantile(x$acceleration, 0.9), ]
})
# function to round values to two decimal places
r <- \(x) format(round(x, 2), nsmall = 2L)
# assign names to each element of out
for(i in seq_along(out)) {
names(out)[i] <- paste0(r(my_intervals[i]), '-', r(my_intervals[i+1L]))
}
Output 1
> out
$`3.38-3.57`
speed acceleration
11 3.394378 2.583636
21 3.383631 2.267659
57 3.434123 2.300234
83 3.394886 2.580924
101 3.395459 2.460971
$`3.57-3.76`
speed acceleration
6 3.635234 1.447290
41 3.572868 1.618293
51 3.615017 1.420020
95 3.575412 1.763215
We could also compute the desired values of speed based on intervals that make more 'sense' than just equally spaced speed intervals, e.g. [min(speed), 3.3), [3.3, 3.45), [3.45, 3.6), and [3.6, max(speed)).
This can be accomplished by leaving n_groups unspecified and instead specify lwr, upr, and an interval_span that makes sense. For instance, it makes sense to have a interval span of 0.15 when the lower limit is 3.3 and the upper limit is 3.6.
# custom boundaries based on a lower limit and upper limit
my_intervals <- create_intervals(lwr = 3.3, upr = 3.6, interval_span = 0.15)
Output 2
> out
$`3.18-3.30`
speed acceleration
37 3.238781 2.696456
82 3.258691 2.722076
$`3.30-3.45`
speed acceleration
11 3.394378 2.583636
19 3.328292 2.711825
73 3.315306 2.644580
83 3.394886 2.580924
$`3.45-3.60`
speed acceleration
4 3.520530 2.018930
40 3.517329 2.032943
58 3.485247 2.079893
67 3.458031 2.078545
$`3.60-3.76`
speed acceleration
6 3.635234 1.447290
34 3.688131 1.218969
51 3.615017 1.420020
78 3.628465 1.348873
Note: use function(x) instead of \(x) if you use a version of R <4.1.0
I am stuck with this problem
I want to multiply my data but each proportion of each observation for a different percentage.
As example: if the first observation of my Var_1 has 5000 value.
I want to multiply the proportion between 100 and 1000 by 2% (in this particular case 900 x 2%).
The proportion between 1000 and 2000 by 3% (in this case 1000 x 3%).
And the proportion >2000 by 5% (in this case 3000 x 5%).
And add the sum of this process for each observation in a new variable.
Any idea of how to proceed?
I have these data like example:
library(tidyverse)
my_data <- tibble(Var_1 = c(5000, 1500, 350, 1200, 750, 1000,1250, 2500))
We can use case_when
library(dplyr)
my_data %>%
mutate(prop = case_when(between(Var_1, 100, 1000) ~ Var_1 * 0.02,
between(Var_1, 1000, 2000) ~ Var_1 * 0.03,
Var_1 > 2000 ~ Var_1 * 0.05))
You don't need to load a package just for this one operation. You can use the subset(...) function from base R. It may be tempting to write an ifelse statement and that would probably work, but R excels when you use vectorized operations rather than loops.
# 1000 randomly selected numbers between 0 and 3000
data <- sample(c(0:3000), 1000, replace = TRUE)
# Multiply the data by a percentage based on it's value.
# It is tempting to do this with ifelse statements but R
# is best at vectorised operations so use the subset(...) function
# Multiply values between 100 and 1000 by 0.02
data[subset(data, data >= 100 & data < 1000)] <- data[subset(data, data >= 100 & data < 1000)] * 0.02
# Multiply values between 1000 and 2000 by 0.03
data[subset(data, data >= 1000 & data < 2000)] <- data[subset(data, data >= 1000 & data < 2000)] * 0.03
# Multiply values greater than 2000 by 0.05
data[subset(data, data >= 2000)] <- data[subset(data, data >= 2000)] * 0.05
Here's a more generic way:
# First create a table of intervals and multipliers
bins <- data.frame(from = c(0, 100, 1000, 2000),
to = c(100, 1000, 2000, Inf),
multiplier = c(0, 0.02, 0.03, 0.05))
# Join that table to *every* row of your starting data
df <- merge(bins, my_data, all=TRUE) %>% mutate(
# Calculate the overlap between each interval and your value
interval_length = pmax(pmin(df$Var_1, df$to) - from, 0),
# Multiply the amount of overlap by the given percentage
amount = interval_length * multiplier
)
# Add up all the calculated amounts
sum(df$amount)
I have a long time series (zoo) of precipitation data, I know how to obtain the monthly average of the values:
library(hydroTSM)
ma= monthlyfunction(data, mean, na.rm=TRUE)
I also know how to obtain the monthly sum of the values:
su= monthlyfunction(data, sum, na.rm=TRUE)
but with the last one I get a monthly sum for the whole period of the time serie. I would like to get a monthly average of the sums, I mean for example:
jan 1980 (sum)= 150
jan 1981 (sum)= 180
jan 1982 (sum)= 90
expected value for january = average(150,180,90)= 140
Is there a function for this instead of mean and sum?
library(hydroTSM)
#This data is daily streamflows, but is similar to Precipitation
data(OcaEnOnaQts)
x <- OcaEnOnaQts
#In case you want monthly precipitation in "precipitation / 30 days" (what is common) you can use
monthlyfunction(x, FUN=mean, na.rm=TRUE) * 30
#In case you want the precipitation per days in specific month you can use
monthlyfunction(x, FUN=mean, na.rm=TRUE) * as.vector(dwi(x, out.unit = "months") * mean(dwi(x)) / sum(dwi(x)))
#or approximately
monthlyfunction(x, FUN=mean, na.rm=TRUE)*c(31,28.25,31,30,31,30,31,31,30,31,30,31)
#Add: Some ways to come to the mean monthly precipitation
p1980 <- c(rep(0,28), 50, 50, 50) #sum = 150
p1981 <- c(rep(0,28), 60, 60, 60) #sum = 180
p1982 <- c(rep(0,28), 30, 30, 30) #sum = 90
#
mean(c(sum(p1980), sum(p1981), sum(p1982))) # = 140 This is how you want it to be calculated
mean(c(p1980, p1981, p1982))*31 # = 140 This is how I suggested to come to the result
#Some other ways to come to the mean monthly precipitation
mean(c(mean(p1980), mean(p1981), mean(p1982)))*31 # = 140
sum(c(p1980, p1981, p1982))/3 # = 140