I am stuck with this problem
I want to multiply my data but each proportion of each observation for a different percentage.
As example: if the first observation of my Var_1 has 5000 value.
I want to multiply the proportion between 100 and 1000 by 2% (in this particular case 900 x 2%).
The proportion between 1000 and 2000 by 3% (in this case 1000 x 3%).
And the proportion >2000 by 5% (in this case 3000 x 5%).
And add the sum of this process for each observation in a new variable.
Any idea of how to proceed?
I have these data like example:
library(tidyverse)
my_data <- tibble(Var_1 = c(5000, 1500, 350, 1200, 750, 1000,1250, 2500))
We can use case_when
library(dplyr)
my_data %>%
mutate(prop = case_when(between(Var_1, 100, 1000) ~ Var_1 * 0.02,
between(Var_1, 1000, 2000) ~ Var_1 * 0.03,
Var_1 > 2000 ~ Var_1 * 0.05))
You don't need to load a package just for this one operation. You can use the subset(...) function from base R. It may be tempting to write an ifelse statement and that would probably work, but R excels when you use vectorized operations rather than loops.
# 1000 randomly selected numbers between 0 and 3000
data <- sample(c(0:3000), 1000, replace = TRUE)
# Multiply the data by a percentage based on it's value.
# It is tempting to do this with ifelse statements but R
# is best at vectorised operations so use the subset(...) function
# Multiply values between 100 and 1000 by 0.02
data[subset(data, data >= 100 & data < 1000)] <- data[subset(data, data >= 100 & data < 1000)] * 0.02
# Multiply values between 1000 and 2000 by 0.03
data[subset(data, data >= 1000 & data < 2000)] <- data[subset(data, data >= 1000 & data < 2000)] * 0.03
# Multiply values greater than 2000 by 0.05
data[subset(data, data >= 2000)] <- data[subset(data, data >= 2000)] * 0.05
Here's a more generic way:
# First create a table of intervals and multipliers
bins <- data.frame(from = c(0, 100, 1000, 2000),
to = c(100, 1000, 2000, Inf),
multiplier = c(0, 0.02, 0.03, 0.05))
# Join that table to *every* row of your starting data
df <- merge(bins, my_data, all=TRUE) %>% mutate(
# Calculate the overlap between each interval and your value
interval_length = pmax(pmin(df$Var_1, df$to) - from, 0),
# Multiply the amount of overlap by the given percentage
amount = interval_length * multiplier
)
# Add up all the calculated amounts
sum(df$amount)
Related
Suppose I have every month an income of 1200
The interest rate is 1% - so after 1 year the price will increase 1%
I would like to find out how many years it will take until each investment will break even
Suppose an investment costs 200,000 with a momthly income of 1200
My first year and subsequent years annual income will be:
firstYear = 1200 * 12
additionalYears = (1200*12)*(1+0.01)^c(1:5)
c(firstYear, additionalYears)
14400.00 14544.00 14689.44 14836.33 14984.70 15134.54
I would like to make the "5" in the above example dynamic until it find the breakeven point.
In this example I have:
sum(c(firstYear, additionalYears))
198854.3
So the investment did not breakeven yet. Adjusting it to "12" gives me the breakeven point:
firstYear = 1200 * 12
additionalYears = (1200*12)*(1+0.01)^c(1:12)
sum(c(firstYear, additionalYears))
If possible I would like to determine the month of that year it will break even (so given this example it breakseven in month 12 of year 12, but others might break even in month 8 of year 6 etc.
Instead of using a loop, you can use vectors. Set the max years to 100, create a vector of 100 incomes to grow and a vector of 100 growth factors. Multiply the two and get a cumulative sum of the cost (negative) and the incomes. Count the number of times the sum is negative, that is your break even.
cost = -200000 # negative cost
income = 1200*12 # annual income
i = 0.01 # interest rate to grow income after year 0.
# repeat 14400 101 times, multiply it by (1+r)^n - R is vectorised
income100 = rep(income, 101) * ((1+i) ^ seq(0,100))
# subtract the cost from the cumulative sum of income
cumincome = cost + cumsum(income100)
# how many are negative?
yrs = sum(cumincome < 0)
# how much left to recover in next year
yrs + (-cumincome[yrs] / income100[yrs+1])
[1] 13.06991
Putting this into a function
break_even_years <- function(cost, income, interest=0, period = "monthly"){
if(cost >= 0) cost = -cost
if(period == "monthly") income = income * 12
income100 = rep(income, 101) * ((1+interest) ^ seq(0,100))
cumincome = cost + cumsum(income100)
# how many are negative?
yrs = sum(cumincome < 0)
# how much left to recover in next year
yrs + (-cumincome[yrs] / income100[yrs+1])
}
Using the function
purrr::map2(
.x = cost,
.y = investment,
~ break_even_years(.x, .y, interest = 0.03, "annual"))
[[1]]
[1] 28.90435
[[2]]
[1] 23.75858
[[3]]
[1] 6.391264
[[4]]
[1] 3.505453
Benchmarking
library(microbenchmark)
microbenchmark(break_even_years(200000,1000,0.01),
find_break_even_year(1000, 200000, 0.01, 100), times = 1000)
Unit: microseconds
expr min lq mean median uq
break_even_years(200000, 1000, 0.01) 50.9 87.10 257.4185 119.0 159.05
find_break_even_year(1000, 200000, 0.01, 100) 853.5 1247.05 3432.5157 1556.2 2391.35
max neval
36938.0 1000
145980.6 1000
I think this answers my question. If anybody can help with not using the forloop function that would be very helpful.
library(tidyverse)
investment = c(1000, 2000, 5000, 27000)
interest_rate = 0.03
cost = c(45000, 67900, 34678, 98367)
max_years = 100
future_value = list()
find_break_even_year <- function(CF, investment, interest_rate, max_years){
for (year in 1:max_years){
#print(year)
future_value[[year]] <- CF * (1 + interest_rate)^year
future_value_sums = sum(unlist(future_value))
if(future_value_sums >= investment)
return(year)
}
}
purrr::map2(
.x = investment,
.y = cost,
~ find_break_even_year(.x, .y, interest_rate = 0.03, max_years = 100)
)
Morning folks,
I'm trying to categorize a set of numerical values (Days Left divided by 365.2 which gives us approximately the numbers of years left until a maturity).
The results of this first calculation give me a vector of 3560 values (example: 0.81, 1.65, 3.26 [...], 0.2).
I'd like to categorise these results into intervals, [Between 0 and 1 Year, 0 and 2 Years, 0 and 3 years, 0 and 4 years, Over 4 years].
#Set the Data Frame
dfMaturity <- data.frame(Maturity = DATA$Maturity)
#Call the library and Run the function
MaturityX = ddply(df, .(Maturity), nrow)
#Set the Data Frame
dfMaturityID <- data.frame(testttto = DATA$Security.Name)
#Calculation of the remaining days
MaturityID = ddply(df, .(dfMaturityID$testttto), nrow)
survey <- data.frame(date=c(DATA$Maturity),tx_start=c("1/1/2022"))
survey$date_diff <- as.Date(as.character(survey$date), format="%m/%d/%Y")-
as.Date(as.character(survey$tx_start), format="%m/%d/%Y")
# Data for the table
MaturityName <- MaturityID$`dfMaturityID$testttto
MaturityZ <- survey$date
TimeToMaturity <- as.numeric(survey$date_diff)
# /!/ HERE IS WHERE I NEED HELP /!/ I'M TRYING TO CATEGORISE THE RESULTS OF THIS CALCULATION
Multiplier <- TimeToMaturity /365.2
cx <- cut(Multiplier, breaks=0:5)
The original datasource comes from an excel file (DATA$Maturity)
If it can helps you:
'''
print(Multiplier)
'''
gives us
print(Multiplier)
[1] 0.4956188 1.4950712 1.9989047 0.2464403 0.9994524 3.0010953 5.0000000 7.0016429 9.0005476
[10] 21.0021906 4.1621030 13.1626506 1.1610077 8.6664841 28.5377875 3.1626506 6.7497262 2.0920044
[19] 2.5602410 4.6495071 0.3368018 6.3225630 8.7130340 10.4956188 3.9019715 12.7957284 5.8378970
I copied the first three lines, but there is a total 3560 objects.
I'm open to any kind of help, I just want it to work :) thank you !
The cut function does that:
example <- c(0.81, 1.65, 3.26, 0.2)
cut(example, breaks = c(0, 1, 2, 3, 4),
labels = c("newborn", "one year old", "two", "three"))
Edit:
From the comment
I'd like then to create a table with for example: 30% of the objects has a maturity between 0 and 1 year
You could compute that using the function below:
example <- c(0.81, 1.65, 3.26, 0.2)
share <- function(x, lower = 0, higher= 1){
x <- na.omit(x)
sum((lower <= x) & (x < higher))/length(x)
}
share(1:10, lower = 0,higher = 3.5) # true for 1:3 out of 1:10 so 30%
share(1:10, lower = 4.5, higher = 5.5) # true for 5 so 10%)
share(example, 0, 3)
I have a dataframe, I want to add some random missingness in the data and add random noise
22RV1 23132-87 639-V
exp_TSPAN6 7.951917 3.524705 12.04370
exp_TNMD 8.079243 3.580134 12.20077
exp_DPM1 8.509788 6.853905 4.406769
exp_SCYL3 7.642608 7.613985 8.741316
exp_C1orf112 3.231196 7.695874 10.668163
Desired Output
22RV1 23132-87 639-V
exp_TSPAN6 7.951917 2.524705 12.04370
exp_TNMD 7.079243 Nan 13.20077
exp_DPM1 8.509788 6.853905 4.406769
exp_SCYL3 Nan 6.613985 Nan
exp_C1orf112 3.231196 7.695874 9.668163
I tried this function but its too slow for dataset with 1000 columns and 500 rows. is there an efficient way to add both randomness and noise?
insertNA <- function(df,NAratio) {
sel <- sample( nrow(df)*ncol(df), size = NAratio*nrow(df)*ncol(df) )
for (i in c(1:length(sel))) {
a <- as.integer((sel[i]-1)/ncol(df)+1)
b <- sel[i] - (a-1)*ncol(df)
df[a,b] <- NA
}
return(df)
}
Although you want to add noise and missingness randomly there should still be some rules defined.
To add noise I use rnorm function and I change 20% of the values to NA here. You can change the conditions based on your requirement.
n <- nrow(df)
df[] <- lapply(df, function(x) x + rnorm(n) *
sample(c(1, NA),n, replace = TRUE, prob = c(0.8, 0.2)))
df
# X22RV1 X23132.87 X639.V
#exp_TSPAN6 7.172716 5.179046 10.417945
#exp_TNMD 7.568185 3.794910 11.130833
#exp_DPM1 7.701909 6.799382 3.003337
#exp_SCYL3 8.034432 NA 8.510112
#exp_C1orf112 NA NA 10.981735
I am trying to simulate certain discrete variable depicting "true state of the world" (say, "red", "green" or "blue") and its indicator, somewhat imperfectly describing it.
r_names <- c("real_R", "real_G", "real_B")
Lets say I have some prior belief about distribution of "reality" variable, which I will use to sample it.
r_probs <- c(0.3, 0.5, 0.2)
set.seed(100)
reality <- sample(seq_along(r_names), 10000, prob=r_probs, replace = TRUE)
Now, let's say I have conditional probability table that stipulates the value of indicator given each of the "realities"
ri_matrix <- matrix(c(0.7, 0.3, 0,
0.2, 0.6, 0.2,
0.05,0.15,0.8), byrow=TRUE,nrow = 3)
dimnames(ri_matrix) <- list(paste("real", r_names, sep="_"),
paste("ind", r_names, sep="_"))
ri_matrix
># ind_R ind_G ind_B
># real_Red 0.70 0.30 0.0
># real_Green 0.20 0.60 0.2
># real_Blue 0.05 0.15 0.8
Since base::sample() is not vectorized for prob argument, I have to:
sample_cond <- function(r, rim){
unlist(lapply(r, function(x)
sample(seq_len(ncol(rim)), 1, prob = rim[x,], replace = TRUE)))
}
Now I can sample my "indicator" variable using the conditional probability matrix
set.seed(200)
indicator <- sample_cond(reality, ri_matrix)
Just to make sure the distributions turned out as expected:
prop.table(table(reality, indicator), margin = 1)
#> indicator
#> reality 1 2 3
#> 1 0.70043610 0.29956390 0.00000000
#> 2 0.19976124 0.59331476 0.20692400
#> 3 0.04365278 0.14400401 0.81234320
Is there a better (i.e. more idiomatic and/or efficient) way to sample a discrete variable conditioned on another discrete random variable?
UPDATE:
As suggested by #Mr.Flick, this is at least 50x faster, because it reuses probability vectors instead of repeated subsetting of the conditional probability matrix.
sample_cond_group <- function(r, rim){
il <- mapply(function(x,y){sample(seq(ncol(rim)), length(x), prob = y, replace = TRUE)},
x=split(r, r),
y=split(rim, seq(nrow(rim))))
unsplit(il, r)
}
You can be a bit more efficient by drawing all the random samples per group with a split/combine type strategy. That might look something like this
simFun <- function(N, r_probs, ri_matrix) {
stopifnot(length(r_probs) == nrow(ri_matrix))
ind <- sample.int(length(r_probs), N, prob = r_probs, replace=TRUE)
grp <- split(data.frame(ind), ind)
unsplit(Map(function(data, r) {
draw <-sample.int(ncol(ri_matrix), nrow(data), replace=TRUE, prob=ri_matrix[r, ])
data.frame(data, draw)
}, grp, as.numeric(names(grp))), ind)
}
Than you can call with
simFun(10000, r_probs, ri_matrix)
I have a data.table which has a balance. The balance is based on deposits/withdrawals each period. Each period there is an interest rate that should be applied. However I am not able to compound the interest rate to the balances, basically applying the interest rate to the balance and then using the updated balance in the next period to calculate the new value.
Balance_t1 = (0 + Deposit_t1)*(1+Interest_t1)
Balance_t2 = (Balance_t1 + Deposit_t2)*(1+Interest_t2)
Balance_t3 = (Balance_t2 + Deposit_t3)*(1+Interest_t3)
I have the following data.table
dtCash <- data.table(
Deposit = c(100, 100, -300, 0),
Balance = c(100, 200, -100, -100),
Interest=c(0.1, 0.01, 0.2, 0.1)
)
The result would be:
dtCash <- data.table(
Deposit = c(100, 100, -300, 0),
Balance = c(100, 200, -100, -100),
Interest=c(0.1, 0.01, 0.2, 0.1),
BalanceWithInterest = c(110, 212.1, -105.48, -116.028)
)
How do I update and reference the updated Balance column in every period?
It seems like you're looking for a "cumulative sum and product," which I don't know of a way to do in R (other than, for instance, with #dynamo's for loop).
That being said, this can be done efficiently with a relatively simple Rcpp solution:
library(Rcpp)
getBalance <- cppFunction(
"NumericVector getBalance(NumericVector deposit,
NumericVector interest) {
NumericVector result(deposit.size());
double prevResult = 0.0;
for (int i=0; i < deposit.size(); ++i) {
result[i] = (prevResult + deposit[i]) * (1.0 + interest[i]);
prevResult = result[i];
}
return result;
}")
Deposit <- c(100, 100, -300, 0)
Interest <- c(0.1, 0.01, 0.2, 0.1)
getBalance(Deposit, Interest)
# [1] 110.000 212.100 -105.480 -116.028
To give a sense of the efficiency improvements of Rcpp vs. base R:
# Base R solution
f2 = function(Deposit, Interest) {
Balance <- c(0, rep(NA, length(Deposit)))
for (i in 2:length(Balance)) {
Balance[i] = (Balance[i-1] + Deposit[i-1]) * (1+Interest[i-1])
}
return(Balance[-1])
}
set.seed(144)
Deposit <- runif(1000000, -1, 2)
Interest = runif(1000000, 0, 0.05)
system.time(getBalance(Deposit, Interest))
# user system elapsed
# 0.008 0.000 0.008
system.time(f2(Deposit, Interest))
# user system elapsed
# 4.701 0.008 4.730
Not enough rep to comment yet:
Can you give an indication of what data you have at each point/ when you wish to update? Do you wish to calculate, say, balance_after_interest(t) based on interest, balance(t-1) and deposits(t)?
A somewhat messy answer:
library(data.table)
dtCash <- data.table(
Deposit = c(100, 100, -300, 0),
Balance = c(100, 200, -100, -100),
Interest=c(0.1, 0.01, 0.2, 0.1)
)
# Add row for t = 0
dtCash <- rbind(rep(0, ncol(dtCash)), dtCash)
# Add "dummy" column for interest-accrued balance
dtCash$Balance.1 <- c(0, rep(NA, nrow(dtCash)-1))
for ( i in seq(nrow(dtCash))[-1] ) {
dtCash$Balance.1[i] <- (dtCash$Balance.1[i - 1] + dtCash$Deposit[i]) *
(1 + dtCash$Interest[i])
}
dtCash
# Deposit Balance Interest Balance.1
# 1: 0 0 0.00 0.000
# 2: 100 100 0.10 110.000
# 3: 100 200 0.01 212.100
# 4: -300 -100 0.20 -105.480
# 5: 0 -100 0.10 -116.028
Is this what you mean? This isn't super efficient, but it does give you what you are looking for. With some clever re-parameterisation you might be about to work around the explicit loop.
Also, if your problem size is small, you could just as well use data.frame rather than data.table. In this case, the notation would be identical. (And in this case, there is no advantage from using data.table.)
I think you need to pull out the data, work it out with lapply(), and update it. I don't think there's any vector way to do it:
interest<-dtCash[,Interest]
balance<-dtCash[,Balance]
lapply(1:(length(interest)-1), # leave the last entry - nothing to add it to
function(x)
{balance[x+1]<<-balance[x+1]+balance[x]*interest[x]} # remember the double arrow
) # because you're in a function
dtCash[,rollBal:=balance]
Deposit Balance Interest rollBal
1: 100 100 0.10 100.00
2: 100 200 0.01 220.00
3: -300 -100 0.20 -95.70
4: 0 -100 0.10 -138.72