I am testing if a sending information to consumers about promotion convince them to buy anything. Out of 100k consumers we randomly selected 90% of them and sended them catalogs. After sometime we tracked who have bought.
To recreate the problem lets use:
set.seed(1)
got <- rbinom(n=100000, size=1, prob=0.1)
bought <- rbinom(n=100000, size=1, prob=0.05)
table(got, bought)
bought
got 0 1
0 85525 4448
1 9567 460
As I read on here I should use prop.test(table(got, bought), correct=FALSE) function, but i want to check not only if the proportions are equal, but if the proportion of those who bought during promotion, for the group who got the leaflet was greater then in those who didn't get it.
Should I use argument alternative = "less" or , alternative = "greater"? and dose the order or got and bought is impotent?
You usually want to use a two sided alternative (for all you know sending promotion annoys people and they are less likely to purchase).
prop.test is doing a chi square test which by definition does not look at which group is bigger.
You could do a t.test like this
t.test(bought ~ got, data = data.frame(got = got, bought = bought))
Depending on your typical conversion rate and sample size and alpha you can get confidence intervals implying negative conversion rates so a Bootstrapping or Bayesian approach may be better suited.
Related
I am running an impulse response function in R, using the package vars.
My data has 3 variables, the inflation (Brazilian CPI, or IPCA), the exchange rate and the output gap.
My goal is to calculate the exchange rate pass-through (both the maximum impact and the lag), and I am following and academic recommendation to add the output gap (as the monthly industrial production with HP filter).
The pass-through I am interested in is exchange rate -> CPI. The output gap is of my interest only in the way it impacts this pass-through relation. So I wrote the code as:
model_irf <- vars::irf(model_var,
impulse = "Exchange Rate",
response = "CPI",
n.ahead = 12,
cumulative = TRUE)
This gives me the expected response of variable “CPI” t+12 to a unit change in variable “Exchange Rate”.
I imagine (from macroeconomic theory) the output gap impacts the magnitude of the pass through, so in periods of larger output gap companies have less space to increase prices; relation that is not visible in this model I wrote.
My question is: How is the output gap related to the IRF I calculated? (Or if the model is wrong and I should write it differently to test this assumption)
Thank you very much for your time!
My code for getting a proper plot in R does not seem to work (I am new to R and I am having difficulties with coding).
Basically, using the concept of temporal discounting in the form of beta-delta model, we are supposed to calculate the subjective value for $10 at every delay from 0 to 365.
The context of the homework is that we have to account for the important exception that if a reward is IMMEDIATE, there’s no discount, but if it occurs at any delay there’s both an exponential discount and a delay penalty.
I created a variable called BetaDeltaValuesOf10, which is 366 elements long and represents the subjective value for $10 at each delay from 0 to 365.
The code needs to have the following properties using for-loops and an if-else statement:
1) IF the delay is 0, the subjective value is the objective magnitude (and should be saved to the appropriate element of BetaDeltaValuesOf10.
2) OTHERWISE, calculate the subjective value at the exponentially discounted rate, assuming 𝛿 = .98 and apply a delay penalty of .8, then save it to the appropriate element of BetaDeltaValuesOf10.
The standard code given to us to help us in creating the code is as follows:
BetaDeltaValuesOf10 = 0
Delays = 0:365
Code(Equation) to get subjective value/preference using exponential discounting model:
ExponentialDecayValuesOf10 = .98^Delays*10
0.98 is the discount rate which ranges between 0 and 1.
Delays is the number of time periods in the future when the later reward will be delivered
10 is the subjective value of $10
Code(Equation) to get subjective value using beta-delta model:
0.8*0.98^Delays*10
0.8 is the delay penalty
The code I came up with in trying to satisfy the above mentioned properties is as follows:
for(t in 1:length(Delays)){BetaDeltaValuesOf10 = 0.98^0*10
if(BetaDeltaValuesOf10 == 0){0.98^t*10}
else {0.8*0.98^t*10}
}
So, I tried the code and did not get any error. But, when I try to plot the outcome of the code, my plot comes up blank.
To plot I used the code:
plot(BetaDeltaValuesOf10,type = 'l', ylab = 'DiscountedValue')
I believe that my code is actually faulty and that is why I am not getting a proper outcome for my plot.
Please let me know of the amendments to the code and if the community needs any clarification, I will try to clarify as soon as I can.
result <- double(length=366)
delays <- 0:365
val <- 10
delta <- 0.98
penalty <- 0.8
for(t in seq_along(delays)) {
result[t] <- val * delta^delays[t] * penalty^(delays[t]>0)
}
plot(x=delays, y=result, pch=20)
I am using the CRRBinomialTreeOption function in the fOptions package to price American options. For example:
CRRBinomialTreeOption("pa",24.5,27.01,0.7479452,r = 0.02,0,0.235999,n=100,NULL,NULL)
However, this is producing results that are significantly different from what I expect. For example using those inputs I get a price of 3.071618 compared to an expected price of 4.04. I have also tried using b=0 and get 3.546099, which is still reasonably different from what I expect.
All my inputs have been verified and are valid, however I suspect that I have misunderstood the b parameter. I have interpreted the parameter as the dividend yield, however in the documentation it is described as the annualized cost of carry.
Have I misunderstood this parameter? If so, how should I interpret the parameter for equity options? If I haven't misunderstood that parameter, can anyone suggest another reason as to why I am not getting my expected output?
For anyone looking at this in the future, the answer to the above problem is that the b parameter in the CRRBinomialTreeOption function is actually the cost of carry, i.e. the risk free rate less the income earned on the asset. This seems as though you are inputting the risk free rate twice: 1) in the r parameter and 2) in the b parameter, which is the source of confusion. Nevertheless, this approach means that IF there is no income earned on the asset r=b, otherwise, r = risk free rate and b = risk free rate - dividend yield if it is a stock option. E.g.
CRRBinomialTreeOption(TypeFlag = "pa",
S = 24.5
X = 27.01
Time = 0.7479452
r = 0.02,
b = 0.02 - 0.0635,
sigma = 0.235999
n=100)#price
Which gives me 4.05 versus the expected 4.04. The difference of 1 cent is likely due to slightly different market forward dividend curve expectations and/or the risk free rate reflecting the forward curve. In this particular case a dividend yield of 0.0625 compared to 0.0635 solves for the price almost exactly.
Given the results for a simple A / B test...
A B
clicked 8 60
ignored 192 1940
( ie a conversation rate of A 4% and B 3% )
... a fisher test in R quite rightly says there's no significant difference
> fisher.test(data.frame(A=c(8,192), B=c(60,1940)))
...
p-value = 0.3933
...
But what function is available in R to tell me how much I need to increase my sample size to get to a p-value of say 0.05?
I could just increase the A values (in their proportion) until I get to it but there's got to be a better way? Perhaps pwr.2p2n.test [1] is somehow usable?
[1] http://rss.acs.unt.edu/Rdoc/library/pwr/html/pwr.2p2n.test.html
power.prop.test() should do this for you. In order to get the math to work I converted your 'ignored' data to impressions by summing up your columns.
> power.prop.test(p1=8/200, p2=60/2000, power=0.8, sig.level=0.05)
Two-sample comparison of proportions power calculation
n = 5300.739
p1 = 0.04
p2 = 0.03
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
That gives 5301, which is for each group, so your sample size needs to be 10600. Subtracting out the 2200 that have already run, you have 8400 "tests" to go.
In this case:
sig.level is the same as your p-value.
power is the likelihood of finding significant results that exist within your sample. This is somewhat arbitrary, 80% is a common choice. Note that choosing 80% means that 20% of the time you won't find significance when you should. Increasing the power means you'll need a larger sample size to reach your desired significance level.
If you wanted to decide how much longer it will take to reach significance, divide 8400 by the number of impressions per day. That can help determine if its worth while to continue the test.
You can also use this function to determine required sample size before testing begins. There's a nice write-up describing this on the 37 Signals blog.
This is a native R function, so you won't need to add or load any packages. Other than that I can't say how similar this is to pwr.p2pn.test().
I want to rank a set of sellers. Each seller is defined by parameters var1,var2,var3,var4...var20. I want to score each of the sellers.
Currently I am calculating score by assigning weights on these parameters(Say 10% to var1, 20 % to var2 and so on), and these weights are determined based on my gut feeling.
my score equation looks like
score = w1* var1 +w2* var2+...+w20*var20
score = 0.1*var1+ 0.5 *var2 + .05*var3+........+0.0001*var20
My score equation could also look like
score = w1^2* var1 +w2* var2+...+w20^5*var20
where var1,var2,..var20 are normalized.
Which equation should I use?
What are the methods to scientifically determine, what weights to assign?
I want to optimize these weights to revamp the scoring mechanism using some data oriented approach to achieve a more relevant score.
example
I have following features for sellers
1] Order fulfillment rates [numeric]
2] Order cancel rate [numeric]
3] User rating [1-5] { 1-2 : Worst, 3: Average , 5: Good} [categorical]
4] Time taken to confirm the order. (shorter the time taken better is the seller) [numeric]
5] Price competitiveness
Are there better algorithms/approaches to solve this problem? calculating score? i.e I linearly added the various features, I want to know better approach to build the ranking system?
How to come with the values for the weights?
Apart from using above features, few more that I can think of are ratio of positive to negative reviews, rate of damaged goods etc. How will these fit into my Score equation?
Unfortunately stackoverflow doesn't have latex so images will have to do:
Also as a disclaimer, I don't think this is a concise answer but your question is quite broad. This has not been tested but is an approach I would most likely take given a similar problem.
As a possible direction to go, below is the multivariate gaussian. The idea would be that each parameter is in its own dimension and therefore could be weighted by importance. Example:
Sigma = [1,0,0;0,2,0;0,0,3] for a vector [x1,x2,x3] the x1 would have the greatest importance.
The co-variance matrix Sigma takes care of scaling in each dimension. To achieve this simply add the weights to a diagonal matrix nxn to the diagonal elements. You are not really concerned with the cross terms.
Mu is the average of all logs in your data for your sellers and is a vector.
xis the mean of every category for a particular seller and is as a vector x = {x1,x2,x3...,xn}. This is a continuously updated value as more data are collected.
The parameters of the the function based on the total dataset should evolve as well. That way biased voting especially in the "feelings" based categories can be weeded out.
After that setup the evaluation of the function f_x can be played with to give the desired results. This is a probability density function, but its utility is not restricted to stats.