Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have a list of purchases for every customer and I am trying to determine brand loyalty. Based on this list I have calculated each customer's brand entropy which I am using as a proxy for brand loyalty. For example, if a customer only purchase brand_a then then their entropy will be 0 and they are very brand loyal. However, if the customer purchases brand_a, brand_b and others then their entropy will be high and they are not very brand loyal.
# Dummy Data
CUST_ID <- c("c_X","c_X","c_X","c_Y","c_Y","c_Z")
BRAND <- c("brand_a","brand_a","brand_a","brand_a","brand_b","brand_a")
PURCHASES <- data.frame(CUST_ID,BRAND)
# Casting from PURCHASES to grouped_by CUST_ID
library(plyr)
library(dplyr)
library(data.table)
ENTROPY <- PURCHASES %>%
group_by(CUST_ID, BRAND) %>%
summarise(count = n()) %>%
dcast(CUST_ID ~ BRAND, value.var = "count")
ENTROPY[is.na(ENTROPY)] <- 0
# Calculating Entropy
library(entropy)
ENTROPY$entropy <- NA
for (i in 1:nrow(ENTROPY)){
ENTROPY[i,4] <- entropy(as.numeric(as.vector(ENTROPY[i,2:3])), method="ML")
}
# Calculating Frequency
ENTROPY$frequency <- ENTROPY$brand_a + ENTROPY$brand_b
ENTROPY
However, my problem is that entropy does not account for the quantity of purchases of each customer. Consider the following cases:
1) Customer_X has made 3 purchases, each time it is brand_a. Their entropy is 0.
2) Customer_Z has made 1 purchase, it is brand_a. Their entropy is 0.
Naturally, we are more sure that Customer_X is more brand loyal then Customer_Z. Therefore, I would like to weight the entropy calculations by the frequency. However, Customer_X: 0/3 = 0 and Customer_Z: 0/1 = 0.
Essentially, I want a clever way to have Customer_X to have a low value for my brand loyalty and Customer_Z to have a higher value. One thought was to use a CART/Decision Tree/Random Forest Model, but if it can be done using clever math, that would be ideal.
I think the index that you want is entropy normalised by some expectation for the entropy given the number of purchases. Essentially, fit a curve to the graph of entropy vs number of purchases, and then divide each entropy by the expectation given by the curve.
Now this doesn't solve your problem with super-loyal customers which have 0 entropy. But I think the question there is subtly different: Is the apparent loyalty due to chance (low count) or is it real? This is a distinct question to how loyal is that customer. Essentially, you want to know the probability of a observing such a data point.
You could compute the probability of only having bought a single brand given the number of purchases from your data, if the 0 entropy events are your only pain point.
Alternatively, you could determine the full joint probability distribution for entropy and number of purchases (instead of just the mean), e.g. by density estimation, and then compute the conditional probability observing a given entropy given the number of purchases.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 days ago.
Improve this question
A hard Drive manufacturers is required to ensure that the mean time between failures for its new hard drive is 1 million hours. A stress test is designed that can simulate the workload at a much faster rate. The test is designed so that a test lasting 10 days is equivalent to the hard drive lasting 1 million hours. In stress tests of 15 hard drives, the average is 9.5 days with a standard deviation of 1 day. Does a 90% confidence level include 10 days?
t<-(9.5-10)/(1/sqrt(15))-1.94
I think the next step is finding the critical t value, but I am not sure how to do that with a confidence level of 90%.
The critical value for a two-tailed t-test at alpha = 0.1 (90% confidence level) with (say) 15 degrees of freedom is:
alpha <- 0.1
df <- 15
qt(1-alpha/2, df = df)
## 1.753
compute 1-alpha/2 to get the upper bound such that the probability of lying in the upper tail is alpha/2 (division by two because we're doing a two-tailed test)
apply the quantile function (inverse CDF) of the t distribution with the appropriate number of degrees of freedom.
You could look up the number in the back of a stats book, too, but these days it might be easier to find a computer ... you could also get this from an online calculator (use 0.05 as your alpha level since it gives a one-tailed critical value).
I am working with a problem in R that has the demand for a product that is normally distributed with a mean of a and standard deviation b. It is estimated to increase c% each year. The market share for year 1 is anywhere between d% and e% with each number in between equally likely, and is expected to grow at f% each year.
I know I need to use the rnorm function for the demand
rnorm(1000,a,b) and runif function for the market share runif(1000,1.d,1.e) (adding the 1 in front for easier calculation) year 1 values.
The problem is asking for what the market share and demand will be over 3 years. I know year 1, but I am not sure how I would set the calculation up in R for years 2 and 3 given the growth rates c and f. I currently have something like MarketSizeGrowth<-cumprod(c(runif(1000,1.d,1.e),rep(1.f,2) for the market size but this is definitely wrong.
I am constructing a GAMM model (for the first time) to compare longitudinal slopes of cognitive performance in a Bipolar Disorder (BD) sample, compared to a control (HC) sample. The study design is referred to as an "accelerated longitudinal study" where participants across a large span of ages 25-60, are followed for 2 years (HC group) and 4 years (BD group).
Hypothesis (1) The BD group’s yearly rate of change on processing speed will be higher overall than the healthy control group, suggesting a more rapid cognitive decline in BD than seen in HC.
Here is my R code formula, which I think is a bit off:
RUN2 <- gamm4(BACS_SC_R ~ group + s(VISITMONTH, bs = "cc") +
s(VISITMONTH, bs = "cc", by=group), random=~(1|SUBNUM), data=Df, REML = TRUE)
The visitmonth variable is coded as "months from first visit." Visit 1 would equal 0, and the following visits (3 per year) are coded as months elapsed from visit 1. Is a cyclic smooth correct in this case?
I plan on adding additional variables (i.e peripheral inflammation) to the model to predict individual slopes of cognitive trajectories in BD.
If you have any other suggestions, it would be greatly appreciated. Thank you!
If VISITMONTH is over years (i.e. for a BD observation we would have VISITMONTH in {0, 1, 2, ..., 48} (for the four years)), then no, you don't want a cyclic smooth unless there is some 4-year periodicity that would mean 0 and 11 should be constrained to be the same.
The default thin plate spline bs = 'tp' should suffice.
I'm also assuming that there are many possible values for VISITMONTH as not everyone was followed up at the same monthly intervals? Otherwise you're not going to have many degrees of freedom available for the temporal smooth.
Is group coded as an ordered factor here? If so that's great; the by smooth will encode the difference between the reference level (be sure to set HC as the reference level) and the other level so you can see directly in the summary a test for a difference of the BD group.
It's not clear how you are dealing with the fact that HC are followed up over fewer months than the BD group. It looks like the model has VISITMONTH representing the full time of the study not just a winthin-year term. So how do you intend to compare the BD group with the HC group for the 2 years where the HC group are not observed?
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I am looking for a package in R that can help me to calculate the posterior probability of an event. Is there any?
Alright, I am working on such a data set
age education grade pass
group1 primary 50 no
group2 tertiary 20 no
group1 secondary 70 yes
group2 secondary 67 yes
group1 secondary 55 yes
group1 secondary 49 no
group1 secondary 76 yes
I have the prior probability of a student passing the exam is 0.6, Now I need to get the posterior probability of a student pass given his age , education level, and grade
I know I should get first P(age=group1| pass=yes)* P(education=primary| pass=yes)* P(grade>50 |pass=yes)
But this should be done for each case (row) and I have a date set with 1000 rows
So, I thought I can get a function helps me in this!
There are many packages/functions available. You should check out the Bayesian inference Task View
Also, if your prior isn't a full distribution and is only a point estimate of a probability, you're probably not actually doing bayesian inference, just using bayes rule in a frequentist framework. Very different. But now we're getting into CrossValidated territory.
This is the answer for only one variable (education) and variable (pass):
# get the prior probabilities
prior<- c(prior_no, prior_yes)
# get contingency table of values for mydata
edu_table<- with(mydata,table(mydata$pass, mydata$education))
# get the sum across (pass)
tots<- apply(edu_table,1,sum)
# create matrix of 0's
ppn<- edu_table*0
post<- edu_table*0
# use a loop to get the prior probabilities& posterior probabilities
for(i in 1:length(tots)){
for( j in 1: 4){
ppn[i,j]=edu_table[i,j]/tots[i]
post[i,j]=prior[i]*ppn[i,j]/(prior[1]*ppn[1,j]+prior[2]*ppn[2,j])
}
}
ppn # probability of education=j given y=i
post # posterior probability of y=i given education=j
I want to rank a set of sellers. Each seller is defined by parameters var1,var2,var3,var4...var20. I want to score each of the sellers.
Currently I am calculating score by assigning weights on these parameters(Say 10% to var1, 20 % to var2 and so on), and these weights are determined based on my gut feeling.
my score equation looks like
score = w1* var1 +w2* var2+...+w20*var20
score = 0.1*var1+ 0.5 *var2 + .05*var3+........+0.0001*var20
My score equation could also look like
score = w1^2* var1 +w2* var2+...+w20^5*var20
where var1,var2,..var20 are normalized.
Which equation should I use?
What are the methods to scientifically determine, what weights to assign?
I want to optimize these weights to revamp the scoring mechanism using some data oriented approach to achieve a more relevant score.
example
I have following features for sellers
1] Order fulfillment rates [numeric]
2] Order cancel rate [numeric]
3] User rating [1-5] { 1-2 : Worst, 3: Average , 5: Good} [categorical]
4] Time taken to confirm the order. (shorter the time taken better is the seller) [numeric]
5] Price competitiveness
Are there better algorithms/approaches to solve this problem? calculating score? i.e I linearly added the various features, I want to know better approach to build the ranking system?
How to come with the values for the weights?
Apart from using above features, few more that I can think of are ratio of positive to negative reviews, rate of damaged goods etc. How will these fit into my Score equation?
Unfortunately stackoverflow doesn't have latex so images will have to do:
Also as a disclaimer, I don't think this is a concise answer but your question is quite broad. This has not been tested but is an approach I would most likely take given a similar problem.
As a possible direction to go, below is the multivariate gaussian. The idea would be that each parameter is in its own dimension and therefore could be weighted by importance. Example:
Sigma = [1,0,0;0,2,0;0,0,3] for a vector [x1,x2,x3] the x1 would have the greatest importance.
The co-variance matrix Sigma takes care of scaling in each dimension. To achieve this simply add the weights to a diagonal matrix nxn to the diagonal elements. You are not really concerned with the cross terms.
Mu is the average of all logs in your data for your sellers and is a vector.
xis the mean of every category for a particular seller and is as a vector x = {x1,x2,x3...,xn}. This is a continuously updated value as more data are collected.
The parameters of the the function based on the total dataset should evolve as well. That way biased voting especially in the "feelings" based categories can be weeded out.
After that setup the evaluation of the function f_x can be played with to give the desired results. This is a probability density function, but its utility is not restricted to stats.