Likert Rank ordering optimization heuristic possible? - math

I can't find the type of problem I have and I was wondering if someone knew the type of statistics it involves. I'm not sure it's even a type that can be optimized.
I'd like to optimize three variables, or more precisely the combination of 2. The first is a likert scale average the other is the frequency of that item being rated on that likert scale, and the third is the item ID. The likert is [1,2,3,4]
So:
3.25, 200, item1. Would mean that item1 was rated 200 times and got an average of 3.25 in rating.
I have a bunch of items and I'd like to find the high value items. For instance, an item that is 4,1 would suck because while it is rated highest, it is rated only once. And a 1,1000 would also suck for the inverse reason.
Is there a way to optimize with a simple heuristic? Someone told me to look into confidence bands but I am not sure how that would work. Thanks!

Basically you want to ignore scores with fewer than x ratings, where x is a threshold that can be estimated based on the variance in your data.
I would recommend estimating the variance (standard deviation) of your data and putting a threshold on your standard error, then translating that error into the minimum number of samples required to produce that bound with 95% confidence. See: http://en.wikipedia.org/wiki/Standard_error_(statistics)
For example, if your data has standard deviation 0.5 and you want to be 95% sure your score is within 0.1 of the current estimate, then you need (0.5/0.1)^2 = 25 ratings.

Related

Uniquenesses component of explorator yfactor analysis

I am applying an Exploratory factor analysis on a dataset using the factanal() package in R. After applying Scree Test I found out that 2 factors need to be retained from 20 features.
Trying to find what this uniqueness represents, I found the following from here
"A high uniqueness for a variable usually means it doesn’t fit neatly into our factors. ..... If we subtract the uniquenesses from 1, we get a quantity called the communality. The communality is the proportion of variance of the ith variable contributed by the m common factors. ......In general, we’d like to see low uniquenesses or high communalities, depending on what your statistical program returns."
I understand that if the value of uniquenesses is high, it could be better represented in a single factor. But what is a good threshold for this uniquenesses measure? All of my features show a value greater than 0.3, and most of them range from 0.3 to 0.7. Does the following mean that my factor analysis doesn't work well on my data? I have tried rotation, the results are not very different. What else should I try then?
You can partition an indicator variable's variance into its...
Uniqueness (h2): Variance that is not explained by the common factors
Communality (1-h2): The variance that is explained by the common factors
Which values can be considered "good" depends on your context. You should look for examples in your application domain to know what you can expect. In the package psych, you can find some examples from psychology:
library(psych)
m0 <- fa(Thurstone.33,2,rotate="none",fm="mle")
m0
m0$loadings
When you run the code, you can see that the communalities are around 0.6. The absolute factor loadings of the unrotated solution vary between 0.27 and 0.85.
An absolute value of 0.4 is often used as an arbitrary cutoff for acceptable factor loadings in psychological domains.

Interpreting results from emmeans comparison

I have a glm model with two fixed effects, Treatment and Date, to estimate Temperature from data collected in a time series. Within Treatment there are three different categories: Fucus, Terrycloth or Control, and temperature is measured beneath those canopies. The model is created like so mod1 <- glm(Temp ~ Treatment * Date, data = aveTerry.df )
I am trying to tell if Terrycloth has a similar effect as Fucus canopy (i.e. replicates it).
I found the emmeans package and believe it could help me compare between these levels within treatment by using my model, and have used it as so to find the estimated marginal means terry.emmeans <- emmeans(modAllTerry, poly ~ Treatment | Date) and plotted the comparisons via plot(terry.emmeans.average, comparison = TRUE) +theme_bw()
Giving me this output linked here.
I am looking for some help understanding what this graphical output is, especially what exactly are the comparisons (which are shown by the red arrows). I somewhat understand the that blue boxes are the confidence intervals for the mean value of temperature for each treatment on one day (based on model), but am wondering how is the comparison made? And why do some days only have a one sided arrow?
As described in the documentation for plot.emmGrid, the comparison arrows are created in such a way that two arrows are disjoint if and only if their respective means are significantly different at the stated level.
The lowest mean in the set has only a right-pointing arrow because that mean will not be compared with anything smaller, obviating the need for a left-pointing arrow. For similar reasons, the highest mean has only a left-pointing arrow. These arrows do not define intervals; their only purpose is depicting comparisons.
In situations where the SEs of pairwise comparisons vary widely, it may not be possible to construct comparison arrows. If that happens, an error message is displayed.
Confidence intervals are available as well, but those CIs should not be used for comparing means.
More information and examples may be found via vignette("comparisons", "emmeans"). Also, details of how the arrows are actually constructed are given in vignette("xplanations", "emmeans")

Calculate basic statistics in R

I am a noob when using R.
My experiment: I have 300 genotypes, each one planted in 6 different locations. For every genotype in every location I have a measure of the yield.
What I would like to do: I would like to calculate the mean, standard deviation and standard error for every genotype, first using the yield data of the 6 locations. Later, I want to calculate the same statistical parameters for only 5 locations and then 4 locations.
This is example of my desired output:
I have been searching for days, but I cannot find how to do it.
Let's say this is your data:
library(data.table)
dt= data.table(genotype=sample(1:10,size=20,replace=T),
location=sample(1:6,size=20,replace=T),
yield=round(runif(20,1000,1500)))
Then, first thing to do is to take the mean of yield, by genotype:
m1 = dt[,.(mean_6_locations=mean(yield)),by=genotype]
After that, assuming that you know which locations to exclude, here is the mean of 5 and 4 locations respectively:
m2 = dt[!location %in% c(10),.(mean_5_locations=mean(yield)),by=genotype]
m3 = dt[!location %in% c(5,10),.(mean_4_locations=mean(yield)),by=genotype]
Note that location 10 is excluded for mean of 5 locations, similarly locations 5 and 10 are excluded for mean of 4 locations.
Lastly, you need to merge everything into a single table:
m12 = merge(m1,m2)
m123 = merge(m12,m3)
print(m123)
This is an interesting thing, and I would do it with Monte Carlo "like" methods. Definitely I would encourage nonparametric methods because the dimensionality of the data doesn't support distributional assumptions.
Assume genotype doesn't matter, and aggregate over the six locations [or 5 or 4]... to make a distribution of means. The corresponding quintile of one specific genotypes mean corresponding number of locations mean tells you a lot more about the genotype than the mean itself. Also the standard error of means falls out of that distribution.
The standard deviation of this distribution similarly let's you know the standard deviation among means and allows for significance testing.
I know this answer is a little tangent, but building a distribution for six locations and taking the standard deviation of that doesn't tell you much.
Similarly,If you take the standard deviation of all rows and build a distribution of standard deviations, you can see how tightly a given genotypes standard deviation is relative to the population again just by using a quintile.
I assume the optimal genotype would be high quintile in the mean distribution and low quintile in the standard deviation distribution for a given location or among all locations. Depending of course on the specific question being addressed

Test if regression weights are different from 1

I am doing an lm()regression with R where I use stock quotations. I used exponential weights for the regression : the older the data, the less weight. My weights formula is like this : alpha^(seq(685,1,by=-1))) (the data length is 685), and to find alpha I tried every value between 0.9 and 1.1 with a step of 0.0001 and I chose the alpha which minimizes the difference between the predicted values and the real values. This alpha is equal to 0.9992 so I would like to know if it is statistically different from 1.
In other words I would like to know if the weights are different from 1. Is it possible to achieve that and if so, how could I do this ?
I don't really know whether this question should be asked on stats.stackexchange but it involves Rso I hope it is not misplaced.

mathematical model to build a ranking/ scoring system

I want to rank a set of sellers. Each seller is defined by parameters var1,var2,var3,var4...var20. I want to score each of the sellers.
Currently I am calculating score by assigning weights on these parameters(Say 10% to var1, 20 % to var2 and so on), and these weights are determined based on my gut feeling.
my score equation looks like
score = w1* var1 +w2* var2+...+w20*var20
score = 0.1*var1+ 0.5 *var2 + .05*var3+........+0.0001*var20
My score equation could also look like
score = w1^2* var1 +w2* var2+...+w20^5*var20
where var1,var2,..var20 are normalized.
Which equation should I use?
What are the methods to scientifically determine, what weights to assign?
I want to optimize these weights to revamp the scoring mechanism using some data oriented approach to achieve a more relevant score.
example
I have following features for sellers
1] Order fulfillment rates [numeric]
2] Order cancel rate [numeric]
3] User rating [1-5] { 1-2 : Worst, 3: Average , 5: Good} [categorical]
4] Time taken to confirm the order. (shorter the time taken better is the seller) [numeric]
5] Price competitiveness
Are there better algorithms/approaches to solve this problem? calculating score? i.e I linearly added the various features, I want to know better approach to build the ranking system?
How to come with the values for the weights?
Apart from using above features, few more that I can think of are ratio of positive to negative reviews, rate of damaged goods etc. How will these fit into my Score equation?
Unfortunately stackoverflow doesn't have latex so images will have to do:
Also as a disclaimer, I don't think this is a concise answer but your question is quite broad. This has not been tested but is an approach I would most likely take given a similar problem.
As a possible direction to go, below is the multivariate gaussian. The idea would be that each parameter is in its own dimension and therefore could be weighted by importance. Example:
Sigma = [1,0,0;0,2,0;0,0,3] for a vector [x1,x2,x3] the x1 would have the greatest importance.
The co-variance matrix Sigma takes care of scaling in each dimension. To achieve this simply add the weights to a diagonal matrix nxn to the diagonal elements. You are not really concerned with the cross terms.
Mu is the average of all logs in your data for your sellers and is a vector.
xis the mean of every category for a particular seller and is as a vector x = {x1,x2,x3...,xn}. This is a continuously updated value as more data are collected.
The parameters of the the function based on the total dataset should evolve as well. That way biased voting especially in the "feelings" based categories can be weeded out.
After that setup the evaluation of the function f_x can be played with to give the desired results. This is a probability density function, but its utility is not restricted to stats.

Resources