Proportion in cohorts vs overall proportion - math

I have a maths problem that I need help with. It is medical study but I will use orchards as an example.
Lets say a study conducted in an orchard looked at the proportion of bad apples. The study has shown that: Results study
A) Red apples (bad 10; good 90, therefore probability of apple being bad is 0.10)
B) Green apples (bad 20; good 80, probability of bad apple 0.20)
Thus
C) All apples - green and red (bad 30; good 170, probability of bad apple 0.15)
Now, based on that study I want to estimate how many bad apples I might expect in my orchard.
I have 1,000 apples in my orchard 600 red and 400 green. Can somebody tell me why the following two estimations don't match and also suggest which one is correct?
Option 1
1,000 apples x probability of bad apple 0.15 = 150 bad apples
Option 2
600 red apples x probability of bad red apple 0.10 = 60 bad red apples
400 red apples x probability of bad green apple 0.20 = 80 bad green apples
Which sums up 60 + 80 = 140 bad apples.
So why is there a difference and which estimation is correct?

The second is correct. It is correct because it applies the correct probabilities to each case and sums the results; E[X+Y] = E[X] + E[Y]. The first is wrong because it uses the unweighted average of the two probabilities and applies it to a population that does not contain an equal number of samples of eachxss type. If the farm had 500 apples of each kind, both procedures would have worked.
EDIT: Based on comments, a clarification:
The question asks how many bad apples you'd expect. The number of bad apples is a random variable Z. The number of bad apples is the sum of bad red and bad green apples. These, in turn, are random variables X and Y, and Z = X + Y. We want the expected value of Z, E[Z], and this is E[X + Y]. We know by properties of the expected value that this is E[X] + E[Y]. That is, we can figure the expected number of bad red apples separately from the expected number of bad green apples and then add them together to get the total number of bad apples we expect. This method is basically the same as using conditional probability except that we have skipped dividing by the total number of apples; had we done that, we'd have been using conditional probabilities to find the probability of getting a bad apple, which is correct but not requested.

Related

How to perform a Chi^2 Test with two data tables where x and y have different lengths

So I have the following tables (simplified here):
this is Ost_data
Raumeinheit
Langzeitarbeitslose
Hamburg
22
Koln
45
This is West_data
Raumeinheit
Langzeitarbeitslose
Hamburg
42
Koln
11
Ost_data has 76 rows and West_data has 324 rows.
I am tasked with proving my hypothesis that the Variable "Langzeitarbeitslose" is statistically, significantly higher in Ost_data than in West_data. Because that variable is not normally distributed I am trying to use Pearson's Chi Square Test.
I tried
chisq.test(Ost_data$Langzeitarbeitslose, West_data$Langzeitarbeitslose)
but that just retuns that it can't be performed because x and y differs in length.
Is there a way to navigate around that problem and perform the Chi Square test regardless with my two tables which have varying lengths?
Pearson's ChiSq test is when the rows are measuring the same thing. It sounds like here your rows are just measuring some quantity on repeated samples, so you should use a t-test.
t.test(Ost_data$Langzeitarbeitslose, West_data$Langzeitarbeitslose)
The most important aspect of your variable "Langzeitarbeitslose" (longtime-unemployed)is not whether it is normally distributed but its scale-level. I assume it is a dichotomous variable (either yes or no).
t-Test needs interval-scale
wilcoxon test needs ordinal scale
chi-square test works for nominal (and therefore also for dichotomous) data
If you have both, the number of long-time unemployed and the number of not-longtime-unemployed per city you can compare the probability of being unemployed in the east and west.
l_west <- absolute number of longtime-unemployed in the west
l_ost <- absolute number of longtime-unemployed in the east
n_west <- absolut number of observed people (unemployed or not) in the west
n_ost <- absolut number of observed people (unemployed or not) in the east
N <- n_west + n_ost # absolut number of observations
chisq.test(c(l_west,l_ost),p=c(n_west/N, n_ost/N))
# this tests whether the relative frequency of unemployment in the east (l_ost / n_ost)
# differs from the equivalent rel. frequency in the west (l_west / n_west)
# while considering the absolut number of observeations in east and west
I know the words Ost (east), West, and Langzeitarbeitslose (unemployed). I know that Hamburg and Köln are in the west and not in the east (therefore they should not appear in your "Ost_data"). Somebody who does not know this cannot help you. --> Bear this in mind in the future.
Best,
ajj

How to distribute reward points among winners giving more weight to higher ranks

Let's say we have a competition and a reward of 100 points that will be distributed among the winners ranked 1 to 20.
The 100 points will be distributed even if there are only 10 winners or 5 winners.
The distribution should give more weight to first than second position and more weight to second than third position, and significantly less weight starting form the fourth, and so on.
I am not sure how to calculate this.
Thanks for advance.

Post-hoc test for lmer Error message

I am running a Linear Mixed Effect Model in R and I was able to successfully run my code and get results.
My code is as follow:
library(lme4)
library(multcomp)
read.csv(file="bh_new_all_woas.csv")
whb=read.csv(file="bh_new_all_woas.csv")
attach(whb)
head(whb)
whb.model = lmer(Density ~ distance + (1|Houses) + Cats, data = whb)
summary(whb.model)
However, I would like to do a comparison of my distance fixed factor that has 4 levels to it. I tried running a lsmean as followed:
lsmeans(whb.model, pairwise ~ distance, adjust = "tukey")
This error popped up:
Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments
I also tried glht using this code:
glht(whb.model, linfct=mcp(distance="tukey"))
and got the same results. A sample of my data is as follows:
Houses distance abund density
House 1 20 0 0
House 1 120 6.052357 0.00077061
House 1 220 3.026179 0.000385305
House 1 320 7.565446 0.000963263
House 2 20 0 0
House 2 120 4.539268 0.000577958
House 2 220 6.539268 0.000832606
House 2 320 5.026179 0.000639953
House 3 20 0 0
House 3 120 6.034696 0.000768362
House 3 220 8.565446 0.001090587
House 3 320 5.539268 0.000705282
House 4 20 0 0
House 4 120 6.052357 0.00077061
House 4 220 8.052357 0.001025258
House 4 320 2.521606 0.000321061
House 5 20 4.513089 0.000574624
House 5 120 6.634916 0.000844784
House 5 220 4.026179 0.000512629
House 5 320 5.121827 0.000652131
House 6 20 2.513089 0.000319976
House 6 120 9.308185 0.001185155
House 6 220 7.803613 0.000993587
House 6 320 6.130344 0.00078054
House 7 20 3.026179 0.000385305
House 7 120 9.052357 0.001152582
House 7 220 7.052357 0.000897934
House 7 320 6.547785 0.00083369
House 8 20 5.768917 0.000734521
House 8 120 4.026179 0.000512629
House 8 220 4.282007 0.000545202
House 8 320 7.537835 0.000959747
House 9 20 3.513089 0.0004473
House 9 120 5.026179 0.000639953
House 9 220 8.052357 0.001025258
House 9 320 9.573963 0.001218995
House 10 20 2.255828 0.000287221
House 10 120 5.255828 0.000669193
House 10 220 10.060874 0.001280991
House 10 320 8.539268 0.001087254
Does anyone have any suggestions on how to fix this problem?
So which problem is it that needs fixing? One issue is the model, and another is the follow-up to it.
The model displayed is fitted using the fixed effects ~ distance + Cats. Now, Cats is not in the dataset provided, so that's an issue. But aside from that, distance enters the model as a quantitative predictor (if I am to believe the read.csv statements etc.). This model implies that changes in the expected Density are proportional to changes in distance. Is that a reasonable model? Maybe, maybe not. But is it reasonable to follow that up with multiple comparisons for distance? Definitely not. From this model, the change between distances of 20 to 120 will be exactly the same as the change between distances of 120 and 220. The estimated slope of distance, from the model summary, embodies everything you need to know about the effect of distance. Multiple comparisons should not be done.
Now, one might guess from the question that what you really had wanted to do was to fit a model where each of the four distances has its own effect, separate from the other distances. That would require a model with factor(distance) as a predictor; in that case, factor(distance) will account for 3 degrees of freedom rather than 1 d.f. for distance as a quantitative predictor. For such a model, it is appropriate to follow it up with multiple comparisons (unless possibly distance also interacts with some other predictors). If you were to fit such a model, I believe you will find there will be no errors in your lsmeans call (though you need a library("lsmeans") statement, not shown in your code.
Ultimately, getting programs to run without error is not necessarily the same as producing sensible or meaningful answers. So my real answer is to consider carefully what is a reasonable model for the data. I might suggest seeking one-on-one help from a statistical consultant to make sure you understand the modeling issues. Once that is settled, then appropriate interpretation of that model is the next step; and again, that may require some advice.
Additional minor notes about the code provided:
The first read.csv call accomplishes nothing because it doesn't store the data.
R is case-sensitive, so technically, Density isn't in your dataset either
When the data frame is attached, you don't also need the data argument in the lmer call.
The apparent fact that Houses has levels "House 1", "House 2", etc. is messed-up in your listing because the comma delimiters in your data file are not shown.

Calculating Odds of winning?

So I'm trying to calculate the odds of someone winning, so for example there are 1,000 tickets, 4 people buy 250 tickets that would mean they have 25% chance each.
So I looked at how http://www.calculatorsoup.com/calculators/games/odds.php but it says 20% chance...
I'm not sure if I'm being stupid here or not... I'm unsure how to do this. so far I've done the following
p = 250 / (1000 + 250) * 100
but it gives me 20...
Do not think too much, the answer is obvious here:
P = 250 / 1000
Hence the probability to win is 25%.
The only reason where you should be using the formula P = A / (A + B) is the following:
You have A red balls and B blue balls and you want to know the probability of picking a red one amongst all of them. Then the probability is P = A / (A + B).
Your A and B values are 750 and 250 respectively. It gives you 25%.
Reason: If the winning ticket is in 250 tickets which you bought, then you will win. This is why B = 250. But if it is in other 750, you won't win. So, your A is 750. 250 / (250 + 750) = 1/4 = 0.25
The link you provided is for an odds calculator for sports betting. But the question you have is to do with probability.
You are correct in answering that the odds for each of the 4 people is 25%. Each person has a 250/1000 or 1/4 chance of winning.
but it says 20% chance...
I assume you're entering 1000 and 250 in that site's input fields? Note how the input is defined. It's asking for A:B odds. Which means the total would be both numbers combined, and the B value is your chance of winning.
Consider the example by default on that page: 3:2 odds of winning. That is, out of a total of 5, 3 will lose and 2 will win. So you have a 40% chance of winning.
So if you put 1000 and 250 into the inputs, then out of a total of 1250 there will be 250 winners, so any given player has a 20% chance of winning.
p = 250 / (1000 + 250) * 100 but it gives me 20...
That's correct.
If there is a total of 1000 and there will be 250 winners, then it would be 750:250 odds, or 3:1 odds, which is 25%. To verbalize it, you might say something like:
In this contest, any given player has 250 chances to win and 750 chances to lose. So their odds are 3:1.
If the tickets were distributed more oddly, each person would have individual odds which may not reduce to smaller numbers. For example, if someone bought 3 tickets, their odds would be 997:3.

Distribution of money following different beta-distributions

I am trying to find a methodology (or even better, the code) to do the following in Netlogo. Any help is appreciated (I could always try to rewrite the code from R or Matlab to Netlogo):
I have $5000, which I want to distribute following different beta distributions among 10 000 actors. The maximum amount an actor may receive is $1.
Basically, I am looking for a way to generate random numbers to actors (10000 actors) in a [0,1] interval, following different beta distributions, where the mean of the distributed values remains equal to 0.5. This way the purchasing power of the population (10000 actors with a mean of 0.5 is $5000) remains equal for beta(1,1) (uniform population) as well as, for example, beta(4,1) (rich population).
an example with 5 actors distributing 2,5 dollar:
beta(1,1) 0,5 - 0,5 - 0,5 - 0,5 - 0,5 (mean 0,5)
beta(4,1) 0,1 - 0,2 - 0,5 - 0,7 - 1,0 (mean 0,5)
I've been thinking. If there is no apparent solution to this, maybe the following could work. We can write the shape of the frequency distribution of beta(4,1) as y=ax^2+b with some value for a and b (both increase exponentially).
In my case, the integral(0-1) of y=ax^2+b should be 5000. Playing around with values for a and b should give me the shape of beta(4,1).
The number of actors having 0.1 should then be the the integral(0-0.1) of y=ax^2+b where a and b are parameters of the shape resembling the beta(4,1).
Is my reasoning clear enough? Could someone extend this reasoning? Is there a link between the beta distribution and a function of a,b,x ?

Resources