Calculate basic statistics in R - r

I am a noob when using R.
My experiment: I have 300 genotypes, each one planted in 6 different locations. For every genotype in every location I have a measure of the yield.
What I would like to do: I would like to calculate the mean, standard deviation and standard error for every genotype, first using the yield data of the 6 locations. Later, I want to calculate the same statistical parameters for only 5 locations and then 4 locations.
This is example of my desired output:
I have been searching for days, but I cannot find how to do it.

Let's say this is your data:
library(data.table)
dt= data.table(genotype=sample(1:10,size=20,replace=T),
location=sample(1:6,size=20,replace=T),
yield=round(runif(20,1000,1500)))
Then, first thing to do is to take the mean of yield, by genotype:
m1 = dt[,.(mean_6_locations=mean(yield)),by=genotype]
After that, assuming that you know which locations to exclude, here is the mean of 5 and 4 locations respectively:
m2 = dt[!location %in% c(10),.(mean_5_locations=mean(yield)),by=genotype]
m3 = dt[!location %in% c(5,10),.(mean_4_locations=mean(yield)),by=genotype]
Note that location 10 is excluded for mean of 5 locations, similarly locations 5 and 10 are excluded for mean of 4 locations.
Lastly, you need to merge everything into a single table:
m12 = merge(m1,m2)
m123 = merge(m12,m3)
print(m123)

This is an interesting thing, and I would do it with Monte Carlo "like" methods. Definitely I would encourage nonparametric methods because the dimensionality of the data doesn't support distributional assumptions.
Assume genotype doesn't matter, and aggregate over the six locations [or 5 or 4]... to make a distribution of means. The corresponding quintile of one specific genotypes mean corresponding number of locations mean tells you a lot more about the genotype than the mean itself. Also the standard error of means falls out of that distribution.
The standard deviation of this distribution similarly let's you know the standard deviation among means and allows for significance testing.
I know this answer is a little tangent, but building a distribution for six locations and taking the standard deviation of that doesn't tell you much.
Similarly,If you take the standard deviation of all rows and build a distribution of standard deviations, you can see how tightly a given genotypes standard deviation is relative to the population again just by using a quintile.
I assume the optimal genotype would be high quintile in the mean distribution and low quintile in the standard deviation distribution for a given location or among all locations. Depending of course on the specific question being addressed

Related

R how to calculate confidence interval based on proportion

I'm new to R and trying to learn stats..
Here is one practice question that I'm trying to figure out
How should I use R code to create a function based on this math equation?
I have a dataframe like this
the "exposed" column from the df contains two groups, one is called"Test Group (Exposed)" the other one is called "Control Group". So the math function is referring to these two groups.
In another practice I have these codes here to calculate the confidence interval
# sample size
# OK for non normal data if n > 30
n <- 150
# calculate the mean & standard deviation
will_mean <- mean(will_sample)
will_s <- sd(will_sample)
# normal quantile function, assuming mean has a normal distribution:
qnorm(p=0.975, mean=0, sd=1) # 97.5th percentile for a N(0,1) distribution
# a.k.a. Z = 1.96 from the standard normal distribution
# calculate standard error of the mean
# standard error of the mean = mean +/- critical value x (s/sqrt(n))
# "q" functions in r give the value of the statistic at a given quantile
critical_value <- qt(p=0.975, df=n-1)
error <- critical_value * will_s/sqrt(n)
# confidence inverval
will_mean - error
will_mean + error
but I'm not sure how to do the exposed 2 groups
Don't worry it's quite easy if you have experience in at least one programming language, R is quite trivial.
The only remarkable difference between R and most of other programming languanges is that R was developed for statistical purposes.
You can compute what is the quantile for a certain significance level α (reminds to divide it by 2 for your formula) by using the function qnorm(). By default it is set up for standardized normal distribution, like in your case, but you can get more details using the documentation, reachable by the command ?qnorm().
Actually in the exercise you are not required to compute it, since you have to pass it as argument, but in reality you need to.
The code should be something like:
conf <- function(p1,p2,n1,n2,z){
part = z*(p1*(1-p1)/n1+p2*(1-p2)/n2)**(1/2)
return(c(p1-p2-part,
p1-p2+part))
}

How to plot the difference between two density distributions

I've trained a model to predict a certain variable. When I now use this model to predict said value and compare this predictions to the actual values, I get the two following distributions.
The corresponding R Data Frame looks as follows:
x_var | kind
3.532 | actual
4.676 | actual
...
3.12 | predicted
6.78 | predicted
These two distributions obviously have slightly different means, quantiles, etc. What I would now like to do is combine these two distributions into one (especially as they are fairly similar), but not like in the following thread.
Instead, I would like to plot one density function that shows the difference between the actual and predicted values and enables me to say e.g. 50% of the predictions are within -X% and +Y% of the actual values.
I've tried just plotting the difference between predicted-actual and also the difference compared to the mean in the respective group. However, neither approach has produced my desired result. With the plotted distribution, it is especially important to be able to make above statement, i.e. 50% of the predictions are within -X% and +Y% of the actual values. How can this be achieved?
Let's consider the two distributions as df_actual, df_predicted, then calculate
# dataframe with difference between two distributions
df_diff <- data.frame(x = df_predicted$x - df_actual$x, y = df_predicted$y - df_actual$y)
Then find the relative % difference by :
x_diff = mean(( df_diff$x - df_actual$x) / df_actual $x) * 100
y_diff = mean(( df_diff$y - df_actual$y) / df_actual $y) * 100
This will give you % prediction whether +/- in x as well as y. This is my opinion and also follow this thread for displaying and measuring area between two distribution curves.
I hope this helps.
ParthChaudhary is right - rather than subtracting the distributions, you want to analyze the distribution of differences. But take care to subtract the values within corresponding pairs, or otherwise the actual - predicted differences will be overshadowed by the variance of actual (and predicted) alone. I.e., if you have something like:
x y type
0 10.9 actual
1 15.7 actual
2 25.3 actual
...
0 10 predicted
1 17 predicted
2 23 predicted
...
you would merge(df[df$type=="actual",], df[df$type=="predicted",], by="x"), then calculate and plot y.x-y.y.
To better quantify whether the differences between your predicted and actual distributions are significant, you could consider using the Kolmogorov-Smirnov test in R, available via the function ks.test

Compare variances between two populations with different means

I would like to compare two populations which have different means. I want to find a way to compare their variances, to have an idea of which of the two populations have values that disperse further from the mean.
The issues is that I think I should need a variance standardized/normalized on the mean value of each distribution.
Suggestions?
The next step would be to get a function in R that it is able to do that.
You don't need to standardise/normalise because variance is calculated as distance from the mean so is already normalised around the sample mean.
To demonstrate this run the following code
x<-runif(10000,min=100,max=101)
y<-runif(10000,min=1,max=2)
mean(x)
mean(y)
var(x)
var(y)
You'll see while the mean is different the variance of the two samples is identical (allowing for some difference due to pseudo-random number generation and sample size)

plotting multiple mean returns vs corresponding Standard deviations

I´m currently working with 7 different datasets, therefore I have seven independent variables that represents the mean return for each dataset.
The same thing goes for the Standard deviations.
I know how to plot for one value of mean return and the corresponding standard deviation;
plot=(meanr1, stdr1)
How do I plot all the mean returns versus their corresponding standard deviations?
from meanR1 ..to... meanR7 (mean returns)
from stdR1...to...stdR7 (standard deviation)
A minimum solution (here for three means and stds) is
plot(c(meanR1,meanR2,meanR3),c(stdR1,stdR2,stdR3))
you may want to place the means and standard deviations in a dataframe first, e.g.
df<-data.frame(means=c(meanR1,meanR2,meanR3),stds=c(stdR1,stdR2,stdR3))
and then plot them using
plot(df$means,df$stds)

Likert Rank ordering optimization heuristic possible?

I can't find the type of problem I have and I was wondering if someone knew the type of statistics it involves. I'm not sure it's even a type that can be optimized.
I'd like to optimize three variables, or more precisely the combination of 2. The first is a likert scale average the other is the frequency of that item being rated on that likert scale, and the third is the item ID. The likert is [1,2,3,4]
So:
3.25, 200, item1. Would mean that item1 was rated 200 times and got an average of 3.25 in rating.
I have a bunch of items and I'd like to find the high value items. For instance, an item that is 4,1 would suck because while it is rated highest, it is rated only once. And a 1,1000 would also suck for the inverse reason.
Is there a way to optimize with a simple heuristic? Someone told me to look into confidence bands but I am not sure how that would work. Thanks!
Basically you want to ignore scores with fewer than x ratings, where x is a threshold that can be estimated based on the variance in your data.
I would recommend estimating the variance (standard deviation) of your data and putting a threshold on your standard error, then translating that error into the minimum number of samples required to produce that bound with 95% confidence. See: http://en.wikipedia.org/wiki/Standard_error_(statistics)
For example, if your data has standard deviation 0.5 and you want to be 95% sure your score is within 0.1 of the current estimate, then you need (0.5/0.1)^2 = 25 ratings.

Resources