I have two datasets:
sims = c(2,5,3,5,5,3)
obs = c(1,4,NA,NA,7,4)
Using the hydroGOF R package I can calculate the percentage bias as
pbias(sims,obs,na.rm=T)
However, is there a way to output the sum of sims used in the pbias calculation (i.e. 2+5+5+3 because the hydroGOF manual states that "When an ’NA’ value is found at the i-th position in obs OR sim, the i-th value of obs AND sim are removed before the computation") rather than the actual sum of sims (i.e. what would be returned by sum(sims) )?
You can do this with any two vectors like this:
sum(sims[!is.na(obs)])
Related
I'm reproducing a question that I couldn't find an answer to.
"I got some surprising results when using the svytotal routine from the survey package with data containing missing values.
Some example code demonstrating the behaviour is included below.
I have a stratified sampling design where I want to estimate the total
income. In some strata some of the incomes are missing. I want to
ignore these missing incomes. I would have expected that
svytotal(\~income, design=mydesign, na.rm=TRUE) would do the trick.
However, when calculating the estimates 'by hand' the estimates were
different from those obtained from svytotal. The estimated mean
incomes do agree with each other. It seems that using the na.rm option
with svytotal is the same as replacing the missing values with zero's,
which is not what I would have expected, especially since this
behaviour seems to differ from that of svymean. Is there a reason for
this behaviour?
I can of course remove the missing values myself before creating the
survey object. However, with many different variables with different
missing values, this is not very practical. Is there an easy way to
get the behaviour I want?"
library(survey)
library(plyr)
# generate some data
data <- data.frame(
id = 1:20,
stratum = rep(c("a", "b"), each=10),
income = rnorm(20, 100),
n = rep(c(100, 200), each=10)
)
data$income[5] <- NA
# calculate mean and total income for every stratum using survey package
des <- svydesign(ids=~id, strata=~stratum, data=data, fpc=~n)
svyby(~income, by=~stratum, FUN=svytotal, design=des, na.rm=TRUE)
mn <- svyby(~income, by=~stratum, FUN=svymean, design=des, na.rm=TRUE)
mn
n <- svyby(~n, by=~stratum, FUN=svymean, design=des)
# total does not equal mean times number of persons in stratum
mn[2] * n[2]
# calculate mean and total income 'by hand'. This does not give the same total
# as svytotal, but it does give the same mean
ddply(data, .(stratum), function(d) {
data.frame(
mean = mean(d$income, na.rm=TRUE),
n = mean(d$n),
total = mean(d$income, na.rm=TRUE) * mean(d$n)
)
})
# when we set income to 0 for missing cases and repeat the previous estimation
# we get the same answer as svytotal (but not svymean)
data2 <- data
data2$income[is.na(data$income )] <- 0
ddply(data2, .(stratum), function(d) {
data.frame(
mean = mean(d$income, na.rm=TRUE),
n = mean(d$n),
total = mean(d$income, na.rm=TRUE) * mean(d$n)
)
})
Yes, there is a reason for this behaviour!
The easiest way to think about the answer survey is trying to give here is it sets the weights for the missing observations to zero. That is, the package gives population estimates for the subdomain of non-missing values. This is important for getting the right standard errors. [Note: it doesn't actually do it by just setting the weights to zero, there are some optimisations, but that's the answer it gives]
If you set the weights to zero in svytotal, you get the sum of the non-missing values, which is the same as you get if you set the missing values to 0 or if they weren't ever sampled. When you come to compute standard errors it matters exactly which one you did, but not for point estimates.
If you set the weights to zero in svymean you get the mean of the non-missing values, which is not the same as you get if you set the missing values to zero (though it is the same as if they just weren't ever sampled).
I don't know exactly what you mean when you say you want to 'ignore' the missing incomes, but if you want to multiply mn[2] and n[2] meaningfully, they need to be computed on the same subdomain: you have one of them computed only where income is not missing and the other computed on all observations.
How do I calculate the pooled standard deviation in R?
Below is the code to my dataset(As my dataset contains many entries I cannot copy-paste it here)
install.packages("Sleuth3")
library(Sleuth3)
View(ex0126)
To find the mean and standard deviation for each group individually(i.e., individual groups are party R and D) I have got it using the below R code.
library(Sleuth3)
ex0126
View(ex0126)
#Average of each group individually for party (R,D)
meanOfR <- subset(aggregate(ex0126[, 4:10], list(ex0126$Party), mean, na.rm=TRUE), Group.1=='R')
meanOfR
meanOfD <- subset(aggregate(ex0126[, 4:10], list(ex0126$Party), mean, na.rm=TRUE), Group.1=='D')
meanOfD
#Sample standard deviation for party (R,D)
sdOfR <- subset(aggregate(ex0126[, 4:10], list(ex0126$Party), sd, na.rm=TRUE), Group.1=='R')
sdOfR
sdOfD <- subset(aggregate(ex0126[, 4:10], list(ex0126$Party), sd, na.rm=TRUE), Group.1=='D')
sdOfD
But how to find the pooled standard deviation for the above sample standard deviation for Party R and D
It depends which pooled estimate you want. Using the most general estimate with unequal size of grouping
data(ex0126, package = Sleuth3)
library(dplyr)
#' Calculate pooled variance given a data.frame with columns (var, n) for each group.
#' All other columns are ignored
pooled_var <- function(df){
var <- if('sd' %in% names(df)) df$sd^2 else df$var
d <- dim(var)
if(d[1] != (n <- nrow(df)))
stop('inconsistent size of variance and n')
if(length(d) == 2)
colSums(sweep(var, 1, df$n - 1, '*')) / (sum(df$n) - n)
else
sum(var * ( df$n - 1 )) / (sum(df$n) - nrow(df))
}
ex0126 %>%
select(4:10, Party) %>%
group_by(Party) %>%
na.omit() %>%
summarise(var = across(1:6, var), n = n()) %>%
pooled_var() %>%
sqrt()
Note that
select chooses the columns i want to use
na.omit is used to avoid including missing values in variance calculations
group_by tells my pipe that everything needs to be done to each group in Party
summarise/summarize is used to aggregate a function across rows
across is used to perform the same action over multiple columns.
The output of across is itself a tibble (data.frame like structure), so df$var becomes a tibble in pooled_var
by default summarize calls ungroup at the end. All calls following are no longer in each "group".
in pooled_var Ii assume a column var and n exist, and simply use standard formulas to calculate the pooled variance.
Within pooled_var I handle both single vectors and multiple columns based on whether df$var has multiple dimensions or not.
And sqrt is called at the end to go from pooled_var to a pooled standard deviation.
Use the sample.decomp function in the utilities package
Since you have access to the underlying dataset, it is possible to compute the pooled standard deviation directly on the underlying pooled data. However, you can also compute the pooled standard deviation from the pooled moments and group sizes. This is implemented in the sample.decomp function in the utilities package. This function can compute pooled sample moments from subgroup moments, or compute missing subgroup moments from the other subgroup moments and pooled moments. It works for decompositions up to fourth order ---i.e., decompositions of sample size, sample mean, sample variance/standard deviation, sample skewness, and sample kurtosis.
How to use the function: I am going to assume that in addition to computing the moments, you can also compute the sizes of the two groups, which I will designate as sizeR and sizeG. You can use the sample.decomp function to obtain the pooled sample moments from the subgroup sample moments.
#Input the sample statistics for subgroups
N <- c(sizeR, sizeG)
MEAN <- c(meanOfR, meanOfG)
SD <- c(sdOfR, sdOfG)
#Compute sample decomposition
library(utilities)
sample.decomp(n = N, sample.mean = MEAN, sample.sd = SD, include.sd = TRUE)
Since you have not given the values of your moments and group sizes, I cannot show you the pooled standard deviation you get as your output. However, the above code will give you a table showing the moments of the input groups and the pooled sample. This will include the pooled standard deviation.
I am running some simulations for a selection experiment I am doing.
As part of this, I want to select from a dataset I've already made using probabilities to simulate selection.
I start by making an initial population using starting frequencies where the probability of getting a 1 is 0.25, a 2 is 0.5 and a 3 is 0.25. 1,2 and 3 represent the 3 different genotypes.
N <- 400
my_prob = c(0.25,0.5,0.25)
N1=sample(c(1:3), N, replace= TRUE, prob=my_prob)
P1 <-data.frame(N1)
I now want to simulate selection in my population where one homozygote is selected against and there is partial selection against heterozygotes so probabilities of ((1-s)^2, (1-s), 1) where s=0.2 in this example.
Initially I was sampling each group individually using the sample_frac() function and then recombing the datasets.
s <- 0.2
S1homo<- filter(P1, N1==1) %>%
sample_frac((1-s)^2, replace= FALSE)
S1hetero <-filter(P1, N1==2) %>%
sample_frac((1-s), replace= FALSE)
S1others <-filter(P1, N1==3)
S1 <- rbind(S1homo, S1hetero, S1others)
The problem with this is there isn't any variability in the numbers it returns which is unrealisitic, for example S1homo will always return exactly 64% of the 1 values when I set s=0.2 whereas in my initial populations there is some variability in the numbers you get for each value.
So I was wondering if there is a way to select from my P1 population using the set probabilities of ((1-s)^2,(1-s), 1) for the different genotypes so that I don't always get the exact same numbers being returned for each group being selected against.
I tried doing this using the sample() function I used before but I couldn't get it to work.
# sel is done to give the total number of values there will be in the new population when times by N
sel <-((1-s)^2 + 2*(1-s)+1)/4
S1 <-sample(P1, N*sel, replace=FALSE, prob=c((1-s)^2,(1-s),1))
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
I am not 100% sure what you are trying to do, but if you want (1-s)^2 to be the probability that a randomly chosen element is included in the sample, rather than the exact percentage chosen, you can use sample_n rather than sample_frac, with an n which is randomly chosen to reflect that rate:
S1homo<- filter(P1, N1==1) %>%
sample_n(rbinom(1,sum(N1==1),(1-s)^2))
Using rbinom like that is perhaps a bit indirect, but I don't see another way to easily do it with %>%.
I want to calculate the error rate by interval where 0 is good and 1 is bad. If I have a sample of 100 observation as levels divided in intervals as follows:
X <- 10; q<-sample(c(0,1), replace=TRUE, size=X)
l <- sample(c(1:100),replace=T,size=10)
bornes<-seq(min(l),max(l),5)
v <- cut(l,breaks=bornes,include.lowest=T)
table(v)
How can I get a table or function that calculates the default rate by each interval, the number of bad observations divided by the total number of observations?
tx_erreur<-function(x){
t<-table(x,q)
return(sum(t[,2])/sum(t))
}
I already tried this code above and tapply.
Thank you!
I think you want this:
tapply(q,# the variable to be summarized
v,# the variable that defines the bins
function(x) # the function to calculate the summary statistics within each bin
sum(x)/length(x))
In the following code I use bootstrapping to calculate the C.I. and the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields (and the alternative being that the "improved" fertilizer is better). The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from.
x <- c(11.4,25.3,29.9,16.5,21.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
library(boot)
diff <- function(x,i) mean(x[i[6:11]]) - mean(x[i[1:5]])
b <- boot(total, diff, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
What I don't like about the code above is that resampling is done as if there was only one sample of 11 values (separating the first 5 as belonging to sample x leaving the rest to sample y).
Could you show me how this code should be modified in order to draw resamples of size 5 with replacement from the first sample and separate resamples of size 6 from the second sample, so that bootstrap resampling would mimic the “separate samples” design that produced the original data?
EDIT2 :
Hack deleted as it was a wrong solution. Instead one has to use the argument strata of the boot function :
total <- c(x,y)
id <- as.factor(c(rep("x",length(x)),rep("y",length(y))))
b <- boot(total, diff, strata=id, R = 10000)
...
Be aware you're not going to get even close to a correct estimate of your p.value :
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
b <- boot(total, diff, strata=id, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
> p.value
[1] 0.5162
How would you explain a p-value of 0.51 for two samples where all values of the second are higher than the highest value of the first?
The above code is fine to get a -biased- estimate of the confidence interval, but the significance testing about the difference should be done by permutation over the complete dataset.
Following John, I think the appropriate way to use bootstrap to test if the sums of these two different populations are significantly different is as follows:
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
b_x <- boot(x, sum, R = 10000)
b_y <- boot(y, sum, R = 10000)
z<-(b_x$t0-b_y$t0)/sqrt(var(b_x$t[,1])+var(b_y$t[,1]))
pnorm(z)
So we can clearly reject the null that they are the same population. I may have missed a degree of freedom adjustment, I am not sure how bootstrapping works in that regard, but such an adjustment will not change your results drastically.
While the actual soil beds could be considered a stratified variable in some instances this is not one of them. You only have the one manipulation, between the groups of plants. Therefore, your null hypothesis is that they really do come from the exact same population. Treating the items as if they're from a single set of 11 samples is the correct way to bootstrap in this case.
If you have two plots, and in each plot tried the different fertilizers over different seasons in a counterbalanced fashion then the plots would be statified samples and you'd want to treat them as such. But that isn't the case here.