I have samples of observation counts for 4 genotypes in a single copy region. What I want to do, is calculate the allele frequencies of these genotypes, and then test of these frequencies deviate significantly from expected values of 25%:25%:25%:25% using Chi Squared in R.
So far, I got:
> a <- c(do.call(rbind, strsplit(as.character(gdr18[1,9]), ",")), as.character(gdr18[1,8]))
> a
[1] "27" "30" "19" "52"
Next I get total count:
> sum <- as.numeric(a[1]) + as.numeric(a[2]) + as.numeric(a[3]) + as.numeric(a[4])
> sum
[1] 128
Now frequencies:
> af1 <- as.numeric(a[1])/sum
> af2 <- as.numeric(a[2])/sum
> af3 <- as.numeric(a[3])/sum
> af4 <- as.numeric(a[4])/sum
> af1
[1] 0.2109375
> af2
[1] 0.234375
> af3
[1] 0.1484375
> af4
[1] 0.40625
Here I am lost now. I want to know if af1, af2, af3 and af4 deviate significantly from 0.25, 0.25, 0.25 and 0.25
How do I do this in R?
Thank you,
Adrian
EDIT:
Alright, I am trying chisq.test() as suggested:
> p <- c(0.25,0.25,0.25,0.25)
> chisq.test(af, p=p)
Chi-squared test for given probabilities
data: af
X-squared = 0.146, df = 3, p-value = 0.9858
Warning message:
In chisq.test(af, p = p) : Chi-squared approximation may be incorrect
What is the warning message trying to tell me? Why would the approximation be incorrect?
To test this methodology, I picked values far from expected 0.25:
> af=c(0.001,0.200,1.0,0.5)
> chisq.test(af, p=p)
Chi-squared test for given probabilities
data: af
X-squared = 1.3325, df = 3, p-value = 0.7214
Warning message:
In chisq.test(af, p = p) : Chi-squared approximation may be incorrect
In this case the H0 is still not rejected, even though the values are pretty far off from the expected 0.25 values.
observed <- c(27,30,19,52)
chisq.test(observed)
which indicates that such frequencies or more extreme than this would arise by chance alone about 0.03% of the time (p = 0.0003172).
If your null hypothesis is not a 25:25:25:25 distribution across the four categories, but say that the question was whether these data depart significantly from the 3:3:1:9 expectation, you need to calculate the expected frequencies explicitly:
expected <- sum(observed)*c(3,3,1,9)/16
chisq.test(observed,p=c(3,3,1,9),rescale.p=TRUE)
Related
library(haven)
library(survey)
library(dplyr)
nhanesDemo <- read_xpt(url("https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT"))
# Rename variables into something more readable
nhanesDemo$fpl <- nhanesDemo$INDFMPIR
nhanesDemo$age <- nhanesDemo$RIDAGEYR
nhanesDemo$gender <- nhanesDemo$RIAGENDR
nhanesDemo$persWeight <- nhanesDemo$WTINT2YR
nhanesDemo$psu <- nhanesDemo$SDMVPSU
nhanesDemo$strata <- nhanesDemo$SDMVSTRA
# Select the necessary columns
nhanesAnalysis <- nhanesDemo %>%
select(fpl, age, gender, persWeight, psu, strata)
# Set up the design
nhanesDesign <- svydesign(id = ~psu,
strata = ~strata,
weights = ~persWeight,
nest = TRUE,
data = nhanesAnalysis)
# Select those between the agest of 18 and 79
ageDesign <- subset(nhanesDesign, age > 17 & age < 80 & !is.na(fpl))
quantile_results <- svyquantile(~fpl, ageDesign, quantiles=c(0.1, 0.5, 0.9))
print(quantile_results)
The default rounding of svyquantile appears to be two digits past the decimal place. How can I change this? I couldn't find anything in the documentation.
svyquantile does no rounding.
In this example, the two digit precision is the precision of the data: fpl is given to only two decimal places and by default svyquantile returns the left quantile, which is always one of the observed values. In fact, most of the distinct values of fpl occur multiple times: there are 20 observations equal to the 10th percentile, 29 equal to the median, and 1220 equal to the 90th percentile, so the quantile will be equal to one of the observed values in this example no matter what you specify for the qrule argument.
If you make fpl noisier, you'll get more digits
> ageDesign<-update(ageDesign, fpl_noisy=fpl+runif(nrow(ageDesign),0,0.005))
> svyquantile(~fpl_noisy, ageDesign, quantiles=c(0.1, 0.5, 0.9))
$fpl_noisy
quantile ci.2.5 ci.97.5 se
0.1 0.8027744 0.7128426 0.8841695 0.04019022
0.5 2.9711470 2.5921659 3.3747105 0.18357099
0.9 5.0031355 5.0027002 5.0035307 0.00019482
attr(,"hasci")
[1] TRUE
attr(,"class")
[1] "newsvyquantile"
I'd like to get the same answer by binom.test or prop.test in R for the following question. How can I get the same answer of my manual calculation(0.009903076)?
n=475, H0:p=0.05, H1:p>0.05
What is the probability of phat>0.0733?
n <- 475
p0 <- 0.05
p <- 0.0733
(z <- (p - p0)/sqrt(p0*(1 - p0)/n))
# [1] 2.33
(ans <- 1 - pnorm(z))
# [1] 0.009903076
You can get this from prop.test():
prop.test(n*p, n, p0, alternative="greater", correct=FALSE)
# data: n * p out of n, null probability p0
# X-squared = 5.4289, df = 1, p-value = 0.009903
# alternative hypothesis: true p is greater than 0.05
# 95 percent confidence interval:
# 0.05595424 1.00000000
# sample estimates:
# p
# 0.0733
#
You can't get the result from binom.test() so far as I can tell because n*p is not an integer, it's 34.8175. The binom.test() function only takes an integer values number of successes, so when you convert this to 35 by rounding, p effectively becomes 0.07368421, which makes the rest of your results not match. Even if you had a situation where n*p was an integer, binom.test() would still not produce the same answer because it's not using a normal approximation as your original code does - it's using the binomial distribution to calculate the probability above p0.
I can do it for the two sample t test but not for Median test or Wilcoxon test or Hodges Lehmann test
data_2000 <- c(500,450,600,700,550,551,552)
data_2019 <- c(560,460,620,720,540,600,750)
mean(data_2000)
mean(data_2019)
mean(data_2019) - mean(data_2000)
combined_data <- c(data_2000, data_2019)
set.seed(123)
null_dist <- c()
for (i in 1:100000) {
shuffled_data <- sample(combined_data)
shuffled_2000 <- shuffled_data[1:7]
shuffled_2019 <- shuffled_data[8:14]
null_dist[i] <- mean(shuffled_2019) - mean(shuffled_2000)
}
(p_value <- (sum(null_dist >= 49.57143) + sum(null_dist <=
`enter code here`-49.57143))/length(null_dist))
I think this is what you're trying to do. I altered your code as little as possible. There are packages like infer that will do this for you and the for loop is not the most efficient but it's plenty good enough and may help you learn. As long as we're looping I did mean and median at the same time since all other parts of the code are identical. ifelse is a nice easy way to make 1s and 0s to sum.
data_2000 <- c(500,450,600,700,550,551,552)
data_2019 <- c(560,460,620,720,540,600,750)
delta_mean <- mean(data_2019) - mean(data_2000)
delta_median <- median(data_2019) - median(data_2000)
combined_data <- c(data_2000, data_2019)
trials <- 100000
set.seed(123)
mean_diff <- c()
median_diff <- c()
for (i in 1:trials) {
shuffled_data <- sample(combined_data)
shuffled_2000 <- shuffled_data[1:7]
shuffled_2019 <- shuffled_data[8:14]
mean_diff[i] <- mean(shuffled_2019) - mean(shuffled_2000)
median_diff[i] <- median(shuffled_2019) - median(shuffled_2000)
}
p_mean <- sum(ifelse(mean_diff > delta_mean | mean_diff < -1 * delta_mean, 1, 0)) / trials
p_median <- sum(ifelse(median_diff > delta_median | median_diff < -1 * delta_median, 1, 0)) / trials
p_mean
#> [1] 0.31888
p_median
#> [1] 0.24446
Following up on your question about HL test. Quoting Wikipedia
The Hodges–Lehmann statistic also estimates the difference between two populations. For two sets of data with m and n observations, the set of two-element sets made of them is their Cartesian product, which contains m × n pairs of points (one from each set); each such pair defines one difference of values. The Hodges–Lehmann statistic is the median of the m × n differences.
You could run it on your data with the following code...
Do NOT run it 100,000 times the answer is the same everytime because you're already making all 49 possible pairings
hl_df <- expand.grid(data_2019, data_2000)
hl_df$pair_diffs <- hl_df$Var1 - hl_df$Var2
median(hl_df$pair_diffs)
[1] 49
You can do the Wilcoxon test with wilcox.test in the stats package (loaded by default as part of R core). You need to set exact = FALSE because an exact p-value is not possible if there are ties.
wilcox.test(data_2019, data_2000, exact = FALSE)
Wilcoxon rank sum test with continuity correction
data: data_2019 and data_2000
W = 33.5, p-value = 0.2769
alternative hypothesis: true location shift is not equal to 0
I'll update this when I figure out how to do the other tests.
When I print my p value from my t.test by doing:
ttest_bb[3]
It returns the full p value. How can I make it so it only prints the first two integers? i.e. .03 instead of .034587297?
The output from t.test is a list. If you only use [ to grab the p-value then what is returned is a list with one element. You want to use [[ to grab the element contained at the spot in the list returned by t.test if you want to treat it as a vector.
> ttest_bb <- t.test(rnorm(20), rnorm(20))
> ttest_bb
Welch Two Sample t-test
data: rnorm(20) and rnorm(20)
t = -2.5027, df = 37.82, p-value = 0.01677
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.4193002 -0.1498456
sample estimates:
mean of x mean of y
-0.3727489 0.4118240
> # Notice that what is returned when subsetting like this is
> # a list with the name p.value
> ttest_bb[3]
$`p.value`
[1] 0.01676605
> # If we use the double parens then it extracts just the vector contained
> ttest_bb[[3]]
[1] 0.01676605
> # What you're seeing is this:
> round(ttest_bb[3])
Error in round(ttest_bb[3]) :
non-numeric argument to mathematical function
> # If you use double parens you can use that value
> round(ttest_bb[[3]],2)
[1] 0.02
> # I prefer using the named argument to make it more clear what you're grabbing
> ttest_bb$p.value
[1] 0.01676605
> round(ttest_bb$p.value, 2)
[1] 0.02
Update:
the following code should be reproducible
someFrameA = data.frame(label="A", amount=rnorm(10000, 100, 20))
someFrameB = data.frame(label="B", amount=rnorm(1000, 50000, 20))
wholeFrame = rbind(someFrameA, someFrameB)
fit <- e1071::naiveBayes(label ~ amount, wholeFrame)
wholeFrame$predicted = predict(fit, wholeFrame)
nrow(subset(wholeFrame, predicted != label))
In my case, this gave 243 misclassifications.
Note these two rows:
(row num, label, amount, prediction)
10252 B 50024.81895 A
2955 A 100.55977 A
10678 B 50010.26213 B
While the input is only different by 12.6, the classification changes. It's curious that the posterior probabilities for rows like this are so close:
> predict(fit, wholeFrame[10683, ], type="raw")
A B
[1,] 0.5332296 0.4667704
Original Question:
I am trying to classify some bank transactions using the transaction amount. I had many other text based features in my original model, but noticed something fishy when using just the numeric one.
> head(trainingSet)
category amount
1 check 688.00
2 non-businesstransaction 2.50
3 non-businesstransaction 36.00
4 non-businesstransaction 243.22
5 payroll 302.22
6 non-businesstransaction 16.18
fit <- e1071::naiveBayes(category ~ amount, data=trainingSet)
fit
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
bankfee check creditcardpayment e-commercedeposit insurance
0.029798103 0.189613233 0.054001459 0.018973486 0.008270494
intrabanktransfer loanpayment mcapayment non-businesstransaction nsf
0.045001216 0.015689613 0.011432741 0.563853077 0.023351982
other payroll taxpayment utilitypayment
0.003405497 0.014838239 0.005716371 0.016054488
Conditional probabilities:
amount
Y [,1] [,2]
bankfee 103.58490 533.67098
check 803.44668 2172.12515
creditcardpayment 819.27502 2683.43571
e-commercedeposit 42.15026 59.24806
insurance 302.16500 727.52321
intrabanktransfer 1795.54065 11080.73658
loanpayment 308.43233 387.71165
mcapayment 356.62755 508.02412
non-businesstransaction 162.41626 951.65934
nsf 44.92198 78.70680
other 9374.81071 18074.36629
payroll 1192.79639 2155.32633
taxpayment 1170.74340 1164.08019
utilitypayment 362.13409 1064.16875
According to the e1071 docs, the first column for "conditional probabilities" is the mean of the numeric variable, and the other is the standard deviation. These means and stdevs are correct, as are the apriori probabilities.
So, it's troubling that this row:
> thatRow
category amount
40 other 11268.53
receives these posteriors:
> predict(fit, newdata=thatRow, type="raw")
bankfee check creditcardpayment e-commercedeposit insurance intrabanktransfer loanpayment mcapayment
[1,] 4.634535e-96 7.28883e-06 9.401975e-05 0.4358822 4.778703e-51 0.02582751 1.103762e-174 1.358662e-101
non-businesstransaction nsf other payroll taxpayment utilitypayment
[1,] 1.446923e-29 0.5364704 0.001717378 1.133719e-06 2.059156e-18 2.149142e-24
Note that "nsf" has about 300X the score than "other" does. Since this transaction has an amount of 11.2k dollars, if it were to follow that "nsf" distribution, it would be over 100 standard deviations from the mean. Meanwhile, since "other" transactions have a sample mean of about 9k dollars with a large standard deviation, I would think that this transaction is much more probable as an "other". While "nsf" is more likely wrt the prior probabilities, they aren't so different as to outweigh that tail observation, and there are plenty of other viable candidates besides "other" as well.
I was assuming that this package just looked at the normal(mew=samplemean, stdev=samplestdev) pdf and used that value to multiply, but is that not the case? I can't quite figure out how to see the source.
Datatypes seem to be fine too:
> class(trainingSet$amount)
[1] "numeric"
> class(trainingSet$category)
[1] "factor"
The "naive bayes classifier for discrete predictors" in the printout is maybe odd, since this is a continuous predictor, but I assume this package can handle continuous predictors.
I had similar results with the klaR package. Maybe I need to set the kernel option on that?
The threshold argument is a large part of this. The code in the package has a bit like this:
L <- sapply(1:nrow(newdata), function(i) {
ndata <- newdata[i, ]
L <- log(object$apriori) + apply(log(sapply(seq_along(attribs),
function(v) {
nd <- ndata[attribs[v]]
if (is.na(nd)) rep(1, length(object$apriori)) else {
prob <- if (isnumeric[attribs[v]]) {
msd <- object$tables[[v]]
msd[, 2][msd[, 2] <= eps] <- threshold
dnorm(nd, msd[, 1], msd[, 2])
} else object$tables[[v]][, nd]
prob[prob <= eps] <- threshold
prob
}
The threshold (and this is documented) will replace any probabilities less than eps. So, if the normal pdf for the continuous variable is 0.000000000, it will become .001 by default.
> wholeFrame$predicted = predict(fit, wholeFrame, threshold=0.001)
> nrow(subset(wholeFrame, predicted != label))
[1] 249
> wholeFrame$predicted = predict(fit, wholeFrame, threshold=0.0001)
> nrow(subset(wholeFrame, predicted != label))
[1] 17
> wholeFrame$predicted = predict(fit, wholeFrame, threshold=0.00001)
> nrow(subset(wholeFrame, predicted != label))
[1] 3
Now, I believe that the quantities returned by the sapply are incorrect, since when "debugging" it, I got something like .012 for what should have been dnorm(49990, 100, 20), and I think something gets left out / mixed up with the mean and standard deviation matrix, but in any case, setting the threshold will help with this.
.001*(10/11) > pdfB*(1/11) or A having higher posterior than B due to this situation means that pdfB has to be less than .01 by chance.
> dnorm(49977, 50000, 20)
[1] 0.01029681
> 2*pnorm(49977, 50000, 20)
[1] 0.2501439
And since there were 1000 observations in class B, we should expect about 250 misclassifications, which is pretty close to the original 243.