I have found a solution in python, but was looking for a solution in R. Is there an R equivalent of chi2.isf(p, df)? I know the R equivalent of chi2.sf(p, df) is 1-qchisq(p,df).
There is family of functions in R to deal with Chi-square distribution: dchisq(), pchisq(), qchisq(), rchisq(). As for your case, you would need qchisq to get Chi-square statistics from p-values and degrees of freedom:
qchisq(p = 0.01, df = 7)
To build a matrix with qchisq, I would do something like this. Feel free to change p-values and degrees of freedom as you need.
# Set p-values
p <- c(0.995, 0.99, 0.975, 0.95, 0.90, 0.10, 0.05, 0.025, 0.01, 0.005)
# Set degrees of freedom
df <- seq(1,20)
# Calculate a matrix of chisq statistics
m <- outer(p, df, function(x,y) qchisq(x,y))
# Transpose for a better view
m <- t(m)
# Set column and row names
colnames(m) <- p
rownames(m) <- df
m
0.995 0.99 0.975 0.95 0.9 0.1 0.05 0.025 0.01 0.005
1 7.879439 6.634897 5.023886 3.841459 2.705543 0.01579077 0.00393214 0.0009820691 0.0001570879 0.00003927042
2 10.596635 9.210340 7.377759 5.991465 4.605170 0.21072103 0.10258659 0.0506356160 0.0201006717 0.01002508365
3 12.838156 11.344867 9.348404 7.814728 6.251389 0.58437437 0.35184632 0.2157952826 0.1148318019 0.07172177459
4 14.860259 13.276704 11.143287 9.487729 7.779440 1.06362322 0.71072302 0.4844185571 0.2971094805 0.20698909350
5 16.749602 15.086272 12.832502 11.070498 9.236357 1.61030799 1.14547623 0.8312116135 0.5542980767 0.41174190383
6 18.547584 16.811894 14.449375 12.591587 10.644641 2.20413066 1.63538289 1.2373442458 0.8720903302 0.67572677746
7 20.277740 18.475307 16.012764 14.067140 12.017037 2.83310692 2.16734991 1.6898691807 1.2390423056 0.98925568313
8 21.954955 20.090235 17.534546 15.507313 13.361566 3.48953913 2.73263679 2.1797307473 1.6464973727 1.34441308701
9 23.589351 21.665994 19.022768 16.918978 14.683657 4.16815901 3.32511284 2.7003895000 2.0879007359 1.73493290500
10 25.188180 23.209251 20.483177 18.307038 15.987179 4.86518205 3.94029914 3.2469727802 2.5582121602 2.15585648130
11 26.756849 24.724970 21.920049 19.675138 17.275009 5.57778479 4.57481308 3.8157482522 3.0534841066 2.60322189052
12 28.299519 26.216967 23.336664 21.026070 18.549348 6.30379606 5.22602949 4.4037885070 3.5705689706 3.07382363809
13 29.819471 27.688250 24.735605 22.362032 19.811929 7.04150458 5.89186434 5.0087505118 4.1069154715 3.56503457973
14 31.319350 29.141238 26.118948 23.684791 21.064144 7.78953361 6.57063138 5.6287261030 4.6604250627 4.07467495740
15 32.801321 30.577914 27.488393 24.995790 22.307130 8.54675624 7.26094393 6.2621377950 5.2293488841 4.60091557173
16 34.267187 31.999927 28.845351 26.296228 23.541829 9.31223635 7.96164557 6.9076643535 5.8122124701 5.14220544304
17 35.718466 33.408664 30.191009 27.587112 24.769035 10.08518633 8.67176020 7.5641864496 6.4077597777 5.69721710150
18 37.156451 34.805306 31.526378 28.869299 25.989423 10.86493612 9.39045508 8.2307461948 7.0149109012 6.26480468451
19 38.582257 36.190869 32.852327 30.143527 27.203571 11.65091003 10.11701306 8.9065164820 7.6327296476 6.84397144548
20 39.996846 37.566235 34.169607 31.410433 28.411981 12.44260921 10.85081139 9.5907773923 8.2603983325 7.43384426293
if you want to create the contingency table you can use the function table().
With this function you get a table of this type:
table(var1,var2)
var1$1 var1$2
var2$1 n n
var2$2 n n
where $1 and $2 are the modalities of the variables 1 and 2.
If you have a dataframe containing the variables you must set the function like this table(namedf$var1, namedf$var2).
If you want do the chi-squared test the function is chisq.test()
chisq.test(namedf$var1, namedf$var2, correct=FALSE)
Related
I'd like to get the same answer by binom.test or prop.test in R for the following question. How can I get the same answer of my manual calculation(0.009903076)?
n=475, H0:p=0.05, H1:p>0.05
What is the probability of phat>0.0733?
n <- 475
p0 <- 0.05
p <- 0.0733
(z <- (p - p0)/sqrt(p0*(1 - p0)/n))
# [1] 2.33
(ans <- 1 - pnorm(z))
# [1] 0.009903076
You can get this from prop.test():
prop.test(n*p, n, p0, alternative="greater", correct=FALSE)
# data: n * p out of n, null probability p0
# X-squared = 5.4289, df = 1, p-value = 0.009903
# alternative hypothesis: true p is greater than 0.05
# 95 percent confidence interval:
# 0.05595424 1.00000000
# sample estimates:
# p
# 0.0733
#
You can't get the result from binom.test() so far as I can tell because n*p is not an integer, it's 34.8175. The binom.test() function only takes an integer values number of successes, so when you convert this to 35 by rounding, p effectively becomes 0.07368421, which makes the rest of your results not match. Even if you had a situation where n*p was an integer, binom.test() would still not produce the same answer because it's not using a normal approximation as your original code does - it's using the binomial distribution to calculate the probability above p0.
Is it possible to get the multiple comparison adjustment in pairwise.prop.test() to use less than the full number of comparisons? For example, if I only care about 4 vs 1,2,3 (3 comparisons) below, I would multiply the p-values in the bottom row by 3 instead of 6 (which is the full number of pairwise comparisons) to do the Bonferroni adjustment. p.adjust has the n argument, but I can't figure out how to pass it through by doing something like
pairwise.prop.test(x=b$s,n=b$pop,p.adjust.method="bonferroni", p.adjust.n = 3, alternative="two.sided", correct = FALSE)
With Bonferroni, it's trivial, but much more involved with the other types of corrections.
Here's the result (with code below):
> b <- data.frame(
+ s=c(18,53,49,30),
+ pop=c(29,100,88,73),
+ reg=c("1","2","3","4")
+ )
> pairwise.prop.test(x=b$s,n=b$pop,p.adjust.method="none",alternative="two.sided", correct = FALSE)
Pairwise comparisons using Pairwise comparison of proportions
data: b$s out of b$pop
1 2 3
2 0.387 - -
3 0.547 0.713 -
4 0.056 0.122 0.065
P value adjustment method: none
> pairwise.prop.test(x=b$s,n=b$pop,p.adjust.method="bonferroni", alternative="two.sided", correct = FALSE)
Pairwise comparisons using Pairwise comparison of proportions
data: b$s out of b$pop
1 2 3
2 1.00 - -
3 1.00 1.00 -
4 0.33 0.73 0.39
P value adjustment method: bonferroni
Code:
b <- data.frame(
s=c(18,53,49,30),
pop=c(29,100,88,73),
reg=c("1","2","3","4")
)
pairwise.prop.test(x=b$s,n=b$pop,p.adjust.method="none",alternative="two.sided", correct = FALSE)
pairwise.prop.test(x=b$s,n=b$pop,p.adjust.method="bonferroni", alternative="two.sided", correct = FALSE)
Based on #27 ϕ 9's comment:
b <- data.frame(
s=c(18,53,49,30),
pop=c(29,100,88,73),
reg=c("1","2","3","4")
)
unadj <- pairwise.prop.test(x=b$s,n=b$pop,p.adjust.method="none",alternative="two.sided")
p.adjust(unadj$p.value[3, ], method = "holm")
pairwise.prop.test(x=b$s,n=b$pop,p.adjust.method="bonferroni", alternative="two.sided")
I'm a beginner with R. I would need your help to automate these analyses and to get a summary output with the results.
I have 4 different data frames like this (see below), with the same headers and the same values in the Threshold column:
Set Threshold R2 P Coefficient Standard.Error Num_SNP
Base 0.0001 0.000233304 0.66047 0.0332613 0.0757204 47
Base 0.001 0.000387268 0.571772 -0.0438782 0.0775996 475
Base 0.05 0.00302399 0.114364 0.129474 0.082004 14164
Base 0.1 0.00252797 0.14897 0.117391 0.0813418 24616
Base 0.2 0.00481908 0.0465384 0.163571 0.0821767 41524
Base 0.3 0.00514761 0.0398082 0.170058 0.0827237 55307
Base 0.4 0.00699506 0.0166685 0.200571 0.083783 66943
Base 0.5 0.00634181 0.0226301 0.192314 0.0843623 76785
For each matching value in the Threshold columns, I would like to use the package metafor in R to meta-analyse the corresponding effect sizes (in the Coefficient column) and standard errors over the 4 data frames.
Using the metafor package:
rma.uni(yi=c(Coefficient_1,Coefficient_2,Coefficient_3,Coefficient_4),sei=c(Standard.Error_1,Standard.Error_2,Standard.Error_3,Standard.Error_4), measure="GEN", method='FE',intercept=T,weights=c(sample_size1,sample_size2,sample_size3,sample_size4))
How could I automate the analyses and get a summary data frame with the results for each Threshold?
Hi there this should get you started. Essentially you can loop over all thresholds
extract the rows matching each threshold from all 4 dataframes into a new dataframe and run your meta-analysis
library(metafor)
# Make some fake data resembling your own
df1 = data.frame(Set=rep("Base",8), Threshold=c(0.0001,0.001,0.05,seq(0.1,0.5,0.1)),
R2=runif(8,0.001,0.005),P=runif(8,0.001,1),Coefficient=runif(8,-0.1,0.2),
Standard.Error=runif(8,0.07,0.08),Num_SNP=sample(1:1000,8))
df2 = data.frame(Set=rep("Base",8), Threshold=c(0.0001,0.001,0.05,seq(0.1,0.5,0.1)),
R2=runif(8,0.001,0.005),P=runif(8,0.001,1),Coefficient=runif(8,-0.1,0.2),
Standard.Error=runif(8,0.07,0.08),Num_SNP=sample(1:1000,8))
df3 = data.frame(Set=rep("Base",8), Threshold=c(0.0001,0.001,0.05,seq(0.1,0.5,0.1)),
R2=runif(8,0.001,0.005),P=runif(8,0.001,1),Coefficient=runif(8,-0.1,0.2),
Standard.Error=runif(8,0.07,0.08),Num_SNP=sample(1:1000,8))
df4 = data.frame(Set=rep("Base",8), Threshold=c(0.0001,0.001,0.05,seq(0.1,0.5,0.1)),
R2=runif(8,0.001,0.005),P=runif(8,0.001,1),Coefficient=runif(8,-0.1,0.2),
Standard.Error=runif(8,0.07,0.08),Num_SNP=sample(1:1000,8))
Thresholds = unique(df1$Threshold)
Results <- NULL
for(i in 1:length(Thresholds)){
idf = rbind(df1[df1$Threshold==Thresholds[i],],
df2[df2$Threshold==Thresholds[i],],
df3[df3$Threshold==Thresholds[i],],
df4[df4$Threshold==Thresholds[i],])
i.meta <- rma.uni(yi=idf$Coefficient,sei=idf$Standard.Error, measure="GEN", method='FE',intercept=T,
weights=idf$Num_SNP)
Results <- rbind(Results, c(Threshold=Thresholds[i],beta=i.meta$beta,se=i.meta$se,
zval=i.meta$zval,pval=i.meta$pval,ci.lb=i.meta$ci.lb,
ci.ub=i.meta$ci.ub,QEp=i.meta$QEp))
}
Results <- data.frame(Results)
Results
should give you :
Threshold beta se zval pval ci.lb ci.ub QEp
1 1e-04 -0.012079013 0.04715546 -0.2561530 0.79783270 -0.104502022 0.0803440 0.08700919
2 1e-03 0.068932388 0.04006086 1.7206917 0.08530678 -0.009585452 0.1474502 0.22294419
3 5e-02 0.050069503 0.04094881 1.2227340 0.22143020 -0.030188694 0.1303277 0.07342661
4 1e-01 0.102598016 0.04188183 2.4497022 0.01429744 0.020511132 0.1846849 0.07380669
5 2e-01 0.069482160 0.04722693 1.4712401 0.14122619 -0.023080930 0.1620452 0.95494364
6 3e-01 0.009793206 0.05098346 0.1920859 0.84767489 -0.090132542 0.1097190 0.12191340
7 4e-01 0.030432884 0.03967771 0.7670021 0.44308028 -0.047333994 0.1081998 0.86270334
8 5e-01 0.073511575 0.03997485 1.8389458 0.06592316 -0.004837683 0.1518608 0.12333557
I am looking to identify the simulation package in R to identify the perfect weights, that enables me allocate my datapoints into the maximum bucket.
Basically, i want to tune my weights in a such a way the achieve my goal.
Below is the example.
Score1,Score2,Score3,Final,Group
0.87,0.73,0.41,0.63,"60-100"
0.82,0.73,0.85,0.796,"70-80"
0.82,0.37,0.85,0.652,"60-65"
0.58,0.95,0.42,0.664,"60-65"
1,1,0.9,0.96,"90-100"
Weight1,Weight2,Weight3
0.2,0.4,0.4
Final Score= Score1*Weight1+ Score2*Weight2+Score3*Weight3
The sum of my weights is 1. W1+W2+W3=1
i want to tune my weights in such a way that most of my cases lie into the "90-100" bucket. I know there won't be a perfect combination, but want to capture the maximum cases. I am currently trying to do the same in excel manually, using Pivot, but want to know if there is any package in R, that helps me to achieve my goal.
THe group allocation "70-80" "80-90" is something i have made in excel, using if else condition.
R Pivot Result:
"60-100",1
"60-65",2
"70-80",1
"90-100",1
Would appreciate if someone can help me to for the same.
Thanks,
Here's an approach that tries to get all the final scores as close as possible to 0.9 using a nested optimisation approach.
Here's your original data:
# Original data
df <- read.table(text = "Score1, Score2, Score3
0.87,0.73,0.41
0.82,0.73,0.85
0.82,0.37,0.85
0.58,0.95,0.42
1,1,0.9", header = TRUE, sep = ",")
This is the cost function for the first weight.
# Outer cost function
cost_outer <- function(w1){
# Run nested optimisation
res <- optimise(cost_nested, lower = 0, upper = 1 - w1, w1 = w1)
# Spit second weight into a global variable
res_outer <<- res$minimum
# Return the cost function value
res$objective
}
This is the cost function for the second weight.
# Nested cost function
cost_nested <- function(w2, w1){
# Calculate final weight
w <- c(w1, w2, 1 - w2 -w1)
# Distance from desired interval
res <- 0.9 - rowSums(w*df)
# Zero if negative distance, square distance otherwise
res <- sum(ifelse(res < 0, 0, res^2))
}
Next, I run the optimisation.
# Repackage weights
weight <- c(optimise(cost_outer, lower = 0, upper = 1)$minimum, res_outer)
weight <- c(weight, 1 - sum(weight))
Finally, I show the results.
# Final scores
cbind(df, Final = rowSums(weight * df))
# Score1 Score2 Score3 Final
# 1 0.87 0.73 0.41 0.7615286
# 2 0.82 0.73 0.85 0.8229626
# 3 0.82 0.37 0.85 0.8267400
# 4 0.58 0.95 0.42 0.8666164
# 5 1.00 1.00 0.90 0.9225343
Notice, however, that this code gets the final scores as close as possible to the interval, which is different from getting the most scores in that interval. That can be achieved by switching out the nested cost function with something like:
# Nested cost function
cost_nested <- function(w2, w1){
# Calculate final weight
w <- c(w1, w2, 1 - w2 -w1)
# Number of instances in desired interval
res <- sum(rowSums(w*df) < 0.9)
}
This can be formulated as a Mixed Integer Programming (MIP) problem. The mathematical model can look like:
The binary variable δi indicates if final weight Fi is inside the interval [0.9,1]. M is "large" value (if all your data is between 0 and 1 we can choose M=1). ai,j is your data.
The objective function and all constraints are linear, so we can use standard MIP solvers to solve this problem. MIP solvers for R are readily available.
PS in the example groups overlap. That does not make much sense to me. I think if we have "90-100" we should not also have "60-100".
PS2. If all data is between 0 and 1, we can simplify the sandwich equation a bit: we can drop the right part.
For the small example data set I get:
---- 56 PARAMETER a
j1 j2 j3
i1 0.870 0.730 0.410
i2 0.820 0.730 0.850
i3 0.820 0.370 0.850
i4 0.580 0.950 0.420
i5 1.000 1.000 0.900
---- 56 VARIABLE w.L weights
j1 0.135, j2 0.865
---- 56 VARIABLE f.L final scores
i1 0.749, i2 0.742, i3 0.431, i4 0.900, i5 1.000
---- 56 VARIABLE delta.L selected
i4 1.000, i5 1.000
---- 56 VARIABLE z.L = 2.000 objective
(zeros are not printed)
I am generating random variables with specified range and dimension.I have made a following code for this.
generateRandom <- function(size,scale){
result<- round(runif(size,1,scale),1)
return(result)
}
flag=TRUE
x <- generateRandom(300,6)
y <- generateRandom(300,6)
while(flag){
corrXY <- cor(x,y)
if(corrXY>=0.2){
flag=FALSE
}
else{
x <- generateRandom(300,6)
y <- generateRandom(300,6)
}
}
I want following 6 variables with size 300 and scale of all is between 1 to 6 except for one variable which would have scale 1-7 with following correlation structure among them.
1 0.45 -0.35 0.46 0.25 0.3
1 0.25 0.29 0.5 -0.3
1 -0.3 0.1 0.4
1 0.4 0.6
1 -0.4
1
But when I try to increase threshold value my program gets very slow.Moreover,I want more than 7 variables of size 300 and between each pair of those variables I want some specific correlation threshold.How would I do it efficiently?
This answer is directly inspired from here and there.
We would like to generate 300 samples of a 6-variate uniform distribution with correlation structure equal to
Rhos <- matrix(0, 6, 6)
Rhos[lower.tri(Rhos)] <- c(0.450, -0.35, 0.46, 0.25, 0.3,
0.25, 0.29, 0.5, -0.3, -0.3,
0.1, 0.4, 0.4, 0.6, -0.4)
Rhos <- Rhos + t(Rhos)
diag(Rhos) <- 1
We first generate from this correlation structure the correlation structure of the Gaussian copula:
Copucov <- 2 * sin(Rhos * pi/6)
This matrix is not positive definite, we use instead the nearest positive definite matrix:
library(Matrix)
Copucov <- cov2cor(nearPD(Copucov)$mat)
This correlation structure can be used as one of the inputs of MASS::mvrnorm:
G <- mvrnorm(n=300, mu=rep(0,6), Sigma=Copucov, empirical=TRUE)
We then transform G into a multivariate uniform sample whose values range from 1 to 6, except for the last variable which ranges from 1 to 7:
U <- matrix(NA, 300, 6)
U[, 1:5] <- 5 * pnorm(G[, 1:5]) + 1
U[, 6] <- 6 * pnorm(G[, 6]) + 1
After rounding (and taking the nearest positive matrix to the copula's covariance matrix etc.), the correlation structure is not changed much:
Ur <- round(U, 1)
cor(Ur)