how to apply fisher exact test on matrices - r

I want to perform the fisher exact test between these two matrices, I want to compare the columns of one matrix with the columns of other matrix
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
A 0.1200480 0.07189073 0.0000000 0.1016260 0.1128205 0.10200927 0.2961319 0.3020383 0.02524866 0.0000000
C 0.0300120 0.19769950 0.2012802 0.2815041 0.2358974 0.48686244 0.4724160 0.3749228 0.65340474 0.3294118
G 0.6302521 0.52120776 0.6273115 0.4085366 0.4179487 0.35548686 0.1122384 0.1247684 0.29609793 0.6705882
T 0.2196879 0.20920201 0.1714083 0.2083333 0.2333333 0.05564142 0.1192137 0.1982705 0.02524866 0.0000000
here is the other matrix
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
A 0.3143147 0.79432422 0.03440895 0.07098851 0.0004179104 0.0002388202 0.9988065402 0.69191708 0.181764 0.00000000
C 0.1690764 0.02235194 0.89570290 0.92901149 0.0000000000 0.0004179354 0.0007757489 0.01218711 0.000000 0.48599247
G 0.3406620 0.14882308 0.06988816 0.00000000 0.9992835821 0.9993432444 0.0004177109 0.00000000 0.818236 0.02478944
T 0.1759469 0.03450076 0.00000000 0.00000000 0.0002985075 0.0000000000 0.0000000000 0.29589581 0.000000 0.48921809
How can I do this.
I tried this
Ref <- read.table("Ref_PPM.txt", sep=" ", header=T, stringsAsFactors=F)
Pred <- read.table("Pre_PPM.txt", sep=" ", header=T, stringsAsFactors=F)
output_df_forward <- data.frame()
for(i in 1:ncol(Ref)) {
Ref_vec <- as.numeric(unlist(as.data.frame(Ref[,1:i])))
Pred_vec <- as.numeric(unlist(as.data.frame(Pred[,(ncol(Pred)-i+1):ncol(Pred)])))
res <- Fisher.test(Pred_vec, Ref_vec)
output <- as.data.frame(cbind(as.numeric(res$p.value), as.numeric(res$estimate)))
ifelse(i == 1, output_df_forward <- output, output_df_forward <- rbind(output_df_forward, output))
}
But this loop does not work I want to apply this fisher test through above mention loop.
I also try this
FP<-table(Pre_PPM)
FR<-table(Ref_PPM)
fisher.test(FP, FR)
Error in fisher.test(FP, FR) : 'x' and 'y' must have the same length
Kindly guide how can I do this. Thanks for your time in advance.

Considering your second example, I believe your matrices do not have the same size in terms of rows and columns. Here is an example of Fisher test:
a <- data.frame(foo=c(1,2), bar=c(3, 4), spam=c(5, 6))
b <- data.frame(eggs=c(7,10), ham=c(15, 30), spar=c(35, 40))
Which works perfectly fine:
fisher.test(a, b)
The output is:
Consider checking the matrices sizes.

Related

maximize the value of a function using a for loop

I'm a bit new to R and I'm trying to maximize a simple value function, for a given parameter space.
The idea is to solve for c given different values of a.
The function is
2*(c^2)-(7.8*c)-(4*c*a)+(5*a)+(a^2)+6=0
And I would like to solve for the different values of c, given values of a from 0 to 100.
Is there any way to do this simply using a for loop?
Thanks a lot!
The equation can be solved analytically for c (since it's a simple degree 2 polynomial equation). Using e.g. Wolfram Alpha gives the two solutions
We can roll this up into a function to calculate c for different a
func_c <- function(a) a + 1/20 * (39 + c(-1, 1) * sqrt(200 * a^2 + 560 * a + 321))
sapply(0:10, func_c)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
#[1,] 1.054176 1.306072 1.58304 1.867387 2.154937 2.44417 2.734398 3.025264 3.316562 3.608168 3.9
#[2,] 2.845824 4.593928 6.31696 8.032613 9.745063 11.45583 13.165602 14.874736 16.583438 18.291832 20.0
If you must use a numerical root finder (and I don't recommend doing this here since this has a simple closed-form analytical solution) you can use polyroot. The coefficients can be read off from the equation you give
pol_coef <- function(a) c(5 * a + a^2 + 6, 7.8 + 4 * a, 2)
sapply(0:10, \(x) abs(polyroot(pol_coef(x))))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
#[1,] 1.054176 1.306072 1.58304 1.867387 2.154937 2.44417 2.734398 3.025264 3.316562 3.608168 3.9
#[2,] 2.845824 4.593928 6.31696 8.032613 9.745063 11.45583 13.165602 14.874736 16.583438 18.291832 20.0
Solutions from both approaches are identical.

R: creating a distance matrix quickly (like using mapply() or similar)

I am looking to create a distance matrix for any arbitrary non-standard distance function.
I can do this the slow way as follows:
set.seed(1000)
DF <- data.frame(x=rnorm(10),y=rnorm(10)) # ten random points on the x y plane
L <- dim(DF)[1] # length of DF
F <- function(P1,P2,y){sqrt((P2$x-P1$x)^2 + (P2$y-P1$y)^2 + 1)}
# Almost the euclidean distance but with an added 1 to make it nonstandard
M <- matrix(nrow=L,ncol=L)
# Find the distances between every point in DF and every other point in DF
for(i in 1:L){
for(j in 1:L){
M[i,j] <- F(DF[i,],DF[j,])
}
}
M
which gives:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1.000000 1.326971 1.566994 1.708761 1.114078 1.527042 1.514868 1.836636 1.521510 1.813663
[2,] 1.326971 1.000000 1.735444 2.143117 1.336652 1.482555 1.427014 2.245816 2.153173 1.271712
[3,] 1.566994 1.735444 1.000000 1.190212 1.951701 1.088288 1.126241 1.212367 2.388228 1.734505
[4,] 1.708761 2.143117 1.190212 1.000000 2.123664 1.461169 1.523137 1.013764 2.267420 2.271950
[5,] 1.114078 1.336652 1.951701 2.123664 1.000000 1.851806 1.822077 2.263007 1.447333 1.934958
[6,] 1.527042 1.482555 1.088288 1.461169 1.851806 1.000000 1.004188 1.497537 2.459305 1.406153
[7,] 1.514868 1.427014 1.126241 1.523137 1.822077 1.004188 1.000000 1.564111 2.460997 1.344779
[8,] 1.836636 2.245816 1.212367 1.013764 2.263007 1.497537 1.564111 1.000000 2.415824 2.327128
[9,] 1.521510 2.153173 2.388228 2.267420 1.447333 2.459305 2.460997 2.415824 1.000000 2.818048
[10,] 1.813663 1.271712 1.734505 2.271950 1.934958 1.406153 1.344779 2.327128 2.818048 1.000000
Obviously, with 2 nested for loops in R, this will be very slow for datasets of any size.
I would like to speed this up by using a function such as mapply() or outer() but am unsure of how to do it.
I've had a good look for similar questions but I can't find one that give an adequate answer that doesn't involve rcpp.
Create a distance matrix in R using parallelization
Create custom distance matrix function in R
Speed Up Distance Calculations
Trying the advice given in this link below gives me:
pairwise comparison with all vectors of a list
outer(DF,DF,FUN=Vectorize(F))
Error: $ operator is invalid for atomic vectors
or
outer(DF,DF,FUN=F)
Error in dim(robj) <- c(dX, dY) :
dims [product 4] do not match the length of object [10]
Here is how to use the outer() function to replace a nested loop and calculate a custom distance function:
set.seed(1000)
L <- 10
DF <- data.frame(x=rnorm(L), y=rnorm(L))
FN <- function (P1,P2) {sqrt((P2$x - P1$x)^2 + (P2$y - P1$y)^2 + 1)}
outer(1:L, 1:L, FUN=function(x, y) FN(DF[x,], DF[y,]))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1.000000 1.326971 1.566994 1.708761 1.114078 1.527042 1.514868 1.836636 1.521510 1.813663
[2,] 1.326971 1.000000 1.735444 2.143117 1.336652 1.482555 1.427014 2.245816 2.153173 1.271712
[3,] 1.566994 1.735444 1.000000 1.190212 1.951701 1.088288 1.126241 1.212367 2.388228 1.734505
[4,] 1.708761 2.143117 1.190212 1.000000 2.123664 1.461169 1.523137 1.013764 2.267420 2.271950
[5,] 1.114078 1.336652 1.951701 2.123664 1.000000 1.851806 1.822077 2.263007 1.447333 1.934958
[6,] 1.527042 1.482555 1.088288 1.461169 1.851806 1.000000 1.004188 1.497537 2.459305 1.406153
[7,] 1.514868 1.427014 1.126241 1.523137 1.822077 1.004188 1.000000 1.564111 2.460997 1.344779
[8,] 1.836636 2.245816 1.212367 1.013764 2.263007 1.497537 1.564111 1.000000 2.415824 2.327128
[9,] 1.521510 2.153173 2.388228 2.267420 1.447333 2.459305 2.460997 2.415824 1.000000 2.818048
[10,] 1.813663 1.271712 1.734505 2.271950 1.934958 1.406153 1.344779 2.327128 2.818048 1.000000
Benchmark with DF <- data.frame(x=rnorm(100),y=rnorm(100)) 100x100
Unit: milliseconds
expr min lq mean median uq max neval
loop 647.080268 681.283754 720.842738 695.972994 728.078378 1057.16015 100
outer 7.892903 8.145765 8.661221 8.307392 8.710785 14.07253 100
You can use a nice simple method included in base R to calculate distances in dataframes of points (2D or 3D)
dist(DF, method = "euclidean", diag =TRUE, upper = TRUE)
If you only want the lower triangle leave out upper=TRUE, and if you do not want to see the Zero values for the diagonal on your triangle set diag=FALSE
This function can also to manhattan, minkowski and canabera distances as well. Super simple
Understanding what you now want there is a package for R called usedist, it offers some methods for defining matrices and functions for the application of distance measures.
It has a function `dist_make() which applies a function to each pair of rows in a matrix (not dataframe)
You will need to figure out how to retool your function to align a matrix of your data
Here is the documentation

A for loop with rowSums function

I am a beginner in R and I have written a double-for loop for calculating chi2 values for selecting features among 6610 terms and 10 classes.
Here is my for loops:
library(raster)
#for x^2 [n,r] = term n, class r. n starts from col #7 and r starts from col #6617
chi2vals <- matrix(0:0,6610,10)
chi2avgs <- vector("numeric",6610L)
for(r in 1:10){
for(n in 1:6610){
A = sum(data1.sub.added[,6+n]==1 & data1.sub.added[,6616+r]==1)
M = sum(data1.sub.added[,6+n]==1)
P = sum(data1.sub.added[,6616+r]==1)
N = nrow(data1.sub.added)
E = ((A*N)-(M*P))**2
F = (N-P)*(N-M)
chi2vals[n,r] = (N/(P*M))*(E/F) # for term n
}
Prcj = sum(data1.sub.added[,6616+r]==1)/sum(data1.sub.added[,6616:6626]==1) #probability of class c_r
pchi <- Prcj * chi2vals
chi2avgs[n] = rowSums(pchi)[n]
}
The code correctly calculates everything up to the line pchi <- Prcj * chi2vals. The result is a nice matrix of p*chi2 values:
> head(pchi)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 128.36551442 0.239308113 0.683517530 1.5038665 0.6145058 3.656857e-01 1.3311564 2.6977448 0.410702803
[2,] 0.06632758 0.067970859 0.019178551 0.2900692 1.5300639 4.430705e-08 0.2599859 0.6362953 0.098745147
[3,] 1.85641330 1.411925435 3.590747764 7.3018416 38.8044465 4.102248e-01 6.4118078 13.0164994 1.709506238
[4,] 0.11063892 0.005039029 0.244964758 0.1622654 0.1156411 8.274468e+00 0.2564959 0.0577651 0.242946022
[5,] 0.04788648 0.049072885 0.001420669 0.2094211 1.7200152 2.045923e-01 0.1877019 0.1468187 0.005493183
[6,] 5.39946188 6.899336618 60.735646913 7.4351538 10.7005784 9.946261e+00 35.8868899 178.7112406 11.382740754
[,10]
[1,] 0.26436516
[2,] 0.14414444
[3,] 0.90292073
[4,] 0.01168997
[5,] 0.06641298
[6,] 19.68599142
But the final chi2avgs values mostly turn out to be zeros:
> head(chi2avgs)
[1] 0.000000 0.000000 0.000000 0.000000 2.638835 0.000000
However, when aside from the loop I replace n with any number, the last line works well:
chi2avgs[1] = rowSums(pchi)[1]
chi2avgs[2] = rowSums(pchi)[2]
chi2avgs[3] = rowSums(pchi)[3]
chi2avgs[4] = rowSums(pchi)[4]
chi2avgs[5] = rowSums(pchi)[5]
> head(chi2avgs)
[1] 136.476367 3.112781 75.416334 9.481914 2.638835 0.000000
I wonder what causes this problem. Do you have an idea how I can fix it?
You can try directly rowsums without [n]
chi2avgs = rowSums(pchi)

Nested apply function

I have two matrices and want to apply a linear regression. Briefly I want to get the p value for the regression between each colum of a with the factors of b, repsectively.
So I want to calculate the association between (the number displays the column):
a1 ~ b1
a1 ~ b2
...
a2 ~ b1
a2 ~ b2
...
a3 ~ b1
...
The both datasets:
set.seed(1232)
a <- matrix(runif(100,min=6,max=14),10)
b <- matrix(sample(c(0,1),100,replace = T),10)
I tried to use a loop, this works but it is too slow:
res <- NULL
for( i in 1:ncol(a)){
tmp <- apply(b,2,function(y,x) summary(lm(x~y))$coefficients[2,4],a[,i])
res <- cbind(res,tmp)
print(i)
}
So I tried to use a nested apply function like this but it does not work. Do you have an idea?
apply(b, 2, function(y,x) apply(x,2,function(x,y) summary(lm(x~y))$coefficients[2,4]), a)
Try this:
apply(b, 2, function(x) apply(a, 2, function(y) summary(lm(x~y))$coefficients[2,4]))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 0.8625602 0.2411563 0.7612476 0.509436082 0.3577235 0.45874974 0.360808572 0.05617883
[2,] 0.4136582 0.1186081 0.5161454 0.819513642 0.7813467 0.75912163 0.908950370 0.80584144
[3,] 0.5388209 0.8055687 0.7549796 0.935750996 0.0646300 0.79523596 0.973429634 0.55095667
[4,] 0.9930913 0.9622925 0.3753466 0.552830253 0.9636374 0.75312925 0.997724288 0.39246169
[5,] 0.2238149 0.4628420 0.5969530 0.007438859 0.4620672 0.42912431 0.007249279 0.50000827
[6,] 0.8022590 0.2352531 0.8990588 0.760952083 0.5794459 0.06170874 0.459247551 0.94625005
[7,] 0.4928149 0.1462937 0.5620584 0.554991195 0.6128055 0.55182670 0.874316882 0.55169689
[8,] 0.6631461 0.6260493 0.7679088 0.714076186 0.1254954 0.14316276 0.961166356 0.55342849
[9,] 0.9449110 0.2703502 0.5328246 0.533630873 0.2036671 0.87532137 0.402796595 0.24040106
[10,] 0.4151634 0.3584605 0.6923008 0.599701142 0.4649529 0.98238156 0.628130071 0.17310254
[,9] [,10]
[1,] 0.38552290 0.03078476
[2,] 0.03466566 0.64135540
[3,] 0.44603945 0.57578621
[4,] 0.47220820 0.45735156
[5,] 0.33202974 0.63330763
[6,] 0.09964719 0.19571414
[7,] 0.72649867 0.61591287
[8,] 0.22911914 0.84239810
[9,] 0.30766378 0.12782897
[10,] 0.01139275 0.46489123
You don't need regression:
res <- outer(seq_len(ncol(a)),
seq_len(ncol(b)),
FUN = Vectorize(function(k, l) cor.test(a[, k], b[, l])$p.value))
res[4, 3]
#[1] 0.3753466
summary(lm(a[,4] ~ b[, 3]))$coefficients[2,4]
#[1] 0.3753466
You can of course use lm instead of cor.test in the same way.
Alternatively, you could use package psych:
library(psych)
corr.test(as.data.frame(a), as.data.frame(b), adjust = "none")$p
Of course, it's default of adjusting for multiple testing reminds us that you really should do that.

Replicate each time with different standard deviation

I have a vector of standard deviations:
sd_vec<-runif(10,0,20) with 10 values between 0 and 20.
[1] 11.658106 9.693493 12.695608 4.091922 5.761061 18.410951 14.710990 12.095944 18.023123
[10] 13.294963
I would like to replicate the following process:
a<-rnorm(10,0,30)
[1] -21.265083 85.557147 23.958170 -32.843328 6.629831 -23.745339 46.094324 51.020059
[9] 1.041724 13.757235
n_columns=50
replicate(n_columns, a+rnorm(length(a), mean=0,sd=sd_vec))
The result should be 10 columns each of which are:
column 1: a + rnorm(length(a),0,11.658106)
column 2: a + rnorm(length(a),0,9.693493)
column 3: a + rnorm(length(a),0,12.695608)
.
.
.
column 10:a + rnorm(length(a),0,13.294963)
Will this use different values of sd_vec for each replication or will it use it for each random number generation?
According to your edit, then you may want to try
a+sapply(sd_vec, rnorm, n=100, mean=0)
# example
> set.seed(1)
> sd_vec <-runif(10,0,20)
> set.seed(1)
> a<-rnorm(100,0,30)
> n_columns=10
> head(a+sapply(sd_vec, rnorm, n=100, mean=0))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] -22.087869 -15.746650 -8.554735 0.7226986 -18.481801 -24.921835 -32.16206 -33.158153 -38.187974
[2,] 5.732942 18.078702 -6.489666 39.9422684 4.311839 32.504554 42.75921 -18.624133 7.954302
[3,] -29.906010 -13.260709 -2.483113 -36.0217953 -29.841630 -15.576334 -26.76925 -11.915258 -21.741820
[4,] 48.697584 45.395650 43.463125 40.7586401 47.903975 57.600406 47.59359 47.701659 33.782184
[5,] 6.409275 -7.122582 28.836887 2.3249113 13.884993 7.429514 -11.34081 1.960571 18.075706
[6,] -15.229450 -6.025260 -7.288529 -31.4375515 -18.184563 -45.038651 -50.00938 -26.965804 -37.610292
[,10]
[1,] -17.391109
[2,] 6.883342
[3,] -26.144900
[4,] 48.118830
[5,] 9.970987
[6,] -26.668629
Your current solution will replicate sd_vec for each replication, not using each sd for each replication.
If you want to have columns for each sd then you may work on matrices. Create matrix of rnorm with desire sd by:
X <- rnorm(length(a)*n_columns, mean=0, sd=sd_vec)
X <- matrix(X, nrow=length(a), ncol=n_columns, byrow=TRUE)
Then add it to a converted to matrix:
matrix(a, nrow=length(a), ncol=n_columns) + X

Resources