Using R to find several fisher's exact test p value in sequence - r

I have over 5000 data sets that i want to find p values for using fishers exact test in R. They are saved in a csv file and look something like this;
100 5000 400 500
250 400 600 400
... ... ... ...
where each row represents a contingency table.
Right now, I'm having to do a contingency table at a time, which will take me forever.
I used this code so far
alltables<-read.table("untitled1.csv") ##to read my data
apply(alltables,1, function(x) fisher.test(matrix(x,nr=2))$p.value)
But then I get the error "Error in fisher.test(matrix(x, nr = 2)) : 'x' must have at least 2 rows and columns"

You can do something like the following. But since you didn't really give a reproducible example, I first create some toy-data:
set.seed(1)
print(dat <- matrix(rbinom(n = 40, size = 1000, prob = 0.5), ncol = 4))
# [,1] [,2] [,3] [,4]
# [1,] 500 526 494 505
# [2,] 497 500 512 493
# [3,] 480 488 500 512
# [4,] 464 513 498 497
# [5,] 527 503 518 508
# [6,] 504 517 511 483
# [7,] 519 493 522 471
# [8,] 486 490 497 507
# [9,] 492 499 475 509
#[10,] 530 486 488 501
# Function to be applied row-wise
rowFisher <- function(x, ...) {
return(fisher.test(matrix(x, nrow = 2, ...))$p.value)
}
# Apply the function row-wise
apply(dat, 1, rowFisher)
# [1] 0.7557946 0.6548804 0.9641424 0.2603181 0.7912598 0.3729036 0.5916955 0.9283668 0.5585135
#[10] 0.2111895
Edit I didn't see your commments. But this should do the trick. If not, probably you have some NAs or other non-numeric values somewhere in your data.

Related

Adding the subsequent numbers of list containing random numbers, to the subsequent indices

I have a list with some random numbers. I want to add the two following numbers for each random number and add them to the subsequent indices in the list, without using a for loop.
So, lets say I have this list: v <- c(238,1002,569,432,6,1284)
Then the output I want is:
v <- c(238,239,240,1002,1003,1004,569,570,571,432,433,434,6,7,8,1284,1285,1286)
I am still pretty new to r, so I don't really know what I'm doing, but I've tried for hours now with no results.. I have tho, made it work using a for loop, but I know r isn't too happy with loops so I really need to vectorize it, somehow.
Does anybody know how I can implement this into my r code in an efficient manner?
You can just use outer to calculate the outer sum:
res <- outer(0:2, v, "+")
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 238 1002 569 432 6 1284
#[2,] 239 1003 570 433 7 1285
#[3,] 240 1004 571 434 8 1286
You can then turn the resulting matrix into a vector:
res <- as.vector(res)
#[1] 238 239 240 1002 1003 1004 569 570 571 432 433 434 6 7 8 1284 1285 1286
Note that matrices are "column-major" in R.

How do I automatically populate a matrix with intervals given the size and number of the intervals?

bucket_size <- 30
bucket_amount <- 24
matrix(???, bucket_amount, 2)
I'm trying to populate a (bucket_amount x 2) matrix using the interval size given by bucket_size. Here is what it would look like with the current given values of bucket_size and bucket_amount.
[1 30]
[31 60]
[61 90]
[91 120]
.
.
.
[691 720]
I can obviously hard code this specific example out, but I'm wondering how I can do this for different values of bucket_size and bucket_amount and have the matrix populate automatically.
We can seq specifying the from, by as 'bucket_size' and length.out as 'bucket_amount' to create a sequence of values ('v1'). Append 1 at the beginning while adding 1 to the 'v1' without last element and cbind these two vectors to create a matrix
v1 <- seq(bucket_size, length.out = bucket_amount , by = bucket_size)
v2 <- c(1, v1[-length(v1)] + 1)
m1 <- cbind(v2, v1)
-outupt
> head(m1)
v2 v1
[1,] 1 30
[2,] 31 60
[3,] 61 90
[4,] 91 120
[5,] 121 150
[6,] 151 180
> tail(m1)
v2 v1
[19,] 541 570
[20,] 571 600
[21,] 601 630
[22,] 631 660
[23,] 661 690
[24,] 691 720

How to automatically multiply and add some coefficient to a data frame in R?

I have this data set
obs <- data.frame(replicate(8,rnorm(10, 0, 1)))
and this coefficients
coeff <- data.frame(replicate(8,rnorm(2, 0, 1)))
For each column of obs, I need to multiply the first element of first column, and add the second element of the first column too. I need to do the same for the 8 columns. I read somewhere that if someone copy and paste code more than once you are doing something wrong... and that's exactly what I did.
obs.transformed.X1 <-(obs[1]*coeff[1,1])+coeff[2,1]
obs.transformed.X2 <-(obs[2]*coeff[1,2])+coeff[2,2]
.
.
.
.
.
obs.transformed.X8 <-(obs[8]*coeff[1,8])+coeff[2,8]
I know there is a smarter way to do this (loop?), but I just couldn't figure it out. Any help will be appreciated.
This is what I've tried but I am only getting the last column
for (i in 1:length(obs)) {
results=(obs[i]*coeff[1,i])+coeff[2,i]
}
If you coerce to matrix class you can use the sweep function in a sequential fashion first multiplying columns by the first row of coeff and then by adding hte second row, again column-wise:
obs <- data.frame(matrix(1:60, 10)) # I find checking with random numbers difficult
coeff <- data.frame(matrix(1:12,2))
sweep(
sweep(as.matrix(obs), 2, as.matrix(coeff)[1,], "*"), # first operation is "*"
2, as.matrix(coeff)[2,], "+" ) # arguments for the addition
#--------------------------------
X1 X2 X3 X4 X5 X6
[1,] 3 37 111 225 379 573
[2,] 4 40 116 232 388 584
[3,] 5 43 121 239 397 595
[4,] 6 46 126 246 406 606
[5,] 7 49 131 253 415 617
[6,] 8 52 136 260 424 628
[7,] 9 55 141 267 433 639
[8,] 10 58 146 274 442 650
[9,] 11 61 151 281 451 661
[10,] 12 64 156 288 460 672
Decreased number of columns because your original code was too wide for my Rstudio console. But this should be very general. I suspect there's an equivalent matrix operator method but It didn't come to me
I came up with this solution..
results = list()
for (i in 1:length(obs)) {
results[[i]]=(obs[i]*coeff[1,i])+coeff[2,i]
}
results <- as.data.frame(results)
Is there any efficient way to do this?
I used Map
results <- as.data.frame(Map(`+`, Map(`*`, obs, coeff[1,]), coeff[2,]))
This should also give what you are looking for.

R Conditional summing

I've just started my adventure with programming in R. I need to create a program summing numbers divisible by 3 and 5 in the range of 1 to 1000, using the '%%' operator. I came up with an idea to create two matrices with the numbers from 1 to 1000 in one column and their remainders in the second one. However, I don't know how to sum the proper elements (kind of "sum if" function in Excel). I attach all I've done below. Thanks in advance for your help!
s1<-1:1000
in<-s1%%3
m1<-matrix(c(s1,in), 1000, 2, byrow=FALSE)
s2<-1:1000
in2<-s2%%5
m2<-matrix(c(s2,in2),1000,2,byrow=FALSE)
Mathematically, the best way is probably to find the least common multiple of the two numbers and check the remainder vs that:
# borrowed from Roland Rau
# http://r.789695.n4.nabble.com/Greatest-common-divisor-of-two-numbers-td823047.html
gcd <- function(a,b) if (b==0) a else gcd(b, a %% b)
lcm <- function(a,b) abs(a*b)/gcd(a,b)
s <- seq(1000)
s[ (s %% lcm(3,5)) == 0 ]
# [1] 15 30 45 60 75 90 105 120 135 150 165 180 195 210
# [15] 225 240 255 270 285 300 315 330 345 360 375 390 405 420
# [29] 435 450 465 480 495 510 525 540 555 570 585 600 615 630
# [43] 645 660 675 690 705 720 735 750 765 780 795 810 825 840
# [57] 855 870 885 900 915 930 945 960 975 990
Since your s is every number from 1 to 1000, you could instead do
seq(lcm(3,5), 1000, by=lcm(3,5))
Just use sum on either result if that's what you want to do.
Props to #HoneyDippedBadger for figuring out what the OP was after.
See if this helps
x =1:1000 ## Store no. 1 to 1000 in variable x
x ## print x
Div = x[x%%3==0 & x%%5==0] ## Extract Nos. divisible by 3 & 5 both b/w 1 to 1000
Div ## Nos. Stored in DIv which are divisible by 3 & 5 both
length(Div)
table(x%%3==0 & x%%5==0) ## To see how many are TRUE for given condition
sum(Div) ## Sums up no.s divisible by both 3 and 5 b/w 1 to 1000

Clustering Large Data Matrix using R

I have a large data matrix (33183x1681), each row corresponding to one observation and each column corresponding to the variables.
I applied K-medoids clustering using PAM function in R, and I tried to visualize the clustering results using the built-in plots available with the PAM function. I got this error:
Error in princomp.default(x, scores = TRUE, cor = ncol(x) != 2) :
cannot use cor=TRUE with a constant variable
I think this problem is because of the high dimensionality of the data matrix I'm trying to cluster.
Any thoughts/ideas how to tackle this issue?
Check out the clara() function in package cluster which is shipped with all versions of R.
library("cluster")
## generate 500 objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)),
cbind(rnorm(300,50,8), rnorm(300,50,8)))
clarax <- clara(x, 2, samples=50)
clarax
> clarax
Call: clara(x = x, k = 2, samples = 50)
Medoids:
[,1] [,2]
[1,] -1.15913 0.5760027
[2,] 50.11584 50.3360426
Objective function: 10.23341
Clustering vector: int [1:500] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
Cluster sizes: 200 300
Best sample:
[1] 10 17 45 46 68 90 99 150 151 160 184 192 232 238 243 250 266 275 277
[20] 298 303 304 313 316 327 333 339 353 358 398 405 410 411 421 426 429 444 447
[39] 456 477 481 494 499 500
Available components:
[1] "sample" "medoids" "i.med" "clustering" "objective"
[6] "clusinfo" "diss" "call" "silinfo" "data"
Note that you should study the help for clara() (?clara) in some detail as well as the references cited in order to make the clustering performed by clara() as close to or identical to pam().

Resources