Contingency table and confidence intervals

Contingency table and confidence intervals - r

Let me describe my problem by a simplified example
I have a contingency table
datatable<-array(c(1,2,3,4,5,6,dim=c(3,2))
datatable<-cbind(datatable,rowSums(datatable))
datatable<-rbind(datatable,colSums(datatable))
here, m=3,n=2 This is a (m+1)x(n+1)=4x3 table. Now, i want a new mxn array, where the ij'th entrance is a confidence interval (a list of size 2 if you will) for the estimate datatable[i,j]. Below is a function that computes an interval for the estimate datatable[i,j] from the arguments a<-datatable[i,j] and b<-datatable[i,n+1]
CIfunction<-function(a,b) c(-1,1)+a*b
I hope it is not too messy, i couldn't think of an easier example.
How do i create such a table in a elegant way? In the real example, this is a function that should take arbitrary contingency tables and return the corresponding CI table.
I already found a way, using double loop over i resp. j, but it really smells like something that could be done in an elegant way using mapply/outer or thelike.

The code below involves some trickery because it relies on how R recycles vectors (under what circumstances does R recycle?). array and matrix objects are just folded vectors (stored column-wise). It's not hard to come up with tricks like this once you understand the behavior
datatable <- array (1:6, dim=c(3,2))
datatable <-cbind(datatable,rowSums(datatable))
datatable <-rbind(datatable,colSums(datatable))
# last column recycled as necessary
lower <- -datatable[1:3, 1:2] * datatable[1:3, 3]
upper <- -lower
CIlist <- list(lower, upper)
I do suggest, however, that you store datatable without the row and column totals and compute them only on print.

Thank for your comments and answers. I had abit of trouble generalizing the (elegant) method provided by ilir to arbitrary CI functions. I ended up just doing the double loop mentioned above. My solution to the above problem would be something like
CItable<-function(datatable,CIfunction)
{
m<-dim(datatable)[1]-1
n<-dim(datatable)[2]-1
CItable<-array(NA,dim=c(m,2*n))
for(i in 1:m)
{
for(j in 1:n)
{
tempint<-CIfunction(datatable[i,j],datatable[i,n+1])
CItable[i,2*j]<-tempint[2]
CItable[i,2*j-1]<-tempint[1]
}
}
return(CItable)
}

Related

Double "for loops" in a dataframe in R

I need to do a quality control in a dataset with more than 3000 variables (columns). However, I only want to apply some conditions in a couple of them. A first step would be to replace outliers by NA. I want to replace the observations that are greater or smaller than 3 standard deviations from the mean by NA. I got it, doing column by column:
height = ifelse(abs(height-mean(height,na.rm=TRUE)) <
3*sd(height,na.rm=TRUE),height,NA)
And I also want to create other variables based on different columns. For example:
data$CGmark = ifelse(!is.na(data$mark) & !is.na(data$height) ,
paste(data$age, data$mark,sep=""),NA)
An example of my dataset would be:
name = factor(c("A","B","C","D","E","F","G","H","H"))
height = c(120,NA,150,170,NA,146,132,210,NA)
age = c(10,20,0,30,40,50,60,NA,130)
mark = c(100,0.5,100,50,90,100,NA,50,210)
data = data.frame(name=name,mark=mark,age=age,height=height)
data
I have tried this (for one condition):
d1=names(data)
list = c("age","height","mark")
ntraits=length(list)
nrows=dim(data)[1]
for(i in 1:ntraits){
a=list[i]
b=which(d1==a)
d2=data[,b]
for (j in 1:nrows){
d2[j] = ifelse(abs(d2[j]-mean(d2,na.rm=TRUE)) < 3*sd(d2,na.rm=TRUE),d2[j],NA)
}
}
Someone told me that I am not storing d2. How can I create for loops to apply the conditions I want? I know that there are similar questions but i didnt get it yet. Thanks in advance.

You pretty much wrote the answer in your first line. You're overthinking this one.
First, it's good practice to encapsulate this kind of operation in a function. Yes, function dispatch is a tiny bit slower than otherwise, but the code is often easier to read and debug. Same goes for assigning "helper" variables like mean_x: the cost of assigning the variable is very, very small and absolutely not worth worrying about.
NA_outside_3s <- function(x) {
mean_x <- mean(x)
sd_x <- sd(x,na.rm=TRUE)
x_outside_3s <- abs(x - mean(x)) < 3 * sd_x
x[x_outside_3s] <- NA # no need for ifelse here
x
}
of course, you can choose any function name you want. More descriptive is better.
Then if you want to apply the function to very column, just loop over the columns. That function NA_outside_3s is already vectorized, i.e. it takes a logical vector as an argument and returns a vector of the same length.
cols_to_loop_over <- 1:ncol(my_data) # or, some subset of columns.
for (j in cols_to_loop_over) {
my_data[, j] <- NA_if_3_sd(my_data[, j])
}
I'm not sure why you wrote your code the way you did (and it took me a minute to even understand what you were trying to do), but looping over columns is usually straightforward.
In my comment I said not to worry about efficiency, but once you understand how the loop works, you should rewrite it using lapply:
my_data[cols_to_loop_over] <- lapply(my_data[cols_to_loop_over], NA_outside_3s)
Once you know how the apply family of functions works, they are very easy to read if written properly. And yes, they are somewhat faster than looping, but not as much as they used to be. It's more a matter of style and readability.
Also: do NOT name a variable list! This masks the function list, which is an R built-in function and a fairly important one at that. You also shouldn't generally name variables data because there is also a data function for loading built-in data sets.

Indexing in nested loops

I am new to R and this site. My aim with the following, assuredly unnecessarily-arcane code is to create an R function that produces a special type of box plot in ggplot2. I first need to process potential input thereinto by calculating the variables that I shall later wish to have plotted.
I start by generating some random data, called datos:
c1=rnorm(98,47,23)
c2=rnorm(98,56,13)
c3=rnorm(98,52,7)
fila1=as.matrix(t(c(-2,15,30)))
colnames(fila1)=c("c1","c2","c3")
fila2=as.matrix(t(c(-20,5,20)))
colnames(fila2)=c("c1","c2","c3")
datos=rbind(data.frame(c1,c2,c3),fila1,fila2)
rm(c1,c2,c3,fila1,fila2)
Then, I calculate the variables to later be plotted, which include for each of the present columns in datos the mean (puntoMedio), the first and third quartiles (cuar1,cuar3), the inner-quartile range (iqr), the lower bound of potential submean whiskers (limInf), the upper bound of potential supermean whiskers (limSup) and outliers (submean outliers vAtInf and supermean outliers vAtSup to be combined in vAt):
puntoMedio=apply(datos,MARGIN=2,FUN=mean)
cuar1=apply(datos,MARGIN=2,FUN=quantile,probs=.25)
cuar3=apply(datos,MARGIN=2,FUN=quantile,probs=.75)
cuar=rbind(cuar1,cuar3)
iqr=apply(cuar,MARGIN=2,FUN=diff)
cuar=rbind(cuar,iqr,puntoMedio)
limInf=array(dim=ncol(datos))
for(i in 1:ncol(datos)){
limInf0=as.matrix(t(cuar[1,]-1.5*cuar[3,]))
if(length(datos[datos[,i]<limInf0[,i],i])>0){
limInf[i]=limInf0[,i]
}else{limInf[i]=min(datos[,i])}
}
limSup=array(dim=ncol(datos))
for(i in 1:ncol(datos)){
limSup0=as.matrix(t(cuar[2,]+1.5*cuar[3,]))
if(length(datos[datos[,i]>limSup0[,i],i])>0){
limSup[i]=limSup0[,i]
}else{limSup[i]=max(datos[,i])}
}
d=data.frame(t(rbind(cuar,limInf,limSup)))
rm(cuar)
vAtInf=datos
for(i in 1:ncol(vAtInf)){
vAtInf[vAtInf[,i]>limInf0[,i],i]=NA
}
colnames(vAtInf)=c("vAtInfc1","vAtInfc2","vAtInfc3")
vAtSup=datos
for(i in 1:ncol(vAtSup)){
vAtSup[vAtSup[,i]<limSup0[,i],i]=NA
}
colnames(vAtSup)=c("vAtSupc1","vAtSupc2","vAtSupc3")
datos=cbind(datos,vAtInf,vAtSup)
rm(limInf0,limSup0,cuar1,cuar3,i,iqr,limInf,limSup,puntoMedio)
Everything works as desired up until here. I have two data frames d and datos, the former of no interest here, and the latter, which in this specific case comprises nine columns: three of all values, three of the corresponding submean outliers and three of the corresponding supermean outliers (these latter six padded with NA). I now wish to extract all outliers by column, wherefore I have tried formulating the following loop. While it does work giving neither error nor warning, it also does not give the desired output in vAt (again, the by-column [columns 4:9] outliers from datos). The problem, then, as far as I have been able to discern, occurs in the nested for-loop, upon attempting to input i into vAt: each iteration of the loop erases the last, such that upon completion of the entire loop, vAt only contains NA and the outliers from the last column/of the last iteration.
for(i in ((ncol(datos)/3)+1):ncol(datos)){
vAt=matrix(nrow=.25*nrow(datos),ncol=ncol(datos)-(ncol(datos)/3))
colnames(vAt)=c(((ncol(datos)/3)+1):ncol(datos))
if(length(datos[,i][is.na(datos[,i])==F])>0){
for(j in 1:(length(datos[,i][is.na(datos[,i])==F]))){
nom=as.character(i)
vAt[j,nom]=datos[,i][is.na(datos[,i])==F][j]
}
}else{next}
}
I have not been able to find any existent thread that answers my question. Thanks for any help.

The problem is that you are initialising vAt inside the loop here.
Moving the initialisation statements outside the for loop will fix the problem that you are facing:
vAt=matrix(nrow=.25*nrow(datos),ncol=ncol(datos)-(ncol(datos)/3))
colnames(vAt)=c(((ncol(datos)/3)+1):ncol(datos))
for(i in ((ncol(datos)/3)+1):ncol(datos)){
if(length(datos[,i][is.na(datos[,i])==F])>0){
for(j in 1:(length(datos[,i][is.na(datos[,i])==F]))){
nom=as.character(i)
vAt[j,nom]=datos[,i][is.na(datos[,i])==F][j]
}
}else{next}
}
However, there are various improvements which you can make to the code as it stands:
Using vectorisation and *ply functions instead of for loops.
Not comparing logical vectors to ==F but instead only using !is.na(...).
Using sum(is.na(...)) instead of length(d[,i][!is.na(...)])
And some more. These will not change the correctness of the code, but will make it more efficient and more idiomatic.

using argmax or something simpler in R

I am trying to set up a Gibbs sampler in R where I update my value at each step.
I have a function in R that I want to maximise for 2 values; my previous value and a new one.
So I know the maximum outcome from the function applied to both values. But then how do I select the best input without doing it manually? (I need to do a lot of iterations). Here is an idea of the code and the variables:
g0<-function(k){sample(0:1,k,replace=T)}
this is a k dimensional vector with entries 1 or 0 uniformly. Initial starting point for my chain. If i=1 then include the i'th variable in the design matrix.
X1 design matrix
Xg<-function(g){
Xg<-cbind(X1[,1]*g[1],X1[,2]*g[2],X1[,3]*g[3],X1[,4]*g[4],X1[,5]*g[5],X1[,6]*g[6],X1[,7]*g[7])
return(Xg[,which(!apply(Xg,2,FUN = function(x){all(x == 0)}))])
}
Xg0<-Xg(g0)
reduced design matrix for g0
c<-1:100000
mp<-function(g){
mp<-sum((1/(c*(c+1)^-((q+1)/2)))*
(t(Y)%*%Y-(c/(c+1))*t(Y)%*%Xg(g)%*%solve(t(Xg(g))%*%Xg(g))%*%t(Xg(g))%*%Y)^(-27/2))
return(mp)
}
this is my function.
Therefore if I have mp(g) and mp(g*), for 2 inputs g and g*, such that the max is mp(g*) how can I return g*?
Thanks for any help and if you have any queries just ask. sorry about the messy code as well; I have not used this site before.

Like this:
inputs <- list(g, g2)
outputs <- sapply(inputs, mp)
best.input <- inputs[which.max(outputs)]

How to calculate Euclidean distance (and save only summaries) for large data frames

I've written a short 'for' loop to find the minimum euclidean distance between each row in a dataframe and all the other rows (and to record which row is closest). In theory this avoids the errors associated with trying to calculate distance measures for very large matrices. However, while not that much is being saved in memory, it is very very slow for large matrices (my use case of ~150K rows is still running).
I'm wondering whether anyone can advise or point me in the right direction in terms of vectorising my function, using apply or similar. Apologies for what may seem a simple question, but I'm still struggling to think in a vectorised way.
Thanks in advance (and for your patience).
require(proxy)
df<-data.frame(matrix(runif(10*10),nrow=10,ncol=10), row.names=paste("site",seq(1:10)))
min.dist<-function(df) {
#df for results
all.min.dist<-data.frame()
#set up for loop
for(k in 1:nrow(df)) {
#calcuate dissimilarity between each row and all other rows
df.dist<-dist(df[k,],df[-k,])
# find minimum distance
min.dist<-min(df.dist)
# get rowname for minimum distance (id of nearest point)
closest.row<-row.names(df)[-k][which.min(df.dist)]
#combine outputs
all.min.dist<-rbind(all.min.dist,data.frame(orig_row=row.names(df)[k],
dist=min.dist, closest_row=closest.row))
}
#return results
return(all.min.dist)
}
#example
min.dist(df)

This should be a good start. It uses fast matrix operations and avoids the growing object construct, both suggested in the comments.
min.dist <- function(df) {
which.closest <- function(k, df) {
d <- colSums((df[, -k] - df[, k]) ^ 2)
m <- which.min(d)
data.frame(orig_row = row.names(df)[k],
dist = sqrt(d[m]),
closest_row = row.names(df)[-k][m])
}
do.call(rbind, lapply(1:nrow(df), which.closest, t(as.matrix(df))))
}
If this is still too slow, as a suggested improvement, you could compute the distances for k points at a time instead of a single one. The size of k will need to be a compromise between speed and memory usage.
Edit: Also read https://stackoverflow.com/a/16670220/1201032

Usually, built in functions are faster that coding it yourself (because coded in Fortran or C/C++ and optimized).
It seems that the function dist {stats} answers your question spot on:
Description
This function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix.

Loop two variables one is conditional on another one

I want to make a loop which contains two variables i,j. for each i equals 1:24, j can be 1:24
but I don't know to make this loop;
i=1
while(i<=24)
{
j=seq(1,24,by=1)
for (j in j)
{
cor[i,j]
}
}
i=i+1
is this right? my output is cor[i,j].

In order to accomplish your final goal try...
cor(myMatrix)
The result is a matrix containing all of the correlations of all of the columns in myMatrix.
If you want to try to go about it the way you were it's probably best to generate a matrix of all of the possible combinations of your items using combn. Try combn(1:4,2) and see what it looks like for a small example. For your example with 24 columns the best way to cycle through all combinations using a for loop is...
myMatrix <- matrix(rnorm(240), ncol = 24)
myIndex <- combn(1:24,2)
for(i in ncol(myIndex)){
temp <- cor(myMatrix[,myIndex[1,i]],myMatrix[,myIndex[2,i]])
print(c(myIndex[,i],temp))
}
So, it's possible to do it with a for loop in R you'd never do it that way.
(and this whole answer is based on a wild guess about what you're actually trying to accomplish because the question, and your comments, are very hard to figure out)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Contingency table and confidence intervals - r

Related

Double "for loops" in a dataframe in R

Indexing in nested loops

using argmax or something simpler in R

How to calculate Euclidean distance (and save only summaries) for large data frames

Loop two variables one is conditional on another one

Categories

Resources