Chi Square Analysis using for loop in R - r

I'm trying to do chi square analysis for all combinations of variables in the data and my code is:
Data <- esoph[ , 1:3]
OldStatistic <- NA
for(i in 1:(ncol(Data)-1)){
for(j in (i+1):ncol(Data)){
Statistic <- data.frame("Row"=colnames(Data)[i], "Column"=colnames(Data)[j],
"Chi.Square"=round(chisq.test(Data[ ,i], Data[ ,j])$statistic, 3),
"df"=chisq.test(Data[ ,i], Data[ ,j])$parameter,
"p.value"=round(chisq.test(Data[ ,i], Data[ ,j])$p.value, 3),
row.names=NULL)
temp <- rbind(OldStatistic, Statistic)
OldStatistic <- Statistic
Statistic <- temp
}
}
str(Data)
'data.frame': 88 obs. of 3 variables:
$ agegp: Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 1 1 1 1 1 1 ...
$ alcgp: Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 1 1 1 2 2 2 2 3 3 ...
$ tobgp: Ord.factor w/ 4 levels "0-9g/day"<"10-19"<..: 1 2 3 4 1 2 3 4 1 2 ...
Statistic
Row Column Chi.Square df p.value
1 agegp tobgp 2.400 15 1
2 alcgp tobgp 0.619 9 1
My code gives my the chi square analysis output for variable 1 vs variable 3, and variable 2 vs variable 3 and is missing for variable 1 vs variable 2. I tried hard but could not fixed the code. Any comment and suggestion will be highly appreciated. I'd like like to do cross tabulation for all possible combinations. Thanks in advance.
EDIT
I used to do this kind of analysis in SPSS but now I want to switch to R.

A sample of your data would be appreciated, but I think this will work for you. First, create a combination of all columns with combn. Then write a function to use with an apply function to iterate through the combos. I like to use plyr since it is easy to specify what you want for a data structure on the back end. Also note you only need to compute the chi square test once for each combination of columns, which should speed things up quite a bit as well.
library(plyr)
combos <- combn(ncol(Dat),2)
adply(combos, 2, function(x) {
test <- chisq.test(Dat[, x[1]], Dat[, x[2]])
out <- data.frame("Row" = colnames(Dat)[x[1]]
, "Column" = colnames(Dat[x[2]])
, "Chi.Square" = round(test$statistic,3)
, "df"= test$parameter
, "p.value" = round(test$p.value, 3)
)
return(out)
})

I wrote my own function. It creates a matrix where all nominal variables are tested against each other. It can also save the results as excel file. It displays all the pvalues that are smaller than 5%.
funMassChi <- function (x,delFirst=0,xlsxpath=FALSE) {
options(scipen = 999)
start <- (delFirst+1)
ds <- x[,start:ncol(x)]
cATeND <- ncol(ds)
catID <- 1:cATeND
resMat <- ds[1:cATeND,1:(cATeND-1)]
resMat[,] <- NA
for(nCc in 1:(length(catID)-1)){
for(nDc in (nCc+1):length(catID)){
tryCatch({
chiRes <- chisq.test(ds[,catID[nCc]],ds[,catID[nDc]])
resMat[nDc,nCc]<- chiRes[[3]]
}, error=function(e){cat(paste("ERROR :","at",nCc,nDc, sep=" "),conditionMessage(e), "\n")})
}
}
resMat[resMat > 0.05] <- ""
Ergebnis <- cbind(CatNames=names(ds),resMat)
Ergebnis <<- Ergebnis[-1,]
if (!(xlsxpath==FALSE)) {
write.xlsx(x = Ergebnis, file = paste(xlsxpath,"ALLChi-",Sys.Date(),".xlsx",sep=""),
sheetName = "Tabelle1", row.names = FALSE)
}
}
funMassChi(categorialDATA,delFirst=3,xlsxpath="C:/folder1/folder2/")
delFirst can delete the first n columns. So if you have an count index or something you dont want to test.
I hope this can help anyone else.

Related

User Defined function is not responding in R

I try some code in R, but the use defined function could not respond.
factor_func_gender<-function(x){
x$Gender=factor(x$Gender,labels = c(0,1))
}
mixed_data=my_data$depressiondummy[2:30]
factor_func_gender(mixed_data)
I run this code. But it is not showed any error. What can i do?
I was just typing up an answer when #RuiBarradas beat me to it.
For what it's worth, and to reiterate, the two issues with your code are
factor_func_gender does not return the modified data.frame.
When calling factor_func_gender you are not storing the output object of the function (i.e. your modified data.frame) in a new variable.
Here is what you can do to fix those issues:
# Let's generate some sample data
set.seed(2017);
df <- data.frame(
Gender = rep(c("Male", "Female"), each = 5),
Value = runif(10));
# Define the function that returns a data.frame
factor_func_gender <- function(x) {
x$Gender <- factor(x$Gender, labels = c(0, 1));
return(x); # Return the dataframe
}
# Apply the function to a data.frame and store output in new data.frame
df.new <- factor_func_gender(df);
str(df.new);
#'data.frame': 10 obs. of 2 variables:
# $ Gender: Factor w/ 2 levels "0","1": 2 2 2 2 2 1 1 1 1 1
# $ Value : num 0.924 0.537 0.469 0.289 0.77 ...

Function works well on dummy data, on real data "Error: grouping factor must have exactly 2 levels"?

I have created a function, which works well on dummy data. But, when I run this function on real data, I've got back an error
Error in wilcox.test.formula(tab[[dependent]] ~ as.factor(tab$group), :
grouping factor must have exactly 2 levels
and warning messages:
In wilcox.test.default(x = c(11.2558701380866, 31.8401548036613, : cannot compute exact p-value with ties
So, "thresholding" in my function seems not correctly split real data in two groups. Also, the sub-setting of the real data is not correct. But I don't understand why?? The dummy and real tables structure seem the same:
Structure of dummy and real data:
Dummy:
> str(tab)
'data.frame': 80 obs. of 3 variables:
$ infGrad : num 14.15 12.53 3.03 9.21 16.36 ...
$ distance : int 1 1 1 1 1 1 1 1 1 1 ...
$ uniqueGroup: Factor w/ 2 levels "x","y": 1 2 1 2 1 2 1 2 1 2 ...
Real:
> str(tab)
'data.frame': 142 obs. of 10 variables:
$ distance : num 100 100 100 100 100 100 100 100 100 100 ...
$ infGrad : num 11.3 17.4 31.8 11.1 47.8 ...
$ uniqueGroup: Factor w/ 6 levels "x",..: 5 2 5 2 5 5 5 5 3 6 ...
I have found that NAs might cause these problems, or specification of formula of the wilcox.test(y ~ x).
So, I tried to add na.omit to my function, and instead of wilcox.test(y~x) use wilcox.test(y, x). None of these have worked.
Do you have any ideas how to make my function work or how to make it more robust to accept my real data? Your help is highly appreciated.
What the code does:
classify data in two groups by "moving threshold"
test statistical differences between those groups.
I run the function with nested lapply to vary my thresholds and different data subsets.
My dummy data:
set.seed(10)
infGrad <- c(rnorm(20, mean=14, sd=8),
rnorm(20, mean=13, sd=5),
rnorm(20, mean=8, sd=2),
rnorm(20, mean=7, sd=1))
distance <- rep(c(1:4), each = 20)
uniqueGroup <- rep(c("x", "y"), 40)
tab<-data.frame(infGrad, distance, uniqueGroup)
# Create moving threshols function
movThreshold <- function(th, tab, dependent, ...) {
tab<-na.omit(tab)
# Classify data
tab$group<- ifelse(tab$distance < th, "a", "b") # does not WORK on REAL data
# Calculate wincoxon test
test<-wilcox.test(tab[[dependent]] ~ as.factor(tab$group), # specify column name
data = tab)
# Put results in a vector
c(th, dependent, round(test$p.value, 3))
}
# Define two vectors to run through
# unique group
gr.list<-unique(tab$uniqueGroup)
# unique threshold
th.list<-c(2,3,4)
# apply function over threshols and subset
res<-lapply(gr.list, function(x) lapply(th.list,
movThreshold,
tab = tab[uniqueGroup == x,], # does not work on REAL data
dependent = "infGrad"))
What seems not working on real data:
Groups classification within the function
tab$group<- ifelse(tab$distance < th, "a", "b")
Data subsetting in nested lapply loop
subsetting: tab = tab[uniqueGroup == x,]
The issue probably happens because of a single value group.
You can reproduce the error for instance adding a high value to th.list.
# unique threshold
th.list<-c(2,3,4,100)
The easiest way to avoid this is checking for the length of tab$group before performing the test.
This change in the function should suffice:
movThreshold <- function(th, tab, dependent, ...) {
tab<-na.omit(tab)
# Classify data
tab$group<- ifelse(tab$distance < th, "a", "b") # does not WORK on REAL data
# Check there are two groups
if(length(unique(tab$group))<2){return(NA)}
# Calculate wincoxon test
test<-wilcox.test(tab[[dependent]] ~ as.factor(tab$group), # specify column name
data = tab)
# Put results in a vector
c(th, dependent, round(test$p.value, 3))
}

Gnu R: Rename variable in loop

I would like to create a loop in order to create 15 crosstables with one data.frame (var1), which consist of 15 variables, and another variable (var2), see data which can be downloaded here.
The code is now able to give results, but I would like to know how I can rename the variable "mytable" so that I get mytable1, mytable2, etc.
Code:
library(vcd) # for Cramer's V
var1 <- read.csv("~/example.csv", dec=",")
var2 <- sample(1:43)
i <- 1
while(i <= ncol(var1)) {
mytable[[i]] <- table(var2,var1[,i])
assocstats(mytable[[i]])
print(mytable[[i]])
i <- i + 1
}
As suggested in the comments, using names like mytable1, mytable2, etc. for a list of objects is actively discouraged when using R. Collecting all in a list is more useful and cleaner.
One way to do what you want would be this:
library(vcd) # for Cramer's V
data(mtcars)
var1 <- mtcars[ , c(2, 8:11)] ##OP's CSV no longer available
var2 <- sample(1:5, 32, TRUE)
mytable <- myassoc <- list() ##store output in a list
##a `for` loop looks simpler than `while`
for(i in 1:ncol(var1)){
mytable[[i]] <- table(var2, var1[ , i])
myassoc[[i]] <- assocstats(mytable[[i]])
}
So now to access "mytable2" and "myassoc2" you would simply do:
> mytable[[2]]
var2 0 1
1 4 2
2 6 6
3 1 1
4 2 3
5 5 2
> myassoc[[2]]
X^2 df P(> X^2)
Likelihood Ratio 1.7079 4 0.78928
Pearson 1.6786 4 0.79460
Phi-Coefficient : NA
Contingency Coeff.: 0.223
Cramer's V : 0.229

cor.test into data.frame in R

consider the following example:
require(MuMIn)
data(Cement)
d <- data.frame(Cement)
idx <- seq(11,13)
cor1 <- list()
for (i in 1:length(idx)){
d2 <- d[1:idx[i],]
cor1[[i]] <- cor.test(d2$X1,d2$X2, method = "pearson")
}
out <- lapply(cor1, function(x) c(x$estimate, x$conf.int, x$p.value))
Here I calculate the correlation for a dataset within an iteration loop.
I know want to generate one data.frame made up of the values in the list 'out'. I try using
df <- do.call(rbind.data.frame, out)
but the result does not seem right:
> df
c.0.129614123011664..0.195326511912326..0.228579470307565.
1 0.1296141
2 0.1953265
3 0.2285795
c..0.509907346173941...0.426370467476045...0.368861726657293.
1 -0.5099073
2 -0.4263705
3 -0.3688617
c.0.676861607564929..0.691690831088494..0.692365536706126.
1 0.6768616
2 0.6916908
3 0.6923655
c.0.704071702633775..0.542941653020805..0.452566184329491.
1 0.7040717
2 0.5429417
3 0.4525662
This is not what I am after.
How can I generate a data.frame that has the first column expressing which list the cor.test was calcuated i.e. 1 to 3 in this case, the second column referring to the $estimate and then $conf.int and %p.value resulting in a five column data.frame.
Is this what you're trying to do? Your question is a bit hard to understand. Is a column of indices from the list really necessary? The whole first column will be exactly the same as the row names (which appear on the left-hand side).
> D <- data.frame(cbind(index = seq(length(out)), do.call(rbind, out)))
> names(D)[2:ncol(D)] <- c('estimate', paste0('conf.int', 1:2), 'p.value')
> D
index estimate conf.int1 conf.int2 p.value
1 1 0.1296141 -0.5099073 0.6768616 0.7040717
2 2 0.1953265 -0.4263705 0.6916908 0.5429417
3 3 0.2285795 -0.3688617 0.6923655 0.4525662
It's not entirely clear what you're asking ... you have there such a data frame, just without reasonable column names. You can simplify your code to ..
ctests <- lapply(idx, function(x) cor.test(d[1:x,"X1"], d[1:x, "X2"]))
ctests <- lapply(ctests, "[", c("estimate", "conf.int", "p.value"))
as.data.frame(do.call(rbind, lapply(ctests, unlist)))
# estimate.cor conf.int1 conf.int2 p.value
# 1 0.1296141 -0.5099073 0.6768616 0.7040717
# 2 0.1953265 -0.4263705 0.6916908 0.5429417
# 3 0.2285795 -0.3688617 0.6923655 0.4525662
Is this what you need?

R Matrix process with conditional additions

I have to pre-process a big matrix. To make my example easier to understand I will use the following matrix:
Raw data
Where col = people and row = skills
In R my matrix is:
test <- matrix(c(18,12,15,0,13,0,14,0,12),ncol=3, nrow=3)
Aim
In my case I need to process row by row. So there is 3 steps. For each row I have to :
Put 0 if ij=ij (So all diagonals equals zero)
Put 0 if one of the ij=0
Otherwise I have to add ij+ij
I will show the 3 steps to be more clear.
Step 1 (row1)
The data are the row 1
The result is:
Step 2 (row2)
The data are the row 2
The result is:
Step 3 (row3)
The data are the row 3
The result is:
Create a maximum matrix
Then the maximum matching are :
So my final matrix should be:
Question
Can someone tell me how to succeed to achieve this in R?
And of course the same process should work if my matrix has more row and columns...
Thanks a lot :)
Here is my implementation in R. The code doesn't execute the steps exactly in the way you specified them. I focused on your final matrix and assumed that this is the main result you're interested in.
test <- matrix(c(18,12,15,0,13,0,14,0,12),ncol=3, nrow=3)
rownames(test) <- paste("Skill", 1:dim(test)[1], sep="")
colnames(test) <- paste("People", 1:dim(test)[2], sep="")
test
# Pairwise combinations
comb.mat <- combn(1:dim(test)[2], 2)
pairwise.mat <- data.frame(matrix(t(comb.mat), ncol=2))
pairwise.mat$max.score <- 0
names(pairwise.mat) <- c("Person1", "Person2", "Max.Score")
for ( i in 1:dim(comb.mat)[2] ) { # Loop over the rows
first.person <- comb.mat[1,i]
second.person <- comb.mat[2,i]
temp.mat <- test[, c(first.person, second.person)]
temp.mat[temp.mat == 0] <- NA
temp.rowSums <- rowSums(temp.mat, na.rm=FALSE)
temp.rowSums[is.na(temp.rowSums)] <- 0
max.sum <- max(temp.rowSums)
previous.val <- pairwise.mat$Max.Score[pairwise.mat$Person1 == first.person & pairwise.mat$Person2 == second.person]
pairwise.mat$Max.Score[pairwise.mat$Person1 == first.person & pairwise.mat$Person2 == second.person] <- max.sum*(max.sum > previous.val)
}
pairwise.mat
Person1 Person2 Max.Score
1 1 2 25
2 1 3 32
3 2 3 0
person.mat <- matrix(NA, nrow=dim(test)[2], ncol=dim(test)[2])
rownames(person.mat) <- colnames(person.mat) <- paste("People", 1:dim(test)[2], sep="")
diag(person.mat) <- 0
person.mat[cbind(pairwise.mat[,1], pairwise.mat[,2])] <- pairwise.mat$Max.Score
person.mat[lower.tri(person.mat, diag=F)] <- t(person.mat)[lower.tri(person.mat, diag=F)]
person.mat
People1 People2 People3
People1 0 25 32
People2 25 0 0
People3 32 0 0

Resources