I would like to create a loop in order to create 15 crosstables with one data.frame (var1), which consist of 15 variables, and another variable (var2), see data which can be downloaded here.
The code is now able to give results, but I would like to know how I can rename the variable "mytable" so that I get mytable1, mytable2, etc.
Code:
library(vcd) # for Cramer's V
var1 <- read.csv("~/example.csv", dec=",")
var2 <- sample(1:43)
i <- 1
while(i <= ncol(var1)) {
mytable[[i]] <- table(var2,var1[,i])
assocstats(mytable[[i]])
print(mytable[[i]])
i <- i + 1
}
As suggested in the comments, using names like mytable1, mytable2, etc. for a list of objects is actively discouraged when using R. Collecting all in a list is more useful and cleaner.
One way to do what you want would be this:
library(vcd) # for Cramer's V
data(mtcars)
var1 <- mtcars[ , c(2, 8:11)] ##OP's CSV no longer available
var2 <- sample(1:5, 32, TRUE)
mytable <- myassoc <- list() ##store output in a list
##a `for` loop looks simpler than `while`
for(i in 1:ncol(var1)){
mytable[[i]] <- table(var2, var1[ , i])
myassoc[[i]] <- assocstats(mytable[[i]])
}
So now to access "mytable2" and "myassoc2" you would simply do:
> mytable[[2]]
var2 0 1
1 4 2
2 6 6
3 1 1
4 2 3
5 5 2
> myassoc[[2]]
X^2 df P(> X^2)
Likelihood Ratio 1.7079 4 0.78928
Pearson 1.6786 4 0.79460
Phi-Coefficient : NA
Contingency Coeff.: 0.223
Cramer's V : 0.229
Related
In R, I would like to know how I can find the index/indices of the value(s) sampled, for examaple using function sample.
In Matlab, it appears this is quite easily done by requesting output argument idx in function datasample. Explictly, taken from Matlab's documentation page for function datasample:
[y,idx] = datasample(data,k,...) returns an index vector indicating
which values datasample sampled from data.
I would like to know if such a thing can be accomplished in R, and how.
Example:
set.seed(12)
sample(c(0.3,78,45,0.8,0.3,0.8,77), size=1, replace=TRUE)
0.3
How can I know which of the two 0.3's was that one?
We can created a named vector and then sample
v1 <- c(LETTERS[1:10], LETTERS[1])
names(v1) <- seq_along(v1)
v2 <- sample(v1, 20, replace=TRUE)
as.integer(names(v2))
#[1] 10 11 4 2 1 4 6 9 1 1 2 9 2 2 2 3 4 7 3 6
Using the OP's data
set.seed(12)
v1 <- c(0.3,78,45,0.8,0.3,0.8,77)
names(v1) <- seq_along(v1)
set.seed(12)
sample(v1, size=1, replace=TRUE)
# 1
#0.3
I'm actually having a trouble with a particular task of my code. I have a data frame as
n <- 6
set.seed(123)
df <- data.frame(x=paste0("x",seq_along(1:n)), A=sample(c(-2:2),n,replace=TRUE), B=sample(c(-1:3),n,replace=TRUE))
#
# x A B
# 1 x1 -1 1
# 2 x2 1 3
# 3 x3 0 1
# 4 x4 2 1
# 5 x5 2 3
# 6 x6 -2 1
and a decision tree as
A>0;Y;Y;N;N
B>1;Y;N;Y;N
C;1;2;2;1
that I load by
dt <- read.csv2("tmp.csv", header=FALSE)
I'd like to create a loop for all the possible combinations of (A>0) and (B>1) and set the C value to the subset x column that satisfy that condition. So, here's what I did
nr <- 3
nc <- 5
cond <- dt[1:(nr-1),1,drop=FALSE]
rule <- dt[nr,1,drop=FALSE]
subdf <- vector(mode="list",2^(nr-1))
for (i in 2:nc) {
check <- paste0("")
for (j in 1:(nr-1)) {
case <- paste0(dt[j,1])
if (dt[j,i]=="N")
case <- paste0("!",case)
check <- paste0(check, "(", case, ")" )
if (j<(nr-1))
check <- paste0(check, "&")
}
subdf[i] <- subset(df,check)
subdf[i]$C <- dt[nr,i]
}
unlist(subdf)
unfortunately, I got an error using subset as by this, it cannot parse the conditions from my string statements. what should I do?
Your issue is your creating of the subset: the subset commands expects a boolean and you gave it a string. ('check'). So the simplest solution here is to add a 'parse'. I feel there is a more elegant way to solve this problem and I hope someone'll come along and do it, but you can fix the final part of your code with the following
mysubset <- subset(df,with(df,eval(parse(text=check))))
if(nrow(mysubset)>0){
mysubset$C <- dt[nr,i]
}
subdf[[i]]<-mysubset
I have added the parse/eval part to generate a vector of booleans to subset only the 'TRUE' cases, and added a check for whether C could be added (will give error if there are no rows).
Based on the previous answer, I came up with a more elegant/practical way of generating a vector of combined rules, and then applying them all to the data, using apply/lapply.
##create list of formatted rules
#format each 'building' block separately,
#based on rows in 'dt'.
part_conditions <- apply(dt[-nrow(dt),],MARGIN=1,FUN=function(x){
res <- sprintf("(%s%s)", ifelse(x[-1]=="Y","","!"), x[1])
})
# > part_conditions
# 1 2
# [1,] "(A>0)" "(B>1)"
# [2,] "(A>0)" "(!B>1)"
# [3,] "(!A>0)" "(B>1)"
# [4,] "(!A>0)" "(!B>1)"
#combine to vector of conditions
conditions <- apply(part_conditions, MARGIN=1,FUN=paste, collapse="&")
# > conditions
# [1] "(A>0)&(B>1)" "(A>0)&(!B>1)" "(!A>0)&(B>1)" "(!A>0)&(!B>1)"
#for each condition, test in data wheter condition is 'T'
temp <- sapply(conditions, function(rule){
return(with(df, eval(parse(text=rule))))
}
)
rules <- as.numeric(t(dt[nrow(dt),-1]))
#then find which of the (in this case) four is 'T', and put the appropriate rule
#in df
df$C <- rules[apply(temp,1,which)]
> df
x A B C
1 x1 -1 1 1
2 x2 1 3 1
3 x3 0 1 1
4 x4 2 1 2
5 x5 2 3 1
6 x6 -2 1 1
consider the following example:
require(MuMIn)
data(Cement)
d <- data.frame(Cement)
idx <- seq(11,13)
cor1 <- list()
for (i in 1:length(idx)){
d2 <- d[1:idx[i],]
cor1[[i]] <- cor.test(d2$X1,d2$X2, method = "pearson")
}
out <- lapply(cor1, function(x) c(x$estimate, x$conf.int, x$p.value))
Here I calculate the correlation for a dataset within an iteration loop.
I know want to generate one data.frame made up of the values in the list 'out'. I try using
df <- do.call(rbind.data.frame, out)
but the result does not seem right:
> df
c.0.129614123011664..0.195326511912326..0.228579470307565.
1 0.1296141
2 0.1953265
3 0.2285795
c..0.509907346173941...0.426370467476045...0.368861726657293.
1 -0.5099073
2 -0.4263705
3 -0.3688617
c.0.676861607564929..0.691690831088494..0.692365536706126.
1 0.6768616
2 0.6916908
3 0.6923655
c.0.704071702633775..0.542941653020805..0.452566184329491.
1 0.7040717
2 0.5429417
3 0.4525662
This is not what I am after.
How can I generate a data.frame that has the first column expressing which list the cor.test was calcuated i.e. 1 to 3 in this case, the second column referring to the $estimate and then $conf.int and %p.value resulting in a five column data.frame.
Is this what you're trying to do? Your question is a bit hard to understand. Is a column of indices from the list really necessary? The whole first column will be exactly the same as the row names (which appear on the left-hand side).
> D <- data.frame(cbind(index = seq(length(out)), do.call(rbind, out)))
> names(D)[2:ncol(D)] <- c('estimate', paste0('conf.int', 1:2), 'p.value')
> D
index estimate conf.int1 conf.int2 p.value
1 1 0.1296141 -0.5099073 0.6768616 0.7040717
2 2 0.1953265 -0.4263705 0.6916908 0.5429417
3 3 0.2285795 -0.3688617 0.6923655 0.4525662
It's not entirely clear what you're asking ... you have there such a data frame, just without reasonable column names. You can simplify your code to ..
ctests <- lapply(idx, function(x) cor.test(d[1:x,"X1"], d[1:x, "X2"]))
ctests <- lapply(ctests, "[", c("estimate", "conf.int", "p.value"))
as.data.frame(do.call(rbind, lapply(ctests, unlist)))
# estimate.cor conf.int1 conf.int2 p.value
# 1 0.1296141 -0.5099073 0.6768616 0.7040717
# 2 0.1953265 -0.4263705 0.6916908 0.5429417
# 3 0.2285795 -0.3688617 0.6923655 0.4525662
Is this what you need?
Given this data.frame
x y z
1 1 3 5
2 2 4 6
I'd like to add the value of columns x and z plus a coefficient 10, for every rows in dat.
The intended result is this
x y z result
1 1 3 5 16 #(1+5+10)
2 2 4 6 18 #(2+6+10)
But why this code doesn't produce the desired result?
dat <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
Coeff <- 10
# Function
process.xz <- function(v1,v2,cf) {
return(v1+v2+cf)
}
# It breaks here
sm <- apply(dat[,c('x','z')], 1, process.xz(dat$x,dat$y,Coeff ))
# Later I'd do this:
# cbind(dat,sm);
I wouldn't use an apply here. Since the addition + operator is vectorized, you can get the sum using
> process.xz(dat$x, dat$z, Coeff)
[1] 16 18
To write this in your data.frame, don't use cbind, just assign it directly:
dat$result <- process.xz(dat$x, dat$z, Coeff)
The reason it fails is because apply doesn't work like that - you must pass the name of a function and any additional parameters. The rows of the data frame are then passed (as a single vector) as the first argument to the function named.
dat <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
Coeff <- 10
# Function
process.xz <- function(x,cf) {
return(x[1]+x[2]+cf)
}
sm <- apply(dat[,c('x','z')], 1, process.xz,cf=Coeff)
I completely agree that there's no point in using apply here though - but it's good to understand anyway.
I'm trying to do chi square analysis for all combinations of variables in the data and my code is:
Data <- esoph[ , 1:3]
OldStatistic <- NA
for(i in 1:(ncol(Data)-1)){
for(j in (i+1):ncol(Data)){
Statistic <- data.frame("Row"=colnames(Data)[i], "Column"=colnames(Data)[j],
"Chi.Square"=round(chisq.test(Data[ ,i], Data[ ,j])$statistic, 3),
"df"=chisq.test(Data[ ,i], Data[ ,j])$parameter,
"p.value"=round(chisq.test(Data[ ,i], Data[ ,j])$p.value, 3),
row.names=NULL)
temp <- rbind(OldStatistic, Statistic)
OldStatistic <- Statistic
Statistic <- temp
}
}
str(Data)
'data.frame': 88 obs. of 3 variables:
$ agegp: Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 1 1 1 1 1 1 ...
$ alcgp: Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 1 1 1 2 2 2 2 3 3 ...
$ tobgp: Ord.factor w/ 4 levels "0-9g/day"<"10-19"<..: 1 2 3 4 1 2 3 4 1 2 ...
Statistic
Row Column Chi.Square df p.value
1 agegp tobgp 2.400 15 1
2 alcgp tobgp 0.619 9 1
My code gives my the chi square analysis output for variable 1 vs variable 3, and variable 2 vs variable 3 and is missing for variable 1 vs variable 2. I tried hard but could not fixed the code. Any comment and suggestion will be highly appreciated. I'd like like to do cross tabulation for all possible combinations. Thanks in advance.
EDIT
I used to do this kind of analysis in SPSS but now I want to switch to R.
A sample of your data would be appreciated, but I think this will work for you. First, create a combination of all columns with combn. Then write a function to use with an apply function to iterate through the combos. I like to use plyr since it is easy to specify what you want for a data structure on the back end. Also note you only need to compute the chi square test once for each combination of columns, which should speed things up quite a bit as well.
library(plyr)
combos <- combn(ncol(Dat),2)
adply(combos, 2, function(x) {
test <- chisq.test(Dat[, x[1]], Dat[, x[2]])
out <- data.frame("Row" = colnames(Dat)[x[1]]
, "Column" = colnames(Dat[x[2]])
, "Chi.Square" = round(test$statistic,3)
, "df"= test$parameter
, "p.value" = round(test$p.value, 3)
)
return(out)
})
I wrote my own function. It creates a matrix where all nominal variables are tested against each other. It can also save the results as excel file. It displays all the pvalues that are smaller than 5%.
funMassChi <- function (x,delFirst=0,xlsxpath=FALSE) {
options(scipen = 999)
start <- (delFirst+1)
ds <- x[,start:ncol(x)]
cATeND <- ncol(ds)
catID <- 1:cATeND
resMat <- ds[1:cATeND,1:(cATeND-1)]
resMat[,] <- NA
for(nCc in 1:(length(catID)-1)){
for(nDc in (nCc+1):length(catID)){
tryCatch({
chiRes <- chisq.test(ds[,catID[nCc]],ds[,catID[nDc]])
resMat[nDc,nCc]<- chiRes[[3]]
}, error=function(e){cat(paste("ERROR :","at",nCc,nDc, sep=" "),conditionMessage(e), "\n")})
}
}
resMat[resMat > 0.05] <- ""
Ergebnis <- cbind(CatNames=names(ds),resMat)
Ergebnis <<- Ergebnis[-1,]
if (!(xlsxpath==FALSE)) {
write.xlsx(x = Ergebnis, file = paste(xlsxpath,"ALLChi-",Sys.Date(),".xlsx",sep=""),
sheetName = "Tabelle1", row.names = FALSE)
}
}
funMassChi(categorialDATA,delFirst=3,xlsxpath="C:/folder1/folder2/")
delFirst can delete the first n columns. So if you have an count index or something you dont want to test.
I hope this can help anyone else.