My questions concern the calculation of the Cramers V to detect correlation between categorial variables. I 've got a dataset with missing values, but I created a fake dataset for illustration with two variables a and b, one of them containing to NA's.
a <- factor(c("M","","F","F","","M","F","F"))
a2 <- factor(a, levels = c('M','F'),labels =c('Male','Female'))
b <- factor(c("y","y","","y","n","n","n","y"))
b2 <- factor(b, levels=c("y","n"),labels=c("yes","no"))
df<-cbind(a2,b2)
The assocstats function gives me the result for the cramers V:
require(vcd)
> tab <-table(a,b)
> assocstats(tab)
X^2 df P(> X^2)
Likelihood Ratio 1.7261 4 0.78597
Pearson 1.3333 4 0.85570
Phi-Coefficient : 0.408
Contingency Coeff.: 0.378
Cramer's V : 0.289
Now I want to drop the NA's from the levels
a[a==""]<-NA
a3 <- droplevels(a)
levels(a3)
tab <-table(a,b)
assocstats(tab)
But everytime I remove NA's the result looks like this:
X^2 df P(> X^2)
Likelihood Ratio 0.13844 2 0.93312
Pearson NaN 2 NaN
Phi-Coefficient : NaN
Contingency Coeff.: NaN
Cramer's V : NaN
Also, because I have a large dataset I would like to calculate a matrix of the Cramer V results. I found this code here on stack overflow and it seems to work...
get.V<-function(y){
col.y<-ncol(y)
V<-matrix(ncol=col.y,nrow=col.y)
for(i in 1:col.y){
for(j in 1:col.y){
V[i,j]<-assocstats(table(y[,i],y[,j]))$cramer
}
}
return(V)
}
get.V(tab)
Only that the result is different than that with assocstats function:
[,1] [,2] [,3]
[1,] 1.0 0.5 1
[2,] 0.5 1.0 1
[3,] 1.0 1.0 1
This can not be right, because I get this result every time, even when changing the number of observations... what is wrong with this code?
Conclusion:I don't know which one of the result is right. I have a large dataset with a lot of NA's in it. The first asocstat result and the code give different results, altough there is no big difference,because the code only creates a matrix. The second asocstat function gives only NaN.I cant detect any errors... Can somebody help me?
You don't have to replace the "" with NA if you are using factors--any unique value that you don't define in levels will be converted to NA by factor
a <- factor(c("M","","F","F","","M","F","F"))
a2 <- factor(a, levels = c('M','F'),labels =c('Male','Female'))
a
# [1] M F F M F F
# Levels: F M
a2
# [1] Male <NA> Female Female <NA> Male Female Female
# Levels: Male Female
b <- factor(c("y","y","","y","n","n","n","y"))
b2 <- factor(b, levels=c("y","n"),labels=c("yes","no"))
(df <- cbind(a2,b2))
# a2 b2
# [1,] 1 1
# [2,] NA 1
# [3,] 2 NA
# [4,] 2 1
# [5,] NA 2
# [6,] 1 2
# [7,] 2 2
# [8,] 2 1
Above, you're creating a matrix which loses all the labels that you created with factor. I think you want a data frame:
(df <- data.frame(a2,b2))
# a2 b2
# 1 Male yes
# 2 <NA> yes
# 3 Female <NA>
# 4 Female yes
# 5 <NA> no
# 6 Male no
# 7 Female no
# 8 Female yes
require('vcd')
(tab <- table(a2,b2, useNA = 'ifany'))
# b2
# a2 yes no <NA>
# Male 1 1 0
# Female 2 1 1
# <NA> 1 1 0
(tab <- table(a2,b2))
# b2
# a2 yes no
# Male 1 1
# Female 2 1
You need to explicitly tell table if you want to see NA values in the table. Otherwise, it will drop them by default so that you are already "excluding" them when you use assocstats:
assocstats(tab)
# X^2 df P(> X^2)
# Likelihood Ratio 0.13844 1 0.70983
# Pearson 0.13889 1 0.70939
#
# Phi-Coefficient : 0.167
# Contingency Coeff.: 0.164
# Cramer's V : 0.167
For get.V just pass the data frame or matrix, not the table:
get.V <- function(y) {
col.y <- ncol(y)
V <- matrix(ncol=col.y,nrow=col.y)
for(i in 1:col.y){
for(j in 1:col.y){
V[i,j] <- assocstats(table(y[,i],y[,j]))$cramer
}
}
return(V)
}
get.V(df)
# [,1] [,2]
# [1,] 1.0000000 0.1666667
# [2,] 0.1666667 1.0000000
Related
I'm very new to R. I have two matrices of different dimensions, C (3 rows, 79 columns) and T(3 rows, 215 columns). I want my code to calculate the Spearman correlation between the first column of C and all the columns of T and return the maximum correlation with the indexes and of the columns. Then, the second column of C and all the columns of T and so on. In fact, I want to find the columns between two matrices which are most correlated. Hope it was clear.
What I did was a nested for loop, but the result is not what I search.
for (i in 1:79){
for(j in 1:215){
print(max(cor(C[,i],T[,j],method = c("spearman"))))
}
}
You don't have to loop over the columns.
x <- cor(C,T,method = c("spearman"))
out <- data.frame(MaxCorr = apply(x,1,max), T_ColIndex=apply(x,1,which.max),C_ColIndex=1:nrow(x))
head(out)
gives,
MaxCorr T_ColIndex C_ColIndex
1 1 8 1
2 1 1 2
3 1 2 3
4 1 1 4
5 1 11 5
6 1 4 6
Fake Data:
C <- matrix(rnorm(3*79),nrow=3)
T <- matrix(rnorm(3*215),nrow=3)
Maybe something like the function below can solve the problem.
pairwise_cor <- function(x, y, method = "spearman"){
ix <- seq_len(ncol(x))
iy <- seq_len(ncol(y))
t(sapply(ix, function(i){
m <- sapply(iy, function(j) cor(x[,i], y[,j], method = method))
setNames(c(i, which.max(m), max(m)), c("col_x", "col_y", "max"))
}))
}
set.seed(2021)
C <- matrix(rnorm(3*5), nrow=3)
T <- matrix(rnorm(3*7), nrow=3)
pairwise_cor(C, T)
# col_x col_y max
#[1,] 1 1 1.0
#[2,] 2 2 1.0
#[3,] 3 2 1.0
#[4,] 4 3 0.5
#[5,] 5 5 1.0
I have a dataframe with the following structure:
> str(data_l)
'data.frame': 800 obs. of 5 variables:
$ Participant: int 1 2 3 4 5 6 7 8 9 10 ...
$ Temperature: Factor w/ 4 levels "35","37","39",..: 3 3 3 3 3 3 3 3 3 3 ...
$ Region : Factor w/ 5 levels "Eyes","Front",..: 3 3 3 3 3 3 3 3 3 3 ...
$ Time : Factor w/ 5 levels "0","15","30",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Rating : num 5 5 5 4 5 5 5 5 5 5 ...
I want to run one-sample t-test for each combination of all factors all levels, for a total of 4*5*5 = 100 t-tests, with Rating as dependent variables, or y.
I am stuck at looping through the combinations, and performing t-test at each combo.
I tried splitting the dataframe by the factors, then lapply t.test() through the list, but to no avail.
Does anyone have a better approach? Cheers!
Edit
My ultimate intention is to calculate confidence interval for arrays in all factors all levels. For instance, I was able to do this:
subset1 <- data_l$Rating[data_l$Temperature == 35 & data_l$Region == "Front" & data_l$Time == 0]
Then,
t.test(subset1)$conf.int
But the problem is I will have to do this 100 times.
Edit 2
I am recreating the dataframe.
Temperature <- rep(seq(35, 41, 2), 10)
Region <- rep(c("Front", "Back", "Eyes", "Left", "Right"), 8)
Time <- rep(seq(0, 60, 15), 8)
Rating <- sample(1:5, 40, replace = TRUE)
data_l <- data.frame(Region = factor(Region), Temperature = factor(Temperature), Time = factor(Time), Rating = as.numeric(Rating))
Two things.
Can this be done? Certainly. Should it? Many of your combinations may have insufficient data to find a reasonable confidence interval. While your data sample is certainly reduced and simplified, I don't have assurances that there will be sufficient fillingness of your factor combinations.
table(sapply(split(data_l$Rating, data_l[,c("Temperature","Region","Time")]), length))
# 0 2
# 80 20
(There are 80 "empty" combinations of your factor levels.)
Let's try this:
outs <- aggregate(data_l$Rating, data_l[,c("Temperature","Region","Time")],
function(x) if (length(unique(x)) > 1) t.test(x)$conf.int else c(NA, NA))
nrow(outs)
# [1] 20
head(outs)
# Temperature Region Time x.1 x.2
# 1 35 Front 0 NA NA
# 2 37 Front 0 -9.706205 15.706205
# 3 39 Front 0 -2.853102 9.853102
# 4 41 Front 0 -15.559307 22.559307
# 5 35 Back 15 -15.559307 22.559307
# 6 37 Back 15 -4.853102 7.853102
Realize that this is not five columns; the fourth is really a matrix embedded in a frame column:
head(outs$x)
# [,1] [,2]
# [1,] NA NA
# [2,] -9.706205 15.706205
# [3,] -2.853102 9.853102
# [4,] -15.559307 22.559307
# [5,] -15.559307 22.559307
# [6,] -4.853102 7.853102
It's easy enough to extract:
outs$conf1 <- outs$x[,1]
outs$conf2 <- outs$x[,2]
outs$x <- NULL
head(outs)
# Temperature Region Time conf1 conf2
# 1 35 Front 0 NA NA
# 2 37 Front 0 -9.706205 15.706205
# 3 39 Front 0 -2.853102 9.853102
# 4 41 Front 0 -15.559307 22.559307
# 5 35 Back 15 -15.559307 22.559307
# 6 37 Back 15 -4.853102 7.853102
(If you're wondering why I have a conditional on length(unique(x)) > 1, then see what happens without it:
aggregate(data_l$Rating, data_l[,c("Temperature","Region","Time")],
function(x) t.test(x)$conf.int)
# Error in t.test.default(x) : data are essentially constant
This is because there are combinations with empty data. You'll likely see something similar with not-empty but still invariant data.)
I am stuck at looping through the combinations, and performing t-test
at each combo.
I'm not sure if this is what you wanted.
N <- 800
df <- data.frame(Participant=1:N,
Temperature=gl(4,200),
Region=sample(1:5, 800, TRUE),
Time=sample(1:5, 800, TRUE),
Rating=sample(1:5, 800, TRUE))
head(df)
t_test <- function(data, y, x){
x <- eval(substitute(x), data)
y <- eval(substitute(y), data)
comb <- combn(levels(x), m=2) # this gives all pair-wise combinations
n <- dim(comb)[2]
t <- vector(n, mode="list")
for(i in 1:n){
xlevs <- comb[,i]
DATA <- subset(data, subset=x %in% xlevs)
x2 <- factor(x, levels=xlevs)
tt <- t.test(y~x2, data=DATA)
t[[i]] <- tt
names(t)[i] <- toString(xlevs)
}
t
}
T.test <- t_test(df, Rating, Temperature)
T.test[1]
$`1, 2`
Welch Two Sample t-test
data: y by x2
t = -1.0271, df = 396.87, p-value = 0.305
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.4079762 0.1279762
sample estimates:
mean in group 1 mean in group 2
2.85 2.99
This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 6 years ago.
I have a matrix with a large number of duplicates and would like to obtain a matrix with the unique rows and a frequency count to each unique row.
The example shown below solves this problem but is painfully slow.
rowsInTbl <- function(tbl,row){
sum(apply(tbl, 1, function(x) all(x == row) ))
}
colFrequency <- function(tblall){
tbl <- unique(tblall)
results <- matrix(nrow = nrow(tbl),ncol=ncol(tbl)+1)
results[,1:ncol(tbl)] <- as.matrix(tbl)
dimnames(results) <- list(c(rownames(tbl)),c(colnames(tbl),"Frequency"))
freq <- apply(tbl,1,function(x)rowsInTbl(tblall,x))
results[,"Frequency"] <- freq
return(results)
}
m <- matrix(c(1,2,3,4,3,4,1,2,3,4),ncol=2,byrow=T)
dimnames(m) <- list(letters[1:nrow(m)],c("c1","c2"))
print("Matrix")
print(m)
[1] "Matrix"
c1 c2
a 1 2
b 3 4
c 3 4
d 1 2
e 3 4
print("Duplicate frequency table")
print(colFrequency(m))
[1] "Duplicate frequency table"
c1 c2 Frequency
a 1 2 2
b 3 4 3
Here are the speed measurements of the answers of #Heroka and #m0h3n compared to my example. The matrix shown above was repeated 1000 times. Data.table clearly is the fastest solution.
[1] "Duplicate frequency table - my example"
user system elapsed
0.372 0.000 0.371
[1] "Duplicate frequency table - data.table"
user system elapsed
0.008 0.000 0.008
[1] "Duplicate frequency table - aggregate"
user system elapsed
0.092 0.000 0.089
Looks like a job for data.table, as you need something that can aggregate quickly.
library(data.table)
m <- matrix(c(1,2,3,4,3,4,1,2,3,4),ncol=2,byrow=T)
mdt <- as.data.table(m)
res <- mdt[,.N, by=names(mdt)]
res
# > res
# V1 V2 N
# 1: 1 2 2
# 2: 3 4 3
How about this using base R for extracting unique rows:
mat <- matrix(c(2,5,3,5,2,3,4,2,3,5,4,2,1,5,3,5), ncol = 2, byrow = T)
mat[!duplicated(mat),]
# [,1] [,2]
# [1,] 2 5
# [2,] 3 5
# [3,] 2 3
# [4,] 4 2
# [5,] 1 5
Extracting unique rows along with their frequencies:
m <- as.data.frame(mat)
aggregate(m, by=m, length)[1:(ncol(m)+1)]
# V1 V2 V1.1
# 1 4 2 2
# 2 2 3 1
# 3 1 5 1
# 4 2 5 1
# 5 3 5 3
I found this function to detect proportions of missing values for each column in any given dataframe:
propmiss <- function(dataframe) lapply(dataframe,function(x) data.frame(nmiss=sum(is.na(x)), n=length(x), propmiss=sum(is.na(x))/length(x)))
I assign it to a variable like this:
propmissdf <- propmiss(df)
Then I loop through the dataframe to NULL variables in my data like this:
for(i in (1:length(df))){
var = names(df)[i]
if((propmissdf[[var]][[3]]) > 0.3) { #the 3 index represents the proportion inside propmissdf
df[var] <- NULL
}
}
This gives me an error:
Error in if ((propmissdf[[var]][[3]]) > 0.3) { :argument is of length zero
But it works, somehow. It gets rid of several variables with missing value proportions greater than 0.3, but if I run the for loop again, it gets rid of more until 3 or 4 more times until it gets rid of all of them. Why is this happening? Please feel free to correct my problem, or to come up with a better way to remove variables with over 30% NAs.
You can use something like this:
df <- df[colSums(is.na(df))/nrow(df) < .3]
colSums(is.na(df)) would calculate how many NA values there are in each column.
Divide that output by the number of rows in the data.frame to get the proportion.
Use < .3 to create a logical comparison that can be used to subset the relevant columns.
Sample data and example:
set.seed(2)
df <- data.frame(matrix(sample(c(NA, 1:4), 20, TRUE), nrow = 4))
df
# X1 X2 X3 X4 X5
# 1 NA 4 2 3 4
# 2 3 4 2 NA 1
# 3 2 NA 2 2 2
# 4 NA 4 1 4 NA
colSums(is.na(df))/nrow(df)
# X1 X2 X3 X4 X5
# 0.50 0.25 0.00 0.25 0.25
df[colSums(is.na(df))/nrow(df) < .3]
# X2 X3 X4 X5
# 1 4 2 3 4
# 2 4 2 NA 1
# 3 NA 2 2 2
# 4 4 1 4 NA
For reference, here's a quick timing comparison:
set.seed(1)
df <- data.frame(matrix(sample(c(NA, 1:4), 4000, TRUE), ncol = 1000))
akfun <- function() {
i1 <-sapply(df, function(x) {
pr <- prop.table(table(factor(is.na(x), levels=c(TRUE, FALSE))))
pr[as.logical(names(pr))]< 0.3
})
df[i1]
}
amfun <- function() df[colSums(is.na(df))/nrow(df) < .3]
identical(amfun(), akfun())
# [1] TRUE
system.time(akfun())
# user system elapsed
# 0.172 0.000 0.173
system.time(amfun())
# user system elapsed
# 0.000 0.000 0.001
We can loop over the columns with sapply, get the count of 'NA' values with table, use `prop.table to find the proportion and create a logical vector.
i1 <-sapply(df, function(x) {
pr <- prop.table(table(factor(is.na(x), levels=c(TRUE, FALSE))))
pr[as.logical(names(pr))]< 0.3
})
This vector can be used for subsetting the columns.
df[i1]
If we need to remove the columns
df[!i1] <- list(NULL) #contributed by #Ananda Mahto
df
# X2 X3 X4 X5
#1 4 2 3 4
#2 4 2 NA 1
#3 NA 2 2 2
#4 4 1 4 NA
NOTE: df taken from #Ananda Mahto's post
I have two data.frames: editCounts and nonEditCounts. These structs are of the same dimensions and contain the same column and row names, but the actual data varies. Below are heads to each:
> head(editCounts)
Samp0 Samp1 Samp2
chr10_101992307 0 4 3
chr10_101992684 4 0 1
chr10_127480585 0 3 0
chr10_16479385 3 3 3
chr10_73979859 0 3 2
chr10_73979940 0 3 8
> head(nonEditCounts)
Samp0 Samp1 Samp2
chr10_101992307 0 4 3
chr10_101992684 15 0 4
chr10_127480585 0 6 0
chr10_16479385 7 7 4
chr10_73979859 0 13 7
chr10_73979940 0 21 10
The ultimate goal here is to perform pair-wise fisher tests (using fisher.test()) on each column and row between each of the data.frames. As an output I'd would like to create a table contain the resulting p-values from each pair-wise comparison corresponding to each row name, e.g.:
Samp0_vs_Samp1 Samp0_vs_Samp2 Samp1_vs_Samp2
chr10_101992307 pval pval pval
chr10_101992684 pval pval pval
chr10_127480585 pval pval pval
chr10_16479385 pval pval pval
chr10_73979859 pval pval pval
... ... ... ...
So, take for example Samp0 and Samp1, the first fisher test would consist of a matrix looking something like this:
> tempMat=matrix(c(editCounts$ERR188028_GBR[1], nonEditCounts$ERR188028_GBR[1],
+ editCounts$ERR188035_GBR[1], nonEditCounts$ERR188035_GBR[1]), 2, 2)
> tempMat
[,1] [,2]
[1,] 0 4
[2,] 0 4
These values correspond to the first row (chr10_101992307). In this case the fisher test would result in a p-value of 1.
I know I can use combn() to compute each column permutation but I'm not exactly sure how to loop though each column, create a contingency table from the 4 values, and run fisher's test. The code I have wrote thus far is listed below; however, it throws an error when trying to create the the tempMat.
editCounts <- read.table("editCountMatrix.txt", sep="\t", header=TRUE, row.names=1)
nonEditCounts <- read.table("nonEditCountMatrix.txt", sep="\t", header=TRUE, row.names=1)
pairwiseComb <- combn(names(editCounts),2)
for (j in seq(1,length(pairwiseComb),2)){
tempCol1 = pairwiseComb[[j]]
tempCol2 = pairwiseComb[[j+1]]
cat("Processing: ",tempCol1," vs. ",tempCol2, "\n", sep="") # Prints correctly
for (i in 1:nrow(editCounts)){
tempMat=matrix(c(editCounts$tempCol1[i], nonEditCounts$tempCol1[i],
editCounts$tempCol2[i], nonEditCounts$tempCol2[i]), 2, 2)
tempFisher=fisher.test(tempMat, alternative="two.sided")
pval=tempFisher$p.value
pvalAdj=p.adjust(pval,method="fdr")
}
}
The error this produces is shown below:
Error in matrix(c(editCounts$tempCol1[i], nonEditCounts$tempCol1[i], editCounts$tempCol2[i], :
'data' must be of a vector type, was 'NULL'
Any help would be greatly appreciated.
Thanks!
Here is a proposed solution where I have corrected some minor indexing problems with your code, and suggested using a pre-allocated matrix to store the Fisher Exact test results.
# Create data.frames using your sample data.
editCounts <- read.table(header=TRUE,
text=" Samp0 Samp1 Samp2
chr10_101992307 0 4 3
chr10_101992684 4 0 1
chr10_127480585 0 3 0
chr10_16479385 3 3 3
chr10_73979859 0 3 2
chr10_73979940 0 3 8")
nonEditCounts <- read.table(header=TRUE,
text=" Samp0 Samp1 Samp2
chr10_101992307 0 4 3
chr10_101992684 15 0 4
chr10_127480585 0 6 0
chr10_16479385 7 7 4
chr10_73979859 0 13 7
chr10_73979940 0 21 10")
pairwiseComb <- combn(names(editCounts), 2)
# Create a matrix to hold results.
results <- matrix(NA, ncol=ncol(pairwiseComb), nrow=nrow(editCounts))
# Create row and column names to use for indexing/assignment of results.
rownames(results) <- rownames(editCounts)
colnames(results) <- apply(pairwiseComb, 2,
function(x) {paste(x[1], "_vs_", x[2], sep="")})
# Loop over number of column pairs.
for (j in seq(ncol(pairwiseComb))) {
tempCol1 <- pairwiseComb[1, j]
tempCol2 <- pairwiseComb[2, j]
resultsCol <- paste(tempCol1, "_vs_", tempCol2, sep="")
cols <- c(tempCol1, tempCol2)
# Loop over rownames.
for (row in rownames(results)) {
tempMat <- rbind( editCounts[row, cols], # Grab values using row and
nonEditCounts[row, cols]) # column names. Use rbind to
# create two-row matrix.
tempFisher <- fisher.test(tempMat, alternative="two.sided")
results[row, resultsCol] <- tempFisher$p.value # Use row and column name
# indexing to assign
# p-value to results.
}
}
# Compute adjusted p-values using all of the computed p-values, outside of loop.
padj <- results # First make copy of results matrix.
padj[] <- p.adjust(results, method="fdr") # Trick to retain shape and attributes.
results
# Samp0_vs_Samp1 Samp0_vs_Samp2 Samp1_vs_Samp2
# chr10_101992307 1 1.0000000 1.00000000
# chr10_101992684 1 1.0000000 1.00000000
# chr10_127480585 1 1.0000000 1.00000000
# chr10_16479385 1 0.6436652 0.64366516
# chr10_73979859 1 1.0000000 1.00000000
# chr10_73979940 1 1.0000000 0.03290832
padj
# Samp0_vs_Samp1 Samp0_vs_Samp2 Samp1_vs_Samp2
# chr10_101992307 1 1 1.0000000
# chr10_101992684 1 1 1.0000000
# chr10_127480585 1 1 1.0000000
# chr10_16479385 1 1 1.0000000
# chr10_73979859 1 1 1.0000000
# chr10_73979940 1 1 0.5923497