Handling zeros in Cramer's V for contingency table - r

I'm following the vcd docs where assocstats is called on an xtable call on multiple subsets of a data frame. However, I get NaNs with a specific subset because the expected observations for many cases is 0:
factor.2
factor.1 0 1 2 3 4 5 or more
0 0 12 7 1 0 1
1 0 2 1 1 0 0
2 0 8 2 1 0 0
3 0 5 4 0 0 0
4 0 1 2 2 0 0
5 0 6 8 0 0 0
6 0 5 3 0 0 0
7 0 5 1 0 0 0
8 0 5 4 0 1 0
9 0 1 1 0 1 0
10 0 5 6 0 0 1
temp.table <- structure(c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 12L,
2L, 8L, 5L, 1L, 6L, 5L, 5L, 5L, 1L, 5L, 7L, 1L, 2L, 4L, 2L, 8L,
3L, 1L, 4L, 1L, 6L, 1L, 1L, 1L, 0L, 2L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 1L), .Dim = c(11L, 6L), .Dimnames = structure(list(
factor.1 = c("0", "1", "2", "3", "4", "5", "6", "7", "8",
"9", "10"), factor.2 = c("0", "1", "2", "3", "4", "5 or more"
)), .Names = c("factor.1", "factor.2")), class = c("xtabs",
"table"), call = xtabs(data = cases.limited, na.action = na.omit))
library(vcd)
assocstats(temp.table)
X^2 df P(> X^2)
Likelihood Ratio 35.004 50 0.94676
Pearson NaN 50 NaN
Phi-Coefficient : NA
Contingency Coeff.: NaN
Cramer's V : NaN
Is there a way to quickly and efficiently avoid including these cases in the analysis without extensive rewriting of some of what assocstats or xtable do? I understand that there is arguably less statistical power, but Cramer's V is already an optimistic estimator, so the results will still be useful to me.

Related

choice experiment data: mlogit exercise 3 "error in reshapelong.... 'varying arguments must be same length'

Following Exercise 3 of the mlogit package https://cran.r-project.org/web/packages/mlogit/vignettes/e3mxlogit.html, but attempting to use my own data (see below)
structure(list(Choice.Set = c(4L, 5L, 7L, 8L, 10L, 12L), Alternative = c(2L,
1L, 1L, 2L, 2L, 2L), respondent = c(1L, 1L, 1L, 1L, 1L, 1L),
code = c(7L, 9L, 13L, 15L, 19L, 23L), Choice = c(1L, 1L,
1L, 1L, 1L, 1L), price1 = c(0L, 0L, 1L, 1L, 0L, 0L), price2 = c(0L,
1L, 0L, 0L, 1L, 1L), price3 = c(0L, 0L, 0L, 0L, 0L, 0L),
price4 = c(1L, 0L, 0L, 0L, 0L, 0L), price5 = c(0L, 0L, 0L,
0L, 0L, 0L), zone1 = c(0L, 0L, 0L, 1L, 1L, 1L), zone2 = c(0L,
0L, 0L, 0L, 0L, 0L), zone3 = c(1L, 0L, 1L, 0L, 0L, 0L), zone4 = c(0L,
1L, 0L, 0L, 0L, 0L), lic1 = c(0L, 0L, 0L, 0L, 0L, 0L), lic2 = c(1L,
0L, 1L, 0L, 1L, 1L), lic3 = c(0L, 1L, 0L, 1L, 0L, 0L), enf1 = c(0L,
0L, 1L, 0L, 1L, 0L), enf2 = c(0L, 0L, 0L, 1L, 0L, 1L), enf3 = c(1L,
1L, 0L, 0L, 0L, 0L), chid = 1:6), row.names = c(4L, 5L, 7L,
8L, 10L, 12L), class = "data.frame")
I have run into an error when running the code:
dfml <- dfidx(df, idx=list(c("chid", "respondent")),
choice="Alternative", varying=6:20, sep ="")
"Error in reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying, :
'varying' arguments must be the same length"
I have check the data and each col from 6:20 is the same length, however, some respondents chose some of the options more than the others. Can someone possibly point out where I have gone wrong? It's my first attempt at analyzing choice experiment data.
The error means, that your price has five options, whereas the others, zone, lic, enf have less. dfidx obviously can't handle that. You need to provide them, at least as NA columns.
df <- transform(df, zone5=NA, lic4=NA, lic5=NA, enf4=NA, enf5=NA)
library(mlogit)
dfml <- dfidx(df, idx=list(c("chid","respondent")), choice="Alternative",
varying=grep('^price|^zone|^lic|^enf', names(df)), sep="")
dfml
# ~~~~~~~
# first 10 observations out of 30
# ~~~~~~~
# Choice.Set Alternative code Choice price zone lic enf idx
# 1 4 FALSE 7 1 0 0 0 0 1:1
# 2 4 TRUE 7 1 0 0 1 0 1:2
# 3 4 FALSE 7 1 0 1 0 1 1:3
# 4 4 FALSE 7 1 1 0 NA NA 1:4
# 5 4 FALSE 7 1 0 NA NA NA 1:5
# 6 5 TRUE 9 1 0 0 0 0 2:1
# 7 5 FALSE 9 1 1 0 0 0 2:2
# 8 5 FALSE 9 1 0 0 1 1 2:3
# 9 5 FALSE 9 1 0 1 NA NA 2:4
# 10 5 FALSE 9 1 0 NA NA NA 2:5
#
# ~~~ indexes ~~~~
# chid respondent id2
# 1 1 1 1
# 2 1 1 2
# 3 1 1 3
# 4 1 1 4
# 5 1 1 5
# 6 2 1 1
# 7 2 1 2
# 8 2 1 3
# 9 2 1 4
# 10 2 1 5
# indexes: 1, 1, 2
I use grep here to identify the varying= columns. Get rid of the habit of lazily specifying variables as numbers; it's dangerous since order might change easily with small changes in the script.

How do I convert this adjacency matrix into a graph object?

I have a matrix that represents social interaction data on a CSV, which looks like below:
`0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
0 0 29 1 0 1 9 3 0 1 4
1 1 0 0 1 3 1 0 1 1 1
2 1 1 0 13 4 0 1 1 15 0
3 3 0 1 0 1 1 7 1 1 1
4 1 0 1 98 0 1 1 1 1 2
5 2 5 1 1 3 0 2 0 1 5
6 1 1 0 0 12 1 0 2 1 1
7 1 1 0 1 0 1 9 0 1 2
8 1 1 17 13 145 1 39 1 0 1
9 88 23 1 5 1 2 1 7 1 0
I am new to social network analysis, so I am not sure of my terminology, but this seems like a weighted adjacency matrix to me, as we can say from this that student 1 has had 29 interactions with student 0 in the last year. I had this object stored as a data-frame in my RStudio, but when I ran the following code, I received the below error:
> fn <- graph_from_adjacency_matrix(output, weighted = T)
Error in mde(x) : 'list' object cannot be coerced to type 'double'
I've tried converting it to matrix, but that does not seem to work either. Any help concerning this would be really appreciated.
You need to convert your data.frame to matrix first and then apply graph_from_adjacency_matrix, e.g.,
g <- graph_from_adjacency_matrix(as.matrix(df),weighted = TRUE)
and plot(g) gives
Data
> dput(df)
structure(list(``0`` = c(0L, 1L, 1L, 3L, 1L, 2L, 1L, 1L, 1L,
88L), ``1`` = c(29L, 0L, 1L, 0L, 0L, 5L, 1L, 1L, 1L, 23L), ``2`` = c(1L,
0L, 0L, 1L, 1L, 1L, 0L, 0L, 17L, 1L), ``3`` = c(0L, 1L, 13L,
0L, 98L, 1L, 0L, 1L, 13L, 5L), ``4`` = c(1L, 3L, 4L, 1L, 0L,
3L, 12L, 0L, 145L, 1L), ``5`` = c(9L, 1L, 0L, 1L, 1L, 0L, 1L,
1L, 1L, 2L), ``6`` = c(3L, 0L, 1L, 7L, 1L, 2L, 0L, 9L, 39L, 1L
), ``7`` = c(0L, 1L, 1L, 1L, 1L, 0L, 2L, 0L, 1L, 7L), ``8`` = c(1L,
1L, 15L, 1L, 1L, 1L, 1L, 1L, 0L, 1L), ``9`` = c(4L, 1L, 0L, 1L,
2L, 5L, 1L, 2L, 1L, 0L)), class = "data.frame", row.names = c("0",
"1", "2", "3", "4", "5", "6", "7", "8", "9"))

convert dataset to longitudinal data structure in R

I have a datset that looks something like this:
> head(BurnData)
Treatment Gender Race Surface head buttock trunk up.leg low.leg resp.tract type ex.time excision antib.time antibiotic
1 0 0 0 15 0 0 1 1 0 0 2 12 0 12 0
2 0 0 1 20 0 0 1 0 0 0 4 9 0 9 0
3 0 0 1 15 0 0 0 1 1 0 2 13 0 13 0
4 0 0 0 20 1 0 1 0 0 0 2 11 1 29 0
5 0 0 1 70 1 1 1 1 0 0 2 28 1 31 0
6 0 0 1 20 1 0 1 0 0 0 4 11 0 11 0
inf.time infection
1 12 0
2 9 0
3 7 1
4 29 0
5 4 1
6 8 1
I want to run a Cox's Regression on variables Surface, ex.time and, antib.time and treatment. Treatment is an indicator variable. Surface denotes the % of body burned. ex.time and antib.time both record time to event in days.
I am aware that to run a time dependent Cox's Regression i need to convert the data in longitudinal structure, but how can i do it in R?
then i will use the forluma:
coxph(formula = Surv(tstart, tstop, infection) ~ covariate)
DATA
> dput(head(BurnData))
structure(list(Treatment = c(0L, 0L, 0L, 0L, 0L, 0L), Gender = c(0L,
0L, 0L, 0L, 0L, 0L), Race = c(0L, 1L, 1L, 0L, 1L, 1L), Surface = c(15L,
20L, 15L, 20L, 70L, 20L), head = c(0L, 0L, 0L, 1L, 1L, 1L), buttock = c(0L,
0L, 0L, 0L, 1L, 0L), trunk = c(1L, 1L, 0L, 1L, 1L, 1L), up.leg = c(1L,
0L, 1L, 0L, 1L, 0L), low.leg = c(0L, 0L, 1L, 0L, 0L, 0L), resp.tract = c(0L,
0L, 0L, 0L, 0L, 0L), type = c(2L, 4L, 2L, 2L, 2L, 4L), ex.time = c(12L,
9L, 13L, 11L, 28L, 11L), excision = c(0L, 0L, 0L, 1L, 1L, 0L),
antib.time = c(12L, 9L, 13L, 29L, 31L, 11L), antibiotic = c(0L,
0L, 0L, 0L, 0L, 0L), inf.time = c(12L, 9L, 7L, 29L, 4L, 8L
), infection = c(0L, 0L, 1L, 0L, 1L, 1L), Surface_discr = structure(c(1L,
1L, 1L, 1L, 2L, 1L), .Label = c("1", "2"), class = "factor"),
ex.time_discr = c(1L, 1L, 1L, 1L, 2L, 1L), antib.time_discr = c(1L,
1L, 1L, 2L, 2L, 1L)), .Names = c("Treatment", "Gender", "Race",
"Surface", "head", "buttock", "trunk", "up.leg", "low.leg", "resp.tract",
"type", "ex.time", "excision", "antib.time", "antibiotic", "inf.time",
"infection", "Surface_discr", "ex.time_discr", "antib.time_discr"
), row.names = c(NA, 6L), class = "data.frame")

create a new indicator variable based on values of existing variables

I have a dataset like this:
> dput(head(BurnData))
structure(list(Treatment = c(0L, 0L, 0L, 0L, 0L, 0L), Gender = c(0L,
0L, 0L, 0L, 0L, 0L), Race = c(0L, 1L, 1L, 0L, 1L, 1L), Surface = c(15L,
20L, 15L, 20L, 70L, 20L), head = c(0L, 0L, 0L, 1L, 1L, 1L), buttock = c(0L,
0L, 0L, 0L, 1L, 0L), trunk = c(1L, 1L, 0L, 1L, 1L, 1L), `upper leg` = c(1L,
0L, 1L, 0L, 1L, 0L), `lower leg` = c(0L, 0L, 1L, 0L, 0L, 0L),
`respiratory tract` = c(0L, 0L, 0L, 0L, 0L, 0L), type = c(2L,
4L, 2L, 2L, 2L, 4L), `excision time` = c(12L, 9L, 13L, 11L,
28L, 11L), excision = c(0L, 0L, 0L, 1L, 1L, 0L), `antibiotic time` = c(12L,
9L, 13L, 29L, 31L, 11L), antibiotic = c(0L, 0L, 0L, 0L, 0L,
0L), infection_t = c(12L, 9L, 7L, 29L, 4L, 8L), infection = c(0L,
0L, 1L, 0L, 1L, 1L)), .Names = c("Treatment", "Gender", "Race",
"Surface", "head", "buttock", "trunk", "upper leg", "lower leg",
"respiratory tract", "type", "excision time", "excision", "antibiotic time",
"antibiotic", "infection_t", "infection"), row.names = c(NA,
6L), class = "data.frame")
I am trying to create a new variable which combines the indicators head, buttock, trunk, upper leg, lower leg, respiratory tract into ONE new indicator variable where 0 is when all indicators are zero, 1 - only head, 2 - only buttock, 3 ... , 7 - only respiratory tract and 8 - combination of any of them.
I have been trying to do this with mutate, dplyr but i cannot get it right. I am not very good at this.
Here is an approach with base R using an ifelse statement.
ifelse(rowSums(d1[5:10]) > 1, 8,
ifelse(rowSums(d1[5:10]) == 0, 0, max.col(d1[5:10])))
#1 2 3 4 5 6
#8 3 8 8 8 8
You can also try a case_when using the tidyverse
library(tidyverse)
d %>%
select(head:`respiratory tract`) %>%
mutate(res=case_when(rowSums(.) == 0 ~ 0,
rowSums(.) > 1 ~ 8,
head == 1 ~ 1,
buttock == 1 ~ 2,
trunk == 1 ~ 3,
`upper leg`==1 ~ 4,
`lower leg`==1~5,
`respiratory tract`==1 ~ 6)) %>%
select(res) %>%
bind_cols(d,.)
Treatment Gender Race Surface head buttock trunk upper leg lower leg respiratory tract type
1 0 0 0 15 0 0 1 1 0 0 2
2 0 0 1 20 0 0 1 0 0 0 4
3 0 0 1 15 0 0 0 1 1 0 2
4 0 0 0 20 1 0 1 0 0 0 2
5 0 0 1 70 1 1 1 1 0 0 2
6 0 0 1 20 1 0 1 0 0 0 4
excision time excision antibiotic time antibiotic infection_t infection res
1 12 0 12 0 12 0 8
2 9 0 9 0 9 0 3
3 13 0 13 0 7 1 8
4 11 1 29 0 29 0 8
5 28 1 31 0 4 1 8
6 11 0 11 0 8 1 8
Or completely using the elegant solution of Sotos
mutate(res=case_when(rowSums(.) == 0 ~ 0L,
rowSums(.) > 1 ~ 8L,
TRUE ~ max.col(.)))

How does one lookup of max value in matrix?

I have a table that looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
586 0 0 0 1 0 0 0 1 3 1 0 1 0 0 0 0 0 1 0 2 0 3 0 0 0 4 0 1 2 0
637 0 0 0 0 0 0 2 3 2 2 0 4 0 0 0 0 1 0 0 2 0 1 1 1 0 0 0 0 0 1
989 0 0 1 0 0 0 2 1 0 0 0 2 1 0 0 1 2 1 0 3 0 2 0 1 1 0 1 0 1 0
1081 0 0 0 1 0 0 1 0 1 1 0 0 2 0 0 0 0 0 0 3 0 5 0 0 2 1 0 1 1 1
2922 0 1 1 1 0 0 0 2 1 0 0 0 2 0 0 0 1 1 0 1 0 3 1 1 2 0 0 1 0 1
3032 0 1 0 0 0 0 0 3 0 0 1 0 2 1 0 1 0 1 1 0 0 3 1 1 1 1 0 0 1 1
Numbers 1 to 30 in the first row are my labels, and the columns are my items. I would like to find, for each item, the label with the most counts. E.g. 586 has 4 counts of 26, which is the highest number in that row, so for 586, I would like to assign 26.
I am able to get the maximum value for each row with max(table1[1,])), which gets me the maximum value for first row, but doesn't get me the label it corresponds to, but I don't know how to proceed. All help is appreciated!
dput:
structure(c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L,
0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 2L, 1L, 0L, 0L, 1L, 3L, 1L,
0L, 2L, 3L, 3L, 2L, 0L, 1L, 1L, 0L, 1L, 2L, 0L, 1L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 1L, 4L, 2L, 0L, 0L, 0L, 0L, 0L, 1L, 2L, 2L,
2L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 1L, 0L, 1L, 2L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 0L,
0L, 0L, 0L, 0L, 1L, 2L, 2L, 3L, 3L, 1L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 3L, 1L, 2L, 5L, 3L, 3L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 1L,
0L, 1L, 1L, 0L, 0L, 1L, 2L, 2L, 1L, 4L, 0L, 0L, 1L, 0L, 1L, 0L,
0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 2L, 0L, 1L, 1L, 0L,
1L, 0L, 1L, 0L, 1L, 1L, 1L), .Dim = c(6L, 30L), .Dimnames = structure(list(
c("586", "637", "989", "1081", "2922", "3032"), c("1", "2",
"3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23",
"24", "25", "26", "27", "28", "29", "30")), .Names = c("",
"")))
max.col will give you vector of column numbers which correspond to maximum value for each row.
> max.col(df, tie='first')
[1] 26 12 20 22 22 8
You can use that vector to get column names for each row.
> colnames(df)[max.col(df, tie='first')]
[1] "26" "12" "20" "22" "22" "8"
Perhaps you are looking for which.max. Assuming your matrix is called "temp":
> apply(temp, 1, which.max)
586 637 989 1081 2922 3032
26 12 20 22 22 8
apply with MARGIN = 1 (the second argument) will apply a function by row.

Resources