I have the following dataframe:
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15), var1 = c(1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1), var2 = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1),
var3 = c(1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1), var4 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1), outcome = c(1,
1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1)), row.names = c(NA,
-15L), class = c("tbl_df", "tbl", "data.frame"))
I would like to arrange a script to calculate all possible odds ratio (using chi square), with 95% CI and p values, between all columns and the column outcome.
How can I do that?
I installed epitools but it seems that I need a 2x2 contingency table and I am not able to apply the function to columns of a dataframe
With mapply, you can use the fisher.test function, which doesn't fail when the odds ratio cannot be calculated.
mapply(fisher.test, x=data[, grep("var", names(data))], y=data[,"outcome"])
But the output is a 7x4 matrix which cannot be tidied into a nice format. However, we can use lapply to perform Fisher's test for each column and then tidy the results with the broom package.
library(broom)
cols <- df1[,grep("var", names(df1))]
res_list <- lapply(as.list(cols), function(x) fisher.test(x, y=df1$outcome))
do.call(rbind, lapply(res_list, broom::tidy))
# A tibble: 4 x 6
estimate p.value conf.low conf.high method alternative
<dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 0 1 0 77.9 Fisher's Exact Test ~ two.sided
2 Inf 0.505 0.204 Inf Fisher's Exact Test ~ two.sided
3 2.13 0.608 0.160 37.2 Fisher's Exact Test ~ two.sided
4 Inf 0.505 0.204 Inf Fisher's Exact Test ~ two.sided
Or using dplyr with map, reshaping first and then splitting on the name.
library(dplyr)
df1 %>%
pivot_longer(cols=starts_with("var")) %>%
split(.$name) %>%
map(~fisher.test(x=.$value, y=.$outcome)) %>%
map(tidy) %>%
map_df(~as_tibble(.))
Data:
df1 <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15), var1 = c(1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1), var2 = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1),
var3 = c(1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1), var4 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1), outcome = c(1,
1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1)), row.names = c(NA,
-15L), class = c("tbl_df", "tbl", "data.frame"))
The following code performs the computations as described in the question but 3/4 give errors.
library(epitools)
cols <- grep("var", names(df1), value = TRUE)
res_list <- lapply(cols, function(v){
tbl <- table(df1[, c(v, "outcome")])
tryCatch(oddsratio(x = tbl), error = function(e) e)
})
ok <- !sapply(res_list, inherits, "error")
res_list[ok]
The errors are all this:
simpleError in uniroot(function(or) { 1 - midp(a1, a0, b1, b0, or)
- alpha/2}, interval = interval): f() values at end points not of opposite sign
which can be seen with
res_list[!ok]
Related
This question already has answers here:
How do I dichotomise efficiently
(5 answers)
How to one hot encode several categorical variables in R
(5 answers)
Closed 9 months ago.
I am working on a project that requires me to one-hot code a single variable and I cannot seem to do it correctly.
I simply want to one-hot code the variable data$Ratings so that the values for 1,2,3 and separated in the dataframe and only equal either 0 or 1. E.g., if data$Ratings = 3 then the dummy would = 1. All the other columns are not to change.
structure(list(ID = c(284921427, 284926400, 284946595, 285755462,
285831220, 286210009, 286313771, 286363959, 286566987, 286682679
), AUR = c(4, 3.5, 3, 3.5, 3.5, 3, 2.5, 2.5, 2.5, 2.5), URC = c(3553,
284, 8376, 190394, 28, 47, 35, 125, 44, 184), Price = c(2.99,
1.99, 0, 0, 2.99, 0, 0, 0.99, 0, 0), AgeRating = c(1, 1, 1, 1,
1, 1, 1, 1, 1, 1), Size = c(15853568, 12328960, 674816, 21552128,
34689024, 48672768, 6328320, 64333824, 2657280, 1466515), HasSubtitle = c(0,
0, 0, 0, 0, 1, 0, 0, 0, 0), InAppSum = c(0, 0, 0, 0, 0, 1.99,
0, 0, 0, 0), InAppMin = c(0, 0, 0, 0, 0, 1.99, 0, 0, 0, 0), InAppMax = c(0,
0, 0, 0, 0, 1.99, 0, 0, 0, 0), InAppCount = c(0, 0, 0, 0, 0,
1, 0, 0, 0, 0), InAppAvg = c(0, 0, 0, 0, 0, 1.99, 0, 0, 0, 0),
descriptionTermCount = c(263, 204, 97, 272, 365, 368, 113,
129, 61, 87), LanguagesCount = c(17, 1, 1, 17, 15, 1, 0,
1, 1, 1), EngSupported = c(2, 2, 2, 2, 2, 2, 1, 2, 1, 2),
GenreCount = c(2, 2, 2, 2, 3, 3, 3, 2, 3, 2), months = c(7,
7, 7, 7, 7, 7, 7, 8, 8, 8), monthsSinceUpdate = c(29, 17,
25, 29, 15, 6, 71, 12, 23, 134), GameFree = c(0, 0, 0, 0,
0, 1, 0, 0, 0, 0), Ratings = c(3, 3, 3, 3, 2, 3, 2, 3, 2,
3)), row.names = c(NA, 10L), class = "data.frame")
install.packages("mlbench")
install.packages("neuralnet")
install.packages("mltools")
library(mlbench)
library(dplyr)
library(caret)
library(mltools)
library(tidyr)
data2 <- mutate_if(data, is.factor,as.numeric)
data3 <- lapply(data2, function(x) as.numeric(as.character(x)))
data <- data.frame(data3)
summary(data)
head(data)
str(data)
View(data)
#
dput(head(data, 10))
data %>% mutate(value = 1) %>% spread(data$Ratings, value, fill = 0 )
Is this what you want? I will assume your data is called data and continue with that for the data frame you supplied:
library(plm)
plm::make.dummies(data$Ratings) # returns a matrix
## 2 3
## 2 1 0
## 3 0 1
# returns the full data frame with dummies added:
plm::make.dummies(data, col = "Ratings")
## [not printed to save space]
There are some options for plm::make.dummies, e.g., you can select the base category via base and you can choose whether to include the base (add.base = TRUE) or not (add.base = FALSE).
The help page ?plm::make.dummies has more examples and explanation as well as a comparison for LSDV model estimation by a factor variable and by explicitly self-created dummies.
I have the following dataframe:
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15), var1 = c(1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1), var2 = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1),
var3 = c(1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1), var4 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1), outcome = c(1,
1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1)), row.names = c(NA,
-15L), class = c("tbl_df", "tbl", "data.frame"))
I would like to arrange a script to calculate all possible odds ratio (using chi square), with 95% CI and p values, between all columns and the column outcome.
How can I do that?
I installed epitools but it seems that I need a 2x2 contingency table and I am not able to apply the function to columns of a dataframe
With mapply, you can use the fisher.test function, which doesn't fail when the odds ratio cannot be calculated.
mapply(fisher.test, x=data[, grep("var", names(data))], y=data[,"outcome"])
But the output is a 7x4 matrix which cannot be tidied into a nice format. However, we can use lapply to perform Fisher's test for each column and then tidy the results with the broom package.
library(broom)
cols <- df1[,grep("var", names(df1))]
res_list <- lapply(as.list(cols), function(x) fisher.test(x, y=df1$outcome))
do.call(rbind, lapply(res_list, broom::tidy))
# A tibble: 4 x 6
estimate p.value conf.low conf.high method alternative
<dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 0 1 0 77.9 Fisher's Exact Test ~ two.sided
2 Inf 0.505 0.204 Inf Fisher's Exact Test ~ two.sided
3 2.13 0.608 0.160 37.2 Fisher's Exact Test ~ two.sided
4 Inf 0.505 0.204 Inf Fisher's Exact Test ~ two.sided
Or using dplyr with map, reshaping first and then splitting on the name.
library(dplyr)
df1 %>%
pivot_longer(cols=starts_with("var")) %>%
split(.$name) %>%
map(~fisher.test(x=.$value, y=.$outcome)) %>%
map(tidy) %>%
map_df(~as_tibble(.))
Data:
df1 <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15), var1 = c(1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1), var2 = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1),
var3 = c(1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1), var4 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1), outcome = c(1,
1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1)), row.names = c(NA,
-15L), class = c("tbl_df", "tbl", "data.frame"))
The following code performs the computations as described in the question but 3/4 give errors.
library(epitools)
cols <- grep("var", names(df1), value = TRUE)
res_list <- lapply(cols, function(v){
tbl <- table(df1[, c(v, "outcome")])
tryCatch(oddsratio(x = tbl), error = function(e) e)
})
ok <- !sapply(res_list, inherits, "error")
res_list[ok]
The errors are all this:
simpleError in uniroot(function(or) { 1 - midp(a1, a0, b1, b0, or)
- alpha/2}, interval = interval): f() values at end points not of opposite sign
which can be seen with
res_list[!ok]
I have a data frame which has 20 columns/items in it, and 593 rows (number of rows doesn't matter though) as shown below:
Using this the reliability of test is obtained as 0.94, with the help of alpha from psych package psych::alpha. The output also gives me the the new value of cronbach's alpha if I drop one of the items. However, I want to know how many items can I drop to retain an alpha of at least 0.8 I used a brute force approach for the purpose where I am creating the combination of all the items that exists in my data frame and check if their alpha is in the range (0.7,0.9). Is there a better way of doing this, as this is taking forever to run because number of items is too large to check for all the combination of items. Below is my current piece of code:
numberOfItems <- 20
for(i in 2:(2^numberOfItems)-1){
# ignoring the first case i.e. i=1, as it doesn't represent any model
# convert the value of i to binary, e.g. i=5 will give combination = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1
# using the binaryLogic package
combination <- as.binary(i, n=numberOfItems)
model <- c()
for(j in 1:length(combination)){
# choose which columns to consider depending on the combination
if(combination[j])
model <- c(model, j)
}
itemsToUse <- itemResponses[, c(model)]
#cat(model)
if(length(model) > 13){
alphaVal <- psych::alpha(itemsToUse)$total$raw_alpha
if(alphaVal > 0.7 && alphaVal < 0.9){
cat(alphaVal)
print(model)
}
}
}
A sample output from this code is as follows:
0.8989831 1 4 5 7 8 9 10 11 13 14 15 16 17 19 20
0.899768 1 4 5 7 8 9 10 11 12 13 15 17 18 19 20
0.899937 1 4 5 7 8 9 10 11 12 13 15 16 17 19 20
0.8980605 1 4 5 7 8 9 10 11 12 13 14 15 17 19 20
Here are the first 10 rows of data:
dput(itemResponses)
structure(list(CESD1 = c(1, 2, 2, 0, 1, 0, 0, 0, 0, 1), CESD2 = c(2,
3, 1, 0, 0, 1, 1, 1, 0, 1), CESD3 = c(0, 3, 0, 1, 1, 0, 0, 0,
0, 0), CESD4 = c(1, 2, 0, 1, 0, 1, 1, 1, 0, 0), CESD5 = c(0,
1, 0, 2, 1, 2, 2, 0, 0, 0), CESD6 = c(0, 3, 0, 1, 0, 0, 2, 0,
0, 0), CESD7 = c(1, 2, 1, 1, 2, 0, 1, 0, 1, 0), CESD8 = c(1,
3, 1, 1, 0, 1, 0, 0, 1, 0), CESD9 = c(0, 1, 0, 2, 0, 0, 1, 1,
0, 1), CESD10 = c(0, 1, 0, 2, 0, 0, 1, 1, 0, 1), CESD11 = c(0,
2, 1, 1, 1, 1, 2, 3, 0, 0), CESD12 = c(0, 3, 1, 1, 1, 0, 2, 0,
0, 0), CESD13 = c(0, 3, 0, 2, 1, 2, 1, 0, 1, 0), CESD14 = c(0,
3, 1, 2, 1, 1, 1, 0, 1, 1), CESD15 = c(0, 2, 0, 1, 0, 1, 0, 1,
1, 0), CESD16 = c(0, 2, 2, 0, 0, 1, 1, 0, 0, 0), CESD17 = c(0,
0, 0, 0, 0, 1, 1, 0, 0, 0), CESD18 = c(0, 2, 0, 0, 0, 0, 0, 0,
0, 1), CESD19 = c(0, 3, 0, 0, 0, 0, 0, 1, 1, 0), CESD20 = c(0,
3, 0, 1, 0, 0, 0, 0, 0, 0)), .Names = c("CESD1", "CESD2", "CESD3",
"CESD4", "CESD5", "CESD6", "CESD7", "CESD8", "CESD9", "CESD10",
"CESD11", "CESD12", "CESD13", "CESD14", "CESD15", "CESD16", "CESD17",
"CESD18", "CESD19", "CESD20"), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
The idea is to replace the computation of alpha with the so-called discrimination for each item from classical test theory (CTT). The discrimination is the correlation of the item score with a "true score" (which we would assume to be the row sum).
Let the data be
dat <- structure(list(CESD1 = c(1, 2, 2, 0, 1, 0, 0, 0, 0, 1), CESD2 = c(2, 3, 1, 0, 0, 1, 1, 1, 0, 1),
CESD3 = c(0, 3, 0, 1, 1, 0, 0, 0, 0, 0), CESD4 = c(1, 2, 0, 1, 0, 1, 1, 1, 0, 0),
CESD5 = c(0, 1, 0, 2, 1, 2, 2, 0, 0, 0), CESD6 = c(0, 3, 0, 1, 0, 0, 2, 0, 0, 0),
CESD7 = c(1, 2, 1, 1, 2, 0, 1, 0, 1, 0), CESD8 = c(1, 3, 1, 1, 0, 1, 0, 0, 1, 0),
CESD9 = c(0, 1, 0, 2, 0, 0, 1, 1, 0, 1), CESD10 = c(0, 1, 0, 2, 0, 0, 1, 1, 0, 1),
CESD11 = c(0, 2, 1, 1, 1, 1, 2, 3, 0, 0), CESD12 = c(0, 3, 1, 1, 1, 0, 2, 0, 0, 0),
CESD13 = c(0, 3, 0, 2, 1, 2, 1, 0, 1, 0), CESD14 = c(0, 3, 1, 2, 1, 1, 1, 0, 1, 1),
CESD15 = c(0, 2, 0, 1, 0, 1, 0, 1, 1, 0), CESD16 = c(0, 2, 2, 0, 0, 1, 1, 0, 0, 0),
CESD17 = c(0, 0, 0, 0, 0, 1, 1, 0, 0, 0), CESD18 = c(0, 2, 0, 0, 0, 0, 0, 0, 0, 1),
CESD19 = c(0, 3, 0, 0, 0, 0, 0, 1, 1, 0), CESD20 = c(0, 3, 0, 1, 0, 0, 0, 0, 0, 0)),
.Names = c("CESD1", "CESD2", "CESD3", "CESD4", "CESD5", "CESD6", "CESD7", "CESD8", "CESD9",
"CESD10", "CESD11", "CESD12", "CESD13", "CESD14", "CESD15", "CESD16", "CESD17",
"CESD18", "CESD19", "CESD20"), row.names = c(NA, -10L),
class = c("tbl_df", "tbl", "data.frame"))
We compute (1) the discrimination and (2) the alpha coefficient.
stat <- t(sapply(1:ncol(dat), function(ii){
dd <- dat[, ii]
# discrimination is the correlation of the item to the rowsum
disc <- if(var(dd, na.rm = TRUE) > 0) cor(dd, rowSums(dat[, -ii]), use = "pairwise")
# alpha that would be obtained when we skip this item
alpha <- psych::alpha(dat[, -ii])$total$raw_alpha
c(disc, alpha)
}))
dimnames(stat) <- list(colnames(dat), c("disc", "alpha^I"))
stat <- data.frame(stat)
Observe that the discrimination (which is more efficient to compute) is inversely proportional to alpha that is obtained when deleting this item. In other words, alpha is highest when there are many high "discriminating" items (that correlate with each other).
plot(stat, pch = 19)
Use this information to select the sequence with which the items should be deleted to fall below a benchmark (say .9, since the toy data doesn't allow for a lower mark):
1) delete as many items as possible to stay above the benchmark; that is, start with the least discriminating items.
stat <- stat[order(stat$disc), ]
this <- sapply(1:(nrow(stat)-2), function(ii){
ind <- match(rownames(stat)[1:ii], colnames(dat))
alpha <- psych::alpha(dat[, -ind, drop = FALSE])$total$raw_alpha
})
delete_these <- rownames(stat)[which(this > .9)]
psych::alpha(dat[, -match(delete_these, colnames(dat)), drop = FALSE])$total$raw_alpha
length(delete_these)
2) delete as few items as possible to stay above the benchmark; that is, start with the highest discriminating items.
stat <- stat[order(stat$disc, decreasing = TRUE), ]
this <- sapply(1:(nrow(stat)-2), function(ii){
ind <- match(rownames(stat)[1:ii], colnames(dat))
alpha <- psych::alpha(dat[, -ind, drop = FALSE])$total$raw_alpha
})
delete_these <- rownames(stat)[which(this > .9)]
psych::alpha(dat[, -match(delete_these, colnames(dat)), drop = FALSE])$total$raw_alpha
length(delete_these)
Note, that 1) is coherent with classical item selection procedures in (psychological/educational) diagnostic/assessments: remove items from the assessment, that fall below a benchmark in terms of discriminatory power.
I changed the code as follows, now I am dropping a fixed number of items and changing the value of numberOfItemsToDrop from 1 to 20 manually. Although it is a lil better, but it still is taking too long to run :(
I hope there is some better way of doing this.
numberOfItemsToDrop <- 13
combinations <- combinat::combn(20, numberOfItemsToDrop)
timesToIterate <- length(combinations)/numberOfItemsToDrop
for(i in 1:timesToIterate){
model <- combinations[,i]
itemsToUse <- itemResponses[, -c(model)]
alphaVal <- psych::alpha(itemsToUse)$total$raw_alpha
if(alphaVal < 0.82){
cat("Cronbach's alpha =",alphaVal, ", number of items dropped = ", length(model), " :: ")
print(model)
}
}
I have a question about PCA using the caret package and an error message I'm getting, "cannot rescale a constant/zero column to unit variance".
Consider two sets of similar code. The first works just fine:
a = c(0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, -1, -1, NA)
b = c(1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, -1, -1, NA)
c = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0)
df = data.frame(a, b, c)
trans = preProcess(df, method = c("center", "scale", "pca"))
The variance of each column can be seen as:
apply(df, 2, var, na.rm=TRUE)
Note that the variance of column "c" is 0.11
Let's say I change the second to last integer in column "c" to 1 instead of 0, and then run the same code:
a = c(0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, -1, -1, NA)
b = c(1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, -1, -1, NA)
c = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0)
df = data.frame(a, b, c)
trans = preProcess(df, method = c("center", "scale", "pca"))
I get an error message:
Error in prcomp.default(x, scale = TRUE, retx = FALSE) :
cannot rescale a constant/zero column to unit variance
If you look at the variance for column c, it's 0.059:
apply(df, 2, var, na.rm=TRUE)
Can anyone please help me understand the difference between these two sets of code and why the second gives an error when the first does not?
Thank you
PCA only uses complete observations. In your second definition of df above, a PCA analysis will drop the last row due to missingness. And column c is constant within the remaining rows.
Note: my answer is around PCA generally and not specific to the caret package.
What's the simplest way to compute the percentage of rows (1) containing ones and (2) containing zeros, per group?
Here's some small example data:
dat <- structure(list(rs = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), group = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names = c("rs", "group"), row.names = c(NA,
-62L), class = "data.frame")
Here's what I've got so far (don't laugh!):
require(plyr)
tab <- as.data.frame(table(dat))
dc <- dcast(tab, group ~ rs)
dc <- dc[,-1]
dc[] <- lapply(dc, as.numeric)
data.frame(prop.table(as.matrix(dc), 1))
Which works fine:
X0 X1
1 1.0000000 0.00000000
2 0.8787879 0.12121212
3 0.9285714 0.07142857
But I'm sure there's a method that requires less typing.
Solutions with plyr and data.table most welcome.
table almost does what you want. Convert to ratios by dividing each set of values by its sum:
t(apply(table(dat), 2, function(x) x/sum(x)))
## group 0 1
## 1 1.0000000 0.00000000
## 2 0.8787879 0.12121212
## 3 0.9285714 0.07142857