How to make contingency table over a list in r - r

The main exposure variable is aff. I want to get contingency tables for aff and all variables in the varlist. Then I want to do chi-square test using these contingency tables. My codes are following:
name=names(data)
varlist=name[11:40]
models=lapply(varlist, function(x) {
chisq.test(table(substitute(data$i,list(i = as.name(x))),data$aff))
})
lapply(models, summary)
But I got error
Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors
How to fix this?

I think you're over-complicating things by using substitute and such. Without your data, I'll try with mtcars, using cyl as the exposure variable.
data <- mtcars
name <- names(data)
ev <- "cyl"
varlist <- name[ name != ev ]
models <- lapply(varlist, function(nm) {
chisq.test(table(data[[nm]], data[[ev]]))
})
# Warning messages:
# 1: In chisq.test(table(data[[nm]], data[[ev]])) :
# Chi-squared approximation may be incorrect
(because I'm using a bad example for the test, there are a lot of warnings here; this can be ignored when using mtcars because it is really not a good dataset for this test.)
summaries <- lapply(models, summary)
str(summaries[1:2])
# List of 2
# $ : 'summaryDefault' chr [1:9, 1:3] " 1" " 1" " 1" " 1" ...
# ..- attr(*, "dimnames")=List of 2
# .. ..$ : chr [1:9] "statistic" "parameter" "p.value" "method" ...
# .. ..$ : chr [1:3] "Length" "Class" "Mode"
# $ : 'summaryDefault' chr [1:9, 1:3] " 1" " 1" " 1" " 1" ...
# ..- attr(*, "dimnames")=List of 2
# .. ..$ : chr [1:9] "statistic" "parameter" "p.value" "method" ...
# .. ..$ : chr [1:3] "Length" "Class" "Mode"

Supposing your data is like mtcars, where vs, am, gear, and carb are categorical variables, you can create a function like so:
df_list_f <- function(x) chisq.test(table(df2$cyl, x))
df2 <- mtcars[,8:11] # df2 contains the columns vs, am, gear and carb
lapply(df2, df_list_f)

Related

mk.test() results to tabble/matrix R

I want to apply mk.test() to the large dataset and get results in a table/matrix.
My data look something like this:
Column A
Column B
...
ColumnXn
1
2
...
5
...
...
...
...
3
4
...
7
So far I managed to perform mk.test() for all columns and print the results:
for(i in 1:ncol(data)) {
print(mk.test(as.numeric(unlist(data[ , i]))))
}
I got all the results printed:
.....
Mann-Kendall trend test
data: as.numeric(unlist(data[, i]))
z = 4.002, n = 71, p-value = 6.28e-05
alternative hypothesis: true S is not equal to 0
sample estimates:
S varS tau
7.640000e+02 3.634867e+04 3.503154e-01
Mann-Kendall trend test
data: as.numeric(unlist(data[, i]))
z = 3.7884, n = 71, p-value = 0.0001516
alternative hypothesis: true S is not equal to 0
sample estimates:
S varS tau
7.240000e+02 3.642200e+04 3.283908e-01
....
However, I was wondering if it is possible to get results in a table/matrix format that I could save as excel.
Something like this:
Column
z
p-value
S
varS
tau
Column A
4.002
0.0001516
7.640000e+02
3.642200e+04
3.283908e-01
...
...
...
...
...
...
ColumnXn
3.7884
6.28e-05
7.240000e+02
3.642200e+04
3.283908e-01
Is it possible to do so?
I would really appreciate your help.
Instead of printing the test results you can store them in a variable. This variable holds the various test statistics and values. To find the names of the properties you can perform the test on the first row and find the property names using a string conversion:
testres = mk.test(as.numeric(unlist(data[ , 1])))
str(testres)
List of 9
$ data.name : chr "as.numeric(unlist(data[, 1]))"
$ p.value : num 0.296
$ statistic : Named num 1.04
..- attr(*, "names")= chr "z"
$ null.value : Named num 0
..- attr(*, "names")= chr "S"
$ parameter : Named int 3
..- attr(*, "names")= chr "n"
$ estimates : Named num [1:3] 3 3.67 1
..- attr(*, "names")= chr [1:3] "S" "varS" "tau"
$ alternative: chr "two.sided"
$ method : chr "Mann-Kendall trend test"
$ pvalg : num 0.296
- attr(*, "class")= chr "htest"
Here you see that for example the z-value is called testres$statistic and similar for the other properties. The values of S, varS and tau are not separate properties but they are grouped together in the list testres$estimates.
In the code you can create an empty dataframe, and in the loop add the results of that run to this dataframe. Then at the end you can convert to csv using write.csv().
library(trend)
# sample data
mydata = data.frame(ColumnA = c(1,3,5), ColumnB = c(2,4,1), ColumnXn = c(5,7,7))
# empty dataframe to store results
results = data.frame(matrix(ncol=6, nrow=0))
colnames(results) <- c("Column", "z", "p-value", "S", "varS", "tau")
for(i in 1:ncol(mydata)) {
# store test results in variable
testres = mk.test(as.numeric(unlist(mydata[ , i])))
# extract elements of result
testvars = c(colnames(mydata)[i], # column
testres$statistic, # z
testres$p.value, # p-value
testres$estimates[1], # S
testres$estimates[2], # varS
testres$estimates[3]) # tau
# add to results dataframe
results[nrow(results)+1,] <- testvars
}
write.csv(results, "mannkendall.csv", row.names=FALSE)
The resulting csv file can be opened in Excel.

R - Alteryx - All columns in a tibble must be vectors

I'm using R on Alteryx to perform some statical analysis from my data.
It appears the error message " ! All Columns in a tibble must be vectors." as the following error message:
Does anybody can help me?
Below is my entire code:
library("tibble")
# Calling Data from Connection #1
data <- read.Alteryx("#1")
average_wilcox <- c("1","1","1","1","1","1","1")
# Creating data frame for in case it comes an empty table
df <- data.frame(average_wilcox)
#Verify if p-value is empty
# In case is different that empty, executes the steps for the Hypothesis Test for non-normal data
if (length(data$p.value) == 0) {
write.Alteryx(df, 1)
} else if (data$p.value != '') {
Week1 <- read.Alteryx("#2", mode="data.frame")
"&"
Week2 <- read.Alteryx("#3", mode="data.frame")
# MANN WHITNEY TEST (AVERAGE TEST FOR NON NORMAL)
Week1_data <- Week1$Wk1_feature_value
Week2_data <- Week2$Wk2_feature_value
# DEFINE VECTORS
week1 <- c(Week1_data)
week2 <- c(Week2_data)
merge(cbind(Week1, X=1:length(week1)),
cbind(Week2, X=1:length(week2)), all.y =T) [-1]
# MANN WHITNEY TEST (MEAN TEST FOR NON NORMAL)
average_wilcox <- wilcox.test(week1,week2, alternative='two.sided', conf.level=.95)
average_test <- tibble(average_wilcox)
average_test[] <- lapply(average_test, as.character)
write.Alteryx(average_test, 1)
}
#### NORMAL HYPOTHESIS TEST ####
# Calling Data from Connection #4
data1 <- read.Alteryx("#4")
df1 <- data.frame(Date=as.Date(character()),"p.value"=character(),User=character(),stringsAsFactors=FALSE)
# Verify if p-value is empty
# In case if different than empty, executes the steps for the Hypothesis Test for normal data
if(length(data1$p.value) == 0) {
write.Alteryx(df1, 3)
} else if (data1$p.value != '') {
Week1 <- read.Alteryx("#2", mode="data.frame")
"&"
Week2 <- read.Alteryx("#3", mode="data.frame")
# T TEST (MEAN TEST FOR NORMAL)
Week1_data <- Week1$Wk1_feature_value
Week2_data <- Week2$Wk2_feature_value
# DEFINE VECTORS
week1 <- c(Week1_data)
week2 <- c(Week2_data)
# T TEST (MEAN TEST FOR NORMAL)
t_test <- t.test(week1,week2, alternative='two.sided',conf.level=.95)
write.Alteryx(t_test,3)
}
Please, anybody knows what I have to do?
Many thanks,
Wil
Reason is that both wilcox.test and t.test returns a list of vectors, which may have difference in length. So, using that list in write.Alteryx is triggering the error as it expects a data.frame/tibble/data.table. e.g.
> str(t.test(1:10, y = c(7:20)))
List of 10
$ statistic : Named num -5.43
..- attr(*, "names")= chr "t"
$ parameter : Named num 22
..- attr(*, "names")= chr "df"
$ p.value : num 1.86e-05
$ conf.int : num [1:2] -11.05 -4.95
..- attr(*, "conf.level")= num 0.95
$ estimate : Named num [1:2] 5.5 13.5
..- attr(*, "names")= chr [1:2] "mean of x" "mean of y"
$ null.value : Named num 0
..- attr(*, "names")= chr "difference in means"
$ stderr : num 1.47
$ alternative: chr "two.sided"
$ method : chr "Welch Two Sample t-test"
$ data.name : chr "1:10 and c(7:20)"
- attr(*, "class")= chr "htest"
> x <- c(0.80, 0.83, 1.89, 1.04, 1.45, 1.38, 1.91, 1.64, 0.73, 1.46)
> y <- c(1.15, 0.88, 0.90, 0.74, 1.21)
> str(wilcox.test(x, y, alternative = "g") )
List of 7
$ statistic : Named num 35
..- attr(*, "names")= chr "W"
$ parameter : NULL
$ p.value : num 0.127
$ null.value : Named num 0
..- attr(*, "names")= chr "location shift"
$ alternative: chr "greater"
$ method : chr "Wilcoxon rank sum exact test"
$ data.name : chr "x and y"
- attr(*, "class")= chr "htest"
An option is to convert the output from both t.test and wilcox.test to a data.frame/tibble. tidy/glance from broom does this
...
library(broom)
average_wilcox <- tidy(wilcox.test(week1,week2, alternative='two.sided', conf.level=.95))
write.Alteryx(average_wilcox, 1)
...
t_test <- tidy(t.test(week1,week2, alternative='two.sided',conf.level=.95))
write.Alteryx(t_test,3)

Extract a column from lme4 summary in R

I was wondering what is the most efficient way to extract (not print like HERE) only the Std.Dev. column from the vc object below as a vector?
library(lme4)
library(nlme)
data(Orthodont, package = "nlme")
fm1 <- lmer(distance ~ age + (age|Subject), data = Orthodont)
vc <- VarCorr(fm1) ## extract only the `Std.Dev.` column as a vector
The structure of 'vc' suggests it is a list with single element 'Subject' and the 'stddev' is an attribute
str(vc)
#List of 1
# $ Subject: num [1:2, 1:2] 6.3334 -0.3929 -0.3929 0.0569
# ..- attr(*, "dimnames")=List of 2
# .. ..$ : chr [1:2] "(Intercept)" "age"
# .. ..$ : chr [1:2] "(Intercept)" "age"
# ..- attr(*, "stddev")= Named num [1:2] 2.517 0.239 ####
So, extract the attribute directly
attr(vc$Subject, "stddev")
and the residual standard deviation is an outside attribute
attr(vc, "sc")
#[1] 1.297364
If we combine them with c, we get a single vector
c(attr(vc$Subject, "stddev"), attr(vc, "sc"))
# (Intercept) age
# 2.5166317 0.2385853 1.2973640
Wrap with as.numeric/as.vector to remove the names as it is a named vector
Or use attributes
c(attributes(vc)$sc, attributes(vc$Subject)$stddev)
If you want the three elements in the column, you can use:
as.numeric(c(attr(vc[[1]], "stddev"), attr(vc, "sc")))

cluster prototypes of som results in object cannot be coerced to type 'double'

I try to follow this tutorial more precisely this code:
groups = 3
iris.hc = cutree(hclust(dist(iris.som$codes)), groups)
# plot
plot(iris.som, type="codes", bgcol=rainbow(groups)[iris.hc])
#cluster boundaries
add.cluster.boundaries(iris.som, iris.hc)
However, the bit:
dist(iris.som$codes)
gives me:
(list) object cannot be coerced to type 'double'
Any ideas?
The command dist needs a numeric matrix as input but the object iris.som$codes is a list not a matrix:
str(iris.som$codes)
List of 1
$ : num [1:25, 1:4] -1.353 -0.933 -0.523 0.321 0.569 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:25] "V1" "V2" "V3" "V4" ...
.. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
Hence you must use dist(iris.som$codes[[1]]) in your code:
library("kohonen")
iris.sc = scale(iris[, 1:4])
iris.grid = somgrid(xdim = 5, ydim=5, topo="hexagonal")
iris.som = som(iris.sc, grid=iris.grid, rlen=100, alpha=c(0.05,0.01))
iris.hc = cutree(hclust(dist(iris.som$codes[[1]])), groups)
plot(iris.som, type="codes", bgcol=rainbow(groups)[iris.hc])
add.cluster.boundaries(iris.som, iris.hc)

Data imputation with preProcess in caret returns less observations than expected

I wonder why preProcess function from R's caret package used for imputation of dataset's missing values returns less observations than in original dataset?
For example:
library(caret)
t <- data.frame(seq_len(100000),seq_len(100000))
for (i in 1:100000)
{
if (i %% 10 == 0) t[i,1] <- NA;
if (i %% 100 == 0) t[i,2] <- NA
}
preProcValues <- preProcess(t, method = c("knnImpute"))
preProcValues will contain only 90000 observations of 2 variables while 100000 is expected.
From the documentation:
The function preProcess estimates the required parameters for each
operation and predict.preProcess is used to apply them to specific
data sets.
Here, preProcValues is not t after imputation, it contains the parameters required to perform the imputation on t using predict.preProcess.
You should not be expecting 100K observations in preProcValues
Hint: Have a look at the source code to see what is going on under the hood with NA values
Using your example (modified to use method = "medianImpute" - See this question (and the above-mentioned source code) for why what you are trying to do wouldn't work with "knnImpute")
preProcValues <- preProcess(t, method = "medianImpute")
> preProcValues$dim[1]
#[1] 90000
Here we replace the NA values in t with the median (50K)
t2 <- predict(preProcValues, t)
> dim(t2)[1]
#[1] 100000
preProcess does not return values, it simply sets up the whole preprocess model based on the provided data. So, you need to run predict (requiring also the RANN package), but even if you do so with your artificial data you'll get an error:
Error in FUN(newX[, i], ...) :
cannot impute when all predictors are missing in the new data point
as the k-nn imputation can not work in rows where both your predictors are NA's.
Here's a demonstration with only 20 rows, for clarity and easy inspection:
library(caret)
t <- data.frame(seq_len(20),seq_len(20))
for (i in 1:20)
{
if (i %% 3 == 0) t[i,1] <- NA;
if (i %% 7 == 0) t[i,2] <- NA
}
names(t) <- c('V1', 'V2')
preProcValues <- preProcess(t, method = c("knnImpute"))
library(RANN)
t_imp <- predict(preProcValues, t)
When viewing the result, keep in mind that methods "center", "scale" have been automaticaly added to your preprocessing, even if you did not invoke them explicitly:
> str(preProcValues)
List of 19
$ call : language preProcess.default(x = t, method = c("knnImpute"))
$ dim : int [1:2] 12 2
$ bc : NULL
$ yj : NULL
$ et : NULL
$ mean : Named num [1:2] 10.5 10.5
..- attr(*, "names")= chr [1:2] "V1" "V2"
$ std : Named num [1:2] 6.25 6.14
..- attr(*, "names")= chr [1:2] "V1" "V2"
$ ranges : NULL
$ rotation : NULL
$ method : chr [1:3] "knnImpute" "scale" "center"
$ thresh : num 0.95
$ pcaComp : NULL
$ numComp : NULL
$ ica : NULL
$ k : num 5
$ knnSummary:function (x, ...)
$ bagImp : NULL
$ median : NULL
$ data : num [1:12, 1:2] -1.434 -1.283 -0.981 -0.83 -0.377 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:12] "1" "2" "4" "5" ...
.. ..$ : chr [1:2] "V1" "V2"
..- attr(*, "scaled:center")= Named num [1:2] 10.5 10.5
.. ..- attr(*, "names")= chr [1:2] "V1" "V2"
..- attr(*, "scaled:scale")= Named num [1:2] 6.63 6.63
.. ..- attr(*, "names")= chr [1:2] "V1" "V2"
- attr(*, "class")= chr "preProcess"

Resources