R - describe() output to a data frame

R - describe() output to a data frame - r

I want to create a data frame using describe() function. Dataset under consideration is iris. The data frame should look like this:
Variable n missing unique Info Mean 0.05 0.1 0.25 0.5 0.75 0.9 0.95
Sepal.Length 150 0 35 1 5.843 4.6 4.8 5.1 5.8 6.4 6.9 7.255
Sepal.Width 150 0 23 0.99 3.057 2.345 2.5 2.8 3 3.3 3.61 3.8
Petal.Length 150 0 43 1 3.758 1.3 1.4 1.6 4.35 5.1 5.8 6.1
Petal.Width 150 0 22 0.99 1.199 0.2 0.2 0.3 1.3 1.8 2.2 2.3
Species 150 0 3
Is there a way out to coerce the output of describe() to data.frame type? When I try to coerce, I get an error as shown below:
library(Hmisc)
statistics <- describe(iris)
statistics[1]
first_vec <- statistics[1]$Sepal.Length
as.data.frame(first_vec)
#Error in as.data.frame.default(first_vec) : cannot coerce class ""describe"" to a data.frame
Thanks

The way to figure this out is to examine the objects with str():
data(iris)
library(Hmisc)
di <- describe(iris)
di
# iris
#
# 5 Variables 150 Observations
# -------------------------------------------------------------
# Sepal.Length
# n missing unique Info Mean .05 .10 .25 .50 .75 .90 .95
# 150 0 35 1 5.843 4.600 4.800 5.100 5.800 6.400 6.900 7.255
#
# lowest : 4.3 4.4 4.5 4.6 4.7, highest: 7.3 7.4 7.6 7.7 7.9
# -------------------------------------------------------------
# ...
# -------------------------------------------------------------
# Species
# n missing unique
# 150 0 3
#
# setosa (50, 33%), versicolor (50, 33%)
# virginica (50, 33%)
# -------------------------------------------------------------
str(di)
# List of 5
# $ Sepal.Length:List of 6
# ..$ descript : chr "Sepal.Length"
# ..$ units : NULL
# ..$ format : NULL
# ..$ counts : Named chr [1:12] "150" "0" "35" "1" ...
# .. ..- attr(*, "names")= chr [1:12] "n" "missing" "unique" "Info" ...
# ..$ intervalFreq:List of 2
# .. ..$ range: atomic [1:2] 4.3 7.9
# .. .. ..- attr(*, "Csingle")= logi TRUE
# .. ..$ count: int [1:100] 1 0 3 0 0 1 0 0 4 0 ...
# ..$ values : Named chr [1:10] "4.3" "4.4" "4.5" "4.6" ...
# .. ..- attr(*, "names")= chr [1:10] "L1" "L2" "L3" "L4" ...
# ..- attr(*, "class")= chr "describe"
# $ Sepal.Width :List of 6
# ...
# $ Species :List of 5
# ..$ descript: chr "Species"
# ..$ units : NULL
# ..$ format : NULL
# ..$ counts : Named num [1:3] 150 0 3
# .. ..- attr(*, "names")= chr [1:3] "n" "missing" "unique"
# ..$ values : num [1:2, 1:3] 50 33 50 33 50 33
# .. ..- attr(*, "dimnames")=List of 2
# .. .. ..$ : chr [1:2] "Frequency" "%"
# .. .. ..$ : chr [1:3] "setosa" "versicolor" "virginica"
# ..- attr(*, "class")= chr "describe"
# - attr(*, "descript")= chr "iris"
# - attr(*, "dimensions")= int [1:2] 150 5
# - attr(*, "class")= chr "describe"
We see that di is a list of lists. We can take it apart by looking at just the first sublist. You can convert that into a vector:
unlist(di[[1]])
# descript counts.n
# "Sepal.Length" "150"
# counts.missing counts.unique
# "0" "35"
# counts.Info counts.Mean
# "1" "5.843"
# counts..05 counts..10
# "4.600" "4.800"
# counts..25 counts..50
# "5.100" "5.800"
# counts..75 counts..90
# "6.400" "6.900"
# counts..95 intervalFreq.range1
# "7.255" "4.3"
# intervalFreq.range2 intervalFreq.count1
# "7.9" "1"
# ...
# values.H3 values.H2
# "7.6" "7.7"
# values.H1
# "7.9"
str(unlist(di[[1]]))
# Named chr [1:125] "Sepal.Length" "150" "0" "35" ...
# - attr(*, "names")= chr [1:125] "descript" "counts.n" "counts.missing" "counts.unique" ...
It is very, very long (125). The elements have been coerced to all be of the same (and most inclusive) type, namely, character. It seems you want the 2nd through 12th elements:
unlist(di[[1]])[2:12]
# counts.n counts.missing counts.unique counts.Info
# "150" "0" "35" "1"
# counts.Mean counts..05 counts..10 counts..25
# "5.843" "4.600" "4.800" "5.100"
# counts..50 counts..75 counts..90
# "5.800" "6.400" "6.900"
Now you have something you can start to work with. But notice that this only seems to be the case for numerical variables; the factor variable species is different:
unlist(di[[5]])
# descript counts.n counts.missing counts.unique
# "Species" "150" "0" "3"
# values1 values2 values3 values4
# "50" "33" "50" "33"
# values5 values6
# "50" "33"
In that case, it seems you only want elements two through four.
Using this process of discovery and problem solving, you can see how you'd take the output of describe apart and put the information you want into a data frame. However, this will take a lot of work. You'll presumably need to use loops and lots of if(){ ... } else{ ... } blocks. You might just want to code your own dataset description function from scratch.

You can do this by using the stat.desc function from the pastecs package:
library(pastecs)
summary_df <- stat.desc(mydata)
The summary_df is the dataframe you wanted. See more info here.

In R, you just have to use the summary(iris) function instead of describe(iris) function in Python.

Related

Predict function convert Probabilities between 0 to 1 and store in a dataframe

The Predict function code returns output like this.
library(e1071)
model <- svm(Species ~ ., data = iris, probability=TRUE)
pred <- predict(model, iris, probability=TRUE)
head(attr(pred, "probabilities"))
# setosa versicolor virginica
# 1 0.9803339 0.01129740 0.008368729
# 2 0.9729193 0.01807053 0.009010195
# 3 0.9790435 0.01192820 0.009028276
# 4 0.9750030 0.01531171 0.009685342
# 5 0.9795183 0.01164689 0.008834838
# 6 0.9740730 0.01679643 0.009130620
So, I wrote a piece of code like this :-
Code:-
pred_df <- as.data.frame(pred)
This returns output like this, (have just made up the values)
1 Setosa
2 Versicolor
3 Virginica
4 Setosa
5 Setosa
But my preferred output would be something like this (have just made up the values),
Setosa Versicolor Virginica
1 0.62 0.11 0.27
2 0.41 0.55 0.04
***Pred is a factor and Pred_df is a dataframe***
I am looking to return the numbers in the form of a decimal rather than a whole number. Kindly help me with this.

To see what's going on you need to look inside the structure.
str(pred)
Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
- attr(*, "probabilities")= num [1:150, 1:3] 0.979 0.971 0.978 0.973 0.978 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:150] "1" "2" "3" "4" ...
.. ..$ : chr [1:3] "setosa" "versicolor" "virginica"
So
as.data.frame(attr(pred,"probabilities"))
should do what you want.

Display the lambda used in caret::BoxCoxTrans function

I'm using BoxCoxTrans function from the caret package:
library(caret)
library(purrr)
model1 <- apply(X = my.df, 2, BoxCoxTrans)
model2 <- purrr::map2(model1, my.df, function(x,y) predict(x,y))
trans.df <- as.data.frame(do.call(cbind, model2))
library(rcompanion)
plotNormalHistogram(trans.df)
print(trans.df)
It is working correctly and transforming the data, but I have no way of knowing which lambda value is used for the transformation.

You can find these values in model1. I'll show you how to get them using the iris data.
library(caret)
fudge <- 0.2
out <- lapply(iris[1:2], BoxCoxTrans, fudge = fudge) # instead of apply(..., margin = 2, ...)
Now look at the structure of out
str(out, 2)
#List of 2
# $ Sepal.Length:List of 6
# ..$ lambda : num -0.1
# ..$ fudge : num 0.2
# ..$ n : int 150
# ..$ summary :Classes 'summaryDefault', 'table' Named num [1:6] 4.3 5.1 5.8 5.84 6.4 ...
# .. .. ..- attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" ...
# ..$ ratio : num 1.84
# ..$ skewness: num 0.309
# ..- attr(*, "class")= chr "BoxCoxTrans"
# $ Sepal.Width :List of 6
# ..$ lambda : num 0.3
# ..$ fudge : num 0.2
# ..$ n : int 150
# ..$ summary :Classes 'summaryDefault', 'table' Named num [1:6] 2 2.8 3 3.06 3.3 ...
# .. .. ..- attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" ...
# ..$ ratio : num 2.2
# ..$ skewness: num 0.313
# ..- attr(*, "class")= chr "BoxCoxTrans"
Using base R you can use sapply and `[[` now as follows
sapply(out, `[[`, "lambda")
#Sepal.Length Sepal.Width
# -0.1 0.3
Since you use purrr, you might consider map and pluck
map_dbl(out, pluck, "lambda")
#Sepal.Length Sepal.Width
# -0.1 0.3
Thanks to #missuse's mindful comments we can get the lambda used for transformation as
library(dplyr)
real_lambda <- case_when(between(lambda, -fudge, fudge) ~ 0,
between(lambda, 1 - fudge, 1 + fudge) ~ 1,
TRUE ~ lambda)
real_lambda <- setNames(real_lambda, names(lambda))
real_lambda
#Sepal.Length Sepal.Width
# 0.0 0.3
This is necessary because the function BoxCoxTrans has the argument fudge which is
a tolerance value: lambda values within +/-fudge will be coerced to 0 and within 1+/-fudge will be coerced to 1.

Extract all model statistics from rms fits?

The rms package contains a wealth of useful statistical functions. However, I cannot find a proper way to extract certain fit statistics from the fitted object. Consider an example:
library(pacman)
p_load(rms, stringr, readr)
#fit
> (fit = rms::ols(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data = iris))
Linear Regression Model
rms::ols(formula = Sepal.Length ~ Sepal.Width + Petal.Length +
Petal.Width + Species, data = iris)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 150 LR chi2 302.96 R2 0.867
sigma0.3068 d.f. 5 R2 adj 0.863
d.f. 144 Pr(> chi2) 0.0000 g 0.882
Residuals
Min 1Q Median 3Q Max
-0.794236 -0.218743 0.008987 0.202546 0.731034
Coef S.E. t Pr(>|t|)
Intercept 2.1713 0.2798 7.76 <0.0001
Sepal.Width 0.4959 0.0861 5.76 <0.0001
Petal.Length 0.8292 0.0685 12.10 <0.0001
Petal.Width -0.3152 0.1512 -2.08 0.0389
Species=versicolor -0.7236 0.2402 -3.01 0.0031
Species=virginica -1.0235 0.3337 -3.07 0.0026
So, the print function for the fit prints a lot of useful stuff including standard errors and adjusted R2. Unfortunately, if we inspect the model fit object, the values don't seem to be present anywhere.
> str(fit)
List of 19
$ coefficients : Named num [1:6] 2.171 0.496 0.829 -0.315 -0.724 ...
..- attr(*, "names")= chr [1:6] "Intercept" "Sepal.Width" "Petal.Length" "Petal.Width" ...
$ residuals : Named num [1:150] 0.0952 0.1432 -0.0731 -0.2894 -0.0544 ...
..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
$ effects : Named num [1:150] -71.5659 -1.1884 9.1884 -1.3724 -0.0587 ...
..- attr(*, "names")= chr [1:150] "Intercept" "Sepal.Width" "Petal.Length" "Petal.Width" ...
$ rank : int 6
$ fitted.values : Named num [1:150] 5 4.76 4.77 4.89 5.05 ...
..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
$ assign :List of 4
..$ Sepal.Width : int 2
..$ Petal.Length: int 3
..$ Petal.Width : int 4
..$ Species : int [1:2] 5 6
$ qr :List of 5
..$ qr : num [1:150, 1:6] -12.2474 0.0816 0.0816 0.0816 0.0816 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:150] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:6] "Intercept" "Sepal.Width" "Petal.Length" "Petal.Width" ...
..$ qraux: num [1:6] 1.08 1.02 1.11 1.02 1.02 ...
..$ pivot: int [1:6] 1 2 3 4 5 6
..$ tol : num 1e-07
..$ rank : int 6
..- attr(*, "class")= chr "qr"
$ df.residual : int 144
$ var : num [1:6, 1:6] 0.07828 -0.02258 -0.00198 0.01589 -0.02837 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:6] "Intercept" "Sepal.Width" "Petal.Length" "Petal.Width" ...
.. ..$ : chr [1:6] "Intercept" "Sepal.Width" "Petal.Length" "Petal.Width" ...
$ stats : Named num [1:6] 150 302.964 5 0.867 0.882 ...
..- attr(*, "names")= chr [1:6] "n" "Model L.R." "d.f." "R2" ...
$ linear.predictors: Named num [1:150] 5 4.76 4.77 4.89 5.05 ...
..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
$ call : language rms::ols(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data = iris)
$ terms :Classes 'terms', 'formula' language Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
.. ..- attr(*, "variables")= language list(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species)
.. ..- attr(*, "factors")= int [1:5, 1:4] 0 1 0 0 0 0 0 1 0 0 ...
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:5] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ...
.. .. .. ..$ : chr [1:4] "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
.. ..- attr(*, "term.labels")= chr [1:4] "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
.. ..- attr(*, "order")= int [1:4] 1 1 1 1
.. ..- attr(*, "intercept")= num 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species)
.. ..- attr(*, "dataClasses")= Named chr [1:5] "numeric" "numeric" "numeric" "numeric" ...
.. .. ..- attr(*, "names")= chr [1:5] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ...
.. ..- attr(*, "formula")=Class 'formula' language Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
.. .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
$ Design :List of 12
..$ name : chr [1:4] "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
..$ label : chr [1:4] "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
..$ units : Named chr [1:4] "" "" "" ""
.. ..- attr(*, "names")= chr [1:4] "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
..$ colnames : chr [1:5] "Sepal.Width" "Petal.Length" "Petal.Width" "Species=versicolor" ...
..$ mmcolnames : chr [1:5] "Sepal.Width" "Petal.Length" "Petal.Width" "Speciesversicolor" ...
..$ assume : chr [1:4] "asis" "asis" "asis" "category"
..$ assume.code : int [1:4] 1 1 1 5
..$ parms :List of 1
.. ..$ Species: chr [1:3] "setosa" "versicolor" "virginica"
..$ limits : list()
..$ values : list()
..$ nonlinear :List of 4
.. ..$ Sepal.Width : logi FALSE
.. ..$ Petal.Length: logi FALSE
.. ..$ Petal.Width : logi FALSE
.. ..$ Species : logi [1:2] FALSE FALSE
..$ interactions: NULL
$ non.slopes : num 1
$ na.action : NULL
$ scale.pred : chr "Sepal.Length"
$ fail : logi FALSE
$ sformula :Class 'formula' language Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
- attr(*, "class")= chr [1:3] "ols" "rms" "lm"
There is a 7 year old question on R help where the package creator explains a solution to getting these:
On Wed, 11 Aug 2010, david dav wrote:
Hi,
I would like to extract the coefficients of a logistic regression
(estimates and standard error as well) in lrm as in glm with
summary(fit.glm)$coef
Thanks
David
coef(fit) sqrt(diag(vcov(fit)))
But these will not be very helpful except in the trivial case where
everything is linear, nothing interacts, and factors have two levels.
Frank
And the solution is according to the author not optimal. This leaves one wondering how the displayed values are calculated. Tracing down the code results in a hunt through the undocumented package code (the package code is on Github). I.e. we begin with print.ols():
> rms:::print.ols
function (x, digits = 4, long = FALSE, coefs = TRUE, title = "Linear Regression Model",
...)
{
latex <- prType() == "latex"
k <- 0
z <- list()
if (length(zz <- x$na.action)) {
k <- k + 1
z[[k]] <- list(type = paste("naprint", class(zz)[1],
sep = "."), list(zz))
}
stats <- x$stats
...
Reading further we do find that e.g. R2 adj. is calculated in the print function:
rsqa <- 1 - (1 - r2) * (n - 1) / rdf
We also find some standard error calculations, though no p values.
se <- sqrt(diag(x$var))
z[[k]] <- list(type='coefmatrix',
list(coef = x$coefficients,
se = se,
errordf = rdf))
All the results are passed down further to prModFit(). We can look it up and find the p value calculation etc. Unfortunately, the print command returns NULL so these values are not available anywhere for programmatic reuse:
> x = print((fit = rms::ols(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data = iris)))
#printed output...
> x
NULL
How does one get all the statistics?

Here is a hack solution where we capture the output of the print command:
#parser
get_model_stats = function(x, precision=60) {
# remember old number formatting function
# (which would round and transforms p-values to formats like "<0.01")
old_format_np = rms::formatNP
# substitute it with a function which will print out as many digits as we want
assignInNamespace("formatNP", function(x, ...) formatC(x, format="f", digits=precision), "rms")
# remember old width setting
old_width = options('width')$width
# substitute it with a setting making sure the table will not wrap
options(width=old_width + 4 * precision)
# actually print the data and capture it
cap = capture.output(print(x))
# restore original settings
options(width=old_width)
assignInNamespace("formatNP", old_format_np, "rms")
#model stats
stats = c()
stats$R2.adj = str_match(cap, "R2 adj\\s+ (\\d\\.\\d+)") %>% na.omit() %>% .[, 2] %>% as.numeric()
#coef stats lines
coef_lines = cap[which(str_detect(cap, "Coef\\s+S\\.E\\.")):(length(cap) - 1)]
#parse
coef_lines_table = suppressWarnings(readr::read_table(coef_lines %>% stringr::str_c(collapse = "\n")))
colnames(coef_lines_table)[1] = "Predictor"
list(
stats = stats,
coefs = coef_lines_table
)
}
Example:
> get_model_stats(fit)
$stats
$stats$R2.adj
[1] 0.86
$coefs
# A tibble: 6 x 5
Predictor Coef S.E. t `Pr(>|t|)`
<chr> <dbl> <dbl> <dbl> <chr>
1 Intercept 2.17 0.280 7.8 <0.0001
2 Sepal.Width 0.50 0.086 5.8 <0.0001
3 Petal.Length 0.83 0.069 12.1 <0.0001
4 Petal.Width -0.32 0.151 -2.1 0.0389
5 Species=versicolor -0.72 0.240 -3.0 0.0031
6 Species=virginica -1.02 0.334 -3.1 0.0026
This still has issues, e.g. p values are not returned as numerics and only has 4 digits, which can cause issues in some situations. The updated code should extract digits up to arbitrary precision.
Be extra careful when using this with long variable names as those could wrap the table into multiple rows and introduce missing values (NA) in output even though the stats are in there!

Package broom is a great way to extract model info.
library(pacman)
library(rms)
library(broom)
fit = ols(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species,
data = iris)
tidy(summary.lm(fit))
# term estimate std.error statistic p.value
# 1 Intercept 2.1712663 0.27979415 7.760227 1.429502e-12
# 2 Sepal.Width 0.4958889 0.08606992 5.761466 4.867516e-08
# 3 Petal.Length 0.8292439 0.06852765 12.100867 1.073592e-23
# 4 Petal.Width -0.3151552 0.15119575 -2.084418 3.888826e-02
# 5 Species=versicolor -0.7235620 0.24016894 -3.012721 3.059634e-03
# 6 Species=virginica -1.0234978 0.33372630 -3.066878 2.584344e-03
glance(fit)
# r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC df.residual
# 1 0.8673123 0.862705 0.3068261 188.251 2.666942e-61 6 -32.55801 79.11602 100.1905 144
The object fit also contains some easily accessible info that you can get and store in a dataframe:
fit$coefficients
# Intercept Sepal.Width Petal.Length Petal.Width Species=versicolor Species=virginica
# 2.1712663 0.4958889 0.8292439 -0.3151552 -0.7235620 -1.0234978
fit$stats
# n Model L.R. d.f. R2 g Sigma
# 150.0000000 302.9635115 5.0000000 0.8673123 0.8820479 0.3068261

converting string to numeric in R

I have a problem regarding data conversion using R language.
I have two data that being stored in variables named lung.X and lung.y, below are the description of my data.
> str(lung.X)
chr [1:86, 1:7129] " 170.0" " 104.0" " 53.7" " 119.0" " 105.5" " 130.0" ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:86] "V3" "V4" "V5" "V6" ...
..$ : chr [1:7129] "A28102_at" "AB000114_at" "AB000115_at" "AB000220_at" ...
and
> str(lung.y)
num [1:86] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
lung.X is a matrix (row: 86 col: 7129) and lung.y is an array of numbers (86 entries)
Do anyone know how to convert above data into the format below?
> str(lung.X)
num [1:86, 1:7129] 170 104 53.7 119 105.5 130...
I thought I should do like this
lung.X <- as.numeric(lung.X)
but I got this instead
> str(lung.X)
num [1:613094] 170 104 53.7 119 105.5 130...
The reason of doing this is because I need lung.X to be numerical only.
Thank you.

You could change the mode of your matrix to numeric:
## example data
m <- matrix(as.character(1:10), nrow=2,
dimnames = list(c("R1", "R2"), LETTERS[1:5]))
m
# A B C D E
# R1 "1" "3" "5" "7" "9"
# R2 "2" "4" "6" "8" "10"
str(m)
# num [1:2, 1:5] 1 2 3 4 5 6 7 8 9 10
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:2] "R1" "R2"
# ..$ : chr [1:5] "A" "B" "C" "D" ...
# NULL
mode(m) <- "numeric"
str(m)
# num [1:2, 1:5] 1 2 3 4 5 6 7 8 9 10
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:2] "R1" "R2"
# ..$ : chr [1:5] "A" "B" "C" "D" ...
# NULL
m
# A B C D E
# R1 1 3 5 7 9
# R2 2 4 6 8 10

Give this a try: m <- matrix(as.numeric(lung.X), nrow = 86, ncol = 7129)
If you need it in dataframe/list format, df <- data.frame(m)

How to do the p-value to print only 4 digits instead of the scientific notation in R?

How to do the p-value to print only 4 digits instead of the scientific notation in R?
I tried to use options (digits = 3, scipen = 12), but it didn't work ...
Here's an example ...
>options(digits=3, scipen=12)
>Oi <- c(A=321, B=712, C=44)
>Ei <- c(A=203, B=28, C=6)
>chisq.test(Oi, p=Ei,rescale.p=T)"

Not quite sure what you want here.
Thanks for the reproducible example: the output was
cc <- chisq.test(Oi,p=Ei,rescale.p=TRUE)
print(cc)
Chi-squared test for given probabilities
data: Oi
X-squared = 3090, df = 2, p-value < 0.00000000000000022
Inspecting the structure of the object reveals that the p-value in this case has underflowed to exactly zero:
List of 9
$ statistic: Named num 3090
..- attr(*, "names")= chr "X-squared"
$ parameter: Named num 2
..- attr(*, "names")= chr "df"
$ p.value : num 0
$ method : chr "Chi-squared test for given probabilities"
$ data.name: chr "Oi"
$ observed : Named num [1:3] 321 712 44
..- attr(*, "names")= chr [1:3] "A" "B" "C"
$ expected : Named num [1:3] 922.5 127.2 27.3
..- attr(*, "names")= chr [1:3] "A" "B" "C"
$ residuals: Named num [1:3] -19.8 51.8 3.2
..- attr(*, "names")= chr [1:3] "A" "B" "C"
$ stdres : Named num [1:3] -52.29 55.2 3.25
..- attr(*, "names")= chr [1:3] "A" "B" "C"
- attr(*, "class")= chr "htest"
I think if you want the exact p-value from this test you have to go a bit out of your way:
(pval <- pchisq(3090,2,lower.tail=FALSE,log.p=TRUE))
[1] -1545
So this is approximately 10^pval/log(10) = 10^(-671) [R's minimum representable value is typically around 1e-308, see .Machine$double.xmin]

I'm not sure what you want either, but I'm guessing you're looking for something like <0.0001, similar to what SAS outputs. I'd use the format.pval function for that, or perhaps printCoefmat, depending on how many tests you have. eps is a tolerance; values below that are printed as < [eps].
Oi <- c(A=321, B=712, C=44)
Ei <- c(A=203, B=28, C=6)
tt <- chisq.test(Oi, p=Ei,rescale.p=T)
format.pval(tt$p.value, eps=0.0001)
# [1] "<0.0001"
ttp <- data.frame(Chisq=tt$statistic, p.value=tt$p.value)
rownames(ttp) <- "Oi vs Ei"
printCoefmat(ttp, has.Pvalue=TRUE, eps.Pvalue=0.0001)
# Chisq p.value
# Oi vs Ei 3090 <0.0001 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - describe() output to a data frame - r

You can do this by using the stat.desc function from the pastecs package: library(pastecs) summary_df <- stat.desc(mydata) The summary_df is the dataframe you wanted. See more info here.

In R, you just have to use the summary(iris) function instead of describe(iris) function in Python.

Related

Predict function convert Probabilities between 0 to 1 and store in a dataframe

Display the lambda used in caret::BoxCoxTrans function

Extract all model statistics from rms fits?

converting string to numeric in R

How to do the p-value to print only 4 digits instead of the scientific notation in R?

Categories

Resources