Access elements in summary() in R - r

I would like to access some elements of an Anova summary in R. I've been trying things like in this question Access or parse elements in summary() in R.
When I convert the summary to a string it shows something like this:
str(summ)
List of 1
$ :Classes 'anova' and 'data.frame': 2 obs. of 5 variables:
..$ Df : num [1:2] 3 60
..$ Sum Sq : num [1:2] 0.457 2.647
..$ Mean Sq: num [1:2] 0.1523 0.0441
..$ F value: num [1:2] 3.45 NA
..$ Pr(>F) : num [1:2] 0.022 NA
- attr(*, "class")= chr [1:2] "summary.aov" "listof"
How can I access the F value?
I've been trying things like summ[c('F value')] and I still can't get it to work.
Any help would be greatly appreciated!

You have the anova object inside a list (first line of str output is List of 1). So you need to get the "F value" of this single element, like:
summm[[1]][["F value"]]

As an addition to the answer above I'd recommend to start using the broom package when you want to access/use various elements of a model object.
First, by using the str command you don't convert the summary into a string, but you just see the structure of your summary, which is a list. So, str means "structure".
The broom package enables you to save the info of your model object as a data frame, which is easier to manipulate. Check my simple example:
library(broom)
fit <- aov(mpg ~ vs, data = mtcars)
# check the summary of the ANOVA (not possible to access info/elements)
fit2 = summary(fit)
fit2
# Df Sum Sq Mean Sq F value Pr(>F)
# vs 1 496.5 496.5 23.66 3.42e-05 ***
# Residuals 30 629.5 21.0
# create a data frame of the ANOVA
fit3 = tidy(fit)
fit3
# term df sumsq meansq statistic p.value
# 1 vs 1 496.5279 496.52790 23.66224 3.415937e-05
# 2 Residuals 30 629.5193 20.98398 NA NA
# get F value (or any other values)
fit3$statistic[1]
#[1] 23.66224
I think for the specific example you provided you don't really need to use the broom method, but if it happens to deal with more complicated model objects it will be really useful to try it.

Related

Class of Data and how to do some data manipulation in R

I have a subset of a genetic dataset in which I want to run some correlations between the CpG markers.
I have inspected the class, class(data) of this subset and it shows that it's a
[1] "matrix" "array"
The structure str(data) also shows an output of the form
num [1:64881, 1:704] 0.0149 NA 0.0558 NA NA ...
-- attr(*, "dimnames")=List of 2
..$ : chr [1:64881] "cg11223003" NA "cg22629907" NA ...
..$ : chr [1:704] "200357150075_R01C01" "200357150075_R02C01" "200357150075_R03C01" "200357150075_R04C01" ...
It actually looks as though it were a data frame but the class of the variable tells otherwise. It's kind of confusing.
I need help on how to manipulate the dataset to obtain a matrix or data frame format of the markers to enable run the correlations.

Using .mat data to do multiple linear regression in R

I have a dataset in .mat file. Because most of my project is going to be R, I want to analyze the dataset in R rather than Matlab. I have used "R.matlab" library to convert into R but I am struggling to convert the data to dataframe to do further processing with it.
library(R.matlab)
>data <- readMat(paste(dataDirectory, 'Data.mat', sep=""))
> str(data)
List of 1
$ Data: num [1:32, 1:5, 1:895] 0.999 0.999 1 1 1 ...
- attr(*, "header")=List of 3
..$ description: chr "MATLAB 5.0 MAT-file, Platform: PCWIN, Created on: Fri Oct 18 11:36:04 2013 "
..$ version : chr "5"
..$ endian : chr "little"'''
I have tried the following codes from what I found from other questions but they do not do exactly what I wanted to do.
data = lapply(data, unlist, use.names=FALSE)
df <- as.data.frame(data)
> str(df)
'data.frame': 32 obs. of 4475 variables:
I want to convert into a data frame to 5 observations (Y,X1,X2,X3,X4) but right now there is 32 observation.
I do not know how to go further from here as I never worked with such a large dataset and couldn't find a relevant post. I am also new to R and coding so please excuse me if I will have some trouble with some of the answers. Any help would be greatly appreciated.
Thanks

collapse data frame with embedded matrices [duplicate]

This question already has answers here:
aggregate() puts multiple output columns in a matrix instead
(1 answer)
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 4 years ago.
Under certain conditions, R generates data frames that contain matrices as elements. This requires some determination to do by hand, but happens e.g. with the results of an aggregate() call where the aggregation function returns multiple values:
set.seed(101)
d0 <- data.frame(g=factor(rep(1:2,each=20)), x=rnorm(20))
d1 <- aggregate(x~g, data=d0, FUN=function(x) c(m=mean(x), s=sd(x)))
str(d1)
## 'data.frame': 2 obs. of 2 variables:
## $ g: Factor w/ 2 levels "1","2": 1 2
## $ x: num [1:2, 1:2] -0.0973 -0.0973 0.8668 0.8668
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr "m" "s"
This makes a certain amount of sense, but can make trouble for downstream processing code (for example, ggplot2 doesn't like it). The printed representation can also be confusing if you don't know what you're looking at:
d1
## g x.m x.s
## 1 1 -0.09731741 0.86678436
## 2 2 -0.09731741 0.86678436
I'm looking for a relatively simple way to collapse this object to a regular three-column data frame (either with names g, m, s, or with names g, x.m, x.s ...).
I know this problem won't arise with tidyverse (group_by + summarise), but am looking for a base-R solution.

issue with predict with glmnetUtils

trying to use the glmnetUtils package from GitHub for formula interface to glmnet but predict is not estimating enough values
library(nycflights13) # from GitHub
library(modelr)
library(dplyr)
library(glmnet)
library(glmnetUtils)
library(purrr)
fitfun=function(dF){
cv.glmnet(arr_delay~distance+air_time+dep_time,data=dF)
}
gnetr2=function(model,datavals){
yvar=all.vars(formula(model)[[2]])
print(paste('y variable:',yvar))
print('observations')
print(str(as.data.frame(datavals)[[yvar]]))
print('predictions')
print(str(predict(object=model,newdata=datavals)))
stats::cor(stats::predict(object=model, newdata=datavals), as.data.frame(datavals)[[yvar]], use='complete.obs')^2
}
flights %>%
group_by(carrier) %>%
do({
crossv_mc(.,4) %>%
mutate(mdl=map(train,fitfun),
r2=map2_dbl(mdl,test,gnetr2))
})
the output from gnetr2():
[1] "y variable: arr_delay"
[1] "observations"
num [1:3693] -33 -6 47 4 15 -5 45 16 0 NA ...
NULL
[1] "predictions"
num [1:3476, 1] 8.22 21.75 24.31 -7.96 -7.27 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:3476] "1" "2" "3" "4" ...
..$ : chr "1"
NULL
Error: incompatible dimensions
any ideas what's going on? your help is much appreciated!
This is an issue with the underlying glmnet package, but there's no reason that it can't be handled in glmnetUtils. I've just pushed an update that should let you use the na.action argument with the predict method for formula-based calls.
Setting na.action=na.pass (the default) will pad out the predictions to include NAs for rows with missing values
na.action=na.omit or na.exclude will drop these rows
Note that the missingness of a given row may change depending on how much regularisation is done: if the NAs are for variables that get dropped from the model, then the row will be counted as being a complete case.
Also took the opportunity to fix a bug where the LHS of the formula contains an expression.
Give it a go with install_github("Hong-Revo/glmnetUtils") and tell me if anything breaks.
Turns out its happening because there are NA in the predictor variables so predict() results in a shorter vector since na.action=na.exclude.
Normally a solution would be to use predict(object,newdata,na.action=na.pass) but predict.cv.glmnet does not accept other arguments to predict.
Therefore the solution is to filter for complete cases before beginning
flights=flights %>% filter(complete.cases(.))

How to convert 4d array to 3d array subsetting on specific elements of one of the dimensions

Here is probably an easy question.. but I am really struggling so help is very much appreciated.
I have 4d data that I wish to transform into 3d data. The data has the following attributes:
lon <- 1:96
lat <- 1:73
lev <- 1:60
tme <- 1:12
data <- array(runif(96*73*60*12),
dim=c(96,73,60,12) ) # fill with random test values
What I would like to do is calculate the mean of the first few levels (say 1:6). The new data would be of the form:
new.data <- array(96*73*12), dim=c(96,73,12) ) # again just test data
But would contain the mean of the first 5 levels of data. At the moment the only way I have been able to make it work is to write a rather inefficient loop which extracts each of the first 5 levels and divides the sum of those by 5 to get the mean.
I have tried:
new.data <- apply(data, c(1,2,4), mean)
Which nicely gives me the mean of ALL the vertical levels but can't understand how to subset the 3rd dimension to get an average of only a few! e.g.
new.data <- apply(data, c(1,2,3[1:5],4), mean) # which returns
Error in ds[-MARGIN] : only 0's may be mixed with negative subscripts
I am desperate for some help!
apply with indexing (the proper use of "[") should be enough for the mean of the first six levels of the third dimension if I understand your terminology:
> str(apply(data[,,1:6,] , c(1,2,4), FUN=mean) )
num [1:96, 1:73, 1:12] 0.327 0.717 0.611 0.388 0.47 ...
This returns a 96 x 73 by 12 matrix.
In addition to the answer of #DWin, I would recommend the plyr package. The package provides apply like functions. The analgue of apply is the plyr function aaply. The first two letters of a plyr function specify the input and the output type, aa in this case, array and array.
> system.time(str(apply(data[,,1:6,], c(1,2,4), mean)))
num [1:96, 1:73, 1:12] 0.389 0.157 0.437 0.703 0.61 ...
user system elapsed
2.180 0.004 2.184
> Library(plyr)
> system.time(str(aaply(data[,,1:6,], c(1,2,4), mean)))
num [1:96, 1:73, 1:12] 0.389 0.157 0.437 0.703 0.61 ...
- attr(*, "dimnames")=List of 3
..$ X1: chr [1:96] "1" "2" "3" "4" ...
..$ X2: chr [1:73] "1" "2" "3" "4" ...
..$ X3: chr [1:12] "1" "2" "3" "4" ...
user system elapsed
40.243 0.016 40.262
In this example it is slower than apply, but there are a few advantages. The packages supports parallel processing, it also supports outputting the results to a data.frame or list (nice for plotting using ggplot2), and it can show a progress bar (nice for long running processes). Although in this case I'd still go for apply because of performance.
More information regarding the plyr package can be found in this paper. Maybe someone can comment on the poor performance of aaply in this example?

Resources