I've used the function lme() in a database I've got. After that, I've used glht() in order to see the multiple comparisons of means. The summary of glht() gives me, in the linear hypotheses, 4 columns, with the last one (Pr(>|z|)) being the one I'm interest at, at the moment.
Because I'm dealing with 20 parameters, 4 different treatments, in 10 different groups, it is our plan to create a "heat map" of the p values obtained in the last column of summary(glht()).
My original plan is to copy and paste these values into Excel, and there I can make this heat map. However, summary() in R divides the columns with empty spaces, which is not helpful to send to Excel.
I've got 2 questions, then: (1) is it possible to change summary() in such a way all the values are tab delimited? (2) does anyone have a better/faster way of preparing this heat map? For this last question, I've thought about populating a matrix with the p values in R, as a starter. However, I'm not sure how to call this Pr(>|z|) column from the summary() function.
Many thanks in advance for the help!
Related
I'm having a weird issue with printing the results of a summary function in R Markdown. I'm compiling an html document of the results of my analysis, but what I'm noticing is that the column of the summary that displays statistical significance is being wrapped around to the next series below the rest of the data rather than forming a single series of rows. Specifically this seems to be driven by the length of a coefficient name that represents the interaction between a grouping variable and a numeric log-transformed independent variable raised to a fractional exponent. However, because my grouping variable has 27 levels it results in a huge amount of wasted space.
The strange thing is looking at the actual print of the summary table it seems like there is plenty of space to keep the significance column in line with the estimate, std. error, t value, and p value. I don't know what is driving it to wrap around. Below is an attached figure showing what it looks like when I knit the code.
Below is some code that replicates the issue using the mtcars dataset. I've tested this and it should work with a plain html_document format.
data(mtcars)
data.frame(mtcars)
mtcars$supercalifragilisticexpialidocious1234a<-mtcars$cyl #renaming character to same character length as my data.
mtcars$supercalifragilisticexpialidocious1234b<-mtcars$mpg
summary(lm(supercalifragilisticexpialidocious1234b~supercalifragilisticexpialidocious1234a,data=mtcars))
The reason the character names are so long is that in the actual dataset they represent the results of an interaction coefficient between the scientific name of a taxonomic group (the longest of which is 16 characters long), and an interaction coefficient raising the variable to a fractional exponent. So in reality the coefficient name looks like "groupTaxonname::I(log(var)^(1/2))", it's not a single variable with a monstrously long name. However, the problem with this means that I cannot simply shorten the coefficient names in order to make the table narrower and easier to fit, there is no real abbreviation that can be used for the group names and omitting the rest of the coefficient name would mean potentially inaccurately defining the variables used.
Given this I'm wondering if there is some way to adjust the output of the summary function in R Markdown to produce something like the following:
Adding options(width=1000) in the same code chunk did the trick for me.
I am working with the 'indicspecies' package - multipatt function and am unable to extract summary values of the package. Unfortunately I can't print all the summary and am left with impartial information for my model. The reason is the huge amount of data that needs to be printed from the summary (300.000 different species, 3 groups, 6 comparable combinations).
This is what happens with summary being saved (pre-code incl.):
x <- multipatt(data, ...)
sumx <-summary(x)
sumx
NULL
str(sumx)
NULL
So, the summary does not work exactly like a generic summary. It seems that the function is based around the older indval function from the 'labdsv' package (which is mentioned in the documentation). I found an archived thread where a similar problem is discussed: http://r.789695.n4.nabble.com/extract-values-from-summary-of-function-indval-of-the-package-labdsv-td4637466.html
but it seems not resolved (and is not exactly about the same function, rather the base function indval).
I was wondering if anyone has experience with the indicspecies package and knows a way to either extract the info from the summary.
It is possible to extract significance and other information from the other saved data from the model, but it might be nice to just get a quick complete overview from the data.
ps. I tried
options(max.print=1000000)
but this didn't solve it for me.
I use to capture the summary output for a multipatt object, but don't any more because the p-values reported are not corrected for multiple testing. To answer the OP's question you can capture the summary output using capture.output
ex.
dat.multipatt.summary<-capture.output(summary(dat.multipatt, indvalcomp=TRUE))
Again, I do not recommend this. It is very important to correct the p-values for multiple testing, so the summary output actually isn't helpful. To be clear ?multipatt states:
"sign Data table with results of the best matching pattern, the association value and the degree of statistical significance of the association (i.e. p-values from permutation test). Note that p-values are not corrected for multiple testing."
I just posted an answer for how to correct the p-values here https://stats.stackexchange.com/questions/370724/indiscpecies-multipatt-and-overcoming-multi-comparrisons/401277#401277
I don't have any experience with this package and since you haven't provided the data, it's difficult to reproduce. But since summary is returning NULL, are you sure your x is computed properly? Check the object.size or class or something else of x to see if it indeed has any content.
Also instead of accessing all the contents of summary(x) together, you can use # to access slots of it (similar to $ in dataframe).
If you need further assistance, it'd be better t provide atleast a small subset or some other sample data so that the community can work with it.
I want to ask some general questions on the possibility of regression in R.
For instance, I have data between two variables for 58 regions. I want to conduct the whole regression process including assumption check, model fitting and diagnostics for each region, but get the overall result by one command, which means without a loop.
I already know that I can use the lmList function to do model fitting all in one trial. However, I do not know whether it is possible to get Q-Q normal residual plot for all the 58 regressions in one go.
Does anyone get idea whether this is feasible? If so, what kind of functions I might need?
Depends what you mean by "one command", and why you want to avoid loops. How about:
library(nlme)
L <- lmList(y~x|region,data=yourData)
lapply(L,plot,which=2)
should work; however, it will spit out 58 plots in sequence. If you try to capture them all on a single page you'll probably get errors about too-small margins.
You have lots of other choices based on working on the list of regressions that lmList returns. For example,
library(plyr)
qqDat <- ldply(L,function(x) as.data.frame(qqnorm(residuals(x))))
will give you a data frame containing the Q-Q plot information (expected and observed values) for each group in the data.
I am trying to automate logistic regression in R.
Basically, my source code will generate a new equation everyday as the input data is updated,
(Variables, data format etc are same) and print out te significant variables with corresponding coefficients.
When I use step function, sometimes the resulting coefficients are not significant. Therefore, I want to update my set of coefficients and get rid of all the ones that are not significant enough.
Is there a function or automated way of doing it?
If not, the only way I can think of is writing a script on another language that takes the coefficients and corresponding P value and checking significance, and rerunning R accordingly. But even for that, do you know how can I get only P values and coefficients of variables. I can either print whole summary of regression result with "summary" function. I can't reach only P values.
Thank you very much
It's a bit hard for me without sample code and data, but you can subset based on variable values like this,
newdata <- data[ which(data$p.value < 0.5), ]
You can inspect your R object using str, see ?str to figure out how to select whatever you want to use in your subset $p.value or $residuals.
If this doesn't answer your question try submitting some sample code and data.
Best,
Eric
I am using R software (R commander) to cluster my data. I have a smaller subset of my data containing 200 rows and about 800 columns. I am getting the following error when trying kmeans cluster and plot on a graph.
"'princomp' can only be used with more units than variables"
I then created a test doc of 10 row and 10 columns whch plots fine but when I add an extra column I get te error again.
Why is this? I need to be able to plot my cluster. When I view my data set after performing kmeans on it I can see the extra results column which shows which clusters they belong to.
IS there anything I am doing wrong, can I ger rid of this error and plot my larger sample???
Please help, been wrecking my head for a week now.
Thanks guys.
The problem is that you have more variables than sample points and the principal component analysis that is being done is failing.
In the help file for princomp it explains (read ?princomp):
‘princomp’ only handles so-called R-mode PCA, that is feature
extraction of variables. If a data matrix is supplied (possibly
via a formula) it is required that there are at least as many
units as variables. For Q-mode PCA use ‘prcomp’.
Principal component analysis is underspecified if you have fewer samples than data point.
Every data point will be it's own principal component. For PCA to work, the number of instances should be significantly larger than the number of dimensions.
Simply speaking you can look at the problems like this:
If you have n dimensions, you can encode up to n+1 instances using vectors that are all 0 or that have at most one 1. And this is optimal, so PCA will do this! But it is not very helpful.
you can use prcomp instead of princomp