R glm function changing my column names - r

I have what I think is a relatively simple question, but I can't seem to find the answer.
I have a 200 X 8 matrix temp and a response matrix (200X1) Binomial Vector
When I run the following line:
CLog=glm(BinomialVector~temp,family= binomial(logit))
I am able to run the logistic regression. What I think this is doing is really BinomialVector~tempcol1 +tempcol2+tempcol3 and so on.
However, when I press summary(CLog) the names of my factors have changed. If the first column was called trees then it has change do temptrees.Is there a way to prevent this?
As requested:
BinomialVector
[,1]
[1,] 0
[2,] 1
[3,] 1
[4,] 0
[5,] 0
[6,] 0
[7,] 1
temp
Net.Income.Y06. Return.on.Assets.Y06.
A 0.1929241 27.947
AA 1.1405694 12.427
AAP 1.0302481 17.117
ABT 2.1006512 13.826
Return.on.Investment.Y06. Total.Current.Assets.Y06.
A 39.844 0.9274886
AA 20.003 0.8830403
AAP 30.927 1.0439536
ABT 21.376 1.2447154
Total.Current.Liabilities.Y06. IntersectionMostAdmired.2006.
A 1.0812744 0.000
AA 0.9842055 7.255
AAP 1.1010472 0.000
ABT 0.7617044 6.715
This is what possible columns of my temp matrix look like. The reason I don't like using that additive notation is that the number of columns changes, as I am using this inside a user defined function where I feed it in the temp matrix. As for using the data frame, I was under the impression that data frame is indeed the correct thing to use but I seem to get an error when it is not as.matrix. :s

Can you post a representative subset of your data and also the actual output glm gives you for that subset?
Then it will be easier to diagnose/replicate.
In the meantime, I suggest you use a data frame instead of a matrix. Here is how:
mydf<-data.frame(y=BinomialVector,temp);
CLog = glm(BinomialVector~tempcol1+tempcol2+tempcol3,data=mydf,family=binomial(logit));
Matrices are a bad format to use as data sources for regression models (for one thing, they coerce all columns to the same data type, which may or may not be part of the problem here), so I never use them. But if I had to guess, your model might be converting the matrix into one long vector? And perhaps there's a variable somewhere in there that has the value "tree"? But without example data and output, it's all guesswork. It's likely that when you run the above commands, the nature of the problem will reveal itself right there.

Using a data frame is the way to go. For one, it'll make getting predictions on new data much easier; and it'll also let you use nominal predictors (factors) without having to code up the dummy variables yourself. If the number of predictors is not fixed and you want to fit a model on all of them, use . in the formula.
df <- data.frame(y=BinomialVector, temp)
glm(y ~ ., family=binomial, data=df)

Related

How to label CCA-Plot with row.names in R

I've been trying to solve the following problem which I am sure is an easy one (I am just not able to find a solution). I am using the package vegan and want to perform a cca that shows the actual row names as labels (instead of the default "sit1", "sit2", ...).
I created a dataframe (ls_Treat1) with cast(), showing plot treatments (AB, DB, DL etc.) as row names and species occurences. The dataframe looks as follows:
species 1
species 2
species 3
AB
0
3
1
DB
1
6
0
DL
3
4
2
I created the data frame with the following code to set the treatments (AB, DB, DL, ...) as row names:
ls_Treat1 <- cast(fungi_ls, Treatment ~ species)
row.names(ls_Treat1)<- ls_Treat1$Treatment
ls_Treat1 <- ls_Treat1[,-1]
When I perform a cca with the following code:
ca <- cca(ls_Treat1)
plot(ca,display="sites")
R puts the default labels "sit1", "sit2", ... into the plot, instead of the actual row names, even though I have performed it this way before and the plots normally showed the right labels. Does this have anything to do with my creating the data frame? I tried to change the treatments (characters) into numbers (integers or factors) but still, the plot won't be labelled with my row names.
Can anyone help me with this?
Thank you very very much!!
The problem is that reshape::cast() does not produce data.frame but something else. It claims to be a data.frame but it is not. We do matrix algebra in cca and therefore we cast input to a matrix which works for standard data.frame, but it does not work with the object you supplied as input. In particular, after you remove the first column in ls_Treat1 <- ls_Treat1[,-1], you also remove the attributes that allow preserving names – it would have worked without removing this column (if reshape package was still loaded). It seems that upgrading to reshape2 package and using reshape2::acast() can be a solution.

Does dummyVars predict really return a data frame?

The predict method for dummyVars from the caret library has documentation that clearly states:
"The predict function produces a data frame."
However, every example that I've produced appear to only be a matrix. The following code is an example of this:
>input<-data.frame(id=c(1, 2, 3), direction=c('up', 'down', 'down'))
>dmy<-dummyVars(" ~ .",input)
>output<-predict(dmy, newdata=input)
>output
id direction.down direction.up
1 1 0 1
2 2 1 0
3 3 1 0
>class(output)
[1] "matrix"
>is.data.frame(output)
[1] FALSE
>is.matrix(output)
[1] TRUE
Everything that I can see indicates that the documentation is wrong and that the predict function is really returning a matrix rather than a data frame. What's going on?
I think you are right and the documentation is wrong. If you look at the source code, the object that is returned is created at line 19 in the function body like this:
x <- model.matrix(Terms, m)
This is a matrix object. There is some more code in the function body after x is created, but all it does is alter the column names and drop the (Intercept) column. At no point is it converted to a data frame before it is returned.
Yeah, attempted to reproduce in other terms to coarse in a data frame, just with this worked:
output<-data.frame(predict(dmy, newdata = input))
print(class(output))

Chi-squared test of independence on all combinations of columns in a dataframe in R

this is my first time posting here and I hope this is all in the right place. I have been using R for basic statistical analysis for some time, but haven't really used it for anything computationally challenging and I'm very much a beginner in the programming/ data manipulation side of R.
I have presence/absence (binary) data on 72 plant species in 323 plots in a single catchment. The dataframe is 323 rows, each representing a plot, with 72 columns, each representing a species. This is a sample of the first 4 columns (some row numbers are missing because the 323 plots are a subset of a larger number of preassigned plots, not all of which were surveyed):
> head(plots[,1:4])
Agrostis.canina Agrostis.capillaris Alchemilla.alpina Anthoxanthum.odoratum
1 1 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
8 0 0 0 0
I want to to determine whether any of the plant species in this catchment are associated with any others, and if so, whether that is a positive or negative association. To do this I want to perform a chi-squared test of independence on each combination of species. I need to create a 2x2 contingency table for each speciesxspecies comparison, run a chi-squared test on each of those contingency tables, and save the output. Ultimately I would like to end up with a list or matrix of all species by species tests that shows whether that combination of species has a positive, negative, or no significant association. I'd also like to incorporate some code that only shows an association as positive if all expected values were greater than 5.
I have made a start by writing the following function:
CHI <- function(sppx, sppy)
{test <- chisq.test(table(sppx, sppy))
result <- c(test$statistic, test$p.value,
sign((table(sppx, sppy) - test$expected)[2,2]))
return(result)
}
This returns the following:
> CHI(plots$Agrostis.canina, plots$Agrostis.capillaris)
X-squared
1.095869e-27 1.000000e+00 -1.000000e+00
Warning message:
In chisq.test(chitbl) : Chi-squared approximation may be incorrect
Now I'm trying to figure out a way to apply this function to each speciesxspecies combination in the data frame. I essentially want R to take each column, apply the CHI function to that column and each other column in sequence, and so on through all the columns, subtracting each column from the dataframe as it is done so the same species pair is not tested twice. I have tried various methods trying to use "for" loops or "apply" functions, but have not been able to figure this out.
I hope that is clear enough. Any help here would be much appreciated. I have tried looking for existing solutions to this specific problem online, but haven't been able to find any that really helped. If anyone could link me to an existing answer to this that would also be great.
You need the combn function to find all the combinations of the columns and then apply them to your function, something like this:
apply(combn(1:ncol(plots), 2), 2, function(ind) CHI(plots[, ind[1]], plots[, ind[2]]))
I think you are looking for something like this. I used the iris dataset.
require(datasets)
ind<-combn(NCOL(iris),2)
lapply(1:NCOL(ind), function (i) CHI(iris[,ind[1,i]],iris[,ind[2,i]]))
The below R code run chisquare test for every categorical variable / every factor of a r dataframe, against a variable given (x or y chisquare parameter is kept stable, is explicitly defined):
Define your variable
Please - change df$variable1 to your desired factor variable and df to your desirable dataframe that contain all the factor variables tested against the given df$variable1
Define your Dataframe
A new dataframe is created (df2) that will contain all the chi square values / dfs, p value of the given variable vs dataframe comparisons
Code created / completed/ altered from similar posts in stackoverflow, neither that produced my desired outcome.
Chi-Square Tables statistic / df / p value for variable vs dataframe
"2" parameter define column wide comparisons - check apply (MARGIN) option.
df2 <- t(round(cbind(apply(df, 2, function(x) {
ch <- chisq.test(df$variable1, x)
c(unname(ch$statistic), ch$parameter, ch$p.value )})), 3))

Way to extract data from lm-object before function is applied?

let me directly dive into an example to show my problem:
rm(list=ls())
n <- 100
df <- data.frame(y=rnorm(n), x1=rnorm(n), x2=rnorm(n) )
fm <- lm(y ~ x1 + poly(x2, 2), data=df)
Now, I would like to have a look at the previously used data. This is almost available by using
temp.data <- fm$model
However, x2will have been split up into poly(x2,2), which will itself be a dataframe as it contains a value for x2 and x2^2. Note that it may seem as if x2 is contained here, but since the polynomal uses orthogonal components, temp.data$x2 is not the same as df$x2. This can also be seen if you compare the variables visually after, say, the following: new.dat <- cbind(df, fm$model).
Now, to some questions:
First, and most importantly, is there a way to retrieve x2 from the lm-object in its original form. Or more generally, if some function f has been applied to some variable in the lm-formula, can the underlying variables be extracted from the lm-object (without doing case-specific math)? Note that I know I could retrieve the data by other means, but I wonder if I can extract it from the lm-object itself.
Second, on a more general note, since I did explicitly not ask for model.matrix(fm), why do I get data that has been manipulated? What is the underlying philosophy behind that? Does anyone know?
Third, the command head(new.dat) shows me that x2 has been split up in two components. What I see when I type View(new.dat) is, however, only one column. This strikes me as puzzling and mindboggling. How can two colums be represented as one, and why is there a difference between head and View? If anyone can explain, I would be highly indebted!
If these questions are too basic, please apologize. In this case, I would appreciate any pointers to relevant manuals where this is explained.
Thanks in advance!
Good question, but this is difficult. fm$model is a weird data frame, of a type that would be hard for a user to construct, but which R sometimes generates internally. Check out the first few lines of str(fm$model), which show you that it's a data frame whose third component is an object of class poly with dimensions (100,2) -- i.e. something like a matrix:
## 'data.frame': 100 obs. of 3 variables:
## $ y : num -0.5952 -1.9561 1.8467 -0.2782 -0.0278 ...
## $ x1 : num 0.423 -1.539 -0.694 0.254 -0.13 ...
## $ poly(x2, 2): poly [1:100, 1:2] 0.0606 -0.0872 0.0799 -0.1068 -0.0395 ...
If you're still working in the environment from which lm was called in the first place, and if lm was called using the data argument, you can use eval(getCall(fm)$data) to get the original data. If things are being passed in and out of functions, or if someone used lm on independent objects in the environment, you're probably out of luck. If you get in trouble you can try
eval(getCall(fm)$data,environment(formula(fm))
but things rapidly start getting harder.
I don't fully understand the logic of storing the processed model frame rather than the raw data, but I think it has to do with the construction of the terms object for the linear model -- each element in the stored model frame corresponds to an element of the terms object. I don't really understand the distinction between factors -- which are post-processed by model.matrix into sets of columns of dummy variables -- and transformed data (e.g. log(x)) or special objects like polynomial or spline bases ...
The question is, how badly you need it. If you look at the structure of fm$model$poly then at the end you will see something like this:
attr(,"coefs")
attr(,"coefs")$alpha
[1] 0.06738858 0.10887048
attr(,"coefs")$norm2
[1] 1.00000 100.00000 93.96666 155.01387
I suppose these coefficients could be used to restore your original data from poly. See the source code for poly function (either page(poly) or just type poly in the console) ... it looks like computing the polynomials might be reversible. But why bother doing it? I can think of two reasons: (1) you have lost the original data and the only way
to restore it is this; (2) you want to understand how R computes orthogonal polynomials.
Second, on a more general note, since I did explicitly not ask for
model.matrix(fm), why do I get data that has been manipulated? What is
the underlying philosophy behind that? Does anyone know?
Do you mean, why is data saved with the lm object at all? Just in case, I suppose. You can easily switch it off:
fm <- lm(y ~ x1 + poly(x2, 2), data=df, model=FALSE)
Or why are the data "manipulated"? I.e., why is poly(x2,2) saved with data instead of the original x2. My understanding is that you requested this yourself. The poly(x2,x) part is first evaluated and then passed to lm, so that lm doesn't even have original x2.
edit - to answer the comment below in a more convenient way
For instance, using factor(f) for some additional factor variable does
not get translated into a data frame being stored in fm$model. Only
the actual variable f is being stored in fm$model, whereas in this
case with poly, some transformation is stored. This puzzles me.
I think you've missed something here and the behaviour is the same for both poly and model.
> df <- data.frame(a=1:5, b=2:6, c=rnorm(5))
> fm <- lm(c~ a + factor(b), df)
> fm$model
c a factor(b)
1 0.5397541 1 2
2 0.9108087 2 3
3 0.1819442 3 4
4 -0.9293893 4 5
5 0.1404305 5 6
> fm$model$factor
[1] 2 3 4 5 6
Levels: 2 3 4 5 6
Warning message:
In `$.data.frame`(fm$model, factor) : Name partially matched in data frame
You can see that fm$model has factor(b) instead of b, and fm$model$factor is indeed a factor, not the original integer variable. (The warning is because the name is actually factor(b) and I used factor to avoid typing something as ugly as fm$model$'factor(b)' (replace single quotes with backquotes).

R unexpected NA output from RandomForest

I'm working with a data set that has a lot of NA's. I know that the first 6 columns do NOT have any NA's. Since the first column is an ID column I'm omitting it.
I run the following code to select only lines that have values in the response column:
sub1 <- TrainingData[which(!is.na(TrainingData[,70])),]
I then use sub1 as the data set in a randomForest using this code:
set.seed(448)
RF <- randomForest(sub1[,c(2:6)], sub1[,70]
,do.trace=TRUE,importance=TRUE,ntree=10,,forest=TRUE)
then I run this code to check the output for NA's:
> length(which(is.na(RF$predicted)))
[1] 65
I can't figure out why I'd be getting NA's if the data going in is clean.
Any suggestions?
I think you should use more trees. Because predicted values are preditions for the out-of-bag set. And if number of trees very small some cases are never present in out-of-bag set, because this set forms randomly.

Resources