Related
trying to use the glmnetUtils package from GitHub for formula interface to glmnet but predict is not estimating enough values
library(nycflights13) # from GitHub
library(modelr)
library(dplyr)
library(glmnet)
library(glmnetUtils)
library(purrr)
fitfun=function(dF){
cv.glmnet(arr_delay~distance+air_time+dep_time,data=dF)
}
gnetr2=function(model,datavals){
yvar=all.vars(formula(model)[[2]])
print(paste('y variable:',yvar))
print('observations')
print(str(as.data.frame(datavals)[[yvar]]))
print('predictions')
print(str(predict(object=model,newdata=datavals)))
stats::cor(stats::predict(object=model, newdata=datavals), as.data.frame(datavals)[[yvar]], use='complete.obs')^2
}
flights %>%
group_by(carrier) %>%
do({
crossv_mc(.,4) %>%
mutate(mdl=map(train,fitfun),
r2=map2_dbl(mdl,test,gnetr2))
})
the output from gnetr2():
[1] "y variable: arr_delay"
[1] "observations"
num [1:3693] -33 -6 47 4 15 -5 45 16 0 NA ...
NULL
[1] "predictions"
num [1:3476, 1] 8.22 21.75 24.31 -7.96 -7.27 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:3476] "1" "2" "3" "4" ...
..$ : chr "1"
NULL
Error: incompatible dimensions
any ideas what's going on? your help is much appreciated!
This is an issue with the underlying glmnet package, but there's no reason that it can't be handled in glmnetUtils. I've just pushed an update that should let you use the na.action argument with the predict method for formula-based calls.
Setting na.action=na.pass (the default) will pad out the predictions to include NAs for rows with missing values
na.action=na.omit or na.exclude will drop these rows
Note that the missingness of a given row may change depending on how much regularisation is done: if the NAs are for variables that get dropped from the model, then the row will be counted as being a complete case.
Also took the opportunity to fix a bug where the LHS of the formula contains an expression.
Give it a go with install_github("Hong-Revo/glmnetUtils") and tell me if anything breaks.
Turns out its happening because there are NA in the predictor variables so predict() results in a shorter vector since na.action=na.exclude.
Normally a solution would be to use predict(object,newdata,na.action=na.pass) but predict.cv.glmnet does not accept other arguments to predict.
Therefore the solution is to filter for complete cases before beginning
flights=flights %>% filter(complete.cases(.))
This question has been asked before, but the solutions posed only partially solve my problem, and I've been working on this for days now. I felt it was time to seek help, even if the topic has been addressed previously. I apologize for any inconvenience.
I have a very large data.frame in R with 6288 observations of 11 variables. I want to run a Shapiro test by group on each variable, but grouped by two different factors (Number and Treatment). A much reduced sample data set with one variable is provided for example:
data <- data.frame(Number=c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2),
Treatment=c("High","High","High","High","High","High","Low",
"Low","Low","Low","Low","Low","High","High","High",
"High","High","High","Low","Low","Low","Low","Low",
"Low"),
FW=c(746,500,498,728,626,580,1462,738,1046,568,320,578,654,664,
660,596,1110,834,486,548,688,776,510,788))
I want to run a Shapiro test on FW by Number and by Treatment, so I'd have a test for 1High, 1Low, 2High, 2Low, etc. I'd like to have data for both the W statistic and the P-value. The original dataset contains 16 observations per group (1High,1Low,etc.; total groups=400), and an occasional NA; this sample dataset contains 6 observations per group (1High, 1Low, 2High, 2Low; groups=4).
The following code was previously posted as a solution to this problem of shapiro tests by groups:
res<-aggregate(cbind(P.value=data$FW)~data$Number+data$Treatment,data,FUN=shapiro.test)
I've also experimented with a number of other ways of grouping, but nothing seems to work. The above code comes closest.
The code above using aggregate groups my data appropriately, and gives me the W statistic, but it won't give me the P value (the column header says "P.value", but this is not the P value, it's the W statistic, I've confirmed this several ways). It also gives me the following warning message:
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
corrupt data frame: columns will be truncated or padded with NAs
When I did a Google search for this warning, the results suggest it is a bug in the data.frame, but I can't figure out how to solve it. I'm not even sure it really is a bug in this case.
Can anyone help by providing some insight into the warning message, or another way to do the Shapiro test by group?
You're getting that error because shapiro.test returns a list and aggregate expects the result of the aggregation to be a vector or a single number.
aggregate sees the list, takes the first element of the list by default, and tells you why it's unhappy (in admittedly vague terms). But it still gives you the Shapiro-Wilk statistic since that's the first element of the list returned from shapiro.test.
You can make a slight modification to your existing code that will get you what you want without issue:
aggregate(formula = FW ~ Number + Treatment,
data = data,
FUN = function(x) {y <- shapiro.test(x); c(y$statistic, y$p.value)})
# Number Treatment FW.W FW.V2
# 1 1 High 0.88995051 0.31792857
# 2 2 High 0.78604502 0.04385663
# 3 1 Low 0.93305840 0.60391888
# 4 2 Low 0.86456934 0.20540230
Note that the rightmost columns correspond to the statistic and p-value.
This is directly extracting the statistic and p-value from the list, thereby making the result of aggregation a single vector, which makes aggregate happy.
Another option would be to use the data.table package, available from CRAN.
library(data.table)
DT <- data.table(data)
DT[,
.(W = shapiro.test(FW)$statistic, P.value = shapiro.test(FW)$p.value),
by = .(Number, Treatment)]
# Number Treatment W P.value
# 1: 1 High 0.8899505 0.31792857
# 2: 1 Low 0.9330584 0.60391888
# 3: 2 High 0.7860450 0.04385663
# 4: 2 Low 0.8645693 0.20540230
The dplyr package is handy for groupwise operations:
library(dplyr)
data %>%
group_by(Number, Treatment) %>%
summarise(statistic = shapiro.test(FW)$statistic,
p.value = shapiro.test(FW)$p.value)
Number Treatment statistic p.value
1 1 High 0.8899505 0.31792857
2 1 Low 0.9330584 0.60391888
3 2 High 0.7860450 0.04385663
4 2 Low 0.8645693 0.20540230
The simple dplyr answer didn't do it for me as it did not do the shapiro test on each grouped variable, but only did it once, so here's my own solution using nesting :
shapiro <- data %>%
group_by(!!sym(groupvar)) %>%
group_nest() %>%
mutate(shapiro = map(.data$data, ~ shapiro_test(.x, !!sym(quantvar)))) %>%
select(-data) %>%
unnest(cols = shapiro) %>%
print()
I just make up a data set to test the function "mlogit" which stands for "multinomial logistic regression model"
The data is simply:
head(dat)
y x1 x2 x3
1 4 1 18 4
2 5 1 20 5
3 2 1 25 3
4 3 0 26 6
5 4 0 26 8
6 3 1 27 4
Then when I type
fit <- mlogit(y ~ x1 + x2 + x3, data=dat)
the following Message appears:
Error in `row.names<-.data.frame`(`*tmp*`, value = value) :
invalid 'row.names' length
Does anyone know why or how to solve it?
The help states:
The ‘data’ argument may be an ordinary ‘data.frame’. In this case,
some supplementary arguments should be provided and are passed to
‘mlogit.data’.
You have not given any supplementary arguments. Note that I consider this poor documentation because it does not state which supplementary arguments should be provided.
From the examples, it seems that "shape" and "choice" should at least be set:
# a data.frame in wide format with two missing prices
Fishing2 <- Fishing
Fishing2[1, "price.pier"] <- Fishing2[3, "price.beach"] <- NA
mlogit(mode~price+catch|income, Fishing2, shape="wide", choice="mode", varying = 2:9)
# a data.frame in long format with three missing lines
data("TravelMode", package = "AER")
Tr2 <- TravelMode[-c(2, 7, 9),]
mlogit(choice~wait+gcost|income+size, Tr2, shape = "long",
chid.var = "individual", alt.var="mode", choice = "choice")
By the way, welcome to stackoverflow! Here are some tips on writing a better question and thus increasing the chance of a good answer.
you should state the package from which your command comes. I'm assuming it is from the mlogit package, but the mlogit command is in every package.
you should give a minimal example. You give the output of the head command, but it's not clear if the error can be reproduced with that. library(mlogit) should also be given in your minimal example.
you should read the help for the command. Help files can be intimidating and very technical, but you don't have to understand everything in them. In your example, I'm guessing that some supplementary arguments should be provided would have jumped out at you. In case you're not sure how to access help for the command mlogit, you can use ?mlogit or help(mlogit).
let me directly dive into an example to show my problem:
rm(list=ls())
n <- 100
df <- data.frame(y=rnorm(n), x1=rnorm(n), x2=rnorm(n) )
fm <- lm(y ~ x1 + poly(x2, 2), data=df)
Now, I would like to have a look at the previously used data. This is almost available by using
temp.data <- fm$model
However, x2will have been split up into poly(x2,2), which will itself be a dataframe as it contains a value for x2 and x2^2. Note that it may seem as if x2 is contained here, but since the polynomal uses orthogonal components, temp.data$x2 is not the same as df$x2. This can also be seen if you compare the variables visually after, say, the following: new.dat <- cbind(df, fm$model).
Now, to some questions:
First, and most importantly, is there a way to retrieve x2 from the lm-object in its original form. Or more generally, if some function f has been applied to some variable in the lm-formula, can the underlying variables be extracted from the lm-object (without doing case-specific math)? Note that I know I could retrieve the data by other means, but I wonder if I can extract it from the lm-object itself.
Second, on a more general note, since I did explicitly not ask for model.matrix(fm), why do I get data that has been manipulated? What is the underlying philosophy behind that? Does anyone know?
Third, the command head(new.dat) shows me that x2 has been split up in two components. What I see when I type View(new.dat) is, however, only one column. This strikes me as puzzling and mindboggling. How can two colums be represented as one, and why is there a difference between head and View? If anyone can explain, I would be highly indebted!
If these questions are too basic, please apologize. In this case, I would appreciate any pointers to relevant manuals where this is explained.
Thanks in advance!
Good question, but this is difficult. fm$model is a weird data frame, of a type that would be hard for a user to construct, but which R sometimes generates internally. Check out the first few lines of str(fm$model), which show you that it's a data frame whose third component is an object of class poly with dimensions (100,2) -- i.e. something like a matrix:
## 'data.frame': 100 obs. of 3 variables:
## $ y : num -0.5952 -1.9561 1.8467 -0.2782 -0.0278 ...
## $ x1 : num 0.423 -1.539 -0.694 0.254 -0.13 ...
## $ poly(x2, 2): poly [1:100, 1:2] 0.0606 -0.0872 0.0799 -0.1068 -0.0395 ...
If you're still working in the environment from which lm was called in the first place, and if lm was called using the data argument, you can use eval(getCall(fm)$data) to get the original data. If things are being passed in and out of functions, or if someone used lm on independent objects in the environment, you're probably out of luck. If you get in trouble you can try
eval(getCall(fm)$data,environment(formula(fm))
but things rapidly start getting harder.
I don't fully understand the logic of storing the processed model frame rather than the raw data, but I think it has to do with the construction of the terms object for the linear model -- each element in the stored model frame corresponds to an element of the terms object. I don't really understand the distinction between factors -- which are post-processed by model.matrix into sets of columns of dummy variables -- and transformed data (e.g. log(x)) or special objects like polynomial or spline bases ...
The question is, how badly you need it. If you look at the structure of fm$model$poly then at the end you will see something like this:
attr(,"coefs")
attr(,"coefs")$alpha
[1] 0.06738858 0.10887048
attr(,"coefs")$norm2
[1] 1.00000 100.00000 93.96666 155.01387
I suppose these coefficients could be used to restore your original data from poly. See the source code for poly function (either page(poly) or just type poly in the console) ... it looks like computing the polynomials might be reversible. But why bother doing it? I can think of two reasons: (1) you have lost the original data and the only way
to restore it is this; (2) you want to understand how R computes orthogonal polynomials.
Second, on a more general note, since I did explicitly not ask for
model.matrix(fm), why do I get data that has been manipulated? What is
the underlying philosophy behind that? Does anyone know?
Do you mean, why is data saved with the lm object at all? Just in case, I suppose. You can easily switch it off:
fm <- lm(y ~ x1 + poly(x2, 2), data=df, model=FALSE)
Or why are the data "manipulated"? I.e., why is poly(x2,2) saved with data instead of the original x2. My understanding is that you requested this yourself. The poly(x2,x) part is first evaluated and then passed to lm, so that lm doesn't even have original x2.
edit - to answer the comment below in a more convenient way
For instance, using factor(f) for some additional factor variable does
not get translated into a data frame being stored in fm$model. Only
the actual variable f is being stored in fm$model, whereas in this
case with poly, some transformation is stored. This puzzles me.
I think you've missed something here and the behaviour is the same for both poly and model.
> df <- data.frame(a=1:5, b=2:6, c=rnorm(5))
> fm <- lm(c~ a + factor(b), df)
> fm$model
c a factor(b)
1 0.5397541 1 2
2 0.9108087 2 3
3 0.1819442 3 4
4 -0.9293893 4 5
5 0.1404305 5 6
> fm$model$factor
[1] 2 3 4 5 6
Levels: 2 3 4 5 6
Warning message:
In `$.data.frame`(fm$model, factor) : Name partially matched in data frame
You can see that fm$model has factor(b) instead of b, and fm$model$factor is indeed a factor, not the original integer variable. (The warning is because the name is actually factor(b) and I used factor to avoid typing something as ugly as fm$model$'factor(b)' (replace single quotes with backquotes).
I used the qda{MASS} to find the classfier for my data and it always reported "some group is too small for 'qda'". Is it due to the size of test data I used for model ? I increased the test sample size from 30 to 100, it reported the same error. Helpppppppp.....
set.seed(1345)
AllMono <- AllData[AllData$type == "monocot",]
MonoSample <- sample (1:nrow(AllMono), size = 100, replace = F)
set.seed(1355)
AllEudi <- AllData[AllData$type == "eudicot",]
EudiSample <- sample (1:nrow(AllEudi), size = 100, replace = F)
testData <- rbind (AllMono[MonoSample,],AllEudi[EudiSample,])
plot (testData$mono_score, testData$eudi_score, col = as.numeric(testData$type), xlab = "mono_score", ylab = "eudi_score", pch = 19)
qda (type~mono_score+eudi_score, data = testData)
Here is my data example
>head (testData)
sequence mono_score eudi_score type
PhHe_4822_404_76 DTRPTAPGHSPGAGH 51.4930 39.55000 monocot
SoBi_10_265860_58 QTESTTPGHSPSIGH 33.1408 2.23333 monocot
EuGr_5_187924_158 AFRPTSPGHSPGAGH 27.0000 54.55000 eudicot
LuAn_AOCW01152859.1_2_79 NFRPTEPGHSPGVGH 20.6901 50.21670 eudicot
PoTr_Chr07_112594_90 DFRPTAPGHSPGVGH 43.8732 56.66670 eudicot
OrSa.JA_3_261556_75 GVRPTNPGHSPGIGH 55.0986 45.08330 monocot
PaVi_contig16368_21_57 QTDSTTPGHSPSIGH 25.8169 2.50000 monocot
>testData$type <- as.factor (testData$type)
> dim (testData)
[1] 200 4
> levels (testData$type)
[1] "eudicot" "monocot" "other"
> table (testData$type)
eudicot monocot other
100 100 0
> packageDescription("MASS")
Package: MASS
Priority: recommended
Version: 7.3-29
Date: 2013-08-17
Revision: $Rev: 3344 $
Depends: R (>= 3.0.0), grDevices, graphics, stats, utils
My R version is R 3.0.2.
tl;dr my guess is that your predictor variables got made into factors or character vectors by accident. This can easily happen if you have some minor glitch in your data set, such as a spurious character in one row.
Here's a way to make up a data set that looks like yours:
set.seed(101)
mytest <- data.frame(type=rep(c("monocot","dicot"),each=100),
mono_score=runif(100,0,100),
dicot_score=runif(100,0,100))
Some useful diagnostics:
str(mytest)
## 'data.frame': 200 obs. of 3 variables:
## $ type : Factor w/ 2 levels "dicot","monocot": 2 2 22 2 2 2 ...
## $ mono_score : num 37.22 4.38 70.97 65.77 24.99 ...
## $ dicot_score: num 12.5 2.33 39.19 85.96 71.83 ...
summary(mytest)
## type mono_score dicot_score
## dicot :100 Min. : 1.019 Min. : 0.8594
## monocot:100 1st Qu.:24.741 1st Qu.:26.7358
## Median :57.578 Median :50.6275
## Mean :52.502 Mean :52.2376
## 3rd Qu.:77.783 3rd Qu.:78.2199
## Max. :99.341 Max. :99.9288
##
with(mytest,table(type))
## type
## dicot monocot
## 100 100
Importantly, the first two (str() and summary()) show us what type each variable is. Update: it turns out the third test is actually the important one in this case, since the problem was a spurious extra level: the droplevel() function should take care of this problem ...
This made-up example seems to work fine, so there must be something you're not showing us about your data set ...
library(MASS)
qda(type~mono_score+dicot_score,data=mytest)
Here's a guess. If your score variables were actually factors rather than numeric, then qda would automatically attempt to create dummy variables from them which would then make the model matrix much wider (101 columns in this example) and provoke the error you're seeing ...
bad <- transform(mytest,mono_score=factor(mono_score))
qda(type~mono_score+dicot_score,data=bad)
## Error in qda.default(x, grouping, ...) :
## some group is too small for 'qda'
I had this error as well, so I explained what went wrong on my side for anyone stumbling upon this in the future.
You might have factors on the variable you want to predict. All levels in this factor must have some amount of observations. If you don't have enough observations in a group, you will get this error.
For me, I removed a level completely, but there was still this level left in the factor.
To remove this you have to do this
df$var %<>% factor
NB. %<>% requires magrittr
However, even when I did this, it still failed. When I debugged this further it appears that if you subset from a dataframe that had factor applied you have to refactor again, somehow.
Your grouping variable has 3 levels including 'other' with non cases. Since the number of response variables (2 variables, i.e. mono_score, dicot_score) is larger than the number of cases in any given group level (100, 100 and 0, for dicot, monocot and other, respectively), the analysis cannot be performed.
One way to get rid of unnecesary group levels is by redifining the grouping variable as factor after setting it to character:
test.data$type <- as.factor(as.character(test.data$type))
Another alternative is by defining the levels of the grouping variable:
test.data$type <- factor(test.data$type, levels = c("dicot", "monocot"))
If your dataset was so unbalanced and had, for example, 2 cases of 'other', it would probably make sense to exclude them from the analysis.
This message could still appear if the number of response variables is larger than the number of cases in any given group level. Since you have 100 cases for both group levels (i.e. dicot, monocot) and only two response variables (i.e. mono_score, dicot_score) this should not be a problem anymore.