issue with predict with glmnetUtils - r

trying to use the glmnetUtils package from GitHub for formula interface to glmnet but predict is not estimating enough values
library(nycflights13) # from GitHub
library(modelr)
library(dplyr)
library(glmnet)
library(glmnetUtils)
library(purrr)
fitfun=function(dF){
cv.glmnet(arr_delay~distance+air_time+dep_time,data=dF)
}
gnetr2=function(model,datavals){
yvar=all.vars(formula(model)[[2]])
print(paste('y variable:',yvar))
print('observations')
print(str(as.data.frame(datavals)[[yvar]]))
print('predictions')
print(str(predict(object=model,newdata=datavals)))
stats::cor(stats::predict(object=model, newdata=datavals), as.data.frame(datavals)[[yvar]], use='complete.obs')^2
}
flights %>%
group_by(carrier) %>%
do({
crossv_mc(.,4) %>%
mutate(mdl=map(train,fitfun),
r2=map2_dbl(mdl,test,gnetr2))
})
the output from gnetr2():
[1] "y variable: arr_delay"
[1] "observations"
num [1:3693] -33 -6 47 4 15 -5 45 16 0 NA ...
NULL
[1] "predictions"
num [1:3476, 1] 8.22 21.75 24.31 -7.96 -7.27 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:3476] "1" "2" "3" "4" ...
..$ : chr "1"
NULL
Error: incompatible dimensions
any ideas what's going on? your help is much appreciated!

This is an issue with the underlying glmnet package, but there's no reason that it can't be handled in glmnetUtils. I've just pushed an update that should let you use the na.action argument with the predict method for formula-based calls.
Setting na.action=na.pass (the default) will pad out the predictions to include NAs for rows with missing values
na.action=na.omit or na.exclude will drop these rows
Note that the missingness of a given row may change depending on how much regularisation is done: if the NAs are for variables that get dropped from the model, then the row will be counted as being a complete case.
Also took the opportunity to fix a bug where the LHS of the formula contains an expression.
Give it a go with install_github("Hong-Revo/glmnetUtils") and tell me if anything breaks.

Turns out its happening because there are NA in the predictor variables so predict() results in a shorter vector since na.action=na.exclude.
Normally a solution would be to use predict(object,newdata,na.action=na.pass) but predict.cv.glmnet does not accept other arguments to predict.
Therefore the solution is to filter for complete cases before beginning
flights=flights %>% filter(complete.cases(.))

Related

Why is the lm() function not working for me?

I am attempting to run a multiple regression on different columns in a data frame. As it is a homework project, the main point is to show some things have less or no correlation than others.
That said, I created the data frame as follows:
df.Part3 <- df %>%
filter(NEIGHBORHOOD_NAME=="BRONXDALE",YEAR>2008,SALE_PRICE>0,TYPE=="RESIDENTIAL") %>%
mutate(TOTAL_UNITS=RESIDENTIAL_UNITS+COMMERCIAL_UNITS) %>%
mutate(SALE_YEAR=year(SALE_DATE)) %>%
ungroup() %>%
select(SALE_PRICE,SALE_YEAR,YEAR_BUILT,GROSS_SQUARE_FEET,TOTAL_UNITS,TYPE)
Once I do that, the data frame is created with no errors, however once I try to run the lm() function as follows:
z <- lm(formula=SALE_PRICE~.,data=df.Part3)
I get the error:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
That said, I am following the lecture video and doing almost everything alike to what the teacher is doing, however he is able to complete his and I am unable to complete mine.
How do I fix this issue?
For context surrounding the data frame:
SALE_PRICE: num 260000
SALE_YEAR: num 2009
YEAR_BUILT: num 1965
GROSS_SQUARE_FEET: num 1152
TOTAL_UNITS: num 1
TYPE: chr "RESIDENTIAL"

dendrogram from pre-made linkage matrix

the problem:
in R, I need to plot a dendrogram + cut the associated tree from a linkage matrix created in a different language. based on the nature of the dataset, the prior processing is only available with this other language. so I need to be able to work in R from an already determined linkage matrix.
I have a linkage matrix and a correlation matrix created from a different language. I saved both as csv files and can read either as a data frame into R.
my approach
I wanted to convert the linkage matrix to an hclust object in R, so that I could pass to as.dendrogram and then subsequently use cutree.
When I run as.hclust(df), I get the error:
Error in as.hclust.default(df) : argument 'x' cannot be coerced to class “hclust” Consider providing an as.hclust.data.frame() method
as.hclust only takes a dist, Diana, or Agnes object
I'm unsuccessfully able to convert the data frame to any of these objects to proceed with my downstream analysis.
an alternative would be to work with the correlation matrix, but I'm not seeing a way to backtrack the physical distances from which to build a meaningful dendrogram.
I could use scipy.cluster.hierarchy.cut_tree in Python but there are documented issues with the function that remain unresolved, so I wanted to use R.
many thanks
I'm not sure what would you call the "linkage matrix" or whether there's a "standard" format for them across packages, but in these cases in helps to use str:
x <- matrix(rnorm(30), ncol = 3)
hc <- hclust(dist(x), method = "complete")
str(hc)
List of 7
$ merge : int [1:9, 1:2] -5 -6 -8 -4 -2 -3 -1 6 5 -7 ...
$ height : num [1:9] 0.714 0.976 1.381 1.468 2.065 ...
$ order : int [1:10] 2 6 10 3 8 5 7 1 4 9
$ labels : NULL
$ method : chr "complete"
$ call : language hclust(d = dist(x), method = "complete")
$ dist.method: chr "euclidean"
- attr(*, "class")= chr "hclust"
So, from this, one can deduce that it's a simple S3 structure, and it should be possible to create an imitation with your already-determined step-by-step data like this:
my_hc <- list(
merge = <your data>,
height = <your data>,
order = <your data>,
labels = NULL,
method = "complete",
call = "some_optional_string",
dist.method = "your_custom_distance"
)
class(my_hc) <- "hclust"
Otherwise, you could let R re-do the clustering from a distance matrix if that's available or computationally feasible.

Nulls in Data frame . How to remove if it is better in Logistic regresion models

I have a Data Frame with two columns that have populations of NULL in them
'data.frame': 31337 obs. of 16 variables:
# $ ID : int 1 2 3 5 6 7 8 9 10 11 ...
# $ Target : int 0 0 0 0 0 0 0 0 0 0 ...
# $ band : chr "3. 35 to 44" "NULL" "NULL" "NULL" ...
# $ gender : chr "Male" "NULL" "Male" "NULL" ...
a) Do I remove the Rows with "Null" in R or
b) do I leave the Null as a seperate category for Logistic Regression in R ?
If the answer to a is yes then how do I do it
There are several things going on here with your question.
"NULL" in your data frame is a character value. It is not NULL.
E.g.,
is.null(NULL)
[1] TRUE
is.null("NULL")
[1] FALSE
In R there is a difference between NULL and NA. NULL represents a null or empty object. It is often returned by functions so that values are undefined. NA is a missing value (does not exist). Based on your context, I would replace your "NULL" values with NA. For a quick way to replace "NULL" with NA, see dplyr::na_if(). (Link to function's documentation.)
If you are using glm() to carry out your logistic regression model there are several ways glm() handles missing data (NAs). You can control how it handles NAs with the argument na.action. Run ?glm in the console to pull up the help page for this function. There is a description of each of the argument's values.
To answer your question about removing NAs or using a dummy indicator for missing values, that's a matter of model intent. It is difficult to provide a general answer to such a broad topic without more details.
#jordan .. Fantastic advice .. dataframe shrunk to 14% of size
data=na_if(data,"NULL")
data <- data[!is.na(data$age_band) & !is.na(data$gender), ]

How to convert 4d array to 3d array subsetting on specific elements of one of the dimensions

Here is probably an easy question.. but I am really struggling so help is very much appreciated.
I have 4d data that I wish to transform into 3d data. The data has the following attributes:
lon <- 1:96
lat <- 1:73
lev <- 1:60
tme <- 1:12
data <- array(runif(96*73*60*12),
dim=c(96,73,60,12) ) # fill with random test values
What I would like to do is calculate the mean of the first few levels (say 1:6). The new data would be of the form:
new.data <- array(96*73*12), dim=c(96,73,12) ) # again just test data
But would contain the mean of the first 5 levels of data. At the moment the only way I have been able to make it work is to write a rather inefficient loop which extracts each of the first 5 levels and divides the sum of those by 5 to get the mean.
I have tried:
new.data <- apply(data, c(1,2,4), mean)
Which nicely gives me the mean of ALL the vertical levels but can't understand how to subset the 3rd dimension to get an average of only a few! e.g.
new.data <- apply(data, c(1,2,3[1:5],4), mean) # which returns
Error in ds[-MARGIN] : only 0's may be mixed with negative subscripts
I am desperate for some help!
apply with indexing (the proper use of "[") should be enough for the mean of the first six levels of the third dimension if I understand your terminology:
> str(apply(data[,,1:6,] , c(1,2,4), FUN=mean) )
num [1:96, 1:73, 1:12] 0.327 0.717 0.611 0.388 0.47 ...
This returns a 96 x 73 by 12 matrix.
In addition to the answer of #DWin, I would recommend the plyr package. The package provides apply like functions. The analgue of apply is the plyr function aaply. The first two letters of a plyr function specify the input and the output type, aa in this case, array and array.
> system.time(str(apply(data[,,1:6,], c(1,2,4), mean)))
num [1:96, 1:73, 1:12] 0.389 0.157 0.437 0.703 0.61 ...
user system elapsed
2.180 0.004 2.184
> Library(plyr)
> system.time(str(aaply(data[,,1:6,], c(1,2,4), mean)))
num [1:96, 1:73, 1:12] 0.389 0.157 0.437 0.703 0.61 ...
- attr(*, "dimnames")=List of 3
..$ X1: chr [1:96] "1" "2" "3" "4" ...
..$ X2: chr [1:73] "1" "2" "3" "4" ...
..$ X3: chr [1:12] "1" "2" "3" "4" ...
user system elapsed
40.243 0.016 40.262
In this example it is slower than apply, but there are a few advantages. The packages supports parallel processing, it also supports outputting the results to a data.frame or list (nice for plotting using ggplot2), and it can show a progress bar (nice for long running processes). Although in this case I'd still go for apply because of performance.
More information regarding the plyr package can be found in this paper. Maybe someone can comment on the poor performance of aaply in this example?

How do I handle multiple kinds of missingness in R?

Many surveys have codes for different kinds of missingness. For instance, a codebook might indicate:
0-99 Data
-1 Question not asked
-5 Do not know
-7 Refused to respond
-9 Module not asked
Stata has a beautiful facility for handling these multiple kinds of missingness, in that it allows you to assign a generic . to missing data, but more specific kinds of missingness (.a, .b, .c, ..., .z) are allowed as well. All the commands which look at missingness report answers for all the missing entries however specified, but you can sort out the various kinds of missingness later on as well. This is particularly helpful when you believe that refusal to respond has different implications for the imputation strategy than does question not asked.
I have never run across such a facility in R, but I would really like to have this capability. Are there any ways of marking several different types of NA? I could imagine creating more data (either a vector of length nrow(my.data.frame) containing the types of missingness, or a more compact index of which rows had what types of missingness), but that seems pretty unwieldy.
I know what you look for, and that is not implemented in R. I have no knowledge of a package where that is implemented, but it's not too difficult to code it yourself.
A workable way is to add a dataframe to the attributes, containing the codes. To prevent doubling the whole dataframe and save space, I'd add the indices in that dataframe instead of reconstructing a complete dataframe.
eg :
NACode <- function(x,code){
Df <- sapply(x,function(i){
i[i %in% code] <- NA
i
})
id <- which(is.na(Df))
rowid <- id %% nrow(x)
colid <- id %/% nrow(x) + 1
NAdf <- data.frame(
id,rowid,colid,
value = as.matrix(x)[id]
)
Df <- as.data.frame(Df)
attr(Df,"NAcode") <- NAdf
Df
}
This allows to do :
> Df <- data.frame(A = 1:10,B=c(1:5,-1,-2,-3,9,10) )
> code <- list("Missing"=-1,"Not Answered"=-2,"Don't know"=-3)
> DfwithNA <- NACode(Df,code)
> str(DfwithNA)
'data.frame': 10 obs. of 2 variables:
$ A: num 1 2 3 4 5 6 7 8 9 10
$ B: num 1 2 3 4 5 NA NA NA 9 10
- attr(*, "NAcode")='data.frame': 3 obs. of 4 variables:
..$ id : int 16 17 18
..$ rowid: int 6 7 8
..$ colid: num 2 2 2
..$ value: num -1 -2 -3
The function can also be adjusted to add an extra attribute that gives you the label for the different values, see also this question. You could backtransform by :
ChangeNAToCode <- function(x,code){
NAval <- attr(x,"NAcode")
for(i in which(NAval$value %in% code))
x[NAval$rowid[i],NAval$colid[i]] <- NAval$value[i]
x
}
> Dfback <- ChangeNAToCode(DfwithNA,c(-2,-3))
> str(Dfback)
'data.frame': 10 obs. of 2 variables:
$ A: num 1 2 3 4 5 6 7 8 9 10
$ B: num 1 2 3 4 5 NA -2 -3 9 10
- attr(*, "NAcode")='data.frame': 3 obs. of 4 variables:
..$ id : int 16 17 18
..$ rowid: int 6 7 8
..$ colid: num 2 2 2
..$ value: num -1 -2 -3
This allows to change only the codes you want, if that ever is necessary. The function can be adapted to return all codes when no argument is given. Similar functions can be constructed to extract data based on the code, I guess you can figure that one out yourself.
But in one line : using attributes and indices might be a nice way of doing it.
The most obvious way seems to use two vectors:
Vector 1: a data vector, where all missing values are represented using NA. For example, c(2, 50, NA, NA)
Vector 2: a vector of factors, indicating the type of data. For example, factor(c(1, 1, -1, -7)) where factor 1 indicates the a correctly answered question.
Having this structure would give you a create deal of flexibility, since all the standard na.rm arguments still work with your data vector, but you can use more complex concepts with the factor vector.
Update following questions from #gsk3
Data storage will dramatically increase: The data storage will double. However, if doubling the size causes real problem it may be worth thinking about other strategies.
Programs don't automatically deal with it. That's a strange comment. Some functions by default handle NAs in a sensible way. However, you want to treat the NAs differently so that implies that you will have to do something bespoke. If you want to just analyse data where the NA's are "Question not asked", then just use a data frame subset.
now you have to manipulate two vectors together every time you want to conceptually manipulate a variable I suppose I envisaged a data frame of the two vectors. I would subset the data frame based on the second vector.
There's no standard implementation, so my solution might differ from someone else's. True. However, if an off the shelf package doesn't meet your needs, then (almost) by definition you want to do something different.
I should state that I have never analysed survey data (although I have analysed large biological data sets). My answers above appear quite defensive, but that's not my intention. I think your question is a good one, and I'm interested in other responses.
This is more than just a "technical" issue. You should have a thorough statistical background in missing value analysis and imputation. One solution requires playing with R and ggobi. You can assign extremely negative values to several types of NA (put NAs into margin), and do some diagnostics "manually". You should bare in mind that there are three types of NA:
MCAR - missing completely at random, where P(missing|observed,unobserved) = P(missing)
MAR - missing at random, where P(missing|observed,unobserved) = P(missing|observed)
MNAR - missing not at random (or non-ignorable), where P(missing|observed,unobserved) cannot be quantified in any way.
IMHO this question is more suitable for CrossValidated.
But here's a link from SO that you may find useful:
Handling missing/incomplete data in R--is there function to mask but not remove NAs?
You can dispense with NA entirely and just use the coded values. You can then also roll them up to a global missing value. I often prefer to code without NA since NA can cause problems in coding and I like to be able to control exactly what is going into the analysis. If have also used the string "NA" to represent NA which often makes things easier.
-Ralph Winters
I usually use them as values, as Ralph already suggested, since the type of missing value seems to be data, but on one or two occasions where I mainly wanted it for documentation I have used an attribute on the value, e.g.
> a <- NA
> attr(a, 'na.type') <- -1
> print(a)
[1] NA
attr(,"na.type")
[1] -1
That way my analysis is clean but I still keep the documentation. But as I said: usually I keep the values.
Allan.
I´d like to add to the "statistical background component" here. Statistical analysis with missing data is a very good read on this.

Resources