What does a proportional matrix look like for glmnet response variable in R? - r

I'm trying to use glmnet to fit a GLM that has a proportional response variable (using the family="binomial").
The help file for glmnet says that the response variable:
"For family="binomial" should be either a factor with
two levels, or a two-column matrix of counts or proportions (the second column
is treated as the target class"
But I don't really understand how I would have a two column matrix. My variable is currently just a single column with values between 0 and 1. Can someone help me figure out how this needs to be formatted so that glmnet will run it properly? Also, can you explain what the target class means?

It is a matrix of positive label and negative label counts, for example in the example below we fit a model for proportion of Claims among Holders :
data = MASS::Insurance
y_counts = cbind(data$Holders - data$Claims,data$Claims)
x = model.matrix(~District+Age+Group,data=data)
fit1 = glmnet(x=x,y=y_counts,family="binomial",lambda=0.001)
If possible, so you should go back to before your calculation of the response variable and retrieve these counts. If that is not possible, you can provide a matrix of proportion, 2nd column for success but this assumes the weight or n is same for all observations:
y_prop = y_counts / rowSums(y_counts)
fit2 = glmnet(x=x,y=y_prop,family="binomial",lambda=0.001)

Related

Linear Regression Model with a variable that zeroes the result

For my class we have to create a model to predict the credit balance of each individuals. Based on observations, many results are zero where the lm tries to calculate them.
To overcome this I created a new variable that results in zero if X and Y are true.
CB$Balzero = ifelse(CB$Rating<=230 & CB$Income<90,0,1)
This resulted in getting 90% of the zero results right. The problem is:
How can I place this variable in the lm so it correctly results in zeros when the proposition is true and the calculation when it is false?
Something like: lm=Balzero*(Balance~.)
I think that
y ~ -1 + Balzero:Balance
might work (you haven't given us a reproducible example to try).
-1 tells R to omit the intercept
: specifies an interaction. If both variables are numeric, then A:B includes the product of A and B as a term in the model.
The second term could also be specified as I(Balzero*Balance) (I means "as is", i.e. interpret * in the usual numerical sense, not in its formula-construction context.)
These specifications should fit the model
Y = beta1*Balzero*Balance + eps
where eps is an error term.
If Balzero == 0, the predicted value will be zero. If Balzero==1 the predicted value will be beta1*Balance.
You might want to look into random forest models, which naturally incorporate the kind of qualitative splitting that you're doing by hand in your example.

R Cross Validation lm predict function [duplicate]

I am trying to convert Absorbance (Abs) values to Concentration (ng/mL), based on an established linear model & standard curve. I planned to do this by using the predict() function. I am having trouble getting predict() to return the desired results. Here is a sample of my code:
Standards<-data.frame(ng_mL=c(0,0.4,1,4),
Abs550nm=c(1.7535,1.5896,1.4285,0.9362))
LM.2<-lm(log(Standards[['Abs550nm']])~Standards[['ng_mL']])
Abs<-c(1.7812,1.7309,1.3537,1.6757,1.7409,1.7875,1.7533,1.8169,1.753,1.6721,1.7036,1.6707,
0.3903,0.3362,0.2886,0.281,0.3596,0.4122,0.218,0.2331,1.3292,1.2734)
predict(object=LM.2,
newdata=data.frame(Concentration=Abs[1]))#using Abs[1] as an example, but I eventually want predictions for all values in Abs
Running that last lines gives this output:
> predict(object=LM.2,
+ newdata=data.frame(Concentration=Abs[1]))
1 2 3 4
0.5338437 0.4731341 0.3820697 -0.0732525
Warning message:
'newdata' had 1 row but variables found have 4 rows
This does not seem to be the output I want. I am trying to get a single predicted Concentration value for each Absorbance (Abs) entry. It would be nice to be able to predict all of the entries at once and add them to an existing data frame, but I can't even get it to give me a single value correctly. I've read many threads on here, webpages found on Google, and all of the help files, and for the life of me I cannot understand what is going on with this function. Any help would be appreciated, thanks.
You must have a variable in newdata that has the same name as that used in the model formula used to fit the model initially.
You have two errors:
You don't use a variable in newdata with the same name as the covariate used to fit the model, and
You make the problem much more difficult to resolve because you abuse the formula interface.
Don't fit your model like this:
mod <- lm(log(Standards[['Abs550nm']])~Standards[['ng_mL']])
fit your model like this
mod <- lm(log(Abs550nm) ~ ng_mL, data = standards)
Isn't that some much more readable?
To predict you would need a data frame with a variable ng_mL:
predict(mod, newdata = data.frame(ng_mL = c(0.5, 1.2)))
Now you may have a third error. You appear to be trying to predict with new values of Absorbance, but the way you fitted the model, Absorbance is the response variable. You would need to supply new values for ng_mL.
The behaviour you are seeing is what happens when R can't find a correctly-named variable in newdata; it returns the fitted values from the model or the predictions at the observed data.
This makes me think you have the formula back to front. Did you mean:
mod2 <- lm(ng_mL ~ log(Abs550nm), data = standards)
?? In which case, you'd need
predict(mod2, newdata = data.frame(Abs550nm = c(1.7812,1.7309)))
say. Note you don't need to include the log() bit in the name. R recognises that as a function and applies to the variable Abs550nm for you.
If the model really is log(Abs550nm) ~ ng_mL and you want to find values of ng_mL for new values of Abs550nm you'll need to invert the fitted model in some way.

wrapnls: Error: singular gradient matrix at initial parameter estimates

I have created a loop to fit a non-linear model to six data points by participants (each participant has 6 data points). The first model is a one parameter model. Here is the code for that model that works great. The time variable is defined. The participant variable is the id variable. The data is in long form (one row for each datapoint of each participant).
Here is the loop code with 1 parameter that works:
1_p_model <- dlply(discounting_long, .(Participant), function(discounting_long) {wrapnls(indiff ~ 1/(1+k*time), data = discounting_long, start = c(k=0))})
However, when I try to fit a two parameter model, I get this error "Error: singular gradient matrix at initial parameter estimates" while still using the wrapnls function. I realize that the model is likely over parameterized, that is why I am trying to use wrapnls instead of just nls (or nlsList). Some in my field insist on seeing both model fits. I thought that the wrapnls model avoids the problem of 0 or near-0 residuals. Here is my code that does not work. The start values and limits are standard in the field for this model.
2_p_model <- dlply(discounting_long, .(Participant), function(discounting_long) {nlxb(indiff ~ 1/(1+k*time^s), data = discounting_long, lower = c (s = 0), start = c(k=0, s=.99), upper = c(s=1))})
I realize that I could use nlxb (which does give me the correct parameter values for each participant) but that function does not give predictive values or residuals of each data point (at least I don't think it does) which I would like to compute AIC values.
I am also open to other solutions for running a loop through the data by participants.
You mention at the end that 'nlxb doesn't give you residuals', but it does. If your result from your call to nlxbis called fit then the residuals are in fit$resid. So you can get the fitted values using just by adding them to the original data. Honestly I don't know why nlxb hasn't been made to work with the predict() function, but at least there's a way to get the predicted values.

Is it possible to specify a range for numbers randomly generated by mvrnorm( ) in R?

I am trying to generate a random set of numbers that exactly mirror a data set that I have (to test it). The dataset consists of 5 variables that are all correlated with different means and standard deviations as well as ranges (they are likert scales added together to form 1 variable). I have been able to get mvrnorm from the MASS package to create a dataset that replicated the correlation matrix with the observed number of observations (after 500,000+ iterations), and I can easily reassign means and std. dev. through z-score transformation, but I still have specific values within each variable vector that are far above or below the possible range of the scale whose score I wish to replicate.
Any suggestions how to fix the range appropriately?
Thank you for sharing your knowledge!
To generate a sample that does "exactly mirror" the original dataset, you need to make sure that the marginal distributions and the dependence structure of the sample matches those of the original dataset.
A simple way to achieve this is with resampling
my.data <- matrix(runif(1000, -1, 2), nrow = 200, ncol = 5) # Some dummy data
my.ind <- sample(1:nrow(my.data), nrow(my.data), replace = TRUE)
my.sample <- my.data[my.ind, ]
This will ensure that the margins and the dependence structure of the sample (closely) matches those of the original data.
An alternative is to use a parametric model for the margins and/or the dependence structure (copula). But as staded by #dickoa, this will require serious modeling effort.
Note that by using a multivariate normal distribution, you are (implicity) assuming that the dependence structure of the original data is the Gaussian copula. This is a strong assumption, and it would need to be validated beforehand.

How to perform least squares regression in R given training and testing data with class labels?

I have a 63*62 training set and the class labels are also present. The test data is a 25*62 dimensions and has the class labels too. Given this how would I perform least squares regression? I am using the code:
res = lm(height~age)
what does height and age correspond to? When I have 61 features + 1 class (making it 62 columns for the training data) how would I input parameters?
Also how do I apply the model on the testing data?
If you have 62 columns you may want to use the more general formula
res = lm(height ~ . , data = mydata)
Notice how the period '.' represent the rest of the variables. But the previous answer is completely right in the sense that there are more variables than observations and therefore the answer (if there's any which shouldn't be) is completely useless.
height and age would be simply the labels of columns in your data frame. height is a predicted variable. You can have as many variables there as you wish: res = lm(height~age+wight+gender)
However, I must say that the question seems a bit strange to me because if you are performing a regression with 62 variables having 62 points in training set it will simply mean that you will always have an exact solution. Training set should always be (significantly) larger than the number of variables used.

Resources