Inputting data to a matrix and predicting the response variable in r - r

So I'm having a problem combining a vector with a matrix
require(faraway)
x<-lm(lpsa~lcavol+lweight+age+svi+lcp+gleason+pgg45,prostate)
y<-model.matrix(x)
I have been given new data, that I need to predict lpsa with. So I was thinking that I could just add the data in using a vector and go about the regression analysis from there.
z<-c(1.44692,3.62301,65,.30010,0,-.79851,7,15)
rbind(y,z)
Not only does this give me 100 rows, but I'm not sure how to predict lpsa using this method. Can anybody give me advice?

Try :
require(faraway)
x<-lm(lpsa~lcavol+lweight+age+svi+lcp+gleason+pgg45,prostate)
z<-c(1.44692,3.62301,65,.30010,0,-.79851,7,15)
z<-z[-length(z)]
names(z)<-names(x$coefficients)[-1]
z<-as.list(z)
predict(x,z)
1
2.036906
Explanation : when you create x you then have to use predict to predict lpsa for new values of your variables. You create a list z with as many variables as there are in the model (except lpsa as you wish to "find" it). You then run the command and 2 is the predicted value of lpsa for the new varaibles. AS for the last value of z (ie 15) I don't know what it is.
unlist(z) # this shows that z is coherent as age is 65 (only value that makes sense for it)
lcavol lweight age svi lcp gleason pgg45
1.44692 3.62301 65.00000 0.30010 0.00000 -0.79851 7.00000
If you want to know the coefficients calcultated by the regression you can do :
coefficients(x)
(Intercept) lcavol lweight age svi lcp gleason pgg45
-0.130150643 0.577486444 0.576247172 -0.014687934 0.698386394 -0.100954503 0.055762175 0.004769619
If you want to be sure that predict is correct, do :
unname(sum(unlist(z)*coefficients(x)[-1])+coefficients(x)[1])
[1] 2.036906 # same estimated value for z

Related

R: Find cutoffpoint for continous variable to assign observations to two groups

I have the following data
Species <- c(rep('A', 47), rep('B', 23))
Value<- c(3.8711, 3.6961, 3.9984, 3.8641, 4.0863, 4.0531, 3.9164, 3.8420, 3.7023, 3.9764, 4.0504, 4.2305,
4.1365, 4.1230, 3.9840, 3.9297, 3.9945, 4.0057, 4.2313, 3.7135, 4.3070, 3.6123, 4.0383, 3.9151,
4.0561, 4.0430, 3.9178, 4.0980, 3.8557, 4.0766, 4.3301, 3.9102, 4.2516, 4.3453, 4.3008, 4.0020,
3.9336, 3.5693, 4.0475, 3.8697, 4.1418, 4.0914, 4.2086, 4.1344, 4.2734, 3.6387, 2.4088, 3.8016,
3.7439, 3.8328, 4.0293, 3.9398, 3.9104, 3.9008, 3.7805, 3.8668, 3.9254, 3.7980, 3.7766, 3.7275,
3.8680, 3.6597, 3.7348, 3.7357, 3.9617, 3.8238, 3.8211, 3.4176, 3.7910, 4.0617)
D<-data.frame(Species,Value)
I have the two species A and B and want to find out which is the best cutoffpoint for value to determine the species.
I found the following question:
R: Determine the threshold that maximally separates two groups based on a continuous variable?
and followed the accepted answer to find the best value with the dose.p function from the MASS package. I have several similar values and it worked for them, but not for the one given above (which is also the reason why i needed to include all 70 observations here).
D$Species_b<-ifelse(D$Species=="A",0,1)
my.glm<-glm(Species_b~Value, data = D, family = binomial)
dose.p(my.glm,p=0.5)
gives me 3.633957 as threshold:
Dose SE
p = 0.5: 3.633957 0.1755291
this results in 45 correct assignments. however, if I look at the data, it is obvious that this is not the best value. By trial and error I found that 3.8 gives me 50 correct assignments, which is obviously better.
Why does the function work for other values, but not for this one? Am I missing an obvious mistake? Or is there maybe a different/ better approach to solving my problem? I have several values I need to do this for, so I really do not want to just randomly test values until I find the best one.
Any help would be greatly appreciated.
I would typically use a receiver operating characteristic curve (ROC) for this type of analysis. This allows a visual and numerical assessment of how the sensitivity and specificity of your cutoff changes as you adjust your threshold. This allows you to select the optimum threshold based on when the overall accuracy is optimum. For example, using pROC:
library(pROC)
species_roc <- roc(D$Species, D$Value)
We can get a measure of how good a discriminator Value is for predicting Species by examining the area under the curve:
auc(species_roc)
#> Area under the curve: 0.778
plot(species_roc)
and we can find out the optimum cut-off threshold like this:
coords(species_roc, x = "best")
#> threshold specificity sensitivity
#> 1 3.96905 0.6170213 0.9130435
We see that this threshold correctly identifies 50 cases:
table(Actual = D$Species, Predicted = c("A", "B")[1 + (D$Value < 3.96905)])
#> Predicted
#> Actual A B
#> A 29 18
#> B 2 21

predict.lm after regression with missing data in Y

I don't understand how to generate predicted values from a linear regression using the predict.lm command when some value of the dependent variable Y are missing, even though no independent X observation is missing. Algebraically, this isn't a problem, but I don't know an efficient method to do it in R. Take for example this fake dataframe and regression model. I attempt to assign predictions in the source dataframe but am unable to do so because of one missing Y value: I get an error.
# Create a fake dataframe
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(100,200,300,400,NA,600,700,800,900,100)
df <- as.data.frame(cbind(x,y))
# Regress X and Y
model<-lm(y~x+1)
summary(model)
# Attempt to generate predictions in source dataframe but am unable to.
df$y_ip<-predict.lm(testy)
Error in `$<-.data.frame`(`*tmp*`, y_ip, value = c(221.............
replacement has 9 rows, data has 10
I got around this problem by generating the predictions using algebra, df$y<-B0+ B1*df$x, or generating the predictions by calling the coefficients of the model df$y<-((summary(model)$coefficients[1, 1]) + (summary(model)$coefficients[2, 1]*(df$x)) ; however, I am now working with a big data model with hundreds of coefficients, and these methods are no longer practical. I'd like to know how to do it using the predict function.
Thank you in advance for your assistance!
There is built-in functionality for this in R (but not necessarily obvious): it's the na.action argument/?na.exclude function. With this option set, predict() (and similar downstream processing functions) will automatically fill in NA values in the relevant spots.
Set up data:
df <- data.frame(x=1:10,y=100*(1:10))
df$y[5] <- NA
Fit model: default na.action is na.omit, which simply removes non-complete cases.
mod1 <- lm(y~x+1,data=df)
predict(mod1)
## 1 2 3 4 6 7 8 9 10
## 100 200 300 400 600 700 800 900 1000
na.exclude removes non-complete cases before fitting, but then restores them (filled with NA) in predicted vectors:
mod2 <- update(mod1,na.action=na.exclude)
predict(mod2)
## 1 2 3 4 5 6 7 8 9 10
## 100 200 300 400 NA 600 700 800 900 1000
Actually, you are not using correctly the predict.lm function.
Either way you have to input the model itself as its first argument, hereby model, with or without the new data. Without the new data, it will only predict on the training data, thus excluding your NA row and you need this workaround to fit the initial data.frame:
df$y_ip[!is.na(df$y)] <- predict.lm(model)
Or explicitly specifying some new data. Since the new x has one more row than the training x it will fill the missing row with a new prediction:
df$y_ip <- predict.lm(model, newdata = df)

why multinom() predicts a lot of rows of probabilities for each level of outcome?

I have a moltinomial logistic regression and the outcome variable has 6 levels: 10,20,60,70,80,90
test<-multinom(y ~ x1 + x2 + as.factor(x3) ,data=data1)
I want to predict the probabilities associate with each level of y for each set of given input values. So I run this:
dfin <- data.frame( ses = c(10,20,60,70,80,90), x1=2.1, x2=4, x3=40)
predict(test, todaydata = dfin, type = "probs")
But instead of getting 6 probabilities (one for each level of outcome), I got many many rows of probabilities. Each row has 6 probabilities (summation is 1) but I don't know why I get many rows and which row I should trust.
5541 7.226948e-01 1.498199e-01 8.086624e-02 1.253289e-02 8.799416e-03 2.528670e-02
5546 6.034188e-01 7.386553e-02 1.908132e-01 1.229962e-01 4.716406e-04 8.434623e-03
5548 7.266859e-01 1.278779e-01 1.001634e-01 2.032530e-02 7.156766e-03 1.779076e-02
5562 7.120179e-01 1.471181e-01 9.146071e-02 1.265592e-02 8.189511e-03 2.855781e-02
5666 6.645056e-01 3.034978e-02 1.687687e-01 1.219601e-01 3.972833e-03 1.044308e-02
5668 4.875966e-01 3.126855e-02 2.090006e-01 2.430828e-01 3.721631e-03 2.532970e-02
5670 3.900772e-01 1.305786e-02 1.803779e-01 4.137106e-01 1.314298e-03 1.462155e-03
5671 4.272971e-01 1.194599e-02 1.748494e-01 3.833422e-01 8.863019e-04 1.678975e-03
5674 5.477521e-01 2.587478e-02 1.650817e-01 2.487404e-01 3.368726e-03 9.182195e-03
5677 4.300207e-01 9.532836e-03 1.608679e-01 3.946310e-01 2.626104e-03 2.321351e-03
5678 4.542981e-01 1.220728e-02 1.410984e-01 3.885146e-01 2.670689e-03 1.210891e-03
5705 5.642322e-01 1.830575e-01 5.134181e-02 8.952808e-04 8.796467e-03 1.916767e-01
5706 6.161694e-01 1.094046e-01 1.979044e-01 1.095385e-02 7.254592e-03 5.831323e-02
....
Am I missing anything in coding or do I need to set any parameter?
It is returning the probability for the observation to be in each of the classes. That is how multinomial logistic regressions are implemented. You can imagine a series of binomial logistic regressions (one for each class) and then choosing the class that has the highest probability. This is called the one-v-all approach.
In your example, observation 5541 is predicted to be class 1 because the first column has the highest value (probability). Observation 5670 is class 4 because that's the column with the highest probability. The matrix will have dimensions # of observations x # of classes.

Error with sem function at R : differences in factors

I wanted to use the function sem (with the package lavaan) on my data in R :
Model1<- 'Transfer~Amotivation+Gender+Age
Amotivation~Gender+Age
transfer are 4 questions with a 5 point likert scale
Amotivation: 4 questions with a 5 pint likert scale
Gender: 0 (=male) and 1 (=female)
Age: just the different ages
And i got next error:
in getDataFull (data= data, group = group, grow.label = group.label,:
lavaan WARNING: some observed variances are (at least) a factor 100 times larger than others; please rescale
Is anybody familiar with this error? Does it influence my results? Do I have to change anything? I really don't know what this error means.
Your scales are not equivalent. Your gender variables are constrained to be either 0 or 1. Amotivation is constrained to be between 1 and 5, but age is even less constrained. I created some sample data for gender, age, and amotivation. You can see that the variance for the age variable is over 4,000 times higher than the variance for gender, and about 500 times higher than sample amotivation data.
gender <- c(0,1,1,1,0,0,1,1,0,1,1,0,0,1,1,1)
age <- c(18,42,87,12,24,26,98,84,23,12,95,44,54,23,10,16)
set.seed(42)
amotivation <- rnorm(16, 3, 1.5)
var(gender) # 0.25 variance
var(age) # 1017.27 variance
var(amotivation) # 2.21 variance
I'm not sure how the unequal variances influence your results, or if you need to do anything at all. To make your age variable more closely match the amotivation scale, you could transform the data so that it's also on a 5 point scale.
newage <- age/max(age)*5
var(newage) # 2.65 variance
You could try running the analysis both ways (using your original data and the transformed data) and see if there are differences.

Doing linear prediction with R: How to access the predicted parameter(s)?

I am new to R and I am trying to do linear prediction. Here is some simple data:
test.frame<-data.frame(year=8:11, value= c(12050,15292,23907,33991))
Say if I want to predict the value for year=12. This is what I am doing (experimenting with different commands):
lma=lm(test.frame$value~test.frame$year) # let's get a linear fit
summary(lma) # let's see some parameters
attributes(lma) # let's see what parameters we can call
lma$coefficients # I get the intercept and gradient
predict(lm(test.frame$value~test.frame$year))
newyear <- 12 # new value for year
predict.lm(lma, newyear) # predicted value for the new year
Some queries:
if I issue the command lma$coefficients for instance, a vector of two values is returned to me. How to pick only the intercept value?
I get lots of output with predict.lm(lma, newyear) but cannot understand where the predicted value is. Can someone please clarify?
Thanks a lot...
intercept:
lma$coefficients[1]
Predict, try this:
test.frame <- data.frame(year=12, value=0)
predict.lm(lma, test.frame)

Resources