When I run the code below, I can calculate the regression's coefficients for each category of c. Now I was wondering how I can apply these estimated coefficients to calculate the residuals of all observations. For example, here just 25 observations belong to c=1, but I need to calculate the fitted values/Residuals of all 50 observations based on the estimated coefficients for this category.
A<-cars$speed
B<-cars$dist
c<-rep(1:2,25)
S<-data.frame(A,B,c)
library(plyr)
lmodel <- dlply(S,"c", function(d) lm(B~A, data = d))
I'm not 100% sure I understand what you mean, but the following code will give you a list of residuals. The first element of the list contains the residuals for all 50 observations using the coefficients for c=1 and the second for c=2.
residuals<- lapply(lmodel, function(x) B - coef(x)[1] - coef(x)[2]*A)
Related
Let's work with classic dataset with iris
data(iris)
When I conduct Pearson corr analysis, i have these corr coefficients
SEPALLEN SEPALWID PETALLEN PETALWID
SEPALLEN 1,000000 -0,117570 0,871754 0,817941
SEPALWID -0,117570 1,000000 -0,428440 -0,366126
PETALLEN 0,871754 -0,428440 1,000000 0,962865
PETALWID 0,817941 -0,366126 0,962865 1,000000
So is there the way to perform inverse transformation, namely from corr coefficients to initial value of variables?
You cannot extract details of data of correlation data, only the general character of correlation between two columns. If Person's coefficient is positive than there is a increasing tendency, if negative then decreasing one. We can visualize it with correlation plot:
data(iris)
ibrary(PerformanceAnalytics)
chart.Correlation(iris[, 1:4], histogram=TRUE, pch=19)
As you can see below each upper trianglar number matches with a graph in a lower triangle. In fact the cor function transforms 600 entries in iris (1-4 columns) data into just 5 unique numbers. So inverse transformation from 5 numbers into 600 numbers in an unambigous way is not possible:
I have a data set [1000 x 80] of 1000 data points each with 80 variable values. I have to linearly regress two variables: price and area, and identify the 5 data points that have highest squared residuals. For these identified data points, I have to display 4 of the 80 variable values.
I do not know how to use the residuals to identify the original data points. All I have at the moment is:
model_lm <- lm(log(price) ~ log(area), data = ames)
Can I please get some guidance on how I can approach the above problem
The model_lm object will contain a variable called 'residuals' that will have the residuals in the same order as the original observations. If I'm understanding the question correctly, then an easy way to do this is base R is:
ames$residuals <- model_lm$residuals ## Add the residuals to the data.frame
o <- order(ames$residuals^2, decreaseing=T) ## Reorder to put largest first
ames[o[1:5],] ## Return results
I would like to know how I can loop a regression n times, and in each time with a different set of variables, extract a data.frame where each column is a regression and each row represent a variable.
In my case I have a data.frame of:
dt_deals <- data.frame(Premium=c(1,3,4,5),Liquidity=c(0.2,0.3,1.5,0.8),Leverage=c(1,3,0.5,0.7))
But I have another explanatory dummy variable called hubris, that is the product of a binomial distribution, with 0.25 of mean. Like that:
n <- 10
hubris_dataset <- data.frame(replicate(n, rbinom(4,1,0.25))
In this sense, what I need is to make n simulation of hubris, so I can, make n regression each one with a different set of random binomial distribution and the output of each distribution needs to be put in a data.frame
So far I could reach this:
# define n as the number of simulations i want
n=10
# define beta as a data.frame to put every coefficient from the lm regression
beta=NULL
for(i in 1:n) {
dt_deals2 <- dt_deals
beta[[i]] <- coef(lm(dt_deals$Premium ~ dt_deals$Liquidity + dt_deals$Leverage + hubris_dataset[,i], data=dt_deals2))
beta <- cbind(reg$coefficients)
}
But this way it only generate the first set of coefficient, and doesn't make another ten columns for the data.frame.
#jogo give an idea to change the for-loop method and use sapply, and change the object beta to list(). This was the result:
beta <- sapply(1:n, function(i) coef(lm(Premium ~ Liquidity +Leverage+ hubris_dataset[,i], data=dt_deals2)))
And it worked
I have a data frame X with 2 columns a and b, a is of class character and b is of class numeric.
I fitted a gaussian distribution using the fitdist (fitdistrplus package) function on b.
data.fit <- fitdist(x$b,"norm", "mle")
I want to extract the elements in column a that fall in the 5% right tail of the fitted gaussian distribution.
I am not sure how to proceed because my knowledge on fitting distribution is limited.
Do I need to retain the corresponding elements in column a for which b is greater than the value obtain for the 95%?
Or does the fitting imply that new values have been created for each value in b and I should use those values?
Thanks
by calling unclass(data.fit) you can see all the parts that make up the data.fit object, which include:
$estimate
mean sd
0.1125554 1.2724377
which means you can access the estimated mean and standard deviation via:
data.fit$estimate['sd']
data.fit$estimate['mean']
To calculate the upper 5th percentile of the fitted distribution, you can use the qnorm() function (q is for quantile, BTW) like so:
threshold <-
qnorm(p = 0.95,
mean=data.fit$estimate['mean'],
sd=data.fit$estimate['sd'])
and you can subset your data.frame x like so:
x[x$b > threshold,# an indicator of the rows to return
'a']# the column to return
I have a matrix Expr with rows representing variables and columns samples.
I have a categorical vector called groups (containing either "A","B", or "C")
I want to test which of variables 'Expr' can be explained by the fact that the sample belong to a group.
My strategy would be modelling the problem with a generalized additive model (with a negative binomial distribution).
And then I want use a likelihood ratio test in a variable wise way to get a p value for each variable.
I do:
require(VGAM)
m <- vgam(Expr ~ group, family=negbinomial)
m_alternative <- vgam(Expr ~ 1, family=negbinomial)
and then:
lr <- lrtest(m, m_alternative)
The last step is wrong because it is testing the overall likelihood ratio of the two model not the variable wise.
Instead of a single p value I would like to get a vector of the p-values for every variable.
How should I do it?
(I am very new to R, so forgive me my stupidity)
It sounds like you want to use Expr as your predictors It think you may have your formula backwards. The response should be on the left, so I guess that's groups in your case.
If Expr is a data.frame, you can do regression on all variables with
m <- vgam(group ~ ., Expr, family=negbinomial)
If class(Expr)=="matrix", then
m <- vgam(group ~ Expr, family=negbinomial)
probably should work, but you may just get slightly odd looking coefficient labels.