trouble using lm function regression using R - r

No Issuer LR1 LR2 LR3 LR4 LR5 DR1
1 CompanyA 1.41470 1.32430 -0.16422 139.30633 8.49702 0.85071
2 CompanyB 1.44627 0.42427 0.40415 8.77173 6.66632 0.53576
3 CompanyC 1.54267 1.52505 0.81449 261.21500 35.86433 0.53681
4 CompanyD 3.64603 2.70640 2.32230 107.33922 1.79202 0.48101
5 CompanyE 1.00592 0.98415 0.78911 82.44725 27.00442 0.68071
6 CompanyF 2.59738 1.70374 0.92933 145.01431 1.81996 0.43577
DR2 DR3 AR1 AR2 AR3 AR4 AR5 PR1 PR2 PR3
5.84882 0.60382 2.62012 8.49702 4.68022 0.51531 0.00822 0.06236 0.05199 0.01595
1.15546 0.33039 41.61093 6.66632 4.04257 2.24779 0.00677 0.06957 0.00083 0.00301
1.16084 0.40417 1.39732 35.86433 0.32469 0.21293 0.04110 0.33770 0.25534 0.19301
0.92684 0.38246 3.40043 1.79202 1.10595 0.46242 0.03522 0.41886 0.14047 0.07617
2.13194 0.60695 4.42707 27.00442 0.23780 0.19290 0.05958 0.42816 0.39135 0.30883
1.00352 0.33506 2.51699 1.81996 1.07226 0.46796 0.04559 0.24596 0.16839 0.09742
PR4 PR5 PR6 PR7 RR1 RR2 Rating
-0.26783 0.00822 0.05651 -0.13802 0.00822 0.05651 4
0.03071 0.00677 0.01460 0.06903 0.00677 0.01460 3
0.02213 0.04110 0.08887 0.00471 0.04110 0.08887 3
0.23080 0.03522 0.06787 0.10673 0.03522 0.06787 3
0.09979 0.05958 0.18659 0.01925 0.05958 0.18659 3
0.10664 0.04559 0.10498 0.04990 0.04559 0.10498 3
Above is from the head(data) using R. I wanted to use SVM, but before doing so, i want to regress the data. The Y is "Rating" variable, located in the last column, the rest is X which are LR1,LR2,...,RR1,RR2. Here is my steps :
x <- data[,3:24]
y <- data[,25]
lm <- (y~x)
but this is what i get from the warning
Error in model.frame.default(formula = y ~ x, drop.unused.levels = TRUE) :
invalid type (list) for variable 'x'
I have tried couple of times , including using the data.frame(x) first, but the result are the same. The "Rating" variable determines the performance of the company, Rating 1 is the best performance while 4 is the worst performance.
Why i get such trouble? Please help thank you

You can regress one variable against all the others by using the 'dot' notation like below:
fit <- lm(Rating ~ ., data = data)

attach(data)
reg <- lm(Rating ~ LR1 + LR2 + ... + RR2, data=data)
Or you can separate the X and Y.
x <- LR1+LR2+ ... + RR2
y <- Rating
reg <- lm(y~x, data=data)

Related

Error when creating a geom_table in R showing abnormal returns development of several industries

the desired graph but in RHello, I am quite new in R and Im facing a problem when trying to create a ggpolt geom_line to visualise the development of the abnormal return of the selected 5 companies. but I am getting an error saying "Error in is.list(x) : incorrect number of dimensions" that I don't understand.
Could anyone help me understand this error and how can I fix it?
What I have done so far is:
Ploting all rows of financial industry
df <- read.table(text="AR-5 AR-4 AR-3 AR-2 AR-1 AR0 AR1 AR2 AR3 AR4 AR5 AR6 AR7 AR8 AR9 AR10
+ -0,0069 0,0157 0,0175 -0,0087 -0,0108 -0,0038 -0,0136 -0,0077 0,0135 -0,0024 -0,0190 0,0119 0,0100 0,0041 0,0044 -0,0287
+ -0,0008 0,0012 -0,0088 0,0032 -0,0017 0,0088 -0,1461 -0,0968 0,0208 -0,1597 -0,0234 -0,0413 0,0128 0,0034 0,0105 0,0254
+ -0,0032 0,0128 0,0029 0,0014 0,0010 -0,0059 -0,0074 -0,0855 0,0001 -0,0011 0,0111 0,0045 0,0002 0,0024 -0,0146 0,0007
+ -0,0637 -0,0043 0,0003 0,0208 -0,0246 -0,0890 -0,0630 -0,0534 -0,0071 0,0239 -0,0151 0,0054 -0,0083 0,0078 0,0327 -0,0541
+ -0,0054 -0,0029 -0,0007 0,0019 0,0077 -0,0088 0,0119 0,0000 0,0025 -0,0009 0,0021 0,0039 0,0131 -0,0046 -0,0338 -0,0081", header = T)
df <- melt(df)
df$company <- 1:5
head(df, 11)
ggplot(df, aes(x=df$company[1:16, 0], y=df$company)[1:16, 1:5], group=factor(company)) +
+geom_line(aes(color=factor(company)
))
Error in is.list(x) : incorrect number of dimensions
The y axis should be the values and the r axis only the titles of the abnormal returns of each day i.e. AR-2, AR-1, AR0, AR1
Commas are used as decimal points, so would include dec=',' in the read.table call (otherwise values are considered as factors):
df <- read.table(text="AR-5 AR-4 AR-3 AR-2 AR-1 AR0 AR1 AR2 AR3 AR4 AR5 AR6 AR7 AR8 AR9 AR10
-0,0069 0,0157 0,0175 -0,0087 -0,0108 -0,0038 -0,0136 -0,0077 0,0135 -0,0024 -0,0190 0,0119 0,0100 0,0041 0,0044 -0,0287
-0,0008 0,0012 -0,0088 0,0032 -0,0017 0,0088 -0,1461 -0,0968 0,0208 -0,1597 -0,0234 -0,0413 0,0128 0,0034 0,0105 0,0254
-0,0032 0,0128 0,0029 0,0014 0,0010 -0,0059 -0,0074 -0,0855 0,0001 -0,0011 0,0111 0,0045 0,0002 0,0024 -0,0146 0,0007
-0,0637 -0,0043 0,0003 0,0208 -0,0246 -0,0890 -0,0630 -0,0534 -0,0071 0,0239 -0,0151 0,0054 -0,0083 0,0078 0,0327 -0,0541
-0,0054 -0,0029 -0,0007 0,0019 0,0077 -0,0088 0,0119 0,0000 0,0025 -0,0009 0,0021 0,0039 0,0131 -0,0046 -0,0338 -0,0081", header = T, dec = ',')
Then if you include company as factor 1 to 5, you can use your melt to make the data long:
df$company <- as.factor(1:5)
df.melt <- melt(df, id = "company")
> head(df.melt, 10)
company variable value
1 1 AR.5 -0.0069
2 2 AR.5 -0.0008
3 3 AR.5 -0.0032
4 4 AR.5 -0.0637
5 5 AR.5 -0.0054
6 1 AR.4 0.0157
7 2 AR.4 0.0012
8 3 AR.4 0.0128
9 4 AR.4 -0.0043
10 5 AR.4 -0.0029
And then use ggplot. Note: no need to include df a second time in aes() as you had previously.
ggplot(df.melt, aes(x = variable, y = value, group = company)) +
geom_line(aes(color=company))

ERROR in R: `[.data.frame`(m.data, , treat) : undefined columns selected - running mediation

#Create subset of a dataset
df <- subset(dat,select = c(id,obs,day_clos,posaff,er89,qol1))
### remove rows with missing values on a variable
df <- subset(df, !is.na(day_clos))
df <- subset(df, !is.na(er89))
df <- subset(df, !is.na(qol1))
df <- subset(df,!is.na(posaff))
any(is.na(df)) ## returns FALSE
Then my data looks like this
id obs day_clos posaff er89 qol1
1 0 16966.61 2.000000 2.785714 3
1 1 16967.79 1.666667 2.785714 4
1 2 16968.82 1.666667 3.142857 3
1 3 16969.76 1.166667 3.071429 4
1 4 16970.95 2.083333 3.000000 4
1 5 16971.75 1.416667 2.857143 4
model.Y <- lm(qol1 ~ posaff,df)
summary(model.Y)
model.M <- lm(qol1 ~ er89, df)
summary(model.M)
#### There is no problem running the regression analyses, however:
results <- mediate(model.M, model.Y, treat="posaff", mediator="er89", boot=TRUE, sims=500)
Returns error message: [.data.frame(m.data, , treat) : undefined columns selected
Any one know how to fix this?
Variables used in treat and mediator must be presents in both models:
treat a character string indicating the name of the treatment variable used in the models.
The treatment can be either binary (integer or a two-valued factor) or continuous
(numeric).
mediator a character string indicating the name of the mediator variable used in the models
Source
A trivial working example:
library("mediation")
db<-data.frame(y=c(1,2,3,4,5,6,7,8,9),x1=c(9,8,7,6,5,4,3,2,1),x2=c(9,9,7,7,5,5,3,3,1),x3=c(1,1,1,1,1,1,1,1,1))
model.M <- lm(x2 ~ x1+x3,db)
model.Y <- lm(y ~ x1+x2+x3)
results <- mediate(model.M, model.Y, treat="x1", mediator="x2", boot=TRUE, sims=500)
I think that I have what you suggested but it is still giving the same error message.
model.mediator <- lmer(PercAccuracy~factor(Rep1) +
(factor(Rep1)| ParticipantPublicID),
data = data, REML=FALSE , control = control_params)
summary(model.mediator)
model.outcome <- lmer(Sharing~factor(Rep1) +PercAccuracy+
(factor(Rep1)+PercAccuracy| ParticipantPublicID),
data = data, REML=FALSE , control = control_params)
summary(model.outcome )
effectModel<-mediate(model.mediator, model.outcome, treat = "Rep1", mediator="PercAccuracy")
summary(effectModel)

Performing lm() and segmented() on multiple columns in R

I am trying to perform lm() and segmented() in R using the same independent variable (x) and multiple dependent response variables (Curve1, Curve2, etc.) one by one. I wish to extract the estimated break point and model coefficients for each response variable. I include an example of my data below.
x Curve1 Curve2 Curve3
1 -0.236422 98.8169 95.6828 101.7910
2 -0.198083 98.3260 95.4185 101.5170
3 -0.121406 97.3442 94.8899 100.9690
4 0.875399 84.5815 88.0176 93.8424
5 0.913738 84.1139 87.7533 93.5683
6 1.795530 73.3582 78.1278 82.9956
7 1.833870 72.8905 77.7093 82.7039
8 1.872200 72.4229 77.3505 82.4123
9 2.907350 59.2070 67.6652 74.5374
10 3.865810 46.4807 58.5158 65.0220
11 3.904150 45.9716 58.1498 64.7121
12 3.942490 45.4626 57.8099 64.4022
13 4.939300 33.3040 48.9742 56.3451
14 4.977640 32.9641 48.6344 56.0352
15 5.936100 24.4682 36.4758 47.0485
16 5.936100 24.4682 36.4758 47.0485
17 6.012780 23.7885 35.9667 46.5002
18 6.971250 20.7387 29.6035 39.6476
19 7.009580 20.6167 29.3490 39.3930
20 8.006390 18.7209 22.7313 32.7753
21 8.121410 18.5022 22.3914 32.1292
22 9.041530 16.4722 19.6728 26.9604
23 9.079870 16.3877 19.5595 26.7450
I am able to do this one curve at a time using the below code. However, my full data set has over 1000 curves, so I would like to be able to repeat this code over every column somehow. I have not been at all successful trying to loop it over every column, so if anyone could show me how to do something like that and create a summary data frame similar to that generated by the below code, but with every column included, I would be extremely grateful. Thanks!
model <- lm(Curve1~x, dat) # Linear model
seg_model <- segmented(model, seg.Z = ~x) # Segmented model
breakpoint <- as.matrix(seg_model$psi.history[[5]]) # Extract breakpoint
coefficients <- as.matrix(seg_model$coefficients) # Extract coefficients
summary_curve1 <- as.data.frame(rbind(breakpoint, coefficients)) # combine breakpoint and coefficeints
colnames(summary_curve1) <- "Curve_1" # header name
summary_curve1 # display summary
Here's an approach using tidyverse and broom to return a data frame containing the results for each Curve column:
library(broom)
library(tidyverse)
model.results = setNames(names(dat[,-1]), names(dat[,-1])) %>%
map(~ lm(paste0(.x, " ~ x"), data=dat) %>%
segmented(seg.Z=~x) %>%
list(model=tidy(.),
psi=data.frame(term="breakpoint", estimate=.[["psi.history"]][[5]]))) %>%
map_df(~.[2:3] %>% bind_rows, .id="Curve")
model.results
Curve term estimate std.error statistic p.value
1 Curve1 (Intercept) 95.866127 0.14972382 640.286416 1.212599e-42
2 Curve1 x -12.691455 0.05220412 -243.112130 1.184191e-34
3 Curve1 U1.x 10.185816 0.11080880 91.922447 1.233602e-26
4 Curve1 psi1.x 0.000000 0.02821843 0.000000 1.000000e+00
5 Curve1 breakpoint 5.595706 NA NA NA
6 Curve2 (Intercept) 94.826309 0.45750667 207.267599 2.450058e-33
7 Curve2 x -9.489342 0.11156425 -85.057193 5.372730e-26
8 Curve2 U1.x 6.532312 1.17332640 5.567344 2.275438e-05
9 Curve2 psi1.x 0.000000 0.23845241 0.000000 1.000000e+00
10 Curve2 breakpoint 7.412087 NA NA NA
11 Curve3 (Intercept) 100.027990 0.29453941 339.608175 2.069087e-37
12 Curve3 x -8.931163 0.08154534 -109.523900 4.447569e-28
13 Curve3 U1.x 2.807215 0.36046013 7.787865 2.492325e-07
14 Curve3 psi1.x 0.000000 0.26319757 0.000000 1.000000e+00
15 Curve3 breakpoint 6.362132 NA NA NA
You can wrap the whole thing in a function, taking as the arguments the column name and the data, and use lapply on the column names, like this:
library(segmented)
run_mod <- function(varname, data){
data$Y <- data[,varname]
model <- lm(Y ~ x, data) # Linear model
seg_model <- segmented(model, seg.Z = ~x) # Segmented model
breakpoint <- as.matrix(seg_model$psi.history[[5]]) # Extract breakpoint
coefficients <- as.matrix(seg_model$coefficients) # Extract coefficients
summary_curve1 <- as.data.frame(rbind(breakpoint, coefficients))
colnames(summary_curve1) <- varname
return(summary_curve1)
}
lapply(names(dat)[2:ncol(dat)], function(x)run_mod(x, dat))
Which gives the summary for each fitted curve (not sure which output you actually want).
I had the same issue and I'm tryng to adapt the suggested answer, but it appears the following:
Error in model.frame.default(formula = Y ~ Prof, data = data, drop.unused.levels = TRUE) :
invalid type (list) for variable 'Y'
I run this code:
run_mod <- function(varname, data){
data$Y <- data[,varname]
model <- lm(Y ~ Prof, data) # Linear model
seg_model <- segmented(model, seg.Z = ~ Prof) # Segmented model
breakpoint <- as.matrix(seg_model$psi.history[[5]]) # Extract breakpoint
coefficients <- as.matrix(seg_model$coefficients) # Extract coefficients
summary_curve1 <- as.data.frame(rbind(breakpoint, coefficients))
colnames(summary_curve1) <- varname
return(summary_curve1)
}
lapply(names(DATApiv)[3:ncol(DATApiv)], function(Prof)run_mod(Prof, DATApiv))
NOTE: Prof = is the column in my DF the corresponds to independent variable as the x column of this example). DataPiv is my DB.

R dynamic equation

Using R and estimating a simple equation by least squares that has the dependent variable as a independent (explanatory, right hand side) variable, I want to forecast out of sample and use the dependent variable forecasts in the out of sample period as a lag for each step ahead.
I.e., I want to extend forecasts of y to be outside the data period
a <- lm( y ~ x + lag(y,1), data= dset1)
b <- forecast(a,newdata=dset2)
where dset2 has the full period of extra x variables, but not the lagged y.
Here is an example using the AirPassengers data set, where dset2 was created with some missing ap data. The results below show only row 143 gets filled in not 144 because forecast did not have the 143 lag.
I looked at the dyn dynlm and forecast packages but nonw seem to work with type of model. (I do not want to restate as an ARMA or a VAR)
What package can easily do this, or am I using forecast incorrectly?
I can loop and step ahead on period at a time, but rather not do that.
##Example case using airline data
data("AirPassengers", package = "datasets")
ap <- log(AirPassengers)
ap <- as.ts(ap)
d1 <- data.frame(ap, index= as.Date(ap))
m1 <- lm(ap ~ lag(ap,1), data=d1)
m2 <- dynlm(ap ~ lag(ap,1), data=d1)
m3 <- dyn(lm(ap ~ lag(ap,1), data=d1))
summary(m3)
##Neither lm or dyn or dynlm obects worked as I want
## Try forecast missing values, 2 steps, rows 143 and 144
d2 <- d1
d2$apx = d2$ap
d2$apx[143:144]= NA
mx <- lm(apx ~ lag(apx,1), data=d2)
b <- forecast(mx,newdata=d2)
Results:
> b
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
1 NA NA NA NA NA
2 4.756850 4.619213 4.894488 4.545513 4.968188
3 4.807218 4.669783 4.944653 4.596191 5.018245
....
140 6.411559 6.273546 6.549572 6.199644 6.623474
141 6.386407 6.248507 6.524306 6.174667 6.598146
142 6.216154 6.078941 6.353368 6.005467 6.426841
143 6.122453 5.985553 6.259354 5.912247 6.332659
144 NA NA NA NA NA
other lm like objects produced errors for forecast
mx <- dynlm(apx ~ lag(apx,1), data=d2)
b <- forecast(mx,newdata=d2)
Error in forecast.lm(mx, newdata = d2) : invalid type/length
(symbol/0) in vector allocation
mx <- dyn(lm(apx ~ lag(apx,1), data=d2))
b <- forecast(mx,newdata=d2)
Error in predict.lm(object, newdata = newdata, se.fit = TRUE, interval
= "prediction", : formal argument "se.fit" matched by multiple actual arguments

Error: variables were specified with different types from the fit

In carpackage, I am trying to predict the response variable called prestige in a dataset also named Prestige based on income, education, and factor type by lm function. But before I fit data, I want to scale education and income. The code below if you copy and run it in R stuido, the console would say Error: variables ‘income’, ‘I(income^2)’, ‘education’, ‘I(education^2)’ were specified with different types from the fit
library(car)
summary(Prestige)
Prestige$education <- scale(Prestige$education)
Prestige$income <- scale(Prestige$income)
fit <- lm(prestige ~ income + I(income^2) + education + I(education^2)
+ income:education + type + type:income + type:I(income^2)
+ type:education + type:I(education^2)+ type:income:education, Prestige)
summary(fit)
pred <- expand.grid(income = c(1000, 20000), education = c(10,20),type = levels(Prestige $ type))
pred $ prestige.pred <- predict(fit, newdata = pred)
pred
Without scaling the predictors, it can successfully work. So the error is definitely due to the scaling before prediction and I am wondering how to fix this issue?
Note that scale() actually change the class of your columns. See
class(car::Prestige$education)
# [1] "numeric"
class(scale(car::Prestige$education))
# [1] "matrix"
You would be safe simplying them to numeric vectors. You can use the dimension-stripping properties of c() for this
Prestige$education <- c(scale(Prestige$education))
Prestige$income <- c(scale(Prestige$income))
Then I was able to run your model with
fit <- lm(prestige ~ income + I(income^2) + education + I(education^2)
+ income:education + type + type:income + type:I(income^2)
+ type:education + type:I(education^2)+ type:income:education,
Prestige, na.action="na.omit")
and the prediction returned
income education type prestige.pred
1 1000 10 bc -1352364.5
2 20000 10 bc -533597423.4
3 1000 20 bc -1382361.7
4 20000 20 bc -534229639.3
5 1000 10 prof 398464.2
6 20000 10 prof 155567014.1
7 1000 20 prof 409271.3
8 20000 20 prof 155765754.7
9 1000 10 wc -7661464.3
10 20000 10 wc -3074382169.9
11 1000 20 wc -7634693.8
12 20000 20 wc -3073902696.6
Also note you cam simplify your formula somewhat with
fit<-lm(prestige ~ (income + I(income^2) + education + I(education^2))*type +
income:education + type:income:education, Prestige, na.action="na.omit")
This uses * to create many of the interaction terms.
scale() adds attributes that seem to create problems with lm(). Using
Prestige$education <- as.numeric(scale(Prestige$education))
Prestige$education <- as.numeric(scale(Prestige$income))
make everything works.

Resources