Output from scatter3d R script - how to read the equation - r
I am using scatter3d to find a fit in my R script. I did so, and here is the output:
Call:
lm(formula = y ~ (x + z)^2 + I(x^2) + I(z^2))
Residuals:
Min 1Q Median 3Q Max
-0.78454 -0.02302 -0.00563 0.01398 0.47846
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.051975 0.003945 -13.173 < 2e-16 ***
x 0.224564 0.023059 9.739 < 2e-16 ***
z 0.356314 0.021782 16.358 < 2e-16 ***
I(x^2) -0.340781 0.044835 -7.601 3.46e-14 ***
I(z^2) 0.610344 0.028421 21.475 < 2e-16 ***
x:z -0.454826 0.065632 -6.930 4.71e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.05468 on 5293 degrees of freedom
Multiple R-squared: 0.6129, Adjusted R-squared: 0.6125
F-statistic: 1676 on 5 and 5293 DF, p-value: < 2.2e-16
Based on this, what is the equation of the best fit line? I'm not really sure how to read this? Can someone explain? thanks!
This is a basic regression output table. The parameter estimates ("Estimate" column) are the best-fit line coefficients corresponding to the different terms in your model. If you aren't familiar with this terminology, I would suggest reading up on some linear model and regression tutorial. There are thousands around the web. I would also encourage you to play with some simpler 2D simulations.
For example, let's make some data with an intercept of 2 and a slope of 0.5:
# Simulate data
set.seed(12345)
x = seq(0, 10, len=50)
y = 2 + 0.5 * x + rnorm(length(x), 0, 0.1)
data = data.frame(x, y)
Now when we look at the fit, you'll see that the Estimate column shows these same values:
# Fit model
fit = lm(y ~ x, data=data)
summary(fit)
> summary(fit)
Call:
lm(formula = y ~ x, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.26017 -0.06434 0.02539 0.06238 0.20008
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.011759 0.030856 65.20 <2e-16 ***
x 0.501240 0.005317 94.27 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1107 on 48 degrees of freedom
Multiple R-squared: 0.9946, Adjusted R-squared: 0.9945
F-statistic: 8886 on 1 and 48 DF, p-value: < 2.2e-16
Pulling these out, we can then plot the best-fit line:
# Make plot
dev.new(width=4, height=4)
plot(x, y, ylim=c(0,10))
abline(fit$coef[1], fit$coef[2])
It's not a plane but rather a paraboloid surface (and using 'y' as the third dimension since you used 'z' already):
y = -0.051975 + x * 0.224564 + z * 0.356314 +
-x^2 * -0.340781 + z^2 * 0.610344 - x * z * 0.454826
Related
How to formulate time period dummy variable in lm()
I am analysing whether the effects of x_t on y_t differ during and after a specific time period. I am trying to regress the following model in R using lm(): y_t = b_0 + [b_1(1-D_t) + b_2 D_t]x_t where D_t is a dummy variable with the value 1 over the time period and 0 otherwise. Is it possible to use lm() for this formula?
observationNumber <- 1:80 obsFactor <- cut(observationNumber, breaks = c(0,55,81), right =F) fit <- lm(y ~ x * obsFactor) For example: y = runif(80) x = rnorm(80) + c(rep(0,54), rep(1, 26)) fit <- lm(y ~ x * obsFactor) summary(fit) Call: lm(formula = y ~ x * obsFactor) Residuals: Min 1Q Median 3Q Max -0.48375 -0.29655 0.05957 0.22797 0.49617 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.50959 0.04253 11.983 <2e-16 *** x -0.02492 0.04194 -0.594 0.554 obsFactor[55,81) -0.06357 0.09593 -0.663 0.510 x:obsFactor[55,81) 0.07120 0.07371 0.966 0.337 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3116 on 76 degrees of freedom Multiple R-squared: 0.01303, Adjusted R-squared: -0.02593 F-statistic: 0.3345 on 3 and 76 DF, p-value: 0.8004 obsFactor[55,81) is zero if observationNumber < 55 and one if its greater or equal its coefficient is your $b_0$. x:obsFactor[55,81) is the product of the dummy and the variable $x_t$ - its coefficient is your $b_2$. The coefficient for $x_t$ is your $b_1$.
Why I() "AsIs" is necessary when making a linear polynomial model in R?
I'm trying to understand what is the role of I() base function in R when using a linear polynomial model or the function poly. When I calculate the model using q + q^2 q + I(q^2) poly(q, 2) I have different answers. Here is an example: set.seed(20) q <- seq(from=0, to=20, by=0.1) y <- 500 + .1 * (q-5)^2 noise <- rnorm(length(q), mean=10, sd=80) noisy.y <- y + noise model3 <- lm(noisy.y ~ poly(q,2)) model1 <- lm(noisy.y ~ q + I(q^2)) model2 <- lm(noisy.y ~ q + q^2) I(q^2)==I(q)^2 I(q^2)==q^2 summary(model1) summary(model2) summary(model3) Here is the output: > summary(model1) Call: lm(formula = noisy.y ~ q + I(q^2)) Residuals: Min 1Q Median 3Q Max -211.592 -50.609 4.742 61.983 165.792 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 489.3723 16.5982 29.483 <2e-16 *** q 5.0560 3.8344 1.319 0.189 I(q^2) -0.1530 0.1856 -0.824 0.411 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 79.22 on 198 degrees of freedom Multiple R-squared: 0.02451, Adjusted R-squared: 0.01466 F-statistic: 2.488 on 2 and 198 DF, p-value: 0.08568 > summary(model2) Call: lm(formula = noisy.y ~ q + q^2) Residuals: Min 1Q Median 3Q Max -219.96 -54.42 3.30 61.06 170.79 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 499.5209 11.1252 44.900 <2e-16 *** q 1.9961 0.9623 2.074 0.0393 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 79.16 on 199 degrees of freedom Multiple R-squared: 0.02117, Adjusted R-squared: 0.01625 F-statistic: 4.303 on 1 and 199 DF, p-value: 0.03933 > summary(model3) Call: lm(formula = noisy.y ~ poly(q, 2)) Residuals: Min 1Q Median 3Q Max -211.592 -50.609 4.742 61.983 165.792 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 519.482 5.588 92.966 <2e-16 *** poly(q, 2)1 164.202 79.222 2.073 0.0395 * poly(q, 2)2 -65.314 79.222 -0.824 0.4107 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 79.22 on 198 degrees of freedom Multiple R-squared: 0.02451, Adjusted R-squared: 0.01466 F-statistic: 2.488 on 2 and 198 DF, p-value: 0.08568 Why is the I() necessary when doing a polynomial model in R. Also, is this normal that the poly function doesn't give the same result as the q + I(q^2)?
The formula syntax in R is described in the ?formula help page. The ^ symbol has not been given the usual meaning of multiplicative exponentiation. Rather, it's used for interactions between all terms at the base of the exponent. For example y ~ (a+b)^2 is the same as y ~ a + b + a:b But if you do y ~ a + b^2 y ~ a + b # same as above, no way to "interact" b with itself. That caret would just include the b term because it can't include the interaction with itself. So ^ and * inside formulas has nothing to do with multiplication just like the + doesn't really mean addition for variables in the usual sense. If you want the "usual" definition for ^2 you need to put it the as is function. Otherwise it's not fitting a squared term at all. And the poly() function by default returns orthogonal polynomials as described on the help page. This helps to reduce co-linearity in the covariates. But if you don't want the orthogonal versions and just want the "raw" polynomial terms, then just pass raw=TRUE to your poly call. For example lm(noisy.y ~ poly(q,2, raw=TRUE)) will return the same estimates as model1
General Linear Model interpretation of parameter estimates in R
I have a data set that looks like "","OBSERV","DIOX","logDIOX","OXYGEN","LOAD","PRSEK","PLANT","TIME","LAB" "1",1011,984.06650389,6.89169348002254,"L","H","L","RENO_N","1","KK" "2",1022,1790.7973641,7.49041625445373,"H","H","L","RENO_N","1","USA" "3",1031,661.95870145,6.4952031694744,"L","H","H","RENO_N","1","USA" "4",1042,978.06853583,6.88557974511529,"H","H","H","RENO_N","1","KK" "5",1051,270.92290942,5.60183431332639,"N","N","N","RENO_N","1","USA" "6",1062,402.98269729,5.99889362626069,"N","N","N","RENO_N","1","USA" "7",1071,321.71945701,5.77367991426247,"H","L","L","RENO_N","1","KK" "8",1082,223.15260359,5.40785585845064,"L","L","L","RENO_N","1","USA" "9",1091,246.65350151,5.507984523849,"H","L","H","RENO_N","1","USA" "10",1102,188.48323034,5.23900903921703,"L","L","H","RENO_N","1","KK" "11",1141,267.34994025,5.58855843790491,"N","N","N","RENO_N","1","KK" "12",1152,452.10355987,6.11391126834609,"N","N","N","RENO_N","1","KK" "13",2011,2569.6672555,7.85153169693888,"N","N","N","KARA","1","USA" "14",2021,604.79620572,6.40489155123453,"N","N","N","KARA","1","KK" "15",2031,2610.4804449,7.86728956188212,"L","H",NA,"KARA","1","KK" "16",2032,3789.7097503,8.24004471210954,"L","H",NA,"KARA","1","USA" "17",2052,338.97054188,5.82591320649553,"L","L","L","KARA","1","KK" "18",2061,391.09027375,5.96893841249289,"H","L","H","KARA","1","USA" "19",2092,410.04420258,6.01626496505788,"N","N","N","KARA","1","USA" "20",2102,313.51882368,5.74785940190679,"N","N","N","KARA","1","KK" "21",2112,1242.5931417,7.12495571830002,"H","H","H","KARA","1","KK" "22",2122,1751.4827969,7.46821802066524,"H","H","L","KARA","1","USA" "23",3011,60.48026048,4.10231703874031,"N","N","N","RENO_S","1","KK" "24",3012,257.27729731,5.55015448107691,"N","N","N","RENO_S","1","USA" "25",3021,46.74282552,3.84466077914493,"N","N","N","RENO_S","1","KK" "26",3022,73.605375516,4.29871805996994,"N","N","N","RENO_S","1","KK" "27",3031,108.25433812,4.68448344109116,"H","H","L","RENO_S","1","KK" "28",3032,124.40704234,4.82355878915293,"H","H","L","RENO_S","1","USA" "29",3042,123.66859296,4.81760535031397,"L","H","L","RENO_S","1","KK" "30",3051,170.05332632,5.13611207209694,"N","N","N","RENO_S","1","USA" "31",3052,95.868704018,4.56297958887925,"N","N","N","RENO_S","1","KK" "32",3061,202.69261215,5.31169060558111,"N","N","N","RENO_S","1","USA" "33",3062,70.686307069,4.25825187761015,"N","N","N","RENO_S","1","USA" "34",3071,52.034715526,3.95191110210073,"L","H","H","RENO_S","1","KK" "35",3072,93.33525462,4.53619789950355,"L","H","H","RENO_S","1","USA" "36",3081,121.47464906,4.79970559129829,"H","H","H","RENO_S","1","USA" "37",3082,94.833869239,4.55212661590867,"H","H","H","RENO_S","1","KK" "38",3091,68.624596439,4.22865101914209,"H","L","L","RENO_S","1","USA" "39",3092,64.837097371,4.17187792984139,"H","L","L","RENO_S","1","KK" "40",3101,32.351569811,3.47666254561192,"L","L","L","RENO_S","1","KK" "41",3102,29.285124102,3.37707967726539,"L","L","L","RENO_S","1","USA" "42",3111,31.36974463,3.44584388158928,"L","L","H","RENO_S","1","USA" "43",3112,28.127853881,3.33676032670116,"L","L","H","RENO_S","1","KK" "44",3121,91.825330102,4.51988818660262,"H","L","H","RENO_S","1","KK" "45",3122,136.4559307,4.91600171048243,"H","L","H","RENO_S","1","USA" "46",4011,126.11889968,4.83722511024933,"H","L","H","RENO_N","2","KK" "47",4022,76.520259821,4.33755554003153,"L","L","L","RENO_N","2","KK" "48",4032,93.551979795,4.53851721545715,"L","L","H","RENO_N","2","USA" "49",4041,207.09703422,5.33318744777751,"H","L","L","RENO_N","2","USA" "50",4052,383.44185307,5.94918798759058,"N","N","N","RENO_N","2","USA" "51",4061,156.79345897,5.05492939129363,"N","N","N","RENO_N","2","USA" "52",4071,322.72413197,5.77679787769979,"L","H","L","RENO_N","2","USA" "53",4082,554.05710342,6.31726775620079,"H","H","H","RENO_N","2","USA" "54",4091,122.55552697,4.80856420867156,"N","N","N","RENO_N","2","KK" "55",4102,112.70050456,4.72473389805434,"N","N","N","RENO_N","2","KK" "56",4111,94.245481423,4.54590288271731,"L","H","H","RENO_N","2","KK" "57",4122,323.16498582,5.77816298482521,"H","H","L","RENO_N","2","KK" I define a linear model in R using lm as lm1 <- lm(logDIOX ~ 1 + OXYGEN + LOAD + PLANT + TIME + LAB, data=data) and I want to interpret the estimated coefficients. However, when I extract the coefficients I get multiple 'NAs' (I'm assuming it's due to linear dependencies among the variables). How can I then interpret the coefficients? I only have one intercept that somehow represents one of the levels of each of the included factors in the model. Is it possible to get an estimate for each factor level? > summary(lm1) Coefficients: Call: lm(formula = logDIOX ~ OXYGEN + LOAD + PLANT + TIME + LAB, data = data) Residuals: Min 1Q Median 3Q Max -0.90821 -0.32102 -0.08993 0.27311 0.97758 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 7.2983 0.2110 34.596 < 2e-16 *** OXYGENL -0.4086 0.1669 -2.449 0.017953 * OXYGENN -0.7567 0.1802 -4.199 0.000113 *** LOADL -1.0645 0.1675 -6.357 6.58e-08 *** LOADN NA NA NA NA PLANTRENO_N -0.6636 0.2174 -3.052 0.003664 ** PLANTRENO_S -2.3452 0.1929 -12.158 < 2e-16 *** TIME2 -0.9160 0.2065 -4.436 5.18e-05 *** LABUSA 0.3829 0.1344 2.849 0.006392 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.5058 on 49 degrees of freedom Multiple R-squared: 0.8391, Adjusted R-squared: 0.8161 F-statistic: 36.5 on 7 and 49 DF, p-value: < 2.2e-16
For the NA part of your question you can have a look here: [linear regression "NA" estimate just for last coefficient, actually of your variables can be described as a linear combination of the rest. For the factors and their levels the way r works is showing intercept with first factor level and shows the difference of the intercept with the rest of the factors. I think it will be more clear with just one factor regression: lm1 <- lm(logDIOX ~ 1 + OXYGEN , data=df) > summary(lm1) Call: lm(formula = logDIOX ~ 1 + OXYGEN, data = df) Residuals: Min 1Q Median 3Q Max -1.7803 -0.7833 -0.2027 0.6597 3.1229 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.5359 0.2726 20.305 <2e-16 *** OXYGENL -0.4188 0.3909 -1.071 0.289 OXYGENN -0.1896 0.3807 -0.498 0.621 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.188 on 54 degrees of freedom Multiple R-squared: 0.02085, Adjusted R-squared: -0.01542 F-statistic: 0.5749 on 2 and 54 DF, p-value: 0.5662 what this result is saying is that for OXYGEN="H" intercept is 5.5359, for OXYGEN="L" intercept is 5.5359-0.4188=5.1171 and for OXYGEN="N" intercept is 5.5359-0.1896= 5.3463. Hope this helps UPDATE: Following your comment I generalize to your model. when OXYGEN = "H" LOAD = "H" PLANT= "KARRA" TIME=1 LAB="KK" then: logDIOX =7.2983 when OXYGEN = "L" LOAD = "H" PLANT= "KARRA" TIME=1 LAB="KK" then: logDIOX =7.2983-0.4086 =6.8897 when OXYGEN = "L" LOAD = "L" PLANT= "KARRA" TIME=1 LAB="KK" then: logDIOX =7.2983-0.4086-1.0645 =5.8252 etc.
Need help modeling data with a log() function
I have some data that Excel will fit pretty nicely with a logarithmic trend. I want to pass the same data into R and have it tell me the coefficients and intercept. What form should have the data in and what function should I call to have it figure out the coefficients? Ultimately, I want to do this thousands of time so that I can project into the future. Passing Excel these values produces this trendline function: y = -0.099ln(x) + 0.7521 Data: y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647, 0.574715813, 0.559454895, 0.546235287, 0.534574767, 0.524144076, 0.514708368) For context, the data points represent % of our user base that are retained on a given day.
The question omitted the value of x but working backwards it seems you were using 1, 2, 3, ... so try the following: x <- 1:11 y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647, 0.574715813, 0.559454895, 0.546235287, 0.534574767, 0.524144076, 0.514708368) fm <- lm(y ~ log(x)) giving: > coef(fm) (Intercept) log(x) 0.7521 -0.0990 and plot(y ~ x, log = "x") lines(fitted(fm) ~ x, col = "red")
You can get the same results by: y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647, 0.574715813, 0.559454895, 0.546235287, 0.534574767, 0.524144076, 0.514708368) t <- seq(along=y) > summary(lm(y~log(t))) Call: lm(formula = y ~ log(t)) Residuals: Min 1Q Median 3Q Max -3.894e-10 -2.288e-10 -2.891e-11 1.620e-10 4.609e-10 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.521e-01 2.198e-10 3421942411 <2e-16 *** log(t) -9.900e-02 1.261e-10 -784892428 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.972e-10 on 9 degrees of freedom Multiple R-squared: 1, Adjusted R-squared: 1 F-statistic: 6.161e+17 on 1 and 9 DF, p-value: < 2.2e-16 For large projects I recommend to encapsulate the data into a data frame, like df <- data.frame(y, t) lm(formula = y ~ log(t), data=df)
Two stage least square in R
I want to run a two stage probit least square regression in R. Does anyone know how to do this? Is there any package out there? I know it's possible to do it using Stata, so I imagine it's possible to do it with R.
You might want to be more specific when you say 'two-stage-probit-least-squares'. Since you refer to a Stata program that implements this I am guessing you are talking about the CDSIMEQ package, which implements the Amemiya (1978) procedure for the Heckit model (a.k.a Generalized Tobit, a.k.a. Tobit type II model, etc.). As Grant said, systemfit will do a Tobit for you, but not with two equations. The MicEcon package did have a Heckit (but the package has split so many times I don't know where it is now). If you want what the CDSIMEQ does, it can easily be implemented in R. I wrote a function that replicates CDSIMEQ: tspls <- function(formula1, formula2, data) { # The Continous model mf1 <- model.frame(formula1, data) y1 <- model.response(mf1) x1 <- model.matrix(attr(mf1, "terms"), mf1) # The dicontionous model mf2 <- model.frame(formula2, data) y2 <- model.response(mf2) x2 <- model.matrix(attr(mf2, "terms"), mf2) # The matrix of all the exogenous variables X <- cbind(x1, x2) X <- X[, unique(colnames(X))] J1 <- matrix(0, nrow = ncol(X), ncol = ncol(x1)) J2 <- matrix(0, nrow = ncol(X), ncol = ncol(x2)) for (i in 1:ncol(x1)) J1[match(colnames(x1)[i], colnames(X)), i] <- 1 for (i in 1:ncol(x2)) J2[match(colnames(x2)[i], colnames(X)), i] <- 1 # Step 1: cat("\n\tNOW THE FIRST STAGE REGRESSION") m1 <- lm(y1 ~ X - 1) m2 <- glm(y2 ~ X - 1, family = binomial(link = "probit")) print(summary(m1)) print(summary(m2)) yhat1 <- m1$fitted.values yhat2 <- X %*% coef(m2) PI1 <- m1$coefficients PI2 <- m2$coefficients V0 <- vcov(m2) sigma1sq <- sum(m1$residuals ^ 2) / m1$df.residual sigma12 <- 1 / length(y2) * sum(y2 * m1$residuals / dnorm(yhat2)) # Step 2: cat("\n\tNOW THE SECOND STAGE REGRESSION WITH INSTRUMENTS") m1 <- lm(y1 ~ yhat2 + x1 - 1) m2 <- glm(y2 ~ yhat1 + x2 - 1, family = binomial(link = "probit")) sm1 <- summary(m1) sm2 <- summary(m2) print(sm1) print(sm2) # Step 3: cat("\tNOW THE SECOND STAGE REGRESSION WITH CORRECTED STANDARD ERRORS\n\n") gamma1 <- m1$coefficients[1] gamma2 <- m2$coefficients[1] cc <- sigma1sq - 2 * gamma1 * sigma12 dd <- gamma2 ^ 2 * sigma1sq - 2 * gamma2 * sigma12 H <- cbind(PI2, J1) G <- cbind(PI1, J2) XX <- crossprod(X) # X'X HXXH <- solve(t(H) %*% XX %*% H) # (H'X'XH)^(-1) HXXVXXH <- t(H) %*% XX %*% V0 %*% XX %*% H # H'X'V0X'XH Valpha1 <- cc * HXXH + gamma1 ^ 2 * HXXH %*% HXXVXXH %*% HXXH GV <- t(G) %*% solve(V0) # G'V0^(-1) GVG <- solve(GV %*% G) # (G'V0^(-1)G)^(-1) Valpha2 <- GVG + dd * GVG %*% GV %*% solve(XX) %*% solve(V0) %*% G %*% GVG ans1 <- coef(sm1) ans2 <- coef(sm2) ans1[,2] <- sqrt(diag(Valpha1)) ans2[,2] <- sqrt(diag(Valpha2)) ans1[,3] <- ans1[,1] / ans1[,2] ans2[,3] <- ans2[,1] / ans2[,2] ans1[,4] <- 2 * pt(abs(ans1[,3]), m1$df.residual, lower.tail = FALSE) ans2[,4] <- 2 * pnorm(abs(ans2[,3]), lower.tail = FALSE) cat("Continuous:\n") print(ans1) cat("Dichotomous:\n") print(ans2) } For comparison, we can replicate the sample from the author of CDSIMEQ in their article about the package. > library(foreign) > cdsimeq <- read.dta("http://www.stata-journal.com/software/sj3-2/st0038/cdsimeq.dta") > tspls(continuous ~ exog3 + exog2 + exog1 + exog4, + dichotomous ~ exog1 + exog2 + exog5 + exog6 + exog7, + data = cdsimeq) NOW THE FIRST STAGE REGRESSION Call: lm(formula = y1 ~ X - 1) Residuals: Min 1Q Median 3Q Max -1.885921 -0.438579 -0.006262 0.432156 2.133738 Coefficients: Estimate Std. Error t value Pr(>|t|) X(Intercept) 0.010752 0.020620 0.521 0.602187 Xexog3 0.158469 0.021862 7.249 8.46e-13 *** Xexog2 -0.009669 0.021666 -0.446 0.655488 Xexog1 0.159955 0.021260 7.524 1.19e-13 *** Xexog4 0.316575 0.022456 14.097 < 2e-16 *** Xexog5 0.497207 0.021356 23.282 < 2e-16 *** Xexog6 -0.078017 0.021755 -3.586 0.000352 *** Xexog7 0.161177 0.022103 7.292 6.23e-13 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6488 on 992 degrees of freedom Multiple R-squared: 0.5972, Adjusted R-squared: 0.594 F-statistic: 183.9 on 8 and 992 DF, p-value: < 2.2e-16 Call: glm(formula = y2 ~ X - 1, family = binomial(link = "probit")) Deviance Residuals: Min 1Q Median 3Q Max -2.49531 -0.59244 0.01983 0.59708 2.41810 Coefficients: Estimate Std. Error z value Pr(>|z|) X(Intercept) 0.08352 0.05280 1.582 0.113692 Xexog3 0.21345 0.05678 3.759 0.000170 *** Xexog2 0.21131 0.05471 3.862 0.000112 *** Xexog1 0.45591 0.06023 7.570 3.75e-14 *** Xexog4 0.39031 0.06173 6.322 2.57e-10 *** Xexog5 0.75955 0.06427 11.818 < 2e-16 *** Xexog6 0.85461 0.06831 12.510 < 2e-16 *** Xexog7 -0.16691 0.05653 -2.953 0.003152 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1386.29 on 1000 degrees of freedom Residual deviance: 754.14 on 992 degrees of freedom AIC: 770.14 Number of Fisher Scoring iterations: 6 NOW THE SECOND STAGE REGRESSION WITH INSTRUMENTS Call: lm(formula = y1 ~ yhat2 + x1 - 1) Residuals: Min 1Q Median 3Q Max -2.32152 -0.53160 0.04886 0.53502 2.44818 Coefficients: Estimate Std. Error t value Pr(>|t|) yhat2 0.257592 0.021451 12.009 <2e-16 *** x1(Intercept) 0.012185 0.024809 0.491 0.623 x1exog3 0.042520 0.026735 1.590 0.112 x1exog2 0.011854 0.026723 0.444 0.657 x1exog1 0.007773 0.028217 0.275 0.783 x1exog4 0.318636 0.028311 11.255 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7803 on 994 degrees of freedom Multiple R-squared: 0.4163, Adjusted R-squared: 0.4128 F-statistic: 118.2 on 6 and 994 DF, p-value: < 2.2e-16 Call: glm(formula = y2 ~ yhat1 + x2 - 1, family = binomial(link = "probit")) Deviance Residuals: Min 1Q Median 3Q Max -2.49610 -0.58595 0.01969 0.59857 2.41281 Coefficients: Estimate Std. Error z value Pr(>|z|) yhat1 1.26287 0.16061 7.863 3.75e-15 *** x2(Intercept) 0.07080 0.05276 1.342 0.179654 x2exog1 0.25093 0.06466 3.880 0.000104 *** x2exog2 0.22604 0.05389 4.194 2.74e-05 *** x2exog5 0.12912 0.09510 1.358 0.174544 x2exog6 0.95609 0.07172 13.331 < 2e-16 *** x2exog7 -0.37128 0.06759 -5.493 3.94e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1386.29 on 1000 degrees of freedom Residual deviance: 754.21 on 993 degrees of freedom AIC: 768.21 Number of Fisher Scoring iterations: 6 NOW THE SECOND STAGE REGRESSION WITH CORRECTED STANDARD ERRORS Continuous: Estimate Std. Error t value Pr(>|t|) yhat2 0.25759209 0.1043073 2.46955009 0.01369540 x1(Intercept) 0.01218500 0.1198713 0.10165068 0.91905445 x1exog3 0.04252006 0.1291588 0.32920764 0.74206810 x1exog2 0.01185438 0.1290754 0.09184073 0.92684309 x1exog1 0.00777347 0.1363643 0.05700519 0.95455252 x1exog4 0.31863627 0.1367881 2.32941597 0.02003661 Dichotomous: Estimate Std. Error z value Pr(>|z|) yhat1 1.26286574 0.7395166 1.7076909 0.0876937093 x2(Intercept) 0.07079775 0.2666447 0.2655134 0.7906139867 x2exog1 0.25092561 0.3126763 0.8025092 0.4222584495 x2exog2 0.22603717 0.2739307 0.8251618 0.4092797527 x2exog5 0.12911922 0.4822986 0.2677163 0.7889176766 x2exog6 0.95609385 0.2823662 3.3860070 0.0007091758 x2exog7 -0.37128221 0.3265478 -1.1369920 0.2555416141
systemfit will also do the trick.
there are several packages available in R to do two state least squares. here are a few sem: Two-Stage Least Squares Zelig: Link removed, no longer functional (28.07.11) let me know if these serve your purpose.