I have a variable x that is between 0 and 1, or (0,1].
I want to generate 10 dummy variables for 10 deciles of variable x. For example x_0_10 takes value 1 if x is between 0 and 0.1, x_10_20 takes value 1 if x is between 0.1 and 0.2, ...
The Stata code to do above is something like this:
forval p=0(10)90 {
local Next=`p'+10
gen x_`p'_`Next'=0
replace x_`p'_`Next'=1 if x<=`Next'/100 & x>`p'/100
}
Now, I am new at R and I wonder how I can do above in R?
cut is your friend here; its output is a factor, which, when used in models, R will auto-expand into the 10 dummy variables.
set.seed(2932)
x = runif(1e4)
y = 3 + 4 * x + rnorm(1e4)
x_cut = cut(x, 0:10/10, include.lowest = TRUE)
summary(lm(y ~ x_cut))
# Call:
# lm(formula = y ~ x_cut)
#
# Residuals:
# Min 1Q Median 3Q Max
# -3.7394 -0.6888 0.0028 0.6864 3.6742
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 3.16385 0.03243 97.564 <2e-16 ***
# x_cut(0.1,0.2] 0.43932 0.04551 9.654 <2e-16 ***
# x_cut(0.2,0.3] 0.85555 0.04519 18.933 <2e-16 ***
# x_cut(0.3,0.4] 1.26441 0.04588 27.556 <2e-16 ***
# x_cut(0.4,0.5] 1.66181 0.04495 36.970 <2e-16 ***
# x_cut(0.5,0.6] 2.04538 0.04574 44.714 <2e-16 ***
# x_cut(0.6,0.7] 2.44771 0.04533 53.999 <2e-16 ***
# x_cut(0.7,0.8] 2.80875 0.04591 61.182 <2e-16 ***
# x_cut(0.8,0.9] 3.22323 0.04545 70.919 <2e-16 ***
# x_cut(0.9,1] 3.60092 0.04564 78.897 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 1.011 on 9990 degrees of freedom
# Multiple R-squared: 0.5589, Adjusted R-squared: 0.5585
# F-statistic: 1407 on 9 and 9990 DF, p-value: < 2.2e-16
See ?cut for more customizations
You can also pass cut directly in the RHS of the formula, which would make using predict a bit easier:
reg = lm(y ~ cut(x, 0:10/10, include.lowest = TRUE))
idx = sample(length(x), 500)
plot(x[idx], y[idx])
x_grid = seq(0, 1, length.out = 500L)
lines(x_grid, predict(reg, data.frame(x = x_grid)),
col = 'red', lwd = 3L, type = 's')
This won't fit well into a comment, but for the record, the Stata code can be simplified down to
forval p = 0/9 {
gen x_`p' = x > `p'/10 & `x' <= (`p' + 1)/10
}
Note that -- contrary to the OP's claim -- values of x exactly zero will be mapped to zero for all these variables, both on their code and on mine (which is intended to be a simplification of their code, not a correct way to do it, modulo a difference of taste on variable names). That follows from the fact that 0 is not greater than 0. Again, values that are exactly 0.1, 0.2, 0.3, will in principle go in the lower bin, not the higher bin, but that is complicated by the fact that most multiples of 0.1 don't have exact binary representations (0.5 is clearly an exception).
Indeed, depending on details about their set-up that the OP doesn't tell us, indicator variables (dummy variables, in their terminology) may well be available in Stata without a loop or made quite unnecessary by factor variable notation. In that respect Stata is closer to R than may at first appear.
While not answering the question directly, the signal here to Stata and R users alike is that Stata need not be so awkward as might be inferred from the code in the question.
Related
For Y = % of population with income below poverty level and X = per capita income of population, I have constructed a box-cox plot and found that the lambda = 0.02020:
bc <- boxcox(lm(Percent_below_poverty_level ~ Per_capita_income, data=tidy.CDI), plotit=T)
bc$x[which.max(bc$y)] # gives lambda
Now I want to fit a simple linear regression using the transformed data, so I've entered this code
transform <- lm((Percent_below_poverty_level**0.02020) ~ (Per_capita_income**0.02020))
transform
But all I get is the error message
'Error in terms.formula(formula, data = data) : invalid power in formula'. What is my mistake?
You could use bcPower() from the car package.
## make sure you do install.packages("car") if you haven't already
library(car)
data(Prestige)
p <- powerTransform(prestige ~ income + education + type ,
data=Prestige,
family="bcPower")
summary(p)
# bcPower Transformation to Normality
# Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
# Y1 1.3052 1 0.9408 1.6696
#
# Likelihood ratio test that transformation parameter is equal to 0
# (log transformation)
# LRT df pval
# LR test, lambda = (0) 41.67724 1 1.0765e-10
#
# Likelihood ratio test that no transformation is needed
# LRT df pval
# LR test, lambda = (1) 2.623915 1 0.10526
mod <- lm(bcPower(prestige, 1.3052) ~ income + education + type, data=Prestige)
summary(mod)
#
# Call:
# lm(formula = bcPower(prestige, 1.3052) ~ income + education +
# type, data = Prestige)
#
# Residuals:
# Min 1Q Median 3Q Max
# -44.843 -13.102 0.287 15.073 62.889
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -3.736e+01 1.639e+01 -2.279 0.0250 *
# income 3.363e-03 6.928e-04 4.854 4.87e-06 ***
# education 1.205e+01 2.009e+00 5.999 3.78e-08 ***
# typeprof 2.027e+01 1.213e+01 1.672 0.0979 .
# typewc -1.078e+01 7.884e+00 -1.368 0.1746
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 22.25 on 93 degrees of freedom
# (4 observations deleted due to missingness)
# Multiple R-squared: 0.8492, Adjusted R-squared: 0.8427
# F-statistic: 131 on 4 and 93 DF, p-value: < 2.2e-16
Powers (more often represented by ^ than ** in R, FWIW) have a special meaning inside formulas [they represent interactions among variables rather than mathematical operations]. So if you did want to power-transform both sides of your equation you would use the I() or "as-is" operator:
I(Percent_below_poverty_level^0.02020) ~ I(Per_capita_income^0.02020)
However, I think you should do what #DaveArmstrong suggested anyway:
it's only the predictor variable that gets transformed
the Box-Cox transformation is actually (y^lambda-1)/lambda (although the shift and scale might not matter for your results)
R has certain significance codes to determine statistical significance. In the example below, for example, a dot . indicates significance at the 10% level (see sample output below).
Dots can be very hard to see, especially when I copy-paste to Excel and display it in Times New Roman.
I'd like to change it such that:
* = significant at 10%
** = significant at 5%
*** = significant at 1%
Is there a way I can do this?
> y = c(1,2,3,4,5,6,7,8)
> x = c(1,3,2,4,5,6,8,7)
> summary(lm(y~x))
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-1.0714 -0.3333 0.0000 0.2738 1.1191
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2143 0.6286 0.341 0.74480
x 0.9524 0.1245 7.651 0.00026 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8067 on 6 degrees of freedom
Multiple R-squared: 0.907, Adjusted R-squared: 0.8915
F-statistic: 58.54 on 1 and 6 DF, p-value: 0.0002604
You can create your own formatting function with
mystarformat <- function(x) symnum(x, corr = FALSE, na = FALSE,
cutpoints = c(0, 0.01, 0.05, 0.1, 1),
symbols = c("***", "**", "*", " "))
And you can write your own coefficient formatter
show_coef <- function(mm) {
mycoef<-data.frame(coef(summary(mm)), check.names=F)
mycoef$signif = mystarformat(mycoef$`Pr(>|t|)`)
mycoef$`Pr(>|t|)` = format.pval(mycoef$`Pr(>|t|)`)
mycoef
}
And then with your model, you can run it with
mm <- lm(y~x)
show_coef(mm)
# Estimate Std. Error t value Pr(>|t|) signif
# (Intercept) 0.2142857 0.6285895 0.3408993 0.7447995
# x 0.9523810 0.1244793 7.6509206 0.0002604 ***
One should be aware that stargazer package reports significance levels with a different scale than other statistical softwares like STATA.
In R (stargazer) you get # (* p<0.1; ** p<0.05; *** p<0.01). Whereas, in STATA you get # (* p<0.05, ** p<0.01, *** p< 0.001).
This means that what is significant with one * in R results may appear not to be significant for a STATA user.
Sorry for the late response, but I found a great solution to this.
Just do the following:
install.packages("stargazer")
library(stargazer)
stargazer(your_regression, type = "text")
This displays everything in a beautiful way with your desired format.
Note: If you leave type = "text" out, then you'll get the LaTeX code.
I have some data that Excel will fit pretty nicely with a logarithmic trend. I want to pass the same data into R and have it tell me the coefficients and intercept. What form should have the data in and what function should I call to have it figure out the coefficients? Ultimately, I want to do this thousands of time so that I can project into the future.
Passing Excel these values produces this trendline function: y = -0.099ln(x) + 0.7521
Data:
y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647, 0.574715813,
0.559454895, 0.546235287, 0.534574767, 0.524144076, 0.514708368)
For context, the data points represent % of our user base that are retained on a given day.
The question omitted the value of x but working backwards it seems you were using 1, 2, 3, ... so try the following:
x <- 1:11
y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647,
0.574715813, 0.559454895, 0.546235287, 0.534574767, 0.524144076,
0.514708368)
fm <- lm(y ~ log(x))
giving:
> coef(fm)
(Intercept) log(x)
0.7521 -0.0990
and
plot(y ~ x, log = "x")
lines(fitted(fm) ~ x, col = "red")
You can get the same results by:
y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647, 0.574715813, 0.559454895, 0.546235287, 0.534574767, 0.524144076, 0.514708368)
t <- seq(along=y)
> summary(lm(y~log(t)))
Call:
lm(formula = y ~ log(t))
Residuals:
Min 1Q Median 3Q Max
-3.894e-10 -2.288e-10 -2.891e-11 1.620e-10 4.609e-10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.521e-01 2.198e-10 3421942411 <2e-16 ***
log(t) -9.900e-02 1.261e-10 -784892428 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.972e-10 on 9 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 6.161e+17 on 1 and 9 DF, p-value: < 2.2e-16
For large projects I recommend to encapsulate the data into a data frame, like
df <- data.frame(y, t)
lm(formula = y ~ log(t), data=df)
Hi I'm new to R and would like to ask a more general question. How do I simulate or create an example data set which is suitable to be posted here and simultaneously posses the property of reproducibility. I would like, for instance, create a numeric example which abstract my data set properly. One condition woud be to implement some correlation between my dependent and independent variables.
For instance. how to introduce some correlation between my count and my in.var1 and in.var2?
set.seed(1122)
count<-rpois(1000,30)
in.var1<- rnorm(1000, mean = 25, sd = 3)
in.var1<- rnorm(1000, mean = 12, sd = 2)
data<-cbind(count,in.var1,in.var2)
You can introduce dependence by adding in some portion of the "information" in the two variables to the construction of the count variable:
set.seed(1222)
in.var1<- rnorm(1000, mean = 25, sd = 3)
#Corrected spelling of in.var2
in.var2<- rnorm(1000, mean = 12, sd = 2)
count<-rpois(1000,30) + 0.15*in.var1 + 0.3*in.var2
# Avoid use 'data` as an object name
dat<-data.frame(count,in.var1,in.var2)
> spearman(count, in.var1)
rho
0.06859676
> spearman(count, in.var2)
rho
0.1276568
> spearman(in.var1, in.var2)
rho
-0.02175273
> summary( glm(count ~ in.var1 + in.var2, data=dat) )
Call:
glm(formula = count ~ in.var1 + in.var2, data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-16.6816 -3.6910 -0.4238 3.4435 15.5326
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.05034 1.74084 16.688 < 2e-16 ***
in.var1 0.14701 0.05613 2.619 0.00895 **
in.var2 0.35512 0.08228 4.316 1.74e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
If you want count to be a function of in.var1 and invar.2 try this. Note that count is already a function name so I am changing it to Count
set.seed(1122)
in.var1<- rnorm(1000, mean = 4, sd = 3)
in.var2<- rnorm(1000, mean = 6, sd = 2)
Count<-rpois(1000, exp(3+ 0.5*in.var1 - 0.25*in.var2))
Data<-data.frame(Count=Count, Var1=in.var1, Var2=in.var2)
You now have a poisson count based on in.var1 and in.var2. A poisson regression will show an intercept of 3 and coefficients of 0.5 for Var1 and -0.25 for Var2
summary(glm(Count~Var1+Var2,data=Data, family=poisson))
Call:
glm(formula = Count ~ Var1 + Var2, family = poisson, data = Data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.84702 -0.76292 -0.04463 0.67525 2.79537
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.001390 0.011782 254.7 <2e-16 ***
Var1 0.499789 0.001004 498.0 <2e-16 ***
Var2 -0.250949 0.001443 -173.9 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 308190.7 on 999 degrees of freedom
Residual deviance: 1063.3 on 997 degrees of freedom
AIC: 6319.2
Number of Fisher Scoring iterations: 4
As I understand you want to add some pattern to your data.
# Basic info taken from Data Science Exploratory Analysis Course
# http://datasciencespecialization.github.io/courses/04_ExploratoryAnalysis/
set.seed(1122)
rowNumber = 1000
count<-rpois(rowNumber,30)
in.var1<- rnorm(rowNumber, mean = 25, sd = 3)
in.var2<- rnorm(rowNumber, mean = 12, sd = 2)
data<-cbind(count,in.var1,in.var2)
dataNew <- data
for (i in 1:rowNumber) {
# flip a coin
coinFlip <- rbinom(1, size = 1, prob = 0.5)
# if coin is heads add a common pattern to that row
if (coinFlip) {
dataNew[i,"count"] <- 2 * data[i,"in.var1"] + 10* data[i,"in.var2"]
}
}
Basically, I am adding a pattern count = 2 *in.var1 + 10 * in.var2 to some random rows, here coinFlip variable. Of course you should vectorize it for more rows.
I am trying to get an lm fit for my data. The problem I am having is that I want to fit a linear model(1st order polynomial) when the factor is "true" and a second order polynomial when the factor is "false". How can I get that done using only one lm.
a=c(1,2,3,4,5,6,7,8,9,10)
b=factor(c("true","false","true","false","true","false","true","false","true","false"))
c=c(10,8,20,15,30,21,40,25,50,31)
DumbData<-data.frame(cbind(a,c))
DumbData<-cbind(DumbData,b=b)
I have tried
Lm2<-lm(c~a + b + b*I(a^2), data=DumbData)
summary(Lm2)
that results in:
summary(Lm2)
Call:
lm(formula = c ~ a + b + b * I(a^2), data = DumbData)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.74483 1.12047 -0.665 0.535640
a 4.44433 0.39619 11.218 9.83e-05 ***
btrue 6.78670 0.78299 8.668 0.000338 ***
I(a^2) -0.13457 0.03324 -4.049 0.009840 **
btrue:I(a^2) 0.18719 0.01620 11.558 8.51e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7537 on 5 degrees of freedom
Multiple R-squared: 0.9982, Adjusted R-squared: 0.9967
F-statistic: 688 on 4 and 5 DF, p-value: 4.896e-07
here I have I(a^2) for both fits and i want 1 1st order and another with second order polynomials.
If one tries with:
Lm2<-lm(c~a + b + I(b*I(a^2)), data=DumbData)
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
In addition: Warning message:
In Ops.factor(b, I(a^2)) : * not meaningful for factors
How can I get the proper interaction terms here???
Thanks Andrie, there are still some things I am missing here. In this example the variable b is a logic one, if is a factor of two levels does not work, I guess I have to convert the factor variable in a logic one. The other thing I am missing is the not in the condition, I(!b*a^2) without the ! I get:
Call: lm(formula = c ~ a + I(b * a^2), data = dat)
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.2692 1.8425 3.945 0.005565 **
a 2.3222 0.3258 7.128 0.000189 ***
I(b * a^2) 0.3005 0.0355 8.465 6.34e-05 ***
I can not relate the formulas with and without the ! condition, which is a bit strange to me.
Try something along the following lines:
dat <- data.frame(
a=c(1,2,3,4,5,6,7,8,9,10),
b=c(TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE),
c=c(10,8,20,15,30,21,40,25,50,31)
)
fit <- lm(c ~ a + I(!b * a^2), dat)
summary(fit)
This results in:
Call:
lm(formula = c ~ a + I(!b * a^2), data = dat)
Residuals:
Min 1Q Median 3Q Max
-4.60 -2.65 0.50 2.65 4.40
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.5000 2.6950 3.896 0.005928 **
a 3.9000 0.4209 9.266 3.53e-05 ***
I(!b * a^2)TRUE -13.9000 2.4178 -5.749 0.000699 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.764 on 7 degrees of freedom
Multiple R-squared: 0.9367, Adjusted R-squared: 0.9186
F-statistic: 51.75 on 2 and 7 DF, p-value: 6.398e-05
Note:
I made use of the logical values TRUE and FALSE.
These will coerce to 1 and 0, respectively.
I used the negation !b inside the formula.
Ummm ...
Lm2<-lm(c~a + b + b*I(a^2), data=DumbData)
You say that "The problem I am having is that I want to fit a linear model(1st order polynomial) when the factor is "true" and a second order polynomial when the factor is "false". How can I get that done using only one lm. "
From that I infer that you don't want b to be directly in the model? In addition, a^2 should be included only if b is false.
So that would be...
lm(c~ a + I((!b) * a^2))
If b is true (that is, !b equals FALSE) then a^2 is multiplied by zero (FALSE) and omitted from the equation.
The only problem is that you have defined b as factor instead of logical. That can be cured.
# b=factor(c("true","false","true","false","true","false","true","false","true","false"))
# could use TRUE and FALSE instead of "ture" and "false"
# alternatively, after defining b as above, do
# b <- b=="true" -- that would convert b to logical (i.e boolean TRUE and FALSe values)
Ok to be exact, you defined b as "character" but it was converted to "factor" when adding it to the data frame ("DumbData")
Another minor point about the way you defined the data frame.
a=c(1,2,3,4,5,6,7,8,9,10)
b=factor(c("true","false","true","false","true","false","true","false","true","false"))
c=c(10,8,20,15,30,21,40,25,50,31)
DumbData<-data.frame(cbind(a,c))
DumbData<-cbind(DumbData,b=b)
Here, cbind is unnecessary. You coud have it all on one line:
Dumbdata<- data.frame(a,b,c)
# shorter and cleaner!!
In addition, to convert b to logical use:
Dumbdata<- data.frame(a,b=b=="true",c)
Note. You need to say b=b=="true", it seems redundant but the LHS (b) gives the name of the variable in data frame whereas the RHS (b=="true") is an expression that evaluates to a "logical" (boolean) value.