I'm trying to understand how to use Formula objects. Let's say I wanted to make my own 2SLS function and want to divide the objects I'm working with into 4 main groups: y = response; X = exogenous variables; E = endogenous variables; Z = instruments.
I want to be able to construct these objects without making extra copies of the data unnecessarily (say, large N and large number of instruments would make this prohibitively costly in memory usage/time). I also want to take into account NAs from across the data.
Let's use a formula syntax similar to felm (I tried looking at the parsing code there, but couldn't follow it).
frml = y ~ x1 + x2 + x3*x4 | (e1 | e2 ~ z1 + z2)
library(Formula)
N = 12 # be divisible by 6
data = data.frame(y=rnorm(N), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N),
x4=factor(rep(1:2, N/2)), e1=rnorm(N), e2=rnorm(N),
z1=rnorm(N), z2=factor(rep(1:3, N/3)))
data[2,'y'] = data[3,'x1'] = data[4,'e1'] = data[5,'z2'] = NA
parse_frml = function(frml, data, subset=NULL) {
frml = as.Formula(frml)
# does not take into account NAs at all
y = model.part(frml, data=data, subset=subset, lhs=1)
# does not take into account NAs in other variables (y, Z, E)
X = model.matrix(frml, data=data, subset=subset, lhs=0, rhs=1)
Z = model.matrix(frml, data=data, subset=subset, lhs=0, rhs=2)
#E = # I can't figure this out at all
return(list(y=y, X=X, E=E, Z=Z))
}
Now, I can do something like
mf = model.frame(frml, data=data, subset=subset, lhs=1, rhs=1)
which will take into account NAs in y and X, but ignores E and Z. Further, this copies the data into the mf, and then copies again into y and X.
So, I have 2 questions and 1 constraint
How do I get E? (a matrix for the LHS of the 2nd equation)
How do I take into account NAs from across the data used by frml in all matrices?
While minimizing the number of copies of the data (ideally just copied into the matrices)
More generally, what's a good resource for understanding Formula, formula, terms, and the like? I've not found, e.g. the Formula libraries package documentation to be super helpful.
This isn't perfect, but it works. It's a shame how there is almost no information on how to actually handle and manipulate formulas in R code. My solution depends on formula.tools
library(formula.tools)
parse_frml = function(frml, data, subset=NULL) {
frml = as.Formula(frml)
vars = all.vars(frml)
other_vars = c(all.vars(formula(frml, lhs=1, rhs=1)),
rhs.vars(formula(frml, lhs=0, rhs=2)))
e_vars = setdiff(vars, other_vars)
valid = which(complete.cases(data[, vars]))
if (!is.null(subset)) {
if (class(subset) == 'logical') {
subset = which(subset)
}
valid = intersect(valid, subset)
}
y = model.part(frml, data=data[valid,], lhs=1)
X = model.matrix(frml, data=data[valid,], lhs=0, rhs=1)
Z = model.matrix(frml, data=data[valid,], lhs=0, rhs=2)
E = data.matrix(data[valid, e_vars])
return(list(y=y, X=X, E=E, Z=Z))
}
I suspect that subsetting data with valid each time is rather expensive. But in the above test cast, it seems to work.
Related
The basic gist is that I have a set of housing data that I need to create a model for to minimize the predicted price vs actual price of house based on the dataset. So I created this bit of code to essentially test for a range of different numerators and find the one that minimized the difference between them. I'm using the median instead of the mean as the data isn't exactly normal.
Since I only have experience with lm(), I'm using that to create the coefficients and C values. But since the model likes exponents, I have to also test various exponents. It does this for each of the variables and then goes back to the first and re-evaluates it based on the other exponents. The model starts out with all the exponents ending up equal to 1. So the same as the basic linear model. I know that this is probably horribly inefficient and probably uses a lot of code in a somewhat wasteful, but I'm in my first r class so sorry about the mess and/or convoluted coding logic.
Is there any way to do this same thing but being more efficient. Also, I can't really decrease the number of variables as the model likes having more variables and produces a greater margin of error when they aren't present.
w <- seq(1,10000,1)
r <- seq(1,10000,1)
t <- seq(1,10000,1)
z <- seq(1,10000,1)
s <- seq(1,10000,1)
coef_1 <- c(6000,6000,6000,6000,6000,6000,6000,6000)
v <- rep(6000, each = 8)
for(l_1 in 1:10){
for(t_1 in 1:8){
for(i in 1:10000){
t = t_1
coef_1[t] = i
mod5 <- lm(log(SALE_PRC) ~ I(TOT_LVG_AREA^((coef_1[1]-5000)/1000)) + I(LND_SQFOOT^((coef_1[2]-5000)/1000)) + I(RAIL_DIST^((coef_1[3]-5000)/1000)) + I(OCEAN_DIST^((coef_1[4]-5000)/1000)) + I(CNTR_DIST^((coef_1[5]-5000)/1000)) + I(HWY_DIST^((coef_1[6]-5000)/1000)) + I(structure_quality^((coef_1[7]-5000)/1000)) + SUBCNTR_DI + SPEC_FEAT_VAL + (exp(((coef_1[8]-5000)/1000)*SPECIAL_RATIO)) + age, data = kaggle_transform_final)
kaggle_new <- kaggle_transform_final %>%
add_predictions(model = mod5, var = "prediction") %>%
mutate(new_predict = exp(prediction)) %>%
mutate(new_difference = abs((new_predict-SALE_PRC))/SALE_PRC) %>%
mutate(average_percent_difference = median(new_difference)) %>%
mutate(mean_percent_difference = mean(new_difference)) %>%
mutate(quart_75 = quantile(new_difference,.75))
w[i] = kaggle_new$average_percent_difference[1]
r[i] = kaggle_new$mean_percent_difference[1]
t[i] = kaggle_new$quart_75[1]
z[i] = i
s[i] = (i-5000)/1000
if(i%%100 ==0){show(i)}
}
u <- data.frame(median_diff = w, mean_diff = r, quart_75 = t, actual = s, number = z) %>%
arrange(median_diff)
coef_1[t_1] <- u$number[1]
v[t_1] <- u$actual[1]
show(coef_1)
}
coef_1 <- coef_1
}
I have a fairly simple equation, in which I have direct measurements of the variables through time, and two different unknown parameters I need to solve for, but which I know can be considered constants over the time periods I'm studying.
Both of these "constants" have fairly narrow ranges of variability in nature. In principle, it seems like some kind of optimization procedure/function should be able to do this easily, by finding the pair of values that minimizes the standard deviation of each of the constant values across the time series.
However, I am new to optimization and parameter fitting. Any help figuring out how to use r code to find the pair (or pairs) of values in this situation would be greatly appreciated.
Below is a simplified form of the equation I'm dealing with:
A * x + B * z - B * d = c + e
A and B are the constants I need to solve for.
Possible real-world values of A are 0.4-0.8
Possible real-world values of B are 0.85-0.99
To create a reasonable mock data set, assuming perfect measurements of all variables, and known values of A and B:
### Generate mock data
### Variables all have a daily cycle and are strongly autocorrelated,
# and so can be approximated via sin function,
# with unique noise added to each to simulate variability:
# Variability for each variable
n <- 1000 # number of data points
t <- seq(0,4*pi,length.out = 1000)
a <- 3
b <- 2
x.unif <- runif(n)
z.norm <- rnorm(n)
c.unif <- runif(n)
d.norm <- rnorm(n)
d.unif <- runif(n)
e.norm <- rnorm(n)
amp <- 1
# Create reasonable values of mock variable data for all variables except e;
# I will calculate from known fixed values for A and B.
x <- a*sin(b*t)+x.unif*amp + 10 # uniform error
z <- a*sin(b*t)+z.norm*amp + 10 # Gaussian/normal error
c <- ((a*sin(b*t)+c.unif*amp) + 10)/4
d <- ((a*sin(b*t)+d.norm*amp)+(a*sin(b*t)+d.unif*amp)+10)/2
# Put vectors in dataframe
dat <- data.frame("t" = t, "x" = x, "z" = z, "c" = c, "d" = d)
# Equation: A*x + B*z - B*d = c + e
# Solve for e:
# e = A*x + B*z - B*d - c
# Specify "true" values for A and B:
A = 0.6
B = 0.9
# Solve for e:
dat <- dat %>%
mutate(e = A*x + B*z - B*d - c)
# Gather data for easy visualizing of results for e:
dat_gathered <- dat %>%
gather(-t, value = "value", key = "key")
# Plot all variables
ggplot(dat_gathered, aes(x = t, y = value, color = key)) + geom_line()
# Add small error (to simulate measurement error) to all variables except A and B:
dat <- dat %>%
mutate(x_j = x + rnorm(x, sd=0.02)/(1/x)) %>%
mutate(z_j = z + rnorm(z, sd=0.02)/(1/z)) %>%
mutate(c_j = c + rnorm(c, sd=0.02)/(1/c)) %>%
mutate(d_j = d + rnorm(d, sd=0.02)/(1/d)) %>%
mutate(e_j = e + rnorm(e, sd=0.02)/(1/e))
The variables in dat with the _j suffix represent real world data (since they have measurement error added). Knowing the constraint that:
A is within 0.4-0.8
B is within 0.85-0.99
Is it possible to use the noisy "_j" data to optimize for the pair of constant values that minimize deviation of A and B across the entire time series?
A little bit of algebra and setting this up as a linear regression problem with no intercept seems to work fine:
m1 <- lm(e_j+c_j ~ 0 + x_j + I(z_j-d_j), data=dat)
coef(m1) ## A =0.6032, B = 0.8916
It doesn't do anything to constrain the solution, though.
I have a CRM data set used for an experiment, where the dummy W corresponds to the treatment/control group (see code below). When I tested for the independence of W from the other features, I realized two things:
When using model.matrix, some coefficients (1 in this dummy dataset) were not defined because of similarity. This did not happen when feeding the DT straight to lm()
The model obtained in both cases produces different results i.e., the p-values of the individual features changes
I (think that I) understand the concept of multi-collinearity but in this particular case I don't quite understand a) why it comes up b) why it has a different impact on model.matrix and lm
What am I missing?
Thanks a lot!
set.seed(1)
n = 302
DT = data.table(
zipcode = factor(sample(seq(1,52), n, replace=TRUE)),
gender = factor(sample(c("M","F"), n, replace=TRUE)),
age = sample(seq(1,95), n, replace=TRUE),
days_since_last_purchase = sample(seq(1,259), n, replace=TRUE),
W = sample(c(0,1), n, replace=TRUE)
)
summary(DT)
m = model.matrix(W ~ . +0, DT)
f1 = lm(DT$W ~ m)
f2= lm(W~ ., DT)
p_value_ratio <- function(lm)
{
summary_randomization = summary(lm)
p_values_randomization = summary_randomization$coefficients[, 4]
L = length(p_values_randomization)
return(sum(p_values_randomization <= 0.05)/(L-1))
}
all.equal(p_value_ratio(f1), p_value_ratio(f2))
alias(f1)
alias(f2)
Your problem is the + 0 in model.matrix. The second fit includes the intercept in the model matrix. If you exclude it, less factor levels (which are normally represented by the intercept) get excluded:
colnames(model.matrix(W ~ ., DT))
#excludes zipcode1 and genderf since these define the intercept
colnames(model.matrix(W ~ . + 0, DT))
#excludes only genderf
Note that f1 includes an intercept, which is added by lm (I believe by an internal call to model.matrix, but haven't checked):
m = model.matrix(W ~ . + 0, DT);
f1 = lm(DT$W ~ m );
model.matrix(f1)
You might want this:
m = model.matrix(W ~ ., DT);
f1 = lm(DT$W ~ m[,-1]);
(Usually you construct the model matrix only manually if you want to use lm.fit directly.)
f2= lm(W~ ., DT);
all.equal(unname(coef(f1)), unname(coef(f2)))
#[1] TRUE
In the end, this boils down to your understanding of treatment contrasts. Usually, you shouldn't exclude the intercept from the model matrix.
I have a function that inputs a data.frame and outputs the residual version of it with some chosen variable as predictor.
residuals.DF = function(data, resid.var, suffix="") {
lm_f = function(x) {
x = residuals(lm(data=data, formula= x ~ eval(parse(text=resid.var))))
}
resid = data.frame(apply(data,2,lm_f))
colnames(resid) = paste0(colnames(data),suffix)
return(resid)
}
set.seed(31233)
df = data.frame(Age = c(1,3,6,7,3,8,4,3,2,6),
Var1 = c(19,45,76,34,83,34,85,34,27,32),
Var2 = round(rnorm(10)*100))
df.res = residuals.DF(df, "Age", ".test")
df.res
Age.test Var1.test Var2.test
1 -1.696753e-17 -25.1351351 -90.20582
2 -1.318443e-19 -0.8108108 31.91892
3 -5.397735e-18 27.6756757 84.10603
4 -5.927747e-18 -15.1621622 -105.83160
5 -3.807699e-18 37.1891892 -57.08108
6 -6.457759e-18 -16.0000000 -25.76923
7 5.117344e-17 38.3513514 -65.01871
8 -3.807699e-18 -11.8108108 35.91892
9 -3.277687e-18 -17.9729730 97.85655
10 -5.397735e-18 -16.3243243 94.10603
This works fine, however, I often need to use the eval parse combo when working with variable inputs to lm(), so I decided to write a wrapper function:
#Wrapper function for convenience for evaluating strings
evalparse = function(string) {
eval(parse(text=string))
}
This works fine when used alone, e.g.:
> evalparse("5+5")
[1] 10
However, if one uses it in the above function, one gets:
> df.res = residuals.DF(df, "Age", ".test")
Error in eval(expr, envir, enclos) : object 'Age' not found
I figure this is because the wrapper function means that the string gets evaluated in its own environment where the chosen variable is missing. This does not happen when using eval parse combo because it then happens in the lm() environment where the chosen variable is not missing.
Is there some clever solution to this problem? A better way of using dynamic formulas in lm()? Otherwise I will have to keep typing eval(parse(text=object)).
Anytime you're trying to perform operations that modify the contents of a formula, you should use update because it is designed for this purpose.
In your case, you want to modify your function as follows:
residuals.DF = function(data, resid.var, suffix="") {
lm_f = function(x) {
x = residuals(lm(data=data, formula= update(x ~ 0, paste0("~",resid.var))))
}
resid = data.frame(apply(data,2,lm_f))
colnames(resid) = paste0(colnames(data),suffix)
return(resid)
}
Basically, update (or the update.formula method specifically) takes a formula as its first argument, and then allows for modifications based on its second argument. To get a handle on it, check out the following examples:
f <- y ~ x
f
# y ~ x
update(f, ~ z)
# y ~ z
update(f, x ~ y)
# x ~ y
update(f, "~ x + y")
# y ~ x + y
update(f, ~ . + z + w)
# y ~ x + z + w
x <- "x"
update(f, paste0("~",x))
# y ~ x
As you can see, the second argument can be a formula or character string containing one or more variables. This greatly simplifies the creation of a dynamically modified formula where you are only trying to change one part of the formula.
I have a large data set that has older and newer data. I created two data frames, EarlyYears with the older data and LaterYears with the new data, so they have the same columns.
What I want to do is regress the data from Early years to determine an equation and apply it to the Later Years to test the equation's strength - A and B are constants, Input is what I am testing - I change it for different runs of the code - and Dummy is 1 is there is no data for the input. However, I want to split both the EarlyYears and LaterYears data by quintiles of one of the variables, and apply the equation found in quintile 1 of EarlyYears to data from LaterYears that is in quintile 1. I am fairly new at R, and so far have:
Model<-data.frame(Date = rep(c("3/31/09","3/31/11"),each = 20),
InputRating = rep(c(1:5), 8), Dummy = rep(c(rep(0,9),1),4),
Y = rep(1,3,5,7,11,13,17,19), A = 1:40,B = 1:40*3+7)
newer<-as.numeric(grep("/11",Model$Date))
later<-as.numeric(grep("/11",Model$Date,invert = TRUE))
LaterYears<-Model[newer,]
EarlyYears<-Model[later,]
newModel<-EarlyYears
DataSet.Input<-data.frame(Date = newModel$Date, InputRating = newModel$InputRating,
Dummy = newModel$Dummy, Y = newModel$Y, A = newModel$A,B = newModel$B)
quintiles<-quantile(DataSet.Input$A,probs=c(0.2,0.4,0.6, 0.8, 1.0))
VarQuint<-findInterval(DataSet.Input$A,quintiles,rightmost.closed=TRUE)+1L
regressionData<-do.call(rbind,lapply(split(DataSet.Input,VarQuint),
FUN = function(SplitData) {
SplitRegression<-lm(Y ~ A + B + InputRating + Dummy, data = SplitData, na.action = na.omit)
c(coef.Intercept = coef(summary(SplitRegression))[1],
coef.A = coef(summary(SplitRegression))[2],
coef.B = coef(summary(SplitRegression))[3],
coef.Input = coef(summary(SplitRegression))[4],
coef.Dummy= coef(summary(SplitRegression))[5])
}))
i = 0
quintiles.LY<-quantile(LaterYears$A,probs=c(0.2,0.4,0.6, 0.8, 1.0))
Quint.LY<-findInterval(LaterYears$A,quintiles,rightmost.closed=TRUE)+1L
LaterYears$ExpectedValue <-apply(split(LaterYears,Quint.LY),1,
FUN = function(SplitData) {
i=i+1
regressionData[i,1]+regressionData[i,2]*SplitData$A +
regressionData[i,3]*SplitData$B + regressionData[i,4]*SplitData$Input +
regressionData[i,5]*SplitData$Dummy
})
The first part works great to get the data in regressionData. I want this results of applying the equation to be held in a column within the LaterYears dataset, but I get an error -
Error in apply(split(LaterYears, Quint.LY), 1, FUN = function(SplitData) { :
dim(X) must have a positive length
when running this with apply, and blank when running with lapply which is what I originally tried.
Any help with how to fix this would be greatly appreciated!
Thanks!
Perhaps something like this, using predict would be better. It doesn't work very well for your example data but it may work on the real data.
# by, splits a dataset by a factor
regressionData <- by(DataSet.Input,VarQuint,
function(d) {
lm1 <- lm(Y ~ A + B + InputRating + Dummy, d)
})
quintiles.LY<-quantile(LaterYears$A,probs=seq(0,1,0.2))
Quint.LY<-findInterval(LaterYears$A,quintiles,rightmost.closed=TRUE)+1L
LaterYearsPredict <- split(LaterYears,Quint.LY)
# lapply's arguments can be anything that is a sequence
LaterYears$ExpectedValue <- unlist(lapply(1:length(LaterYearsPredict),
function(x)
predict(regressionData[[x]],LaterYearsPredict[[x]])
))