R - Splitting Data, regression and applying equation to new split data set

R - Splitting Data, regression and applying equation to new split data set - r

I have a large data set that has older and newer data. I created two data frames, EarlyYears with the older data and LaterYears with the new data, so they have the same columns.
What I want to do is regress the data from Early years to determine an equation and apply it to the Later Years to test the equation's strength - A and B are constants, Input is what I am testing - I change it for different runs of the code - and Dummy is 1 is there is no data for the input. However, I want to split both the EarlyYears and LaterYears data by quintiles of one of the variables, and apply the equation found in quintile 1 of EarlyYears to data from LaterYears that is in quintile 1. I am fairly new at R, and so far have:
Model<-data.frame(Date = rep(c("3/31/09","3/31/11"),each = 20),
InputRating = rep(c(1:5), 8), Dummy = rep(c(rep(0,9),1),4),
Y = rep(1,3,5,7,11,13,17,19), A = 1:40,B = 1:40*3+7)
newer<-as.numeric(grep("/11",Model$Date))
later<-as.numeric(grep("/11",Model$Date,invert = TRUE))
LaterYears<-Model[newer,]
EarlyYears<-Model[later,]
newModel<-EarlyYears
DataSet.Input<-data.frame(Date = newModel$Date, InputRating = newModel$InputRating,
Dummy = newModel$Dummy, Y = newModel$Y, A = newModel$A,B = newModel$B)
quintiles<-quantile(DataSet.Input$A,probs=c(0.2,0.4,0.6, 0.8, 1.0))
VarQuint<-findInterval(DataSet.Input$A,quintiles,rightmost.closed=TRUE)+1L
regressionData<-do.call(rbind,lapply(split(DataSet.Input,VarQuint),
FUN = function(SplitData) {
SplitRegression<-lm(Y ~ A + B + InputRating + Dummy, data = SplitData, na.action = na.omit)
c(coef.Intercept = coef(summary(SplitRegression))[1],
coef.A = coef(summary(SplitRegression))[2],
coef.B = coef(summary(SplitRegression))[3],
coef.Input = coef(summary(SplitRegression))[4],
coef.Dummy= coef(summary(SplitRegression))[5])
}))
i = 0
quintiles.LY<-quantile(LaterYears$A,probs=c(0.2,0.4,0.6, 0.8, 1.0))
Quint.LY<-findInterval(LaterYears$A,quintiles,rightmost.closed=TRUE)+1L
LaterYears$ExpectedValue <-apply(split(LaterYears,Quint.LY),1,
FUN = function(SplitData) {
i=i+1
regressionData[i,1]+regressionData[i,2]*SplitData$A +
regressionData[i,3]*SplitData$B + regressionData[i,4]*SplitData$Input +
regressionData[i,5]*SplitData$Dummy
})
The first part works great to get the data in regressionData. I want this results of applying the equation to be held in a column within the LaterYears dataset, but I get an error -
Error in apply(split(LaterYears, Quint.LY), 1, FUN = function(SplitData) { :
dim(X) must have a positive length
when running this with apply, and blank when running with lapply which is what I originally tried.
Any help with how to fix this would be greatly appreciated!
Thanks!

Perhaps something like this, using predict would be better. It doesn't work very well for your example data but it may work on the real data.
# by, splits a dataset by a factor
regressionData <- by(DataSet.Input,VarQuint,
function(d) {
lm1 <- lm(Y ~ A + B + InputRating + Dummy, d)
})
quintiles.LY<-quantile(LaterYears$A,probs=seq(0,1,0.2))
Quint.LY<-findInterval(LaterYears$A,quintiles,rightmost.closed=TRUE)+1L
LaterYearsPredict <- split(LaterYears,Quint.LY)
# lapply's arguments can be anything that is a sequence
LaterYears$ExpectedValue <- unlist(lapply(1:length(LaterYearsPredict),
function(x)
predict(regressionData[[x]],LaterYearsPredict[[x]])
))

Related

Is there a way to test a range of exponents in a lm() model in the same way as the code below more efficiently?

The basic gist is that I have a set of housing data that I need to create a model for to minimize the predicted price vs actual price of house based on the dataset. So I created this bit of code to essentially test for a range of different numerators and find the one that minimized the difference between them. I'm using the median instead of the mean as the data isn't exactly normal.
Since I only have experience with lm(), I'm using that to create the coefficients and C values. But since the model likes exponents, I have to also test various exponents. It does this for each of the variables and then goes back to the first and re-evaluates it based on the other exponents. The model starts out with all the exponents ending up equal to 1. So the same as the basic linear model. I know that this is probably horribly inefficient and probably uses a lot of code in a somewhat wasteful, but I'm in my first r class so sorry about the mess and/or convoluted coding logic.
Is there any way to do this same thing but being more efficient. Also, I can't really decrease the number of variables as the model likes having more variables and produces a greater margin of error when they aren't present.
w <- seq(1,10000,1)
r <- seq(1,10000,1)
t <- seq(1,10000,1)
z <- seq(1,10000,1)
s <- seq(1,10000,1)
coef_1 <- c(6000,6000,6000,6000,6000,6000,6000,6000)
v <- rep(6000, each = 8)
for(l_1 in 1:10){
for(t_1 in 1:8){
for(i in 1:10000){
t = t_1
coef_1[t] = i
mod5 <- lm(log(SALE_PRC) ~ I(TOT_LVG_AREA^((coef_1[1]-5000)/1000)) + I(LND_SQFOOT^((coef_1[2]-5000)/1000)) + I(RAIL_DIST^((coef_1[3]-5000)/1000)) + I(OCEAN_DIST^((coef_1[4]-5000)/1000)) + I(CNTR_DIST^((coef_1[5]-5000)/1000)) + I(HWY_DIST^((coef_1[6]-5000)/1000)) + I(structure_quality^((coef_1[7]-5000)/1000)) + SUBCNTR_DI + SPEC_FEAT_VAL + (exp(((coef_1[8]-5000)/1000)*SPECIAL_RATIO)) + age, data = kaggle_transform_final)
kaggle_new <- kaggle_transform_final %>%
add_predictions(model = mod5, var = "prediction") %>%
mutate(new_predict = exp(prediction)) %>%
mutate(new_difference = abs((new_predict-SALE_PRC))/SALE_PRC) %>%
mutate(average_percent_difference = median(new_difference)) %>%
mutate(mean_percent_difference = mean(new_difference)) %>%
mutate(quart_75 = quantile(new_difference,.75))
w[i] = kaggle_new$average_percent_difference[1]
r[i] = kaggle_new$mean_percent_difference[1]
t[i] = kaggle_new$quart_75[1]
z[i] = i
s[i] = (i-5000)/1000
if(i%%100 ==0){show(i)}
}
u <- data.frame(median_diff = w, mean_diff = r, quart_75 = t, actual = s, number = z) %>%
arrange(median_diff)
coef_1[t_1] <- u$number[1]
v[t_1] <- u$actual[1]
show(coef_1)
}
coef_1 <- coef_1
}

Multiple imputation and mlogit for a multinomial regression

I am trying to run a multinomial regression with imputed data. I can do this with the nnet package, however I want to use mlogit. Using the mlogit package I keep getting the following error "Error in 1:nrow(data) : argument of length 0".
So making the data
library(mlogit)
library(nnet)
library(tidyverse)
library(mice)
df <- data.frame(vax = sample(1:6, 500, replace = T),
age = runif(500, 12, 18),
var1 = sample(1:2, 500, replace = T),
var2 = sample(1:5, 500, replace = T))
# Create missing data using the mice package:
df2 <- ampute(df, prop = 0.15)
df3 <- df2$amp
df3$vax <- as.factor(df3$vax)
df3$var1 <- as.factor(df3$var1)
df3$var2 <- as.factor(df3$var2)
# Inpute missing data:
df4 <- mice(df3, m = 5, print = T, seed = 123)
It works using nnet's multinom:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, data = df, model = T))
summary(pool(multinomtest))
But throws up an error when I try to reshape the data into mlogit format
test <- with(df4, dfidx(data = df4, choice = "vax", shape = "wide"))
Does anyone have any idea how I can get the imputed data into mlogit format, or even whether mlogit has compatibility with mice or any other imputation package?

Answer
You are using with.mids incorrectly, and thus both lines of code are wrong; the multinom line just doesn't give an error. If you want to apply multiple functions to the imputed datasets, you're better off using something like lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2,
data = data.idx,
reflevel = "1",
nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
How with.mids works
When you apply with to a mids object (AKA the output of mice::mice), then you are actually calling with.mids.
If you use getAnywhere(with.mids) (or just type mice:::with.mids), you'll find that it does a couple of things:
It loops over all imputed datasets.
It uses complete to get one dataset.
It runs the expression with the dataset as the environment.
The third step is the problem. For functions that use formulas (like lm, glm and multinom), you can use that formula within a given environment. If the variables are not in the current environment (but rather in e.g. a data frame), you can specify a new environment by setting the data variable.
The problems
This is where both your problems derive from:
In your multinom call, you set the data variable to be df. Hence, you are actually running your multinom on the original df, NOT the imputed dataset!
In your dfidx call, you are again filling in data directly. This is also wrong. However, leaving it empty also gives an error. This is because with.mids doesn't fill in the data argument, but only the environment. That isn't sufficient for you.
Fixing multinom
The solution for your multinom line is simple: just don't specify data:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, model = T))
summary(pool(multinomtest))
As you will see, this will yield very different results! But it is important to realise that this is what you are trying to obtain.
Fixing dfidx (and mlogit)
We cannot do this with with.mids, since it uses the imputed dataset as the environment, but you want to use the modified dataset (after dfidx) as your environment. So, we have to write our own code. You could just do this with any looping function, e.g. lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2, data = data.idx, reflevel = "1", nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
From there, all we have to do is make something that looks like a mira object, so that we can still use pool:
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))

Offering this as a way forward to circumvent the error with dfidx():
df5 <- df4$imp %>%
# work with a list, where each top-element is a different imputation run (imp_n)
map(~as.list(.x)) %>%
transpose %>%
# for each run, impute and return the full (imputed) data set
map(function(imp_n.x) {
df_out <- df4$data
df_out$vax[is.na(df_out$vax)] <- imp_n.x$vax
df_out$age[is.na(df_out$age)] <- imp_n.x$age
df_out$var1[is.na(df_out$var1)] <- imp_n.x$var1
df_out$var2[is.na(df_out$var2)] <- imp_n.x$var2
return(df_out)
}) %>%
# No errors with dfidx() now
map(function(imp_n.x) {
dfidx(data = imp_n.x, choice = "vax", shape = "wide")
})
However, I'm not too familiar with mlogit(), so can't help beyond this.
Update 8/2/21
As #slamballais mentioned in their answer, the issue is with dataset you refer to when fitting the model. I assume that mldata (from your code in the comments section) is a data.frame? This is probably why you are seeing the same coefficients - you are not referring to the imputed data sets (which I've identified as imp_n.x in the functions). The function purrr::map() is very similar to lapply(), where you apply a function to elements of a list. So to get the code working properly, you would want to change mldata to imp_n.x:
# To fit mlogit() for each imputed data set
df5 %>%
map(function(imp_n.x) {
# form as specified in the comments
mlogit(vax ~ 1 | age + var1 + var2,
data = imp_n.x,
reflevel = "1",
nests = list(type1 = c('1', '2'),
type2 = c('3','4'),
type3 = c('5','6')))
})

Generate missing value in dataset using `ampute` function from `mice` library in R

I originally posted this on CrossValidated but now realize this website is more appropriate for my question.
Following this link, I am trying to generate missing values to simulate a real-world situation
First I generated explanatory variables using the following code:
n = 50
x1 = rnorm(n,mean = 0,sd = 1)
x2 = rnorm(n,mean = 0,sd = 1)
x3 = rnorm(n,mean = 0,sd = 1)
x4 = rnorm(n,mean = 0,sd = 1)
Then I generate the responsive variable by the following code.
z = -1 + .5*x1 + .5*x2 + .5*x3 + .5*x4 + rnorm(1,0,0.1)
pr = 1/(1+exp(z)) # pass through an inv-logit function
y = rbinom(n,1,pr) # bernoulli response variable
data_mat <- as.data.frame(cbind(x1,x2,x3,x4,y))
I am trying to use the ampute function from the mice library to generate missing data based on the binary response variable.
The missing not at random case I would like to generate is as follows: when Y = 0, the independent variables are four times more likely to have missing data than the independent variables when Y = 0.
Can this be done using ampute function or is there an alternative way? Thank you.

Calculating and indexing mcmc chains in coda

There are two things I need to do. Firstly I would like to be able to create new variables in a coda mcmc object that have been calculated from existing variables so that I can run chain diagnostics on the new variable. Secondly I would like to be able to index single variables in some of the coda plot functions while still viewing all chains.
Toy data. Bayesian t-test on the sleep data using JAGS and rjags.
data(sleep)
# read in data
y <- sleep$extra
x <- as.numeric(as.factor(sleep$group))
nTotal <- length(y)
nGroup <- length(unique(x))
mY <- mean(y)
sdY <- sd(y)
# make dataList
dataList <- list(y = y, x = x, nTotal = nTotal, nGroup = nGroup, mY = mY, sdY = sdY)
# model string
modelString <- "
model{
for (oIdx in 1:nTotal) {
y[oIdx] ~ dnorm(mu[x[oIdx]], 1/sigma[x[oIdx]]^2)
}
for (gIdx in 1:nGroup) {
mu[gIdx] ~ dnorm(mY, 1/sdY)
sigma[gIdx] ~ dunif(sdY/10, sdY*10)
}
}
"
writeLines(modelString, con = "tempModel.txt")
# chains
# 1. adapt
jagsModel <- jags.model(file = "tempModel.txt",
data = dataList,
n.chains = 3,
n.adapt = 1000)
# 2. burn-in
update(jagsModel, n.iter = 1000)
# 3. generate
codaSamples <- coda.samples(model = jagsModel,
variable.names = c("mu", "sigma"),
thin = 15,
n.iter = 10000*15/3)
Problem one
If I convert the coda object to a dataframe I can calculate the difference between the estimates for the two groups and plot this new variable, like so...
df <- as.data.frame(as.matrix(codaSamples))
names(df) <- gsub("\\[|\\]", "", names(df), perl = T) # remove brackets
df$diff <- df$mu1 - df$mu2
ggplot(df, aes(x = diff)) +
geom_histogram(bins = 100, fill = "skyblue") +
geom_vline(xintercept = mean(df$diff), colour = "red", size = 1, linetype = "dashed")
...but how do I get a traceplot? I can get one for existing variables within the coda object like so...
traceplot(codaSamples[[1]][,1])
...but I would like to be able to get them for the the new diff variable.
Problem Two
Which brings me to the second problem. I would like to be able to get a traceplot (among other things) for individual variables. As I have shown above I can get them for a single variable if I only want to see one chain but I'd like to see all chains. I can see all chains for all variables in the model with the simple
plot(codaSamples)
...but what if I don't want or need to see all variables? What if I just want to see the trace and/or desnity plots for one, or even two, variables (but not all variables) but with all chains in the plot?

multipart formula parsing: handling NAs and minimizing object copies

I'm trying to understand how to use Formula objects. Let's say I wanted to make my own 2SLS function and want to divide the objects I'm working with into 4 main groups: y = response; X = exogenous variables; E = endogenous variables; Z = instruments.
I want to be able to construct these objects without making extra copies of the data unnecessarily (say, large N and large number of instruments would make this prohibitively costly in memory usage/time). I also want to take into account NAs from across the data.
Let's use a formula syntax similar to felm (I tried looking at the parsing code there, but couldn't follow it).
frml = y ~ x1 + x2 + x3*x4 | (e1 | e2 ~ z1 + z2)
library(Formula)
N = 12 # be divisible by 6
data = data.frame(y=rnorm(N), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N),
x4=factor(rep(1:2, N/2)), e1=rnorm(N), e2=rnorm(N),
z1=rnorm(N), z2=factor(rep(1:3, N/3)))
data[2,'y'] = data[3,'x1'] = data[4,'e1'] = data[5,'z2'] = NA
parse_frml = function(frml, data, subset=NULL) {
frml = as.Formula(frml)
# does not take into account NAs at all
y = model.part(frml, data=data, subset=subset, lhs=1)
# does not take into account NAs in other variables (y, Z, E)
X = model.matrix(frml, data=data, subset=subset, lhs=0, rhs=1)
Z = model.matrix(frml, data=data, subset=subset, lhs=0, rhs=2)
#E = # I can't figure this out at all
return(list(y=y, X=X, E=E, Z=Z))
}
Now, I can do something like
mf = model.frame(frml, data=data, subset=subset, lhs=1, rhs=1)
which will take into account NAs in y and X, but ignores E and Z. Further, this copies the data into the mf, and then copies again into y and X.
So, I have 2 questions and 1 constraint
How do I get E? (a matrix for the LHS of the 2nd equation)
How do I take into account NAs from across the data used by frml in all matrices?
While minimizing the number of copies of the data (ideally just copied into the matrices)
More generally, what's a good resource for understanding Formula, formula, terms, and the like? I've not found, e.g. the Formula libraries package documentation to be super helpful.

This isn't perfect, but it works. It's a shame how there is almost no information on how to actually handle and manipulate formulas in R code. My solution depends on formula.tools
library(formula.tools)
parse_frml = function(frml, data, subset=NULL) {
frml = as.Formula(frml)
vars = all.vars(frml)
other_vars = c(all.vars(formula(frml, lhs=1, rhs=1)),
rhs.vars(formula(frml, lhs=0, rhs=2)))
e_vars = setdiff(vars, other_vars)
valid = which(complete.cases(data[, vars]))
if (!is.null(subset)) {
if (class(subset) == 'logical') {
subset = which(subset)
}
valid = intersect(valid, subset)
}
y = model.part(frml, data=data[valid,], lhs=1)
X = model.matrix(frml, data=data[valid,], lhs=0, rhs=1)
Z = model.matrix(frml, data=data[valid,], lhs=0, rhs=2)
E = data.matrix(data[valid, e_vars])
return(list(y=y, X=X, E=E, Z=Z))
}
I suspect that subsetting data with valid each time is rather expensive. But in the above test cast, it seems to work.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - Splitting Data, regression and applying equation to new split data set - r

Related

Is there a way to test a range of exponents in a lm() model in the same way as the code below more efficiently?

Multiple imputation and mlogit for a multinomial regression

Generate missing value in dataset using `ampute` function from `mice` library in R

Calculating and indexing mcmc chains in coda

multipart formula parsing: handling NAs and minimizing object copies

Categories

Resources