How to change the reference level of a meta-regression in metafor? - metafor

I'm running a univariate metaregression with the package metafor using the following code:
resMeta <- rma(measure="IR",xi=xi, ti=ti, mods = ~ factor(pop)), data=metaAAS)
resMeta
confint(resMeta)
The levels of the moderator 'pop' are labelled as "0", "1", "2", and "3".
The problem is that the function automatically defines the first level ("0") of the moderator 'pop' as the reference level.
How do I change the reference for level "3"?
Thank you.

You can use the relevel() function for that. So, for example:
resMeta <- rma(measure="IR",xi=xi, ti=ti, mods = ~ relevel(factor(pop), ref="3"), data=metaAAS)
resMeta
confint(resMeta)
to make level 3 the reference level.

Related

R GLM Predict Error - factor has new levels

I'm doing a basic logistic regression using glm()
I split the data into train and test, built the model using glm, and then tried running predict() using the test data.
Here is the code
data = read.csv('2022_data.csv')
data$A= as.factor(data$A)
data$B= as.factor(data$B)
# split train and test
df = sort(sample(nrow(data), nrow(data)*.8))
df_train = data[df,]
df_test = data[-df,]
# create model
model1 = glm(attrition ~ A+ B + C + D + E, data = df_train, family = binomial)
predict1 = predict(model1, df_test1, type='response')
I encountered
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor A has new levels
I understand that this error message means there is a value in column A that is not accounted for in the model. But I checked the unique values for column A in training and testing data, and both have the exact same values
levels(as.factor(df_test1$A))
levels(as.factor(df_train$A))
Both returns
[1] "" "N" "Y"
I'm not sure what I'm missing here
Update:
I checked the summary of the model and it shows only 1 dummy variable for A (i.e. AY, with AN being the reference). It seems that value "" is automatically being excluded by glm(). I changed "" to "no data", but this still occurs.
The thing about factors is that all the levels are stored in the metadata for the column whether or not the value is actually reflected in the data after subsetting.
So, you may have trained on data with two of three levels but not the third, that then shows up in the test data. (without seeing data and basic descriptive statistics I cannot be sure)
However, you can test this by running the following code to see what I mean:
x<-as.factor(x<-c("A", "B", "C","A", "B", "C","A", "B", "C","A", "B", "C"))
y<-x[1:2]
When you look at why this is what you see
y
[1] A B
Levels: A B C
If you want to be sure that all values of the levels are reflected in your coefficients from training you should use a stratified sampling method to account for all levels in the data.
I would check before you go too far to see that there are enough of each level to be meaningful.
> table(x)
x
A B C
4 4 4
If you only have a couple of one level you have bigger problems to consider.
You said that you changed "" to "No Data". If you did as a character/string then you need to re-level the factor to take the new category into consideration.
It might be best if you use :
library(plyr)
revalue(x, c(""="no_data"))
This way of converting will take into account the existing levels as it changes the value of that level. The values of levels persist even if you change a one set of values. It persists until you re-level)
I'd try
library(forcats)
df_test1$A <- df_test1$A |> fct_drop(c(""))
Your error refers to model.frame.default. I am wondering if the "" levels aren't used in the model, and then found in test. Or you might want to assign "" levels to "Y" or "N".

fitting the good regression for categorical data on R

i am quite new on R ! I am a student in econometrics, but still quite new to data science, and i am currently trying to manipulate a dataset and see what interesting facts i can get out of it. It is mostly on the econometrics part, trying to find an appropriate model, that i am looking for help (i will be happy to receive any kind of advice on any part of my code though, as i am still learning how to use R !).
My dataset is one of employees of a firm, which has levels (hierarchy), and different wages. It is a dataset on one year but i have a similar dataset for each of the previous years and ultimately i would like to group them to work on the evolution of wages in each level.
I started by computing basic statistics on the dataset, following the advice i had been given on a previous post :
df%>%
select(level, sex)%>%
group_by(level)%>%
summarise(mean = mean(sex == "F"))
And cleaning my dataset :
df <- transform(df,
sex=factor(ifelse(sex=="NA", NA, sex)),
age=as.numeric(age),
time_passed=as.numeric(time_passed),
level=factor(ifelse(level=="", NA, level), ordered = TRUE),
wage=as.numeric(wage))
summary(df)
I plotted age across levels :
plot(age ~ level, data=df)
library(ggplot2)
ggplot(df, aes(x=level, y=age, fill=sex)) +
geom_boxplot()
I tried to remove the NA, but failed when coding this (it did not remove the NA) :
ggplot(df[!is.na(df$level),], aes(x=level, y=age, fill=sex)) +
geom_boxplot()
The linear regression showed level as a significant covariate, yet i would like to do it the other way round (level may be explained by age, not the other way round : i but "level" is a categorical variable...)
lm(age ~ level, data=df)
I then ran an alternative test for correlation of ranked (ordinal) data, the Spearman rank test, to establish whether there is a significant correlation :
library(pspearman)
with(df, spearman.test(age, level))
I also tried to look at the differences in the mean age for each level (i am telling this so you might see the different things i have tried, i do not really have a pre established workflow so it may seem a bit messy, i am quite new to data science) :
(library dplyr)
df %>%
filter(sex == "F" | sex == "H") %>%
ggplot(aes(level, age)) +
geom_point() +
facet_wrap(~sex) +
ggtitle ("age per level, for each sex")
I have also tried to follow this workflow, but i think it is limited for my dataset :
https://data.library.virginia.edu/understanding-ordered-factors-in-a-linear-model/
I think a categorical regression model would be nice, i am thinking about a logistic model : i have tried the following method, yet it gives me non significant coefficients which seems wrong (wages are in fact an increasing function of level, which was not rendered by the model plotted then) :
https://www.geeksforgeeks.org/regression-with-categorical-variables-in-r-programming/
So you can see how i translated the model in the above article to my dataset, i will put it here :
df$sex = as.factor(df$sex)
df$level = as.factor(df$level)
# two-way table of factor variable
xtabs(~level + age, data = df)
# partitioning the data
str(df)
set.seed(1234)
data1<-sample(2, nrow(df),
replace = T,
prob = c(0.6, 0.4))
train<-df[data1 == 1,]
test<-df[data1 == 2,]
# Now build a logistic regression model for our data. glm() function helps us to establish a
neural network for our data.
mymodel<-glm(wage ~ age + sex + time_passed + level,
data = train,
family = 'binomial')
summary(mymodel)
So my main question is : what do you recommend to tackle this dataset (i am also grateful for any reading advice, and not afraid to tackle more elaborate models !) ?
Also, i have the same dataset for other years and i wonder if there is a function in R that could enable me to visualise and manipulate the levels over the year (i would like to see if the mean wages for the people in one level evolves throughout the years or not, for example, or the standard deviation).
My dataset has the following structure :
structure(list(sex = c("F", "H", "F", "F", "H", "F"), age = c("24",
"33", "53", "32", "38", "21"), time_passed = c("0", "3", "4",
"0", "2", "0"), level = c("N7 ", "N7 ", "N9 ", "N7 ", "N8 ",
" "), wage = c("2605", "4931", "11123", "3750", "6180", "858,31"
)), row.names = c(NA, 6L), class = "data.frame")
I hope i haven't been too long, and thank you in advance for the help you might provide !

Writing a loop in R for regression replacing independent variable for robustness check

I want to run a simple logit regression in R, where my dependent variable is whether a firm charges a positive price or not, and the key independent variable is number of competitors within an x mile radius of the firm. To operationalize the competition variable, I am looking at 1, 5, 10 and 50 miles radius.
I am not sure how to write the loop though, and the Error in eval(predvars, data, env) : object 'radius_i' not found when I run the loop below.
circle_radius = list("1", "5", "10", "15", "50")
for (i in seq_along(circle_radius)){
my_logit_4_r[i] <- glm(price_b1 ~ radius_i ,
data=data1,
family = binomial(link='logit'))
summary(my_logit_4_r[i])
}
So I am not sure how to specify the loop, as I do not want to use brute force and write the 4 regressions separately. Would appreciate help on what error I am making.
You have to change your code a bit - first, use get() to use the what you are trying to call from radius_[i] to be a covariate in your model - though this needs to be changed to get(paste0("radius_",i)) (assuming you have a covariate named radius_1, radius_5, and so on in your data1 data frame. Also, you might want to remove the seq_along(circle_radius) and just do circle_radius since seq_along will define i as 1, 2, 3, 4 and removing it will define it as "1", "5", "10", and "50". You also need to define my_logit_4_r as a list and use double bracket [[i]] when assigning to the list in the loop.
Below I have made the changes to make this clearer.
Since you didnt provide sample data, I am assuming your data look like this:
circle_radius <- list("1", "5", "10", "50")
data1 <- data.frame(price_b1 = runif(100),
radius_1 = runif(100),
radius_5 = runif(100),
radius_10 = runif(100),
radius_50 = runif(100))
Try the following code:
my_logit_4_r <- vector(mode = "list", length = length(circle_radius))
for (i in circle_radius){
my_logit_4_r[[i]] <- glm(price_b1 ~ get(paste0("radius_",i)) ,
data=data1,
family = binomial(link='logit'))
summary(my_logit_4_r[[i]])
}
The models wont converge with my sample data, but they attempt to run. If this doesn't work, please provide sample data and I update my answer.

InvMillsRatio & Heckit correction

I am currently writing my thesis and relatively new to R.
I need to perform a heckit two stage model (invMillsRatio plus heckit) since I have so much missing data. However I have no clue how to do that. I have 3 main models (2 linear regressions (one lm and one log-linear) and 1 censor regression), but how can I now perform this heckman correction?
I would really appreciate your help a lot, i have absolutely no idea at all!
I have struggled for a while with the same problem, and I think I have found a solution. It is certainly not the most elegant way to proceed, but it runs fine. Additional feedback and friendly suggestions will always be welcome!
First step: create a dummy out of your dependent variable. My dependent variable was FDI outstock, so it took the value of 1 if it was different from 0 or NA, and 0 otherwise. Here is the code I used to create a new column for my dummy DV:
outstock <- outstock %>%
mutate(
outstock4 = as.numeric(
case_when(
log_outstock == 0 ~ "0",
is.na(log_outstock) ~ "0",
log_outstock > 0 ~ "1",
log_outstock < 0 ~ "1",
TRUE ~ as.character(log_outstock)
)
)
)
Second step (if applicable to your dataset): declare as panel data, using the plm.data() function from the plm package. I used this code:
outstock4 <- outstock %>%
plm.data(index = c("CP", "year"))
Third step: In the sampleSelection package, the heckit function works as this:
heckit(dummy_DV ~ IV,
DV ~ IV, data)
Note that it is 2 separate regressions (first your selection equation, then your estimation regression equation), and that there is no mention of lm, glm, log-linear or whatever.
Now, for me, this works as long as I do not have fixed effects to include. In that case, the high number of NAs makes it impossible to run. Therefore I removed the NAs, not using na.omit() because that would remove too many observations, since I do not use all of the variables present in the database in each model) but rather filtering them out as such:
model1 <- summary(heckit(dummy_DV ~ IV,
DV ~ IV + as.factor(year) + as.factor(iso_o) + as.factor(iso_d),
data = filter(database, !is.na(IV1), !is.na(IV2), !is.na(IV3), !is.na(IV4), !is.na(IV5))))
I hope it helps. Good luck!

Classification - Usage of factor levels

I am currently working on a predictive model for a churn problem.
Whenever I try to run the following model, I get this error: At least one of the class levels is not a valid R variable name. This will cause errors when class probabilities are generated because the variables names will be converted to X0, X1. Please use factor levels that can be used as valid R variable names.
fivestats <- function(...) c( twoClassSummary(...), defaultSummary(...))
fitControl.default <- trainControl(
method = "repeatedcv"
, number = 10
, repeats = 1
, verboseIter = TRUE
, summaryFunction = fivestats
, classProbs = TRUE
, allowParallel = TRUE)
set.seed(1984)
rpartGrid <- expand.grid(cp = seq(from = 0, to = 0.1, by = 0.001))
rparttree.fit.roc <- train(
churn ~ .
, data = training.dt
, method = "rpart"
, trControl = fitControl.default
, tuneGrid = rpartGrid
, metric = 'ROC'
, maximize = TRUE
)
In the attached picture you see my data, I already transformed some data from chr to factor variable.
I do not get what my problem is, if I would transform the entire data into factors, then for instance the variable total_airtime_out will probably have around 9000 factors.
Thanks for any kind of help!
It's not exactly possible for me to reproduce your error, but my educated guess is that the error message tells you everything you need to know:
At least one of the class levels is not a valid R variable name. This will cause errors when class probabilities are generated because the variables names will be converted to X0, X1. Please use factor levels that can be used as valid R variable names.
Emphasis mine. Looking at your response variable, its levels are "0" and "1", these aren't valid variable names in R (you can't do 0 <- "my value"). Presumably this problem will go away if you rename the levels of the response variable with something like
levels(training.dt$churn) <- c("first_class", "second_class")
as per this Q.
How about this base function:
make.names(churn) ~ .,
to "make syntactically valid names out of character vectors"?
Source
Adding to the correct answer of #einar, here's the dplyr syntax of converting the factor levels:
training.dt %>%
mutate(churn = factor(churn,
levels = make.names(levels(churn))))
I slightly prefer to change only the labels of the factor levels, as the levels change the underlying data, like this:
training.dt %>%
mutate(churn = factor(churn,
labels = make.names(levels(churn))))
I had the same issue and fixed it by setting classProbs = FALSE in the trainControl() this solved the issue and kept the level 0 and 1
I got the same problem,
class(iris$Species); levels(iris$Species)
iris.lvls <- factor(iris, levels = c("1", "2", "3"))
class(iris.lvls); levels(iris.lvls)

Resources