How to express membership in multiple categories in R? - r

How does one express a linear model where observations can belong to multiple categories and the number of categories is large?
For example, using time dummies as the categories, here is a problem that is easy to set up since the number of categories (time periods) is small and known:
tmp <- "day 1, day 2
0,1
1,0
1,1"
periods <- read.csv(text = tmp)
y <- rnorm(3)
print(lm(y ~ day.1 + day.2 + 0, data=periods))
Now suppose that instead of two days there were 100. Would I need to create a formula like the following?
y ~ day.1 + day.2 + ... + day.100 + 0
Presumably such a formula would have to be created programmatically. This seems inelegant and un-R-like.
What is the right R way to tackle this? For example, aside from the formula problem, is there a better way to create the dummies than creating a matrix of 1s and 0s (as I did above)? For the sake of concreteness, say that the actual data consists (for each observation) of a start and end date (so that tmp would contain a 1 in each column between start and end).
Update:
Based on the answer of #jlhoward, here is a larger example:
num.observations <- 1000
# Manually create 100 columns of dummies called x1, ..., x100
periods <- data.frame(1*matrix(runif(num.observations*100) > 0.5, nrow = num.observations))
y <- rnorm(num.observations)
print(summary(lm(y ~ ., data = periods)))
It illustrates the manual creation of a data frame of dummies (1s and 0s). I would be interested in learning whether there is a more R-like way of dealing with these "multiple dummies per observation" issue.

You can use the . notation to include all variables other than the response in a formula, and -1 to remove the intercept. Also, put everything in your data frame; don't make y a separate vector.
set.seed(1) # for reproducibility
df <- data.frame(y=rnorm(3),read.csv(text=tmp))
fit.1 <- lm(y ~ day.1 + day.2 + 0, df)
fit.2 <- lm(y ~ -1 + ., df)
identical(coef(fit.1),coef(fit.2))
# [1] TRUE

Related

how to change order of rows in ggforest without change reference

Imagine the following database. Made up data
K<- c(2,2.2,2.4,2.6,2.8,3,3.5,3.8,4,4.2,4.4,4.8,5,5.2,5.4,5.6,5.8,6)
event <- c(1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,0,1,1)
t<- c(8,10,25,10,8,22,30,16,32,30,32,20,8,12,14,22,10,6)
df<- data.frame(K,event,t)
I split the variable K (potassium) into a categorical variable with 3 levels (< 3, >= 3 and <5, >=5)
df$K_cut <- cut(K, c(0,3,5,6.5), right = F)
levels(df$K_cut) # [1] "[0,3)" "[3,5)" "[5,6.5)"
We perform a cox regression and represent it with ggforest
The reference category is potassium < 3
fit3<- coxph(Surv(t,event) ~ K_cut, data=df)
fit3
library(survminer)
ggforest(fit3, data=df, fontsize=0.8)
We changed the reference category to be a normal potassium (3-5)
And when plotting it is now the correct reference, but it is plotted on the first line.
df$K_cut <- relevel(df$K_cut, ref = "[3,5)")
fit4<- coxph(Surv(t,event) ~ K_cut, data=df)
fit4
library(survminer)
ggforest(fit4, data=df, fontsize=0.8)
I would like more to be able to put the reference category K 3-5 but for it to be on the center line, so that the graph represents from top to bottom, K < 3, between 3 and 5 and K >=5
The result shoud be (with paste, retouching the figure ...)
Is there a way to do it with ggforest or with another function/package
Change the order of the rows and put the reference wherever you want ..
In addittion, can you change the spaces so that the intervals and N= ... are not so close together, or modify the name of the variable
In the ggforest documentation, I have not seen that such options exist.
Regards and thanks
One option to achieve your desired result would be to relevel your factor after estimating your model and before calling ggforest:
library(survminer)
library(survival)
fit4 <- coxph(Surv(t, event) ~ K_cut, data = df)
df$K_cut <- factor(df$K_cut, levels = c("[0,3)", "[3,5)", "[5,6.5)"))
ggforest(fit4, data = df, fontsize = 0.8)

Why does just one (of 8) numeric predictor variables return NA when I run lm()?

I'm trying to build a linear regression model using eight independent variables, but when I run lm() one variable--what I anticipate being my best predictor!--keeps returning NA. I'm still new to R, and I cannot find a solution.
Here are my independent variables:
TEMPERATURE
HUMIDITY
WIND_SPEED
VISIBILITY
DEW_POINT_TEMPERATURE
SOLAR_RADIATION
RAINFALL
SNOWFALL
My df is training_set and looks like:
I'm not sure whether this matters, but training_set is 75% of my original df, and testing_set is 25%. Created thusly:
set.seed(1234)
split_bike_sharing <- sample(c(rep(0, round(0.75 * nrow(bike_sharing_df))), rep(1, round(0.25 * nrow(bike_sharing_df)))))
This gave me table(split_bike_sharing):
0
1
6349
2116
And then I did:
training_set <- bike_sharing_df[split_bike_sharing == 0, ]
testing_set <- bike_sharing_df[split_bike_sharing == 1, ]
The structure of training_set is like:
To create the model I run the code:
lm_model_weather=lm(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + WIND_SPEED + VISIBILITY + DEW_POINT_TEMPERATURE +
SOLAR_RADIATION + RAINFALL + SNOWFALL, data = training_set)
However, as you can see the resultant model returns RAINFALL as NA. Here is the resultant model:
My first thought was to check RAINFALL datatype, which is numeric with range 0-1 (because at an earlier step I performed min-max normalization). But SNOWFALL also is numeric, and I've done nothing (that I know of!) to the one but not the other. My second thought was to confirm that RAINFALL contains enough values to work, and that does not appear to be an issue: summary(training_set$RAINFALL):
So, how do I correct the NAs in RAINFALL? Truly I will be most grateful for your guidance to a solution.
UPDATE 10 MARCH 2022
I've now checked for collinearity:
X <- model.matrix(RENTED_BIKE_COUNT ~ ., data = training_set)
X2 <- caret::findLinearCombos(X)
print(X2)
This gave me:
I believe this means certain columns are jointly multicollinear. As you can see, columns 8, 13, and 38 are:
[8] is RAINFALL
[13] is SEASONS_WINTER
[38] is HOUR_23
Question: if I want to preserve RAINFALL as a predictor variable (viz., return proper values rather than NAs when I run lm()), what do I do? Remove columns [13] and [38] from the dataset?

Logistic Regression in R: glm() vs rxGlm()

I fit a lot of GLMs in R. Usually I used revoScaleR::rxGlm() for this because I work with large data sets and use quite complex model formulae - and glm() just won't cope.
In the past these have all been based on Poisson or gamma error structures and log link functions. It all works well.
Today I'm trying to build a logistic regression model, which I haven't done before in R, and I have stumbled across a problem. I'm using revoScaleR::rxLogit() although revoScaleR::rxGlm() produces the same output - and has the same problem.
Consider this reprex:
df_reprex <- data.frame(x = c(1, 1, 2, 2), # number of trials
y = c(0, 1, 0, 1)) # number of successes
df_reprex$p <- df_reprex$y / df_reprex$x # success rate
# overall average success rate is 2/6 = 0.333, so I hope the model outputs will give this number
glm_1 <- glm(p ~ 1,
family = binomial,
data = df_reprex,
weights = x)
exp(glm_1$coefficients[1]) / (1 + exp(glm_1$coefficients[1])) # overall fitted average 0.333 - correct
glm_2 <- rxLogit(p ~ 1,
data = df_reprex,
pweights = "x")
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1])) # overall fitted average 0.167 - incorrect
The first call to glm() produces the correct answer. The second call to rxLogit() does not. Reading the docs for rxLogit(): https://learn.microsoft.com/en-us/machine-learning-server/r-reference/revoscaler/rxlogit it states that "Dependent variable must be binary".
So it looks like rxLogit() needs me to use y as the dependent variable rather than p. However if I run
glm_2 <- rxLogit(y ~ 1,
data = df_reprex,
pweights = "x")
I get an overall average
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1]))
of 0.5 instead, which also isn't the correct answer.
Does anyone know how I can fix this? Do I need to use an offset() term in the model formula, or change the weights, or...
(by using the revoScaleR package I occasionally painting myself into a corner like this, because not many other seem to use it)
I'm flying blind here because I can't verify these in RevoScaleR myself -- but would you try running the code below and leave a comment as to what the results were? I can then edit/delete this post accordingly
Two things to try:
Expand data, get rid of weights statement
use cbind(y,x-y)~1 in either rxLogit or rxGlm without weights and without expanding data
If the dependent variable is required to be binary, then the data has to be expanded so that each row corresponds to each 1 or 0 response and then this expanded data is run in a glm call without a weights argument.
I tried to demonstrate this with your example by applying labels to df_reprex and then making a corresponding df_reprex_expanded -- I know this is unfortunate, because you said the data you were working with was already large.
Does rxLogit allow a cbind representation, like glm() does (I put an example as glm1b), because that would allow data to stay same sizeā€¦ from the rxLogit page, I'm guessing not for rxLogit, but rxGLM might allow it, given the following note in the formula page:
A formula typically consists of a response, which in most RevoScaleR
functions can be a single variable or multiple variables combined
using cbind, the "~" operator, and one or more predictors,typically
separated by the "+" operator. The rxSummary function typically
requires a formula with no response.
Does glm_2b or glm_2c in the example below work?
df_reprex <- data.frame(x = c(1, 1, 2, 2), # number of trials
y = c(0, 1, 0, 1), # number of successes
trial=c("first", "second", "third", "fourth")) # trial label
df_reprex$p <- df_reprex$y / df_reprex$x # success rate
# overall average success rate is 2/6 = 0.333, so I hope the model outputs will give this number
glm_1 <- glm(p ~ 1,
family = binomial,
data = df_reprex,
weights = x)
exp(glm_1$coefficients[1]) / (1 + exp(glm_1$coefficients[1])) # overall fitted average 0.333 - correct
df_reprex_expanded <- data.frame(y=c(0,1,0,0,1,0),
trial=c("first","second","third", "third", "fourth", "fourth"))
## binary dependent variable
## expanded data
## no weights
glm_1a <- glm(y ~ 1,
family = binomial,
data = df_reprex_expanded)
exp(glm_1a$coefficients[1]) / (1 + exp(glm_1a$coefficients[1])) # overall fitted average 0.333 - correct
## cbind(success, failures) dependent variable
## compressed data
## no weights
glm_1b <- glm(cbind(y,x-y)~1,
family=binomial,
data=df_reprex)
exp(glm_1b$coefficients[1]) / (1 + exp(glm_1b$coefficients[1])) # overall fitted average 0.333 - correct
glm_2 <- rxLogit(p ~ 1,
data = df_reprex,
pweights = "x")
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1])) # overall fitted average 0.167 - incorrect
glm_2a <- rxLogit(y ~ 1,
data = df_reprex_expanded)
exp(glm_2a$coefficients[1]) / (1 + exp(glm_2a$coefficients[1])) # overall fitted average ???
# try cbind() in rxLogit. If no, then try rxGlm below
glm_2b <- rxLogit(cbind(y,x-y)~1,
data=df_reprex)
exp(glm_2b$coefficients[1]) / (1 + exp(glm_2b$coefficients[1])) # overall fitted average ???
# cbind() + rxGlm + family=binomial FTW(?)
glm_2c <- rxGlm(cbind(y,x-y)~1,
family=binomial,
data=df_reprex)
exp(glm_2c$coefficients[1]) / (1 + exp(glm_2c$coefficients[1])) # overall fitted average ???

Nonlinear model with many independent variables (fixed effects) in R

I'm trying to fit a nonlinear model with nearly 50 variables (since there are year fixed effects). The problem is I have so many variables that I cannot write the complete formula down like
nl_exp = as.formula(y ~ t1*year.matrix[,1] + t2*year.matrix[,2]
+... +t45*year.matirx[,45] + g*(x^d))
nl_model = gnls(nl_exp, start=list(t=0.5, g=0.01, d=0.1))
where y is the binary response variable, year.matirx is a matrix of 45 columns (indicating 45 different years) and x is the independent variable. The parameters need to be estimated are t1, t2, ..., t45, g, d.
I have good starting values for t1, ..., t45, g, d. But I don't want to write a long formula for this nonlinear regression.
I know that if the model is linear, the expression can be simplified using
l_model = lm(y ~ factor(year) + ...)
I tried factor(year) in gnls function but it does not work.
Besides, I also tried
nl_exp2 = as.formula(y ~ t*year.matrix + g*(x^d))
nl_model2 = gnls(nl_exp2, start=list(t=rep(0.2, 45), g=0.01, d=0.1))
It also returns me error message.
So, is there any easy way to write down the nonlinear formula and the starting values in R?
Since you have not provided any example data, I wrote my own - it is completely meaningless and the model actually doesn't work because it has bad data coverage but it gets the point across:
y <- 1:100
x <- 1:100
year.matrix <- matrix(runif(4500, 1, 10), ncol = 45)
start.values <- c(rep(0.5, 45), 0.01, 0.1) #you could also use setNames here and do this all in one row but that gets really messy
names(start.values) <- c(paste0("t", 1:45), "g", "d")
start.values <- as.list(start.values)
nl_exp2 <- as.formula(paste0("y ~ ", paste(paste0("t", 1:45, "*year.matrix[,", 1:45, "]"), collapse = " + "), " + g*(x^d)"))
gnls(nl_exp2, start=start.values)
This may not be the most efficient way to do it, but since you can pass a string to as.formula it's pretty easy to use paste commands to construct what you are trying to do.

Get predicted values for next period

please consider following data:
y<- c(2,2,6,3,2,23,5,6,4,23,3,4,3,87,5,7,4,23,3,4,3,87,5,7)
x1<- c(3,4,6,3,3,23,5,6,4,23,6,5,5,1,5,7,2,23,6,5,5,1,5,7)
x2<- c(7,3,6,3,2,2,5,2,2,2,2,2,6,5,4,3,2,3,2,2,6,5,4,3)
type <- c("a","a","a","a","a","a","a","a","b","b","b","b","b","b","b","b","c","c","c","c","c","c","c","c")
generation<- c(1,1,1,1,2,2,3,3,1,2,2,2,3,3,4,4,1,2,2,2,3,3,4,4)
year<- c(2004,2005,2006,2007,2008,2009,2010,2011,2004,2005,2006,2007,2008,2009,2010,2011,2004,2005,2006,2007,2008,2009,2010,2011)
data <- data.frame(y,x1,x2,model,generation,year)
I would now make analysis that only take into account each single year and predict on the following. So in essence, this would run several separate analysis, only taking into account the data up to one point in time and then predicting on the next (only the directly next) period.
I tried to set up an example for the three models:
data2004 <- subset(data, year==2004)
data2005 <- subset(data, year==2005)
m1 <- lm(y~x1+x2, data=data2004)
preds <- predict(m1, data2005)
How can I do this automatically? My preferred output would be a predicted value for each type that indicates what the value would have been for each of the values that exist in the following period (the original data has 200 periods).
Thanks in advance, help very much appreciated!
The following may be more like what you want.
uq.year <- sort(unique(dat$year)) ## sorting so that i+1 element is the year after ith element
year <- dat$year
dat$year <- NULL ## we want everything in dat to be either the response or a predictor
model <- rep(c("a", "b", "c"), times = length(year) / 3) ## identifies the separate people per year
predlist <- vector("list", length(uq.year) - 1) ## there is 1 prediction fewer than the number of unique years
for(i in 1:(length(uq.year) - 1))
{
mod <- lm(y ~ ., data = subset(dat, year == uq.year[i]))
predlist[[i]] <- predict(mod, subset(dat, subset = year == uq.year[i + 1], select = -y))
names(predlist[[i]]) <- model[year == uq.year[i + 1]] ## labeling each prediction
}
The reason that we want dat to only have modeling variables (rather than year, for example) is because then we can easily use the y ~ . notation and avoid having to spell out all of the predictors in the lm call.

Resources