How do we run a linear regression with the given data? - r

We have a large data set with 26 brands, sold in 93 stores, during 399 weeks. The brands are still divided into sub brands (f.ex.: brand = Colgate, but sub brands(556) still exist: Colgate premium white/ Colgate extra etc.)
We calculated for each Subbrand a brandshared price on a weekly store level:
Calculation: (move per ounce for each subbrand and every single store weekly) DIVIDED BY (sum for move per ounce over the subbrands refering to one brand for every single store weekly)* (log price per ounce for each sub brand each week on storelevel)
Everything worked! We created a data frame with all the detailed calculation (data = tooth4) Our final interest is to run a linear regression to predict the influence of price on the move variable
--> the problem now is that the sale variable (a dummy, which says if there is a promotion in a specific week for a specific sub brand in a specific store ) is on subbrandlevel
--> we tried to run a regression on sub brand level (variable = descrip) but it doesn't work due to big data
lm(formula = logmove_ounce ~ log_wei_price_ounce + descrip - 1 *
(log_wei_price_ounce) + sale - 1, data = tooth4)
logmove_ounce = log of weekly subbrand based move on store level
log_wei_price_ounce = weighted subbrand based price for each store for each week
sale-1 = fixed effect for promotion
descrip-1 = fixed effect for subbrand
Does anyone have a solution how to run a regression only on brand level but include the promotion variable ?
We got a hint that we could calculate a shared value of promotion for each brand on each store ? But how?
Another question, assuming my regression is right/ partly right -- how can I weight the results to get the results only on store level not weekly storelevel?
Thank you in advance !!!

We got a hint that we could calculate a shared value of promotion for each brand on each store ? But how?
This is variously called a multilevel model, a nested model, hierarchical model, mixed model, or random-effect model which are all the same mathematical model. It is widely used to analyze the kind of longitudinal panel data you describe. A serious book on the subject is Gelman.
The most common approach in R is to use the lmer() function from the lme4 package. If you're using lme4 on uncomfortably large data, you should read their performance tips.
lmer() models accept a slightly different formula syntax, which I'll describe only briefly so that you can see how it can solve the problems you're having.
For example, let's assume we're modeling future salary as a function of the GPA and IQ of certain students. We know that students come from certain schools, so all students which go to the same school are part of a group, and schools are again grouped into counties, states. Furthermore, students graduate in different years which may have an effect. This is a generic example, but I chose it because it shares many of the same characteristics as your own longitudinal panel data.
We can use the generalized formula syntax to specify groups with a varying intercept:
lmer(salary ~ gpa + iq + (1|school), data=df)
A nested hierarchy of such groups:
lmer(salary ~ gpa + iq + (1|state/county/school), data=df)
Or group-varying slopes to capture changes overtime:
lmer(salary ~ gpa + iq + (1 + year|school), data=df)
You'll have to make your own decisions about how to model your data, but lme4::lmer() will give you a larger toolbox than lm() for dealing with groups and levels. I'd recommend asking on https://stats.stackexchange.com/ if you have questions about the modeling side.

Related

Use of svyglm and svydesign with R for multistage stratified cluster design

I have a complicated data set which was made by a multistage stratified cluster design. I had originally analysed this using glm, however now realise that I have to use svyglm. I'm not quite sure about how is best to model the data utilising svyglm. I was wondering if anyone could help shed some light.
I am attempting to see the effect that a variety of covariates taken at time 1 have on a binary outcome taken at time 2.
The sampling strategy was as follows: state -> urban/rural -> district -> subdistrict -> village. Within each village, individuals were randomly selected, with each of these having an id (uniqid).
I have a variable in the df for each of these stages of the sampling strategy. I also have the following variables: outcome, age, sex, income, marital_status, urban_or_rural_area, uniqid, weights. The formula that I want for my regression equation is outcome ~ age + sex + income + marital_status + urban_or_rural_area . Weights are coded by the weights variable. I had set the family to binomial(link = logit).
If anyone has any idea how such an approach could be coded in R with svyglm I would be most appreciative. I'm quite confused as to what should be inputted as ID, fpc and nest. Do I have to specify all levels of the stratified design or just some?
Any direction, or resources which explain this well would be massively appreciated.
You don't really give enough information about the design: which of the geographical units are strata and which are clusters. For example, my guess is that you sample both urban and rural in all states, and you don't sample all villages, but I don't know whether you sample all districts or subdistricts. I also don't know whether your overall sampling fraction is large or small (so whether the with-replacement approximation is ok)
Let's pretend you sample just some districts, so districts are your Primary Sampling Units, and that the overall sampling fraction of people is small. The design command is
your_design <- svydesign(id=~district, weights=~weights,
strata=~interaction(state, urban_rural,drop=TRUE),
data=your_data_frame)
That is, the strata are combinations of state and urban/rural and any combinations that aren't in your data set don't exist in the population (maybe some states are all-rural or all-urban). Within each stratum you have districts, and only some of these appear in the sample. In your geographical hierarchy, districts are then the first level that is sampled rather than exhaustively enumerated.
You don't need fpc unless you want to specify the full multistage design without replacement.
The nest option is not about how the survey was done but is about how variables are coded. The US National Center for Health Statistics (bless their hearts) set up a lot of designs that have many strata and two primary sampling units per stratum. They call these primary sampling units 1 and 2; that is, they reuse the names 1 and 2 in every stratum. The svydesign function is set up to expect different sampling unit names in different strata, and to verify that each sampling unit name appears in just one stratum, as a check against data errors. This check has to be disabled for NCHS surveys and perhaps some others that also reuse sampling unit names. You can always leave out the nest option at first; svydesign will tell you if it might be needed.
Finally, the models:
svyglm(outcome ~ age + sex + income + marital_status + urban_or_rural_area,
design=your_design, family=quasibinomial)
Using binomial or quasibinomial will give identical answers, but using binomial will give you a harmless warning about non-integer weights. If you use quasibinomial, the harmless warning is suppressed.

Diagnostic plots fail with LMMs

I've been working on the following problem recently: We sent 18 people, 9 each, several times to two different clubs "N" and "O". These people arrived at the club either between 8 and 10 am (10) or between 10 and 12 pm (12). Each club consists of four sectors with ascending price classes. At the end of each test run, the subjects filled out a questionnaire reflecting a score for their satisfaction depending on the different parameters. The aim of the study is to find out how satisfaction can be modelled as a function of the club. You can download the data as csv for one week with this link (without spaces): https: // we.tl/t-I0UXKYclUk
After some try and error, I fitted the following model using the lme4 package in R (the other models were singular, had too strong internal correlations or higher AIC/BIC):
mod <- lmer(Score ~ Club + (1|Sector:Subject) + (1|Subject), data = dl)
Now I wanted to create some diagnostic plots as indicated here.
plot(resid(mod), dl$Score)
plot(mod, col=dl$Club)
library(lattice)
qqmath(mod, id=0.05)
Unfortunately, it turns out that there are still patterns in the residuals that can be attributed to the club but are not captured by the model. I have already tried to incorporate the club into the random effects, but this leads to singularities. Does anyone have a suggestion on how I can deal with these patterns in the residuals? Thank you!

GLMM: Needing overall advice on selecting model terms for glmm modelling in R

I would like to create a model to understand how habitat type affects the abundance of bats found, however I am struggling to understand which terms I should include. I wish to use lme4 to carry out a glmm model, I have chosen glmm as the distribution is poisson - you can't have half a bat, and also distribution is left skewed - lots of single bats.
My dataset is very big and is comprised of abundance counts recorded by an individual on a bat survey (bat survey number is not included as it's public data). My data set includes abundance, year, month, day, environmental variables (temp, humidity, etc.), recorded_habitat, surrounding_habitat, latitude and longitude, and is structured like the set shown below. P.S Occurrence is an anonymous recording made by an observer at a set location, at a location a number of bats will be recorded - it's not relevant as it's from a greater dataset.
occurrence
abundance
latitude
longitude
year
month
day
(environmental variables
3456
45
53.56
3.45
2000
5
3
34.6
surrounding_hab
recorded_hab
A
B
Recorded habitat and surrounding habitat range in letters (A-I) corresponding to a habitat type. Also, the table is split as it wouldn't fit in the box.
These models shown below are the models I think are a good choice.
rhab1 <- glmer(individual_count ~ recorded_hab + (1|year) + latitude + longitude + sun_duration2, family = poisson, data = BLE)
summary(rhab1)
rhab2 <- glmer(individual_count ~ surrounding_hab + (1|year) + latitude + longitude + sun_duration2, family = poisson, data = BLE)
summary(rhab2)
I'll now explain my questions in regards to the models I have chosen, with my current thinking/justification.
Firstly, I am confused about the mix of categorical and numeric variables, is it wise to include the environmental variables as they are numeric? My current thinking is scaling the environmental variables allowed the model to converge so including them is okay?
Secondly, I am confused about the mix of spatial and temporal variables, primarily if I should include temporal variables as the predictor is a temporal variable. I'd like to include year as a random variable as bat populations from one year directly affect bat populations the next year, and also latitude and longitude, does this seem wise?
I am also unsure if latitude and longitude should be random? The confusion arises because latitude and longitude do have some effect on the land use.
Additionally, is it wise to include recorded_habitat and surrounding_habitat in the same model? When I have tried this is produces a massive output with a huge correlation matrix, so I'm thinking I should run two models (year ~ recorded_hab) and (year ~ surrounding_hab) then discuss them separately - hence the two models.
Sorry this question is so broad! Any help or thinking is appreciated - including data restructuring or model term choice. I'm also new to stack overflow so please do advise on question lay out/rules etc if there are glaringly obvious mistakes.

Interacting regressors in a BQ ML Linear Regression Model

I'm trying to work out how get two regressors to interact when using BigQuery ML.
In this example below (apologies for the rough fake data!), I'm trying to predict total_hire_duration using trip_count as well as the month of the year. BQ tends to treat the month part as a constant to add on to the linear regression equation but I actually want it to grow with trip_count. For my real dataset I can't just supply the timestamp as BQML seems to over parametise.
I should add, if I supply month as a numeric value I just get a single coefficient that doesn't really work for my dataset (patterns form around parts of the academic year rather than calendar).
If the month part is a constant, then as trip_count gets very very large, the constant in the equation y = ax+b becomes inconsequential. It's almost as if I want something like y = ax + bx + c where a is the trip_count and b is a coefficient weighted on what the value of month is.
This is quite easy to do in R, I'd just run
glm(bike$totalHireDuration ~ bike$tripCount:bike$month)
Here's some fake data to reproduce:
CREATE OR REPLACE MODEL
my_model_name OPTIONS (model_type='linear_reg',
input_label_cols =['total_hire_duration']) AS (
SELECT
CAST(EXTRACT(MONTH FROM DATE(start_date)) AS STRING) month,
COUNT(*) trip_count,
SUM(duration_sec) total_hire_duration
FROM
bigquery-public-data.san_francisco_bikeshare.bikeshare_trips
GROUP BY
date)
Any help would be greatly appreciated!
First, note that it is almost always a bad idea to fit a model such as:
glm(bike$totalHireDuration ~ bike$tripCount:bike$month)
which fits only the interaction without the main effects.
But getting to the point of the question, I can't help with big query in particular, but in any software you can fit an interaction between two variables simply by creating a new variable that is the product of the two and then using that as a regressor.
Looks like the month in your model is string. So the BQML linear regression model will treat it as a categorical feature. If you want month as an integer, you can try CAST(EXTRACT(MONTH FROM DATE(start_date)) AS INT64) month

Incorporating time series into a mixed effects model in R (using lme4)

I've had a search for similar questions and come up short so apologies if there are related questions that I've missed.
I'm looking at the amount of time spent on feeders (dependent variable) across various conditions with each subject visiting feeders 30 times.
Subjects are exposed to feeders of one type which will have a different combination of being scented/unscented, having visual patterns/being blank, and having these visual or scented patterns presented in one of two spatial arrangements.
So far my model is:
mod<-lmer(timeonfeeder ~ scent_yes_no + visual_yes_no +
pattern_one_or_two + (1|subject), data=data)
How can I incorporate the visit numbers into the model to see if these factors have an effect on the time spent on the feeders over time?
You have a variety of choices (this question might be marginally better for CrossValidated).
as #Dominix suggests, you can allow for a linear increase or decrease in time on feeder over time. It probably makes sense to allow this change to vary across birds:
timeonfeeder ~ time + ... + (time|subject)
you could allow for an arbitrary pattern of change over time (i.e. not just linear):
timeonfeeder ~ factor(time) + ... + (1|subject)
this probably doesn't make sense in your case, because you have a large number of observations, so it would require many parameters (it would be more sensible if you had, say, 3 time points per individual)
you could allow for a more complex pattern of change over time via an additive model, i.e. modeling change over time with a cubic spline. For example:
library(mgcv)
gamm(timeonfeeder ~ s(time) + ... , random = ~1|subject
(1) this assumes the temporal pattern is the same across subjects; (2) because gamm() uses lme rather than lmer under the hood you have to specify the random effect as a separate argument. (You could also use the gamm4 package, which uses lmer under the hood.)
You might want to allow for temporal autocorrelation. For example,
lme(timeonfeeder ~ time + ... ,
random = ~ time|subject,
correlation = corAR1(form= ~time|subject) , ...)

Resources