Why is this the output for my linear model and how can I fix it? - r

I am trying to setup a multivariable linear programming model using R but the model keeps creating new variables in the output.
Essentially I am trying to find correlations between air quality and different factors such as population, time of day, weather readings, and a few others. For this example, I am looking at multiple different sensor locations over a months time. I have data on the actual AQI, weather data, and assumed the population in the area surrounding the sensor doesn't change over time (which might be my problem). Therefore, the population varies between the different sensor, however remains constant over the months. I then combined each sensors data into a data frame to conduct the linear programming. The code for my model is below:
model = lm(AQI ~ Time.of.Day + Temp + Humidity + Pressure + pop + ind + rd_dist, data = Krakdata)
The output is given in the picture below. I do not know why it doesn't come up with just population as an output. Instead, it outputs each population reading as another factor. Thanks!
Linear Model Output:
Krakdata example. Note how the population will not change until the next sensor comes up:

pop is a categorical variable. You need to convert it to an integer, otherwise each value will be treated as a separated category and therefore separate variable.

pop is a categorical variable, hence R treats it as such. R turns the pop variable into dummy variable, therefore the output. You have to convert it to numeric if this variable is supposed to be numeric in nature/in your analysis.
As to how to convert it:
Krakdata$pop <- as.numeric(as.character(Krakdata$pop))
As to how pop is read as factors while it resembles numbers, you need to look into your previous code or to the data itself.

Related

GLMM: Needing overall advice on selecting model terms for glmm modelling in R

I would like to create a model to understand how habitat type affects the abundance of bats found, however I am struggling to understand which terms I should include. I wish to use lme4 to carry out a glmm model, I have chosen glmm as the distribution is poisson - you can't have half a bat, and also distribution is left skewed - lots of single bats.
My dataset is very big and is comprised of abundance counts recorded by an individual on a bat survey (bat survey number is not included as it's public data). My data set includes abundance, year, month, day, environmental variables (temp, humidity, etc.), recorded_habitat, surrounding_habitat, latitude and longitude, and is structured like the set shown below. P.S Occurrence is an anonymous recording made by an observer at a set location, at a location a number of bats will be recorded - it's not relevant as it's from a greater dataset.
occurrence
abundance
latitude
longitude
year
month
day
(environmental variables
3456
45
53.56
3.45
2000
5
3
34.6
surrounding_hab
recorded_hab
A
B
Recorded habitat and surrounding habitat range in letters (A-I) corresponding to a habitat type. Also, the table is split as it wouldn't fit in the box.
These models shown below are the models I think are a good choice.
rhab1 <- glmer(individual_count ~ recorded_hab + (1|year) + latitude + longitude + sun_duration2, family = poisson, data = BLE)
summary(rhab1)
rhab2 <- glmer(individual_count ~ surrounding_hab + (1|year) + latitude + longitude + sun_duration2, family = poisson, data = BLE)
summary(rhab2)
I'll now explain my questions in regards to the models I have chosen, with my current thinking/justification.
Firstly, I am confused about the mix of categorical and numeric variables, is it wise to include the environmental variables as they are numeric? My current thinking is scaling the environmental variables allowed the model to converge so including them is okay?
Secondly, I am confused about the mix of spatial and temporal variables, primarily if I should include temporal variables as the predictor is a temporal variable. I'd like to include year as a random variable as bat populations from one year directly affect bat populations the next year, and also latitude and longitude, does this seem wise?
I am also unsure if latitude and longitude should be random? The confusion arises because latitude and longitude do have some effect on the land use.
Additionally, is it wise to include recorded_habitat and surrounding_habitat in the same model? When I have tried this is produces a massive output with a huge correlation matrix, so I'm thinking I should run two models (year ~ recorded_hab) and (year ~ surrounding_hab) then discuss them separately - hence the two models.
Sorry this question is so broad! Any help or thinking is appreciated - including data restructuring or model term choice. I'm also new to stack overflow so please do advise on question lay out/rules etc if there are glaringly obvious mistakes.

Interacting regressors in a BQ ML Linear Regression Model

I'm trying to work out how get two regressors to interact when using BigQuery ML.
In this example below (apologies for the rough fake data!), I'm trying to predict total_hire_duration using trip_count as well as the month of the year. BQ tends to treat the month part as a constant to add on to the linear regression equation but I actually want it to grow with trip_count. For my real dataset I can't just supply the timestamp as BQML seems to over parametise.
I should add, if I supply month as a numeric value I just get a single coefficient that doesn't really work for my dataset (patterns form around parts of the academic year rather than calendar).
If the month part is a constant, then as trip_count gets very very large, the constant in the equation y = ax+b becomes inconsequential. It's almost as if I want something like y = ax + bx + c where a is the trip_count and b is a coefficient weighted on what the value of month is.
This is quite easy to do in R, I'd just run
glm(bike$totalHireDuration ~ bike$tripCount:bike$month)
Here's some fake data to reproduce:
CREATE OR REPLACE MODEL
my_model_name OPTIONS (model_type='linear_reg',
input_label_cols =['total_hire_duration']) AS (
SELECT
CAST(EXTRACT(MONTH FROM DATE(start_date)) AS STRING) month,
COUNT(*) trip_count,
SUM(duration_sec) total_hire_duration
FROM
bigquery-public-data.san_francisco_bikeshare.bikeshare_trips
GROUP BY
date)
Any help would be greatly appreciated!
First, note that it is almost always a bad idea to fit a model such as:
glm(bike$totalHireDuration ~ bike$tripCount:bike$month)
which fits only the interaction without the main effects.
But getting to the point of the question, I can't help with big query in particular, but in any software you can fit an interaction between two variables simply by creating a new variable that is the product of the two and then using that as a regressor.
Looks like the month in your model is string. So the BQML linear regression model will treat it as a categorical feature. If you want month as an integer, you can try CAST(EXTRACT(MONTH FROM DATE(start_date)) AS INT64) month

How can I peform a simple linear regression model on this data?

So I would like to create a linear regression model, with rocket price (written as rocket) against the data of launch (datum). I believe I can do this by doing: lm(Y ~ X). However, how would I be able to convert the prices from chr to num, and likewise for the dates?
Thank you!
Data: https://www.kaggle.com/agirlcoding/all-space-missions-from-1957
Effectively you are asking 3 different but very basic questions, which would be better learned by reading an introductory text than by posting a question on Stack Overflow.
How do I convert character data to numeric data for the Rocket column?
Depending on what version of R you are using, the column spaceData$Rocket will be either a character vector or a factor vector. To cover both eventualities, you can do:
spaceData$Rocket <- as.numeric(as.character(spaceData$Rocket))
This will give you a warning that some NA values were produced. That's OK - there are some blank cells in the column, so you want these to be NA.
How do I convert the column spaceData$Datum from text to actual date times?
In this case, you can use strptime, and specify how the date string is formatted. We will also wrap this in as.POSIXct to ensure that the data is formatted in a way that is easier to plot:
spaceData$Datum <- as.POSIXct(strptime(spaceData$Datum, "%a %b %d, %Y %H:%M"))
How do I do a linear regression using these two variables?
Before you attempt a linear regression, it is a good idea to make sure it is sensible to do a linear regression. For a linear regression to make sense, you should know that there is an approximately linear relationship between the two variables, and that the residuals are approximately normally distributed. An easy way to examine these assumptions is to plot the two variables:
plot(spaceData$Datum, spaceData$Rocket)
You don't need to be a statistician to see that any straight line through these points is going to be pretty hopeless as a description of the relationship. If we try it, we can see that:
abline(lm(Rocket ~ Datum, data = spaceData), col = "red")
So, by running a linear regression on this data, we can predict that the price of rockets will fall to zero on the 13th May 2036. Clearly this is nonsense.

Syntax for survival analysis with late-entry

I am trying to fit a survival model with left-truncated data using the survival package however I am unsure of the correct syntax.
Let's say we are measuring the effect of age at when hired (age) and job type (parttime) on duration of employment of doctors in public health clinics. Whether the doctor quit or was censored is indicated by the censor variable (0 for quittting, 1 for censoring). This behaviour was measured in an 18-month window. Time to either quit or censoring is indicated by two variables, entry (start time) and exit(stop time) indicating how long, in years, the doctor was employed at the clinic. If doctors commenced employment after the window 'opened' their entry time is set to 0. If they commenced employment prior to the window 'opening' their entry time represents how long they had already been employed in that position when the window 'opened', and their exit time is how long from when they were initially hired they either quit or were censored by the window 'closing'. We also postulate a two-way interaction between age and duration of employment (exit).
This is the toy data set. It is much smaller than a normal dataset would be, so the estimates themselves are not as important as whether the syntax and variables included (using the survival package in R) are correct, given the structure of the data. The toy data has the exact same structure as a dataset discussed in Chapter 15 of Singer and Willet's Applied Longitudinal Data Analysis. I have tried to match the results they report, without success. There is not a lot of explicit information online how to conduct survival analyses on left-truncated data in R, and the website that provides code for the book (here) does not provide R code for the chapter in question. The methods for modeling time-varying covariates and interaction effects are quite complex in R and I just wonder if I am missing something important.
Here is the toy data
id <- 1:40
entry <- c(2.3,2.5,2.5,1.2,3.5,3.1,2.5,2.5,1.5,2.5,1.4,1.6,3.5,1.5,2.5,2.5,3.5,2.5,2.5,0.5,rep(0,20))
exit <- c(5.0,5.2,5.2,3.9,4.0,3.6,4.0,3.0,4.2,4.0,2.9,4.3,6.2,4.2,3.0,3.9,4.1,4.0,3.0,2.0,0.2,1.2,0.6,1.9,1.7,1.1,0.2,2.2,0.8,1.9,1.2,2.3,2.2,0.2,1.7,1.0,0.6,0.2,1.1,1.3)
censor <- c(1,1,1,1,0,0,0,0,1,0,0,1,1,1,0,0,0,0,0,0,rep(1,20))
parttime <- c(1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0)
age <- c(34,28,29,38,33,33,32,28,40,30,29,34,31,33,28,29,29,31,29,29,30,37,33,38,34,37,37,40,29,38 ,49,32,30,27,35,34,35,30,35,34)
doctors <- data.frame(id,entry,exit,censor,parttime,age)
Now for the model.
coxph(Surv(entry, exit, 1-censor) ~ parttime + age + age:exit, data = doctors)
Is this the correct way to specify the model given the structure of the data and what we want to know? An answer here suggests it is correct, but I am not sure whether, for example, the interaction variable is correctly specified.
As is often the case, it's not until I post a question about a problem on SO that I work out how to do it myself. If there is an interaction with time predictor we need to convert the dataset into a count process, person period format (i.e. a long form). This is because each participant needs an interval that tracks their status with respect to the event for every time point that the event occurred to anyone else in the data set, up to the point when they exited the study.
First let's make an event variable
doctors$event <- 1 - doctors$censor
Before we run the cox model we need to use the survSplit function in the survival package. To do this we need to make a vector of all the time points when an event occurred
cutPoints <- order(unique(doctors$exit[doctors$event == 1]))
Now we can pass this into the survSplit function to create a new dataset...
docNew <- survSplit(Surv(entry, exit, event)~.,
data = doctors,
cut = cutPoints,
end = "exit")
... which we then run our model on
coxph(Surv(entry,exit,event) ~ parttime + age + age:exit, data = docNew)
Voila!

How to structure stratified data for Poisson regression

I'm trying to use R to conduct Poisson regression on some data that I have. The current structure of the data is as follows:
Data is stratified based on three occupations. There are four levels of income in the data. Within each stratum, for each level of income there is
the number of workplace accidents that have occurred, and
the total man months observed.
Here's an example of the setup. The number in parentheses is the total man months observed and the number not in parentheses is the number of workplace accidents.
My question is how do I set up this data and perform a Poisson regression on the effect of income level on the occurrence of workplace accidents? Ideally I would like to adjust for occupation and find out the effect of only income, but as a starting point, I'm not sure how to set it up as a Poisson regression problem at all. I thought about doing something like dividing the number of injuries by the months of observation, but then that gives non-integer values so I assume that's not the right thing to do.
To reiterate, predictor: income level; response variable: workplace accidents.
BTW, it would be very easy to separate the parentheses numbers and put them into their own column, if that would make sense to do.
I'd really appreciate any suggestions on how to set this up. I am sure other statisticians are working with similarly structured data and might like to gain some insight as well. Thanks so much!
#thelatemail might be correct in think this to be better suited for stats.stackexchange.com but here is some R code. That data is in wide format and you need to re-structure it to long format. (And you will not want to include the totals columns. After converting the first four columns to a long format where you had 'occupation' and 'level' as factor-class variables, and accident 'counts' and exposure 'months' as numeric columns, you could use this call to glm.
fit <- glm( counts ~ level + occup + offset(log(months)), data=dfrm, family="poisson")
The offset needs to be log()-ed to agree with the logged counts created by the default link function for the poisson-family.
(You cannot really expect us to redo that data entry task, now can you?)

Resources