Account for different variable lengths after differenciation in a linear Model in R - r

Struggling with R.
My linear lm() Model contains Variables that are differenced via diff() and variables which are not differenced. The differenced variables are one observation shorter due to differencing.
Therefore, the lm() gives the error message of different variable lengths.
My idea of a solution to this error is, to somehow define the variables as time series (which they are anyway, but R doesnt know that) and then tell the lm-Model exactly, which years to use (yearly data).
In my understanding, after differencing, a time-series looses its first observation, so, as I use the ts()-Funktion, I shall set a starting year one year later for the differenced function.
More concrete:
lets say I imported the variables x and y
then I go
dx<-diff(x, lag=1, differences=1)
while y remains the same
lm(y~dx)
will then produce the said error.
Lets say x and y both start in 1900. Then dx starts in 1901, so that lm has to start in 1901 for all Variables.
My idea is, as stated above, to explicitly make both variables a time-series
tsdx<-ts(dx, frequency=1, start=1901)
tsy<-ts(y, frequency=1, start:1900)
and then somehow tell lm() to start with the year 1901.
Is this a good way to deal with these Problems? And how would I code the last step? Thanks alot!

Related

How to create and analyze a time series with variable test frequency in R

Here is a short description of the problem I am trying to solve: I have test data for multiple variables (weight, thickness, absorption, etc.) that are taken at varying intervals over time - no set schedule, sometimes a test a day, sometimes days might go between tests. I want to detect trends in each of these and alert stake holders when any parameter is trending up/down more than a certain amount. I first did a linear model between each variable's raw data and test time (I converted the test time to days or weeks since a fixed date) and create a table with slopes for each variable - so the stake holders can view one table for all variables and quickly see if any of them is raising concern. The issue was that the data for most variables is very noisy. Someone suggested using time series functions, separating noise and seasonality from the trends, and studying the trend component for a cleaner analysis. I started to look into this and see a couple concerns/questions already:
Time series analysis seems to require specifying a frequency - how do you handle this if your test data is not taken at regular intervals
If one gets over the issue in #1 above, decomposes the data, and gets the trend separated out (ie. take out particularly the random variation/noise), how would you then get a slope metric from that? Namely, if I wanted to then fit a linear model to the trend component of the raw data (after decomposing), what would be the x (independent) variable? Is there a way to connect the trend component of the ts-decompose function with the original data's x-axis data (in this case the actual test date/times, say converted to weeks or days from a fixed date)?
Finally, is there a better way of accomplishing what I explained above? I am only looking for general trends over time - say over 3 months of data, not day to day trends.
Time series are generally used to see if previous observations of a variable have influence on future observations. You would model under the assumption that the previous observations are able to predict the future observations. That is the reason for that most (not all) time series models require evenly spaced instances of training data. If your data is not only very noisy, but also not collected on a regular basis, then you should seriously consider if time series is the appropriate choice of modelling.
Time series analysis seems to require specifying a frequency - how do you handle this if your test data is not taken at regular intervals.
What you can do, is creating an aggregate by increasing the time bucket (shift from daily data to a weekly average for instance) such that every unit of time has an instance of training data. Following your final comment, you could create the average of the observations of the last 3 months of data instead from the observations.
If one gets over the issue in #1 above, decomposes the data, and gets the trend separated out (ie. take out particularly the random variation/noise), how would you then get a slope metric from that? Namely, if I wanted to then fit a linear model to the trend component of the raw data (after decomposing), what would be the x (independent) variable?
In the simplest case of a linear model, the independent variable is the unit of time corresponding to the prediction you are trying to make. However this is not always regarded a time series model.
In the case of an autoregressive model, this would be the previous observation of what you are trying to predict, something similar to y(t) = x(t-1), for instance multiplied by a smoothing factor. I encourage you to read Forecasting: principles and practice which is an excellent book on the matter.
Is there a way to connect the trend component of the ts-decompose function with the original data's x-axis data (in this case the actual test date/times, say converted to weeks or days from a fixed date)?
The function decompose.ts returns a list which includes trend. Trend is a vector of the estimated trend components corresponding to it's respective time value.
Let's create an example time series with linear trend
df <- data.frame(
date = seq(from = as.Date("2021-01-01"), to = as.Date("2021-01-10"), by=1)
)
df$value <- jitter(seq(from = 1, to = nrow(df), by=1))
time_series <- ts(df$value, frequency = 5)
df$trend <- decompose(time_series)$trend
> df
date value trend
1 2021-01-01 0.9170296 NA
2 2021-01-02 1.8899565 NA
3 2021-01-03 3.0816892 2.992256
4 2021-01-04 4.0075589 4.042486
5 2021-01-05 5.0650478 5.046874
6 2021-01-06 6.1681775 6.051641
7 2021-01-07 6.9118942 7.074260
8 2021-01-08 8.1055282 8.041628
9 2021-01-09 9.1206522 NA
10 2021-01-10 9.9018900 NA
As you see, the trend component already is an estimate of the dependent variable at the corresponding time. In decompose the estimate of trend is based on a moving average.

How can I peform a simple linear regression model on this data?

So I would like to create a linear regression model, with rocket price (written as rocket) against the data of launch (datum). I believe I can do this by doing: lm(Y ~ X). However, how would I be able to convert the prices from chr to num, and likewise for the dates?
Thank you!
Data: https://www.kaggle.com/agirlcoding/all-space-missions-from-1957
Effectively you are asking 3 different but very basic questions, which would be better learned by reading an introductory text than by posting a question on Stack Overflow.
How do I convert character data to numeric data for the Rocket column?
Depending on what version of R you are using, the column spaceData$Rocket will be either a character vector or a factor vector. To cover both eventualities, you can do:
spaceData$Rocket <- as.numeric(as.character(spaceData$Rocket))
This will give you a warning that some NA values were produced. That's OK - there are some blank cells in the column, so you want these to be NA.
How do I convert the column spaceData$Datum from text to actual date times?
In this case, you can use strptime, and specify how the date string is formatted. We will also wrap this in as.POSIXct to ensure that the data is formatted in a way that is easier to plot:
spaceData$Datum <- as.POSIXct(strptime(spaceData$Datum, "%a %b %d, %Y %H:%M"))
How do I do a linear regression using these two variables?
Before you attempt a linear regression, it is a good idea to make sure it is sensible to do a linear regression. For a linear regression to make sense, you should know that there is an approximately linear relationship between the two variables, and that the residuals are approximately normally distributed. An easy way to examine these assumptions is to plot the two variables:
plot(spaceData$Datum, spaceData$Rocket)
You don't need to be a statistician to see that any straight line through these points is going to be pretty hopeless as a description of the relationship. If we try it, we can see that:
abline(lm(Rocket ~ Datum, data = spaceData), col = "red")
So, by running a linear regression on this data, we can predict that the price of rockets will fall to zero on the 13th May 2036. Clearly this is nonsense.

Why is this the output for my linear model and how can I fix it?

I am trying to setup a multivariable linear programming model using R but the model keeps creating new variables in the output.
Essentially I am trying to find correlations between air quality and different factors such as population, time of day, weather readings, and a few others. For this example, I am looking at multiple different sensor locations over a months time. I have data on the actual AQI, weather data, and assumed the population in the area surrounding the sensor doesn't change over time (which might be my problem). Therefore, the population varies between the different sensor, however remains constant over the months. I then combined each sensors data into a data frame to conduct the linear programming. The code for my model is below:
model = lm(AQI ~ Time.of.Day + Temp + Humidity + Pressure + pop + ind + rd_dist, data = Krakdata)
The output is given in the picture below. I do not know why it doesn't come up with just population as an output. Instead, it outputs each population reading as another factor. Thanks!
Linear Model Output:
Krakdata example. Note how the population will not change until the next sensor comes up:
pop is a categorical variable. You need to convert it to an integer, otherwise each value will be treated as a separated category and therefore separate variable.
pop is a categorical variable, hence R treats it as such. R turns the pop variable into dummy variable, therefore the output. You have to convert it to numeric if this variable is supposed to be numeric in nature/in your analysis.
As to how to convert it:
Krakdata$pop <- as.numeric(as.character(Krakdata$pop))
As to how pop is read as factors while it resembles numbers, you need to look into your previous code or to the data itself.

How to structure stratified data for Poisson regression

I'm trying to use R to conduct Poisson regression on some data that I have. The current structure of the data is as follows:
Data is stratified based on three occupations. There are four levels of income in the data. Within each stratum, for each level of income there is
the number of workplace accidents that have occurred, and
the total man months observed.
Here's an example of the setup. The number in parentheses is the total man months observed and the number not in parentheses is the number of workplace accidents.
My question is how do I set up this data and perform a Poisson regression on the effect of income level on the occurrence of workplace accidents? Ideally I would like to adjust for occupation and find out the effect of only income, but as a starting point, I'm not sure how to set it up as a Poisson regression problem at all. I thought about doing something like dividing the number of injuries by the months of observation, but then that gives non-integer values so I assume that's not the right thing to do.
To reiterate, predictor: income level; response variable: workplace accidents.
BTW, it would be very easy to separate the parentheses numbers and put them into their own column, if that would make sense to do.
I'd really appreciate any suggestions on how to set this up. I am sure other statisticians are working with similarly structured data and might like to gain some insight as well. Thanks so much!
#thelatemail might be correct in think this to be better suited for stats.stackexchange.com but here is some R code. That data is in wide format and you need to re-structure it to long format. (And you will not want to include the totals columns. After converting the first four columns to a long format where you had 'occupation' and 'level' as factor-class variables, and accident 'counts' and exposure 'months' as numeric columns, you could use this call to glm.
fit <- glm( counts ~ level + occup + offset(log(months)), data=dfrm, family="poisson")
The offset needs to be log()-ed to agree with the logged counts created by the default link function for the poisson-family.
(You cannot really expect us to redo that data entry task, now can you?)

Evolution of linear regression coefficients over time

I would like to observe the evolution of the linear regression coefficients over time. To be more precise, let's have a time frame of 2 years where the linear regression will always use the data set with a range of 1 year. After the first regression, we move one week further (i.e. we add a new week, but one is also subtracted from the beginning) and do the regression again as long as we reach the final date: altogether, there will be 52 regressions.
My problem is that there are some holidays in the data set and we can not simply add 7 days as one would easily suggest. I would like to have some wrapper function that would do aforementioned for many other functions from different packages, for example forecast.lm() from the forecast package or any function that one can think of: the objective in every case would be to find the evolution of the linear regression parameters week-by-week.
I think you might get more answers if you edit/subdivide your question in a clear way. (1) how do I find holidays (it's not clear what your definition of holidays is)? (2) how do I slice up a data set accordingly? (3) how do I run a linear regression in each chunk?
(1) find holidays: can't really help here, as I don't know how they're defined/coded in your data set. library(sos); findFn("holiday") finds some options
(2) partition the data set according to inter-holiday/weekend intervals. The example below supposes holidays are coded as 1 and non-holidays are coded as zero.
(3) run the linear regression for each chunk and extract the coefficients.
d <- data.frame(holiday=c(0,0,0,1,1,0,0,0,0,1,0,0,0,0),
x=runif(14),y=runif(14))
per <- cumsum(c(1,diff(d$holiday)==-1)) ## maybe use rle() instead
dd <- with(d,split(subset(d,!holiday),per[!holiday]))
t(sapply(lapply(dd,lm,formula=y~x),coef))

Resources