I'm trying to work out how get two regressors to interact when using BigQuery ML.
In this example below (apologies for the rough fake data!), I'm trying to predict total_hire_duration using trip_count as well as the month of the year. BQ tends to treat the month part as a constant to add on to the linear regression equation but I actually want it to grow with trip_count. For my real dataset I can't just supply the timestamp as BQML seems to over parametise.
I should add, if I supply month as a numeric value I just get a single coefficient that doesn't really work for my dataset (patterns form around parts of the academic year rather than calendar).
If the month part is a constant, then as trip_count gets very very large, the constant in the equation y = ax+b becomes inconsequential. It's almost as if I want something like y = ax + bx + c where a is the trip_count and b is a coefficient weighted on what the value of month is.
This is quite easy to do in R, I'd just run
glm(bike$totalHireDuration ~ bike$tripCount:bike$month)
Here's some fake data to reproduce:
CREATE OR REPLACE MODEL
my_model_name OPTIONS (model_type='linear_reg',
input_label_cols =['total_hire_duration']) AS (
SELECT
CAST(EXTRACT(MONTH FROM DATE(start_date)) AS STRING) month,
COUNT(*) trip_count,
SUM(duration_sec) total_hire_duration
FROM
bigquery-public-data.san_francisco_bikeshare.bikeshare_trips
GROUP BY
date)
Any help would be greatly appreciated!
First, note that it is almost always a bad idea to fit a model such as:
glm(bike$totalHireDuration ~ bike$tripCount:bike$month)
which fits only the interaction without the main effects.
But getting to the point of the question, I can't help with big query in particular, but in any software you can fit an interaction between two variables simply by creating a new variable that is the product of the two and then using that as a regressor.
Looks like the month in your model is string. So the BQML linear regression model will treat it as a categorical feature. If you want month as an integer, you can try CAST(EXTRACT(MONTH FROM DATE(start_date)) AS INT64) month
Related
I am trying to create (as the title suggests) a rolling linear regression equation on a set of data (daily returns of two variables, total of 257 observations for each, linked together by date, want to make the rolling window 100 observations). I have searched for rolling regression packages but I have not found one that works on my data. The two data pieces are stored within one data frame.
Also, I am pretty new to programming, so any advice would help.
Some of the code I have used is below.
WeightedIMV_VIX_returns_combined_ID100896 <- left_join(ID100896_WeightedIMV_daily_returns, ID100896_VIX_daily_returns, by=c("Date"))
head(WeightedIMV_VIX_returns_combined_ID100896, n=20)
plot(WeightedIMV_returns ~ VIX_returns, data = WeightedIMV_VIX_returns_combined_ID100896)#the data seems to be correlated enought to run a regression, doesnt matter which one you put first
ID100896.lm <- lm(WeightedIMV_returns ~ VIX_returns, data = WeightedIMV_VIX_returns_combined_ID100896)
summary(ID100896.lm) #so the estimate Intercept is 1.2370, estimate Slope is 5.8266.
termplot(ID100896.lm)
Again, sorry if this code is poor, or if I am missing any information that some of you may need to help. This is my first time on here! Just let me know what I can do better. Thanks!
I am trying to setup a multivariable linear programming model using R but the model keeps creating new variables in the output.
Essentially I am trying to find correlations between air quality and different factors such as population, time of day, weather readings, and a few others. For this example, I am looking at multiple different sensor locations over a months time. I have data on the actual AQI, weather data, and assumed the population in the area surrounding the sensor doesn't change over time (which might be my problem). Therefore, the population varies between the different sensor, however remains constant over the months. I then combined each sensors data into a data frame to conduct the linear programming. The code for my model is below:
model = lm(AQI ~ Time.of.Day + Temp + Humidity + Pressure + pop + ind + rd_dist, data = Krakdata)
The output is given in the picture below. I do not know why it doesn't come up with just population as an output. Instead, it outputs each population reading as another factor. Thanks!
Linear Model Output:
Krakdata example. Note how the population will not change until the next sensor comes up:
pop is a categorical variable. You need to convert it to an integer, otherwise each value will be treated as a separated category and therefore separate variable.
pop is a categorical variable, hence R treats it as such. R turns the pop variable into dummy variable, therefore the output. You have to convert it to numeric if this variable is supposed to be numeric in nature/in your analysis.
As to how to convert it:
Krakdata$pop <- as.numeric(as.character(Krakdata$pop))
As to how pop is read as factors while it resembles numbers, you need to look into your previous code or to the data itself.
We have a large data set with 26 brands, sold in 93 stores, during 399 weeks. The brands are still divided into sub brands (f.ex.: brand = Colgate, but sub brands(556) still exist: Colgate premium white/ Colgate extra etc.)
We calculated for each Subbrand a brandshared price on a weekly store level:
Calculation: (move per ounce for each subbrand and every single store weekly) DIVIDED BY (sum for move per ounce over the subbrands refering to one brand for every single store weekly)* (log price per ounce for each sub brand each week on storelevel)
Everything worked! We created a data frame with all the detailed calculation (data = tooth4) Our final interest is to run a linear regression to predict the influence of price on the move variable
--> the problem now is that the sale variable (a dummy, which says if there is a promotion in a specific week for a specific sub brand in a specific store ) is on subbrandlevel
--> we tried to run a regression on sub brand level (variable = descrip) but it doesn't work due to big data
lm(formula = logmove_ounce ~ log_wei_price_ounce + descrip - 1 *
(log_wei_price_ounce) + sale - 1, data = tooth4)
logmove_ounce = log of weekly subbrand based move on store level
log_wei_price_ounce = weighted subbrand based price for each store for each week
sale-1 = fixed effect for promotion
descrip-1 = fixed effect for subbrand
Does anyone have a solution how to run a regression only on brand level but include the promotion variable ?
We got a hint that we could calculate a shared value of promotion for each brand on each store ? But how?
Another question, assuming my regression is right/ partly right -- how can I weight the results to get the results only on store level not weekly storelevel?
Thank you in advance !!!
We got a hint that we could calculate a shared value of promotion for each brand on each store ? But how?
This is variously called a multilevel model, a nested model, hierarchical model, mixed model, or random-effect model which are all the same mathematical model. It is widely used to analyze the kind of longitudinal panel data you describe. A serious book on the subject is Gelman.
The most common approach in R is to use the lmer() function from the lme4 package. If you're using lme4 on uncomfortably large data, you should read their performance tips.
lmer() models accept a slightly different formula syntax, which I'll describe only briefly so that you can see how it can solve the problems you're having.
For example, let's assume we're modeling future salary as a function of the GPA and IQ of certain students. We know that students come from certain schools, so all students which go to the same school are part of a group, and schools are again grouped into counties, states. Furthermore, students graduate in different years which may have an effect. This is a generic example, but I chose it because it shares many of the same characteristics as your own longitudinal panel data.
We can use the generalized formula syntax to specify groups with a varying intercept:
lmer(salary ~ gpa + iq + (1|school), data=df)
A nested hierarchy of such groups:
lmer(salary ~ gpa + iq + (1|state/county/school), data=df)
Or group-varying slopes to capture changes overtime:
lmer(salary ~ gpa + iq + (1 + year|school), data=df)
You'll have to make your own decisions about how to model your data, but lme4::lmer() will give you a larger toolbox than lm() for dealing with groups and levels. I'd recommend asking on https://stats.stackexchange.com/ if you have questions about the modeling side.
I have a series of algorithms I am running on financial data. For the purposes of this question I have financial market data for a stock with 1226 rows of data.
I run the follow code to fit and predict the model:
strat.fit <- glm(DirNDay ~l_UUP.Close + l_FXE.Close + MA50 + +MA10 + RSI06 + BIAS10 + BBands05, data=STCK.df,family="binomial")
strat.probs <- predict(strat.fit, STCK.df,type="response")
I get probability prediction up to row 1226, I am interested in making a prediction for a new day which would be 1227. I get the following response on an attempt for a predict on day 1227
strat.probs[1227]
NA
Any help/suggestions would be appreciated
The predict function is going to predict the value of DirNDay based on the value of the other variables for that day. If you want it to predict DirNDay for a new day, then you need to provide it with all the other relevant variables for that new day.
It sounds like that's not what you're trying to do, and you need to create a totally different model which uses time (or day) to predict the values. Then you can provide predict with a new time and it can use that to predict a new DirNDay.
There's a free online textbook about forecasting using R by Rob Hyndman if you don't know where to start: https://www.otexts.org/fpp
(But if I totally misunderstood that glm model then nevermind those last two paragraphs.)
In order to make a prediction for the 1228th day, you'll need to know what the values of your explanatory variables (MA50, MA10, etc) will be for the 1228th day. Store those as a new data frame (say STCK.df.new) and put that into your predict function:
STCK.df.new <- data.frame(l_UUP.Close = .4, l_FXE.Close = 2, ... )
strat.probs <- predict(strat.fit ,STCK.df.new ,type="response")
I would like to observe the evolution of the linear regression coefficients over time. To be more precise, let's have a time frame of 2 years where the linear regression will always use the data set with a range of 1 year. After the first regression, we move one week further (i.e. we add a new week, but one is also subtracted from the beginning) and do the regression again as long as we reach the final date: altogether, there will be 52 regressions.
My problem is that there are some holidays in the data set and we can not simply add 7 days as one would easily suggest. I would like to have some wrapper function that would do aforementioned for many other functions from different packages, for example forecast.lm() from the forecast package or any function that one can think of: the objective in every case would be to find the evolution of the linear regression parameters week-by-week.
I think you might get more answers if you edit/subdivide your question in a clear way. (1) how do I find holidays (it's not clear what your definition of holidays is)? (2) how do I slice up a data set accordingly? (3) how do I run a linear regression in each chunk?
(1) find holidays: can't really help here, as I don't know how they're defined/coded in your data set. library(sos); findFn("holiday") finds some options
(2) partition the data set according to inter-holiday/weekend intervals. The example below supposes holidays are coded as 1 and non-holidays are coded as zero.
(3) run the linear regression for each chunk and extract the coefficients.
d <- data.frame(holiday=c(0,0,0,1,1,0,0,0,0,1,0,0,0,0),
x=runif(14),y=runif(14))
per <- cumsum(c(1,diff(d$holiday)==-1)) ## maybe use rle() instead
dd <- with(d,split(subset(d,!holiday),per[!holiday]))
t(sapply(lapply(dd,lm,formula=y~x),coef))