SVM in R (e1071): Give more recent data higher influence (weights for support vector machine?) - r

I'm working with Support Vector Machines from the e1071 package in R. This is my first project using SVM.
I have a dataset containing order histories of ~1k customers over 1 year and I want to predict costumer purchases. For every customer I have the information if a certain item (out of ~50) was bought or not in a certain week (for 52 weeks aka 1 yr).
My goal is to predict next month's purchases for every single customer.
I believe that a purchase let's say 1 month ago is more meaningful for my prediction than a purchase 10 months ago.
My question is now how I can give more recent data a higher impact? There is a 'weight' option in the svm-function but I'm not sure how to use it.
Anyone who can give me a hint? Would be much appreciated!
That's my code
# Fit model using Support Vecctor Machines
# install.packages("e1071")
library(e1071)
response <- train[,5]; # purchases
formula <- response ~ .;
tuned.svm <- tune.svm(train, response, probability=TRUE,
gamma=10^(-6:-3), cost=10^(1:2));
gamma.k <- tuned.svm$best.parameter[[1]];
cost.k <- tuned.svm$best.parameter[[2]];
svm.model <- svm(formula, data = train,
type='eps-regression', probability=TRUE,
gamma=gamma.k, cost=cost.k);
svm.pred <- predict(svm.model, test, probability=TRUE);
Side notes: I'm fitting a model for every single customer. Also, since I'm interested in the probability, that customer i buys item j in week k, I put
probability=TRUE
click here to see a sccreenshot of my data

Weights option in the R SVM Model is more towards assigning weights to solve the problem of imbalance classes. its class.Weights parameter and is used to assign weightage to different classes 1/0 in a biased dataset.
To answer your question: to give more weightage in a SVM Model for recent data, a simple trick in absence of an ibuild weight functionality at observation level is to repeat the recent columns (i.e. create duplicate rows for recent data) hence indirectly assigning them higher weight

Try this package: https://CRAN.R-project.org/package=WeightSVM
It uses a modified version of 'libsvm' and is able to deal with instance weighting. You can assign higher weights to recent data.
For example. You have simulated data (x,y)
x <- seq(0.1, 5, by = 0.05)
y <- log(x) + rnorm(x, sd = 0.2)
This is an unweighted SVM:
model1 <- wsvm(x, y, weight = rep(1,99))
Blue dots is the unweighted SVM and do not fit the first instance well. We want to put more weights on the first several instances.
So we can use a weighted SVM:
model2 <- wsvm(x, y, weight = seq(99,1,length.out = 99))
Green dots is the weighted SVM and fit the first instance better.

Related

survfit for stratified cox-model

I have a stratified cox-model and want predicted survival-curves for certain profiles, based on that model.
Now, because I'm working with a large dataset with a lot of strata, I want predictions for very specific strata only, to save time and memory.
The help-page of survfit.coxph states: ... If newdata does contain strata variables, then the result will contain one curve per row of newdata, based on the indicated stratum of the original model.
When I run the code below, where newdata does contain the stratum-variable, I still get predictions for both strata, which contradicts the help-page
df <- data.frame(X1 = runif(200),
X2 = sample(c("A", "B"), 200, replace = TRUE),
Ev = sample(c(0,1), 200, replace = TRUE),
Time = rexp(200))
testfit <- coxph( Surv(Time, Ev) ~ X1 + strata(X2), df)
out <- survfit(testfit, newdata = data.frame(X1 = 0.6, X2 = "A"))
Is there anything I fail to see or understand here?
I'm not sure if this is a bug or a feature in survival:::survfit.coxph. It looks like the intended behaviour in the code is that only requested strata are returned. In the function:
strata(X2) is evaluated in an environment containing newdata and the result, A is returned.
The full curve is then created.
There is then some logic to split the curve into strata, but only if result$surv is a matrix.
In your example it is not a matrix. I can't find any documentation on the expected usage of this if it's not a bug. Perhaps it would be worth dropping the author/maintainer a note.
maintainer("survival")
# [1] "Terry M Therneau <xxxxxxxx.xxxxx#xxxx.xxx>"
Some comments that may be helpfull:
My example was not big enough (and I seem not to have read the related github post very well, but that was after I posted my question here): if newdata has at least two lines (and of course the strata-variable), predictions are returned only for the requested strata
There is an inefficiency inside survfit.coxph, where the baseline-hazard is calculated for every stratum in the original dataset, not only for the requested strata (see my contribution to the same github post). However, that doesn't seem to be a big issue (a test on a dataset with roughly half a million observation, 50% events and 1000 strata), takes less than a minute
The problem is memory allocation somewhere during calculations (in the above example, things collapse once I want predictions for 100 observations - 1 stratum each - while the final output of predictions for 80 is only a few MB)
My work-around:
Select all observations you want predictions for
use lp <- predict(..., type='lp') to get the linear predictor for all these observations
use survfit only on the first observation: survfit(fit, newdata = expand_grid(newdf, strat = strata_list))
Store the resulting survival estimates in a data.frame (or not, that's up to you)
To calculate predicted survival for other observations, use the PH-assumption (see formula below). This invokes the overhead of survfit.coxph only once and if you focus on survival on only a few times (e.g. 5- and 10-year survival), you can reduce the computer time even more

Adding an Moving Average component in GAMs model

I have a simple model, for which the residuals exhibit auto-correlation beyond one order.
I have a simple model for which I want to include a moving average component up to a third order.
My model is this:
m1<-gamm(y~s(x,k=5), data = Training)
the time series properties of y, shows that this follows an ARMA(0,0,3)
because the residuals of m1 are auto-correlated I want to include a moving average component in m1
The answers for similar questions talk only about an AR(1) process, which is not my case.
You can use the corARMA(p, q) function in package nlme for this. corAR1(p) is just a special case function as there are certain efficiencies for that particular model.
You have to pass q and/or p for the order of the ARMA(p, q) process with p specifying the order of the AR terms and q the order of the MA terms. You also need to pass in a variable that orders the observations. Assuming you have a single time series and you want the MA process to operate at the entire time series level (rather than say within a years but not between) then you should crate a time variable that indexes the order of the observations; here I assume this variable is called time.
Then the call is:
m1 <- gamm(y ~ s(x, k = 5), data = Training,
correlation = corARMA(q = 3, form = ~ time))
When looking at the residuals, be sure to extract the normalised residuals as those will include the effect of the estimated MA process:
resid(m1, type = "normalised")

Find the nearest neighbor using caret

I'm fitting a k-nearest neighbor model using R's caret package.
library(caret)
set.seed(0)
y = rnorm(20, 100, 15)
predictors = matrix(rnorm(80, 10, 5), ncol=4)
data = data.frame(cbind(y, predictors))
colnames(data)=c('Price', 'Distance', 'Cost', 'Tax', 'Transport')
I left one observation as the test data and fit the model using the training data.
id = sample(nrow(data)-1)
train = data[id, ]
test = data[-id,]
knn.model = train(Price~., method='knn', train)
predict(knn.model, test)
When I display knn.model, it tells me it uses k=9. I would love to know which 9 observations are actually the "nearest" to the test observation. Besides manually calculating the distances, is there an easier way to display the nearest neighbors?
Thanks!
When you are using knn you are creating clusters with points that are near based on independent variables. Normally, this is done using train(Price~., method='knn', train), such that the model chooses the best prediction based on some criteria (taking into account also the dependent variable as well). Given the fact I have not checked whether the R object stores the predicted price for each of the trained values, I just used the model trained to predicte the expected price given the model (where the expected price is located in the space).
At the end, the dependent variable is just a representation of all the other variables in a common space, where the price associated is assumed to be similar since you cluster based on proximity.
As a summary of steps, you need to calculate the following:
Get the distance for each of the training data points. This is done through predicting over them.
Calculate the distance between the trained data and your observation of interest (in absolut value, since you do not care about the sign but just about the absolut distances).
Take the indexes of the N smaller ones(e.g.N= 9). you can get the observations and related to this lower distances.
TestPred<-predict(knn.model, newdata = test)
TrainPred<-predict(knn.model, train)
Nearest9neighbors<-order(abs(TestPred-TrainPred))[1:9]
train[Nearest9neighbors,]
Price Distance Cost Tax Transport
15 95.51177 13.633754 9.725613 13.320678 12.981295
7 86.07149 15.428847 2.181090 2.874508 14.984934
19 106.53525 16.191521 -1.119501 5.439658 11.145098
2 95.10650 11.886978 12.803730 9.944773 16.270416
4 119.08644 14.020948 5.839784 9.420873 8.902422
9 99.91349 3.577003 14.160236 11.242063 16.280094
18 86.62118 7.852434 9.136882 9.411232 17.279942
11 111.45390 8.821467 11.330687 10.095782 16.496562
17 103.78335 14.960802 13.091216 10.718857 8.589131

How to run a multinomial logit regression with both individual and time fixed effects in R

Long story short:
I need to run a multinomial logit regression with both individual and time fixed effects in R.
I thought I could use the packages mlogit and survival to this purpose, but I am cannot find a way to include fixed effects.
Now the long story:
I have found many questions on this topic on various stack-related websites, none of them were able to provide an answer. Also, I have noticed a lot of confusion regarding what a multinomial logit regression with fixed effects is (people use different names) and about the R packages implementing this function.
So I think it would be beneficial to provide some background before getting to the point.
Consider the following.
In a multiple choice question, each respondent take one choice.
Respondents are asked the same question every year. There is no apriori on the extent to which choice at time t is affected by the choice at t-1.
Now imagine to have a panel data recording these choices. The data, would look like this:
set.seed(123)
# number of observations
n <- 100
# number of possible choice
possible_choice <- letters[1:4]
# number of years
years <- 3
# individual characteristics
x1 <- runif(n * 3, 5.0, 70.5)
x2 <- sample(1:n^2, n * 3, replace = F)
# actual choice at time 1
actual_choice_year_1 <- possible_choice[sample(1:4, n, replace = T, prob = rep(1/4, 4))]
actual_choice_year_2 <- possible_choice[sample(1:4, n, replace = T, prob = c(0.4, 0.3, 0.2, 0.1))]
actual_choice_year_3 <- possible_choice[sample(1:4, n, replace = T, prob = c(0.2, 0.5, 0.2, 0.1))]
# create long dataset
df <- data.frame(choice = c(actual_choice_year_1, actual_choice_year_2, actual_choice_year_3),
x1 = x1, x2 = x2,
individual_fixed_effect = as.character(rep(1:n, years)),
time_fixed_effect = as.character(rep(1:years, each = n)),
stringsAsFactors = F)
I am new to this kind of analysis. But if I understand correctly, if I want to estimate the effects of respondents' characteristics on their choice, I may use a multinomial logit regression.
In order to take advantage of the longitudinal structure of the data, I want to include in my specification individual and time fixed effects.
To the best of my knowledge, the multinomial logit regression with fixed effects was first proposed by Chamberlain (1980, Review of Economic Studies 47: 225–238). Recently, Stata users have been provided with the routines to implement this model (femlogit).
In the vignette of the femlogit package, the author refers to the R function clogit, in the survival package.
According to the help page, clogit requires data to be rearranged in a different format:
library(mlogit)
# create wide dataset
data_mlogit <- mlogit.data(df, id.var = "individual_fixed_effect",
group.var = "time_fixed_effect",
choice = "choice",
shape = "wide")
Now, if I understand correctly how clogit works, fixed effects can be passed through the function strata (see for additional details this tutorial). However, I am afraid that it is not clear to me how to use this function, as no coefficient values are returned for the individual characteristic variables (i.e. I get only NAs).
library(survival)
fit <- clogit(formula("choice ~ alt + x1 + x2 + strata(individual_fixed_effect, time_fixed_effect)"), as.data.frame(data_mlogit))
summary(fit)
Since I was not able to find a reason for this (there must be something that I am missing on the way these functions are estimated), I have looked for a solution using other packages in R: e.g., glmnet, VGAM, nnet, globaltest, and mlogit.
Only the latter seems to be able to explicitly deal with panel structures using appropriate estimation strategy. For this reason, I have decided to give it a try. However, I was only able to run a multinomial logit regression without fixed effects.
# state formula
formula_mlogit <- formula("choice ~ 1| x1 + x2")
# run multinomial regression
fit <- mlogit(formula_mlogit, data_mlogit)
summary(fit)
If I understand correctly how mlogit works, here's what I have done.
By using the function mlogit.data, I have created a dataset compatible with the function mlogit. Here, I have also specified the id of each individual (id.var = individual_fixed_effect) and the group to which individuals belongs to (group.var = "time_fixed_effect"). In my case, the group represents the observations registered in the same year.
My formula specifies that there are no variables correlated with a specific choice, and which are randomly distributed among individuals (i.e., the variables before the |). By contrast, choices are only motivated by individual characteristics (i.e., x1 and x2).
In the help of the function mlogit, it is specified that one can use the argument panel to use panel techniques. To set panel = TRUE is what I am after here.
The problem is that panel can be set to TRUE only if another argument of mlogit, i.e. rpar, is not NULL.
The argument rpar is used to specify the distribution of the random variables: i.e. the variables before the |.
The problem is that, since these variables does not exist in my case, I can't use the argument rpar and then set panel = TRUE.
An interesting question related to this is here. A few suggestions were given, and one seems to go in my direction. Unfortunately, no examples that I can replicate are provided, and I do not understand how to follow this strategy to solve my problem.
Moreover, I am not particularly interested in using mlogit, any efficient way to perform this task would be fine for me (e.g., I am ok with survival or other packages).
Do you know any solution to this problem?
Two caveats for those interested in answering:
I am interested in fixed effects, not in random effects. However, if you believe there is no other way to take advantage of the longitudinal structure of my data in R (there is indeed in Stata but I don't want to use it), please feel free to share your code.
I am not interested in going Bayesian. So if possible, please do not suggest this approach.

Limiting Number of Predictions - R

I'm trying to create a model that predicts whether a given team makes the playoffs in the NHL, based on a variety of available team stats. However, I'm running into a problem.
I'm using R, specifically the caret package, and I'm having fairly good success so far, with one issue: I can't limit the the number of teams that are predicted to make the playoffs.
I'm using a categorical variable as the prediction -- Y or N.
For example, using the random forest method from the caret package,
rf_fit <- train(playoff ~ ., data = train_set, method = "rf")
rf_predict <- predict(rf_fit,newdata = test_data_playoffs)
mean(rf_predict == test_data_playoffs$playoff)
gives an accuracy of approximately 90% for my test set, but that's because it's overpredicting. In the NHL, 16 teams make the playoffs, but this predicts 19 teams to make the playoffs. So I want to limit the number of "Y" predictions to 16.
Is there a way to limit the number of possible responses for a categorical variable? I'm sure there is, but google searching has given me limited success so far.
EDIT: Provide sample data which can be created with the following code:
set.seed(100) # For reproducibility
data <- data.frame(Y = sample(1:10,32,replace = T)/10, N = rep(NA,32))
data$N <- 1-data$Y
This creates a data frame similar to what you get by using the "prob" option, where you have a list of probabilities for Y and N
pred <- predict(fit,newdata = test_data_playoffs, "prob")

Resources