Multidimensional scaling with missing data - r

I've got a survey where a lot of people are randomly asked 8 out of 20 policy-based questions. I want to use MDS to bring these questions down to a single dimension to get the ideology of each of these respondents. However, because people are only asked a few questions, I can't get a dissimilarity matrix between each respondent because very few are asked the same 8 questions. I also can't remove rows with NA, because every row has 12 NAs. I have two options:
Create a regression on every one of the 20 variables with values of other questions in the survey asked to every participant (age, gender, etc), and impute the NAs based on those variables.
Use some sort of MDS method that doesn't require a complete matrix.
So far, I've been working with the first one, but the created models aren't always the best. Since the policy questions are yes-no, I called a binomial glm on each model:
complete <- function(x){
q_and_predictors <- data.frame(question = x, predictors)
logistic_reg <- glm(question ~ ., data = q_and_predictors, family = "binomial")
predictions <- predict(logistic_reg, newdata = predictors)
x <- ifelse(is.na(x), exp(predictions)/(1 + exp(predictions)), x)
return (x)
}
complete_questions <- apply(questions, 2, complete)
The questions dataframe contains all the policy questions, and the predictors dataframe contains all non-policy questions.
I found the McFadden R^2 value for each logistic model, and some were very good (>0.35), but some were not (<0.1). Ideally, I'd like to find a way to either impute the missing values with greater accuracy, or use an MDS algorithm that works with missing values.

Related

survfit for stratified cox-model

I have a stratified cox-model and want predicted survival-curves for certain profiles, based on that model.
Now, because I'm working with a large dataset with a lot of strata, I want predictions for very specific strata only, to save time and memory.
The help-page of survfit.coxph states: ... If newdata does contain strata variables, then the result will contain one curve per row of newdata, based on the indicated stratum of the original model.
When I run the code below, where newdata does contain the stratum-variable, I still get predictions for both strata, which contradicts the help-page
df <- data.frame(X1 = runif(200),
X2 = sample(c("A", "B"), 200, replace = TRUE),
Ev = sample(c(0,1), 200, replace = TRUE),
Time = rexp(200))
testfit <- coxph( Surv(Time, Ev) ~ X1 + strata(X2), df)
out <- survfit(testfit, newdata = data.frame(X1 = 0.6, X2 = "A"))
Is there anything I fail to see or understand here?
I'm not sure if this is a bug or a feature in survival:::survfit.coxph. It looks like the intended behaviour in the code is that only requested strata are returned. In the function:
strata(X2) is evaluated in an environment containing newdata and the result, A is returned.
The full curve is then created.
There is then some logic to split the curve into strata, but only if result$surv is a matrix.
In your example it is not a matrix. I can't find any documentation on the expected usage of this if it's not a bug. Perhaps it would be worth dropping the author/maintainer a note.
maintainer("survival")
# [1] "Terry M Therneau <xxxxxxxx.xxxxx#xxxx.xxx>"
Some comments that may be helpfull:
My example was not big enough (and I seem not to have read the related github post very well, but that was after I posted my question here): if newdata has at least two lines (and of course the strata-variable), predictions are returned only for the requested strata
There is an inefficiency inside survfit.coxph, where the baseline-hazard is calculated for every stratum in the original dataset, not only for the requested strata (see my contribution to the same github post). However, that doesn't seem to be a big issue (a test on a dataset with roughly half a million observation, 50% events and 1000 strata), takes less than a minute
The problem is memory allocation somewhere during calculations (in the above example, things collapse once I want predictions for 100 observations - 1 stratum each - while the final output of predictions for 80 is only a few MB)
My work-around:
Select all observations you want predictions for
use lp <- predict(..., type='lp') to get the linear predictor for all these observations
use survfit only on the first observation: survfit(fit, newdata = expand_grid(newdf, strat = strata_list))
Store the resulting survival estimates in a data.frame (or not, that's up to you)
To calculate predicted survival for other observations, use the PH-assumption (see formula below). This invokes the overhead of survfit.coxph only once and if you focus on survival on only a few times (e.g. 5- and 10-year survival), you can reduce the computer time even more

How do I create a simple linear regression function in R that iterates over the entire dataframe?

I am working through ISLR and am stuck on a question. Basically, I am trying to create a function that iterates through an entire dataframe. It is question 3.7, 15a.
For each predictor, fit a simple linear regression model to predictthe response. Describe your results. In which of the models isthere a statistically significant association between the predictor and the response? Create some plots to back up your assertions.
So my thinking is like this:
y = Boston$crim
x = Boston[, -crim]
TestF1 = lm(y ~ x)
summary(TestF1)
But this is nowhere near the right answer. I was hoping to break it down by:
Iterate over the entire dataframe with crim as my response and the others as predictors
Extract the p values that are statistically significant (or extract the ones insignificant)
Move on to the next question (which is considerably easier)
But I am stuck. I've googled but can't find anything. I tried this combn(Boston) thing but it didn't work either. Please help, thank you.
If your problem is to iterate over a data frame, here is an example for mtrcars (mpg is the targer variable, and the rest are predictors, assuming models with a single predictor). The idea is to generate strings and convert them to formulas:
lms <- vector(mode = "list", length = ncol(mtcars)-1)
for (i in seq_along(lms)){
lms[[i]] <- lm(as.formula(paste0("mpg~",names(mtcars)[-1][i])), data = mtcars)
}
If you want to look at each and every variable combination, start with a model with all variables and then eliminate non-significant predictors finding the best model.

SVM in R (e1071): Give more recent data higher influence (weights for support vector machine?)

I'm working with Support Vector Machines from the e1071 package in R. This is my first project using SVM.
I have a dataset containing order histories of ~1k customers over 1 year and I want to predict costumer purchases. For every customer I have the information if a certain item (out of ~50) was bought or not in a certain week (for 52 weeks aka 1 yr).
My goal is to predict next month's purchases for every single customer.
I believe that a purchase let's say 1 month ago is more meaningful for my prediction than a purchase 10 months ago.
My question is now how I can give more recent data a higher impact? There is a 'weight' option in the svm-function but I'm not sure how to use it.
Anyone who can give me a hint? Would be much appreciated!
That's my code
# Fit model using Support Vecctor Machines
# install.packages("e1071")
library(e1071)
response <- train[,5]; # purchases
formula <- response ~ .;
tuned.svm <- tune.svm(train, response, probability=TRUE,
gamma=10^(-6:-3), cost=10^(1:2));
gamma.k <- tuned.svm$best.parameter[[1]];
cost.k <- tuned.svm$best.parameter[[2]];
svm.model <- svm(formula, data = train,
type='eps-regression', probability=TRUE,
gamma=gamma.k, cost=cost.k);
svm.pred <- predict(svm.model, test, probability=TRUE);
Side notes: I'm fitting a model for every single customer. Also, since I'm interested in the probability, that customer i buys item j in week k, I put
probability=TRUE
click here to see a sccreenshot of my data
Weights option in the R SVM Model is more towards assigning weights to solve the problem of imbalance classes. its class.Weights parameter and is used to assign weightage to different classes 1/0 in a biased dataset.
To answer your question: to give more weightage in a SVM Model for recent data, a simple trick in absence of an ibuild weight functionality at observation level is to repeat the recent columns (i.e. create duplicate rows for recent data) hence indirectly assigning them higher weight
Try this package: https://CRAN.R-project.org/package=WeightSVM
It uses a modified version of 'libsvm' and is able to deal with instance weighting. You can assign higher weights to recent data.
For example. You have simulated data (x,y)
x <- seq(0.1, 5, by = 0.05)
y <- log(x) + rnorm(x, sd = 0.2)
This is an unweighted SVM:
model1 <- wsvm(x, y, weight = rep(1,99))
Blue dots is the unweighted SVM and do not fit the first instance well. We want to put more weights on the first several instances.
So we can use a weighted SVM:
model2 <- wsvm(x, y, weight = seq(99,1,length.out = 99))
Green dots is the weighted SVM and fit the first instance better.

PRC analysis with paired observations in vegan

This message is a copy from a message that I wrote in R-Forge. I would like to compute Principal response curve analysis on my data. I have several pairs of plots where deer browse the vegetation on Anticosti island, Québec. There are repeated observations of each plot during the course of 4 years. At each site, there is a plot inside the enclosure (without deer, called "exclosure") and the other plot is outside the enclosure (with deer, called "control"). I would like to take into account the pairing of observations in and out of each enclosure in the PRC analysis. I would like to add an other condition term to the PRC (like in partial RDA) to consider the paired observations or extract value from a partial RDA computed with the PRC formula and plot it like it is done in a PRC.
More over, I would like to test with permutations tests the signification of the difference between the two treatments. My hypothesis is to find if vegetation composition is different in the exclosure than in the control throughout the years. So, I would like to know if there is a difference between the two treatments and if there is, after how many years.
Somebody knows how to do this?
So here the code of my prc (without taking paired observations into account):
levels (treat)
[1] "controle" "exclosure"
levels (years)
[1] "0" "3" "5" "8"
prc.out <- prc(data.prc.spe.hell, treat, years)
species <- colSums(data.prc.spe.hell)
plot(prc.out, select = species > 5)
ctrl <- how(plots = Plots(strata = site,type = "free"),
within = Within(type = "series"), nperm = 99)
anova(prc.out, permutations = ctrl, first=TRUE)
Here is the result.
Thank you very much for your help!
I may have an answer for the first part of your question:"I would like to add an other condition term to the PRC (like in partial RDA) to consider the paired observations".
I am currently working on a similar case and this is what I came up with: Since Principal Responses Curves (PRC) are a special case of RDA, and that the objective is to do a kind of "partial PRC", I read the R documentation of the function rda() and this is what I found: "If matrix Z is supplied, its effects are removed from the community matrix, and the residual matrix is submitted to the next stage."
So if I understand well, when you do a partial RDA with X, Y, Z (X=community matrix, Y=Constraining matrix, Z=Conditioning matrix), the first thing done by the function is to remove the effect of Z by using the residuals matrix of the RDA of X ~ Z.
If that is true, it is easy to do this step alone, and then to use the residual matrix in your PRC:
library(vegan)
rda.out = rda(X ~ Z) # equivalent of "rda.out = rda(X ~ Condition(Z))"
rda.res = residuals(rda.out)
prc.out = prc(rda.res, treatment, time)
If you coded a dummy variable for your pairing effect, I think it should be as.factor() and NOT as.numeric().
I am not a stats expert, but it looks right to me. Even though that look simple, I would appreciate if someone could validate my answer.
Cheers

plotting glm interactions: "newdata=" structure in predict() function

My problem is with the predict() function, its structure, and plotting the predictions.
Using the predictions coming from my model, I would like to visualize how my significant factors (and their interaction) affect the probability of my response variable.
My model:
m1 <-glm ( mating ~ behv * pop +
I(behv^2) * pop + condition,
data=data1, family=binomial(logit))
mating: individual has mated or not (factor, binomial: 0,1)
pop: population (factor, 4 levels)
behv: behaviour (numeric, scaled & centered)
condition: relative fat content (numeric, scaled & centered)
Significant effects after running the glm:
pop1
condition
behv*pop2
behv^2*pop1
Although I have read the help pages, previous answers to similar questions, tutorials etc., I couldn't figure out how to structure the newdata= part in the predict() function. The effects I want to visualise (given above) might give a clue of what I want: For the "behv*pop2" interaction, for example, I would like to get a graph that shows how the behaviour of individuals from population-2 can influence whether they will mate or not (probability from 0 to 1).
Really the only thing that predict expects is that the names of the columns in newdata exactly match the column names used in the formula. And you must have values for each of your predictors. Here's some sample data.
#sample data
set.seed(16)
data <- data.frame(
mating=sample(0:1, 200, replace=T),
pop=sample(letters[1:4], 200, replace=T),
behv = scale(rpois(200,10)),
condition = scale(rnorm(200,5))
)
data1<-data[1:150,] #for model fitting
data2<-data[51:200,-1] #for predicting
Then this will fit the model using data1 and predict into data2
model<-glm ( mating ~ behv * pop +
I(behv^2) * pop + condition,
data=data1,
family=binomial(logit))
predict(model, newdata=data2, type="response")
Using type="response" will give you the predicted probabilities.
Now to make predictions, you don't have to use a subset from the exact same data.frame. You can create a new one to investigate a particular range of values (just make sure the column names match up. So in order to explore behv*pop2 (or behv*popb in my sample data), I might create a data.frame like this
popbbehv<-data.frame(
pop="b",
behv=seq(from=min(data$behv), to=max(data$behv), length.out=100),
condition = mean(data$condition)
)
Here I fix pop="b" so i'm only looking at the pop, and since I have to supply condition as well, i fix that at the mean of the original data. (I could have just put in 0 since the data is centered and scaled.) Now I specify a range of behv values i'm interested in. Here i just took the range of the original data and split it into 100 regions. This will give me enough points to plot. So again i use predict to get
popbbehvpred<-predict(model, newdata=popbbehv, type="response")
and then I can plot that with
plot(popbbehvpred~behv, popbbehv, type="l")
Although nothing is significant in my fake data, we can see that higher behavior values seem to result in less mating for population B.

Resources