Regressing out or Removing age as confounding factor from experimental result - r

I have obtained cycle threshold values (CT values) for some genes for diseased and healthy samples. The healthy samples were younger than the diseased. I want to check if the age (exact age values) are impacting the CT values. And if so, I want to obtain an adjusted CT value matrix in which the gene values are not affected by age.
I have checked various sources for confounding variable adjustment, but they all deal with categorical confounding factors (like batch effect). I can't get how to do it for age.
I have done the following:
modcombat = model.matrix(~1, data=data.frame(data_val))
modcancer = model.matrix(~Age, data=data.frame(data_val))
combat_edata = ComBat(dat=t(data_val), batch=Age, mod=modcombat, par.prior=TRUE, prior.plots=FALSE)
pValuesComBat = f.pvalue(combat_edata,mod,mod0)
qValuesComBat = p.adjust(pValuesComBat,method="BH")
data_val is the gene expression/CT values matrix.
Age is the age vector for all the samples.
For some genes the p-value is significant. So how to correctly modify those gene values so as to remove the age effect?
I tried linear regression as well (upon checking some blogs):
lm1 = lm(data_val[1,] ~ Age) #1 indicates first gene. Did this for all genes
cor.test(lm1$residuals, Age)
The blog suggested checking p-val of correlation of residuals and confounding factors. I don't get why to test correlation of residuals with age.
And how to apply a correction to CT values using regression?
Please guide if what I have done is correct.
In case it's incorrect, kindly tell me how to obtain data_val with no age effect.

There are many methods to solve this:-
Basic statistical approach
A very basic method to incorporate the effect of Age parameter in the data and make the final dataset age agnostic is:
Do centring and scaling of your data based on Age. By this I mean group your data by age and then take out the mean of each group and then standardise your data based on these groups using this mean.
For standardising you can use two methods:
1) z-score normalisation : In this you can change each data point to as (x-mean(x))/standard-dev(x)); by using group-mean and group-standard deviation.
2) mean normalization: In this you simply subtract groupmean from every observation.
3) min-max normalisation: This is a modification to z-score normalisation, in this in place of standard deviation you can use min or max of the group, ie (x-mean(x))/min(x)) or (x-mean(x))/max(x)).
On to more complex statistics:
You can get the importance of all the features/columns in your dataset using some algorithms like PCA(principle component analysis) (https://en.wikipedia.org/wiki/Principal_component_analysis), though it is generally used as a dimensionality reduction algorithm, still it can be used to get the variance in the whole data set and also get the importance of features.
Below is a simple example explaining it:
I have plotted the importance using the biplot and graph, using the decathlon dataset from factoextra package:
library("factoextra")
data(decathlon2)
colnames(data)
data<-decathlon2[,1:10] # taking only 10 variables/columns for easyness
res.pca <- prcomp(data, scale = TRUE)
#fviz_eig(res.pca)
fviz_pca_var(res.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
hep.PC.cor = prcomp(data, scale=TRUE)
biplot(hep.PC.cor)
output
[1] "X100m" "Long.jump" "Shot.put" "High.jump" "X400m" "X110m.hurdle"
[7] "Discus" "Pole.vault" "Javeline" "X1500m"
On these similar lines you can use PCA on your data to get the importance of the age parameter in your data.
I hope this helps, if I find more such methods I will share.

Related

GAM smooths interaction differences - calculate p value using mgcv and gratia 0.6

I am using the useful gratia package by Gavin Simpson to extract the difference in two smooths for two different levels of a factor variable. The smooths are generated by the wonderful mgcv package. For example
library(mgcv)
library(gratia)
m1 <- gam(outcome ~ s(dep_var, by = fact_var) + fact_var, data = my.data)
diff1 <- difference_smooths(m1, smooth = "s(dep_var)")
draw(diff1)
This give me a graph of the difference between the two smooths for each level of the "by" variable in the gam() call. The graph has a shaded 95% credible interval (CI) for the difference.
Statistical significance, or areas of statistical significance at the 0.05 level, is assessed by whether or where the y = 0 line crosses the CI, where the y axis represents the difference between the smooths.
Here is an example from Gavin's site where the "by" factor variable had 3 levels.
The differences are clearly statistically significant (at 0.05) over nearly all of the graphs.
Here is another example I have generated using a "by" variable with 2 levels.
The difference in my example is clearly not statistically significant anywhere.
In the mgcv package, an approximate p value is outputted for a smooth fit that tests the null hypothesis that the coefficients are all = 0, based on a chi square test.
My question is, can anyone suggest a way of calculating a p value that similarly assesses the difference between the two smooths instead of solely relying on graphical evidence?
The output from difference_smooths() is a data frame with differences between the smooth functions at 100 points in the range of the smoothed variable, the standard error for the difference and the upper and lower limits of the CI.
Here is a link to the release of gratia 0.4 that explains the difference_smooths() function
enter link description here
but gratia is now at version 0.6
enter link description here
Thanks in advance for taking the time to consider this.
Don
One way of getting a p value for the interaction between the by factor variables is to manipulate the difference_smooths() function by activating the ci_level option. Default is 0.95. The ci_level can be manipulated to find a level where the y = 0 is no longer within the CI bands. If for example this occurred when ci_level = my_level, the p value for testing the hypothesis that the difference is zero everywhere would be 1 - my_level.
This is not totally satisfactory. For example, it would take a little manual experimentation and it may be difficult to discern accurately when zero drops out of the CI. Although, a function could be written to search the accompanying data frame that is outputted with difference_smooths() as the ci_level is varied. This is not totally satisfactory either because the detection of a non-zero CI would be dependent on the 100 points chosen by difference_smooths() to assess the difference between the two curves. Then again, the standard errors are approximate for a GAM using mgcv, so that shouldn't be too much of a problem.
Here is a graph where the zero first drops out of the CI.
Zero dropped out at ci_level = 0.88 and was still in the interval at ci_level = 0.89. So an approxiamte p value would be 1 - 0.88 = 0.12.
Can anyone think of a better way?
Reply to Gavin Simpson's comments Feb 19
Thanks very much Gavin for taking the time to make your comments.
I am not sure if using the criterion, >= 0 (for negative diffs), is a good way to go. Because of the draws from the posterior, there is likely to be many diffs that meet this criterion. I am interpreting your criterion as sample the posterior distribution and count how many differences meet the criterion, calculate the percentage and that is the p value. Correct me if I have misunderstood. Using this approach, I consistently got p values at around 0.45 - 0.5 for different gam models, even when it was clear the difference in the smooths should be statistically significant, at least at p = 0.05, because the confidence band around the smooth did not contain zero at a number of points.
Instead, I was thinking perhaps it would be better to compare the means of the posterior distribution of each of the diffs. For example
# get coefficients for the by smooths
coeff.level1 <- coef(gam.model1)[31:38]
coeff.level0 <- coef(gam.model1)[23:30]
# these indices are specific to my multi-variable gam.model1
# in my case 8 coefficients per smooth
# get posterior coefficients variances for the by smooths' coefficients
vp_level1 <- gam.model1$Vp[31:38, 31:38]
vp_level0 <- gam.model1$Vp[23:30, 23:30]
#run the simulation to get the distribution of each
#difference coefficient using the joint variance
library(MASS)
no.draws = 1000
sim <- mvrnorm(n = no.draws, (coeff.level1 - coeff.level0),
(vp_level1 + vp_level0))
# sim is a no.draws X no. of coefficients (8 in my case) matrix
# put the results into a data.frame.
y.group <- data.frame(y = as.vector(sim),
group = c(rep(1,no.draws), rep(2,no.draws),
rep(3,no.draws), rep(4,no.draws),
rep(5,no.draws), rep(6,no.draws),
rep(7,no.draws), rep(8,no.draws)) )
# y has the differences sampled from their posterior distributions.
# group is just a grouping name for the 8 sets of differences,
# (one set for each difference in coefficients)
# compare means with a linear regression
lm.test <- lm(y ~ as.factor(group), data = y.group)
summary(lm.test)
# The p value for the F statistic tells you how
# compatible the data are with the null hypothesis that
# all the group means are equal to each other.
# Same F statistic and p value from
anova(lm.test)
One could argue that if all coefficients are not equal to each other then they all can't be equal to zero but that isn't what we want here.
The basis of the smooth tests of fit given by summary(mgcv::gam.model1)
is a joint test of all coefficients == 0. This would be from a type of likelihood ratio test where model fit with and without a term are compared.
I would appreciate some ideas how to do this with the difference between two smooths.
Now that I got this far, I had a rethink of your original suggestion of using the criterion, >= 0 (for negative diffs). I reinterpreted this as meaning for each simulated coefficient difference distribution (in my case 8), count when this occurs and make a table where each row (my case, 8) is for one of these distributions with two columns holding this count and (number of simulation draws minus count), Then on this table run a chi square test. When I did this, I got a very low p value when I believe I shouldn't have as 0 was well within the smooth difference CI across almost all the levels of the exposure. Maybe I am still misunderstanding your suggestion.
Follow up thought Feb 24
In a follow up thought, we could create a variable that represents the interaction between the by factor and continuous variable
library(dplyr)
my.dat <- my.dat %>% mutate(interact.var =
ifelse(factor.2levels == "yes", 1, 0)*cont.var)
Here I am assuming that factor.2levels has the levels ("no", "yes"), and "no" is the reference level. The ifelse function creates a dummy variable which is multiplied by the continuous variable to generate the interactive variable.
Then we place this interactive variable in the GAM and get the usual statistical test for fit, that is, testing all the coefficients == 0.
#GavinSimpson actually posted a method of how to get the difference between two smooths and assess its statistical significance here in 2017. Thanks to Matteo Fasiolo for pointing me in that direction.
In that approach, the by variable is converted to an ordered categorical variable which causes mgcv::gam to produce difference smooths in comparison to the reference level. Statistical significance for the difference smooths is then tested in the usual way with the summary command for the gam model.
However, and correct me if I have misunderstood, the ordered factor approach causes the smooth for the main effect to now be the smooth for the reference level of the ordered factor.
The approach I suggested, see the main post under the heading, Follow up thought Feb 24, where the interaction variable is created, gives an almost identical result for the p value for the difference smooth but does not change the smooth for the main effect. It also does not change the intercept and the linear term for the by categorical variable which also both changed with the ordered variable approach.

How to change the y-axis for a multivariate GAM model from smoothed to actual values?

I am using multivariate GAM models to learn more about fog trends in multiple regions. Fog is determined by visibility going below a certain threshold (< 400 meters). Our GAM model is used to determine the response of visibility to a range of meteorological variables.
However, my challenge right now is that I'd really like the y-axis to be the actual visibility observations rather than the centered smoothed. It is interesting to see how visibility is impacted by the covariates relative to the mean visibility in that location, but it's difficult to compare this for multiple locations where the mean visibility is different (and thus the 0 point in which visibility is enhanced or diminished has little comparable meaning).
In order to compare the results of multiple locations, I'm trying to make the y-axis actual visibility observations, and then I'll put a line at the visibility threshold we're interested in looking at (400 m)
to evaluate what the predictor variables values are like below that threshold (eg what temperatures are associated with visibility below 400 m).
I'm still a beginner when it comes to GAMs and R in general, but I've figured out a few helpful pieces so far.
Helpful things so far:
Attempt 1. how to extract gam fit for each variable in model
Extracting data used to make a smooth plot in mgcv
Attempt 2. how to use predict function to reconstruct a univariable model
http://zevross.com/blog/2014/09/15/recreate-the-gam-partial-regression-smooth-plots-from-r-package-mgcv-with-a-little-style/
Attempt 3. how to get some semblance of a y-axis that looks like visibility observations using "fitted" -- though I don't think this is
the correct approach since I'm not taking the intercept into account
http://gsp.humboldt.edu/OLM/R/05_03_GAM.html
simulated data
install.packages("mgcv") #for gam package
require(mgcv)
install.packages("pspline")
require(pspline)
#simulated GAM data for example
dataSet <- gamSim(eg=1,n=400,dist="normal",scale=2)
visibility <- dataSet[[1]]
temperature <- dataSet[[2]]
dewpoint <- dataSet[[3]]
windspeed <- dataSet[[4]]
#Univariable GAM model
gamobj <- gam(visibility ~ s(dewpoint))
plot(gamobj, scale=0, page=1, shade = TRUE, all.terms=TRUE, cex.axis=1.5, cex.lab=1.5, main="Univariable Model: Dew Point")
summary(gamobj)
AIC(gamobj)
abline(h=0)
Univariable Model of Dew Point
https://imgur.com/1uzP34F
ATTEMPT 2 -- predict function with univariable model, but didn't change y-axis
#dummy var that spans length of original covariate
maxDP <-max(dewpoint)
minDP <-min(dewpoint)
DPtrial.seq <-seq(minDP,maxDP,length=3071)
DPtrial.seq <-data.frame(dewpoint=DPtrial.seq)
#predict only the DP term
preds <- predict(gamobj, type="terms", newdata=DPtrial.seq, se.fit=TRUE)
#determine confidence intervals
DPplot <-DPtrial.seq$dewpoint
fit <-preds$fit
fit.up95 <-fit-1.96*preds$se.fit
fit.low95 <-fit+1.96*preds$se.fit
#plot
plot(DPplot, fit, lwd=3,
main="Reconstructed Dew Point Covariate Plot")
#plot confident intervals
polygon(c(DPplot, rev(DPplot)),
c(fit.low95,rev(fit.up95)), col="grey",
border=NA)
lines(DPplot, fit, lwd=2)
rug(dewpoint)
Reconstructed Dew Point Covariate Plot
https://imgur.com/VS8QEcp
ATTEMPT 3 -- changed y-axis using "fitted" but without taking intercept into account
plot(dewpoint,fitted(gamobj), main="Fitted Response of Y (Visibility) Plotted Against Dew Point")
abline(h=mean(visibility))
rug(dewpoint)
Fitted Response of Y Plotted Against Dew Point https://imgur.com/RO0q6Vw
Ultimately, I want a horizontal line where I can investigate the predictor variable relative to 400 meters, rather than just the mean of the response variable. This way, it will be comparable across multiple sites where the mean visibility is different. Most importantly, it needs to be for multiple covariates!
Gavin Simpson has explained the method in a couple of posts but unfortunately, I really don't understand how I would hold the mean of the other covariates constant as I use the predict function:
Changing the Y axis of default plot.gam graphs
Any deeper explanation into the method for doing this would be super helpful!!!
I'm not sure how helpful this will be as your Q is a little more open ended than we'd typically like on SO, but, here goes.
Firstly, I think it would help to think about modelling the response variable, which I assume is currently visibility. This is going to be a continuous variable, bounded at 0 (perhaps the data never reach zero?) which suggests modelling the data as conditionally distributed either
gamma (family = Gamma(link = 'log')) for visibility that never takes a value of zero.
Tweedie (family = tw()) for data that do have zeroes.
An alternative approach would be to model the occurrence of fog; if this is defined as an event <400m visibility then you could turn all your observations into 0/1 values for being a fog event or otherwise. Then you'd model the data as conditionally distributed Bernoulli, using family = binomial().
Having decided on a modelling approach, we need to model the response. This should be done using a multiple regression type of approach, with a GAM including multiple predictors. This way you get to estimate the effect of each potential predictor variable on the response while controlling for the effects of the other predictors. If you just do this using a single predictor at a time, say dewpoint, that variable could well "explain" variation in the data that might be due to another predictor, windspeed say, and you wouldn't know it.
Furthermore, there may well be interactions between predictors that you'll want to control for if they exist, which can only be done in
Then, to finally get to the crux of your problem, having fitted the multi-predictor model to "explain" visibility, you will need to predict from the model for sets of likely conditions. To look at how the visibility varies with dewpoint in a model where other predictor variables have effects, you need to fix the other variables at some reasonable values; one option is to set them to their mean (or modal value in the case of any factor predictor variables), or some other value indicative of typically values for that variable. You'll have to use your domain knowledge for this.
If you have interactions in the model, then you'll need to vary the two variables in the interaction, whilst holding all other variable fixed at some values.
Let's assume you don't have interactions and are interested in dewpoint but the model also includes windspeed. The mean windspeed for the values used to fit the model can be found from the cmX component of the fitted model. Of you could just calculate this from the observed windpseed values or set it to some known number you want to use. Denote the fitted by m, and the data frame with your data in it by df, then we can create new data to predict at over the range of dewpoint, whilst holding windspeed fixed.
mn.windspd <- m$cmX['windspeed']
## or
mn.windspd <- with(df, mean(windspeed))
## or set it some some value
mn.windspd <- 10 # say
Then you can do
preddata <- with(df,
expand.grid(dewpoint = seq(min(dewpoint),
max(dewpoint),
length = 300),
windspeed = mn.windspd))
Then you use this to predict from the fitted model:
pred <- predict(m, newdata = preddata, type = "link", se.fit = TRUE)
pred <- as.data.frame(pred)
Now we want to put these predictions back on to the response scale, and we want a confidence interval so we have to create that first before back transforming:
ilink <- family(m)$linkinv
pred <- transform(pred,
Fitted = ilink(fit),
Upper = ilink(fit + (2 * se.fit)),
Lower = ilink(fit - (2 * se.fit)),
dewpoint = preddata = dewpoint)
Now you can visualised the effect of dewpoint on the response whilst keeping windspeed fixed.
In your case, you will have to extend this to keeping temperature constant also, but that is done in the same way
mn.windspd <- m$cmX['windspeed']
mn.temp <- m$cmX['temperature']
preddata <- with(df,
expand.grid(dewpoint = seq(min(dewpoint),
max(dewpoint),
length = 300),
windspeed = mn.windspd,
temperature = mn.temp))
and then follow the steps above to do the prediction.
For one or two variables varying I have a function data_slice() in my gratia package which will do the above expand.grid() stuff for you so you don't have to specify the mean values of the other covariates:
preddata <- data_slice(m, 'dewpoint', n = 300)
technically this finds the value in the data closest to the median value (for the covariates not varying). If you want means, then do
fixdf <- data.frame(windspeed = mn.windspd, temperature = mn.temp)
preddata <- data_slice(m, 'dewpoint', data = fixdf, n = 300)
If you have an interaction, say between dewpoint and windspeed then you need to vary two variables. This is pretty easy again with expand.grid():
mn.temp <- m$cmX['temperature']
preddata <- with(df,
expand.grid(dewpoint = seq(min(dewpoint),
max(dewpoint),
length = 100),
windspeed = seq(min(windspeed),
max(windspeed),
length = 300),
temperature = mn.temp))
This will create a 100 x 100 grid of values of the covariates to predict at, whilst holding temperature constant.
For data_slice() you'd need to do:
fixdf <- data.frame(temperature = mn.temp)
preddata <- data_slice(m, 'dewpoint', 'windpseed',
data = fixdf, n = 300)
And extending this on to more covariates you want to vary, is also easy following this pattern with expand.grid(); I have yet to implement more than 2 variables varying in data_slice.

PRC analysis with paired observations in vegan

This message is a copy from a message that I wrote in R-Forge. I would like to compute Principal response curve analysis on my data. I have several pairs of plots where deer browse the vegetation on Anticosti island, Québec. There are repeated observations of each plot during the course of 4 years. At each site, there is a plot inside the enclosure (without deer, called "exclosure") and the other plot is outside the enclosure (with deer, called "control"). I would like to take into account the pairing of observations in and out of each enclosure in the PRC analysis. I would like to add an other condition term to the PRC (like in partial RDA) to consider the paired observations or extract value from a partial RDA computed with the PRC formula and plot it like it is done in a PRC.
More over, I would like to test with permutations tests the signification of the difference between the two treatments. My hypothesis is to find if vegetation composition is different in the exclosure than in the control throughout the years. So, I would like to know if there is a difference between the two treatments and if there is, after how many years.
Somebody knows how to do this?
So here the code of my prc (without taking paired observations into account):
levels (treat)
[1] "controle" "exclosure"
levels (years)
[1] "0" "3" "5" "8"
prc.out <- prc(data.prc.spe.hell, treat, years)
species <- colSums(data.prc.spe.hell)
plot(prc.out, select = species > 5)
ctrl <- how(plots = Plots(strata = site,type = "free"),
within = Within(type = "series"), nperm = 99)
anova(prc.out, permutations = ctrl, first=TRUE)
Here is the result.
Thank you very much for your help!
I may have an answer for the first part of your question:"I would like to add an other condition term to the PRC (like in partial RDA) to consider the paired observations".
I am currently working on a similar case and this is what I came up with: Since Principal Responses Curves (PRC) are a special case of RDA, and that the objective is to do a kind of "partial PRC", I read the R documentation of the function rda() and this is what I found: "If matrix Z is supplied, its effects are removed from the community matrix, and the residual matrix is submitted to the next stage."
So if I understand well, when you do a partial RDA with X, Y, Z (X=community matrix, Y=Constraining matrix, Z=Conditioning matrix), the first thing done by the function is to remove the effect of Z by using the residuals matrix of the RDA of X ~ Z.
If that is true, it is easy to do this step alone, and then to use the residual matrix in your PRC:
library(vegan)
rda.out = rda(X ~ Z) # equivalent of "rda.out = rda(X ~ Condition(Z))"
rda.res = residuals(rda.out)
prc.out = prc(rda.res, treatment, time)
If you coded a dummy variable for your pairing effect, I think it should be as.factor() and NOT as.numeric().
I am not a stats expert, but it looks right to me. Even though that look simple, I would appreciate if someone could validate my answer.
Cheers

R Trouble getting correlation coefficient

I'm getting difficulties on my quest to get a correlation coefficient for my data set.
I started by using ggpairsand then cor function.
It might sound a lack of knowledge, but I didn’t realize that I can’t calculate the correlation for columns which type is not numeric.
For example, I would like to now the correlation between some AGE and CITY. What alternative do I have to situations like this? Or what data transformations I should do?
Thank you.
As thelatemail put it, sometimes graphs speak more than a stat...
cities <- c("Montreal", "Toronto", "New York", "Plattsburgh")
dat <- data.frame(city = sample(cities,size = 200, replace = TRUE), age = rnorm(n = 200, mean = 40, sd = 20))
dat$city <- as.factor(dat$city)
plot(age ~ city, data = dat)
Then for proper analysis you have several options... anova, or regression with cities as an explanatory variable (factor)... Although your question might have better responses on Cross Validated!
Btw: pls just ignore negative ages, this has been done quickly.
I think you first need to answer the question of what it is you are trying to do. The correlation coefficient (Pearson's r) is a specific statistic that can be calculated on two numerical values (where a dichotomous variable can be considered numeric). It has some special characteristics, including that it is bounded by -1 and 1 and that it does not have a concept of dependent or independent variable. Also it does not represent the proportion of variance explained; you need to square it to get the usual measure of that. What it does do is give you an estimate of the size and direction of the association between two variables.
These characteristics make it inappropriate to use r when you have a variable such as city as one of the two variables. If you want to know the proportion of variance in age explained by city, you can run a regression of age on a set of dummy variables for city and look at the overall R squared for the model. However unlike r, you won't have a simple direction (just direction for each city) and it won't necessarily be the same as if you built a model predicting city based on age.
Regarding the qualitative data such as City, you can use the Spearman's correlation.
You can find more information about this correlation here
It can be simply used in R with the help of this command :
cor(x, use=, method= )
So , if you want to use it in a simple example :
cor(AGE, CITY, method = "Spearman")
I hope that helps you

plotting glm interactions: "newdata=" structure in predict() function

My problem is with the predict() function, its structure, and plotting the predictions.
Using the predictions coming from my model, I would like to visualize how my significant factors (and their interaction) affect the probability of my response variable.
My model:
m1 <-glm ( mating ~ behv * pop +
I(behv^2) * pop + condition,
data=data1, family=binomial(logit))
mating: individual has mated or not (factor, binomial: 0,1)
pop: population (factor, 4 levels)
behv: behaviour (numeric, scaled & centered)
condition: relative fat content (numeric, scaled & centered)
Significant effects after running the glm:
pop1
condition
behv*pop2
behv^2*pop1
Although I have read the help pages, previous answers to similar questions, tutorials etc., I couldn't figure out how to structure the newdata= part in the predict() function. The effects I want to visualise (given above) might give a clue of what I want: For the "behv*pop2" interaction, for example, I would like to get a graph that shows how the behaviour of individuals from population-2 can influence whether they will mate or not (probability from 0 to 1).
Really the only thing that predict expects is that the names of the columns in newdata exactly match the column names used in the formula. And you must have values for each of your predictors. Here's some sample data.
#sample data
set.seed(16)
data <- data.frame(
mating=sample(0:1, 200, replace=T),
pop=sample(letters[1:4], 200, replace=T),
behv = scale(rpois(200,10)),
condition = scale(rnorm(200,5))
)
data1<-data[1:150,] #for model fitting
data2<-data[51:200,-1] #for predicting
Then this will fit the model using data1 and predict into data2
model<-glm ( mating ~ behv * pop +
I(behv^2) * pop + condition,
data=data1,
family=binomial(logit))
predict(model, newdata=data2, type="response")
Using type="response" will give you the predicted probabilities.
Now to make predictions, you don't have to use a subset from the exact same data.frame. You can create a new one to investigate a particular range of values (just make sure the column names match up. So in order to explore behv*pop2 (or behv*popb in my sample data), I might create a data.frame like this
popbbehv<-data.frame(
pop="b",
behv=seq(from=min(data$behv), to=max(data$behv), length.out=100),
condition = mean(data$condition)
)
Here I fix pop="b" so i'm only looking at the pop, and since I have to supply condition as well, i fix that at the mean of the original data. (I could have just put in 0 since the data is centered and scaled.) Now I specify a range of behv values i'm interested in. Here i just took the range of the original data and split it into 100 regions. This will give me enough points to plot. So again i use predict to get
popbbehvpred<-predict(model, newdata=popbbehv, type="response")
and then I can plot that with
plot(popbbehvpred~behv, popbbehv, type="l")
Although nothing is significant in my fake data, we can see that higher behavior values seem to result in less mating for population B.

Resources