Combining 3 separate line plots into one line plot display - r

We are looking at pattern recognitions and making different variables
unusualsubjects <- rtaverages$subject_id[rtaverages$count < 5] # make a list of subjects without enough data.
rtaverages <- filter(rtaverages,!(subject_id %in% unusualsubjects)) # only include data from good subjects. ! = not. Put data from acceptable subjects right back in the same data frame
# Another example of filtering subjects: let's say we only wanted to analyze subjects with accuracies over 95%
accurateSubjects <- averages$subject_id[averages$accuracy > .95] #returns all of the subject_ids for subjects meeting an accuracy criterion
length(accurateSubjects) # tells us how many accurate subjects there are
goodSubjectdata<-filter(data,subject_id %in% accurateSubjects) # make a new data frame that contains only the data from accurate subjects
Code to conduct actual ANOVA of the Respone Times results
model<ezANOVA(data=Data,dv=rt,within=c(set_size,target_presence,task),wid=subject_id) # You need to fill in the XXXs with the correct variable names within the variable containing all of the correct RTs. conduct a repeated measures ANOVA - dv = dependent variable. within = a list of all of the within subject variables. wid = variable that is used to group data by subject
model # show results of the ANOVA model
table1 <- tapply(X=Data$rt,INDEX=list(Data$task,Data$set_size),FUN=mean,trim=0.1)#find breakdown just of setsize and task - less broken down than the above tapply code, obtained just by deleting one item from the INDEX list "INDEX=list(rtaverages$target_presence,rtaverages$set_size,rtaverages$task)" above
table1 #show means so that one can begin to interpret the data. You'll break down rtaverages in different ways to get the different mean RTs that you need for your report
par(mar = c(4,4,4,0),mfrow=c(1, 2) ) # mfrow=c(1,2) creates two plots side by side
lineplot.CI(data=filter(rtaverages,task=="conjunctive"),x.factor=set_size,group=target_presence,x.cont=TRUE,response=rt,ylim=c(0,4000),x.leg=2,xlab="Conjunctive Set Size",ylab="RT") # produces a line graph with confidence intervals
lineplot.CI(data=filter(rtaverages,task=="disjunctive"),x.factor=set_size,group=target_presence,x.cont=TRUE,response=rt,ylim=c(0,4000),x.leg=2,xlab="Disjunctive Set Size",ylab="RT") # produces a line graph with confidence intervals
Currently attempting to put 3 lines onto one plot the following way:
# The next bit of code is to reproduce Treisman and Gelade's Figure 1, including best lines of fit
rtaverages$set_size_num<-sizes[rtaverages$set_size] # added a new column to rtaverage data frame which is the numeric/continuous version of the nominal/categorical set_size factor which will be useful for predicting RT from set_size
bySetSize<-group_by(rtaverages,set_size_num,task,target_presence) #collapse even more, so all subjects' data are combined together
collapsed<-summarize(bySetSize,rt=mean(rt,trim=0.1)) # make RT summary
collapsed # show what collapsed data look like. Note that there are now only 4 (set sizes) X 2 (tasks) X 2 (present/absent trials)=16 rows
cp<-filter(collapsed,task=="conjunctive" & target_presence=="present") # plot each of the four lines separated, filtering by the right type each time
cpf<-lm(data=cp,rt ~ set_size_num) # use a linear model to predict RT from set size. Use this to get out best fitting slope (estimate for set size) and intercept
summary(cpf) # make a summary of the linear regression model. cpf stands for: conjunctive, present fit
cp3<-filter(collapsed,task=="conjunctive" & target_presence=="absent")
caf<-lm(data=cp3,rt ~ set_size_num)
summary(caf)
cp1<-filter(collapsed,task=="disjunctive" & target_presence=="present")
dpf<-lm(data=cp1,rt ~ set_size_num)
summary(dpf)
cp2<-filter(collapsed,task=="disjunctive" & target_presence=="absent")
daf<-lm(data=cp2,rt ~ set_size_num)
summary(daf)
plot(cp$set_size_num,cp$rt,ylim=c(0,4000),xlim=c(0,30),pch=19,col="green",xlab="Set Size",ylab="Response Time (msec.)") # use a big enough range to capture all of the data
abline(cpf, col="green") # add the line with the slope and intercept derived from linear model
lines(cp$set_size_num,cp3$rt,col="green")
abline(caf, col="green")
lines(cp1$set_size_num,cp1$rt,col="red")
abline(dpf, col="red")
lines(cp2$set_size_num,cp2$rt,col="red")
abline(daf, col="red")
legend(x=0,y=4000,pch=c(19,1,19,1),col=c("green","green","red","red"),cex=0.7,legend=c("Conjunctive present","Conjunctive absent","Disjunctive present","Disjunctive absent")) #Legend only should be plotted once, pch sets 4 symbols, and col sets 4 colors. cex < 1 so that legend box isn't too big
I got them to combine, but now the lines lost their format:

Related

How to remove two data points from a data set that have a large influence on the regression model

I have found two outlier data points in my data set but I don't know how to remove them. All of the guides that I have found online seem to emphasize plotting the data but my question does not require plotting, it only takes regression model fitting. I am having great difficulty finding out how to remove the two data points from my data set and then fitting the new data set with a new model.
Here is the code that I have written and the outliers that I found:
library(alr4)
library(MASS)
data(lathe1)
head(lathe1)
y=lathe1$Life
x1=lathe1$Speed
x2=lathe1$Feed
x1_square=(x1)^2
x2_square=(x2)^2
#part A (Box-Cox method show log transformation)
y.regression=lm(y~x1+x2+(x1)^2+(x2)^2+(x1*x2))
mod=boxcox(y.regression, data=lathe1, lambda = seq(-1, 1, length=10))
best.lam=mod$x[which(mod$y==max(mod$y))]
best.lam
#part B (null-hypothesis F-test)
y.regression1_Reduced=lm(log(y)~1)
y.regression1=lm(log(y)~x1+x2+x1_square+x2_square+(x1*x2))
anova(y.regression1_Reduced, y.regression1)
#part D (F-test of log(Y) without beta1)
y.regression2=lm(log(y)~x2+x2_square)
anova(y.regression1_Reduced, y.regression2)
#part E (Cook's distance and refit)
cooks.distance(y.regression1)
Outliers:
9 10
0.7611370235 0.7088115474
I think you may be able (if execution time / corpus size allows it) to pass through your data using a loop and copy / remove elems by your criteria to obtain your desired result e.g.
corpus_list_without_outliers = []
for elem in corpus_list:
if(elem.speed <= 10000) # elem.[any_param_name] < arbitrary_outlier_value
# push to corpus_list_without_outliers because it is OK :)
print corpus_list_without_outliers
# regression algorithm after
this is how I'd see the situation, but you can change the above-if with a remove statement to avoid the creation of a second list etc. e.g.
for elem in corpus_list:
if(elem.speed > 10000) # elem.[any_param_name]
# remove from current corpus because it is an outlier :(
print corpus_list
# regression algorithm after
Hope it helped you!

r qqp function - why is the 'perfect fit' a flat line on 0?

This may be more of a statistical question than a programming one. I just wanted to make sure I was getting the programming right first.
I have a large count dataset (108 sites with 31 species = 3348 observations) but a lot of these are 0 counts because only not species were not present at every site. I have had log transformation suggested to me but others have also said that you shouldn't log transform count data. Here is my data for the first 8 species (also contains the very abundant species with the highest counts):
example.abund <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,
0,0,1,0,8,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,1,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,2,0,3,1,0,0,0,0,0,0,0,0,0,
2,0,1,1,0,0,0,0,1,1,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,
0,1,0,0,0,28,1,0,1,0,0,1,0,2,0,0,2,0,0,0,1,0,0,0,1,0,0,0,2,0,0,1,0,0,
0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,2,0,1,0,0,8,7,7,1,1,13,0,8,0,3,0,1,1,
1,4,4,0,1,0,1,0,0,0,0,6,5,2,0,2,58,4,2,47,4,0,0,0,2,59,2,0,0,6,1,36,28,2,
1,1,0,6,0,0,2,5,0,0,0,0,87,7,0,1,1,1,0,0,1,1,0,6,11,0,0,0,3,0,4,0,7,2,
0,5,0,4,1,0,1,12,0,2,0,9,0,1,0,0,0,24,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,3,1,0,1,0,1,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,1,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,15,0,2,
81,0,1,32,26,13,2,61,0,66,2,2,0,17,43,43,0,25,19,2,25,26,91,61,0,13,0,62,186,1,4,22,1,50,3,67,86,11,56,26,74,0,6,8,7,0152,8,14,1,97,1,0,12,11,3,1,1,112,2,35,36,5,61,26,211,15,8,173,17,97,22,18,88,11,1,66,15,3,3,3,2,0,1,0,41,9,14,1,0,38,0,0,51,27,11,38,31,1,0,221,68,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,2,0,0,2,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,29,0,0,0,0,
0,82,12,0,0,3,0,9,0,0,164,0,0,0,0,1,0,15,0,0,0,6,56,0,0,0,6,0,0,1,0,5,5,8,
0,4,0,0,6,0,0,2,0,0,3,0,0,0,0,683,0,0,0,0,3,149,252,11,13,195,19,0,59,0,0,1,28,0,
0,0,0,0,0,0,0,0,0,0,31,55,85,0,142,0,44,52,0,0192,0,45,0,0,0,0,0,0,11,2,0,0,6,
0,0,0,0,0,0,0,0,0,0,0,0,0,19,3,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0)
I am need to make a mixed model to fit the data, but first I am trying to figure out the most appropriate distribution to use. I was following the steps in this blog. But all of the red lines (meant to represent 'perfect fit' for that distribution) are coming up as being 0 along the entire plot.
My question is: have a coded this correctly and there are so many 0s in my data that the perfect fit is 0? Or is there something wrong with the way I have coded?
Code example:
#so that the families without 0s can recognise data
example.abund.1 <- example.abund + 1
plot(hist(example.abund))
qqp(example.abund, "norm")
qqp(example.abund.1, "lnorm") #lognorm
#have to generate estimates of parameters:
nbinom <- fitdistr(example.abund.1, "Negative Binomial")
qqp(example.abund.1, "nbinom", size = nbinom$estimate[[1]], mu = nbinom$estimate[[2]])
poisson <- fitdistr(example.abund.1, "Poisson")
qqp(example.abund.1, "pois", poisson$estimate)
gamma <- fitdistr(example.abund.1, "gamma")
qqp(example.abund.1, "gamma", shape = gamma$estimate[[1]], rate = gamma$estimate[[2]])

Effects from multinomial logistic model in mlogit

I received some good help getting my data formatted properly produce a multinomial logistic model with mlogit here (Formatting data for mlogit)
However, I'm trying now to analyze the effects of covariates in my model. I find the help file in mlogit.effects() to be not very informative. One of the problems is that the model appears to produce a lot of rows of NAs (see below, index(mod1) ).
Can anyone clarify why my data is producing those NAs?
Can anyone help me get mlogit.effects to work with the data below?
I would consider shifting the analysis to multinom(). However, I can't figure out how to format the data to fit the formula for use multinom(). My data is a series of rankings of seven different items (Accessible, Information, Trade offs, Debate, Social and Responsive) Would I just model whatever they picked as their first rank and ignore what they chose in other ranks? I can get that information.
Reproducible code is below:
#Loadpackages
library(RCurl)
library(mlogit)
library(tidyr)
library(dplyr)
#URL where data is stored
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
#Get data
dat <- read.csv(dat.url)
#Complete cases only as it seems mlogit cannot handle missing values or tied data which in this case you might get because of median imputation
dat <- dat[complete.cases(dat),]
#Change the choice index variable (X) to have no interruptions, as a result of removing some incomplete cases
dat$X <- seq(1,nrow(dat),1)
#Tidy data to get it into long format
dat.out <- dat %>%
gather(Open, Rank, -c(1,9:12)) %>%
arrange(X, Open, Rank)
#Create mlogit object
mlogit.out <- mlogit.data(dat.out, shape='long',alt.var='Open',choice='Rank', ranked=TRUE,chid.var='X')
#Fit Model
mod1 <- mlogit(Rank~1|gender+age+economic+Job,data=mlogit.out)
Here is my attempt to set up a data frame similar to the one portrayed in the help file. It doesnt work. I confess although I know the apply family pretty well, tapply is murky to me.
with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt, mean)))
Compare from the help:
data("Fishing", package = "mlogit")
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
m <- mlogit(mode ~ price | income | catch, data = Fish)
# compute a data.frame containing the mean value of the covariates in
# the sample data in the help file for effects
z <- with(Fish, data.frame(price = tapply(price, index(m)$alt, mean),
catch = tapply(catch, index(m)$alt, mean),
income = mean(income)))
# compute the marginal effects (the second one is an elasticity
effects(m, covariate = "income", data = z)
I'll try Option 3 and switch to multinom(). This code will model the log-odds of ranking an item as 1st, compared to a reference item (e.g., "Debate" in the code below). With K = 7 items, if we call the reference item ItemK, then we're modeling
log[ Pr(Itemk is 1st) / Pr(ItemK is 1st) ] = αk + xTβk
for k = 1,...,K-1, where Itemk is one of the other (i.e. non-reference) items. The choice of reference level will affect the coefficients and their interpretation, but it will not affect the predicted probabilities. (Same story for reference levels for the categorical predictor variables.)
I'll also mention that I'm handling missing data a bit differently here than in your original code. Since my model only needs to know which item gets ranked 1st, I only need to throw out records where that info is missing. (E.g., in the original dataset record #43 has "Information" ranked 1st, so we can use this record even though 3 other items are NA.)
# Get data
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
dat <- read.csv(dat.url)
# dataframe showing which item is ranked #1
ranks <- (dat[,2:8] == 1)
# for each combination of predictor variable values, count
# how many times each item was ranked #1
dat2 <- aggregate(ranks, by=dat[,9:12], sum, na.rm=TRUE)
# remove cases that didn't rank anything as #1 (due to NAs in original data)
dat3 <- dat2[rowSums(dat2[,5:11])>0,]
# (optional) set the reference levels for the categorical predictors
dat3$gender <- relevel(dat3$gender, ref="Female")
dat3$Job <- relevel(dat3$Job, ref="Government backbencher")
# response matrix in format needed for multinom()
response <- as.matrix(dat3[,5:11])
# (optional) set the reference level for the response by changing
# the column order
ref <- "Debate"
ref.index <- match(ref, colnames(response))
response <- response[,c(ref.index,(1:ncol(response))[-ref.index])]
# fit model (note that age & economic are continuous, while gender &
# Job are categorical)
library(nnet)
fit1 <- multinom(response ~ economic + gender + age + Job, data=dat3)
# print some results
summary(fit1)
coef(fit1)
cbind(dat3[,1:4], round(fitted(fit1),3)) # predicted probabilities
I didn't do any diagnostics, so I make no claim that the model used here provides a good fit.
You are working with Ranked Data, not just Multinomial Choice Data. The structure for the Ranked data in mlogit is that first set of records for a person are all options, then the second is all options except the one ranked first, and so on. But the index assumes equal number of options each time. So a bunch of NAs. We just need to get rid of them.
> with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt[complete.cases(index(mod1)$alt)], mean)))
economic
Accessible 5.13
Debate 4.97
Information 5.08
Officials 4.92
Responsive 5.09
Social 4.91
Trade.Offs 4.91

plot multiple fit and predictions for logistic regression

I am running multiple times a logistic regression over more than 1000 samples taken from a dataset. My question is what is the best way to show my results ? how can I plot my outputs for both the fit and the prediction curve?
This is an example of what I am doing, using the baseball dataset from R. For example I want to fit and predict the model 5 times. Each time I take one sample out (for the prediction) and use another for the fit.
library(corrgram)
data(baseball)
#Exclude rows with NA values
dataset=baseball[complete.cases(baseball),]
#Create vector replacing the Leage (A our N) by 1 or 0.
PA=rep(0,dim(dataset)[1])
PA[which(dataset[,2]=="A")]=1
#Model the player be league A in function of the Hits,Runs,Errors and Salary
fit_glm_list=list()
prd_glm_list=list()
for (k in 1:5){
sp=sample(seq(1:length(PA)),30,replace=FALSE)
fit_glm<-glm(PA[sp[1:15]]~baseball$Hits[sp[1:15]]+baseball$Runs[sp[1:15]]+baseball$Errors[sp[1:15]]+baseball$Salary[sp[1:15]])
prd_glm<-predict(fit_glm,baseball[sp[16:30],c(6,8,20,21)])
fit_glm_list[[k]]=fit_glm;prd_glm_list[[k]]=fit_glm
}
There are a number of issues here.
PA is a subset of baseball$League but the model is constructed on columns from the whole baseball data frame, i.e. they do not match.
PA is treated as a continuous response when using the default family (gaussian), it should be changed to a factor and binomial family.
prd_glm_list[[k]]=fit_glm should probably be prd_glm_list[[k]]=prd_glm
You must save the true class labels for the predictions otherwise you have nothing to compare to.
My take on your code looks like this.
library(corrgram)
data(baseball)
dataset <- baseball[complete.cases(baseball),]
fits <- preds <- truths <- vector("list", 5)
for (k in 1:5){
sp <- sample(nrow(dataset), 30, replace=FALSE)
fits[[k]] <- glm(League ~ Hits + Runs + Errors + Salary,
family="binomial", data=dataset[sp[1:15],])
preds[[k]] <- predict(fits[[k]], dataset[sp[16:30],], type="response")
truths[[k]] <- dataset$League[sp[1:15]]
}
plot(unlist(truths), unlist(preds))
The model performs poorly but at least the code runs without problems. The y-axis in the plot shows the estimated probabilities that the examples belong to league N, i.e. ideally the left box should be close to 0 and the right close to 1.

Bootstrapping to compare two groups

In the following code I use bootstrapping to calculate the C.I. and the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields (and the alternative being that the "improved" fertilizer is better). The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from.
x <- c(11.4,25.3,29.9,16.5,21.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
library(boot)
diff <- function(x,i) mean(x[i[6:11]]) - mean(x[i[1:5]])
b <- boot(total, diff, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
What I don't like about the code above is that resampling is done as if there was only one sample of 11 values (separating the first 5 as belonging to sample x leaving the rest to sample y).
Could you show me how this code should be modified in order to draw resamples of size 5 with replacement from the first sample and separate resamples of size 6 from the second sample, so that bootstrap resampling would mimic the “separate samples” design that produced the original data?
EDIT2 :
Hack deleted as it was a wrong solution. Instead one has to use the argument strata of the boot function :
total <- c(x,y)
id <- as.factor(c(rep("x",length(x)),rep("y",length(y))))
b <- boot(total, diff, strata=id, R = 10000)
...
Be aware you're not going to get even close to a correct estimate of your p.value :
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
b <- boot(total, diff, strata=id, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
> p.value
[1] 0.5162
How would you explain a p-value of 0.51 for two samples where all values of the second are higher than the highest value of the first?
The above code is fine to get a -biased- estimate of the confidence interval, but the significance testing about the difference should be done by permutation over the complete dataset.
Following John, I think the appropriate way to use bootstrap to test if the sums of these two different populations are significantly different is as follows:
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
b_x <- boot(x, sum, R = 10000)
b_y <- boot(y, sum, R = 10000)
z<-(b_x$t0-b_y$t0)/sqrt(var(b_x$t[,1])+var(b_y$t[,1]))
pnorm(z)
So we can clearly reject the null that they are the same population. I may have missed a degree of freedom adjustment, I am not sure how bootstrapping works in that regard, but such an adjustment will not change your results drastically.
While the actual soil beds could be considered a stratified variable in some instances this is not one of them. You only have the one manipulation, between the groups of plants. Therefore, your null hypothesis is that they really do come from the exact same population. Treating the items as if they're from a single set of 11 samples is the correct way to bootstrap in this case.
If you have two plots, and in each plot tried the different fertilizers over different seasons in a counterbalanced fashion then the plots would be statified samples and you'd want to treat them as such. But that isn't the case here.

Resources