I have used the specaccum() command to develop species accumulation curves for my samples.
Here is some example data:
site1<-c(0,8,9,7,0,0,0,8,0,7,8,0)
site2<-c(5,0,9,0,5,0,0,0,0,0,0,0)
site3<-c(5,0,9,0,0,0,0,0,0,6,0,0)
site4<-c(5,0,9,0,0,0,0,0,0,0,0,0)
site5<-c(5,0,9,0,0,6,6,0,0,0,0,0)
site6<-c(5,0,9,0,0,0,6,6,0,0,0,0)
site7<-c(5,0,9,0,0,0,0,0,7,0,0,3)
site8<-c(5,0,9,0,0,0,0,0,0,0,1,0)
site9<-c(5,0,9,0,0,0,0,0,0,0,1,0)
site10<-c(5,0,9,0,0,0,0,0,0,0,1,6)
site11<-c(5,0,9,0,0,0,5,0,0,0,0,0)
site12<-c(5,0,9,0,0,0,0,0,0,0,0,0)
site13<-c(5,1,9,0,0,0,0,0,0,0,0,0)
species_counts<-rbind(site1,site2,site3,site4,site5,site6,site7,site8,site9,site10,site11,site12,site13)
accum <- specaccum(species_counts, method="random", permutations=100)
plot(accum)
In order to ensure I have sampled sufficiently, I need to make sure the curve of the species accumulation plot reaches an asymptote, defined as a slope of <0.3 between the last two points (ei between sites 12 and 13).
results <- with(accum, data.frame(sites, richness, sd))
Produces this:
sites richness sd
1 1 3.46 0.9991916
2 2 4.94 1.6625403
3 3 5.94 1.7513054
4 4 7.05 1.6779918
5 5 8.03 1.6542263
6 6 8.74 1.6794660
7 7 9.32 1.5497149
8 8 9.92 1.3534841
9 9 10.51 1.0492422
10 10 11.00 0.8408750
11 11 11.35 0.7017295
12 12 11.67 0.4725816
13 13 12.00 0.0000000
I feel like I'm getting there. I could generate an lm with site vs richness and extract the exact slope (tangent?) between sites 12 and 13. Going to search a bit longer here.
Streamlining your data generation process a little bit:
species_counts <- matrix(c(0,8,9,7,0,0,0,8,0,7,8,0,
5,0,9,0,5,0,0,0,0,0,0,0, 5,0,9,0,0,0,0,0,0,6,0,0,
5,0,9,0,0,0,0,0,0,0,0,0, 5,0,9,0,0,6,6,0,0,0,0,0,
5,0,9,0,0,0,6,6,0,0,0,0, 5,0,9,0,0,0,0,0,7,0,0,3,
5,0,9,0,0,0,0,0,0,0,1,0, 5,0,9,0,0,0,0,0,0,0,1,0,
5,0,9,0,0,0,0,0,0,0,1,6, 5,0,9,0,0,0,5,0,0,0,0,0,
5,0,9,0,0,0,0,0,0,0,0,0, 5,1,9,0,0,0,0,0,0,0,0,0),
byrow=TRUE,nrow=13)
Always a good idea to set.seed() before running randomization tests (and let us know that specaccum is in the vegan package):
set.seed(101)
library(vegan)
accum <- specaccum(species_counts, method="random", permutations=100)
Extract the richness and sites components from within the returned object and compute d(richness)/d(sites) (note that the slope vector is one element shorter than the origin site/richness vectors: be careful if you're trying to match up slopes with particular numbers of sites)
(slopes <- with(accum,diff(richness)/diff(sites)))
## [1] 1.45 1.07 0.93 0.91 0.86 0.66 0.65 0.45 0.54 0.39 0.32 0.31
In this case, the slope never actually goes below 0.3, so this code for finding the first time that the slope falls below 0.3:
which(slopes<0.3)[1]
returns NA.
Related
Ok so i have a pretty large data set of around 500 observations and 3 variables. The first column refers to time.
For a test data set I am using:
dat=as.data.frame(matrix(c(1,2,3,4,5,6,7,8,9,10,
1,1.8,3.5,3.8,5.6,6.2,7.8,8.2,9.8,10.1,
2,4.8,6.5,8.8,10.6,12.2,14.8,16.2,18.8,20.1),10,3))
colnames(dat)=c("Time","Var1","Var2")
Time Var1 Var2
1 1 1.0 2.0
2 2 1.8 4.8
3 3 3.5 6.5
4 4 3.8 8.8
5 5 5.6 10.6
6 6 6.2 12.2
7 7 7.8 14.8
8 8 8.2 16.2
9 9 9.8 18.8
10 10 10.1 20.1
So what I need to do is create a new column that each observations is the slope respect to time of some past points. For example taking 3 past points it would be something like:
slopeVar1[i]=slope(Var1[i-2:i],Time[i-2:i]) #Not real code
slopeVar[i]=slope(Var2[i-2:i],Time[i-2:i]) #Not real code
Time Var1 Var2 slopeVar1 slopeVar2
1 1 1 2 NA NA
2 2 1.8 4.8 NA NA
3 3 3.5 6.5 1.25 2.25
4 4 3.8 8.8 1.00 2.00
5 5 5.6 10.6 1.05 2.05
6 6 6.2 12.2 1.20 1.70
7 7 7.8 14.8 1.10 2.10
8 8 8.2 16.2 1.00 2.00
9 9 9.8 18.8 1.00 2.00
10 10 10.1 20.1 0.95 1.95
I actually got as far as using a for() function, but for really large data sets (>100,000) it starts taking too long.
The for() argument that I used is shown bellow:
#CREATE DATA FRAME
rm(dat)
dat=as.data.frame(matrix(c(1,2,3,4,5,6,7,8,9,10,
1,1.8,3.333,3.8,5.6,6.2,7.8,8.2,9.8,10.1,
2,4.8,6.5,8.8,10.6,12.2,14.8,16.2,18.8,20.1),10,3))
colnames(dat)=c("Time","Var1","Var2")
dat
plot(dat)
#CALCULATE SLOPE OF n POINTS FROM i TO i-n.
#In this case I am taking just 3 points, but it should
#be possible to change the number of points taken.
attach(dat)
n=3 #number for points to take slope
l=dim(dat[1])[1] #number of iterations
y=0
x=0
slopeVar1=NA
slopeVar2=NA
for (i in 1:l) {
if (i<n) {slopeVar1[i]=NA} #For the rows where there are not enough previous observations, it outputs NA
if (i>=n) {
y1=Var1[(i-n+1):i] #y data sets for calculating slope of Var1
y2=Var2[(i-n+1):i]#y data sets for calculating slope of Var2
x=Time[(i-n+1):i] #x data sets for calculating slope of Var1&Var2
z1=lm(y1~x) #Temporal value of slope of Var1
z2=lm(y2~x) #Temporal value of slope of Var2
slope1=as.data.frame(z1[1]) #Temporal value of slope of Var1
slopeVar1[i]=slope1[2,1] #Populating string of slopeVar1
slope2=as.data.frame(z2[1])#Temporal value of slope of Var2
slopeVar2[i]=slope2[2,1] #Populating string of slopeVar2
}
}
slopeVar1 #Checking results.
slopeVar2
(result=cbind(dat,slopeVar1,slopeVar2)) #Binds original data with new calculated slopes.
This code actually outputs what I want; but again, for really large data sets is quite inefficient.
This quick rollapply implemenation seems to be speeding it up somewhat -
library("zoo")
slope_func = function(period) {
y1=period[,2] #y data sets for calculating slope of Var1
y2=period[,3] #y data sets for calculating slope of Var2
x=period[,1] #x data sets for calculating slope of Var1&Var2
z1=lm(y1~x) #Temporal value of slope of Var1
z2=lm(y2~x) #Temporal value of slope of Var2
slope1=as.data.frame(z1[1]) #Temporal value of slope of Var1
slopeVar1[i]=slope1[2,1] #Populating string of slopeVar1
slope2=as.data.frame(z1[1])#Temporal value of slope of Var2
slopeVar2[i]=slope2[2,1] #Populating string of slopeVar2
}
}
start = Sys.time()
rollapply(dat[1:3], FUN=slope_func, width=3, by.column=FALSE)
end=Sys.time()
print(end-start)
Time difference of 0.04980111 secs
OP's previous implementation was taking Time difference of 0.2666121 secs for the same
I have plotted the conditional density distribution of my variables by using cdplot (R). My independent variable and my dependent variable are not independent. Independent variable is discrete (it takes only certain values between 0 and 3) and dependent variable is also discrete (11 levels from 0 to 1 in steps of 0.1).
Some data:
dat <- read.table( text="y x
3.00 0.0
2.75 0.0
2.75 0.1
2.75 0.1
2.75 0.2
2.25 0.2
3 0.3
2 0.3
2.25 0.4
1.75 0.4
1.75 0.5
2 0.5
1.75 0.6
1.75 0.6
1.75 0.7
1 0.7
0.54 0.8
0 0.8
0.54 0.9
0 0.9
0 1.0
0 1.0", header=TRUE, colClasses="factor")
I wonder if my variables are appropriate to run this kind of analysis.
Also, I'd like to know how to report this results in an elegant way with academic and statistical sense.
This is a run using the rms-packages `lrm function which is typically used for binary outcomes but also handles ordered categorical variables:
library(rms) # also loads Hmisc
# first get data in the form you described
dat[] <- lapply(dat, ordered) # makes both columns ordered factor variables
?lrm
#read help page ... Also look at the supporting book and citations on that page
lrm( y ~ x, data=dat)
# --- output------
Logistic Regression Model
lrm(formula = y ~ x, data = dat)
Frequencies of Responses
0 0.54 1 1.75 2 2.25 2.75 3 3.00
4 2 1 5 2 2 4 1 1
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 22 LR chi2 51.66 R2 0.920 C 0.869
max |deriv| 0.0004 d.f. 10 g 20.742 Dxy 0.738
Pr(> chi2) <0.0001 gr 1019053402.761 gamma 0.916
gp 0.500 tau-a 0.658
Brier 0.048
Coef S.E. Wald Z Pr(>|Z|)
y>=0.54 41.6140 108.3624 0.38 0.7010
y>=1 31.9345 88.0084 0.36 0.7167
y>=1.75 23.5277 74.2031 0.32 0.7512
y>=2 6.3002 2.2886 2.75 0.0059
y>=2.25 4.6790 2.0494 2.28 0.0224
y>=2.75 3.2223 1.8577 1.73 0.0828
y>=3 0.5919 1.4855 0.40 0.6903
y>=3.00 -0.4283 1.5004 -0.29 0.7753
x -19.0710 19.8718 -0.96 0.3372
x=0.2 0.7630 3.1058 0.25 0.8059
x=0.3 3.0129 5.2589 0.57 0.5667
x=0.4 1.9526 6.9051 0.28 0.7773
x=0.5 2.9703 8.8464 0.34 0.7370
x=0.6 -3.4705 53.5272 -0.06 0.9483
x=0.7 -10.1780 75.2585 -0.14 0.8924
x=0.8 -26.3573 109.3298 -0.24 0.8095
x=0.9 -24.4502 109.6118 -0.22 0.8235
x=1 -35.5679 488.7155 -0.07 0.9420
There is also the MASS::polr function, but I find Harrell's version more approachable. This could also be approached with rank regression. The quantreg package is pretty standard if that were the route you chose. Looking at your other question, I wondered if you had tried a logistic transform as a method of linearizing that relationship. Of course, the illustrated use of lrm with an ordered variable is a logistic transformation "under the hood".
I like to regress the first column means market return (as y) with rest of the columns (as X) and create a data frame with the list of monthly slope coefficients. My data frame is like this
Date Marker return AFARAK GROUP PLC AFFECTO OYJ
1/3/2007 -0.45 0.00 0.85
1/4/2007 -0.92 2.47 -0.85
1/5/2007 -1.98 3.98 -1.14
The expected output of slope coefficient data frame is like this
Date AFARAK GROUP PLC AFFECTO OYJ
Jan-07 1 0.5
Feb-07 2 1.5
Mar-07 2 1
Apr-07 3 2
Could someone help me in this regard?
I'm working with biochemical data from subjects, analysing the results by sex. I have 19 biochemical tests to analyse for each sex, for each of two drugs (haematology and anatomy tests coming later).
For reasons of reproducibility of results and for preventing transcription errors, I am trying to summarise each test into one table. Included in the table output, I need a column for the Dunnett post hoc comparison p-values. Because the Dunnett test compares to the control results, with a control and 3 drug levels I only get 3 p-values. However, I have 4 mean and sd values.
Using ddply to get the mean and sd results (having limited the number of significant figures, I get a dataset that looks like this:
Sex<- c(rep("F",4), rep("M",4))
Druglevel <- c(rep(0:3,2))
Sample <- c(rep(10,8))
Mean <- c(0.44, 0.50, 0.46, 0.49, 0.48, 0.55, 0.47, 0.57)
sd <- c(0.07, 0.07, 0.09, 0.12, 0.18, 0.19, 0.13, 0.41)
Drug1Biochem1 <- data.frame(Sex, Druglevel, Sample, Mean, sd)
I have used glht in the package multcomp to perform the Dunnett tests on the aov object I constructed from undertaking a normal aov. I've extracted the p-values from the glht summary (I've rounded these to three decimal places). The male and female analyses have been run using separate ANOVA so I have one set of output for each sex. The female results are:
femaleR <- c(0.371, 0.973, 0.490)
and the male results are:
maleR <- c(0.862, 0.999, 0.738)
How can I append a column for the p-values to my original dataframe (Drug1Biochem1) so that both femaleR and maleR are in that final column, with row 1 and row 5 of that column empty (i.e. no p-values for the control)?
I wish to output the resulting combination to html, which can be inserted into a Word document so no transcription errors occur. I have set a seed value so that the results of the program are reproducible (when I finally stop debugging).
In summary, I would like a data frame (or table, or whatever I can output to html) that has the following format:
Sex Druglevel Sample Mean sd p-value
F 0 10 0.44 0.07
F 1 10 0.50 0.07 0.371
F 2 10 0.46 0.09 0.973
F 3 10 0.49 0.12 0.480
M 0 10 0.48 0.18
M 1 10 0.55 0.19 0.862
M 2 10 0.47 0.13 0.999
M 3 10 0.57 0.41 0.738
For each test, I wish to reproduce this exact table. There will always be 4 groups per sex, and there will never be a p-value for the control, which will always be summarised in row 1 (F) and row 5 (M).
You could try merge
dN <- data.frame(Sex=rep(c('M', 'F'), each=3), Druglevel=1:3,
pval=c(maleR, femaleR))
merge(Drug1Biochem1, dN, by=c('Sex', 'Druglevel'), all=TRUE)
# Sex Druglevel Sample Mean sd pval
#1 F 0 10 0.44 0.07 NA
#2 F 1 10 0.50 0.07 0.371
#3 F 2 10 0.46 0.09 0.973
#4 F 3 10 0.49 0.12 0.490
#5 M 0 10 0.48 0.18 NA
#6 M 1 10 0.55 0.19 0.862
#7 M 2 10 0.47 0.13 0.999
#8 M 3 10 0.57 0.41 0.738
I'm having a few issue's I'd appreciate some help with.
head(new.data)
WSZ_Code Treatment_Code Year Month TTHM CL2_FREE BrO3 Colour PH TURB seasons
1 2 3 1996 1 30.7 0.35 0.5000750 0.75 7.4 0.055 winter
2 6 1 1996 2 24.8 0.25 0.5001375 0.75 6.9 0.200 winter
3 7 4 1996 2 60.4 0.05 0.5001375 0.75 7.1 0.055 winter
4 7 4 1996 2 58.1 0.15 0.5001570 0.75 7.5 0.055 winter
5 7 4 1996 3 62.2 0.20 0.5003881 2.00 7.6 0.055 spring
6 5 2 1996 3 40.3 0.15 0.5003500 2.00 7.7 0.055 spring
library(nlme)
> mod3 <- lme(TTHM ~ CL2_FREE, random= ~ 1| Treatment_Code/WSZ_Code, data=new.data, method ="ML")
> mod3
Linear mixed-effects model fit by maximum likelihood
Data: new.data
Log-likelihood: -1401.529
Fixed: TTHM ~ CL2_FREE
(Intercept) CL2_FREE
54.45240 -40.15033
Random effects:
Formula: ~1 | Treatment_Code
(Intercept)
StdDev: 0.004156934
Formula: ~1 | WSZ_Code %in% Treatment_Code
(Intercept) Residual
StdDev: 10.90637 13.52372
Number of Observations: 345
Number of Groups:
Treatment_Code WSZ_Code %in% Treatment_Code
4 8
> plot(augPred(mod3))
Error in plot(augPred(mod3)) :
error in evaluating the argument 'x' in selecting a method for function 'plot': Error in sprintf(gettext(fmt, domain = domain), ...) :
invalid type of argument[1]: 'symbol'
I'm not sure why I get this error. The ranef plot seems OK
plot(ranef(mod3))
But that only gives the value of the random intercepts, no TTHM predictions.
I'm looking for a way to plot the predictions like in a typical augPred which would show all the random effects for each zone. Hope that makes sense.
You need a groupedData object to use augPred. I hope this helps.
Best wishes #CSJCampbell
con <- textConnection("
WSZ_Code Treatment_Code Year Month TTHM CL2_FREE BrO3 Colour PH TURB seasons
2 3 1996 1 30.7 0.35 0.5000750 0.75 7.4 0.055 winter
6 1 1996 2 24.8 0.25 0.5001375 0.75 6.9 0.200 winter
7 4 1996 2 60.4 0.05 0.5001375 0.75 7.1 0.055 winter
7 4 1996 2 58.1 0.15 0.5001570 0.75 7.5 0.055 winter
7 4 1996 3 62.2 0.20 0.5003881 2.00 7.6 0.055 spring
5 2 1996 3 40.3 0.15 0.5003500 2.00 7.7 0.055 spring
")
new.data <- read.table(con, header = TRUE)
library(nlme)
new.data.grp <- groupedData(TTHM ~ CL2_FREE | Treatment_Code/WSZ_Code, data = new.data)
mod3 <- lme(TTHM ~ CL2_FREE, random= ~ 1| Treatment_Code/WSZ_Code, data=new.data.grp, method ="ML")
mod3
ap3 <- augPred(mod3)
plot(ap3)
I realize most are probably using ggplot2 and lme4 at this point, but I'm a bit crufty.
Here are a couple of things that I've found working with lists of response variables that are fit using lme().
So, I've been working with a number of response variables that I want to fit to a particular set of inputs. In short my code looks something like
mymodels = list()
for(resp in my_response_vars){
f = as.formula(paste(resp,paste(my_input_vars,collapse='+'),sep='~'))
mymodels[[resp]] = lme(fixed=f,random=~wave|group,method="ML",
data=mydata, na.action=na.exclude)
}
I've been successful in treating the entries in the resulting list as normal lme() objects. The problem comes when I want to plot predictions via augPred(). Specifically I get the following error,
Error in tapply(object[[nm]], groups, FUN[["numeric"]], ...) :
arguments must have same length
So, after much searching, I decided to have a look under the hood of augPred() via debug(). Here are some of the insights I came to... I'm not sure that these qualify as bugs or if they would require a patch, but I hope they can help others with similar problems.
When calling augPred() the function looks for the name of the data that was used in the original lme() call, then inherits this object from the parent.frame() via a call to eval(). I'm not sure if this defaults to the object frame or the global, but, when I change this to data = object$data in the debug, things work. So, ostensibly, if you have used a subset of these data in your model, it will call on the full set of data.
The above causes issues if one response has missing values and you are interested in one that does not. Since it includes everything in the data.frame as part of an eventual call to gsummary() the missing values in the non-response variable will throw a wrench into things.
So, missing values mess things up. I have defaulted to making a temporary data.frame with the columns of interest, then running complete.cases() on this prior to fitting the lme() model.
mymods = list()
for(resp in my_response_vars){
f = as.formula(paste(resp,paste(my_input_vars,collapse='+'),sep='~'))
v2keep = all.vars(f) # grab terms
smdat = mydata[,c(v2keep,'group')] # include group
smdat=smdat[complete.cases(smdat),] # scrub missing
tmpmod = lme(fixed=f, random=~wave|group,
method='ML', data=smdat)
mymods[[resp]] = tmpmod
# include augPred() call here
}
If you are not including a primary argument in your call to augPred() it will require that your data.frame is a groupedData() object.
So, if you are running into the arguments must have the same length error, try: subsetting your data first under a different name, make sure to clear out missing rows explicitly prior to fitting your model.