Resampling and replacement while sequentially adding levels of a factor - r

I am trying to run a function while sequentially adding sites (x+i) to a dataframe, which would result in the statistic plus the confidence intervals. For example, if I want to run a linear model with which I sequentially add a site to every iteration to better understand how the additional data from every site influences the fit. However, I want to include every possible site in each iteration to obtain the confidence interval for each iteration. In its current form, I am able to randomly sample a site, but not all possible sites for a given "x + i" iteration.
I know this particular issue could be addressed with the 'dredge' function. However, ideally I would set this up in a way so that I could easily [with some adjustment] replace the current linear model function with any other function (e.g., metaMDS, diversity).
I am sure there is a better way to perform this, but I am a relative newbie to these types of analyses. Any suggestions would be greatly appreciated!
Edit: I have been considering passing the below function through 'boot' although I haven't quite been able to get this loop to function in boot.
# data
set.seed(45)
dat <- data.frame(site=rep(LETTERS[1:6],3),mean=sample(1:20,18),rich=sample(5:32,18))
model<-lm(mean~rich,dat) # the full model
summary(model)
my_vec <- character() # Create empty character vector
my_site <- character() # Create empty character vector
for(i in seq(from=1, to=6, by=1)){ # increase number of sites at each iteration
dat_seq<-dat %>% subset(site %in% sample(levels(as.factor(site)), i)) # subset data based on number of sites
model<-lm(mean~rich,dat_seq)
result<-summary(model)$r.squared
my_out<-result
my_vec<-c(my_vec,my_out)
my_site<-c(my_site,i)
lm_results<-data.frame(sync=my_vec, site_no = my_site)
}

Something like this might help? Here I generate every combination of sites in the dataset (the combs list) then I lapply the model to the subset of the data corresponding to each element. The upper and lower CI and R^2 are returned.
x <- unique(dat$site)
combs <- do.call(c, lapply(seq_along(x), combn, x = x, simplify = FALSE))
do.call(rbind, lapply( combs , function(x) {
dat2 = dat[dat$site %in% x,]
mod = lm(mean~rich, dat2)
data.frame(sites=paste(x, collapse=""),
lci=confint(mod)["rich",1],
uci=confint(mod)["rich",2],
r2=summary(mod)$r.squared)
})
)
sites lci uci r2
1 A -8.3174474 7.221600752 0.4453499992
2 B -5.5723683 5.818599482 0.0701472479
3 C -1.8397082 1.928749330 0.0826810176
4 D -3.5504781 2.253774792 0.8895987733
5 E -1.9782218 0.783889792 0.9679338880
6 F -0.3642690 0.202676480 0.9291569087
7 AB -1.0726850 0.631838143 0.1141900799
8 AC -1.0156746 0.486238667 0.1932050717
9 AD -1.3744991 0.089962986 0.5972134174
10 AE -1.3425429 0.359346030 0.3914262598
11 AF -1.2542336 1.094735972 0.0088070439
12 BC -0.3148719 0.536493520 0.1155061842
13 BD -0.8115027 0.263460008 0.3337377806
14 BE -1.0264258 0.376744253 0.2923566879
15 BF -1.1047222 0.961865064 0.0091250127
16 CD -0.9745928 0.341039802 0.3088694252
17 CE -0.9413738 0.549038074 0.1178103209
18 CF -0.8967742 1.165648399 0.0317149663
19 DE -0.8081655 -0.063530819 0.7253472880
20 DF -0.4928491 0.673804531 0.0443092831
21 EF -0.9565739 0.524655918 0.1407909531
22 ABC -0.5962015 0.353999681 0.0493374108
23 ABD -0.8365224 0.110852413 0.3191087122
24 ABE -0.8760695 0.210841908 0.2303024575
25 ABF -0.8266745 0.633602031 0.0137712837
26 ACD -0.9065180 0.066518021 0.3731538462
27 ACE -0.8472338 0.235549937 0.2031338155
28 ACF -0.7522162 0.720252734 0.0003762516
29 ADE -0.9661169 -0.041025998 0.4863258317
30 ADF -0.7657306 0.559208857 0.0190378530
31 AEF -0.8971295 0.489083497 0.0647322193
32 BCD -0.5771897 0.206912590 0.1511964736
33 BCE -0.5802808 0.341276672 0.0509875519
34 BCF -0.5806002 0.737926299 0.0112444750
35 BDE -0.6864459 0.004527069 0.4375645756
36 BDF -0.5930715 0.460544893 0.0124799554
37 BEF -0.8077064 0.411788016 0.0776553121
38 CDE -0.7399438 0.108174895 0.3071099077
39 CDF -0.5535068 0.623295610 0.0028013813
40 CEF -0.6905084 0.598692027 0.0040352416
41 DEF -0.5691343 0.342877359 0.0468583354
42 ABCD -0.6438371 0.095450002 0.2145588181
43 ABCE -0.6248798 0.195737009 0.1195408994
44 ABCF -0.5714679 0.519529413 0.0011238991
45 ABDE -0.7459710 -0.015192501 0.3500598278
46 ABDF -0.6397934 0.354865639 0.0391438801
47 ABEF -0.7297368 0.343203399 0.0605325928
48 ACDE -0.7739688 0.003126375 0.3281841191
49 ACDF -0.6236834 0.433241141 0.0158627591
50 ACEF -0.6696598 0.429949692 0.0230490498
51 ADEF -0.6839477 0.287476657 0.0763805047
52 BCDE -0.5735044 0.083072486 0.2169111169
53 BCDF -0.4853537 0.426339044 0.0020758928
54 BCEF -0.5621108 0.444630022 0.0067151679
55 BDEF -0.5715836 0.240391871 0.0762941714
56 CDEF -0.5364817 0.363030081 0.0181252387
57 ABCDE -0.6208064 0.020647714 0.2391257190
58 ABCDF -0.5292293 0.315066335 0.0225784375
59 ABCEF -0.5621816 0.333684980 0.0228222717
60 ABDEF -0.6093804 0.195345360 0.0867885013
61 ACDEF -0.5890752 0.262323665 0.0502230537
62 BCDEF -0.4898635 0.265972273 0.0305394982
63 ABCDEF -0.5239122 0.198342387 0.0539903463

Related

scatter3d(): chol.default(shape) error: leading minor of order not positive definite

I am currently trying to do a principle coordinate analysis (PCoA) in R. I am very new to R and am still trying to learn syntax and code. I was successful in running the PCoA and got it to plot, and am now trying to visualize the PCoA in a 3D space using the scatter3d() function.
I successfully ran the PCoA with the following code
#Running the PCoA
library(vegan)
library(labdsv)
Gowerdist <- vegdist(data.frame,method="gower", na.rm= TRUE)
pcotest <- pco(Gowerdist,k=4)
summary(pcoaTESTplot)
write.csv(pcotest$points,'pcotestPOINTS.csv')
#Plotting PcoA
library(ggplot2)
pcoaTESTplot <- read.csv("pcotestPOINTS.csv")
ggplot(pcoaTESTplot, aes(x=V1, y=V2, color=Species)) + geom_point() + geom_text(aes(label=Species),hjust=0, vjust=0)
The pcotestPOINTS.csv plots normally in a 2D plane and I group it by Species, and has the following values (I am new to R, and do not know how to write this out as code; suggestions would be helpful):
Species V1 V2 V3 V4
1 cf_M -0.031781895 -0.014792286 -0.004503777 -0.012610220
2 C -0.091464004 -0.134006338 -0.017100030 0.049538102
3 C -0.142280811 -0.071970920 0.057220986 0.015636930
4 G 0.127901175 -0.056155450 -0.018575333 0.015381534
5 G 0.116318613 -0.125552537 0.036418773 -0.098754726
6 G 0.212966778 -0.097406669 -0.023185002 0.081309634
7 G 0.063114834 -0.052422944 -0.027281979 -0.013183572
8 G 0.164193441 -0.145067313 0.047893500 -0.075261012
9 G 0.125573983 -0.030635914 -0.003522366 0.055693725
10 C -0.175866887 -0.049829963 -0.032233067 0.033557543
11 cf_M -0.135541377 0.055739251 -0.089503580 0.048764398
12 C -0.177278483 -0.022729224 -0.036536839 0.056107016
13 C -0.213010465 -0.048179837 -0.066925006 0.044377553
14 C -0.150118314 -0.011262976 0.052875986 0.078814272
15 C -0.052938204 -0.032302610 0.031115540 0.041222419
16 cf_M -0.060527464 0.047843822 -0.032686702 -0.116874986
17 cf_M -0.104463064 -0.056349285 0.031957309 -0.059974654
18 C -0.110412784 -0.023630954 0.005149408 0.044280367
19 cf_M -0.120946082 0.060083837 -0.085371294 -0.130249238
20 cf_M -0.052607412 -0.035729934 0.034557754 0.039291800
21 M -0.098428805 0.227005817 0.012707286 0.015943080
22 G 0.111732258 -0.105793117 -0.078062124 0.018757562
23 G 0.104440727 -0.043103550 -0.054803773 0.040568053
24 G 0.114630615 -0.102812853 0.029796076 -0.025098120
25 cf_G 0.041189558 -0.109686712 -0.081449510 0.012694654
26 G 0.139372615 -0.073429675 -0.035514832 -0.021797285
27 cf_G 0.049630172 -0.120238042 -0.082500823 -0.025354457
28 G 0.131962913 -0.079345351 -0.038031678 0.032418512
29 G 0.145388151 -0.073033647 -0.006097915 0.016838026
30 G 0.153083521 -0.080719015 0.009411666 0.013890614
31 G 0.163658995 -0.056128193 0.014838792 0.019248676
32 G 0.175740848 -0.055809349 -0.085783874 0.042118869
33 M 0.122374853 0.121760579 0.000972723 -0.048284135
34 M 0.073623753 0.083966711 -0.048553107 0.014595662
35 cf_M 0.002493609 -0.019775472 0.048228606 -0.107557856
36 cf_M -0.142542791 -0.048504297 -0.033862597 0.014891024
37 M 0.073067507 0.175692122 -0.032429380 -0.013033796
38 M 0.049394837 0.048055305 -0.048492332 0.024362833
39 M 0.043374473 0.148914450 -0.071568319 0.076386040
40 M 0.100479924 0.101136266 -0.000714071 0.069775037
41 C -0.095274095 -0.066087291 0.126446794 -0.054039041
42 C -0.050515560 -0.075369130 0.075846115 0.004257934
43 cf_C -0.120209368 -0.044737012 -0.015814314 0.029790605
44 M 0.033819722 0.077098451 0.103200615 0.001797658
45 M 0.099041728 0.127793360 0.123679516 -0.092233055
46 C -0.119684548 -0.071573066 0.020774450 0.045440300
47 M 0.080064569 0.158117147 0.050984478 0.049517871
48 M 0.073061563 0.179736841 0.061438231 -0.085872914
49 M 0.066196996 0.126650019 -0.073256733 0.050736463
50 M -0.017180859 0.092915512 -0.062340826 0.030966866
51 M 0.007313941 0.030544171 0.034107786 -0.008451064
52 M 0.030077136 0.091946729 0.019021861 -0.037148376
53 M 0.181104379 0.154261866 0.184970234 0.152371966
54 cf_M -0.076461621 0.038913381 -0.094850112 -0.075737783
55 cf_M -0.077452675 0.058624603 -0.104210238 -0.028904142
56 C -0.136410016 -0.068696015 0.032681381 0.027559673
57 cf_M -0.084262114 0.025497711 -0.046012632 -0.090147470
58 C -0.099403208 -0.049318827 0.047823149 -0.074616210
59 cf_C -0.151949338 0.003355951 -0.074866137 0.026535190
60 M -0.048272207 0.035885684 -0.036572954 -0.024464274
61 M 0.035272332 0.137994016 0.048921034 -0.033152910
62 M 0.061062726 0.088220032 0.027235884 0.006511185
63 cf_M -0.022678804 0.096566014 -0.089668642 -0.032362149
64 M 0.100783139 0.070006730 0.086195185 -0.022204185
65 cf_C -0.009137953 0.017062431 -0.050115368 -0.133785442
66 cf_M -0.107810732 -0.068024004 0.021125172 0.021052237
67 G 0.095668772 -0.138675431 -0.028579849 -0.076913412
68 M -0.027020841 0.069674169 -0.021508615 0.032142949
69 C -0.226937501 -0.080085817 0.216765725 0.015425306
70 G 0.203314776 -0.110344554 0.079133253 0.040076830
71 C -0.153490987 -0.013755267 0.165370191 -0.036327947
72 G 0.113580066 -0.166450142 -0.014627538 -0.018557855
73 M -0.132917211 0.008685202 0.031339457 0.058982043
74 cf_M -0.000375639 0.030195173 -0.024656948 0.018778677
75 C -0.159551518 -0.026830563 -0.020288912 0.049217439
76 M 0.057460058 0.096136625 0.006413249 -0.029953721
77 cf_M -0.066324419 0.070271569 -0.083959037 0.025280882
When I use the scatter3d() function:
library(scatterplot3d)
library(plot3D)
scatter3d(x = pcoaTESTplot$V1, y = pcoaTESTplot$V2, z = pcoaTESTplot$V3,
point.col = "blue", groups = pcoaTESTplot$Species, ellipsoid = TRUE, grid = TRUE, surface = FALSE)
I get all species plotted but only the "C" species gets an ellipse, along with the error
Error in chol.default(shape) :
the leading minor of order 3 is not positive definite
I have tried changing rearrange values or see if it has to do with values being close to zero, but I know someone who ran the same code with similar numbers and all groups had an ellipse. I also tried taking the "cf_G" which only has 2 points and grouping it with "G" to see if there was an error due to an ellipse trying to be formed on two points, but I still got the same error. Does anyone know what the error is coming from? Thank you! (Apologies for any rough code/syntax...)
Making this answer in case it will help others:
I have found the solution. I thought by combining the "cf_G" with the "G" would fix the issue, as "cf_G" only has two points. This didn't fix the issue because this was only part of the problem. I had to do this, as well as combine the "cf_C" with the "cf_M" because "cf_C" only had 3 points. So some of my groupings did not have enough points to plot the ellipsoid. And fortunately, this makes biological sense with how I made these species groupings.
So the aforementioned changes that were made were done by editing column 1 in the PCoA file with the gower distances.

the data has 101 values why result has 303 values?

Data here is "sales" data with two columns price and sales and 16 rows
for eg,
price sales
49 81996
46 91735
50 70830
45 101192
51 78319
47 105369
47 68564
46 95523
49 88834
46 89511
45 107836
52 81410
50 67817
54 59207
50 83310
46 71431
in the code below, Using “dim(my.boot.price)” it can be shown that we have 101 values in the input data
But
Results
Shows us we have 303 rows. Why? Please correct the script attached.
#=== bootstrapping prediction price range
library(boot)
# bootstrap function
my.boot <- function(formula, data, indices, price) {
d <- data[indices,]
fit <- lm(formula, data=d)
my.new.data<-data.frame(price)
pred_interval <- predict(fit, newdata=my.new.data, interval="prediction",
level = 0.95)
colnames(pred_interval)[2:3]<-c("pred.lwr","pred.upr")
# return the prediction
return(pred_interval)
}
###############################################
# run the bootstrap
# determine the single value to bootstrap
my.boot.price=data.frame(price=seq(45,55,.1))
dim(my.boot.price)
results <- boot(data=sales, statistic=my.boot,
R=2000, formula=sales~price, price=my.boot.price)
# view results
results

How to column bind and row bind a large number of data frames in R?

I have a large data set of vehicles. They were recorded every 0.1 seconds so there IDs repeat in Vehicle ID column. In total there are 2169 vehicles. I filtered the 'Vehicle velocity' column for every vehicle (using for loop) which resulted in a new column with first and last 30 values removed (per vehicle) . In order to bind it with original data frame, I removed the first and last 30 values of table too and then using cbind() combined them. This works for one last vehicle. I want this smoothing and column binding for all vehicles and finally I want to combine all the data frames of vehicles into one single table. That means rowbinding in sequence of vehicle IDs. This is what I wrote so far:
traj1 <- read.csv('trajectories-0750am-0805am.txt', sep=' ', header=F)
head(traj1)
names (traj1)<-c('Vehicle ID', 'Frame ID','Total Frames', 'Global Time','Local X', 'Local Y', 'Global X','Global Y','Vehicle Length','Vehicle width','Vehicle class','Vehicle velocity','Vehicle acceleration','Lane','Preceding Vehicle ID','Following Vehicle ID','Spacing','Headway')
# TIME COLUMN
Time <- sapply(traj1$'Frame ID', function(x) x/10)
traj1$'Time' <- Time
# SMOOTHING VELOCITY
smooth <- function (x, D, delta){
z <- exp(-abs(-D:D/delta))
r <- convolve (x, z, type='filter')/convolve(rep(1, length(x)),z,type='filter')
r
}
for (i in unique(traj1$'Vehicle ID')){
veh <- subset (traj1, traj1$'Vehicle ID'==i)
svel <- smooth(veh$'Vehicle velocity',30,10)
svel <- data.frame(svel)
veh <- head(tail(veh, -30), -30)
fta <- cbind(veh,svel)
}
'fta' now only shows the data frame for last vehicle. But I want all data frames (for all vehicles 'i') combined by row. May be for loop is not the right way to do it but I don't know how can I use tapply (or any other apply function) to do so many things same time.
EDIT
I can't reproduce my dataset here but 'Orange' data set in R could provide good analogy. Using the same smoothing function, the for loop would look like this (if 'age' column is smoothed and 'Tree' column is equivalent to my 'Vehicle ID' coulmn):
for (i in unique(Orange$Tree)){
tre <- subset (Orange, Orange$'Tree'==i)
age2 <- round(smooth(tre$age,2,0.67),digits=2)
age2 <- data.frame(age2)
tre <- head(tail(tre, -2), -2)
comb <- cbind(tre,age2)}
}
Umair, I am not sure I understood what you want.
If I understood right, you want to combine all the results by row. To do that you could save all the results in a list and then do.call an rbind:
comb <- list() ### create list to save the results
length(comb) <- length(unique(Orange$Tree))
##Your loop for smoothing:
for (i in 1:length(unique(Orange$Tree))){
tre <- subset (Orange, Tree==unique(Orange$Tree)[i])
age2 <- round(smooth(tre$age,2,0.67),digits=2)
age2 <- data.frame(age2)
tre <- head(tail(tre, -2), -2)
comb[[i]] <- cbind(tre,age2) ### save results in the list
}
final.data<-do.call("rbind", comb) ### combine all results by row
This will give you:
Tree age circumference age2
3 1 664 87 687.88
4 1 1004 115 982.66
5 1 1231 120 1211.49
10 2 664 111 687.88
11 2 1004 156 982.66
12 2 1231 172 1211.49
17 3 664 75 687.88
18 3 1004 108 982.66
19 3 1231 115 1211.49
24 4 664 112 687.88
25 4 1004 167 982.66
26 4 1231 179 1211.49
31 5 664 81 687.88
32 5 1004 125 982.66
33 5 1231 142 1211.49
Just for fun, a different way to do it using plyr::ddply and sapply with split:
library(plyr)
data<-ddply(Orange, .(Tree), tail, n=-2)
data<-ddply(data, .(Tree), head, n=-2)
data<- cbind(data,
age2=matrix(sapply(split(Orange$age, Orange$Tree), smooth, D=2, delta=0.67), ncol=1, byrow=FALSE))

plot new values for best fit nonlinear curve

I have created the best fit for a non linear function. It seems to be working correctly:
#define a function
fncTtr <- function(n,d) (d/n)*((sqrt(1+2*(n/d))-1))
#fit
dFit <- nls(dData$ttr~fncTtr(dData$n,d),data=dData,start=list(d=25),trace=T)
summary(dFit)
plot(dData$ttr~dData$n,main="Fitted d value",pch=19,)
xl <- seq(min(dData$n),max(dData$n), (max(dData$n) - min(dData$n))/1000)
lines(xl,predict(dFit,newdata=xl,col=blue)
The plot for my observations are coming out correctly. I am having problems to display the best fit curve on my plot. I create the xl independent variable with 1000 values and I want to define the new values using the best fit. When I call the "lines" procedure, I get the error message:
Error in xy.coords(x, y) : 'x' and 'y' lengths differ
If I try to execute only the predict function:
a <-predict(dFit,newdata=xl)
str(a)
I can see that xl has 1000 components but "a" has only 16 components. Shouldn't I have the same number of values in a?
data used:
n ttr d
1 35 0.6951 27.739
2 36 0.6925 28.072
3 37 0.6905 28.507
4 38 0.6887 28.946
5 39 0.6790 28.003
6 40 0.6703 27.247
7 41 0.6566 25.735
8 42 0.6605 26.981
9 43 0.6567 27.016
10 44 0.6466 26.026
11 45 0.6531 27.667
12 46 0.6461 27.128
13 47 0.6336 25.751
14 48 0.6225 24.636
15 49 0.6214 24.992
16 50 0.6248 26.011
Ok, I think I found the solution, however I'm not sure I would be able to explain it.
When calling predict.nls, what you're inputting to argument newdata has to be named according to the variable with which you're predicting (here n) and the name has to match that given in the original call to nls.
#Here I replaced dData$n with n
dFit <- nls(ttr~fncTtr(n,d),data=dData,start=list(d=25),trace=T)
plot(dData$ttr~dData$n,main="Fitted d value",pch=19,)
xl <- seq(min(dData$n),max(dData$n), (max(dData$n) - min(dData$n))/1000)
a <- predict(dFit,newdata=list(n=xl))
length(a)==length(xl)
[1] TRUE
lines(xl,a,col="blue")

get a surface plot in R

I'm trying to obtain a surface plot from data frame AAA:
j a m p o f
13929 0.86739583 19 165.83 0.1588727 13.24444
13930 0.63166667 19 178.19 0.6105804 12.68333
13932 0.90212963 17 157.77 0.3345627 12.52222
13933 0.80152778 68 146.19 0.1219885 12.35000
13934 0.75784722 62 134.88 0.1531627 12.36667
13935 0.57763889 66 123.80 0.4093869 12.47500
13936 0.56201389 88 112.87 0.9095722 12.45833
13937 0.51680556 26 102.03 0.8494420 12.37500
13938 0.46093333 28 91.20 0.9153419 12.21111
13939 0.16645833 24 80.30 0.8309784 12.04444
13940 0.15451389 36 69.23 2.2847927 12.15556
13941 0.51347222 134 57.92 2.9551087 12.42500
13942 0.33763889 128 46.31 3.5784096 12.53333
13943 0.12937500 38 34.33 3.7371723 12.47778
13944 0.42760870 63 22.00 4.7831677 12.46667
13945 0.09962121 8 9.36 4.8281897 12.30000
13950 0.97901515 18 57.70 0.0000000 12.15833
13951 0.85333333 14 71.07 0.0000000 12.48333
13952 0.92811594 14 84.28 10.0444672 12.49167
13953 0.84812500 42 97.29 7.8020987 12.51667
My code:
require(fields)
fitx <- Tps( AAA[, 4:6], AAA$a)
out.p <- predict.surface(fitx, xy = c(4,5))
plot.surface(out.p, type="p")
However, it doesn't run through. Apparently, the grid is insufficient to represent the data and it's not able to get the predict.surface.
In the Tps function, your x matrix is AAA[, 4:6] and hence has three columns.
But in the predict.surface function, you specified xy = c(4,5). The values passed to the xyparamter are relative to the matrix in your fitx object. Since the matrix used for creating fitx with the predict.surface function has three columns, you can't refer to the 4th and 5th column. Instead, the columns 4 and 5 of your original data.frame AAA correspond to columns 1 and 2 in fitx.
You might wish to try:
library(fields)
fitx <- Tps(AAA[, 4:6], AAA$a)
out.p <- predict.surface(fitx, xy = c(1,2)) # Note the different argument passed to `xy`
plot.surface(out.p, type="p")

Resources