Data here is "sales" data with two columns price and sales and 16 rows
for eg,
price sales
49 81996
46 91735
50 70830
45 101192
51 78319
47 105369
47 68564
46 95523
49 88834
46 89511
45 107836
52 81410
50 67817
54 59207
50 83310
46 71431
in the code below, Using “dim(my.boot.price)” it can be shown that we have 101 values in the input data
But
Results
Shows us we have 303 rows. Why? Please correct the script attached.
#=== bootstrapping prediction price range
library(boot)
# bootstrap function
my.boot <- function(formula, data, indices, price) {
d <- data[indices,]
fit <- lm(formula, data=d)
my.new.data<-data.frame(price)
pred_interval <- predict(fit, newdata=my.new.data, interval="prediction",
level = 0.95)
colnames(pred_interval)[2:3]<-c("pred.lwr","pred.upr")
# return the prediction
return(pred_interval)
}
###############################################
# run the bootstrap
# determine the single value to bootstrap
my.boot.price=data.frame(price=seq(45,55,.1))
dim(my.boot.price)
results <- boot(data=sales, statistic=my.boot,
R=2000, formula=sales~price, price=my.boot.price)
# view results
results
Related
I am trying to run a function while sequentially adding sites (x+i) to a dataframe, which would result in the statistic plus the confidence intervals. For example, if I want to run a linear model with which I sequentially add a site to every iteration to better understand how the additional data from every site influences the fit. However, I want to include every possible site in each iteration to obtain the confidence interval for each iteration. In its current form, I am able to randomly sample a site, but not all possible sites for a given "x + i" iteration.
I know this particular issue could be addressed with the 'dredge' function. However, ideally I would set this up in a way so that I could easily [with some adjustment] replace the current linear model function with any other function (e.g., metaMDS, diversity).
I am sure there is a better way to perform this, but I am a relative newbie to these types of analyses. Any suggestions would be greatly appreciated!
Edit: I have been considering passing the below function through 'boot' although I haven't quite been able to get this loop to function in boot.
# data
set.seed(45)
dat <- data.frame(site=rep(LETTERS[1:6],3),mean=sample(1:20,18),rich=sample(5:32,18))
model<-lm(mean~rich,dat) # the full model
summary(model)
my_vec <- character() # Create empty character vector
my_site <- character() # Create empty character vector
for(i in seq(from=1, to=6, by=1)){ # increase number of sites at each iteration
dat_seq<-dat %>% subset(site %in% sample(levels(as.factor(site)), i)) # subset data based on number of sites
model<-lm(mean~rich,dat_seq)
result<-summary(model)$r.squared
my_out<-result
my_vec<-c(my_vec,my_out)
my_site<-c(my_site,i)
lm_results<-data.frame(sync=my_vec, site_no = my_site)
}
Something like this might help? Here I generate every combination of sites in the dataset (the combs list) then I lapply the model to the subset of the data corresponding to each element. The upper and lower CI and R^2 are returned.
x <- unique(dat$site)
combs <- do.call(c, lapply(seq_along(x), combn, x = x, simplify = FALSE))
do.call(rbind, lapply( combs , function(x) {
dat2 = dat[dat$site %in% x,]
mod = lm(mean~rich, dat2)
data.frame(sites=paste(x, collapse=""),
lci=confint(mod)["rich",1],
uci=confint(mod)["rich",2],
r2=summary(mod)$r.squared)
})
)
sites lci uci r2
1 A -8.3174474 7.221600752 0.4453499992
2 B -5.5723683 5.818599482 0.0701472479
3 C -1.8397082 1.928749330 0.0826810176
4 D -3.5504781 2.253774792 0.8895987733
5 E -1.9782218 0.783889792 0.9679338880
6 F -0.3642690 0.202676480 0.9291569087
7 AB -1.0726850 0.631838143 0.1141900799
8 AC -1.0156746 0.486238667 0.1932050717
9 AD -1.3744991 0.089962986 0.5972134174
10 AE -1.3425429 0.359346030 0.3914262598
11 AF -1.2542336 1.094735972 0.0088070439
12 BC -0.3148719 0.536493520 0.1155061842
13 BD -0.8115027 0.263460008 0.3337377806
14 BE -1.0264258 0.376744253 0.2923566879
15 BF -1.1047222 0.961865064 0.0091250127
16 CD -0.9745928 0.341039802 0.3088694252
17 CE -0.9413738 0.549038074 0.1178103209
18 CF -0.8967742 1.165648399 0.0317149663
19 DE -0.8081655 -0.063530819 0.7253472880
20 DF -0.4928491 0.673804531 0.0443092831
21 EF -0.9565739 0.524655918 0.1407909531
22 ABC -0.5962015 0.353999681 0.0493374108
23 ABD -0.8365224 0.110852413 0.3191087122
24 ABE -0.8760695 0.210841908 0.2303024575
25 ABF -0.8266745 0.633602031 0.0137712837
26 ACD -0.9065180 0.066518021 0.3731538462
27 ACE -0.8472338 0.235549937 0.2031338155
28 ACF -0.7522162 0.720252734 0.0003762516
29 ADE -0.9661169 -0.041025998 0.4863258317
30 ADF -0.7657306 0.559208857 0.0190378530
31 AEF -0.8971295 0.489083497 0.0647322193
32 BCD -0.5771897 0.206912590 0.1511964736
33 BCE -0.5802808 0.341276672 0.0509875519
34 BCF -0.5806002 0.737926299 0.0112444750
35 BDE -0.6864459 0.004527069 0.4375645756
36 BDF -0.5930715 0.460544893 0.0124799554
37 BEF -0.8077064 0.411788016 0.0776553121
38 CDE -0.7399438 0.108174895 0.3071099077
39 CDF -0.5535068 0.623295610 0.0028013813
40 CEF -0.6905084 0.598692027 0.0040352416
41 DEF -0.5691343 0.342877359 0.0468583354
42 ABCD -0.6438371 0.095450002 0.2145588181
43 ABCE -0.6248798 0.195737009 0.1195408994
44 ABCF -0.5714679 0.519529413 0.0011238991
45 ABDE -0.7459710 -0.015192501 0.3500598278
46 ABDF -0.6397934 0.354865639 0.0391438801
47 ABEF -0.7297368 0.343203399 0.0605325928
48 ACDE -0.7739688 0.003126375 0.3281841191
49 ACDF -0.6236834 0.433241141 0.0158627591
50 ACEF -0.6696598 0.429949692 0.0230490498
51 ADEF -0.6839477 0.287476657 0.0763805047
52 BCDE -0.5735044 0.083072486 0.2169111169
53 BCDF -0.4853537 0.426339044 0.0020758928
54 BCEF -0.5621108 0.444630022 0.0067151679
55 BDEF -0.5715836 0.240391871 0.0762941714
56 CDEF -0.5364817 0.363030081 0.0181252387
57 ABCDE -0.6208064 0.020647714 0.2391257190
58 ABCDF -0.5292293 0.315066335 0.0225784375
59 ABCEF -0.5621816 0.333684980 0.0228222717
60 ABDEF -0.6093804 0.195345360 0.0867885013
61 ACDEF -0.5890752 0.262323665 0.0502230537
62 BCDEF -0.4898635 0.265972273 0.0305394982
63 ABCDEF -0.5239122 0.198342387 0.0539903463
I am trying to create a loop to use compare_means (ggpubr library in R) across all columns in a dataframe and then select only significant p.adjusted values, but it does not work well.
Here is some code
head(df3)
sampleID Actio Beta Gammes Traw Cluster2
gut10 10 2.2 55 13 HIGH
gut12 20 44 67 12 HIGH
gut34 5.5 3 89 33 LOW
gut26 4 45 23 4 LOW
library(ggpubr)
data<-list()
for (i in 2:length(df3)){
data<-compare_means(df3[[i]] ~ Cluster2, data=df3, paired = FALSE,p.adjust.method="bonferroni",method = "wilcox.test")
}
Error: `df3[i]` must evaluate to column positions or names, not a list
I would like to create an output to convert in dataframe with all the information contained in compare_means output
Thanks a lot
Try this:
library(ggpubr)
data<-list()
for (i in 2:(length(df3)-1)){
new<-df3[,c(i,"Cluster2")]
colnames(new)<-c("interest","Cluster2")
data<-compare_means(interest ~ Cluster2, data=new, paired = FALSE,p.adjust.method="bonferroni",method = "wilcox.test")
}
I use the caret package with multi-layer perception.
My dataset consists of a labelled output value, which can be either A,B or C. The input vector consists of 4 variables.
I use the following lines of code to calculate the class probabilities for each input value:
fit <- train(device~.,data=dataframetrain[1:100,], method="mlp",
trControl=trainControl(classProbs=TRUE))
(p=(predict(fit,newdata=dataframetest,type=("prob"))))
I thought that the class probabilities for each record must sum up to one. But I get the following:
rowSums(p)
# 1 2 3 4 5 6 7 8
# 1.015291 1.015265 1.015291 1.015291 1.015291 1.014933 1.015011 1.015291
# 9 10 11 12 13 14 15 16
# 1.014933 1.015206 1.015291 1.015291 1.015291 1.015224 1.015011 1.015291
Can anybody help me because I don't know what I did wrong.
There's probably nothing wrong, it just seems that caret returns the values of the neurons in the output layer without converting them to probabilities (correct me if I'm wrong). When using the RSNNS::mlp function outside of caret the rows of the predictions also don't sum to one.
Since all output neurons have the same activation function the outputs can be converted to probabilities by dividing the predictions by the respective row sum, see this question.
This behavior seems to be true when using method = "mlp" or method = "mlpWeightDecay" but when using method = "nnet" the predictions do sum to one.
Example:
library(RSNNS)
data(iris)
#shuffle the vector
iris <- iris[sample(1:nrow(iris),length(1:nrow(iris))),1:ncol(iris)]
irisValues <- iris[,1:4]
irisTargets <- iris[,5]
irisTargetsDecoded <- decodeClassLabels(irisTargets)
iris2 <- splitForTrainingAndTest(irisValues, irisTargetsDecoded, ratio=0.15)
iris2 <- normTrainingAndTestSet(iris2)
set.seed(432)
model <- mlp(iris2$inputsTrain, iris2$targetsTrain,
size=5, learnFuncParams=c(0.1), maxit=50,
inputsTest=iris2$inputsTest, targetsTest=iris2$targetsTest)
predictions <- predict(model,iris2$inputsTest)
head(rowSums(predictions))
# 139 26 17 104 54 82
# 1.0227419 1.0770722 1.0642565 1.0764587 0.9952268 0.9988647
probs <- predictions / rowSums(predictions)
head(rowSums(probs))
# 139 26 17 104 54 82
# 1 1 1 1 1 1
# nnet example --------------------------------------
library(caret)
training <- sample(seq_along(irisTargets), size = 100, replace = F)
modelCaret <- train(y = irisTargets[training],
x = irisValues[training, ],
method = "nnet")
predictionsCaret <- predict(modelCaret,
newdata = irisValues[-training, ],
type = "prob")
head(rowSums(predictionsCaret))
# 122 100 89 134 30 86
# 1 1 1 1 1 1
I don't know how much flexibility the caret package offers in these choices, but the standard way to make a neural net produce outputs which sum to one is to use the softmax function as the activation function in the output layer.
I am a beginner of R. I want to plot the numeral relations between different columns of a data frame.
Currently I have the following data frame:
topN Precision Recall F1Score udim idim tdim
10 50 0.02712121 0.2843955 0.04951998 67 78 50
40 50 0.02515152 0.2584113 0.04584124 67 156 50
70 50 0.02539924 0.2585877 0.04625516 67 234 50
100 50 0.02608365 0.2735997 0.04762680 133 78 50
130 50 0.02431818 0.2504262 0.04433146 133 156 50
160 50 0.02425856 0.2448997 0.04414439 133 234 50
190 50 0.02418251 0.2498824 0.04409746 200 78 50
220 50 0.02342205 0.2436125 0.04273533 200 156 50
250 50 0.02136882 0.2179636 0.03892181 200 234 50
I want to plot the 3D relation between udim, idim and F1Score. I am using persp() function in R. I want to make sure if I am doing the right thing to use t() on z.
So
x is udim: 67 133 200
y is idim: 78 156 234
z is their corresponding F1Score value in the data frame.
I use the following codes:
plot.data <- read.table(plot.file, sep=",", header=T)
# plot.file is the data frame file location
udim <- as.factor(plot.data$udim)
u <-as.integer(levels(udim))
idim <- as.factor(plot.data$idim)
i <- as.integer(levels(idim))
t <- as.integer(levels(as.factor(plot.data$tdim)))
z <- outer(u, i, FUN = function(u, i){
ss <- subset(plot.data, tdim == 50 & topN == 50) #udim == u & idim == i &
ss$F1Score
})
persp(u, i, t(z), theta=45, phi=45, shade = 0.45, xlab="user dim",
ylab="item dim", zlab="F1 Score", scale=TRUE)
I got the following plot:
Am I plotting it right?
Is this the easiest/normal way to tackle with such task?
Actually in my data frame I have more rows with different values of topNs and tdim, so is it possible add one or two more dimensions, say tdim, topN, to reflect numeral relations between so many columns in a plot?
Your graph already looks nice and I cannot answer your second question.
However, I want to present you another option for 3-way graphs.
Although they are usually quite confusing, I found an appealing way to make use of 3D Scatterplots.
Using scatterplot3dand animation as well as some third party software like ImageMagick (http://imagemagick.org) you can create animated pictures of 3D Scatterplots, which are certainly an option for data presentation using a computer.
Sample for your data (I don't have the animation package installed right now so I can only give you the syntax for the plot):
library(scatterplot3d)
F1Score <- c(0.04951998,0.04584124,0.04625516,0.04762680,0.04433146,0.04414439,0.04409746,0.04273533,0.03892181)
udim <- c(67,67,67,133,133,133,200,200,200)
idim <- c(78,156,234,78,156,234,78,156,234)
for (j in seq(5, 175, by = 5)) {
scatterplot3d(udim, idim, F1Score, angle = j)
Sys.sleep(0.042) # for 24 fps when looking at it in R
}
I have created the best fit for a non linear function. It seems to be working correctly:
#define a function
fncTtr <- function(n,d) (d/n)*((sqrt(1+2*(n/d))-1))
#fit
dFit <- nls(dData$ttr~fncTtr(dData$n,d),data=dData,start=list(d=25),trace=T)
summary(dFit)
plot(dData$ttr~dData$n,main="Fitted d value",pch=19,)
xl <- seq(min(dData$n),max(dData$n), (max(dData$n) - min(dData$n))/1000)
lines(xl,predict(dFit,newdata=xl,col=blue)
The plot for my observations are coming out correctly. I am having problems to display the best fit curve on my plot. I create the xl independent variable with 1000 values and I want to define the new values using the best fit. When I call the "lines" procedure, I get the error message:
Error in xy.coords(x, y) : 'x' and 'y' lengths differ
If I try to execute only the predict function:
a <-predict(dFit,newdata=xl)
str(a)
I can see that xl has 1000 components but "a" has only 16 components. Shouldn't I have the same number of values in a?
data used:
n ttr d
1 35 0.6951 27.739
2 36 0.6925 28.072
3 37 0.6905 28.507
4 38 0.6887 28.946
5 39 0.6790 28.003
6 40 0.6703 27.247
7 41 0.6566 25.735
8 42 0.6605 26.981
9 43 0.6567 27.016
10 44 0.6466 26.026
11 45 0.6531 27.667
12 46 0.6461 27.128
13 47 0.6336 25.751
14 48 0.6225 24.636
15 49 0.6214 24.992
16 50 0.6248 26.011
Ok, I think I found the solution, however I'm not sure I would be able to explain it.
When calling predict.nls, what you're inputting to argument newdata has to be named according to the variable with which you're predicting (here n) and the name has to match that given in the original call to nls.
#Here I replaced dData$n with n
dFit <- nls(ttr~fncTtr(n,d),data=dData,start=list(d=25),trace=T)
plot(dData$ttr~dData$n,main="Fitted d value",pch=19,)
xl <- seq(min(dData$n),max(dData$n), (max(dData$n) - min(dData$n))/1000)
a <- predict(dFit,newdata=list(n=xl))
length(a)==length(xl)
[1] TRUE
lines(xl,a,col="blue")