I'm trying to fit a model using loess, and I'm getting errors such as "pseudoinverse used at 3", "neighborhood radius 1", and "reciprocal condition number 0". Here's a MWE:
x = 1:19
y = c(NA,71.5,53.1,53.9,55.9,54.9,60.5,NA,NA,NA
,NA,NA,178.0,180.9,180.9,NA,NA,192.5,194.7)
fit = loess(formula = y ~ x,
control = loess.control(surface = "direct"),
span = 0.3, degree = 1)
x2 = seq(0,20,.1)
library(ggplot2)
qplot(x=x2
,y=predict(fit, newdata=data.frame(x=x2))
,geom="line")
I realize I can fix these errors by choosing a larger span value. However, I'm trying to automate this fit, as I have about 100,000 time series (each of length about 20) similar to this. Is there a way that I can automatically choose a span value that will prevent these errors while still providing a fairly flexible fit to the data? Or, can anyone explain what these errors mean? I did a bit of poking around in the loess() and simpleLoess() functions, but I gave up at the point when C code was called.
Compare fit$fitted to y. You'll notice that something is wrong with your regression. Choose adequate bandwidth, otherwise it'll just interpolate the data. With too few data points, linear function behaves like constant on small bandwidth and triggers collinearity. Thus, you see the errors warning pseudoinverses, singularities. You wont see such errors if you use degree=0 or ksmooth. One intelligible, data-driven choice of span is to use to cross-validation, about which you can ask at Cross Validated.
> fit$fitted
[1] 71.5 53.1 53.9 55.9 54.9 60.5 178.0 180.9 180.9 192.5 194.7
> y
[1] NA 71.5 53.1 53.9 55.9 54.9 60.5 NA NA NA NA NA 178.0
[14] 180.9 180.9 NA NA 192.5 194.7
You see over-fit( perfect-fit) because in your model number of parameters are as many as effective sample size.
fit
#Call:
#loess(formula = y ~ x, span = 0.3, degree = 1, control = loess.control(surface = "direct"))
#Number of Observations: 11
#Equivalent Number of Parameters: 11
#Residual Standard Error: Inf
Or, you might as well just use automated geom_smooth. (again setting geom_smooth(span=0.3) throws warnings)
ggplot(data=data.frame(x, y), aes(x, y)) +
geom_point() + geom_smooth()
Related
Is there an efficient way in R to calculate Relative Risk and 95% CI for multiple rows of data.
Data
or
test <- data.frame(Class = c("Grade 1", "Grade 2", "Grade 3"),
Male = c(39815, 0, 39815),
Pass_m = c(7743, 0, 4993),
Pct_male = c(19, 0, 12),
Female = c(26462, 0, 26462),
Pass_f = c(3929, 0, 2530),
Pct_female = c(14, 0, 9))
Currently, my approach has been to do one set at a time.
riskratio.wald(table(Pass_m,Pass_f))
This is painstakingly inefficient for more than 500 rows of data.
I have over 500 rows of data and would appreciate if anyone could provide an efficient way to solve this issue.
Also, Is the following calculation and interpretation correct:
risk = number of males pass by the population totals in each
group.
relative risk=risk of one group/risk of other group.
For first row, we can say that relative risk 19/14 = 1.36
Males are 1.36 times more likely to pass in Grade 1 compared to female(RR=1.36).
Is the calculation and interpretation correct?
Looks like a job for ‘apply’. It transfers each margin element (1 for rows) successively to the function.
apply( test[ , c(3,6)], 1, riskratio.Wald)
This of course is assuming that riskratio.Wald returns a single value. Since you didn’t say what package it’s from, that’s my hope, but you could improve the question be being more complete.
Is this what you are looking for? And see here is a very good answer of Ben Bolker
How to calculate relative risk in R?
epitab(as.matrix(test[ , c(3,6)]),method="riskratio")
Output:
$tab
Pass_m p0 Pass_f p1 riskratio lower upper p.value
[1,] 7743 0.6633825 3929 0.3366175 1.0000000 NA NA NA
[2,] 0 NaN 0 NaN NaN NaN NaN 1.000000
[3,] 4993 0.6636980 2530 0.3363020 0.9990626 0.9592189 1.040561 0.975038
$measure
[1] "wald"
$conf.level
[1] 0.95
$pvalue
[1] "fisher.exact"
This free excel calculator does the trick, albeit manual. Wish someone make a R-package/Shiny app using this.
https://www.scalestatistics.com/relative-risk.html
You can use the epitools package if doing a simple risk ratio calculation with only 2 variables but if you intend to throw multiple variables in your regression model then you probably need a log-binomial regression which can be achieved using the glm function.
#this builds the log-binomial model and saves in variable rrmodel
rrmodel <- glm(dependent_var ~ pred1 + pred2 ... +pred_n, dataset, family=binomial(log))
rrsumm <- summary(rrmodel) #this displays the estimates in the log scale
remember to exponentiate to get the risk ratio. Here is a sample code for returning your Risk ratio, the lower and upper confidence bounds, and p-value:
cbind(
RiskRatio=exp(rrsumm$coefficients[, 1]),
LCB=exp(rrsumm$coefficients[, 1]-1.96*rrsumm$coefficients[, 2]),
UCB=exp(rrsumm$coefficients[, 1]+1.96*rrsumm$coefficients[, 2]),
pvalue=rrsumm$coefficients[, 4]
)
I have a growth dataset based off of recaptures. There are columns with the capture length, the recapture length, and the time (in yrs) based off the recapture minus the capture.
> str(data)
'data.frame': 60 obs. of 3 variables:
$ sizecapture : num 40.3 43 38.3 41.5 37.6 ...
$ sizerecapture: num 43 48.7 39.5 42 46.7 43.5 43.5 47.2 45.7 59.9 ...
$ timeinterval : num 0.945 1.036 0.997 0.997 2.471 ...
I am following Ogle 2013 vignette (http://derekogle.com/fishR/examples/oldFishRVignettes/VonBertalanffyExtra.pdf) in R, for the Fabens method of trying to derive size at age. For this method I don't need an initial age (as I don't know age at all). I am not interested in extrapolating, but for only estimating the age of individuals that I have sizes for.
I can easily follow the instructions for calculating the two parameters needed to inform the nls model: the k and the Linf. My aim is to create a age at length curve with the growth data, but when I get errors when I try to fitPlot. I get the error "Error in mdl$model[[gpos[2]]] : subscript out of bounds". I have also tried curve() and get the error "Error in FVB1(x) : could not find function "FVB1".
I also can't figure out how to extract the confidence intervals that fit with the predicted data.
I have searched and have found some similar cases but nothing that has worked. I'll continue to research, but am I missing something very basic? Below is a subsample of the data. I'd appreciate any help.
Thank you
install.packages("FSA")
install.packages("FSAdata")
install.packages("nlstools")
install.packages(car)
library(FSA)
library(FSAdata)
library(nlstools)
library(car)
sizecapture <- c(40.30,43.00,38.30,41.50,37.60,41.63,41.80,38.40,40.00,41.20,37.70,41.70,43.70,41.80,42.70,44.60,45.50,44.50,45.60,44.80,47.00,49.20,44.50,45.20,46.40,46.90,49.40,61.00,36.50,42.10,43.90,43.90,46.40,45.50,47.20,64.30,43.00,59.90,39.60,36.80)
sizerecapture < c(43.0,48.7,39.5,42.0,46.7,43.5,43.5,47.2,45.7,59.9,48.1,46.5,45.7,49.1,48.7,47.1,46.9,48.3,47.2,53.7,52.0,51.2,56.2,56.3,57.5,57.7,55.4,74.5,45.6,44.9,46.7,51.0,49.4,58.0,56.8,71.6,43.8,44.6,43.7,41.9)
timeinterval <-c(0.9452055,1.0356164,0.9972603,0.9972603,2.4712329,0.9534247,1.1945205,2.0027397,1.3178082,4.5342466,2.1863014,0.9178082,1.1315068,2.3698630,2.0575342,1.3835616,1.1726027,1.1972603,3.1698630,1.9589041,1.0712329,0.9150685,2.5671233,2.7780822,3.2000000,2.2246575,1.9150685,4.1753425,0.9287671,1.0328767,1.3945205,2.6739726,0.9205479,3.1479452,1.9506849,1.7178082,1.0520548,3.0767123,1.3726027,1.2520548)
data <- data.frame(sizecapture, sizerecapture, timeinterval)
### using Ogle 2013 to calculate Linf and k
# k and Linf
with(data,mean((log(sizerecapture)-log(sizecapture))/timeinterval))
#0.0676`
max(data$sizerecapture) # largest size is 74.5
Fabens.sv <- list(Linf=74.5, K=0.0676)
# declare the model
fvb <- vbFuns("Fabens")
# fit and summarize
FVB1<- nls(sizerecapture~
fvb(sizecapture,timeinterval,Linf,K),start=Fabens.sv,data=data)
summary(FVB1,correlation=TRUE)
# confidence intervals through bootstrapping
boot <- nlsBoot(FVB1, niter=500)
confint(boot,plot=TRUE)
# plotting a fitted line plot
ages2plot <- 0:40
LCI <- UCI <- numeric(length(ages2plot))
fitPlot(FVB1, xlim=range(ages2plot))
estes <- boot$coefboot
for (i in 1:length(ages2plot)) {
pv <-estes[,"Linf"]*(1-exp(-ests[,"K]*(ages2plot)))
LCI[i] <- quantile(pv,0.025)
UCI[i] <- quantile(pv,0.975)
}
lines(UCI~ages2plot,type="1")
lines(LCI~ages2plot,type="1")
# tried to just get a visual and errors arise
curve(FVB1)
I have been presented with a problem, regarding the minimization of the absolute error, the problem know as LAD(Least absolute deviation) but, being each regressor the result of expensive test with an associated cost, one should refrain from using regressors that don't explain variance to a high degree. It takes the following equations:
Where N is the total number of observations, E the deviation associated with observation i, S the number of independant variables, lambda a penalty coefficient for the cost, and C the cost associated with performing the test.
So far, I have oriented as usual. To make it lineal, I transformed the absolute value in two errors, e^+ and e^-, where e= y_i-(B_0+sum(B_j*X_ij) and the following constraints:
z_j ={0,1}, binary value about whether the regressor enters my model.
B_i<=M_zj; B_i>=-M_zj
E^+, E^- >=0
A toy subset of data I'm working has the following structure:
For y
quality
1 5
2 5
3 5
4 6
5 7
6 5
For the regressors
fixed.acidity volatile.acidity citric.acid
1 7.5 0.610 0.26
2 5.6 0.540 0.04
3 7.4 0.965 0.00
4 6.7 0.460 0.24
5 6.1 0.400 0.16
6 9.7 0.690 0.32
And for the cost
fixed.acidity volatile.acidity citric.acid
1 0.26 0.6 0.52
So far, my code looks like this:
# loading the matrixes
y <- read.csv(file="PATH\\y.csv", header = TRUE, sep = ",") #dim=100*11
regresores <- read.csv(file="PATH\\regressors.csv", header = TRUE, sep = ",")#dim=100*1
cost <- read.csv(file="PATH\\cost.csv", header = TRUE, sep = ",")#dim=1*11
for (i in seq(0, 1, by = 0.1)){#so as to have a collection of models with different penalties
obj.fun <- c(1,1,i*coste)
constr <- matrix(
c(y,regresores,-regresores),
c(-y,-regresores,regresores),
sum(regresores),ncol = ,byrow = TRUE)
constr.dir <- c("<=",">=","<=","==")
rhs<-c(regresores,-regresores,1,binary)
sol<- lp("min", obj.fun, constr, constr.tr, rhs)
sol$objval
sol$solution}
I know theres is a LAD function in R, but for consistence sake with my colleagues, as well as a pretty annoying phD tutor, I have to perform this using lpSolve in R. I have just started with R for the project and I don't know exactly why this won't run. Is there something wrong with the syntax or my formulation of the model. Right know, the main problem I have is:
"Error in matrix(c(y, regressors, -regressors), c(-y, -regressors, regressors), : non-numeric matrix extent".
Mainly, I intended for it to create said weighted LAD model and have it return the different values of lambda, from 0 to 1 in a 0.1 step.
Thanks in advance and sorry for any inconvenience, neither English nor R are my native languages.
I have been trying to understand Opera “Online Prediction by Expert Aggregation” by Pierre Gaillard and Yannig Goude. I read two posts by Pierre Gaillard (http://pierre.gaillard.me/opera.html) and Rob Hyndman (https://robjhyndman.com/hyndsight/forecast-combinations/). However, I do not understand how to predict future values. In Pierre's example, newY = Y represents the test data set (Y <- data_test$Load) which is weekly observations of the French electric load. As you shown below, the data ends at December 2009. Now, how can I forecast let's say 2010 values? What will be the newY here?
> tail(electric_load,5)
Time Day Month Year NumWeek Load Load1 Temp Temp1 IPI
727 727 30 11 2009 0.9056604 63568.79 58254.42 7.220536 10.163839 91.3 88.4
728 728 7 12 2009 0.9245283 63977.13 63568.79 6.808929 7.220536 90.1 87.7
729 729 14 12 2009 0.9433962 78046.85 63977.13 -1.671280 6.808929 90.1 87.7
730 730 21 12 2009 0.9622642 66654.69 78046.85 4.034524 -1.671280 90.1 87.7
731 731 28 12 2009 0.9811321 60839.71 66654.69 7.434115 4.034524 90.1 87.7
I noticed that by multiplying the weights of MLpol0 by X, we get similar outputs as online predictions values.
> weights <- predict(MLpol0, X, Y, type='weights')
> w<-weights[,1]*X[,1]+weights[,2]*X[,2]+weights[,3]*X[,3]
> predValues <- predict(MLpol0, newexpert = X, newY = Y, type='response')
Test_Data predValues w
620 65564.29 65017.11 65017.11
621 62936.07 62096.12 62096.12
622 64953.83 64542.44 64542.44
623 61580.44 60447.63 60447.63
624 71075.52 67622.97 67622.97
625 75399.88 72388.64 72388.64
626 65410.13 67445.63 67445.63
627 65815.15 62623.64 62623.64
628 65251.90 64271.97 64271.97
629 63966.91 61803.77 61803.77
630 64893.42 65793.14 65793.14
631 69226.32 67153.80 67153.80
But still I am not sure how to generate weights w/out newY. Maybe we can use final coefficients that are the output of MLpol to predict future values?
(c<-summary(MLpol <- mixture(Y = Y, experts = X, model = "MLpol", loss.type = "square"))$coefficients)
[1] 0.585902 0.414098 0.000000
I am sorry I may be way off on this and my question may not make sense at all, but I really appreciate any help/insight.
The idea of the opera package is a bit different from classical batch machine learning methods with a training set and a testing set. The goal is to make sequential predictions:
At each round t=1,...,n,
1) the algorithm receives predictions of the expert for round n+1,
2) it makes a prediction for this time step by combining the expert
3) it updates the weights used for the combination by using the new output
If you have out-of-sample forecasts (i.e., forecasts of experts for future values without the outputs), the best you can do is to use the last coefficients and use them to make a prediction by using:
newexperts %*% model$coefficients
In practice, you may also want to use the averaged coefficients. You can also obtained the same by using
predict (object, # for exemple, mixture(model='FS', loss.type="square")
newexperts = # matrix of out-of-sample experts predictions
online = FALSE,
type = 'response')
By using the parameter online = FALSE the model does not need any newY. It will not update the model. When you provide newY, the algorithm does not cheat. It does not use the value at rount t to make the prediction at round t. The values of newY are only used to update the coefficients step by step to do as if the prediction were made sequentially.
I hope this helped.
I have a data frame trainData which contains 198 rows and looks like
Matchup Win HomeID AwayID A_TWPCT A_WST6 A_SEED B_TWPCT B_WST6 B_SEED
1 2010_1115_1457 1 1115 1457 0.531 5 16 0.567 4 16
2 2010_1124_1358 1 1124 1358 0.774 5 3 0.75 5 14
...
The testData is similar.
In order to use SVM, I have to change the response variable Win to a factor. I tried the below:
trainDataSVM <- data.frame(Win=as.factor(trainData$Win), A_WST6=trainData$A_WST6, A_SEED=trainData$A_SEED, B_WST6=trainData$B_WST6, B_SEED= trainData$B_SEED,
Matchup=trainData$Matchup, HomeID=trainData$HomeID, AwayID=trainData$AwayID)
I then want to a SVM and predict the probabilities, so I tried the below
svmfit =svm (Win ~ A_WST6 + A_SEED + B_WST6 + B_SEED , data = trainDataSVM , kernel ="linear", cost =10,scale =FALSE )
#use CV with a range of cost values
set.seed (1)
tune.out = tune(svm, Win ~ A_WST6 + A_SEED + B_WST6 + B_SEED, data=trainDataSVM , kernel ="linear",ranges =list (cost=c(0.001 , 0.01 , 0.1, 1 ,5 ,10 ,100) ))
bestmod =tune.out$best.model
testDataSVM <- data.frame(Win=as.factor(testData$Win), A_WST6=testData$A_WST6, A_SEED=testData$A_SEED, B_WST6=testData$B_WST6, B_SEED= testData$B_SEED,
Matchup=testData$Matchup, HomeID=testData$HomeID, AwayID=testData$AwayID)
predictions_SVM <- predict(bestmod, testDataSVM, type = "response")
However, when I try to print out predictions_SVM, I get the message
factor(0)
Levels: 0 1
instead of a column of probability values. What is going on?
I haven't used this much myself, but I know that the SVM algorithm itself does not produce class probabilities, only the response function (distance from hyperplane). If you look at the documentation for svm function, the argument "probability" - "logical indicating whether the model should allow for probability predictions" - is FALSE by default and you did not set it equal to TRUE. Documentation for predict.svm says similarly, argument "probability" is a "Logical indicating whether class probabilities should be computed and returned. Only possible if the model was fitted with the probability option enabled." Hope that's helpful.