What function or functions can you use redirect console output to a data frame in R? As an example, the following code associated with the mgcv package produces a set of diagnostics used to assist in model selection of GAMs:
gam.check(gamout, type=c("deviance")
It produces the following output
Method: GCV Optimizer: magic
Smoothing parameter selection converged after 7 iterations.
The RMS GCV score gradient at convergence was 1.988039e-07 .
The Hessian was positive definite.
Model rank = 10 / 11
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(year) 9.00 3.42 1.18 0.79
I'm interested in redirecting this output to a data frame I can process into a table I can output and actually use rather than read off the console. I don't need specifics just functions I might be able to use to start solving the problem. Once I have the function, I can work my way through the specifics.
sink() this apparently outputs to a txt file which...I suppose I could use this function and then re-import the output but that seems like a pretty stupid solution.
The functions I would start with are class(gam.check(gamout, type="deviance")) and names(gam.check(gamout, type="deviance")). This should help you figure out what the data structure is and then how to extract elements of it.
I am using rdrobust to estimate RDDs and for a submission in a journal the journal demands I report tables with covariates and their estimates. I don't think these should be reported in designs like these and don't really know how informative they are, but anyways: I can't find them anywhere in the output of the rdrobust call, so I was wondering whether there is anyway of actually obtaining them.
Here's my code:
library(rdrobust)
rd <- rdrobust(y = full_data$share_female,
x = full_data$running,
c = 0,
cluster = full_data$constituency.name,
covs=cbind(full_data$income, full_data$year_fct,
full_data$population, as.factor(full_data$constituency.name)))
I then call the object
rd
And get:
Call: rdrobust
Number of Obs. 1812
BW type mserd
Kernel Triangular
VCE method NN
Number of Obs. 1452 360
Eff. Number of Obs. 566 170
Order est. (p) 1 1
Order bias (q) 2 2
BW est. (h) 0.145 0.145
BW bias (b) 0.221 0.221
rho (h/b) 0.655 0.655
Unique Obs. 1452 360
So as you see there seems to be no information on this on the output nor the object the function calls. I don't really know what to do.
Thanks!
Unfortunately, I do not believe rdrobust() allows you to recover the coefficients introduced through the covs option.
In your case, running the code as you provided and then running:
rd$coef
will only give you the point estimate for the rd estimator.
Josh McCrain has written-up a nice vignette here to replicate rdrobust using lfe that also allows you to recover the coefficients on covariates.
It involves some modification on your part and is of course not as user friendly, but does allow recovery of covariates.
This might be beside the point by now, but the journal requirement in an RD design is odd.
Use summary(rd). This will return the coefficient estimate
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Objective:
I would like to utilize a classificaiton Support Vector Machine to model three outcomes: Win=1, Loss=0, or Draw=2. The inputs are a total of 50 interval variables and 2 categorical variables: isHome or isAway. The dataset is comprised of 23,324 instances or rows.
What the data looks like:
Outcome isHome isAway Var1 Var2 Var3 ... Var50
1 1 0 0.23 0.75 0.5 ... 0.34
0 0 1 0.66 0.51 0.23 ... 0.89
2 1 0 0.39 0.67 0.15 ... 0.45
2 0 1 0.55 0.76 0.17 ... 0.91
0 1 0 0.35 0.81 0.27 ... 0.34
The interval variables are within the range 0 to 1, hence I believe they do not require scaling given they are percentages. The categorical variable inputs are 0 for not Home and 1 for Home in isHome and 1 for Away and 0 for not Away.
Summary
Create Support Vector Machine Model
Correct for gamma and cost
Questions
I will be honest, this is my first time using SVM and I have practiced using the Titanic dataset from Kaggle, but I am trying to exapnd off of that and try new things.
Does the data have to be transformed into a scale of [0,1]? I do not believe it does
I have found some literature stating it is possiable to predict with 3 categories, but this is outside of my scope of knowledge. How would I implement this in R?
Are there too many features that I am looking at in order for this to work, or could there be a problem with noise? I know this is not a yes or no question, but curous to hear people's thoughts.
I understand SVM can split data either linearly, radially, or in a polygon. How does one make the best choice for their data?
Reproducable Code
library(e1071)
library(caret)
# set up data
set.seed(500)
isHome<-c(1,0,1,0,1)
isAway<-c(0,1,0,1,0)
Outcome<-c(1,0,2,2,0)
Var1<-abs(rnorm(5,0,1))
Var2<-abs(rnorm(5,0,1))
Var3<-abs(rnorm(5,0,1))
Var4<-abs(rnorm(5,0,1))
Var5<-abs(rnorm(5,0,1))
df<-data.frame(Outcome,isHome,isAway,Var1,Var2,Var3,Var4,Var5)
# split data into train and test
inTrain<-createDataPartition(y=df$Outcome,p=0.50,list=FALSE)
traindata<-df[inTrain,]
testdata<-df[-inTrain,]
# Train the model
svm_model<-svm(Outcome ~.,data=traindata,type='C',kernel="radial")
summary(svm_model)
# predict
pred <- predict(svm_model,testdata[-1])
# Confusion Matrix
table(pred,testdata$Outcome)
# Tune the model to find the best cost and gamma
svm_tune <- tune(svm, train.x=x, train.y=y,
kernel="radial", ranges=list(cost=10^(-1:2),
gamma=c(.5,1,2)))
print(svm_tune)
I'll try to answer each point, to my opinion you could get different solutions for your problem(s), as it is now it's a bit "broad". You could get answers also from searching for similar topics on CrossValidated.
Does the data have to be transformed into a scale of [0,1]?
It depends, usually yes it would be better to scale var1,var2,... One good approach would be to build to pipelines. One where you scale each var, one were you leave them be, the best model on the validation set will win.
Note, you'll find frequently this kind of approach in order to decide "the best way".
Often what you're really interested in is the performance, so checking via cross-validation is a good way of evaluating your hypothesis.
I have found some literature stating it is possible to predict with 3
categories, but this is outside of my scope of knowledge. How would I
implement this in R?
Yes it is, some functions implement this right away in fact. See the example linked down below.
Note, you could always do a multi-label classification by building more models . This is called usually one-vs-all approach (more here).
In general you could:
First train a model to detect Wins, your labels will be [0,1], so Draws and Losses will both be counted as "zero" class, and Wins will be labeled as "one"
Repeat the same principle for the other two classes
Of course, now you'll have three models, and for each observation, you'll have at least two predictions to make.
You'll assign each obs to the class with the most probability or by majority vote.
Note, there are other ways, it really depends on the model you chose.
Good news is you can avoid this. You can look here to start.
e1071::svm() can be easily generalized for your problem, it does this automatically, so no need to fit multiple models.
Are there too many features that I am looking at in order for this to work, or could there be a problem with noise? I know this is not a yes or no question, but curious to hear people's thoughts.
Could be or could not be the case, again, look at the performance you have via CV. You have reason to suspect that var1,..,var50 are too many variables? Then you could build a pipeline where before fit you use PCA to reduce those dimension, say to 95% of the variance.
How do you know this works? You guessed it, by looking at the performance once again, one the validation set you get via CV.
My suggestion is to follow both solutions and keep the best performing one.
I understand SVM can split data either linearly, radially, or in a
polygon. How does one make the best choice for their data?
You can treat the kernel choice again as a hyperparameter to be tuned. Here you need to look at performances, once again.
This is what I'd follow, based on the fact that you seem to have already selected svm as the model of choice. I suggest to look at the package caret, it should simplify all the evaluations you need to do (Example of CV with caret).
Scale Data vs not Scale Data
Perform PCA vs keep all variables
Fit all the models on the training set and evaluate via CV
Take your time to test all this pipelines (4 so far)
Evaluate again using CV which kernel is best, along with other hyperparameters (C, gamma,..)
You should find which path led you to the best result.
If you're familiar with the classic Confusion Matrix, you can use accuracy even for a multi-class classification problem as a performance metric.
In my experiments I tried different set ups to balance the distribution between two tasks. Each set up was run 32 times. I got the following task distributions [ratio from 0 to 1 of tasktype1/(tasktype1+tasktype2)]: http://oi63.tinypic.com/2cf6szb.jpg
This is how (part of) the dataframe looks like in R: http://oi67.tinypic.com/2z9fg28.jpg
I think ANOVA is not suitable as there seems to be no normal distribution of the data. (Is there a quick way to verify a low level of normality? Is there a standard at which point ANOVA is not suitable anymore?)
Therefore I decided to do the Kruskal-Wallis test. Reading on the test I figured the data needs to be ranked. But how can I choose the method of ranking when computing the Kruskal-Wallis test in R. In my case the "desired" outcome is a balanced population (ratio of 0.5). So the ranks would be:
rank: ratio:
1 0.5
2 0.4 and 0.6
[...] [...]
Can kruskal.test() be adjusted accordingly? (Maybe I am understanding the function wrong...)
My best guess would be just to try: kruskal.test(ratio ~ Method, data = ds)
I have a list of 282 items that has been classified by 6 independent coders into 20 categories.
The 20 categories are defined by words (example "perceptual", "evaluation" etc).
The 6 coders have different status: 3 of them are experts, 3 are novices.
I calculated all the kappas (and alphas) between each pair of coders, and the overall kappas among the 6 coders, and the kappas between the 3 experts and between the 3 novices.
Now I would like to check whether there is a significant difference between the interrater agreements achieved by the experts vs those achieved by the novices (whose kappa is indeed lower).
How would you approach this question and report the results?
thanks!
You can at least simply obtain the Cohen's Kappa and its sd in R (<- by far the best option in my opinion).
The PresenceAbsence package has a Kappa (see ?Kappa) function.
You can get the package with the regular install.packages("PresenceAbsence"), then pass a confusion matrix, i.e.:
# we load the package
library(PresenceAbsence)
# a dummy confusion matrix
cm <- matrix(round(runif(16, 0, 10)), nrow=4)
Kappa(cm)
you will obtain the Kappa and its sd. As far as I know there are limitations about testing using the Kappa metric (eg see https://en.wikipedia.org/wiki/Cohen's_kappa#Significance_and_magnitude).
hope this helps