I'd like to write an R function that accepts a formula as its first argument, similar to lm() or glm() and friends. In this case, it's a function that takes a data frame and writes out a file in SVMLight format, which has this general form:
<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>
<target> .=. +1 | -1 | 0 | <float>
<feature> .=. <integer> | "qid"
<value> .=. <float>
<info> .=. <string>
for example, the following data frame:
result qid f1 f2 f3 f4 f5 f6 f7 f8
1 -1 1 0.0000 0.1253 0.0000 0.1017 0.00 0.0000 0.0000 0.9999
2 -1 1 0.0098 0.0000 0.0000 0.0000 0.00 0.0316 0.0000 0.3661
3 1 1 0.0000 0.0000 0.1941 0.0000 0.00 0.0000 0.0509 0.0000
4 -1 2 0.0000 0.2863 0.0948 0.0000 0.34 0.0000 0.7428 0.0608
5 1 2 0.0000 0.0000 0.0000 0.4347 0.00 0.0000 0.9539 0.0000
6 1 2 0.0000 0.7282 0.9087 0.0000 0.00 0.0000 0.0000 0.0355
would be represented as follows:
-1 qid:1 2:0.1253 4:0.1017 8:0.9999
-1 qid:1 1:0.0098 6:0.0316 8:0.3661
1 qid:1 3:0.1941 7:0.0509
-1 qid:2 2:0.2863 3:0.0948 5:0.3400 7:0.7428 8:0.0608
1 qid:2 4:0.4347 7:0.9539
1 qid:2 2:0.7282 3:0.9087 8:0.0355
The function I'd like to write would be called something like this:
write.svmlight(result ~ f1+f2+f3+f4+f5+f6+f7+f8 | qid, data=mydata, file="out.txt")
Or even
write.svmlight(result ~ . | qid, data=mydata, file="out.txt")
But I can't figure out how to use model.matrix() and/or model.frame() to know what columns it's supposed to write. Are these the right things to be looking at?
Any help much appreciated!
Partial answer. You can subscript a formula object to get a parse tree of the formula:
> f<-a~b+c|d
> f[[1]]
`~`
> f[[2]]
a
> f[[3]]
b + c | d
> f[[3]][[1]]
`|`
> f[[3]][[2]]
b + c
> f[[3]][[3]]
d
Now all you need is code to walk this tree.
UPDATE: Here's is an example of a function that walks the tree.
walker<-function(formu){
if (!is(formu,"formula"))
stop("Want formula")
lhs <- formu[[2]]
formu <- formu[[3]]
if (formu[[1]]!='|')
stop("Want conditional part")
condi <- formu[[3]]
flattener <- function(f) {if (length(f)<3) return(f);
c(Recall(f[[2]]),Recall(f[[3]]))}
vars <- flattener(formu[[2]])
list(lhs=lhs,condi=condi,vars=vars)
}
walker(y~a+b|c)
Also look at the documentation for terms.formula and terms.object. Looking at the code for some functions that take conditional formulas can help, for eg. the lmer function in lme4 package.
I used
formu.names <- all.vars(formu)
Y.name <- formu.names[1]
X.name <- formu.names[2]
block.name <- formu.names[3]
In the code I wrote about doing a post-hoc for a friedman test:
http://www.r-statistics.com/2010/02/post-hoc-analysis-for-friedmans-test-r-code/
But it will only work for: Y`X|block
I hope for a better answer others will give.
Related
The acf method in the stats package returns a complex output. For example
x = rnorm(1000, mean=100, sd=10)
acf(x)
returns a plot. If I do
acf_x = acf(x)
acf_x
it returns
Autocorrelations of series ‘x’, by lag
0 1 2 3 4 5 6 7 8 9 10 11
1.000 0.000 -0.031 -0.002 -0.052 0.017 -0.014 0.030 0.011 0.002 -0.044 0.000
12 13 14 15 16 17 18 19 20 21 22 23
0.055 -0.007 0.049 0.025 -0.027 -0.048 0.033 0.027 0.043 -0.007 -0.010 0.025
24 25 26 27 28 29 30
-0.083 0.045 -0.074 0.016 0.041 -0.046 0.010
If I look at class(acf) it returns 'acf'.
How do I extract the autocorrelation versus lag into a data_frame?
More generally, when presented with a function that returns a complex object, how do I extract the data from it, i.e. is there a general pattern for this type of function?
If you look at the help function of acf via ?acf you'll see under "value" what the output will look like.
In this case, the acf object is a list with several elements.
If you e.g. want the lags, you can simply access this via:
my_lags <- acf_x$lag
Deschen's answer to the original question gives the general response - how do I discover the elements in a complex model object: str(). One can also use the names() function for S3 objects, where the result lists the names one can use to extract elements from the list() with the $ or [[ forms of the extract operator.
set.seed(95014)
x = rnorm(1000, mean=100, sd=10)
acf_x <- acf(x)
names(acf_x)
> names(acf_x)
[1] "acf" "type" "n.used" "lag" "series" "snames"
>
Since the acf and lag elements are stored as arrays, we'll need to extract just the first dimension to obtain a simple vector. We can accomplish this by chaining the [ form of the extract operator onto the object that is generated by the [[ extract on the model object.
head(acf_x[["acf"]][,1,1]) # second extract returns a simple vector
> head(acf_x[["acf"]][,1,1])
[1] 1.000000000 -0.034863150 0.037745441 -0.020464290 -0.004974406
[6] 0.016770363
In this case R performs the extraction left to right - first acf_x[["acf"]] is evaluated, and then [,1,1] is applied to the result.
For the concrete part of the question, "how do I create a data frame with this data?" One can create a data frame with the output from the acf() function as follows.
set.seed(95014)
x = rnorm(1000, mean=100, sd=10)
acf_x <- acf(x)
results <- data.frame(acf_value = acf_x$acf[,1,1],
acf_lag = acf_x$lag[,1,1])
head(results)
...and the output:
> head(results)
acf_value acf_lag
1 1.000000000 0
2 -0.034863150 1
3 0.037745441 2
4 -0.020464290 3
5 -0.004974406 4
6 0.016770363 5
Try
str(acf_x)
or
print.default(acf_x)
This will get you an idea how the object looks like internally and how to access the elements in it.
I'm attempting to access the key covariates in detection probability.
I'm currently using this code
model1 <- glm(P ~ Width +
MBL +
DFT +
SGP +
SGC +
Depth,
family = binomial("logit"),
data = dframe2, na.action = na.exclude)
summary.lm(model1)
my data is structured like this-
Site Transect Q ID P Width DFT Depth Substrate SGP SGC MBL
1 Vr1 Q1 1 0 NA NA 0.5 Sand 0 0 0.00000
2 Vr1 Q2 2 0 NA NA 1.4 Sand&Searass 1 30 19.14286
3 Vr1 Q3 3 0 NA NA 1.7 Sand&Searass 1 15 16.00000
4 Vr1 Q4 4 1 17 0 2.0 Sand&Searass 1 95 35.00000
5 Vr1 Q5 5 0 NA NA 2.4 Sand 0 0 0.00000
6 Vr1 Q6 6 0 NA NA 2.9 Sand&Searass 1 50 24.85714
My sample size is really small (n=12) and I only have ~70 rows of data.
when I run the code it returns
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.457e+01 4.519e+00 5.437 0.00555 **
Width 1.810e-08 1.641e-01 0.000 1.00000
MBL -2.827e-08 9.906e-02 0.000 1.00000
DFT 2.905e-07 1.268e+00 0.000 1.00000
SGP 1.064e-06 2.691e+00 0.000 1.00000
SGC -2.703e-09 3.289e-02 0.000 1.00000
Depth 1.480e-07 9.619e-01 0.000 1.00000
SubstrateSand&Searass -8.516e-08 1.626e+00 0.000 1.00000
Does this mean my data set is just to small to asses detection probability or am I doing something wrong?
According to Hair (author of book Multivariate Data Analysis), you need at least 15 examples for each feature (column) of your data. If you have 12, you could only select one feature.
So, run a t-test comparing means of features related the each one of the two classes (0 and 1 at target - dependent variable) and choose the feature (independent variable) whose mean difference between classes is the biggest. This means that variable can properly create a boundary to split these two classes.
So essentially I have two matrices containing the excess returns of stocks (R) and the expected excess return (ER).
R<-matrix(runif(47*78),ncol = 78)
ER<-matrix(runif(47*78),ncol = 78)
I then combine these removing the first row of R and adding the first row of ER to form a new matrix R1.
I then do this for R2 i.e. removing first two rows of and R and rbinding it with the first 2 rows of ER.
I do this until I have n-1 new matrices from R1 to R47.
I then find the Var-Cov matrix of each of the Return matrices using cov() i.e. Var-Cov1 to Var-Cov47.
n<-47
switch_matrices <- function(mat1, mat2, nrows){
rbind(mat1[(1+nrows):nrow(mat1),],mat2[1:nrows,])
}
l<-lapply(1:n-1, function(nrows) switch_matrices(R,ER, nrows))
list2env(setNames(l,paste0("R",seq_along(l))), envir = parent.frame())
b<-lapply(l, cov)
list2env(setNames(b,paste0("VarCov",seq_along(b))), envir = parent.frame())
I am now trying to find the asset allocation using quadprog. So for example:
D_mat <- 2*VarCov1
d_vec <- rep(0,78)
A_mat <- cbind(rep(1,78),diag(78))
b_vec <- c(1,d_vec)
library(quadprog)
output <- solve.QP(Dmat = D_mat, dvec = d_vec,Amat = A_mat, bvec = b_vec,meq =1)
# The asset allocation
(round(output$solution, 4))
For some reason when running solve.QP with any Var-Cov matrix found I get this error:
Error in solve.QP(Dmat = D_mat, dvec = d_vec, Amat = A_mat, bvec = b_vec, :
matrix D in quadratic function is not positive definite!
I'm wondering what I am doing wrong or even why this is not working.
The input matrix isn't positive definite, which is a necessary condition for the optimization algorithm.
Why your matrix isn't positive definite will have to do with your specific data (the real data, not the randomly generated example) and will be both a statistical and subject matter specific question.
However, from a programming perspective there is a workaround. We can use nearPD from the Matrix package to find the nearest positive definite matrix as a viable alternative:
# Data generated by code in the question using set.seed(123)
library(quadprog)
library(Matrix)
pd_D_mat <- nearPD(D_mat)
output <- solve.QP(Dmat = as.matrix(pd_D_mat$mat),
dvec = d_vec,
Amat = A_mat,
bvec = b_vec,
meq = 1)
# The asset allocation
(round(output$solution, 4))
[1] 0.0052 0.0000 0.0173 0.0739 0.0000 0.0248 0.0082 0.0180 0.0000 0.0217 0.0177 0.0000 0.0000 0.0053 0.0000 0.0173 0.0216 0.0000
[19] 0.0000 0.0049 0.0042 0.0546 0.0049 0.0088 0.0250 0.0272 0.0325 0.0298 0.0000 0.0160 0.0000 0.0064 0.0276 0.0145 0.0178 0.0000
[37] 0.0258 0.0000 0.0413 0.0000 0.0071 0.0000 0.0268 0.0095 0.0326 0.0112 0.0381 0.0172 0.0000 0.0179 0.0000 0.0292 0.0125 0.0000
[55] 0.0000 0.0000 0.0232 0.0058 0.0000 0.0000 0.0000 0.0143 0.0274 0.0160 0.0000 0.0287 0.0000 0.0000 0.0203 0.0226 0.0311 0.0345
[73] 0.0012 0.0004 0.0000 0.0000 0.0000 0.0000
I'm working in healthcare and I need help on how to use R.
I explain: I have a set of data like that:
S1 S2 S3 S4 S5
0.498 1.48 1.43 0.536 0.548
2.03 1.7 3.74 2.13 2.02
0.272 0.242 0.989 0.534 0.787
0.986 2.03 2.53 1.65 2.31
0.307 0.934 0.633 0.36 0.281
0.78 0.76 0.706 0.81 1.11
0.829 2.03 0.667 1.48 1.42
0.497 1.27 0.952 1.23 1.73
0.553 0.286 0.513 0.422 0.573
Here are my objectives:
Do correlation between every column
Calculate p-values
Calculate R-squared
Only show when R2>0.5 and p-values <0.05
Here is my code so far (it's not the most efficient but it work):
> e<-read.table(‘Workbook8nm.csv’, header=TRUE, sep=“,”, dec=“.”, na.strings=“NA”)
> f<-data.frame(e)
> M<-cor(f, use=“complete”) #Do the correlation like I want
> library(‘psych’)
> N<-corr.test (f) #Give me p-values
So, so far I have my correlation in M and my p-values in N.
I need help on how to show R2 ?
And second part how to make R only show me when R2>0.5 and p-values<0.05 for example ? I used this line :
P<-M[which(m>0.9))]
To show me only when the pearson coefficent is more than 0.9 as a training. But it just make me a list of every values that are superior to 0.9 ... So I don't know between which and which column this coefficient come from. The best would be that it show me significant values in a table with the name of column so after I can easily identify them.
The reason I want to do that is because by table is 570 by 570 so I can't look at every p-values to keep only the significant one.
I hope I was clear ! It's my first post here, tell me if I did any mistake !
Thanks for your help !
I'm sure there is a function somewhere in the R space to do this quicker, but I wrote a quick function to expand a matrix into a data.frame with the "row" and "column" as columns, and the value as a third column.
matrixToFrame <- function(m, name) {
e <- expand.grid(row=rownames(m), col=colnames(m))
e[name] <- as.vector(m)
e
}
We can transform the correlation matrix into a data frame like so:
> matrixToFrame(cor(f), "cor")
row col cor
1 S1 S1 1.0000000
2 S2 S1 0.5322052
3 S3 S1 0.8573687
4 S4 S1 0.8542438
5 S5 S1 0.6820144
6 S1 S2 0.5322052
....
And we can merge the result of corr.test and cor because the columns match up
> b <- merge(matrixToFrame(corr.test(a)$p, "p"), matrixToFrame(cor(a), "cor"))
> head(b)
row col p cor
1 S1 S1 0.0000000000 1.0000000
2 S1 S2 0.2743683745 0.5322052
3 S1 S3 0.0281656707 0.8573687
4 S1 S4 0.0281656707 0.8542438
5 S1 S5 0.2134783039 0.6820144
6 S2 S1 0.1402243214 0.5322052
Then we can just filter for the elements that we want
> b[b$cor > .5 & b$p > .2,]
row col p cor
2 S1 S2 0.2743684 0.5322052
5 S1 S5 0.2134783 0.6820144
8 S2 S3 0.2743684 0.5356585
10 S2 S5 0.2134783 0.6724486
15 S3 S5 0.2134783 0.6827349
EDIT: I found R matrix to rownames colnames values, which provides a couple of attempts at matrixToFrame; nothing particularly more elegant than what I have here, though.
EDIT2: Make sure to read the docs carefully for corr.test -- it looks like different information gets encoded in the upper and lower diagonal (?), so the results here may be deceptive. You may want to do some filtering with lower.tri or upper.tri before the final filtering step.
> lenss
xd yd zd
1 0.0000 0.0000 2.44479
2 0.0937 0.0000 2.73183
3 0.3750 0.0000 2.97785
4 0.8437 0.0000 3.18626
5 1.5000 0.0000 3.36123
6 2.3437 0.0000 3.50624
7 3.3750 0.0000 3.62511
8 4.5937 0.0000 3.72124
9 5.9999 0.0000 3.79778
10 7.5936 0.0000 3.85744
11 9.3749 0.0000 3.90241
12 11.3436 0.0000 3.93590
13 13.4998 0.0000 3.96011
14 15.8435 0.0000 3.97648
15 18.3748 0.0000 3.98236
16 21.0935 0.0000 3.99406
17 23.9997 0.0000 3.99732
18 27.0934 0.0000 3.99911
19 30.3746 0.0000 4.00004
20 33.8433 0.0000 4.00005
21 37.4995 0.0000 4.00006
22 0.0663 0.0663 3.99973
23 0.2652 0.2652 3.99988
24 0.5966 0.5966 3.99931
25 1.0606 1.0606 3.99740
26 1.6573 1.6573 3.99375
27 2.3865 2.3865 3.98732
28 3.2482 3.2482 3.97640
29 4.2426 4.2426 3.95999
30 5.3695 5.3695 3.93598
31 6.6290 6.6290 3.90258
32 8.0211 8.0211 3.85171
33 9.5458 9.5458 3.79754
34 11.2031 11.2031 3.72156
35 12.9929 12.9929 3.62538
36 14.9153 14.9153 3.50636
37 16.9703 16.9703 3.36129
38 19.1579 19.1579 3.18622
39 21.4781 21.4781 2.97802
40 23.9308 23.9308 2.73206
41 26.5162 26.5162 2.44464
> rd=sqrt(xd^2+yd^2)
> fit=nls(zd~(rd^2/R)/(1+sqrt(1-(1+k)*rd^2/R^2))+d,start=list(R=75,k=-1,d=1))
Error in numericDeriv(form[[3L]], names(ind), env) :
Missing value or an infinity produced when evaluating the model
In addition: Warning message:
In sqrt(1 - (1 + k) * rd^2/R^2) : NaNs produced
The function of that model was given above. The question states that there are a few inaccurate measurements in the data and I need to find them. I was going to fit the model first and work out every residuals in every measurement.
The argument of sqrt must be non-negative but there is no assurance that it is in the setup shown in the question. Furthermore, even if that is fixed it seems unlikely that the model can be fit in the way shown in the question since it consists of two distinct curves (see graphic below) which likely will have to be separately fit.
Using the drc package we can get a reasonable fit using its LL2.5 model like this:
library(drc)
plot(zd ~ rd)
g <- rep(1:2, c(21, 20))
fit1 <- drm(zd ~ rd, fct = LL2.5(), subset = g == 1)
fit2 <- drm(zd ~ rd, fct = LL2.5(), subset = g == 2)
lines(fitted(fit1) ~ rd[g == 1])
lines(fitted(fit2) ~ rd[g == 2])
This involves 10 parameters (5 for each curve). You might try the different models available there by using different models for the fct argument to see if you can find somnething more parsimonious.