R For loop - estimates and values for forecasting - r

I'm trying to do some regression forecasting based on my estimates and actual values. I have the following
estimates=s1$coefficients[,1]
values = data.frame(cbind(sd_rgdpg,DISSIM,TRADE,SIZE,OPEN,TF,INFL,INT,NIIP))
Where estimates are my coefficients and values are my actual values. 'estimates' is a vector of ten with the intercept as the first item. 'values'is a dataframe with 9 columns and 21 rows. The columns' variables correspond to the rows of estimates. I need to multiply the variables estimates and values together to form an equation like y = intercept + b1x1+b2x2+...+b9x9.
I'm not quite sure how to do this in a forloop, can anyone help me out?
Here is the 'values' dataframe:
sd_rgdpg
<dbl>
DISSIM
<dbl>
TRADE
<dbl>
SIZE
<dbl>
OPEN
<dbl>
TF
<dbl>
INFL
<dbl>
INT
<dbl>
NIIP
<dbl>
0.3905156169 0.39590508 0.00000000 0.0000000000 2.629159 0.5474359 -0.40 1.43 -13.68144000
1.4482896523 0.37227806 0.03102011 0.0007919784 2.493771 0.5837563 -0.07 0.16 1.19404188
0.1698460561 0.35884028 0.10907448 0.0386795080 2.342112 0.6075000 0.22 -0.76 0.93052249
0.0020363597 0.04812418 0.24478591 0.0856910910 2.085918 0.6554404 -0.40 -1.22 0.94020757
0.3148110593 0.02315404 0.28936211 0.1649356627 2.094957 0.6589744 -3.16 -1.88 0.85515135
0.0279017603 0.02906603 0.31283051 0.2369223964 2.033051 0.6938776 -1.29 -1.36 0.57801452
0.0192319055 0.05513982 0.35421769 0.3050570794 2.137967 0.8312958 -0.02 -0.85 0.34994832
0.0358535769 0.07426063 0.48108389 0.4014364697 2.326611 0.8333333 -1.50 -0.35 -0.11022762
1.4919556927 0.05297878 0.60639908 0.4873392510 2.608321 0.8096886 -5.94 -0.76 -0.49419490
1.6980146354 0.03063955 0.75594659 0.5018749374 2.795147 0.8380282 1.27 -0.25 -0.28853577
1-10 of 21 rows | 1-5 of 9 columns
and here are the 'estimates'
(Intercept) sd_rgdpg TRADE OPEN
-1.048798e-04 -7.023954e-06 5.159287e-06 2.467633e-04
DISSIM SIZE TF INFL
-5.867023e-04 -3.927840e-04 -3.241606e-04 -2.520122e-05
INT NIIP
1.668813e-06 8.409097e-06

Just recall that your coefficients are a vector, and your values are a matrix. Hence, your result y is also a vector. You wouldn't need a for loop. As an example:
intercept <- -1.05
coef <- c(-7.02, 5.16, 2.47,-5.87,-3.93,-3.23,-2.52, 1.67, 8.41)
values <- matrix(runif(27), ncol = 9) # values is a matrix with 9 columns and 3 rows of Unif[0,1] values as an example
Then you can just do
> intercept + rowSums(coef * values)
[1] -30.385560 3.734984 3.262591
But after training a regression model, for instance with the lm.fit() function, you would generally use the predict() function to produce results.

Related

Obtaining mean phylogenetic tree branch lengths from an ensemble of phylogenetic trees

I have a set of phylogenetic trees, some with different topologies and different branch lengths. Here and example set:
(LA:97.592181158,((HS:82.6284812237,RN:72.190055848635):10.438414999999999):3.989335,((CP:32.2668593286,CL:32.266858085):39.9232054349,(CS:78.2389673073,BT:78.238955218815):8.378847):10.974376);
(((HS:71.9309734249,((CP:30.289472339999996,CL:30.289473923):31.8509454,RN:62.1404181356):9.790551):2.049235,(CS:62.74606492390001,BS:62.74606028250001):11.234141000000001):5.067314,LA:79.0475136246);
(((((CP:39.415718961379994,CL:39.4157161214):29.043224136600003,RN:68.4589436016):8.947169,HS:77.4061105636):4.509818,(BS:63.09170355585999,CS:63.09171066541):18.824224):13.975551000000001,LA:95.891473546);
(LA:95.630761929,((HS:73.4928857457,((CP:32.673882875400004,CL:32.673881941):33.703323212,RN:66.37720021233):7.115682):5.537861,(CS:61.798048265700004,BS:61.798043931600006):17.232697):16.600025000000002);
(((HS:72.6356569413,((CP:34.015223002300004,CL:34.015223157499996):35.207698155399996,RN:69.2229294656):3.412726):8.746038,(CS:68.62665546391,BS:68.6266424085):12.755043999999998):13.40646,LA:94.78814570300001);
(LA:89.58710099299999,((HS:72.440439124,((CP:32.270428384199995,CL:32.2704269484):32.0556597315,RN:64.32607145395):8.114349):6.962274,(CS:66.3266360702,BS:66.3266352709):13.076080999999999):10.184418);
(LA:91.116083247,((HS:73.8383213643,((CP:36.4068361936,CL:36.4068400719):32.297183626700004,RN:68.704029984267):5.134297):6.50389,(BS:68.6124876659,CS:68.61249734691):11.729719):10.773886000000001);
(((HS:91.025288418,((CP:40.288406529099994,CL:40.288401832999995):29.854198951399997,RN:70.14260821095):20.882673999999998):6.163698,(CS:81.12951949976,BS:81.12952162629999):16.059462):13.109915,LA:110.298870881);
In this example there are 2 unique topologies - using R's ape unique.multiPhylo shows that (assuming the example above is saved to a file tree.fn):
tree <- ape::read.tree(tree.fn)
unique.tree <- ape::unique.multiPhylo(tree, use.tip.label = F, use.edge.length = F)
> length(tree)
[1] 8
> length(unique.tree)
[1] 2
My question is how do I get a list of trees, each one representing a unique topology in the input list, and the branch lengths are a summary statistic, such as mean or median, across all trees with the same topology.
In the example above, it will return the first tree as is, because its topology is unique, and another tree which is the topology of the other trees, with mean or median branch lengths?
If I understand well, you want to sort all the trees for each unique into different groups (e.g. in your example, the first group contains one tree, etc...) and then measure some stats for each group?
You can do that by first grouping the topologies into a list:
set.seed(5)
## Generating 20 4 tip trees (hopefully they will be identical topologies!)
tree_list <- rmtree(20, 4)
## How many unique topologies?
length(unique(tree_list))
## Sorting the trees by topologies
tree_list_tmp <- tree_list
sorted_tree_list <- list()
counter <- 0
while(length(tree_list_tmp) != 0) {
counter <- counter+1
## Is the first tree equal to any of the trees in the list
equal_to_tree_one <- unlist(lapply(tree_list_tmp, function(x, base) all.equal(x, base, use.edge.length = FALSE), base = tree_list_tmp[[1]]))
## Saving the identical trees
sorted_tree_list[[counter]] <- tree_list_tmp[which(equal_to_tree_one)]
## Removing them from the list
tree_list_tmp <- tree_list_tmp[-which(equal_to_tree_one)]
## Repeat while there are still some trees!
}
## The list of topologies should be equal to the number of unique trees
length(sorted_tree_list) == length(unique(tree_list))
## Giving them names for fancyness
names(sorted_tree_list) <- paste0("topology", 1:length(sorted_tree_list))
Then for all the trees in each unique topology group you can extract different summary statistics by making a function. Here for example I will measure the branch length sd, mean and 90% quantiles.
## function for getting some stats
get.statistics <- function(unique_topology_group) {
## Extract the branch lengths of all the trees
branch_lengths <- unlist(lapply(unique_topology_group, function(x) x$edge.length))
## Apply some statistics
return(c( n = length(unique_topology_group),
mean = mean(branch_lengths),
sd = sd(branch_lengths),
quantile(branch_lengths, prob = c(0.05, 0.95))))
}
## Getting all the stats
all_stats <- lapply(sorted_tree_list, get.statistics)
## and making it into a nice table
round(do.call(rbind, all_stats), digits = 3)
# n mean sd 5% 95%
# topology1 3 0.559 0.315 0.113 0.962
# topology2 2 0.556 0.259 0.201 0.889
# topology3 4 0.525 0.378 0.033 0.989
# topology4 2 0.489 0.291 0.049 0.855
# topology5 2 0.549 0.291 0.062 0.882
# topology6 1 0.731 0.211 0.443 0.926
# topology7 3 0.432 0.224 0.091 0.789
# topology8 1 0.577 0.329 0.115 0.890
# topology9 1 0.473 0.351 0.108 0.833
# topology10 1 0.439 0.307 0.060 0.795
Of course you can tweak it to get your own desired stats or even get the stats per trees per groups (using a double lapply lapply(sorted_trees_list, lapply, get.statistics) or something like that).

Build Logistic Regression Model for shares

The data i am working with , contains the closing prices of 10 shares of the S&P 500 index.
Data :
> dput(head(StocksData))
structure(list(ACE = c(56.86, 56.82, 56.63, 56.39, 55.97, 55.23
), AMD = c(8.47, 8.77, 8.91, 8.69, 8.83, 9.19), AFL = c(51.83,
50.88, 50.78, 50.5, 50.3, 49.65), APD = c(81.59, 80.38, 80.03,
79.61, 79.76, 79.77), AA = c(15.12, 15.81, 15.85, 15.66, 15.71,
15.78), ATI = c(53.54, 52.37, 52.53, 51.91, 51.32, 51.45), AGN = c(69.77,
69.53, 69.69, 69.98, 68.99, 68.75), ALL = c(29.32, 29.03, 28.99,
28.66, 28.47, 28.2), MO = c(20.09, 20, 20.07, 20.16, 20, 19.88
), AMZN = c(184.22, 185.01, 187.42, 185.86, 185.49, 184.68)), row.names = c(NA,
6L), class = "data.frame")
In the following part , i am calculating the daily percentage changes of 10 shares :
perc_change <- (StocksData[-1, ] - StocksData[-nrow(StocksData), ])/StocksData[-nrow(StocksData), ] * 100
perc_change
Output :
# ACE AMD AFL APD AA ATI AGN ALL MO AMZN
#2 -0.07 3.5 -1.83 -1.483 4.56 -2.19 -0.34 -0.99 -0.45 0.43
#3 -0.33 1.6 -0.20 -0.435 0.25 0.31 0.23 -0.14 0.35 1.30
#4 -0.42 -2.5 -0.55 -0.525 -1.20 -1.18 0.42 -1.14 0.45 -0.83
#5 -0.74 1.6 -0.40 0.188 0.32 -1.14 -1.41 -0.66 -0.79 -0.20
#6 -1.32 4.1 -1.29 0.013 0.45 0.25 -0.35 -0.95 -0.60 -0.44
With the above code i find the latest N rates of change (N should be in [1,10]).
I want to make Logistic Regression Model in order to predict the change of the next day (N + 1), i.e., "increase" or "decrease".
Firstly, i split the data into two chunks: training and testing set :
(NOTE: as testset i must take the last 40 sessions and as trainset the previous 85 sessions of the test set !)
trainset <- head(StocksData, 870)
testset <- tail(StocksData, 40)
Continued with the fitting of the model:
model <- glm(Here???,family=binomial(link='logit'),data=trainset)
The problem iam facing is that i dont have understand and i dont know what to include in the glm function. I have study many models of logistic regression and i think that i havent in my data this object that i need to place there.
Any help for this misunderstanding part of my code ?
Based on what you shared, you need to predict an increment or decrease when new data arrives about the portfolio you mentioned. In that case, you need to define the target variable. We can do that computing the number of positive and negative changes. With that variables, we can create a target variable with 1 if positive is greater than negative (there will be an increment) and with 0 if opposite (there will not be an increment). Data shared is pretty small but I have sketched the code so that you can apply the training/test approach for the modeling. Here the code:
We will start from perc_change and compute the positive and negative variables:
#Build variables
#Store number of and positive negative changes
i <- names(perc_change)
perc_change$Neg <- apply(perc_change[,i],1,function(x) length(which(x<0)))
perc_change$Pos <- apply(perc_change[,i],1,function(x) length(which(x>0)))
Now, we create the target variable with a conditional:
#Build target variable
perc_change$Target <- ifelse(perc_change$Pos>perc_change$Neg,1,0)
We create a replicate for data and remove non necessary variables:
#Replicate data
perc_change2 <- perc_change
perc_change2$Neg <- NULL
perc_change2$Pos <- NULL
With perc_change2 the input is ready and you should split into train/test data. I will not do that as data is too small. I will go directly to the model:
#Train the model, few data for train/test in example but you can adjust that
model <- glm(Target~.,family=binomial(link='logit'),data=perc_change2)
With that model, you know how to evaluate performance and other things. Please do not hesitate in telling me if more details are needed.

How are the threshold or cutoff points in {Epi} R package selected?

In the R package {Epi} the ROC() function can generate a plot out of the dataset aSAH in in the {pROC} package like this:
with the following commands:
require(Epi)
require(pROC)
data(aSAH)
rock = ROC(form = outcome ~ s100b, data=aSAH, plot = "ROC", MX = T)
The sensitivity and specificity were calculated for 51 points included in the object nrow(rock$res). In this regard, note that nrow(aSAH) is instead 113.
Which points were used to generate rock$res?
If we were using the function roc() in the package {pROC} instead, we could get this via: roc(aSAH$outcome, aSAH$s100b)$threshold. But being different packages, they are probably different.
The answer is... of course... in the package documentation:
res dataframe with variables sens, spec, pvp, pvn and name of the test
variable. The latter is the unique values of test or linear predictor
from the logistic regression in ascending order with -Inf prepended.
So what are the unique values:
points = unique(aSAH$s100b); length(points) [1] 50 plus the pre-ended -Inf!
Nice inkling, but can we prove it... I think so:
require(Epi)
require(pROC)
data(aSAH)
rock = ROC(form = outcome ~ s100b, data=aSAH, plot = "ROC", MX = T)
d = aSAH
> head(d)
gos6 outcome gender age wfns s100b ndka
29 5 Good Female 42 1 0.13 3.01
30 5 Good Female 37 1 0.14 8.54
31 5 Good Female 42 1 0.10 8.09
points = sort(unique(d$s100b))
> head(points)
[1] 0.03 0.04 0.05 0.06 0.07 0.08
> length(points)
[1] 50
## Logistic regression coefficients:
beta.0 = as.numeric(rock$lr$coefficients[1])
beta.1 = as.numeric(rock$lr$coefficients[2])
## Sigmoid function:
sigmoid = 1 / (1 + exp(-(beta.0 + beta.1 * points)))
sigmoid = as.numeric(c("-Inf", sigmoid))
lr.eta = rock$res$lr.eta
length(lr.eta)
head(lr.eta)
head(sigmoid)
> head(lr.eta)
[1] -Inf 0.1663429 0.1732556 0.1803934 0.1877585 0.1953526
> head(sigmoid)
[1] -Inf 0.1663429 0.1732556 0.1803934 0.1877585 0.1953526
## Trying to get the lr.eta number 0.304 on the plot:
> which.max(rowSums(rock$res[, c("sens", "spec")]))
# 0.30426295405785 18
## What do we find in row 18 or res?
> rock$res[18,]
sens spec pvp pvn lr.eta
0.30426295405785 0.6341463 0.8055556 0.2054795 0.35 0.304263
## Yet, lr.eta is not the Youden's J statistic or index:
> rock$res[18,"sens"] + rock$res[18,"spec"] - 1
[1] 0.4397019
## Instead, it is the Probability of the outcome at the input with max Youden's index:
## Excluding the "-Inf" introduced by the ROC function (position 17 as opposed to 18):
max.sens.sp.cut = points[17]
1 / (1 + exp(-(beta.0 + beta.1 * max.sens.sp.cut))) [1] 0.304263 !!!
Done!
The lt.eta is, therefore, the probability of the outcome at the threshold corresponding to the maximum Youden's index.

Cox proportional hazard model in R vs Stata

I´m trying to replicate in R a cox proportional hazard model estimation from Stata using the following data http://iojournal.org/wp-content/uploads/2015/05/FortnaReplicationData.dta
The command in stata is the following:
stset enddate2009, id(VPFid) fail(warends) origin(time startdate)
stcox HCTrebels o_rebstrength demdum independenceC transformC lnpop lngdppc africa diffreligion warage if keepobs==1, cluster(js_country)
Cox regression -- Breslow method for ties
No. of subjects = 104 Number of obs = 566
No. of failures = 86
Time at risk = 194190
Wald chi2(10) = 56.29
Log pseudolikelihood = -261.94776 Prob > chi2 = 0.0000
(Std. Err. adjusted for 49 clusters in js_countryid)
-------------------------------------------------------------------------------
| Robust
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
--------------+----------------------------------------------------------------
HCTrebels | .4089758 .1299916 -2.81 0.005 .2193542 .7625165
o_rebstrength | 1.157554 .2267867 0.75 0.455 .7884508 1.699447
demdum | .5893352 .2353317 -1.32 0.185 .2694405 1.289027
independenceC | .5348951 .1882826 -1.78 0.075 .268316 1.066328
transformC | .5277051 .1509665 -2.23 0.025 .3012164 .9244938
lnpop | .9374204 .0902072 -0.67 0.502 .7762899 1.131996
lngdppc | .9158258 .1727694 -0.47 0.641 .6327538 1.325534
africa | .5707749 .1671118 -1.92 0.055 .3215508 1.013165
diffreligion | 1.537959 .4472004 1.48 0.139 .869834 2.719275
warage | .9632408 .0290124 -1.24 0.214 .9080233 1.021816
-------------------------------------------------------------------------------
With R, I´m using the following:
data <- read.dta("FortnaReplicationData.dta")
data4 <- subset(data, keepobs==1)
data4$end_date <- data4$`_t`
data4$start_date <- data4$`_t0`
levels(data4$o_rebstrength) <- c(0:4)
data4$o_rebstrength <- as.numeric(levels(data4$o_rebstrength[data4$o_rebstrength])
data4 <- data4[,c("start_date", "end_date","HCTrebels", "o_rebstrength", "demdum", "independenceC", "transformC", "lnpop", "lngdppc", "africa", "diffreligion", "warage", "js_countryid", "warends")]
data4 <- na.omit(data4)
surv <- coxph(Surv(start_date, end_date, warends) ~ HCTrebels+ o_rebstrength +demdum + independenceC+ transformC+ lnpop+ lngdppc+ africa +diffreligion+ warage+cluster(js_countryid), data = data4, robust = TRUE, method="breslow")
coef exp(coef) se(coef) robust se z p
HCTrebels -0.8941 0.4090 0.3694 0.3146 -2.84 0.0045
o_rebstrength 0.1463 1.1576 0.2214 0.1939 0.75 0.4505
demdum -0.5288 0.5893 0.4123 0.3952 -1.34 0.1809
independenceC -0.6257 0.5349 0.3328 0.3484 -1.80 0.0725
transformC -0.6392 0.5277 0.3384 0.2831 -2.26 0.0240
lnpop -0.0646 0.9374 0.1185 0.0952 -0.68 0.4974
lngdppc -0.0879 0.9158 0.2060 0.1867 -0.47 0.6377
africa -0.5608 0.5708 0.3024 0.2898 -1.94 0.0530
diffreligion 0.4305 1.5380 0.3345 0.2878 1.50 0.1347
warage -0.0375 0.9632 0.0405 0.0298 -1.26 0.2090
Likelihood ratio test=30.1 on 10 df, p=0.000827
n= 566, number of events= 86
I get the same hazard ratio coefficients but the standard errors does not look the same. The Z and p values are close but not exactly the same. Why might be the difference between the results in R and Stata?
As user20650 noticed, when including "nohr" in the Stata options you get exactly the same standard errors as in R. Still there was a small difference in the standard errors when using clusters. user20650 again noticed that the difference was given because Stata default standard errors are multiplied g/(g − 1), where g is the number of cluster while R does not adjust these standard errors. So a solution is just to include noadjust in Stata or have the standard errors adjusted in R by doing:
sqrt(diag(vcov(surv))* (49/48))
If still we want in R to have the same standard errors from Stata, as when not specifying nohr, we need to know that when nhr is left off we obtain $exp(\beta)$ with the standard errors resulting from fitting the model in those scale. In particular obtained by applying the delta method to the original standard-error estimate. "The delta method obtains the standard error of a transformed variable by calculating the variance of the corresponding first-order Taylor expansion, which for the transform $exp(\beta)$ amounts to mutiplying the oringal standard error by $exp(\hat{\beta})$. This trick of calculation yields identical rsults as does transforming the parameters prior to estimation and then reestimating" (Cleves et al 2010). In R we can do it by using:
library(msm)
se <-diag(vcov(surv)* (49/48))
sapply(se, function(x) deltamethod(~ exp(x1), coef(surv)[which(se==x)], x))
HCTrebels o_rebstrength demdum independenceC transformC lnpop lngdppc africa diffreligion warage
0.1299916 0.2267867 0.2353317 0.1882826 0.1509665 0.0902072 0.1727694 0.1671118 0.4472004 0.02901243

R Linear Regression Data in Single Column

I have the following data as an example:
InputName InputValue Output
===================================
Oxide 35 0.4
Oxide 35.2 0.42
Oxide 34.6 0.38
Oxide 35.9 0.46
CD 0.5 0.42
CD 0.48 0.4
CD 0.56 0.429
I want to do a linear regression of InputValue vs. Output treating different InputName as independent predictors.
If I want to use lm(Output ~ Oxide + CD) in R, it assumes a separate column for each predictor. In the example above that would mean making a separate column for Oxide and CD. I can do that using cast function from plyr package which might introduce NAs in the data.
However, is there a way to direct tell lm function that the input predictors are grouped according to the column InputName, and the values are given in the column Inputvalue?
It seems to me you are describing a form of dummy variable coding. This is not necessary in R at all, since any factor column in your data will automatically be dummy coded for you.
Recreate your data:
dat <- read.table(text="
InputName InputValue Output
Oxide 35 0.4
Oxide 35.2 0.42
Oxide 34.6 0.38
Oxide 35.9 0.46
CD 0.5 0.42
CD 0.48 0.4
CD 0.56 0.429
", header=TRUE)
Now build the model you described, but drop the intercept to make things a little bit more explicit:
fit <- lm(Output ~ InputValue + InputName - 1, dat)
summary(fit)
Call:
lm(formula = Output ~ InputValue + InputName - 1, data = dat)
Residuals:
1 2 3 4 5 6 7
-0.003885 0.003412 0.001519 -0.001046 0.004513 -0.014216 0.009703
Coefficients:
Estimate Std. Error t value Pr(>|t|)
InputValue 0.063512 0.009864 6.439 0.00299 **
InputNameCD 0.383731 0.007385 51.962 8.21e-07 ***
InputNameOxide -1.819018 0.346998 -5.242 0.00633 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.009311 on 4 degrees of freedom
Multiple R-squared: 0.9997, Adjusted R-squared: 0.9995
F-statistic: 4662 on 3 and 4 DF, p-value: 1.533e-07
Notice how all of your factor levels for InputName appear in the output, giving you a separate estimate of the effect of each level.
Concisely, the information you need are in these two lines:
InputNameCD 0.383731 0.007385 51.962 8.21e-07 ***
InputNameOxide -1.819018 0.346998 -5.242 0.00633 **
Here are 2 ways of doing this, split the data and do the regressions separately, or use interaction terms to specify that you want to consider the different levels of InputName to have separate slopes:
Split
lapply(split(dat,dat$InputName),lm,formula=Output~InputValue)
$CD
Call:
FUN(formula = ..1, data = X[[1L]])
Coefficients:
(Intercept) InputValue
0.2554 0.3135
$Oxide
Call:
FUN(formula = ..1, data = X[[2L]])
Coefficients:
(Intercept) InputValue
-1.78468 0.06254
Interaction
lm(Output~InputName + InputName:InputValue - 1,dat)
Call:
lm(formula = Output ~ InputName + InputName:InputValue - 1, data = dat)
Coefficients:
InputNameCD InputNameOxide InputNameCD:InputValue InputNameOxide:InputValue
0.25542 -1.78468 0.31346 0.06254
For comparision purposes I've also removed the intercept. Note that the estimated coefficients are the same in each case.

Resources