Boxplot factor across many samples with R

Boxplot factor across many samples with R - r

Given the data
Step A B C D E F G I J
1 1 0.158 0.011 0.099 6.504 5.914 0.000 0.100 0.330 0.000
2 2 0.345 0.016 0.102 6.050 5.285 0.000 0.102 0.316 0.001
3 1 0.324 0.015 0.100 7.146 6.426 0.000 0.101 0.293 0.000
4 2 0.264 0.015 0.099 5.864 5.202 0.000 0.101 0.296 0.000
5 1 0.346 0.022 0.101 5.889 5.027 0.000 0.101 0.411 0.000
6 2 0.397 0.022 0.130 6.061 5.311 0.000 0.131 0.220 0.000
7 1 0.337 0.015 0.048 7.417 6.839 0.000 0.110 0.129 0.000
8 2 0.362 0.016 0.143 5.726 4.951 0.001 0.144 0.268 0.000
9 1 0.178 0.011 0.099 5.831 5.290 0.000 0.100 0.261 0.000
d < - read.table('sample.txt', header=T) gives me a data frame, and boxplot(d$A ~ d$Step) yields a reasonable graph, but I cannot seem to get all plots on the same graph. Something like boxplot(d ~ d$Step) is what I expected to work, but I get the following error:
Error in model.frame.default(formula = d ~ d$Step) :
invalid type (list) for variable 'd'
I've tried making Step a factor d$Step <- as.factor(d$Step) but that seems to have no effect.

An alternative is to plot these in base R each on their own scale, like this
par(mfrow=c(3,3))
for(i in 2:10) {
boxplot(d[,i] ~ d$Step, main=names(d)[i]) }

We can do this with tidyverse
library(tidyverse)
gather(d, Var, Val, -Step) %>%
mutate(Step=factor(Step)) %>%
ggplot(., aes(x=Var, y = Val, fill=Step)) +
geom_boxplot() +
scale_fill_manual(values = c("red", "blue"))

Related

Subset a dataframe to include samples from another file

I currently have a count matrix data.frame where the rownames are the genes and the colnames are the sample names
head(colnames(countmatrix_clean_cl_mouse))
[1] "UB01.31YE" "UT38.78EE" "YW49.74CE" "OB13.46DD" "OT35.78PE" "KE51.98JE"
head(rownames(countmatrix_clean_cl_mouse))
[1] "Gnai3" "Pbsn" "Cdc45" "H19" "Scml2" "Apoh"
head(countmatrix_clean_cl_mouse[,1:10])
UB01.31YE UT38.78EE YW49.74CE OB13.46DD OT35.78PE KE51.98JE YB40.88ZA UI68.54DC GB09.27EE QI98.56TC
Gnai3 88.608 67.174 104.042 103.504 80.314 81.985 104.550 58.628 70.957 89.278
Pbsn 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Cdc45 10.121 6.637 12.057 5.356 13.340 3.309 7.987 83.508 8.491 93.227
H19 43.613 2.044 152.882 0.095 0.455 0.325 1.660 0.278 0.313 0.037
Scml2 0.342 0.000 0.283 0.517 0.000 0.000 0.000 2.262 0.684 4.787
Apoh 0.000 0.781 0.204 0.000 0.000 0.000 0.000 0.071 0.000 0.059
The above data.frame includes 963 samples but I want to subset the samples from that data/frame to the samples that I have in a separate excel sheet. Which looks like below. The sample names are the same but have a "-" instead of ".".
> head(pdac_samples)
V1
1 GT34-87JE
2 QT33-82OE
3 KT30-82ZE
4 UT38-78EE
5 SO33-16DD
6 CD10-05ZE
How would I go about subsetting countmatrix_clean_cl_mouse?

You can use sub to replace the - with ., then find the names in common, and use standard data[row, column] subsetting:
dot_names = sub(pattern = "-", replacement = ".", pdac_samples$V1, fixed = TRUE)
names_in_common = intersect(names(countmatrix_clean_cl_mouse), dot_names)
countmatrix_subset = countmatrix_clean_cl_mouse[, names_in_common, drop = FALSE]
# UT38.78EE
# Gnai3 67.174
# Pbsn 0.000
# Cdc45 6.637
# H19 2.044
# Scml2 0.000
# Apoh 0.781
Using this sample data:
countmatrix_clean_cl_mouse = read.table(text = ' UB01.31YE UT38.78EE YW49.74CE OB13.46DD OT35.78PE KE51.98JE YB40.88ZA UI68.54DC GB09.27EE QI98.56TC
Gnai3 88.608 67.174 104.042 103.504 80.314 81.985 104.550 58.628 70.957 89.278
Pbsn 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Cdc45 10.121 6.637 12.057 5.356 13.340 3.309 7.987 83.508 8.491 93.227
H19 43.613 2.044 152.882 0.095 0.455 0.325 1.660 0.278 0.313 0.037
Scml2 0.342 0.000 0.283 0.517 0.000 0.000 0.000 2.262 0.684 4.787
Apoh 0.000 0.781 0.204 0.000 0.000 0.000 0.000 0.071 0.000 0.059', header = T)
pdac_samples = read.table(text = ' V1
1 GT34-87JE
2 QT33-82OE
3 KT30-82ZE
4 UT38-78EE
5 SO33-16DD
6 CD10-05ZE', header = T)

selectiveinference package: fixedLassoInf after lasso Logit

Following Taylor and Tibshirani (2015), I'm applying the selectiveinference package in R after a Lasso Logit fit with glmnet. Specifically, I'm interested in inference for the lasso with a fixed lambda.
Below I report the code:
First, I standardized the X matrix (as suggested https://cran.r-project.org/web/packages/selectiveInference/selectiveInference.pdf).
Then, I fit glmnet and after I extracted the beta coefficient for a lambda previously picked with LOOCV.
X.std <- std(X1[1:2833,])
fit = glmnet(X.std, Y[1:2833],alpha=1, family=c("binomial"))
fit$lambda
lambda=0.00431814
n=2833
beta_hat = coef(fit, x=X.std, y=Y[1:2833], s=lambda/n, exact=TRUE)
beta_hat
out = fixedLassoInf(X.std, Y[1:2833],beta_hat,lambda,family="binomial")
out
After I run the code, this is what I get. I understood that there is something related to KKT conditions, and that is a problem specific to Lasso Logit, as when I try with family=gaussian, I do not get any warnings or mistakes.
Warning message:
In fixedLogitLassoInf(x, y, beta, lambda, alpha = alpha, type = type, :
Solution beta does not satisfy the KKT conditions (to within specified tolerances)
> res
Call:
fixedLassoInf(x = X.std, y = Y[1:2833], beta = b, lambda = lam,
family = c("binomial"))
Testing results at lambda = 0.004, with alpha = 0.100
Var Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
1 58.558 6.496 0.000 46.078 124.807 0.049 0.050
2 -8.008 -2.815 0.005 -13.555 -3.106 0.049 0.049
3 -18.514 -6.580 0.000 -31.262 -14.153 0.049 0.048
4 -1.070 -0.390 0.447 -22.976 19.282 0.050 0.050
5 -0.320 -1.231 0.610 -0.660 1.837 0.050 0.000
6 -0.448 -1.906 0.619 -2.378 5.056 0.050 0.050
7 -47.732 -9.624 0.000 -161.370 -44.277 0.050 0.050
8 -39.023 -8.378 0.000 -54.988 -31.510 0.050 0.048
10 23.827 1.991 0.181 -20.151 42.867 0.049 0.049
11 -2.454 -0.522 0.087 -269.951 9.345 0.050 0.050
12 0.045 0.018 0.993 -Inf -14.962 0.000 0.050
13 -18.647 -1.143 0.156 -149.623 25.464 0.050 0.050
14 -3.508 -1.140 0.305 -8.444 7.000 0.049 0.049
15 -0.620 -0.209 0.846 -3.486 46.045 0.050 0.050
16 -3.960 -1.288 0.739 -6.931 47.641 0.049 0.050
17 -8.587 -3.010 0.023 -42.700 -2.474 0.050 0.049
18 2.851 0.986 0.031 2.745 196.728 0.050 0.050
19 -6.612 -1.258 0.546 -14.967 37.070 0.049 0.000
20 -11.621 -2.291 0.021 -29.558 -2.536 0.050 0.049
21 -76.957 -0.980 0.565 -186.701 483.180 0.049 0.050
22 -13.556 -5.053 0.000 -126.367 -13.274 0.050 0.049
23 -4.836 -0.388 0.519 -109.667 125.933 0.050 0.050
24 11.355 0.898 0.492 -55.335 30.312 0.050 0.049
25 -1.118 -0.146 0.919 -4.439 232.172 0.049 0.050
26 -7.776 -1.298 0.200 -17.540 8.006 0.050 0.049
27 0.678 0.234 0.515 -42.265 38.710 0.050 0.050
28 32.938 1.065 0.335 -77.314 82.363 0.050 0.049
Does someone know how to solve this warning?
I would like to understand which kind of "tolerances" should I specify.
Thank for the help.

Value at Risk for a Portfolio Backtest

I am trying to calculate measures for my portfolio backtest. I am using R package PerformanceAnalytics, and I want to apply/use its function VaR for every year where I've actually rebalanced my portfolio. This seems not to work, though I am pretty sure there must be a simple solution for it, as I have my table with all the logreturns needed, and a table with all the portfolio weights/year.
What I need is the VaR/year after the optimize.portfolio.rebalancing step.
port_ret <- portfolio.spec(assets=funds)
port_ret <- add.constraint(portfolio=port_ret, type="full_investment")
port_ret <- add.constraint(portfolio=port_ret, type="long_only")
port_ret <- add.constraint(portfolio=port_ret, type="box", min=0.0, max=0.2)
port_ret <- add.objective(portfolio=port_ret, type="quadratic_utility", risk_aversion=(4.044918))
port_ret <- add.objective(portfolio=port_ret, type="risk", name="StdDev")
port_ret <- add.objective(portfolio=port_ret, type="return", name="mean")
opt_rent<- optimize.portfolio(R=R, portfolio=port_ret, optimize_method="ROI", trace=TRUE)
plot(opt_rent, risk.col="StdDev", return.col="mean", main="Quadratic Utility Optimization", chart.assets=TRUE, xlim=c(0, 0.03), ylim=c(0, 0.002085))
extractStats(opt_rent)
bt_port_rent <- optimize.portfolio.rebalancing(R=R, portfolio= port_ret, optimize_method="ROI", rebalance_on="years", trace=TRUE, training_period= NULL)
chart.Weights(bt_port_rent, ylim=c(0, 1))
extractStats(bt_port_rent)
weights_rent <- round(extractWeights(bt_port_rent),3)
VaR(R, weights= weights_rent, portfolio_method="component",method="historical")
The current VaR calculaction gives me an error (R are the daily returns of the indices used, and weights_rent are the weights for the rebalancing, see below). Important to add is that the weights_rent are yearly, wheareas R is a daily data:
requires numeric/complex matrix/vector arguments
I assume this is because the VaR calculation requires a vector of weights and not a table with 20 rows providing different weights, see the weights table below:
> weights_rent
SPX RUA FTSE DAX NKY MSCI EM GOLD ASIA50 SSE BBAG REX GSCI
1998-12-31 0.200 0.200 0.198 0.002 0 0.000 0.000 0.000 0.000 0.200 0.200 0.000
1999-12-31 0.200 0.159 0.000 0.188 0 0.000 0.000 0.200 0.076 0.177 0.000 0.000
2000-12-29 0.179 0.000 0.000 0.150 0 0.000 0.000 0.071 0.200 0.200 0.000 0.200
2001-12-31 0.147 0.000 0.000 0.045 0 0.000 0.077 0.122 0.200 0.200 0.200 0.010
2002-12-31 0.013 0.000 0.000 0.000 0 0.000 0.200 0.106 0.109 0.200 0.200 0.172
2003-12-31 0.000 0.053 0.000 0.000 0 0.000 0.200 0.137 0.071 0.200 0.200 0.140
2004-12-31 0.000 0.080 0.000 0.000 0 0.000 0.200 0.161 0.000 0.200 0.200 0.160
2005-12-30 0.000 0.070 0.000 0.000 0 0.000 0.200 0.193 0.000 0.200 0.145 0.191
2006-12-29 0.000 0.097 0.000 0.000 0 0.015 0.200 0.196 0.193 0.200 0.000 0.098
2007-12-31 0.000 0.008 0.000 0.017 0 0.130 0.200 0.125 0.200 0.200 0.000 0.120
2008-12-31 0.000 0.055 0.000 0.025 0 0.000 0.200 0.129 0.130 0.200 0.200 0.061
2009-12-31 0.000 0.051 0.000 0.010 0 0.007 0.200 0.145 0.162 0.200 0.200 0.024
2010-12-31 0.000 0.064 0.000 0.015 0 0.012 0.200 0.158 0.129 0.200 0.200 0.023
2011-12-30 0.000 0.098 0.000 0.000 0 0.000 0.200 0.149 0.119 0.200 0.200 0.035
2012-12-31 0.000 0.099 0.000 0.014 0 0.000 0.200 0.161 0.109 0.200 0.200 0.018
2013-12-31 0.000 0.134 0.000 0.025 0 0.000 0.200 0.146 0.095 0.200 0.200 0.000
2014-12-31 0.000 0.138 0.000 0.016 0 0.000 0.200 0.117 0.130 0.200 0.200 0.000
2015-12-31 0.000 0.129 0.000 0.041 0 0.000 0.200 0.102 0.127 0.200 0.200 0.000
2016-12-30 0.000 0.148 0.000 0.036 0 0.000 0.200 0.119 0.098 0.200 0.200 0.000
2017-12-29 0.000 0.151 0.000 0.018 0 0.000 0.200 0.146 0.085 0.200 0.200 0.000
2018-12-31 0.000 0.179 0.000 0.004 0 0.000 0.200 0.150 0.066 0.200 0.200 0.000
I would really appreciate some help. Thanks in advance.
Edit Test Data:
#fake data
data(edhec)
ticker1 <- c("ConA","CTA","DisE","EM","EQN","EvD", "FIA", "GM", "LSE","MA", "RV", "SS","FF")
colnames(edhec) <- ticker1
fund.names <- colnames(edhec)
port_test <- portfolio.spec(assets=fund.names)
port_test <- add.constraint(portfolio=port_test, type="full_investment")
port_test <- add.constraint(portfolio=port_test, type="long_only")
port_test <- add.constraint(portfolio=port_test, type="box", min=0.0, max=0.2)
port_test <- add.objective(portfolio=port_test, type="quadratic_utility", risk_aversion=(4.044918))
port_test <- add.objective(portfolio=port_test, type="risk", name="StdDev")
port_test <- add.objective(portfolio=port_test, type="return", name="mean")
bt_port_test <- optimize.portfolio.rebalancing(R=edhec, portfolio= port_test, optimize_method="ROI", rebalance_on="years", trace=TRUE, training_period= NULL)
chart.Weights(bt_port_test, ylim=c(0, 1))
extractStats(bt_port_test)
weights_test <- round(extractWeights(bt_port_test),3)
weights_test
head(edhec)
#split data per year (result in list)
ret.year <- split(edhec, f="years")
#calculating yearly VaR
VaRs = rollapply(data = edhec, width = 20, FUN = function(x) VaR(x, p = 0.95, weights= weights_test, portfolio_method="component",method = "historical", by.column = TRUE))
I am getting the following error code:
Error in VaR(x, p = 0.95, weights = weights_test, portfolio_method = "component", :
number of items in weights not equal to number of columns in R
If tried to create a function:
ret.year2 <- ret.year[-c(1,2)]
VAR <- function(p, ret.year2, weights.year){
a <- for(i in 1:ret.year2)
b <- for(j in 1:weights.year)
VaR(a,p=0.95,weights= b, portfolio_method="component",method = "historical")
}
resultat <- VAR(p=0.95,ret.year2=ret.year2, weights.year= weights.year)
which unfortunately didn't work out as expected:
Error in 1:ret.year2 : NA/NaN argument
In addition: Warning message:
In 1:ret.year2 : numerical expression has 11 elements: only the first used

Based on the function documentation, it seems like the reason for the error could be the one you´ve mentioned yourself: weights argument requires a vector of weights - not a zoo object or something else. You could try to give VaR function what it wants - a vector of numeric values.
And, if you want to get 20 VaR function values (one for each year in R) it would seem logical to feed VaR one year of data/R at a time, which ultimately would give you the wanted 20 function values.
If you want, you can automate the process and in a loop subset the data by year, 1 year at a time, and feed it to the VaR, then either print the results or store the in some data structure.
EDIT: With your fake data you can analyse it like this:
library(ROI)
library(ROI.plugin.quadprog)
library(ROI.plugin.glpk)
library(PerformanceAnalytics)
library(PortfolioAnalytics)
# your code here
#split data per year (result in list)
ret.year <- split(edhec, f="years")
# split weights per year
weights.year <- split(weights_test, f="years")
# loop over the list of weights, find corresponding data from edhec and run the analysis
for (i in 1:length(weights.year)){
weight <- weights.year[[i]]
year_weight <- as.numeric(format(start(weight), "%Y"))
weight <- as.vector(weight)
for (j in 1:length(ret.year)){
YearlyR <- ret.year[[j]]
year_R <- as.numeric(format(start(YearlyR), "%Y"))
if (year_R==year_weight){
print(paste("BINGO - years match: ", year_R, year_weight, sep=" "))
result <- VaR(YearlyR, weights= weight, portfolio_method="component",method="historical")
print(result)
}
}
}

Extract rows from a matrix based of values from another matrix

I need your help!
I am trying to pull out rows of the second matrix based on IDs from the first matrix. To check that my function (which is not provided here) works correctly, I run the following code (CritMat is the second matrix and parms is the first):
results <- matrix(0, nrow = 15, ncol = 8)
colnames(results) <- c("alpha", "beta", "omega", "T=64", "T=128", "T=256", "T=512", "T=1024")
for (r in 1:15) {
results [r,] <- CritMat[CritMat[, 1] == parms[r, 2] & CritMat[, 2] ==
parms[r, 1] & CritMat[, 3] == parms[r, 3] , ]
print(results[r,])
}
The loop works for the first 4 iterations followed by the following error message for the fifth:
*Error in results[r, ] <- CritMat[CritMat[, 1] == parms[r, 2] & CritMat[, :
replacement has length zero*
Any idea why this happens and solution.
Many thanks
AA
****parms matrix****
beta alpha omega
1 0.005 0.005 0.990
2 0.240 0.005 0.755
3 0.490 0.005 0.505
4 0.740 0.005 0.255
5 0.990 0.005 0.005
6 0.005 0.250 0.745
7 0.240 0.250 0.510
8 0.490 0.250 0.260
9 0.740 0.250 0.010
10 0.005 0.500 0.495
11 0.240 0.500 0.260
12 0.490 0.500 0.010
13 0.005 0.750 0.245
14 0.240 0.750 0.010
15 0.005 0.990 0.005
****CritMat matrix****
alpha beta omega T.64 T.128 T.256 T.512 T.1024
1 0.005 0.005 0.990 -2.956420 -2.919654 -2.921704 -2.886429 -2.879443
2 0.005 0.240 0.755 -2.959242 -2.917744 -2.923356 -2.885018 -2.881905
3 0.005 0.490 0.505 -2.959395 -2.915798 -2.927405 -2.886637 -2.885186
4 0.005 0.740 0.255 -2.957763 -2.912088 -2.934518 -2.890182 -2.889484
5 0.005 0.990 0.005 -2.937999 -2.857668 -2.864637 -2.819950 -2.820588
6 0.250 0.005 0.745 -2.987160 -2.986864 -2.897846 -2.865875 -2.911572
7 0.250 0.240 0.510 -3.034868 -2.979375 -2.924888 -2.875446 -2.898752
8 0.250 0.490 0.260 -3.052279 -2.995942 -2.969414 -2.926178 -2.918958
9 0.250 0.740 0.010 -3.197169 -3.263336 -3.258011 -3.202253 -3.248068
10 0.500 0.005 0.495 -3.031267 -3.038585 -2.936348 -2.921126 -2.908868
11 0.500 0.240 0.260 -3.142031 -3.086536 -3.026555 -3.079825 -2.871080
12 0.500 0.490 0.010 -3.383052 -3.410789 -3.431221 -3.367462 -3.332024
13 0.750 0.005 0.245 -3.209441 -3.170385 -3.112472 -3.141569 -2.925559
14 0.750 0.240 0.010 -3.452131 -3.517234 -3.428402 -3.477691 -3.178128
15 0.990 0.005 0.005 -3.427804 -3.491805 -3.298037 -3.290127 -3.087541

Evaluating a matrix by row for a condition being met in R

I've got data in the following format.
P10_neg._qn P11_neg._qn P12_neg._qn P14_neg._qn P17_neg._qn P24_neg._qn P25_neg._qn
1 -0.025 -0.037 -0.032 -0.061 -0.176 0.033 -0.011
2 -0.029 -0.125 0.003 -0.098 0.117 0.039 0.087
3 0.033 -0.127 0.042 0.014 0.097 0.105 0.048
4 0.033 -0.127 0.042 0.014 0.097 0.105 0.048
5 -0.029 -0.125 0.003 -0.098 0.117 0.039 0.087
6 -0.029 -0.125 0.003 -0.098 0.117 0.039 0.087
What is the best way by which I can check, for every row, how many entries are greater than 0.1, for instance and return a vector of counts?

You can use the rowSum function for this task. Assuming that dat is you matrix then :
rowSum(dat > 0.1)
Using the sample data provided we have :
dat <- read.table(text = ' P10_neg._qn P11_neg._qn P12_neg._qn P14_neg._qn P17_neg._qn P24_neg._qn P25_neg._qn
1 -0.025 -0.037 -0.032 -0.061 -0.176 0.033 -0.011
2 -0.029 -0.125 0.003 -0.098 0.117 0.039 0.087
3 0.033 -0.127 0.042 0.014 0.097 0.105 0.048
4 0.033 -0.127 0.042 0.014 0.097 0.105 0.048
5 -0.029 -0.125 0.003 -0.098 0.117 0.039 0.087
6 -0.029 -0.125 0.003 -0.098 0.117 0.039 0.087',
row.names = 1, header = TRUE)
rowSums(dat > 0.1)
## 1 2 3 4 5 6
## 0 1 1 1 1 1

apply(dat, 1, function(x) sum(x>.1))
# [1] 0 1 1 1 1 1

here an Rcpp version:
// [[Rcpp::export]]
IntegerVector countGreaterThan2(NumericMatrix M,double val) {
IntegerVector res;
for (int i=0; i<M.nrow(); i++) {
NumericVector row = M( i, _);
double num = std::count_if(row.begin(), row.end(),
[&val](const double& x) -> bool {return x>val;});
res.push_back(num);
}
return res;
}
But rowSum is unbeatable:
system.time(rowSums(dfx>0.2))
user system elapsed
0.01 0.00 0.02
> system.time(countGreaterThan2(dfx,0.2))
user system elapsed
0.06 0.00 0.06

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Boxplot factor across many samples with R - r

An alternative is to plot these in base R each on their own scale, like this par(mfrow=c(3,3)) for(i in 2:10) { boxplot(d[,i] ~ d$Step, main=names(d)[i]) }

We can do this with tidyverse library(tidyverse) gather(d, Var, Val, -Step) %>% mutate(Step=factor(Step)) %>% ggplot(., aes(x=Var, y = Val, fill=Step)) + geom_boxplot() + scale_fill_manual(values = c("red", "blue"))

Related

Subset a dataframe to include samples from another file

selectiveinference package: fixedLassoInf after lasso Logit

Value at Risk for a Portfolio Backtest

Extract rows from a matrix based of values from another matrix

Evaluating a matrix by row for a condition being met in R

Categories

Resources