Subset a dataframe to include samples from another file - r

I currently have a count matrix data.frame where the rownames are the genes and the colnames are the sample names
head(colnames(countmatrix_clean_cl_mouse))
[1] "UB01.31YE" "UT38.78EE" "YW49.74CE" "OB13.46DD" "OT35.78PE" "KE51.98JE"
head(rownames(countmatrix_clean_cl_mouse))
[1] "Gnai3" "Pbsn" "Cdc45" "H19" "Scml2" "Apoh"
head(countmatrix_clean_cl_mouse[,1:10])
UB01.31YE UT38.78EE YW49.74CE OB13.46DD OT35.78PE KE51.98JE YB40.88ZA UI68.54DC GB09.27EE QI98.56TC
Gnai3 88.608 67.174 104.042 103.504 80.314 81.985 104.550 58.628 70.957 89.278
Pbsn 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Cdc45 10.121 6.637 12.057 5.356 13.340 3.309 7.987 83.508 8.491 93.227
H19 43.613 2.044 152.882 0.095 0.455 0.325 1.660 0.278 0.313 0.037
Scml2 0.342 0.000 0.283 0.517 0.000 0.000 0.000 2.262 0.684 4.787
Apoh 0.000 0.781 0.204 0.000 0.000 0.000 0.000 0.071 0.000 0.059
The above data.frame includes 963 samples but I want to subset the samples from that data/frame to the samples that I have in a separate excel sheet. Which looks like below. The sample names are the same but have a "-" instead of ".".
> head(pdac_samples)
V1
1 GT34-87JE
2 QT33-82OE
3 KT30-82ZE
4 UT38-78EE
5 SO33-16DD
6 CD10-05ZE
How would I go about subsetting countmatrix_clean_cl_mouse?

You can use sub to replace the - with ., then find the names in common, and use standard data[row, column] subsetting:
dot_names = sub(pattern = "-", replacement = ".", pdac_samples$V1, fixed = TRUE)
names_in_common = intersect(names(countmatrix_clean_cl_mouse), dot_names)
countmatrix_subset = countmatrix_clean_cl_mouse[, names_in_common, drop = FALSE]
# UT38.78EE
# Gnai3 67.174
# Pbsn 0.000
# Cdc45 6.637
# H19 2.044
# Scml2 0.000
# Apoh 0.781
Using this sample data:
countmatrix_clean_cl_mouse = read.table(text = ' UB01.31YE UT38.78EE YW49.74CE OB13.46DD OT35.78PE KE51.98JE YB40.88ZA UI68.54DC GB09.27EE QI98.56TC
Gnai3 88.608 67.174 104.042 103.504 80.314 81.985 104.550 58.628 70.957 89.278
Pbsn 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Cdc45 10.121 6.637 12.057 5.356 13.340 3.309 7.987 83.508 8.491 93.227
H19 43.613 2.044 152.882 0.095 0.455 0.325 1.660 0.278 0.313 0.037
Scml2 0.342 0.000 0.283 0.517 0.000 0.000 0.000 2.262 0.684 4.787
Apoh 0.000 0.781 0.204 0.000 0.000 0.000 0.000 0.071 0.000 0.059', header = T)
pdac_samples = read.table(text = ' V1
1 GT34-87JE
2 QT33-82OE
3 KT30-82ZE
4 UT38-78EE
5 SO33-16DD
6 CD10-05ZE', header = T)

Related

R: Create a new data frame from an old one by checking if the old data frame's first column value matches a third data frame's first column value

For example, let's say I have a first data frame that looks like this:
1 -0.123 -0.306 inf 1.043 0.000 0.010 0.000 0.653 0.000 0.091 0.000 0.009 0.000 3.097 0.000 0.137 0.002
2 -0.142 -0.170 inf 1.035 0.000 0.064 0.000 0.538 0.000 0.560 0.000 0.289 0.000 3.168 0.000 6.182 0.000
3 -0.160 -0.143 inf 1.027 0.000 0.086 0.000 0.401 0.000 0.631 0.000 0.400 0.000 3.348 0.000 0.130 0.000
4 -0.176 -0.117 inf 1.020 0.000 0.107 0.000 0.249 0.000 0.592 0.000 0.435 0.000 3.526 0.000 0.402 0.001
5 -0.191 -0.110 inf 1.014 0.000 0.133 0.000 0.091 0.000 0.514 0.000 0.425 0.000 3.644 0.001 0.598 0.001
6 -0.206 -0.099 inf 1.008 0.000 0.162 0.000 6.247 0.000 0.435 0.001 0.392 0.001 3.675 0.001 0.707 0.002
7 -0.220 -0.093 0.976 1.003 0.000 0.194 0.000 6.168 0.001 0.377 0.001 0.352 0.001 3.602 0.003 0.740 0.003
8 -0.233 -0.092 inf 0.999 0.000 0.226 0.000 6.137 0.001 0.353 0.001 0.302 0.001 3.445 0.004 0.712 0.005
9 -0.246 -0.124 inf 0.996 0.000 0.258 0.000 6.145 0.001 0.363 0.001 0.252 0.001 3.242 0.004 0.620 0.006
10 -0.259 -0.119 inf 0.994 0.000 0.289 0.000 6.172 0.001 0.393 0.001 0.206 0.001 3.028 0.005 0.456 0.008
I want to create a new second data frame from this first data frame - on one condition. I have a third frame that looks like this:
1 0.00038 0.75053 0.50 35 6000 0.75346
7 0.00038 0.75053 0.50 35 6300 0.60313
10 0.00038 0.75053 0.50 35 6450 0.55122
and I want to use only the rows where the first column value exists.
Ultimately, I want the second data frame to look like this:
1 -0.123 -0.306 inf 1.043 0.000 0.010 0.000 0.653 0.000 0.091 0.000 0.009 0.000 3.097 0.000 0.137 0.002
7 -0.220 -0.093 0.976 1.003 0.000 0.194 0.000 6.168 0.001 0.377 0.001 0.352 0.001 3.602 0.003 0.740 0.003
10 -0.259 -0.119 inf 0.994 0.000 0.289 0.000 6.172 0.001 0.393 0.001 0.206 0.001 3.028 0.005 0.456 0.008
can be done like this:
df1[ df1[,1] %in% df3[,1], ]

Value at Risk for a Portfolio Backtest

I am trying to calculate measures for my portfolio backtest. I am using R package PerformanceAnalytics, and I want to apply/use its function VaR for every year where I've actually rebalanced my portfolio. This seems not to work, though I am pretty sure there must be a simple solution for it, as I have my table with all the logreturns needed, and a table with all the portfolio weights/year.
What I need is the VaR/year after the optimize.portfolio.rebalancing step.
port_ret <- portfolio.spec(assets=funds)
port_ret <- add.constraint(portfolio=port_ret, type="full_investment")
port_ret <- add.constraint(portfolio=port_ret, type="long_only")
port_ret <- add.constraint(portfolio=port_ret, type="box", min=0.0, max=0.2)
port_ret <- add.objective(portfolio=port_ret, type="quadratic_utility", risk_aversion=(4.044918))
port_ret <- add.objective(portfolio=port_ret, type="risk", name="StdDev")
port_ret <- add.objective(portfolio=port_ret, type="return", name="mean")
opt_rent<- optimize.portfolio(R=R, portfolio=port_ret, optimize_method="ROI", trace=TRUE)
plot(opt_rent, risk.col="StdDev", return.col="mean", main="Quadratic Utility Optimization", chart.assets=TRUE, xlim=c(0, 0.03), ylim=c(0, 0.002085))
extractStats(opt_rent)
bt_port_rent <- optimize.portfolio.rebalancing(R=R, portfolio= port_ret, optimize_method="ROI", rebalance_on="years", trace=TRUE, training_period= NULL)
chart.Weights(bt_port_rent, ylim=c(0, 1))
extractStats(bt_port_rent)
weights_rent <- round(extractWeights(bt_port_rent),3)
VaR(R, weights= weights_rent, portfolio_method="component",method="historical")
The current VaR calculaction gives me an error (R are the daily returns of the indices used, and weights_rent are the weights for the rebalancing, see below). Important to add is that the weights_rent are yearly, wheareas R is a daily data:
requires numeric/complex matrix/vector arguments
I assume this is because the VaR calculation requires a vector of weights and not a table with 20 rows providing different weights, see the weights table below:
> weights_rent
SPX RUA FTSE DAX NKY MSCI EM GOLD ASIA50 SSE BBAG REX GSCI
1998-12-31 0.200 0.200 0.198 0.002 0 0.000 0.000 0.000 0.000 0.200 0.200 0.000
1999-12-31 0.200 0.159 0.000 0.188 0 0.000 0.000 0.200 0.076 0.177 0.000 0.000
2000-12-29 0.179 0.000 0.000 0.150 0 0.000 0.000 0.071 0.200 0.200 0.000 0.200
2001-12-31 0.147 0.000 0.000 0.045 0 0.000 0.077 0.122 0.200 0.200 0.200 0.010
2002-12-31 0.013 0.000 0.000 0.000 0 0.000 0.200 0.106 0.109 0.200 0.200 0.172
2003-12-31 0.000 0.053 0.000 0.000 0 0.000 0.200 0.137 0.071 0.200 0.200 0.140
2004-12-31 0.000 0.080 0.000 0.000 0 0.000 0.200 0.161 0.000 0.200 0.200 0.160
2005-12-30 0.000 0.070 0.000 0.000 0 0.000 0.200 0.193 0.000 0.200 0.145 0.191
2006-12-29 0.000 0.097 0.000 0.000 0 0.015 0.200 0.196 0.193 0.200 0.000 0.098
2007-12-31 0.000 0.008 0.000 0.017 0 0.130 0.200 0.125 0.200 0.200 0.000 0.120
2008-12-31 0.000 0.055 0.000 0.025 0 0.000 0.200 0.129 0.130 0.200 0.200 0.061
2009-12-31 0.000 0.051 0.000 0.010 0 0.007 0.200 0.145 0.162 0.200 0.200 0.024
2010-12-31 0.000 0.064 0.000 0.015 0 0.012 0.200 0.158 0.129 0.200 0.200 0.023
2011-12-30 0.000 0.098 0.000 0.000 0 0.000 0.200 0.149 0.119 0.200 0.200 0.035
2012-12-31 0.000 0.099 0.000 0.014 0 0.000 0.200 0.161 0.109 0.200 0.200 0.018
2013-12-31 0.000 0.134 0.000 0.025 0 0.000 0.200 0.146 0.095 0.200 0.200 0.000
2014-12-31 0.000 0.138 0.000 0.016 0 0.000 0.200 0.117 0.130 0.200 0.200 0.000
2015-12-31 0.000 0.129 0.000 0.041 0 0.000 0.200 0.102 0.127 0.200 0.200 0.000
2016-12-30 0.000 0.148 0.000 0.036 0 0.000 0.200 0.119 0.098 0.200 0.200 0.000
2017-12-29 0.000 0.151 0.000 0.018 0 0.000 0.200 0.146 0.085 0.200 0.200 0.000
2018-12-31 0.000 0.179 0.000 0.004 0 0.000 0.200 0.150 0.066 0.200 0.200 0.000
I would really appreciate some help. Thanks in advance.
Edit Test Data:
#fake data
data(edhec)
ticker1 <- c("ConA","CTA","DisE","EM","EQN","EvD", "FIA", "GM", "LSE","MA", "RV", "SS","FF")
colnames(edhec) <- ticker1
fund.names <- colnames(edhec)
port_test <- portfolio.spec(assets=fund.names)
port_test <- add.constraint(portfolio=port_test, type="full_investment")
port_test <- add.constraint(portfolio=port_test, type="long_only")
port_test <- add.constraint(portfolio=port_test, type="box", min=0.0, max=0.2)
port_test <- add.objective(portfolio=port_test, type="quadratic_utility", risk_aversion=(4.044918))
port_test <- add.objective(portfolio=port_test, type="risk", name="StdDev")
port_test <- add.objective(portfolio=port_test, type="return", name="mean")
bt_port_test <- optimize.portfolio.rebalancing(R=edhec, portfolio= port_test, optimize_method="ROI", rebalance_on="years", trace=TRUE, training_period= NULL)
chart.Weights(bt_port_test, ylim=c(0, 1))
extractStats(bt_port_test)
weights_test <- round(extractWeights(bt_port_test),3)
weights_test
head(edhec)
#split data per year (result in list)
ret.year <- split(edhec, f="years")
#calculating yearly VaR
VaRs = rollapply(data = edhec, width = 20, FUN = function(x) VaR(x, p = 0.95, weights= weights_test, portfolio_method="component",method = "historical", by.column = TRUE))
I am getting the following error code:
Error in VaR(x, p = 0.95, weights = weights_test, portfolio_method = "component", :
number of items in weights not equal to number of columns in R
If tried to create a function:
ret.year2 <- ret.year[-c(1,2)]
VAR <- function(p, ret.year2, weights.year){
a <- for(i in 1:ret.year2)
b <- for(j in 1:weights.year)
VaR(a,p=0.95,weights= b, portfolio_method="component",method = "historical")
}
resultat <- VAR(p=0.95,ret.year2=ret.year2, weights.year= weights.year)
which unfortunately didn't work out as expected:
Error in 1:ret.year2 : NA/NaN argument
In addition: Warning message:
In 1:ret.year2 : numerical expression has 11 elements: only the first used
Based on the function documentation, it seems like the reason for the error could be the one you´ve mentioned yourself: weights argument requires a vector of weights - not a zoo object or something else. You could try to give VaR function what it wants - a vector of numeric values.
And, if you want to get 20 VaR function values (one for each year in R) it would seem logical to feed VaR one year of data/R at a time, which ultimately would give you the wanted 20 function values.
If you want, you can automate the process and in a loop subset the data by year, 1 year at a time, and feed it to the VaR, then either print the results or store the in some data structure.
EDIT: With your fake data you can analyse it like this:
library(ROI)
library(ROI.plugin.quadprog)
library(ROI.plugin.glpk)
library(PerformanceAnalytics)
library(PortfolioAnalytics)
# your code here
#split data per year (result in list)
ret.year <- split(edhec, f="years")
# split weights per year
weights.year <- split(weights_test, f="years")
# loop over the list of weights, find corresponding data from edhec and run the analysis
for (i in 1:length(weights.year)){
weight <- weights.year[[i]]
year_weight <- as.numeric(format(start(weight), "%Y"))
weight <- as.vector(weight)
for (j in 1:length(ret.year)){
YearlyR <- ret.year[[j]]
year_R <- as.numeric(format(start(YearlyR), "%Y"))
if (year_R==year_weight){
print(paste("BINGO - years match: ", year_R, year_weight, sep=" "))
result <- VaR(YearlyR, weights= weight, portfolio_method="component",method="historical")
print(result)
}
}
}

R "increasing 'x' and 'y' values expected

I'm trying to create a perspective graph in R and keep getting the increasing 'x' and 'y' error. I've tried numerous options but I can't seem to figure this out. Any help would be appreciated!
fit.A <- data.frame(Temp.f="Ambient",x,y)
fit.A$pred <- predict(model=lrNH4,newdata=fit.A)
x
[1] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
[17] 0.619 0.626 0.630 0.635 0.649 0.656 1.902 1.947 1.967 2.056 2.689 2.707 2.758 2.760 2.943 2.978
[33] 2.992 3.020 4.564 4.854 5.893 6.029 6.051 6.067
y
[1] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
[17] 0.479 0.530 0.566 0.605 0.635 0.726 1.909 1.916 2.128 2.195 2.636 2.645 2.747 2.777 2.943 3.057
[33] 3.169 3.203 4.657 4.813 5.956 5.986 6.154 6.157
persp(x,y,z=matrix(fit.A$pred,nrow=length(x),ncol=length(y),byrow=TRUE),
zlim=c(140,700))
Error in persp.default(x, y, z = matrix(fit.A$pred, nrow = length(x),
: increasing 'x' and 'y' values expected
Your x and y values aren't increasing- they stay stuck at 0 for the first batch of rows.
To create a perspective plot, remove all rows where x and/or y are duplicated. This could be done with:
fit.A <- fit.A[!duplicated(fit.A$x), ]
(In different data it's possible you'd need to filter for duplicated y as well).

Hierarchical Clustering in R - 'pvclust' Issues

I have made a reproducible example where I am having trouble with pvclust. My goal is to pick the ideal clusters in a hierarchal cluster dendogram. I've heard of 'pvclust' but can't figure out how to use it. Also if anyone has other suggestions besides this to determine the ideal clusters it will be really helpful.
My code is provided.
library(pvclust)
employee<- c('A','B','C','D','E','F','G','H','I',
'J','K','L','M','N','O','P',
'Q','R','S','T',
'U','V','W','X','Y','Z')
salary<-c(20,30,40,50,20,40,23,05,56,23,15,43,53,65,67,23,12,14,35,11,10,56,78,23,43,56)
testing90<-cbind(employee,salary)
testing90<-as.data.frame(testing90)
head(testing90)
testing90$salary<-as.numeric(testing90$salary)
row.names(testing90)<-testing90$employee
testing91<-data.frame(testing90[,-1])
head(testing91)
row.names(testing91)<-testing90$employee
d<-dist(as.matrix(testing91))
hc<-hclust(d,method = "ward.D2")
hc
plot(hc)
par(cex=0.6, mar=c(5, 8, 4, 1))
plot(hc, xlab="", ylab="", main="", sub="", axes=FALSE)
par(cex=1)
title(xlab="Publishers", main="Hierarchal Cluster of Publishers by eCPM")
axis(2)
fit<-pvclust(d, method.hclust="ward.D2", nboot=1000, method.dist="eucl")
An error came up stating:
Error in names(edges.cnt) <- paste("r", 1:rl, sep = "") :
'names' attribute [2] must be the same length as the vector [0]
A solution would be to force your object d into a matrix.
From the helpfile of pvclust:
data numeric data matrix or data frame.
Note that by forcing an object of type dist into a marix, as it was a diagonal it will get 'reflected' (math term escapes me right now), you can check the object that is being taken into account with the call:
as.matrix(d)
This would be the call you are looking for:
#note that I can't
pvclust(as.matrix(d), method.hclust="ward.D2", nboot=1000, method.dist="eucl")
#Bootstrap (r = 0.5)... Done.
#Bootstrap (r = 0.58)... Done.
#Bootstrap (r = 0.69)... Done.
#Bootstrap (r = 0.77)... Done.
#Bootstrap (r = 0.88)... Done.
#Bootstrap (r = 1.0)... Done.
#Bootstrap (r = 1.08)... Done.
#Bootstrap (r = 1.19)... Done.
#Bootstrap (r = 1.27)... Done.
#Bootstrap (r = 1.38)... Done.
#
#Cluster method: ward.D2
#Distance : euclidean
#
#Estimates on edges:
#
# au bp se.au se.bp v c pchi
#1 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#2 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#3 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#4 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#5 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#6 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#7 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#8 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#9 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#10 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#11 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#12 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#13 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#14 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#15 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#16 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#17 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#18 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#19 0.853 0.885 0.022 0.003 -1.126 -0.076 0.058
#20 0.854 0.885 0.022 0.003 -1.128 -0.073 0.069
#21 0.861 0.897 0.022 0.003 -1.176 -0.090 0.082
#22 0.840 0.886 0.024 0.003 -1.100 -0.106 0.060
#23 0.794 0.690 0.023 0.005 -0.658 0.162 0.591
#24 0.828 0.686 0.020 0.005 -0.716 0.232 0.704
#25 1.000 1.000 0.000 0.000 0.000 0.000 0.000
Note that this method will fix your call, but the validity of the clustering method, and quality of your data is for you to decide. Your MRE was trusted.

How to filter rows based on certain criteria?

I have an example file as follows:
GENES Samp1 Samp2 Samp3 Samp4 Samp5 Samp6 Samp7 Samp8
g1 0.000 0.000 0.000 0.000 0.010 0.000 0.022 0.344
g2 0.700 0.000 0.000 0.000 0.000 0.000 0.000 0.000
g3 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
g4 0.322 0.782 0.000 0.023 0.000 0.000 0.000 0.345
g5 0.010 0.000 0.333 0.000 0.000 0.000 0.011 0.000
g6 0.000 0.000 0.010 0.000 0.000 0.000 0.000 0.000
I need to retrieve the list of rows (genes) if it has "2 or more samples" with the values "0.010 or more". So I should get the resulting column as follows.:
GENES
g1
g4
g5
Can anyone help me with this ?
Here's one possible way:
DF <- read.table(text=
"GENES Samp1 Samp2 Samp3 Samp4 Samp5 Samp6 Samp7 Samp8
g1 0.000 0.000 0.000 0.000 0.010 0.000 0.022 0.344
g2 0.700 0.000 0.000 0.000 0.000 0.000 0.000 0.000
g3 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
g4 0.322 0.782 0.000 0.023 0.000 0.000 0.000 0.345
g5 0.010 0.000 0.333 0.000 0.000 0.000 0.011 0.000
g6 0.000 0.000 0.010 0.000 0.000 0.000 0.000 0.000",header=T,sep=' ')
rows <- sapply(1:nrow(DF),FUN=function(i){sum(DF[i,2:ncol(DF)] >= 0.01) >= 2})
subSet <- DF[rows,]
> subSet
GENES Samp1 Samp2 Samp3 Samp4 Samp5 Samp6 Samp7 Samp8
1 g1 0.000 0.000 0.000 0.000 0.01 0 0.022 0.344
4 g4 0.322 0.782 0.000 0.023 0.00 0 0.000 0.345
5 g5 0.010 0.000 0.333 0.000 0.00 0 0.011 0.000
or similarly this:
subSet <- DF[apply(DF,1,function(x){sum(tail(x,-1) >= 0.01) >= 2}),]
or this:
subSet <- DF[rowSums(DF[,2:ncol(DF)] >= 0.01) >= 2,]
as you can see there are many ways to accomplish that :)

Resources