Projecting a function between two vectors, into a data frame in R - r

The distance coefficient in R between two vectors X,Y can be calculated with the dcor(X,Y) function of energy package. I have a matrix Z like this
library(energy)
Z <- data.frame(Z1 = c(0.2, 3, 0.01, 3.4, 6),
Z2 = c(2.2, 3.5, 0.5, 0.3, 7.0),
Z3 = c(4.7, 0.003, 1.4, 0, 0.6))
and I want to create the distance correlation matrix for all combinations. How can I create a function that does this calculation? The final result must look like this
Z1 Z2 Z3
Z1 1 0.76 0.76
Z2 0.76 1 0.48
Z3 0.76 0.48 1

I seems that energy::dcor cannot do cross computation on a single dataframe like stats::cor. A workaround is using outer:
outer(Z, Z, Vectorize(energy::dcor))
# Z1 Z2 Z3
# Z1 1.0000000 0.7632896 0.7647835
# Z2 0.7632896 1.0000000 0.4783923
# Z3 0.7647835 0.4783923 1.0000000

We may use combn to calculate only needed pairwise combinations and then convert to matrix - thus, we don't have to redo the same calculations
library(energy)
v1 <- combn(Z, 2, FUN = \(x) dcor(x[[1]], x[[2]]))
as.matrix(as.dist(c(0, v1)))[-1, -1]
2 3 4
2 0.0000000 0.7632896 0.7647835
3 0.7632896 0.0000000 0.4783923
4 0.7647835 0.4783923 0.0000000

Related

Find edges of intervals in dataframe column and use them for geom_rect xmin-xmax in ggplot

I have a data frame consituted by two columns
positionx <- c(1:10)
pvalue <- c(0.1, 0.04, 0.03, 0.02, 0.001, 0.2, 0.5, 0.6, 0.001, 0.002)
df <- data.frame(cbind(positionx, pvalue))
df
positionx pvalue
1 1 0.100
2 2 0.040
3 3 0.030
4 4 0.020
5 5 0.001
6 6 0.200
7 7 0.500
8 8 0.600
9 9 0.001
10 10 0.002
I would like to find in which intervals of values of positionx my pvalue is below a certain treshold, let's say 0.05.
Using 'which' I can find the index of the rows and I could go back to the vlaues of positionx.
which(df[,2]<0.05)
[1] 2 3 4 5 9 10
Howeverm what I would like are the edges of the intervals, with that I mean a result like: 2-5, 9-10
I also tried to use the findInterval function as below
int <- c(-10, 0.05, 10)
separation <- findInterval(pvalue,int)
separation
[1] 2 1 1 1 1 2 2 2 1 1
df_sep <- data.frame(cbind(df, separation))
df_sep
positionx pvalue separation
1 1 0.100 2
2 2 0.040 1
3 3 0.030 1
4 4 0.020 1
5 5 0.001 1
6 6 0.200 2
7 7 0.500 2
8 8 0.600 2
9 9 0.001 1
10 10 0.002 1
However I am stuck again with a column of numbers, while I want the edges of the intervals that contain 1 in the separation column.
Is there a way to do that?
This is semplified example, in reality I have many plots and for each plot one data frame of this type (just much longer and with pvalues not as easy to judge at a glance).
The reason why I think I need the information of the edges of my intervals, is that I would like to colour the background of my ggplot according to the pvalue. I know I can use geom_rect for it, but I think I need the edges of the intervals in order to build the coloured rectangles.
Is there a way to do this in an automated way instead of manually?
This seems like a great use case for run length encoding.
Example as below:
library(ggplot2)
# Data from question
positionx <- c(1:10)
pvalue <- c(0.1, 0.04, 0.03, 0.02, 0.001, 0.2, 0.5, 0.6, 0.001, 0.002)
df <- data.frame(cbind(positionx, pvalue))
# Sort data (just to be sure)
df <- df[order(df$positionx),]
# Do run length encoding magic
threshold <- 0.05
rle <- rle(df$pvalue < threshold)
starts <- {ends <- cumsum(rle$lengths)} - rle$lengths + 1
df2 <- data.frame(
xmin = df$positionx[starts],
xmax = df$positionx[ends],
type = rle$values
)
# Filter on type
df2 <- df2[df2$type == TRUE, ] # Satisfied threshold criterium
ggplot(df2, aes(xmin = xmin, xmax = xmax, ymin = 0, ymax = 1)) +
geom_rect()
Created on 2020-05-22 by the reprex package (v0.3.0)

Replace value and previous value if condition is met in R

I have a dataset that looks like this:
df <- matrix(c(0.2, 0.5, 1, 3.1, 0.5, 0.3, 0.1, 4, 0.3, 1.2), nrow=5, ncol=2)
(This is a simplified example)
I would like to write a function or loop that checks if each value (t) or its previous value (t-1) are bigger than 3, and that replaces both t and t-1 with NA if either one of them is bigger than 3.
The desired outcome would thus look something like this:
Thanks in advance.
Here is a base R solution which should yield your desired outcome. Note, that since there is no "wrap-around" in your desired output, I turned the matrix into a data.frame.
# your data
df <-matrix(c(0.2, 0.5, 1, 3.1, 0.5, 0.3, 0.1, 4, 0.3, 1.2), nrow=5, ncol=2)
# needs to be converted to a data.frame
df <- as.data.frame(df)
# recode function
recode_df <- function(x) {
x2 <- c(NA,x[-length(x)])
x3 <- c(x[-1], NA)
replace(x, (x > 3 | x2 > 3 | x3 > 3), NA)
}
# apply recode function on all columns
as.data.frame(lapply(df, recode_df))
#> V1 V2
#> 1 0.2 0.3
#> 2 0.5 NA
#> 3 NA NA
#> 4 NA NA
#> 5 NA 1.2
Created on 2020-05-23 by the reprex package (v0.3.0)

R Conditional standard deviation

I have a large data set and I need to get the standard deviation for the Main column based on the number of rows in other columns. Here is a sample data set:
df1 <- data.frame(
Main = c(0.33, 0.57, 0.60, 0.51),
B = c(NA, NA, 0.09,0.19),
C = c(NA, 0.05, 0.07, 0.05),
D = c(0.23, 0.26, 0.23, 0.26)
)
View(df1)
# Main B C D
# 1 0.33 NA NA 0.23
# 2 0.57 NA 0.05 0.26
# 3 0.60 0.09 0.07 0.23
# 4 0.51 0.19 0.05 0.26
Take column B as an example, since row 1&2 are NA, its standard deviation will be sd(df1[3:4,1]); column C&D will be sd(df1[2:4,1]) and sd(df1[1:4,1]). Therefore, the result will be:
# B C D
# 1 0.06 0.05 0.12
I did the followings but it only returned one number - 0.0636
df2 <- df1[,-1]!=0
sd(df1[df2,1], na.rm = T)
My data set has many more columns, and I'm wondering if there is a more efficient way to get it done? Many thanks!
Try:
sapply(df1[,-1], function(x) sd(df1[!is.na(x), 1]))
# B C D
# 0.06363961 0.04582576 0.12093387
x <- colnames(df) # list all columns you want to calculate sd of
value <- sapply(1:length(x) , function(i) sd(df[,x[i],drop=TRUE], na.rm = T))
names(value) <- x
# Main B C D
# 0.12093387 0.07071068 0.01154701 0.01732051
We can get this with colSds from matrixStats
library(matrixStats)
colSds(`dim<-`(df1[,1][NA^is.na(df1[-1])*row(df1[-1])], dim(df1[,-1])), na.rm = TRUE)
#[1] 0.06363961 0.04582576 0.12093387

Combine same-name columns and apply Johansen test in R

I have two data bases (data is multicolumn before and after treatment):
Before treatment
Data1<-read.csv("before.csv")
X1 X2 X3
1 0.21 0.32 0.42
2 0.34 0.23 0.33
3 0.42 0.14 0.11
4 0.35 0.25 0.35
5 0.25 0.41 0.44
After treatment
Data2<-read.csv("after.csv")
X1 X2 X3
1 0.33 0.43 0.7
2 0.28 0.51 0.78
3 0.11 0.78 0.34
4 0.54 0.34 0.34
5 0.42 0.64 0.22
I would like to combine the data by columns (i.e. x1 in Data1 and x1 in Data2 similarly: x2 in Data1 and x2 in Data2 and so on) and perform Johansen Cointegration test for each pair.
What I tried is to make:
library("urca")
x1<-cbind(Data1$x1, Data2$x1)
Jo1<-ca.jo(x1, type="trace",K=2,ecdet="none", spec="longrun")
summary(Jo1)
x2<-cbind(Data1$x1, Data2$x2)
Jo2<-ca.jo(x2, type="trace",K=2,ecdet="none", spec="longrun")
summary(Jo2)
This gives me what I want but I would like to automate the process, i.e. instead of manually combining data, to have all pair-wise combinations.
Based on krishna's answere, but modified the loop:
for(i in 1:ncol(Data1)) {
col <- paste0("X", as.character(i))
data <- cbind(Data1[, col], Data2[, col])
colnames(data) <- c(paste0("Data1_",col),paste0("Data2_",col)) # add column names
Jo<- ca.jo(data, type="trace",K=2,ecdet="none", spec="longrun")
print(summary(Jo)) # print the summary to the console
}
You can loop through the columns name and find the Johansen Cointegration as follows:
# Create a sample data frame
Data1<- data.frame(X1 = rnorm(10, 0, 1), X2 = rnorm(10, 0, 1), X3 = rnorm(10, 0, 1))
Data2 <-data.frame(X1 = rnorm(10, 0, 1), X2 = rnorm(10, 0, 1), X3 = rnorm(10, 0, 1))
library("urca")
# loop through all columns index
for(i in ncol(Data1)) {
col <- paste0("X", as.character(i)) # find the column name
data <- cbind(Data1[, col], Data2[, col]) # get the data from Data1 and Data2, all rows of a column = col
# Your method for finding Ca.Jo ...
Jo<- ca.jo(data, type="trace",K=2,ecdet="none", spec="longrun")
summary(Jo)
}
You can also use colnames for looping as:
for(col in colnames(Data1)) {
print(col)
data <- cbind(Data1[, col], Data2[, col])
print(data)
#Jo<- ca.jo(data, type="trace",K=2,ecdet="none", spec="longrun")
#summary(Jo)
}
Hope this will help you.

returns to prices R

Here is sample data:
set.seed(13)
x1 <- runif(10, -0.05, 0.05)
x2 <- runif(10, -0.05, 0.05)
x3 <- runif(10, -0.05, 0.05)
x4 <- runif(10, -0.05, 0.05)
df <- as.data.frame(cbind(x1, x2, x3, x4))
Lets think that these are Returns and I would like to convert these to prices with a starting value of 100. There is a answer for turning one vector of returns to prices here: How to convert returns to prices? I have tried following:
index <- as.data.frame(Reduce(function(x,y) {x * exp(y)}, df, init=100, accumulate=T))
but that wont work for data frame. I also tried apply function, but couldn't get anything reasonable out of it.
Expand the answer to your data frame by running it column-wise.
index <- sapply(colnames(df), function(col){
Reduce(function(x,y){x * exp(y)},
df[[col]], init=100, accumulate=T)
})
index
#x1 x2 x3 x4
#[1,] 100.00000 100.0000 100.00000 100.00000
#[2,] 102.12550 101.6243 96.43574 99.23404
#[3,] 99.56554 105.5431 96.88956 98.29784
#[4,] 98.47272 109.7467 98.62877 102.50103
#[5,] 94.53007 110.4766 98.90613 105.71522
#[6,] 99.00045 111.5149 94.90222 106.13989
#[7,] 94.27516 110.0142 96.04782 102.05241
#[8,] 94.97819 108.4567 91.65382 101.58857
#[9,] 97.52289 109.4531 91.30083 97.13752
#[10,] 101.23305 113.5271 89.76203 99.68356
#[11,] 96.69209 115.5952 90.96857 95.62000
Use cumsum, which works on data frames.
R> index <- exp(cumsum(df)) * 100
x1 x2 x3 x4
1 102.12550 101.6243 96.43574 99.23404
2 99.56554 105.5431 96.88956 98.29784
3 98.47272 109.7467 98.62877 102.50103
4 94.53007 110.4766 98.90613 105.71522
5 99.00045 111.5149 94.90222 106.13989
6 94.27516 110.0142 96.04782 102.05241
7 94.97819 108.4567 91.65382 101.58857
8 97.52289 109.4531 91.30083 97.13752
9 101.23305 113.5271 89.76203 99.68356
10 96.69209 115.5952 90.96857 95.62000

Resources