I am trying to figure out how to calculate the average,median and standard deviation for each value of each variable. Here is some of the data (thanks to #Barranka for providing the data in a easy-to-copy format):
df <- data.frame(
gama=c(10, 1, 1, 1, 1, 1, 10, 0.1, 10),
theta=c(1, 1, 1, 1, 0.65, 1, 0.65, 1, 1),
detectl=c(3, 5, 1, 1, 5, 3, 5, 5, 1),
NSMOOTH=c(10, 5, 20, 20, 5, 20, 10, 10, 40),
NREF=c(50, 80, 80, 50, 80, 50, 10, 100, 30),
NOBS=c(10, 40, 40, 20, 20, 20, 10, 40, 10),
sma=c(15, 15, 15, 15, 15, 15, 15, 15, 15),
lma=c(33, 33, 33, 33, 33, 33, 33, 33, 33),
PosTrades=c(11, 7, 6, 3, 9, 3, 6, 6, 5),
NegTrades=c(2, 2, 1, 0, 1, 0, 1, 5, 1),
Acc=c(0.846154, 0.777778, 0.857143, 1, 0.9, 1, 0.857143, 0.545455, 0.833333),
AvgWin=c(0.0451529, 0.0676022, 0.0673241, 0.13204, 0.0412913, 0.126522, 0.0630061, 0.0689745, 0.0748437),
AvgLoss=c(-0.0194498, -0.0083954, -0.0174653, NaN, -0.00264179, NaN, -0.0161558, -0.013903, -0.0278908), Return=c(1.54942, 1.54916, 1.44823, 1.44716, 1.42789, 1.42581, 1.40993, 1.38605, 1.38401)
)
To save it into csv later, i have to make it into data frame that supposed to be like this
Table for gama
Value Average Median Standard Deviation
10 (Avg of 10) (median of 10) (Stdev of 10)
1 (Avg of 1) (median of 1) (Stdev of 1)
0.1 (Avg of 0.1) (median of 0.1) (Stdev of 0.1)
Table for theta
Value Average Median Standard Deviation
1 (Avg of 10) (median of 10) (Stdev of 10)
0.65 (Avg of 0.65) (median of 0.65) (Stdev of 0.65)
Table for detectionsLimit
Value Average Median Standard Deviation
3 (Avg of 3) (median of 3) (Stdev of 3)
5 (Avg of 5) (median of 5) (Stdev of 5)
...
The columns to be used as ID's are:
ids <- c("gama", "theta","detectl", "NSMOOTH", "NREF", "NOBS", "sma", "lma")
Summary statistics should be computed over the following columns:
vals <- c("PosTrades", "NegTrades", "Acc", "AvgWin", "AvgLoss", "Return")
I have tried using data.table package/function, but I cannot figuring out how to develop an approach using data.table without renaming values one by one; also, when pursuing this approach, my code gets very complicated.
Clever use of melt() and tapply() can help you. I made the following assumptions:
You have to get the mean, median and average of the last three columns
You need to group the data for each of the first ten columns (gama, theta, ..., negTrades)
For reproducibility, here's the input:
# Your example data
df <- data.frame(
gama=c(10, 1, 1, 1, 1, 1, 10, 0.1, 10),
theta=c(1, 1, 1, 1, 0.65, 1, 0.65, 1, 1),
detectl=c(3, 5, 1, 1, 5, 3, 5, 5, 1),
NSMOOTH=c(10, 5, 20, 20, 5, 20, 10, 10, 40),
NREF=c(50, 80, 80, 50, 80, 50, 10, 100, 30),
NOBS=c(10, 40, 40, 20, 20, 20, 10, 40, 10),
sma=c(15, 15, 15, 15, 15, 15, 15, 15, 15),
lma=c(33, 33, 33, 33, 33, 33, 33, 33, 33),
PosTrades=c(11, 7, 6, 3, 9, 3, 6, 6, 5),
NegTrades=c(2, 2, 1, 0, 1, 0, 1, 5, 1),
Acc=c(0.846154, 0.777778, 0.857143, 1, 0.9, 1, 0.857143, 0.545455, 0.833333),
AvgWin=c(0.0451529, 0.0676022, 0.0673241, 0.13204, 0.0412913, 0.126522, 0.0630061, 0.0689745, 0.0748437),
AvgLoss=c(-0.0194498, -0.0083954, -0.0174653, NaN, -0.00264179, NaN, -0.0161558, -0.013903, -0.0278908), Return=c(1.54942, 1.54916, 1.44823, 1.44716, 1.42789, 1.42581, 1.40993, 1.38605, 1.38401)
)
And here's my proposed solution:
library(reshape)
md <- melt(df, id=colnames(df)[1:10]) # This will create one row for each
# 'id' combination, and will store
# the rest of the column headers
# in the `variable` column, and
# each value corresponding to the
# variable. Like this:
head(md)
## gama theta detectl NSMOOTH NREF NOBS sma lma PosTrades NegTrades variable value
## 1 10 1.00 3 10 50 10 15 33 11 2 Acc 0.846154
## 2 1 1.00 5 5 80 40 15 33 7 2 ## Acc 0.777778
## 3 1 1.00 1 20 80 40 15 33 6 1 ## Acc 0.857143
## 4 1 1.00 1 20 50 20 15 33 3 0 ## Acc 1.000000
## 5 1 0.65 5 5 80 20 15 33 9 1 ## Acc 0.900000
## 6 1 1.00 3 20 50 20 15 33 3 0 ## Acc 1.000000
results <- list() # Prepare the results list
for(i in unique(md$variable)) { # For each variable you have...
results[[i]] <- list() # ... create a new list to hold the 'summary'
tmp_data <- subset(md, variable==i) # Filter the data you'll use
for(j in colnames(tmp_data)[1:10]) { # For each variable, use tapply()
# to get what you need, and
# store it into a data frame
# inside the results
results[[i]][[j]] <- as.data.frame(
t(
rbind(
tapply(tmp_data$value, tmp_data[,j], mean),
tapply(tmp_data$value, tmp_data[,j], median),
tapply(tmp_data$value, tmp_data[,j], sd))
)
)
colnames(results[[i]][[j]]) <- c('average', 'median', 'sd')
}
rm(tmp_data) # You'll no longer need this
}
Now what? Check out the summary for results:
summary(results)
## Length Class Mode
## Acc 10 -none- list
## AvgWin 10 -none- list
## AvgLoss 10 -none- list
## Return 10 -none- list
You have a list for each variable. Now, if you check out the summary for any results "sublist", you'll see this:
summary(results$Acc)
## Length Class Mode
## gama 3 data.frame list
## theta 3 data.frame list
## detectl 3 data.frame list
## NSMOOTH 3 data.frame list
## NREF 3 data.frame list
## NOBS 3 data.frame list
## sma 3 data.frame list
## lma 3 data.frame list
## PosTrades 3 data.frame list
## NegTrades 3 data.frame list
See what happens when you peek into the results$Acc$gama list:
results$Acc$gama
## average median sd
## 0.1 0.5454550 0.545455 NA
## 1 0.9069842 0.900000 0.09556548
## 10 0.8455433 0.846154 0.01191674
So, for each variable and each "id" column, you have the data summary you want.
Hope this helps.
I have an approach involving data.table.
EDIT: I tried to submit an edit to the question, but I took some liberties so it'll probably get rejected. I made assumptions about which columns were to be used as "id" columns (columns whose values subset data), and which should be "measure" columns (columns whose values are used to calculate the summary statistics). See here for these designations:
ids <- c("gama", "theta","detectl", "NSMOOTH", "NREF", "NOBS", "sma", "lma")
vals <- c("PosTrades", "NegTrades", "Acc", "AvgWin", "AvgLoss", "Return")
Setup
# Convert to data.table
df <- data.table(df)
# Helper function to convert a string to a call
# useful in a data.table j
s2c <- function (x, type = "list"){
as.call(lapply(c(type, x), as.symbol))
}
# Function to computer the desired summary stats
smry <- function(x) list(Average=mean(x, na.rm=T), Median=median(x, na.rm=T), StandardDeviation=sd(x, na.rm=T))
# Define some names to use later
ids <- c("gama", "theta","detectl", "NSMOOTH", "NREF", "NOBS", "sma", "lma")
vals <- c("PosTrades", "NegTrades", "Acc", "AvgWin", "AvgLoss", "Return")
usenames <- paste(rep(c("Average","Median","StdDev"),each=length(vals)), vals,sep="_")
Calculations in data.table
# Compute the summary statistics
df2 <- df[,j={
for(i in 1:length(ids)){ # loop through each id
t.id <- ids[i]
t.out <- .SD[,j={
t.vals <- .SD[,eval(s2c(vals))] # this line returns a data.table with each vals as a column
sapply(t.vals, smry) # apply summary statistics
},by=t.id] # this by= loops through each value of the current id (t.id)
setnames(t.out, c("id.val", usenames)) # fix the names of the data.table to be returned for this i
t.out <- cbind(id=t.id, t.out) # add a column indicating the variable name (t.id)
if(i==1){big.out <- t.out}else{big.out <- rbind(big.out, t.out)} # accumulate the output data.table
}
big.out
}]
Formatting
df2 <- data.table:::melt.data.table(df2, id.vars=c("id","id.val")) # melt into "long" format
df2[,c("val","metric"):=list(gsub(".*_","",variable),gsub("_.*","",variable))] # splice names to create id's
df2[,variable:=NULL] # delete old column that had the names we just split up
df2 <- data.table:::dcast.data.table(df2, id+id.val+val~metric) # go a bit wider, so stats in diff columns
# reshape2:::acast(df2, id+id.val~metric~val) # maybe replace the above line with this
Result
id id.val val Average Median StdDev
1: NOBS 10 Acc 3.214550 0.01191674 0.006052701
2: NOBS 10 AvgLoss 1.000000 0.06300610 1.409930000
3: NOBS 10 AvgWin 1.333333 0.06100090 1.447786667
4: NOBS 10 NegTrades 6.000000 0.84615400 -0.019449800
5: NOBS 10 PosTrades 7.333333 0.84554333 -0.021165467
---
128: theta 1 AvgLoss 1.000000 0.06897450 1.447160000
129: theta 1 AvgWin 1.571429 0.08320849 1.455691429
130: theta 1 NegTrades 6.000000 0.84615400 -0.017465300
131: theta 1 PosTrades 5.857143 0.83712329 -0.017420860
132: theta 1 Return 1.718249 0.03285638 0.068957635
Related
Let's say I've got a dataframe with multiple columns, some of which I want to transform. The column names define what transformation needs to be used.
library(tidyverse)
set.seed(42)
df <- data.frame(A = 1:100, B = runif(n = 100, 0, 1), log10 = runif(n = 100, 10, 100), log2 = runif(n = 100, 10, 100), log1p = runif(n = 100, 10, 100), sqrt = runif(n = 100, 10, 100))
trans <- list()
trans$log10 <- log10
trans$log2 <- log2
trans$log1p <- log1p
trans$sqrt <- sqrt
Ideally, I would like to use an across call where the column names were matched up with the trans function names and the transformations would be performed on the fly.
The desired output is the following:
df_trans <- df %>%
dplyr::mutate(log10 = trans$log10(log10),
log2 = trans$log2(log2),
log1p = trans$log1p(log1p),
sqrt = trans$sqrt(sqrt))
df_trans
However, I don't want to manually specify each transformation separately. In the representative example I only have 4 but this number could vary and be significantly higher making manual specification cumbersome and error prone.
I have managed to match up the column names with the functions by turning the trans list into a data frame and left-joining but am then unable to call the function in the trans_function column.
trans_df <- enframe(trans, value = "trans_function")
df %>%
pivot_longer(cols = everything()) %>%
left_join(trans_df) %>%
dplyr::mutate(value = trans_function(value))
Error: Problem with mutate() column value.
i value = trans_function(value).
x could not find function "trans_function"
I think I either need to find a way of calling the functions from the list columns or another way of matching up the function names with the column names. All ideas are welcome.
We can use cur_column() in across to get the column name and use it to subset trans.
library(dplyr)
df %>%
mutate(across(names(trans), ~trans[[cur_column()]](.x))) %>%
head
# A B log10 log2 log1p sqrt
#1 1 0.9148060 1.821920 6.486402 3.998918 3.470303
#2 2 0.9370754 1.470472 5.821200 3.932046 7.496103
#3 3 0.2861395 1.469690 6.437524 2.799395 8.171007
#4 4 0.8304476 1.653261 5.639570 3.700698 6.905755
#5 5 0.6417455 1.976905 4.597484 4.500461 9.441077
#6 6 0.5190959 1.985133 5.638341 4.551289 4.440590
Comparing it with output of df_trans.
head(df_trans)
# A B log10 log2 log1p sqrt
#1 1 0.9148060 1.821920 6.486402 3.998918 3.470303
#2 2 0.9370754 1.470472 5.821200 3.932046 7.496103
#3 3 0.2861395 1.469690 6.437524 2.799395 8.171007
#4 4 0.8304476 1.653261 5.639570 3.700698 6.905755
#5 5 0.6417455 1.976905 4.597484 4.500461 9.441077
#6 6 0.5190959 1.985133 5.638341 4.551289 4.440590
One way can be to use lapply:
library(tidyverse)
set.seed(42)
df <- data.frame(A = 1:100, B = runif(n = 100, 0, 1), log10 = runif(n = 100, 10, 100), log2 = runif(n = 100, 10, 100), log1p = runif(n = 100, 10, 100), sqrt = runif(n = 100, 10, 100))
trans <- list()
trans$log10 <- log10
trans$log2 <- log2
trans$log1p <- log1p
trans$sqrt <- sqrt
df_trans <- setNames(lapply(names(df),
function(x) if(x %in% names(trans))
{ trans[[x]](df[,(x)])} else {df[,x]}),names(df)) %>%
bind_cols() %>%
as.data.frame()
head(df_trans)
which gives:
A B log10 log2 log1p sqrt
1 1 0.1365052 1.739051 6.301896 4.530600 4.318942
2 2 0.1771364 1.549601 5.793220 4.521715 3.649834
3 3 0.5195605 1.902438 4.819125 3.343266 6.788565
4 4 0.8111208 1.572253 6.219991 4.075945 3.322401
5 5 0.1153620 1.751276 6.306097 4.060292 7.817301
6 6 0.8934218 1.724403 6.201123 3.235938 9.749128
The original dataframe being:
head(df)
A B log10 log2 log1p sqrt
1 1 0.1365052 54.83409 78.89684 91.81428 18.65326
2 2 0.1771364 35.44878 55.45401 90.99323 13.32129
3 3 0.5195605 79.88006 28.22936 27.31143 46.08461
4 4 0.8111208 37.34675 74.54249 57.90612 11.03835
5 5 0.1153620 56.39961 79.12693 56.99123 61.11019
6 6 0.8934218 53.01557 73.57393 24.43022 95.04549
In base R, we may use Map
df[names(trans)] <- Map(function(x, y) x(y), trans, df[names(trans)])
-checking
> identical(df, df_trans)
[1] TRUE
Another possibility is the following:
library(tidyverse)
set.seed(42)
df <- data.frame(A = 1:100, B = runif(n = 100, 0, 1), log10 = runif(n = 100, 10, 100), log2 = runif(n = 100, 10, 100), log1p = runif(n = 100, 10, 100), sqrt = runif(n = 100, 10, 100))
df %>%
mutate(across(-(A:B), ~ getFunction(cur_column())(.x))) %>% head
#> A B log10 log2 log1p sqrt
#> 1 1 0.9148060 1.821920 6.486402 3.998918 3.470303
#> 2 2 0.9370754 1.470472 5.821200 3.932046 7.496103
#> 3 3 0.2861395 1.469690 6.437524 2.799395 8.171007
#> 4 4 0.8304476 1.653261 5.639570 3.700698 6.905755
#> 5 5 0.6417455 1.976905 4.597484 4.500461 9.441077
#> 6 6 0.5190959 1.985133 5.638341 4.551289 4.440590
I am not totally sure how to describe my problem, so I might just need help to find the right keywords to search for.
Here are some dummy data that resembles my own data, there are x and y co-ordinates and a z value:
require(data.table)
example <- data.table(x = c(-3, -4, -2, -1, -1, 0, 0, 0, 1, 4, 4, 5),
y = c(2, -2, -2, -3, -0, 3, 4, 4, -1, 4, 4, 4),
z = c(10, 10, 20, 30, 40, 40, 50, 70, 70, 80, 90, 90))
There are some duplicate co-ordinates in there, e.g. at (4,4) so the next step is to average the z values for the duplicate points:
example <- as.data.table(aggregate(z ~ x + y, data = example, FUN = "mean"))
Next, I would like to add z = 0 values to all of the coordinates that I don't have data for, e.g. (x = 0, y = 0), (x = 1, y = 1) etc. for the range -5:5 in both x and y axes.
How do I go about this?
To clarify: I have z values for specific x and y coordinates, I'd like to create a data table (or matrix) which has all x,y coordinates from -5,-5 to 5,5 with z = 0 except for the specific z values I already have.
Thanks!
Maybe this is what you are looking for.
example[, .(z=mean(z)), by=.(x, y)][CJ(x=-5:5, y=-5:5), on=c("x", "y")][is.na(z), z:=0][]
x y z
1: -5 -5 0
2: -5 -4 0
3: -5 -3 0
4: -5 -2 0
5: -5 -1 0
---
117: 5 1 0
118: 5 2 0
119: 5 3 0
120: 5 4 90
121: 5 5 0
Here, example[, .(z=mean(z)), by=.(x, y)] performs the data.table equivalent of your aggregate function. The result is then joined to the Cartesian product of -5:5 with itself (11^2 = 121 observations) CJ(x=-5:5, y=-5:5) with the second chain [CJ(x=-5:5, y=-5:5), on=c("x", "y")]. The join fills in NA for x y combinations not present in the aggregated data, so in the final chain, The NA values of z are set to 0 [is.na(z), z:=0]. The last bit prints the output.
I'm trying to extract the matrices from the markovchainListFit but am unable to.
library(markovchain)
mat <- data.frame(A = c(rep(0, 10)),
B = c(40 ,37, 35 ,30, 27, 21, 15, 16, 21, 19),
C = c(10, 15, 20, 23, 44, 34, 47, 22, 37, 29),
D = c(1, 2, 3, 5, 9, 21, 8, 12, 17, 12))
mat$A <- apply(mat, 1, function(x) 100 - sum(x))
# Build sequence from mat
tseq <- apply(t(mat), 2, function(x) rep(row.names(t(mat)), x))
# Fit Markov Matrices to sequences
mcListFit <- markovchainListFit(data = tseq)
What I've tried:
> mcListFit$estimate[[1]]
Unnamed Markov chain
A 4 - dimensional discrete Markov Chain defined by the following states:
A, B, C, D
The transition matrix (by rows) is defined as follows:
A B C D
A 0.9387755 0.06122449 0.00 0.0
B 0.0000000 0.85000000 0.15 0.0
C 0.0000000 0.00000000 0.90 0.1
D 0.0000000 0.00000000 0.00 1.0
> as.matrix(mcListFit$estimate[[1]])
Error in as.vector(data) :
no method for coercing this S4 class to a vector
> as.matrix(unlist(mcListFit$estimate[[1]]))
Error in as.vector(data) :
no method for coercing this S4 class to a vector
But I'm still not able to extract any of the matrices. How would I go about doing this?
This code could help:
#allocate a generic list
matrixList<-list()
#sequentially fill the list with the matrices
#using dim method to get the length of the estimates
for (i in 1:dim(mcListFit$estimate)) {
myMatr<- mcListFit$estimate[[i]]#transitionMatrix
matrixList[[i]]<-myMatr
}
matrixList
Due to the nature of my specification, the results of my regression coefficients provide the slope (change in yield) between two points; therefore, I would like to plot these coefficients using the slope of a line between these two points with the first point (0, -0.7620) as the intercept. Please note this is a programming question; not a statistics question.
I'm not entirely sure how to implement this in base graphics or ggplot and would appreciate any help. Here is some sample data.
Sample Data:
df <- data.frame(x = c(0, 5, 8, 10, 12, 15, 20, 25, 29), y = c(-0.762,-0.000434, 0.00158, 0.0000822, -0.00294, 0.00246, -0.000521, -0.00009287, -0.01035) )
Output:
x y
1 0 -7.620e-01
2 5 -4.340e-04
3 8 1.580e-03
4 10 8.220e-05
5 12 -2.940e-03
6 15 2.460e-03
7 20 -5.210e-04
8 25 -9.287e-05
9 29 -1.035e-02
Example:
You can use cumsum, the cumulative sum, to calculate intermediate values
df <- data.frame(x=c(0, 5, 8, 10, 12, 15, 20, 25, 29),y=cumsum(c(-0.762,-0.000434, 0.00158, 0.0000822, -0.00294, 0.00246, -0.000521, -0.00009287, -0.0103)))
plot(df$x,df$y)
I have a variable e.g. c(0, 8, 7, 15, 85, 12, 46, 12, 10, 15, 15)
how can I calculate a mean value out of random maximal values in R?
for example, I would like to calculate a mean value with three maximal values?
First step: You draw a sample of 3 from your data and store it in x
Second step: You calculate the mean of the sample
try
dat <- c(0,8,7,15, 85, 12, 46, 12, 10, 15,15)
x <- sample(dat,3)
x
mean(x)
possible output:
> x <- sample(dat,3)
> x
[1] 85 15 0
> mean(x)
[1] 33.33333
If you mean the three highest values, just sort your vector and subset:
> mean(sort(c(0,8,7,15, 85, 12, 46, 12, 10, 15,15), decreasing=T)[1:3])
[1] 48.66667