get custom table frequency in R - r

Suppose I have the following vector.
test <- c(0.3,1.0,0.8,0.3,0.6,0.4,0.3,0.5,0.6,0.4,0.5,0.6,0.1,0.6,0.2,0.7,0.0,0.7,0.3,0.3,0.4,0.9,0.9,0.9,0.3,0.6,0.3,0.1)
Is there a way to get non logical frequency table such as?
Frequency between 0 and 0.1
Frequency between 0.2 and 0.4
Frequency between 0.5 and 0.8
Frequency between 0.9 and 1
Thanks

There are a few extra unnecessary groups in here but you can ignore those or subset them
table(cut(test, breaks = c(0,0.1,0.2,0.4,0.5,0.8,0.9,1)))

I'm not aware of a dedicated function, but you could write your own:
test <- c(0.3,1.0,0.8,0.3,0.6,0.4,0.3,0.5,0.6,0.4,0.5,0.6,0.1,0.6,0.2,0.7,0.0,0.7,0.3,0.3,0.4,0.9,0.9,0.9,0.3,0.6,0.3,0.1)
mapply(function (start, end) { sum(test >= start & test <= end) },
c(0, 0.2, 0.5, 0.9), # starts
c(0.1, 0.4, 0.8, 1)) # ends
# [1] 3 11 10 4
The use of mapply is purely to vectorise over the starts and ends which you supply. Note test is hard-coded into this function and the endpoints are inclusive, so adjust as necessary, etc.

Something like this maybe:
labs <- c("0 and 0.1", "0.2 and 0.4", "0.5 and 0.8", "0.9 and 1")
table(cut(test, c(0, .2, .5, .9, 1.1), right = FALSE, labels = labs))
## 0 and 0.1 0.2 and 0.4 0.5 and 0.8 0.9 and 1
## 3 11 10 4

Assuming that you really want to bin these as tenths, and there are no missing intervals, findInterval is made for the task.
Here, 1.0 is in a group by itself:
table(findInterval(test, c(0,.2, .5, .9, 1)))
## 1 2 3 4 5
## 3 11 10 3 1
With this statement, 1.0 is in the last interval, with .9:
table(findInterval(test, c(0,.2, .5, .9, 1), rightmost.closed=T))
## 1 2 3 4
## 3 11 10 4

Related

Find edges of intervals in dataframe column and use them for geom_rect xmin-xmax in ggplot

I have a data frame consituted by two columns
positionx <- c(1:10)
pvalue <- c(0.1, 0.04, 0.03, 0.02, 0.001, 0.2, 0.5, 0.6, 0.001, 0.002)
df <- data.frame(cbind(positionx, pvalue))
df
positionx pvalue
1 1 0.100
2 2 0.040
3 3 0.030
4 4 0.020
5 5 0.001
6 6 0.200
7 7 0.500
8 8 0.600
9 9 0.001
10 10 0.002
I would like to find in which intervals of values of positionx my pvalue is below a certain treshold, let's say 0.05.
Using 'which' I can find the index of the rows and I could go back to the vlaues of positionx.
which(df[,2]<0.05)
[1] 2 3 4 5 9 10
Howeverm what I would like are the edges of the intervals, with that I mean a result like: 2-5, 9-10
I also tried to use the findInterval function as below
int <- c(-10, 0.05, 10)
separation <- findInterval(pvalue,int)
separation
[1] 2 1 1 1 1 2 2 2 1 1
df_sep <- data.frame(cbind(df, separation))
df_sep
positionx pvalue separation
1 1 0.100 2
2 2 0.040 1
3 3 0.030 1
4 4 0.020 1
5 5 0.001 1
6 6 0.200 2
7 7 0.500 2
8 8 0.600 2
9 9 0.001 1
10 10 0.002 1
However I am stuck again with a column of numbers, while I want the edges of the intervals that contain 1 in the separation column.
Is there a way to do that?
This is semplified example, in reality I have many plots and for each plot one data frame of this type (just much longer and with pvalues not as easy to judge at a glance).
The reason why I think I need the information of the edges of my intervals, is that I would like to colour the background of my ggplot according to the pvalue. I know I can use geom_rect for it, but I think I need the edges of the intervals in order to build the coloured rectangles.
Is there a way to do this in an automated way instead of manually?
This seems like a great use case for run length encoding.
Example as below:
library(ggplot2)
# Data from question
positionx <- c(1:10)
pvalue <- c(0.1, 0.04, 0.03, 0.02, 0.001, 0.2, 0.5, 0.6, 0.001, 0.002)
df <- data.frame(cbind(positionx, pvalue))
# Sort data (just to be sure)
df <- df[order(df$positionx),]
# Do run length encoding magic
threshold <- 0.05
rle <- rle(df$pvalue < threshold)
starts <- {ends <- cumsum(rle$lengths)} - rle$lengths + 1
df2 <- data.frame(
xmin = df$positionx[starts],
xmax = df$positionx[ends],
type = rle$values
)
# Filter on type
df2 <- df2[df2$type == TRUE, ] # Satisfied threshold criterium
ggplot(df2, aes(xmin = xmin, xmax = xmax, ymin = 0, ymax = 1)) +
geom_rect()
Created on 2020-05-22 by the reprex package (v0.3.0)

How to use dplyr:mutate to mulitply pairs of columns specified by parts of the variable name

I have the following example:
df <- data.frame(
id = c(1,2,3),
fix_01.2012 = c(2,5,7),
fix_02.2012 = c(5,1,7),
fix_03.2012 = c(6,1,5),
fox_01.2012 = c(0.4, 0.5, 0.7),
fox_02.2012 = c(0.6, 0.5, 0.8),
fox_03.2012 = c(0.7, 0.5, 0.9)
)
id fix_01.2012 fix_02.2012 fix_03.2012 fox_01.2012 fox_02.2012 fox_03.2012
1 1 2 5 6 0.4 0.6 0.7
2 2 5 1 1 0.5 0.5 0.5
3 3 7 7 5 0.7 0.8 0.9
The table below is what I want to get.
I want to create a new column for each date (e.g. "01.2012"):
res_date = fix_date * fox_date
As I have many dates / pairs of dates, I guess this needs to be done by looping through the names.
id fix_01.2012 fix_02.2012 fix_03.2012 fox_01.2012 fox_02.2012 fox_03.2012 res_01.2012 res_02.2012 res_03.2012
1 1 2 5 6 0.4 0.6 0.7 0.8 3.0 4.2
2 2 5 1 1 0.5 0.5 0.5 2.5 0.5 0.5
3 3 7 7 5 0.7 0.8 0.9 4.9 5.6 4.5
Anyone can help? Thanks very much in advance!
Here is an idea that uses split.default to split the data frame based on similar column names (based on your conditions). We then loop over that list and multiply the columns. In this case, we use Reduce (rather than i[1]*i[2]) to multiply in order to account for more than two columns
do.call(cbind,
lapply(split.default(df[-1], gsub('.*_', '', names(df[-1]))), function(i) Reduce(`*`, i)))
# 01.2012 02.2012 03.2012
#[1,] 0.8 3.0 4.2
#[2,] 2.5 0.5 0.5
#[3,] 4.9 5.6 4.5
Bind them back to the original with cbind.data.frame()
If you want a tidyverse approach, it will take using a bit of tidy evaluation to get what you want.
library(tidyverse)
df <- data.frame(
id = c(1,2,3),
fix_01.2012 = c(2,5,7),
fix_02.2012 = c(5,1,7),
fix_03.2012 = c(6,1,5),
fox_01.2012 = c(0.4, 0.5, 0.7),
fox_02.2012 = c(0.6, 0.5, 0.8),
fox_03.2012 = c(0.7, 0.5, 0.9)
)
# colnames with "fix"
fix <- names(df)[grepl("fix",names(df))]
# colnames with "fox"
fox <- names(df)[grepl("fox",names(df))]
# Iterate over the two vectors of names and column bind the results (map2_dfc).
# Since these are strings, we need to have them evaluated as symbols
# Creating the column name just requires the string to be evaluated.
map2_dfc(fix, fox, ~transmute(df, !!paste0("res", str_extract(.x, "_(0\\d)")) := !!sym(.x) * !!sym(.y)))
#> res_01 res_02 res_03
#> 1 0.8 3.0 4.2
#> 2 2.5 0.5 0.5
#> 3 4.9 5.6 4.5
Much more verbose than the other answers, but to my eye easier to read/edit/adapt, is a heavy gather-spread approach (the way I'd reason the problem if I was solving it step-by-step):
library(tidyr)
library(dplyr)
df %>%
gather(-id, key=colname, value=value) %>%
separate(colname, c('fixfox', 'date'), sep='_') %>%
spread(key=fixfox, value=value) %>%
mutate(res=fix*fox) %>%
gather(-id, -date, key=colname, value=value) %>%
unite(new_colname, colname, date, sep='_') %>%
spread(key=new_colname, value=value)

Sum data points in the rows from data frame if they meet criteria from another data frame in R

I have two data frames both with 220 obs and 80 variables. The first data frame, df1, has only the data points 1, 2, and 3. The second data frame, df2, has different numeric values consisting of decimals, such as 0.12, -0.03, 0.01 etc. (supposed to portray market cap weighted stock returns for a given month). PS: The length of the original data set is 80.
For example
df1 = data.frame(a = c(2, 2, 1), b = c(3, 2, 3), c = c(1, 1, 2), d = c(3, 3, 1))
a b c d
1 2 3 1 3
2 2 2 1 3
3 1 3 2 1
df2 = data.frame(a = c(0.1, 0.1, 0.2), b = c(0.3, 0.4, 0.6), c = c(0.2, 0.3, 0.5), d = c(0.1, 0.5, 0.6))
a b c d
1 0.1 0.3 0.2 0.1
2 0.1 0.4 0.3 0.5
3 0.2 0.6 0.5 0.6
How can I sum the rows of df2and turn into a matrix with 220 obs and 3 variables based on the values in df1. Note that df1 and df2 have the same column names in the same order. How can I create a third data frame df3 based on the indicator variables from df1 by summing the rows of df2? I want to sum the rows of df2 based on the values in df1 to create df3:
df3 =
X1 X2 X3
1 0.2 0.1 0.4
2 0.3 0.5 0.5
3 0.8 0.5 0.6
Let's first look at (X1,1). Row 1 in df1 only contain one data point with value 1, which is (c,1). Thus, we sum row 1 of df2 to get 0.2. Now look at (X1,3) (last value of column X1). Observe row 3 in df1 to find two data points with value 1. In df2 those two values are 0.2 (a,3) and 0.6 (d,3), and sum the values to get 0.8.
Here is the explanation of how df3 looks like:
calculation = data.frame("1" = c("0+0+0.2+0", "0+0+0.3+0", "0.2+0+0+0.6"), "2" = c("0.1+0+0+0", "0.1+0.4+0+0", "0+0+0.5+0"), "3" = c("0+0.3+0+0.1", "0+0+0+0.5", "0+0.6+0+0"))
X1 X2 X3
1 0 + 0 + 0.2 + 0 0.1 + 0 + 0 + 0 0 + 0.3 + 0 + 0.1
2 0 + 0 + 0.3 + 0 0.1 + 0.4 + 0 + 0 0 + 0 + 0 + 0.5
3 0.2 + 0 + 0 + 0.6 0 + 0 + 0.5 + 0 0 + 0.6 + 0 + 0
More practical explanation based on stocks. Assume df1 is a matrix that describes buy, hold, and sell recommendations. df2 describes the market weighted stock returns. All variables/columns are different stocks. df3 creates a matrix with three different portfolios. If the stock is "buy", I want to put it in a "buy" portfolio. If the stock is "hold", I want to put it in a "hold" portfolio, etc. This is easily done in Excel with nested IF,AND,OR functions, but I do not know how to do it in R.
We could use tapply by converting the datasets to matrix, use grouping variables as the row index of the data and the index of 'df1'
tapply(as.matrix(df2), list(row(df2), as.matrix(df1)), FUN = sum)
# 1 2 3
#[1,] 0.2 0.1 0.4
#[2,] 0.3 0.5 0.5
#[3,] 0.8 0.5 0.6
Or with tidyverse, bind the datasets after gathering the two in to 'long' data, and then do a group by sum
library(tidyverse)
gather(df1) %>%
bind_cols(gather(df2)) %>%
group_by(key) %>%
group_by(rn = row_number(), value) %>%
summarise(value1 = sum(value1)) %>%
spread(value, value1) %>%
ungroup %>%
select(-rn)
# A tibble: 3 x 3
# `1` `2` `3`
# <dbl> <dbl> <dbl>
#1 0.2 0.1 0.4
#2 0.3 0.5 0.5
#3 0.8 0.5 0.6
Here is another base R method that uses rowsum to perform group sums and loops through the rows with mapply.
t(mapply(rowsum, as.data.frame(t(df2)), as.data.frame(t(df1))))
[,1] [,2] [,3]
V1 0.2 0.1 0.4
V2 0.3 0.5 0.5
V3 0.8 0.5 0.6
Note that I am using R 3.4.4. I believe that as.data.frame is not necessary with R 3.5.0+, since t should return a data.frame when it is fed a data.frame.

Simulation of forecast having lower and upper levels

I am trying to simulate the forecast (rather than taking mean) using this code:
set.seed(100)
df <- data.frame("lower" = c(-1, 0.5, 0), "upper" = c(0, 2, 4), "mean" = c(-0.5, 1.2, 2.5))
df$simulation <- rnorm(1, df$mean, (df$upper - df$lower) / 2 / 1.96)
df
# lower upper mean simulation
#1 -1.0 0 -0.5 -0.6281103
#2 0.5 2 1.2 -0.6281103
#3 0.0 4 2.5 -0.6281103
I get same value in the simulation column.
However, if I provide the vector in the n parameter I get result that looks better:
df$simulation <- rnorm(nrow(df), df$mean, (df$upper - df$lower) / 2 / 1.96)
df
# lower upper mean simulation
#1 -1.0 0 -0.5 -0.6281103
#2 0.5 2 1.2 1.2503308
#3 0.0 4 2.5 2.4194724
Is the later solution is the right way of doing this?

How can I vectorize this task in R?

For a specific task, I have written the following R script:
pred <- c(0.1, 0.1, 0.1, 0.2, 0.2, 0.3, 0.3)
grp <- as.factor(c(1, 1, 2, 2, 1, 1, 1))
cut <- unique(pred)
cut_n <- length(cut)
n <- length(pred)
class_1 <- numeric(cut_n)
class_2 <- numeric(cut_n)
curr_cut <- cut[1]
class_1_c <- 0
class_2_c <- 0
j <- 1
for (i in 1:n){
if (curr_cut != pred[i]) {
j <- j + 1
curr_cut <- pred[i]
}
if (grp[i] == levels(grp)[1])
class_1_c <- class_1_c + 1
else
class_2_c <- class_2_c + 1
class_1[j] <- class_1_c
class_2[j] <- class_2_c
}
cat("index:", cut, "\n")
cat("class1:", class_1, "\n")
cat("class2:", class_2, "\n")
My goal above was to compute the cumulative number of times the factors in grp appear for each unique value in pred. For example, I get the following output for above:
index: 0.1 0.2 0.3
class1: 2 3 5
class2: 1 2 2
I am a beginner in R and I have few questions about this:
How can I make this code faster and simpler?
Is is it possible to vectorize this and avoid the for loop?
Is there a different "R-esque" way of doing this?
Any help would be greatly appreciated. Thanks!
You can start by getting a the unique group/pred counts using a table
table(grp, pred)
# pred
# grp 0.1 0.2 0.3
# 1 2 1 2
# 2 1 1 0
Of course this isn't exactly what you wanted. You want cumulative totals, so we can adjust this result by applying a cumulative sum across each row (transposed to better match your data layout)
t(apply(table(grp, pred), 1, cumsum))
# grp 0.1 0.2 0.3
# 1 2 3 5
# 2 1 2 2

Resources