Calculating group differences in a "badly" partitioned data set

Calculating group differences in a "badly" partitioned data set - r

I tried to solve the problem with questions here on SO but I could not find a satisfying answer. My data frame has the structure
X = data_frame(
treat = c(rep(1,4), rep(2,4), rep(3,4), rep(4,4)),
id = seq(1:16),
x = rnorm(16),
y = rnorm(16),
z = rnorm(16)
)
Looks like
# A tibble: 16 x 5
treat id x y z
<int> <int> <dbl> <dbl> <dbl>
1 1 1 -0.0724 1.26 0.317
2 1 2 -0.486 -0.628 0.392
3 1 3 -0.406 -0.706 1.18
4 1 4 -1.35 -1.27 2.36
5 2 5 -0.0751 -0.0394 0.568
6 2 6 0.243 0.873 0.132
7 2 7 0.138 0.611 -0.700
8 2 8 -0.732 1.02 -0.811
9 3 9 -0.0278 1.78 0.568
10 3 10 0.526 1.18 1.03
11 3 11 1.43 0.0937 -0.0825
12 3 12 -0.299 -0.117 0.367
13 4 13 1.05 2.04 0.678
14 4 14 -1.93 0.201 0.250
15 4 15 0.624 1.09 0.852
16 4 16 0.502 0.119 -0.843
Every fourth value in treat is a control and now I want to calculate the difference in x, y and z between the treatments and the controls. For example I would like to calculate for the first treatment
-0.724 - (-1.35) #x
1.26 - (-1.27) #y
0.317 - 2.36 #z
for the first treatment. For the second treatment accordingly,
-0.486 - (-1.35) #x
-0.628 - (-1.27) #y
0.392 - 2.36 #z
... and so on.
I would like to use a dplyr / tidyverse solution but I have no idea how to do that in a "smooth" way. I found a solution already by using joins but this seems rather tedious compared to the "smooth" solution dplyr usually offers.

With dplyr, we can group_by treat and use mutate_at to select specific columns (x:z) and subtract each value with 4th value using the nth function.
library(dplyr)
X %>%
group_by(treat) %>%
mutate_at(vars(x:z), funs(. - nth(., 4)))
#treat id x y z
# <dbl> <int> <dbl> <dbl> <dbl>
# 1 1 1 -0.631 0.971 0.206
# 2 1 2 -0.301 -1.49 0.189
# 3 1 3 1.49 1.17 0.133
# 4 1 4 0 0 0
# 5 2 5 1.39 -0.339 0.934
# 6 2 6 2.98 0.511 0.319
# 7 2 7 1.73 -0.297 0.0745
# 8 2 8 0 0 0
# 9 3 9 -1.05 -0.778 -2.86
#10 3 10 -0.805 -1.84 -2.38
#11 3 11 0.864 0.684 -3.43
#12 3 12 0 0 0
#13 4 13 -1.39 -0.843 1.67
#14 4 14 -1.68 1.55 -0.656
#15 4 15 -2.34 0.722 0.0638
#16 4 16 0 0 0
This can be also written as
X %>%
group_by(treat) %>%
mutate_at(vars(x:z), funs(. - .[4]))
data
set.seed(123)
X = data_frame(
treat = c(rep(1,4), rep(2,4), rep(3,4), rep(4,4)),
id = seq(1:16),
x = rnorm(16),
y = rnorm(16),
z = rnorm(16)
)

Related

avoid repeated unquoting in dplyr non standard evaluation

Suppose we have the following data:
tib <- tibble::tibble(x = 1:10)
Then, suppose we want to make a function that takes a column as input and returns a tibble with several added columns such as:
library(dplyr)
generate_transformations <- function(data, column){
transform <- sym(column)
data %>%
mutate(
sqrt = sqrt(!!transform),
recip = 1 / !!transform,
log = log(!!transform)
)
}
# Usage is great:
tib %>%
generate_transformations('x')
# A tibble: 10 x 4
x sqrt recip log
<int> <dbl> <dbl> <dbl>
1 1 1 1 0
2 2 1.41 0.5 0.693
3 3 1.73 0.333 1.10
4 4 2 0.25 1.39
5 5 2.24 0.2 1.61
6 6 2.45 0.167 1.79
7 7 2.65 0.143 1.95
8 8 2.83 0.125 2.08
9 9 3 0.111 2.20
10 10 3.16 0.1 2.30
Now my question is, is there a way to avoid unquoting (!!) transform repeatedly?
Yes, I could, e.g., temporarily rename column and then rename it back after I am done, but that is not my interest in this question.
I am interested if there is a way to produce a variable that does not need the !!.
While it does not work, I was looking for something like:
generate_transformations <- function(data, column){
transform <- !!sym(column) # cannot unquote here :(
data %>%
mutate(
sqrt = sqrt(transform),
recip = 1 / transform,
log = log(transform)
)
}

Convert to string and subset from the data and use transform
generate_transformations <- function(data, column){
transform <- data[[rlang::as_string(ensym(column))]]
data %>%
mutate(
sqrt = sqrt(transform),
recip = 1 / transform,
log = log(transform)
)
}
-testing
tib %>%
generate_transformations('x')
# A tibble: 10 × 4
x sqrt recip log
<int> <dbl> <dbl> <dbl>
1 1 1 1 0
2 2 1.41 0.5 0.693
3 3 1.73 0.333 1.10
4 4 2 0.25 1.39
5 5 2.24 0.2 1.61
6 6 2.45 0.167 1.79
7 7 2.65 0.143 1.95
8 8 2.83 0.125 2.08
9 9 3 0.111 2.20
10 10 3.16 0.1 2.30
Or create a temporary column and remove it later
generate_transformations <- function(data, column){
data %>%
mutate(transform = !! rlang::ensym(column),
sqrt = sqrt(transform),
recip = 1 / transform,
log = log(transform),
transform = NULL
)
}
-testing
tib %>%
generate_transformations('x')
# A tibble: 10 × 4
x sqrt recip log
<int> <dbl> <dbl> <dbl>
1 1 1 1 0
2 2 1.41 0.5 0.693
3 3 1.73 0.333 1.10
4 4 2 0.25 1.39
5 5 2.24 0.2 1.61
6 6 2.45 0.167 1.79
7 7 2.65 0.143 1.95
8 8 2.83 0.125 2.08
9 9 3 0.111 2.20
10 10 3.16 0.1 2.30

You can do it in one, if you swap !! for {{}} and use across:
data_transformations <- function(d, col, funs=list(sqrt=sqrt, log=log, recip=~1/.)) {
d %>% mutate(across({{col}}, .fns=funs))
}
d %>% data_transformations(x)
# A tibble: 10 × 4
x x_sqrt x_log x_recip
<int> <dbl> <dbl> <dbl>
1 1 1 0 1
2 2 1.41 0.693 0.5
3 3 1.73 1.10 0.333
4 4 2 1.39 0.25
5 5 2.24 1.61 0.2
6 6 2.45 1.79 0.167
7 7 2.65 1.95 0.143
8 8 2.83 2.08 0.125
9 9 3 2.20 0.111
10 10 3.16 2.30 0.1
To restore your original column names, use
data_transformations <- function(d, col, funs=list(sqrt=sqrt, log=log, recip=~1/.)) {
d %>% mutate(across({{col}}, .fns=funs, .names="{.fn}"))
}
d %>% data_transformations(x)
# A tibble: 10 × 4
x sqrt log recip
<int> <dbl> <dbl> <dbl>
1 1 1 0 1
2 2 1.41 0.693 0.5
3 3 1.73 1.10 0.333
4 4 2 1.39 0.25
5 5 2.24 1.61 0.2
6 6 2.45 1.79 0.167
7 7 2.65 1.95 0.143
8 8 2.83 2.08 0.125
9 9 3 2.20 0.111
10 10 3.16 2.30 0.1
To handle multiple columns:
data_transformations <- function(d, cols, funs=list(sqrt=sqrt, log=log, recip=~1/.)) {
d %>% mutate(across({{cols}}, .fns=funs))
}
d1 <- tibble(x=1:10, y=seq(2, 20, 2))
d1 %>% data_transformations(c(x, y), list(sqrt=sqrt, log=log))
A tibble: 10 × 6
x y x_sqrt x_log y_sqrt y_log
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 1 0 1.41 0.693
2 2 4 1.41 0.693 2 1.39
3 3 6 1.73 1.10 2.45 1.79
4 4 8 2 1.39 2.83 2.08
5 5 10 2.24 1.61 3.16 2.30
6 6 12 2.45 1.79 3.46 2.48
7 7 14 2.65 1.95 3.74 2.64
8 8 16 2.83 2.08 4 2.77
9 9 18 3 2.20 4.24 2.89
10 10 20 3.16 2.30 4.47 3.00

Dynamic portfolio re-balancing if PF weights deviate by more than a threshold

It's not so hard to backtest a portfolio with given weights and a set rebalancing frequency (e.g. daily/weekly...). There are R packages doing this, for example PerformanceAnalytics, or tidyquant's tq_portfolio which uses that function.
I would like to backtest a portfolio that is re-balanced when the weights deviate by a certain threshold given in percentage points.
Say I have two equally-weighted stocks and a threshold of +/-15 percentage points, I would rebalance to the initial weights when one of the weights exceeds 65%.
For example I have 3 stocks with equal weights (we should also be able to set other weights).
library(dplyr)
set.seed(3)
n <- 6
rets <- tibble(period = rep(1:n, 3),
stock = c(rep("A", n), rep("B", n), rep("C", n)),
ret = c(rnorm(n, 0, 0.3), rnorm(n, 0, 0.2), rnorm(n, 0, 0.1)))
target_weights <- tibble(stock = c("A", "B", "C"), target_weight = 1/3)
rets_weights <- rets %>%
left_join(target_weights, by = "stock")
rets_weights
# # A tibble: 18 x 4
# period stock ret target_weight
# <int> <chr> <dbl> <dbl>
# 1 1 A -0.289 0.333
# 2 2 A -0.0878 0.333
# 3 3 A 0.0776 0.333
# 4 4 A -0.346 0.333
# 5 5 A 0.0587 0.333
# 6 6 A 0.00904 0.333
# 7 1 B 0.0171 0.333
# 8 2 B 0.223 0.333
# 9 3 B -0.244 0.333
# 10 4 B 0.253 0.333
# 11 5 B -0.149 0.333
# 12 6 B -0.226 0.333
# 13 1 C -0.0716 0.333
# 14 2 C 0.0253 0.333
# 15 3 C 0.0152 0.333
# 16 4 C -0.0308 0.333
# 17 5 C -0.0953 0.333
# 18 6 C -0.0648 0.333
Here are the actual weights without rebalancing:
rets_weights_actual <- rets_weights %>%
group_by(stock) %>%
mutate(value = cumprod(1+ret)*target_weight[1]) %>%
group_by(period) %>%
mutate(actual_weight = value/sum(value))
rets_weights_actual
# # A tibble: 18 x 6
# # Groups: period [6]
# period stock ret target_weight value actual_weight
# <int> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 A -0.289 0.333 0.237 0.268
# 2 2 A -0.0878 0.333 0.216 0.228
# 3 3 A 0.0776 0.333 0.233 0.268
# 4 4 A -0.346 0.333 0.153 0.178
# 5 5 A 0.0587 0.333 0.162 0.207
# 6 6 A 0.00904 0.333 0.163 0.238
# 7 1 B 0.0171 0.333 0.339 0.383
# 8 2 B 0.223 0.333 0.415 0.437
# 9 3 B -0.244 0.333 0.314 0.361
# 10 4 B 0.253 0.333 0.393 0.458
# 11 5 B -0.149 0.333 0.335 0.430
# 12 6 B -0.226 0.333 0.259 0.377
# 13 1 C -0.0716 0.333 0.309 0.349
# 14 2 C 0.0253 0.333 0.317 0.335
# 15 3 C 0.0152 0.333 0.322 0.371
# 16 4 C -0.0308 0.333 0.312 0.364
# 17 5 C -0.0953 0.333 0.282 0.363
# 18 6 C -0.0648 0.333 0.264 0.385
So I want that if in any period any stock's weight goes over or under the threshold (for example 0.33+/-0.1), the portfolio weights should be set back to the initial weights.
This has to be done dynamically, so we could have a lot of periods and a lot of stocks. Rebalancing could be necessary several times.
What I tried to solve it: I tried to work with lag and set the initial weights when the actual weights exceed the threshold, however I was unable to do so dynamically, as the weights depend on the returns given the rebalanced weights.

The approach to rebalance upon deviation by more than a certain threshold is called percentage-of-portfolio rebalancing.
My solution is to iterate period-by-period and check if the upper or lower threshold was passed. If so we reset to the initial weights.
library(tidyverse)
library(tidyquant)
rets <- FANG %>%
group_by(symbol) %>%
mutate(ret = adjusted/lag(adjusted)-1) %>%
select(symbol, date, ret) %>%
pivot_wider(names_from = "symbol", values_from = ret)
weights <- rep(0.25, 4)
threshold <- 0.05
r_out <- tibble()
i0 <- 1
trade_rebalance <- 1
pf_value <- 1
for (i in 1:nrow(rets)) {
r <- rets[i0:i,]
j <- 0
r_i <- r %>%
mutate_if(is.numeric, replace_na, 0) %>%
mutate_if(is.numeric, list(v = ~ pf_value * weights[j <<- j + 1] * cumprod(1 + .))) %>%
mutate(pf = rowSums(select(., contains("_v")))) %>%
mutate_at(vars(ends_with("_v")), list(w = ~ ./pf))
touch_upper_band <- any(r_i[nrow(r_i),] %>% select(ends_with("_w")) %>% unlist() > weights + threshold)
touch_lower_band <- any(r_i[nrow(r_i),] %>% select(ends_with("_w")) %>% unlist() < weights - threshold)
if (touch_upper_band | touch_lower_band | i == nrow(rets)) {
i0 <- i + 1
r_out <- bind_rows(r_out, r_i %>% mutate(trade_rebalance = trade_rebalance))
pf_value <- r_i[[nrow(r_i), "pf"]]
trade_rebalance <- trade_rebalance + 1
}
}
r_out %>% head()
# # A tibble: 6 x 15
# date FB AMZN NFLX GOOG FB_v AMZN_v NFLX_v GOOG_v pf FB_v_w AMZN_v_w NFLX_v_w GOOG_v_w trade_rebalance
# <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2013-01-02 0 0 0 0 0.25 0.25 0.25 0.25 1 0.25 0.25 0.25 0.25 1
# 2 2013-01-03 -0.00821 0.00455 0.0498 0.000581 0.248 0.251 0.262 0.250 1.01 0.245 0.248 0.259 0.247 1
# 3 2013-01-04 0.0356 0.00259 -0.00632 0.0198 0.257 0.252 0.261 0.255 1.02 0.251 0.246 0.255 0.249 1
# 4 2013-01-07 0.0229 0.0359 0.0335 -0.00436 0.263 0.261 0.270 0.254 1.05 0.251 0.249 0.257 0.243 1
# 5 2013-01-08 -0.0122 -0.00775 -0.0206 -0.00197 0.259 0.259 0.264 0.253 1.04 0.251 0.250 0.255 0.245 1
# 6 2013-01-09 0.0526 -0.000113 -0.0129 0.00657 0.273 0.259 0.261 0.255 1.05 0.261 0.247 0.249 0.244 1
r_out %>% tail()
# # A tibble: 6 x 15
# date FB AMZN NFLX GOOG FB_v AMZN_v NFLX_v GOOG_v pf FB_v_w AMZN_v_w NFLX_v_w GOOG_v_w trade_rebalance
# <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2016-12-22 -0.0138 -0.00553 -0.00727 -0.00415 0.945 1.10 1.32 1.08 4.45 0.213 0.247 0.297 0.243 10
# 2 2016-12-23 -0.00111 -0.00750 0.0000796 -0.00171 0.944 1.09 1.32 1.08 4.43 0.213 0.246 0.298 0.243 10
# 3 2016-12-27 0.00631 0.0142 0.0220 0.00208 0.950 1.11 1.35 1.08 4.49 0.212 0.247 0.301 0.241 10
# 4 2016-12-28 -0.00924 0.000946 -0.0192 -0.00821 1.11 1.12 1.10 1.11 4.45 0.250 0.252 0.247 0.250 11
# 5 2016-12-29 -0.00488 -0.00904 -0.00445 -0.00288 1.11 1.11 1.10 1.11 4.42 0.250 0.252 0.248 0.251 11
# 6 2016-12-30 -0.0112 -0.0200 -0.0122 -0.0140 1.09 1.09 1.08 1.09 4.36 0.251 0.250 0.248 0.251 11
Here we would have rebalanced 11 times.
r_out %>%
mutate(performance = pf-1) %>%
ggplot(aes(x = date, y = performance)) +
geom_line(data = FANG %>%
group_by(symbol) %>%
mutate(performance = adjusted/adjusted[1L]-1),
aes(color = symbol)) +
geom_line(size = 1)
The approach is slow and using a loop is far from elegant. If anyone has a better solution, I would happily upvote and accept.

Extract every 11 rows from data frame [duplicate]

This question already has answers here:
Split a vector into chunks
(22 answers)
Closed 1 year ago.
So I have a data frame and I want to get every 11 rows. Not just the every 11th rows but a chunk of 11 rows every time for eg:
Subject Wt Dose Time conc
1 1 79.6 4.02 0.00 0.74
2 1 79.6 4.02 0.25 2.84
3 1 79.6 4.02 0.57 6.57
4 1 79.6 4.02 1.12 10.50
5 1 79.6 4.02 2.02 9.66
6 1 79.6 4.02 3.82 8.58
7 1 79.6 4.02 5.10 8.36
8 1 79.6 4.02 7.03 7.47
9 1 79.6 4.02 9.05 6.89
10 1 79.6 4.02 12.12 5.94
11 1 79.6 4.02 24.37 3.28
and then later 11 and then again the other following 11 rows.
I tried this
for (i in 1:nrow(Theoph)) {
everyEleven = Theoph[11,i]
everyEl
}
But it just gives me the first 11 rows and not the second chunk of 11 rows and so on

Maybe you can try split like below
everyEleven <- split(Theoph,ceiling(seq(nrow(Theoph))/11))

Try this as adapted from [split into multiple subset of dataframes with dplyr:group_by?
library(tibble)
library(dplyr)
library(tidyr)
Make an indicative dataframe as your data in the question is only 11 rows.
tib <- tibble(sub = rep(1:33),
var = runif(33))
tib1 <-
tib %>%
# create a grouping variable every 11 rows , unless there is a variable in your data which does the same.
mutate(grp = rep(1:3, each = 11)) %>%
group_by(grp) %>%
nest()%>%
select(data) %>%
unlist(recursive = FALSE)
Gives you:
$data1
# A tibble: 11 x 2
sub var
<int> <dbl>
1 1 0.258
2 1 0.337
3 1 0.463
4 1 0.856
5 1 0.466
6 1 0.701
7 1 0.548
8 1 0.999
9 1 0.454
10 1 0.292
11 1 0.173
$data2
# A tibble: 11 x 2
sub var
<int> <dbl>
1 2 0.148
2 2 0.487
3 2 0.246
4 2 0.279
5 2 0.130
6 2 0.730
7 2 0.312
8 2 0.935
9 2 0.968
10 2 0.745
11 2 0.485
$data3
# A tibble: 11 x 2
sub var
<int> <dbl>
1 3 0.141
2 3 0.200
3 3 0.00000392
4 3 0.993
5 3 0.644
6 3 0.334
7 3 0.567
8 3 0.817
9 3 0.0342
10 3 0.718
11 3 0.527

Since in the sample data you provided there is a column Subject, which I assume represents the subject IDs and there are only 11 rows with the same value for Subject, you can use
split(Theoph, Theoph$Subject)

I will assume your data frame is 11*N rows long then
everyEleven = vector(mode = "list", length = N)
for(i in 1:N){
start = (i - 1) * 11 + 1
end = i * 11
everyEleven[[i]] = Theoph[start:end, ]
}

We can use gl to create the grouping index
split(Theoph, as.integer(gl(nrow(Theoph), 11, nrow(Theoph))))

Calculate a mean per day per sample, then add a line of best fit

My Question:
How do I calculate the average (mean) per sample A, B, C per day (3 separate to 5) and then add a line of best fit through the mean from one day to the next?
I wanted to add this to a dot plot (ggplot2 geom_point) example of data is below... R script used below data.
Data below:
Day Sample Measurement
3 A 0.648
3 A 0.661
3 A 0.65
3 A 0.594
3 A 0.548
3 A 0.653
3 A 0.648
3 A 0.672
3 A 0.661
3 A 0.66
3 A 0.647
3 A 0.629
3 A 0.691
3 A 0.534
3 A 0.567
3 A 0.634
3 A 0.579
3 B 0.689
3 B 0.598
3 B 0.658
3 B 0.662
3 B 0.599
3 B 0.678
3 B 0.65
3 B 0.617
3 B 0.673
3 B 0.67
3 B 0.666
3 B 0.595
3 B 0.604
3 B 0.59
3 B 0.569
3 B 0.614
3 C 0.624
3 C 0.623
3 C 0.606
3 C 0.66
3 C 0.623
3 C 0.669
3 C 0.642
3 C 0.658
3 C 0.645
3 C 0.653
3 C 0.501
3 C 0.552
3 C 0.663
3 C 0.589
3 C 0.602
5 A 0.811
5 A 0.822
5 A 0.811
5 A 0.824
5 A 0.773
5 A 0.823
5 A 0.815
5 A 0.819
5 A 0.754
5 A 0.81
5 A 0.796
5 A 0.818
5 A 0.797
5 A 0.811
5 A 0.812
5 A 0.817
5 A 0.821
5 B 0.827
5 B 0.798
5 B 0.819
5 B 0.81
5 B 0.826
5 B 0.821
5 B 0.805
5 B 0.821
5 B 0.825
5 B 0.821
5 B 0.816
5 B 0.814
5 B 0.823
5 B 0.81
5 B 0.823
5 B 0.762
5 B 0.825
5 B 0.821
5 B 0.825
5 B 0.812
R Code for ggplot:
p2 <- ggplot(data=data1, aes(x=Day, y=Fv.Fm..XE..Mean)) +
geom_point(aes(colour= Sample),
position = position_jitterdodge(dodge.width=0.75 , jitter.width=0.250)) +
# geom_line(aes(colour=Sample),
# position = position_jitterdodge(dodge.width=0.75)) +
scale_x_discrete(labels=c(3, 5, 7, 10, 14)) +
scale_y_continuous(limits=c(0.3 , 1.0))
p2
ggsave("p2.jpg")

First calculate the mean for each Sample and Day
library(tidyverse)
library(ggpmisc)
data1 <- read.table(text = txt, header = TRUE)
mean_data1 <- data1 %>%
group_by(Day, Sample) %>%
summarise(Mean = mean(Measurement, na.rm = TRUE))
mean_data1
#> # A tibble: 5 x 3
#> # Groups: Day [?]
#> Day Sample Mean
#> <int> <fct> <dbl>
#> 1 3 A 0.628
#> 2 3 B 0.633
#> 3 3 C 0.621
#> 4 5 A 0.808
#> 5 5 B 0.815
Then plot all Measurement, facet_grid by Sample. Linear lines are added using geom_smooth. stat_poly_eq function from ggpmisc package is used for displaying equation & R2. Finally, we plot the Mean values.
p2 <- ggplot(data = data1, aes(x = Day, y = Measurement)) +
geom_point(aes(colour= Sample),
alpha = 0.7,
position = position_jitterdodge(dodge.width=0.75,
jitter.width=0.250)) +
scale_x_continuous(breaks=c(3, 5)) +
scale_y_continuous(limits=c(0.3 , 1.0))
formula <- y ~ x
p2 +
facet_grid(~ Sample) +
geom_smooth(method = "lm", formula = formula, se = FALSE) +
stat_poly_eq(aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~~")),
label.x.npc = "left", label.y.npc = "top",
formula = formula, parse = TRUE, size = 4) +
geom_point(data = mean_data1, aes(Day, Mean, color = "Mean"),
size = 3) +
theme_bw() +
theme(aspect.ratio = 1)
Data used:
txt <- "Day Sample Measurement
3 A 0.648
3 A 0.661
3 A 0.65
3 A 0.594
3 A 0.548
3 A 0.653
3 A 0.648
3 A 0.672
3 A 0.661
3 A 0.66
3 A 0.647
3 A 0.629
3 A 0.691
3 A 0.534
3 A 0.567
3 A 0.634
3 A 0.579
3 B 0.689
3 B 0.598
3 B 0.658
3 B 0.662
3 B 0.599
3 B 0.678
3 B 0.65
3 B 0.617
3 B 0.673
3 B 0.67
3 B 0.666
3 B 0.595
3 B 0.604
3 B 0.59
3 B 0.569
3 B 0.614
3 C 0.624
3 C 0.623
3 C 0.606
3 C 0.66
3 C 0.623
3 C 0.669
3 C 0.642
3 C 0.658
3 C 0.645
3 C 0.653
3 C 0.501
3 C 0.552
3 C 0.663
3 C 0.589
3 C 0.602
5 A 0.811
5 A 0.822
5 A 0.811
5 A 0.824
5 A 0.773
5 A 0.823
5 A 0.815
5 A 0.819
5 A 0.754
5 A 0.81
5 A 0.796
5 A 0.818
5 A 0.797
5 A 0.811
5 A 0.812
5 A 0.817
5 A 0.821
5 B 0.827
5 B 0.798
5 B 0.819
5 B 0.81
5 B 0.826
5 B 0.821
5 B 0.805
5 B 0.821
5 B 0.825
5 B 0.821
5 B 0.816
5 B 0.814
5 B 0.823
5 B 0.81
5 B 0.823
5 B 0.762
5 B 0.825
5 B 0.821
5 B 0.825
5 B 0.812"
Created on 2018-03-17 by the reprex package (v0.2.0).

Replace ID variable values with counts of value occurrences

I have a data frame like:
DATE x y ID
06/10/2003 7.21 0.651 1
12/10/2003 5.99 0.428 1
18/10/2003 4.68 1.04 1
24/10/2003 3.47 0.363 1
30/10/2003 2.42 0.507 1
02/05/2010 2.72 0.47 2
05/05/2010 2.6 0. 646 2
08/05/2010 2.67 0.205 2
11/05/2010 3.57 0.524 2
12/05/2010 0.428 4.68 3
13/05/2010 1.04 3.47 3
14/05/2010 0.363 2.42 3
18/10/2003 0.507 2.52 3
24/10/2003 0.418 4.68 3
30/10/2003 0.47 3.47 3
29/04/2010 0.646 2.42 4
18/10/2003 3.47 2.52 4
i have the count of number of rows per group for column ID as an integer vector like 5 4 6 2
is there a way to replace the group values in column id with these integer vector 5 4 6 2
the output i am expecting is
DATE x y ID
06/10/2003 7.21 0.651 5
12/10/2003 5.99 0.428 5
18/10/2003 4.68 1.04 5
24/10/2003 3.47 0.363 5
30/10/2003 2.42 0.507 5
02/05/2010 2.72 0.47 4
05/05/2010 2.6 646 4
08/05/2010 2.67 0.205 4
11/05/2010 3.57 0.524 4
12/05/2010 0.428 4.68 6
13/05/2010 1.04 3.47 6
14/05/2010 0.363 2.42 6
18/10/2003 0.507 2.52 6
24/10/2003 0.418 4.68 6
30/10/2003 0.47 3.47 6
29/04/2010 0.646 2.42 2
18/10/2003 3.47 2.52 2
i am quite new to R and tried to find if there is any idea replace function. But having a hard time. Any help is much appreciated.
above data is just an example for understanding my requirement.

A compact solution with the data.table-package:
library(data.table)
setDT(mydf)[, ID := .N, by = ID][]
which gives:
> mydf
DATE x y ID
1: 06/10/2003 7.210 0.651 5
2: 12/10/2003 5.990 0.428 5
3: 18/10/2003 4.680 1.040 5
4: 24/10/2003 3.470 0.363 5
5: 30/10/2003 2.420 0.507 5
6: 02/05/2010 2.720 0.470 4
7: 05/05/2010 2.600 0.646 4
8: 08/05/2010 2.670 0.205 4
9: 11/05/2010 3.570 0.524 4
10: 12/05/2010 0.428 4.680 6
11: 13/05/2010 1.040 3.470 6
12: 14/05/2010 0.363 2.420 6
13: 18/10/2003 0.507 2.520 6
14: 24/10/2003 0.418 4.680 6
15: 30/10/2003 0.470 3.470 6
16: 29/04/2010 0.646 2.420 2
17: 18/10/2003 3.470 2.520 2
What this does:
setDT(mydf) converts the dataframe to a data.table
by = ID groups by ID
ID := .N replaces the original value of ID with the count by group

You can use the ave() function to calculate how many rows each ID takes up. In the example below I created a new variable ID2, but you could replace the original ID if you want.
(I included code to create your data in R below, but when you ask questions in the future please include your data in the question by using the dput() function on the data object. That's what I did to make the code below.)
mydata <- structure(list(DATE = c("06/10/2003", "12/10/2003", "18/10/2003",
"24/10/2003", "30/10/2003", "02/05/2010", "05/05/2010", "08/05/2010",
"11/05/2010", "12/05/2010", "13/05/2010", "14/05/2010", "18/10/2003",
"24/10/2003", "30/10/2003", "29/04/2010", "18/10/2003"),
x = c(7.21, 5.99, 4.68, 3.47, 2.42, 2.72, 2.6, 2.67, 3.57, 0.428, 1.04, 0.363,
0.507, 0.418, 0.47, 0.646, 3.47),
y = c(0.651, 0.428, 1.04, 0.363, 0.507, 0.47, 646, 0.205, 0.524, 4.68, 3.47,
2.42, 2.52, 4.68, 3.47, 2.42, 2.52),
ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4)),
.Names = c("DATE", "x", "y", "ID"),
class = c("data.frame"),
row.names = c(NA, -17L))
# ave() takes an input object, an object of group IDs of the same length
# as the input object, and a function to apply to the input object split across groups
mydata$ID2 <- ave(mydata$ID, mydata$ID, FUN = length)
mydata
DATE x y ID ID2
1 06/10/2003 7.210 0.651 1 5
2 12/10/2003 5.990 0.428 1 5
3 18/10/2003 4.680 1.040 1 5
4 24/10/2003 3.470 0.363 1 5
5 30/10/2003 2.420 0.507 1 5
6 02/05/2010 2.720 0.470 2 4
7 05/05/2010 2.600 646.000 2 4
8 08/05/2010 2.670 0.205 2 4
9 11/05/2010 3.570 0.524 2 4
10 12/05/2010 0.428 4.680 3 6
11 13/05/2010 1.040 3.470 3 6
12 14/05/2010 0.363 2.420 3 6
13 18/10/2003 0.507 2.520 3 6
14 24/10/2003 0.418 4.680 3 6
15 30/10/2003 0.470 3.470 3 6
16 29/04/2010 0.646 2.420 4 2
17 18/10/2003 3.470 2.520 4 2
# if you want to replace the original ID variable, you can assign to it
# instead of adding a new variable
mydata$ID <- ave(mydata$ID, mydata$ID, FUN = length)

A solution with dplyr:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(ID2 = n()) %>%
ungroup() %>%
mutate(ID = ID2) %>%
select(-ID2)
Edit:
I've just found a solution that's a bit cleaner than the above:
df %>%
group_by(ID2 = ID) %>%
mutate(ID = n()) %>%
select(-ID2)
Result:
# A tibble: 17 x 4
DATE x y ID
<fctr> <dbl> <dbl> <int>
1 06/10/2003 7.210 0.651 5
2 12/10/2003 5.990 0.428 5
3 18/10/2003 4.680 1.040 5
4 24/10/2003 3.470 0.363 5
5 30/10/2003 2.420 0.507 5
6 02/05/2010 2.720 0.470 4
7 05/05/2010 2.600 0.646 4
8 08/05/2010 2.670 0.205 4
9 11/05/2010 3.570 0.524 4
10 12/05/2010 0.428 4.680 6
11 13/05/2010 1.040 3.470 6
12 14/05/2010 0.363 2.420 6
13 18/10/2003 0.507 2.520 6
14 24/10/2003 0.418 4.680 6
15 30/10/2003 0.470 3.470 6
16 29/04/2010 0.646 2.420 2
17 18/10/2003 3.470 2.520 2
Notes:
The reason behind ungroup() %>% mutate(ID = ID2) %>% select(-ID2) is that dplyr doesn't allow mutateing on grouping variables. So this would not work:
df %>%
group_by(ID) %>%
mutate(ID = n())
Error in mutate_impl(.data, dots) : Column ID can't be modified
because it's a grouping variable
If you don't care about replacing the original ID column, you can just do:
df %>%
group_by(ID) %>%
mutate(ID2 = n())
Alternative Result:
# A tibble: 17 x 5
# Groups: ID [4]
DATE x y ID ID2
<fctr> <dbl> <dbl> <int> <int>
1 06/10/2003 7.210 0.651 1 5
2 12/10/2003 5.990 0.428 1 5
3 18/10/2003 4.680 1.040 1 5
4 24/10/2003 3.470 0.363 1 5
5 30/10/2003 2.420 0.507 1 5
6 02/05/2010 2.720 0.470 2 4
7 05/05/2010 2.600 0.646 2 4
8 08/05/2010 2.670 0.205 2 4
9 11/05/2010 3.570 0.524 2 4
10 12/05/2010 0.428 4.680 3 6
11 13/05/2010 1.040 3.470 3 6
12 14/05/2010 0.363 2.420 3 6
13 18/10/2003 0.507 2.520 3 6
14 24/10/2003 0.418 4.680 3 6
15 30/10/2003 0.470 3.470 3 6
16 29/04/2010 0.646 2.420 4 2
17 18/10/2003 3.470 2.520 4 2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Calculating group differences in a "badly" partitioned data set - r

Related

avoid repeated unquoting in dplyr non standard evaluation

Dynamic portfolio re-balancing if PF weights deviate by more than a threshold

Extract every 11 rows from data frame [duplicate]

Calculate a mean per day per sample, then add a line of best fit

Replace ID variable values with counts of value occurrences

Categories

Resources