Ggplot2 : bubbles representing propotions by category? - r

I've this data :
# A tibble: 19 x 8
country Prop_A Prop_B Prop_C Prop_D Prop_E Prop_F Prop_G
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Austria 1 1 0.912 0.912 0.518 0.999 0.567
2 Belgium 1 1 0.821 1 0.687 0.0990 0.925
3 Denmark NA NA NA NA NA NA NA
4 France 0.750 1 0.361 0.345 0 0.0658 0.563
5 Germany 0.928 1 0.674 0.783 0.128 0.635 0.0828
6 Greece 0 1 0 0 0 1 0
7 Hungary 0.812 1 0.812 0.812 0 0.375 0.188
8 Israel 1 1 1 0.755 0.450 0.241 0.292
9 Italy 0.962 1 0.881 0.516 0.533 0 0.0230
10 Latvia 0 1 1 0 0 0 0
11 Lithuania 0.507 1 1 0.507 0 0 0
12 Malta 1 1 1 1 0 1 0
13 Netherlands 0.818 1 1 0.682 0.5 0.182 0.682
14 Portugal 0.829 1 1 0.829 0 0.610 0.509
15 Romania 1 1 1 1 0 0.273 1
16 Spain 1 1 1 0.787 0.215 0.191 0.653
17 Sweden 0.792 1 0.792 0.167 0.375 0 0
18 Switzerland 0.697 1 1 0.547 0.126 0.724 0.210
19 Turkey 1 1 0.842 0.775 0.585 0.810 0.117
>
0.812 represent 81% for the proposal A in Hungary (7)
What I want is this kind of graphic :
https://zupimages.net/viewer.php?id=20/13/ob6z.png
I want to have "81%" in the bubble , countries in rows and the different "props" in columns.
I've tried geom_tile, but doesn't work. I don't understand if my data are not well built, or if i just don't find the good command.
Thank for your help !

Here is one approach to making a bubble plot.
library(tidyverse)
df %>%
mutate_at(vars(starts_with("Prop")), list(~. * 100)) %>%
pivot_longer(cols = starts_with("Prop"), names_to = c("Prop", "Type"), names_sep = "_") %>%
ggplot(aes(x = Type, y = country, size = value, label = value)) +
geom_point(shape = 21, fill = "white") +
geom_text(size = 3) +
scale_size(range = c(5, 15), guide = F)
Plot

Related

Scatterplot with multi variables

Scatterplot reference
data set
Can someone help me create three scatter plots as in the first picture? Ideally using the plot() function.
require(tidyverse)
require(ggplot2)
df <- tibble(
image = 1:18,
m_r_exsal = rnorm(18, 5, 2),
m_r_sal = rnorm(18, 6, 2),
female = c(rep(1, 18/2), rep(0, 18/2)),
lg_salary = rnorm(18, 5, 1.5),
deviation = rnorm(18, 1, 1),
chinese = c(rep(1, 6), rep(0, 18/3*2)),
european = c(rep(0, 6), rep(1, 6), rep(0, 6)),
american = c(rep(0, 18/3*2), rep(1, 6))
)
Example data:
# A tibble: 18 x 9
image m_r_exsal m_r_sal female lg_salary deviation chinese european american
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 6.19 9.56 1 5.48 1.54 1 0 0
2 2 10.1 5.17 1 3.77 -0.755 1 0 0
3 3 4.96 1.91 1 6.75 0.381 1 0 0
4 4 5.10 4.57 1 4.61 -0.207 1 0 0
5 5 -1.25 6.57 1 2.33 0.880 1 0 0
6 6 6.77 9.10 1 3.07 1.03 1 0 0
7 7 4.04 4.84 1 4.56 1.95 0 1 0
8 8 3.72 4.72 1 5.32 1.17 0 1 0
9 9 7.59 7.05 1 6.24 -0.224 0 1 0
10 10 4.09 3.94 0 5.60 2.52 0 1 0
11 11 4.15 6.05 0 7.08 -0.152 0 1 0
12 12 6.07 5.27 0 5.79 -0.323 0 1 0
13 13 4.49 4.64 0 5.97 0.457 0 0 1
14 14 6.74 4.70 0 3.38 0.377 0 0 1
15 15 7.46 9.02 0 6.65 1.85 0 0 1
16 16 4.29 5.26 0 4.07 2.18 0 0 1
17 17 2.33 1.58 0 8.43 1.06 0 0 1
18 18 4.78 8.75 0 5.03 0.101 0 0 1
Making the plot:
df %>%
mutate(chinese = case_when(chinese == 1 ~ "chinese"),
european = case_when(european == 1 ~ "european"),
american = case_when(american == 1 ~ "american"),
female = case_when(female == 1 ~ "female",
TRUE ~ "male")) %>%
unite(country, chinese:american, remove = TRUE, sep = "") %>%
mutate(country = country %>% str_remove_all("NA")) %>%
ggplot() +
aes(lg_salary, deviation, col = female) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ country)
The output:

How do I create a dummy variable that depends on values in multiple columns?

I am trying to create a treatment dummy for the states whose 1970 legal1820 is different from their legal1820 in 1979. So I need the proper syntax for somethihng like this treat = ifelse((legal1820 when (year == 1970)) != (legal1820 when (year == 1979)) , 1,0)
this is the data I am using
mlda <- read_dta("http://masteringmetrics.com/wp-content/uploads/2015/01/deaths.dta")
dft <- mlda %>%
filter(year <= 1990) %>%
mutate(dtype = as_factor(dtype, levels = "labels"),
age_cat = agegr,
agegr = as_factor(agegr, levels = "labels"))
library(tidycensus)
data("fips_codes")
fips_codes <- fips_codes %>%
mutate(state_code = as.numeric(state_code)) %>%
select(state, state_code) %>%
distinct()
dft <- dft %>%
rename(state_code = state) %>%
right_join(fips_codes, by = "state_code") %>%
select(-state_code)%>%
group_by(state)%>%
filter(agegr == "18-20 yrs", year <= 1983)%>%
pivot_wider(names_from = dtype, values_from = mrate)%>%
mutate(post = ifelse(year >= 1975 ,1,0)
these are the libraries I am using (most of them are for other parts of my code)
library(tidyverse)
library(AER)
library(stargazer)
library(haven)
library(lfe)
library(estimatr)
library(stringr)
library(dplyr)
library(modelsummary)
library(ggplot2)
library(haven)
Is this what you are looking for?
library(dplyr)
mlda %>% group_by(state) %>% mutate(treat = +(first(legal1820[year == 1970] != legal1820[year == 1979])))
Output
# A tibble: 24,786 x 16
# Groups: state [51]
year state legal1820 dtype agegr count pop age legal beertaxa beerpercap winepercap spiritpercap totpercap mrate treat
<dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 1970 1 0 1 [all] 1 [15-17 yrs] 224 213574 16.0 0 1.37 0.600 0.0900 0.700 1.38 105. 1
2 1971 1 0 1 [all] 1 [15-17 yrs] 241 220026 16.0 0 1.32 0.660 0.0900 0.760 1.52 110. 1
3 1972 1 0 1 [all] 1 [15-17 yrs] 270 224877 16.0 0 1.28 0.740 0.0900 0.780 1.61 120. 1
4 1973 1 0 1 [all] 1 [15-17 yrs] 258 227256 16.0 0 1.20 0.790 0.100 0.790 1.69 114. 1
5 1974 1 0 1 [all] 1 [15-17 yrs] 224 229025 16.0 0 1.08 0.830 0.160 0.810 1.80 97.8 1
6 1975 1 0.294 1 [all] 1 [15-17 yrs] 207 229739 16.0 0 0.991 0.880 0.160 0.850 1.88 90.1 1
7 1976 1 0.665 1 [all] 1 [15-17 yrs] 231 230696 16.0 0 0.937 0.890 0.150 0.860 1.89 100. 1
8 1977 1 0.668 1 [all] 1 [15-17 yrs] 219 230086 16.0 0 0.880 0.990 0.130 0.840 1.96 95.2 1
9 1978 1 0.667 1 [all] 1 [15-17 yrs] 234 229519 16.0 0 0.817 0.980 0.120 0.880 1.97 102. 1
10 1979 1 0.668 1 [all] 1 [15-17 yrs] 176 227140 16.0 0 0.734 0.980 0.120 0.840 1.94 77.5 1
# ... with 24,776 more rows

Dynamic portfolio re-balancing if PF weights deviate by more than a threshold

It's not so hard to backtest a portfolio with given weights and a set rebalancing frequency (e.g. daily/weekly...). There are R packages doing this, for example PerformanceAnalytics, or tidyquant's tq_portfolio which uses that function.
I would like to backtest a portfolio that is re-balanced when the weights deviate by a certain threshold given in percentage points.
Say I have two equally-weighted stocks and a threshold of +/-15 percentage points, I would rebalance to the initial weights when one of the weights exceeds 65%.
For example I have 3 stocks with equal weights (we should also be able to set other weights).
library(dplyr)
set.seed(3)
n <- 6
rets <- tibble(period = rep(1:n, 3),
stock = c(rep("A", n), rep("B", n), rep("C", n)),
ret = c(rnorm(n, 0, 0.3), rnorm(n, 0, 0.2), rnorm(n, 0, 0.1)))
target_weights <- tibble(stock = c("A", "B", "C"), target_weight = 1/3)
rets_weights <- rets %>%
left_join(target_weights, by = "stock")
rets_weights
# # A tibble: 18 x 4
# period stock ret target_weight
# <int> <chr> <dbl> <dbl>
# 1 1 A -0.289 0.333
# 2 2 A -0.0878 0.333
# 3 3 A 0.0776 0.333
# 4 4 A -0.346 0.333
# 5 5 A 0.0587 0.333
# 6 6 A 0.00904 0.333
# 7 1 B 0.0171 0.333
# 8 2 B 0.223 0.333
# 9 3 B -0.244 0.333
# 10 4 B 0.253 0.333
# 11 5 B -0.149 0.333
# 12 6 B -0.226 0.333
# 13 1 C -0.0716 0.333
# 14 2 C 0.0253 0.333
# 15 3 C 0.0152 0.333
# 16 4 C -0.0308 0.333
# 17 5 C -0.0953 0.333
# 18 6 C -0.0648 0.333
Here are the actual weights without rebalancing:
rets_weights_actual <- rets_weights %>%
group_by(stock) %>%
mutate(value = cumprod(1+ret)*target_weight[1]) %>%
group_by(period) %>%
mutate(actual_weight = value/sum(value))
rets_weights_actual
# # A tibble: 18 x 6
# # Groups: period [6]
# period stock ret target_weight value actual_weight
# <int> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 A -0.289 0.333 0.237 0.268
# 2 2 A -0.0878 0.333 0.216 0.228
# 3 3 A 0.0776 0.333 0.233 0.268
# 4 4 A -0.346 0.333 0.153 0.178
# 5 5 A 0.0587 0.333 0.162 0.207
# 6 6 A 0.00904 0.333 0.163 0.238
# 7 1 B 0.0171 0.333 0.339 0.383
# 8 2 B 0.223 0.333 0.415 0.437
# 9 3 B -0.244 0.333 0.314 0.361
# 10 4 B 0.253 0.333 0.393 0.458
# 11 5 B -0.149 0.333 0.335 0.430
# 12 6 B -0.226 0.333 0.259 0.377
# 13 1 C -0.0716 0.333 0.309 0.349
# 14 2 C 0.0253 0.333 0.317 0.335
# 15 3 C 0.0152 0.333 0.322 0.371
# 16 4 C -0.0308 0.333 0.312 0.364
# 17 5 C -0.0953 0.333 0.282 0.363
# 18 6 C -0.0648 0.333 0.264 0.385
So I want that if in any period any stock's weight goes over or under the threshold (for example 0.33+/-0.1), the portfolio weights should be set back to the initial weights.
This has to be done dynamically, so we could have a lot of periods and a lot of stocks. Rebalancing could be necessary several times.
What I tried to solve it: I tried to work with lag and set the initial weights when the actual weights exceed the threshold, however I was unable to do so dynamically, as the weights depend on the returns given the rebalanced weights.
The approach to rebalance upon deviation by more than a certain threshold is called percentage-of-portfolio rebalancing.
My solution is to iterate period-by-period and check if the upper or lower threshold was passed. If so we reset to the initial weights.
library(tidyverse)
library(tidyquant)
rets <- FANG %>%
group_by(symbol) %>%
mutate(ret = adjusted/lag(adjusted)-1) %>%
select(symbol, date, ret) %>%
pivot_wider(names_from = "symbol", values_from = ret)
weights <- rep(0.25, 4)
threshold <- 0.05
r_out <- tibble()
i0 <- 1
trade_rebalance <- 1
pf_value <- 1
for (i in 1:nrow(rets)) {
r <- rets[i0:i,]
j <- 0
r_i <- r %>%
mutate_if(is.numeric, replace_na, 0) %>%
mutate_if(is.numeric, list(v = ~ pf_value * weights[j <<- j + 1] * cumprod(1 + .))) %>%
mutate(pf = rowSums(select(., contains("_v")))) %>%
mutate_at(vars(ends_with("_v")), list(w = ~ ./pf))
touch_upper_band <- any(r_i[nrow(r_i),] %>% select(ends_with("_w")) %>% unlist() > weights + threshold)
touch_lower_band <- any(r_i[nrow(r_i),] %>% select(ends_with("_w")) %>% unlist() < weights - threshold)
if (touch_upper_band | touch_lower_band | i == nrow(rets)) {
i0 <- i + 1
r_out <- bind_rows(r_out, r_i %>% mutate(trade_rebalance = trade_rebalance))
pf_value <- r_i[[nrow(r_i), "pf"]]
trade_rebalance <- trade_rebalance + 1
}
}
r_out %>% head()
# # A tibble: 6 x 15
# date FB AMZN NFLX GOOG FB_v AMZN_v NFLX_v GOOG_v pf FB_v_w AMZN_v_w NFLX_v_w GOOG_v_w trade_rebalance
# <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2013-01-02 0 0 0 0 0.25 0.25 0.25 0.25 1 0.25 0.25 0.25 0.25 1
# 2 2013-01-03 -0.00821 0.00455 0.0498 0.000581 0.248 0.251 0.262 0.250 1.01 0.245 0.248 0.259 0.247 1
# 3 2013-01-04 0.0356 0.00259 -0.00632 0.0198 0.257 0.252 0.261 0.255 1.02 0.251 0.246 0.255 0.249 1
# 4 2013-01-07 0.0229 0.0359 0.0335 -0.00436 0.263 0.261 0.270 0.254 1.05 0.251 0.249 0.257 0.243 1
# 5 2013-01-08 -0.0122 -0.00775 -0.0206 -0.00197 0.259 0.259 0.264 0.253 1.04 0.251 0.250 0.255 0.245 1
# 6 2013-01-09 0.0526 -0.000113 -0.0129 0.00657 0.273 0.259 0.261 0.255 1.05 0.261 0.247 0.249 0.244 1
r_out %>% tail()
# # A tibble: 6 x 15
# date FB AMZN NFLX GOOG FB_v AMZN_v NFLX_v GOOG_v pf FB_v_w AMZN_v_w NFLX_v_w GOOG_v_w trade_rebalance
# <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2016-12-22 -0.0138 -0.00553 -0.00727 -0.00415 0.945 1.10 1.32 1.08 4.45 0.213 0.247 0.297 0.243 10
# 2 2016-12-23 -0.00111 -0.00750 0.0000796 -0.00171 0.944 1.09 1.32 1.08 4.43 0.213 0.246 0.298 0.243 10
# 3 2016-12-27 0.00631 0.0142 0.0220 0.00208 0.950 1.11 1.35 1.08 4.49 0.212 0.247 0.301 0.241 10
# 4 2016-12-28 -0.00924 0.000946 -0.0192 -0.00821 1.11 1.12 1.10 1.11 4.45 0.250 0.252 0.247 0.250 11
# 5 2016-12-29 -0.00488 -0.00904 -0.00445 -0.00288 1.11 1.11 1.10 1.11 4.42 0.250 0.252 0.248 0.251 11
# 6 2016-12-30 -0.0112 -0.0200 -0.0122 -0.0140 1.09 1.09 1.08 1.09 4.36 0.251 0.250 0.248 0.251 11
Here we would have rebalanced 11 times.
r_out %>%
mutate(performance = pf-1) %>%
ggplot(aes(x = date, y = performance)) +
geom_line(data = FANG %>%
group_by(symbol) %>%
mutate(performance = adjusted/adjusted[1L]-1),
aes(color = symbol)) +
geom_line(size = 1)
The approach is slow and using a loop is far from elegant. If anyone has a better solution, I would happily upvote and accept.

Calculating group differences in a "badly" partitioned data set

I tried to solve the problem with questions here on SO but I could not find a satisfying answer. My data frame has the structure
X = data_frame(
treat = c(rep(1,4), rep(2,4), rep(3,4), rep(4,4)),
id = seq(1:16),
x = rnorm(16),
y = rnorm(16),
z = rnorm(16)
)
Looks like
# A tibble: 16 x 5
treat id x y z
<int> <int> <dbl> <dbl> <dbl>
1 1 1 -0.0724 1.26 0.317
2 1 2 -0.486 -0.628 0.392
3 1 3 -0.406 -0.706 1.18
4 1 4 -1.35 -1.27 2.36
5 2 5 -0.0751 -0.0394 0.568
6 2 6 0.243 0.873 0.132
7 2 7 0.138 0.611 -0.700
8 2 8 -0.732 1.02 -0.811
9 3 9 -0.0278 1.78 0.568
10 3 10 0.526 1.18 1.03
11 3 11 1.43 0.0937 -0.0825
12 3 12 -0.299 -0.117 0.367
13 4 13 1.05 2.04 0.678
14 4 14 -1.93 0.201 0.250
15 4 15 0.624 1.09 0.852
16 4 16 0.502 0.119 -0.843
Every fourth value in treat is a control and now I want to calculate the difference in x, y and z between the treatments and the controls. For example I would like to calculate for the first treatment
-0.724 - (-1.35) #x
1.26 - (-1.27) #y
0.317 - 2.36 #z
for the first treatment. For the second treatment accordingly,
-0.486 - (-1.35) #x
-0.628 - (-1.27) #y
0.392 - 2.36 #z
... and so on.
I would like to use a dplyr / tidyverse solution but I have no idea how to do that in a "smooth" way. I found a solution already by using joins but this seems rather tedious compared to the "smooth" solution dplyr usually offers.
With dplyr, we can group_by treat and use mutate_at to select specific columns (x:z) and subtract each value with 4th value using the nth function.
library(dplyr)
X %>%
group_by(treat) %>%
mutate_at(vars(x:z), funs(. - nth(., 4)))
#treat id x y z
# <dbl> <int> <dbl> <dbl> <dbl>
# 1 1 1 -0.631 0.971 0.206
# 2 1 2 -0.301 -1.49 0.189
# 3 1 3 1.49 1.17 0.133
# 4 1 4 0 0 0
# 5 2 5 1.39 -0.339 0.934
# 6 2 6 2.98 0.511 0.319
# 7 2 7 1.73 -0.297 0.0745
# 8 2 8 0 0 0
# 9 3 9 -1.05 -0.778 -2.86
#10 3 10 -0.805 -1.84 -2.38
#11 3 11 0.864 0.684 -3.43
#12 3 12 0 0 0
#13 4 13 -1.39 -0.843 1.67
#14 4 14 -1.68 1.55 -0.656
#15 4 15 -2.34 0.722 0.0638
#16 4 16 0 0 0
This can be also written as
X %>%
group_by(treat) %>%
mutate_at(vars(x:z), funs(. - .[4]))
data
set.seed(123)
X = data_frame(
treat = c(rep(1,4), rep(2,4), rep(3,4), rep(4,4)),
id = seq(1:16),
x = rnorm(16),
y = rnorm(16),
z = rnorm(16)
)

Replace ID variable values with counts of value occurrences

I have a data frame like:
DATE x y ID
06/10/2003 7.21 0.651 1
12/10/2003 5.99 0.428 1
18/10/2003 4.68 1.04 1
24/10/2003 3.47 0.363 1
30/10/2003 2.42 0.507 1
02/05/2010 2.72 0.47 2
05/05/2010 2.6 0. 646 2
08/05/2010 2.67 0.205 2
11/05/2010 3.57 0.524 2
12/05/2010 0.428 4.68 3
13/05/2010 1.04 3.47 3
14/05/2010 0.363 2.42 3
18/10/2003 0.507 2.52 3
24/10/2003 0.418 4.68 3
30/10/2003 0.47 3.47 3
29/04/2010 0.646 2.42 4
18/10/2003 3.47 2.52 4
i have the count of number of rows per group for column ID as an integer vector like 5 4 6 2
is there a way to replace the group values in column id with these integer vector 5 4 6 2
the output i am expecting is
DATE x y ID
06/10/2003 7.21 0.651 5
12/10/2003 5.99 0.428 5
18/10/2003 4.68 1.04 5
24/10/2003 3.47 0.363 5
30/10/2003 2.42 0.507 5
02/05/2010 2.72 0.47 4
05/05/2010 2.6 646 4
08/05/2010 2.67 0.205 4
11/05/2010 3.57 0.524 4
12/05/2010 0.428 4.68 6
13/05/2010 1.04 3.47 6
14/05/2010 0.363 2.42 6
18/10/2003 0.507 2.52 6
24/10/2003 0.418 4.68 6
30/10/2003 0.47 3.47 6
29/04/2010 0.646 2.42 2
18/10/2003 3.47 2.52 2
i am quite new to R and tried to find if there is any idea replace function. But having a hard time. Any help is much appreciated.
above data is just an example for understanding my requirement.
A compact solution with the data.table-package:
library(data.table)
setDT(mydf)[, ID := .N, by = ID][]
which gives:
> mydf
DATE x y ID
1: 06/10/2003 7.210 0.651 5
2: 12/10/2003 5.990 0.428 5
3: 18/10/2003 4.680 1.040 5
4: 24/10/2003 3.470 0.363 5
5: 30/10/2003 2.420 0.507 5
6: 02/05/2010 2.720 0.470 4
7: 05/05/2010 2.600 0.646 4
8: 08/05/2010 2.670 0.205 4
9: 11/05/2010 3.570 0.524 4
10: 12/05/2010 0.428 4.680 6
11: 13/05/2010 1.040 3.470 6
12: 14/05/2010 0.363 2.420 6
13: 18/10/2003 0.507 2.520 6
14: 24/10/2003 0.418 4.680 6
15: 30/10/2003 0.470 3.470 6
16: 29/04/2010 0.646 2.420 2
17: 18/10/2003 3.470 2.520 2
What this does:
setDT(mydf) converts the dataframe to a data.table
by = ID groups by ID
ID := .N replaces the original value of ID with the count by group
You can use the ave() function to calculate how many rows each ID takes up. In the example below I created a new variable ID2, but you could replace the original ID if you want.
(I included code to create your data in R below, but when you ask questions in the future please include your data in the question by using the dput() function on the data object. That's what I did to make the code below.)
mydata <- structure(list(DATE = c("06/10/2003", "12/10/2003", "18/10/2003",
"24/10/2003", "30/10/2003", "02/05/2010", "05/05/2010", "08/05/2010",
"11/05/2010", "12/05/2010", "13/05/2010", "14/05/2010", "18/10/2003",
"24/10/2003", "30/10/2003", "29/04/2010", "18/10/2003"),
x = c(7.21, 5.99, 4.68, 3.47, 2.42, 2.72, 2.6, 2.67, 3.57, 0.428, 1.04, 0.363,
0.507, 0.418, 0.47, 0.646, 3.47),
y = c(0.651, 0.428, 1.04, 0.363, 0.507, 0.47, 646, 0.205, 0.524, 4.68, 3.47,
2.42, 2.52, 4.68, 3.47, 2.42, 2.52),
ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4)),
.Names = c("DATE", "x", "y", "ID"),
class = c("data.frame"),
row.names = c(NA, -17L))
# ave() takes an input object, an object of group IDs of the same length
# as the input object, and a function to apply to the input object split across groups
mydata$ID2 <- ave(mydata$ID, mydata$ID, FUN = length)
mydata
DATE x y ID ID2
1 06/10/2003 7.210 0.651 1 5
2 12/10/2003 5.990 0.428 1 5
3 18/10/2003 4.680 1.040 1 5
4 24/10/2003 3.470 0.363 1 5
5 30/10/2003 2.420 0.507 1 5
6 02/05/2010 2.720 0.470 2 4
7 05/05/2010 2.600 646.000 2 4
8 08/05/2010 2.670 0.205 2 4
9 11/05/2010 3.570 0.524 2 4
10 12/05/2010 0.428 4.680 3 6
11 13/05/2010 1.040 3.470 3 6
12 14/05/2010 0.363 2.420 3 6
13 18/10/2003 0.507 2.520 3 6
14 24/10/2003 0.418 4.680 3 6
15 30/10/2003 0.470 3.470 3 6
16 29/04/2010 0.646 2.420 4 2
17 18/10/2003 3.470 2.520 4 2
# if you want to replace the original ID variable, you can assign to it
# instead of adding a new variable
mydata$ID <- ave(mydata$ID, mydata$ID, FUN = length)
A solution with dplyr:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(ID2 = n()) %>%
ungroup() %>%
mutate(ID = ID2) %>%
select(-ID2)
Edit:
I've just found a solution that's a bit cleaner than the above:
df %>%
group_by(ID2 = ID) %>%
mutate(ID = n()) %>%
select(-ID2)
Result:
# A tibble: 17 x 4
DATE x y ID
<fctr> <dbl> <dbl> <int>
1 06/10/2003 7.210 0.651 5
2 12/10/2003 5.990 0.428 5
3 18/10/2003 4.680 1.040 5
4 24/10/2003 3.470 0.363 5
5 30/10/2003 2.420 0.507 5
6 02/05/2010 2.720 0.470 4
7 05/05/2010 2.600 0.646 4
8 08/05/2010 2.670 0.205 4
9 11/05/2010 3.570 0.524 4
10 12/05/2010 0.428 4.680 6
11 13/05/2010 1.040 3.470 6
12 14/05/2010 0.363 2.420 6
13 18/10/2003 0.507 2.520 6
14 24/10/2003 0.418 4.680 6
15 30/10/2003 0.470 3.470 6
16 29/04/2010 0.646 2.420 2
17 18/10/2003 3.470 2.520 2
Notes:
The reason behind ungroup() %>% mutate(ID = ID2) %>% select(-ID2) is that dplyr doesn't allow mutateing on grouping variables. So this would not work:
df %>%
group_by(ID) %>%
mutate(ID = n())
Error in mutate_impl(.data, dots) : Column ID can't be modified
because it's a grouping variable
If you don't care about replacing the original ID column, you can just do:
df %>%
group_by(ID) %>%
mutate(ID2 = n())
Alternative Result:
# A tibble: 17 x 5
# Groups: ID [4]
DATE x y ID ID2
<fctr> <dbl> <dbl> <int> <int>
1 06/10/2003 7.210 0.651 1 5
2 12/10/2003 5.990 0.428 1 5
3 18/10/2003 4.680 1.040 1 5
4 24/10/2003 3.470 0.363 1 5
5 30/10/2003 2.420 0.507 1 5
6 02/05/2010 2.720 0.470 2 4
7 05/05/2010 2.600 0.646 2 4
8 08/05/2010 2.670 0.205 2 4
9 11/05/2010 3.570 0.524 2 4
10 12/05/2010 0.428 4.680 3 6
11 13/05/2010 1.040 3.470 3 6
12 14/05/2010 0.363 2.420 3 6
13 18/10/2003 0.507 2.520 3 6
14 24/10/2003 0.418 4.680 3 6
15 30/10/2003 0.470 3.470 3 6
16 29/04/2010 0.646 2.420 4 2
17 18/10/2003 3.470 2.520 4 2

Resources