How to count subsequent instance of level in each group R [duplicate] - r

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 1 year ago.
I have the following dataframe:
df <- data.frame(
id = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
name = c("J", "Z", "X", "A", "J", "B", "R", "J", "X")
)
I would like to group_by(id), then create a counter column which increases the value for each subsequent instance/level of (name). The desired output would look like this...
id name count
1 J 1
1 Z 1
1 X 1
2 A 1
2 J 2
2 B 1
3 R 1
3 J 3
3 X 2
I assume it would be something that starts like this...
library(tidyverse)
df %>%
group_by(id) %>%
But I'm not sure how I would implement that kind of counter...
Any help much appreciated.

actually you have to group by the "name" since you are looking to count that independ of the Id:
library(dplyr)
df %>%
dplyr::group_by(name) %>%
dplyr::mutate(count = dplyr::row_number()) %>%
dplyr::ungroup()
# A tibble: 9 x 3
id name count
<dbl> <chr> <int>
1 1 J 1
2 1 Z 1
3 1 X 1
4 2 A 1
5 2 J 2
6 2 B 1
7 3 R 1
8 3 J 3
9 3 X 2

Related

Filter groups based on difference two highest values

I have the following dataframe called df (dput below):
> df
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 B 8
6 B 2
7 B 2
8 B 3
9 C 10
10 C 1
11 C 1
12 C 8
I would like to filter groups based on the difference between their highest value (max) and second highest value. The difference should be smaller equal than 2 (<=2), this means that group B should be removed because the highest value is 8 and the second highest value is 3 which is a difference of 5. The desired output should look like this:
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
So I was wondering if anyone knows how to filter groups based on the difference between their highest and second-highest value?
dput of df:
df<-structure(list(group = c("A", "A", "A", "A", "B", "B", "B", "B",
"C", "C", "C", "C"), value = c(5, 1, 1, 5, 8, 2, 2, 3, 10, 1,
1, 8)), class = "data.frame", row.names = c(NA, -12L))
Using dplyr
library(dplyr)
df %>%
group_by(group) %>%
filter(abs(diff(sort(value, decreasing=T)[1:2])) <= 2) %>%
ungroup()
# A tibble: 8 × 2
group value
<chr> <int>
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
A base R alternative
grp <- na.omit(aggregate(. ~ group, df, function(x)
abs(diff(sort(x, decreasing=T)[1:2])) <= 2))
do.call(rbind, c(mapply(function(g, v)
list(df[df$group == g & v,]), grp$group, grp$value), make.row.names=F))
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
I possibility would be to first create a vector with the groups that achieve your condition and then filter in the original data.frame. Here how I thought:
library(dplyr)
group_to_keep <-
df %>%
group_by(group) %>%
slice_max(n = 2,value) %>%
filter(abs(diff(value)) <= 2) %>%
pull(group) %>%
unique()
df %>%
filter(group %in% group_to_keep)
You can use ave.
df[ave(df$value, df$group, FUN=\(x) diff(sort(c(-x, Inf)))[1]) <= 2,]
# group value
#1 A 5
#2 A 1
#3 A 1
#4 A 5
#9 C 10
#10 C 1
#11 C 1
#12 C 8
In case you can sure that you have all the time at least two values you can use.
df[ave(df$value, df$group, FUN=\(x) diff(tail(sort(x), 2))) <= 2,]
df[ave(df$value, df$group, FUN=\(x) diff(sort(-x)[1:2])) <= 2,]

Removing rows based on column conditions

Suppose we have a data frame:
Event <- c("A", "A", "A", "B", "B", "C" , "C", "C")
Model <- c( 1, 2, 3, 1, 2, 1, 2, 3)
df <- data.frame(Event, Model)
Which looks like this:
event
Model
A
1
A
2
A
3
B
1
B
2
C
1
C
2
C
3
We can see that event B only has 2 models of data. As the actual data frame I am using has thousands of rows and 17 columns, how can I remove all events that do not have 3 models? My guess is to use a subset however I am not sure how to do it when we have more than one condition.
I tried the suggested code from YH Jang below:
df %>% group_by(Event) %>%
filter(max(Model)==3)
However, this would miss out entries in the data that looked like this.
event
Model
A
1
A
3
example:
# A tibble: 6 × 2
# Groups: Event [2]
Event Model
<chr> <dbl>
1 A 1
2 A 3
4 C 1
5 C 2
6 C 3
Using dplyr,
df %>% group_by(Event) %>%
filter(max(Model)=3)
the result would be
# A tibble: 6 × 2
# Groups: Event [2]
Event Model
<chr> <dbl>
1 A 1
2 A 2
3 A 3
4 C 1
5 C 2
6 C 3
or using data.table,
df[df[,.I[max(Model)==3],by=Event]$V1]
the result is same as below.
Event Model
1: A 1
2: A 2
3: A 3
4: C 1
5: C 2
6: C 3
EDIT
I misunderstood the question.
Here's the edited answer.
# with dplyr
df %>% group_by(Event) %>%
filter(length(Model)>=3)
or
# with data.table
df[df[,.I[length(Model)>=3],by=Event]$V1]
Try this:
library(dplyr)
df %>% group_by(Event) %>%
filter(length(Model) >= 3)
or, more concisely:
df %>% group_by(Event) %>%
filter(n() >= 3)
This removes rows that have fewer than three Model types

Estimating the percentage of common set members over time in a panel

I have a time-series panel dataset that is structured in the following way: There are 2 funds that each own different stocks at each time period.
df <- data.frame(
fund_id = c(1,1,1,1,1,1,1,1, 1, 2,2,2,2),
time_Q = c(1,1,1,2,2,2,2,3, 3, 1,1,2,2),
stock_id = c("A", "B", "C", "A", "C", "D", "E", "D", "E", "A", "B", "B", "C")
)
> df
fund_id time_Q stock_id
1 1 1 A
2 1 1 B
3 1 1 C
4 1 2 A
5 1 2 C
6 1 2 D
7 1 2 E
8 1 3 D
9 1 3 E
10 2 1 A
11 2 1 B
12 2 2 B
13 2 2 C
For each fund, I would like to calculate the percentage of stocks held in that current time_Q that were also held in the previous one to 2 quarters. So basically for every fund and every time_Q, I would like to have 2 columns with past 1 time_Q, past 2 time_Q which show what percentage of stocks held on that time were also present in each of those past time_Qs.
Here is what the result should look like:
result <- data.frame(
fund_id = c(1,1,1,2,2),
time_Q = c(1,2,3,1,2),
past_1Q = c("NA",0.5,1,"NA",0.5),
past_2Q = c("NA","NA",0,"NA","NA")
)
> result
fund_id time_Q past_1Q past_2Q
1 1 1 NA NA
2 1 2 0.5 NA
3 1 3 1 0
4 2 1 NA NA
5 2 2 0.5 NA
I'm currently thinking about using either setdiff or intersect function but I'm not sure how to format it in the panel dataset. I'm looking for a scalable dplyr or data.table solution that would be able to cover multiple funds, stocks and time periods and also look into common elements in up to 12 lagged time-periods. I would appreciate any help as I've been stuck on this problem for quite a while.
We can use dplyr and purrr to programmatically build up a lagged ownership variable and then summarize() across all of them using across(). First, we just need a dummy variable for ownership and group our data by fund and stock.
library(dplyr)
library(purrr)
df_grouped <- df %>%
mutate(owned = TRUE) %>%
group_by(fund_id, stock_id)
Then we can generate lagged ownership for each stock, based on time_Q, join all of them together, and for each fund and time_Q, calculate proportion of ownership.
map(
1:2,
~df_grouped %>%
mutate(
"past_{.x}Q" := lag(owned, n = .x, order_by = time_Q)
)
) %>%
reduce(left_join, by = c("fund_id", "stock_id", "time_Q", "owned")) %>%
group_by(fund_id, time_Q) %>%
summarize(
across(
starts_with("past"),
~if (all(is.na(.x))) NA else sum(.x, na.rm = T) / n()
)
)
#> # A tibble: 5 × 4
#> fund_id time_Q past_1Q past_2Q
#> <dbl> <dbl> <dbl> <lgl>
#> 1 1 1 NA NA
#> 2 1 2 0.5 NA
#> 3 1 3 1 NA
#> 4 2 1 NA NA
#> 5 2 2 0.5 NA
Here's a dplyr-only solution:
library(dplyr)
df %>%
group_by(fund_id, time_Q) %>%
summarise(new = list(stock_id)) %>%
mutate(past_1Q = lag(new, 1),
past_2Q = lag(new, 2)) %>%
rowwise() %>%
transmute(time_Q,
across(past_1Q:past_2Q, ~ length(intersect(new, .x)) / length(new)))
output
fund_id time_Q past_1Q past_2Q
<dbl> <dbl> <dbl> <dbl>
1 1 1 0 0
2 1 2 0.5 0
3 1 3 1 0
4 2 1 0 0
5 2 2 0.5 0

Arrange a tibble based on 2 columns in R?

A similar question was asked here... however, I cant get it to work in my case and Im not sure why.
I am trying to arrange a tibble based on 2 columns. For example, in my data, I am trying to arrange by the value and count columns. To begin, I show a working example:
library(dplyr)
dat <- tibble(
value = c("B", "D", "D", "E", "A", "A", "B", "C", "B", "E"),
ids = c(1:10),
count = c(3, 2, 1, 2, 2, 1, 2, 1, 1, 1)
)
dat %>%
group_by(value) %>%
mutate(valrank = min(ids)) %>%
ungroup() %>%
arrange(valrank, value, desc(count))
looking at the output:
# A tibble: 10 × 4
value ids count valrank
<chr> <int> <dbl> <int>
1 B 1 3 1
2 B 7 2 1
3 B 9 1 1
4 D 2 2 2
5 D 3 1 2
6 E 4 2 4
7 E 10 1 4
8 A 5 2 5
9 A 6 1 5
10 C 8 1 8
We can see that the code worked... the tibble is arranged by the value column, and the order is based on how many times each element appears in the tibble (ie, the count).
However, when I try the following example, the same code doesn't work:
dat_1 <- tibble(
value = c("x2....", "x5...." , "x5....", "x3...." , "x3....", "x4....", "x3....", "x3....", "x4....", "x2...." ),
ids = c(1:10),
count = c(2, 2, 1, 4, 3, 2, 2, 1, 1, 1)
)
dat_1 %>%
group_by(value) %>%
mutate(valrank = min(ids)) %>%
ungroup() %>%
arrange(valrank, value, desc(count))
Looking at this output, we get:
# A tibble: 10 × 4
value ids count valrank
<chr> <int> <dbl> <int>
1 x2.... 1 2 1
2 x2.... 10 1 1
3 x5.... 2 2 2
4 x5.... 3 1 2
5 x3.... 4 4 4
6 x3.... 5 3 4
7 x3.... 7 2 4
8 x3.... 8 1 4
9 x4.... 6 2 6
10 x4.... 9 1 6
So we can see, this has failed to reorder the tibble based on the count. In the 2nd example, x3 appears the most (i.e., has the highest count), so should appear at the top of the tibble.
I'm not sure what Im doing wrong here!?
UPDATE:
I think I may have solved this problem with:
dat_1 %>%
group_by(value) %>%
mutate(valrank = max(count)) %>%
ungroup() %>%
arrange(-valrank, value, -count)

Group by count NAs as zeros [duplicate]

This question already has answers here:
Count number of non-NA values by group
(3 answers)
Count non-NA values by group [duplicate]
(3 answers)
Closed 1 year ago.
I try to count values in group_by with NA in one column of data frame. I have data like this:
> df <- data.frame(id = c(1, 2, 3, NA, 4, NA),
group = c("A", "A", "B", "C", "D", "E"))
> df
id group
1 1 A
2 2 A
3 3 B
4 NA C
5 4 D
6 NA E
I want to count groups having NA in first column as 0, but with an approach like this
> df %>% group_by(group) %>% summarise(n = n())
# A tibble: 5 x 2
group n
* <chr> <int>
1 A 2
2 B 1
3 C 1
4 D 1
5 E 1
i have 1 in rows C and E but not 0 which i want.
The expected result looks like this:
# A tibble: 5 x 2
group n
* <chr> <int>
1 A 2
2 B 1
3 C 0
4 D 1
5 E 0
How can i do this?
We can get the sum of a logical vector created with is.na to get the count as TRUE => 1 and FALSE => 0 so the sum returns the count of non-NA elements
library(dplyr)
df %>%
group_by(group) %>%
summarise(n = sum(!is.na(id)))
# A tibble: 5 x 2
# group n
# * <chr> <int>
#1 A 2
#2 B 1
#3 C 0
#4 D 1
#5 E 0
Or use length after subsetting
df %>%
group_by(group) %>%
summarise(n = length(id[!is.na(id)]))
n() returns the total number of rows including the missing values

Resources