Arrange a tibble based on 2 columns in R? - r

A similar question was asked here... however, I cant get it to work in my case and Im not sure why.
I am trying to arrange a tibble based on 2 columns. For example, in my data, I am trying to arrange by the value and count columns. To begin, I show a working example:
library(dplyr)
dat <- tibble(
value = c("B", "D", "D", "E", "A", "A", "B", "C", "B", "E"),
ids = c(1:10),
count = c(3, 2, 1, 2, 2, 1, 2, 1, 1, 1)
)
dat %>%
group_by(value) %>%
mutate(valrank = min(ids)) %>%
ungroup() %>%
arrange(valrank, value, desc(count))
looking at the output:
# A tibble: 10 × 4
value ids count valrank
<chr> <int> <dbl> <int>
1 B 1 3 1
2 B 7 2 1
3 B 9 1 1
4 D 2 2 2
5 D 3 1 2
6 E 4 2 4
7 E 10 1 4
8 A 5 2 5
9 A 6 1 5
10 C 8 1 8
We can see that the code worked... the tibble is arranged by the value column, and the order is based on how many times each element appears in the tibble (ie, the count).
However, when I try the following example, the same code doesn't work:
dat_1 <- tibble(
value = c("x2....", "x5...." , "x5....", "x3...." , "x3....", "x4....", "x3....", "x3....", "x4....", "x2...." ),
ids = c(1:10),
count = c(2, 2, 1, 4, 3, 2, 2, 1, 1, 1)
)
dat_1 %>%
group_by(value) %>%
mutate(valrank = min(ids)) %>%
ungroup() %>%
arrange(valrank, value, desc(count))
Looking at this output, we get:
# A tibble: 10 × 4
value ids count valrank
<chr> <int> <dbl> <int>
1 x2.... 1 2 1
2 x2.... 10 1 1
3 x5.... 2 2 2
4 x5.... 3 1 2
5 x3.... 4 4 4
6 x3.... 5 3 4
7 x3.... 7 2 4
8 x3.... 8 1 4
9 x4.... 6 2 6
10 x4.... 9 1 6
So we can see, this has failed to reorder the tibble based on the count. In the 2nd example, x3 appears the most (i.e., has the highest count), so should appear at the top of the tibble.
I'm not sure what Im doing wrong here!?
UPDATE:
I think I may have solved this problem with:
dat_1 %>%
group_by(value) %>%
mutate(valrank = max(count)) %>%
ungroup() %>%
arrange(-valrank, value, -count)

Related

Filter groups based on difference two highest values

I have the following dataframe called df (dput below):
> df
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 B 8
6 B 2
7 B 2
8 B 3
9 C 10
10 C 1
11 C 1
12 C 8
I would like to filter groups based on the difference between their highest value (max) and second highest value. The difference should be smaller equal than 2 (<=2), this means that group B should be removed because the highest value is 8 and the second highest value is 3 which is a difference of 5. The desired output should look like this:
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
So I was wondering if anyone knows how to filter groups based on the difference between their highest and second-highest value?
dput of df:
df<-structure(list(group = c("A", "A", "A", "A", "B", "B", "B", "B",
"C", "C", "C", "C"), value = c(5, 1, 1, 5, 8, 2, 2, 3, 10, 1,
1, 8)), class = "data.frame", row.names = c(NA, -12L))
Using dplyr
library(dplyr)
df %>%
group_by(group) %>%
filter(abs(diff(sort(value, decreasing=T)[1:2])) <= 2) %>%
ungroup()
# A tibble: 8 × 2
group value
<chr> <int>
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
A base R alternative
grp <- na.omit(aggregate(. ~ group, df, function(x)
abs(diff(sort(x, decreasing=T)[1:2])) <= 2))
do.call(rbind, c(mapply(function(g, v)
list(df[df$group == g & v,]), grp$group, grp$value), make.row.names=F))
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
I possibility would be to first create a vector with the groups that achieve your condition and then filter in the original data.frame. Here how I thought:
library(dplyr)
group_to_keep <-
df %>%
group_by(group) %>%
slice_max(n = 2,value) %>%
filter(abs(diff(value)) <= 2) %>%
pull(group) %>%
unique()
df %>%
filter(group %in% group_to_keep)
You can use ave.
df[ave(df$value, df$group, FUN=\(x) diff(sort(c(-x, Inf)))[1]) <= 2,]
# group value
#1 A 5
#2 A 1
#3 A 1
#4 A 5
#9 C 10
#10 C 1
#11 C 1
#12 C 8
In case you can sure that you have all the time at least two values you can use.
df[ave(df$value, df$group, FUN=\(x) diff(tail(sort(x), 2))) <= 2,]
df[ave(df$value, df$group, FUN=\(x) diff(sort(-x)[1:2])) <= 2,]

Removing rows based on column conditions

Suppose we have a data frame:
Event <- c("A", "A", "A", "B", "B", "C" , "C", "C")
Model <- c( 1, 2, 3, 1, 2, 1, 2, 3)
df <- data.frame(Event, Model)
Which looks like this:
event
Model
A
1
A
2
A
3
B
1
B
2
C
1
C
2
C
3
We can see that event B only has 2 models of data. As the actual data frame I am using has thousands of rows and 17 columns, how can I remove all events that do not have 3 models? My guess is to use a subset however I am not sure how to do it when we have more than one condition.
I tried the suggested code from YH Jang below:
df %>% group_by(Event) %>%
filter(max(Model)==3)
However, this would miss out entries in the data that looked like this.
event
Model
A
1
A
3
example:
# A tibble: 6 × 2
# Groups: Event [2]
Event Model
<chr> <dbl>
1 A 1
2 A 3
4 C 1
5 C 2
6 C 3
Using dplyr,
df %>% group_by(Event) %>%
filter(max(Model)=3)
the result would be
# A tibble: 6 × 2
# Groups: Event [2]
Event Model
<chr> <dbl>
1 A 1
2 A 2
3 A 3
4 C 1
5 C 2
6 C 3
or using data.table,
df[df[,.I[max(Model)==3],by=Event]$V1]
the result is same as below.
Event Model
1: A 1
2: A 2
3: A 3
4: C 1
5: C 2
6: C 3
EDIT
I misunderstood the question.
Here's the edited answer.
# with dplyr
df %>% group_by(Event) %>%
filter(length(Model)>=3)
or
# with data.table
df[df[,.I[length(Model)>=3],by=Event]$V1]
Try this:
library(dplyr)
df %>% group_by(Event) %>%
filter(length(Model) >= 3)
or, more concisely:
df %>% group_by(Event) %>%
filter(n() >= 3)
This removes rows that have fewer than three Model types

Estimating the percentage of common set members over time in a panel

I have a time-series panel dataset that is structured in the following way: There are 2 funds that each own different stocks at each time period.
df <- data.frame(
fund_id = c(1,1,1,1,1,1,1,1, 1, 2,2,2,2),
time_Q = c(1,1,1,2,2,2,2,3, 3, 1,1,2,2),
stock_id = c("A", "B", "C", "A", "C", "D", "E", "D", "E", "A", "B", "B", "C")
)
> df
fund_id time_Q stock_id
1 1 1 A
2 1 1 B
3 1 1 C
4 1 2 A
5 1 2 C
6 1 2 D
7 1 2 E
8 1 3 D
9 1 3 E
10 2 1 A
11 2 1 B
12 2 2 B
13 2 2 C
For each fund, I would like to calculate the percentage of stocks held in that current time_Q that were also held in the previous one to 2 quarters. So basically for every fund and every time_Q, I would like to have 2 columns with past 1 time_Q, past 2 time_Q which show what percentage of stocks held on that time were also present in each of those past time_Qs.
Here is what the result should look like:
result <- data.frame(
fund_id = c(1,1,1,2,2),
time_Q = c(1,2,3,1,2),
past_1Q = c("NA",0.5,1,"NA",0.5),
past_2Q = c("NA","NA",0,"NA","NA")
)
> result
fund_id time_Q past_1Q past_2Q
1 1 1 NA NA
2 1 2 0.5 NA
3 1 3 1 0
4 2 1 NA NA
5 2 2 0.5 NA
I'm currently thinking about using either setdiff or intersect function but I'm not sure how to format it in the panel dataset. I'm looking for a scalable dplyr or data.table solution that would be able to cover multiple funds, stocks and time periods and also look into common elements in up to 12 lagged time-periods. I would appreciate any help as I've been stuck on this problem for quite a while.
We can use dplyr and purrr to programmatically build up a lagged ownership variable and then summarize() across all of them using across(). First, we just need a dummy variable for ownership and group our data by fund and stock.
library(dplyr)
library(purrr)
df_grouped <- df %>%
mutate(owned = TRUE) %>%
group_by(fund_id, stock_id)
Then we can generate lagged ownership for each stock, based on time_Q, join all of them together, and for each fund and time_Q, calculate proportion of ownership.
map(
1:2,
~df_grouped %>%
mutate(
"past_{.x}Q" := lag(owned, n = .x, order_by = time_Q)
)
) %>%
reduce(left_join, by = c("fund_id", "stock_id", "time_Q", "owned")) %>%
group_by(fund_id, time_Q) %>%
summarize(
across(
starts_with("past"),
~if (all(is.na(.x))) NA else sum(.x, na.rm = T) / n()
)
)
#> # A tibble: 5 × 4
#> fund_id time_Q past_1Q past_2Q
#> <dbl> <dbl> <dbl> <lgl>
#> 1 1 1 NA NA
#> 2 1 2 0.5 NA
#> 3 1 3 1 NA
#> 4 2 1 NA NA
#> 5 2 2 0.5 NA
Here's a dplyr-only solution:
library(dplyr)
df %>%
group_by(fund_id, time_Q) %>%
summarise(new = list(stock_id)) %>%
mutate(past_1Q = lag(new, 1),
past_2Q = lag(new, 2)) %>%
rowwise() %>%
transmute(time_Q,
across(past_1Q:past_2Q, ~ length(intersect(new, .x)) / length(new)))
output
fund_id time_Q past_1Q past_2Q
<dbl> <dbl> <dbl> <dbl>
1 1 1 0 0
2 1 2 0.5 0
3 1 3 1 0
4 2 1 0 0
5 2 2 0.5 0

How to count subsequent instance of level in each group R [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 1 year ago.
I have the following dataframe:
df <- data.frame(
id = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
name = c("J", "Z", "X", "A", "J", "B", "R", "J", "X")
)
I would like to group_by(id), then create a counter column which increases the value for each subsequent instance/level of (name). The desired output would look like this...
id name count
1 J 1
1 Z 1
1 X 1
2 A 1
2 J 2
2 B 1
3 R 1
3 J 3
3 X 2
I assume it would be something that starts like this...
library(tidyverse)
df %>%
group_by(id) %>%
But I'm not sure how I would implement that kind of counter...
Any help much appreciated.
actually you have to group by the "name" since you are looking to count that independ of the Id:
library(dplyr)
df %>%
dplyr::group_by(name) %>%
dplyr::mutate(count = dplyr::row_number()) %>%
dplyr::ungroup()
# A tibble: 9 x 3
id name count
<dbl> <chr> <int>
1 1 J 1
2 1 Z 1
3 1 X 1
4 2 A 1
5 2 J 2
6 2 B 1
7 3 R 1
8 3 J 3
9 3 X 2

substitute value in dataframe based on conditional

I have the following data set
library(dplyr)
df<- data.frame(c("a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "b"),
c(1, 1, 2, 2, 2, 3, 1, 2, 2, 2, 3, 3),
c(25, 75, 20, 40, 60, 50, 20, 10, 20, 30, 40, 60))
colnames(df)<-c("name", "year", "val")
This we summarize by grouping df by name and year and then find the average and number of these entries
asd <- (df %>%
group_by(name,year) %>%
summarize(average = mean(val), `ave_number` = n()))
This gives the following desired output
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 50 1
4 b 1 20 1
5 b 2 20 3
6 b 3 50 2
Now, all entries of asd$average where asd$ave_number<2 I would like to substitute according to the following array based on year
replacer<- data.frame(c(1,2,3),
c(100,200,300))
colnames(replacer)<-c("year", "average")
In other words, I would like to end up with
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 300 1 #substituted
4 b 1 100 1 #substituted
5 b 2 20 3
6 b 3 50 2
Is there a way to achieve this with dplyr? I guess I have to use the %>%-operator, something like this (not working code)
asd %>%
group_by(name, year) %>%
summarize(average = ifelse(n() < 2, #SOMETHING#, mean(val)))
Here's what I would do:
colnames(replacer) <- c("year", "average_replacer") #To avoid duplicate of variable name
asd <- left_join(asd, replacer, by = "year") %>%
mutate(average = ifelse(ave_number < 2, average_replacer, average)) %>%
select(-average_replacer)
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 300 1
4 b 1 100 1
5 b 2 20 3
6 b 3 50 2
Regarding the following:
I guess I have to use the %>%-operator
You don't ever have to use the pipe operator. It is there for convenience because you can string (or "pipe") functions one after another, as you would with a train of thought. It's kind of like having a flow in your code.
You can do this easily by using a named vector of replacement values by year instead of a data frame. If you're set on a data frame, you'd be using joins.
replacer <- setNames(c(100,200,300),c(1,2,3))
asd <- df %>%
group_by(name,year) %>%
summarize(average = mean(val),
ave_number = n()) %>%
mutate(average = if_else(ave_number < 2, replacer[year], average))
Source: local data frame [6 x 4]
Groups: name [2]
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 300 1
4 b 1 100 1
5 b 2 20 3
6 b 3 50 2

Resources