I want to create assign column based on rank and limit by group.
In particular, for each group, I have a priority rank (e.g., 1,2,3 or 1,3,6 or 3,4,5 etc). Based on the rank (the small number is a priority), I want to allocate the resource given in limit column. Now I am doing this by hand. But I want to express this exercise using tidyverse. How do I allocate by mutate and group_by(or other methods)?
Using tidyverse, you can use top_n after grouping. This will filter the top values based on rank - where the n to keep in each group is determined by limit. Those kept will be assigned 1, and then merged with your original data.
Let me know if this provides the desired result.
library(tidyverse)
df %>%
group_by(group) %>%
top_n(limit[1], desc(rank)) %>%
mutate(assign = 1) %>%
right_join(df) %>%
replace_na(list(assign = 0)) %>%
arrange(group, rank)
Output
group rank limit assign
<chr> <dbl> <dbl> <dbl>
1 A 1 1 1
2 A 2 1 0
3 A 3 1 0
4 B 1 1 1
5 B 3 1 0
6 B 6 1 0
7 C 3 2 1
8 C 4 2 1
9 C 5 2 0
10 C 6 2 0
Data
df <- structure(list(group = c("A", "A", "A", "B", "B", "B", "C", "C",
"C", "C"), rank = c(1, 2, 3, 1, 3, 6, 3, 4, 5, 6), limit = c(1,
1, 1, 1, 1, 1, 2, 2, 2, 2)), class = "data.frame", row.names = c(NA,
-10L))
Related
I have a dataset with two distinct groups (A and B) belonging to 3 different categories (1, 2, 3):
library(tidyverse)
set.seed(100)
df <- tibble(Group = sample(c(1, 2, 3), 20, replace = T),Company = sample(c('A', 'B'), 20, replace = T))
I want to come come up with a metric that characterizes group composition across the timespan.
Thus far, I have used an index based on Shannon's Index which gives a measure of heterogeneity varying between 0 and 1. With 1 being a perfectly heterogeneous (equal representation of each group) and 0 being completely homogeneous (only 1 group is represented):
df %>%
group_by(Group, Company) %>%
summarise(n=n()) %>%
mutate(p = n / sum(n)) %>%
mutate(Shannon = -(p*log2(p) + (1-p)))
Yielding:
Group Company n p Shannon
<dbl> <chr> <int> <dbl> <dbl>
1 A 2 0.6666667 0.05664167
1 B 1 0.3333333 -0.13834583
2 A 4 0.5000000 0.00000000
2 B 4 0.5000000 0.00000000
3 A 1 0.1111111 -0.53667500
3 B 8 0.8888889 0.03993333
However, I am looking for an index between [-1, +1]. Where the index yields -1 when only group A is present at a time point, +1 when only group B is present at a time point, 0 being an equal representation.
How can I create such an index? I have looked at measures such as Moran's I as inspiration, but they do not seem to suit the need.
A simple solution might be to calculate the mean.
I transformed Company into value with A = -1 and B = 1 and calculated the mean by Group.
The result will be an index for each Group, with -1 when Company has just "A"s or 1 when there are just "B"s.
Data
df <- structure(list(Group = c(2, 2, 3, 3, 1, 2, 3, 1, 1, 3, 3, 1,
2, 2, 3, 2, 2, 1, 1, 3), Company = c("A", "A", "A", "A", "B",
"B", "B", "B", "A", "B", "B", "B", "A", "A", "B", "A", "B", "B",
"A", "B")), row.names = c(NA, -20L), class = c("tbl_df", "tbl",
"data.frame"))
Code
df %>%
mutate(value = ifelse(Company == "A", -1, 1)) %>%
group_by(Group) %>%
summarise(index = mean(value))
Output
# A tibble: 3 x 2
Group index
<dbl> <dbl>
1 1 0.333
2 2 -0.429
3 3 0.429
This question already has answers here:
Select groups which have at least one of a certain value
(3 answers)
Closed 2 years ago.
I am struggling to write the right logic to filter two columns based only on the condition in one column. I have multiple ids and if an id appears in 2020, I want all the data for the other years that id was measured to come along.
As an example, if a group contains the number 3, I want all the values in that group. We should end up with a dataframe with all the b and d rows.
df4 <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b",
"c", "c", "c", "c", "c", "d", "d", "d", "d", "d"),
pop = c(1, 2, 2, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 4, 5, 1, 2, 3, 4, 5),
value = c(1,2,3,2.5,2,2,3,4,3.5,3,3,2,1,2,2.5,0.5,1.5,6,2,1.5))
threes <- df4 %>%
filter(pop == 3 |&ifelse????
A bit slower than the other answers here (more steps involved), but for me a bit clearer:
df4 %>%
filter(pop == 3) %>%
distinct(group) %>%
pull(group) -> groups
df4 %>%
filter(group %in% groups)
or if you want to combine the two steps:
df4 %>%
filter(group %in% df4 %>%
filter(pop == 3) %>%
distinct(group) %>%
pull(group))
You can do:
df4[df4$group %in% df4$group[df4$pop == 3],]
#> group pop value
#> 6 b 1 2.0
#> 7 b 2 3.0
#> 8 b 3 4.0
#> 9 b 4 3.5
#> 10 b 5 3.0
#> 16 d 1 0.5
#> 17 d 2 1.5
#> 18 d 3 6.0
#> 19 d 4 2.0
#> 20 d 5 1.5
You can do this way using dplyr group_by(), filter() and any() function combined. any() will return TRUE for the matching condition. Group by will do the operation for each subgroup of the variable you mention as a grouping.
Follow these steps:
First pipe the data to group_by() to group by your group variable.
Then pipe to filter() to filter by if any group pop is equal to 3 using any() function.
df4 <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b",
"c", "c", "c", "c", "c", "d", "d", "d", "d", "d"),
pop = c(1, 2, 2, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 4, 5, 1, 2, 3, 4, 5),
value = c(1,2,3,2.5,2,2,3,4,3.5,3,3,2,1,2,2.5,0.5,1.5,6,2,1.5))
# load the library
library(dplyr)
threes <- df4 %>%
group_by(group) %>%
filter(any(pop == 3))
# print the result
threes
Output:
threes
# A tibble: 10 x 3
# Groups: group [2]
group pop value
<chr> <dbl> <dbl>
1 b 1 2
2 b 2 3
3 b 3 4
4 b 4 3.5
5 b 5 3
6 d 1 0.5
7 d 2 1.5
8 d 3 6
9 d 4 2
10 d 5 1.5
An easy base R option is using subset + ave
subset(
df4,
ave(pop == 3, group, FUN = any)
)
which gives
group pop value
6 b 1 2.0
7 b 2 3.0
8 b 3 4.0
9 b 4 3.5
10 b 5 3.0
16 d 1 0.5
17 d 2 1.5
18 d 3 6.0
19 d 4 2.0
Use dplyr:
df4%>%group_by(group)%>%filter(any(pop==3))
I have a dataset: (actually I have more than 100 groups)
and I want to use dplyr to create a variable-y for each group, and fill first value of y to be 1,
Second y = 1* first x + 2*first y
The result would be:
I tried to create a column- y, all=1, then use
df%>% group_by(group)%>% mutate(var=shift(x)+2*shift(y))%>% ungroup()
but the formula for y become, always use initialize y value--1
Second y = 1* first x + 2*1
Could someone give me some ideas about this? Thank you!
The dput of my result data is:
structure(list(group = c("a", "a", "a", "a", "a", "b", "b", "b" ), x =
c(1, 2, 3, 4, 5, 6, 7, 8), y = c(1, 3, 8, 19, 42, 1, 8, 23)),
row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame" ))
To perform such calculation we can use accumulate from purrr or Reduce in base R.
Since you are already using dplyr we can use accumulate :
library(dplyr)
df %>%
group_by(group) %>%
mutate(y1 = purrr::accumulate(x[-n()], ~.x * 2 + .y, .init = 1))
# group x y y1
# <chr> <dbl> <dbl> <dbl>
#1 a 1 1 1
#2 a 2 3 3
#3 a 3 8 8
#4 a 4 19 19
#5 a 5 42 42
#6 b 6 1 1
#7 b 7 8 8
#8 b 8 23 23
I am trying to compute a balance column.
So, to show an example, I want to go from this:
df <- data.frame(group = c("A", "A", "A", "A", "A"),
start = c(5, 0, 0, 0, 0),
receipt = c(1, 5, 6, 4, 6),
out = c(4, 5, 3, 2, 5))
> df
group start receipt out
1 A 5 1 4
2 A 0 5 5
3 A 0 6 3
4 A 0 4 2
5 A 0 6 5
to creating a new balance column like the following
> dfb
group start receipt out balance
1 A 5 1 4 2
2 A 0 5 5 2
3 A 0 6 3 5
4 A 0 4 2 7
5 A 0 6 5 8
I tried the following attempt but it isn't working
dfc <- df %>%
group_by(group) %>%
mutate(balance = if_else(row_number() == 1, start + receipt - out, (lag(balance) + receipt) - out)) %>%
ungroup()
Would really appreciate some help with this. Thanks!
You could use cumsum from dplyr. Note: I had to change your initial df table to match the one in your required result because you have different data in "out".
df <- data.frame(group = c("A", "A", "A", "A", "A"),
start = c(5, 0, 0, 0, 0),
receipt = c(1, 5, 6, 4, 6),
out = c(4, 5, 3, 2, 5))
dfc <- df %>%
group_by(group) %>%
mutate(balance=cumsum(start+receipt-out))
Source: local data frame [5 x 5]
Groups: group [1]
group start receipt out balance
<fctr> <dbl> <dbl> <dbl> <dbl>
1 A 5 1 4 2
2 A 0 5 5 2
3 A 0 6 3 5
4 A 0 4 2 7
5 A 0 6 5 8
I'm trying to add a body count for each unique person. Each person has multiple data points.
df <- data.frame(PERSON = c("A", "A", "A", "B", "B", "C", "C", "C", "C"),
Y = c(2, 5, 4, 1, 2, 5, 3, 7, 1))
This is what I'd like it to look like:
PERSON Y UNIQ_CT
1 A 2 1
2 A 5 0
3 A 4 0
4 B 1 1
5 B 2 0
6 C 5 1
7 C 3 0
8 C 7 0
9 C 1 0
You can use duplicated and negate it:
transform(df, uniqct = as.integer(!duplicated(Person)))
Since there is dplyr tag to the question here is an option
library(dplyr)
df %>%
group_by(PERSON) %>%
mutate(UNIQ_CT = ifelse(row_number( ) == 1, 1, 0))