I have the following df structure:
category difference factor
a -0.12 1
a -0.12 2
b -0.17 3
b -0.21 4
I want to categorise this data such that I can create identify each category separately by a number and rank them according to decreasing differences. Expected result is something like this:
category difference factor catCount rank
a -0.12 1 2 2
a -0.12 2 2 1
b -0.17 3 1 2
b -0.21 4 1 1
I'm using the following code to achieve this:
df %>% group_by(category) %>% mutate(categoryNumber = n_distinct(category)) %>% mutate(rank = rank(difference, ties.method = 'last'))
but getting the out put as :
category difference factor catCount rank
a -0.12 1 2 2
a -0.12 2 2 1
b -0.17 3 2 2
b -0.21 4 2 1
Any suggestions for this?
use this
df %>% group_by(category, catcnt = dense_rank(desc(category))) %>%
mutate(rank = rank(difference, ties.method = 'last'))
# A tibble: 4 x 5
# Groups: category [2]
category difference factor catcnt rank
<chr> <dbl> <int> <int> <int>
1 a -0.12 1 2 2
2 a -0.12 2 2 1
3 b -0.17 3 1 2
4 b -0.21 4 1 1
counting n_distinct category for each category would always give 1. Try this :
library(dplyr)
df %>%
arrange(category, difference) %>%
group_by(category) %>%
mutate(catCount = cur_group_id(),
rank = row_number()) %>%
ungroup()
# category difference factor catCount rank
# <chr> <dbl> <int> <int> <int>
#1 a -0.12 1 1 1
#2 a -0.12 2 1 2
#3 b -0.21 4 2 1
#4 b -0.17 3 2 2
Here catCount is a unique number for each category whereas rank is rank based on decreasing differences.
Related
If I have a df and want to do a grouped ID i would do:
df <- data.frame(id= rep(c(1,8,4), each = 3), score = runif(9))
df %>% group_by(id) %>% mutate(ID = cur_group_id())
following(How to create a consecutive group number answer of #Ronak Shah).
Now I have a list of those dfs and want to give consecutive group numbers, but they shall not start in every lists element new. In other words the ID column in listelement is 1 to 10, and in list two 11 to 15 and so on (so I can´t simply run the same code with lapply).
I guess I could do something like:
names(df)<-c("a", "b")
df<- mapply(cbind,df, "list"=names(df), SIMPLIFY=F)
df <- do.call(rbind, list)
df<-df %>% group_by(id) %>% mutate(ID = cur_group_id())
split(df, list)
but maybe some have more direct, clever ways?
A dplyr way could be using bind_rows as group_split (experimental):
library(dplyr)
df_list |>
bind_rows(.id = "origin") |>
mutate(ID = consecutive_id(id)) |> # If dplyr v.<1.1.0, use ID = cumsum(!duplicated(id))
group_split(origin, .keep = FALSE)
Output:
[[1]]
# A tibble: 9 × 3
id score ID
<dbl> <dbl> <int>
1 1 0.187 1
2 1 0.232 1
3 1 0.317 1
4 8 0.303 2
5 8 0.159 2
6 8 0.0400 2
7 4 0.219 3
8 4 0.811 3
9 4 0.526 3
[[2]]
# A tibble: 9 × 3
id score ID
<dbl> <dbl> <int>
1 3 0.915 4
2 3 0.831 4
3 3 0.0458 4
4 5 0.456 5
5 5 0.265 5
6 5 0.305 5
7 2 0.507 6
8 2 0.181 6
9 2 0.760 6
Data:
set.seed(1234)
df1 <- tibble(id = rep(c(1,8,4), each = 3), score = runif(9))
df2 <- tibble(id = rep(c(3,5,2), each = 3), score = runif(9))
df_list <- list(df1, df2)
Or using cur_group_id() for the group number, this approach, however, gives another order than you expect in your question:
library(dplyr)
df_list |>
bind_rows(.id = "origin") |>
mutate(ID = cur_group_id(), .by = "id") |> # If dplyr v.<1.1.0, use group_by()-notation
group_split(origin, .keep = FALSE)
Output:
[[1]]
# A tibble: 9 × 3
id score ID
<dbl> <dbl> <int>
1 1 0.187 1
2 1 0.232 1
3 1 0.317 1
4 8 0.303 6
5 8 0.159 6
6 8 0.0400 6
7 4 0.219 4
8 4 0.811 4
9 4 0.526 4
[[2]]
# A tibble: 9 × 3
id score ID
<dbl> <dbl> <int>
1 3 0.915 3
2 3 0.831 3
3 3 0.0458 3
4 5 0.456 5
5 5 0.265 5
6 5 0.305 5
7 2 0.507 2
8 2 0.181 2
9 2 0.760 2
Given this dataframe:
require(dplyr)
require(ggplot2)
require(forcats)
class <- c(1, 4,1,3, 2, 2,4, 1, 4, 5, 2, 4, 2,2,2)
prog <- c("Bac2", "Bac2","Bac2","Bac", "Master", "Master","Bac", "Bac", "DEA", "Doctorat",
"DEA", "Bac", "DEA","DEA","Bac")
mydata <- data.frame(height = class, prog)
res=mydata %>% group_by(prog,height) %>%
tally() %>% mutate(prop = n/sum(n))
i need to create a new column "new", per name of "prog": if prop does not have > 0.5, return 99 under column "new"
if prop has > 0.5, return the value under "height" that correspond to the max prop
desired output:
prog height n prop new
<chr> <dbl> <int> <dbl> dbl
1 Bac 1 1 0.2 99
2 Bac 2 1 0.2 99
3 Bac 3 1 0.2 99
4 Bac 4 2 0.4 99
5 Bac2 1 2 0.667 1
6 Bac2 4 1 0.333 1
7 DEA 2 3 0.75 2
8 DEA 4 1 0.25 2
9 Doctorat 5 1 1 5
10 Master 2 2 1 2
group_by prog and use ifelse:
library(dplyr)
res %>%
group_by(prog) %>%
mutate(new = ifelse(any(prop > 0.5), height[prop > 0.5], 99))
output
# A tibble: 10 × 5
# Groups: prog [5]
prog height n prop new
<chr> <dbl> <int> <dbl> <dbl>
1 Bac 1 1 0.2 99
2 Bac 2 1 0.2 99
3 Bac 3 1 0.2 99
4 Bac 4 2 0.4 99
5 Bac2 1 2 0.667 1
6 Bac2 4 1 0.333 1
7 DEA 2 3 0.75 2
8 DEA 4 1 0.25 2
9 Doctorat 5 1 1 5
10 Master 2 2 1 2
So let's say I have two data frames
df1 <- data.frame(n = rep(n = 2,c(0,1,2,3,4)), nn =c(rep(x = 1, 5), rep(x=2, 5)),
y = rnorm(10), z = rnorm(10))
df2 <- data.frame(x = rnorm(20))
Here is the first df:
> head(df1)
n nn y z
1 0 1 1.5683647 0.48934096
2 1 1 1.2967556 -0.77891030
3 2 1 -0.2375963 1.74355935
4 3 1 -1.2241501 -0.07838729
5 4 1 -0.3278127 -0.97555379
6 0 2 -2.4124503 0.07065982
Here is the second df:
x
1 -0.4884289
2 0.9362939
3 -1.0624084
4 -0.9838209
5 0.4242479
6 -0.4513135
I'd like to substact x column values of df2 from z column values of df1. And return the rows of both dataframes for which the substracted value is approximately equal to that of y value of df1.
Is there a way to construct such function, so that I could imply the approximation to which the values should be equal?
So, that it's clear, I'd like to substract all x values from all z values and then compare the value to y column value of df1, and check if there is approximately matching value to y.
Here's an approach where I match every row of df1 with every row of df2, then take x and y from z (as implied by your logic of comparing z-x to y; this is the same as comparing z-x-y to zero). Finally, I look at each row of df1 and keep the match with the lowest absolute difference.
library(dplyr)
left_join(
df1 %>% mutate(dummy = 1, row = row_number()),
df2 %>% mutate(dummy = 1, row = row_number()), by = "dummy") %>%
mutate(diff = z - x - y) %>%
group_by(row.x) %>%
slice_min(abs(diff)) %>%
ungroup()
Result (I used set.seed(42) before generating df1+df2.)
# A tibble: 10 x 9
n nn y z dummy row.x x row.y diff
<dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int> <dbl>
1 0 1 1.37 1.30 1 1 0.0361 20 -0.102
2 1 1 -0.565 2.29 1 2 1.90 5 0.956
3 2 1 0.363 -1.39 1 3 -1.76 8 0.0112
4 3 1 0.633 -0.279 1 4 -0.851 18 -0.0607
5 4 1 0.404 -0.133 1 5 -0.609 14 0.0713
6 0 2 -0.106 0.636 1 6 0.705 12 0.0372
7 1 2 1.51 -0.284 1 7 -1.78 2 -0.0145
8 2 2 -0.0947 -2.66 1 8 -2.41 19 -0.148
9 3 2 2.02 -2.44 1 9 -2.41 19 -2.04
10 4 2 -0.0627 1.32 1 10 1.21 4 0.168
I have a data frame, for example:
ID category value1 value2
1 A 2 5
1 A 3 6
1 A 5 7
1 B 6 **12**
2 A 1 3
2 A 2 5
2 B 5 **10**
Now I want to add a new column to calculate the percentage. For each ID, the way of calculation is to use each value1 of category A to divide the value2 of category B. For value1 of category B, it divides value2 of category B directly. The expected result likes:
ID category value1 value2 percentage
1 A 2 5 0.17
1 A 3 6 0.25
1 A 5 7 0.42
1 B 6 **12** 0.50
2 A 1 3 0.10
2 A 2 5 0.20
2 B 5 **10** 0.50
Thank you very much.
Using dplyr you could do something like this, assuming there is only one category B value for each of your IDs as per your example:
library(dplyr)
df1 <- tibble(
ID = c(1,1,1,1,2,2,2),
category =c('A', 'A', 'A','B','A','A','B'),
value1 = c(2,3,5,6,1,2,5),
value2 = c(5,6,7,12,3,5,10)
)
df1 %>%
group_by(ID) %>%
mutate(percentage = value1 / value2[category == 'B'])
# # A tibble: 7 x 5
# # Groups: ID [2]
# ID category value1 value2 percentage
# <dbl> <chr> <dbl> <dbl> <dbl>
# 1 1 A 2 5 0.167
# 2 1 A 3 6 0.25
# 3 1 A 5 7 0.417
# 4 1 B 6 12 0.5
# 5 2 A 1 3 0.1
# 6 2 A 2 5 0.2
# 7 2 B 5 10 0.5
Suppose I have a tibble tbl_
tbl_ <- tibble(id = c(1,1,2,2,3,3), dta = 1:6)
tbl_
# A tibble: 6 x 2
id dta
<dbl> <int>
1 1 1
2 1 2
3 2 3
4 2 4
5 3 5
6 3 6
There are 3 id groups. I want to resample entire id groups 3 times with replacement. For example the resulting tibble can be:
id dta
<dbl> <int>
1 1 1
2 1 2
3 1 1
4 1 2
5 3 5
6 3 6
but not
id dta
<dbl> <int>
1 1 1
2 1 2
3 1 1
4 2 4
5 3 5
6 3 6
or
id dta
<dbl> <int>
1 1 1
2 1 1
3 2 3
4 2 4
5 3 5
6 3 6
Here is one option with sample_n and distinct
library(tidyverse)
distinct(tbl_, id) %>%
sample_n(nrow(.), replace = TRUE) %>%
pull(id) %>%
map_df( ~ tbl_ %>%
filter(id == .x)) %>%
arrange(id)
# A tibble: 6 x 2
# id dta
# <dbl> <int>
#1 1.00 1
#2 1.00 2
#3 1.00 1
#4 1.00 2
#5 3.00 5
#6 3.00 6
An option can be to get the minimum row number for each id. That row number will be used to generate random samples from wiht replace = TRUE.
library(dplyr)
tbl_ %>% mutate(rn = row_number()) %>%
group_by(id) %>%
summarise(minrow = min(rn)) ->min_row
indx <- rep(sample(min_row$minrow, nrow(min_row), replace = TRUE), each = 2) +
rep(c(0,1), 3)
tbl_[indx,]
# # A tibble: 6 x 2
# id dta
# <dbl> <int>
# 1 1.00 1
# 2 1.00 2
# 3 3.00 5
# 4 3.00 6
# 5 2.00 3
# 6 2.00 4
Note: In the above answer the number of rows for each id has been assumed as 2 but this answer can tackle any number of IDs. The hard-coded each=2 and c(0,1) needs to be modified in order to scale it up to handle more than 2 rows for each id