Ranking observations within groups that are tied - r

I'm trying to rank the certain groups by their counts using dense_rank, it doesn't make a distinct rank for groups that are tied. And any ranking function I try that has some sort of ties.method doesn't give me the rankings in a consecutive 1,2,3 order. Example:
library(dplyr)
id <- c(rep(1, 8),
rep(2, 8))
fruit <- c(rep('apple', 4), rep('orange', 1), rep('banana', 2), 'orange',
rep('orange', 4), rep('banana', 1), rep('apple', 2), 'banana')
df <- data.frame(id, fruit, stringsAsFactors = FALSE)
df2 <- df %>%
mutate(counter = 1) %>%
group_by(id, fruit) %>%
mutate(fruitCnt = sum(counter)) %>%
ungroup() %>%
group_by(id) %>%
mutate(fruitCntRank = dense_rank(desc(fruitCnt))) %>%
select(id, fruit, fruitCntRank)
df2
id fruit fruitCntRank
1 1 apple 1
2 1 apple 1
3 1 apple 1
4 1 apple 1
5 1 orange 2
6 1 banana 2
7 1 banana 2
8 1 orange 2
9 2 orange 1
10 2 orange 1
11 2 orange 1
12 2 orange 1
13 2 banana 2
14 2 apple 2
15 2 apple 2
16 2 banana 2
It doesn't matter which of orange or banana are ranked 3, and it doesn't even need to be consistent. I just need the groups to be ranked 1, 2, 3.
Desired result:
id fruit fruitCntRank
1 1 apple 1
2 1 apple 1
3 1 apple 1
4 1 apple 1
5 1 orange 2
6 1 banana 3
7 1 banana 3
8 1 orange 2
9 2 orange 1
10 2 orange 1
11 2 orange 1
12 2 orange 1
13 2 banana 2
14 2 apple 3
15 2 apple 3
16 2 banana 2

We can add count for each id and fruit combination, arrange them in descending order of count and get the rank using match.
library(dplyr)
df %>%
add_count(id, fruit) %>%
arrange(id, desc(n)) %>%
group_by(id) %>%
mutate(n = match(fruit, unique(fruit)))
#Another option with cumsum and duplicated
#mutate(n = cumsum(!duplicated(fruit)))
# id fruit n
# <dbl> <chr> <int>
# 1 1 apple 1
# 2 1 apple 1
# 3 1 apple 1
# 4 1 apple 1
# 5 1 orange 2
# 6 1 banana 3
# 7 1 banana 3
# 8 1 orange 2
# 9 2 orange 1
#10 2 orange 1
#11 2 orange 1
#12 2 orange 1
#13 2 banana 2
#14 2 apple 3
#15 2 apple 3
#16 2 banana 2

Related

How can I split sentence into new variables in R (with zero-one encoding)?

I have a data like below:
V1 V2
1 orange, apple
2 orange, lemon
3 lemon, apple
4 orange, lemon, apple
5 lemon
6 apple
7 orange
8 lemon, apple
I want to split the V2 variable like this:
I have three categories of the V2 column: "orange", "lemon", "apple"
for each of the categories I want to create a new column (variable) that will inform about whether such a name appeared in V2 (0,1)
I tried this
df %>% separate(V2, into = c("orange", "lemon", "apple"))
.. and I got this result, but it's not what I expect.
V1 orange lemon apple
1 1 orange apple <NA>
2 2 orange lemon <NA>
3 3 lemon apple <NA>
4 4 orange lemon apple
5 5 lemon <NA> <NA>
6 6 apple <NA> <NA>
7 7 orange <NA> <NA>
8 8 lemon apple <NA>
The result I mean is below.
V1 orange lemon apple
1 1 0 1
2 1 1 0
3 0 1 1
4 1 1 0
5 0 1 0
6 0 0 1
7 1 0 0
8 0 1 1
you could try pivoting:
library(dplyr)
library(tidyr)
df |>
separate_rows(V2, sep = ", ") |>
mutate(ind = 1) |>
pivot_wider(names_from = V2,
values_from = ind,
values_fill = 0)
Output is:
# A tibble: 8 × 4
V1 orange apple lemon
<int> <dbl> <dbl> <dbl>
1 1 1 1 0
2 2 1 0 1
3 3 0 1 1
4 4 1 1 1
5 5 0 0 1
6 6 0 1 0
7 7 1 0 0
8 8 0 1 1
data I used:
V1 <- 1:8
V2 <- c("orange, apple", "orange, lemon",
"lemon, apple", "orange, lemon, apple",
"lemon", "apple", "orange",
"lemon, apple")
df <- tibble(V1, V2)
We may use dummy_cols
library(stringr)
library(fastDummies)
library(dplyr)
dummy_cols(df, "V2", split = ",\\s+", remove_selected_columns = TRUE) %>%
rename_with(~ str_remove(.x, '.*_'))
-output
# A tibble: 8 × 4
V1 apple lemon orange
<int> <int> <int> <int>
1 1 1 0 1
2 2 0 1 1
3 3 1 1 0
4 4 1 1 1
5 5 0 1 0
6 6 1 0 0
7 7 0 0 1
8 8 1 1 0

How to pull values by reference number

I have a df of paired values and I want to be able to subset it by accessing only one value. This is my data:
df1 %>% head()
values pair_num
<ch> <int>
1 apple 1
2 pb 1
3 apple 2
4 ranch 2
5 apple 3
6 sauce 3
7 orange 4
8 soda 4
9 grape 5
10 juice 5
So for example I would like to access all values associated with apple without knowing what they are and end up with something like this:
df1 %>% head()
values pair_num
<ch> <int>
1 apple 1
2 pb 1
3 apple 2
4 ranch 2
5 apple 3
6 sauce 3
I'm not sure I understand the question, as I would have thought this would be the output (with row 6) that you'd want.
library(dplyr)
df1 %>%
filter(values == "apple") %>%
select(pair_num) %>%
left_join(df1)
Joining, by = "pair_num"
pair_num values
1 1 apple
2 1 pb
3 2 apple
4 2 ranch
5 3 apple
6 3 sauce

Replace NA conditionally

this an augmented version of my own question as i could not clearly explain it through the comments
There are only 2 farms, so each fruit is duplicated in the below df. i'd like to replace NA with 0 only if there is a value for either of the fruits, such as for a pear at y2019 with values c(NA, 7), i'd like to output c(0,7) instead.
sample data:
df <- data.frame(fruit = c("apple", "apple", "peach", "peach", "pear", "pear", "lime", "lime"),
farm = as.factor(c(1,2,1,2,1,2,1,2)), 'y2019' = c(NA,NA,3,12,NA,7,4,6),
'y2018' = c(5,3,NA,NA,8,2,NA,NA),'y2017' = c(4,5,7,15,NA,NA,1,NA))
> df
fruit farm y2019 y2018 y2017
1 apple 1 NA 5 4
2 apple 2 NA 3 5
3 peach 1 3 NA 7
4 peach 2 12 NA 15
5 pear 1 NA 8 NA
6 pear 2 7 2 NA
7 lime 1 4 NA 1
8 lime 2 6 NA NA
this is close
df %>%
group_by(fruit) %>%
mutate(across(where(is.numeric), ~ if (any(is.na(.))) 0 else .)) %>%
ungroup()
but :
7 gets wiped out in pear producing c(0,0).
i'd like to leave NA in when both farms are NA
#A tibble: 8 x 5
fruit farm y2019 y2018 y2017
<chr> <fct> <dbl> <dbl> <dbl>
1 apple 1 0 5 4
2 apple 2 0 3 5
3 peach 1 3 0 7
4 peach 2 12 0 15
5 pear 1 0 8 0
6 pear 2 0 2 0
7 lime 1 4 0 0
8 lime 2 6 0 0
desired outcome:
> df
fruit farm y2019 y2018 y2017
1 apple 1 NA 5 4
2 apple 2 NA 3 5
3 peach 1 3 NA 7
4 peach 2 12 NA 15
5 pear 1 0 8 NA
6 pear 2 7 2 NA
7 lime 1 4 NA 1
8 lime 2 6 NA 0
You can try :
library(dplyr)
df %>%
group_by(fruit) %>%
mutate(across(where(is.numeric), ~ if(any(!is.na(.)))
replace(., is.na(.), 0) else .)) %>%
ungroup()
# A tibble: 8 x 5
# fruit farm y2019 y2018 y2017
# <chr> <fct> <dbl> <dbl> <dbl>
#1 apple 1 NA 5 4
#2 apple 2 NA 3 5
#3 peach 1 3 NA 7
#4 peach 2 12 NA 15
#5 pear 1 0 8 NA
#6 pear 2 7 2 NA
#7 lime 1 4 NA 1
#8 lime 2 6 NA 0
So we replace NA to 0 only if there is any value in the group which is not NA.
We can use replace_na from tidyr if there are any non-NA elements to replace with 0 or else return the value
library(dplyr)
library(tidyr)
df %>%
group_by(fruit) %>%
mutate(across(where(is.numeric), ~ if(any(!is.na(.))) replace_na(., 0) else .)) %>%
ungroup()
# A tibble: 8 x 5
# fruit farm y2019 y2018 y2017
# <chr> <fct> <dbl> <dbl> <dbl>
#1 apple 1 NA 5 4
#2 apple 2 NA 3 5
#3 peach 1 3 NA 7
#4 peach 2 12 NA 15
#5 pear 1 0 8 NA
#6 pear 2 7 2 NA
#7 lime 1 4 NA 1
#8 lime 2 6 NA 0
or another option without if/else by having two logical expressions in replace after doing the group by 'fruit'
df %>%
group_by(fruit) %>%
mutate(across(where(is.numeric),
~ replace(., sum(!is.na(.)) > 0 & is.na(.), 0)))
# A tibble: 8 x 5
# Groups: fruit [4]
# fruit farm y2019 y2018 y2017
# <chr> <fct> <dbl> <dbl> <dbl>
#1 apple 1 NA 5 4
#2 apple 2 NA 3 5
#3 peach 1 3 NA 7
#4 peach 2 12 NA 15
#5 pear 1 0 8 NA
#6 pear 2 7 2 NA
#7 lime 1 4 NA 1
#8 lime 2 6 NA 0

ordered grouping of rows in R

I would like to create a new column that sequentially labels groups of rows. Original data:
> dt = data.table(index=(1:10), group = c("apple","apple","orange","orange","orange","orange","apple","apple","orange","apple"))
> dt
index group
1: 1 apple
2: 2 apple
3: 3 orange
4: 4 orange
5: 5 orange
6: 6 orange
7: 7 apple
8: 8 apple
9: 9 orange
10: 10 apple
Desired output:
index group id
1: 1 apple 1
2: 2 apple 1
3: 3 orange 1
4: 4 orange 1
5: 5 orange 1
6: 6 orange 1
7: 7 apple 2
8: 8 apple 2
9: 9 orange 2
10: 10 apple 3
dplyr attempt:
dt %>% group_by(group) %>% mutate( id= row_number())
# A tibble: 10 x 3
# Groups: group [2]
index group id
<int> <chr> <int>
1 1 apple 1
2 2 apple 2
3 3 orange 1
4 4 orange 2
5 5 orange 3
6 6 orange 4
7 7 apple 3
8 8 apple 4
9 9 orange 5
10 10 apple 5
How can I edit this to get the first group of apples as 1, then the first group of oranges as 1, then the second group of apples as 2 etc (see desired output above). Also open to data.table solution.
library(data.table)
dt[, id := cumsum(c(TRUE, diff(index) > 1)), by="group"]
dt
# index group id
# 1: 1 apple 1
# 2: 2 apple 1
# 3: 3 orange 1
# 4: 4 orange 1
# 5: 5 orange 1
# 6: 6 orange 1
# 7: 7 apple 2
# 8: 8 apple 2
# 9: 9 orange 2
# 10: 10 apple 3
Starting from original dt:
library(dplyr)
dt %>%
group_by(group) %>%
mutate(id = cumsum(c(TRUE, diff(index) > 1))) %>%
ungroup()
# # A tibble: 10 x 3
# index group id
# <int> <chr> <int>
# 1 1 apple 1
# 2 2 apple 1
# 3 3 orange 1
# 4 4 orange 1
# 5 5 orange 1
# 6 6 orange 1
# 7 7 apple 2
# 8 8 apple 2
# 9 9 orange 2
# 10 10 apple 3
Base R, perhaps a little clunky:
out <- do.call(rbind, by(dt, dt$group,
function(x) transform(x, id = cumsum(c(TRUE, diff(index) > 1)))))
out[order(out$index),]
# index group id
# apple.1 1 apple 1
# apple.2 2 apple 1
# orange.3 3 orange 1
# orange.4 4 orange 1
# orange.5 5 orange 1
# orange.6 6 orange 1
# apple.7 7 apple 2
# apple.8 8 apple 2
# orange.9 9 orange 2
# apple.10 10 apple 3
The names can be removed easily with rownames(out) <- NULL. The order part isn't necessary, but I wanted to present it in the same order as the other solutions, and do.call/by does not preserve the original order.
Another option using data.table::rleid twice:
dt[, gid := rleid(group)][, id := rleid(gid), .(group)]
We can also use rle from base R
with(dt, with(rle(group), rep(ave(seq_along(values),
values, FUN = seq_along), lengths)))
#[1] 1 1 1 1 1 1 2 2 2 3

R: convert data from wide to long - multiple conditions - getting error [duplicate]

This question already has answers here:
Combine Multiple Columns Into Tidy Data [duplicate]
(3 answers)
Closed 5 years ago.
I have a data like the following and I would like to convert it into long format.
id count a1 b1 c1 a2 b2 c2 a3 b3 c3 age
1 1 apple 2 3 orange 3 2 beer 2 1 50
1 2 orange 3 2 apple 2 2 beer 2 1 50
2 1 pear 3 2 apple 2 2 orange 2 2 45
[a1,b1,c1],[a2,b2,c2],[a3,b3,c3] are the set of three attributes that person with an assigned id is facing and this person may face multiple choice situations with count indicating the ith choice situation. I want to change it back to a long format while keep the other variables like the following:
id count a b c age
1 1 apple 2 3 50
1 1 orange 3 2 50
1 1 beer 2 1 50
1 2 orange 3 2 50
1 2 apple 2 2 50
1 2 beer 2 1 50
2 1 pear 3 2 45
2 1 apple 2 2 45
2 1 orange 2 2 45
I have tried reshape with the following commands, but I get confused in terms of where to deal with timevar and times:
l <- reshape(df,
varying = df[,3:11],
v.names = c("a","b","c"),
timevar = "choice",
times = c("a","b","c"),
direction = "long")
with the above commands, I cannot the result I want, would sincerely appreciate any help!
Use the melt function from data.table package:
library(data.table)
setDT(df)
melt(df, id.vars = c('id', 'count', 'age'),
measure = patterns('a\\d', 'b\\d', 'c\\d'),
# this needs to be regular expression to group `a1, a2, a3` etc together and
# the `\\d` is necessary because you have an age variable in the column.
value.name = c('a', 'b', 'c'))[, variable := NULL][order(id, count, -age)]
# id count age a b c
# 1: 1 1 50 apple 2 3
# 2: 1 1 50 orange 3 2
# 3: 1 1 50 beer 2 1
# 4: 1 2 50 orange 3 2
# 5: 1 2 50 apple 2 2
# 6: 1 2 50 beer 2 1
# 7: 2 1 45 pear 3 2
# 8: 2 1 45 apple 2 2
# 9: 2 1 45 orange 2 2
To use the reshape function, you just have to adjust the varying argument. It can be a list and you want to put the variables that will make up the same column together as vectors in a list:
reshape(df,
idvar=c("id", "count", "age"),
varying = list(c(3,6,9), c(4,7,10), c(5,8,11)),
timevar="time",
v.names=c("a", "b", "c"),
direction = "long")
This returns
id count age time a b c
1.1.50.1 1 1 50 1 apple 2 3
1.2.50.1 1 2 50 1 orange 3 2
2.1.45.1 2 1 45 1 pear 3 2
1.1.50.2 1 1 50 2 orange 3 2
1.2.50.2 1 2 50 2 apple 2 2
2.1.45.2 2 1 45 2 apple 2 2
1.1.50.3 1 1 50 3 beer 2 1
1.2.50.3 1 2 50 3 beer 2 1
2.1.45.3 2 1 45 3 orange 2 2
I also added in the idvars as I think this is usually good practice for others or for re-reading your old code.
data
df <- read.table(header=T, text="id count a1 b1 c1 a2 b2 c2 a3 b3 c3 age
1 1 apple 2 3 orange 3 2 beer 2 1 50
1 2 orange 3 2 apple 2 2 beer 2 1 50
2 1 pear 3 2 apple 2 2 orange 2 2 45")
We can use dplyr/tidyr
library(dplyr)
library(tidyr)
gather(df1, Var, Val, a1:c3) %>%
extract(Var, into = c("Var1", "Var2"), "(.)(.)") %>%
spread(Var1, Val) %>%
select(-Var2)
# id count age a b c
#1 1 1 50 apple 2 3
#2 1 1 50 orange 3 2
#3 1 1 50 beer 2 1
#4 1 2 50 orange 3 2
#5 1 2 50 apple 2 2
#6 1 2 50 beer 2 1
#7 2 1 45 pear 3 2
#8 2 1 45 apple 2 2
#9 2 1 45 orange 2 2

Resources