How to break ties in a ranking with gaps in ranking [duplicate] - r

This question already has answers here:
Increment by one to each duplicate value
(4 answers)
Closed 1 year ago.
Say that I have these data:
data <- data.frame(orig=c(1,5,5,5,14,18,18,25))
orig
1 1
2 5
3 5
4 5
5 14
6 18
7 18
8 25
I would like to create the want column:
orig want
1 1 1
2 5 5
3 5 6
4 5 7
5 14 14
6 18 18
7 18 19
8 25 25
This column takes orig and copies its value, but breaks ties if they exist. What I am trying to do is to re-create the rankings so that there are no ties and the ties are broken based on the order of the rows in the dataset. If not for the spaces in the rankings (jump from 1 to 5, etc.), I could use
library(tidyverse)
data %>% mutate(test = rank(orig, ties.method="min"))
But this of course doesn't get me what I want:
orig test
1 1 1
2 5 2
3 5 2
4 5 2
5 14 5
6 18 6
7 18 6
8 25 8
What can I do?

We may add row_number() after grouping
library(dplyr)
data %>%
group_by(orig) %>%
mutate(want = orig + row_number() - 1) %>%
ungroup
-ouptut
# A tibble: 8 x 2
orig want
<dbl> <dbl>
1 1 1
2 5 5
3 5 6
4 5 7
5 14 14
6 18 18
7 18 19
8 25 25
Or may simplify with rowid from data.table
library(data.table)
data %>%
mutate(want = orig + rowid(orig)-1)

A base R option using ave + seq_along
transform(
data,
want = orig + ave(orig, orig, FUN = seq_along) - 1
)
gives
orig want
1 1 1
2 5 5
3 5 6
4 5 7
5 14 14
6 18 18
7 18 19
8 25 25

Related

Adding lines in data frame for each observation

I have a data structure in long format, meaning that each individual has more than one observation (and each observation has one row). Now each individual has a different number of observations. I would like to structure my data in the way, that each individual will have the same number of observations. Therefore, it would be great to find the individual with the most observations and add lines with LOCF (depending on the number of missing lines).
For example:
# simulate data structure
d <- data.frame(
id = c(1,1,1,2,2,3,3,3,3,3),
value = c(10,11,12,5,9,55,14,12,20,7) )
Now individual 3 has the most observations (count = 5). I would like to add two lines for individual 1 (with 12 for value) and three lines for individual 2 (with 9 for value)
Any ideas?
Best wishes and thank you.
In case you wish to carry forward the last value for each individual you could do
d$seq=ave(d$id,d$id,FUN=seq_along)
d=merge(
d,
merge(
aggregate(value~id,data=d,FUN=tail,1),
data.frame("seq"=1:max(table(d$id))),
how="cross"
),
by=c("id","seq"),
all.y=T
)
d$value=ifelse(is.na(d$value.x),d$value.y,d$value.x)
d=d[,!grepl("value.",colnames(d))]
id seq value
1 1 1 10
2 1 2 11
3 1 3 12
4 1 4 12
5 1 5 12
6 2 1 5
7 2 2 9
8 2 3 9
9 2 4 9
10 2 5 9
11 3 1 55
12 3 2 14
13 3 3 12
14 3 4 20
15 3 5 7
Here's a tidyverse solution. If we create a variable to hold the within ID count using seq_along then we can use complete and fill to expand the table and fill in the missing values.
d |> group_by(id) |>
mutate(n = seq_along(value)) |>
ungroup() |>
complete(id, n) |>
fill(value) |>
select(-n)
# A tibble: 15 × 2
id value
<dbl> <dbl>
1 1 10
2 1 11
3 1 12
4 1 12
5 1 12
6 2 5
7 2 9
8 2 9
9 2 9
10 2 9
11 3 55
12 3 14
13 3 12
14 3 20
15 3 7

Sidestepping for-loops using dplyr 1.0.0

I am just starting to appreciate the power of the new dplyr 1.0.0. But after reading the vignettes I need to read some more, and of course there aren't any more so I turn once again to SO.
Say I have the following dataset# using rowwise and c_across to calculate new variables
rm(list = ls())
library(tidyverse)
set.seed(1)
df <- tibble(d_1_a = round(sample(1:10,10,replace=T)),
d_1_b = round(sample(1:10,10,replace=T)),
d_1_c = round(sample(1:10,10,replace=T)),
d_1_d = round(sample(1:10,10,replace=T)),
d_2_a = round(sample(1:10,10,replace=T)),
d_2_b = round(sample(1:10,10,replace=T)),
d_2_c = round(sample(1:10,10,replace=T)),
d_2_d = round(sample(1:10,10,replace=T)))
And I want to calculate row sums for a subset of columns within the dataset and add them to the existing dataset. I came up with the following for-loop
for (i in 1:2) {
namesCols <- grep(paste0("^d_",i,"_[a-z]$"), names(df), perl = T) # indexes of subset of columns
newDF <- df %>% select(all_of(namesCols)) # extract subset of columns from main
totDF <- newDF %>% rowwise() %>%
mutate(!!paste0("sum_",i) := sum(c_across(everything()))) %>% # new column from old
select(starts_with("sum")) # now extract just the new column as a dataframe
df <- cbind(df,totDF) # binds the new column to the old dataframe
}
Now if we call the original dataset
df
d_1_a d_1_b d_1_c d_1_d d_2_a d_2_b d_2_c d_2_d sum_1 sum_2
1 9 5 5 10 9 2 6 7 29 24
2 4 10 5 6 7 2 8 6 25 23
3 7 6 2 4 8 6 7 1 19 22
4 1 10 10 4 6 6 1 5 25 18
5 2 7 9 10 10 1 4 6 28 21
6 7 9 1 9 7 3 8 1 26 19
7 2 5 4 7 3 3 9 9 18 24
8 3 5 3 6 10 8 9 7 17 34
9 1 9 6 9 6 6 7 7 25 26
10 5 9 10 8 8 7 4 3 32 22
We can see the two sum columns, each calculated from a different subset of the existing columns from the original dataset and then added on the end of that dataset.
But I am keen to learn some of the new dplyr/purrr voodoo but am ignorant of how the syntax works.
Can anyone suggest a tidyverse version of my for-loop?
Literal translation of the for loop would be -
library(dplyr)
library(purrr)
bind_cols(df, map_dfc(1:2, function(i) {
df %>%
transmute(!!paste0("sum_",i) := rowSums(
select(., matches(paste0("^d_",i,"_[a-z]$")))))
}))
# d_1_a d_1_b d_1_c d_1_d d_2_a d_2_b d_2_c d_2_d sum_1 sum_2
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 9 5 5 10 9 2 6 7 29 24
# 2 4 10 5 6 7 2 8 6 25 23
# 3 7 6 2 4 8 6 7 1 19 22
# 4 1 10 10 4 6 6 1 5 25 18
# 5 2 7 9 10 10 1 4 6 28 21
# 6 7 9 1 9 7 3 8 1 26 19
# 7 2 5 4 7 3 3 9 9 18 24
# 8 3 5 3 6 10 8 9 7 17 34
# 9 1 9 6 9 6 6 7 7 25 26
#10 5 9 10 8 8 7 4 3 32 22
However, we can also use split.default -
bind_cols(df, df %>%
split.default(sub('.*(\\d+).*', '\\1', names(.))) %>%
imap_dfc(~.x %>% transmute(!!paste0("sum_",.y) := rowSums(.))))
where sub part returns the grouping of columns on how to split them.
sub('.*(\\d+).*', '\\1', names(df))
#[1] "1" "1" "1" "1" "2" "2" "2" "2"

Matching the row value of a data frame with its corresponding values

The picture below is my data set in R :
reproducible example:
data <- data.frame(
time = rep(0.2, 5),
m1 = c(9,15,2,8,18),
m2 = c(11,1,13,12,NA),
m3 = c(16,NA,7,17,NA),
m4 = c(10,NA,3,4,NA),
m5 = c(14,NA,6,NA,NA),
m6 = c(NA,NA,5,NA,NA)
)
I want the following output, which is a table displaying each value in the dataset and below the number of the row to which the value belongs:
Thank you in advance for your help !
Remove the first column, transpose what is left, convert it back to a data frame, set the column names to the original row numbers, stack that and omit NA rows. Then re-order by values.
d <- na.omit(stack(setNames(as.data.frame(t(data[-1])), 1:nrow(data))))
d[order(d$values), ]
giving:
values ind
8 1 2
13 2 3
16 3 3
22 4 4
18 5 3
17 6 3
15 7 3
19 8 4
1 9 1
4 10 1
2 11 1
20 12 4
14 13 3
5 14 1
7 15 2
3 16 1
21 17 4
25 18 5
try this:
library(tidyverse)
data %>%
rownames_to_column("row_id") %>%
gather(key, value, -time, -row_id) %>%
select(1, 4) %>%
na.omit() %>%
spread(value, row_id)
output is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 2 3 3 4 3 3 3 4 1 1 1 4 3 1 2 1 4 5

Summing rows based on conditional in groups

Previously I asked related to this question but I need more elegant and general way to solve this.
I have data separated in groups and I want to sum some rows in range based on conditional. I prefer to use 'dplyr' to do this because it's more straight forward for me to understand.
The conditionals which I need as follows;
1: for group 1 ;
find the first occurrence of '10' and sum the rows after this occurrence to the end of the group and count how many rows.
2: for group 2;'find the last occurrence of '10' and and sum the rows before this occurrence to the beginning of the group and count how many rows!
3: for group 3; find the first occurrence of '10' and and sum the rows before this occurrence to the starting row of the group and count how many rows.
df <- data.frame(gr=rep(c(1,2,3),c(7,9,11)),
y_value=c(c(0,0,10,8,8,6,0),c(10,10,10,8,7,6,2,0,0), c(8,5,8,7,6,2,10,10,8,7,0)))
> df
gr y_value
1 1 0
2 1 0
3 1 10
4 1 8
5 1 8
6 1 6
7 1 0
8 2 10
9 2 10
10 2 10
11 2 8
12 2 7
13 2 6
14 2 2
15 2 0
16 2 0
17 3 8
18 3 5
19 3 8
20 3 7
21 3 6
22 3 2
23 3 10
24 3 10
25 3 8
26 3 7
27 3 0
It guess something like this should work but cannot figured out how to implement this to dplyr
count <- function(y,gr){
if (any(y==10)&(gr==1)) {
*
*
*
if (any(y==10)&(gr==2))
*
*
*
*
}
}
df%>%
library(dplyr)
df %>%
group_by(gr) %>%
do(data.frame(.,count_rows=count(y_value,gr)))
expected output
> df
gr y_value sum nrow
1 1 0 22 4
2 1 0 22 4
3 1 10 22 4
4 1 8 22 4
5 1 8 22 4
6 1 6 22 4
7 1 0 22 4
8 2 10 23 6
9 2 10 23 6
10 2 10 23 6
11 2 8 23 6
12 2 7 23 6
13 2 6 23 6
14 2 2 23 6
15 2 0 23 6
16 2 0 23 6
17 3 8 28 6
18 3 5 28 6
19 3 7 28 6
20 3 6 28 6
21 3 2 28 6
22 3 10 28 6
23 3 10 28 6
24 3 8 28 6
25 3 7 28 6
26 3 0 28 6
Hope this helps!
(Edit note: modified code after OP updated his original requirement)
#sample data - I slightly changed sample data (replaced 0 by 10 in 2nd row) for group 1 to satisfy your condition
df <- data.frame(gr=rep(c(1,2,3),c(7,9,11)),
y_value=c(c(0,10,10,8,8,6,0),c(10,10,10,8,7,6,2,0,0), c(8,5,8,7,6,2,10,10,8,7,0)))
library(dplyr)
df_temp <- df %>%
group_by(gr) %>%
mutate(rows_to_aggregate=cumsum(y_value==10)) %>%
filter(ifelse(gr==1, rows_to_aggregate !=0, ifelse(gr==2, rows_to_aggregate ==0 | y_value==10, rows_to_aggregate ==0))) %>%
filter(ifelse(gr==1, row_number(gr) != 1, ifelse(gr==2, row_number(gr) != n(), rows_to_aggregate ==0))) %>%
mutate(nrow=n(), sum=sum(y_value)) %>%
select(gr,sum,nrow) %>%
distinct()
#final output
df<- left_join(df,df_temp, by='gr')
I think you're after cummax:
df %>%
group_by(gr) %>%
mutate(in_scope = if_else(gr == 1,
cummax(lag(y_value == 10, default = FALSE)),
if_else(gr == 2,
cummax(lag(y_value == 10, default = FALSE) & y_value != 10),
1L - cummax(y_value == 10)))) %>%
ungroup %>%
group_by(gr) %>%
summarise(the_sum = sum(y_value * in_scope),
the_count = sum(in_scope))
# A tibble: 3 x 3
gr the_sum the_count
<dbl> <dbl> <int>
1 1 22 4
2 2 23 6
3 3 36 6

How to generate an uneven sequence of numbers in R

Here's an example data frame:
df <- data.frame(x=c(1,1,2,2,2,3,3,4,5,6,6,6,9,9),y=c(1,2,3,4,6,3,7,8,6,4,3,7,3,2))
I want to generate a sequence of numbers according to the number of observations of y per x group (e.g. there are 2 observations of y for x=1). I want the sequence to be continuously increasing and jumps by 2 after each x group.
The desired output for this example would be:
1,2,5,6,7,10,11,14,17,20,21,22,25,26
How can I do this simply in R?
To expand on my comment, the groupings can be arbitrary, you simply need to recast it to the correct ordering. There are a few ways to do this, #akrun has shown that this can be accomplished using match function, or you can make use the the as.numeric function if this is easier to understand for yourself.
df <- data.frame(x=c(1,1,2,2,2,3,3,4,5,6,6,6,9,9),y=c(1,2,3,4,6,3,7,8,6,4,3,7,3,2))
# these are equivalent
df$newx <- as.numeric(factor(df$x, levels=unique(df$x)))
df$newx <- match(df$x, unique(df$x))
Since you now have a "new" releveling which is sequential, we can use the logic that was discussed in the comments.
df$newNumber <- 1:nrow(df) + (df$newx-1)*2
For this example, this will result in the following dataframe:
x y newx newNumber
1 1 1 1
1 2 1 2
2 3 2 5
2 4 2 6
2 6 2 7
3 3 3 10
3 7 3 11
4 8 4 14
5 6 5 17
6 4 6 20
6 3 6 21
6 7 6 22
9 3 7 25
9 2 7 26
where df$newNumber is the output you wanted.
To create the sequence 0,0,4,4,4,9,..., basically what you're doing is taking the minimum of each group and subtracting 1. The easiest way to do this is using the library(dplyr).
library(dplyr)
df %>%
group_by(x) %>%
mutate(newNumber2 = min(newNumber) -1)
Which will have the output:
Source: local data frame [14 x 5]
Groups: x
x y newx newNumber newNumber2
1 1 1 1 1 0
2 1 2 1 2 0
3 2 3 2 5 4
4 2 4 2 6 4
5 2 6 2 7 4
6 3 3 3 10 9
7 3 7 3 11 9
8 4 8 4 14 13
9 5 6 5 17 16
10 6 4 6 20 19
11 6 3 6 21 19
12 6 7 6 22 19
13 9 3 7 25 24
14 9 2 7 26 24

Resources