tidyverse calculate ranking per row across several columns - r

I have the following data frame:
dat <- data.frame(id = c("a", "b", "c", "d"),
x1 = c(1, 3, 5, 7),
x2 = c(4, 2, 6, 0),
x3 = c(2, 2, 5, 9))
I now want to calculate the ranking per row across my three x columns and want to store that result into my dat data frame.
So the result could be stored in two ways:
a) ideally, there will be 4 new columns with the respective ranks or
b) there will be a new nested column that I probably need to unnest somehow.
I tried the following which at least gives me a list column.
dat %>%
rowwise() %>%
mutate(my_ranks = list(rank(c_across(starts_with("x")))))
But when I try to unnest, it will give me the ranks but it does so by creating new rows (i.e. each original case now appears four times). Although I guess I could somehow reshape this result with pivot_wider, it feels wrong to follow that route.
Any better/easier idea? Thanks.

We can use unnest_wider
library(dplyr)
library(tidyr)
library(stringr)
dat %>%
rowwise() %>%
mutate(my_ranks = list(rank(c_across(starts_with("x"))))) %>%
unnest_wider(c(my_ranks)) %>%
rename_at(vars(starts_with("...")), ~ str_replace(., fixed("..."), "rank_x"))
# A tibble: 4 x 7
# id x1 x2 x3 rank_x1 rank_x2 rank_x3
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 a 1 4 2 1 3 2
#2 b 3 2 2 3 1.5 1.5
#3 c 5 6 5 1.5 3 1.5
#4 d 7 0 9 2 1 3
Another option is pmap/as_tibble_row
library(tibble)
library(purrr)
dat %>%
mutate(my_ranks = pmap(select(., starts_with('x')), ~
as_tibble_row(rank(c(...)),
.name_repair = ~ str_c('rank', seq_along(.))))) %>%
unnest(c(my_ranks))
# A tibble: 4 x 7
# id x1 x2 x3 rank1 rank2 rank3
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 a 1 4 2 1 3 2
#2 b 3 2 2 3 1.5 1.5
#3 c 5 6 5 1.5 3 1.5
#4 d 7 0 9 2 1 3
It can be done more straightforward with rowRanks from matrixStats
library(matrixStats)
nm1 <- names(dat)[-1]
dat[paste0('rank', nm1)] <- rowRanks(as.matrix(dat[nm1]), ties.method = 'average')

I guess this is sort of tidyverse:
dat %>%
bind_cols(as_tibble(`colnames<-`(t(apply(dat[-1], 1, rank)), paste0("rank_x", 1:3))))
#> id x1 x2 x3 rank_x1 rank_x2 rank_x3
#> 1 a 1 4 2 1.0 3.0 2.0
#> 2 b 3 2 2 3.0 1.5 1.5
#> 3 c 5 6 5 1.5 3.0 1.5
#> 4 d 7 0 9 2.0 1.0 3.0

Related

R Regex capture to remove/keep columns with repeats in their column names

This is an example dataframe
means2 <- as.data.frame(matrix(runif(n=25, min=1, max=20), nrow=5))
names(means2) <- c("B_T0|B_T0", "B_T0|B_T1", "B_T0|Fibro_T0", "B_T5|Endo_T5", "Macro_T1|Fibro_T1")
I have column names in my dataframe in R in this format
\S+_T\d+|\S+_T\d+
The syntax is something like (Name)_ (T)(Number) | (Name)_ (T)(Number)
Step 1) I want to select columns which contain the same (T)(Number) on both sides of the "|"
I did this with some manual labor :
means_t0 <- means2 %>% select(matches("\\S+_T0\\|\\S+_T0")) %>% rownames_to_column("id_cp_interaction")
means_t1 <- means2 %>% select(matches("\\S+_T1\\|\\S+_T1")) %>% rownames_to_column("id_cp_interaction")
means_t5 <- means2 %>% select(matches("\\S+_T5\\|\\S+_T5")) %>% rownames_to_column("id_cp_interaction")
means3 <- full_join(means_t0, means_t1) %>% full_join(means_t5)
This gives me what I want and it was easy to do because I only had 3 types - T0, T1 and T5. What do I do if I had a huge number?
Step 2) From the output of Step1, I want to do a negation of the last question i.e. select only those columns with Names which are not the same
For example B_T0|B_T0 should be removed but B_T0|Fibro_T0 should be retained
Is there a way to regex capture the part in front of the pipe(|) and match it to the part at the back of the pipe(|)
Thank you
If you have that much information in your column names, I like to transform the data into the long format and then separate the info from the column name into several columns. Then it's easy to filter by these columns:
means2 <- as.data.frame(matrix(runif(n=25, min=1, max=20), nrow=5))
names(means2) <- c("B_T0|B_T0", "B_T0|B_T1", "B_T0|Fibro_T0", "B_T5|Endo_T5", "Macro_T1|Fibro_T1")
means2 <- cbind(data.frame(id_cp_interaction = 1:5), means2)
library(tidyr)
library(dplyr)
library(stringr)
res <- means2 %>%
pivot_longer(
cols = -id_cp_interaction,
names_to = "names",
values_to = "values"
) %>%
mutate(
celltype_1 = str_extract(names, "^[^_]*"),
timepoint_1 = str_extract(names, "[0-9](?=|)"),
celltype_2 = str_extract(names, "(?<=\\|)(.*?)(?=_)"),
timepoint_2 = str_extract(names, "[0-9]$")
)
head(res, n = 7)
#> # A tibble: 7 × 7
#> id_cp_interaction names values celltype_1 timepoint_1 celltype_2 timepoint_2
#> <int> <chr> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 B_T0|B… 1.68 B 0 B 0
#> 2 1 B_T0|B… 19.3 B 0 B 1
#> 3 1 B_T0|F… 10.6 B 0 Fibro 0
#> 4 1 B_T5|E… 12.5 B 5 Endo 5
#> 5 1 Macro_… 2.84 Macro 1 Fibro 1
#> 6 2 B_T0|B… 2.17 B 0 B 0
#> 7 2 B_T0|B… 10.1 B 0 B 1
# only keep interactions of different cell types
res %>%
filter(celltype_1 != celltype_2) %>%
head()
#> # A tibble: 6 × 7
#> id_cp_interaction names values celltype_1 timepoint_1 celltype_2 timepoint_2
#> <int> <chr> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 B_T0|F… 10.6 B 0 Fibro 0
#> 2 1 B_T5|E… 12.5 B 5 Endo 5
#> 3 1 Macro_… 2.84 Macro 1 Fibro 1
#> 4 2 B_T0|F… 1.47 B 0 Fibro 0
#> 5 2 B_T5|E… 11.3 B 5 Endo 5
#> 6 2 Macro_… 13.0 Macro 1 Fibro 1
Created on 2022-09-19 by the reprex package (v1.0.0)

How to Create Iterative Forumla to calculate Z Score in R?

I have a number of large data frames that have the following basic format, where the final two rows are a mean (d) and standard deviation (e) - although these are calculated elsewhere.
a b c
a 4 3 4
b 3 2 6
c 2 1 8
d 3 2 6
e 1 1 2
I would like to create an iterative function that converts each raw data point into a z-score via the mean and sd value in d and e per column. The formula I would like to apply is ((x-mean)/SD).
The result would be the following:
a b c
a 1 1 1
b 0 0 0
c -1 -1 -1
I don't mind if this is added to the end, created as a new dataframe or the data is converted.
Thanks!
Here is one approach, note that I do not use the mean/sd provided in the data but re-calculate it on the fly.
Also note that usually the data should be in a tidy data representation, which in your case would mean that a, b, c would be in columns and then mean/sd would be either calculated on the fly or be in a separate column (note that this would reshaping the data, not shown here).
# your input data
raw_data <- data.frame(
a = c(4, 3, 2, 3, 1),
b = c(3, 2, 1, 2, 1),
c = c(4, 6, 8, 6, 2),
row.names = c("a", "b", "c", "d", "e")
)
raw_data
#> a b c
#> a 4 3 4
#> b 3 2 6
#> c 2 1 8
#> d 3 2 6
#> e 1 1 2
# remove the mean/sd values
data <- raw_data[!rownames(raw_data) %in% c("d", "e"), ]
data
#> a b c
#> a 4 3 4
#> b 3 2 6
#> c 2 1 8
# quick way to recalculate the values
means <- apply(data, 2, mean)
means
#> a b c
#> 3 2 6
sds <- apply(data, 2, sd)
sds
#> a b c
#> 1 1 2
z_scores <- apply(data, 2, function(x) (x - mean(x)) / sd(x))
z_scores
#> a b c
#> a 1 1 -1
#> b 0 0 0
#> c -1 -1 1
Created on 2021-01-07 by the reprex package (v0.3.0)
Edit / Full Code
The following code is a bit longer but most of it is spent on getting the data into the right (long/tidy) format.
If you have any questions, feel free to use the comments.
Note that the tidyverse is really helpful, but might need some time to get used to. The code used here is mostly dplyr (included in the tidyverse).
If you understand the functions: %>% (pipe), group_by(), mutate(), summarise(), and pivot_longer/wider() you got everything.
library(tidyverse)
# use your original dataset again
raw_data <- data.frame(
a = c(4, 3, 2, 3, 1),
b = c(3, 2, 1, 2, 1),
c = c(4, 6, 8, 6, 2),
row.names = c("a", "b", "c", "d", "e")
)
### 1) Turn the data into a nicer format
# match-table how to rename the variables
var_match <- c(d = "mean", e = "sd")
# convert the raw data into a nicer format, first we do some minor changes
# (variable names, etc)
data_mixed <- raw_data %>%
# have the rownames as explicit variable
rownames_to_column("metric") %>%
# nicer printing etc
as_tibble() %>%
# replace variable names with mean/sd
mutate(metric = ifelse(metric %in% c("d", "e"),
var_match[metric], metric))
data_mixed
#> # A tibble: 5 x 4
#> metric a b c
#> <chr> <dbl> <dbl> <dbl>
#> 1 a 4 3 4
#> 2 b 3 2 6
#> 3 c 2 1 8
#> 4 mean 3 2 6
#> 5 sd 1 1 2
# separate the dataset into two:
# data holds the values
# data_vars holds the metrics mean and sd
data <- data_mixed %>% filter(!metric %in% var_match) %>% select(-metric)
data_vars <- data_mixed %>% filter(metric %in% var_match)
data
#> # A tibble: 3 x 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 4 3 4
#> 2 3 2 6
#> 3 2 1 8
data_vars
#> # A tibble: 2 x 4
#> metric a b c
#> <chr> <dbl> <dbl> <dbl>
#> 1 mean 3 2 6
#> 2 sd 1 1 2
# turn the value dataset into its longer form, makes it easier to work with it later
data_long <- data %>%
pivot_longer(everything(), names_to = "var", values_to = "val")
data_long
#> # A tibble: 9 x 2
#> var val
#> <chr> <dbl>
#> 1 a 4
#> 2 b 3
#> 3 c 4
#> 4 a 3
#> 5 b 2
#> 6 c 6
#> 7 a 2
#> 8 b 1
#> 9 c 8
# turn the metric dataset into another long form, allowing easy combination in the next step
data_vars2 <- data_vars %>%
pivot_longer(-metric, names_to = "var", values_to = "val") %>%
pivot_wider(var, names_from = metric, values_from = val)
data_vars2
#> # A tibble: 3 x 3
#> var mean sd
#> <chr> <dbl> <dbl>
#> 1 a 3 1
#> 2 b 2 1
#> 3 c 6 2
# combine the datasets
data_all <- left_join(data_long, data_vars2, by = "var")
data_all
#> # A tibble: 9 x 4
#> var val mean sd
#> <chr> <dbl> <dbl> <dbl>
#> 1 a 4 3 1
#> 2 b 3 2 1
#> 3 c 4 6 2
#> 4 a 3 3 1
#> 5 b 2 2 1
#> 6 c 6 6 2
#> 7 a 2 3 1
#> 8 b 1 2 1
#> 9 c 8 6 2
## 2) calculate the z-score
# now comes the actual number crunchin!
# per variable var (a, b, c) compute the variable val_z as the z-score
data_res <- data_all %>%
group_by(var) %>%
mutate(val_z = (val - mean) / sd)
data_res
#> # A tibble: 9 x 5
#> # Groups: var [3]
#> var val mean sd val_z
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a 4 3 1 1
#> 2 b 3 2 1 1
#> 3 c 4 6 2 -1
#> 4 a 3 3 1 0
#> 5 b 2 2 1 0
#> 6 c 6 6 2 0
#> 7 a 2 3 1 -1
#> 8 b 1 2 1 -1
#> 9 c 8 6 2 1
## 3) make the results more readable
# lastly pivot the results to its original form
data_res_wide <- data_res %>%
select(var, val_z) %>%
group_by(var) %>%
mutate(id = 1:n()) %>% # needed for easier identification of values
pivot_wider(id, names_from = var, values_from = val_z)
data_res_wide
#> # A tibble: 3 x 4
#> id a b c
#> <int> <dbl> <dbl> <dbl>
#> 1 1 1 1 -1
#> 2 2 0 0 0
#> 3 3 -1 -1 1
Created on 2021-01-07 by the reprex package (v0.3.0)

How to transform a tibble from one column to two columns with repeated observations

I tried to transform df into df2. I have done it through a very patchy way using df3, Is there a simpler and more elegant way of doing it?
library(tidyverse)
# I want to transform df
df <- tibble(id = c(1, 2, 1, 2, 1, 2),
time = c('t1', 't1', 't2', 't2', 't3', 't3'),
value = c(2, 3, 6, 4, 5, 7))
df
#> # A tibble: 6 x 3
#> id time value
#> <dbl> <chr> <dbl>
#> 1 1 t1 2
#> 2 2 t1 3
#> 3 1 t2 6
#> 4 2 t2 4
#> 5 1 t3 5
#> 6 2 t3 7
# into df2
df2 <- tibble(id = c(1, 2, 1, 2),
t = c(2, 3, 6, 4),
r = c(6, 4, 5, 7))
df2
#> # A tibble: 4 x 3
#> id t r
#> <dbl> <dbl> <dbl>
#> 1 1 2 6
#> 2 2 3 4
#> 3 1 6 5
#> 4 2 4 7
# This is how I did it, but I think it should be a better way
df3 <- df %>% pivot_wider(names_from = time, values_from = value)
b <- tibble(id = numeric(), t = numeric(), r = numeric())
for (i in 2:3){
a <- df3[,c(1,i,i+1)]
colnames(a) <- c('id', 't', 'r')
b <- bind_rows(a, b)
}
b
#> # A tibble: 4 x 3
#> id t r
#> <dbl> <dbl> <dbl>
#> 1 1 6 5
#> 2 2 4 7
#> 3 1 2 6
#> 4 2 3 4
Created on 2020-11-25 by the reprex package (v0.3.0)
For each id you can use lead to select next value and create r column and drop NA rows.
library(dplyr)
df %>%
group_by(id) %>%
mutate(t = value,
r = lead(value)) %>%
na.omit() %>%
select(id, t, r)
# id t r
# <dbl> <dbl> <dbl>
#1 1 2 6
#2 2 3 4
#3 1 6 5
#4 2 4 7
We can use summarise from dplyr version >= 1.0. Previously, it had the constraint of returning only single observation per group. From version >= 1.0, it is no longer the case. Can return any number of rows i.e. it can be shorter or longer than the original number of rows
library(dplyr)
df %>%
group_by(id) %>%
summarise(t = value[-n()], r = value[-1], .groups = 'drop')
-output
# A tibble: 4 x 3
# id t r
# <dbl> <dbl> <dbl>
#1 1 2 6
#2 1 6 5
#3 2 3 4
#4 2 4 7

dplyr get a group-level variable from the lagged group

Let's say I have a df with groups and a group-level variable like a mean. How do I produce a variable which is the group-level mean of the lagged group, where the only rows with NA for this variable are those in the first group?
e.g:
df <- data_frame(group = c(1,1,2,2),
grouped.mean = c(2.5,2.5,3.5,3.5))
# my attempt
df %<>%
group_by(group) %>%
mutate(lag.group.mean = lag(grouped.mean))
# A tibble: 4 x 3
# Groups: group [2]
group grouped.mean lag.group.mean
<dbl> <dbl> <dbl>
1 1. 2.50 NA
2 1. 2.50 2.50
3 2. 3.50 NA
4 2. 3.50 3.50
Desired output:
group grouped.mean lag.group.mean
<dbl> <dbl> <dbl>
1 1. 2.50 NA
2 1. 2.50 NA
3 2. 3.50 2.50
4 2. 3.50 2.50
Thanks!
EDIT: more challenging example:
df <- data_frame(group = c(1,1,2,3,3,3),
grouped.mean = c(2.5,2.5,3.5,4.5,4.5,4.5))
expected output:
group grouped.mean lag.grouped.mean
<dbl> <dbl> <dbl>
1 1. 2.50 NA
2 1. 2.50 NA
3 2. 3.50 2.50
4 3. 4.50 3.50
5 3. 4.50 3.50
6 3. 4.50 3.50
Here is an option. The key is to use distinct to remove duplicated rows, create the lag.group.mean column, and then left_join to the original data frame.
library(dplyr)
df <- data_frame(group = c(1,1,2,2),
grouped.mean = c(2.5,2.5,3.5,3.5))
df2 <- df %>%
distinct() %>%
mutate(lag.group.mean = lag(grouped.mean)) %>%
left_join(df, ., by = c("group", "grouped.mean"))
df2
# # A tibble: 4 x 3
# group grouped.mean lag.group.mean
# <dbl> <dbl> <dbl>
# 1 1 2.5 NA
# 2 1 2.5 NA
# 3 2 3.5 2.5
# 4 2 3.5 2.5
The lagged group value is the first globally lagged value within each group:
library(tidyverse)
df <- data_frame(group = c(1, 1, 2, 3, 3, 3),
grouped.mean = c(2.5, 2.5, 3.5, 4.5, 4.5, 4.5))
df %>%
mutate(lag.grouped.mean = lag(grouped.mean)) %>%
group_by(group) %>%
mutate(lag.grouped.mean = first(lag.grouped.mean))
#> # A tibble: 6 x 3
#> # Groups: group [3]
#> group grouped.mean lag.grouped.mean
#> <dbl> <dbl> <dbl>
#> 1 1 2.5 NA
#> 2 1 2.5 NA
#> 3 2 3.5 2.5
#> 4 3 4.5 3.5
#> 5 3 4.5 3.5
#> 6 3 4.5 3.5
But it's probably easier to see what's happening if you use a join like in
#www's answer.
Created on 2018-08-06 by the reprex package (v0.2.0.9000).

loop to multiply across columns

I have a data frame with columns labeled sales1, sales2, price1, price2 and I want to calculate revenues by multiplying sales1 * price1 and so-on across each number in an iterative fashion.
data <- data_frame(
"sales1" = c(1, 2, 3),
"sales2" = c(2, 3, 4),
"price1" = c(3, 2, 2),
"price2" = c(3, 3, 5))
data
# A tibble: 3 x 4
# sales1 sales2 price1 price2
# <dbl> <dbl> <dbl> <dbl>
#1 1 2 3 3
#2 2 3 2 3
#3 3 4 2 5
Why doesn't the following code work?
data %>%
mutate (
for (i in seq_along(1:2)) {
paste0("revenue",i) = paste0("sales",i) * paste0("price",i)
}
)
Assuming your columns are already ordered (sales1, sales2, price1, price2). We can split the dataframe in two parts and then multiply them
data[grep("sales", names(data))] * data[grep("price", names(data))]
# sales1 sales2
#1 3 6
#2 4 9
#3 6 20
If the columns are not already sorted according to their names, we can sort them by using order and then use above command.
data <- data[order(names(data))]
This answer is not brief. For that, #RonakShah's existing answer is the one to look at!
My response is intended to address a broader concern regarding the difficulty of trying to do this in the tidyverse. My understanding is this is difficult because the data is not currently in a "tidy" format. Instead, you can create a tidy data frame like so:
library(tidyverse)
tidy_df <- data %>%
rownames_to_column() %>%
gather(key, value, -rowname) %>%
extract(key, c("variable", "id"), "([a-z]+)([0-9]+)") %>%
spread(variable, value)
Which then makes the final calculation straightforward
tidy_df %>% mutate(revenue = sales * price)
#> # A tibble: 6 x 5
#> rowname id price sales revenue
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 1 1 3 1 3
#> 2 1 2 3 2 6
#> 3 2 1 2 2 4
#> 4 2 2 3 3 9
#> 5 3 1 2 3 6
#> 6 3 2 5 4 20
If you need to get the data back into the original format you can although this feels clunky to me (I'm sure this can be improved in someway).
tidy_df %>% mutate(revenue = sales * price) %>%
gather(key, value, -c(rowname, id)) %>%
unite(key, key, id, sep = "") %>%
spread(key, value) %>%
select(starts_with("price"),
starts_with("sales"),
starts_with("revenue"))
#> # A tibble: 3 x 6
#> price1 price2 sales1 sales2 revenue1 revenue2
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3 3 1 2 3 6
#> 2 2 3 2 3 4 9
#> 3 2 5 3 4 6 20

Resources