Adding new column names to use in spread() - r

Big picture: I'm trying to set up an export that has one route as a row and columns for each value.
This code: I'm trying to select the top three transfers for each route (using slice(1:3) because I need no more than three values. top_n() allows for ties). Then, I'm trying to spread() to create 6 columns: a name and a pct for each.
If I were to spread the data right now, the names would become columns, but I need to keep the names in the rows (see Desired Output). I want to create the column names as a key column to use to spread(). My approach is creating an error. I'm having trouble thinking of another strategy.
Data frame:
# A tibble: 7 x 3
route_shortname transfer_to pct
<chr> <chr> <dbl>
1 A D 0.5
2 A E 0.5
3 B F 0.667
4 B G 0.333
5 C D 0.111
6 C E 0.111
7 C G 0.111
Desired output:
# A tibble: 3 x 7
route_shortname transfer1 transfer1_pct transfer2 transfer2_pct transfer3 transfer3_pct
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 A D 0.5 E 0.5 NA NA
2 B F 0.667 G 0.333 NA NA
3 C D 0.111 E 0.111 G 0.111
Reprex:
library(tidyverse)
sample_data <- tibble::tribble(
~route_shortname, ~transfer_to, ~pct,
"A", "D", 0.5,
"A", "E", 0.5,
"B", "F", 0.666666666666667,
"B", "G", 0.333333333333333,
"C", "D", 0.111111111111111,
"C", "E", 0.111111111111111,
"C", "G", 0.111111111111111
)
transfer_to_table <- sample_data %>%
group_by(route_shortname) %>%
mutate(key = c("transfer1", "transfer2", "transfer3"))
#> Error in mutate_impl(.data, dots): Column `key` must be length 2 (the group size) or one, not 3

df = read.table(text = "
route_shortname transfer_to pct
1 A D 0.5
2 A E 0.5
3 B F 0.667
4 B G 0.333
5 C D 0.111
6 C E 0.111
7 C G 0.111
", header=T)
library(tidyverse)
df %>%
group_by(route_shortname) %>%
mutate(id = paste0("transfer", row_number())) %>%
ungroup() %>%
unite(v, transfer_to, pct) %>%
spread(id, v) %>%
separate(transfer1, c("transfer1","transfer1_pct"), sep = "_", convert = T) %>%
separate(transfer2, c("transfer2","transfer2_pct"), sep = "_", convert = T) %>%
separate(transfer3, c("transfer3","transfer3_pct"), sep = "_", convert = T)
# route_shortname transfer1 transfer1_pct transfer2 transfer2_pct transfer3 transfer3_pct
# <fct> <chr> <dbl> <chr> <dbl> <chr> <dbl>
# 1 A D 0.5 E 0.5 NA NA
# 2 B F 0.667 G 0.333 NA NA
# 3 C D 0.111 E 0.111 G 0.111

Though you tagged this question with tidyverse packages, here is an option using dcast from data.table which let's you do the reshaping in one (admittedly long) line.
library(data.table)
setDT(sample_data)
dcast(sample_data, route_shortname ~ rowid(route_shortname), value.var = c('transfer_to', 'pct'))
# route_shortname transfer_to_1 transfer_to_2 transfer_to_3 pct_1 pct_2 pct_3
#1: A D E <NA> 0.5000000 0.5000000 NA
#2: B F G <NA> 0.6666667 0.3333333 NA
#3: C D E G 0.1111111 0.1111111 0.1111111
You could also use reshape from base R
sample_data <- as.data.frame(sample_data) # does not work with tibbles for some reason
sample_data$idx <- with(sample_data,
ave(route_shortname, route_shortname, FUN = seq_along))
reshape(sample_data, idvar = "route_shortname", timevar = "idx", direction = "wide", sep = "_")
# route_shortname transfer_to_1 pct_1 transfer_to_2 pct_2 transfer_to_3 pct_3
#1 A D 0.5000000 E 0.5000000 <NA> NA
#3 B F 0.6666667 G 0.3333333 <NA> NA
#5 C D 0.1111111 E 0.1111111 G 0.1111111
In both cases you'd need to rename columns but I that shouldn't be too hard.

Related

how i can calculate the quantile of a difference of each pair in R using dplyr?

Let's say i have data frame in R that looks like this :
var2 = c(rep("A",3),rep("B",3),rep("C",3),rep("D",3),rep("E",3),rep("F",3),
rep("H",3),rep("I",3))
y2 = rnorm(24)
data2 = tibble(var2,y2)%>%base::print(n=30);data2
i want to calculate the 99% quantile of the difference of each distinct pair in R using dplyr.
Ideally i want to look like this (the third column to contain the values of each correlation pair):
var1
var2
value
A
B
quantile(A-B,0.99)
A
C
quantile(A-C,0.99)
A
D
quantile(A-D,0.99)
A
E
quantile(A-E,0.99)
B
C
quantile(B-C,0.99)
B
D
quantile(B-D,0.99)
B
E
quantile(B-E,0.99)
C
D
quantile(C-D,0.99)
C
E
quantile(C-E,0.99)
D
E
quantile(D-E,0.99)
My attempt so far is :
data2%>%
dplyr::mutate(index = sequence(rle(var2)$lengths))%>%
pivot_wider(index, names_from = "var2", values_from = "y2")%>%
dplyr::select(-index)
resulting to
# A tibble: 3 x 8
A B C D E F H I
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1.33 -0.539 0.458 1.51 0.990 -1.24 0.306 -0.337
2 -0.542 -0.359 0.0107 -0.0449 0.0222 1.03 -0.238 0.354
3 -0.840 0.843 -1.73 -0.414 -0.874 0.522 -0.0762 -0.777
i can continue with
data2%>%
dplyr::mutate(index = sequence(rle(var2)$lengths))%>%
pivot_wider(index, names_from = "var2", values_from = "y2")%>%
dplyr::select(-index)%>%
quantile(,0.99)%>%
as.data.frame.table(responseName = "QUANTILE")%>%
dplyr::filter(format(Var1) < format(Var2))
But i don't know how to calculate the difference of each pair.
How i can do that in R ?
Any help ?
Here is an option in base R using outer.
vec <- with(data2, setNames(y2, var2)). # Turn `data.frame` into `vector`
lst <- split(vec, names(vec)) # Split `vector` on names
outer(
setNames(seq_along(lst), names(lst)),
setNames(seq_along(lst), names(lst)),
FUN = Vectorize(function(i, j) quantile(lst[[i]] - lst[[j]], probs = 0.99)))
# A B C D E F H I
#A 0.0000000 1.0022382 3.835410 2.339986 0.94314229 -0.3951486 0.7687369 0.9090856
#B 1.0670568 0.0000000 2.845121 3.407043 0.09522425 0.3434554 0.2499785 1.9761424
#C 2.5396299 2.0461416 0.000000 4.040152 1.37209725 1.4944558 1.8316670 2.6092508
#D 1.2580325 2.2602707 5.075953 0.000000 2.19736422 0.8561844 2.0267693 1.0734799
#E 1.1675326 0.6770104 2.892267 3.324613 0.00000000 0.2610250 0.4595697 1.8937120
#F 1.0451740 1.4040863 4.230558 3.063588 1.34117977 0.0000000 1.1705849 1.6326870
#H 0.8170783 0.2335014 3.068946 3.157065 0.17667867 0.0934769 0.0000000 1.7261639
#I 0.1845526 1.1867908 4.002474 1.430901 1.12388433 -0.2172954 0.9532894 0.0000000
This returns a matrix; you can convert this to a data.frame or tibble as needed using as.data.frame or as_tibble.
You can use combn() to get pairs of unique var2; in lapply() estimate the 99th percentile of difference, and use bind_rows() to bind each separate result
bind_rows(
lapply(combn(unique(data2$var2),2, simplify=F), \(x) {
data2 %>%
summarize(value = quantile(y2[var2==x[1]] - y2[var2==x[2]], prob=0.99)) %>%
mutate(var1=x[1], var2=x[2]) %>%
relocate(value, .after=everything())
})
)
Output:
# A tibble: 28 × 3
var1 var2 value
<chr> <chr> <dbl>
1 A B 0.339
2 A C 1.63
3 A D 2.89
4 A E 1.62
5 A F 1.35
6 A H 1.10
7 A I 0.200
8 B C 2.11
9 B D 3.37
10 B E 1.30
# … with 18 more rows

Is there a R function for conditional values across different columns?

Suppose you have a dataframe that looks something like this:
df <- tibble(PatientID = c(1,2,3,4,5),
Treat1 = c("R", "O", "C", "O", "C"),
Treat2 = c("O", "R", "R", NA, "O"),
Treat3 = c("C", NA, "O", NA, "R"),
Treat4 = c("H", NA, "H", NA, "H"),
Treat5 = c("H", NA, NA, NA, "H"))
Treat 1:Treat5 are different treatments that a patient has had. I'm looking to create a new variable "Chemo" with 1 for yes, 0 for no based on whether a patient has had treatment "C".
I've been using if_else(), but as I have 10 different treatment variables in my actual dataset, and I would like to create such a column per treatment, i wonder if I can do it without writing such long if statements. Is there an easier way to do this?
Use if_any to loop over the columns that starts_with 'Treat', create a logical vector with %in% - if_any returns TRUE/FALSE if any of the columns selected have 'C' for a particular row, the logical is converted to binary with + (or as.integer)
library(dplyr)
df <- df %>%
mutate(Chemo = +(if_any(starts_with("Treat"), ~ .x %in% "C")))
-output
df
# A tibble: 5 × 7
PatientID Treat1 Treat2 Treat3 Treat4 Treat5 Chemo
<dbl> <chr> <chr> <chr> <chr> <chr> <int>
1 1 R O C H H 1
2 2 O R <NA> <NA> <NA> 0
3 3 C R O H <NA> 1
4 4 O <NA> <NA> <NA> <NA> 0
5 5 C O R H H 1
Or using base R with rowSums
df$Chemo <- +(rowSums(df[startsWith(names(df), "Treat")] == "C",
na.rm = TRUE) > 0)
Another option using str_detect and any to determine if C occurs in any of the Treat columns for each row. The + converts the logical to an integer.
library(tidyverse)
df %>%
rowwise() %>%
mutate(Chemo = +any(str_detect(c_across(starts_with("Treat")), "C"), na.rm = TRUE)) %>%
ungroup
Output
PatientID Treat1 Treat2 Treat3 Treat4 Treat5 Chemo
<dbl> <chr> <chr> <chr> <chr> <chr> <int>
1 1 R O C H H 1
2 2 O R NA NA NA 0
3 3 C R O H NA 1
4 4 O NA NA NA NA 0
5 5 C O R H H 1
An alternative dplyr way:
library(dplyr)
df %>%
mutate(across(starts_with("Treat"), ~case_when(.=="C" ~1,
TRUE ~0), .names = 'new_{col}')) %>%
mutate(Chemo = rowSums(select(., starts_with("new")))) %>%
select(-starts_with("new"))
PatientID Treat1 Treat2 Treat3 Treat4 Treat5 Chemo
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
1 1 R O C H H 1
2 2 O R NA NA NA 0
3 3 C R O H NA 1
4 4 O NA NA NA NA 0
5 5 C O R H H 1

find value furthest to the right in a table r

Let's say I've got some data:
data <- tibble(A = c("a", "b", "c", "d"),
B = c("e", "f", "g", NA_character_),
C = c("h", "i", NA_character_, NA_character_))
Which looks like this:
# A tibble: 4 x 3
A B C
<chr> <chr> <chr>
1 a e h
2 b f i
3 c g NA
4 d NA NA
What I'd like to do is get the value that's furthest to the right into a new column:
# A tibble: 4 x 4
A B C D
<chr> <chr> <chr> <chr>
1 a e h h
2 b f i i
3 c g NA g
4 d NA NA d
I know I could do it with case_when and a bunch of logical !is.na(A) ~ A, statements, but say I've got a load of columns and that's not feasible. I feel like there probably is an easy way that I just don't know about and haven't been able to find. Thanks
coalesce would be more easier
library(dplyr)
data %>%
mutate(D = coalesce(C, B, A))
-output
# A tibble: 4 x 4
# A B C D
# <chr> <chr> <chr> <chr>
#1 a e h h
#2 b f i i
#3 c g <NA> g
#4 d <NA> <NA> d
Or if there are many column, rev the column names, convert to symbols and evaluate (!!!)
data %>%
mutate(D = coalesce(!!! rlang::syms(rev(names(.)))))

Fill in cells with alternating pattern

I am trying to fill in blank cells with the value of rows above. Similar to na.locf function, but I have a pattern that needs to be matched. I don't necessarily know how many rows between new values (i.e betweem a,b and c,d).
I have used the na.locf and searched around for a solution to no avail.
df <- df <- data.frame(col1 = c("a","b", NA, NA, NA, NA, "c", "d", NA, NA))
df
# col1
# 1 a
# 2 b
# 3 <NA>
# 4 <NA>
# 5 <NA>
# 6 <NA>
# 7 c
# 8 d
# 9 <NA>
# 10 <NA>
Solution I would like:
df
col1
a
b
a
b
a
b
c
d
c
d
ave(df$col1,
with(rle(!is.na(df$col1)), rep(cumsum(values), lengths)),
FUN = function(x){
rep(x[!is.na(x)], length.out = length(x))
})
# [1] a b a b a b c d c d
Here's way with dplyr. You can drop the group column if needed. -
df %>%
group_by(group = cumsum(is.na(lag(col1)) & !is.na(col1))) %>%
mutate(
col1 = rep(col1[!is.na(col1)], length.out = n())
) %>%
ungroup()
# A tibble: 10 x 2
col1 group
<chr> <int>
1 a 1
2 b 1
3 a 1
4 b 1
5 a 1
6 b 1
7 c 2
8 d 2
9 c 2
10 d 2

Can one separate column into several columns starting from the end of the line?

I wonder if there is some secret argument that would allow to apply separate from the end of the line? Some magic_argument?
The desired output would be as follows:
library(dplyr)
df <- data.frame(x = c(NA, "a.b.b", "a.b.d", "b.c"))
df %>% separate(x, c("A", "B"), magic_argument = TRUE)
#> A B
#> 1 <NA> <NA>
#> 2 a.b b
#> 3 a.b d
#> 4 b c
Try:
df %>% separate(x, c("A", "B"), sep="\\.(?=[^\\.]+$)")
# A B
#1 <NA> <NA>
#2 a.b b
#3 a.b d
#4 b c

Resources