Suppose you have a dataframe that looks something like this:
df <- tibble(PatientID = c(1,2,3,4,5),
Treat1 = c("R", "O", "C", "O", "C"),
Treat2 = c("O", "R", "R", NA, "O"),
Treat3 = c("C", NA, "O", NA, "R"),
Treat4 = c("H", NA, "H", NA, "H"),
Treat5 = c("H", NA, NA, NA, "H"))
Treat 1:Treat5 are different treatments that a patient has had. I'm looking to create a new variable "Chemo" with 1 for yes, 0 for no based on whether a patient has had treatment "C".
I've been using if_else(), but as I have 10 different treatment variables in my actual dataset, and I would like to create such a column per treatment, i wonder if I can do it without writing such long if statements. Is there an easier way to do this?
Use if_any to loop over the columns that starts_with 'Treat', create a logical vector with %in% - if_any returns TRUE/FALSE if any of the columns selected have 'C' for a particular row, the logical is converted to binary with + (or as.integer)
library(dplyr)
df <- df %>%
mutate(Chemo = +(if_any(starts_with("Treat"), ~ .x %in% "C")))
-output
df
# A tibble: 5 × 7
PatientID Treat1 Treat2 Treat3 Treat4 Treat5 Chemo
<dbl> <chr> <chr> <chr> <chr> <chr> <int>
1 1 R O C H H 1
2 2 O R <NA> <NA> <NA> 0
3 3 C R O H <NA> 1
4 4 O <NA> <NA> <NA> <NA> 0
5 5 C O R H H 1
Or using base R with rowSums
df$Chemo <- +(rowSums(df[startsWith(names(df), "Treat")] == "C",
na.rm = TRUE) > 0)
Another option using str_detect and any to determine if C occurs in any of the Treat columns for each row. The + converts the logical to an integer.
library(tidyverse)
df %>%
rowwise() %>%
mutate(Chemo = +any(str_detect(c_across(starts_with("Treat")), "C"), na.rm = TRUE)) %>%
ungroup
Output
PatientID Treat1 Treat2 Treat3 Treat4 Treat5 Chemo
<dbl> <chr> <chr> <chr> <chr> <chr> <int>
1 1 R O C H H 1
2 2 O R NA NA NA 0
3 3 C R O H NA 1
4 4 O NA NA NA NA 0
5 5 C O R H H 1
An alternative dplyr way:
library(dplyr)
df %>%
mutate(across(starts_with("Treat"), ~case_when(.=="C" ~1,
TRUE ~0), .names = 'new_{col}')) %>%
mutate(Chemo = rowSums(select(., starts_with("new")))) %>%
select(-starts_with("new"))
PatientID Treat1 Treat2 Treat3 Treat4 Treat5 Chemo
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
1 1 R O C H H 1
2 2 O R NA NA NA 0
3 3 C R O H NA 1
4 4 O NA NA NA NA 0
5 5 C O R H H 1
Related
I want to automatically add a new dataset identifier variable when using full_join() in R.
df1 <- tribble(~ID, ~x,
"A", 1,
"B", 2,
"C", 3)
df2 <- tribble(~ID, ~y,
"D", 4,
"E", 5,
"F", 6)
combined <- df1 %>% dplyr::full_join(df2)
I know from ?full_join that it joins all rows from df1 followed by df2. But, I couldn't find an option to create an index variable automatically.
Currently, I'm adding an extra variable in df1 first
df1 <- tribble(~ID, ~x, ~dataset,
"A", 1, 1,
"B", 2, 1,
"C", 3, 1)
and following it up with df1 %>% dplyr::full_join(df2) %>% dplyr::mutate(dataset = replace_na(dataset, 2))
Any suggestions to do it in a better way?
I'm not sure if it's more efficient than yours', but if there always do not exist overlapping columns except id, then you may try
df1 %>%
full_join(df2) %>%
mutate(dataset = as.numeric(is.na(x))+1)
ID x y dataset
<chr> <dbl> <dbl> <dbl>
1 A 1 NA 1
2 B 2 NA 1
3 C 3 NA 1
4 D NA 4 2
5 E NA 5 2
6 F NA 6 2
But to be safe, it might be better just define it's index(?) thing beforehand.
df1 %>%
mutate(dataset = 1) %>%
full_join(df2 %>% mutate(dataset = 2))
ID x y dataset
<chr> <dbl> <dbl> <dbl>
1 A 1 NA 1
2 B 2 NA 1
3 C 3 NA 1
4 D NA 4 2
5 E NA 5 2
6 F NA 6 2
New data
df1 <- tribble(~ID, ~x,~y,
"A", 1,1,
"B", 2,1,
"C", 3,1)
df2 <- tribble(~ID, ~x,~y,
"D", 4,1,
"E", 5,1,
"F", 6,1)
full_join(df1, df2)
ID x y
<chr> <dbl> <dbl>
1 A 1 1
2 B 2 1
3 C 3 1
4 D 4 1
5 E 5 1
6 F 6 1
Instead of a "join", maybe try bind_rows from dplyr:
library(dplyr)
bind_rows(df1, df2, .id = "dataset")
This will bind rows, and the missing columns are filled in with NA. In addition, you can specify an ".id" argument with an identifier. If you provide a list of dataframes, the labels are taken from names in the list. If not, a numeric sequence is used (as seen below).
Output
dataset ID x y
<chr> <chr> <dbl> <dbl>
1 1 A 1 NA
2 1 B 2 NA
3 1 C 3 NA
4 2 D NA 4
5 2 E NA 5
6 2 F NA 6
Let's say I've got some data:
data <- tibble(A = c("a", "b", "c", "d"),
B = c("e", "f", "g", NA_character_),
C = c("h", "i", NA_character_, NA_character_))
Which looks like this:
# A tibble: 4 x 3
A B C
<chr> <chr> <chr>
1 a e h
2 b f i
3 c g NA
4 d NA NA
What I'd like to do is get the value that's furthest to the right into a new column:
# A tibble: 4 x 4
A B C D
<chr> <chr> <chr> <chr>
1 a e h h
2 b f i i
3 c g NA g
4 d NA NA d
I know I could do it with case_when and a bunch of logical !is.na(A) ~ A, statements, but say I've got a load of columns and that's not feasible. I feel like there probably is an easy way that I just don't know about and haven't been able to find. Thanks
coalesce would be more easier
library(dplyr)
data %>%
mutate(D = coalesce(C, B, A))
-output
# A tibble: 4 x 4
# A B C D
# <chr> <chr> <chr> <chr>
#1 a e h h
#2 b f i i
#3 c g <NA> g
#4 d <NA> <NA> d
Or if there are many column, rev the column names, convert to symbols and evaluate (!!!)
data %>%
mutate(D = coalesce(!!! rlang::syms(rev(names(.)))))
I am new to R and have a simple 'how to' question, specifically, what is the best way to calculate Group and overall percentages on data frame columns? My data looks like this:
# A tibble: 13 x 3
group resp id
<chr> <dbl> <chr>
1 A 1 ssa
2 A 1 das
3 A NA fdsf
4 B NA gfd
5 B 1 dfg
6 B 1 dg
7 C 1 gdf
8 C NA gdf
9 C NA hfg
10 D 1 hfg
11 D 1 trw
12 D 1 jyt
13 D NA ghj
the test data is this:
structure(list(group = c("A", "A", "A", "B", "B", "B", "C", "C",
"C", "D", "D", "D", "D"), resp = c(1, 1, NA, NA, 1, 1, 1, NA,
NA, 1, 1, 1, NA), id = c("ssa", "das", "fdsf", "gfd", "dfg",
"dg", "gdf", "gdf", "hfg", "hfg", "trw", "jyt", "ghj")), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame")
I managed to do the group percentages by doing the following (which seems overcomplicated):
a <- test %>%
group_by(group) %>%
summarise(no_resp = sum(resp, na.rm = TRUE))
b <- test %>%
group_by(group) %>%
summarise(all = n_distinct(id, na.rm = TRUE))
result <- a %>%
left_join(b) %>%
mutate(a,resp_rate = round(no_resp/all*100))
this gives me:
# A tibble: 4 x 4
group no_resp all resp_rate
<chr> <dbl> <int> <dbl>
1 A 2 3 67
2 B 2 3 67
3 C 1 2 50
4 D 3 4 75
which is fine, but I wondered how I could make this simpler? Also, how would I do an overall percentage? E.g. an overall distinct count of resp/distinct count of id, without grouping.
Many thanks
You can add multiple statements in summarise so you don't have to create temporary objects a and b. To calculate overall percentage you can divide the number by the sum of the column.
library(dplyr)
test %>%
group_by(group) %>%
summarise(no_resp = sum(resp, na.rm = TRUE),
all = n_distinct(id),
resp_rate = round(no_resp/all*100)) %>%
mutate(no_resp_perc = no_resp/sum(no_resp) * 100)
# group no_resp all resp_rate no_resp_perc
# <chr> <int> <int> <dbl> <dbl>
#1 A 2 3 67 25
#2 B 2 3 67 25
#3 C 1 2 50 12.5
#4 D 3 4 75 37.5
Using base R we may apply tapply and table functions.
res <- transform(with(test, data.frame(no_resp=tapply(resp, group, sum, na.rm=TRUE),
all=colSums(table(id, group) > 0))),
resp_rate=round(no_resp/all*100),
overall_perc=prop.table(no_resp)*100
)
res
# no_resp all resp_rate overall_perc
# A 2 3 67 25.0
# B 2 3 67 25.0
# C 1 2 50 12.5
# D 3 4 75 37.5
I am trying to remove duplicates from a dataset (caused by merging). However, one row contains a value and one does not, in some cases both rows are NA. I want to keep the ones with data, and if there are on NAs, then it does not matter which I keep. How do I do that? I am stuck.
I unsuccessfully tried the solutions from here (also not usually working with data.table, so I dont understand whats what)
R data.table remove rows where one column is duplicated if another column is NA
Some minimum example data:
df <- data.frame(ID = c("A", "A", "B", "B", "C", "D", "E", "G", "H", "J", "J"),
value = c(NA, 1L, NA, NA, 1L, 1L, 1L, 1L, 1L, NA, 1L))
ID value
A NA
A 1
B NA
B NA
C 1
D 1
E 1
G 1
H 1
J NA
J 1
and I want this:
ID value
A 1
B NA
C 1
D 1
E 1
G 1
H 1
J 1
One possibility using dplyr could be:
df %>%
group_by(ID) %>%
slice(which.max(!is.na(value)))
ID value
<chr> <int>
1 A 1
2 B NA
3 C 1
4 D 1
5 E 1
6 G 1
7 H 1
8 J 1
An alternative of #tmfmnk's answer with slice_max() in dplyr.
library(dplyr)
df %>%
group_by(ID) %>%
slice_max(!is.na(value), with_ties = F)
# # A tibble: 8 x 2
# # Groups: ID [8]
# ID value
# <chr> <int>
# 1 A 1
# 2 B NA
# 3 C 1
# 4 D 1
# 5 E 1
# 6 G 1
# 7 H 1
# 8 J 1
Here is a relatively simple data.table solution.
Grouping by ID if all the values are NA just take the first value, if not take all values that are not NA.
library(data.table)
setDT(df)
df[, if (all(is.na(value))) value[1] else value[!is.na(value)], by = ID]
Big picture: I'm trying to set up an export that has one route as a row and columns for each value.
This code: I'm trying to select the top three transfers for each route (using slice(1:3) because I need no more than three values. top_n() allows for ties). Then, I'm trying to spread() to create 6 columns: a name and a pct for each.
If I were to spread the data right now, the names would become columns, but I need to keep the names in the rows (see Desired Output). I want to create the column names as a key column to use to spread(). My approach is creating an error. I'm having trouble thinking of another strategy.
Data frame:
# A tibble: 7 x 3
route_shortname transfer_to pct
<chr> <chr> <dbl>
1 A D 0.5
2 A E 0.5
3 B F 0.667
4 B G 0.333
5 C D 0.111
6 C E 0.111
7 C G 0.111
Desired output:
# A tibble: 3 x 7
route_shortname transfer1 transfer1_pct transfer2 transfer2_pct transfer3 transfer3_pct
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 A D 0.5 E 0.5 NA NA
2 B F 0.667 G 0.333 NA NA
3 C D 0.111 E 0.111 G 0.111
Reprex:
library(tidyverse)
sample_data <- tibble::tribble(
~route_shortname, ~transfer_to, ~pct,
"A", "D", 0.5,
"A", "E", 0.5,
"B", "F", 0.666666666666667,
"B", "G", 0.333333333333333,
"C", "D", 0.111111111111111,
"C", "E", 0.111111111111111,
"C", "G", 0.111111111111111
)
transfer_to_table <- sample_data %>%
group_by(route_shortname) %>%
mutate(key = c("transfer1", "transfer2", "transfer3"))
#> Error in mutate_impl(.data, dots): Column `key` must be length 2 (the group size) or one, not 3
df = read.table(text = "
route_shortname transfer_to pct
1 A D 0.5
2 A E 0.5
3 B F 0.667
4 B G 0.333
5 C D 0.111
6 C E 0.111
7 C G 0.111
", header=T)
library(tidyverse)
df %>%
group_by(route_shortname) %>%
mutate(id = paste0("transfer", row_number())) %>%
ungroup() %>%
unite(v, transfer_to, pct) %>%
spread(id, v) %>%
separate(transfer1, c("transfer1","transfer1_pct"), sep = "_", convert = T) %>%
separate(transfer2, c("transfer2","transfer2_pct"), sep = "_", convert = T) %>%
separate(transfer3, c("transfer3","transfer3_pct"), sep = "_", convert = T)
# route_shortname transfer1 transfer1_pct transfer2 transfer2_pct transfer3 transfer3_pct
# <fct> <chr> <dbl> <chr> <dbl> <chr> <dbl>
# 1 A D 0.5 E 0.5 NA NA
# 2 B F 0.667 G 0.333 NA NA
# 3 C D 0.111 E 0.111 G 0.111
Though you tagged this question with tidyverse packages, here is an option using dcast from data.table which let's you do the reshaping in one (admittedly long) line.
library(data.table)
setDT(sample_data)
dcast(sample_data, route_shortname ~ rowid(route_shortname), value.var = c('transfer_to', 'pct'))
# route_shortname transfer_to_1 transfer_to_2 transfer_to_3 pct_1 pct_2 pct_3
#1: A D E <NA> 0.5000000 0.5000000 NA
#2: B F G <NA> 0.6666667 0.3333333 NA
#3: C D E G 0.1111111 0.1111111 0.1111111
You could also use reshape from base R
sample_data <- as.data.frame(sample_data) # does not work with tibbles for some reason
sample_data$idx <- with(sample_data,
ave(route_shortname, route_shortname, FUN = seq_along))
reshape(sample_data, idvar = "route_shortname", timevar = "idx", direction = "wide", sep = "_")
# route_shortname transfer_to_1 pct_1 transfer_to_2 pct_2 transfer_to_3 pct_3
#1 A D 0.5000000 E 0.5000000 <NA> NA
#3 B F 0.6666667 G 0.3333333 <NA> NA
#5 C D 0.1111111 E 0.1111111 G 0.1111111
In both cases you'd need to rename columns but I that shouldn't be too hard.