Assign to a value its column name - r

i'm working with dplyr, and i have a tibble like this:
df<- data_frame(one= c(1, NA, 1),
two= c(NA,NA,1),
three= c(1, 1,1)
)
----------------------------
# A tibble: 3 x 3
one two three
<dbl> <dbl> <dbl>
1 1 NA 1
2 NA NA 1
3 1 1 1
----------------------------
i need to obtain something like this:
----------------------------
# A tibble: 3 x 3
one two three
<dbl> <dbl> <dbl>
1 one NA three
2 NA NA three
3 one two three
----------------------------
So i can use ifelse function with mutate for each column:
df %>%
one= ifelse(!is.na(one),'one', NA ),
two= ifelse(!is.na(two),'two', NA ),
three= ifelse(!is.na(three),'three', NA ),
But in my real df i have many columns, then this wat is very inefficient.
i need something more elegant, using mutate_at and the column name but it's seem very hard.
I tried to do that in many ways but everytime i get an error.
Any advice?

If there is only 1's and NAs in the dataset, multiply the col(df) with the dataset, unlist and based on the index, just replace it with names of the dataset and assign it back to the original data
df[] <- names(df)[unlist(col(df)*df)]
df
# A tibble: 3 x 3
# one two three
# <chr> <chr> <chr>
#1 one <NA> three
#2 <NA> <NA> three
#3 one two three
Or with tidyverse, we can do this for each column (map2_df from purrr)
library(tidyverse)
df %>%
map2_df(names(.), ~replace(., !is.na(.), .y))
# A tibble: 3 x 3
# one two three
# <chr> <chr> <chr>
#1 one <NA> three
#2 <NA> <NA> three
#3 one two three

Related

How to remove missing values in summarise_all dplyr [duplicate]

This question already has answers here:
combine rows in data frame containing NA to make complete row
(7 answers)
Closed 7 months ago.
I'm having trouble to exclude missing values in summarise_all function.
I have a dataset (df) as shown below and basically I'm having two problems:
excluding missing values and the output only being one number
additional data rows with same IDs but NA values (the second column with 'TRUE' values in df1 dataset)
df1 dataset is the one I'm trying to get to.
Here's the whole enchilada:
df #the original dataset
ID type of data genes1 genes2 genes3 ...
1 new 2 NA NA
1 old NA 0 NA
1 suggested NA NA 2
2 new 1 NA NA
2 old NA 1 NA
2 suggested NA NA 1
...
df1 <- df %>% group_by(df$ID) %>% summarize_all(list, na.rm= TRUE) #my code
#output
ID type of data genes1 genes2 genes3 ...
1 c("new","old","suggested") c(2,NA,NA) c(0,NA,NA) c(2,NA,NA)
1 TRUE TRUE TRUE TRUE
2 c("new","old","suggested") c(1,NA,NA) c(1,NA,NA) c(1,NA,NA)
2 TRUE TRUE TRUE TRUE
...
#my main concern is the "genes" type of data and the rows with same IDs and NA values, I wanted something like this
df1 #dream dataset
ID type of data genes1 genes2 genes3 ...
1 #doesn't matter 2 0 2
2 #doesn't matter 1 1 1
...
I also tried using na.omit in summarise_all but it didn't really fix anything.
Does anybody have any ideas on how to fix it?
You could do:
library(dplyr)
df %>%
group_by(ID) %>%
summarise(across(starts_with('genes'), ~.[!is.na(.)]))
#> # A tibble: 2 × 4
#> ID genes1 genes2 genes3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 0 2
#> 2 2 1 1 1
Another way
library (dplyr)
df[-2] |>
group_by(ID) |>
fill(genes1:genes3, .direction = "downup") |>
slice(1)
ID genes1 genes2 genes3
<int> <int> <int> <int>
1 1 2 0 2
2 2 1 1 1
An alternative approach based on the coalesce() function from tidyr
In the below code, we remove the type variable since the OP indicated we don't need it in the output. We then group_by() to essentially break up our data into separate data.frames for each ID. The coalesce_by_column() function we define then converts each of these into a list whose elements are each a vector of values for each gene column.
We finally can pass this list to coalesce(). coalesce() takes a set of vectors and finds the first non-NA value across the vectors for each index of the vectors. In practice, this means it can take multiple columns with only one or zero non-NA value across all columns for each index and collapse them into a single column with as many non-NA values as possible.
Usually we would have to pass each vector as its own object to coalesce() but we can use the (splice operator)[https://stackoverflow.com/questions/61180201/triple-exclamation-marks-on-r] !!! to pass each element of our list as its own vector. See the last example in ?"!!!" for a demonstration.
library(dplyr)
library(tidyr)
# Define a function to coalesce by column
coalesce_by_column <- function(df) {
coalesce(!!! as.list(df))
}
# Remove NA rows
df %>%
select(-type) %>%
group_by(ID) %>%
summarise(across(.fns = coalesce_by_column))
#> # A tibble: 2 x 4
#> ID genes1 genes2 genes3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 0 2
#> 2 2 1 1 1
If you are not worried about the type column you can do something like this
library(tidyverse)
" ID type genes1 genes2 genes3
1 new 2 NA NA
1 old NA 0 NA
1 suggested NA NA 2
2 new 1 NA NA
2 old NA 1 NA
2 suggested NA NA 1" %>%
read_table() -> df
df %>%
pivot_longer(-c(ID, type)) %>%
drop_na(value) %>%
select(-type) %>%
pivot_wider(names_from = name, values_from = value)
# A tibble: 2 × 4
ID genes1 genes2 genes3
<dbl> <dbl> <dbl> <dbl>
1 1 2 0 2
2 2 1 1 1
If you want to keep the "type of data" column while using summarise, you can use the following code:
df <- read.table(text = "ID type_of_data genes1 genes2 genes3
1 new 2 NA NA
1 old NA 0 NA
1 suggested NA NA 2
2 new 1 NA NA
2 old NA 1 NA
2 suggested NA NA 1", header = TRUE)
library(dplyr)
library(tidyr)
df1 <- df %>%
group_by(ID) %>%
summarise(across(starts_with("genes"), na.omit),
type_of_data = type_of_data[genes1]) %>%
ungroup()
df1
#> # A tibble: 2 × 5
#> ID genes1 genes2 genes3 type_of_data
#> <int> <int> <int> <int> <chr>
#> 1 1 2 0 2 old
#> 2 2 1 1 1 new
Created on 2022-07-26 by the reprex package (v2.0.1)

How to remove variables from a dataset that are actually named NA?

I have an excel dataset that I have to work on. The problem is that empty cells were named NA instead of leaving them empty.
I'm trying to remove NA values from the dataset and usually I could use is.na() to omit them but now they have a name so I don't know how to go about this.
Any ideas to point me in the right direction?
You can try something like below:
library(dplyr)
df %>% mutate(across(everything(), ~ na_if(., 'NA'))) %>% na.omit()
# A tibble: 3 x 2
Drinks ranked
<chr> <chr>
1 A 1
2 C 2
3 C 1
Data used:
df
# A tibble: 5 x 2
Drinks ranked
<chr> <chr>
1 A 1
2 B NA
3 NA 1
4 C 2
5 C 1

R: Joining Tibbles Using Derived Column Values

Consider the following tibbles:
library(tidyverse)
tbl_base_ids = tibble(base_id = c("ABC", "ABCDEF", "ABCDEFGHI"), base_id_length = c(3, 6, 9), record_id_length = c(10, 12, 15))
tbl_records = tibble(record_id = c("ABC1234567", "ABCDEF123456", "ABCDEFGHI123456"))
I'd like to join matching rows to produce a third tibble:
tbl_records_with_base
record_id
base_id
base_id_length
record_id_length
As you can see, this is not a matter of joining one or more variables from each of the first two. This requires matching variable derivatives. In SQL, I'd do this:
SELECT A.record_id,
B.base_id,
B.base_id_length,
B.record_id_length
FROM tbl_records A
JOIN tbl_base_ids B
ON LENGTH(a.record_id) = B.record_id_length
AND LEFT(a.record_id, B.base_id_length) = B.base_id
I've tried variations of dplyr joins and using the match function, to but to no avail. Can someone help? Thank you.
You should come up with some logic to separate base_id from record_id. because joining only on record_id_length would not be enough. For this example we can get base_id if we remove all numbers from record_id. Based on your actual dataset you need to change this if needed.
Once we do that we can join tbl_records with tbl_base_ids by base_id and record_id_length.
library(dplyr)
tbl_records %>%
mutate(base_id = sub('\\d+', '', record_id),
record_id_length = nchar(record_id)) %>%
inner_join(tbl_base_ids, by = c("base_id", "record_id_length")) -> result
result
# record_id base_id record_id_length base_id_length
# <chr> <chr> <dbl> <dbl>
#1 ABC1234567 ABC 10 3
#2 ABCDEF123456 ABCDEF 12 6
#3 ABCDEFGHI123456 ABCDEFGHI 15 9
I suggest using the fuzzyjoin package.
library(dplyr)
library(fuzzyjoin)
tbl_base_ids %>%
mutate(record_ptn = sprintf("^%s.{%i}$", base_id, pmax(0, record_id_length - base_id_length))) %>%
regex_full_join(tbl_records, ., by = c("record_id" = "record_ptn"))
# # A tibble: 3 x 5
# record_id base_id base_id_length record_id_length record_ptn
# <chr> <chr> <dbl> <dbl> <chr>
# 1 ABC1234567 ABC 3 10 ^ABC.{7}$
# 2 ABCDEF123456 ABCDEF 6 12 ^ABCDEF.{6}$
# 3 ABCDEFGHI123456 ABCDEFGHI 9 15 ^ABCDEFGHI.{6}$
A note about this: the order of tables matters, where the regex must reside on the RHS of the by= settings. For instance, this does not work if we reverse it:
tbl_base_ids %>%
mutate(record_ptn = sprintf("^%s.{%i}$", base_id, pmax(0, record_id_length - base_id_length))) %>%
regex_full_join(., tbl_records, by = c("record_ptn" = "record_id"))
# # A tibble: 6 x 5
# base_id base_id_length record_id_length record_ptn record_id
# <chr> <dbl> <dbl> <chr> <chr>
# 1 ABC 3 10 ^ABC.{7}$ <NA>
# 2 ABCDEF 6 12 ^ABCDEF.{6}$ <NA>
# 3 ABCDEFGHI 9 15 ^ABCDEFGHI.{6}$ <NA>
# 4 <NA> NA NA <NA> ABC1234567
# 5 <NA> NA NA <NA> ABCDEF123456
# 6 <NA> NA NA <NA> ABCDEFGHI123456

How do you remove rows from a tibble where the non-missing values match a subset of values in other rows?

I am looking for an efficient way to remove rows of a tibble where the non-missing values are identical to missing values in another row. Consider this fake example:
library(tidyverse)
phony_genes <- tribble(
~mouse_entrez, ~mgi_symbol, ~human_entrez, ~hgnc_symbol,
1, "a", 2 , "A",
1, "a", 2 , NA,
1, NA, 2 , "A",
1, "a", 3 , NA,
4, "b", 3 , NA,
5, NA, 2 , "A"
)
Row 2 is a subset of row 1, because each non-missing value is in row 2 is the same as in row 1. Same goes for row 3, but a different value is missing. I am looking for a way that uses the tidyverse (or other packages) to filter out rows 2 and 3, but keep the other rows. I can't filter out the NA values in hgnc_symbol or mgi_symbol because in both cases I will lose rows that I want to keep. I can't group by mouse_entrez and filter away the NA values within the groups because I want to keep row 4. This simple example could of course be expanded to a huge tibble. I could probably do this by coding something myself but I am wondering if anyone has an elegant solution.
library(dplyr)
phony_genes %>%
group_by(mouse_entrez, mgi_symbol, human_entrez) %>%
arrange_all(~ is.na(.)) %>%
slice(1)
# # A tibble: 4 x 4
# # Groups: mouse_entrez, mgi_symbol, human_entrez [4]
# mouse_entrez mgi_symbol human_entrez hgnc_symbol
# <dbl> <chr> <dbl> <chr>
# 1 1 a 2 A
# 2 1 a 3 <NA>
# 3 4 b 3 <NA>
# 4 5 c 2 A
Here's a way to do it using tidyverse :
library(dplyr)
library(purrr)
phony_genes %>%
mutate(col = pmap(., ~na.omit(c(...)))) %>%
filter(!map_lgl(seq_along(col), function(x)
any(map_lgl(col[-x], function(y) all(col[[x]] %in% y))))) %>%
select(-col)
# mouse_entrez mgi_symbol human_entrez hgnc_symbol
# <dbl> <chr> <dbl> <chr>
#1 1 a 2 A
#2 1 a 3 NA
#3 4 b 3 NA
#4 5 NA 2 A
We get all the values in a row as a character vector removing NA values using pmap. For each row check if a complete duplicate exists and remove those rows using filter.
You can group by all columns except the ones where you don't want to remove anything & then remove missing values where total count > 1, e.g.:
phony_genes %>%
group_by(mouse_entrez, human_entrez) %>%
filter_at(vars(2, 4), all_vars(!(is.na(.) & n() > 1)))
Output:
# A tibble: 4 x 4
# Groups: mouse_entrez, human_entrez [4]
mouse_entrez mgi_symbol human_entrez hgnc_symbol
<dbl> <chr> <dbl> <chr>
1 1 a 2 A
2 1 a 3 NA
3 4 b 3 NA
4 5 NA 2 A

dplyr - mutate with variable column names

I have a tibble containing time series of various blood parameters like CRP over the course of several days. The tibble is tidy, with each time series in one column, as well as a column for the day of measurement. The tibble contains another column with a day of infection. I want to replace each blood parameter with NA if the Day variable is greater-equal than the InfectionDay. Since I have a lot of variables, I'd like to have a function which accepts the column name dynamically and creates a new column name by appending "_censored" to the old one. I've tried the following:
censor.infection <- function(df, colname){
newcolname <- paste0(colname, "_censored")
return(df %>% mutate(!!newcolname := ifelse( Day < InfectionDay, !!colname, NA)))
}
data = tibble(Day=1:5, InfectionDay=3, CRP=c(3,2,5,4,1))
data = censor.infection(data, "CRP")
Running this, I expected
# A tibble: 5 x 4
Day InfectionDay CRP CRP_censored
<int> <dbl> <dbl> <chr>
1 1 3 3 3
2 2 3 2 2
3 3 3 5 NA
4 4 3 4 NA
5 5 3 1 NA
but I get
# A tibble: 5 x 4
Day InfectionDay CRP CRP_censored
<int> <dbl> <dbl> <chr>
1 1 3 3 CRP
2 2 3 2 CRP
3 3 3 5 NA
4 4 3 4 NA
5 5 3 1 NA
You can add sym() to the column name in mutate to convert to symbol before evaluating
censor.infection <- function(df, colname){
newcolname <- paste0(colname, "_censored")
return(df %>% mutate(!!newcolname := ifelse( Day < InfectionDay, !! sym(colname), NA)))
}
data = tibble(Day=1:5, InfectionDay=3, CRP=c(3,2,5,4,1))
data = censor.infection(data, "CRP")
We can select columns on which we want to apply the function (cols) and use mutate_at which will also automatically rename the columns. Added an extra column in the data to show renaming.
library(dplyr)
cols <- c("CRP", "CRP1")
data %>%
mutate_at(cols, list(censored = ~replace(., Day >= InfectionDay, NA)))
# A tibble: 5 x 6
# Day InfectionDay CRP CRP1 CRP_censored CRP1_censored
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 3 3 3 3 3
#2 2 3 2 2 2 2
#3 3 3 5 5 NA NA
#4 4 3 4 4 NA NA
#5 5 3 1 1 NA NA
data
data <- tibble(Day=1:5, InfectionDay=3, CRP=c(3,2,5,4,1), CRP1 = c(3,2,5,4,1))

Resources