Changing value conditionally only for numerical variables in dataframe - r

Imagine I have a dataframe. This dataframe consists of numerical and non-numerical variables.
For all the numerical variables I would like to operate the following:
IF the value is bigger than the mean of the column it is in then change the value to "high".
ELSE change it to "low".
I have come very close to the solution with the following line of code:
df <- mutate_if(df, is.numeric, funs(ifelse(. > mean(.), "high", "low")))
However, I am aware that the mean(.) part is incorrect. So my question is:
How can I correct this part so I get the mean of the corresponding variable where . is in?
Also, I am assuming the rest is correct. If this is not the case I would appreciate someone telling me so I can try to correct it!
Here is an illustration of what I am trying to achieve:
duration amount sex
6 2 F
5 2 M
3 9 M
2 3 M
should become:
duration amount sex
high low F
high low M
low high M
low low M
EDIT:
The accepted answer made me realize my code was correct in the end!

In the newer version of dplyr (version >= 1.0), we use mutate with across as the suffix if, at all are getting deprecated
library(dplyr)
df %>%
mutate(across(where(is.numeric),
~ case_when(. > mean(., na.rm = TRUE) ~ "high", TRUE ~ "low")))
-output
# duration amount sex
#1 high low F
#2 high low M
#3 low high M
#4 low low M
Or with ifelse
df %>%
mutate(across(where(is.numeric),
~ ifelse(. > mean(., na.rm = TRUE), "high", "low")))
Or using the previous version
df %>%
mutate_if(is.numeric, ~ ifelse(. > mean(.), "high", "low"))
data
df <- structure(list(duration = c(6L, 5L, 3L, 2L), amount = c(2L, 2L,
9L, 3L), sex = c("F", "M", "M", "M")), class = "data.frame",
row.names = c(NA,
-4L))

Related

How to create a chord diagram in r?

I've never made a plot like this before, so sorry as this is probably a basic question, but I am stuck on how to make a chord diagram and specifically get the outer sections to be my column headings (drug mechanisms) and the inner connections between the sections to be the rows (genes) which don't need to be named in the plot as there are so many.
My data is rows of genes that are marked as interacting with columns of drug mechanisms by zeros or ones.
For example a subset of my data looks like:
Gene Diuretic Beta_blocker ACE_inhibitor
Gene1 1 0 0
Gene2 0 0 1
Gene3 1 1 1
Gene4 0 1 1
My total data is actually 700 genes for 15 columns of drug mechanisms with all zeors and ones. I am currently just creating a chord diagram with:
df <- fread('df.csv')
df[is.na(df)] <- 0
df <- df %>% data.frame %>% set_rownames(.$Gene) %>% dplyr::select(-Gene)
mt <- as.matrix(df)
circos.par(gap.degree = 0.9) #set this as I was otherwise getting an error with my total data
chordDiagram(mt, transparency = 0.5)
With my total data this plot looks like:
I've been getting various errors with trying to get this plot to be 15 sections only (and even just trying to get the sections to have the column names).
Is there a way for me plot a chord diagram with the sections being representative of each column? Then for genes/rows that have an interaction (a 1 in the data) for that section and any other section to be shown in the chord diagram? I don't need the gene names to be visible, I am looking to just visualize the amount of overlap between my columns/sections.
Example input data (for which my problem would be trying to make only have 3 sections per each column to show their overlap):
df <- structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4"),
Diuretic = c(1L, 0L, 1L, 0L), Beta_blocker = c(0L, 0L, 1L,
1L), ACE_inhibitor = c(0L, 1L, 1L, 1L)), row.names = c(NA,
-4L), class = c("data.table", "data.frame")
If you have 15 different drug mechanisms, it would be best to count the genes that various mechanisms have in common, and use these as weightings for the links between drug effects.
Your sample data is too limited to give a feel for how this would look, but the code would be something like this:
new_df <-apply(df, 1, function(x) {
x <- names(df)[which(x == 1)]
m <- 1 - diag(length(x))
dimnames(m) <- list(x, x)
inds <- which(lower.tri(m), arr.ind = TRUE)
data.frame(from = x[inds[,1]], to = x[inds[,2]])}) %>%
bind_rows() %>%
mutate(wt = 1) %>%
group_by(from, to) %>%
summarize(wt = sum(wt), .groups = 'drop')
new_df
#> # A tibble: 3 x 3
#> from to wt
#> <chr> <chr> <dbl>
#> 1 ACE_inhibitor Beta_blocker 2
#> 2 ACE_inhibitor Diuretic 1
#> 3 Beta_blocker Diuretic 1
We can see that we have two genes that have a common action on ACE inhibitor and Beta blocker mechansim (which is what your table implies), and a single gene that links diuretic to both beta blocker and ACE inhibitor to diuretic.
This produces the following rather dull chord diagram:
chordDiagram(new_df)
However, if we make a sample data set that is of the same scale as your real data, we get a more satisfactory result:
set.seed(123)
big_dat <- as.data.frame(matrix(rbinom(15 * 700, 1, 0.5), 700),
row.names = paste0('Gene', 1:700)) %>%
setNames(c('ACE_inhibitor', 'Diuretic', 'Beta_Blocker',
'CCB', 'Nitrate', 'K_channel', 'Aldosterone_blocker',
'Vasodilator', 'PDEI', 'Central', 'Relaxant',
'ARB', 'Alpha_blocker', 'Dopaminergic', 'Unknown'))
big_df <- apply(big_dat, 1, function(x) {
x <- names(big_dat)[which(x == 1)]
m <- 1 - diag(length(x))
dimnames(m) <- list(x, x)
inds <- which(lower.tri(m), arr.ind = TRUE)
data.frame(from = x[inds[,1]], to = x[inds[,2]])}) %>%
bind_rows() %>%
mutate(wt = 1) %>%
subset(complete.cases(.)) %>%
group_by(from, to) %>%
summarize(wt = sum(wt), .groups = 'drop')
chordDiagram(big_df)

Conditionally replace values with NA in R

I'm trying to conditionally replace values with NA in R.
Here's what I've tried so far using dplyr package.
Data
have <- data.frame(id = 1:3,
gender = c("Female", "I Do Not Wish to Disclose", "Male"))
First try
want = as.data.frame(have %>%
mutate(gender = replace(gender, gender == "I Do Not Wish to Disclose", NA))
)
This gives me an error.
Second try
want = as.data.frame(have %>%
mutate(gender = ifelse(gender == "I Do Not Wish to Disclose", NA, gender))
)
This runs without an error but turns Female into 1, Male into 3 and I Do Not Wish to Disclose into 2...
It is case where the column is factor. Convert to character and it should work
library(dplyr)
have %>%
mutate(gender = as.character(gender),
gender = replace(gender, gender == "I Do Not Wish to Disclose", NA))
The change in values in gender is when it gets coerced to its integer storage values
as.integer(factor(c("Male", "Female", "Male")))
I would use the very neat function na_if() from dplyr.
library(dplyr)
have <- data.frame(gender = c("F", "M", "NB", "I Do Not Wish to Disclose"))
have |> mutate(gender2 = na_if(gender, "I Do Not Wish to Disclose"))
Output:
#> gender gender2
#> 1 F F
#> 2 M M
#> 3 NB NB
#> 4 I Do Not Wish to Disclose <NA>
Created on 2022-04-19 by the reprex package (v2.0.1)

How to use weighting in dplyr package

I don't understand how weighting works in the dplyr::sample_n function. I have a list of very small numbers (ranging from 0.1020457 to 0.1789028) and I need to weight my sampling so that I get some on the lower end, upper end and in the middle. But since the numbers are so similar, I'm not sure how to do it. I also don't want to restrict my sampling to a certain range either (e.g. numbers > 0.16), i just want those far more likely to be sampled.
I can make the range larger (-1.552115 to 2.008253) but that means scaling by data, and I can't weight with negative numbers. I have to do things like abs(numbers - maximum). Heres an example of what I'm doing:
sample_n(data.frame(scaledMeasurement$V1), 4,
replace = FALSE,
weight = abs((scaledMeasurement $V1) - max(scaledMeasurement $V1)))
Heres a section of my data:
Measurement ID
0.8022473 1
1.6991193 2
0.7262765 3
0.3903775 4
-1.5521155 5
-0.7821887 6
If your goal is to get a sample that contains some on the low end, some near the median, and some on the end, it's far easier to avoid weights and just work with group_by + sample_n.
library(tidyverse)
df = tibble(my_nums = runif(10,0.1020457,0.1789028))
df %>%
mutate(quantile = case_when(
my_nums <= quantile(my_nums, probs = c(0.33)) ~ "a",
my_nums <= quantile(my_nums, probs = c(0.67)) ~ "b",
TRUE ~ "c"
)) %>%
group_by(quantile) %>%
sample_n(2)
Produces:
my_nums quantile
<dbl> <chr>
1 0.105 a
2 0.105 a
3 0.151 b
4 0.124 b
5 0.173 c
6 0.172 c
However, if you wanted to use weights, sample_n requires that the weights be the same length as the vector that's being sampled and also that the sum of the weights is equal to 1. You could add a weight column based on a subdivision of your groups (as I show above quantiles), grouping by that, generating a random number between one and length, ungrouping, and then dividing the values in that column by its sum. Like so:
df %>%
mutate(quantile = case_when(
my_nums <= quantile(my_nums, probs = c(0.33)) ~ "a",
my_nums <= quantile(my_nums, probs = c(0.67)) ~ "b",
TRUE ~ "c"
)) %>%
group_by(quantile) %>%
mutate(weight = sample(seq(1,length(my_nums)),length(my_nums))) %>%
ungroup %>% arrange(quantile) %>%
mutate(weight = weight / sum(weight)) %>%
sample_n(6, weight = weight)

Conditional selection of repeated measures from data frame

I have data with repeat measurements on each subject (id) at a variable number of timepoints. I would like to retain two row for each subject, timepoint == 0 and the timepoint closest to 4. In the case rows with two candidate timepoints equally distant from 4, e.g. (3, 5), I want to chose the lowest (3).
As shown in the 'choice' column of the image below, rows with "x" would not be retained.
dat <- structure(list(id = c(172507L, 172507L, 172507L, 172525L, 172525L,
172525L, 172526L, 172526L, 172526L, 172527L, 172527L, 172527L,
172527L, 172527L), timepoint = c(0L, 2L, 6L, 0L, 4L, 5L, 0L,
5L, 2L, 2L, 3L, 5L, 6L, 0L)), class = "data.frame", row.names = c(NA,
-14L))
We could arrange by id and timepoint and for every group select the first occurrence when timepoint == 0 and minimum absolute value between 4 - timepoint. Since we have arranged it by timepoint which.min will select first timepoint with lower value (in case of tie).
library(dplyr)
dat %>%
arrange(id, timepoint) %>%
group_by(id) %>%
slice(c(which.max(timepoint == 0), which.min(abs(4- timepoint))))
# id timepoint
# <int> <int>
#1 172507 0
#2 172507 2
#3 172525 0
#4 172525 4
#5 172526 0
#6 172526 5
#7 172527 0
#8 172527 3
Can you do something like this. Arranging by the distance and then the timepoint will put the smallest closest value first. Then you can use the first() function to grab the first value or filter for when the timepoint is zero.
library(tidyverse)
dat %>%
mutate(dist = abs(4-timepoint)) %>%
arrange(id, dist, timepoint) %>%
group_by(id) %>%
filter(timepoint %in% c(0, first(timepoint))) %>%
ungroup() %>%
arrange(id, timepoint)
Here's the data.table solution. It relies on the assumption that each ID will have a timepoint of 0. Otherwise, you should use which.max(timepoint == 0). Credit to Ronak Shah for the which.min approach.
Edit: Changed to match(TRUE, timepoint == 0) and fixed an issue in base R approach.
library(data.table)
dt <- as.data.table(dat)
dt[order(timepoint),
.SD[c(match(TRUE, timepoint == 0), which.min(abs(4- timepoint)))],
by = id]
For kicks, here's base R:
do.call(rbind, by(dat[order(dat$timepoint), ], dat[order(dat$timepoint), ], function(x) x[c(match(TRUE, x$timepoint == 0), which.min(abs(4-x$timepoint))),]) )
Something like this should work:
zeros <-
dat %>%
filter(timepoint == 0) %>%
transmute(id, timepoint)
nonzeros <-
dat %>%
filter(timepoint != 0) %>%
mutate(diff = abs(timepoint - 4)) %>%
group_by(id) %>%
filter(diff == min(diff)) %>%
arrange(timepoint) %>%
slice(1) %>%
ungroup() %>%
transmute(id, timepoint)
df <-
bind_rows(zeros, nonzeros) %>%
arrange(id, timepoint)
There is probably a way to do this in one pipe but I had an easier time visualizing what's going on this way.

find duplicates with grouped variables

I have a df that looks like this:
I guess it will work some with dplyr and duplicates. Yet I don't know how to address multiple columns while distinguishing between a grouped variable.
from to group
1 2 metro
2 4 metro
3 4 metro
4 5 train
6 1 train
8 7 train
I want to find the ids which exist in more than one group variable.
The expected result for the sample df is: 1 and 4. Because they exist in the metro and the train group.
Thank you in advance!
Using base R we can split the first two columns based on group and find the intersecting value between the groups using intersect
Reduce(intersect, split(unlist(df[1:2]), df$group))
#[1] 1 4
We gather the 'from', 'to' columns to 'long' format, grouped by 'val', filter the groups having more than one unique elements, then pull the unique 'val' elements
library(dplyr)
library(tidyr)
df1 %>%
gather(key, val, from:to) %>%
group_by(val) %>%
filter(n_distinct(group) > 1) %>%
distinct(val) %>%
pull(val)
#[1] 1 4
Or using base R we can just table to find the frequency, and get the ids out of it
out <- with(df1, colSums(table(rep(group, 2), unlist(df1[1:2])) > 0)) > 1
names(which(out))
#[1] "1" "4"
data
df1 <- structure(list(from = c(1L, 2L, 3L, 4L, 6L, 8L), to = c(2L, 4L,
4L, 5L, 1L, 7L), group = c("metro", "metro", "metro", "train",
"train", "train")), class = "data.frame", row.names = c(NA, -6L
))
Convert data to long format and count unique values, using data.table. melt is used to convert to long format, and data table allows filtering in the i part of df1[ i, j, k], grouping in the k part, and pulling in the j part.
library(data.table)
library(magrittr)
setDT(df1)
melt(df1, 'group') %>%
.[, .(n = uniqueN(group)), value] %>%
.[n > 1, unique(value)]
# [1] 1 4

Resources