Calculate column medians with NA's - r

I am trying to calculate the median of individual columns in R and then subtract the median value with every value in the column. The problem that I face here is I have N/A's in my column that I dont want to remove but just return them without subtracting the median. For example
ID <- c("A","B","C","D","E")
Point_A <- c(1, NA, 3, NA, 5)
Point_B <- c(NA, NA, 1, 3, 2)
df <- data.frame(ID,Point_A ,Point_B)
Is it possible to calculate the median of a column having N/A's? My resulting output would be
+----+---------+---------+
| ID | Point_A | Point_B |
+----+---------+---------+
| A | -2 | NA |
| B | NA | NA |
| C | 0 | -1 |
| D | NA | 1 |
| E | 2 | 0 |
+----+---------+---------+

If we talking real NA values (as per OPs comment), one could do
df[-1] <- lapply(df[-1], function(x) x - median(x, na.rm = TRUE))
df
# ID Point_A Point_B
# 1 A -2 NA
# 2 B NA NA
# 3 C 0 -1
# 4 D NA 1
# 5 E 2 0
Or using the matrixStats package
library(matrixStats)
df[-1] <- df[-1] - colMedians(as.matrix(df[-1]), na.rm = TRUE)
When original df is
df <- structure(list(ID = structure(1:5, .Label = c("A", "B", "C",
"D", "E"), class = "factor"), Point_A = c(1, NA, 3, NA, 5), Point_B = c(NA,
NA, 1, 3, 2)), .Names = c("ID", "Point_A", "Point_B"), row.names = c(NA,
-5L), class = "data.frame")

Another option is
library(dplyr)
df %>%
mutate_each(funs(median=.-median(., na.rm=TRUE)), -ID)

Of course it is possible.
median(df[,]$Point_A, na.rm = TRUE)
where df is the data frame, while df[,] means for all rows and columns. But, be aware that the column the specified afterwards by $Point_A. The same could be written in this notation:
median(df[,"Point_A"], na.rm = TRUE)
where once again, df[,"Point_A"] means for all rows of the column Point_A.

Related

Exchange the values between two columns based on a condition using R

I have got the following df. I want if the value in the dm column is less than 20000, then that value should go to the nd column. Similarly, if the value in the nd column is greater then 20000 then that value should go to the dm column
structure(list(id = c(1, 2, 3), nd = c(NA, 20076, NA), dm = c(10113,
NA, 10188)), class = "data.frame", row.names = c(NA, -3L))
I want my final df to look like this
structure(list(id = c(1, 2, 3), nd = c(10113, NA, 10188), dm = c(NA,
20076, NA)), class = "data.frame", row.names = c(NA, -3L))
Thank you
ifelse is your friend for this.
base R
transform(df,
nd = ifelse(dm < 20000, dm, nd),
dm = ifelse(nd > 20000, nd, dm)
)
# id nd dm
# 1 1 10113 NA
# 2 2 NA 20076
# 3 3 10188 NA
Note that this works in base R because unlike dplyr::mutate, the calculation for the dm= (second) expression (and beyond) does not see the change from the previous expressions, so the nd that it sees is the original, unchanged nd.
We can also use the temporary-variable trick illustrated in the dplyr example below:
df |>
transform(
nd2 = ifelse(dm < 20000, dm, nd),
dm2 = ifelse(nd > 20000, nd, dm)
) |>
subset(select = -c(nd, dm))
and then rename nd2 to nd (etc).
dplyr
Because mutate "sees" the changes immediately, we need to store into other variables and then reassign.
library(dplyr)
df %>%
mutate(
nd2 = ifelse(dm < 20000, dm, nd),
dm2 = ifelse(nd > 20000, nd, dm)
) %>%
select(-nd, -dm) %>%
rename(nd=nd2, dm=dm2)
# id nd dm
# 1 1 10113 NA
# 2 2 NA 20076
# 3 3 10188 NA
Another base R option using apply:
as.data.frame(t(apply(df, 1, function(x) {
if(x[2] > 20000 | x[3] < 20000) x[c(1, 3, 2)] else x})))
#> id dm nd
#> 1 1 10113 NA
#> 2 2 NA 20076
#> 3 3 10188 NA
Created on 2023-02-18 with reprex v2.0.2

Finding maximum difference between columns of same name in R

I have the following table in R. I have 2 A columns, 3 B columns and 1 C column. I need to calculate the maximum difference possible between any columns of the same name and return the column name as output.
For row 1
The max difference between A is 2
The max difference between B is 4
I need the output as B
For row 2
The max difference between A is 3
The max difference between B is 2
I need the output as A
| A | A | B | B | B | C |
| 2 | 4 |5 |2 |1 |0 |
| -3 |0 |2 |3 |4 |2 |
First of all, it's a bit dangerous (and not allowed in some cases) to have non-unique column names, so the first thing I did was to uniqueify the names using base::make.unique(). From there, I used tidyr::pivot_longer() so that the grouping information contained in the column names could be accessed more easily. Here I use a regex inside names_pattern to discard the differentiating parts of the column names so they will be the same again. Then we use dplyr::group_by() followed by dplyr::summarize() to get the largest difference in each id and grp which corresponds to your rows and similar columns in the original data. Finally we use dplyr::slice_max() to return only the largest difference per group.
library(tidyverse)
d <- structure(list(A = c(2L, -3L), A = c(4L, 0L), B = c(5L, 2L), B = 2:3, B = c(1L, 4L), C = c(0L, 2L)), row.names = c(NA, -2L), class = "data.frame")
# give unique names
names(d) <- make.unique(names(d), sep = "_")
d %>%
mutate(id = row_number()) %>%
pivot_longer(-id, names_to = "grp", names_pattern = "([A-Z])*") %>%
group_by(id, grp) %>%
summarise(max_diff = max(value) - min(value)) %>%
slice_max(order_by = max_diff, n = 1, with_ties = F)
#> `summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
#> # A tibble: 2 x 3
#> # Groups: id [2]
#> id grp max_diff
#> <int> <chr> <int>
#> 1 1 B 4
#> 2 2 A 3
Created on 2022-02-14 by the reprex package (v2.0.1)
Here is base R option using aggregate + range + diff + which.max
df$max_diff <- with(
p <- aggregate(
. ~ id,
cbind(id = names(df), as.data.frame(t(df))),
function(v) diff(range(v))
),
id[sapply(p[-1],which.max)]
)
which gives
> df
A A B B B C max_diff
1 2 4 5 2 1 0 B
2 -3 0 2 3 4 2 A
data
> dput(df)
structure(list(A = c(2L, -3L), A = c(4L, 0L), B = c(5L, 2L),
B = 2:3, B = c(1L, 4L), C = c(0L, 2L), max_diff = c("B",
"A")), row.names = c(NA, -2L), class = "data.frame")
We may also use split.default to split based on the column names similarity and then with max.col find the index of the max diff
m1 <- sapply(split.default(df, names(df)), \(x)
apply(x, 1, \(u) diff(range(u))))
df$max_diff <- colnames(m1)[max.col(m1, "first")]
df$max_diff
[1] "B" "A"

Comparing two columns in a dataframe using R

I am trying to compare two columns in a dataframe to find rows where the two columns are not equal.
I would do:
df %>% filter(column1 != column2)
This will give me cases where values exist in both columns and are not equal (e.g. column1 = 5, column2 = 6)
However it will not give me cases where one of the values is NA (e.g. column1 = NA, column2 = 7)
How can I include the latter case into the filter function?
Thanks
Or use xor:
df %>% filter(a != b | xor(is.na(a), is.na(b)))
Or as #thelatemail mentioned, you could use Base R:
df[which(df$a != df$b | xor(is.na(df$a), is.na(df$b))),]
Or as #runr mentioned, you could try subset in Base R:
subset(df, a != b | xor(is.na(a), is.na(b)))
You can include them with an OR (|) condition -
library(dplyr)
df <- data.frame(a = c(1, 2, NA, 4, 5), b = c(NA, 2, 3, 4, 8))
df %>% filter(a != b | is.na(a) | is.na(b))
# a b
#1 1 NA
#2 NA 3
#3 5 8
Another option would be to change NA values to string "NA" and then only using a != b should work.
df %>%
mutate(across(.fns = ~replace(., is.na(.), 'NA'))) %>%
filter(a != b) %>%
type.convert(as.is = TRUE)
We can use if_any
library(dplyr)
df %>%
filter(a != b | if_any(everything(), is.na))
a b
1 1 NA
2 NA 3
3 5 8
data
df <- structure(list(a = c(1, 2, NA, 4, 5), b = c(NA, 2, 3, 4, 8)),
class = "data.frame", row.names = c(NA,
-5L))

Remove columns from a dataframe based on number of rows with valid values

I have a dataframe:
df = data.frame(gene = c("a", "b", "c", "d", "e"),
value1 = c(NA, NA, NA, 2, 1),
value2 = c(NA, 1, 2, 3, 4),
value3 = c(NA, NA, NA, NA, 1))
I would like to keep all those columns (plus the first, gene) with more than or equal to atleast 2 valid values (i.e., not NA). How do I do this?
I am thinking something like this ...
df1 = df %>% select_if(function(.) ...)
Thanks
We can sum the non-NA elements and create a logical condition to select the columns of interest
library(dplyr)
df1 <- df %>%
select_if(~ sum(!is.na(.)) > 2)
df1
# gene value2
#1 a NA
#2 b 1
#3 c 2
#4 d 3
#5 e 4
Or another option is keep
library(purrr)
keep(df, ~ sum(!is.na(.x)) > 2)
Or create the condition based on the number of rows
df %>%
select_if(~ mean(!is.na(.)) > 0.5)
Or use Filter from base R
Filter(function(x) sum(!is.na(x)) > 2, df)
We can use colSums in base R to count the non-NA value per column
df[colSums(!is.na(df)) > 2]
# gene value2
#1 a NA
#2 b 1
#3 c 2
#4 d 3
#5 e 4
Or using apply
df[apply(!is.na(df), 2, sum) > 2]

Select last value in a row & place it in another column

I have data table like this
Col1 | Col2 | Colx
12 | 13 | 19
34 | NA | NA
13 | 33 | NA
to determine the last value in each row I used Andrie's suggestion here for a previous question on the same subject
But I'd like the output to be in a separated column, the expected output for the above example.
>
Column
19
34
33
The OG question in the link above didn't solve my problem, as the output is not coming in a new column.
We can do
apply(df, 1, function(x) tail(x[!is.na(x)], 1))
If you want the result in a new column, you can do:
df$newColumn <- apply(df, 1, function(x) tail(x[!is.na(x)], 1))
Another option is
i1 <- which(!is.na(df1), arr.ind=TRUE)
unname(tapply(df1[i1], i1[,1], FUN=tail,1))
#[1] 19 34 33
data
df1 <- structure(list(Col1 = c(12, 34, 13), Col2 = c(13,
NA, 33), Colx = c(19,
NA, NA)), .Names = c("Col1", "Col2", "Colx"),
row.names = c(NA, -3L), class = "data.frame")

Resources