Select columns that contain only values from an external list

Select columns that contain only values from an external list - r

I want to filter a data frame so that I only keep columns that have the values 1, 2, 3, 4, 5, or NA in them.
x = data.frame(col1 = c("a" , "b", "d", "e", "f", "g"),
col2 = c(12, 45, 235, 2134, NA, 1),
col3 = c(1, 2, 3, 1, 2, NA),
col4 = c(1, 2, 3, 4, 5, NA),
col5 = c(1, 2, 3, 4, 5, 6))
With this example data, I would like to return x with only col 3 and 4.

You can use the following solution:
library(dplyr)
x %>%
select(where(function(x) all(x %in% c(1:5, NA))))
col3 col4
1 1 1
2 2 2
3 3 3
4 1 4
5 2 5
6 NA NA
Or using a formula:
x %>%
select(where(~ all(.x %in% c(1:5, NA))))
Since the discussion just heated up on this, in case you would like to know how R interprets formulas created by ~ pronounced twiddle, just wrap it inside purrr::as_mapper. This is a function R calls behind the scene when you use this syntax for an anonymous function:
as_mapper(~ all(.x %in% c(1:5, NA)))
<lambda>
function (..., .x = ..1, .y = ..2, . = ..1)
all(.x %in% c(1:5, NA))
attr(,"class")
[1] "rlang_lambda_function" "function"
Here .x argument is equivalent to the first argument of our anonymous function.

Related

Counting the number of times one level of a variable occurs using dplyr, group_by and summarise in r

I want to summarise both factor and numerical variables using group_by and summarise. For example, if I have the following data frame:
group<- c(1, 1, 2, 2, 3, 3, 4, 4)
var1<- c(3, 6, 3, 2, 7, 5, 2, 5)
var2<- c("A", "B", "B", "B", "A", "A", "B", "A")
df<- data.frame(group, var1, var2)
I want to achieve the following output:
# A tibble: 4 x 3
group max_1 sum_A
<dbl> <dbl> <dbl>
1 6 1
2 3 0
3 7 2
4 5 1
I have tried various iterations of the following using "tally", and "n", and "sum", but none work
summary<- df %>% group_by (group) %>%
summarise(max_1 = max(var1)),
mutate (var2A = sum (var2 == "A"))
Thank you!

How to pivot_wider only a single condition using a single command in R

Let's say I test 3 drugs (A, B, C) at 3 conditions (0, 1, 2), and then I want to compare two of the conditions (1, 2) to a reference condition (0). This is the plot I would like to get:
First: I do get there, but my solution seems overly complex.
# The data I have
df <- data.frame(
drug = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
cond = c(0, 1, 2, 0, 1, 2, 0, 1, 2),
result = c(1, 2, 3, 2, 4, 6, 3, 6, 9),
)
# The data I want
df_wider0 <- data.frame(
drug = c("A", "A", "B", "B", "C", "C"),
result0 = c(1, 1, 2, 2, 3, 3),
cond = c(1, 2, 1, 2, 1, 2),
result = c(2, 3, 4, 6, 6, 9)
)
# This pivots also condition 1 and 2 ...
df_wider <- tidyr::pivot_wider(
df,
names_from = cond,
values_from = result
)
# ... so I pivot out these two again ...
colnames(df_wider)[colnames(df_wider) == "0"] <- "result0"
df_wider0 <- tidyr::pivot_longer(
df_wider,
cols = c("1", "2"),
names_to = "cond",
values_to = "result"
)
# ... so that I can use this ggplot command:
library(ggplot2)
ggplot(df_wider0, aes(x = result0, y = result, label = drug)) +
geom_label() +
facet_wrap("cond")
As you can see, I use a sequence of pivot_wider and pivot_longer to do a selective pivot_wider (by inverting some of its effects later). Is there an integrated command that I can use to achieve this more elegantly?

This can also be a strategy. (Will work even if there are unequal number of conditions per group)
df %>%
filter(cond != 0) %>%
right_join(df %>% filter(cond == 0), by = "drug", suffix = c("", "0")) %>%
select(-cond0)
Revised df adopted
df <- data.frame(
drug = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "D"),
cond = c(0, 1, 2, 0, 1, 2, 0, 1, 2, 0),
result = c(1, 2, 3, 2, 4, 6, 3, 6, 9, 10)
)
Result of above syntax
drug cond result result0
1 A 1 2 1
2 A 2 3 1
3 B 1 4 2
4 B 2 6 2
5 C 1 6 3
6 C 2 9 3
7 D NA NA 10
You may also fill cond if desired so

You can do this without any pivot statement at all.
library(dplyr)
library(ggplot2)
df_wider0 <- df %>%
mutate(result0 = result[match(drug, unique(drug))]) %>%
filter(cond != 0)
df_wider0
# drug cond result result0
#1 A 1 2 1
#2 A 2 3 1
#3 B 1 4 2
#4 B 2 6 2
#5 C 1 6 3
#6 C 2 9 3
Plot the data :
ggplot(df_wider0, aes(x = result0, y = result, label = drug)) +
geom_label() +
facet_wrap("cond")

How to delete all the duplicates row based on two columns?

I have a data frame where I want to delete duplicates rows, but I want to delete them only if a value from another column is the same for all the rows. (To be more clear I want to delete the duplicates rows which have the same "Number" value for all rows)
There is a example of my data frame :
df <- data.frame("Name" = c("a", "a", "b", "b", "b", "c", "c", "c"),
"Number" = c(1, 1, 1, 2, 3, 4, 5, 5), stringsAsFactors = FALSE)
And the result I expect is :
result <- data.frame("Name" = c("b", "b", "b", "c", "c", "c"),
"Number" = c(1, 2, 3, 4, 5, 5), stringsAsFactors = FALSE)

We can group_by Name and remove groups which have more than 1 row and have only one distinct value.
library(dplyr)
df %>%
group_by(Name) %>%
filter(!(n_distinct(Number) == 1 & n() > 1))
# Name Number
# <chr> <dbl>
#1 b 2
#2 b 2
#3 b 3
and using base R ave, the same logic can be written as
df[with(df, !as.logical(ave(Number, Name, FUN = function(x)
length(unique(x)) == 1 & length(x) > 1))), ]

Here is a solution with data.table
library("data.table")
df <- data.table("Name" = c("a", "a", "b", "b", "b"),
"Number" = c(1, 1, 2, 2, 3))
df[, if (uniqueN(Number)!=1 || .N==1) .SD, Name]
and here is a solution with base R:
df <- data.frame("Name" = c("a", "a", "b", "b", "b"),
"Number" = c(1, 1, 2, 2, 3), stringsAsFactors = FALSE)
df[as.logical(ave(df$Number, df$Name, FUN=function(x) length(unique(x))!=1 || length(x)==1)),]

We can use data.table methods
library(data.table)
setDT(df)[, .SD[uniqueN(Number) > 1] , Name]
# Name Number
#1: b 1
#2: b 2
#3: b 3
#4: c 4
#5: c 5
#6: c 5

Calculate median for multiple columns by group based on subsets defined by other columns

I am trying to calculate the median (but that could be substituted by similar metrics) by group for multiple columns based on subsets defined by other columns. This is direct follow-on question from this previous post of mine. I have attempted to incorporate calculating the median via aggregate into the Map(function(x,y) dosomething, x, y) solution kindly provided by #Frank, but that didn't work. Let me illustrate:
Calculate median for A and B by groups GRP1 and GRP2
df <- data.frame(GRP1 = c("A","A","A","A","A","A","B","B","B","B","B","B"), GRP2 = c("A","A","A","B","B","B","A","A","A","B","B","B"), A = c(0,4,6,7,0,1,9,0,0,8,3,4), B = c(6,0,4,8,6,7,0,9,9,7,3,0))
med <- aggregate(.~GRP1+GRP2,df,FUN=median)
Simple. Now add columns defining which rows to be used for calculating the median, i.e. rows with NAs should be dropped, column a defines which rows to be used for calculating the median in column A, same for columns b and B:
a <- c(1,4,7,3,NA,3,7,NA,NA,4,8,1)
b <- c(5,NA,7,9,5,6,NA,8,1,7,2,9)
df1 <- cbind(df,a,b)
As mentioned above, I have tried combining Map and aggregate, but that didn't work. I assume that Map doesn't know what to do with GRP1 and GRP2.
med1 <- Map(function(x,y) aggregate(.~GRP1+GRP2,df1[!is.na(y)],FUN=median), x=df1[,3:4], y=df1[, 5:6])
This is the result I'm looking for:
GRP1 GRP2 A B
1 A A 4 5
2 B A 9 9
3 A B 4 7
4 B B 4 3
Any help will be much appreciated!

Using data.table
library(data.table)
setDT(df1)
df1[, .(A = median(A[!is.na(a)]), B = median(B[!is.na(b)])), by = .(GRP1, GRP2)]
GRP1 GRP2 A B
1: A A 4 5
2: A B 4 7
3: B A 9 9
4: B B 4 3
Same logic in dplyr
library(dplyr)
df1 %>%
group_by(GRP1, GRP2) %>%
summarise(A = median(A[!is.na(a)]), B = median(B[!is.na(b)]))
The original df1:
df1 <- data.frame(
GRP1 = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
GRP2 = c("A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B"),
A = c(0, 4, 6, 7, 0, 1, 9, 0, 0, 8, 3, 4),
B = c(6, 0, 4, 8, 6, 7, 0, 9, 9, 7, 3, 0),
a = c(1, 4, 7, 3, NA, 3, 7, NA, NA, 4, 8, 1),
b = c(5, NA, 7, 9, 5, 6, NA, 8, 1, 7, 2, 9)
)

With dplyr:
library(dplyr)
df1 %>%
mutate(A = ifelse(is.na(a), NA, A),
B = ifelse(is.na(b), NA, B)) %>%
# I use this to put as NA the values we don't want to include
group_by(GRP1, GRP2) %>%
summarise(A = median(A, na.rm = T),
B = median(B, na.rm = T))
# A tibble: 4 x 4
# Groups: GRP1 [?]
GRP1 GRP2 A B
<fct> <fct> <dbl> <dbl>
1 A A 4 5
2 A B 4 7
3 B A 9 9
4 B B 4 3

if one observation meet criteria fill other with the same value for a new variable

I have data.frame like this
test <- data.frame(plot = c(1, 1, 2, 2, 3, 3), sort = c(10, 20, 11, 12, 15, 20))
I want to create a new variable callled treat that will be "A" if any sort in the plot is 20. Otherwise it should be B.
The expected output is
data.frame(plot = c(1, 1, 2, 2, 3, 3), sort = c(10, 20, 11, 12, 15, 20), treat = c("A", "A", "B", "B", "A", "A"))

We can use ave and group by plot variable. Check if any sort variable has value as 20 in it and assign the group accordingly
test$treat<-ave(test$sort,test$plot,FUN =function(x) ifelse(any(x ==20),"A","B"))
test
# plot sort treat
#1 1 10 A
#2 1 20 A
#3 2 11 B
#4 2 12 B
#5 3 15 A
#6 3 20 A
Similary with dplyr
library(dplyr)
test %>%
group_by(plot) %>%
mutate(treat = ifelse(any(sort == 20), "A", "B"))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Select columns that contain only values from an external list - r

Related

Counting the number of times one level of a variable occurs using dplyr, group_by and summarise in r

How to pivot_wider only a single condition using a single command in R

How to delete all the duplicates row based on two columns?

Calculate median for multiple columns by group based on subsets defined by other columns

if one observation meet criteria fill other with the same value for a new variable

Categories

Resources