Subset a new dataframe with binary columns - r

I would like to identify binary columns in a data.frame. And make a new df on based that condition.
For example, this table
my.table <-read.table(text="a,b,c
0,2,0
0.25,1,1
1,0,0", header=TRUE, as.is=TRUE,sep = ",")

Maybe you can keep columns that have only 0 and 1 value.
Filter(function(x) all(x %in% c(0, 1)), my.table)
# c
#1 0
#2 1
#3 0
Few other variations to do the same thing :
library(dplyr)
library(purrr)
#2
my.table[colSums(my.table == 0 | my.table == 1) == nrow(my.table)]
#3
my.table %>% select(where(~all(. %in% c(0, 1))))
#4
keep(my.table, ~all(. %in% c(0, 1)))

We can use base R
my.table[colSums(sapply(my.table, `%in%`, c(0, 1))) == nrow(my.table)]
# c
#1 0
#2 1
#3 0

Related

Why does my table say “No data available” instead of zero?

I need to create multiple new data frames based on different filters that contain two variable counts “d” & “e” based of the values in columns “a”, “b” and “c”. I have created a function for this that works as long as at least one column has a value. However, sometimes certain groups will have no answer for a, b or c. I want both d and e to both return zero in the columns when this happens but instead it says “No data available in table”. I’ve added my code below.
f_calculate_net = function(data)
{ data %>% mutate(a = ifelse("a" %in% colnames(data), a, 0)) %>%
mutate(b = ifelse("b" %in% colnames(data), b, 0)) %>%
mutate(c = ifelse("c" %in% colnames(data), c, 0)) %>%
mutate(d = ifelse(a + b + c == 0, 0, ((a/(a+b))*c)+a)) %>%
mutate(e = ifelse(a + b + c == 0, 0, ((b/(a+b))*c)+b)) %>%
select(d,e) }
A sample of the dataframe is
wt
beet
ilo
age
country
ine
sex
647
a
3
19
1
24
1
875
b
3
18
1
27
2
647
c
1
24
1
3
2
875
b
3
20
1
27
2
435
b
2
66
4
31
1
643
a
1
32
3
5
1
496
b
2
47
2
1
2
511
c
2
23
4
2
1
774
a
2
37
5
5
1
550
b
1
24
1
1
2
I take the main dataset and then apply a filter and count the number responses of the variable beet
data2 <- df_beet %>% filter(age == 18 & sex == 1 & ilo == 2) %>%
count(beet, wt = wt) %>%
pivot_wider(names_from = beet, values from = n) %>%
f_calculate_net()
There are no results and the resulting dataframe shows the columns d and e but it doesn’t show zeros and instead shows “no data available”
Your main problem here is in the way you are using ifelse. The expression "a" %in% colnames(data) always returns a length-1 logical vector (either TRUE or FALSE). So the output of the expression ifelse("a" %in% colnames(data), a, 0) will also be of length 1. It will return either the first element of a or a single 0. Since this is inside a mutate call, a will either be overwritten by the first element of a, or will be created as a column of zeros. Instead of ifelse you should use
if(!"a" %in% colnames(data)) data$a <- 0
And the same for columns b and c.
You will sometimes still get a NaN entry in columns d and e here if both a and b are 0, but c isn't, since your expression ((a/(a+b))*c)+a) divides by the sum of a and b. You should only check whether a + b == 0, since in that case you should return a 0
So the fixed function would be something like:
f_calculate_net = function(data) {
if(!"a" %in% colnames(data)) data$a <- 0
if(!"b" %in% colnames(data)) data$b <- 0
if(!"c" %in% colnames(data)) data$c <- 0
data %>%
mutate(d = ifelse(a + b == 0, 0, ((a/(a+b))*c)+a)) %>%
mutate(e = ifelse(a + b == 0, 0, ((b/(a+b))*c)+b)) %>%
select(d,e)
}
Let's create some random data to test this:
set.seed(123)
df <- data.frame(a = rpois(5, 1), b = rpois(5, 2), c = rpois(5, 1))
df
#> a b c
#> 1 0 0 3
#> 2 2 2 1
#> 3 1 4 1
#> 4 2 2 1
#> 5 3 2 0
And we see that we get the expected output:
f_calculate_net(df)
#> d e
#> 1 0.0 0.0
#> 2 2.5 2.5
#> 3 1.2 4.8
#> 4 2.5 2.5
#> 5 3.0 2.0
Created on 2022-08-15 by the reprex package (v2.0.1)
When a and b are zero a/b is NAN. If you want this case to be zero, try change a + b + c == 0 to (a + b) == 0
Based on Allan's explanation and comment, another possibility is to make a logical vector of the same length as the number of rows:
f_calculate_net = function(data)
{ data %>%
mutate(a = ifelse(rep("a" %in% colnames(data), nrow(data)), a, 0)) %>%
mutate(b = ifelse(rep("b" %in% colnames(data), nrow(data)), b, 0)) %>%
mutate(c = ifelse(rep("c" %in% colnames(data), nrow(data)), c, 0)) %>%
mutate(d = ifelse(a + b == 0, 0, ((a/(a+b))*c)+a)) %>%
mutate(e = ifelse(a + b == 0, 0, ((b/(a+b))*c)+b)) %>%
select(d,e) }

Calculate row sums by variable names

what's the easiest way to calculate row-wise sums? For example if I wanted to calculate the sum of all variables with "txt_"? (see example below)
df <- data.frame(var1 = c(1, 2, 3),
txt_1 = c(1, 1, 0),
txt_2 = c(1, 0, 0),
txt_3 = c(1, 0, 0))
base R
We can first use grepl to find the column names that start with txt_, then use rowSums on the subset.
rowSums(df[, grepl("txt_", names(df))])
[1] 3 1 0
If you want to bind it back to the original dataframe, then we can bind the output to the original dataframe.
cbind(df, sums = rowSums(df[, grepl("txt_", names(df))]))
var1 txt_1 txt_2 txt_3 sums
1 1 1 1 1 3
2 2 1 0 0 1
3 3 0 0 0 0
Tidyverse
library(tidyverse)
df %>%
mutate(sum = rowSums(across(starts_with("txt_"))))
var1 txt_1 txt_2 txt_3 sum
1 1 1 1 1 3
2 2 1 0 0 1
3 3 0 0 0 0
Or if you want just the vector, then we can use pull:
df %>%
mutate(sum = rowSums(across(starts_with("txt_")))) %>%
pull(sum)
[1] 3 1 0
Data Table
Here is a data.table option as well:
library(data.table)
dt <- as.data.table(df)
dt[ ,sum := rowSums(.SD), .SDcols = grep("txt_", names(dt))]
dt[["sum"]]
# [1] 3 1 0
Another dplyr option:
df %>%
rowwise() %>%
mutate(sum = sum(c_across(starts_with("txt"))))

Add multiple columns with dplyr and fill cells based on condition

I am trying to:
1) add multiple columns that correspond to existing columns (e.g., a1 exists and add a1_yes).
2) Next, if a given cell contains 1:3, put 1 in a#_yes column, otherwise, put 0.
I can easily to this with base R but I'm trying to also make it work with dplyr.
My data:
df <- data.frame(a1 = c(1, 2, 0, NA, NA),
a2 = c(NA, 1, 2, 3, 3))
With base R:
df[paste0("a", 1:2, "_yes")] <- NA # add columns
for(c in 1:2) {
for(r in 1:nrow(df)) {
ifelse(df[r,c] %in% c(1,2,3), df[r,c+2] <- 1,df[r,c+2] <- 0)
}
}
> df
a1 a2 a1_yes a2_yes
1 1 NA 1 0
2 2 1 1 1
3 0 2 0 1
4 NA 3 0 1
5 NA 3 0 1
Thank you
Here is an option, assuming you want to do this to all columns of your dataframe
library(dplyr)
df %>%
mutate_all(., list('yes' = ~ifelse(.x %in% c(1:3), 1, 0)))
# a1 a2 a1_yes a2_yes
#1 1 NA 1 0
#2 2 1 1 1
#3 0 2 0 1
#4 NA 3 0 1
#5 NA 3 0 1
Edits
As #Akrun mentioned, you can do this without ifelse using as.integer or +
df %>%
mutate_all(., list('yes' = ~as.integer(.x %in% 1:3)))
You can also use mutate_at to select specific vars
df %>%
mutate_at(vars(a1, a2), list('yes' = ~as.integer(.x %in% 1:3)))
This will work without editing no matter how many columns you have if they are all in this format
df %>%
mutate_all(., function(x) ifelse(x == 0 | is.na(x), 0, 1)) %>%
rename_all(., function(x) paste0(x, "_yes")) %>%
bind_cols(df, .)
Here's a dplyr solution:
library(dplyr)
df <- data.frame(a1 = c(1, 2, 0, NA, NA),
a2 = c(NA, 1, 2, 3, 3))
df2 <- df %>%
mutate(a1_yes = ifelse(a1 == 0 | is.na(a1), 0, 1),
a2_yes = ifelse(a2 == 0 | is.na(a2), 0, 1))
Instead of putting the conditions so that the new columns' values are 1, I put the conditions so that they're equal to zero.
Here is a solution
df <- data.frame( a1 = c(1,2,0,NA,NA),
a2 = c(NA,1,2,3,3))
check_values <- c(1,2,3)
df %>% mutate(a1_yes = ifelse(a1 %in% check_values,1,0),
a2_yes =ifelse(a2 %in% check_values,1,0))

comparing dates in R not working well (equal)

I want to compare 2 column dataframe with dates and include one column to indicate whether dates "A" are <= dates "B" or >
df <- data.frame( list (A=c("15-10-2000", "15-10-2000", "15-10-2000","20-10-2000"),
B=c("15-10-2000", "16-10-2000", "14-10-2000","19-10-2000")))
What I would like to include is new column C = ( 1 , 1, 0, 0).
I have tried:
df$C = ifelse (df$A <= df$B, 1, 0)
It works except for the "equal" comparation.
I get: C = ( 0 , 1, 0, 0)
sorry but before doing the comparation I changed the format to Date and still does not works
df$A= as.Date(df$A, format = "%d-%m-%Y")
df$B = as.Date(df$B, format = "%d-%m-%Y")
The date columns are factors. You need to first convert them to Date class and then compare
library(dplyr)
df %>%
mutate_at(vars(A:B), as.Date, format = "%d-%m-%Y") %>%
mutate(C = as.integer(A <= B))
# A B C
#1 2000-10-15 2000-10-15 1
#2 2000-10-15 2000-10-16 1
#3 2000-10-15 2000-10-14 0
#4 2000-10-20 2000-10-19 0
Or in base R that would be
df[1:2] <- lapply(df[1:2], as.Date, format = "%d-%m-%Y")
df$C <- as.integer(df$A <= df$B)
You should convert the factors to dates (As Jon Spring pointed out). Then it should work
library(dplyr)
df %>%
mutate_all(lubridate::dmy) %>%
mutate(C = ifelse(A<=B,1,0))
A B C
1 2000-10-15 2000-10-15 1
2 2000-10-15 2000-10-16 1
3 2000-10-15 2000-10-14 0
4 2000-10-20 2000-10-19 0

Variable names as Input in an R Function

I have a dataframe with several numeric variables along with factors. I wish to run over the numeric variables and replace the negative values to missing. I couldn't do that.
My alternative idea was to write a function that gets a dataframe and a variable, and does it. It didn't work either.
My code is:
NegativeToMissing = function(df,var)
{
df$var[df$var < 0] = NA
}
Error in $<-.data.frame(`*tmp*`, "var", value = logical(0)) : replacement has 0 rows, data has 40
what am I doing wrong ?
Thank you.
Here is an example with some dummy data.
df1 <- data.frame(col1 = c(-1, 1, 2, 0, -3),
col2 = 1:5,
col3 = LETTERS[1:5])
df1
# col1 col2 col3
#1 -1 1 A
#2 1 2 B
#3 2 3 C
#4 0 4 D
#5 -3 5 E
Now find columns that are numeric
numeric_cols <- sapply(df1, is.numeric)
And replace negative values
df1[numeric_cols] <- lapply(df1[numeric_cols], function(x) replace(x, x < 0 , NA))
df1
# col1 col2 col3
#1 NA 1 A
#2 1 2 B
#3 2 3 C
#4 0 4 D
#5 NA 5 E
You could also do
df1[df1 < 0] <- NA
With tidyverse, we can make use of mutate_if
library(tidyverse)
df1 %>%
mutate_if(is.numeric, funs(replace(., . < 0, NA)))
If you still want to change only one selected variable a solution withdplyr would be to use non-standard evaluation:
library(dplyr)
NegativeToMissing <- function(df, var) {
quo_var = quo_name(var)
df %>%
mutate(!!quo_var := ifelse(!!var < 0, NA, !!var))
}
NegativeToMissing(data, var=quo(val1)) # use quo() function without ""
# val1 val2
# 1 1 1
# 2 NA 2
# 3 2 3
Data used:
data <- data.frame(val1 = c(1, -1, 2),
val2 = 1:3)
data
# val1 val2
# 1 1 1
# 2 -1 2
# 3 2 3

Resources