Subset a new dataframe with binary columns

Subset a new dataframe with binary columns - r

I would like to identify binary columns in a data.frame. And make a new df on based that condition.
For example, this table
my.table <-read.table(text="a,b,c
0,2,0
0.25,1,1
1,0,0", header=TRUE, as.is=TRUE,sep = ",")

Maybe you can keep columns that have only 0 and 1 value.
Filter(function(x) all(x %in% c(0, 1)), my.table)
# c
#1 0
#2 1
#3 0
Few other variations to do the same thing :
library(dplyr)
library(purrr)
#2
my.table[colSums(my.table == 0 | my.table == 1) == nrow(my.table)]
#3
my.table %>% select(where(~all(. %in% c(0, 1))))
#4
keep(my.table, ~all(. %in% c(0, 1)))

We can use base R
my.table[colSums(sapply(my.table, `%in%`, c(0, 1))) == nrow(my.table)]
# c
#1 0
#2 1
#3 0

Related

Why does my table say “No data available” instead of zero?

I need to create multiple new data frames based on different filters that contain two variable counts “d” & “e” based of the values in columns “a”, “b” and “c”. I have created a function for this that works as long as at least one column has a value. However, sometimes certain groups will have no answer for a, b or c. I want both d and e to both return zero in the columns when this happens but instead it says “No data available in table”. I’ve added my code below.
f_calculate_net = function(data)
{ data %>% mutate(a = ifelse("a" %in% colnames(data), a, 0)) %>%
mutate(b = ifelse("b" %in% colnames(data), b, 0)) %>%
mutate(c = ifelse("c" %in% colnames(data), c, 0)) %>%
mutate(d = ifelse(a + b + c == 0, 0, ((a/(a+b))*c)+a)) %>%
mutate(e = ifelse(a + b + c == 0, 0, ((b/(a+b))*c)+b)) %>%
select(d,e) }
A sample of the dataframe is
wt
beet
ilo
age
country
ine
sex
647
a
3
19
1
24
1
875
b
3
18
1
27
2
647
c
1
24
1
3
2
875
b
3
20
1
27
2
435
b
2
66
4
31
1
643
a
1
32
3
5
1
496
b
2
47
2
1
2
511
c
2
23
4
2
1
774
a
2
37
5
5
1
550
b
1
24
1
1
2
I take the main dataset and then apply a filter and count the number responses of the variable beet
data2 <- df_beet %>% filter(age == 18 & sex == 1 & ilo == 2) %>%
count(beet, wt = wt) %>%
pivot_wider(names_from = beet, values from = n) %>%
f_calculate_net()
There are no results and the resulting dataframe shows the columns d and e but it doesn’t show zeros and instead shows “no data available”

Your main problem here is in the way you are using ifelse. The expression "a" %in% colnames(data) always returns a length-1 logical vector (either TRUE or FALSE). So the output of the expression ifelse("a" %in% colnames(data), a, 0) will also be of length 1. It will return either the first element of a or a single 0. Since this is inside a mutate call, a will either be overwritten by the first element of a, or will be created as a column of zeros. Instead of ifelse you should use
if(!"a" %in% colnames(data)) data$a <- 0
And the same for columns b and c.
You will sometimes still get a NaN entry in columns d and e here if both a and b are 0, but c isn't, since your expression ((a/(a+b))*c)+a) divides by the sum of a and b. You should only check whether a + b == 0, since in that case you should return a 0
So the fixed function would be something like:
f_calculate_net = function(data) {
if(!"a" %in% colnames(data)) data$a <- 0
if(!"b" %in% colnames(data)) data$b <- 0
if(!"c" %in% colnames(data)) data$c <- 0
data %>%
mutate(d = ifelse(a + b == 0, 0, ((a/(a+b))*c)+a)) %>%
mutate(e = ifelse(a + b == 0, 0, ((b/(a+b))*c)+b)) %>%
select(d,e)
}
Let's create some random data to test this:
set.seed(123)
df <- data.frame(a = rpois(5, 1), b = rpois(5, 2), c = rpois(5, 1))
df
#> a b c
#> 1 0 0 3
#> 2 2 2 1
#> 3 1 4 1
#> 4 2 2 1
#> 5 3 2 0
And we see that we get the expected output:
f_calculate_net(df)
#> d e
#> 1 0.0 0.0
#> 2 2.5 2.5
#> 3 1.2 4.8
#> 4 2.5 2.5
#> 5 3.0 2.0
Created on 2022-08-15 by the reprex package (v2.0.1)

When a and b are zero a/b is NAN. If you want this case to be zero, try change a + b + c == 0 to (a + b) == 0
Based on Allan's explanation and comment, another possibility is to make a logical vector of the same length as the number of rows:
f_calculate_net = function(data)
{ data %>%
mutate(a = ifelse(rep("a" %in% colnames(data), nrow(data)), a, 0)) %>%
mutate(b = ifelse(rep("b" %in% colnames(data), nrow(data)), b, 0)) %>%
mutate(c = ifelse(rep("c" %in% colnames(data), nrow(data)), c, 0)) %>%
mutate(d = ifelse(a + b == 0, 0, ((a/(a+b))*c)+a)) %>%
mutate(e = ifelse(a + b == 0, 0, ((b/(a+b))*c)+b)) %>%
select(d,e) }

Calculate row sums by variable names

what's the easiest way to calculate row-wise sums? For example if I wanted to calculate the sum of all variables with "txt_"? (see example below)
df <- data.frame(var1 = c(1, 2, 3),
txt_1 = c(1, 1, 0),
txt_2 = c(1, 0, 0),
txt_3 = c(1, 0, 0))

base R
We can first use grepl to find the column names that start with txt_, then use rowSums on the subset.
rowSums(df[, grepl("txt_", names(df))])
[1] 3 1 0
If you want to bind it back to the original dataframe, then we can bind the output to the original dataframe.
cbind(df, sums = rowSums(df[, grepl("txt_", names(df))]))
var1 txt_1 txt_2 txt_3 sums
1 1 1 1 1 3
2 2 1 0 0 1
3 3 0 0 0 0
Tidyverse
library(tidyverse)
df %>%
mutate(sum = rowSums(across(starts_with("txt_"))))
var1 txt_1 txt_2 txt_3 sum
1 1 1 1 1 3
2 2 1 0 0 1
3 3 0 0 0 0
Or if you want just the vector, then we can use pull:
df %>%
mutate(sum = rowSums(across(starts_with("txt_")))) %>%
pull(sum)
[1] 3 1 0
Data Table
Here is a data.table option as well:
library(data.table)
dt <- as.data.table(df)
dt[ ,sum := rowSums(.SD), .SDcols = grep("txt_", names(dt))]
dt[["sum"]]
# [1] 3 1 0

Another dplyr option:
df %>%
rowwise() %>%
mutate(sum = sum(c_across(starts_with("txt"))))

Add multiple columns with dplyr and fill cells based on condition

I am trying to:
1) add multiple columns that correspond to existing columns (e.g., a1 exists and add a1_yes).
2) Next, if a given cell contains 1:3, put 1 in a#_yes column, otherwise, put 0.
I can easily to this with base R but I'm trying to also make it work with dplyr.
My data:
df <- data.frame(a1 = c(1, 2, 0, NA, NA),
a2 = c(NA, 1, 2, 3, 3))
With base R:
df[paste0("a", 1:2, "_yes")] <- NA # add columns
for(c in 1:2) {
for(r in 1:nrow(df)) {
ifelse(df[r,c] %in% c(1,2,3), df[r,c+2] <- 1,df[r,c+2] <- 0)
}
}
> df
a1 a2 a1_yes a2_yes
1 1 NA 1 0
2 2 1 1 1
3 0 2 0 1
4 NA 3 0 1
5 NA 3 0 1
Thank you

Here is an option, assuming you want to do this to all columns of your dataframe
library(dplyr)
df %>%
mutate_all(., list('yes' = ~ifelse(.x %in% c(1:3), 1, 0)))
# a1 a2 a1_yes a2_yes
#1 1 NA 1 0
#2 2 1 1 1
#3 0 2 0 1
#4 NA 3 0 1
#5 NA 3 0 1
Edits
As #Akrun mentioned, you can do this without ifelse using as.integer or +
df %>%
mutate_all(., list('yes' = ~as.integer(.x %in% 1:3)))
You can also use mutate_at to select specific vars
df %>%
mutate_at(vars(a1, a2), list('yes' = ~as.integer(.x %in% 1:3)))

This will work without editing no matter how many columns you have if they are all in this format
df %>%
mutate_all(., function(x) ifelse(x == 0 | is.na(x), 0, 1)) %>%
rename_all(., function(x) paste0(x, "_yes")) %>%
bind_cols(df, .)

Here's a dplyr solution:
library(dplyr)
df <- data.frame(a1 = c(1, 2, 0, NA, NA),
a2 = c(NA, 1, 2, 3, 3))
df2 <- df %>%
mutate(a1_yes = ifelse(a1 == 0 | is.na(a1), 0, 1),
a2_yes = ifelse(a2 == 0 | is.na(a2), 0, 1))
Instead of putting the conditions so that the new columns' values are 1, I put the conditions so that they're equal to zero.

Here is a solution
df <- data.frame( a1 = c(1,2,0,NA,NA),
a2 = c(NA,1,2,3,3))
check_values <- c(1,2,3)
df %>% mutate(a1_yes = ifelse(a1 %in% check_values,1,0),
a2_yes =ifelse(a2 %in% check_values,1,0))

comparing dates in R not working well (equal)

I want to compare 2 column dataframe with dates and include one column to indicate whether dates "A" are <= dates "B" or >
df <- data.frame( list (A=c("15-10-2000", "15-10-2000", "15-10-2000","20-10-2000"),
B=c("15-10-2000", "16-10-2000", "14-10-2000","19-10-2000")))
What I would like to include is new column C = ( 1 , 1, 0, 0).
I have tried:
df$C = ifelse (df$A <= df$B, 1, 0)
It works except for the "equal" comparation.
I get: C = ( 0 , 1, 0, 0)
sorry but before doing the comparation I changed the format to Date and still does not works
df$A= as.Date(df$A, format = "%d-%m-%Y")
df$B = as.Date(df$B, format = "%d-%m-%Y")

The date columns are factors. You need to first convert them to Date class and then compare
library(dplyr)
df %>%
mutate_at(vars(A:B), as.Date, format = "%d-%m-%Y") %>%
mutate(C = as.integer(A <= B))
# A B C
#1 2000-10-15 2000-10-15 1
#2 2000-10-15 2000-10-16 1
#3 2000-10-15 2000-10-14 0
#4 2000-10-20 2000-10-19 0
Or in base R that would be
df[1:2] <- lapply(df[1:2], as.Date, format = "%d-%m-%Y")
df$C <- as.integer(df$A <= df$B)

You should convert the factors to dates (As Jon Spring pointed out). Then it should work
library(dplyr)
df %>%
mutate_all(lubridate::dmy) %>%
mutate(C = ifelse(A<=B,1,0))
A B C
1 2000-10-15 2000-10-15 1
2 2000-10-15 2000-10-16 1
3 2000-10-15 2000-10-14 0
4 2000-10-20 2000-10-19 0

Variable names as Input in an R Function

I have a dataframe with several numeric variables along with factors. I wish to run over the numeric variables and replace the negative values to missing. I couldn't do that.
My alternative idea was to write a function that gets a dataframe and a variable, and does it. It didn't work either.
My code is:
NegativeToMissing = function(df,var)
{
df$var[df$var < 0] = NA
}
Error in $<-.data.frame(`*tmp*`, "var", value = logical(0)) : replacement has 0 rows, data has 40
what am I doing wrong ?
Thank you.

Here is an example with some dummy data.
df1 <- data.frame(col1 = c(-1, 1, 2, 0, -3),
col2 = 1:5,
col3 = LETTERS[1:5])
df1
# col1 col2 col3
#1 -1 1 A
#2 1 2 B
#3 2 3 C
#4 0 4 D
#5 -3 5 E
Now find columns that are numeric
numeric_cols <- sapply(df1, is.numeric)
And replace negative values
df1[numeric_cols] <- lapply(df1[numeric_cols], function(x) replace(x, x < 0 , NA))
df1
# col1 col2 col3
#1 NA 1 A
#2 1 2 B
#3 2 3 C
#4 0 4 D
#5 NA 5 E
You could also do
df1[df1 < 0] <- NA

With tidyverse, we can make use of mutate_if
library(tidyverse)
df1 %>%
mutate_if(is.numeric, funs(replace(., . < 0, NA)))

If you still want to change only one selected variable a solution withdplyr would be to use non-standard evaluation:
library(dplyr)
NegativeToMissing <- function(df, var) {
quo_var = quo_name(var)
df %>%
mutate(!!quo_var := ifelse(!!var < 0, NA, !!var))
}
NegativeToMissing(data, var=quo(val1)) # use quo() function without ""
# val1 val2
# 1 1 1
# 2 NA 2
# 3 2 3
Data used:
data <- data.frame(val1 = c(1, -1, 2),
val2 = 1:3)
data
# val1 val2
# 1 1 1
# 2 -1 2
# 3 2 3

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Subset a new dataframe with binary columns - r

I would like to identify binary columns in a data.frame. And make a new df on based that condition. For example, this table my.table <-read.table(text="a,b,c 0,2,0 0.25,1,1 1,0,0", header=TRUE, as.is=TRUE,sep = ",")

We can use base R my.table[colSums(sapply(my.table, `%in%`, c(0, 1))) == nrow(my.table)] # c #1 0 #2 1 #3 0

Related

Why does my table say “No data available” instead of zero?

Calculate row sums by variable names

Add multiple columns with dplyr and fill cells based on condition

comparing dates in R not working well (equal)

Variable names as Input in an R Function

Categories

Resources