How to select columns whose elements are all 0 in R? - r

I am struggling to select all columns whose elements are all 0. Here is a simple example:
df <- data.frame(A=c(1,0,1,0),B=c(0,0,0,0),C=c(1,1,0,0),D=c(0,0,0,0))
df
A B C D
1 1 0 1 0
2 0 0 1 0
3 1 0 0 0
4 0 0 0 0
I want to select Columns B and D. I know the dplyr::selectallows to select by colnames. But how to select columns by their values? Something like select(across(everything(),~.x==0)). But the across() cannot be used in select().

We can use where inside the select statement to find the columns that have a sum of 0.
library(dplyr)
df %>%
select(where(~sum(.) == 0))
Or you could use colSums in base R:
df[,colSums(df) == 0]
Output
B D
1 0 0
2 0 0
3 0 0
4 0 0

Related

R - In new dataframe: if cell matches another column of same row, then

d <- data.frame(B1 = c(1,2,3,4),B2 = c(0,1,2,3))
d$total=rowSums(d)
B1 B2 total
1 0 1
2 1 3
3 2 5
4 3 7
Using the dataframe above, I want to create a new dataframe with the following logic:
Going by rows, if cells (B1:B2) matches d$total, return 1, else 0.
Ideally output to look like:
B1n B2n
1 0
0 0
0 0
0 0
What is the best way to do this in R?
Thank you.
You can compare first 2 columns with total value.
res <- +(d[1:2] == d$total)
res
# B1 B2
#[1,] 1 0
#[2,] 0 0
#[3,] 0 0
#[4,] 0 0
The result is a matrix, if you want dataframe as output you can do res <- data.frame(res).
Here is an alternate way to solve this problem. You can use dplyr::transmute which is the opposite of dplyr::mutate which will give you two separate columns. Inside transmute are just conditions.
library(dplyr)
newdf <- d %>% transmute(B1n=ifelse(B1+B2==B1,1,0),B2n=ifelse(B1+B2==B2,1,0))
> newdf
B1n B2n
1 1 0
2 0 0
3 0 0
4 0 0

Create a new conditional column based on binary values in other columns? (R)

My data looks like this:
A B C
1 0 0
0 1 0
0 1 0
0 0 1
This is what I’m going for:
A B C New_Column
1 0 0 A
0 1 0 B
0 1 0 B
0 0 1 C
So I'm creating a new column that tells me which of the three variables (A, B, or C) is present. Only one of the three columns will contain a 1 per row. What is the best way to go about this?
We can use max.col
df1$New_Column <- names(df1)[max.col(df1, "first")]

R spread columns with a specific pattern

Got a data.frame with a column like this:
Column_1
AAA
B
BBB
AAA_FACE
CCC
BBB_AAA
I want to spread the column into new columns (but not for all my unique values, because then I would get very, very much columns), but only for the values containing a specific pattern: "AAA".
After spreading the values, I want to make them binary, So ideally my new data.frame looks like this:
AAA AAA_FACE BBB_AAA
1 0 0
0 0 0
0 0 0
0 1 0
0 0 0
0 0 1
I tried dplyr's spread() function. But there I got the issue that I spread the data in many, many columns (instead of only the columns containing 'AAA' pattern).
One option with tidyverse would be
library(tidyverse)
df1 %>%
mutate(i1 = as.integer(str_detect(Column_1, "AAA")),
rn = row_number()) %>%
spread(Column_1, i1, fill = 0) %>%
select(matches("AAA"))
# AAA AAA_FACE BBB_AAA
#1 1 0 0
#2 0 0 0
#3 0 0 0
#4 0 1 0
#5 0 0 0
#6 0 0 1
It can be made a bit more efficient by replaceing the other values to NA and then do the spread
df1 %>%
mutate(i1 = as.integer(str_detect(Column_1, "AAA")),
Column_1 = replace(Column_1, !i1, NA),
rn = row_number()) %>%
spread(Column_1, i1, fill = 0) %>%
select(matches("AAA"))
Using basic R code:
Your data
db<-data.frame(Column_1=c("AAA","B","BBB","AAA_FACE","CCC","BBB_AAA"))
Identify "AAA" pattern
AAA_names<-as.character(db[grep("AAA",db$Column_1),"Column_1"])
Output dataframe creation:
out<-data.frame(lapply(AAA_names, f<-function(x,y){ return(x == y) }, y=as.character(db$Column_1)))
colnames(out)<-AAA_names
out[,AAA_names] <- lapply(out[,AAA_names], as.numeric)
Your output
out
AAA AAA_FACE BBB_AAA
1 1 0 0
2 0 0 0
3 0 0 0
4 0 1 0
5 0 0 0
6 0 0 1

Finding "similar" rows performing a conditional join with sqldf

Say I got a data.table (can also be data.frame, doesn't matter to me) which has numeric columns a, b, c, d and e.
Each row of the table represents an article and a-e are numeric characteristics of the articles.
What I want to find out is which articles are similar to each other, based on columns a, b and c.
I define "similar" by allowing a, b and c to vary +/- 1 at most.
That is, article x is similar to article y if neither a, b nor c differs by more than 1. Their values for d and e don't matter and may differ significantly.
I've already tried a couple of approaches but didn't get the desired result. What I want to achieve is to get a result table which contains only those rows that are similar to at least one other row. Plus, duplicates must be excluded.
Particularly, I'm wondering if this is possible using the sqldf library. My idea is to somehow join the table with itself under the given conditions, but I don't get it together properly. Any ideas (not necessarily using sqldf)?
Suppose our input data frame is the built-in 11x8 anscombe data frame. Its first three column names are x1, x2 and x3. Then here are some solutions.
1) sqldf This returns the pairs of row numbers of similar rows:
library(sqldf)
ans <- anscombe
ans$id <- 1:nrow(ans)
sqldf("select a.id, b.id
from ans a
join ans b on abs(a.x1 - b.x1) <= 1 and
abs(a.x2 - b.x2) <= 1 and
abs(a.x3 - b.x3) <= 1")
Add another condition and a.id < b.id if each row should not be paired with itself and if we want to exclude the reverse of each pair or add and not a.id = b.id to just exclude self pairs.
2) dist This returns a matrix m whose i,j-th element is 1 if rows i and j are similar and 0 if not based on columns 1, 2 and 3.
# matrix of pairs (1 = similar, 0 = not)
m <- (as.matrix(dist(anscombe[1:3], method = "maximum")) <= 1) + 0
giving:
1 2 3 4 5 6 7 8 9 10 11
1 1 0 0 1 1 0 0 0 0 0 0
2 0 1 0 1 0 0 0 0 0 1 0
3 0 0 1 0 0 1 0 0 1 0 0
4 1 1 0 1 0 0 0 0 0 0 0
5 1 0 0 0 1 0 0 0 1 0 0
6 0 0 1 0 0 1 0 0 0 0 0
7 0 0 0 0 0 0 1 0 0 1 1
8 0 0 0 0 0 0 0 1 0 0 1
9 0 0 1 0 1 0 0 0 1 0 0
10 0 1 0 0 0 0 1 0 0 1 0
11 0 0 0 0 0 0 1 1 0 0 1
We could add m[lower.tri(m, diag = TRUE)] <- 0 to exclude self pairs and the reverse of each pair if desired or diag(m) <- 0 to just exclude self pairs.
We can create a data frame of similar row number pairs like this. To keep the output short we have excluded self pairs and the reverse of each pair.
# two-column data.frame of pairs excluding self pairs and reverses
subset(as.data.frame.table(m), c(Var1) < c(Var2) & Freq == 1)[1:2]
giving:
Var1 Var2
34 1 4
35 2 4
45 1 5
58 3 6
91 3 9
93 5 9
101 2 10
106 7 10
117 7 11
118 8 11
Here is a network graph of the above. Note that answer continues after the graph:
# network graph
library(igraph)
g <- graph.adjacency(m)
plot(g)
# raster plot
library(ggplot2)
ggplot(as.data.frame.table(m), aes(Var1, Var2, fill = factor(Freq))) +
geom_raster()
I am quite new to R so don't expect to much.
What if you create from your values (which are basically vectors) a matrix with the distance from the two values. So you can find those combinations which have a difference of less than 1 from each other. Via this way you can find the matching (a)-pairs. Repeat this with (b) and (c) and find those which are included in all pairs.
Alternatively this can probably be done as a cube as well.
Just as a thought hint.

How to refer to previous cell in a data-frame column (lagged cell), in R

I’m working in R and am trying to find a way to refer to the previous cell within a vector when that vector belongs to a data frame. By previous cell, I’m essentially hoping for a “lag” command of some sort so that I can compare one cell to the cell previous. As an example, I have these data:
A <- c(1,0,0,0,1,0,0)
B <- c(1,1,1,1,1,0,0)
AB_df <- cbind (A,B)
What I want is for a given cell in a given row, if that cell’s value is less than the previous cell’s value for the same column vector, to return a value of 1 and if not to return a value of 0. For this example, the new columns would be called “A-flag” and “B-flag” below.
A B A-flag B-flag
1 1 0 0
0 1 1 0
0 1 0 0
0 1 0 0
1 1 0 0
0 0 1 1
0 0 0 0
Any suggestions for syntax that can do this? Ideally, to just create a new column variable into an existing data-frame.
Here is one solution using dplyr package and it's lag method:
library(dplyr)
AB_df <- data.frame(A = A, B = B)
AB_df %>% mutate(A.flag = ifelse(A < lag(A, default = 0), 1, 0),
B.flag = ifelse(B < lag(B, default = 0), 1, 0))
A B A.flag B.flag
1 1 1 0 0
2 0 1 1 0
3 0 1 0 0
4 0 1 0 0
5 1 1 0 0
6 0 0 1 1
7 0 0 0 0

Resources