Selecting columns based on row values in multiple columns using dplyr - r

I am trying to select columns where at least one row equals 1, only if the same row also has a certain value in a second column. I would prefer to achieve this using dplyr, but any computationally efficient solution is welcome.
Example:
Select columns among a1, a2, a3 containing at least one row where the value is 1 AND where column b=="B"
Example data:
rand <- function(S) {set.seed(S); sample(x = c(0,1),size = 3, replace=T)}
df <- data.frame(a1=rand(1),a2=rand(2),a3=rand(3),b=c("A","B","A"))
Input data:
a1 a2 a3 b
1 0 0 0 A
2 0 1 1 B
3 1 1 0 A
Desired output:
a2 a3
1 0 0
2 1 1
3 1 0
I managed to obtain the correct output with the following code, however this is a very inefficient solution and I need to run it on a very large dataframe (365,000 rows X 314 columns).
df %>% select_if(function(x) any(paste0(x,.$b) == '1B'))

A solution, not using dplyr:
df[sapply(df[df$b == "B",], function(x) 1 %in% x)]

Here is my dplyr solution:
ids <- df %>%
reshape2::melt(id.vars = "b") %>%
filter(value == 1 & b == "B") %>%
select(variable)
df[,unlist(ids)]
# a2 a3
#1 0 0
#2 1 1
#3 1 0
As suggested by #docendo-discimus it is easier to convert to long format

Related

How can I use vectorisation in R to change a DF value based on a condition?

Suppose I have the following DF:
C1
C2
0
0
1
1
1
1
0
0
.
.
.
.
I now want to apply these following conditions on the Dataframe:
The value for C1 should be 1
A random integer between 0 and 5 should be less than 2
If both these conditions are true, I change the C1 and C2 value for that row to 2
I understand this can be done by using the apply function, and I have used the following:
C1 <- c(0, 1,1,0,1,0,1,0,1,0,1)
C2 <- c(0, 1,1,0,1,0,1,0,1,0,1)
df <- data.frame(C1, C2)
fun <- function(x){
if (sample(0:5, 1) < 2){
x[1:2] <- 2
}
return (x)
}
index <- df$C1 ==1 // First Condition
processed_Df <-t(apply(df[index,],1,fun)) // Applies Second Condition
df[index,] <- processed_Df
Output:
C1
C2
0
0
2
2
1
1
0
0
.
.
.
.
Some Rows have both conditions met, some doesn't (This is the main
functionality, I would like to achieve)
Now I want to achieve this same using vectorization and without using loops or the apply function. The only confusion I have is "If I don't use apply, won't each row get the same result based on the condition's result? (For example, the following:)
df$C1 <- ifelse(df$C1==1 & sample(0:5, 1) < 5, 2, df$C1)
This changes all the rows in my DF with C1==2 to 2 when there should possibly be many 1's.
Is there a way to get different results for the second condition for each row without using the apply function? Hopefully my question makes sense.
Thanks
You need to sample the values for nrow times. Try this method -
set.seed(167814)
df[df$C1 == 1 & sample(0:5, nrow(df), replace = TRUE) < 2, ] <- 2
df
# C1 C2
#1 0 0
#2 2 2
#3 2 2
#4 0 0
#5 1 1
#6 0 0
#7 2 2
#8 0 0
#9 1 1
#10 0 0
#11 1 1
Here is a fully vectorized way. Create the logical index index just like in the question. Then sample all random integers r in one call to sample. Replace in place based on the conjunction of the index and the condition r < 2.
x <- 'C1 C2
0 0
1 1
1 1
0 0'
df1 <- read.table(textConnection(x), header = TRUE)
set.seed(1)
index <- df1$C1 == 1
r <- sample(0:5, length(index), TRUE)
df1[index & r < 2, c("C1", "C2")] <- 2
df1
#> C1 C2
#> 1 0 0
#> 2 1 1
#> 3 2 2
#> 4 0 0
Created on 2022-05-11 by the reprex package (v2.0.1)

If any value is present in a column of Dataframe, change the value to 1 else insert 0

I have a dataframe with about 1000 rows and 1000 columns. What I want to do is that if any value is present in any cell of the dataframe then change the value to 1 or else put a 0 in that cell. I am programming in R so a R code would be appreciated. I don't want the value of the T column to change but only for the rest of the columns to change.
For example
I have a dataframe like this :
T A B C D
1 29 90 0 100
2 30 12 76 0
3 0 12 0 32
convert it to :
T A B C D
1 1 1 0 1
2 1 1 1 0
3 0 1 0 1
To ignore the first column, you could combine it with a simple modification of akrun's first solution. For example,
data.frame(df[, 1, drop=FALSE], +(df[,-1] != 0))
We can convert to a logical matrix and coerce it to integer
df1 <- +(df != 0)
Or with replace
replace(df, df != 0, 1)
If we need to do this without taking the first column
df[-1] <- +(df[-1] != 0)
Or with sapply
+(sapply(df, `!=`, 0))
In tidyverse, we can use mutate_all
library(dplyr)
df <- df %>%
mutate_all(~ as.integer(. != 0))

Sum a group of columns by row count

I'm trying to create a new dataset from an existing one. The new dataset is supposed to combine 60 rows from the original dataset in order to convert a sum of events occurring each second to the total by minute. The number of columns will generally not be known in advance.
For example, with this dataset, if we split it into groups of 3 rows:
d1
a b c d
1 1 1 0 1
2 0 1 0 1
3 0 1 0 0
4 0 0 1 0
5 0 0 1 0
6 1 0 0 0
We'll get this data.frame. Row 1 contains the column sums for rows 1-3 of d1 and Row 2 contains the column sums for rows 4-6 of d1:
d2
a b c d
1 1 3 0 2
2 1 0 2 0
I've tried d2<-colSums(d1[seq(1,NROW(d1),3),]) which is about as close as I've been able to get.
I've also considered recommendations from How to sum rows based on multiple conditions - R?,How to select every xth row from table,Remove last N rows in data frame with the arbitrary number of rows,sum two columns in R, and Merging multiple rows into single row. I'm all out of ideas. Any help would be greatly appreciated.
Create a grouping variable, group_by that variable, then summarise_all.
# your data
d <- data.frame(a = c(1,0,0,0,0,1),
b = c(1,1,1,0,0,0),
c = c(0,0,0,1,1,1),
d = c(1,1,0,0,0,0))
# create the grouping variable
d$group <- rep(c("A","B"), each = 3)
# apply the mean to all columns
library(dplyr)
d %>%
group_by(group) %>%
summarise_all(funs(sum))
Returns:
# A tibble: 2 x 5
group a b c d
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 3 0 2
2 B 1 0 3 0
Overview
After reading Split up a dataframe by number of rows, I realized the only thing you need to know is how you'd like to split() d1.
In this case, you'd like to split d1 into multiple data frames based on every 3 rows. In this case, you use rep() to specify that you'd like each element in the sequence - 1:2 - to be repeated three times (the number of rows divided by the length of your sequence).
After that, the logic involves using map() to sum each column for each data frame created after d1 %>% split(). Here, summarize_all() is helpful since you don't need to know the column names ahead of time.
Once the calculations are complete, you use bind_rows() to stack all the observations back into one data frame.
# load necessary package ----
library(tidyverse)
# load necessary data ----
df1 <-
read.table(text = "a b c d
1 1 0 1
0 1 0 1
0 1 0 0
0 0 1 0
0 0 1 0
1 0 0 0", header = TRUE)
# perform operations --------
df2 <-
df1 %>%
# split df1 into two data frames
# based on three consecutive rows
split(f = rep(1:2, each = nrow(.) / length(1:2))) %>%
# for each data frame, apply the sum() function to all the columns
map(.f = ~ .x %>% summarize_all(.funs = funs(sum))) %>%
# collapse data frames together
bind_rows()
# view results -----
df2
# a b c d
# 1 1 3 0 2
# 2 1 0 2 0
# end of script #

Counting occurrencies by row

Imagine I have a data.frame (or matrix) with few different values such as this
test <- data.frame(replicate(10,sample(c(-1,0,1),20, replace=T, prob=c(0.2,0.2,0.6))))
test2 <- test
If I want to add extra columns with counts I could do:
test2$good <- apply(test,1, function(x) sum(x==1))
test2$bad <- apply(test,1, function(x) sum(x==-1))
test2$neutral <- apply(test,1, function(x) sum(x==0))
But If I had many possible values instead I would have to create many lines, it won't be elegant.
I've tried with table(), but the output is not easily usable
apply(test,1, function(x) table(x))
and there is a big problem, if any row doesn't contain any occurrency of some factor the result generated by table() doesn't have the same length and it can't be binded.
Is there way to force table() to take that value into account, telling it has zero occurrencies?
Then I've thought of using do.call or lapply and merge but it's too difficult for me.
I've also read about dplyr count but I have no clue on how to do it.
Could anyone provide a solution with dplyr or tidyr?
PD: What about a data.table solution?
We could melt the dataset to long format after converting to matrix, get the frequency using table and cbind with the original dataset.
library(reshape2)
cbind(test2, as.data.frame.matrix(table(melt(as.matrix(test2))[-2])))
Or use mtabulate on the transpose of 'test2' and cbind with the original dataset.
library(qdapTools)
cbind(test2, mtabulate(as.data.frame(t(test2))))
Or we can use gather/spread from tidyr after creating row id with add_rownames from dplyr
library(dplyr)
library(tidyr)
add_rownames(test2) %>%
gather(Var, Val, -rowname) %>%\
group_by(rn= as.numeric(rowname), Val) %>%
summarise(N=n()) %>%
spread(Val, N, fill=0) %>%
bind_cols(test2, .)
you can use rowSums():
test2 <- cbind(test2, sapply(c(-1, 0, 1), function(x) rowSums(test==x)))
similar to the code in the comment from etienne, but without the call to apply()
Here is the answer using base R.
test <- data.frame(replicate(10,sample(c(-1,0,1),20, replace=T, prob=c(0.2,0.2,0.6))))
testCopy <- test
# find all unique values, note that data frame is a list
uniqVal <- unique(unlist(test))
# the new column names start with Y
for (val in uniqVal) {
test[paste0("Y",val)] <- apply(testCopy, 1, function(x) sum(x == val))
}
head(test)
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y-1 Y1 Y0
# 1 -1 0 1 1 1 0 -1 -1 1 1 3 5 2
# 2 1 -1 0 1 1 -1 -1 0 0 1 3 4 3
# 3 -1 0 1 0 1 1 1 1 -1 1 2 6 2
# 4 1 1 1 1 0 1 1 0 1 0 0 7 3
# 5 0 -1 1 -1 -1 0 0 1 0 0 3 2 5
# 6 1 1 0 1 1 1 1 1 1 1 0 9 1

break a data.frame in groups based on a value using dplyr R

I have a big data.frame (200000) and I need to add a column for grouping, and the groups are separated by a row with a particular value e.g.
s<-"A B C
1 2 1
2 22 3
0 0 -1
2 12 2
0 0 -1
20 2 5
1 3 1
0 2 2"
d<-read.delim(textConnection(s),sep=" ",header=T)
the C==-1 is the break point for each group, what I need as a result are 3 groups:
require(dplyr)
here I find the rows that separate groups
mutate(d,rn=row_number()) %>% filter(C==-1)
and then I can build the data.frame I need
bind_rows(slice(d, 1:2) %>% mutate(grp=1),slice(d,4) %>%mutate(grp=2), slice(d,6:n()) %>% mutate(grp=3))
How can I make it without hard coding the breaks?
How about this:
d %>% mutate(grp = cumsum(C == -1) + 1) %>% filter(C != -1)
cumsum(C == -1) will give you a group column and all what's left is filter.

Resources