Check condition row wise for a number of columns [duplicate] - r

This question already has an answer here:
How to subset all rows in a dataframe that have a particular value
(1 answer)
Closed 2 years ago.
Data example:
df <- data.frame("a" = c(1,2,3,4), "b" = c(4,3,2,1), "x_ind" = c(1,0,1,1), "y_ind" = c(0,0,1,1), "z_ind" = c(0,1,1,1) )
> df
a b x_ind y_ind z_ind
1 1 4 1 0 0
2 2 3 0 0 1
3 3 2 1 1 1
4 4 1 1 1 1
I want to add a new column which checks if the whole row for the columns which end in "_ind" has all values equal to 1. If it does then returns 1 else returns 0. So the result dataframe looks like:
a b x_ind y_ind z_ind keep
1 1 4 1 0 0 0
2 2 3 0 0 1 0
3 3 2 1 1 1 1
4 4 1 1 1 1 1
I can select the columns by using df %>% select(contains("_ind")) however I am not sure how to do a rowwise operation which checks if every value in the row contains a 1, and then append the column back to the original dataframe.
Any help would be appreicated! Working with Dplyr but appreciate any solution

You can use rowSums when your df is equal to 1, i.e.
rowSums(df[grepl('_ind', names(df))] == 1) == ncol(df[grepl('_ind', names(df))])
#[1] FALSE FALSE TRUE TRUE
Continuing your dplyr attempt you can do,
df %>%
select(contains("_ind")) %>%
mutate(new = rowSums(. == 1) == ncol(.))
# x_ind y_ind z_ind new
#1 1 0 0 FALSE
#2 0 0 1 FALSE
#3 1 1 1 TRUE
#4 1 1 1 TRUE
#OR you can filter directly
df %>%
select(contains("_ind")) %>%
filter(rowSums(. == 1) == ncol(.))
# x_ind y_ind z_ind
#1 1 1 1
#2 1 1 1
If you want to also keep the origina columns, you can use,
df %>%
filter_at(vars(ends_with('_ind')), all_vars(. == 1))
# a b x_ind y_ind z_ind
#1 3 2 1 1 1
#2 4 1 1 1 1
NOTE: When we use (.), the dot refers to the resulting data frame. In this case, it refers to columns specify in the condition (i.e. to the columns that end with _ind)
Similarly in base R,
df[rowSums(df[grepl('_ind', names(df))] == 1) == ncol(df[grepl('_ind', names(df))]),]
# a b x_ind y_ind z_ind
#3 3 2 1 1 1
#4 4 1 1 1 1

You can use rowwise with c_across in new dplyr :
library(dplyr)
df %>% rowwise() %>% mutate(keep = +all(c_across(ends_with('ind')) == 1))
# a b x_ind y_ind z_ind keep
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#1 1 4 1 0 0 0
#2 2 3 0 0 1 0
#3 3 2 1 1 1 1
#4 4 1 1 1 1 1

You can use apply with all, using endsWith to get the columns ending with _ind and testing if they are == 1.
df$keep <- +(apply(df[,endsWith(colnames(df), "_ind")]==1, 1, all))
df
# a b x_ind y_ind z_ind keep
#1 1 4 1 0 0 0
#2 2 3 0 0 1 0
#3 3 2 1 1 1 1
#4 4 1 1 1 1 1
or using rowSums
df$keep <- +(rowSums(df[,endsWith(colnames(df), "_ind")]!=1) == 0)

Related

Restarting a Counter By Groups Under Conditions [duplicate]

This question already has answers here:
Create counter within consecutive runs of certain values
(6 answers)
Closed 29 days ago.
I have the following dataset:
id = c("A","A","A","A","A","B", "B", "B", "B")
result = c(1,1,0,1,1,0,1,0,1)
my_data = data.frame(id, result)
For each unique id, I want to create a "counter variable" that:
if the first result value is 1 then counter = 1 , else 0
increases by 1 each time when result = 1
becomes 0 when the result = 0
remains 0 until the first result = 1 is encountered
restart to increasing by 1 each time the result = 1
when the next unique id is encountered, the counter initializes back to 1 if result = 1 , else 0
I think the final result should look something like this:
id result counter
1 A 1 1
2 A 1 2
3 A 0 0
4 A 1 1
5 A 1 2
6 B 0 0
7 B 1 1
8 B 0 0
9 B 1 1
I have these two codes that I am trying to use:
# creates counter by treating entire dataset as a single ID
my_data$counter = unlist(lapply(split(my_data$results, c(0, cumsum(abs(diff(!my_data$results == 1))))), function(x) (x[1] == 1) * seq(length(x))))
# creates counter by taking into consideration ID's
my_data$counter = ave(my_data$results, my_data$id, FUN = function(x){ tmp<-cumsum(x);tmp-cummax((!x)*tmp)})
But I am not sure how to interpret these correctly. For example, I am interested in learning about how to write a general function to accomplish this task with general conditions - e.g. if result = AAA then counter restarts to 0, if result = BBB then counter + 1, if result = CCC then counter + 2, if result = DDD then counter - 1.
Can someone please show me how to do this?
Thanks!
We may create a grouping column with rleid and then do the grouping by 'id' and the rleid of 'result'
library(dplyr)
library(data.table)
my_data %>%
group_by(id) %>%
mutate(grp = rleid(result)) %>%
group_by(grp, .add = TRUE) %>%
mutate(counter = row_number() * result)%>%
ungroup %>%
select(-grp)
-output
# A tibble: 9 × 3
id result counter
<chr> <dbl> <dbl>
1 A 1 1
2 A 1 2
3 A 0 0
4 A 1 1
5 A 1 2
6 B 0 0
7 B 1 1
8 B 0 0
9 B 1 1
Or using data.table
library(data.table)
setDT(my_data)[, counter := seq_len(.N) * result, .(id, rleid(result))]
-output
> my_data
id result counter
1: A 1 1
2: A 1 2
3: A 0 0
4: A 1 1
5: A 1 2
6: B 0 0
7: B 1 1
8: B 0 0
9: B 1 1

sum of each column between them in R

I have a dataframe like this :
dd<-data.frame(col1=c(1,0,1),col2=c(1,1,1),col3=c(1,0,0),col4=c(1,0,1,0,1,0,1,0))
And i would like to have the sum of each column between them like:
col1+col2 col1+col3 col1+col4 col2+col3 col2+col4 col3+col4
2 2 2 2 2 2
1 1 1 1 1 0
1 1 2 1 2 1
2 1 1 1 1 0
I did'nt find any fonctions who does that
Please help me
One base R option might be combn + rowSums
setNames(
as.data.frame(combn(dd, 2, rowSums)),
combn(names(dd), 2, paste0, collapse = "+")
)
which gives
col1+col2 col1+col3 col1+col4 col2+col3 col2+col4 col3+col4
1 2 2 2 2 2 2
2 1 0 0 1 1 0
3 2 1 2 1 2 1
Data
dd<-data.frame(col1=c(1,0,1),col2=c(1,1,1),col3=c(1,0,0),col4=c(1,0,1))
One dplyr and purrr possibility could be:
map_dfc(.x = combn(names(dd), 2, simplify = FALSE),
~ dd %>%
rowwise() %>%
transmute(!!paste(.x, collapse = "+") := sum(c_across(all_of(.x)))))
`col1+col2` `col1+col3` `col1+col4` `col2+col3` `col2+col4` `col3+col4`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 2 2 2 2 2
2 1 0 0 1 1 0
3 2 1 2 1 2 1
Uglier, and slower than the base R above:
do.call("cbind", setNames(Map(function(i){y <- dd[,i] + dd[,-c(1:i)]},
seq_along(dd)[1:ncol(dd)-1]), names(dd)[1:(ncol(dd)-1)]))

Identifying Duplicates in `data.frame` Using `dplyr`

I want to identify (not eliminate) duplicates in a data frame and add 0/1 variable accordingly (wether a row is a duplicate or not), using the R dplyr package.
Example:
| A B C D
1 | 1 0 1 1
2 | 1 0 1 1
3 | 0 1 1 1
4 | 0 1 1 1
5 | 1 1 1 1
Clearly, row 1 and 2 are duplicates, so I want to create a new variable (with mutate?), say E, that is equal to 1 in row 1,2,3 and 4 since row 3 and 4 are also identical.
Moreover, I want to add another variable, F, that is equal to 1 if there is a duplicate differing only by one column. That is, F in row 1,2 and 5 would be equal to 1 since they only differ in the B column.
I hope it is clear what I want to do and I hope that dplyr offers a smooth solution to this problem. This is of course possible in "base" R but I believe (hope) that there exists a smoother solution.
You can use dist() to compute the differences, and then a search in the resulting distance object can give the needed answers (E, F, etc.). Here is an example code, where X is the original data.frame:
W=as.matrix(dist(X, method="manhattan"))
X$E = as.integer(sapply(1:ncol(W), function(i,D){any(W[-i,i]==D)}, D=0))
X$F = as.integer(sapply(1:ncol(W), function(i,D){any(W[-i,i]==D)}, D=1))
Just change D= for the number of different columns needed.
It's all base R though. Using plyr::laply instead of sappy has same effect. dplyr looks overkill here.
Here is a data.table solution that is extendable to an arbitrary case (1..n columns the same)- not sure if someone can convert to dpylr for you. I had to change your dataset a bit to show your desired F column - in your example all rows would get a 1 because 3 and 4 are one column different from 5 as well.
library(data.table)
DT <- data.frame(A = c(1,1,0,0,1), B = c(0,0,1,1,1), C = c(1,1,1,1,1), D = c(1,1,1,1,1), E = c(1,1,0,0,0))
DT
A B C D E
1 1 0 1 1 1
2 1 0 1 1 1
3 0 1 1 1 0
4 0 1 1 1 0
5 1 1 1 1 0
setDT(DT)
DT_ncols <- length(DT)
base <- data.table(t(combn(1:nrow(DT), 2)))
setnames(base, c("V1","V2"),c("ind_x","ind_y"))
DT[, ind := .I)]
DT_melt <- melt(DT, id.var = "ind", variable.name = "column")
base <- merge(base, DT_melt, by.x = "ind_x", by.y = "ind", allow.cartesian = TRUE)
base <- merge(base, DT_melt, by.x = c("ind_y", "column"), by.y = c("ind", "column"))
base <- base[, .(common_cols = sum(value.x == value.y)), by = .(ind_x, ind_y)]
This gives us a data.frame that looks like this:
base
ind_x ind_y common_cols
1: 1 2 5
2: 1 3 2
3: 2 3 2
4: 1 4 2
5: 2 4 2
6: 3 4 5
7: 1 5 3
8: 2 5 3
9: 3 5 4
10: 4 5 4
This says that rows 1 and 2 have 5 common columns (duplicates). Rows 3 and 5 have 4 common columns, and 4 and 5 have 4 common columns. We can now use a fairly extendable format to flag any combination we want:
base <- melt(base, id.vars = "common_cols")
# Unique - common_cols == DT_ncols
DT[, F := ifelse(ind %in% unique(base[common_cols == DT_ncols, value]), 1, 0)]
# Same save 1 - common_cols == DT_ncols - 1
DT[, G := ifelse(ind %in% unique(base[common_cols == DT_ncols - 1, value]), 1, 0)]
# Same save 2 - common_cols == DT_ncols - 2
DT[, H := ifelse(ind %in% unique(base[common_cols == DT_ncols - 2, value]), 1, 0)]
This gives:
A B C D E ind F G H
1: 1 0 1 1 1 1 1 0 1
2: 1 0 1 1 1 2 1 0 1
3: 0 1 1 1 0 3 1 1 0
4: 0 1 1 1 0 4 1 1 0
5: 1 1 1 1 0 5 0 1 1
Instead of manually selecting, you can append all combinations like so:
# run after base <- melt(base, id.vars = "common_cols")
base <- unique(base[,.(ind = value, common_cols)])
base[, common_cols := factor(common_cols, 1:DT_ncols)]
merge(DT, dcast(base, ind ~ common_cols, fun.aggregate = length, drop = FALSE), by = "ind")
ind A B C D E 1 2 3 4 5
1: 1 1 0 1 1 1 0 1 1 0 1
2: 2 1 0 1 1 1 0 1 1 0 1
3: 3 0 1 1 1 0 0 1 0 1 1
4: 4 0 1 1 1 0 0 1 0 1 1
5: 5 1 1 1 1 0 0 0 1 1 0
Here is a dplyr solution:
test%>%mutate(flag = (A==lag(A)&
B==lag(B)&
C==lag(C)&
D==lag(D)))%>%
mutate(twice = lead(flag)==T)%>%
mutate(E = ifelse(flag == T | twice ==T,1,0))%>%
mutate(E = ifelse(is.na(E),0,1))%>%
mutate(FF = ifelse( ( (A +lag(A)) + (B +lag(B)) + (C+lag(C)) + (D + lag(D))) == 7,1,0))%>%
mutate(FF = ifelse(is.na(FF)| FF == 0,0,1))%>%
select(A,B,C,D,E,FF)
Result:
A B C D E FF
1 1 0 1 1 1 0
2 1 0 1 1 1 0
3 0 1 1 1 1 0
4 0 1 1 1 1 0
5 1 1 1 1 0 1

propagate changes down a column

I would like to use dplyr to go through a dataframe row by row, and if A == 0, then set B to the value of B in the previous row, otherwise leave it unchanged. However, I want "the value of B in the previous row" to refer to the previous row during the computation, not before the computation began, because the value may have changed -- in other words, I'd like changes to propagate downwards. For example, with the following data:
dat <- data.frame(A=c(1,0,0,0,1),B=c(0,1,1,1,1))
A B
1 0
0 1
0 1
0 1
1 1
I would like the result of the computation to be:
result <- data.frame(A=c(1,0,0,0,1),B=c(0,0,0,0,1))
A B
1 0
0 0
0 0
0 0
1 1
If I use something like result <- dat %>% mutate(B = ifelse(A==0,lag(B),B) then changes won't propagate downwards: result$B will be equal to c(0,0,1,1,1), not c(0,0,0,0,1).
More generally, how do you use dplyr::mutate to create a column that depends on itself (as it updates during the computation, not a copy of what it was before)?
Seems like you want a "last observation carried forward" approach. The most common R implementation is zoo::na.locf which fills in NA values with the last observation. All we need to do to use it in this case is to first set to NA all the B values that we want to fill in:
mutate(dat,
B = ifelse(A == 0, NA, B),
B = zoo::na.locf(B))
# A B
# 1 1 0
# 2 0 0
# 3 0 0
# 4 0 0
# 5 1 1
As to my comment, do note that the only thing mutate does is add the column to the data frame. We could do it just as well without mutate:
result = dat
result$B = with(result, ifelse(A == 0, NA, B))
result$B = zoo::na.locf(result$B)
Whether you use mutate or [ or $ or any other method to access/add the columns is tangential to the problem.
We could use fill from tidyr after changing the 'B' values to NA that corresponds to 0 in 'A'
library(dplyr)
library(tidyr)
dat %>%
mutate(B = NA^(!A)*B) %>%
fill(B)
# A B
#1 1 0
#2 0 0
#3 0 0
#4 0 0
#5 1 1
NOTE: By default, the .direction (argument in fill) is "down", but it can also take "up" i.e. fill(B, .direction="up")
Here's a solution using grouping, and rleid (Run length encoding id) from data.table. I think it should be faster than the zoo solution, since zoo relies on doing multiple revs and a cumsum. And rleid is blazing fast
Basically, we only want the last value of the previous group, so we create a grouping variable based on the diff vector of the rleid and add that to the rleid if A == 1. Then we group and take the first B-value of the group for every case where A == 0
library(dplyr)
library(data.table)
dat <- data.frame(A=c(1,0,0,0,1),B=c(0,1,1,1,1))
dat <- dat %>%
mutate(grp = data.table::rleid(A),
grp = ifelse(A == 1, grp + c(diff(grp),0),grp)) %>%
group_by(grp) %>%
mutate(B = ifelse(A == 0, B[1],B)) # EDIT: Always carry forward B on A == 0
dat
Source: local data frame [5 x 3]
Groups: grp [2]
A B grp
<dbl> <dbl> <dbl>
1 1 0 2
2 0 0 2
3 0 0 2
4 0 0 2
5 1 1 3
EDIT: Here's an example with a longer dataset so we can really see the behavior: (Also, switched, it should be if all A != 1 not if not all A == 1
set.seed(30)
dat <- data.frame(A=sample(0:1,15,replace = TRUE),
B=sample(0:1,15,replace = TRUE))
> dat
A B
1 0 1
2 0 0
3 0 1
4 0 1
5 0 0
6 0 0
7 1 1
8 0 0
9 1 0
10 0 0
11 0 0
12 0 0
13 1 0
14 1 1
15 0 0
Result:
Source: local data frame [15 x 3]
Groups: grp [5]
A B grp
<int> <int> <dbl>
1 0 1 1
2 0 1 1
3 0 1 1
4 0 1 1
5 0 1 1
6 0 1 1
7 1 1 3
8 0 1 3
9 1 0 5
10 0 0 5
11 0 0 5
12 0 0 5
13 1 0 6
14 1 1 7
15 0 1 7

Grouping and Counting instances?

Is it possible to group and count instances of all other columns using R (dplyr)? For example, The following dataframe
x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1
Turns to this (note: y is value that is being counted)
EDIT:- explaining the transformation, x is what I'm grouping by, for each number grouped, i want to count how many times 0 and 1 and 2 was mentioned, as in the first row in the transformed dataframe, we counted how many times x = 1 was equal to 0 in the other columns (y), so 0 was in column a one time, column b two times and column c one time
x y a b c
1 0 1 2 1
1 1 1 0 2
1 2 1 1 0
2 1 1 0 1
2 2 0 1 0
An approach with a combination of the melt and dcast functions of data.table or reshape2:
library(data.table) # v1.9.5+
dt.new <- dcast(melt(setDT(df), id.vars="x"), x + value ~ variable)
this gives:
dt.new
# x value a b c
# 1: 1 0 1 2 1
# 2: 1 1 1 0 2
# 3: 1 2 1 1 0
# 4: 2 1 1 0 1
# 5: 2 2 0 1 0
In dcast you can specify which aggregation function to use, but this is in this case not necessary as the default aggregation function is length. Without using an aggregation function, you will get a warning about that:
Aggregation function missing: defaulting to length
Furthermore, if you do not explicitly convert the dataframe to a data table, data.table will redirect to reshape2 (see the explanation from #Arun in the comments). Consequently this method can be used with reshape2 as well:
library(reshape2)
df.new <- dcast(melt(df, id.vars="x"), x + value ~ variable)
Used data:
df <- read.table(text="x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1", header=TRUE)
I'd use a combination of gather and spread from the tidyr package, and count from dplyr:
library(dplyr)
library(tidyr)
df = data.frame(x = c(1,1,1,2), a = c(0,1,2,1), b = c(0,0,2,2), c = c(0,1,1,1))
res = df %>%
gather(variable, value, -x) %>%
count(x, variable, value) %>%
spread(variable, n, fill = 0)
# Source: local data frame [5 x 5]
#
# x value a b c
# 1 1 0 1 2 1
# 2 1 1 1 0 2
# 3 1 2 1 1 0
# 4 2 1 1 0 1
# 5 2 2 0 1 0
Essentially, you first change the format of the dataset to:
head(df %>%
gather(variable, value, -x))
# x variable value
#1 1 a 0
#2 1 a 1
#3 1 a 2
#4 2 a 1
#5 1 b 0
#6 1 b 0
Which allows you to use count to get the information regarding how often certain values occur in columns a to c. After that, you reformat the dataset to your required format using spread.

Resources