R Undo Dummy Variables

R Undo Dummy Variables - r

I have a data set where a bunch of categorical variables were converted to dummy variables (all classes used, NOT n-1) and some were not. I'm trying to recode them in a single column.
For instance
Q1.1 Q1.2 Q1.3 Q1.NA Q2 Q3.1 Q3.2
1 0 0 0 3 0 1
0 1 0 0 4 1 0
0 0 1 0 2 0 1
Is there a simple way to convert this to:
Q1 Q2 Q3
1 3 2
2 4 1
3 2 2
Right now I'm just using strsplit() (as all the dummied variable names contain '.') with a couple loops but feel like there should be a better way. Any suggestions?

I wrote a function a while back that did this sort of thing.
MultChoiceCondense<-function(vars,indata){
tempvar<-matrix(NaN,ncol=1,nrow=length(indata[,1]))
dat<-indata[,vars]
for (i in 1:length(vars)){
for (j in 1:length(indata[,1])){
if (dat[j,i]==1) tempvar[j]=i
}
}
return(tempvar)
}
If your data is called Dat, then:
Dat$Q1<-MultChoiceCondense(c("Q1.1","Q1.2","Q1.3"),Dat)

Here's an approach that uses melt from "reshape2" and cSplit from my "splitstackshape" package along with some "data.table" fun. I've loaded dplyr so that we can pipe all the things.
library(splitstackshape)
library(reshape2)
library(dplyr)
mydf %>%
as.data.table(keep.rownames = TRUE) %>% # Convert to data.table. Keep rownames
melt(id.vars = "rn", variable.name = "V") %>% # Melt the dataset by rownames
.[value > 0] %>% # Subset for all non-zero values
cSplit("V", ".") %>% # Split the "V" column (names) by "."
.[is.na(V_2), V_2 := value] %>% # Replace NA values with actual values
dcast.data.table(rn ~ V_1, value.var = "V_2") # Go wide.
# rn Q1 Q2 Q3
# 1: 1 1 3 2
# 2: 2 2 4 1
# 3: 3 3 2 2
Here's a possible base R approach:
## Which columns are binary?
Bins <- sapply(mydf, function(x) {
all(x %in% c(0, 1))
})
## Two vectors -- part after the dot and before
X <- gsub(".*\\.(.*)$", "\\1", names(mydf)[Bins])
Y <- unique(gsub("(.*)\\..*$", "\\1", names(mydf)[Bins]))
## Use `apply` to subset the X value based on the
## logical version of the binary variable
cbind(mydf[!Bins],
`colnames<-`(t(apply(mydf[Bins], 1, function(z) {
X[as.logical(z)]
})), Y))
# Q2 Q1 Q3
# 1 3 1 2
# 2 4 2 1
# 3 2 3 2
At the end, you can just reorder the columns as required. You may also need to convert them to numeric since in this case, Q1 and Q3 will be factors.

another base R approach
dat <- read.table(header = TRUE, text = "Q1.1 Q1.2 Q1.3 Q1.NA Q2 Q3.1 Q3.2
1 0 0 0 3 0 1
0 1 0 0 4 1 0
0 0 1 0 2 0 1")
## this will take all the unique questions; Q1, Q2, Q3; test if
## they are dummies; and return the column if so or find which
## dummy column is a 1 otherwise
res <- lapply(unique(gsub('\\..*', '', names(dat))), function(x) {
tmp <- dat[, grep(x, names(dat)), drop = FALSE]
if (ncol(tmp) == 1) unlist(tmp, use.names = FALSE) else max.col(tmp)
})
# [[1]]
# [1] 1 2 3
#
# [[2]]
# [1] 3 4 2
#
# [[3]]
# [1] 2 1 2
do.call('cbind', res)
# [,1] [,2] [,3]
# [1,] 1 3 2
# [2,] 2 4 1
# [3,] 3 2 2

I'm assuming your data looks like this, where the categorical columns are encoded using a dot at the end. You may also have a case where all of the values in a row are zero, which indicates a base level (such as how dummyVars in caret works with fullRank=FALSE). If so, here is a vectorized solution.
library(dplyr)
dummyVars.undo = function(df, col_prefix) {
if (!endsWith(col_prefix, '.')) {
# If col_prefix doesn't end with a period, include one, but save the
# "pretty name" as the one without a period
pretty_col_prefix = col_prefix
col_prefix = paste0(col_prefix, '.')
} else {
# Otherwise, strip the period for the pretty column name
pretty_col_prefix = substr(col_prefix, 1, nchar(col_prefix)-1)
}
# Get all columns with that encoding prefix
cols = names(df)[names(df) %>% startsWith(col_prefix)]
# Find the rows where all values are zero. If this isn't the case
# with your data there's no worry, it won't hurt anything.
base_level.idx = rowSums(df[cols]) == 0
# Set the column value to a base value of zero
df[base_level.idx, pretty_col_prefix] = 0
# Go through the remaining columns and find where the maximum value (1) occurs
df[!base_level.idx, pretty_col_prefix] = cols[apply(df[!base_level.idx, cols], 1, which.max)] %>%
strsplit('\\.') %>%
sapply(tail, 1)
# Drop the encoded columns
df[cols] = NULL
return(df)
}
Usage:
# Collapse Q1
df = dummyVars.undo(df, 'Q1')
# Collapse Q3
df = dummyVars.undo(df, 'Q3')
This uses dplyr, but only for the pipe operator %>%. You could certainly remove that if you'd prefer to do base R instead.

Related

Create a list of vectors from a vector where n consecutive values are not 0 in R

So I have this vector:
a = sample(0:3, size=30, replace = T)
[1] 0 1 3 3 0 1 1 1 3 3 2 1 1 3 0 2 1 1 2 0 1 1 3 2 2 3 0 1 3 2
What I want to have is a list of vectors with all the elements that are separated by n 0s. So in this case, with n = 0 (there can't be any 0 between the consecutive values), this would give:
res = c([1,3,3], [1,1,1,3,3,2,1,1,3], [2,1,1,2]....)
However, I would like to control the n-parameter flexible to that if I would set it for example to 2, that something like this:
b = c(1,2,0,3,0,0,4)
would still result in a result like this
res = c([1,2,3],[4])
I tried a lot of approaches with while loops in for-loops while trying to count the number of 0s. But I just could not achieve it.
Update
I tried to post the question in a more real-world setting here:
Flexibly calculate column based on consecutive counts in another column in R
Thank you all for the help. I just don't seem to manage put your help into practice with my limited knowledge..

Here is a base R option using rle + split for general cases, i.e., values in b is not limited to 0 to 3.
with(
rle(with(rle(b == 0), rep(values & lengths == n, lengths))),
Map(
function(x) x[x != 0],
unname(split(b, cut(seq_along(b), c(0, cumsum(lengths))))[!values])
)
)
which gives (assuming n=2)
[[1]]
[1] 1 2 3
[[2]]
[1] 4
If you have values within ragne 0 to 9, you can try the code below
lapply(
unlist(strsplit(paste0(b, collapse = ""), strrep(0, n))),
function(x) {
as.numeric(
unlist(strsplit(gsub("0", "", x), ""))
)
}
)
which also gives
[[1]]
[1] 1 2 3
[[2]]
[1] 4

I also wanted to paste a somehow useful solution with the function SplitAt from DescTools:
SplitAt(a, which(a==0)) %>% lapply(., function(x) x[which(x != 0)])
where a is your intial vector. It gives you a list where every entry contains the pair of numbers between zeros:
If you than add another SplitAt with empty chars, you can create sublist after sublist and split it in as many sublists as you want: e.g.:
n <- 4
SplitAt(a, which(a==0)) %>% lapply(., function(x) x[which(x != 0)]) %>% SplitAt(., n)
gives you:

set.seed(1)
a <- sample(0:3, size=30, replace = T)
a
[1] 0 3 2 0 1 0 2 2 1 1 2 2 0 0 0 1 1 1 1 2 0 2 0 0 0 0 1 0 0 1
a2 <- paste(a, collapse = "") # Turns into a character vector, making it easier to handle patterns.
a3 <- unlist(strsplit(a2, "0")) # Change to whatever pattern you want, like "00".
a3 <- a3[a3 != ""] # Remove empty elements
a3 <- as.numeric(a3) # Turn back to numeric
a3
[1] 32 1 221122 11112 2 1 1

Validate name with email in dataframe

I have a data frame like below having name and email column.
df <- data.frame(name=c("maay,bhtr","nsgu,nhuts thang","affat,nurfs","nukhyu,biyts","ngyst,muun","nsgyu,noon","utrs guus,book","thum,cryant","mumt,cant","bhan,btan","khtri,ntuk","ghaan,rstu","shaan,btqaan","nhue,bjtraan","wutys,cyun","hrtsh,jaan"),
email=c("maay.bhtr#email.com","nsgu.nhuts#gmail.com","asfa.1234#gmail.com","nukhyu.biyts#gmail.com","ngyst.muun#gmail.com","nsgyu.noon#gmail.com","utrs.book#hotmail.com","thum.cryant#live.com","mumt.cant#gmail.com","bhan.btan#gmail.com","khtri.ntuk#gmail.c.om","chang.lee#gmail.com","shaan.btqaan#gmail.com","nhue.bjtraan#gmail.com","wutys.cyun#gmailcom","hrtsh.jaan#gmail.com"))
I am looking for a function by which i can check if the first name or last name matches with mail id then mutate new column to true.

In Base R we can utilize Map() and sapply() to loop through your list and create a logical vector to then append to your df:
Since this code included a lot of nested apply statements let me try to explain whats ging on. The code is probably best understood when starting from the inside.
# t is the strsplit() names column
strsplit(df[,1], ",")
# this next line checks if the names occur in the email address
grepl(t, y, fixed = T)
# this statement wrapped in sapply returns a list with each entry containing two true/false statements for first and last name
# the sapply() statement above allows us to do exactly that for every row
# lastly we convert this list into a single true/false for each df entry
Code:
a <- sapply(Map(function(x, y){
sapply(x, function(t){
grepl(t, y, fixed = T)
})}
, strsplit(df[,1], ","), df[, 2]), function(p){
if(any(p)){
T
} else {
F
}
})
# result
cbind(df, a)
name email a
1 maay,bhtr maay.bhtr#email.com TRUE
2 nsgu,nhuts thang nsgu.nhuts#gmail.com TRUE
3 affat,nurfs asfa.1234#gmail.com FALSE
4 nukhyu,biyts nukhyu.biyts#gmail.com TRUE
5 ngyst,muun ngyst.muun#gmail.com TRUE
6 nsgyu,noon nsgyu.noon#gmail.com TRUE
7 utrs guus,book utrs.book#hotmail.com TRUE
8 thum,cryant thum.cryant#live.com TRUE
9 mumt,cant mumt.cant#gmail.com TRUE
10 bhan,btan bhan.btan#gmail.com TRUE
11 khtri,ntuk khtri.ntuk#gmail.c.om TRUE
12 ghaan,rstu chang.lee#gmail.com FALSE
13 shaan,btqaan shaan.btqaan#gmail.com TRUE
14 nhue,bjtraan nhue.bjtraan#gmail.com TRUE
15 wutys,cyun wutys.cyun#gmailcom TRUE
16 hrtsh,jaan hrtsh.jaan#gmail.com TRUE

Maybe you can try
within(
df,
consistent <- mapply(
function(x, y) 1 - any(mapply(grepl, x, y) | mapply(grepl, x, y)),
strsplit(name, ","),
strsplit(gsub("#.*", "", email), "\\.")
)
)
which gives
name email consistent
1 maay,bhtr maay.bhtr#email.com 0
2 nsgu,nhuts thang nsgu.nhuts#gmail.com 0
3 affat,nurfs asfa.1234#gmail.com 1
4 nukhyu,biyts nukhyu.biyts#gmail.com 0
5 ngyst,muun ngyst.muun#gmail.com 0
6 nsgyu,noon nsgyu.noon#gmail.com 0
7 utrs guus,book utrs.book#hotmail.com 0
8 thum,cryant thum.cryant#live.com 0
9 mumt,cant mumt.cant#gmail.com 0
10 bhan,btan bhan.btan#gmail.com 0
11 khtri,ntuk khtri.ntuk#gmail.c.om 0
12 ghaan,rstu chang.lee#gmail.com 1
13 shaan,btqaan shaan.btqaan#gmail.com 0
14 nhue,bjtraan nhue.bjtraan#gmail.com 0
15 wutys,cyun wutys.cyun#gmailcom 0
16 hrtsh,jaan hrtsh.jaan#gmail.com 0

You could do this as follows - code commented below.
df <- data.frame(name=c("maay,bhtr","nsgu,nhuts thang","affat,nurfs","nukhyu,biyts","ngyst,muun","nsgyu,noon","utrs guus,book","thum,cryant","mumt,cant","bhan,btan","khtri,ntuk","ghaan,rstu","shaan,btqaan","nhue,bjtraan","wutys,cyun","hrtsh,jaan"),
email=c("maay.bhtr#email.com","nsgu.nhuts thang#gmail.com","asfa.1234#gmail.com","nukhyu.biyts#gmail.com","ngyst.muun#gmail.com","nsgyu.noon#gmail.com","utrs guus.book#hotmail.com","thum.cryant#live.com","mumt.cant#gmail.com","bhan.btan#gmail.com","khtri.ntuk#gmail.c.om","chang.lee#gmail.com","shaan.btqaan#gmail.com","nhue.bjtraan#gmail.com","wutys.cyun#gmailcom","hrtsh.jaan#gmail.com"))
library(stringr)
library(dplyr)
## extract all of the names any string of letters unbroken by a space or punctuation or number
names <- str_extract_all(df$name, "[A-Za-z]*") %>%
## make a matrix out of the names
do.call(rbind, .) %>%
## turn the names into a data frame
as.data.frame()
## some of the columns have all "" in them, find which ones are all ""
w <- sapply(names, function(x)all(x == ""))
## if any of the columns are all "" then ...
if(any(w)){
## remove those columns from the dataset
names <- names[,-which(w)]
}
## add email into this dataset that has the individual names
names$email <- df$email
library(tidyr)
## pipe the names dataset (which has individual names and an e-mail address)
out <- names %>%
## switch from wide to long format
pivot_longer(-email, names_to="V", values_to="n") %>%
## create consistent = 1 if the name is not detected in the e-mail
mutate(consistent = !str_detect(email, n)) %>%
## group the data by e-mail
group_by(email) %>%
## take the maximum of consistent by group
## this will be 1 if any of the names are not detected in the e-mail
summarise(consistent = max(consistent)) %>%
## join back together with the original data
left_join(df) %>%
## change the variable ordering back
select(name, email, consistent)
out
# # A tibble: 16 x 3
# name email consistent
# <chr> <chr> <int>
# 1 affat,nurfs asfa.1234#gmail.com 1
# 2 bhan,btan bhan.btan#gmail.com 0
# 3 ghaan,rstu chang.lee#gmail.com 1
# 4 hrtsh,jaan hrtsh.jaan#gmail.com 0
# 5 khtri,ntuk khtri.ntuk#gmail.c.om 0
# 6 maay,bhtr maay.bhtr#email.com 0
# 7 mumt,cant mumt.cant#gmail.com 0
# 8 ngyst,muun ngyst.muun#gmail.com 0
# 9 nhue,bjtraan nhue.bjtraan#gmail.com 0
# 10 nsgu,nhuts thang nsgu.nhuts thang#gmail.com 0
# 11 nsgyu,noon nsgyu.noon#gmail.com 0
# 12 nukhyu,biyts nukhyu.biyts#gmail.com 0
# 13 shaan,btqaan shaan.btqaan#gmail.com 0
# 14 thum,cryant thum.cryant#live.com 0
# 15 utrs guus,book utrs guus.book#hotmail.com 0
# 16 wutys,cyun wutys.cyun#gmailcom 0
#
Note, I had to change two of the values of e-mail in your dataset to match the image you posted.

Removing rows having only zeros [duplicate]

This question already has answers here:
How to remove rows with 0 values using R
(2 answers)
Closed 2 years ago.
I want to remove all the rows having either zeros or NAs. In the code below I am selecting numeric variables and then filtering out 0s. Problem here is it does not return character variables along with numeric ones in the final output.
df <- read.table(header = TRUE, text =
"x y z
a 1 2
b 0 3
c 1 NA
d 0 NA
")
df %>% select_if(is.numeric) %>% filter(rowSums(., na.rm = T)!=0)

You can use filter_if :
library(dplyr)
df %>% filter_if(is.numeric, any_vars(. != 0 & !is.na(.)))
# x y z
#1 a 1 2
#2 b 0 3
#3 c 1 NA
Or using base R :
cols <- sapply(df, is.numeric)
df[rowSums(!is.na(df[cols]) & df[cols] != 0) > 0, ]

Another dplyr option could be:
df %>%
rowwise() %>%
filter(any(across(where(is.numeric)) != 0, na.rm = TRUE))
x y z
<fct> <int> <int>
1 a 1 2
2 b 0 3
3 c 1 NA

Following the suggestions written in this new doc page after the release of dplyr version 1.0.0, you can create a helper function to substitute the superseded functions filter_if and any_vars.
Previously, filter() was paired with the all_vars() and any_vars()
helpers. Now, across() is equivalent to all_vars(), and there’s no
direct replacement for any_vars(). However you can make a simple
helper yourself
From now on, this way should be the reference method for this kind of filtering steps.
rowAny <- function(x) {rowSums(x != 0 & !is.na(x)) > 0}
df %>% filter(rowAny(across(where(is.numeric))))
# x y z
# 1 a 1 2
# 2 b 0 3
# 3 c 1 NA

You could simply do
df[rowSums(suppressWarnings(sapply(df, as.double)), na.rm=TRUE) > 0, ]
# x y z
# 1 a 1 2
# 2 b 0 3
# 3 c 1 NA

Take first non-0 value or last 0 value if that's all there is

Ciao,
Here is my replicating example.
HAVE <- data.frame(ID=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6),
ABSENCE=c(NA,NA,NA,0,0,0,0,0,1,NA,0,NA,0,1,2,0,0,0),
TIME=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3))
WANT <- data.frame(ID=c(1,2,3,4,5,6),
ABSENCE=c(NA,0,1,0,1,0),
TIME=c(NA,3,3,2,2,3))
The tall data file HAVE is the one I need to convert to WANT. So essentially for each ID I need to identify the first non-zero value and that value goes into the data file WANT. If all values of absence is NA than TIME is NA. If all values of ABSENCE is 0 then I report the last possible row in WANT (as reflected in the TIME variable)
This is my attempt:
WANT <- group_by(HAVE,ID) %>% slice(seq_len(min(which(ABSENCE > 0), n())))
but I do not know how to take the last of the 0 rows if there are only 0s.

library(data.table)
setDT(HAVE)
res = unique(HAVE[, .(ID)])
# look up first ABSENCE > 0
res[, c("ABSENCE", "TIME") := unique(HAVE[ABSENCE > 0], by="ID")[.SD, on=.(ID), .(ABSENCE, TIME)]]
# if nothing was found, look up last ABSENCE == 0
res[is.na(ABSENCE), c("ABSENCE", "TIME") := unique(HAVE[ABSENCE == 0], by="ID", fromLast=TRUE)[.SD, on=.(ID), .(ABSENCE, TIME)]]
# check
all.equal(as.data.frame(res), WANT)
# [1] TRUE
ID ABSENCE TIME
1: 1 NA NA
2: 2 0 3
3: 3 1 3
4: 4 0 2
5: 5 1 2
6: 6 0 3
I'm using data.table since the tidyverse does not and never will support sub-assignment / modifying only rows selected by a condition (like the is.na(ABSENCE) here).
If there two rules can be made more consistent with each other, this should be doable in a left join or a single group_by + slice as the OP attempted, though. Okay, here's one way, though it looks impossible to debug:
HAVE %>%
arrange(ID, -(ABSENCE > 0), TIME*(ABSENCE > 0), -TIME) %>%
distinct(ID, .keep_all = TRUE)
ID ABSENCE TIME
1 1 NA 3
2 2 0 3
3 3 1 3
4 4 0 2
5 5 1 2
6 6 0 3

Using data.table as well, based on subsetting the .I row counter:
WANT <- HAVE[
HAVE[,
if(all(is.na(ABSENCE))) .I[1] else
if(!any(ABSENCE > 0, na.rm=TRUE)) max(.I[ABSENCE==0], na.rm=TRUE) else
min(.I[ABSENCE > 0], na.rm=TRUE),
by=ID
]$V1,
]
WANT[is.na(ABSENCE), TIME := NA_integer_]
# ID ABSENCE TIME
#1: 1 NA NA
#2: 2 0 3
#3: 3 1 3
#4: 4 0 2
#5: 5 1 2
#6: 6 0 3

Here are two approaches using dplyr and custom functions. Both rely on the data being sorted by TIME.
Filter Approach
# We'll use this function inside filter() to keep only the desired rows
flag_wanted <- function(absence){
flags <- rep(FALSE, length(absence))
if (any(absence > 0, na.rm = TRUE)) {
# There's a nonzero value somewhere in x; we want the first one.
flags[which.max(absence > 0)] <- TRUE
} else if (any(absence == 0, na.rm = TRUE)) {
# There's a zero value somewhere in x; we want the last one.
flags[max(which(absence == 0))] <- TRUE
} else {
# All values are NA; we want the last row
flags[length(absence)] <- TRUE
}
return(flags)
}
# After filtering, we have to flip TIME to NA if ABSENCE is NA
HAVE %>%
arrange(ID, TIME) %>%
group_by(ID) %>%
filter(flag_wanted(ABSENCE)) %>%
mutate(TIME = ifelse(is.na(ABSENCE), NA, TIME)) %>%
ungroup()
# A tibble: 6 x 3
ID ABSENCE TIME
<dbl> <dbl> <dbl>
1 1. NA NA
2 2. 0. 3.
3 3. 1. 3.
4 4. 0. 2.
5 5. 1. 2.
6 6. 0. 3.
The filter() step reduces the dataframe to the rows you need. Since it doesn't modify the TIME values, we need to mutate() as well.
Summarize Approach
# This function captures the general logic of getting the value of one variable
# based on the value of another
get_wanted <- function(of_this, by_this){
# If there are any positive values of `by_this`, use the first
if (any(by_this > 0, na.rm = TRUE)) {
return( of_this[ which.max(by_this > 0) ] )
}
# If there are any zero values of `by_this`, use the last
if (any(by_this == 0, na.rm = TRUE)) {
return( of_this[ max(which(by_this == 0)) ] )
}
# Otherwise, use NA
return(NA)
}
HAVE %>%
arrange(ID, TIME) %>%
group_by(ID) %>%
summarize(TIME = get_first_nz(of_this = TIME, by_this = ABSENCE),
ABSENCE = get_first_nz(of_this = ABSENCE, by_this = ABSENCE))
# A tibble: 6 x 3
ID TIME ABSENCE
<dbl> <dbl> <dbl>
1 1. NA NA
2 2. 3. 0.
3 3. 3. 1.
4 4. 2. 0.
5 5. 2. 1.
6 6. 3. 0.
The order of summarization matters because we're overwriting variables, so this approach is risky. It only produces the output WANT if you summarize TIME and then ABSENCE.

R: By group, test if for each value of one variable, that value exists in another variable

I have a data frame structured something like:
a <- c(1,1,1,2,2,2,3,3,3,3,4,4)
b <- c(1,2,3,1,2,3,1,2,3,4,1,2)
c <- c(NA, NA, 2, NA, 1, 1, NA, NA, 1, 1, NA, NA)
df <- data.frame(a,b,c)
Where a and b uniquely identify an observation. I want to create a new variable, d, which indicates if each observation's value for b is present at least once in c as grouped by a. Such that d would be:
[1] 0 1 0 1 0 0 1 0 0 0 0 0
I can write a for loop which will do the trick,
attach(df)
for (i in unique(a)) {
for (j in b[a == i]) {
df$d[a == i & b == j] <- ifelse(j %in% c[a == i], 1, 0)
}
}
But surely in R there must be a cleaner/faster way of achieving the same result?

Using data.table:
library(data.table)
setDT(df) #convert df to a data.table without copying
# +() is code golf for as.integer
df[ , d := +(b %in% c), by = a]
# a b c d
# 1: 1 1 NA 0
# 2: 1 2 NA 1
# 3: 1 3 2 0
# 4: 2 1 NA 1
# 5: 2 2 1 0
# 6: 2 3 1 0
# 7: 3 1 NA 1
# 8: 3 2 NA 0
# 9: 3 3 1 0
# 10: 3 4 1 0
# 11: 4 1 NA 0
# 12: 4 2 NA 0
Adding the dplyr version for those of that persuasion. All credit due to #akrun.
library(dplyr)
df %>% group_by(a) %>% mutate(d = +(b %in% c))
And for posterity, a base R version as well (via #thelatemail below)
df <- df[order(df$a, df$b), ]
df$d <- unlist(by(df, df$a, FUN = function(x) (x$b %in% x$c) + 0L ))

The above answer by MichaelChirico apparently works well and is correct. I rarely use data.table so I don't understand the syntax. This is a way to get the same results without data.table.
invisible(lapply(unique(df$a), function(x) {
df$d[df$a==x] <<- 0L + (df$b[df$a==x] %in% df$c[df$a==x])
}))
This code gets all of the unique levels of a and then modifies the data.frame for that level of a using the logic you request. The <<- is necessary because df will otherwise be modified just in the scope of the apply and not in .GlobalEnv. With <<- it finds the parent environment where df is defined and sets df there.
Also, note the slightly different version of the + "trick" where a leading 0 which makes it clearer to the reader that the resulting vector is an integer because it must be cast that way for the addition to work. The L after the 0 indicates that 0 is in integer and not a double. Note that the notation used by MichaelChirico for this casting gives the same results (a column of class integer).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R Undo Dummy Variables - r

Related

Create a list of vectors from a vector where n consecutive values are not 0 in R

Validate name with email in dataframe

Removing rows having only zeros [duplicate]

Take first non-0 value or last 0 value if that's all there is

R: By group, test if for each value of one variable, that value exists in another variable

Categories

Resources