Create dummy-column based on another columns - r

Let's say I have this dataset
> example <- data.frame(a = 1:10, b = 10:1, c = 1:5 )
I want to create a new variable d. I want in d the value 1 when at least in of the variables a b c the value 1 2 or 3 is present.
d should look like this:
d <- c(1, 1, 1, 0, 0, 1, 1, 1, 1, 1)
Thanks in advance.

You can use rowSums to get a logical vector of 1, 2 or 3 appearing in each row and wrap it in as.integer to convert to 0 and 1, i.e.
as.integer(rowSums(df == 1|df == 2| df == 3) > 0)
#[1] 1 1 1 0 0 1 1 1 1 1

Will work for any number of vars:
example <- data.frame(a = 1:10, b = 10:1, c = 1:5 )
x <- c(1, 2, 3)
as.integer(Reduce(function(a, b) (a %in% x) | (b %in% x), example))

With the dplyr package:
library(dplyr)
x <- 1:3
example %>% mutate(d = as.integer(a %in% x | b %in% x | c %in% x))

Two other possibilities which work with any number of columns:
#option 1
example$d <- +(rowSums(sapply(example, `%in%`, 1:3)) > 0)
#option 2
library(matrixStats)
example$d <- rowMaxs(+(sapply(example, `%in%`, 1:3)))
which both give:
> example
a b c d
1 1 10 1 1
2 2 9 2 1
3 3 8 3 1
4 4 7 4 0
5 5 6 5 0
6 6 5 1 1
7 7 4 2 1
8 8 3 3 1
9 9 2 4 1
10 10 1 5 1

You can do this using apply(although little slow)
Logic: any will compare if there is any 1,2 or 3 is present or not, apply is used to iterate this logic on each of the rows. Then finally converting the boolean outcome to numeric by adding +0 (You may choose as.numeric here in case you want to be more expressive)
d <- apply(example,1 ,function(x)any(x==1|x==2|x==3))+0
In case someone wants to restrict the columns or want to run the logic on some columns, then one can do this also:
d <- apply(example[,c("a","b","c")], 1, function(x)any(x==1|x==2|x==3))+0
Here you have control on columns on which one to take or ignore basis your needs.
Output:
> d
[1] 1 1 1 0 0 1 1 1 1 1

general solution:
example %>%
sapply(function(i)i %in% x) %>% apply(1,any) %>% as.integer
#[1] 1 1 1 0 0 1 1 1 1 1

Try this method, verify if in any column there is at list one element present in x.
x<-c(1,2,3)
example$d<-as.numeric(example$a %in% x | example$b %in% x | example$c %in% x)
example
a b c d
1 1 10 1 1
2 2 9 2 1
3 3 8 3 1
4 4 7 4 0
5 5 6 5 0
6 6 5 1 1
7 7 4 2 1
8 8 3 3 1
9 9 2 4 1
10 10 1 5 1

Related

nested for loop in R, where the second index counts inside the first one

I have for example a datset like this:
data <- data.frame(matrix(c(1,2,2,3,4,5,5,"a","a","b","a","a","a","b"), nrow = 7, ncol = 2, byrow = F))
X1 X2
1 a
2 a
2 b
3 a
4 a
5 a
5 b
then I add another variable "tag", initially set to 0.
data$tag <- 0
X1 X2 tag
1 a 0
2 a 0
2 b 0
3 a 0
4 a 0
5 a 0
5 b 0
I'd like to have "tag" equal to 1 for each row that is repeated, like:
X1 X2 tag
1 a 0
2 a 1
2 b 1
3 a 0
4 a 0
5 a 1
5 b 1
I used the followed code:
for (i in data$X1) {
for (j in 1:length(data$X1)) {
if (j==2) {data$tag[j] <- 1}
}
}
but it doesn't work like I would like to. I'd like the second loop (j) to work inside the previous one in order to obtain what I want, where j starts from 1 every time X1 changes.
How can I manage it?
Thanks a lot
Maybe you can try ave
within(
data,
tag <- +(ave(X1, X1, FUN = length) > 1)
)
which gives
X1 X2 tag
1 1 a 0
2 2 a 1
3 2 b 1
4 3 a 0
5 4 a 0
6 5 a 1
7 5 b 1
You can use duplicated from both the ends in base R :
data$tag <- as.integer(duplicated(data$X1) |
duplicated(data$X1, fromLast = TRUE))
data
# X1 X2 tag
#1 1 a 0
#2 2 a 1
#3 2 b 1
#4 3 a 0
#5 4 a 0
#6 5 a 1
#7 5 b 1
An option with add_count
library(dplyr)
data %>%
add_count(X1) %>%
mutate(n = +(n > 1))

Identifying Duplicates in `data.frame` Using `dplyr`

I want to identify (not eliminate) duplicates in a data frame and add 0/1 variable accordingly (wether a row is a duplicate or not), using the R dplyr package.
Example:
| A B C D
1 | 1 0 1 1
2 | 1 0 1 1
3 | 0 1 1 1
4 | 0 1 1 1
5 | 1 1 1 1
Clearly, row 1 and 2 are duplicates, so I want to create a new variable (with mutate?), say E, that is equal to 1 in row 1,2,3 and 4 since row 3 and 4 are also identical.
Moreover, I want to add another variable, F, that is equal to 1 if there is a duplicate differing only by one column. That is, F in row 1,2 and 5 would be equal to 1 since they only differ in the B column.
I hope it is clear what I want to do and I hope that dplyr offers a smooth solution to this problem. This is of course possible in "base" R but I believe (hope) that there exists a smoother solution.
You can use dist() to compute the differences, and then a search in the resulting distance object can give the needed answers (E, F, etc.). Here is an example code, where X is the original data.frame:
W=as.matrix(dist(X, method="manhattan"))
X$E = as.integer(sapply(1:ncol(W), function(i,D){any(W[-i,i]==D)}, D=0))
X$F = as.integer(sapply(1:ncol(W), function(i,D){any(W[-i,i]==D)}, D=1))
Just change D= for the number of different columns needed.
It's all base R though. Using plyr::laply instead of sappy has same effect. dplyr looks overkill here.
Here is a data.table solution that is extendable to an arbitrary case (1..n columns the same)- not sure if someone can convert to dpylr for you. I had to change your dataset a bit to show your desired F column - in your example all rows would get a 1 because 3 and 4 are one column different from 5 as well.
library(data.table)
DT <- data.frame(A = c(1,1,0,0,1), B = c(0,0,1,1,1), C = c(1,1,1,1,1), D = c(1,1,1,1,1), E = c(1,1,0,0,0))
DT
A B C D E
1 1 0 1 1 1
2 1 0 1 1 1
3 0 1 1 1 0
4 0 1 1 1 0
5 1 1 1 1 0
setDT(DT)
DT_ncols <- length(DT)
base <- data.table(t(combn(1:nrow(DT), 2)))
setnames(base, c("V1","V2"),c("ind_x","ind_y"))
DT[, ind := .I)]
DT_melt <- melt(DT, id.var = "ind", variable.name = "column")
base <- merge(base, DT_melt, by.x = "ind_x", by.y = "ind", allow.cartesian = TRUE)
base <- merge(base, DT_melt, by.x = c("ind_y", "column"), by.y = c("ind", "column"))
base <- base[, .(common_cols = sum(value.x == value.y)), by = .(ind_x, ind_y)]
This gives us a data.frame that looks like this:
base
ind_x ind_y common_cols
1: 1 2 5
2: 1 3 2
3: 2 3 2
4: 1 4 2
5: 2 4 2
6: 3 4 5
7: 1 5 3
8: 2 5 3
9: 3 5 4
10: 4 5 4
This says that rows 1 and 2 have 5 common columns (duplicates). Rows 3 and 5 have 4 common columns, and 4 and 5 have 4 common columns. We can now use a fairly extendable format to flag any combination we want:
base <- melt(base, id.vars = "common_cols")
# Unique - common_cols == DT_ncols
DT[, F := ifelse(ind %in% unique(base[common_cols == DT_ncols, value]), 1, 0)]
# Same save 1 - common_cols == DT_ncols - 1
DT[, G := ifelse(ind %in% unique(base[common_cols == DT_ncols - 1, value]), 1, 0)]
# Same save 2 - common_cols == DT_ncols - 2
DT[, H := ifelse(ind %in% unique(base[common_cols == DT_ncols - 2, value]), 1, 0)]
This gives:
A B C D E ind F G H
1: 1 0 1 1 1 1 1 0 1
2: 1 0 1 1 1 2 1 0 1
3: 0 1 1 1 0 3 1 1 0
4: 0 1 1 1 0 4 1 1 0
5: 1 1 1 1 0 5 0 1 1
Instead of manually selecting, you can append all combinations like so:
# run after base <- melt(base, id.vars = "common_cols")
base <- unique(base[,.(ind = value, common_cols)])
base[, common_cols := factor(common_cols, 1:DT_ncols)]
merge(DT, dcast(base, ind ~ common_cols, fun.aggregate = length, drop = FALSE), by = "ind")
ind A B C D E 1 2 3 4 5
1: 1 1 0 1 1 1 0 1 1 0 1
2: 2 1 0 1 1 1 0 1 1 0 1
3: 3 0 1 1 1 0 0 1 0 1 1
4: 4 0 1 1 1 0 0 1 0 1 1
5: 5 1 1 1 1 0 0 0 1 1 0
Here is a dplyr solution:
test%>%mutate(flag = (A==lag(A)&
B==lag(B)&
C==lag(C)&
D==lag(D)))%>%
mutate(twice = lead(flag)==T)%>%
mutate(E = ifelse(flag == T | twice ==T,1,0))%>%
mutate(E = ifelse(is.na(E),0,1))%>%
mutate(FF = ifelse( ( (A +lag(A)) + (B +lag(B)) + (C+lag(C)) + (D + lag(D))) == 7,1,0))%>%
mutate(FF = ifelse(is.na(FF)| FF == 0,0,1))%>%
select(A,B,C,D,E,FF)
Result:
A B C D E FF
1 1 0 1 1 1 0
2 1 0 1 1 1 0
3 0 1 1 1 1 0
4 0 1 1 1 1 0
5 1 1 1 1 0 1

How to do group matching in R?

Suppose I have the data.frame below where treat == 1 means that the id received treatment and prob is the calculated probability that treat == 1.
set.seed(1)
df <- data.frame(id = 1:10, treat = sample(0:1, 10, replace = T))
df$prob <- ifelse(df$treat, rnorm(10, .8, .1), rnorm(10, .4, .4))
df
id treat prob
1 1 0 0.3820266
2 2 0 0.3935239
3 3 1 0.8738325
4 4 1 0.8575781
5 5 0 0.6375605
6 6 1 0.9511781
7 7 1 0.8389843
8 8 1 0.7378759
9 9 1 0.5785300
10 10 0 0.6479303
To minimize selection bias, I now wish to create pseudo treatment and control groups on the basis of the values of treat and prob:
When any id withtreat == 1 is within 0.1 prob of any id with treat == 0, I want the value of group to be "treated".
When any id withtreat == 0 is within 0.1 prob of any id with treat == 1, I want the value of group to be "control".
Below is an example of what I'd like the result to be.
df$group <- c(NA, NA, NA, NA, 'control', NA, NA, 'treated', 'treated', 'control')
df
id treat prob group
1 1 0 0.3820266 <NA>
2 2 0 0.3935239 <NA>
3 3 1 0.8738325 <NA>
4 4 1 0.8575781 <NA>
5 5 0 0.6375605 control
6 6 1 0.9511781 <NA>
7 7 1 0.8389843 <NA>
8 8 1 0.7378759 treated
9 9 1 0.5785300 treated
10 10 0 0.6479303 control
How would I go about doing this? In the example above, matching is done with replacements, but a solution without replacements would be welcome, too.
You can try
foo <- function(x){
TR <- range(x$prob[x$treat == 0])
CT <- range(x$prob[x$treat == 1])
tmp <- sapply(1:nrow(x), function(y, z){
if(z$treat[y] == 1){
ifelse(any(abs(z$prob[y] - TR) <= 0.1), "treated", "NA")
}else{
ifelse(any(abs(z$prob[y] - CT) <= 0.1), "control", "NA")
}}, x)
cbind(x, group = tmp)
}
foo(df)
id treat prob group
1 1 0 0.3820266 NA
2 2 0 0.3935239 NA
3 3 1 0.8738325 NA
4 4 1 0.8575781 NA
5 5 0 0.6375605 control
6 6 1 0.9511781 NA
7 7 1 0.8389843 NA
8 8 1 0.7378759 treated
9 9 1 0.5785300 treated
10 10 0 0.6479303 control
I think this problem is well suited for cut in base R. Here is how you can do it in a vectorized way:
f <- function(r) {
x <- cut(df[r,]$prob, breaks = c(df[!r,]$prob-0.1, df[!r,]$prob+0.1))
df[r,][!is.na(x),]$id
}
ones <- df$treat==1
df$group <- NA
df[df$id %in% f(ones),]$group <- "treated"
df[df$id %in% f(!ones),]$group <- "control"
> df
# id treat prob group
# 1 1 0 0.3820266 <NA>
# 2 2 0 0.3935239 <NA>
# 3 3 1 0.8738325 <NA>
# 4 4 1 0.8575781 <NA>
# 5 5 0 0.6375605 control
# 6 6 1 0.9511781 <NA>
# 7 7 1 0.8389843 <NA>
# 8 8 1 0.7378759 treated
# 9 9 1 0.5785300 treated
# 10 10 0 0.6479303 control
Perhaps not the most elegant but it seems to work for me:
df %>% group_by(id,treat) %>% mutate(group2 = ifelse(treat==1,
ifelse(any(abs(prob-df[df$treat==0,3])<0.1),"treated","NA"),
ifelse(any(abs(prob-df[df$treat==1,3])<0.1),"control","NA"))) # treat==0
Is this what you want?
#Base R:
apply(df[df$treat == 1, ],1, function(x){
ifelse(any(df[df$treat == 0, 'prob'] -.1 < x[3] & x[3] < df[df$treat == 0, 'prob'] +.1), 'treated', NA)
})
You can invert $treatclause to reflect control-group and attach the variables to your df.

Find consecutive values in dataframe

I have a dataframe. I wish to detect consecutive numbers and populate a new column as 1 or 0.
ID Val
1 a 8
2 a 7
3 a 5
4 a 4
5 a 3
6 a 1
Expected output
ID Val outP
1 a 8 0
2 a 7 1
3 a 5 0
4 a 4 1
5 a 3 1
6 a 1 0
You could do this with the diff function in combination with abs and see whether the outcome is 1 or another value:
d$outP <- c(0, abs(diff(d$Val)) == 1)
which gives:
> d
ID Val outP
1 a 8 0
2 a 7 1
3 a 5 0
4 a 4 1
5 a 3 1
6 a 1 0
If you only want to take decreasing consecutive values into account, you can use:
c(0, diff(d$Val) == -1)
When you want to do this for each ID, you can also do this in base R or with dplyr:
# base R
d$outP <- ave(d$Val, d$ID, FUN = function(x) c(0, abs(diff(x)) == 1))
# dplyr
library(dplyr)
d %>%
group_by(ID) %>%
mutate(outP = c(0, abs(diff(Val)) == 1))
We can also a faster option by comparing the previous value with current
with(df1, as.integer(c(FALSE, Val[-length(Val)] - Val[-1]) ==1))
#[1] 0 1 0 1 1 0
If we need to group by "ID", one option is data.table
library(data.table)
setDT(df1)[, outP := as.integer((shift(Val, fill =Val[1]) - Val)==1) , by = ID]

Replacing Values in DataFrame

So a quick question jumping off of this one....
Fast replacing values in dataframe in R
If I want to do this replace but only for certain rows of my data frame, is there a way to add a row specification to:
df [df<0] =0
Something like applying this to rows 40-52 (Doesn't work):
df[df[40:52,] < 0] = 0
Any suggestions? Much appreciated.
Or simply:
df[40:52,][df[40:52,] < 0] <- 0
Here is a test:
test = data.frame(A = c(1,2,-1), B = c(4,-8,5), C = c(1,2,3), D = c(7,8,-9))
#> test
# A B C D
#1 1 4 1 7
#2 2 -8 2 8
#3 -1 5 3 -9
To replace the negative values with 0 for only rows 2 and 3, you can do:
test[2:3,][test[2:3,] < 0] <- 0
and you get
#> test
# A B C D
#1 1 4 1 7
#2 2 0 2 8
#3 0 5 3 0
This is another way, utilizing R's recycling behavior.
df[df < 0 & 1:nrow(df) %in% 40:52] <- 0

Resources