Combing columns with replacement based on value - r

I am trying to combine 2 or more different columns, with the replacement of values. For example,
a b
1 0
1 1
<NA> 1
0 1
0 0
would become
c
1
1
1
1
0
Most functions seem to have an index column, which would be different from having and overwriting value.
Is there a possible way to combine with replacement according to value?

We could do it with an ifelse statement:
library(dplyr)
df %>%
#mutate(c = ifelse(is.na(a) | a ==1, 1, b))
transmute(c = ifelse(is.na(a) | a ==1, 1, b))
c
1 1
2 1
3 1
4 1
5 0

Related

Custom mutate new column based on two other columns in R using dplyr

My aim is to create a new df column for which the values are based on two other columns. My data set concerns recruitment into a study. I would like a column that defines whether or not a person was in a particular round of the study, if so was it their first involvement, their second, third and so on (up to 8 rounds). Currently I am attempting this with mutate(case_when)) in dplyr and using lag(). However, it works incorrectly if a person missed a round of the study and later came back into it. The data set looks like this:
person | round | in_round |
A 1 1
A 2 1
A 3 1
A 4 1
A 5 1
A 6 0
A 7 0
A 8 0
B 1 0
B 2 0
B 3 1
B 4 1
B 5 1
B 6 1
B 7 0
B 8 1
What I need is a separate column that uses round and in_round for each person to produce the following:
person | round | in_round | round_status
A 1 1 recruited
A 2 1 follow_up_1
A 3 1 follow_up_2
A 4 1 follow_up_3
A 5 1 follow_up_4
A 6 0 none
A 7 0 none
A 8 0 none
B 1 0 none
B 2 0 none
B 3 1 recruited
B 4 1 follow_up_1
B 5 1 follow_up_2
B 6 1 follow_up_3
B 7 0 none
B 8 1 follow_up_4
In summary:
where in_round == 0, round_status == "none"
the first time in_round == 1, round_status == "recruited"
subsequent times in_round == 1, round_status == "follow_up_X" (dependent on the number of previous waves the individual was present in).
Try this:
df %>%
group_by(person) %>%
arrange(round) %>%
mutate(cum_round = cumsum(in_round),
round_status = case_when(
in_round == 0 ~ "none",
cum_round == 1 ~ "recruited",
TRUE ~ paste0("follow_up_", cum_round - 1)
))

How to automate recoding of many variables using mutate_at and nested ifelse statement?

There is a large data set consisting of repeated measures of the same variable on each subject. An example data is as below
df<-data.frame(
"id"=c(1:5),
"ax1"=c(1,6,8,15,17),
"bx1"=c(2,16,8,15,17))
where "x1" is measured repeatedly so we can have "ax1", "bx1", "cx1" and so on. I am trying to recode these variables. The plan is to recode 1 and any number on the range from 3 to 12 (inclusively) as 0 and recode 2 or any value greater than or equal to 13 as 1. Because it involves many variables I am making use of "mutate_at" to automate the recoding. Also, the numbers to take on the same code are not consecutive (e.g. 1 and 3-12 to be recoded as 0) so I used a nested "ifelse" statement. I tried the following
df1<-df %>%
mutate_at(vars(ends_with("x1")),factor,
ifelse(x1>=3 & x1 <=12,0,ifelse(x1==1, 0,
ifelse(x1==2, 1,0))))
However, this fails to work because R cannot recognize "x1". Any help on this is greatly appreciated in advance. The expected output would look like
> df1
id ax1 bx1
1 1 0 1
2 2 0 1
3 3 0 0
4 4 1 1
5 5 1 1
Using ifelse, we can proceed as follows:
df %>%
mutate_at(vars(ends_with("x1")),~ifelse(. ==1 | . %in% 3:12,0,
ifelse(. ==2 | .>=13,1,.)))
id ax1 bx1
1 1 0 1
2 2 0 1
3 3 0 0
4 4 1 1
5 5 1 1
We can use case_when
library(dplyr)
df %>%
mutate_at(vars(ends_with("x1")), ~case_when((. >= 3 & . <= 12) | . == 1 ~ 0,
. >= 13 | . == 2 ~ 1))
# id ax1 bx1
#1 1 0 1
#2 2 0 1
#3 3 0 0
#4 4 1 1
#5 5 1 1
Here is another solution similar to what you where attempting. I just added the "or" operator (|) to make a simpler ifelse and removed the factor part from your code.
library(dplyr)
df1<-df %>%
mutate_at(vars(ends_with("x1")), function(x)
ifelse(x >= 3 & x <= 12 | x == 1,0,
ifelse(x >= 13 | x == 2, 1,0)))
# id ax1 bx1
#1 1 0 1
#2 2 0 1
#3 3 0 0
#4 4 1 1
#5 5 1 1
If there are no other possible conditions apart from the ones you mention (for example, having zeros), I think you could simplify it more by just reducing it to the following:
df1<-df %>%
mutate_at(vars(ends_with("x1")), function(x)
ifelse(x >= 3 & x <= 12 | x == 1, 0, 1))

R: cumulative sum with conditions [duplicate]

I have a vector of numbers in a data.frame such as below.
df <- data.frame(a = c(1,2,3,4,2,3,4,5,8,9,10,1,2,1))
I need to create a new column which gives a running count of entries that are greater than their predecessor. The resulting column vector should be this:
0,1,2,3,0,1,2,3,4,5,6,0,1,0
My attempt is to create a "flag" column of diffs to mark when the values are greater.
df$flag <- c(0,diff(df$a)>0)
> df$flag
0 1 1 1 0 1 1 1 1 1 1 0 1 0
Then I can apply some dplyr group/sum magic to almost get the right answer, except that the sum doesn't reset when flag == 0:
df %>% group_by(flag) %>% mutate(run=cumsum(flag))
a flag run
1 1 0 0
2 2 1 1
3 3 1 2
4 4 1 3
5 2 0 0
6 3 1 4
7 4 1 5
8 5 1 6
9 8 1 7
10 9 1 8
11 10 1 9
12 1 0 0
13 2 1 10
14 1 0 0
I don't want to have to resort to a for() loop because I have several of these running sums to compute with several hundred thousand rows in a data.frame.
Here's one way with ave:
ave(df$a, cumsum(c(F, diff(df$a) < 0)), FUN=seq_along) - 1
[1] 0 1 2 3 0 1 2 3 4 5 6 0 1 0
We can get a running count grouped by diff(df$a) < 0. Which are the positions in the vector that are less than their predecessors. We add c(F, ..) to account for the first position. The cumulative sum of that vector creates an index for grouping. The function ave can carry out a function on that index, we use seq_along for a running count. But since it starts at 1, we subtract by one ave(...) - 1 to start from zero.
A similar approach using dplyr:
library(dplyr)
df %>%
group_by(cumsum(c(FALSE, diff(a) < 0))) %>%
mutate(row_number() - 1)
You don't need dplyr:
fun <- function(x) {
test <- diff(x) > 0
y <- cumsum(test)
c(0, y - cummax(y * !test))
}
fun(df$a)
[1] 0 1 2 3 0 1 2 3 4 5 6 0 1 0
a <- c(1,2,3,4,2,3,4,5,8,9,10,1,2,1)
f <- c(0, diff(a)>0)
ifelse(f, cumsum(f), f)
that it is without reset.
with reset:
unlist(tapply(f, cumsum(c(0, diff(a) < 0)), cumsum))

How to compare the count of unique values

I need to check whether the number of elements of each unique value in the variable PPT in A is equal to the number of elements of each unique value in PPT in B, and whether there is any value unique only to A or only to B.
For example:
PPTa <- c("ppt0100109","ppt0301104","ppt0100109","ppt0100109","ppt0300249","ppt0100109","ppt0300249","ppt0100109","ppt0504409","ppt2303401","ppt0704210","ppt0704210","ppt0100109")
CNa <- c(110,54,110,110,49,10,49,110,409,40,10,10,110)
LLa <- c(150,55,150,150,45,15,45,115,405,45,5,15,50)
A <-data.frame(PPTa,CNa,LLa)
PPTb <- c("ppt0100200","ppt0300249","ppt0100109","ppt0300249","ppt0100109","ppt0764091","ppt2303401","ppt0704210","ppt0704210","ppt0100109")
CNb <- c(110,54,110,110,49,10,49,110,409,40)
LLb <- c(150,55,150,150,45,15,45,115,405,45)
B <-data.frame(PPTb,CNb,LLb)
In this case, we have these unique values which occur a certain amount of times:
A$PPTa TIMES
"ppt0100109" 6
"ppt0301104" 1
"ppt0300249" 2
"ppt0504409" 1
"ppt2303401" 1
"ppt0704210" 2
B$PPTb TIMES
"ppt0100200" 1
"ppt0300249" 2
"ppt0100109" 3
"ppt0764091" 1
"ppt2303401" 1
"ppt0704210" 2
I would like to create a new matrix (or anything you could suggest) with a value of 0 if the unique value exists both in A and B with the same number of elements, a value of 1 if it exists in both dataframes A and B but the number of elements differ, and a value of 2 if the value exists only in one of the two dataframes.
Something like:
A$PPTa TIMES OUTPUT
"ppt0100109" 6 1
"ppt0301104" 1 2
"ppt0300249" 2 0
"ppt0504409" 1 2
"ppt2303401" 1 0
"ppt0704210" 2 0
B$PPTb TIMES OUTPUT
"ppt0100200" 1 2
"ppt0300249" 2 0
"ppt0100109" 3 1
"ppt0764091" 1 2
"ppt2303401" 1 0
"ppt0704210" 2 0
You can use a nested ifelse statement,
ifelse(do.call(paste0, A) %in% do.call(paste0, B), 0, ifelse(A$PPTa %in% B$PPTb, 1, 2))
#[1] 1 0 2 2 0 0
ifelse(do.call(paste0, B) %in% do.call(paste0, A), 0, ifelse(B$PPTb %in% A$PPTa, 1, 2))
#[1] 1 2 0 0 2 0

finding if boolean is ever true by groups in R

I want a simple way to create a new variable determining whether a boolean is ever true in R data frame.
Here is and example:
Suppose in the dataset I have 2 variables (among other variables which are not relevant) 'a' and 'b' and 'a' determines a group, while 'b' is a boolean with values TRUE (1) or FALSE (0). I want to create a variable 'c', which is also a boolean being 1 for all entries in groups where 'b' is at least once 'TRUE', and 0 for all entries in groups in which 'b' is never TRUE.
From entries like below:
a b
-----
1 1
2 0
1 0
1 0
1 1
2 0
2 0
3 0
3 1
3 0
I want to get variable 'c' like below:
a b c
-----------
1 1 1
2 0 0
1 0 1
1 0 1
1 1 1
2 0 0
2 0 0
3 0 1
3 1 1
3 0 1
-----------
I know how to do it in Stata, but I haven't done similar things in R yet, and it is difficult to find information on that on the internet.
In fact I am doing that only in order to later remove all the observations for which 'c' is 0, so any other suggestions would be fine as well. The application of that relates to multinomial logit estimation, where the alternatives that are never-chosen need to be removed from the dataset before estimation.
if X is your data frame
library(dplyr)
X <- X %>%
group_by(a) %>%
mutate(c = any(b == 1))
A base R option would be
df1$c <- with(df1, ave(b, a, FUN=any))
Or
library(sqldf)
sqldf('select * from df1
left join(select a, b,
(sum(b))>0 as c
from df1
group by a)
using(a)')
Simple data.table approach
require(data.table)
data <- data.table(data)
data[, c := any(b), by = a]
Even though logical and numeric (0-1) columns behave identically for all intents and purposes, if you'd like a numeric result you can simply wrap the call to any with as.numeric.
An answer with base R, assuming a and b are in dataframe x
c value is a 1-to-1 mapping with a, and I create a mapping here
cmap <- ifelse(sapply(split(x, x$a), function(x) sum(x[, "b"])) > 0, 1, 0)
Then just add in the mapped value into the data frame
x$c <- cmap[x$a]
Final output
> x
a b c
1 1 1 1
2 2 0 0
3 1 0 1
4 1 0 1
5 1 1 1
6 2 0 0
7 2 0 0
8 3 0 1
9 3 1 1
10 3 0 1
edited to change call to split.

Resources