Conditionally creating multiple variables - r

I am trying to create multiple variables based on a condition using R dplyr. I have to write the same condition three times to get this work but I am guessing that there is an efficient way to do this task by writing the condition once and using that condition to create multiple variables. The reason I am trying to do this is, if there is a change in the condition, it will be easier to update the condition at one single location instead of updating it three times. Please help.
Example:
Current solution:
library(dplyr)
x = c(12,8,3)
df<-data.frame(x)
y<- df %>% mutate( a = ifelse(x>10 ,1,
ifelse(x>5 ,11,0)),
b = ifelse(x>10 ,2,
ifelse(x>5 ,12,0)),
c = ifelse(x>10 ,3,
ifelse(x>5 ,13,0))
)
Looking for something like this:
if x>10 then
{a=1 b=2 c=3}
else if x>5 then
{a=11 b=12 c=13}
else
{a=0 b=0 c=0}

Define a function and use it three times:
cond <- function(x, x1, x2)
case_when(
x > 10 ~ x1,
x > 5 ~ x2,
TRUE ~ 0)
df %>% mutate(a = cond(x, 1, 11), b = cond(x, 2, 12), c = cond(x, 3, 13))

Related

Changing values in multiple column given one condition (preferably in dplyr)

I'm looking for an easy way to change several values for the same person. Preferably with dplyr or any other package from tidyverse.
Here my example:
df <- data.frame(personid = 1:3, class = c("class1", "class3", "class3"), classlevel = c(1, 11, 3), education = c("BA", "Msc", "BA"))
df
My dataset contains an entry with several mistakes. Person #2 should be part of class 1, at classlevel 1 und his education is BA, not MSc. I use mutate with case_when a lot, but in my case I don't want to change one variable with multiple condition, I have one condition and want to change multiple values in other variables based on this condition.
Basically, I'm looking for an shorter code which replaces this:
df$class[df$personid == 2] <- "class1"
df$classlevel[df$personid == 2] <- 1
df$education[df$personid == 2] <- "BA"
df
or this:
library(tidyverse)
df <- df |>
mutate(class = case_when(personid == 2 ~ "class1", TRUE ~ class)) |>
mutate(classlevel = case_when(personid == 2 ~ 1, TRUE ~ as.numeric(classlevel))) |>
mutate(education = case_when(personid == 2 ~ "BA", TRUE ~ education))
df
In my original data, there are several dozend cases like this, and I find it a bit tedious to use three lines of code for each person. Is there a shorter way?
Thanks for your input!
One way would be to create a data frame of the values to be updated and use rows_update(). Note that this assumes the rows are uniquely identified.
library(dplyr)
df_update <- tribble(
~personid, ~class, ~classlevel, ~education,
1, "class1", 1, "BA"
)
df %>%
rows_update(df_update, by = "personid")
personid class classlevel education
1 1 class1 1 BA
2 2 class1 1 BA
3 3 class3 3 BA
I think I need a little bit more information to try to answer your question., but I'll try anyway.
If you want to change the value of some columns based on a unique condition across all the rows I recommend doing this (I created new columns col_name1 so you can see the original and ouput):
df <- df %>% mutate(class1 = case_when(class != "class1" ~ "class1", TRUE ~ class),
classlevel1 = case_when( classlevel != 1 ~ 1, TRUE ~ as.numeric(classlevel)),
education1 = case_when( education != "BA" ~ "BA", TRUE ~ education))
If that was your problem, then you are probably not familiar with the concept of vectorization. Briefly, a vectorized function runs for all the rows or elements in your vector, without you needing to specify that. There are a lof of examples and tutorial on the web if you search "vectorization in R" or something similar.
Otherwise, if your condition changes for each single id (or row) in your data, then the problem is more complicated.
Let me know if that helps and, if it doesn't, consider providing more information in your question.

Clustering rows by ID based on a column value condition multiple times

Some time ago I opened a related question in this post
Suppose I have the following df:
data <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,1, 1, 1,1,1,1,1,1,1,1,1,1),
Obs1 = c(1,1,0,1,0,1,1,0,1,0,0,0,1,1,1,1,1,1,1,1,0,1),
Control = c(0,3,3,1,12,1,1,1,36,13,1,1,2,24,2,2,48,24,20,21,10,10),
ClusterObs1 = c(1,1,1,2,2,3,3,3,4,4,4,4,5,5,5,5,5,5,5,5,5,6))
And I want to obtain:
data <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,1, 1, 1,1,1,1,1,1,1,1,1,1),
Obs1 = c(1,1,0,1,0,1,1,0,1,0,0,0,1,1,1,1,1,1,1,1,0,1),
Control = c(0,3,3,1,12,1,1,1,36,13,1,1,2,24,2,2,48,24,20,21,10,10),
ClusterObs1 = c(1,1,1,2,2,3,3,3,4,4,4,4,5,5,5,5,5,5,5,5,5,6),
DesiredResultClusterObs1 = c(1,1,1,2,2,3,3,3,4,4,4,4,5,6,6,6,7,8,9,10,10,11))
The conditions are:
If value of 'Control' is higher than 12 and actual 'Obs1' value is equal to 1 and to previous 'Obs1' value, 'DesiredResultClusterObs1' value should add +1 (the main difference with the other question is that consecutive control values above 12 must be considered)
Any idea of how can I achieve the desired result.
I don't know much how to use the whith() and rle() functions, but i've got to a solution to the problem, using ifelse.
data <- data %>% mutate (aux = ifelse (Control>12 & Obs1 == 1 & lag(Obs1) ==1,1,0),
DesiredResultClusterObs1 = ClusterObs1 + cumsum(aux))
The aux variable is not necessary, it just help to see step by step. You can do the following too
data <- data %>% mutate (DesiredResultClusterObs1 =
ClusterObs1 +
cumsum(ifelse (Control>12 & Obs1 == 1 & lag(Obs1) ==1,1,0)))

R limit output of dataframe?

I have a data frame of transactions.
I am using dplyr to filter the transaction by gender.
Gender in my case is 0 or 1.
I want to filter 2 rows one with Gender == 0 and the second with Gender == 1.
The closest was to do it like this
df %>% arrange(Gender)
and then select 2 transactions in the middle where one is 1 and the second is 0.
Please advise.
To randomly sample a row/cell where condition in another cell is satisfied you can use sample like this:
# Dummy data: X = value of interest, G = Gender (0,1)
df1 <- data.frame("X" = rnorm(10, 0, 1), "G" = sample(c(0,1), replace = T, size = 10))
# Sampling
sample(df1[,'X'][df1[,'G'] == 1], size = 1)
sample(df1[,'X'][df1[,'G'] == 0], size = 1)
This is taking one value of X for each gender (condition of G being set by [df1[,'G'] == 1]).
Building from the comment by docendo discimus you can use the popular dplyr package, using the script below, but note that this runs considerably slower (5 times slower, 3M rows & 1000 iterations) than the sample approach I offered above:
pull(df1 %>% group_by(G) %>% sample_n(1), X)

R function or loop that could go through a binary variable (1 and 0) in a dataframe and returns a third variable (y) value from a different column

I do need some help. I am trying to build a function or a loop using R that could go through a binary variable (1 and 0) in a dataframe in such way that everytime 1 is followed by a 0, I could save a vector indicating the value of a third variable (y) in the same line where it occurred. I tried a couple of options based on previous posts, but nothing gives me something even close from that.
My data looks a bit like that:
ID <- rep(1001, 5)
variable <- c(1, 1, 0, 1, 0)
y <- c(10, 20, 30, 40, 50)
df <- cbind(ID, variable, y)
In this case, for example, the answer would give me a vector with the y values 30 and 50. Sorry if someone already has answered that, I could not find something similar. Thanks a lot!
Here's a 'vectorial' solution. Basically, I paste together variable in position i and i+1. Then I check to see if the combination is "10". The position you want is actually the next one (e.g. i+1), so we add 1.
df <- data.frame(ID, variable, y)
idx <- which(paste0(df$variable[-nrow(df)], df$variable[-1]) == "10") + 1
df$y[idx]
Here is an approach with tidyverse:
library(tidyverse)
df %>%
as.tibble %>%
mutate(y1 = ifelse(lag(variable) == 1 & variable == 0, y, NA)) %>%
pull(y1)
#output
[1] NA NA 30 NA 50
and in base R:
ifelse(c(NA, df[-nrow(df),2]) == 1 & df[, 2] == 0, df[, 3], NA)
if the lag of variable is 1 and the variable is 0 then return y, else return NA.
If you would like to remove the NA. wrap it in na.omit

extracting data using dplyr

Say I have the following data
set.seed(123)
a <- c(rep(1,30),rep(2,30))
b <- rep(1:30)
c <- sample(20:60, 60, replace = T)
data <- data.frame(a,b,c)
data
Now I want to extract data whereby:
For each unique value of a, extract/match data where the b value is the same and the c value is within a limit of +-5
so a desired output should produce:
You want to compare within each distinct b group (as they are unique within each a), thus you should group by b. It is also not possible to group by a and compare between them, thus a possible solution would be
data %>%
group_by(b) %>%
filter(abs(diff(c)) <= 5)
with data.table package this would be something like
library(data.table)
setDT(data)[, .SD[abs(diff(c)) <= 5], b]
Or
data[, if (abs(diff(c)) <= 5) .SD, b]
Or
data[data[, abs(diff(c)) <= 5, b]$V1]
In base R it would be something like
data[with(data, !!ave(c, b, FUN = function(x) abs(diff(x)) <= 5)), ]

Resources