R: Create Indicator Columns from list of conditions - r

I have a dataframe and a number of conditions. Each condition is supposed to check whether the value in a certain column of the dataframe is within a set of valid values.
This is what I tried:
# create the sample dataframe
age <- c(120, 45)
sex <- c("x", "f")
df <-data.frame(age, sex)
# create the sample conditions
conditions <- list(
list("age", c(18:100)),
list("sex", c("f", "m"))
)
addIndicator <- function (df, columnName, validValues) {
indicator <- vector()
for (row in df[, toString(columnName)]) {
# for some strange reason, %in% doesn't work correctly here, but always returns FALSe
indicator <- append(indicator, row %in% validValues)
}
df <- cbind(df, indicator)
# rename the column
names(df)[length(names(df))] <- paste0("I_", columnName)
return(df)
}
for (condition in conditions){
columnName <- condition[1]
validValues <- condition[2]
df <- addIndicator(df, columnName, validValues)
}
print(df)
However, this leads to all conditions considered not to be met - which is not what I expect:
age sex I_age I_sex
1 120 x FALSE FALSE
2 45 f FALSE FALSE
I figured that %in% does not return the expected result. I checked for the typeof(row) and tried to boil this down into a minimum example. In a simple ME, with the same type and values of the variables, the %in% works properly. So, something must be wrong within the context I try to apply this. Since this is my first attempt to write anything in R, I am stuck here.
What am I doing wrong and how can I achieve what I want?

If you prefer an approach that uses the tidyverse family of packages:
library(tidyverse)
allowed_values <- list(age = 18:100, sex = c("f", "m"))
df %>%
imap_dfr(~ .x %in% allowed_values[[.y]]) %>%
rename_with(~ paste0('I_', .x)) %>%
bind_cols(df)
imap_dfr allows you to manipulate each column in df using a lambda function. .x references the column content and .y references the name.
rename_with renames the columns using another lambda function and bind_cols combines the results with the original dataframe.
I borrowed the simplified list of conditions from ben's answer. I find my approach slightly more readable but that is a matter of taste and of whether you are already using the tidyverse elsewhere.

conditions appears to be a nested list. When you use:
validValues <- condition[2]
in your for loop, your result is also a list.
To get the vector of values to use with %in%, you can extract [[ by:
validValues <- condition[[2]]
A simplified approach to obtaining indicators could be with a simple list:
conditions_lst <- list(age = 18:100, sex = c("f", "m"))
And using sapply instead of a for loop:
cbind(df, sapply(setNames(names(df), paste("I", names(df), sep = "_")), function(x) {
df[[x]] %in% conditions_lst[[x]]
}))
Output
age sex I_age I_sex
1 120 x FALSE FALSE
2 45 f TRUE TRUE

An alternative approach using across and cur_column() (and leaning heavily on severin's solution):
library(tidyverse)
df <- tibble(age = c(12, 45), sex = c('f', 'f'))
allowed_values <- list(age = 18:100, sex = c("f", "m"))
df %>%
mutate(across(c(age, sex),
c(valid = ~ .x %in% allowed_values[[cur_column()]])
)
)
Reference: https://dplyr.tidyverse.org/articles/colwise.html#current-column
Related question: Refering to column names inside dplyr's across()

Related

Changing values in multiple column given one condition (preferably in dplyr)

I'm looking for an easy way to change several values for the same person. Preferably with dplyr or any other package from tidyverse.
Here my example:
df <- data.frame(personid = 1:3, class = c("class1", "class3", "class3"), classlevel = c(1, 11, 3), education = c("BA", "Msc", "BA"))
df
My dataset contains an entry with several mistakes. Person #2 should be part of class 1, at classlevel 1 und his education is BA, not MSc. I use mutate with case_when a lot, but in my case I don't want to change one variable with multiple condition, I have one condition and want to change multiple values in other variables based on this condition.
Basically, I'm looking for an shorter code which replaces this:
df$class[df$personid == 2] <- "class1"
df$classlevel[df$personid == 2] <- 1
df$education[df$personid == 2] <- "BA"
df
or this:
library(tidyverse)
df <- df |>
mutate(class = case_when(personid == 2 ~ "class1", TRUE ~ class)) |>
mutate(classlevel = case_when(personid == 2 ~ 1, TRUE ~ as.numeric(classlevel))) |>
mutate(education = case_when(personid == 2 ~ "BA", TRUE ~ education))
df
In my original data, there are several dozend cases like this, and I find it a bit tedious to use three lines of code for each person. Is there a shorter way?
Thanks for your input!
One way would be to create a data frame of the values to be updated and use rows_update(). Note that this assumes the rows are uniquely identified.
library(dplyr)
df_update <- tribble(
~personid, ~class, ~classlevel, ~education,
1, "class1", 1, "BA"
)
df %>%
rows_update(df_update, by = "personid")
personid class classlevel education
1 1 class1 1 BA
2 2 class1 1 BA
3 3 class3 3 BA
I think I need a little bit more information to try to answer your question., but I'll try anyway.
If you want to change the value of some columns based on a unique condition across all the rows I recommend doing this (I created new columns col_name1 so you can see the original and ouput):
df <- df %>% mutate(class1 = case_when(class != "class1" ~ "class1", TRUE ~ class),
classlevel1 = case_when( classlevel != 1 ~ 1, TRUE ~ as.numeric(classlevel)),
education1 = case_when( education != "BA" ~ "BA", TRUE ~ education))
If that was your problem, then you are probably not familiar with the concept of vectorization. Briefly, a vectorized function runs for all the rows or elements in your vector, without you needing to specify that. There are a lof of examples and tutorial on the web if you search "vectorization in R" or something similar.
Otherwise, if your condition changes for each single id (or row) in your data, then the problem is more complicated.
Let me know if that helps and, if it doesn't, consider providing more information in your question.

How to use a %in% condition in the R which function?

I have a simple task, which I can do in loads of line of individual code, but I would like to simplify it as it will take a long time in the future.
my task is to transform 100's of columns of a data frame in to factors and re label accordingly.
with just a subset of my data, I tried to create a list of variables as the 12 variables have different prefixes at each wave (year of collection, the code I ended up using was:
ghq <-c("scghqa", "scghqb", "scghqc", "scghqd", "scghqe", "scghqf", "scghqg",
"scghqh", "scghqi", "scghqj", "scghqk", "scghql")
waves <- c("a", "b", "c", "d", "e")
ghqa <- paste0(waves[1], sep = "_", ghq[1:12])
ghqb <- paste0(waves[2], sep = "_", ghq[1:12])
ghqc <- paste0(waves[3], sep = "_", ghq[1:12])
ghqd <- paste0(waves[4], sep = "_", ghq[1:12])
ghqe <- paste0(waves[5], sep = "_", ghq[1:12])
ghqv <- c(ghqa, ghqb, ghqc, ghqd, ghqe)
I tried this in a for loop, but I could not get it to produce the output in a list or character vector (only a matrix seemed to work), see the code for that at the bottom of this question, if you are curious.
From here to be able to use apply, I need to know the positions of these columns in the dataframe
apply(data[c(indexes of cols), 2, lfactor(c(values in the factor), levels =c(levels they will correspond to), labels=c(text labels to be attached to each level))
NOTE: I put this here because perhaps I am going the wrong way about things by trying to use apply.
so to identify the columns I want drom the data i used
head(dat[colnames(dat) %in% ghqv]) # produced the data for the 60 rows I want
length(dat[colnames(dat) %in% ghqv]) # 60 (as expected)
so I tried:
which(dat[colnames(dat) %in% ghqv])
Error in which(dat[colnames(dat) %in% ghqv]) :
argument to 'which' is not logical
How can I transform this to a logical please? as any time I use == with %in% it does not seem to recognise it
To try to help simplify this, with the silly variable names, I created the same issue in the mt cars data set:
cars <- mtcars
vars <- c("mpg", "qsec")
head(cars[colnames(cars) %in% vars])
which(cars[colnames(cars) %in% vars])
Error in which(cars[colnames(cars) %in% vars]) :
argument to 'which' is not logical
Any assistance would be very welcomed, thank you
Just as an aside; the for loop i couldn't change to create a single vector which appended
vars <- data.frame(matrix(nrow = 12, ncol = 5)) # we will create a container
colnames(vars) <- c("wave1", "wave2", "wave3", "wave4", "wave5")
rownames(vars) <- c("ghq1", "ghq2", "ghq3", "ghq4", "ghq5",
"ghq6", "ghq7", "ghq8", "ghq9", "ghq10",
"ghq11", "ghq12")
for(i in 1:5){
a <- paste(waves[i], ghqv[1:12], sep = "_")
vars[,i] <- a
print(a) # we print it to see in console
}
You're passing an entire data frame to which()
which(cars[colnames(cars) %in% vars]) is running which on cars[colnames(cars) %in% vars], which is a substet of the cars data.frame (incidentally, cars[colnames(cars %in% vars] is identical to cars[vars]
If you just want the indeces of matching columns, run:
which(colnames(cars) %in% vars)
There's probably a better way to do what you want to do
I would run
require(dplyr)
mutate(cars, across(all_of(vars), factor)) %>%
rename_at(vars, some_function_that_renames_columns)

How to use the R pipe operator (%>%) in the following cases

1) I have a data frame named df, how can I include an if statement within the mutate function used within the pipe operator? The following does not work:
df %>%
mutate_if(myvar == "A", newColumn = oldColumn*3, newColumn = oldColumn)
The variable myvar is not included in the data frame and is a "flag" variable with values either "A" or "B". When "A", would like to create a new column named "newColumn" in the data frame that is three times the old column (named "oldColumn"), otherwise it is identical to the old column.
2) Would like to divide the column named "numbers" with the entry of numbers which has the minimum value in another column named "seconds", as follows:
df$newCol <- df$numbers / df[df$seconds== min(df$seconds),]$numbers
How can I do that with mutate command and "%>%", so that it looks more handy? Nothing that I tried works unfortunately.
Thanks for any answers,
J.
If myvar is just a variable floating around in the environmnet, you can use an if else statement within mutate (similar question here)
library(dplyr)
# Generate dataset
df <- tibble(oldColumn = rnorm(100))
# Mutate with if-else conditions
df <- df %>% mutate(newColumn = if(myvar == "A") oldColumn else if(myvar=="B") oldColumn * 3)
If myvar is included as a column in the dataframe then you could can use case_when.
# Generate dataset
df <- tibble(myvar = sample(c("A", "B"), 100, replace = TRUE),
oldColumn = rnorm(100))
# Create a new column which depends on the value of myvar
df <- df %>%
mutate(newColumn = case_when(myvar == "A" ~ oldColumn*3,
myvar == "B" ~ oldColumn))
As for question 2, you can use mutate with "." operater which calls the left hand side (i.e. "df") in the right hand side of the function. Then you can filter down to the row with the minimum value of seconds (top_n statement using -1 as argument), and pull out the value for the numbers variable
# Generate data
df <- tibble(numbers = sample(1:60),
seconds = sample(1:60))
# Do computation
df <- df %>% mutate(newCol = numbers / top_n(.,-1,seconds) %>% pull(numbers))

Using filter_ in dplyr where both field and value are in variables

I want to filter a dataframe using a field which is defined in a variable, to select a value that is also in a variable. Say I have
df <- data.frame(V=c(6, 1, 5, 3, 2), Unhappy=c("N", "Y", "Y", "Y", "N"))
fld <- "Unhappy"
sval <- "Y"
The value I want would be df[df$Unhappy == "Y", ].
I've read the nse vignette to try use filter_ but can't quite understand it. I tried
df %>% filter_(.dots = ~ fld == sval)
which returned nothing. I got what I wanted with
df %>% filter_(.dots = ~ Unhappy == sval)
but obviously that defeats the purpose of having a variable to store the field name. Any clues please? Eventually I want to use this where fld is a vector of field names and sval is a vector of filter values for each field in fld.
You can try with interp from lazyeval
library(lazyeval)
library(dplyr)
df %>%
filter_(interp(~v==sval, v=as.name(fld)))
# V Unhappy
#1 1 Y
#2 5 Y
#3 3 Y
For multiple key/value pairs, I found this to be working but I think a better way should be there.
df1 %>%
filter_(interp(~v==sval1[1] & y ==sval1[2],
.values=list(v=as.name(fld1[1]), y= as.name(fld1[2]))))
# V Unhappy Col2
#1 1 Y B
#2 5 Y B
For these cases, I find the base R option to be easier. For example, if we are trying to filter the rows based on the 'key' variables in 'fld1' with corresponding values in 'sval1', one option is using Map. We subset the dataset (df1[fld1]) and apply the FUN (==) to each column of df1[f1d1] with corresponding value in 'sval1' and use the & with Reduce to get a logical vector that can be used to filter the rows of 'df1'.
df1[Reduce(`&`, Map(`==`, df1[fld1],sval1)),]
# V Unhappy Col2
# 2 1 Y B
#3 5 Y B
data
df1 <- cbind(df, Col2= c("A", "B", "B", "C", "A"))
fld1 <- c(fld, 'Col2')
sval1 <- c(sval, 'B')
Now, with rlang 0.4.0, it introduces a new more intuitive way for this type of use case:
packageVersion("rlang")
# [1] ‘0.4.0’
df <- data.frame(V=c(6, 1, 5, 3, 2), Unhappy=c("N", "Y", "Y", "Y", "N"))
fld <- "Unhappy"
sval <- "Y"
df %>% filter(.data[[fld]]==sval)
#OR
filter_col_val <- function(df, fld, sval) {
df %>% filter({{fld}}==sval)
}
filter_col_val(df, Unhappy, "Y")
More information can be found at https://www.tidyverse.org/articles/2019/06/rlang-0-4-0/
Previous Answer
With dplyr 0.6.0 and later, this code works:
packageVersion("dplyr")
# [1] ‘0.7.1’
df <- data.frame(V=c(6, 1, 5, 3, 2), Unhappy=c("N", "Y", "Y", "Y", "N"))
fld <- "Unhappy"
sval <- "Y"
df %>% filter(UQ(rlang::sym(fld))==sval)
#OR
df %>% filter((!!rlang::sym(fld))==sval)
#OR
fld <- quo(Unhappy)
sval <- "Y"
df %>% filter(UQ(fld)==sval)
More about the dplyr syntax available at http://dplyr.tidyverse.org/articles/programming.html and the quosure usage in the rlang package https://cran.r-project.org/web/packages/rlang/index.html .
If you find it challenging mastering non-standard evaluation in dplyr 0.6+, Alex Hayes has an excellent writing-up on the topic: https://www.alexpghayes.com/blog/gentle-tidy-eval-with-examples/
Original Answer
With dplyr version 0.5.0 and later, it is possible to use a simpler syntax and gets closer to the syntax #Ricky originally wanted, which I also find more readable than using lazyeval::interp
df %>% filter_(.dots = paste0(fld, "=='", sval, "'"))
# V Unhappy
#1 1 Y
#2 5 Y
#3 3 Y
#OR
df %>% filter_(.dots = glue::glue("{fld}=='{sval}'"))
Here's an alternative with base R, which is maybe not very elegant, but it might have the benefit of being rather easily understandable:
df[df[colnames(df)==fld]==sval,]
# V Unhappy
#2 1 Y
#3 5 Y
#4 3 Y
Following on from LmW; personally I prefer using a dplyr pipeline where the dots are specified before the pipeline so that it is easier to use programmatically, say in a loop of filters.
dots <- paste0(fld," == '",sval,"'")
df %>% filter_(.dots = dots)
LmW's example is correct but the values are hardcoded.
So I was trying to do the same thing, and it seems that now dplyr has a builtin functionality to address exactly this.
Check the last example here: https://dplyr.tidyverse.org/reference/filter.html
I'm also pasting it here for simplicity:
# To refer to column names that are stored as strings, use the `.data` pronoun:
vars <- c("mass", "height")
cond <- c(80, 150)
starwars %>%
filter(
.data[[vars[[1]]]] > cond[[1]],
.data[[vars[[2]]]] > cond[[2]]
)

dynamic subsetting in r

I have a data set that is something like the following, but with many more columns and rows:
a<-c("Fred","John","Mindy","Mike","Sally","Fred","Alex","Sam")
b<-c("M","M","F","M","F","M","M","F")
c<-c(40,35,25,50,25,40,35,40)
d<-c(9,7,8,10,10,9,5,8)
df<-data.frame(a,b,c,d)
colnames(df)<-c("Name", "Gender", "Age", "Score")
I need to create a function that will let me sum the scores for selected subsets of the data. However, the subsets selected may have different numbers of variables each time. One subset could be Name=="Fred" and another could be Gender == "M" & Age == 40. In my actual data set, there could be up to 20 columns used in a selected subset, so I need to make this as general as possible.
I tried using a sapply command that included eval(parse(text=...), but it takes a long time with only a sample of 20,000 or so records. I'm sure there's a much faster way, and I'd appreciate any help in finding it.
There are several ways to represent these two variables. One way is as two distinct objects, another is as two elements in a list.
However, using a named list might be the easiest:
# df is a function for the F distribution. Avoid using "df" as a variable name
DF <- df
example1 <- list(Name = c("Fred")) # c() not needed, used for emphasis
example2 <- list(Gender = c("M"), Age=c(40, 50))
## notice that the key portion is `DF[[nm]] %in% ll[[nm]]`
subByNmList <- function(ll, DF, colsToSum=c("Score")) {
ret <- vector("list", length(ll))
names(ret) <- names(ll)
for (nm in names(ll))
ret[[nm]] <- colSums(DF[DF[[nm]] %in% ll[[nm]] , colsToSum, drop=FALSE])
# optional
if (length(ret) == 1)
return(unlist(ret, use.names=FALSE))
return(ret)
}
subByNmList(example1, DF)
subByNmList(example2, DF)
lapply( subset( df, Gender == "M" & Age == 40, select=Score), sum)
#$Score
#[1] 18
I could have writtne just :
sum( subset( df, Gender == "M" & Age == 40, select=Score) )
But that would not generalize very well.

Resources