Call a function with some parameters in a vector variable - r

I have a dataframe, and a function I use to mark specific cases in it (The following is an example to make things more concrete. The actual dataframe and function are much more complicated).
df = data.frame(number = c(1,2,3,4),
type1 = c("A","A","B","B"),
type2 = c("A","B","A","B"),
var1 = c(1,2,3,4),
var2 = c(1,2,3,4) )
FilterMark <- function (fun_data, cond_type1, cond_type2) {
fun_data$filter <- FALSE
fun_data$filter [which(fun_data$type1 == cond_type1 &
fun_data$type2 == cond_type2 )] <- TRUE
return(fun_data$filter)
}
I need to call this function multiple times, with various conditions. I want use a vector to define the conditions for each such call. for example:
conds = c("A","A")
df$case1[FilterMark(df, conds)] <- TRUE
But this doesn't work. The function interprets conds as a one parameter, instead of breaking the vector into two parameters.
Is there a way to call the function in such a way?
(I already tried do.call, but couldn't get it to work...)

Do you really need an specific function for that?
library(tidyverse)
df = data.frame(
number = c(1, 2, 3, 4),
type1 = c("A", "A", "B", "B"),
type2 = c("A", "B", "A", "B"),
var1 = c(1, 2, 3, 4),
var2 = c(1, 2, 3, 4)
)
df |>
mutate(case1 = type1 == "A" & type2 == "A")
#> number type1 type2 var1 var2 case1
#> 1 1 A A 1 1 TRUE
#> 2 2 A B 2 2 FALSE
#> 3 3 B A 3 3 FALSE
#> 4 4 B B 4 4 FALSE

Related

For loop inside lapply

I have the following simplified dataframe df (dput below):
> df
group value
1 A 1
2 A 4
3 B 2
4 B 3
5 C 2
6 C 1
I would like to apply a for-loop per group using lapply. So first I created a list for each group using split. Then I would like to perform the for loop within each dataframe, but it doesn't work. Here is some reproducible code (Please note a simplified for loop to make it reproducible):
df = data.frame(group = c("A", "A", "B", "B", "C", "C"),
value = c(1, 4, 2, 3, 2, 1))
l = split(df, df$group)
lapply(l, \(x) {
for(i in 1:nrow(x)) {
if(x$value[i] == 1) {
x$value[i] = x$value[i] + 1
} else
x$value[i] = x$value[i] + 2
}
})
#> $A
#> NULL
#>
#> $B
#> NULL
#>
#> $C
#> NULL
Created on 2023-02-15 with reprex v2.0.2
This should add 1 to the values that are 1 otherwise add 2 (this is really simple but reproducible). As you can see it returns NULL. The expected output should look like this:
$A
group value
1 A 2
2 A 6
$B
group value
3 B 3
4 B 5
$C
group value
5 C 4
6 C 2
So I was wondering how we can apply a for-loop to each group within a lapply? Why is it returning NULL?
dput df:
df<-structure(list(group = c("A", "A", "B", "B", "C", "C"), value = c(1,
4, 2, 3, 2, 1)), class = "data.frame", row.names = c(NA, -6L))
A for loop aways returns a NULL value. If you want your function to return the updated x value, then make sure to add x to the end of the function, or return(x) to be more explicit.
l = split(df, df$group)
lapply(l, \(x) {
for(i in 1:nrow(x)) {
if(x$value[i] == 1) {
x$value[i] = x$value[i] + 1
} else
x$value[i] = x$value[i] + 2
}
x
})

Remove all records that have duplicates based on more than one variables

I have data like this
df <- data.frame(var1 = c("A", "A", "B", "B", "C", "D", "E"), var2 = c(1, 2, 3, 4, 5, 5, 6 ))
# var1 var2
# 1 A 1
# 2 A 2
# 3 B 3
# 4 B 4
# 5 C 5
# 6 D 5
# 7 E 6
A is mapped to 1, 2
B is mapped to 3, 4
C and D are both mapped to 5 (and vice versa: 5 is mapped to C and D)
E is uniquely mapped to 6 and 6 is uniquely mapped to E
I would like filter the dataset so that only
var1 var2
7 E 6
is returned. base or tidyverse solution are welcomed.
I have tried
unique(df$var1, df$var2)
df[!duplicated(df),]
df %>% distinct(var1, var2)
but without the wanted result.
Using igraph::components.
Represent data as graph and get connected components:
library(igraph)
g = graph_from_data_frame(df)
cmp = components(g)
Grab components where cluster size (csize) is 2. Output vertices as a two-column character matrix:
matrix(names(cmp$membership[cmp$membership %in% which(cmp$csize == 2)]),
ncol = 2, dimnames = list(NULL, names(df))) # wrap in as.data.frame if desired
# var1 var2
# [1,] "E" "6"
Alternatively, use names of relevant vertices to index original data frame:
v = names(cmp$membership[cmp$membership %in% which(cmp$csize == 2)])
df[df$var1 %in% v[1:(length(v)/2)], ]
# var1 var2
# 7 E 6
Visualize the connections:
plot(g)
Using a custom function to determine if the mapping is unique you could achieve your desired result like so:
df <- data.frame(
var1 = c("A", "A", "B", "B", "C", "D", "E"),
var2 = c(1, 2, 3, 4, 5, 5, 6)
)
is_unique <- function(x, y) ave(as.numeric(factor(x)), y, FUN = function(x) length(unique(x)) == 1)
df[is_unique(df$var2, df$var1) & is_unique(df$var1, df$var2), ]
#> var1 var2
#> 7 E 6
Another igraph option
decompose(graph_from_data_frame(df)) %>%
subset(sapply(., vcount) == 2) %>%
sapply(function(g) names(V(g)))
which gives
[,1]
[1,] "E"
[2,] "6"
A base R solution:
df[!(duplicated(df$var1) | duplicated(df$var1, fromLast = TRUE) |
duplicated(df$var2) | duplicated(df$var2, fromLast = TRUE)), ]
var1 var2
7 E 6

R get column names for changed rows

I have two dataframes, old and new, in R. Is there a way to add a column (called changed) to the new dataframe that lists the column names (in this case, separated with a ";") where the values are different between the two dataframes? I am also trying to use this is a function where the column names that I am comparing are contained in other variables (x1, x2, x3). Ideally, I would only refer to x1, x2, x3 instead of the actual column names, but I can make due if this isn't possible. A tidy solution is preferable.
old <- data.frame(var1 = c(1, 2, 3, 5), var2 = c("A", "B", "C", "D"))
new <- data.frame(var1 = c(1, 4, 3, 6), var2 = c("A", "B", "D", "Z"))
x1 <- "var1"
x2 <- "var2"
x3 <- "changed"
#Output, adding a new column changed to new dataframe
var1 var2 changed
1 1 A NA
2 4 B var1
3 3 D var2
4 6 Z var1; var2
A tidyverse way -
library(dplyr)
library(tidyr)
cols <- names(new)
bind_cols(new, map2_df(old, new, `!=`) %>%
rowwise() %>%
transmute(changed = {
x <- c_across()
if(any(x)) paste0(cols[x], collapse = ';') else NA
}))
# var1 var2 changed
#1 1 A <NA>
#2 4 B var1
#3 3 D var2
#4 6 Z var1;var2
The same logic can be implemented in base R as well -
new$changed <- apply(mapply(`!=`, old, new), 1, function(x)
if(any(x)) paste0(cols[x], collapse = ';') else NA)
Here is a base R approach.
new$changed <- apply(old != new, 1L, \(r, nms) toString(nms[which(r)]), colnames(old))
Output
var1 var2 changed
1 1 A
2 4 B var1
3 3 D var2
4 6 Z var1, var2

How can I assign a value with case_when() from dplyr based on another column value?

Is there a way to assign the value of the column being created using an existing value from another column when using case_when() with mutate()?
The actual dataframe I'm dealing with is quite complicated so here is a trivial example of what I want:
library(dplyr)
df = tibble(Assay = c("A", "A", "B", "C", "D", "D"),
My_ID = c(3, 12, 36, 5, 13, 1),
Modifier = c(12, 6, 5, 9, 3, 6))
new_df = df %>%
mutate(Assay = case_when(
My_ID == 5 ~ "C/D",
My_ID == 12 ~ "Rm",
My_ID == 13 | My_ID == 3 ~ Modifier * 3,
TRUE ~ Assay)) %>%
select(-Modifier)
Expected new_df:
# A tibble: 6 x 2
Assay My_ID
<chr> <dbl>
1 36 3
2 Rm 12
3 B 36
4 C/D 5
5 9 13
6 D 1
I can successfully assign the NA values to the column I am mutating when no cases match, but haven't found a way to assign a value based on the value of some other column in the data frame if I'm manipulating it. I get this error:
Error: Problem with `mutate()` column `Assay`.
i `Assay = case_when(...)`.
x must be a character vector, not a double vector.
Is there a way to do this?
I found that I was able to do this using paste() after experimenting. As noted by a commenter, paste() works because the underlying issue here is an object type issue. The Assay column is a character vector, but the modification includes an integer. The function paste() implicitly converts to a character. The function paste0() will fix the problem, but using as.character() directly addresses the issue.
library(dplyr)
df = tibble(Assay = c("A", "A", "B", "C", "D", "D"),
My_ID = c(3, 12, 36, 5, 13, 1),
Modifier = c(12, 6, 5, 9, 3, 6))
new_df = df %>%
mutate(Assay = case_when(
My_ID == 5 ~ "C/D",
My_ID == 12 ~ "Rm",
My_ID == 13 | My_ID == 3 ~ as.character(Modifier * 3),
TRUE ~ Assay)) %>%
select(-Modifier)
This is the output:
print(new_df)
# A tibble: 6 x 2
Assay My_ID
<chr> <dbl>
1 36 3
2 Rm 12
3 B 36
4 C/D 5
5 9 13
6 D 1

determining frequency of multiple variables based on multiple factors in R

Suppose I have a dataset like this:
id <- c(1,1,1,2,2,3,3,4,4)
visit <- c("A", "B", "C", "A", "B", "A", "C", "A", "B")
test1 <- c(12,16, NA, 11, 15,NA, 0,12, 5)
test2 <- c(1,NA, 2, 2, 2,2, NA,NA, NA)
df <- data.frame(id,visit,test1,test2)
I want to know the number of data points per visit PER test so that the final output looks something like this:
visit test1 test2
A 3 3
B 3 1
C 1 1
I know I can use the aggregate function like this for 1 variable as mentioned on this older post :
aggregate(x = df$id[!is.na(df$test)], by = list(df$visit[!is.na(df$test)]), FUN = length)
but how would I go about doing this for multiple tests?
You can also use data.table which could be useful for a flexible number of columns:
cols <- names(df)[grepl("test",names(df))]
setDT(df)[,lapply(.SD, function(x) sum(!is.na(x))), by = visit, .SDcols = cols]
df
# visit test1 test2
#1: A 3 3
#2: B 3 1
#3: C 1 1
Using table and rowSums in base R:
cols <- 3:4
sapply(cols, function(i) rowSums(table(df$visit, df[,i]), na.rm = TRUE))
# [,1] [,2]
#A 3 3
#B 3 1
#C 1 1

Resources