For loop inside lapply - r

I have the following simplified dataframe df (dput below):
> df
group value
1 A 1
2 A 4
3 B 2
4 B 3
5 C 2
6 C 1
I would like to apply a for-loop per group using lapply. So first I created a list for each group using split. Then I would like to perform the for loop within each dataframe, but it doesn't work. Here is some reproducible code (Please note a simplified for loop to make it reproducible):
df = data.frame(group = c("A", "A", "B", "B", "C", "C"),
value = c(1, 4, 2, 3, 2, 1))
l = split(df, df$group)
lapply(l, \(x) {
for(i in 1:nrow(x)) {
if(x$value[i] == 1) {
x$value[i] = x$value[i] + 1
} else
x$value[i] = x$value[i] + 2
}
})
#> $A
#> NULL
#>
#> $B
#> NULL
#>
#> $C
#> NULL
Created on 2023-02-15 with reprex v2.0.2
This should add 1 to the values that are 1 otherwise add 2 (this is really simple but reproducible). As you can see it returns NULL. The expected output should look like this:
$A
group value
1 A 2
2 A 6
$B
group value
3 B 3
4 B 5
$C
group value
5 C 4
6 C 2
So I was wondering how we can apply a for-loop to each group within a lapply? Why is it returning NULL?
dput df:
df<-structure(list(group = c("A", "A", "B", "B", "C", "C"), value = c(1,
4, 2, 3, 2, 1)), class = "data.frame", row.names = c(NA, -6L))

A for loop aways returns a NULL value. If you want your function to return the updated x value, then make sure to add x to the end of the function, or return(x) to be more explicit.
l = split(df, df$group)
lapply(l, \(x) {
for(i in 1:nrow(x)) {
if(x$value[i] == 1) {
x$value[i] = x$value[i] + 1
} else
x$value[i] = x$value[i] + 2
}
x
})

Related

Call a function with some parameters in a vector variable

I have a dataframe, and a function I use to mark specific cases in it (The following is an example to make things more concrete. The actual dataframe and function are much more complicated).
df = data.frame(number = c(1,2,3,4),
type1 = c("A","A","B","B"),
type2 = c("A","B","A","B"),
var1 = c(1,2,3,4),
var2 = c(1,2,3,4) )
FilterMark <- function (fun_data, cond_type1, cond_type2) {
fun_data$filter <- FALSE
fun_data$filter [which(fun_data$type1 == cond_type1 &
fun_data$type2 == cond_type2 )] <- TRUE
return(fun_data$filter)
}
I need to call this function multiple times, with various conditions. I want use a vector to define the conditions for each such call. for example:
conds = c("A","A")
df$case1[FilterMark(df, conds)] <- TRUE
But this doesn't work. The function interprets conds as a one parameter, instead of breaking the vector into two parameters.
Is there a way to call the function in such a way?
(I already tried do.call, but couldn't get it to work...)
Do you really need an specific function for that?
library(tidyverse)
df = data.frame(
number = c(1, 2, 3, 4),
type1 = c("A", "A", "B", "B"),
type2 = c("A", "B", "A", "B"),
var1 = c(1, 2, 3, 4),
var2 = c(1, 2, 3, 4)
)
df |>
mutate(case1 = type1 == "A" & type2 == "A")
#> number type1 type2 var1 var2 case1
#> 1 1 A A 1 1 TRUE
#> 2 2 A B 2 2 FALSE
#> 3 3 B A 3 3 FALSE
#> 4 4 B B 4 4 FALSE

Remove all records that have duplicates based on more than one variables

I have data like this
df <- data.frame(var1 = c("A", "A", "B", "B", "C", "D", "E"), var2 = c(1, 2, 3, 4, 5, 5, 6 ))
# var1 var2
# 1 A 1
# 2 A 2
# 3 B 3
# 4 B 4
# 5 C 5
# 6 D 5
# 7 E 6
A is mapped to 1, 2
B is mapped to 3, 4
C and D are both mapped to 5 (and vice versa: 5 is mapped to C and D)
E is uniquely mapped to 6 and 6 is uniquely mapped to E
I would like filter the dataset so that only
var1 var2
7 E 6
is returned. base or tidyverse solution are welcomed.
I have tried
unique(df$var1, df$var2)
df[!duplicated(df),]
df %>% distinct(var1, var2)
but without the wanted result.
Using igraph::components.
Represent data as graph and get connected components:
library(igraph)
g = graph_from_data_frame(df)
cmp = components(g)
Grab components where cluster size (csize) is 2. Output vertices as a two-column character matrix:
matrix(names(cmp$membership[cmp$membership %in% which(cmp$csize == 2)]),
ncol = 2, dimnames = list(NULL, names(df))) # wrap in as.data.frame if desired
# var1 var2
# [1,] "E" "6"
Alternatively, use names of relevant vertices to index original data frame:
v = names(cmp$membership[cmp$membership %in% which(cmp$csize == 2)])
df[df$var1 %in% v[1:(length(v)/2)], ]
# var1 var2
# 7 E 6
Visualize the connections:
plot(g)
Using a custom function to determine if the mapping is unique you could achieve your desired result like so:
df <- data.frame(
var1 = c("A", "A", "B", "B", "C", "D", "E"),
var2 = c(1, 2, 3, 4, 5, 5, 6)
)
is_unique <- function(x, y) ave(as.numeric(factor(x)), y, FUN = function(x) length(unique(x)) == 1)
df[is_unique(df$var2, df$var1) & is_unique(df$var1, df$var2), ]
#> var1 var2
#> 7 E 6
Another igraph option
decompose(graph_from_data_frame(df)) %>%
subset(sapply(., vcount) == 2) %>%
sapply(function(g) names(V(g)))
which gives
[,1]
[1,] "E"
[2,] "6"
A base R solution:
df[!(duplicated(df$var1) | duplicated(df$var1, fromLast = TRUE) |
duplicated(df$var2) | duplicated(df$var2, fromLast = TRUE)), ]
var1 var2
7 E 6

Subset a data frame based on count of values of column x. Want only the top two in R

here is the data frame
p <- c(1, 3, 45, 1, 1, 54, 6, 6, 2)
x <- c("a", "b", "a", "a", "b", "c", "a", "b", "b")
df <- data.frame(p, x)
I want to subset the data frame such that I get a new data frame with only the top two"x" based on the count of "x".
One of the simplest ways to achieve what you want to do is with the package data.table. You can read more about it here. Basically, it allows for fast and easy aggregation of your data.
Please note that I modified your initial data by appending the elements 10 and c to p and x, respectively. This way, you won't see a NA when filtering the top two observations.
The idea is to sort your dataset and then operate the function .SD which is a convenient way for subsetting/filtering/extracting observations.
Please, see the code below.
library(data.table)
p <- c(1, 3, 45, 1, 1, 54, 6, 6, 2, 10)
x <- c("a", "b", "a", "a", "b", "c", "a", "b", "b", "c")
df <- data.table(p, x)
# Sort by the group x and then by p in descending order
setorder( df, x, -p )
# Extract the first two rows by group "x"
top_two <- df[ , .SD[ 1:2 ], by = x ]
top_two
#> x p
#> 1: a 45
#> 2: a 6
#> 3: b 6
#> 4: b 3
#> 5: c 54
#> 6: c 10
Created on 2021-02-16 by the reprex package (v1.0.0)
Does this work for you?
Using dplyr:
library(dplyr)
df %>%
add_count(x) %>%
slice_max(n, n = 2)
p x n
1 1 a 4
2 3 b 4
3 45 a 4
4 1 a 4
5 1 b 4
6 6 a 4
7 6 b 4
8 2 b 4

Find out the row with different value with in same name [duplicate]

This question already has answers here:
How to remove rows that have only 1 combination for a given ID
(4 answers)
Selecting & grouping dual-category data from a data frame
(4 answers)
Closed 5 years ago.
I have a df looks like
df <- data.frame(Name = c("A", "A","A","B", "B", "C", "D", "E", "E"),
Value = c(1, 1, 1, 2, 15, 3, 4, 5, 5))
Basically, A is 1, B is 2, C is 3 and so on.
However, as you can see, B has "2" and "15"."15" is the wrong value and it should not be here.
I would like to find out the row which Value won't matches within the same Name.
Ideal output will looks like
B 2
B 15
you can use tidyverse functions like:
df %>%
group_by(Name, Value) %>%
unique()
giving:
Name Value
1 A 1
2 B 2
3 B 15
4 C 3
5 D 4
6 E 5
then, to keep only the Name with multiple Value, append above with:
df %>%
group_by(Name) %>%
filter( n() > 1)
Something like this? This searches for Names that are associated to more than 1 value and outputs one copy of each pair {Name - Value}.
df <- data.frame(Name = c("A", "A","A","B", "B", "C", "D", "E", "E"),
Value = c(1, 1, 1, 2, 15, 3, 4, 5, 5))
res <- do.call(rbind, lapply(unique(df$Name), (function(i){
if (length(unique(df[df$Name == i,]$Value)) > 1 ) {
out <- df[df$Name == i,]
out[!duplicated(out$Value), ]
}
})))
res
Result as expected
Name Value
4 B 2
5 B 15
Filter(function(x)nrow(unique(x))!=1,split(df,df$Name))
$B
Name Value
4 B 2
5 B 15
Or:
Reduce(rbind,by(df,df$Name,function(x) if(nrow(unique(x))>1) x))
Name Value
4 B 2
5 B 15

determining frequency of multiple variables based on multiple factors in R

Suppose I have a dataset like this:
id <- c(1,1,1,2,2,3,3,4,4)
visit <- c("A", "B", "C", "A", "B", "A", "C", "A", "B")
test1 <- c(12,16, NA, 11, 15,NA, 0,12, 5)
test2 <- c(1,NA, 2, 2, 2,2, NA,NA, NA)
df <- data.frame(id,visit,test1,test2)
I want to know the number of data points per visit PER test so that the final output looks something like this:
visit test1 test2
A 3 3
B 3 1
C 1 1
I know I can use the aggregate function like this for 1 variable as mentioned on this older post :
aggregate(x = df$id[!is.na(df$test)], by = list(df$visit[!is.na(df$test)]), FUN = length)
but how would I go about doing this for multiple tests?
You can also use data.table which could be useful for a flexible number of columns:
cols <- names(df)[grepl("test",names(df))]
setDT(df)[,lapply(.SD, function(x) sum(!is.na(x))), by = visit, .SDcols = cols]
df
# visit test1 test2
#1: A 3 3
#2: B 3 1
#3: C 1 1
Using table and rowSums in base R:
cols <- 3:4
sapply(cols, function(i) rowSums(table(df$visit, df[,i]), na.rm = TRUE))
# [,1] [,2]
#A 3 3
#B 3 1
#C 1 1

Resources