Looping through dataframe to filter rows - r

In a given dataframe, I need to filter the rows on separate columns, one at a time, using the same condition. The following formulation does not work. Any suggestions?
DF <- data.frame(A = c(1,4,99),
B = c(2,5,6),
C = c(3,99,7))
r <- c("A", "C")
for (i in r){
column = as.formula(paste0("DF$",i))
DF<- DF[column != 99,]
print(DF)
}
The desired outputs are the following two:
A B C
1 1 2 3
2 4 5 99
A B C
1 1 2 3
3 99 6 7

We may use
library(dplyr)
library(purrr)
map(r, ~ DF %>%
filter(!! rlang::sym(.x) != 99))
-output
[[1]]
A B C
1 1 2 3
2 4 5 99
[[2]]
A B C
1 1 2 3
2 99 6 7
Or in base R
lapply(r, \(x) subset(DF, DF[[x]] != 99))
[[1]]
A B C
1 1 2 3
2 4 5 99
[[2]]
A B C
1 1 2 3
3 99 6 7
If it is to filter and then remove on a loop
library(data.table)
setDT(df)
for(nm in r) {
tmp <- DF[DF[[nm]] != 99]
... do some calc ...
rm(tmp)
gc()
}

Related

How to find duplicated values in two columns between two dataframes and remove non-duplicates in R?

So let's say I have two dataframes that look like this
df1 <- data.frame(ID = c("A","B","F","G","B","B","A","G","G","F","A","A","A","B","F"),
code = c(1,2,2,3,3,1,2,2,1,1,3,2,2,1,1),
class = c(2,4,5,5,2,3,2,5,1,2,4,5,3,2,1))
df2 <- data.frame(ID = c("G","F","C","F","B","A","F","C","A","B","A","B","C","A","G"),
code = c(1,2,2,3,3,1,2,2,1,1,3,2,2,1,1),
class = c(2,4,5,5,2,3,2,5,1,2,4,5,3,2,1))
I want to check the duplicates in df1$ID and df2$ID and remove all the rows from df2 if the IDs are not present in df1 so the new dataframe would look like this:
df3 <- data.frame(ID = c("G","F","F","B","A","F","A","B","A","B","A","G"),
code = c(1,2,3,3,1,2,1,1,3,2,1,1),
class = c(2,4,5,2,3,2,1,2,4,5,2,1))
With %in%:
df2[df2$ID %in% df1$ID, ]
ID code class
1 G 1 2
2 F 2 4
4 F 3 5
5 B 3 2
6 A 1 3
7 F 2 2
9 A 1 1
10 B 1 2
11 A 3 4
12 B 2 5
14 A 1 2
15 G 1 1
You can use the 'intersect' function to tackle the issue.
common_ids <- intersect(df1$ID, df2$ID)
df3 <- df2[df2$ID %in% common_ids, ]
ID code class
1 G 1 2
2 F 2 4
4 F 3 5
5 B 3 2
6 A 1 3
7 F 2 2
9 A 1 1
10 B 1 2
11 A 3 4
12 B 2 5
14 A 1 2
15 G 1 1
I want to throw semi_join in.
library(tidyverse)
df_test <- df2 |> semi_join(df1, by = "ID")
all.equal(df3, df_test)
#> [1] TRUE

Combining elements of one column into two columns by group in R

Given a two column data.frame with one containing group labels and a second containing integer values ordered from smallest to largest. How can the data be expanded creating pairs of combinations of the integer column?
Not sure the best way to state this. I'm not interested in all possible combinations but instead all unique combinations starting from the lowest value.
In r, the combn function gives the desired output not considering groups, for example:
t(combn(seq(1:4),2))
[,1] [,2]
[1,] 1 2
[2,] 1 3
[3,] 1 4
[4,] 2 3
[5,] 2 4
[6,] 3 4
Since the first values is 1 we get the unique combination of (1,2) and not the additional combination of (2,1) which I don't need. How would one then apply a similar method by groups?
for example given a data.frame
test <- data.frame(Group = rep(c("A","B"),each=4),
Val = c(1,3,6,8,2,4,5,7))
test
Group Val
1 A 1
2 A 3
3 A 6
4 A 8
5 B 2
6 B 4
7 B 5
8 B 7
I was able to come up with this solution that gives the desired output:
test <- data.frame(Group = rep(c("A","B"),each=4),
Val = c(1,3,6,8,2,4,5,7))
j=1
for(i in unique(test$Group)){
if(j==1){
one <- filter(test,i == Group)
two <- data.frame(t(combn(one$Val,2)))
test1 <- data.frame(Group = i,Val1=two$X1,Val2=two$X2)
j=j+1
}else{
one <- filter(test,i == Group)
two <- data.frame(t(combn(one$Val,2)))
test2 <- data.frame(Group = i,Val1=two$X1,Val2=two$X2)
test1 <- rbind(test1,test2)
}
}
test1
Group Val1 Val2
1 A 1 3
2 A 1 6
3 A 1 8
4 A 3 6
5 A 3 8
6 A 6 8
7 B 2 4
8 B 2 5
9 B 2 7
10 B 4 5
11 B 4 7
12 B 5 7
However, this is not elegant and is really slow as the number of groups and length of each group become large. It seems like there should be a more elegant and efficient solution but so far I have not come across anything on SO.
I would appreciate any ideas!
here is a data.table approach
library( data.table )
#make test a data.table
setDT(test)
#split by group
L <- split( test, by = "Group")
#get unique combinations of 2 Vals
L2 <- lapply( L, function(x) {
as.data.table( t( combn( x$Val, m = 2, simplify = TRUE ) ) )
})
#merge them back together
data.table::rbindlist( L2, idcol = "Group" )
# Group V1 V2
# 1: A 1 3
# 2: A 1 6
# 3: A 1 8
# 4: A 3 6
# 5: A 3 8
# 6: A 6 8
# 7: B 2 4
# 8: B 2 5
# 9: B 2 7
#10: B 4 5
#11: B 4 7
#12: B 5 7
You can set simplify = F in combn() and then use unnest_wider() in dplyr.
library(dplyr)
library(tidyr)
test %>%
group_by(Group) %>%
summarise(Val = combn(Val, 2, simplify = F)) %>%
unnest_wider(Val, names_sep = "_")
# Group Val_1 Val_2
# <chr> <dbl> <dbl>
# 1 A 1 3
# 2 A 1 6
# 3 A 1 8
# 4 A 3 6
# 5 A 3 8
# 6 A 6 8
# 7 B 2 4
# 8 B 2 5
# 9 B 2 7
# 10 B 4 5
# 11 B 4 7
# 12 B 5 7
library(tidyverse)
df2 <- split(df$Val, df$Group) %>%
map(~gtools::combinations(n = 4, r = 2, v = .x)) %>%
map(~as_tibble(.x, .name_repair = "unique")) %>%
bind_rows(.id = "Group")

filter() or subset() all the dataframes stored in a list

If I want to remove all the rows that contain 0s in a specific column, I can just do:
df <- data.frame(a = c(0,1,2,3,0,5),
b = c(1,2,3,5,3,1))
df <- filter(df, a != 0)
How can I do the same if I'm working with lists?
My intuition tells me to use 'lapply' but I cannot seem to make the syntax work:
#same dataframe.
df <- data.frame(a = c(0,1,2,3,0,5),
b = c(1,2,3,5,3,1))
df2 <- df
list.df <- list (df, df2)
lapply(list.df, filter(), a !=0) #don't work. How do I fix this syntax?
Many thanks in advance!
One option involving purrr could be:
map(.x = list.df, ~ .x %>%
filter(a != 0))
[[1]]
a b
1 1 2
2 2 3
3 3 5
4 5 1
[[2]]
a b
1 1 2
2 2 3
3 3 5
4 5 1
You have other options using lapply as:
#Without dplyr
lapply(list.df, function(x)x["a"!=0,])
#With dplyr
library(dplyr)
lapply(list.df, function(x)filter(x,a!=0))
# Result
# [[1]]
# a b
# 1 1 2
# 2 2 3
# 3 3 5
# 4 5 1
#
# [[2]]
# a b
# 1 1 2
# 2 2 3
# 3 3 5
# 4 5 1

Dplyr select based on multiple strings in a column

I have a data frame containing following columns:-
sample.data
a_b_c d_b_e r_f_g c_b_a
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
How do I select only columns that contain both let's say "a" and "c" in the column name?
To select variables that contain a and c we could do:
library(dplyr)
df %>%
select(matches("(a.*c)|(c.*a)"))
a_b_c c_b_a
1 1 1
2 2 2
3 3 3
4 4 4
Note that var a_a_e is not selected because it doesn't contain c and var c_f_g is not selected because it doesn't contain a. Column names with two a's and two c's will not be selected either as seen with var a_a_e.
We could also use str_subset:
library(dplyr)
library(stringr)
df %>%
select(str_subset(names(df), "(a.*c)|(c.*a)"))
Data:
df <- data.frame(
a_b_c = 1:4,
a_a_e = 1:4,
c_f_g = 1:4,
c_b_a = 1:4
)
Try df %>% dplyr::select(matches("(a|c)"))
library(dplyr)
df <- data.frame(
a_b_c=1:4,
d_b_e=1:4,
r_f_g=1:4,
c_b_a=1:4
)
Results
> df %>% dplyr::select(matches("(a|c)"))
a_b_c c_b_a
1 1 1
2 2 2
3 3 3
4 4 4
If you want to see how it works under the hood, use the following function:
contain_both <- function(data_frame, letter_a, letter_b) {
j <- 0
keep_columns <- NULL
for(i in 1:ncol(data_frame)) {
has_letters <- unlist(strsplit(names(data_frame)[i], '_'))
if(is.element(letter_a, has_letters) && is.element(letter_b, has_letters)) {
j <- j + 1
keep_columns[j] <- i
}
}
return(data_frame[, keep_columns])
}
Data:
df <- data.frame(seq(1:4), seq(1:4), seq(1:4), seq(1:4))
names(df) <- c('a_b_c', 'd_b_e', 'r_f_g', 'c_b_a')
Just pass in your data frame, along with your 2 letter choices:
Usage:
contain_both(df, 'b', 'c')
Hope this is what you are looking for:
a_b_c <- c(1,2,3,4)
d_b_e <- c(1,2,3,4)
yy <- cbind(a_b_c, d_b_e)
> yy
a_b_c d_b_e
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 4
yy <- as.data.frame(yy)
yy
a_b_c d_b_e
1 1 1
2 2 2
3 3 3
4 4 4
y <- yy[which(names(yy) %in% "a_b_c")]
> y
a_b_c
1 1
2 2
3 3
4 4
In your example, you can use this:
y <- sample.data[which(names(sample.data) %in% c("a_b_c","c_b_a" )]

Dataframe create new column based on other columns

I have a dataframe:
df <- data.frame('a'=c(1,2,3,4,5), 'b'=c(1,20,3,4,50))
df
a b
1 1 1
2 2 20
3 3 3
4 4 4
5 5 50
and I want to create a new column based on existing columns. Something like this:
if (df[['a']] == df[['b']]) {
df[['c']] <- df[['a']] + df[['b']]
} else {
df[['c']] <- df[['b']] - df[['a']]
}
The problem is that the if condition is checked only for the first row... If I create a function from the above if statement then I use apply() (or mapply()...), it is the same.
In Python/pandas I can use this:
df['c'] = df[['a', 'b']].apply(lambda x: x['a'] + x['b'] if (x['a'] == x['b']) \
else x['b'] - x['a'], axis=1)
I want something similar in R. So the result should look like this:
a b c
1 1 1 2
2 2 20 18
3 3 3 6
4 4 4 8
5 5 50 45
One option is ifelse which is vectorized version of if/else. If we are doing this for each row, the if/else as showed in the OP's pandas post can be done in either a for loop or lapply/sapply, but that would be inefficient in R.
df <- transform(df, c= ifelse(a==b, a+b, b-a))
df
# a b c
#1 1 1 2
#2 2 20 18
#3 3 3 6
#4 4 4 8
#5 5 50 45
This can be otherwise written as
df$c <- with(df, ifelse(a==b, a+b, b-a))
to create the 'c' column in the original dataset
As the OP wants a similar option in R using if/else
df$c <- apply(df, 1, FUN = function(x) if(x[1]==x[2]) x[1]+x[2] else x[2]-x[1])
Here is a slightly more confusing algebraic method:
df$c <- with(df, b + ((-1)^((a==b)+1) * a))
df
a b c
1 1 1 2
2 2 20 18
3 3 3 6
4 4 4 8
5 5 50 45
The idea is that the "minus" operator is turned on or off based on the test a==b.
If you want an apply method, then another way with mapply would be create a function and apply it,
fun1 <- function(x, y) if (x == y) {x + y} else {y-x}
df$c <- mapply(fun1, df$a, df$b)
df
# a b c
#1 1 1 2
#2 2 20 18
#3 3 3 6
#4 4 4 8
#5 5 50 45
Using dplyr package:
library(dplyr)
df <- df %>%
mutate(c = if_else(a == b, a + b, b - a))
df
# a b c
# 1 1 1 2
# 2 2 20 18
# 3 3 3 6
# 4 4 4 8
# 5 5 50 45
A solution with apply
myFunction <- function(x){
a <- x[1]
b <- x[2]
#further values ignored (if there are more than 2 columns)
value <- if(a==b) a + b else b - a
#or more complicated stuff
return(value)
}
df$c <- apply(df, 1, myFunction)

Resources